Tracking down a bug

As part of working with Active Directory, I was working on making sure that we were reporting good error messages on failure. And I soon found what I thought was a bug in the ruby-ldap project. Specifically, after including their code, all of our "CONNECTION
REFUSED" messages changed into "INVALID ARGUMENT." Which we can cope with, since we control the higher layers, but it's a loss of granularity, and causes one to raise an eyebrow.


So I started by doing a binary search through all the ldap packages, and found that they all had the bug. So then I started a binary search by commenting out all the code, and I found out that even an almost empty source file would still trigger the bug.





So then I tried dropping out the libraries one by one. Eventually, as you can see in that screen shot, I found out that it was -lpthreads that was triggering it.


I cruised through the Ruby source code and figured out where the call was happening. I dropped through several layers of code before getting into their networking stuff, which I was able to extract out into a standalone test program:



I couldn't reproduce it on a few Linux boxes, and Lawrence confirmed that it was also a problem in a recent OpenBSD snapshot.


This violates OpenBSD's man page of errno, which explicitly says that errno only changes on an error. (Fully POSIX-compliant systems may warn that errno can change on even successful system or library calls; this uncertainty was made explicit in Technical Corrigendum 2.)


So, off through the pthread code. The libpthread.so object includes its own close() library call, which calls _thread_fd_entry_close(), which calls _thread_fs_flags_replace(), which calls _thread_sys_fstat(). I haven't quite tracked down that this last function does (it doesn't appear in the source tree; probably auto-generated), but I believe it's some library-safe version of fstat(). And, just like fstat() modifies errno, so does _thread_sys_fstat().


Interestingly, the errno value set by both fstat() and _thread_sys_fstat() is EINVAL, which isn't one of of fstat()'s error codes.


These are both now fixed, since the OpenBSD team is pretty responsive about the correctness and security of their code.


It's kind of interesting to look back over all of that, and realize that the bug that seemed to be in the library of a very high-level language was in the basic OS libraries.