Russ Cox

Who are you?

I’m a programmer.

I write programs. I worked on Plan 9 from Bell Labs for about a decade, writing kernel code, networked servers, file systems, and a bit of graphics code. Now I work at Google, where I’m one of the lead developers of the Go programming language. Go has turned out to be a nice general-purpose language, but its original design target was concurrent networked servers, the kinds of programs we were writing for Plan 9 and for Google.

I also write about programs. My most well-known articles are about implementing regular expressions, putting a zip file inside itself, and making pictures in QR codes. I used Go for all three.

What’s the most interesting bug you’ve met?

To me, the most interesting bugs are the ones that reveal fundamental, subtle misunderstandings about the way a program works. A good bug is like a good science experiment: through it, you learn something unexpected about the virtual world you are exploring.

About ten years ago I was working on a networked server that used threads, coordinating with locks and condition variables. This server was part of Plan 9 and was written in C. Occasionally it would crash inside malloc, which usually means some kind of memory corruption due to a write-after-free error. One day, while benchmarking with the bulk of the server disabled, I was lucky enough to have the crash happen reproducibly. The server being mostly disabled gave me a head start in isolating the bug, and the reproducibility made it possible to cut code out, piece by piece, until one section was very clearly implicated.

The code in question was cleaning up after a client that had recently disconnected. In the server, there is a per-client data structure shared by two threads: the thread R reads from the client connection, and the thread W writes to it. R notices the disconnect as an EOF from a read, notifies W, waits for an acknowledgement from W, and then frees the per-client structure.

To acknowledge the disconnect, W ran code like:

qlock(&conn->lk);
conn->writer_done = 1;
qsignal(&conn->writer_ack);
qunlock(&conn->lk);
thread_exit();

And to wait for the acknowledgement, R ran code like:

qlock(&conn->lk);
while(!conn->writer_done)
    qwait(&conn->writer_ack);

// The writer is done, and so are we:
// free the connection.
free(conn);

This is a standard locks and condition variables piece of code: qwait is defined to release the lock (here, conn->lk), wait, and then reacquire the lock before returning. Once R observes that writer_done is set, R knows that W is gone, so R can free the per-connection data structure.

R does not call qunlock(&conn->lk). My reasoning was that calling qunlock before free sends mixed messages: qunlock suggests coordination with another thread using conn, but free is only safe if no other thread is using conn. W was the other thread, and W is gone. But somehow, when I added qunlock(&conn->lk) before free(conn), the crashes stopped. Why?

To answer that, we have to look at how locks are implemented.

Conceptually, the core of a lock is a variable with two markings unlocked and locked. To acquire a lock, a thread checks that the core is marked unlocked and, if so, marks it locked, in one atomic operation. Because the operation is atomic, if two (or more) threads are attempting to acquire the lock, only one can succeed. That thread—let’s call it thread A—now holds the lock. Another thread vying for the lock—thread B—sees the core is marked locked and must now decide what to do.

The first, simplest approach, is to try again, and again, and again. Eventually thread A will release the lock (by marking the core unlocked), at which point thread B’s atomic operation will succeed. This approach is called spinning, and a lock using this approach is called a spin lock.

A simple spin lock implementation looks like:

struct SpinLock
{
    int bit;
};

void
spinlock(SpinLock *lk)
{
    for(;;) {
        if(atomic_cmp_and_set(&lk->bit, 0, 1))
            return;
    }
}

void
spinunlock(SpinLock *lk)
{
    atomic_set(&lk->bit, 0);
}

The spin lock’s core is the bit field. It is 0 or 1 to indicate unlocked or locked. The atomic_cmp_and_set and atomic_set use special machine instructions to manipulate lk->bit atomically.

Spinning only makes sense if the lock is never held for very long, so that B’s spin loop only executes a small number of times. If the lock can be held for longer periods of time, spinning while it is held wastes CPU and can interact badly with the operating system scheduler.

The second, more general approach is to maintain a queue of threads interested in acquiring the lock. In this approach, when thread B finds the lock already held, it adds itself to the queue and uses an operating system primitive to go to sleep. When thread A eventually releases the lock, it checks the queue, finds B, and uses an operating system primitive to wake B. This approach is called queueing, and a lock using this approach is called a queue lock. Queueing is more efficient than spinning when the lock may be held for a long time.

The queue lock’s queue needs its own lock, almost always a spin lock. In the library I was using, qlock and qunlock were implemented as:

struct QLock
{
    SpinLock spin;
    Thread *owner;
    ThreadQueue queue;
};

void
qlock(QLock *lk)
{
    spinlock(&lk->spin);
    if(lk->owner == nil) {
        lk->owner = current_thread();
        spinunlock(&lk->spin);
        return;
    }
    push(&lk->queue, current_thread());
    spinunlock(&lk->spin);
    os_sleep();
}

void
qunlock(QLock *lk)
{
    Thread *t;

    spinlock(&lk->spin);
    t = pop(&lk->queue);
    lk->owner = t;
    if(t != nil)
        os_wakeup(t);
    spinunlock(&lk->spin);
}

The queue lock’s core is the owner field. If owner is nil, the lock is unlocked; otherwise owner records the thread that holds the lock. The operations on lk->owner are made atomic by holding the spin lock lk->spin.

Back to the bug.

The locks in the crashing code were queue locks. The acknowledgement protocol between R and W sets up a race between W’s call to qunlock and R’s call to qlock (either the explicit call in the code or the implicit call inside qwait). Which call happens first?

If W’s qunlock happens first, then R’s qlock finds the lock unlocked, locks it, and everything proceeds uneventfully.

If R’s qlock happens first, it finds the lock held by W, so it adds R to the queue and puts R to sleep. Then W’s qunlock executes. It sets the owner to R, wakes up R, and unlocks the spin lock. By the time W unlocks the spin lock, R may have already started running, and R may have already called free(conn). The spinunlock’s atomic_set writes a zero to conn->lk.spin.bit. That’s the write-after-free, and if the memory allocator didn’t want a zero there, the zero may cause a crash (or a memory leak, or some other behavior).

But is the server code wrong or is qunlock wrong?

The fundamental misunderstanding here is in the definition of the queue lock API. Is a queue lock required to be unlocked before being freed? Or is a queue lock required to support being freed while locked? I had written the queue lock routines as part of a cross-platform library mimicking Plan 9’s, and this question had not occurred to me when I was writing qunlock.

  • If the queue lock must be freed only when unlocked, then qunlock’s implementation is correct and the server must change. If R calls qunlock before free, then R’s qunlock’s spinlock must wait for W’s qunlock’s spinunlock, so W will really be gone by the time R calls free.
  • If the queue lock can be freed while locked, then the server is correct and qunlock must change: the os_wakeup gives up control of lk and must be delayed until after the spinunlock.The Plan 9 documentation for queue locks does not address the question directly, but the implementation was such that freeing locked queue locks was harmless, and since I was using my library to run unmodified Plan 9 software, I changed the lock implementation to call os_wakeup after spinunlock. Two years later, while fixing a different bug, I defensively changed the server implementation to call qunlock too, just in case. The definition of the POSIX pthread_mutex_destroy function gives a different answer to the same design question: “It is safe to destroy an initialised mutex that is unlocked. Attempting to destroy a locked mutex results in undefined behaviour.”What did we learn?

    The rationale I gave for not calling qunlock before free made an implicit assumption that the two were independent. After looking inside an implementation, we can see why the two are intertwined and why an API might specify, as POSIX does, that you must unlock a lock before destroying it. This is an example of implementation concerns influencing an API, creating a “leaky abstraction.”

    What makes this bug interesting is that it was caused by a complex interaction between manual memory management and concurrency. Obviously a program must stop using a resource before freeing it. But a concurrent program must stop all threads from using a resource before freeing it. On a good day, that can require bookkeeping or careful coordination to track which threads are still using the resource. On a bad day, that can require reading the lock implementation to understand the exact order of operations carried out in the different threads.

    In the modern computing world of clients and servers and clouds, concurrency is a fundamental concern for most programs. In that world, choosing garbage collection instead of manual memory management eliminates a source of leaky abstractions and makes programs simpler and easier to reason about.

    Anything else to add?

    I started the post by saying that good bugs help you learn something unexpected about the virtual world you are exploring. This was especially true for Maurice Wilkes and his team, who built EDSAC, the first practical stored-program computer. The first program they ran on EDSAC (printing square numbers) ran correctly, but the second did not: the log for May 7, 1949 reads “Table of primes attempted – programme incorrect.” That was a Saturday, making this the first weekend spent working on a buggy program.

    What did they learn? Wilkes later recalled,

    “By June 1949, people had begun to realize that it was not so easy to get a program right as had at one time appeared. … It was on one of my journeys between the EDSAC room and the punching equipment that the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.” (Wilkes, p. 145)

    For more about this early history, see Brian Hayes’s “The Discovery of Debugging” and Martin Campbell-Kelly’s “The Airy Tape: An Early Chapter in the History of Debugging.”

Marc-André Cournoyer

Who are you?

I’m a software developer and entrepreneur. I live in Quebec, Canada. I created OSS, wrote a book and sold a few businesses. Nowadays I mainly code and teach at Coded.

What’s the hardest/funniest bug you’ve met and how did you solve it?

I was recently coding a VM for my code club and couldn’t figure out why none of it was working.

A VM (virtual machine) works like a real CPU. It has a pointer to the current instruction being executed. Usually it’s called the program counter (pc).

The VM works by executing one instruction and moving on to the next. So I coded just that. Executing the instruction, done in a big switchcase loop. And then moving to the next instruction, by incrementing the program counter (instructions are 2 bytes long, so pc += 2).

However, some of those instructions also play with the pc. One in particular is used for implementing control flow (think of it as an if). It would set the pc to an address in memory. If a condition is true move the pc to that address in memory. That’s how all control flow structures are implemented on your CPU.

Here’s the bug. Remember from earlier, I was incrementing the pc right after executing an instruction. That means each time an if was executed the VM would jump to a memory address + 2. Two bytes too far.

The solution was to increment the pc before executing the instruction: https://github.com/greatcodeclub/chip8/blob/master/vm.js#L37.

Anything else to add?

Here’s my process for solving a bug in pseudo machine code.

  1. Get stuck because of said bug.
  2. Bang head on keyboard.
  3. Stop coding and move away from machine.
  4. Get back to machine and try something new.
  5. Jump to step 7 if bug solved.
  6. Jump to step 2.
  7. Celebrate with a drink.

But usually when I find the bug, it is so dumb and stupid I try hard to forget it to keep my confidence up.

Alan Cox on porting Linux to the m68

This is when I first learned the horrors of the Mac. While Unix says ‘Im sorry you can’t do that’, MacOS has two error messages.

It either goes ‘eep?’ or the box you wanted to set but couldn’t is simply not there on you computer until you’ve installed the other 12 unidenfied items and filled in 3 apparently unrelated dialog boxes. This was an error of the latter category.

David Welton

Who are you?

I am originally from Oregon, in the US, live in Padova, Italy, and have been a programmer for the past 15 years, working with a number of languages.

What’s the most interesting/funny bug you’ve met so far and how did you solve it?

When I was heavily involved with Apache Rivet, I once got some very strange errors, that were very difficult to replicate.

After some intense work with GDB, I narrowed the problem down to a particular struct that appears in a library that Rivet links to.

Being a bridge between Apache and Tcl, Rivet links to elements of both, and it turned out that on this particular system, Apache and Tcl were compiled against very slightly different versions of this struct, meaning that in certain situations, code thinking it was going to get one got another, and occasionally the small difference would be enough to blow things up in a way that was not very clear.

Kyle Harr

Who are you?

My name is Kyle Harr and I’ve been developing Java professionally for the past four years in Ann Arbor Michigan (the United States big mitten). Prior to that, I was a freelance developer, primarily designing custom panoramic tours.

What interesting bug did you solve?

The most interesting bug I’ve solved disappeared when I tried debugging it.

I was comparing a sub-class of java.util.Date with a java.util.Date object using date.after(subInstance). However, no matter what dates I used, the Date always seemed to be after() the sub-class date.

So I naturally attached the debugger and after a bit of work identifying the correct line, I inserted a breakpoint. When the breakpoint tripped, I checked the variables list and everything looked appropriate, so I clicked run. And of course, the comparison now worked correctly. Scratching my head, I tried it a few more times and the comparison still worked.

Confused, thinking maybe I had inadvertently been using bad data earlier, I turned off the debugger and went back to work.

But the next time I tried it, it was still not working. So I turned the debugger on and set the breakpoint again and the comparison started worked again. I tested this multiple times and became convinced that running the debugger was fixing the bug!

The breakthrough came when I turned on the debugger and moved the breakpoint to a different line. This time the bug was still there!

Turns out the implementer of the subclass was concerned about performance when dealing with large numbers of date objects that would get deserialized from the database. So they wrote a lazy implementation of the Date class that delayed initializing many fields of the object until any method was called on the instance of the Date subclass.

When I used date.after(subInstance), no methods were called on subInstance. Its uninitialized fields were accessed directly, so it started at time 0.

When the breakpoint tripped, it would call toString() on the subInstance to display its value in the variable list, triggering the lazy initialization of the object and making the bug disappear.

Brent Simmons

Who are you?

I’m a software developer. I live in Seattle — in the Pacific Northwest, which is in the far back of the bus in the United States.

I write Vesper, a note-taking app for iPhone, along with my co-workers Dave Wiskus and John Gruber. In the past I’ve created apps such as NetNewsWire, MarsEdit, and Glassboard.

I blog at inessential.com, and I publish a podcast at therecord.co along with my friend Chris Parrish.

What interesting bug did you solve?

With some version of OS X — perhaps OS X 10.5 — Apple changed how crash logs were stored on disk. It used to be one file per app, but then Apple switched it to one file per crash log.

NetNewsWire, my app at the time, had a crash log catcher that would send me crash logs, so I could figure out what went wrong and fix it.

I updated the crash-log-catching code to handle the new format, the app went out to beta testers, and eventually that code made its way into the next release.

To my surprise, with that next release, a whole bunch of people were getting crashes the first time they launched the app!

I knew this because they were telling me — but also because the app was sending me their crash logs.

What was interesting was that the crash was in the crash log catcher itself. I had forgotten to test the crash log catcher when there were no crash logs.

And when there were no crash logs, it crashed.

Which then created at least one crash log, so the app didn’t crash again. The bug was self-healing!

Of course, I fixed that in the next release. (It was something minor, a one-line fix.) I should have caught that myself, because it’s always good practice to test when there is zero of something.

But it’s not surprising that I, the developer, always had crash logs, and my beta testers did too (since they were using not-ready-to-release versions of the app).

I should have had automated testing for this, but I didn’t. Lesson learned. I was just lucky that, in this one case, the crash could only ever happen once per computer.

Editor’s note: I’d like to thank Brent for being the first!