To test the code, I wrote a monstrous “basher.” It ran lots of transactions,each of which contained nested transactions, recursively up to somemaximum nesting depth. Each of the nested transactions would lock andread several elements of a shared array in ascending order and addsomething to each element, preserving the invariant that the sum of all theelements in the array was zero. Each subtransaction was either committedor aborted—90 percent commits, 10 percent aborts, or whatever. Multiplethreads ran these transactions concurrently and beat on the array for aprolonged period. Since it was a shared-memory facility that I was testing, Iran multiple multithreaded bashers concurrently, each in its own process.

At reasonable concurrency levels, the basher passed with flying colors. Butwhen I really cranked up the concurrency, I found that occasionally, justoccasionally, the basher would fail its consistency check. I had no idea whatwas going on. Of course I assumed it was my fault because I had written allof this new code.

I spent a week or so writing painfully thorough unit tests of eachcomponent, and all the tests passed. Then I wrote detailed consistencychecks for each internal data structure, so I could call the consistencychecks after every mutation until a test failed. Finally I caught a low-levelconsistency check failing—not repeatably, but in a way that allowed me toanalyze what was going on. And I came to the inescapable conclusion thatmy locks weren’t working. I had concurrent read-modify-write sequencestaking place in which two transactions locked, read, and wrote the samevalue and the last write was clobbering the first.

I had written my own lock manager, so of course I suspected it. But the lockmanager was passing its unit tests with flying colors. In the end, Idetermined that what was broken wasn’t the lock manager, but theunderlying mutex implementation! This was before the days when operatingsystems supported threads, so we had to write our own threading package.It turned out that the engineer responsible for the mutex code hadaccidentally exchanged the labels on the lock and try-lock routines in theassembly code for our Solaris threading implementation. So every time youthought you were calling lock, you were actually calling try-lock, and viceversa. Which means that when there was actual contention—rare in thosedays—the second thread just sailed into the critical section as if the firstthread didn’t have the lock. The funny thing was that that this meant thewhole company had been running without mutexes for a couple weeks, andnobody noticed.

There’s a wonderful Knuth quote about testing, quoted by Bentley andMcIlroy in their wonderful paper called “Engineering a Sort Function,” aboutgetting yourself in the meanest and nastiest mood that you can. I mostcertainly did that for this set of tests. But this tickled all of the things thatmake a bug hard to find. First of all, it had to do with concurrency and itwas utterly unreproducible. Second of all, you had some core assumptionthat turned out to be false. It’s the hallmark of the tyro that they say, ”Yeah,well, the language is broken” or, “The system is broken.” But in this case,yes, the bedrock on which I was standing—the mutex—was, in fact, broken.

Seibel: So the bug wasn’t in your code but in the meantime you hadwritten such thorough unit tests for your code that you had no choice butto look outside your code. Do you think there were tests that the author ofthe mutex code could have, or should have, written that would have foundthis bug and saved you a week and a half of debugging?

Bloch: I think a good automated unit test of the mutex facility could havesaved me from this particular agony, but keep in mind that this was in theearly ‘90s. It never even occurred to me to blame the engineer involved fornot writing good enough unit tests. Even today, writing unit tests forconcurrency utilities is an art form.

Seibel: We talked a bit before about stepping through code, but what arethe actual tools you use for debugging?

Bloch: I’m going to come out sounding a bit Neanderthal, but the mostimportant tools for me are still my eyes and my brain. I print out all thecode involved and read it very carefully.

Debuggers are nice and there are times when I would have used a printstatement, but instead use a breakpoint. So yes, I use debuggersoccasionally, but I don’t feel lost without them, either. So long as I can putprint statements in the code, and can read it thoroughly, I can usually findthe bugs.

As I said, I use assertions to make sure that complicated invariants aremaintained. If invariants are corrupted, I want to know the instant ithappens; I want to know what set of actions caused the corruption to takeplace.

That reminds me of another very difficult-to-find bug. My memory of thisone is a bit hazy; either it happened at Transarc or when I was a gradstudent at CMU, working on the Camelot distributed transaction system. Iwasn’t the one who found this one, but it sure made an impression on me.

We had a trace package that allowed code to emit debugging information.Each trace event was tagged with the ID of the thread that emitted it.Occasionally we were getting incorrect thread IDs in the logs, and we hadno idea why. We just decided that we could live with the bug for a while. Itseemed innocuous enough.

It turned out that the bug wasn’t in the trace package at all: it was muchmore serious. To find the thread ID, the trace package called into thethreading package. To get the thread ID, the threading package used a trickthat was fairly common at the time: it looked at some high-order bits of theaddress of a stack variable. In other words, it took a pointer to a stackvariable, shifted it to the right by a fixed distance, and that was the threadID. This trick depends on the fact that each thread has a fixed-size stackwhose size is a well-known power of two.

Seems like a reasonable approach, right? Except that people who didn’tknow any better were creating objects on the stack that were, by thestandards of the day, very big. Perhaps arrays of 100 elements, each 4k insize—so you’ve got 400k slammed onto your thread stack. You jump rightover the stack’s red zone and into the next thread’s stack. Now the thread-IDmethod misidentifies the thread. Worse, when the thread accessesthread-local variables, it gets the next thread’s values, because the thread IDwas used as the key to the thread-local variables.

So what we took to be a minor flaw in the tracing system was actuallyevidence of a really serious bug. When an event was attributed to thread-43instead of thread-42, it was because thread-42 was now unintentionallyimpersonating thread-43, with potentially disastrous consequences.

This is an example of why you need safe languages. This is just notsomething that anyone should ever have to cope with. I was talking tosomeone recently at a university who asked me what I thought about thefact that his university wanted to teach C and C++ first and then Java,because they thought that programmers should understand the system “allthe way down.”

I think the premise is right but the conclusion is wrong. Yes, students shouldlearn low-level languages. In fact, they should learn assembly language, andeven chip architecture. Though chips have turned into to these unbelievablecomplicated beasts where even the chips don’t have good performancemodels anymore because of the fact that they are such complicated statemachines. But they’ll be much better high-level language programmers ifthey understand what’s going on in the lower layers of the system.

So yes, I think it’s important that you learn all this stuff. But do I think youshould start with a low-level language like C? No! Students should not haveto deal with buffer overruns, manual memory allocation, and the like in theirfirst exposure to programming.


Перейти на страницу:
Изменить размер шрифта: