Seibel: So are there language features that make programmers—folks whohave mastered this unnatural act—more productive? You’re designing alanguage right now so you’ve obviously got some opinions about this.

Steele: I said earlier that I think you can’t afford to neglect correctness. Onthe other hand, I think we can design tools to make it easier to achieve that.We can’t make it trivial, but I think we can make it easier to avoid mistakesof various kinds. A good example is overflow detection on arithmetic, orproviding bignums instead of just letting 32-bit integers wrap around. Now,implementing those is more expensive but I believe that providing full-blownbignums is a little less error-prone for some kinds of programming.

A trap that I find systems programmers and designers of operating-systemsalgorithms constantly falling into is they say, “Well, we need to synchronizesome phases here so we’re going to use a take-a-number strategy. Everytime we enter a new phase of the computation we’ll increment somevariable and that’ll be the new number and then the different participantswill make sure they’re all working on the same phase number before acertain operation happens.” And that works pretty well in practice, but ifyou use a 32-bit integer it doesn’t take that long to count to four billionanymore. What happens if that number wraps around? Will you still be OKor not? It turns out that a lot of such algorithms in the literature have thatlurking bug. What if some thread stalls for 2 to the 32nd iterations? That’shighly unlikely in practice, but it’s a possibility. And one should eithermitigate that correctness problem or else do the calculation to show that,yeah, it’s sufficiently unlikely that I don’t want to worry about it. Or maybeyou’re willing to accept one glitch every day. But the point is you should dothe analysis rather than simply ignoring the issue. And the fact that counterscan wrap around is a lurking pitfall that doesn’t hurt most programmers butfor a very few it lays traps in their algorithms.

Seibel: Speaking of glitches, what was the worst bug you’ve ever had totrack down?

Steele: I’m not sure I can dig up a worst one, but I can tell you a fewstories. Certainly, dealing with parallel processes has produced the mostdifficult-to-deal-with bugs.

When I was a teenager programming on the IBM 1130 was the one time inmy life when the solution to a bug came to me in a dream. Or as I woke up.I had been puzzling over a bug for a couple days, couldn’t figure it out. Andthen suddenly sat bolt upright in the middle of the night and realized I knewwhat the problem was. And it was because I had overlooked something inan interface specification.

It had to do with concurrent processes. I was writing a decompiler so that Icould study the IBM disk operating system by decompiling it. And to thisend it would take binary data on the disk and print it in a variety of formatsincluding as instructions, as character codes, as numbers, and so on. And inorder to convert the characters, I was feeding the data to various character–conversionroutines, one of which was designed for use after reading cardcodes from a card reader. And I had overlooked the tiny footnote in thespecification that said, “We assume that before you call the procedure, thebuffer in which the card data will be read has all over the low order bitscleared.” Or maybe had them set.

Anyway, the 12 bits from the card-code column were going into the high 12bits of the 16-bit word and they were using the low bit for a clever trickwhereby you could call the card-reader routine asynchronously and it wouldasynchronously load the buffer and the conversion routine would followbehind, using that low bit to determine whether the next card column hadbeen read in or not. And if it had been read it would then convert thatcharacter. Thereby, as soon as the card was read, very shortly thereafterthe conversion would finish—they were overlapping the cost of the cardconversion with the time it took to read the card. And I was feeding it rawbinary data that wasn’t obeying this constraint. And I had just overlookedthis—I thought it was yet another card-conversion routine but it turned outthis one had something special about its interface—it was relying on thoselow-order bits, which ordinarily you wouldn’t think about. It wasinterpreting what was in the buffer as saying, “Oh, the data has not yetarrived from the card reader.” In principle I knew this, but it wasn’toccurring to me. And then, as I say, it came to me while I was asleep. Sothat was an odd case.

The other really interesting story I can think of was while I was themaintainer of the Maclisp system and Maclisp supported bignums—integersof arbitrary precision. They’d been around for several years and wereconsidered pretty well debugged and shaken out. They’d been used for allkinds of stuff in Macsyma, and Macsyma users were using them all the time.And yet a report came in from Bill Gosper saying, “The quotient of thesetwo integers is wrong.” And he could tell because the quotient wassupposed to be very near a decimal multiple of pi.

These were numbers about a hundred digits each and it really wasn’tfeasible to trace through the entire thing by hand because the divisionroutine was fairly complicated and these were big numbers. So I stared atthe code and didn’t see anything obviously wrong. But one thing that caughtmy eye was a conditional step that I didn’t quite understand.

These routines were based on algorithms from Knuth, so I got Knuth offthe shelf and read the specification and proceeded to map the steps ofKnuth’s algorithm onto the assembly-language code. And what caught myeye in Knuth was a comment that this step happens rarely—with aprobability of roughly only one in two to the size of the word. So weexpected this to happen only once in every four billion times or something.

From that I reasoned to myself, “These routines are thought to be wellshaken out, thus this must be a rare bug; therefore the problem is likely tobe in the rarely executed code.” That was enough to focus my attention onthat code and realize that a data structure wasn’t being properly copied. Asa result, later down the line there was a side-effect bug where somethingwas getting clobbered. And so I repaired that and then fed the numbersthrough and got the right answer and Gosper seemed to be satisfied.

A week later he came back with two somewhat larger numbers and said,“These don’t divide properly either.” This time, properly prepared, Ireturned to that same little stretch of ten instructions and discovered therewas a second bug of the same kind in that code. So then I scoured the codethoroughly, made sure everything was copied in the right ways, and noerrors were ever reported after that.

Seibel: That’s always the trick—that there’s going to be more than one.

Steele: So I guess there’s lessons there—the lesson I should have drawn isthere may be more than one bug here and I should have looked harder thefirst time. But another lesson is that if a bug is thought to be rare, thenlooking at rarely executed paths may be fruitful. And a third thing is, havinggood documentation about what the algorithm is trying to do, namely areference back to Knuth, was just great.

Seibel: When it’s not just a matter of waking up in the middle of the nightand realizing what the problem is, what are your preferred debuggingtechniques? Do you use symbolic debuggers, print statements, assertions,formal proofs, all of the above?

Steele: I admit to being lazy—the first thing I will try is dropping in printstatements to see if it will help me, even though that is probably the leasteffective for dealing with a complicated bug. But it does such a good job ofgrabbing the simple bugs that it’s worth a try. On the other hand, one of thegreat revelations in the evolution of my thinking about programming waswhen I was working on that one project in Haskell. Because it was a purelyfunctional language, I couldn’t just drop in print statements.


Перейти на страницу:
Изменить размер шрифта: