Seibel: What is the worst bug you’ve ever had to track down?

Norvig: Well, I guess the most consequential bugs I was involved withwere not mine, but the ones I had to clean up after: the Mars programfailures in ’98. One was foot-pounds vs. newtons. And the other was, wethink, though we’re not 100 percent sure, prematurely shutting off theengines due to a software problem.

Seibel: I read one of the reports on the Mars Climate Orbiter—that wasthe one that was the foot-pounds vs. newtons problem—and you were theonly computer scientist on that panel. Were you involved in talking to thesoftware guys to figure out what the problem was?

Norvig: That was pretty easy, post hoc, because they knew the failuremode. From that they were able to back it out and it didn’t take long tofigure that one out. Then there was this postmortem of why did it happen?And I think it was a combination of things. One was outsourcing. It was ajoint effort between JPL in Pasadena and Lockheed-Martin in Colorado.There were two people on two different teams and they just weren’t sittingdown and having lunch together. I’m convinced that if they had, they wouldhave solved this problem. But instead, one guy sent an email saying, youknow, “Something not quite right with these measurements, seems likewe’re off by a little bit. It’s not very much, it’s probably OK, but—”

Seibel: That was all during the flight?

Norvig: Right. During the flight they had chance and chance to catch it.They knew something was wrong and they sent this email but they did notput it into the bug-tracking system. If they had, NASA has very goodcontrols for bug tracking and at later points in the flight somebody wouldhave had to OK it. Instead it was just an informal email that never got ananswer back, and JPL said, “Oh, I guess Lockheed-Martin must have solvedthis problem.” And Lockheed says, “Oh, JPL’s not asking anymore—theymust not be concerned.”

So it was this communications problem. It was also a software reuseproblem. They have extremely good checks for the stuff that’s missioncritical,and on the previous mission the stuff that was recorded in footpoundswas non–mission-critical—it was just a log that wasn’t used fornavigation. So it had been classified as non–mission-critical. In the newmission they reused most of the stuff but they changed the navigation sothat what was formerly a log file now became an input to the navigation.

Seibel: So the actual problem was that one side generated a data file infoot-pounds and that data file was fed into a piece of software that wascalculating inputs to the actual navigation and was expecting newtons?

Norvig: Right. So essentially the other root cause was too many particlesfrom the sun. The spacecraft is asymmetrical and it’s got these solar panels.Particles twist the spacecraft a little bit so you’ve got to fire the rockets totwist it back. So this new hire at Lockheed went to the rocketmanufacturer, and they had all their specifications in foot-pounds, so he justsaid, “I’ll go with that,” and he recorded them that way, not knowing thatNASA wanted them in metric.

Seibel: I was struck, reading that report, by the NASA attitude of, “Well,the problem was due to this software bug but we had so many otherchances to notice that the ship wasn’t where we expected, and we shouldhave. We should have fixed it anyway even though the numbers we weredealing with were totally bogus because of some stupid software glitch.” Ithought that was admirable.

Norvig: Yeah, they were looking at the process.

Seibel: Is it actually common for there to be software bugs of thatmagnitude, which we never hear about because all the other processes keepeverything online?

Norvig: Yeah, I think so. Look at all the software bugs on your computer.There are millions of them. But most of the time it doesn’t crash.

Seibel: Yet you hear about how the shuttle flight software costs $1,500 aline or something because of the care with which they write it and which isallegedly bug-free. Is that just a lie?

Norvig: No, that’s probably true. But I don’t know if it’s optimal. I thinkthey might be better off with buggy software.

Seibel: With cheaper software and better operations?

Norvig: Yeah, because of the amount of training they have to give to theseastronauts to be able to deal with the things the software just can’t do. Theyput these astronauts in simulators and give them all these situations andwhen things go bad you’ve got this screen and stuff is scrolling through itand you can’t pause the screen, you can’t go back, you can’t get a summaryof what the important things are. The astronauts just have to be trained toknow, “When I see this happening, here’s what’s really going on.” There area hundred messages in a row saying, “This electrical thing has faulted,” andyou train them to say, “OK, that must be because this one original onefaulted and then there was a cascade downstream and all the other ones arereported.” Why can’t you do that in software rather than train theastronaut? They don’t try because they don’t want to mess with it.

Seibel: On a different topic, what are your preferred debugging techniquesand tools? Print statements? Formal proofs? Symbolic debuggers?

Norvig: I think it’s a mix and it depends on where I am. Sometimes I’musing an IDE that has good tracing capability and sometimes I’m just usingEmacs and don’t have all that. Certainly tracing and printing. And thinking.Writing smaller test cases and watching them go, and breaking thefunctionality down to see where the test case failed. And I’ve got to admit, Ioften end up rewriting. Sometimes I do that without ever finding the bug. Iget to the point where I can just feel that it’s in this part here. I’m just notvery comfortable about this part. It’s a mess. It really shouldn’t be that way.Rather than tweak it a little bit at a time, I’ll just throw away a couplehundred lines of code, rewrite it from scratch, and often then the bug isgone.

Sometimes I feel guilty about that. Is that a failure on my part? I didn’tunderstand what the bug was. I didn’t find the bug. I just dropped a bomb onthe house and blew up all the bugs and built a new house. In some sense,the bug eluded me. But if it becomes the right solution, maybe it’s OK.You’ve done it faster than you would have by finding it.

Seibel: What about things like assertions or invariants? How formally doyou think about those kinds of things when you’re coding?

Norvig: I guess I’m more on the informal side. I haven’t used languageswhere that’s a big part of the formal mechanism, other than just typedeclarations. Like loop invariants: I’ve always thought that was more troublethan it was worth. I occasionally have a problem where this loop doesn’tterminate, but mostly you don’t, and I just feel like it slows you down to dothe formal part. And if you do have a problem, the debugger will tell youwhat loop you’re stuck inside. I guess if you’re writing high-dependabilitysoftware that’s embedded in something that it’s really important that itdoesn’t fail, then you really want to prove everything. But just in terms ofgetting the first version of the program running or debugging it, I’d rathermove fast towards that than worry about the degree of formal specificationyou need later.

Seibel: Have you ever done anything explicit to try and learn from the bugsthat you’ve created?

Norvig: Yeah, I think that’s pretty interesting and I wish I could do morewith that. I’m actually in a discussion now to see if I can do an experiment,company-wide and then maybe for the world at large, to understand moresome of these issues. How do you classify bugs, but also what are somefactors in terms of productivity? How do you know? Is there a certain typeof person? What are the factors of that person that makes them moreproductive? And I think it’s more interesting what the controllable factorsare that make somebody do better. If giving them a bigger monitor increasesproductivity by such and such a percent, then you should probably do it.


Перейти на страницу:
Изменить размер шрифта: