The ability to go study something in a systematic, and maybe even leisurely,way is attractive. The go-to-market, ride Moore’s law, and compete and dealwith fast product cycles and sometimes throwaway software—seems like a shame ifthat’s all everybody does. So there’s a role for people who want to get PhDs,who have the skills for it. And there is interesting research to do. One of thethings that we’re pushing at Mozilla is in between what’s respected in academicresearch circles and what’s already practice in the industry. That’s compilersand VM stuff, debuggers even—things like Valgrind—profiling tools.Underinvested-in and not sexy for researchers, maybe not novel enough, too muchengineering, but there’s room for breakthroughs. We’re working with Andreas Galand he gets these papers rejected because they’re too practical.

Of course, we need researchers who are inclined that way, but we alsoneed programmers who do research. We need to have the programmingdiscipline not be just this sort of blue-collar thing that’s cut off from thepeople in the ivory towers.

Seibel: How do you feel about proofs?

Eich: Proofs are hard. Most people are lazy. Larry Wall is right. Lazinessshould be a virtue. So that’s why I prefer automation. Proofs are somethingthat academics love and most programmers hate. Writing assertions can beuseful. In spite of bad assertions that should’ve been warnings, we’ve hadmore good assertions over time in Mozilla. From that we’ve had someillumination on what the invariants are that you’d like to express in somedream type system.

I think thinking about assertions as proof points helps. But not requiringanything that pretends to be a complete proof—there are enough proofsthat are published in academic papers that are full of holes.

Seibel: On a completely different topic, what’s the worst bug you ever hadto track down?

Eich: Oh, man. The worst bugs are the multithreaded ones. The work I didat Silicon Graphics involved the Unix kernel. The kernel originally startedout, like all Unix kernels of the day, as a monolithic monitor that ran tocompletion once you entered the kernel through a system call. Except forinterrupts, you could be sure you could run to completion, so no locks foryour own data structure. That was cool. Pretty straightforward.

But at SGI the bright young things from HP came in. They sold symmetricmultiprocessing to SGI. And they really rocked the old kernel group. Theycame in with some of their new guys and they did it. They stepped right upand they kept swinging until they knocked the ball pretty far out of the field.But they didn’t do it with anything better than C and semaphores and spinlocks and maybe monitors, condition variables. All hand-coded. So therewere tons of bugs. It was a real nightmare.

I got a free trip to Australia and New Zealand that I blogged about. Weactually fixed the bug in the field but it was hellish to find and fix because itwas one of these bugs where we’d taken some single-threaded kernel codeand put it in this symmetric multiprocessing multithreaded kernel and wehadn’t worried about a particular race condition. So first of all we had toproduce a test case to find it, and that was hard enough. Then under timepressure, because the customer wanted the fix while we were in the field,we had to actually come up with a fix.

Diagnosing it was hard because it was timing-sensitive. It had to do withthese machines being abused by terminal concentrators. People werehooking up a bunch of PTYs to real terminals. Students in a lab or a bunchof people in a mining software company in Brisbane, Australia in this sort of’70s sea of cubes with a glass wall at the end, behind which was a bunch ofmachines including the SGI two-processor machine. That was hard and I’mglad we found it.

These bugs generally don’t linger for years but they are really hard to find.And you have to sort of suspend your life and think about them all the timeand dream about them and so on. You end up doing very basic stuff, though.It’s like a lot of other bugs. You end up bisecting—you know “wolf fence.”You try to figure out by monitoring execution and the state of memory andtry to bound the extent of the bug and control flow and data that can beaddressed. If it’s a wild pointer store then you’re kinda screwed and youhave to really start looking at harder-to-use tools, which have only come tothe fore recently, thanks to those gigahertz processors, like Valgrind andPurify.

Instrumenting and having a checked model of the entire memory hierarchyis big. Robert O’Callahan, our big brain in New Zealand, did his owndebugger based on the Valgrind framework, which efficiently logs everyinstruction so he can re-create the entire program state at any point. It’snot just a time-traveling debugger. It’s a full database so you see a datastructure and there’s a field with a scrogged value and you can say, “Whowrote to that last?” and you get the full stack. You can reason from effectsback to causes. Which is the whole game in debugging. So it’s very slow. It’slike a hundred times slower than real time, but there’s hope.

Or you can use one of these faster recording VMs—they checkpoint only atsystem call and I/O boundaries. They can re-create corrupt program statesat any boundary but to go in between those is harder. But if you use thatyou can probably close in quickly at near real time and then once you get tothat stage you can transfer it into Rob’s Chronomancer and run it muchslower and get all the program states and find the bug.

Debugging technology has been sadly underresearched. That’s anotherexample where there’s a big gulf between industry and academia: theacademics are doing proofs, sometimes by hand, more and moremechanized thanks to the POPLmark challenge and things like that. But inthe real world we’re all in debuggers and they’re pieces of shit from the ’70slike GDB.

Seibel: In the real world one big split is between people who use symbolicdebuggers and people who use print statements.

Eich: Yeah. So I use GDB, and I’m glad GDB, at least on the Mac, has awatch-point facility that mostly works. So I can watch an address and I cancatch it changing from good bits to bad bits. That’s pretty helpful. OtherwiseI’m using printfs to bisect. Once I get close enough usually I can just trythings inside GDB or use some amount of command scripting. But it’sincredibly weak. The scripting language itself is weak. I think Van Jacobsonadded loops and I don’t even know if those made it into the real GDB, pastthe FSF hall monitors.

But there’s so much more debugging can do for you and these attempts, likeChronomancer and Replay, are good. They certainly changed the game forme recently. But I don’t know about multithreading. There’s Helgrind andthere are other sort of dynamic race detectors that we’re using. Those areproducing some false positives we have to weed through, trying to train thetools or to fix our code not to trigger them. The jury is still out on those.

The multithreaded stuff, frankly, scares me because before I was marriedand had kids it took a lot of my life. And not everybody was ready to thinkabout concurrency and all the possible combinations of orders that are outthere for even small scenarios. Once you combine code with other people’scode it just gets out of control. You can’t possibly model the state space inyour head. Most people aren’t up to it. I could be like one of these chestthumperson Slashdot—when I blogged about “Threads suck” someone wassaying, “Oh he doesn’t know anything. He’s not a real man.” Come on, youidiot. I got a trip to New Zealand and Australia. I got some perks. But it wasdefinitely painful and it takes too long. As Oscar Wilde said of socialism, “Ittakes too many evenings.”


Перейти на страницу:
Изменить размер шрифта: