I think that raised maybe six or seven thousand dollars or something to buythese two big Dells and put them in Speakeasy in downtown Seattle.Somebody recommend some servers, Dells, these huge 6U things, likeninety pounds each. The logical split was the database server and the webserver. That was the only division I knew because I was running a MySQLprocess and an Apache process.

That worked well for a while. The web servers spoke directly to the worldand had two network cards and had a little crossover cable to the databaseserver. Then the web server got overloaded, but that was still fairly easy. Atthis point I got 1U servers. Then we had three web servers and onedatabase server. At that point, I started playing with three or four HTTPload balancers—mod_backhand and mod_proxy and Squid and hated themall. That started my hate for HTTP load balancers.

The next thing to fall over was the database, and that’s when I was like,“Oh, shit.” The web servers scale out so nicely. They’re all stateless. Youjust throw more of them and spread load. So that was a long stressful time.“Well, I can optimize queries for a while,” but that only gives you anotherweek until it’s loaded again. So at some point, I started thinking about whatdoes an individual request need.

That’s when—I thought I was the first person in the world to think of this—Iwas like, we’ll shard it out—partition it. So I wrote up design doc withpictures saying how our code would work. “We’ll have our master database justfor metadata about global things that are low traffic and all the per-blog andper-comment stuff will be partitioned onto a per-user database cluster. Theseuser IDs are on this database partition.” Obvious in retrospect—it’s whateveryone does. Then there was a big effort to port the code while the servicewas still running.

Seibel: Was there a red-flag day where you just flipped everything over?

Fitzpatrick: No. Every user had a flag basically saying what cluster numberthey were on. If it was zero, they were on the master; if it was nonzero,they were partitioned out. Then there was a “Your Account Is Locked”version number. So it would lock and try to migrate the data and then retryif you’d done some mutation in the meantime—basically, wait ’til we’vedone a migration where you hadn’t done any write on the master, and thenpivot and say, “OK, now you’re over there.”

This migration took months to run in the background. We calculated that ifwe just did a straight data dump and wrote something to split out the SQLfiles and reload it, it would have taken a week or something. We could havea week of downtime or two months of slow migration. And as we migrated,say, 10 percent of the users, the site became bearable again for the otherones, so then we could turn up the rate of migration off the loaded cluster.

Seibel: That was all pre-memcached and pre-Perlbal.

Fitzpatrick: Yeah, pre-Perlbal for sure. Memcached might have come afterthat. I don’t think I did memcached until like right after college, right when Imoved out. I remember coming up with the idea. I was in my shower oneday. The site was melting down and I was showering and then I realized wehad all this free memory all over the place. I whipped up a prototype thatnight, wrote the server in Perl and the client in Perl, and the server just fellover because it was just way too much CPU for a Perl server. So we startedrewriting it in C.

Seibel: So that saved you from having to buy more database servers.

Fitzpatrick: Yeah, because they were expensive and slow to migrate. Webservers were cheap and we could add them and they would take effectimmediately. You buy a new database and it’s like a week of setup andvalidation: test its disks, and set it all up and tune it.

Seibel: So all the pieces of infrastructure you built, like memcached andPerlbal, were written in response to the actual scaling needs of LiveJournal?

Fitzpatrick: Oh, yeah. Everything we built was because the site was fallingover and we were working all night to build a new infrastructure thing. Webought one NetApp ever. We asked, “How much does it cost?” and they’relike, “Tell us about your business model.” “We have paid accounts.” “Howmany customers do you have? What do you charge?” You just see themmultiplying. “The price is: all the disposable income you have without goingbroke.” We’re like, “Fuck you.” But we needed it, so we bought one. Weweren’t too impressed with the I/O on it and it was way too expensive andthere was still a single point of failure. They were trying to sell us aconfiguration that would be high availability and we were like, “Fuck it.We’re not buying any more of these things.”

So then we just started working on a file system. I’m not even sure the GFSpaper had published at this point—I think I’d heard about it from somebody.At this point I was always spraying memory all over just by taking a hash ofthe key and picking the shard. Why can’t we do this with files? Well, filesare permanent. So, we should record actually where it is becauseconfiguration will change over time as we add a more storage nodes. That’snot much I/O, just keeping track of where stuff is, but how do we make thathigh availability? So we figured that part out, and I came up with a scheme:“Here’s all the reads and writes we’ll do to find where stuff is.” And I wrotethe MySQL schema first for the master and the tracker for where the filesare. Then I was like, “Holy shit! Then this part could just be HTTP. This isn’thard at all!”

I remember coming into work after I’d been up all night thinking about this.We had a conference room downstairs in the shared office building—areally dingy, gross conference room. “All right, everyone, stop. We’re goingdownstairs. We’re drawing.” Which is pretty much what I said every timewe had a design—we’d go find the whiteboards to draw.

I explained the schema and who talks to who, and who does what with therequest. Then we went upstairs and I think I first ordered all the hardwarebecause it takes two weeks or something to get it. Then we started writingthe code, hoping we’d have the code done by the time the machines arrived.

Everything was always under fire. Something was always breaking so wewere always writing new infrastructure components.

Seibel: Are there things that if someone had just sat you down at the verybeginning and told you, “You need to know X, Y, and Z,” that your lifewould have been much easier?

Fitzpatrick: It’s always easier to do something right the first time than todo a migration with a live service. That’s the biggest pain in the ass ever.Everything I’ve described, you could do on a single machine. Design it likethis to begin with. You no longer make assumptions about being able to jointhis user data with this user data or something like that. Assume that you’regoing to want to load these 20 assets—your implementation can be to loadthem all from the same table but your higher-level code that just says, “Iwant these 20 objects” can have an implementation that scatter-gathersover a whole bunch of machines. If I would have done that from thebeginning, I’d have saved a lot of migration pain.

Seibel: So basically the lesson is, “You have to plan for the day when yourdata doesn’t all fit into one database.”

Fitzpatrick: Which I think is common knowledge nowadays in the webcommunity. And people can go overkill on assuming that their site is goingto be huge. But at the time, the common knowledge was, Apache is all youneed and MySQL is all you need.

Seibel: It does seem that while you were writing all this stuff because youneeded it, you also enjoyed doing it.

Fitzpatrick: Oh, yeah. I definitely try to find an excuse to use anything, tolearn it. Because you never learn something until you have to writesomething in it, until you have to live and breathe it. It’s one thing to golearn a language for fun, but until you write some big, complex system in it,you don’t really learn it.


Перейти на страницу:
Изменить размер шрифта: