What Second Life can teach about scaling

February 3rd, 2010

Just read Ian Wilkes’ What Second Life can teach your datacenter about scaling Web apps article.

It’s packed full of really great radically pragmatic advice. Go read it. Couple of times I literally shouted out “Yes!”, so I pulled a few choice quotes out.

herein lies a trap for smaller ones: the belief that you can “do it right the first time.”

Wanted to jump up and down when I read this. Building it “right” the first time is one of the best guarantees of failure I know. Scaling is always a catch up game.

a recurring billing system needs to touch each user annually, and the product is only available to Internet users in the US and Europe, and by the biggest estimates will achieve no more than 10% penetration, then it needs to handle about 2-3 events per second (1bn * 75% * 10% / (365 * 86,400)). Conversely, a chat system with a similar userbase averaging 10 messages/day, concentrated during work hours, might need to handle 20,000 messages per second or more.

Events per second is usually the first and more important metric I calculate when designing a system. Even if you only have the roughest of notions, orders of magnitude are important. (and remember you’re the cynical geek on the team, there are folks on the team paid to dream of world domination, don’t let them influence your numbers too much)

can the system be shut down at regular intervals?

Because change is inevitable, and anything resembling perfect uptime is more expensive then you can afford.

Another often-overlooked component of a scaling strategy is the makeup and attitude of the team … the entire development team needs to be aware of at least the basic implications of working on a large system … . This is especially a risk if a centralized resource (say, a database) is heavily abstracted and somewhat invisible to the developer (by, say, an ORM).

So true! Abstractions kill.

the ultimate solution is typically to partition databases into horizontal slices of the data set (typically by user), but this approach can be very expensive to implement.

Not sure why partitioning is thought of as so expensive. It’s annoying, and not for the lazy, but it’s not that difficult/expensive.

Instrument, propagate, and isolate errors

Flickr’s mantra is graph, graph, graph everything that moves.

It pays to thoroughly embrace the exception model

I can only say I wish I had this, haven’t scaled it, but living without it is instructive. And painful.

“Fix all the bugs” is rarely a realistic plan.

Similarly advice to “close bugs first” will leave your product dead in the water.

Batch jobs: the silent killer

Yup.

Beware the grand re-write

Oh my yes.

Have a Plan B

Someday I’ll publish some of our “plan B” documents. Plan Bs are critical to moving fast.

Don’t be afraid to change the product. Sometimes, a small number of features are responsible for the lion’s share of bottlenecks.

Twitter is the master of this.

All around great pragmatic advice.

3 responses to “What Second Life can teach about scaling”

  1. John Allspaw says:

    The Friendster quotes from Lunt are 100% true and spot-on.

  2. Erik kastner says:

    I haven’t worked at scale for very long or very large, so I can’t speak to that directly, except to say all that advice sounds awesome.

    I have, however, been programming for the greater part of my life. When I first read “So true! Abstractions kill.” (in regards to things like ORM), I pumped my fist in the air. Then I thought, wait… abstractions are our sharpest tool. Yes they can kill, but they can also create. Just like a chef’s knife, they are dangerous, but without them we’d still be flipping switches on the front of our analog computers.

    The trick is to find the correct (or least-incorrect) abstractions that provide the best balance between computer inefficiency and coder efficiency. As computers have increased in speed and complexity, the slider has been moving more and more towards insulating the human operator, and allowing more and more expressiveness.

    That’s where the rub lies. All abstractions, by their nature, are “leaky” (the map is not the territory, etc). And scale is where those leaks can (and usually do) become a problem. If you’ve made your abstraction so rigid that you can’t bypass it to get “closer to the metal”, you’re going to get hurt.

    I don’t have a problem with “ORM”s per-se (at least how they’re implemented in popular frameworks – they’re not conceptually pure ORMs). What they do encourage is the encapsulation of business logic in a single place. What I do have a problem with is an ORM that doesn’t let you drop down directly to SQL when necessary. Careful violation of abstractions is key.

    We can’t hold a lot in our heads at once. Anything that helps us “chunk down” the huge amounts of information we deal with will help us get stuff done faster. If you end up with a good system or a bad one is determined in large part by the architect of said system – if they had discipline and a strong vision, chances are good that you’ll be able to evolve it in a sane matter.

    This is getting very far afield of your post… The “new” thing is the web scaling, the not-new thing is making it possible to communicate with yourself and others through code.

    Programming is communicating with machines Coding is communicating with people

  3. Kellan says:

    @kastner Absolutely. I regularly ignore the filesystem. Similarly I regularly ignore most of the moving parts of MySQL. I even pretend that the PHP I’m writing is what is actually running rather then opcodes compiled by APC.

    Ian’s point, which perhaps got obscured in my response is your whole team needs to be able to drill down past the abstractions. Because you’ll need to in order to reach your next scaling plateau. (which is sometimes growth and sometimes recovering from down time)

    The danger of introducing excessive abstractions is it creates mental boundaries that have a cost to cross. Additionally the abstraction you build in to your app will be necessity be less battle tested then say a filesystem.

    And I do dislike ORMs. I would love an object-ROW-mapper, but the attempts to write SQL/manage the relational model have always failed in the domains where I have experience.