Blog posts tagged "scalability"

On the Freebase custom tuple store, graphd

September 29th, 2008

Thanks to Simon for tickling my memory on this great blog post from Freebase on their custom tuple server.

Graphd is another good example of the log-oriented append only pattern. This is the sort of stuff I’ve been thinking about for a bit, and wishing I had more time to play with. Disks and disk metaphors might turn out to be our most dramatically parallelizable constructs.

Still my favorite hack is that, because they’re building a wiki-like tool, Freebase can bubble their eventual consistency implementation all the way up to the end-users, who are mentally prepared to deal with write contentions (they’re already dealing with rightness contention after all). We’re living in a post-ACID world.

The only down side is everyone I’ve talked to at Freebase seems pretty solid on this being their proprietary secret sauce, because a good, fast scalable open source tuple store might actually jump start a real semantic (small-S) web after all these years.

Oh, and only tangentially related, Myles published a good high level on our job queue system last Friday.

Facebook on “Scaling Out”

August 21st, 2008

Jason Sobel has an interesting post, “Scaling Out” on Facebook’s BCP work and the move to being multi-colo.

Interesting to me was noting that:

  • they just got around to this 8 months ago, and they’re fscking Facebook (which means you can wait)
  • they’re still doing all writes to a single datacenter
  • they’re hacking an object-level mark/sweep into the MySQL replication stream suggesting a certain parable of a hammer and nails.

via PaulH

Tagged: , , ,

A Couple of Caveats on Queuing

July 7th, 2008

Les’ “Delight Everyone” post is latest greatest addition to the 17th letter of the alphabet for savior conversation.

And believe me I’m a huge fan, and am busy carving out a night sometime this week to play with the RabbitMQ/XMPP bridge (/waves hi Alexis).

But …. there are a couple of caveats:

1) Some writes need to be real time.

Les notes this as well, but I just wanted to emphasize because really, they do.

If you can’t see your changes take effect in a system your understanding of cause and effect breaks down. It doesn’t matter that your understanding is wrong, you still need one to function. Ideally a physical analogy too. There are no real world effects that get queued for later application. Violate the principle of (falsely) seeming to respect real world cause and effect and your users will remain forever confused.

del.icio.us showing you the wrong state when you use the inline editing tool, and Flickr taking a handful of seconds to index a newly tagged photo are both good examples of subtly broken interfaces that can really throw people.

My data, now real time. Everyone else can wait (how long depends on how social your users are).

2) You’ve got to process that queue eventually.

Ideally you can add processing boxes in parallel forever but if your dequeuing rate falls below your queuing rate you are, in technical terms, screwed.

Think about it, if you’re falling behind 1 event per second, processing 1,000,000 events a second, but adding 1,000,001 for example, at the end of the day your 86,400 events in debt and counting. It’s likes losing money on individual sales, but trying to make it up in volume.

Good news: Traffic is spiky and most sites see daily cycles with quiet times.

Bad news: Many highly tuned systems exhibit slow down properties as their backlogs increase. Like a credit card, processing debt can get exponentially unmanageable.

In practice this means that most of the time your queue consumers should be sitting around bored. (see Allspaw’s Capacity Planning slides for more on that theme.)

If you can’t guarantee those real time writes for thems that cares, and mostly bored queue consumers the rest of the time then your queues might not delight you after all.

See also: Twitter, or Architecture Will Not Save You

Twitter, or Architecture Will Not Save You

May 28th, 2008


(circa 2006 Twitter maintenance cat)

Along with a whole slew of smart folks, I’ve been playing the current think game de jour, “How would you re-architect Twitter?”. Unlike most I’ve been having this conversation off and on for a couple of years, mostly with Blaine, in my unofficial “Friend of Twitter” capacity. (the same capacity that I wrote the first Twitter bot in, and have on rare occasion logged into their boxes to play “spot the run away performance issue.”)

For my money Leonard’s Brought to You By the 17th Letter of the Alphabet is probably the best proposed architecture I’ve seen — or at least it matches my own biases when I sat down last month to sketch out how build a Twitter-like thing. But when Leonard and I were chatting last week about this stuff, I was struck what was missing from the larger Blogosphere’s conversation: the issues Twitter is actually facing.

Folks both within Twitter and without have framed the conversation as an architectural challenge. Meanwhile the nattering classes have struck on the fundamental challenge of all social software (namely the network effects) and are reporting that they’ve gotten confirmation from “an individual who is familiar with the technical probelms at Twitter” that indeed Twitter is a social software site!

Living and Dying By the Network

All social software has to deal with the network effect. At scale it’s hard. And all large social software has had to solve it. If you’re looking for the roots of Twitter’s special challenges, you’re going to have to look a bit farther a field.

Though you can hedge your bets with this stuff by making less explicit promises than Twitter does (everything from my friends in a timely fashion is pretty hard promise to keep). Flickr mitigates some of this impact by making promises about recent contacts, not recent photos (there are a fewer people than photos), meanwhile Facebook can hide a slew of sins behind the fact that their newsfeeds are “editorialized”, no claims of completeness anywhere in site. (there is a figure floating around that at least at one point Facebook was dropping 80% of their updates on the floor)

So while architectures that strip down Twitter to queues, and logs could be a huge win, and while thinking about new architectures is the sexy, hard problem we all want to fix, Twitter’s problems are really of a more pedestrian hard, plumbing and ditch digging nature. Which is less fun, but reality.

Growth

Their first problem is growth. Honest to god hockey stick growth is so weird, and wild, and hard, thats it’s hard to imagine and cope with if you haven’t been through it at least once. To quote Leonard again (this from a few weeks ago back when TC thought they’d figured out that Twitter’s problems were Blaine):

“Even if you’re architecturally sound, you’re dealing with development with extremely tight timelines/pressures, so you have to make decisions to pick things that will work but will probably need to eventually be replaced (e.g. DRb for Twitter) — usually you won’t know when and what component will be the limiting factor since you don’t know what the uses cases will be to begin with. Development from prototype on is a series of compromises against the limited resources of man-hours and equipment. In a perfect world, you’d have perfect capacity planning and infinite resources, but if you’ve ever experienced real-world hockey-stick growth on a startup shoestring, you know that’s not the case. If you have, you understand that scaling is the brick that hits you when you’ve gone far beyond your capacity limits and when your machines hit double or triple digit loads. Architecture doesn’t help you one bit there.”

Growth is hard. Dealing with growth is rarely sexy. When your growth goes non-linear you’re tempted to think you’ve stumbled into a whole class of new problems that need wild new thinking. Resist. New ideas should be applied judiciously. Because mostly its plumbing. Tuning your databases, getting your thread buffer sizes right, managing the community, and the abuse.

Intelligence and Monitoring

Growth compounds the other hard problem that Twitter (and almost every sites I’ve seen) has, thery’re running black boxes. Social software is hard to heartbeat, socially or technically. It’s one of the places where our jobs are actually harder than those real time trading systems, and other five nines style hard computing systems.

And it’s a problem Twitter is still struggling to solve. (really you never stop solving it, your next SPOF will always come find you, and then you have something new to monitor) Twitter came late in life to Ganglia, and haven’t had the time to really burnish it. And Ganglia doesn’t ship by default with a graph for what to do when your site needs its memcache servers hot to run. And what do you do when Ganglia starts telling you your recent framework upgrade is causing a 10x increase in data returned from your DBs for the same QPS. Or that your URL shortening service is starting to slow down sporadically adding an extra 30ms burn to message handling. (how do you even graph that?)

Beyond LAMP Needs Better Intelligence

Monitoring and intelligence get even harder as you start to embrace these new architectures. Both because the systems are more complex, but largely because we don’t know what monitoring and resourcing for Web scale queues of data, and distributed hash tables look like. And we don’t yet have the scars from living through the failure scenarios. And we’re rolling our own solutions as it is early days, without the battle hardened tweaks and flags of an Apache or MySQL.

We all know that Jabber has different performance characteristics than the Web (that’s rather the point), but we don’t have the data to quantify what it looks like at network effect impacted scale. (the big IM installs, particularly LJ and Google have talked a bit in public, but their usage patterns tend to be pretty different than stream style APIs. Btw I’ll be talking about this a bit in Portland at OSCON in a few months!)

Recommendations

So I’d add to Leonard’s architecture (and I know Leonard is thinking about this), and the various other cloud architectures emerging that to make it work you need build monitoring and resourcing in from the ground up, or you’re distributed in the cloud queues are going to fail.

And solve the growth issues, with appropriate solutions for growth, which rarely involves architectural solutions.

XMPP in TiVo

January 11th, 2008

“Today each TiVo polls TiVo’s severs roughly every 15 minutes to check for new scheduled recordings, TiVoCast downloads, Unbox downloads, etc. That’s highly inefficient – nearly all of those polling calls are for nothing. There is nothing waiting to be done. And it introduces a lag when you want to start a download – up to 15 minutes. And it doesn’t scale well as TiVo’s user base keeps growing.

So what’s changed? The polling system is gone. TiVo is using XMPP now instead. What is XMPP? The Extensible Messaging and Presence Protocol – better known as the instant messaging protocol that powers Jabber, Google Talk, and other IM systems.” – Peter St. Andre noticed as interesting announcement coming out of CES. (via aaron)

Google Talk Architecture, and High Availability (HA)

July 29th, 2007

P7280018_Moleskine_Kreisel

Via the HA blog (an obviously unserved niche in retrospect), a very interesting 30 minute presentation on the Google Talk architecture.

ConnectedUsers * BuddyListSize * OnlineStateChanges

Interestingly people keep independently re-discovering that maintaining presence is the hard part of scaling these systems.

Its something that really came home hard in my talking with Twitter helping with their scaling challenges (so much so that we took a slide out of our “Social Software for Robots” talk to talk about it, and Blaine mentioned it again in his “Scaling Twitter” talk)

So by way of a PSA:

Presence isn’t easy.

Growth in social systems in non-linear. Ignore the network effect at your peril.

Kick the Tires

Also interesting was “Real Life Load Tests”. The GTalk team deployed to Orkut and GMail weeks before actually turning on the UI for the features to be able to monitor the load. These are the practices that make Bill’s recent observation on HA systems possible:

An interesting takeaway is that it’s clearly possible to re-architect data storage on super-busy production systems seemingly no matter where you start from.

For the rest of bullets see the HA blog post.

Twitter, Ruby, and Scaling

April 12th, 2007

Alex gave a phenomenal interview on Twitter and Rails a couple of weeks ago. This morning its all over the Net — but folks I think are taking the wrong lessons from it.

  1. Ruby is dead slow. This is not news, though it can be surprising when you’re used to thinking about scripting languages as all being roughly equal.

  2. Rails trades developer performance for framework performance. Also not news, as this has been the mantra of Rails since day 1.

More importantly he gives a quick insight into the how of making social software scale. It’s hard, it has ugly network effects, it makes databases cry. Alex mentions cache like mad. (because frankly no one but the content creator needs to see fresh data)

Also denormalize like mad, federate like mad, and prune features that make your site slow. (and these are the same techniques that they’re working on behind the scenes at Twitter, and that we use to scale Flickr).

You’ll never build a successful site if you build to scale from day 1, scaling is always a catch up game, but it’s the best game there is.

(And yes, this is my all Twitter all the time blog week)

update: Blaine, lead Twitter engineer, is giving a talk on how they scale Rails/Twitter next weekend at the Rudy SD Forum. (which has done a terrible job of publicizing its existence, but has a pretty killer looking line up)

Bad internet karma day.

April 11th, 2006

Unscheduled Bloglines downtime, and WoW’s weekly 7 hours of scheduled downtime is pushing 12 with no end in sight. (and they’re homepage is even down)

Tagged: Uncategorized , , ,