• March 8, 2010

    SnapGroups: Mark Fletcher’s Lightweight Discussion.

    I’m a big fan of lightweight, ridiculously easy group forming, and of Mark Fletcher’s work, so I’m excited about SnapGroups. More interesting to me though is to see he chose to use Mongo, as Mark handrolled a custom datastore to Bloglines. Validates some of my cautious optimism for Mongo delivering on some of its promise.

    + 0. (Aside )

Armory Data Mining

March 5th, 2010

“It is time for some truth in advertising. If I will present my thesis adviser with this analysis, she will probably hang me, rez me, hang me again, and then /gkick me out of my PhD program.”Armory Data Mining.

Great, accessible look at population stats.

  • February 16, 2010

    Wikimedia has a users RPE of 30mil.

    “Wikimedia Foundation currently employs 14 technical people (not all of whom are developers). At 400 million readers/month (as of February 2010), that’s about 1 developer per 30 million users. Accounting for open source developers probably doesn’t change that ratio by an order of magnitude.” (not exactly apples to apples, but still interesting)

    + 1. (Aside , , )

KrazyDad: Mayor of the North Pole

February 16th, 2010

I’ve been blatantly cheating at foursquare for the past week … At some point last week, I devolved into a 12 year old hacker, and I spent many spare hours (and my computer’s spare cycles) abusing the system with a set of scripts operating fake accounts. Not only did I add new venues like the North Pole, but I started persistently checking into coveted landmarks, like the Statue of Liberty. – Jim

I would have thought that cheating at 4sq was so easy as to not invite this kind of concerted effort. Afterall cheating is implicitly allowed in the social contract of the site, a fact that may or may not have gotten lost as it expanded beyond the ex-Dodgeball early adopters, and the game mechanics forefronted.

I assume that Foursquare are carefully monitoring the return they get on the game mechanics, and at some point they’ll burn down the game, which was necessary to get the early adopters in the door, but which will forever strand the product on one side of the chasm, and move to a more utilitarian product — critical mass reach, social cascade ignited.

geobloggers: Flickr Photos now in Bing Maps

February 12th, 2010

“This, is what geotagging photos is all about, it’s about having enough of them, millions and millions, so that they can be thrown through complex analysis, allowing them to be matched up, combined, calculated and computed into a geo-spatal context. It’s also about people sharing the world about them. Start of mini rant: You’ll see that all these advances are made by Google and Microsoft …” – Rev. Dan Catt.

I try not to let it get to me anymore that we’ve been actively de-prioritizing geo as an axis of understanding the human experience as everyone else has been spinning it up..

A Whole Lotta Nothing: Skinner Boxes

February 11th, 2010

Flickr offers the wonderful Recent Activity page that I loved so much I copied it for MetaFilter. It's pretty much the ultimate tool for finding what has happened with your content on the network and I hope other services are watching and following suit. I would love to see an internet-wide tool that worked like this to track stuff people have said about my writing/photos as well as any followups on comments I left on any other blog. Many companies have tried, no one has succeeded yet.” – Matt Haughey

Yay! That almost makes it worth how much pain it was to build that page. The whole post is good, a couple of neat tricks I’d missed for tracking the conversations.

Ticket Servers: Distributed Unique Primary Keys on the Cheap

February 8th, 2010

(re-published from the Flickr Code Blog)

This is the first post in the Using, Abusing and Scaling MySQL at Flickr series.

Ticket servers aren’t inherently interesting, but they’re an important building block at Flickr. Among other things they are core to topics we’ll be talking about later, like sharding and master-master. Ticket servers give us globally (Flickr-wide) unique integers to serve as primary keys in our distributed setup.

Why?

Sharding (aka data partioning)) is how we scale Flickr’s datastore. Instead of storing all our data on one really big database, we have lots of databases, each with some of the data, and spread the load between them. Sometimes we need to migrate data between databases, so we need our primary keys to be globally unique. Additionally our MySQL shards are built as master-master replicant pairs for resiliency. This means we need to be able to guarantee uniqueness within a shard in order to avoid key collisions. We’d love to go on using MySQL auto-incrementing columns for primary keys like everyone else, but MySQL can’t guarantee uniqueness across physical and logical databases.

GUIDs?

Given the need for globally unique ids the obvious question is, why not use GUIDs? Mostly because GUIDs are big, and they index badly in MySQL. One of the ways we keep MySQL fast is we index everything we want to query on, and we only query on indexes. So index size is a key consideration. If you can’t keep your indexes in memory, you can’t keep your database fast. Additionally ticket servers give us sequentiality which has some really nice properties including making reporting and debugging more straightforward, and enabling some caching hacks.

Consistent Hashing?

Some projects like Amazon’s Dynamo provide a consistent hashing ring on top of the datastore to handle the GUID/sharding issue. This is better suited for write-cheap environments (e.g. LSMTs), while MySQL is optimized for fast random reads.

Centralizing Auto-Increments

If we can’t make MySQL auto-increments work across multiple databases, what if we just used one database? If we inserted a new row into this one database every time someone uploaded a photo we could then just use the auto-incrementing ID from that table as the primary key for all of our databases.

Of course at 60+ photos a second that table is going to get pretty big. We can get rid of all the extra data about the photo, and just have the ID in the centralized database. Even then the table gets unmanageably big quickly. And there are comments, and favorites, and group postings, and tags, and so on, and those all need IDs too.

REPLACE INTO

A little over a decade ago MySQL shipped with a non-standard extension to the ANSI SQL spec, “REPLACE INTO”. Later “INSERT ON DUPLICATE KEY UPDATE” came along and solved the original problem much better. However REPLACE INTO is still supported.

REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.

This allows us to atomically update in a place a single row in a database, and get a new auto-incremented primary ID.

Putting It All Together

A Flickr ticket server is a dedicated database server, with a single database on it, and in that database there are tables like Tickets32 for 32-bit IDs, and Tickets64 for 64-bit IDs.

The Tickets64 schema looks like:

CREATE TABLE `Tickets64` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM

SELECT * from Tickets64 returns a single row that looks something like:

+-------------------+------+
| id                | stub |
+-------------------+------+
| 72157623227190423 |    a |
+-------------------+------+

When I need a new globally unique 64-bit ID I issue the following SQL:

REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

SPOFs

You really really don’t know want provisioning your IDs to be a single point of failure. We achieve “high availability” by running two ticket servers. At this write/update volume replicating between the boxes would be problematic, and locking would kill the performance of the site. We divide responsibility between the two boxes by dividing the ID space down the middle, evens and odds, using:

TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2

We round robin between the two servers to load balance and deal with down time. The sides do drift a bit out of sync, I think we have a few hundred thousand more odd number objects then evenly numbered objects at the moment, but this hurts no one.

More Sequences

We actually have more tables then just Tickets32 and Tickets64 on the ticket servers. We have a sequences for Photos, for Accounts, for OfflineTasks, and for Groups, etc. OfflineTasks get their own sequence because we burn through so many of them we don’t want to unnecessarily run up the counts on other things. Groups, and Accounts get their own sequence because we get comparatively so few of them. Photos have their own sequence that we made sure to sync to our old auto-increment table when we cut over because its nice to know how many photos we’ve had uploaded, and we use the ID as a short hand for keeping track.

So There’s That

It’s not particularly elegant, but it works shockingly well for us having been in production since Friday the 13th, January 2006, and is a great example of the Flickr engineering dumbest possible thing that will work design principle.

More soon.

Using, Abusing and Scaling MySQL at Flickr

February 8th, 2010

(re-published from the Flickr Code Blog)

I like “NoSQL”. But at Flickr, MySQL is our hammer, and we use it for nearly everything. It’s our federated data store, our key-value store, and our document store. We’ve built an event queue, and a job server on top of it, a stats feature, and a data warehouse.

We’ve spent the last several years abusing, twisting, and generally mis-using MySQL in ways that could only be called “post relational”. Our founding architect is famously in print saying, “Normalization is for sissies.”

So while it’s great to see folks going back to basics — instead of assuming a complex and historically dictated series of interfaces, assuming just disks, RAM, data, and problem to solve — I think it’s also worth looking a bit harder at what you can do with MySQL. Because frankly MySQL brings some difficult to beat advantages.

  • it is a very well known component. When you’re scaling a complex app everything that can go wrong, will. Anything which cuts down on your debugging time is gold. All the of MySQL’s flags and stats can be a bit overwhelming at times, but they’ve accumulated over time to solve real problems.

  • it’s pretty darn fast and stable. Speed is usually one of the key appeals of the new NoSQL architectures, but MySQL isn’t exactly slow (if you’re doing it right). I’ve seen two large, commercial “NoSQL” services flounder, stall and eventually get rewritten on top of MySQL. (and you’ve used services backed by both of them)

Over the next bit I’ll be writing a series of blog posts looking into how Flickr scales MySQL to do all sorts of things it really wasn’t intended for. I can’t promise you these are the best techniques, they are merely our techniques, there are others, but these are ours. They’re in production, and they work. I was tempted to call the series “YesSQL”, but that really doesn’t capture the spirit, so instead I’m calling it “Using and Abusing MySQL”.

And the first article is on ticket servers.

What Second Life can teach about scaling

February 3rd, 2010

Just read Ian Wilkes’ What Second Life can teach your datacenter about scaling Web apps article.

It’s packed full of really great radically pragmatic advice. Go read it. Couple of times I literally shouted out “Yes!”, so I pulled a few choice quotes out.

herein lies a trap for smaller ones: the belief that you can “do it right the first time.”

Wanted to jump up and down when I read this. Building it “right” the first time is one of the best guarantees of failure I know. Scaling is always a catch up game.

a recurring billing system needs to touch each user annually, and the product is only available to Internet users in the US and Europe, and by the biggest estimates will achieve no more than 10% penetration, then it needs to handle about 2-3 events per second (1bn * 75% * 10% / (365 * 86,400)). Conversely, a chat system with a similar userbase averaging 10 messages/day, concentrated during work hours, might need to handle 20,000 messages per second or more.

Events per second is usually the first and more important metric I calculate when designing a system. Even if you only have the roughest of notions, orders of magnitude are important. (and remember you’re the cynical geek on the team, there are folks on the team paid to dream of world domination, don’t let them influence your numbers too much)

can the system be shut down at regular intervals?

Because change is inevitable, and anything resembling perfect uptime is more expensive then you can afford.

Another often-overlooked component of a scaling strategy is the makeup and attitude of the team … the entire development team needs to be aware of at least the basic implications of working on a large system … . This is especially a risk if a centralized resource (say, a database) is heavily abstracted and somewhat invisible to the developer (by, say, an ORM).

So true! Abstractions kill.

the ultimate solution is typically to partition databases into horizontal slices of the data set (typically by user), but this approach can be very expensive to implement.

Not sure why partitioning is thought of as so expensive. It’s annoying, and not for the lazy, but it’s not that difficult/expensive.

Instrument, propagate, and isolate errors

Flickr’s mantra is graph, graph, graph everything that moves.

It pays to thoroughly embrace the exception model

I can only say I wish I had this, haven’t scaled it, but living without it is instructive. And painful.

“Fix all the bugs” is rarely a realistic plan.

Similarly advice to “close bugs first” will leave your product dead in the water.

Batch jobs: the silent killer

Yup.

Beware the grand re-write

Oh my yes.

Have a Plan B

Someday I’ll publish some of our “plan B” documents. Plan Bs are critical to moving fast.

Don’t be afraid to change the product. Sometimes, a small number of features are responsible for the lion’s share of bottlenecks.

Twitter is the master of this.

All around great pragmatic advice.

Tagged: