Ticket Servers: Distributed Unique Primary Keys on the Cheap

February 8th, 2010

(re-published from the Flickr Code Blog)

This is the first post in the Using, Abusing and Scaling MySQL at Flickr series.

Ticket servers aren’t inherently interesting, but they’re an important building block at Flickr. Among other things they are core to topics we’ll be talking about later, like sharding and master-master. Ticket servers give us globally (Flickr-wide) unique integers to serve as primary keys in our distributed setup.

Why?

Sharding (aka data partioning)) is how we scale Flickr’s datastore. Instead of storing all our data on one really big database, we have lots of databases, each with some of the data, and spread the load between them. Sometimes we need to migrate data between databases, so we need our primary keys to be globally unique. Additionally our MySQL shards are built as master-master replicant pairs for resiliency. This means we need to be able to guarantee uniqueness within a shard in order to avoid key collisions. We’d love to go on using MySQL auto-incrementing columns for primary keys like everyone else, but MySQL can’t guarantee uniqueness across physical and logical databases.

GUIDs?

Given the need for globally unique ids the obvious question is, why not use GUIDs? Mostly because GUIDs are big, and they index badly in MySQL. One of the ways we keep MySQL fast is we index everything we want to query on, and we only query on indexes. So index size is a key consideration. If you can’t keep your indexes in memory, you can’t keep your database fast. Additionally ticket servers give us sequentiality which has some really nice properties including making reporting and debugging more straightforward, and enabling some caching hacks.

Consistent Hashing?

Some projects like Amazon’s Dynamo provide a consistent hashing ring on top of the datastore to handle the GUID/sharding issue. This is better suited for write-cheap environments (e.g. LSMTs), while MySQL is optimized for fast random reads.

Centralizing Auto-Increments

If we can’t make MySQL auto-increments work across multiple databases, what if we just used one database? If we inserted a new row into this one database every time someone uploaded a photo we could then just use the auto-incrementing ID from that table as the primary key for all of our databases.

Of course at 60+ photos a second that table is going to get pretty big. We can get rid of all the extra data about the photo, and just have the ID in the centralized database. Even then the table gets unmanageably big quickly. And there are comments, and favorites, and group postings, and tags, and so on, and those all need IDs too.

REPLACE INTO

A little over a decade ago MySQL shipped with a non-standard extension to the ANSI SQL spec, “REPLACE INTO”. Later “INSERT ON DUPLICATE KEY UPDATE” came along and solved the original problem much better. However REPLACE INTO is still supported.

REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.

This allows us to atomically update in a place a single row in a database, and get a new auto-incremented primary ID.

Putting It All Together

A Flickr ticket server is a dedicated database server, with a single database on it, and in that database there are tables like Tickets32 for 32-bit IDs, and Tickets64 for 64-bit IDs.

The Tickets64 schema looks like:

CREATE TABLE `Tickets64` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM

SELECT * from Tickets64 returns a single row that looks something like:

+-------------------+------+
| id                | stub |
+-------------------+------+
| 72157623227190423 |    a |
+-------------------+------+

When I need a new globally unique 64-bit ID I issue the following SQL:

REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

SPOFs

You really really don’t know want provisioning your IDs to be a single point of failure. We achieve “high availability” by running two ticket servers. At this write/update volume replicating between the boxes would be problematic, and locking would kill the performance of the site. We divide responsibility between the two boxes by dividing the ID space down the middle, evens and odds, using:

TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2

We round robin between the two servers to load balance and deal with down time. The sides do drift a bit out of sync, I think we have a few hundred thousand more odd number objects then evenly numbered objects at the moment, but this hurts no one.

More Sequences

We actually have more tables then just Tickets32 and Tickets64 on the ticket servers. We have a sequences for Photos, for Accounts, for OfflineTasks, and for Groups, etc. OfflineTasks get their own sequence because we burn through so many of them we don’t want to unnecessarily run up the counts on other things. Groups, and Accounts get their own sequence because we get comparatively so few of them. Photos have their own sequence that we made sure to sync to our old auto-increment table when we cut over because its nice to know how many photos we’ve had uploaded, and we use the ID as a short hand for keeping track.

So There’s That

It’s not particularly elegant, but it works shockingly well for us having been in production since Friday the 13th, January 2006, and is a great example of the Flickr engineering dumbest possible thing that will work design principle.

More soon.

Using, Abusing and Scaling MySQL at Flickr

February 8th, 2010

(re-published from the Flickr Code Blog)

I like “NoSQL”. But at Flickr, MySQL is our hammer, and we use it for nearly everything. It’s our federated data store, our key-value store, and our document store. We’ve built an event queue, and a job server on top of it, a stats feature, and a data warehouse.

We’ve spent the last several years abusing, twisting, and generally mis-using MySQL in ways that could only be called “post relational”. Our founding architect is famously in print saying, “Normalization is for sissies.”

So while it’s great to see folks going back to basics — instead of assuming a complex and historically dictated series of interfaces, assuming just disks, RAM, data, and problem to solve — I think it’s also worth looking a bit harder at what you can do with MySQL. Because frankly MySQL brings some difficult to beat advantages.

  • it is a very well known component. When you’re scaling a complex app everything that can go wrong, will. Anything which cuts down on your debugging time is gold. All the of MySQL’s flags and stats can be a bit overwhelming at times, but they’ve accumulated over time to solve real problems.

  • it’s pretty darn fast and stable. Speed is usually one of the key appeals of the new NoSQL architectures, but MySQL isn’t exactly slow (if you’re doing it right). I’ve seen two large, commercial “NoSQL” services flounder, stall and eventually get rewritten on top of MySQL. (and you’ve used services backed by both of them)

Over the next bit I’ll be writing a series of blog posts looking into how Flickr scales MySQL to do all sorts of things it really wasn’t intended for. I can’t promise you these are the best techniques, they are merely our techniques, there are others, but these are ours. They’re in production, and they work. I was tempted to call the series “YesSQL”, but that really doesn’t capture the spirit, so instead I’m calling it “Using and Abusing MySQL”.

And the first article is on ticket servers.

What Second Life can teach about scaling

February 3rd, 2010

Just read Ian Wilkes’ What Second Life can teach your datacenter about scaling Web apps article.

It’s packed full of really great radically pragmatic advice. Go read it. Couple of times I literally shouted out “Yes!”, so I pulled a few choice quotes out.

herein lies a trap for smaller ones: the belief that you can “do it right the first time.”

Wanted to jump up and down when I read this. Building it “right” the first time is one of the best guarantees of failure I know. Scaling is always a catch up game.

a recurring billing system needs to touch each user annually, and the product is only available to Internet users in the US and Europe, and by the biggest estimates will achieve no more than 10% penetration, then it needs to handle about 2-3 events per second (1bn * 75% * 10% / (365 * 86,400)). Conversely, a chat system with a similar userbase averaging 10 messages/day, concentrated during work hours, might need to handle 20,000 messages per second or more.

Events per second is usually the first and more important metric I calculate when designing a system. Even if you only have the roughest of notions, orders of magnitude are important. (and remember you’re the cynical geek on the team, there are folks on the team paid to dream of world domination, don’t let them influence your numbers too much)

can the system be shut down at regular intervals?

Because change is inevitable, and anything resembling perfect uptime is more expensive then you can afford.

Another often-overlooked component of a scaling strategy is the makeup and attitude of the team … the entire development team needs to be aware of at least the basic implications of working on a large system … . This is especially a risk if a centralized resource (say, a database) is heavily abstracted and somewhat invisible to the developer (by, say, an ORM).

So true! Abstractions kill.

the ultimate solution is typically to partition databases into horizontal slices of the data set (typically by user), but this approach can be very expensive to implement.

Not sure why partitioning is thought of as so expensive. It’s annoying, and not for the lazy, but it’s not that difficult/expensive.

Instrument, propagate, and isolate errors

Flickr’s mantra is graph, graph, graph everything that moves.

It pays to thoroughly embrace the exception model

I can only say I wish I had this, haven’t scaled it, but living without it is instructive. And painful.

“Fix all the bugs” is rarely a realistic plan.

Similarly advice to “close bugs first” will leave your product dead in the water.

Batch jobs: the silent killer

Yup.

Beware the grand re-write

Oh my yes.

Have a Plan B

Someday I’ll publish some of our “plan B” documents. Plan Bs are critical to moving fast.

Don’t be afraid to change the product. Sometimes, a small number of features are responsible for the lion’s share of bottlenecks.

Twitter is the master of this.

All around great pragmatic advice.

Tagged:

(One of the many) Ebook Dilemmas.

January 25th, 2010

I'm going to need books, lots of books

How do I support and reward the excellent curation of the local bookstore if I want the ebook version of something I find? – Kellan

I am not a unsophisticated consumer of science fiction. And finding new material to feed the book addiction is something I spend a not inconsiderable number of cycles on. Yet, there I was standing in Borderlands last week, and books to buy were jumping off the shelves. 2-3 of “my authors” had new books out that I hadn’t heard about. (tho 2 of them are on low rotation right now, as they’ve disappointed me of late) A book multiple friends had mentioned but I’d failed to track was featured. And I found several other new promising options, none of which I had heard of, and several of which aren’t normally available in print in this country.

Low Paper Diet

And I was stuck. You see, I’m on a pretty strict dead tree diet right now. I simply don’t have to the space to store books. And while I’m at it I’d rather not incur the carbon debt of chopping down trees, mass printing on paper, warehousing and transporting a product which is statistically likely to be pulped before ever being purchased. Clearly I’m getting a huge amount of value out of Borderlands, but I didn’t really have a way to include them in the exchange. I wasn’t even sure I was really comfortable wandering next door to their newly opened cafe and settling in with my Kindle as I was inclined to do.

Micro-slicing the pie vs trickle down?

Charlie Stross wrote a really great post recently, The monetization paradox analyzing the value chain of content production right now, summed up as,

“Google could in principle afford to pay every novelist currently active in the English language out of the petty cash.” – Charles Stross

Amazon is doing something similar. Capturing greater value then they’re providing. (and I love Amazon) I visit Amazon.com, I visit the Amazon.com Kindle Store. And I walk away empty handed. Amazon captures the value when I buy a book for my Kindle, but aren’t providing sufficient tools for me to do this. Without Borderlands, Amazon would have gotten no $$ from me last week, as it is, they did all right.

So how do I cut my local bookstore/curator in? I asked on Twitter and the consensus emerged around “buy the book, steal the ebook”, or “tip the bookstore.” (thanks to waferbaby, dajobe, BOBTHEBUTCHER, benprincess, timoni, carlcoryell, bhyde, and rabble for feedback!)

One of the ways I know I’m getting old is most of the time stealing media isn’t worth it. This also is a product of consuming outside of the most mainstream troughs, and genuinely liking/respecting most of the players in my media supply chain. I’ve got sitting on my drive detailed specs for building a relatively high throughput personal book scanner, and in the moments when I’m honest with myself I’ll probably never build it.

Open Questions?

Which brings me around to, how do I tip bookstores? And if there exists a viable model of funding that allows me to express my generalized appreciation of the existence of these important curators while getting some specific value back, a Kickstarter inspired model if you will? Would anyone besides me use it I wonder? How does this interact with Charlie’s ideas of a subscription model for writers? Given a semi-hyphothetical open e-reader with a radio could we partially fund bookstores with a real world version of Amazon affiliate links?

Unfortunately I still don’t have the answers, but I wanted to write down the problem, am I’m going to keep looking into it. Meanwhile if you know of anyone experimenting with this, I’d love to hear about it.

(so concludes the latest in this week’s series of blog posts written by the simple expedient of scaling up a tweet by a 30x inflation factor)

(update: a few really interesting comments, thanks guy!)

4294967295 and MySQL INT(20) Syntax Blows

January 24th, 2010

Big Numbers 2 by pjern.

When you’ve been working with a technology for a long time, it’s difficult not to develop Stockholm syndrome. Not sure when I started using MySQL, but I bought my first license in 1998. I think it wasn’t until mid-to-late ‘98 when we had to call Monty long distance to Sweden to get help with some tricky issues. Which is to say its been a long time since I thought about how confusing MySQL’s CREATE TABLE syntax can be.

Which is not to say that the documentation isn’t clear:

M indicates the maximum display width for integer types. The maximum legal display width is 255. Display width is unrelated to the range of values a type can contain, as described in Section 10.2, “Numeric Types”. For floating-point and fixed-point types, M is the total number of digits that can be stored.

But last week Flickr had a hiccup. We hit 4,294,967,295 photos. Or as a geek might say it, the largest number that can be represented by a 32-bit unsigned integer. This didn’t exactly catch us by surprise. We’d switched to using 64-bit ids for some things January, Friday the 13th, 2006. That and we got bit a few years ago when we hit 2,147,483,647 photos (that’d be the max signed 32 bit integer). Shortly after that we did a full audit of our tables.

But somehow we went on writing code after that, and we managed to slip a couple of new tables into the mix. And some of those tables ended up with INT(20) columns. Which simply mean we were adding some non-significant zeros to pad the display but truncating photo ids over 4294967295.

INT(5), INT(10), INT(20), and INT(255) all store the same amount of data.

Funny thing is, when I told this story to folks last week, this caught them by surprise. Sophisticated engineers, some of whom had deployed quite large MySQL backed sites. Because they were right, that syntax is dumb. And confusing. And I’d been taking it for granted so long I hadn’t thought about it in a decade. Which is why I’d bother to write a blog post about a popular piece of software, behaving exactly as it’s extensively documented to work.

Also, it’s interesting to note how if you keep making the same mistakes they become easier and easier to fix.

If you’re ever debugging a problem and you see the number 42-mumble-mumble-mumble-7295 you’ve run out of 32-bit storage. If you see 2-mumble-mumble-mumble-647 (2147483647) you’ve run out of signed 32-bit storage. 167-mumble-mumble-15 (16777215) you’ve run out of 24-bits and 65-mumble-mumble-35 (65535) you’ve run out of 16-bits of integers.

Somehow those numbers just jump out at me after all this time, you ignore the numbers in the middle, and notice the significant bits at the front and the end.

Photo from pjern

Counting Things, and RPEs

January 22nd, 2010

306 Million And Counting

On an unrelated email thread this morning I got to thinking about how I quantify the Flickr engineering team, and counting things in general.

Depending on how I’m counting I tend to place the Flickr engineering team at ~20 people. In that group I include everyone on our team who writes code (including HTML, CSS, Javascript, PHP, Java, Perl, Python, C, C++, XUL, or Objective-C). Additionally I include our operations team (aka sysadmins aka “service engineering”), our “tech support” team (technical customer care/qa/researchers), and various folks with “manager” in their title.

(a more traditional count would probably put the Flickr engineering team at 5 application/backend engineers, 4 front-end engineers, and 4 technical manager types.)

Which got me thinking about a new metric, the RPE or “roughly per engineer”. Mostly it’s a useful thought tool (for me) to think about what sorts of things scale up with economies of scale, and what doesn’t. Here are a couple of quick RPE metrics I pulled tonight.

Photo from siliconmonkey

Tagged: , ,

Quotable

December 10th, 2009
It’s cheating to start a blog post with a quote from Winston Churchill. He was that good.Fred Wilson

I’ll probably steal that line some day.

Tagged:

On a roof top in Brooklyn ….

November 16th, 2009

Just one of those moments that makes it all worth it. Gorgeous light streamed through our big, impossible to open, difficult to clean, over designed loft inspired windows bathing the too small apartment in pinks, yellows and oranges. This photo was taken from the roof of our six story apartment building as the last night turned the sky line into silhouettes.

Tagged: , ,

A Certain Kind of Memory

November 10th, 2009

I think of it as a kind of a pseudo random number generator for my memory, or maybe a probabilistic PhotoJojo Time Capsule powered by a certain inscrutable logic, but lately the odd blog spam comment that slips through Askimet’s filters and triggers a Wordpress “Please moderate” comment email on some long forgotten blog post is more blessing then curse, a chance to remember, sort of.

Roughly 31 minutes ago some botnet left a gibberish comment on a blog post from July 2003, “That certain kind of tired”.

Oddly enough I remember very little about what was going on in my life at that point. I say odd, because here I am reading a diary entry where I’m situated in time, and space (Providence, July 2003), and I’ve posted a hyper specific list of recent movements (“38 hours of bus travel, 19 hours of car travel, 12 hours of air travel, 7 hours of train/subway travel”), yet nothing about that travel comes back to me. An odd unforeseen (by me at least) consequence of diarying in public is I’ve left out the context in preference of the shape of things. I wonder what I’ll make of my Twitter stream 6+ years on (assuming any of it survives and is accessible).

In contrast, my grandmother recently found a box of my great grandmother’s diaries (her mother-in-law) with nearly daily entries for a 20 year span, sometimes obscurely personal in nature, but never the intentionally obfuscating dance of public performance that my old post is.

And there is another kind of unanticipated (again probably only by me) forgetting in that post. A mere 6 years into the life span of that post, of the 5 links in that post, only one of them still works. (and all of these links to the sites of web dorks)

The title of the post is a phrase I borrowed from Jessamyn. Her server seems to be down. Hopefully it will come back up. I can remember thinking as I quoted it that I was pretty sure I’d met her, but I wasn’t sure when, but given our overlaps (mutual friends, mutual alma mater, similar geographic patterns) we’d meet again some day. I still think we will, but it hasn’t happened yet.

And I remember vividly, even though its only alluded to briefly in that original post in an attempt at wit and snark, that just prior to writing that post was the first time I met my good friend Aaron in person. We sat outside on the porch at The Otherside Cafe on Newbury St., in Boston, and talked about many many things including my first, but hardly my last, attempt to make him explain RDF to me.

Sky Captain is Building Ur Cloudz

November 6th, 2009

(for Kellan)

by straup

Tagged:

What Everybody Knows

November 1st, 2009

Last Friday, Jasmine and I saw Leonard Cohen play. That’s not a sentence I ever expected to type.

It was an amazing show. Everything conspired against it: lousy cavernous venue, weird crowd, show boating instrumentalists. But it was awesome.

And I was transported back to the first moment I ever heard Leonard Cohen. Those first minutes of Pump Up Volume when “Everybody Knows” comes rolling out of Christian Slater’s pirate radio station, and it was the coolest thing I’d ever heard. Later I wondered how many of us got involved with helping start Indymedia because of that movie.

But that’s not why it stuck with me. You see, when the sound track for Pump Up the Volume came out that song wasn’t on there. Or rather it was, but it wasn’t the right version. I just had a tape of a tape of a tape, sound wasn’t great, and certainly didn’t come with any liner notes explaining that the version included on the sound track was from Concrete Blonde, whoever they were. Disappointment. Really profound disappointment.

And I didn’t know what to do. I was stuck. Transported back to that moment in the early 90s as an adult I could probably find a solution, but honestly I’m not sure. It wasn’t until I got to college, met other people who had been touched by that song, that I ever heard the name Leonard Cohen, and even then it took us a while to obtain a copy of it. Instead we spent a lot of late nights watching “Pump Up the Volume”. Eventually, it was the first MP3 I ever downloaded.

The double barreled revelation really hit me hard. First just to see him play it. Second to remember what I spend most days forgetting, how profoundly the world has changed by the Net and particularly by the Web. It’s trite, and I lived through it, and I live with it everyday, but I rarely get the distance to see how much has changed. And that little distance gave me some hope.

Kind of like the first time I heard that song.

I missing blogging.

October 25th, 2009
Like anyone who used to blog with frequency pre-2005, I’d like to post here more often — not just to fill up bits and bytes, but to write again. Remember when blogs were more casual and conversational? Before a post’s purpose was to grab search engine clicks or to promise “99 Answers to Your Problem That We’re Telling You You’re Having”. Yeah. I’d like to get back to that here.Dan Cederholm

This is the idea I’ve been trying to place with again, really starting just this week, rejecting the consensus about how to blog that’s emerged over the last couple years, and holds up Digg-ability and Techcrunch-i-tude as good indicators. Dan, of course, said it better.

It’s probably an indicator of slipping into my dotage, but a new stray link and I’m happily back wandering through those early archives, even my own, having stumbled across a rather odd review of the rather minor Ruled Britannia, circa 2003 earlier this evening.