Blog posts tagged "data" data and doing the right thing

January 5th, 2011 (“Caffeinated and Unstrung”) was a moderately successful community collaboration I built and ran to catalog good coffee shops to work out of, originally and most successfully in Seattle, then spread to Vancouver, Portland, Chicago, Boston, and New York.

It was also an experiment in extracting structured data from semi-structured, free form wiki like data entry.

I moved away, and the site kept running itself. Eventually the spammers overwhelmed the community and I had to shut it down. I feel bad about this. Bringing back (and rewriting) WifiMug is on the todo list, it’s on the given-the-ability-to-freeze-time todo list. (by far the longest of my todo lists)

So I spent 15 minutes attempting to do the right thing, and all the data (and all the spam) is available for download under a Creative Commons Attribution-ShareAlike license.

The dump contains the following directories per city:

  • database contains the raw text of the wiki pages that represented each of the cafes (and the various meta pages).
  • metadata contains some of the structured data, per page
  • rcs the history of each page. (yes rcs)

Ideally I’d do more. Ideally I’d scrub the data, port it to some sane format, tease out the implicit metadata encoded in the markup, attribute all of the various community members, etc. But I’m trying to not let the perfect be the enemy of the ok. This is a minimal competence thing.

And ideally we’d should be able to hope and expect that Yahoo! would do something like this with the infinitely more important and influential data set — dedicate a few weeks or months of time to preserving one of the greatest new libraries of our time, possibly donating it to or the LoC. But nothing makes you uncomfortable like holding Y! to a standard you aren’t personally living up to.

Get the data.

All done/written under the influence of 30k feet and 10 minutes of reflection, treat accordingly.

ps. I almost got all of this into 140 characters, but failed. I hate the way blog posts feel so flabby and fluffy after the compressed kinetic energy of a tweet. I mean Anil makes me feel all noble doing it, but I miss my creative restrictions.

Minimal Competence: Data Access, Data Ownership, and Sharecropping.

May 18th, 2010

A friend (from Google) recently trolled me, asking, “What’s up with the data lock-in at Flickr?”.

Got me thinking about standards. I wrote back a rant to a mailing list of fellow senior hacker, and coders types. Below I’ve included that rant, largely verbatim. I’d been meaning to turn it into a more reasoned blog post, maybe something suitable for posting on a more official outlet, but life is short, and Rod’s post about Quora reminded me to get on it.

As software engineers, as social software engineers, it’s important to have standards. You can debate the how much of what we do can be called engineering, even charitably, but the code we write determines the rules that govern the spaces more and more people spend time in, and while “First, do no harm” might be reaching, a few standards that you should be embarrassed to not meet seem appropriate.

One of those is around data access, data ownership, and sharecropping. This is something Flickr takes very seriously.

The Minimum

With Flickr you can get out, via the API, every single piece of information you put into the system.

Every photo, in every size, plus the completely untouched original. (which we store for you indefinitely, whether or not you pay us) Every tag, every comment, every note, every people tag, every fave. Also your stats, view counts, and referers.

Not the most recent N, not a subset of the data. All of it.

It’s your data, and you’ve granted us a limited license to use it.

Additionally we provide a moderately competently built API that allows you to access your data at rates roughly 500x faster then the rate that will get you banned from Twitter.

Asking people to accept anything else is sharecropping. It’s a bad deal. Flickr helped pioneer “Web 2.0”, and personal data ownership is a key piece of that vision. Just because the wider public hasn’t caught on yet to all the nuances around data access, data privacy, data ownership, and data fidelity, doesn’t mean you shouldn’t be embarrassed to be failing to deliver a quality product.

The ability to get out the data you put in is the bare minimum. All of it, at high fidelity, in a reasonable amount of time.

The bare minimum that you should be building, bare minimum that you should be using, and absolutely the bare minimum you should be looking for in tools you allow and encourage people who aren’t builders to use.

A Reasonable Exchange of Value

Flickr actually goes a bit farther, not only can you get your data out, but it gets enriched as it passes through the system.

If you use the geotagging feature, you don’t just get the lat/long out you put in, but your photo comes back with a whole hierarchy of geographic descriptors, that are pointers into a publicly available gazetteer (Y! GeoPlanet). It would be good if there were pointers into other publicly available gazetteers (if for example Google ever released one) but there isn’t a good concordance service yet (but it’s being worked on)

You get structured access to all the metadata that people have added to your photos, with proper attribution available. (of course there is a working privacy model, so your “friends” aren’t getting data they aren’t supposed to, like your friend requests, and chat logs)

If you used our machine tags vocab, you get extra information pulled in from 3rd party APIs like Open Street Maps, Open Library,, various transit administrations, and Foursquare.

Additionally you also have access to the data that was created in aggregate using the data you shared with us, like tag clusters, and the Creative Commons licensed neighborhood shape boundaries.

This isn’t the exhaustive list, just a few of the things Flickr does to respect, and collaborate with the people who share their time and data with us.

I’d certainly love to get a fraction of this data back from other services I use. Imagine getting access to all the data Google has about you, and everything they’ve learned partially based on observing you. I’ve gotten used to being disappointed by most of my fellow practitioners, but I still dream about using good tools that treat me with respect and want to collaborate.

Thanks go to Jesse Vincent, for the useful sharecropping metaphor.

(and I’ll state the obvious this is my personal blog, nothing I post here should be taken as official Flickr or Yahoo communication or policy, unless otherwise noted, that isn’t what they pay me to do.)

Henry Blodget: “Facebook’s Approach To Innovation Is The Secret To Its Success”

May 17th, 2010

Blodget gets the headline right, and nearly everything else wrong.

I’m really surprised we aren’t seeing more people writing and talking about what I see as Facebook’s key competitive advantage: It’s a data driven company, which is nimble enough to act on that data.

When I look at the Facebook engineering culture I see the best parts of what we’ve done at Flickr, scaled up in a way I didn’t think was possible.

And when you look at the work the data team is doing (which you can get a sense of by the tools they throw off), you know that Facebook’s innovation is being tested, put through it’s paces, and extensively analyzed before most of us are aware of it.

This is a unique combination.

Armory Data Mining

March 5th, 2010

“It is time for some truth in advertising. If I will present my thesis adviser with this analysis, she will probably hang me, rez me, hang me again, and then /gkick me out of my PhD program.”Armory Data Mining.

Great, accessible look at population stats.

  • April 17, 2009

    #2 Every Building with a Shoebox in it’s Basement.

    “Buildings could offer WiFi photo uploading service, in return for keeping the photos taken of them….… what if Cloudgate were built with servers and wireless inside, right from the start, offering to consume the photos taken of it. You take a shot with a wireless enabled camera and it could store a copy for you. It’s building up a library of itself, in all seasons, in all weather. Meanwhile you, have a backup, findable by time and browsing, stored safely in the Cloud!”

    + 0. (Aside , , , , )

Is a Firehose of Snowflakes a Nor’easter?

March 4th, 2009

I tried explaining the title of this blog post to Jasmine this morning. Suffice to say my explanation needed a bit of practice. And more than 140 characters. Or it might just be I’m a bit stir crazy from Winter returning with a vengeance in these here parts. But I wanted to call out a couple of points that might have gotten overshadowed in the good Reverend’s recent post on the Flickr Panda APIs.

NewsWire API

Picture 21

The NY Times at their great Times Open event announced their Newswire API, which is a real time stream of their content. Stories, and blog posts, and what not. More interestingly was their discussion about how they’ve built a backend “pinging service” that makes it easy for them to add new types of data to their stream. I’m a dork enough that a Grey Lady firehose sounds pretty awesome.

But they got some flack for it being a snowflake API. From where I sit snowflake APIs look like opening up your data as fast as possible, along any means necessary, and trying not to pre-judge how people will use it, but I’m thankful for the metaphor, as it allowed me to spend the morning envisioning fire hoses of snowflakes.

Still I spent 2007, and 2008 talking about how XMPP was going to be a key piece of building firehoses standardizing and enabling the real time Web, so its a criticism I’m sensitive to. (and I’ve already been skipping conferences in 2009 in the hopes of actually having some time to build it, though thankfully minor details like time haven’t stopped my colleagues at Fire Eagle from launching theirs)


Flickr Panda!

Which is all apropos of saying, we launched our own “snowflake” realtime API yesterday. (though actually its just a slight modification of our standard photo response format). And its Panda-shaped. And it is awesome.

Near Real-Time, Every Minute, up to 120 Events

But because the documentation is quirky, I think people missed the significance. These are Flickr real time data APIs.

We’re building streams of photos in real time. Examining the huge stream of data events that happen on Flickr, the social activity, the searching, the meta-data creation, and fishing from that stream to build 3 real time streams. We’re then exposing those streams via a near real time polling based API.

The API pattern is specifically structured around making it easy to call from client side scripting, and the data streams are structured around discovery rather then guided search, but we’re pushing up to 120 discovered photos down these streams each minutes, every minute. Two streams of real-time interestingness, and 1 of lightly interestingnessed geotagged photos.

And they’re named after famous pandas. Really what more do you want?

Whither XMPP

So what’s up with the blossoming real time data APIs? And where is our promised standardization? They’re coming. There has always been a tricky chicken and egg problem. There is so little data out there that is appropriate to expose in a real time fashion, that there is little demand to consume it, so the tools fail to evolve. But I’m seeing tons of work, great toolkits from like Fire Hydrant from FireEagle and Babylon from, and Google’s decision to make XMPP a standard part of their AppEngine toolkit are just I’ve been most excited about recently.

NYTimes Article Search API

February 7th, 2009

NYTimes: Sex & Scandal since 1981

I don’t have much to add that the New York Times hasn’t already said about their Article Search API. Its an amazing corpus to be searchable, both in breadth, and scope, and for sheer richness of the classification. I can’t think of an remotely comparable dataset with such a rich API.

Couple of things I noticed that I wanted to call out.

Get info about an article/Search by URL

Positioned as a search API, it also doubles as a “getInfo”-style API, as article URL is one of the searchable fields.


Just make sure to remove the various query string bits that the Times appends, as these aren’t indexed. Should make a “find the history of this topic being discussed” Greasemonkey script a snap.

Expert’s attention information

One of my less comprehensible requests to the NYTimes developer team at OSCON last year was to make sure their APIs exposed the “attention information of [their] editors.” Age of amateur, citizen journalism, and radical decentralization are all awesome, but the NYTimes’ editors job is to think about what is important and interesting full time; and that’s information worth mining.

And they did!

The page_facet, and nytd_section_facet both allow you to gauge some degree of relative weight given to a story. (section_page_facet seems like it ought to do the same thing, but I couldn’t get it to work)

?query=flickr nytd_section_facet:[Front Page]

Gives you articles mentioning “flickr” featured on the NYTimes front page. (of which it only finds 3, alas)

API Design

Good stuff:

  • Clean hackable URLs, you can play with it in your browser and see what you’re going to get.
  • The getList + extras (called fields in the NYTimes API) is the house wisdom at Flickr, and I’m glad to see it elsewhere
  • The parsed tokens block is neat, and I can see it being incredibly useful for working with such a large, varied corpus
  • The sure amount of searchable/indexable metadata and the granularity is really unprecedented, great to see them go out with such a rich, “here’s the data do something great” approach.


The graphic at the top of this blog post is a “visualization of the frequency of occurrence of the words ‘sex’ and ‘scandal’ in the New York Times, since 1981.”, part of a set of visualizations by blprnt_van built with the article search API, and Processing.

Fire Eagle: Interesting Choices

March 5th, 2008

Fire Eagle

Other folks are talking about and writing about the long germinating, launched in beta, location broker from Yahoo’s Brickhouse, Fire Eagle.

I wanted to call out just a couple of the cool, and non-intuitve decisions they made.

Is NOT a consumer brand

Fire Eagle is a service for building and sharing location data. Its the application built on top of it that you’ll interact with, unless you’re building stuff.

Fire Eagle does NOT manage the social graph

Its a service for sharing your data with friends (or services, or your toaster), but it doesn’t know who your friends are. The social graph has been outsource. Best example of a small piece loosely joined I’ve seen in a long time.

Cares about privacy and ease of use

Ninja privacy is built in. But you don’t have to care. The TOS requires developers to discuss how the data is used. And privacy levels are front and center. And from day one data is delete-able, and in fact data is flushed on a regular basis.

Built on OAuth


Notes from Social Graph Foo

February 4th, 2008

Here is my quick dump of the notebook, probably useful to no one but me. Names mostly removed to protect the guilty.

I think “Social Graph” is kind of a dumb phrase to apply to the back question of relationships. I promptly re-dubbed the event “Social Foo” and thereby found interesting things to talk about. Kevin Marks proposed “social cloud”, clouds hide details. (operations people get hives when you talk about clouds)

XMPP, OpenID, OAuth are all going to be huge in 2008; DiSo, DataPortability, and Social Graph API aren’t as clear winners to me.

Bowling Alone misses the point. There has been a transformative change from groups to networks. Groups are just a funny form of network.”

“Differentiated role networks”. Differentiated roles, and the failure of monolithic identity and friending were one of the things I went to Sebastopol to talk about this weekend, the people who got it got it, and everyone else wasn’t interested in the hard squishy details of real community. I think this might be the side effect of running social software for social softwares sake vs social software as bath for social media object sharing.

“Relationships can be broken down into 5 types: emotional aid, sociality, major help, minor help, and $$$”

Note to self: try block modeling interactions in high profile/high turn Flickr groups. (central, utata, etc)

No one really understands user expectations. Privacy expectation is currently, “unstable”.

Huge conceptual issues with the difference between public information hand aggregated, and public information computer aggregated. Cognitive dissonance ensues.

Rules, games, and rulesets. Modeling of social software as games. Tension of implicit vs. explicit rules. Mag.nol.ia’s altruism game derived from the cracks board (witnessing altruistic acts is a public good, way to update the Mag rules of game to support this?), Satisfaction’s status update game. Hoping Teresa can bring the quality gaming to BoingBoing’s anemic community. Social games + adversting.

Parody/pastiche as lit analysis. Investigate for web.

Social networks need NPCs. e.g. the Instructables Robot.

Standards works should be done in small groups, with a clear need, that selectively grow the list of participants. No hierarchy of early/late joiners (aka OAuth did it right)

“Everything public” bores me.

Beyond LAMP.

Find a feed for Nathan Eagle’s research.

“locations rights management”

“trusts are largely not transitive”

Language communities are “small world networks”, partitions communities by language. 2-5 hops vs 8 in analyzed network.

The Plaxo way: “We gets ze data Lebowski”

“Twitter is my early warning system. My blood pressure has gone down over the last 18 months”

Identity and sharing can make everyone warm and fuzzy, but also came face to face with sobering consequences that kept me up at night with a bottle of tequila. Re-thinking proposed Flickr features.

Flickr: A Place of Our Own

December 10th, 2007

You might have seen the post on the Flickr blog announcing Places, or maybe the Good Reverend’s write up, but if you haven’t:

Places is a new Flickr feature that mines our corpus of geotagged photos, identifies characteristic features on a per location basis, and then goes back into the data looking for “iconic” beautiful photos. (btw try reloading that /places page, the feature places are random. As to a certain degree are the photos on the individual Places pages themselves)

It also is where a good chunk of my creative energy went for the last few months which is why the blog has been so quiet. And its a hell of a lot of fun, not to mention a privilege and pleasure to deep dive into our database and be reminded just how much fabulous photography there is on Flickr, and maybe just barely fumble around the edges of surfacing the diverse communities shared vision. Eyes of the world indeed.

A Place for GeoRSS feeds

Dan roped me in on Places months ago. We had geoFeeds working for semi-arbitrary places, and we needed a page to hang them off of. That page looked a lot like search result. You never saw it because the Flickr project management process (a blog post of its own) left that particular prototype a bloody, heaving wreck. Don’t worry, the current version is much much much better. (of course you also never saw Dan’s brilliant prototype of the current version, which was too cool to release on an unsuspecting public) And voila, many months later, the feeds are there. (though I’d still like to bring back that SRP view to allow rich searching within a location)

Increased Surface Area

We brought a bunch of different design goals to Places, but one of my obsessions that I think we nailed was the idea of “increasing the surface area” of Flickr. (also known as providing new ways to level up in the Game of Flickr[tm]). Only a few people, and a limited range of styles will ever be featured on the Flickr Explore pages. Which is fine, most people don’t care. But Places provides another way to recognize the contributions of Flickr members, by hilighting their geotagging and their photography skills. I’m looking forward to adding a couple more similar features to Places, recognizing other Flickr Games one can level up in, and other contributions back to the commons you can make.

Mo’ Betta

A bunch of stuff didn’t make our initial launch. Some of that has come in since then. More will be coming. I’m particularly excited about using adding some new data sources to improve the page. (e.g. the Groups right now a bit weak, and we don’t have reliable neighborhoods in cities, both of which are in process of being fixed)

Thats kH8dLOubBZRvX_YZ to You

Turns out there are a lot of San Franciscos in the world, and we personally struggle to keep track of which one is which. So we’ve been experimenting with giving them unique place_ids. If you look really close you’ll start to see these popping up around flickr, in photos.getInfo,, and as microformats on the Places pages. Its all very experimental, this unique identifiers thing, but we think it might work.

Arm Chair Travel

And because I love you, I’m going to let you in a on a secret. Have a great trip.

Just beyond the door

Personal Data Stores and the Network

October 31st, 2007

Thinking about what “personal data stores” are going to look like, how this interacts with decentralized models for community services, (I swear I’ve written something more recent then 2005 on that topic, but can’t find it), mulling models for updating clouds, wondering if projects like G’s OpenSocial, and Portable Social Networks are a step forward or back, speculating that digital curation is a viable near future business model, and that individual curations would work well as shareable social media objects.

Nothing necessarily novel. Just where my head is at.

*Lots* of New Semi-Structured Data

June 21st, 2006

Showed up at Microformats party last night and promptly fell asleep in my drink (long day), but this is the real action. Andy and Y!Local have rolled out hCard, hCal, and hReview to all of Yahoo Local.

And Gordon has whipped up the first microformat -> to Y! bridge with Greasemonkey.


Tagged: Uncategorized , , ,

Usability and Stockholm Syndrome

February 17th, 2006

Going home, and working with my grandmother on her computer is always an eye opening experience. I think it’s the only time I get any real insight into how computers should actually work, or how much time I spend working for my computer versus my computer working for me. Trite I’m sure, but I’m floored every time I do it, and floored again when I think of the energy I spend justifying (largely to myself) how computers work.

This morning I realized how arbitrary the distinction between photos you’ve downloaded from your camera, and photos you’ve been emailed by friends is. (and by extension photos out in the ether) Why can’t you find them all in iPhoto?

Tagged: Uncategorized ,