You are officially entering wet cat territory.

Inspired by Scott’s patch, the million or so sites that only produce Atom, and a couple of requests, I hacked experimental support for parsing Atom into Magpie.

Taking a page from Mark’s Feed Parser, it should be relatively transparent to move between parsing an RSS feed, and parsing an Atom feed. Specifically

  • Atom feed elements and RSS channel elements are both accessible via $feed->channel[$element<em>name]</em>.
  • Atom link elements that point to an alternative html version (i.e. those with the attribute rel="alternative") are treated as being equivalent to RSS’s link elements and are accessible via $feed->channel['link'] and $item['link']
  • channel/description is mapped to channel/tagline and channel/tagline is mapped to channel/description
  • item/description is mapped to item/summary and item/summary is mapped to item/description

Namespaces and Atom’s item/content field

Magpie handles namespaces by adding an array to an item using the namespace prefix as the key. For example and item’s <dc:subject> (aka item/dc/subject) field is available at $item['dc']['subject']. This has never been ideal, but it is simple, from both the parser’s and the user’s perspective. This causes a small conflict between RSS’s item/content/encoded field and Atom’s item/content field. I’ve chosen to make Atom’s item/content field available at $item['atomcontent']. If the content field is of type xml, I flatten it to string instead of making the parse tree available. (I don’t think anyone using Magpie wants the parse tree). Like I said, wet cat country. Also, item/content/encoded and item/atom_encoded are mapped to each other. ### Nested Elements

Magpie has never handled elements nested more then one level deep. While this could have potentially been a problem while parsing RSS, no one has mentioned it yet. However Atom even at its simplest has a number of nested elements, so just ignoring them wasn’t going to work. Here is what I do, this: ```

Mark Pilgrim http://diveintomark.org/ f8dy@diveintomark.org


Becomes: ```

[author<em>name] => Mark Pilgrim
[author</em>url] => http://diveintomark.org/
[author_email] => f8dy@diveintomark.org

Lastly there are two new methods $feed->is<em>rss()</em> and $rss->isatom() which return false when false, and return the version number of the feed when true (e.g. for Atom will likely return ‘0.3’, for RSS could return ‘1.0’, ‘2.0’, ‘0.91’, ‘0.93b71’, or a variety of other values)

Getting Started.

I think that is everything you need to know to get started playing. I’ll do a release complete with tarball once Sourceforge’s CVS servers are back online, in the meantime you can download rssparse.inc.with.atom, rename it rssparser.inc, and it should be a drop in replacement for your current rss_parser.inc. All the documentation at the beginning of the file is all of out of date, but the inline comments have been updated, and you have this blog entry. (as an alternative, you might want to look at using Aaron’s Atom to RSS stylesheets.)

Caveat

I tested against only two Atom feeds, Steve’s which I took to be representative of Blogger’s output, and Mark’s which I assume is an example of best practices per Atom 0.3. There was a enough variation between them that I don’t feel it was a horrible sampling. Also I only tested against an RSS 1.0 feed to make sure that RSS parsing hadn’t broken, but again, I’m feeling pretty good about it. ### Next Steps

The code is still kind of hoary, and in need of a major refactoring. Also I’m not sure how happy I am with this whole solution, it is partially a proof of concept. So if your interested in parsing parsing Atom with PHP, or have thoughts on Magpie and Atom, take it for a spin, give me some feedback, and we’ll see where it goes.

Thoughts on Atom

I’m still not as excited about Atom as I am about RSS. It feels like a dead end format designed for one, and one thing only, blogs. I guess its a good idea to do one thing, and do it well, but I’m not sure I would have chosen blogs as my one thing to do well in life. Also little things like in channel the summary field is called tagline is just annoying, and reminescinent of some of RSS’s worse descision. The various modes, and types of fields make it hard to write a parser which is “correct” (as opposed to us writing RSAS parsers)

update: Magpierss-0.6a (alpha) is available for download. This release adds the above support for Atom, as well as the support borken webservers patch. This is not the fabled 0.6 release that was going to be a total rewrite of the parser for better namespace support, that is still vapor.

udpate: MagpieRSS 0.61 (not alpha) is out with Atom support.