A Pretty Puzzle with Straw and rssfinder

Outside its pouring rain. I like this. This what winter is like at home. However I’m still not inclined to go out in it when the sidewalks are torrential rivers. So I spent the day crossing minor todos off the list while avoid thinking of the big ones. One such item was adding <content:encoded></content:encoded> data to the IMC feeds. Mostly for my own selfish reasons of being able to read the feeds from Straw. And in doing so triggered a minor python exploration through the guts of Straw, rssfinder, and rss validation. ### Feed Not Found?

Adding the <content:encoded> section was as simple as expected. Tested the new feeds to make sure XML::RSS could parse them, gave them a once over, and them went to test how they looked in Straw. 1. clicked Add a New Feed

  1. entered the URL for the feed (http://localhost/features_test.rdf)
  2. clicked find
  3. waited

Eventually Straw told me “No feed was found at specified location”. This puzzled me mightily. I checked that the feed was viewable from within a browser, tried a couple of different name, tried copying it to a couple of different server, made sure it validated, all to no avail. I even added a few other feeds to see if Straw would add them. It did, just not mine. ### Digging: SGMLParseError: expected name token

Nothing like totally weird behaviour to make you go digging. I spent a little while cruising around the Straw source, and eventually decided that my problem lay in SubscribeDialog.py, in particular with the line:

<pre class="code">
feeds = straw.rssfinder.getFeeds(site)

When I removed this from the try/except block, I got a weird SGMLParseError. Why the hell was Straw trying to use an SGML parser merely to confirm the existence of my RSS file.

When is RSS not RSS?

So while I was scratching my head, I realized that rssfinder must be trying to be too smart, trying to do RSS auto-detect or something. But why? So I went looking through the rssfinder code. And found

<pre class="code">
def isRSS(data):
    data = data.lower()
    if data.count('<html'): return 0
    return data.count('<rss') + data.count('<rdf')

And while I thought the count() + count() idiom was pretty cool, my debugging sense also told me here was my problem. A quick grep over the test RSS feed turned up an entry testing an earlier bug fix in the IMC features code when people insisted on wrapping new features in “<html>….</html>”. And that was indeed the problem. rssfinder was finding my “<html>” tag (buried in a CDATA section) and deciding that this must be an HTML file, and kicking over to smart mode.

In Conclusion

I’m not sure what my conculsion is. Haven’t decided if this is a bug in rssfinder, or if testing for an “<html>” tag is legitimate heurisitic and my feeds were badly formed, or if the problem was in Straw’s reporting of the condition. Just thought I would document it, in case someone else has the same problem someday.