Inexplicable Magpie/Snoopy Problem

I got a bug report against Magpie a couple of days ago. A user was puzzled why the seemingly valid, perfectly ordinary url http://www.dr.dk/nyheder/html/nyheder/rss/ was failing to parse.

I spent a while confirming that Snoopy (both the Magpie patched version, and the recently release 1.01) was treating the entire document as if they were HTTP headers, and treating the body as blank. I’ve spent a little while futzing with this tonight, but my motivation/energy on this is running pretty low. Anyone feeling up to a spot of debugging?

update [17-jan-04]: Thank you Phil! The problem was, as I kind of suspected, with the CRLF terminators. Seems the above host was returning “\n” instead of “\r\n”. Having futzed around with trying to get curl, or wget, or lynx to dump the raw headers I was stumped about how to flag the problem.

Phil used what he describes as, “The most incredibly tacky debugger ever”, but looks great to me:

 str<em>replace("\n","|n|",$snoopy->headers[9]);
 str</em>replace("\r","|r|",$snoopy->headers[9]);

(and it was the middle of the night no less) ### Solution

See comments for discussion, here is the patch: ```

— magpierss-0.5.2/extlib/Snoopy.class.inc Wed Jun 25 19:34:48 2003 +++ dev/magpierss/extlib/Snoopy.class.inc Sat Jan 17 10:00:21 2004 @@ -808,7 +808,8 @@ return false; }

```

if($currentHeader == “\r\n”) +// if($currentHeader == “\r\n”)
if(preg_match(“/^\r?\n$/”, $currentHeader) ) break;

I’m going back and forth on whether to do a point release with this (and whether to even apply it to Magpie), as it feels wrong to me to make a piece of code less correct. I guess I’m a draconian, not a tolerant. (which comes as no surprise to anyone who knows me Opinions? And I’m still considering dumping Snoopy, if anyone wants to give feedback on that.

Thanks again Phil, like I said, it is rare and pleasant surprise to wake up in the morning with less problems then when you went to bed.

Snoopy Alternatives (Dump the Beagle?)

I’m really tempted to ditch Snoopy all together, as its dev cycles is abysmal (and they didn’t apply my patch! pout pout), and it includes a huge amount of HTML parsing code which is inappropriate in an HTTP library.

Other options include:

HTTP_Request from PEAR, which looks promising. Support for SSL (using OpenSSL instead of cURL), HTTP Authentication (Basic), gzip encoding, all fresh and out of the box. Looks nice and clean. And my Perl soul wants to support PEAR on the idea that it’s like CPAN. On the downside it introduces the PEAR dependency cascade which seems to be beyond the ability (either due to experience or hosting limitations) of many if not most of my users (or at least the ones I hear from).
The other option which I’ve explored before is HttpClient from the excellent Simon Willison. I suppose it’s a false sense of security to say “hey, I like his blog, his code must be good”, but there you have you it. And he is where I cribbed how to add gzip encoding support to Snoopy. But his call syntax is awkward, and no SSL support, and I don’t think it’s an active project. On the up side it is self contained; I could (presumably) drop the file into extlib (which standards for external library btw) in place of Snoopy, and be off to the races.

One option would be to pack the entire PEAR dependency tree into extlib, but I’m unclear on how well that will work, and it will increase the the library size several fold, and probably cause some on version clashes to boot. Another option would be to other two versions of Magpie, with and without dependencies included (MoveableType takes this approach). Or lastly I could figure out how to fix bloody Snoopy. Thoughts?