Blog posts tagged "screenscraping"

Weather RSS and the Dangers of Screen Scraping

August 24th, 2003

So the Weather RSS service is down right now. The whole thing is driven by screen scraping because the freely available sources of weather info suck (or at least the ones I can find do).

I have a working screen scraper for but unforunately their URLs are relatively obscure. I have some code to mechanize their search form, but they got a bit jumpy, and temporarily blocked my IP while I was tuning it. So I put them on hold, and went with Wunderground.


Well the Wunderground has US and international weather, so that is a plus, and nicely predictable URLs, which really helps when you’re screen scraping. But they HTML sucks. Its very much circa 1997, but more cluttered. Its not a field I have a lot of expirence with, but I thought my scraper was pretty good, but 2 week later it is broken. Ugh.

Haven’t tried yet, years ago when I was last playing with weather, they blocked my IP for scraping (and I’m being well behaved I promise!), don’t know if they still do that, my instinct is not, at least not within reason. However their URLs seem totally arbitrary, probably pegged to an internal numbering scheme. If anyone knows differently that would be great. (Also their editorial voice is insipid, the differnece I suppose between people who study weather for a living, and those who sell it for a living, e.g “Sunny” becomes “Plentiful sunshine”.)

More Domain Knowledge and Directories

Looks like I’m going to have to move away form Wunderground, their HTML just isn’t reliable enough. To do that I’ll need to understand the identifiers the NOAA and are using. The NOAA identifiers are standard I think, but I havne’t found a good documentation of them. The identifiers could probably be fetched by walking their website (or their syndicated Yahoo weather site which has a more directory like structure) The one problem with moving to using the identifiers rather then a search interface is rather then allowing free form entry, one would want to present the users with a deeply nested list of possible choices, which has never worked all that well on the web. I just downloaded WeatherPop (sweet little app btw), who I suspect is also a screen scraper (though I should probalby examine my outgoing traffic before making that claim) to see how it handles this problem, and presenting a drill down list is exactly what it does.

Anyone has other suggestions for data in a workable format, let me know. God knows I hate screen scraping in this day and page (CSS makes it easier, but it feels so backwards)

That and I never quite got conditional gets supported, it was working in Magpie, and at least one of the Windows clients (don’t remember which one) but not in NNW, or most of the other readers.

And last but not least, there is the lovely “Divsion by zero” error I get when trying to view Denver’s weather.

Tagged: Uncategorized , , , ,

CPAN & Reputation

March 5th, 2003

A few days of reading my scraped CPAN RSS feed, made me realize what an integral part of CPAN the PAUSE ids are. With so many…questionable…modules being uploaded it really helps to know who is what. I think this is one of the (many) examples where the CPAN clones fail to learn from CPAN’s success. So yeah, added the PAUSE ID to the feed.

update: I’ve moved the RSS feed (for the last time for foreseeable future) to CPAN recent module releases.

Tagged: Uncategorized , , , , ,

I Want I Want

February 18th, 2003

Note to Self: Besides a better CPAN RSS feed, and better RSS feeds for lists, I also want an RSS feed for WSWS and exploding dog (I never remember to check them)

Tagged: Uncategorized , ,