Skip to content

Archives

Sitescooper and RSS

I did this a while ago, but I’ve been very busy in work and haven’t had time to mention it. But it’s worth doing some preliminary pointing at Sitescooper RSS.

Basically, I’ve added RSS output to Sitescooper, the venerable HTML-scraping script that can disassemble a news/blog/reading-material website efficiently, use a cache, log in, cope with redirects, figure out when stuff is new and when it’s old, perform diffs, confuse you with copious regular expressions, etc. etc.

Sitescooper was originally oriented entirely towards display on a Palm; then new PDAs came out that could do good text or HTML display, so they’re now supported too; and now, I’m no longer commuting and using an RSS aggregator instead for that kind of daily reading, so RSS is the natural next step.

Basically, what this means is that those annoying blogs that don’t include the full text in the item block, or those websites you like that don’t have an RSS feed — make a site file, and scrape them into your aggregator yourself!

This code is present in the current Sitescooper CVS version; the only doco is really what’s in that RSS directory on sitescooper.org.

If your interest is piqued, take a look…