September 12, 2002 - Justin's Linklog

Sitescooper: Aaron notes that the Wayback Machine has added support for diffing HTML, using technology licensed from DocuComp (demo), and he notes “HTML Diff is extremely difficult and they do a half decent job, but it’s got plenty of room to improve.”

Maybe they should look at Sitescooper: it’s had HTML diffing for the last 3 years, using diff(1) or Algorithm::Diff and some basic knowledge of HTML presentation. Though mind you, DocuComp might have some trouble having a look, as it’s free software, licensed under the GPL. :)

Of course, Sitescooper is a big, chunky lump of application, very oriented towards scraping an entire news site, downloading the latest news, stripping down the HTML and delivering that in one file — ie. exactly what you want for viewing news sites offline on a PDA, but when you want to use just nifty feature in there, you’re stuck with the whole application. It’s just not UNIX.

So, one thing I’ve been thinking about doing recently, is taking some of the code in Sitescooper and refactoring it into a UNIX toolset; a wget-style getting tool, which has Sitescooper‘s knowledge of how to cache and rewrite URLs; a HTML-differ; and a few other tools. But this is still thinking, at the moment.

Archives

HTML diffing