August 24, 2004 - Justin Mason's Weblog

Tech: So Danny O’Brien’s ‘Life Hacks’ talk is one of the most worthwhile reflections on productivity (and productivity technology) I’ve heard. (Cory Doctorow’s transcript from NotCon 2004, video from ETCon.)

There’s a couple of things I wanted to write about it, so I’ll do them in separate blog entries.

(First off, I’d love to see Ward Cunningham’s ‘cluster files by time’ hack, it sounds very useful. But that’s not what I wanted to write about ;)

People don’t extract stuff from big complex apps using OLE and so on; it’s brittle, and undocumented. Instead they write little command-line scriptlets. Sometimes they do little bits of ‘open this URL in a new window’ OLE-type stuff to use in a pipeline, but that’s about it. And fundamentally, they pipe.

This ties into the post that reminded me to write about it — Diego Doval’s atomflow, which is essentially a small set of command-line apps for Atom storage. Diego notes:

Now, here’s what’s interesting. I have of course been using pipes for years. And yet the power and simplicity of this approach had simply not occurred to me at all. I have been so focused on end-user products for so long that my thoughts naturally move to complex uber-systems that do everything in an integrated way. But that is overkill in this case.

Exactly! He’s not the only one to get that recently — MS and Google are two very high-profile organisations that have picked up the insight; it’s the Egypt way.

There’s fundamentally a breakage point where shrink-wrapped GUI apps cannot do everything you want done, and you have to start developing code yourself — and the best APIs for that, after 30 years, has been the command-line and pipe metaphor.

(Also, complex uber-apps are what people think is needed — however, that’s just a UI scheme that’s prevailing at the moment. Bear in mind that anyone using the web today uses a command line every day. A command line will not necessarily confuse users.)

Tying back into the Life Hacks stuff — one thing that hasn’t yet been done properly as a command-line-and-pipe tool, though, is web-scraping. Right now, if you scrape, you’ve got to do either (a) lots of munging in a single big fat script of your own devising, if you’re lucky using something like WWW::Mechanize (which is excellent!); (b) use a scraping app like sitescooper; or (c) get hacky with a shell script that runs wget and greps bits of output out in a really brittle way.

I’ve been considering a ‘next-generation sitescooper’ a little bit occasionally over the past year, and I think the best way to do it is to split its functionality up into individual scripts/perl modules:

one to download files, maintaining a cache, taking likely freshness into account, and dealing with crappy HTTP/HTTPS wierdness like cookies, logins and redirects;
one to diff HTML;
one to lobotomise (ie. simplify) HTML;
one to scrape out the ‘good bits’ using sitescooper-style regions

Tie those into HTML Tidy and XMLStarlet, and you have an excellent command-line scraping framework.

Still haven’t got any time to do all that though. :(

Comments closed

Archives

Life Hacks: getting back to the command-line