Skip to content

Tag: hacks

User script: add my delicious search results to Google

For years now, I’ve been collecting bookmarks at delicious.com/jm — nearly 7000 of them by now. I’ve been scrupulous about tagging and describing each one, so they’re eminently searchable, too. I’ve frequently found this to be a very useful personal reference resource.

I was quite pleased to come across the Delicious Search Results on Google Greasemonkey userscript, accordingly. It intercepts Google searches, adding Delicious tag-search results at the top of the search page, and works pretty well. Unfortunately though, that searches all of delicious, not specifically my own bookmarks.

So here’s a quick hack fix to do just that:

my_delicious_search_results.user.js – My Delicious Search Results on Google

Shows tag-search results from my Delicious account on Google search pages, with links to more extensive Delicious searches. Use ‘User Script Commands‘ -> ‘Set Delicious Username‘ to specify your username.

Screenshot:

Enjoy!

2 Comments

Google Reader productivity hack: change your Home

So, if you use Google Reader, read your news with the “All items” page, and are subscribed to hundreds of feeds, it can be pretty overwhelming. I’ve found a better way to deal with this.

Select a ‘most important’ subset of feeds. For each of those, click through to the feed details page, hit the “Feed Settings…” menu, and select “Change folders…“. Put the feed into a new “top” folder (creating it if necessary).

Now go to “Settings” -> “Preferences” and check out the “Start page” preference. By default, it’s set to “Home“; change it to “Folders and Tags: top“.

Hey presto — now, when you load Google Reader, it’ll come up with your “top” items. You can get through those quickly enough, and get on to other more important tasks. When you’re bored and need something to read, though, just hit “Navigation” -> “All items” (or even just type ‘ga’), and every other feed is now there for your delectation. Sweet!

2 Comments

Hack: reassassinate

A coworker today, returning from a couple of weeks holiday, bemoaned the quantities of spam he had to wade through. I mentioned a hack I often used in this situation, which was to discard the spam and download the 2 weeks of supposed-nonspam as a huge mbox, and rescan it all with spamassassin — since the intervening 2 weeks gave us plenty of time for the URLs to be blacklisted by URIBLs and IPs to be listed by DNSBLs, this generally results in better spamfilter accuracy, at least in terms of reducing false negatives (the “missed spam”). In other words, it gets rid of most of the remaining spam nicely.

Chatting about this, it occurred to us that it’d be easy enough to generalize this hack into something more widely useful by hooking up the Mail::IMAPClient CPAN module with Mail::SpamAssassin, and in fact, it’d be pretty likely that someone else would already have done so.

Sure enough, a search threw up this node on perlmonks.org, containing a script which did pretty much all that. Here’s a minor freshening: download

reassassinate – run SpamAssassin on an IMAP mailbox, then reupload

Usage: ./reassassinate –user jmason –host mail.example.com –inbox INBOX –junkfolder INBOX.crap

Runs SpamAssassin over all mail messages in an IMAP mailbox, skipping ones it’s processed before. It then reuploads the rewritten messages to two locations depending on whether they are spam or not; nonspam messages are simply re-saved to the original mailbox, spam messages are sent to the mailbox specified in “–junkfolder”.

This is especially handy if some time passed since the mails were originally delivered, allowing more of the message contents of spam mails to be blacklisted by third-party DNSBLs and URIBLs in the meantime.

Prerequisites:

  • Mail::IMAPClient
  • Mail::SpamAssassin
3 Comments

Links for 2008-10-10

Comments closed

Links for 2008-10-07

2 Comments

Hack: twitter_no_popups.user.js

Twitter has this nasty habit — if you come across a tweet in your feed reader containing a URL, and you want to follow that link, you can’t, because Twitter doesn’t auto-link URLs in its RSS feeds. Instead, you have to click on the feed item, itself, wait for that to open in the browser, then click on the link in the new browser tab. That link will, in turn, open in another new tab.

Here’s a quick-hack Greasemonkey user script to inhibit this second new-tab:

twitter_no_popups.user.js

3 Comments

Full-text RSS bookmarklet

This site offers a nifty utility for dealing with those annoying sites which offer only partial text content in their RSS and Atom feeds.

Given an RSS or Atom feed’s URL, the CGI will iterate through the posts in the feed, scrape the full text of each post from its HTML page, and re-generate a new RSS feed containing the full text.

The one thing it’s missing is a one-click bookmarklet version. So here it is:

Full-text RSS Bookmarklet

Drag that to your bookmarks menu, and next time you’re looking at a partial-text feed, click the bookmark to transform the viewed page into the full-text version. Enjoy!

7 Comments

converting TAP output to JUnit-style XML

Here’s a perl script that may prove useful: tap-to-junit-xml

NAME

tap-to-junit-xml – convert perl-style TAP test output to JUnit-style XML

SYNOPSIS

tap-to-junit-xml "test suite name" [ outputprefix ] < tap_output.log

DESCRIPTION

Parse test suite output in TAP (Test Anything Protocol) format, and produce XML output in a similar format to that produced by the <junit> ant task. This is useful for consumption by continuous-integration systems like Hudson.

Written in perl, requires TAP::Parser and XML::Generator. It's based on junit_xml.pl by Matisse Enzer, although pretty much entirely rewritten.

11 Comments

Announcing IrishPulse

As I previously threatened, I’ve gone ahead and created a “Microplanet” for Irish twitterers, similar to Portland’s Pulse of PDX — an aggregator of the “stream of consciousness” that comes out of our local Twitter community: IrishPulse.

Here’s what you can do:

Add yourself: if you’re an Irish Twitter user, follow the user ‘irishpulse’. This will add you to the sources list.

Publicise it: feel free to pass on the URL to other Irish Twitter users, and blog about it.

Read it: bookmark and take a look now and again!

In terms of implementation, it’s just a (slightly patched) copy of Venus and a perl script using Net::Twitter to generate an OPML file of the Twitter followers. Here’s the source. I’d love to see more “Pulse” sites using this…

4 Comments

Remote sound playback through a Nokia 770

For a while now, I’ve been using various hacks to play music from my Linux laptop, holding my main music collection, to client systems which drive the speakers.

Previously, I used this setup to play via my MythTV box. Nowadays, however, my TV isn’t in the room where I want to listen to music. Instead, I have my Nokia 770 hooked up to the speakers; this plays the BBC Radio 4 RealAudio streams nicely, and also the laptop’s MP3 collection using a uPnP AV MediaServer.

I specifically use TwonkyMedia right now, playing back via the N770’s Media Streamer app. (That works pretty well — uPnP AV is one of those standards plagued with incompatibilities, but TwonkyMedia and Media Streamer seem to be a reliable combination.)

However, TwonkyMedia sometimes fails to notice updates of the library, and nothing has quite as good a music-player user interface as JuK, the KDE music player and organiser app, so a way to play directly from the laptop instead of via uPnP would be nice…

A weekend’s hacking reveals that this is pretty easily done nowadays, thanks to some cool features in pulseaudio, the current standard sound server on Ubuntu gutsy, and the Esound server running on the N770.

Unfortunately, the N770 doesn’t (yet) support pulseaudio directly, otherwise we could use its seriously cool support for RTP multicast streams. Still, we can hack something up using the venerable “esd” protocol (again!) Here’s how to set it up…

On the N770:

You need to fix the N770’s “esd” sound server to allow public connections. Set up your wifi network’s DHCP server to give the N770 a static IP address. Log in over SSH, or fire up an xterm. Run the following:

mv /usr/bin/esd /usr/bin/esd.real

cat > /usr/bin/esd <<EOM
#!/bin/sh
exec /usr/bin/esd.real -tcp -public -promiscuous -port 5678 $*
EOM

chmod 755 /usr/bin/esd
/etc/init.d/esd restart

On the server:

Download this file, and save it as n770.pa. Edit it, and change server=n770:5678 on the fourth line to use the IP address or hostname of your Nokia 770 instead of n770. Then run:

cp n770.pa ~/.n770.pa

cat > ~/bin/sound_n770 <<EOM
#!/bin/sh
pulseaudio -k; pulseaudio -nF $HOME/.n770.pa &
EOM

cat > ~/bin/sound_here <<EOM
#!/bin/sh
pulseaudio -k; pulseaudio &
EOM

chmod 755 ~/bin/sound_here ~/bin/sound_n770

Now you just need to run ‘~/bin/sound_n770’ to redirect sound playback to the N770, and ‘~/bin/sound_here’ to reset back to laptop speaker output, for the entire desktop environment. Nifty!

Update: it appears that things may work more reliably if you add “rate=22050” at the end of the “load-module module-esound-sink” line — this halves the bitrate of the network stream, which copes better with harsh wifi network conditions. The n770.pa file above now includes this.

5 Comments

Host monitoring with Jaiku

A few weeks back, we were having trouble with dogma, our shared server where taint.org is hosted, which would occasionally be unavailable for unknown reasons. We needed to monitor its availability so that it could be fixed when it crashed again, and we’d be able to investigate quickly. Since it was happening mostly out of working hours, SMS notification was essential.

Normally, that kind of monitoring is pretty basic stuff, and there’s plenty of services out there, from Host-Tracker.com to the more complex self-hosted apps like monit and Nagios which can do that. But looking around, I found that none of them offered SMS notification for free, and since this was our personal-use server, I wasn’t willing to sign up for a $10-per-month paid account to support it, or buy any hardware to act as a private SMS gateway.

Instead, I thought of Jaiku — the Finnish company which offers a microblogging/presence platform similar to Twitter. Jaiku had a couple of cool features:

  • SMS notifications
  • it’s possible to broadcast messages to a “channel”, which others could subscribe to, IRC-style
  • it has an open API

This would allow me to notify any interested party of dogma’s downtime, allowing subscribers to subscribe and unsubscribe using whatever notification systems Jaiku support.

With a little perl and LWP, I rigged up a quick monitoring script to check http://taint.org/ via HTTP, and report if it was unavailable over the course of 5 retries in 50 seconds. If it was broken, the script sends a JSON-formatted POST request to Jaiku’s “presence.send” method, informing the target channel of the issue. (Perl source here.)

You can see the ‘#dogmastatus’ channel here — as you can see, we fixed the problem with dogma just over 2 weeks ago ;)

It’s worth noting that I had to set up an additional user, “downtimebot”, on Jaiku to send the messages — otherwise I’d never see them on my configured mobile phone! Jaiku uses the optimisation that, if I sent the message, there’s no need to cc me with a copy of what I just sent; logical enough.

Anyway, if you’re interested in dogma’s availability (there might be one or two taint.org readers who are), feel free to add yourself to the #dogmastatus channel and receive any updates.

Update: Fergal noted that it’s pretty simple to use Cape Clear’s assembly framework to perform a HTTP ping test with output to Jabber/XMPP. nifty!

7 Comments

How to solve a maze with Photoshop

wow, this is cool. lod3n, confronted by this heinous puzzle, wrote:

‘2 minutes in Photoshop. All too easy. So, where do I pick up my cake?

  1. Increase contrast.
  2. Select the right wall of the maze using the magic wand.
  3. Select > Modify > Expand 4 pixels
  4. Create new layer.
  5. Fill with Red.
  6. Select > Modify > Contract 2 pixels.
  7. Delete. Now you’ve got a line tracing the solution.
  8. Manually clean up the outer edge, and connect the dots.
  9. Cake!’

Here’s the result. Seriously nifty!

(Update: wow, this got Dugg heavily — 17000 pageviews from Digg alone! Unfortunately that caused a bit of a server meltdown. Should be back now though…)

118 Comments

A SpamAssassin rule-discovery algorithm

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and “ham” (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple “grep” will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, “this is\na \ntest” -> “this is a test”

All pretty basic stuff, and performed by the SpamAssassin “body” rendering process during a “mass-check” operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later “collapse patterns” step.

Split the text into “words” — ie. space-separated chunks of non-whitespace chars. Compress each “word” into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global “token count” hashtable, and add the message ID to the token’s entry in a “message subset hit” table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the “token count” hash entries, simply delete the entries in the “token count” and “message subset hit” tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
  • the “message subset hit” table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the “message subset hit” table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the “discard if subsumed” trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent “OEM” stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/*.2007022* \
        ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin “meta” rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <[email protected]>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <[email protected]>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <[email protected]>
  To: <[email protected]>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

  OEM software - throw packing case, leave CD, use electronic manuals.
  Pay for software only and save 75-90%!

  Discounts! Special offers! Software for home and office!
              TOP 1O ITEMS.

    $79 Microsoft Windows Vista Ultimate
    $79 MS Office Enterprise 2007
    $79 Adobe Acrobat 8 Pro
    $49 Windows XP Pro w/SP2
    $99 Macromedia Studio 8
    $59 Adobe Premiere 2.0
    $59 Corel Grafix Suite X3
    $59 Adobe Illustrator CS2
  $129 Autodesk Autocad 2007
  $149 Adobe Creative Suite 2
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

            Mac Specials:
  Adobe Acrobat PR0 7             $69
  Adobe After Effects             $49
  Adobe Creative Suite 2 Premium $149
  Ableton Live 5.0.1              $49
  Adobe Photoshop CS              $49
  http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
  6E1FAE&t6

  See more by this manufacturers:
  Microsoft...Mac...Adobe...Borland...Macromedia
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

  Microsoft Windows Vista Ultimate
  Retail price:  $399.00
  Proposition:  $79.95
  Your benefit:  $319.05 (80%)
  Availability: Can be downloaded INSTANTLY.
  http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
  Best choice for home and professional. (37268 reviews)

  Microsoft Office 2007 Enterprise Edition
  Regular price:  $899.00
  Our offer:  $79.95
  You save:  $819.95 (89%)
  Availability: Pay and download instantly.
  http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
  Sales Rank: #1 (121329 reviews)

  Adobe Acrobat 8.0 Professional
  Market price:  $449.00
  We propose:  $79.95
  Your profit:  $369.05 (80%)
  Availability: Available for INSTANT download.
  http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
  Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching “anys”, also known as “gapped alignment” searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. “this is foo”, “this is bar”, “this is baz” => “this is (foo|bar|baz)”, would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

9 Comments

Script: knewtab

Here’s a handy script for konsole users like myself:

knewtab — create a new tab in a konsole window, from the commandline

usage: knewtab {tabname} {command line …}

Creates a new tab in a “konsole” window (the current window, or a new one if the command is not run from a konsole).

Requires that the konsole app be run with the “–script” switch.

Download ‘knewtab.txt’

Comments closed

Cliche-finder bookmarklet

Quinn posted a link to a nifty CGI by Aaron Swartz which detects uses of common cliches, with the list of cliches to avoid taken from the Associated Press Guide to News Writing. In addition, she also mentioned there’s the Passivator, ‘a passive verb and adverb flagger for Mozilla-derived browsers, Safari, and Opera 7.5’.

Combining the two, I’ve hacked together a bookmarklet version of the cliche finder — it can be found on this page. (Couldn’t place it inline into this post due to stupid over-aggressive Markdown, grr.)

Fun! Probably not IE-compatible, though.

10 Comments

Top 100 Irish Blogs, pt 2

The previous post was pretty popular, and one of the requests was for a regularly-updated listing. So here it is: http://taint.org/technorati/

Since Technorati limit daily queries to about 500 per day (iirc), and there are quite a few more blogs in the Irish blogs list, I plan to update it on a nightly basis, with each set of blogs updating on different days. This should result in the figures staying more-or-less up to date without hammering T’rati too much.

46 Comments

Technorati-ranked Irish Blogs Top 100

So, I was thinking about the various Irish blog aggregators, Planet.journals.ie, IrishBlogs.ie, and IrishBlogs.info. Michele’s Irishblogs.info attempts to “rank” the blogs by hits, but many of the Irish webloggers don’t include that hit-counting HTML snippet in their web pages, so quite a few are probably missing; on top of that, RSS readers don’t count. It lists me as #3, which I knew was definitely wrong, anyway ;)

However, it occurred to me that an alternative way to compute a “top 100” would be to use the Technorati rank of each blog, and make a table based on that; that’d measure the blogs by Technorati’s readership-estimation algorithm, which may still be faulty, of course, but worth a try… I was curious, so I gave it a go, and here’s the results. Enjoy!

Update: This table is no longer up-to-date — a much fresher version is now available over here, and will be updated regularly.

Top 100 by rank / inbound blog links:

Position Rank Inbound blogs Inbound links Blog
1 2940 638 1931   http://www.tomrafteryit.net/
2 6636 371 1280   http://www.mulley.net/
3 8231 315 625   http://twentymajor.blogspot.com/
4 10984 249 512   http://www.natterjackpr.com/
5 15720 181 409   http://www.avalon5.com/
6 18897 151 315   http://irish.typepad.com/irisheyes/
7 19364 148 472   http://www.gavinsblog.com/
8 21214 136 385   http://www.blather.net/
9 21715 133 968   http://ocaoimh.ie/
10 22210 132 399   http://eirepreneur.blogs.com/eirepreneur/
11 22258 130 323   http://thetorturegarden.blogspot.com/
12 23921 122 351   http://www.dehora.net/journal/
13 24143 121 199   http://www.atlanticblog.com/
14 24828 118 174   http://freestater.blogspot.com/
15 25570 115 260   http://arseblog.com/WP
16 25570 115 246   http://tcal.net/
17 27174 109 252   http://www.digitalrights.ie/
18 27189 110 169   http://cork2toronto.blogspot.com/
19 28004 106 731   http://taint.org/
20 29008 103 286   http://unitedirelander.blogspot.com/
21 29008 103 232   http://www.nialler9.com/blog
22 29008 103 175   http://clickhere.blogs.ie/
23 29978 100 270   http://www.mneylon.com/blog
24 31954 95 901   http://www.irishelection.com/
25 33397 91 231   http://memex.naughtons.org/
26 34121 89 370   http://siciliannotes.blogspot.com/
27 35022 86 285   http://www.sineadgleeson.com/blog
28 35022 86 146   http://www.cfdan.com/
29 35858 84 904   http://www.pkellypr.com/blog
30 36223 84 255   http://www.thinkingoutloud.biz/
31 37735 80 175   http://www.dervala.net/
32 39719 76 207   http://backseatdrivers.blogspot.com/
33 40078 76 229   http://fdelondras.blogspot.com/
34 40276 75 203   http://www.mediangler.com/
35 40821 74 128   http://www.thinkinghomebusiness.com/blog
36 44148 69 122   http://outofambit.blogspot.com/
37 45075 67 147   http://www.podleaders.com/
38 45075 67 87   http://www.aidanf.net/
39 45729 66 238   http://www.argolon.com/
40 46477 65 201   http://www.sarahcarey.ie/
41 46477 65 191   http://disillusionedlefty.blogspot.com/
42 47586 64 141   http://www.johnbreslin.com/blog
43 48011 63 66   http://www.branedy.net/
44 52278 58 398   http://dossing.blogspot.com/
45 54710 56 155   http://redmum.blogspot.com/
46 55758 55 103   http://richarddelevan.blogspot.com/
47 56390 54 148   http://donal.wordpress.com/
48 56390 54 129   http://prettycunning.net/blog
49 57527 53 104   http://www.dublinblog.ie/
50 58724 52 167   http://www.tuppenceworth.ie/blog
51 58724 52 102   http://www.inter-actions.biz/blog/
52 59920 51 101   http://seanmcgrath.blogspot.com/
53 60315 51 76   http://www.blackphoebe.com/msjen/
54 62483 49 112   http://www.infactah.com/
55 62885 49 118   http://mamanpoulet.blogspot.com/
56 63869 48 229   http://icecreamireland.com/
57 68503 45 93   http://www.web2ireland.org/
58 68503 45 75   http://www.davidmcwilliams.ie/
59 68503 45 73   http://vipglamour.net/
60 68824 45 193   http://imeall.blogspot.com/
61 72248 43 81   http://planetpotato.blogs.com/planet_potato_an_irish_bl/
62 73843 42 149   http://lettertoamerica.blogs.com/
63 73843 42 119   http://www.kenmc.com/
64 73843 42 102   http://www.pmooney.net/blogsphe.nsf
65 73843 42 70   http://bohanna.typepad.com/pureplay/
66 75725 41 107   http://bonhom.ie/
67 75725 41 93   http://www.bibliocook.com/
68 75725 41 78   http://shittyfirstdraft.blogspot.com/
69 77680 40 225   http://bestofbothworlds.blogspot.com/
70 77680 40 134   http://www.stdlib.net/%7Ecolmmacc
71 77957 40 82   http://davesrants.com/
72 79732 39 103   http://ricksbreakfastblog.blogspot.com/
73 80012 39 92   http://manuel-estimulo.blogspot.com/
74 81970 38 91   http://gingerpixel.com/
75 82240 38 248   http://www.linksheaven.com/
76 84304 37 726   http://thelimerick.blogspot.com/
77 84304 37 127   http://www.ryderdiary.com/
78 84304 37 83   http://morgspace.net/
79 84304 37 64   http://talideon.com/weblog/
80 86729 36 140   http://www.damienblake.com/
81 86729 36 124   http://irisheagle.blogspot.com/
82 86729 36 102   http://blog.rymus.net/
83 86729 36 65   http://www.adammaguire.com/blog
84 87068 36 272   http://progressiveireland.blogspot.com/
85 89814 35 145   http://www.windsandbreezes.org/
86 92646 34 43   http://football-corner.blogspot.com/
87 95258 33 207   http://www.fustar.org/
88 95258 33 171   http://www.iced-coffee.com/
89 95258 33 82   http://www.bytesurgery.com/gearedup
90 101881 31 90   http://phoblacht.blogspot.com/
91 101881 31 70   http://counago-and-spaves.blogspot.com/
92 101881 31 58   http://www.firstpartners.net/blog
93 105668 30 82   http://realitycheckdotie.blogspot.com/
94 109643 29 142   http://bifsniff.com/cartoons/
95 109643 29 75   http://dave.antidisinformation.com/
96 109643 29 60   http://conoroneill.com/
97 109643 29 55   http://www.minds.may.ie/%7Edez/serendipity/
98 109643 29 51   http://dublin.metblogs.com/
99 110005 29 78   http://www.janinedalton.com/blog
100 110005 29 54   http://www.runningwithbulls.com/blog

List by inbound links:

Position Rank Inbound blogs Inbound links Blog
1 2940 638 1931   http://www.tomrafteryit.net/
2 6636 371 1280   http://www.mulley.net/
3 21715 133 968   http://ocaoimh.ie/
4 35858 84 904   http://www.pkellypr.com/blog
5 31954 95 901   http://www.irishelection.com/
6 28004 106 731   http://taint.org/
7 84304 37 726   http://thelimerick.blogspot.com/
8 8231 315 625   http://twentymajor.blogspot.com/
9 258886 13 519   http://newswire99.blogspot.com/
10 10984 249 512   http://www.natterjackpr.com/
11 19364 148 472   http://www.gavinsblog.com/
12 164780 20 451   http://inao.blogspot.com/
13 15720 181 409   http://www.avalon5.com/
14 22210 132 399   http://eirepreneur.blogs.com/eirepreneur/
15 52278 58 398   http://dossing.blogspot.com/
16 21214 136 385   http://www.blather.net/
17 34121 89 370   http://siciliannotes.blogspot.com/
18 23921 122 351   http://www.dehora.net/journal/
19 156276 21 336   http://www.ebbybrett.co.uk/blog
20 22258 130 323   http://thetorturegarden.blogspot.com/
21 18897 151 315   http://irish.typepad.com/irisheyes/
22 29008 103 286   http://unitedirelander.blogspot.com/
23 35022 86 285   http://www.sineadgleeson.com/blog
24 87068 36 272   http://progressiveireland.blogspot.com/
25 239963 14 271   http://www.thehealthtechblog.com/
26 29978 100 270   http://www.mneylon.com/blog
27 25570 115 260   http://arseblog.com/WP
28 36223 84 255   http://www.thinkingoutloud.biz/
29 27174 109 252   http://www.digitalrights.ie/
30 82240 38 248   http://www.linksheaven.com/
31 977738 3 248   http://www.tomgriffin.org/the_green_ribbon/
32 25570 115 246   http://tcal.net/
33 45729 66 238   http://www.argolon.com/
34 29008 103 232   http://www.nialler9.com/blog
35 33397 91 231   http://memex.naughtons.org/
36 40078 76 229   http://fdelondras.blogspot.com/
37 63869 48 229   http://icecreamireland.com/
38 77680 40 225   http://bestofbothworlds.blogspot.com/
39 208904 16 210   http://www.anlionra.com/
40 471327 7 208   http://www.ravenfamily.org/sam/
41 39719 76 207   http://backseatdrivers.blogspot.com/
42 95258 33 207   http://www.fustar.org/
43 40276 75 203   http://www.mediangler.com/
44 46477 65 201   http://www.sarahcarey.ie/
45 637233 5 200   http://armchaircelts.co.uk/
46 24143 121 199   http://www.atlanticblog.com/
47 280786 12 199   http://conann.com/
48 68824 45 193   http://imeall.blogspot.com/
49 46477 65 191   http://disillusionedlefty.blogspot.com/
50 637233 5 182   http://www.everysecondpaycheck.com/blog
51 164524 20 181   http://irishlinks.blogspot.com/
52 542250 6 176   http://www.dublinka.com/
53 29008 103 175   http://clickhere.blogs.ie/
54 37735 80 175   http://www.dervala.net/
55 24828 118 174   http://freestater.blogspot.com/
56 155943 21 172   http://www.jamesgalvin.com/
57 95258 33 171   http://www.iced-coffee.com/
58 164524 20 171   http://irishcraftworker.typepad.com/an_irish_craftworkers_goo/
59 27189 110 169   http://cork2toronto.blogspot.com/
60 58724 52 167   http://www.tuppenceworth.ie/blog
61 141242 23 164   http://atp.datagate.net.uk/blog
62 148304 22 159   http://www.lifewithouttoast.com/
63 184241 18 158   http://funferal.org/
64 54710 56 155   http://redmum.blogspot.com/
65 73843 42 149   http://lettertoamerica.blogs.com/
66 56390 54 148   http://donal.wordpress.com/
67 45075 67 147   http://www.podleaders.com/
68 155943 21 147   http://dublinopinion.com/
69 35022 86 146   http://www.cfdan.com/
70 89814 35 145   http://www.windsandbreezes.org/
71 109643 29 142   http://bifsniff.com/cartoons/
72 195745 17 142   http://podcasting.ie/podcast
73 47586 64 141   http://www.johnbreslin.com/blog
74 86729 36 140   http://www.damienblake.com/
75 223280 15 137   http://thegurrier.com/
76 77680 40 134   http://www.stdlib.net/%7Ecolmmacc
77 980795 3 131   http://www.sineadcochrane.com/
78 56390 54 129   http://prettycunning.net/blog
79 40821 74 128   http://www.thinkinghomebusiness.com/blog
80 84304 37 127   http://www.ryderdiary.com/
81 86729 36 124   http://irisheagle.blogspot.com/
82 44148 69 122   http://outofambit.blogspot.com/
83 73843 42 119   http://www.kenmc.com/
84 62885 49 118   http://mamanpoulet.blogspot.com/
85 135121 24 117   http://nellysgarden.blogspot.com/
86 195745 17 115   http://blog.infurious.com/
87 542250 6 114   http://ainelivia.typepad.com/aine_livia_at_the_midnigh/
88 62483 49 112   http://www.infactah.com/
89 75725 41 107   http://bonhom.ie/
90 57527 53 104   http://www.dublinblog.ie/
91 55758 55 103   http://richarddelevan.blogspot.com/
92 79732 39 103   http://ricksbreakfastblog.blogspot.com/
93 58724 52 102   http://www.inter-actions.biz/blog/
94 73843 42 102   http://www.pmooney.net/blogsphe.nsf
95 86729 36 102   http://blog.rymus.net/
96 59920 51 101   http://seanmcgrath.blogspot.com/
97 173857 19 99   http://www.ofoghlu.net/log/
98 118678 27 96   http://irishkc.com/
99 68503 45 93   http://www.web2ireland.org/
100 75725 41 93   http://www.bibliocook.com/

Update: Here’s a full list of all 569 tested blogs. Also, there’s been a minor change to the rankings here; I’ve just realised that there was a bug in how the script handled evenly-matched blogs, so (for example) #15 and #16 were reversed in order; that’s now fixed.

If you find a blog missing, it’s possible that (a) it’s not pinging Planet.journals.ie or (b) is not registered with Technorati; this method requires both of those. Most Irish blogs do, but some (Old Rotten Hat, for example) don’t…

Methodology

I found this more-or-less full list of Irish weblogs at Planet.journals.ie, and selected the blogs that had pinged their site in the past 6 months, then cut that down to just the blog main-page URLs, removing duplicates.

Given that list, I then looked up each blog URL using the Technorati API, and got its rank, inbound link count, and inbound linking blogs count.

top100code.tgz is a tarball of the perl code I wrote to do this, if you fancy doing it yourself on whichever set of blogs you fancy…

18 Comments

The vagaries of Google Image Search

Remember the C=64-izer, the quick hack to display an image in the style of the Commodore 64?

Recently, I’ve started getting hits to this demo image of the “O RLY?” owl — lots of ’em.

It turns out that the C=64-ized rendition of this image is now the top hit for “O RLY” on Google Image Search; pretty bizarre, since there are obvious better images on the first search page, one result along in fact. What’s more, the page listed as the ‘origin page’, http://taint.org/tag/today, doesn’t even use that text.

This has resulted in lots of Myspace kiddies etc. obliviously using the C=64 rendering. Yay for Commodore ;)

Comments closed

SpicyLinks and del.icio.us Network Summarization

Ross Mayfield:

Every time I see Gabe Rivera of TechMeme, I ask for the same thing — MeMeme. Give me TechMeme where the core index is based on who I read, about 150 people at any given time, to show me what my friends are interested in.

Funnily eough, that is exactly why I wrote SpicyLinks!

It works pretty well — in fact, nowadays I don’t really bother reading slashdot, Digg, Reddit, et al, particularly frequently, because I know that all the really interesting stuff will be at the top of my newsreader in the SpicyLinks feed.

Anyway, I’ve been calling SpicyLinks a ‘summarizing aggregator’, but the discussion that arose from Ross’ posting inspired me. A little bit of hacking has come up with an interesting twist: take a del.icio.us social network, a CGI script called deliciousnetwork2opml.cgi, and 15 minutes hacking on SpicyLinks to support inclusion of OPML via a remote URI, and hey presto — it’s now a social-network summarising aggregator. ;)

6 Comments

“Stretch-to-fit Textareas” Greasemonkey User Script

Here’s another quick-hack Greasemonkey user script I wrote recently.

Stretch-to-fit Textareas is a user script which improves the usability of editable textareas; it causes them to “stretch” vertically to fit their contents, as you type. This behaviour was inspired by that of textareas in FogBugz.

It can be inhibited by turning off the small checkbox to the right of each textarea.

Update: it’s worth noting that this is different from the Resizeable Textareas Firefox extension. Whereas the latter allows the user to resize the textareas by hand, this user script does that action automatically, based on the contents of the field; no manual resize-handle-searching and dragging is required. On the other hand, this user script will only stretch textareas vertically, whereas the extension allows them to be dragged in both dimensions. In fact, the two are complementary — I’m running both, and I suggest you do too ;)

Update 2: here’s a Firefox extension version — Greasemonkey not required!

1 Comment

Retroactive Tagging With TagThe.Net

Hacky hack hack.

Ever since I enabled tags on taint.org, I’ve been mildly annoyed by the fact that there were thousands of older entries deprived of their folksonomic chunky goodness. A way to ‘retroactively tag’ those entries somehow would be cool.

Last week, Leonard posted a link on his linkblog to TagThe.net, a web service which offers a nifty REST API; simply upload a chunk of text, and it’ll suggest a few tags for that text, like this:

echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
<memes>
  <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
    <dim type="topic">
      <item>robot</item>
    </dim>
    <dim type="language">
      <item>english</item>
    </dim>
  </meme>
</memes>

This looked promising.

Anyway, I’ve now implemented this — it worked great! If you’re curious, here’s details of how I did it. It’s a bit hacky, since I’m only going to be doing this once — and very UNIXy and perlish, because that’s how I do these things — but maybe somebody will find it useful.

How I Retroactively Tagged taint.org

This weblog runs WordPress — so all the entries are stored in a MySQL database. I took the MySQL dump of the tables, and a quick script figured out that out of somewhere over 1600-ish posts, there were 1352 that came from the pre-tag era, requiring tag inference. A mail to the TagThe.Net team established that they were happy with this level of usage.

I grepped the post IDs and text out of the SQL dump, threw those into a text file using the simple format ‘id=NNN text=SQLHTMLSTRING’ (where SQLHTMLSTRING was the nicely-escaped HTML text taken directly from the SQL dump), and ran them through this script.

That rendered the first 2k of each of those entries as a URL-encoded string, invoked the REST API with that, got the XML output, and extracted the tags into another UNIXy text-format output file. (It also added one tag for the ‘proto-tag’ system I used in the early days, where the first word of the entry was a single tag-style category name.)

Next, I ran this script, which in turn took that intermediate output and converted it to valid PHP code, like so:

cat suggestedtags | ./taglist-to-php.pl  > addtags.php
scp addtags.php my.server:taint.org/wp-admin/

The generated page ‘addtags.php’ looks like this:

<?php
  require_once('admin.php');
  global $utw;
  $utw->SaveTags(997, array("music","all","audio","drm-free",
      "faq","lunchbox","destination","download","premiere","quote"));
  [...]
  $utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
  $utw->SaveTags(999, array("oses","eek","longhorn","ram",
    "winsupersite","windows","amount","base","dog","preview","system"));
?>

Once that page was in place, I just visited it in my (already logged in) web browser window, at http://taint.org/wp-admin/addtags.php, and watched as it gronked for a while. Eventually it stopped, and all those entries had been tagged. (If I wasn’t so hackish, I might have put in a little UI text here — but I didn’t.)

The results are very good, I think.

A success: http://taint.org/tag/research has picked up a lot of the interesting older entries where I discussed things like IBM’s Tieresias pattern-recognition algorithm. That’s spot on.

A minor downside: it’s not so good at nouns. This entry talks about Silicon Valley and geographical insularity, and mentions “Silicon Valley” prominently — one or both of those words would seem to be a good thing to tag with, but it missed them.

Still, that’s a minor issue — the tags it has suggested are generally very appropriate and useful.

Next, I need to find a way to auto-generate titles for the really old entries ;)

1 Comment

Another script: goog-love.pl

A quick hack —

goog-love.pl – find out where your site’s google juice comes from

This script will grind through your web site’s “access.log” file (which must be in the “combined” log format). It’ll pick out the top 100 Google searches found in the referer field, re-run those searches, and determine which ones are giving your website all the linky Google love — in other words, the searches that your site ‘wins’ on.

The output is in plain text and a chunk of HTML.

usage:

goog-love.pl sitehost google-api-key < access.log > out.html

e.g.

cat /var/www/logs/taint.org.* | goog-love.pl \
  taint.org 0xb0bd0bb5yourgoogleapikeyhere0xdeadbeef | tee out.html

NOTE: this script requires the SOAP::Lite module be installed. Install it using apt-get install libsoap-lite-perl or cpan SOAP::Lite. It also requires a Google API key.

For example, here are the current results for this site. You can immediately see some interesting stuff that’s not immediately obvious otherwise, such as my site being the top hit for [beardy justin] ;)

Download here (5 KiB perl script).

Notes:

  • if you see a lot of “502 Bad Gateway” errors, it’s probably over-zealous anti-bot ACLs on Google’s side. Try from another host.

  • Read the comments for notes on a bug in recent releases of SOAP::Lite; please let me know if you hear of them getting fixed ;)

5 Comments

Urban Dead HUD; added Inventory Sorting

I’ve updated the Urban Dead HUD Greasemonkey userscript; it now offers inventory sorting, inspired by Ikko’s userscript (albeit a little different in implementation). Here’s a screenshot:

Right now, UD is reasonably interesting — our team of plucky survivors have been helping out with the defence of Caiger Mall, a major mall towards the north-west of the city. We’ve repulsed the Church of the Resurrection‘s attempts to wipe us out, but that seems to have made us quite a juicy target; there are now no less than three separate Zombie groups ganging up on us. For now, we’re still holding out.

4 Comments

Life Hacks: getting back to the command-line

Tech: So Danny O’Brien’s ‘Life Hacks’ talk is one of the most worthwhile reflections on productivity (and productivity technology) I’ve heard. (Cory Doctorow’s transcript from NotCon 2004, video from ETCon.)

There’s a couple of things I wanted to write about it, so I’ll do them in separate blog entries.

(First off, I’d love to see Ward Cunningham’s ‘cluster files by time’ hack, it sounds very useful. But that’s not what I wanted to write about ;)

People don’t extract stuff from big complex apps using OLE and so on; it’s brittle, and undocumented. Instead they write little command-line scriptlets. Sometimes they do little bits of ‘open this URL in a new window’ OLE-type stuff to use in a pipeline, but that’s about it. And fundamentally, they pipe.

This ties into the post that reminded me to write about it — Diego Doval’s atomflow, which is essentially a small set of command-line apps for Atom storage. Diego notes:

Now, here’s what’s interesting. I have of course been using pipes for years. And yet the power and simplicity of this approach had simply not occurred to me at all. I have been so focused on end-user products for so long that my thoughts naturally move to complex uber-systems that do everything in an integrated way. But that is overkill in this case.

Exactly! He’s not the only one to get that recently — MS and Google are two very high-profile organisations that have picked up the insight; it’s the Egypt way.

There’s fundamentally a breakage point where shrink-wrapped GUI apps cannot do everything you want done, and you have to start developing code yourself — and the best APIs for that, after 30 years, has been the command-line and pipe metaphor.

(Also, complex uber-apps are what people think is needed — however, that’s just a UI scheme that’s prevailing at the moment. Bear in mind that anyone using the web today uses a command line every day. A command line will not necessarily confuse users.)

Tying back into the Life Hacks stuff — one thing that hasn’t yet been done properly as a command-line-and-pipe tool, though, is web-scraping. Right now, if you scrape, you’ve got to do either (a) lots of munging in a single big fat script of your own devising, if you’re lucky using something like WWW::Mechanize (which is excellent!); (b) use a scraping app like sitescooper; or (c) get hacky with a shell script that runs wget and greps bits of output out in a really brittle way.

I’ve been considering a ‘next-generation sitescooper’ a little bit occasionally over the past year, and I think the best way to do it is to split its functionality up into individual scripts/perl modules:

  • one to download files, maintaining a cache, taking likely freshness into account, and dealing with crappy HTTP/HTTPS wierdness like cookies, logins and redirects;
  • one to diff HTML;
  • one to lobotomise (ie. simplify) HTML;
  • one to scrape out the ‘good bits’ using sitescooper-style regions

Tie those into HTML Tidy and XMLStarlet, and you have an excellent command-line scraping framework.

Still haven’t got any time to do all that though. :(

Comments closed

Going to LayerOne

Conferences: I’m going to LayerOne; it looks interesting, and I’ve been hoping to bump into Danny O’Brien (who’s there doing his Life Hacks talk) for a couple of drinks and a blather for quite a while. Other speakers look similarly interesting, in an ‘offbeat hacker conference’ way, so I think it’ll be fun.

Conflicts with The Streets playing the Wiltern though, but c’est la vie ;)

Comments closed

Life Hacks

Work: Life Hacks: Tech Secrets of Overprolific Alpha Geeks, Danny O’Brien’s ETech talk.

Amazingly, despite not being an alpha geek ;), I already use all these things:

  • a todo.txt file (anything else is inconvenient).
  • everything incoming comes through email, including RSS (thanks to rss2email). Again, anything else is inconvenient; I couldn’t be bothered with another desktop app.
  • I hack scripts for every repetitive task I run into
  • I sync instead of backup; everything has a CVS repository running on a remote server, even my home dir
  • I have a nasty tendency to web-scrape data

These tips definitely are good advice. Although I have a feeling the result is optimised to a weblogging UNIX geek who spends hours hacking perl/python scripts. ;)

I’m looking forward to LifeHacks.com when it does eventually go live… should be interesting.

Comments closed