Retroactive Tagging With TagThe.Net

Hacky hack hack.

Ever since I enabled tags on, I’ve been mildly annoyed by the fact that there were thousands of older entries deprived of their folksonomic chunky goodness. A way to ‘retroactively tag’ those entries somehow would be cool.

Last week, Leonard posted a link on his linkblog to, a web service which offers a nifty REST API; simply upload a chunk of text, and it’ll suggest a few tags for that text, like this:

echo 'Hi there, I am a tag-suggesting robot' | curl "`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
  <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
    <dim type="topic">
    <dim type="language">

This looked promising.

Anyway, I’ve now implemented this — it worked great! If you’re curious, here’s details of how I did it. It’s a bit hacky, since I’m only going to be doing this once — and very UNIXy and perlish, because that’s how I do these things — but maybe somebody will find it useful.

How I Retroactively Tagged

This weblog runs WordPress — so all the entries are stored in a MySQL database. I took the MySQL dump of the tables, and a quick script figured out that out of somewhere over 1600-ish posts, there were 1352 that came from the pre-tag era, requiring tag inference. A mail to the TagThe.Net team established that they were happy with this level of usage.

I grepped the post IDs and text out of the SQL dump, threw those into a text file using the simple format ‘id=NNN text=SQLHTMLSTRING’ (where SQLHTMLSTRING was the nicely-escaped HTML text taken directly from the SQL dump), and ran them through this script.

That rendered the first 2k of each of those entries as a URL-encoded string, invoked the REST API with that, got the XML output, and extracted the tags into another UNIXy text-format output file. (It also added one tag for the ‘proto-tag’ system I used in the early days, where the first word of the entry was a single tag-style category name.)

Next, I ran this script, which in turn took that intermediate output and converted it to valid PHP code, like so:

cat suggestedtags | ./  > addtags.php
scp addtags.php

The generated page ‘addtags.php’ looks like this:

  global $utw;
  $utw->SaveTags(997, array("music","all","audio","drm-free",
  $utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
  $utw->SaveTags(999, array("oses","eek","longhorn","ram",

Once that page was in place, I just visited it in my (already logged in) web browser window, at, and watched as it gronked for a while. Eventually it stopped, and all those entries had been tagged. (If I wasn’t so hackish, I might have put in a little UI text here — but I didn’t.)

The results are very good, I think.

A success: has picked up a lot of the interesting older entries where I discussed things like IBM’s Tieresias pattern-recognition algorithm. That’s spot on.

A minor downside: it’s not so good at nouns. This entry talks about Silicon Valley and geographical insularity, and mentions “Silicon Valley” prominently — one or both of those words would seem to be a good thing to tag with, but it missed them.

Still, that’s a minor issue — the tags it has suggested are generally very appropriate and useful.

Next, I need to find a way to auto-generate titles for the really old entries ;)

This entry was posted in Uncategorized and tagged , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.