Home » hacks

Tag: hacks

User script: add my delicious search results to Google

Published July 1, 2009

For years now, I’ve been collecting bookmarks at delicious.com/jm — nearly 7000 of them by now. I’ve been scrupulous about tagging and describing each one, so they’re eminently searchable, too. I’ve frequently found this to be a very useful personal reference resource.

I was quite pleased to come across the Delicious Search Results on Google Greasemonkey userscript, accordingly. It intercepts Google searches, adding Delicious tag-search results at the top of the search page, and works pretty well. Unfortunately though, that searches all of delicious, not specifically my own bookmarks.

So here’s a quick hack fix to do just that:

my_delicious_search_results.user.js – My Delicious Search Results on Google

Shows tag-search results from my Delicious account on Google search pages, with links to more extensive Delicious searches. Use ‘User Script Commands‘ -> ‘Set Delicious Username‘ to specify your username.

Screenshot:

Enjoy!

2 Comments

Google Reader productivity hack: change your Home

Published March 11, 2009

So, if you use Google Reader, read your news with the “All items” page, and are subscribed to hundreds of feeds, it can be pretty overwhelming. I’ve found a better way to deal with this.

Select a ‘most important’ subset of feeds. For each of those, click through to the feed details page, hit the “Feed Settings…” menu, and select “Change folders…“. Put the feed into a new “top” folder (creating it if necessary).

Now go to “Settings” -> “Preferences” and check out the “Start page” preference. By default, it’s set to “Home“; change it to “Folders and Tags: top“.

Hey presto — now, when you load Google Reader, it’ll come up with your “top” items. You can get through those quickly enough, and get on to other more important tasks. When you’re bored and need something to read, though, just hit “Navigation” -> “All items” (or even just type ‘ga’), and every other feed is now there for your delectation. Sweet!

2 Comments

Hack: reassassinate

Published January 12, 2009

A coworker today, returning from a couple of weeks holiday, bemoaned the quantities of spam he had to wade through. I mentioned a hack I often used in this situation, which was to discard the spam and download the 2 weeks of supposed-nonspam as a huge mbox, and rescan it all with spamassassin — since the intervening 2 weeks gave us plenty of time for the URLs to be blacklisted by URIBLs and IPs to be listed by DNSBLs, this generally results in better spamfilter accuracy, at least in terms of reducing false negatives (the “missed spam”). In other words, it gets rid of most of the remaining spam nicely.

Chatting about this, it occurred to us that it’d be easy enough to generalize this hack into something more widely useful by hooking up the Mail::IMAPClient CPAN module with Mail::SpamAssassin, and in fact, it’d be pretty likely that someone else would already have done so.

Sure enough, a search threw up this node on perlmonks.org, containing a script which did pretty much all that. Here’s a minor freshening: download

reassassinate – run SpamAssassin on an IMAP mailbox, then reupload

Usage: ./reassassinate –user jmason –host mail.example.com –inbox INBOX –junkfolder INBOX.crap

Runs SpamAssassin over all mail messages in an IMAP mailbox, skipping ones it’s processed before. It then reuploads the rewritten messages to two locations depending on whether they are spam or not; nonspam messages are simply re-saved to the original mailbox, spam messages are sent to the mailbox specified in “–junkfolder”.

This is especially handy if some time passed since the mails were originally delivered, allowing more of the message contents of spam mails to be blacklisted by third-party DNSBLs and URIBLs in the meantime.

Prerequisites:

Mail::IMAPClient

Mail::SpamAssassin

3 Comments

Links for 2008-10-10

Published October 10, 2008

Memeorandum Colors: Visualizing Political Bias with Greasemonkey : nice hack
(tags: hacks greasemonkey ui us-politics election)
Xen/VPS hosting reviews : good list from Russell Coker. Gandi’s product is amazingly cheap at EUR6 for super-low-end nodes (UPDATE: that’s because the page is wrong! in reality it’s EUR10+VAT)
(tags: gandi xen vps linux hosting reviews)
Make-Believe Maverick : ‘A closer look at the life and career of John McCain reveals a disturbing record of recklessness and dishonesty.’ amazingly vicious hatchet-job on John McCain from _Rolling Stone_
(tags: rolling-stone articles toread vicious john-mccain expose maverick tim-dickinson)
Spam Volumes Plummet After Atrivo Shutdown : MessageLabs indicates a ‘significant drop’ in spam/botnet activity, due to loss of C&C networks located at Atrivo
(tags: messagelabs atrivo spam botnets cutwail mega-d securityfix)
Marshal reckon the Storm botnet has dried up : ‘Spam originating from the Storm botnet suddenly dried up in mid-September. Since that time we have not detected a single Storm spam in our traps.’
(tags: storm spam botnets malware marshal)
Django test fixtures : handy; load initial test data into the database from JSON files. must start using these
(tags: fixtures testing django web-apps sql json)

Comments closed

Links for 2008-10-07

Published October 7, 2008

Dabble DB : looks like a web-based version of good old Filemaker
(tags: dabble-db databases web db spreadsheets analytics groupware)
retrocomputing hackers reminisce about the 1541 diskette drive : skip to the comment thread, it’s fantastic. I used to know lots of this stuff (via Donncha)
(tags: via:donncha commodore-64 hacks 1541 disks copy-protection drm hardware hacking history retro c64)
MPLC racketeering Irish playschools : the ‘Motion Picture Licensing Company’ sent a letter to 2,500 Irish playschools, demanding a fee of EUR3 per child to cover license fees for the kids watching DVDs. However, it seems they themselves hadn’t registered as required by law, so were acting illegally in issuing demands… oh the irony
(tags: mplc racketeering law ireland dvds movies mpaa licensing copyright legal children kindergarten playschools)
Komplett’s new Dublin pick-up point is open : routing around the Irish couriers and An Post’s brokenness by allowing customers to pick up their items directly. a shame this is necessary
(tags: an-post delivery couriers dublin ireland komplett components pc hardware)

2 Comments

Links for 2008-09-04

Published September 4, 2008

EC2 hack: make metadata visible in ‘ec2-describe-instances’ output : by creating one-off security groups to hold the metadata. hack, will be deprecated by AWS in a future release, but hey it works right now
(tags: ec2 via:elasticgrid hacks patterns aws)

Comments closed

AppEngine — only useful for toys

Published August 8, 2008

Noted on Twitter:

simonw: So apparently http://www.news.com.au/ used json-time for their Beijing countdown widget and blew my App Engine quota! They’ve stopped now.

uh, great. That’s useful.

Google — how are we supposed to host useful services with those limits?

3 Comments

Hack: twitter_no_popups.user.js

Published July 2, 2008

Twitter has this nasty habit — if you come across a tweet in your feed reader containing a URL, and you want to follow that link, you can’t, because Twitter doesn’t auto-link URLs in its RSS feeds. Instead, you have to click on the feed item, itself, wait for that to open in the browser, then click on the link in the new browser tab. That link will, in turn, open in another new tab.

Here’s a quick-hack Greasemonkey user script to inhibit this second new-tab:

twitter_no_popups.user.js

3 Comments

Full-text RSS bookmarklet

Published May 12, 2008

This site offers a nifty utility for dealing with those annoying sites which offer only partial text content in their RSS and Atom feeds.

Given an RSS or Atom feed’s URL, the CGI will iterate through the posts in the feed, scrape the full text of each post from its HTML page, and re-generate a new RSS feed containing the full text.

The one thing it’s missing is a one-click bookmarklet version. So here it is:

Full-text RSS Bookmarklet

Drag that to your bookmarks menu, and next time you’re looking at a partial-text feed, click the bookmark to transform the viewed page into the full-text version. Enjoy!

7 Comments

converting TAP output to JUnit-style XML

Published March 26, 2008

Here’s a perl script that may prove useful: tap-to-junit-xml…

NAME

tap-to-junit-xml – convert perl-style TAP test output to JUnit-style XML

SYNOPSIS

tap-to-junit-xml "test suite name" [ outputprefix ] < tap_output.log

DESCRIPTION

Parse test suite output in TAP (Test Anything Protocol) format, and produce XML output in a similar format to that produced by the <junit> ant task. This is useful for consumption by continuous-integration systems like Hudson.

Written in perl, requires TAP::Parser and XML::Generator. It's based on junit_xml.pl by Matisse Enzer, although pretty much entirely rewritten.

11 Comments

Announcing IrishPulse

Published March 6, 2008

As I previously threatened, I’ve gone ahead and created a “Microplanet” for Irish twitterers, similar to Portland’s Pulse of PDX — an aggregator of the “stream of consciousness” that comes out of our local Twitter community: IrishPulse.

Here’s what you can do:

Add yourself: if you’re an Irish Twitter user, follow the user ‘irishpulse’. This will add you to the sources list.

Publicise it: feel free to pass on the URL to other Irish Twitter users, and blog about it.

Read it: bookmark and take a look now and again!

In terms of implementation, it’s just a (slightly patched) copy of Venus and a perl script using Net::Twitter to generate an OPML file of the Twitter followers. Here’s the source. I’d love to see more “Pulse” sites using this…

4 Comments

“Threadless New Tees” feed needs fixing

Published February 4, 2008

My “Threadless New Tees” scraper feed is currently listing all items as being called ‘height=”146″‘. This is obviously not correct ;)

I’ll fix it ASAP…

Update: fixed!

Comments closed

Remote sound playback through a Nokia 770

Published February 3, 2008

For a while now, I’ve been using various hacks to play music from my Linux laptop, holding my main music collection, to client systems which drive the speakers.

Previously, I used this setup to play via my MythTV box. Nowadays, however, my TV isn’t in the room where I want to listen to music. Instead, I have my Nokia 770 hooked up to the speakers; this plays the BBC Radio 4 RealAudio streams nicely, and also the laptop’s MP3 collection using a uPnP AV MediaServer.

I specifically use TwonkyMedia right now, playing back via the N770’s Media Streamer app. (That works pretty well — uPnP AV is one of those standards plagued with incompatibilities, but TwonkyMedia and Media Streamer seem to be a reliable combination.)

However, TwonkyMedia sometimes fails to notice updates of the library, and nothing has quite as good a music-player user interface as JuK, the KDE music player and organiser app, so a way to play directly from the laptop instead of via uPnP would be nice…

A weekend’s hacking reveals that this is pretty easily done nowadays, thanks to some cool features in pulseaudio, the current standard sound server on Ubuntu gutsy, and the Esound server running on the N770.

Unfortunately, the N770 doesn’t (yet) support pulseaudio directly, otherwise we could use its seriously cool support for RTP multicast streams. Still, we can hack something up using the venerable “esd” protocol (again!) Here’s how to set it up…

On the N770:

You need to fix the N770’s “esd” sound server to allow public connections. Set up your wifi network’s DHCP server to give the N770 a static IP address. Log in over SSH, or fire up an xterm. Run the following:

mv /usr/bin/esd /usr/bin/esd.real

cat > /usr/bin/esd <<EOM
#!/bin/sh
exec /usr/bin/esd.real -tcp -public -promiscuous -port 5678 $*
EOM

chmod 755 /usr/bin/esd
/etc/init.d/esd restart

On the server:

Download this file, and save it as n770.pa. Edit it, and change server=n770:5678 on the fourth line to use the IP address or hostname of your Nokia 770 instead of n770. Then run:

cp n770.pa ~/.n770.pa

cat > ~/bin/sound_n770 <<EOM
#!/bin/sh
pulseaudio -k; pulseaudio -nF $HOME/.n770.pa &
EOM

cat > ~/bin/sound_here <<EOM
#!/bin/sh
pulseaudio -k; pulseaudio &
EOM

chmod 755 ~/bin/sound_here ~/bin/sound_n770

Now you just need to run ‘~/bin/sound_n770’ to redirect sound playback to the N770, and ‘~/bin/sound_here’ to reset back to laptop speaker output, for the entire desktop environment. Nifty!

Update: it appears that things may work more reliably if you add “rate=22050” at the end of the “load-module module-esound-sink” line — this halves the bitrate of the network stream, which copes better with harsh wifi network conditions. The n770.pa file above now includes this.

5 Comments

Host monitoring with Jaiku

Published July 24, 2007

A few weeks back, we were having trouble with dogma, our shared server where taint.org is hosted, which would occasionally be unavailable for unknown reasons. We needed to monitor its availability so that it could be fixed when it crashed again, and we’d be able to investigate quickly. Since it was happening mostly out of working hours, SMS notification was essential.

Normally, that kind of monitoring is pretty basic stuff, and there’s plenty of services out there, from Host-Tracker.com to the more complex self-hosted apps like monit and Nagios which can do that. But looking around, I found that none of them offered SMS notification for free, and since this was our personal-use server, I wasn’t willing to sign up for a $10-per-month paid account to support it, or buy any hardware to act as a private SMS gateway.

Instead, I thought of Jaiku — the Finnish company which offers a microblogging/presence platform similar to Twitter. Jaiku had a couple of cool features:

SMS notifications
it’s possible to broadcast messages to a “channel”, which others could subscribe to, IRC-style
it has an open API

This would allow me to notify any interested party of dogma’s downtime, allowing subscribers to subscribe and unsubscribe using whatever notification systems Jaiku support.

With a little perl and LWP, I rigged up a quick monitoring script to check http://taint.org/ via HTTP, and report if it was unavailable over the course of 5 retries in 50 seconds. If it was broken, the script sends a JSON-formatted POST request to Jaiku’s “presence.send” method, informing the target channel of the issue. (Perl source here.)

You can see the ‘#dogmastatus’ channel here — as you can see, we fixed the problem with dogma just over 2 weeks ago ;)

It’s worth noting that I had to set up an additional user, “downtimebot”, on Jaiku to send the messages — otherwise I’d never see them on my configured mobile phone! Jaiku uses the optimisation that, if I sent the message, there’s no need to cc me with a copy of what I just sent; logical enough.

Anyway, if you’re interested in dogma’s availability (there might be one or two taint.org readers who are), feel free to add yourself to the #dogmastatus channel and receive any updates.

Update: Fergal noted that it’s pretty simple to use Cape Clear’s assembly framework to perform a HTTP ping test with output to Jabber/XMPP. nifty!

7 Comments

How to solve a maze with Photoshop

Published June 19, 2007

wow, this is cool. lod3n, confronted by this heinous puzzle, wrote:

‘2 minutes in Photoshop. All too easy. So, where do I pick up my cake?

Increase contrast.
Select the right wall of the maze using the magic wand.
Select > Modify > Expand 4 pixels
Create new layer.
Fill with Red.
Select > Modify > Contract 2 pixels.
Delete. Now you’ve got a line tracing the solution.
Manually clean up the outer edge, and connect the dots.
Cake!’

Here’s the result. Seriously nifty!

(Update: wow, this got Dugg heavily — 17000 pageviews from Digg alone! Unfortunately that caused a bit of a server meltdown. Should be back now though…)

118 Comments

A SpamAssassin rule-discovery algorithm

Published March 5, 2007

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and “ham” (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple “grep” will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

decoding MIME structure
discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
decoding quoted-printable and base64 encoding
rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
normalising whitespace, “this is\na \ntest” -> “this is a test”

All pretty basic stuff, and performed by the SpamAssassin “body” rendering process during a “mass-check” operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later “collapse patterns” step.

Split the text into “words” — ie. space-separated chunks of non-whitespace chars. Compress each “word” into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global “token count” hashtable, and add the message ID to the token’s entry in a “message subset hit” table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the “token count” hash entries, simply delete the entries in the “token count” and “message subset hit” tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
the “message subset hit” table, containing info about exactly which subset of messages contain a given N-gram
the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the “message subset hit” table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the “discard if subsumed” trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent “OEM” stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/*.2007022* \
        ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin “meta” rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <[email protected]>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <[email protected]>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <[email protected]>
  To: <[email protected]>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

  OEM software - throw packing case, leave CD, use electronic manuals.
  Pay for software only and save 75-90%!

  Discounts! Special offers! Software for home and office!
              TOP 1O ITEMS.

    $79 Microsoft Windows Vista Ultimate
    $79 MS Office Enterprise 2007
    $79 Adobe Acrobat 8 Pro
    $49 Windows XP Pro w/SP2
    $99 Macromedia Studio 8
    $59 Adobe Premiere 2.0
    $59 Corel Grafix Suite X3
    $59 Adobe Illustrator CS2
  $129 Autodesk Autocad 2007
  $149 Adobe Creative Suite 2
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

            Mac Specials:
  Adobe Acrobat PR0 7             $69
  Adobe After Effects             $49
  Adobe Creative Suite 2 Premium $149
  Ableton Live 5.0.1              $49
  Adobe Photoshop CS              $49
  http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
  6E1FAE&t6

  See more by this manufacturers:
  Microsoft...Mac...Adobe...Borland...Macromedia
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

  Microsoft Windows Vista Ultimate
  Retail price:  $399.00
  Proposition:  $79.95
  Your benefit:  $319.05 (80%)
  Availability: Can be downloaded INSTANTLY.
  http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
  Best choice for home and professional. (37268 reviews)

  Microsoft Office 2007 Enterprise Edition
  Regular price:  $899.00
  Our offer:  $79.95
  You save:  $819.95 (89%)
  Availability: Pay and download instantly.
  http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
  Sales Rank: #1 (121329 reviews)

  Adobe Acrobat 8.0 Professional
  Market price:  $449.00
  We propose:  $79.95
  Your profit:  $369.05 (80%)
  Availability: Available for INSTANT download.
  http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
  Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching “anys”, also known as “gapped alignment” searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
A way to detect and reverse-engineer templates, e.g. “this is foo”, “this is bar”, “this is baz” => “this is (foo|bar|baz)”, would be great.
Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

9 Comments

Script: knewtab

Published January 5, 2007

Here’s a handy script for konsole users like myself:

knewtab — create a new tab in a konsole window, from the commandline

usage: knewtab {tabname} {command line …}

Creates a new tab in a “konsole” window (the current window, or a new one if the command is not run from a konsole).

Requires that the konsole app be run with the “–script” switch.

Download ‘knewtab.txt’

Comments closed

Cliche-finder bookmarklet

Published December 23, 2006

Quinn posted a link to a nifty CGI by Aaron Swartz which detects uses of common cliches, with the list of cliches to avoid taken from the Associated Press Guide to News Writing. In addition, she also mentioned there’s the Passivator, ‘a passive verb and adverb flagger for Mozilla-derived browsers, Safari, and Opera 7.5’.

Combining the two, I’ve hacked together a bookmarklet version of the cliche finder — it can be found on this page. (Couldn’t place it inline into this post due to stupid over-aggressive Markdown, grr.)

Fun! Probably not IE-compatible, though.

10 Comments

Top 100 Irish Blogs, pt 2

Published November 6, 2006

The previous post was pretty popular, and one of the requests was for a regularly-updated listing. So here it is: http://taint.org/technorati/

Since Technorati limit daily queries to about 500 per day (iirc), and there are quite a few more blogs in the Irish blogs list, I plan to update it on a nightly basis, with each set of blogs updating on different days. This should result in the figures staying more-or-less up to date without hammering T’rati too much.

46 Comments

Technorati-ranked Irish Blogs Top 100

Published November 2, 2006

So, I was thinking about the various Irish blog aggregators, Planet.journals.ie, IrishBlogs.ie, and IrishBlogs.info. Michele’s Irishblogs.info attempts to “rank” the blogs by hits, but many of the Irish webloggers don’t include that hit-counting HTML snippet in their web pages, so quite a few are probably missing; on top of that, RSS readers don’t count. It lists me as #3, which I knew was definitely wrong, anyway ;)

However, it occurred to me that an alternative way to compute a “top 100” would be to use the Technorati rank of each blog, and make a table based on that; that’d measure the blogs by Technorati’s readership-estimation algorithm, which may still be faulty, of course, but worth a try… I was curious, so I gave it a go, and here’s the results. Enjoy!

Update: This table is no longer up-to-date — a much fresher version is now available over here, and will be updated regularly.

Top 100 by rank / inbound blog links
Top 100 by inbound links

Top 100 by rank / inbound blog links:

Position	Rank	Inbound blogs	Inbound links	Blog
1	2940	638	1931	http://www.tomrafteryit.net/
2	6636	371	1280	http://www.mulley.net/
3	8231	315	625	http://twentymajor.blogspot.com/
4	10984	249	512	http://www.natterjackpr.com/
5	15720	181	409	http://www.avalon5.com/
6	18897	151	315	http://irish.typepad.com/irisheyes/
7	19364	148	472	http://www.gavinsblog.com/
8	21214	136	385	http://www.blather.net/
9	21715	133	968	http://ocaoimh.ie/
10	22210	132	399	http://eirepreneur.blogs.com/eirepreneur/
11	22258	130	323	http://thetorturegarden.blogspot.com/
12	23921	122	351	http://www.dehora.net/journal/
13	24143	121	199	http://www.atlanticblog.com/
14	24828	118	174	http://freestater.blogspot.com/
15	25570	115	260	http://arseblog.com/WP
16	25570	115	246	http://tcal.net/
17	27174	109	252	http://www.digitalrights.ie/
18	27189	110	169	http://cork2toronto.blogspot.com/
19	28004	106	731	http://taint.org/
20	29008	103	286	http://unitedirelander.blogspot.com/
21	29008	103	232	http://www.nialler9.com/blog
22	29008	103	175	http://clickhere.blogs.ie/
23	29978	100	270	http://www.mneylon.com/blog
24	31954	95	901	http://www.irishelection.com/
25	33397	91	231	http://memex.naughtons.org/
26	34121	89	370	http://siciliannotes.blogspot.com/
27	35022	86	285	http://www.sineadgleeson.com/blog
28	35022	86	146	http://www.cfdan.com/
29	35858	84	904	http://www.pkellypr.com/blog
30	36223	84	255	http://www.thinkingoutloud.biz/
31	37735	80	175	http://www.dervala.net/
32	39719	76	207	http://backseatdrivers.blogspot.com/
33	40078	76	229	http://fdelondras.blogspot.com/
34	40276	75	203	http://www.mediangler.com/
35	40821	74	128	http://www.thinkinghomebusiness.com/blog
36	44148	69	122	http://outofambit.blogspot.com/
37	45075	67	147	http://www.podleaders.com/
38	45075	67	87	http://www.aidanf.net/
39	45729	66	238	http://www.argolon.com/
40	46477	65	201	http://www.sarahcarey.ie/
41	46477	65	191	http://disillusionedlefty.blogspot.com/
42	47586	64	141	http://www.johnbreslin.com/blog
43	48011	63	66	http://www.branedy.net/
44	52278	58	398	http://dossing.blogspot.com/
45	54710	56	155	http://redmum.blogspot.com/
46	55758	55	103	http://richarddelevan.blogspot.com/
47	56390	54	148	http://donal.wordpress.com/
48	56390	54	129	http://prettycunning.net/blog
49	57527	53	104	http://www.dublinblog.ie/
50	58724	52	167	http://www.tuppenceworth.ie/blog
51	58724	52	102	http://www.inter-actions.biz/blog/
52	59920	51	101	http://seanmcgrath.blogspot.com/
53	60315	51	76	http://www.blackphoebe.com/msjen/
54	62483	49	112	http://www.infactah.com/
55	62885	49	118	http://mamanpoulet.blogspot.com/
56	63869	48	229	http://icecreamireland.com/
57	68503	45	93	http://www.web2ireland.org/
58	68503	45	75	http://www.davidmcwilliams.ie/
59	68503	45	73	http://vipglamour.net/
60	68824	45	193	http://imeall.blogspot.com/
61	72248	43	81	http://planetpotato.blogs.com/planet_potato_an_irish_bl/
62	73843	42	149	http://lettertoamerica.blogs.com/
63	73843	42	119	http://www.kenmc.com/
64	73843	42	102	http://www.pmooney.net/blogsphe.nsf
65	73843	42	70	http://bohanna.typepad.com/pureplay/
66	75725	41	107	http://bonhom.ie/
67	75725	41	93	http://www.bibliocook.com/
68	75725	41	78	http://shittyfirstdraft.blogspot.com/
69	77680	40	225	http://bestofbothworlds.blogspot.com/
70	77680	40	134	http://www.stdlib.net/%7Ecolmmacc
71	77957	40	82	http://davesrants.com/
72	79732	39	103	http://ricksbreakfastblog.blogspot.com/
73	80012	39	92	http://manuel-estimulo.blogspot.com/
74	81970	38	91	http://gingerpixel.com/
75	82240	38	248	http://www.linksheaven.com/
76	84304	37	726	http://thelimerick.blogspot.com/
77	84304	37	127	http://www.ryderdiary.com/
78	84304	37	83	http://morgspace.net/
79	84304	37	64	http://talideon.com/weblog/
80	86729	36	140	http://www.damienblake.com/
81	86729	36	124	http://irisheagle.blogspot.com/
82	86729	36	102	http://blog.rymus.net/
83	86729	36	65	http://www.adammaguire.com/blog
84	87068	36	272	http://progressiveireland.blogspot.com/
85	89814	35	145	http://www.windsandbreezes.org/
86	92646	34	43	http://football-corner.blogspot.com/
87	95258	33	207	http://www.fustar.org/
88	95258	33	171	http://www.iced-coffee.com/
89	95258	33	82	http://www.bytesurgery.com/gearedup
90	101881	31	90	http://phoblacht.blogspot.com/
91	101881	31	70	http://counago-and-spaves.blogspot.com/
92	101881	31	58	http://www.firstpartners.net/blog
93	105668	30	82	http://realitycheckdotie.blogspot.com/
94	109643	29	142	http://bifsniff.com/cartoons/
95	109643	29	75	http://dave.antidisinformation.com/
96	109643	29	60	http://conoroneill.com/
97	109643	29	55	http://www.minds.may.ie/%7Edez/serendipity/
98	109643	29	51	http://dublin.metblogs.com/
99	110005	29	78	http://www.janinedalton.com/blog
100	110005	29	54	http://www.runningwithbulls.com/blog

List by inbound links:

Position	Rank	Inbound blogs	Inbound links	Blog
1	2940	638	1931	http://www.tomrafteryit.net/
2	6636	371	1280	http://www.mulley.net/
3	21715	133	968	http://ocaoimh.ie/
4	35858	84	904	http://www.pkellypr.com/blog
5	31954	95	901	http://www.irishelection.com/
6	28004	106	731	http://taint.org/
7	84304	37	726	http://thelimerick.blogspot.com/
8	8231	315	625	http://twentymajor.blogspot.com/
9	258886	13	519	http://newswire99.blogspot.com/
10	10984	249	512	http://www.natterjackpr.com/
11	19364	148	472	http://www.gavinsblog.com/
12	164780	20	451	http://inao.blogspot.com/
13	15720	181	409	http://www.avalon5.com/
14	22210	132	399	http://eirepreneur.blogs.com/eirepreneur/
15	52278	58	398	http://dossing.blogspot.com/
16	21214	136	385	http://www.blather.net/
17	34121	89	370	http://siciliannotes.blogspot.com/
18	23921	122	351	http://www.dehora.net/journal/
19	156276	21	336	http://www.ebbybrett.co.uk/blog
20	22258	130	323	http://thetorturegarden.blogspot.com/
21	18897	151	315	http://irish.typepad.com/irisheyes/
22	29008	103	286	http://unitedirelander.blogspot.com/
23	35022	86	285	http://www.sineadgleeson.com/blog
24	87068	36	272	http://progressiveireland.blogspot.com/
25	239963	14	271	http://www.thehealthtechblog.com/
26	29978	100	270	http://www.mneylon.com/blog
27	25570	115	260	http://arseblog.com/WP
28	36223	84	255	http://www.thinkingoutloud.biz/
29	27174	109	252	http://www.digitalrights.ie/
30	82240	38	248	http://www.linksheaven.com/
31	977738	3	248	http://www.tomgriffin.org/the_green_ribbon/
32	25570	115	246	http://tcal.net/
33	45729	66	238	http://www.argolon.com/
34	29008	103	232	http://www.nialler9.com/blog
35	33397	91	231	http://memex.naughtons.org/
36	40078	76	229	http://fdelondras.blogspot.com/
37	63869	48	229	http://icecreamireland.com/
38	77680	40	225	http://bestofbothworlds.blogspot.com/
39	208904	16	210	http://www.anlionra.com/
40	471327	7	208	http://www.ravenfamily.org/sam/
41	39719	76	207	http://backseatdrivers.blogspot.com/
42	95258	33	207	http://www.fustar.org/
43	40276	75	203	http://www.mediangler.com/
44	46477	65	201	http://www.sarahcarey.ie/
45	637233	5	200	http://armchaircelts.co.uk/
46	24143	121	199	http://www.atlanticblog.com/
47	280786	12	199	http://conann.com/
48	68824	45	193	http://imeall.blogspot.com/
49	46477	65	191	http://disillusionedlefty.blogspot.com/
50	637233	5	182	http://www.everysecondpaycheck.com/blog
51	164524	20	181	http://irishlinks.blogspot.com/
52	542250	6	176	http://www.dublinka.com/
53	29008	103	175	http://clickhere.blogs.ie/
54	37735	80	175	http://www.dervala.net/
55	24828	118	174	http://freestater.blogspot.com/
56	155943	21	172	http://www.jamesgalvin.com/
57	95258	33	171	http://www.iced-coffee.com/
58	164524	20	171	http://irishcraftworker.typepad.com/an_irish_craftworkers_goo/
59	27189	110	169	http://cork2toronto.blogspot.com/
60	58724	52	167	http://www.tuppenceworth.ie/blog
61	141242	23	164	http://atp.datagate.net.uk/blog
62	148304	22	159	http://www.lifewithouttoast.com/
63	184241	18	158	http://funferal.org/
64	54710	56	155	http://redmum.blogspot.com/
65	73843	42	149	http://lettertoamerica.blogs.com/
66	56390	54	148	http://donal.wordpress.com/
67	45075	67	147	http://www.podleaders.com/
68	155943	21	147	http://dublinopinion.com/
69	35022	86	146	http://www.cfdan.com/
70	89814	35	145	http://www.windsandbreezes.org/
71	109643	29	142	http://bifsniff.com/cartoons/
72	195745	17	142	http://podcasting.ie/podcast
73	47586	64	141	http://www.johnbreslin.com/blog
74	86729	36	140	http://www.damienblake.com/
75	223280	15	137	http://thegurrier.com/
76	77680	40	134	http://www.stdlib.net/%7Ecolmmacc
77	980795	3	131	http://www.sineadcochrane.com/
78	56390	54	129	http://prettycunning.net/blog
79	40821	74	128	http://www.thinkinghomebusiness.com/blog
80	84304	37	127	http://www.ryderdiary.com/
81	86729	36	124	http://irisheagle.blogspot.com/
82	44148	69	122	http://outofambit.blogspot.com/
83	73843	42	119	http://www.kenmc.com/
84	62885	49	118	http://mamanpoulet.blogspot.com/
85	135121	24	117	http://nellysgarden.blogspot.com/
86	195745	17	115	http://blog.infurious.com/
87	542250	6	114	http://ainelivia.typepad.com/aine_livia_at_the_midnigh/
88	62483	49	112	http://www.infactah.com/
89	75725	41	107	http://bonhom.ie/
90	57527	53	104	http://www.dublinblog.ie/
91	55758	55	103	http://richarddelevan.blogspot.com/
92	79732	39	103	http://ricksbreakfastblog.blogspot.com/
93	58724	52	102	http://www.inter-actions.biz/blog/
94	73843	42	102	http://www.pmooney.net/blogsphe.nsf
95	86729	36	102	http://blog.rymus.net/
96	59920	51	101	http://seanmcgrath.blogspot.com/
97	173857	19	99	http://www.ofoghlu.net/log/
98	118678	27	96	http://irishkc.com/
99	68503	45	93	http://www.web2ireland.org/
100	75725	41	93	http://www.bibliocook.com/

Update: Here’s a full list of all 569 tested blogs. Also, there’s been a minor change to the rankings here; I’ve just realised that there was a bug in how the script handled evenly-matched blogs, so (for example) #15 and #16 were reversed in order; that’s now fixed.

If you find a blog missing, it’s possible that (a) it’s not pinging Planet.journals.ie or (b) is not registered with Technorati; this method requires both of those. Most Irish blogs do, but some (Old Rotten Hat, for example) don’t…

Methodology

I found this more-or-less full list of Irish weblogs at Planet.journals.ie, and selected the blogs that had pinged their site in the past 6 months, then cut that down to just the blog main-page URLs, removing duplicates.

Given that list, I then looked up each blog URL using the Technorati API, and got its rank, inbound link count, and inbound linking blogs count.

top100code.tgz is a tarball of the perl code I wrote to do this, if you fancy doing it yourself on whichever set of blogs you fancy…

18 Comments

The vagaries of Google Image Search

Published October 23, 2006

Remember the C=64-izer, the quick hack to display an image in the style of the Commodore 64?

Recently, I’ve started getting hits to this demo image of the “O RLY?” owl — lots of ’em.

It turns out that the C=64-ized rendition of this image is now the top hit for “O RLY” on Google Image Search; pretty bizarre, since there are obvious better images on the first search page, one result along in fact. What’s more, the page listed as the ‘origin page’, http://taint.org/tag/today, doesn’t even use that text.

This has resulted in lots of Myspace kiddies etc. obliviously using the C=64 rendering. Yay for Commodore ;)

Comments closed

SpicyLinks and del.icio.us Network Summarization

Published September 6, 2006

Ross Mayfield:

Every time I see Gabe Rivera of TechMeme, I ask for the same thing — MeMeme. Give me TechMeme where the core index is based on who I read, about 150 people at any given time, to show me what my friends are interested in.

Funnily eough, that is exactly why I wrote SpicyLinks!

It works pretty well — in fact, nowadays I don’t really bother reading slashdot, Digg, Reddit, et al, particularly frequently, because I know that all the really interesting stuff will be at the top of my newsreader in the SpicyLinks feed.

Anyway, I’ve been calling SpicyLinks a ‘summarizing aggregator’, but the discussion that arose from Ross’ posting inspired me. A little bit of hacking has come up with an interesting twist: take a del.icio.us social network, a CGI script called deliciousnetwork2opml.cgi, and 15 minutes hacking on SpicyLinks to support inclusion of OPML via a remote URI, and hey presto — it’s now a social-network summarising aggregator. ;)

6 Comments

Stretch-to-fit Textareas – Now A Firefox Extension

Published September 3, 2006

Since it’s been turning out to be really quite useful, here’s a Firefox extension version of the Stretch-to-fit Textareas Greasemonkey user-script I wrote a few weeks back. In other words, Greasemonkey not required!

8 Comments

“Stretch-to-fit Textareas” Greasemonkey User Script

Published August 10, 2006

Here’s another quick-hack Greasemonkey user script I wrote recently.

Stretch-to-fit Textareas is a user script which improves the usability of editable textareas; it causes them to “stretch” vertically to fit their contents, as you type. This behaviour was inspired by that of textareas in FogBugz.

It can be inhibited by turning off the small checkbox to the right of each textarea.

Update: it’s worth noting that this is different from the Resizeable Textareas Firefox extension. Whereas the latter allows the user to resize the textareas by hand, this user script does that action automatically, based on the contents of the field; no manual resize-handle-searching and dragging is required. On the other hand, this user script will only stretch textareas vertically, whereas the extension allows them to be dragged in both dimensions. In fact, the two are complementary — I’m running both, and I suggest you do too ;)

Update 2: here’s a Firefox extension version — Greasemonkey not required!

1 Comment

Retroactive Tagging With TagThe.Net

Published May 30, 2006

Hacky hack hack.

Ever since I enabled tags on taint.org, I’ve been mildly annoyed by the fact that there were thousands of older entries deprived of their folksonomic chunky goodness. A way to ‘retroactively tag’ those entries somehow would be cool.

Last week, Leonard posted a link on his linkblog to TagThe.net, a web service which offers a nifty REST API; simply upload a chunk of text, and it’ll suggest a few tags for that text, like this:

echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
<memes>
  <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
    <dim type="topic">
      <item>robot</item>
    </dim>
    <dim type="language">
      <item>english</item>
    </dim>
  </meme>
</memes>

This looked promising.

Anyway, I’ve now implemented this — it worked great! If you’re curious, here’s details of how I did it. It’s a bit hacky, since I’m only going to be doing this once — and very UNIXy and perlish, because that’s how I do these things — but maybe somebody will find it useful.

How I Retroactively Tagged taint.org

This weblog runs WordPress — so all the entries are stored in a MySQL database. I took the MySQL dump of the tables, and a quick script figured out that out of somewhere over 1600-ish posts, there were 1352 that came from the pre-tag era, requiring tag inference. A mail to the TagThe.Net team established that they were happy with this level of usage.

I grepped the post IDs and text out of the SQL dump, threw those into a text file using the simple format ‘id=NNN text=SQLHTMLSTRING’ (where SQLHTMLSTRING was the nicely-escaped HTML text taken directly from the SQL dump), and ran them through this script.

That rendered the first 2k of each of those entries as a URL-encoded string, invoked the REST API with that, got the XML output, and extracted the tags into another UNIXy text-format output file. (It also added one tag for the ‘proto-tag’ system I used in the early days, where the first word of the entry was a single tag-style category name.)

Next, I ran this script, which in turn took that intermediate output and converted it to valid PHP code, like so:

cat suggestedtags | ./taglist-to-php.pl  > addtags.php
scp addtags.php my.server:taint.org/wp-admin/

The generated page ‘addtags.php’ looks like this:

<?php
  require_once('admin.php');
  global $utw;
  $utw->SaveTags(997, array("music","all","audio","drm-free",
      "faq","lunchbox","destination","download","premiere","quote"));
  [...]
  $utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
  $utw->SaveTags(999, array("oses","eek","longhorn","ram",
    "winsupersite","windows","amount","base","dog","preview","system"));
?>

Once that page was in place, I just visited it in my (already logged in) web browser window, at http://taint.org/wp-admin/addtags.php, and watched as it gronked for a while. Eventually it stopped, and all those entries had been tagged. (If I wasn’t so hackish, I might have put in a little UI text here — but I didn’t.)

The results are very good, I think.

A success: http://taint.org/tag/research has picked up a lot of the interesting older entries where I discussed things like IBM’s Tieresias pattern-recognition algorithm. That’s spot on.

A minor downside: it’s not so good at nouns. This entry talks about Silicon Valley and geographical insularity, and mentions “Silicon Valley” prominently — one or both of those words would seem to be a good thing to tag with, but it missed them.

Still, that’s a minor issue — the tags it has suggested are generally very appropriate and useful.

Next, I need to find a way to auto-generate titles for the really old entries ;)

1 Comment

Another script: goog-love.pl

Published March 2, 2006

A quick hack —

goog-love.pl – find out where your site’s google juice comes from

This script will grind through your web site’s “access.log” file (which must be in the “combined” log format). It’ll pick out the top 100 Google searches found in the referer field, re-run those searches, and determine which ones are giving your website all the linky Google love — in other words, the searches that your site ‘wins’ on.

The output is in plain text and a chunk of HTML.

usage:

goog-love.pl sitehost google-api-key < access.log > out.html

e.g.

cat /var/www/logs/taint.org.* | goog-love.pl \
  taint.org 0xb0bd0bb5yourgoogleapikeyhere0xdeadbeef | tee out.html

NOTE: this script requires the SOAP::Lite module be installed. Install it using apt-get install libsoap-lite-perl or cpan SOAP::Lite. It also requires a Google API key.

For example, here are the current results for this site. You can immediately see some interesting stuff that’s not immediately obvious otherwise, such as my site being the top hit for [beardy justin] ;)

#1 for kriskat225: http://taint.org/2006/01/20/220239a.html
#1 for kriskat224: http://taint.org/
#1 for mailman rss: http://taint.org/mmrss/index.html
#1 for ray is naked: http://taint.org/2005/05/27/195421a.html
#1 for beardy justin: http://taint.org/2005/09/10/002323a.html
#1 for threadless rss: http://taint.org/2005/05/25/060857a.html
#1 for louis fitzgerald: http://taint.org/2005/05/12/020118a.html
#1 for download JusteTune: http://taint.org/index.php?tag=apple
#1 for mobile repair delhi: http://taint.org/2005/11/11/032651a.html
#1 for site:taint.org mythtv: http://taint.org/index.php?tag=hdtv
#1 for “Google Map” IDS rulesets: http://taint.org/2005/09/
#1 for spam email “prank a friend”: http://taint.org/2004/11/
#1 for site:taint.org mythtv freevo: http://taint.org/index.php?tag=mythtv
#1 for world map desktop background: http://taint.org/xplanet/
#1 for kate thornton + Samuel L jackson: http://taint.org/2003/12/10/185721a.html
#1 for when did chris horn leave iona technologies?: http://taint.org/2003/05/
#2 for natkat224: http://taint.org/
#2 for itms linux: http://taint.org/2005/09/20/022107a.html
#2 for msn IDs hacking software: http://taint.org/index.php?tag=hacking
#3 for gmail spam filter: http://taint.org/2004/04/15/033025a.html
#3 for live world map on desktop: http://taint.org/xplanet/
#4 for moin mozex: http://taint.org/2004/10/08/081409a.html
#4 for editable p45: http://taint.org/2005/01/27/025238a.html
#4 for urban dead exploits: http://taint.org/index.php?tag=games
#4 for gmail spam filtering: http://taint.org/2004/04/15/033025a.html
#4 for world map desktop wallpaper: http://taint.org/xplanet/
#5 for cdwow.ie: http://taint.org/2003/12/04/185038a.html
#5 for life hacking: http://taint.org/2005/10/17/210751a.html
#5 for Adelphi Charter: http://taint.org/index.php?tag=politics
#6 for irish SME: http://taint.org/2005/06/23/212513a.html
#6 for urbandead: http://taint.org/index.php?tag=hacks
#6 for SKY NEWS IRELAND: http://taint.org/2004/05/12/205717a.html
#7 for daniel cuthbert: http://taint.org/2005/10/12/205836a.html
#7 for SAMUEL L. JACKSON QUOTES: http://taint.org/2003/12/10/185721a.html
#7 for cool background pictures: http://taint.org/xplanet/
#8 for CDWOW: http://taint.org/2003/12/04/185038a.html
#8 for urban dead: http://taint.org/2005/10/29/224403a.html
#8 for korea porn: http://taint.org/2003/07/12/031422a.html
#8 for BBC port 8998: http://taint.org/2003/08/
#8 for iftop documentation wrt: http://taint.org/index.php?tag=freevo
#8 for php mail injection spam: http://taint.org/2005/12/08/202248a.html
#8 for fake open source software : http://taint.org/index.php?tag=open-source
#9 for faad symbian: http://taint.org/index.php?tag=apple
#9 for sky news ireland: http://taint.org/2004/05/12/205717a.html
#9 for telemarketing counter speech: http://taint.org/2002/11/12/130851a.html
#10 for “Scratch Heads Over”: http://taint.org/2003/07/12/031422a.html
#10 for web scraper linux console: http://taint.org/2004/06/05/023726a.html

Download here (5 KiB perl script).

Notes:

if you see a lot of “502 Bad Gateway” errors, it’s probably over-zealous anti-bot ACLs on Google’s side. Try from another host.
Read the comments for notes on a bug in recent releases of SOAP::Lite; please let me know if you hear of them getting fixed ;)

5 Comments

The C=64-izer

Published January 20, 2006

Ever wondered what today’s internet meme images would look like on mid-’80’s home computing hardware?

Wonder no longer!

3 Comments

Urban Dead HUD; added Inventory Sorting

Published November 14, 2005

I’ve updated the Urban Dead HUD Greasemonkey userscript; it now offers inventory sorting, inspired by Ikko’s userscript (albeit a little different in implementation). Here’s a screenshot:

Right now, UD is reasonably interesting — our team of plucky survivors have been helping out with the defence of Caiger Mall, a major mall towards the north-west of the city. We’ve repulsed the Church of the Resurrection‘s attempts to wipe us out, but that seems to have made us quite a juicy target; there are now no less than three separate Zombie groups ganging up on us. For now, we’re still holding out.

4 Comments

Life Hacks: getting back to the command-line

Published August 24, 2004

Tech: So Danny O’Brien’s ‘Life Hacks’ talk is one of the most worthwhile reflections on productivity (and productivity technology) I’ve heard. (Cory Doctorow’s transcript from NotCon 2004, video from ETCon.)

There’s a couple of things I wanted to write about it, so I’ll do them in separate blog entries.

(First off, I’d love to see Ward Cunningham’s ‘cluster files by time’ hack, it sounds very useful. But that’s not what I wanted to write about ;)

People don’t extract stuff from big complex apps using OLE and so on; it’s brittle, and undocumented. Instead they write little command-line scriptlets. Sometimes they do little bits of ‘open this URL in a new window’ OLE-type stuff to use in a pipeline, but that’s about it. And fundamentally, they pipe.

This ties into the post that reminded me to write about it — Diego Doval’s atomflow, which is essentially a small set of command-line apps for Atom storage. Diego notes:

Now, here’s what’s interesting. I have of course been using pipes for years. And yet the power and simplicity of this approach had simply not occurred to me at all. I have been so focused on end-user products for so long that my thoughts naturally move to complex uber-systems that do everything in an integrated way. But that is overkill in this case.

Exactly! He’s not the only one to get that recently — MS and Google are two very high-profile organisations that have picked up the insight; it’s the Egypt way.

There’s fundamentally a breakage point where shrink-wrapped GUI apps cannot do everything you want done, and you have to start developing code yourself — and the best APIs for that, after 30 years, has been the command-line and pipe metaphor.

(Also, complex uber-apps are what people think is needed — however, that’s just a UI scheme that’s prevailing at the moment. Bear in mind that anyone using the web today uses a command line every day. A command line will not necessarily confuse users.)

Tying back into the Life Hacks stuff — one thing that hasn’t yet been done properly as a command-line-and-pipe tool, though, is web-scraping. Right now, if you scrape, you’ve got to do either (a) lots of munging in a single big fat script of your own devising, if you’re lucky using something like WWW::Mechanize (which is excellent!); (b) use a scraping app like sitescooper; or (c) get hacky with a shell script that runs wget and greps bits of output out in a really brittle way.

I’ve been considering a ‘next-generation sitescooper’ a little bit occasionally over the past year, and I think the best way to do it is to split its functionality up into individual scripts/perl modules:

one to download files, maintaining a cache, taking likely freshness into account, and dealing with crappy HTTP/HTTPS wierdness like cookies, logins and redirects;
one to diff HTML;
one to lobotomise (ie. simplify) HTML;
one to scrape out the ‘good bits’ using sitescooper-style regions

Tie those into HTML Tidy and XMLStarlet, and you have an excellent command-line scraping framework.

Still haven’t got any time to do all that though. :(

Comments closed

Nearly-Live planetary desktop backgrounds

Published July 12, 2004

Hacks: Nearly-Live Planetary Desktop Backgrounds. ‘a selection of desktop-sized high-quality PNG images, using near-real-time cloud data, and some very nicely rendered maps using satellite data, to create a nifty, nearly-live world map desktop background.’

Comments closed

Going to LayerOne

Published June 2, 2004

Conferences: I’m going to LayerOne; it looks interesting, and I’ve been hoping to bump into Danny O’Brien (who’s there doing his Life Hacks talk) for a couple of drinks and a blather for quite a while. Other speakers look similarly interesting, in an ‘offbeat hacker conference’ way, so I think it’ll be fun.

Conflicts with The Streets playing the Wiltern though, but c’est la vie ;)

Comments closed

Life Hacks

Published February 17, 2004

Work: Life Hacks: Tech Secrets of Overprolific Alpha Geeks, Danny O’Brien’s ETech talk.

Amazingly, despite not being an alpha geek ;), I already use all these things:

a todo.txt file (anything else is inconvenient).
everything incoming comes through email, including RSS (thanks to rss2email). Again, anything else is inconvenient; I couldn’t be bothered with another desktop app.
I hack scripts for every repetitive task I run into
I sync instead of backup; everything has a CVS repository running on a remote server, even my home dir
I have a nasty tendency to web-scrape data

These tips definitely are good advice. Although I have a feeling the result is optimised to a weblogging UNIX geek who spends hours hacking perl/python scripts. ;)

I’m looking forward to LifeHacks.com when it does eventually go live… should be interesting.

Comments closed