User script: add my delicious search results to Google

For years now, I’ve been collecting bookmarks at delicious.com/jm — nearly 7000 of them by now. I’ve been scrupulous about tagging and describing each one, so they’re eminently searchable, too. I’ve frequently found this to be a very useful personal reference resource.

I was quite pleased to come across the Delicious Search Results on Google Greasemonkey userscript, accordingly. It intercepts Google searches, adding Delicious tag-search results at the top of the search page, and works pretty well. Unfortunately though, that searches all of delicious, not specifically my own bookmarks.

So here’s a quick hack fix to do just that:

my_delicious_search_results.user.js – My Delicious Search Results on Google

Shows tag-search results from my Delicious account on Google search pages, with links to more extensive Delicious searches. Use ‘User Script Commands‘ -> ‘Set Delicious Username‘ to specify your username.

Screenshot:

Enjoy!

Tags: , , , , , ,

Comments (2)

Reminder: Irish computing history talk next Monday

Don’t forget — next Monday, the Heritage Society of Engineers Ireland, in association with The Irish Computer Society, and the ICT and Electronic and Electrical Divisions of Engineers Ireland, will be hosting an evening lecture entitled “Reminiscences of Early days of Computing in Ireland”, by Gordon Clarke (M.A., CEng., F.B.C.S., C.I.T.P., F.I.C.S). Sounds like it’ll be great. More details.

Update: it starts at 8pm; useful info! Also, the event’s flyer can be found on this page, which notes:

For those new to using our webcast facility, please see www.engineersireland.ie/webcast for information on how to set-up and access our webcasts. To view the event, please log onto the url below: https://engineersireland.webex.com/engineersireland/onstage/g.php?t=a&d=841959965 The password: computer

Tags: , , ,

Comments

“Fundamentally flawed”

Killer presentation — “RPC And Its Offspring: Convenient, Yet Fundamentally Flawed” from Steve Vinoski, who presented it at QCon London last week. It’s full of reminders of the mid-90’s, hacking away on CORBA technology — Steve was one of the key players at Iona while I was there.

But never mind where we’ve been; let me hit you with the summary slide to show where Steve’s going:

  • RPC is a convenient but flawed accident of history

    • 1980s research focused on monoliths of programming languages, distributed applications, and operating systems
    • each computer vendor of the time owned their own full stack, from language to hardware and network, and you used what they gave you
    • imperative languages won back then simply because of their superior performance at that time
  • It’s almost 2010, folks — we can do WAY better

    • pull your head from the imperative language sand and learn functional programming
    • the world is many-core and highly distributed, and the old ways aren’t going to keep working much longer

Awesome ;)

Tags: , , , , , , , ,

Comments

Continuous deployment

This is awesome, if a little insane. Continuous Deployment at IMVU: Doing the impossible fifty times a day:

Continuous Deployment means running all your tests, all the time. That means tests must be reliable. We’ve made a science out of debugging and fixing intermittently failing tests. When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day. It may be hard to imagine writing rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.

OK, so far, so sensible. But this is where it gets really hairy:

Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.

Mental. So what we’ve got here is:

  • phased rollout: automated gradual publishing of a new version to small subsets of the grid.

  • stats-driven: rollout/rollback is controlled by statistical analysis of error rates, again on an automated basis.

Worth noting some stuff from the comments. MySQL schema changes break this system:

Schema changes are done out of band. Just deploying them can be a huge pain. Doing an expensive alter on the master requires one-by-one applying it to our dozen read slaves (pulling them in and out of production traffic as you go), then applying it to the master’s standby and failing over. It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas (a pseudo DBA who reviews all schema changes extensively) and sometimes that’s a bottleneck to agility. If I started this process today, I’d probably invest some time in testing the limits of distributed key value stores which in theory don’t have any expensive manual processes.

They use an interesting two-phased approach to publishing of the deploy file tree:

We have a fixed queue of 5 copies of the website on each frontend. We rsync with the “next” one and then when every frontend is rsync’d we go back through them all and flip a symlink over.

All in all, this is very intriguing stuff, and way ahead of most sites. Cool!

(thanks to Chris for the link)

Tags: , , , , , , , ,

Comments (4)

Links for 2008-10-01

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Comments (3)

Links for 2008-09-18

Tags: , , , , , , , , , , , , , , ,

Comments (2)

Links for 2008-08-20

Tags: , , , , , , , , , , , , , , ,

Comments

Links for 2008-08-13

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Comments (1)

Links for 2008-07-22

ZSFA — I Want The Mutt Of Feed Readers Zed recommends Newsbeuter. must take a look

We Want A Dead Simple Web Tablet For $200. Help Us Build It. having worked on a project to do just this, believe me, this is doomed. DOOMED

Science Clouds ‘compute cycles in the cloud for scientific communities .. allows you to provision customized compute nodes .. that you have full control over using a leasing model based on the Amazon’s EC2 service.’ Wonder if they’d like to give SA some time ;)

Tags: , , , , , , , , , , , , , , , , , , , , , ,

Comments (1)

Links for 2008-07-21

O2 Leaking Customer Photos (updated) the JBoss/Tomcat install leaks the “secret” URLs through it’s default status page. this is the 3rd helping of FAIL for O2’s web team; 2 previous occasions in the last year exposed customer data through “secret” URL manipulation

Avant Window Navigator “a ‘dock-like’ (cough) navigator bar for the Linux desktop” (via Danny, again!)

trickle ‘user-space bandwidth shaper’, ie. like nice(1) for network bandwidth (via Danny)

RFC 5218 – What Makes For a Successful Protocol? ‘Based on case studies, this document identifies some of the factors influencing success and failure of protocol designs.’ (via spicylinks)

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Comments

Hack: twitter_no_popups.user.js

Twitter has this nasty habit — if you come across a tweet in your feed reader containing a URL, and you want to follow that link, you can’t, because Twitter doesn’t auto-link URLs in its RSS feeds. Instead, you have to click on the feed item, itself, wait for that to open in the browser, then click on the link in the new browser tab. That link will, in turn, open in another new tab.

Here’s a quick-hack Greasemonkey user script to inhibit this second new-tab:

twitter_no_popups.user.js

Tags: , , , , ,

Comments (3)

adding to the “Going Dark” and DVCS debate

On programmers “going dark” — Aristotle Pagaltzis writes:

Jeff Atwood argues that open source projects are in real danger of programmers “going dark,” which means they lock themselves away silently for a long time, then surface with a huge patch that implements a complex feature.

It seems to me that this is as much a technological problem as a social issue… and that we have the technological solution figured out: it’s called distributed version control. It means that that lone developer who locked himself in a room need not resurface with a single huge patch – instead, he can come back with a branch implementing the feature in individually comprehensible steps. At the same time, it allows the lone programmer to experiment in private and throw away the most embarrassing mistakes, addressing part of the social problem.

However, I don’t think he realised that the Jeff Atwood story he responded to was in fact an echo of Ben Collins-Sussman’s original article, where he specifically picked out DVCS as a source of this danger:

A friend of mine works on several projects that use git or mercurial. He gave me this story recently. Basically, he was working with two groups on a project. One group published changes frequently…

“…and as a result, I was able to review consistently throughout the semester, offering design tweaks and code reviews regularly. And as a result of that, [their work] is now in the mainline, and mostly functional. The other group [...] I haven’t heard a peep out of for 5 months. Despite many emails and IRC conversations inviting them to discuss their design and publish changes regularly, there is not a single line of code anywhere that I can see it. [...] Last weekend, one of them walked up to me with a bug [...] and I finally got to see the code to help them debug. I failed, because there are about 5000 lines of crappy code, and just reading through a single file I pointed out two or three major design flaws and a dozen wonky implementation issues. I had admonished them many times during these 5 months to publish their changes, so that we (the others) could take a look and offer feedback… but each time met with stony silence. I don’t know if they were afraid to publish it, or just don’t care. But either way, given the code I’ve seen, the net result is 5 wasted months.”

Before you scream; yes yes, I know that the potential for cave-hiding and writing code bombs is also possible with a centralized version control system like Subversion. But my friend has an interesting point:

“I think this failure is at least partially due to the fact that [DVCS] makes it so damn easy to wall yourself into a cave. Had we been using svn, I think the barrier to caving would have been too high, and I’d have seen the code.”

In other words, yes, this was fundamentally a social problem. A team was embarrassed to share code. But because they were using distributed version control, it gave them a sense of false security. “See, we’re committing changes to our repository every day… making progress!” If they had been using Subversion, it’s much less likely they would have sat on a 5000 line patch in their working copy for 5 months; they would have had to share the work much earlier.

To be honest, I’d tend to agree with Aristotle; just because centralized VC makes it harder to maintain a “private branch” with this “high barrier to caving”, and this therefore imposes a technical pressure to fix a social problem, doesn’t mean that is a good thing. I’d prefer to fix the DVCS to apply social pressure, and have both working tools and a working social organisation.

Another commenter on Ben’s original post put it well:

I [..] disagree, strongly, that DVCS makes code hiding any more difficult than single-branch VCS. When using a single branch, it’s usually a very small group of people who are allowed to commit. Any patches from non-core contributors get lost in a tangle of IRC pastebins, mailing lists, bug trackers, and blog posts. Furthermore, even if these patches are eventually committed, they have lost all their associated version information — the destructive rebase you complain about. DVCS allows anybody to branch from trunk, record their changes, and publish their branch in a service like Launchpad or github. For an example of this, look at the mass of user-created branches for popular projects like GNOME Do or AWN.

It’s very interesting to see those Launchpad sites, in my opinion.

I’ve spent many years shepherding contributions to SpamAssassin through our Bugzilla. We’ve often lost rule contributors, who are particularly hard to attract for some reason, due to delays and human overhead involved in this method. :( So an improved interface for this would be very useful…

Tags: , , , , , ,

Comments (3)

Upgrading to Firefox 3

Firefox 3 Release Candidate 1 was released earlier this month. I’ve upgraded.

I tried switching to it a couple of months back, but gave up, since my favourite extensions were AWOL. This time around though, they’re almost all present. Since Firefox is now basically an operating system in its own right, with upgrade pain all of its own, and a couple of people have asked, here’s what I needed to do to get from Firefox 2 to 3:

Make a list of my favoured extensions

Namely, from most important to least:

Create a new Mozilla profile

This allowed me to keep my Firefox 2.0 settings entirely intact, a key step. Install Firefox 3, and start it with “firefox -ProfileManager”, then create a new profile and start with that.

Get installing

The following extensions from the above list were available by now for Firefox 3, through addons.mozilla.org:

Firebug was slightly trickier, since you need the 1.1 beta version, directly from their site 1.2 beta version, specially designed for Firefox 3 support, available only from their ‘releases’ page.

However, Greasemonkey, SubmitToTab, and MozEx were still missing. :(

Greasemonkey, thankfully, wasn’t too hard to find — the latest nightly build from this directory does the trick.

MozEx seems dead — the Firefox 2 support was added in a development snapshot, and there’s no sign of Firefox 3 support. This was in danger of becoming a show-stopper, since I spend all day editing text in browser textareas in Trac, Bugzilla, and Wordpress — until I found It’s All Text!, which is even slightly prettier and simpler than MozEx. yay. The only thing to watch out for is that after setting the path to the editor command, I had to quit and restart the browser for it to recognise it as valid.

SubmitToTab is the only desirable plugin remaining. It looks like it won’t be making it any time soon, but I’m prepared to live without it. ;)

Also, while discussing this on Twitter, Vipul wondered if XPather was available — turns out that yes, v1.4 of XPather supports FF3. Looks cool too; I’ve installed it ;)

Copy bookmarks

Exit the browser, copy the “bookmarks.html” file from the old profile directory (~/.mozilla/firefox/jocfzbfo.jm in my case) to the new one (~/.mozilla/firefox/7bkf89ws.ff3), and restart it.

I didn’t bother copying cookies — I’m happy to log in again on all those sites. (I don’t like carrying too much baggage between upgrades…)

I also opened the Greasemonkey user scripts dir (~/.mozilla/firefox/jocfzbfo.jm/gm_scripts), clicked on each script there, and installed them that way to FF3. A little laborious, but nothing serious really.

Done!

End result: I’m using FF3, and it’s working quite nicely. Memory usage is consistently below 300MB, so far — I haven’t seen any bloating yet, which is a big improvement. I’m probably going to stick with it.

One thing: I did have to turn off the new image scaling effect, however — text font size modification also now scales images to match, which is very annoying (and jaggy). No Squint allows this quite neatly.

Tags: , , , , ,

Comments (9)

A historical DailyWTF moment

Today, in work, we wound up discussing this classic DailyWTF.com article — “Remember, the enterprisocity of an application is directly proportionate to the number of constants defined”:

public class SqlWords
{
  public const string SELECT = " SELECT ";
  public const string TOP = " TOP ";
  public const string DISTINCT = " DISTINCT ";
  /* etc. */
}

public class SqlQueries { public const string SELECT_ACTIVE_PRODCUTS = SqlWords.SELECT + SqlWords.STAR + SqlWords.FROM + SqlTables.PRODUCTS + SqlWords.WHERE + SqlColumns.PRODUCTS_ISACTIVE + SqlWords.EQUALS + SqlMisc.NUMBERS_ONE; /* etc. */ }

This made me recall the legendary source code for the original Bourne shell, in Version 7 Unix. As this article notes:

Steve Bourne, at Bell Labs, worked on his version of shell starting from 1974 and this shell was released in 1978 as Bourne shell. Steve previously was involved with the development of Algol-68 compiler and he transferred general approach and some syntax sugar to his new project.

“Some syntax sugar” is an understatement. Here’s an example, from cmd.c:

LOCAL REGPTR    syncase(esym)
        REG INT esym;
{
        skipnl();
        IF wdval==esym
        THEN    return(0);
        ELSE    REG REGPTR      r=getstak(REGTYPE);
                r->regptr=0;
                LOOP wdarg->argnxt=r->regptr;
                     r->regptr=wdarg;
                     IF wdval ORF ( word()!=')' ANDF wdval!='|' )
                     THEN synbad();
                     FI
                     IF wdval=='|'
                     THEN word();
                     ELSE break;
                     FI
                POOL
                r->regcom=cmd(0,NLFLG|MTFLG);
                IF wdval==ECSYM
                THEN    r->regnxt=syncase(esym);
                ELSE    chksym(esym);
                        r->regnxt=0;
                FI
                return(r);
        FI
}

Here are the #define macros Bourne used to “Algolify” the C compiler, in mac.h:

/*
 *      UNIX shell
 *
 *      S. R. Bourne
 *      Bell Telephone Laboratories
 *
 */

define LOCAL static

define PROC extern

define TYPE typedef

define STRUCT TYPE struct

define UNION TYPE union

define REG register

define IF if(

define THEN ){

define ELSE } else {

define ELIF } else if (

define FI ;}

define BEGIN {

define END }

define SWITCH switch(

define IN ){

define ENDSW }

define FOR for(

define WHILE while(

define DO ){

define OD ;}

define REP do{

define PER }while(

define DONE );

define LOOP for(;;){

define POOL }

define SKIP ;

define DIV /

define REM %

define NEQ ^

define ANDF &&

define ORF ||

define TRUE (-1)

define FALSE 0

define LOBYTE 0377

define STRIP 0177

define QUOTE 0200

define EOF 0

define NL '\n'

define SP ' '

define LQ '`'

define RQ '\''

define MINUS '-'

define COLON ':'

define MAX(a,b) ((a)>(b)?(a):(b))

Having said all that, the Bourne shell was an awesome achievement; many of the coding constructs we still use in modern Bash scripts, 30 years later, are identical to the original design.

Tags: , , , , , , , , , , , ,

Comments (8)

My Nokia 770

A couple of weeks back, there was quite a bit of buzz in the Irish blogosphere and elsewhere about the Nokia 770; prices for new N770s had dropped from $290ish to a very reasonable $140 / EUR130-ish price-point. I, along with a good few others, bought one.

I bought mine through Expansys, with a free 1GB RS-MMC memory card. They’ve sold out and no longer have any N770s listed; however, Buy.com still seem to have them in stock, so if you’re interested, you can probably still pick one up. (It seems Nokia is trying to sell off their remaining N770 stock, cheap, with plans to drop support for the software platform. I’m fine with this, but it may put other buyers off.)

I’ve now been using it for a while, and am still happy. ;) Here are my recommended top apps:

Slimserver. Originally designed to operate as the backend software for the Squeezebox thin-client MP3 player, this has a fantastic UI built for the N770, and its MP3 stream output works perfectly on the tablet.

This is by far the neatest way to get at a 6000-song music library without a laptop; there was some talk in the GNOME community of making a decent DAAP client, but so far there’s no working results there that I could find. :(

maemo-mapper. This is a fantastic mapping app for the tablet; it presents map tiles downloaded from OpenStreetMap or Google Maps in an N770-optimized format, with the usual nice draggable UI. Bonus: it’ll work offline, so you can follow a route while online, then take the tablet along to help navigate.

Tip: once you start maemo-mapper, click the “Download…” button in the “Repository Manager” and it’ll download details for the 5 most useful map repositories, including Google and Virtual Earth.

FBReader. A very nice document reader; much nicer than trying to read long HTML pages in the builtin web browser, especially since it allows you to turn the device on its side.

In general, the Opera Mini browser works fine; be sure to enable Javascript and set up a swap file on the RS-MMC card first. It does all the basic HTML and rudimentary AJAX; Google Calendar is a no-go, but GMail and even Google Maps works adequately, modulo minor bugs. Plain Old HTML sites like Wikipedia, IMDB and so on all work great.

As long as you’re realistic about the platform, it won’t disappoint — video requires custom transcoding, for example, and proprietary apps like Flash and RealPlayer lag behind their desktop equivalents, but as far as I can tell that’s the case for every embedded platform. (Since I spent a couple of years developing such a platform, I’m quite comfortable with this.)

A really really nifty thing about the N770 is that it’s now entirely hackable — within 30 minutes of powering on, I was able to get a terminal window open with a root prompt, and was adding ext3 partitions to the RS-MMC card. Apps are installed using “apt-get”. The terminal even has word-completion system optimized for the UNIX command-line – nice ;)

This SomethingAwful thread contains plenty more good tips. I’m happy I bought it — so many of these gadgets can wind up as an overpriced door-stop, but this is easily worth what I paid for it.

Update: this thread at InternetTabletTalk seems pretty chock-full of good advice, too.

Tags: , , , ,

Comments (7)

Ohloh

Ohloh is fun — a social networking site for free software and open source developers! Well, I guess Advogato was there first, but this is quite a bit more web2.0-compliant. ;) There are some cool features, like neat Sparkline graphs of commits per person over time. nifty.

Here I am, and here’s the Apache SpamAssassin project page.

Tags: , , , , , ,

Comments (1)

Host monitoring with Jaiku

A few weeks back, we were having trouble with dogma, our shared server where taint.org is hosted, which would occasionally be unavailable for unknown reasons. We needed to monitor its availability so that it could be fixed when it crashed again, and we’d be able to investigate quickly. Since it was happening mostly out of working hours, SMS notification was essential.

Normally, that kind of monitoring is pretty basic stuff, and there’s plenty of services out there, from Host-Tracker.com to the more complex self-hosted apps like monit and Nagios which can do that. But looking around, I found that none of them offered SMS notification for free, and since this was our personal-use server, I wasn’t willing to sign up for a $10-per-month paid account to support it, or buy any hardware to act as a private SMS gateway.

Instead, I thought of Jaiku — the Finnish company which offers a microblogging/presence platform similar to Twitter. Jaiku had a couple of cool features:

  • SMS notifications
  • it’s possible to broadcast messages to a “channel”, which others could subscribe to, IRC-style
  • it has an open API

This would allow me to notify any interested party of dogma’s downtime, allowing subscribers to subscribe and unsubscribe using whatever notification systems Jaiku support.

With a little perl and LWP, I rigged up a quick monitoring script to check http://taint.org/ via HTTP, and report if it was unavailable over the course of 5 retries in 50 seconds. If it was broken, the script sends a JSON-formatted POST request to Jaiku’s “presence.send” method, informing the target channel of the issue. (Perl source here.)

You can see the ‘#dogmastatus’ channel here — as you can see, we fixed the problem with dogma just over 2 weeks ago ;)

It’s worth noting that I had to set up an additional user, “downtimebot”, on Jaiku to send the messages — otherwise I’d never see them on my configured mobile phone! Jaiku uses the optimisation that, if I sent the message, there’s no need to cc me with a copy of what I just sent; logical enough.

Anyway, if you’re interested in dogma’s availability (there might be one or two taint.org readers who are), feel free to add yourself to the #dogmastatus channel and receive any updates.

Update: Fergal noted that it’s pretty simple to use Cape Clear’s assembly framework to perform a HTTP ping test with output to Jabber/XMPP. nifty!

Tags: , , , , , , ,

Comments (7)

SpamAssassin 3.2.0!

W00t! SpamAssassin 3.2.0 has finally gone gold!

This release is a big one — it’s the first major release since 3.1.0, back in September 2005, just over a year and a half ago. Here is the release announcement mail, containing a list of major changes since version 3.1.8. There are a few major new features that I feel worth picking out in more detail and editorialising about:

sa-compile

This is a biggie. This new script takes the active SpamAssassin ruleset, and uses code contributed by Matt Sergeant to produce input for re2c. re2c in turn compiles the ruleset into a deterministic finite automaton, which can match multiple regular expressions in parallel. That’s not all, though; re2c then compiles that DFA into C code — which is then compiled into native object code. SpamAssassin will then load that object code and use it to replace the slower perl regexp tests, if it’s available at scan-time.

Now, it’s been a long time since SpamAssassin’s ruleset consisted mainly of rudimentary regular expressions matched against the body text — a good portion of SpamAssassin’s ruleset these days operates against headers, performs network lookups, analyzes URLs extracted from the body, uses the more advanced features supported by Perl’s NFA regexp engine, or so on. But even given that, the effects of ’sa-compile’ seem to average between a 15% and 25% speedup, in my testing. That’s good ;)

Many of the commercial versions of SpamAssassin include their own body-rule speedups — but this is the first time anything similar has made it into the open source code.

Short-circuiting

Another good one for performance. There are some rules that you can reasonably assume will never hit nonspam or spam mail in a well-configured setup. For example, a hit on “ALL_TRUSTED” should mean that the message never traversed an untrusted network, therefore it cannot be spam, so why bother applying the expensive tests? It should be reasonable to “short-circuit” and immediately return a “ham” score for that mail.

This new plugin implements that algorithm — and efficiently, too, which historically has been the hard part!

I’ve been using this for a while with a ruleset like this one — in my experience, it’s cut overall CPU time spent scanning mail by 20%.

It is pretty flexible, too — there’s lot of tweakage that can be done with this functionality to suit your own setup.

Reduced memory footprint

One aim of this release has been to reduce the memory usage of SpamAssassin; the core code now uses less RAM than 3.1.x does, when tested with the same ruleset. (Unfortunately we’ve added lots more rules in the interim, so it’s a bit of a wash overall. ;)

The VBounce anti-bounce ruleset

Detects spurious bounce messages sent by broken mail systems in response to spam or viruses. More info about that here.

Apache-spamd

apache-spamd implements spamd as a mod_perl module. This was contributed by Radoslaw Zielinski, as a Google Summer of Code project last year. Thanks Radoslaw!

There are plenty more new, useful features and rules — these are just the top ones, in my opinion. Pretty cool stuff!

Tags: , , , , , , , ,

Comments (2)

Bar Camp Dublin next weekend

Dublin hackers/software people — don’t forget! Bar Camp Dublin is happening on April 21st — that’s 9 days from now.

It should be interesting — there are 93 attendees signed up already, and I see a good few familiar names I haven’t run into in a while! The last Bar Camp was a good opportunity to meet up for some very informal talks, and this looks likely to be the same.

Sign up here, go on…

Tags: , , , ,

Comments

A SpamAssassin rule-discovery algorithm

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and “ham” (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple “grep” will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, “this is\na \ntest” -> “this is a test”

All pretty basic stuff, and performed by the SpamAssassin “body” rendering process during a “mass-check” operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later “collapse patterns” step.

Split the text into “words” — ie. space-separated chunks of non-whitespace chars. Compress each “word” into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global “token count” hashtable, and add the message ID to the token’s entry in a “message subset hit” table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the “token count” hash entries, simply delete the entries in the “token count” and “message subset hit” tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
  • the “message subset hit” table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the “message subset hit” table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the “discard if subsumed” trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent “OEM” stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/.2007022 \
        ham:dir:/local/cor/recent/ham/.200702
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin “meta” rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <tyokaluassa.com@ultradian.com>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <xxxxxxxxxxx@jmason.org>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <tyokaluassa.com@ultradian.com>
  To: <xxxxxxxx@jmason.org>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%!

Discounts! Special offers! Software for home and office! TOP 1O ITEMS.

$79 Microsoft Windows Vista Ultimate
$79 MS Office Enterprise 2007
$79 Adobe Acrobat 8 Pro
$49 Windows <b>XP Pro</b> w/SP2
$99 Macromedia Studio 8
$59 Adobe Premiere 2.0
$59 Corel Grafix Suite X3
$59 Adobe Illustrator CS2

$129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

        Mac Specials:

Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE 6E1FAE&t6

See more by this manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded INSTANTLY. http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3 Best choice for home and professional. (37268 reviews)

Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1 Sales Rank: #1 (121329 reviews)

Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for INSTANT download. http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2 Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching “anys”, also known as “gapped alignment” searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. “this is foo”, “this is bar”, “this is baz” => “this is (foo|bar|baz)”, would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

Tags: , , , , , , ,

Comments (9)

BT’s daily disconnects, revisited

As I noted last year, BT, the ISP I use here in Ireland, disconnects broadband sessions on a daily basis, assigning a new IP address; this is really aggravating to anyone who uses a VPN, such as most telecommuters. Reportedly, this is done to work around deficiencies in their billing system.

A comment from Jeremy on that post suggested something interesting, though:

Just had a very helpful tech support guy on from BT. [... he] told me to restart the modem sometime that will make it convenient for the 24 hour IP change – i.e. restart it at 6am, and then it’ll change IP every day at 6am.

I’ve tested this, and it works. Much more convenient! Now the renumbering and VPN breakage can take place when I want it to — at the start of the workday, instead of some random point chosen by BT’s billing system. Quite an improvement.

To make this useful, here’s a script, “reboot-zyxel”, which will reboot your Zyxel P-660RU router remotely over the LAN. (It requires perl and curl.)

Tags: , , , , , , , , , , ,

Comments (13)

Kernighan and Pike on debugging

While reading the log4j manual, I came across this excellent quote from Brian W. Kernighan and Rob Pike’s “The Practice of Programming”:

As personal choice, we tend not to use debuggers beyond getting a stack trace or the value of a variable or two. One reason is that it is easy to get lost in details of complicated data structures and control flow; we find stepping through a program less productive than thinking harder and adding output statements and self-checking code at critical places. Clicking over statements takes longer than scanning the output of judiciously-placed displays. It takes less time to decide where to put print statements than to single-step to the critical section of code, even assuming we know where that is. More important, debugging statements stay with the program; debugging sessions are transient.

+1 to that.

Tags: , , , , , , ,

Comments (3)

Script: knewtab

Here’s a handy script for konsole users like myself:

knewtab — create a new tab in a konsole window, from the commandline

usage: knewtab {tabname} {command line …}

Creates a new tab in a “konsole” window (the current window, or a new one if the command is not run from a konsole).

Requires that the konsole app be run with the “–script” switch.

Download ‘knewtab.txt’

Tags: , , , ,

Comments

SpicyLinks and del.icio.us Network Summarization

Ross Mayfield:

Every time I see Gabe Rivera of TechMeme, I ask for the same thing — MeMeme. Give me TechMeme where the core index is based on who I read, about 150 people at any given time, to show me what my friends are interested in.

Funnily eough, that is exactly why I wrote SpicyLinks!

It works pretty well — in fact, nowadays I don’t really bother reading slashdot, Digg, Reddit, et al, particularly frequently, because I know that all the really interesting stuff will be at the top of my newsreader in the SpicyLinks feed.

Anyway, I’ve been calling SpicyLinks a ’summarizing aggregator’, but the discussion that arose from Ross’ posting inspired me. A little bit of hacking has come up with an interesting twist: take a del.icio.us social network, a CGI script called deliciousnetwork2opml.cgi, and 15 minutes hacking on SpicyLinks to support inclusion of OPML via a remote URI, and hey presto — it’s now a social-network summarising aggregator. ;)

Tags: , , , , , , ,

Comments (6)

“Stretch-to-fit Textareas” Greasemonkey User Script

Here’s another quick-hack Greasemonkey user script I wrote recently.

Stretch-to-fit Textareas is a user script which improves the usability of editable textareas; it causes them to “stretch” vertically to fit their contents, as you type. This behaviour was inspired by that of textareas in FogBugz.

It can be inhibited by turning off the small checkbox to the right of each textarea.

Update: it’s worth noting that this is different from the Resizeable Textareas Firefox extension. Whereas the latter allows the user to resize the textareas by hand, this user script does that action automatically, based on the contents of the field; no manual resize-handle-searching and dragging is required. On the other hand, this user script will only stretch textareas vertically, whereas the extension allows them to be dragged in both dimensions. In fact, the two are complementary — I’m running both, and I suggest you do too ;)

Update 2: here’s a Firefox extension version — Greasemonkey not required!

Tags: , , , , , , ,

Comments (1)

A Released Perl With Trie-based Regexps!

Good news! From the Perl 5.9.2 ‘perl592delta’ change log:

The regexp engine now implements the trie optimization : it’s able to factorize common prefixes and suffixes in regular expressions. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization.

in other words, the trie-optimization patch contributed by demerphq back in March 2005 is now in a released build of Perl. Yay!

Here’s a writeup of what it does:

A trie is a way of storing keys in a tree structure where the branching logic is determined by the value of the digits of the key. Ie: if we have “car”, “cart”, “carp”, “call”, “cull” and “cars” we can build a trie like this:

        c + a + r + t
          |   |   |
          |   |   + p
          |   |   |
          |   |   + s
          |   | 
          |   + l - l
          |
+ u - l - l

What the patch does is make /a | list | of | words/ into a trie that matches those words. This means that we can efficiently tell if any of the words are at a given location in a strng by simply walking the string and trie at the same time. In many cases we can rule out the entire list by looking at only one character of the input. The current way perl handles this would require looking at N chars where N is the number of words involved. (BTW: Thats the beauty of a trie, its lookup time is independent of the number of words it stores but rather on the key length of the word being looked up. )

SpamAssassin is, of course, both (a) very regular-expression-intensive and (b) searches a single block of text for a large number of independent patterns in parallel. I’d love to see someone coming up with a patch to SpamAssassin that uses trie-compatible regexps when the perl version is >= 5.9.2, and gets increased performance that way. hint ;)

BTW, the Regexp::Trie module on CPAN is related — in that it, similar to Regexp::Optimizer, Regex::PreSuf, or Regexp::Assemble, will compile a list of words or regular expressions into a super-efficient trie-style regexp. However, without the trie patch to the regexp engine itself, this would be a minor efficiency tweak at best; although having said that, Regexp::Assemble’s POD notes:

You should realise that large numbers of alternations are processed in perl’s regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie. Note that Perl’s own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out). Regexp::Assemble will do the right thing when it knows it’s running on a a trie’d perl. (At least in some version after this one).

(PS: interestingly, demerphq mentioned back in March 2005 that he was working on Aho-Corasick matching next. A-C is a great parallel-matching algorithm, and I would imagine it would increase performance yet more. I wonder what happened to that…)

Tags: , , , , ,

Comments (16)

Running Dapper

I took the plunge over the weekend, and live-upgraded the new ‘Dapper Drake’ Ubuntu release — ouch. Here’s the two key lessons I learned:

  • Don’t run “grub-install” in a misremembered attempt to update the current GRUB boot menu ‘menu.lst’ file with the new kernel; sadly, this will quietly remove important details from your old menu.lst, such as “initrd” lines, rendering those kernels unbootable. Moral: ensure brain is in gear before meddling with MBRs!

  • If you’re a Kubuntu user, watch out. Ensure you run apt-get install ubuntu-base ubuntu-desktop — bringing the entirety of GNOME up to date — as well as apt-get install kubuntu-desktop after the upgrade; it appears that some part of a new hotplugging subsystem is not included as a dependency of kubuntu-desktop. Failure to do this results in an inability to use USB/hotpluggable devices, including internal devices like the Synaptics touchpad. No pointer devices (mice or touchpads) means no X server at boot, which is always a little annoying.

Some day I’ll just do things the right way, and do a fresh-from-CD install instead. Ah well. The good stuff: the new kernel, or possibly Xorg, is proving to be a lot speedier — window updates are noticeably smoother; and the new Ubuntu GNOME theme is similarly tasty.

Tags: , , , , , , , ,

Comments (11)

Web 2.0 and Open Source

A commenter at this post on Colm MacCarthaigh’s weblog writes:

I guess I still don’t understand how Open Source makes sense for the developers, economically. I understand how it makes sense for adapters like me, who take an app like Xoops or Gecko and customize it gently for a contract. Saves me hundreds of hours of labour. The down side of this is that the whole software industry is seeing a good deal of undercutting aimed at sales to small and medium sized commercial institutions.

Similarly, in the follow-up to the O’Reilly “web 2.0″ trademark shitstorm, there’s been quite a few comments along the lines of “it’s all hype anyway”.

I disagree with that assertion — and Joe Drumgoole has posted a great list of key Web 2.0 vs Web 1.0 differentiators, which nails down some key ideas about the new concepts, in a clear set of one-liners.

Both open source software companies, and “web 2.0″ companies, are based on new economic ideas about software and the internet. There’s still quite a lot of confusion, fear and doubt about both, I think.

Open Source

As I said in my comment at Colm’s weblog — open source is a network effect. If you think of the software market as a single buyer and seller, with the seller producing software and selling to the buyer, it doesn’t make sense.

But that’s not the real picture of a software market. If you expand the picture beyond that, to a more realistic picture of a larger community of all sorts of people at all levels, with various levels interacting in a more complex maze of conversation and transactions, open source creates new opportunities.

Here’s one example, speaking from experience. As the developer of SpamAssassin, open source made sense for me because I could never compete with the big companies any other way.

If I had been considering it in terms of me (the seller) and a single customer (the buyer), economically I could make a case of ‘proprietary SpamAssassin’ being a viable situation — but that’s not the real situation; in reality there was me, the buyer, a few 800lb gorillas who could stomp all over any puny little underfunded Irish company I could put together, and quite a few other very smart people, who I could never afford to employ, who were happy to help out on ‘open-source SpamAssassin’ for free.

Given this picture, I’m quite sure that I made the right choice by open sourcing my code. Since then, I’ve basically had a career in SpamAssassin. In other words my open source product allowed me to make income that I wouldn’t have had, any other way.

It’s certainly not simple economics, is a risk, and is complicated, and many people don’t believe it works — but it’s viable as an economic strategy for developers, in my experience. (I’m not sure how to make it work for an entire company, mind you, but for single developers it’s entirely viable.)

Web 2.0

Similarly — I feel some of the companies that have been tagged as “web 2.0″ are using the core ideas of open source code, and applying them in other ways.

Consider Threadless, which encourages designers to make their designs available, essentially for free — the designer doesn’t get paid when their tee shirt is printed; they get entered into a contest to win prizes.

Or Upcoming.org, where event tracking is entirely user-contributed; there’s no professional content writers scribbling reviews and leader text, just random people doing the same. For fun, wtf!

Or Flickr, where users upload their photos for free to create the social experience that is the site’s unique selling point.

In other words — these companies rely heavily on communities (or more correctly certain actors within the community) to produce part of the system – exactly as open source development relies on bottom-up community contribution to help out a little in places.

The alternative is the traditional, “web 1.0″ style; it’s where you’re Bill Gates in the late 90’s, running a commercial software company from the top down.

  • You have the “crown jewels” — your source code — and the “users” don’t get to see it; they just “use”.
  • Then they get to pay for upgrades to the next version.
  • If you deal with users, it’s via your sales “channels” and your tech support call centre.
  • User forums are certainly not to be encouraged, since it could be a PR nightmare if your users start getting together and talking about how buggy your products are.
  • Developers (er, I mean “engineers”) similarly can’t go talking to customers on those forums, since they’ll get distracted and give away competitive advantage by accidentally leaking secrets.
  • Anyway, the best PR is the stuff that your PR staff put out — if customers talk to engineers they’ll just get confused by the over-technical messages!

Yeah, so, good luck with that. I remember doing all that back in the ’90’s and it really wasn’t much fun being so bloody paranoid all the time ;)

URLs:

(PS: The web2.0 companies aren’t using all of the concepts of open-source, of course — not all those web apps have their source code available for public reimplementation and cloning. I wish they were, but as I said, I can’t see how that’s entirely viable for every company. Not that it seems to stop the cloners, anyway. ;)

Tags: , , , , , , ,

Comments (15)

Script: new-referrer-rss

new-referrer-rss.pl – generate RSS feed of new referrer URLs from access_log

SYNOPSIS

new-referrers-rss nameofsite [source ...] > new-referrers.xml

DESCRIPTION

Given the name of a web site, and a selection of Apache combined log format ‘access_log’ files containing referrer URL data, this will generate an RSS feed containing the latest referrers.

The script should be run periodically with ‘fresh’ access_log data, from cron.

Tags: , , , , ,

Comments (6)

Software Patenting and “Hot” Fields

Paul Graham’s recent essay on his experience with software patenting has been making the rounds recently.

Now Kevin Marks has commented. Worth reading, since he demonstrates nicely the kind of crap you see in a ‘hot’ field, such as video (which he worked on with Apple’s Quicktime):

I broadly agree with Paul Graham’s essay on Software Patents, but I do think he underestimates the damage from patent trolls, and from what he calls the mafia-like behaviour of some patent holders. Paul has been lucky in the field he has worked in, but in the Audio and Video area there are many patent thickets. … While I was at Apple on QuickTime, there was a steady stream of patent trolls claiming that Apple should pay them royalties; enough to keep several lawyers busy, and a lot of engineers spending time working on prior art evidence demonstrations. Several potential features were excluded from QuickTime due to patent thickets. The obvious one was the Unisys LZW patent that encumbered GIF, but there were other more subtle pressures that meant adopting open source codecs was discouraged. Working on the patent license agreements for MPEG meant that technology ready to ship was deferred pending legal agreement on more than one occasion.

In my experience, that’s what happens — once a field becomes “hot”, patent trolls and other nuisance “inventors” start appearing en masse, and then you’ve got to waste a lot of time dealing with that crap.

Tags: , , ,

Comments

A Gotcha With perl’s “each()”

It’s my bi-monthly perl blog entry, to earn my place on planet.perl.org! ;)

Here’s an interesting “gotcha”. Take this code:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    while(($k,$v)=each %t){print "1: $k\n"; last;}
    while(($k,$v)=each %t){print "2: $k\n";}'

In other words, iterate through all the key-value pairs in %t once, then do it again — but exit early in the first loop.

You would expect to get something like this output:

    1: 1
    2: 1
    2: 3
    2: 2

instead, you see:

    1: 1
    2: 3
    2: 2

The “1″ entry in the second loop is AWOL. Here’s why — as “perldoc -f each” notes:

There is a single iterator for each hash, shared by all “each”, “keys”, and “values” function calls in the program

That’s all “each” calls, throughout the entire codebase, possibly in a different class entirely. Argh.

The workaround: reset the iterator using “keys” between calls to “each”:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    while(($k,$v)=each %t){print "1: $k\n"; last;}
    keys %t;
    while(($k,$v)=each %t){print "2: $k\n";}'

This got us in SpamAssassin — bug 4829.

To be honest, having to call “keys” after the loop is kludgy — as you can see if you check the patch in bug 4829 there, we had to change from a “return inside loop” pattern to a “set variable and exit loop, reset state, then return” pattern. It’d be nice to have a scoped version of each(), instead of this global scope, so that this would work:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    { while(($k,$v)=scoped_each %t){print "1: $k\n"; last;} }
    # that each() iterator is now out of scope, so GC'd;
    # the next call uses a new iterator, starting from scratch
    { while(($k,$v)=scoped_each %t){print "2: $k\n";} }'

Scoping, of course, has the benefit of allowing “return early” patterns to work; in my opinion, those are clearer — at the least because they require less lines of code ;)

Tags: , , , ,

Comments (4)

Another script: goog-love.pl

A quick hack –

goog-love.pl – find out where your site’s google juice comes from

This script will grind through your web site’s “access.log” file (which must be in the “combined” log format). It’ll pick out the top 100 Google searches found in the referer field, re-run those searches, and determine which ones are giving your website all the linky Google love — in other words, the searches that your site ‘wins’ on.

The output is in plain text and a chunk of HTML.

usage:

goog-love.pl sitehost google-api-key < access.log > out.html

e.g.

cat /var/www/logs/taint.org.* | goog-love.pl \
  taint.org 0xb0bd0bb5yourgoogleapikeyhere0xdeadbeef | tee out.html

NOTE: this script requires the SOAP::Lite module be installed. Install it using apt-get install libsoap-lite-perl or cpan SOAP::Lite. It also requires a Google API key.

For example, here are the current results for this site. You can immediately see some interesting stuff that’s not immediately obvious otherwise, such as my site being the top hit for [beardy justin] ;)

Download here (5 KiB perl script).

Notes:

  • if you see a lot of “502 Bad Gateway” errors, it’s probably over-zealous anti-bot ACLs on Google’s side. Try from another host.

  • Read the comments for notes on a bug in recent releases of SOAP::Lite; please let me know if you hear of them getting fixed ;)

Tags: , , , , , ,

Comments (5)

What Works in Software Development

I already posted this to the link-blog yesterday, but it’s so good it’s worth promoting more widely. If you write software for a living, you really ought to read the slides for Michael Schwern’s excellent ‘What Works In Software Development’ talk.

It’s a long presentation (108 slides!), but during the course of that, he covers:

  • effective teamwork
  • dealing with bad customers
  • dealing with bad management
  • classic coding mistakes
  • classic project management mistakes
  • classic design mistakes
  • test-driven development
  • refactoring
  • patterns

It’s a really good synthesis of what I think are the best bits of good OO design, XP, CPAN and perl’s design and coding styles, without most of the cruft. I’ll be pointing people at this for years to come, I think…

(Found via yoz.)

Tags: , , , , , , ,

Comments (1)

New SpamAssassin Rule Development Tools

Recently, I’ve been working on new systems to develop SpamAssassin rules faster, and with a lower ‘barrier to entry’ to the core ruleset. Some highlights seem bloggable, seeing as it’s all web-based and I can link to it!

The ‘preflight’ BuildBot:

This uses the fantastic BuildBot continuous-integration system to monitor changes to our Subversion repository.

Every time something is checked into SVN, this wakes up and immediately runs mass-checks using that latest code and rules, allowing near-real-time viewing of changes in rule behaviour. (A ‘mass-check’ is a massive run of SpamAssassin across a corpus of hundreds of thousands of emails, en masse, to measure rule hit-rates.)

The corpus it mass-checks is split in a certain way so that results will be available very quickly — typically in under 10 minutes — with increasing quantities of results becoming available as time elapses.

Progress of the mass-checks are visible at the BuildBot here; as they complete, their results become visible on the Rule-QA app (below). (More info, if you’re curious.)

The Rule-QA App:

To date, we’ve used the basic “freqs” table — output from the hit-frequencies command-line script — as the UI for rule QA and evaluation. This is fine for a small number of developers, but it scales badly and (like mass-checks) requires a pretty complex setup on the developer’s machine.

This new component is a web application, which takes the “freqs” table, and “webifies” it — demo.

Some major improvements are also made possible; the most important, that it can now display ‘freqs’ for multiple revisions during the day, and keeps historical data for comparison. It adds several new reports from ‘hit-frequencies’; a score-map, overlaps, a performance measurement, and a boolean ‘promoteability’ measurement.

Finally, a really useful new report is the graph of rule hit-rate, as it changes over time. Here’s a cached demo, or see the same data produced ‘live’. This gives a totally new insight into how the rule hits for various people’s corpora, how that changed over time, and allows a whole new type of rule analysis. (In fact, it also allows pretty good corpus analysis, too; can you tell which submitters bounce high-scoring spam at receipt time?)

(More info on these.)

Tags: , , , , , ,

Comments

Urban Dead HUD; added Inventory Sorting

I’ve updated the Urban Dead HUD Greasemonkey userscript; it now offers inventory sorting, inspired by Ikko’s userscript (albeit a little different in implementation). Here’s a screenshot:

Right now, UD is reasonably interesting — our team of plucky survivors have been helping out with the defence of Caiger Mall, a major mall towards the north-west of the city. We’ve repulsed the Church of the Resurrection’s attempts to wipe us out, but that seems to have made us quite a juicy target; there are now no less than three separate Zombie groups ganging up on us. For now, we’re still holding out.

Tags: , , , , ,

Comments (4)

False Positive ‘Reports’ != FP Measurement

John Graham-Cumming writes an excellent monthly newsletter on anti-spam, concentrating on technical aspects of detecting and filtering spam. Me, I have a habit of sending follow-up emails in response ;)

This month, it was this comment, from a techie at another software company making anti-spam products:

When I look at the stats produced on our spam traps, which get millions of messages per day from 11 countries all over the world, I see our spam catch rate being consistently over 98% and over 99% most of the time. We also don’t get more than 1 or 2 false positive reports from our customers per week, which can give an impression of our FP rate, considering the number of mailboxes we protect.

My response:

‘Worth noting that a “false positive report from our customer” is NOT the same thing as a “false positive” (although in fairness, [the sender] does note only that it will “give an impression” of their FP rate).

This is something that I’ve seen increasingly in the commercial anti-spam world — attempting to measure false positive rates from what gets reported “upstream” via the support channels.

In reality, the false positives are still happening — it’s just that there are obstacles between the end-user noticing them, and the FP report arriving on a developer’s desk; changes to the organisational structure, surly tech support staff, or even whether the user was too busy to send that report, will affect whether the FP is counted.

Many FPs will go uncounted as a result. As a result, IMO it is not a valid approach to measurement.’

I’ve been saying this a lot in private circles recently, so in my opinion that’s a good reason to post it here…

Tags: , , ,

Comments (7)

‘Life Hacking’ and Metacity

The NY Times story on “life hacking” is a pretty good one, and an excellent intro for anyone who hasn’t been religiously reading the changing transcripts of Danny O’Brien’s talk and so on.

This line:

Mann has embarked on a 12-step-like triage: he canceled his Netflix account, trimmed his instant-messaging “buddy list” so only close friends can contact him and set his e-mail program to bother him only once an hour.

Reminded me of something I ran into recently.

Last month, I switched from Sawfish, the venerable UNIX window manager, to GNOME’s Metacity, which is the new(ish) GNOME standard window manager. (I was tired of some long-standing Sawfish crashes, and didn’t want to be the last Sawfish user on the planet, which was seeming increasingly likely.)

One interesting UI change is that application windows no longer ‘pop up’ — if an app wants to notify you of some important change, it instead can only cause its taskbar button to subtly pulse in the corner of your screen.

Initially, this threw me for a loop, and I rudely (albeit accidentally) ignored my friends on IM and suchlike. But I quickly got the hang of glancing at the taskbar once in a while when I wasn’t concentrating on a task; it’s now second nature, and has significantly reduced the number of interruptions I find myself experiencing in a typical day.

BTW, in passing: switching WMs is a big deal, user interface-wise. One of the key gating factors, for me, was a feature I use to control windows without laying hands on the dreaded rodent — namely, a ‘move window to screen corner’ keyboard shortcut. This patch implements it for Metacity.

I implemented this last year for KWin, too, to resounding disapproval and bitchy comments about how I’m using the mouse all wrong. Meh. I fully expect the Metacity maintainers to throw it out, likewise, leaving me hand-patching WMs for a while yet ;)

Update, Nov 2006: they applied it! yay.

Tags: , , , , , , , , ,

Comments (5)

PRNGs and Groove Theory

Urban Dead is a new browser-based MMORPG that’s become popular recently. I’m not planning to talk about the game itself, at least not until I’ve played it a bit!, but there’s something worth noting here — a cheat called Groove Theory:

Groove Theory was a cheat for Urban Dead that claimed to exploit an apparent lack [sic] of a random number generator in the game, [so] that performing an action exactly eight seconds after a successful action would also be successful.

Kevan, the Urban Dead developer, confirmed that Groove Theory did indeed work, and made this comment after fixing the bug:

There is some pattern to the random numbers, playing around with them; “srand(time())” actually brings back some pretty terrible patterns, and an eight-second wait will catch some of these.

So — here’s my guess as to how this happened.

It appears that Urban Dead is implemented as a CGI script. I’ll speculate that somewhere near the top of the script, there’s a line of code along the lines of srand(time()), as Kevan mentioned. With a sufficiently fast network connection, and a sufficiently unloaded server, you can be reasonably sure that hitting “Refresh” will cause that srand call to be executed on the server within a fraction of a second of your button-press. In other words, through careful timing, the remote user can force the pseudo-random-number generator used to implement rand() into a desired set of states!

As this perl script demonstrates, the output from perl’s rand() is perfectly periodic in its low bits on a modern Linux machine, if constantly reseeded using srand()the demo script’s output decrements from 3 to 0 by 1 every 2 seconds, then repeats the cycle, indefinitely.

I don’t know if Urban Dead is a perl script, PHP, or whatever; but whatever language it’s written in, I’d guess that language uses the same PRNG implementation as perl is using on my Linux box.

As it turns out, this PRNG failing is pretty well-documented in the manual page for rand(3):

on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher-order bits. Do not use this function in applications intended to be portable when good randomness is needed.

That manual page also quotes Numerical Recipes in C: The Art of Scientific Computing (William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling; New York: Cambridge University Press, 1992 (2nd ed., p. 277)) as noting:

“If you want to generate a random integer between 1 and 10, you should always do it by using high-order bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits).”

I think Groove Theory demonstrates this nicely!

Update: I need to be clearer here.

Most of the Groove Theory issue is caused by the repeated use of srand(). If the script could be seeded once, instead of at every request, or if the seed data came from a secure, non-predictable source like /dev/random, things would be a lot safer.

However, the behaviour of rand() is still an issue though, due to how it’s implemented. The classic UNIX rand() uses the srand() seed directly, to entirely replace the linear congruential PRNG’s state; on top of that, the arithmentic used means that the low-order bits have an extremely obvious, repeating, simple pattern, mapping directly to that seed’s value. This is what gives Groove Theory its practicability by a human, without computer aid; with a more complex algorithm, it’d still be guessable with the aid of a computer, but with the simple PRNG, it’s guessable, unaided.

Update 2: as noted in a comment, Linux glibc’s rand(3) is apparently quite good at producing decent numbers. However, perl’s rand() function doesn’t use that; it uses drand48(), which in glibc is still a linear congruential PRNG and displays the ‘low randomness in low-order bits’ behaviour.

Tags: , , , , , , , ,

Comments (8)

Buying Music From iTMS in Linux

On saturday, I spent a little time trying to work out how to give Steve Jobs my money; more accurately, I wanted to get some way to buy music from the iTunes Music Store from my Linux desktop, and this isn’t as easy as it really should be, because the official iTMS is a mess of proprietary Mac- and Windows-only DRM-laden badness.

Here’s a quick walkthrough of how this went:

  • install iTunes in my VMWare Windows install
  • sign up for iTMS, and give Apple all my personal info, including super-s3kr1t card verification codes, eek
  • buy a song
  • find the DRM’d file in the filesystem; it’s an .m4p file, and xine doesn’t seem to like it
  • do some googling for ‘iTunes DRM remove linux’; that leads to Jon Lech Johansen’s JusteTune
  • download and run JusteTune installer
  • get obscure hexadecimal error code dialog. hmm! what could that mean?
  • download and run .NET runtime, link on JusteTune page
  • rerun JusteTune — it works this time
  • select Account -> Authorize, enter login info
  • drag and drop file — it’s decrypted!

So, that yields a decrypted AAC file, which I can play on Linux using xine. That’s the hard part done!

However, I want to play my purchases in JuK, the very nice iTunes-style music player app for KDE.

While the gstreamer audio framework supports playback of AAC files with the gstreamer0.8-faad package (’sudo apt-get install gstreamer0.8-faad’), JuK itself can’t find the file or read its metadata, so it doesn’t show up in the music collection as playable. I don’t want to go hacking code from CVS into my desktop’s music player — possibly the most essential app on the desktop — so transcoding them to MP3 seems to be the best option.

Somebody’s already been here before, though — that’s one of the benefits of being a late adopter! Here’s a script to convert .m4a files to .mp3 using the ‘faad’ tool (’sudo apt-get install faad’).

During this work, I came across Jon Lech Johansen’s latest masterwork — SharpMusique, a fully operational native Linux interface to the iTMS. Building on Ubuntu Hoary was a simple matter of tar xvfz, configure, make, sudo make install, and it works great — and automatically de-DRMs the files on the fly as it downloads them! Now that’s the way to enjoy the iTMS on Linux, at least until Apple’s engineers break it again.

Update, May 2006: Apple’s engineers broke it. Thanks Wilfredo ;)

End result: a brand new, complete, high-quality copy of Dengue Fever’s new album, Escape From Dragon House. Previously I’d only had a couple of tracks off this, so I’m now a happy camper, music-wise.

BTW, I was also considering trying out the new Yahoo! Music Store, but it too uses fascist DRM tricks and is platform-limited, and I’m not sure how breakable it is. On top of that, the prospect of not being able to try it out before handing over credit-card details put me off. As far as I can see, I can’t even look up the albums offered before subscribing. All combined, I’ll stick with iTMS for now.

Tags: , , , , ,

Comments (2)

Faster string search alternative to Boyer-Moore: BloomAV

An interesting technique, from the ClamAV development list — using Bloom filters to speed up string searching. This kind of thing works well when you’ve got 1 input stream, and a multitude of simple patterns that you want to match against the stream. Bloom filters are a hashing-based technique to perform extremely fast and memory-efficient, but false-positive-prone, binary lookups.

The mailing list posting (’Faster string search alternative to Boyer-Moore‘) gives some benchmarks from the developers’ testing, along with the core (GPL-licensed) code:

Regular signatures (28,326) :

  • Extended Boyer-Moore: 11 MB/s

  • BloomAV 1-byte: 89 MB/s

  • BloomAV 4-bytes: 122 MB/s

Some implementation details:

the (implementation) we chose is a simple bit array of (256 K * 8) bits. The filter is at first initialized to all zeros. Then, for every virus signature we load, we take the first 7 bytes, and hash it with four very fast hash functions. The corresponding four bits in the bloom filter are set to 1s.

Our intuition is that if the filter is small enough to fit in the CPU cache, we should be able to avoid memory accesses that cost around 200 CPU cycles each.

Also, in followup discussion, the following paper was mentioned: A paper describing hardware-level Bloom filters in the Snort IDS — S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. W. Lockwood, “Deep packet inspection using parallel Bloom filters,” in Hot Interconnects, (Stanford, CA), pp. 44–51, Aug. 2003.

This system is dubbed ‘BloomAV’. Pretty cool. It’s unclear if the ClamAV developers were keen to incorporate it, though, but it does point at interesting new techniques for spam signatures.

Tags: , , , , , ,

Comments

« Previous entries Next Page » Next Page »