Skip to content

Month: February 2008

On the effects of lowering your SpamAssassin threshold

So I was chatting to Danny O’Brien a few days ago. He noted that he’d reduced his Spamassassin “this is spam” threshold from the default 5.0 points to 3.7, and was wondering what that meant:

I know what it means in raw technical terms — spamassassin now marks anything >3.7 as spam, as opposed to the default of five. But given the genetic algorithm way that SA calculates the rule scoring, what does lowering the score mean? That I’m more confident that stuff marked ham is stuffed marked ham than the average person? That my bayesian scoring is now really good?

Do people usually do this without harmful side-effects? What does it mean about them if they do it?

Does it make me a good person? Will I smell of ham? These are the things that keep me awake at night.

It’s a good question! Here’s what I responded with — it occurs to me that this is probably quite widely speculated about, so let’s blog it here, too.

As you tweak the threshold, it gets more or less aggressive.

By default, we target a false positive rate of less than 0.1% — that means 1 FP, a ham marked as spam incorrectly, per 1000 ham messages. Last time the scores were generated, we ran our usual accuracy estimation tests, and got a false positive rate of 0.06% (1 in 1667 hams) and a false negative rate of 1.49% (1 in 67 spams) for the default threshold of 5.0 points. That’s assuming you’re using network tests (you should be) and have Bayes training (this is generally the case after running for a few weeks with autolearning on).

If you lower the threshold, then, that trades off the false negatives (reducing them — less spam getting past) in exchange for more false positives (hams getting caught). In those tests, here’s some figures for other thresholds:

SUMMARY for threshold 3.0: False positives: 290 0.43% False negatives: 313 0.26%

SUMMARY for threshold 4.0: False positives: 104 0.15% False negatives: 1084 0.91%

SUMMARY for threshold 4.5: False positives: 68 0.10% False negatives: 1345 1.13%

so you can see FPs rise quite quickly as the threshold drops. At 4.0 points, the nearest to 3.7, 1 in 666 ham messages (0.15%) will be marked incorrectly as spam. That’s nearly 3 times as many FPs as the default setting’s value (0.06%). On the other hand, only 1 in 109 spams will be mis-filed.

Here’s the reports from the last release, with all those figures for different thresholds — should be useful for figuring out the likelihoods!

In fact, let’s get some graphs from that report. Here is a graph of false positives (in orange) vs false negatives (in blue) as the threshold changes…

and, to illustrate the details a little better, zoom in to the area between 0% and 1%…

You can see that the default threshold of 5.0 isn’t where the FP% and FN% rates meet; instead, it’s got a much lower FP% rate than FN%. This is because we consider FPs to be much more dangerous than missed spams, so we try to avoid them to a higher degree.

An alternative, more standardized way to display this info is as a Receiver Operating Characteristic curve, which is basically a plot of the true positive rate vs false positives, on a scale from 0 to 1.

Here’s the SpamAssassin ROC curve:

More usefully, here’s the ROC curve zoomed in nearer the “perfect accuracy” top-left corner:

Unfortunately, this type of graph isn’t much use for picking a SpamAssassin threshold. GNUplot doesn’t allow individual points to be marked with the value from a certain column, otherwise this would be much more useful, since we’d be able to tell which threshold value corresponds to each point. C’est la vie!

Update:: this is possible with GNUplot 4.2 onwards, it seems. great news! Hat tip to Philipp K Janert for the advice. here are updated graphs using this feature:

(GNUplot commands to render these graphs are here.)

Update again: much better interactive Flash graphs here.

Microplanets

Intriguing! Via Glynn Moody comes an interesting new site, Pulse of Open Source:

To highlight open source activity on Twitter, I have launched a new web application today called The Pulse of Open Source. This is the stream of collective consciousness from the open source community on Twitter. You can follow this stream by simply bookmarking the site and visiting regularly or by adding the RSS feed to your feed reader. You can also create a Twitter account and add the individuals you’d like to follow to your own Twitter friends list if you’d prefer. There is also a mobile version of the site for on-the-go viewing.

I’m not entirely convinced it makes sense — the “open source community” is a pretty wide and amorphous concept, covering “enterprisey” types like Iona, to conference organisers, to web standards guys to GNOME developers. That’s a wide range.

However, that site links to the original, and a version which resonates better: PulseOfPDX.com, ‘the stream of Portland’s collective consciousness‘. Basically, this is a local syndication site, with microblogging from a community of local Twitterers. Similar to the “Planet” concept, which aggregates posts from multiple weblogs into a new ‘river of news’ combined feed, as seen on Planet Antispam, Planet Perl, Planet.journals.ie, but for off-the-cuff Twitter microblog comments. It’s a microplanet, to coin a phrase.

I think I might set up one of these for Ireland… what a great idea!

Update: Ted Leung posted about this today as well, I see, linking to this call for an “out-of-the-box” Twitter aggregator:

In theory, this whole pulse idea could be packaged up to be as easily deployable as “planet” sites. Here, “pulse” is the operational brand-name of aggregating Twitter accounts, where as “planet” is the tried and true operational brand-name of aggregating blogs.

I think I still prefer “microplanet” ;)

Update 2: check out IrishPulse!

Bea picture of the week


Another super-cute photo of Bea, from the latest batch. Nowadays, my photos are all-Bea, all of the time…


Plug plug

It’s been a while since I’ve posted about good shopping experiences I’ve had. Here’s a couple:

SoleTrader.co.uk: I’m a terrible shopper. I hate shops, I always wind up having to visit them at their busiest times on the weekend, and the last time I tried to go shopping for a new pair of shoes, I got caught in torrential rain, fell over and broke my thumb instead. seriously. So feck that.

Instead, I resolved to buy them online, and that I did — from SoleTrader. They had a great range of trainers, I found what I was after, the price was grand, and delivery on time. Shoes are always the same size — their sizes are standardised, after all — so naturally they fit fine. All in all, it worked out great.

Be Organic: these guys operate in North Dublin, delivering bags of organic fruit and vegetables to your door, weekly. We get the Essential Fruit Bag and the Mini Box, with a bi-weekly bag of spuds on top, for EUR 32 per week. The quality of the food is absolutely fantastic, there’s never any spoilage or wilting, and it’s always fresh and delicious. Compared to supermarket fare, it’s leagues ahead. They’ve also been grand and flexible when we need to tweak the order slightly — for example we have a veto on celery, and that’s not an issue at all. The only problem would be that they’ve recently increased their prices… but unfortunately that seems to be a general problem in Ireland these days!

vote for Dustin on Saturday

A friend of a friend writes:

Unless you are pretty good at avoiding the media, you will be aware that Dustin the Turkey has been chosen as one of six finalists for RTE’s Eurosong, the winner of which will go on to represent Ireland in the Eurovision Song Contest in Serbia in May.

What you may not be aware of is that I wrote and recorded the song with him and need your votes to help get me to Serbia!!!

The TV show will be broadcast live on RTE this Saturday Feb 23rd, at 7pm. It is a televote (a la X-factor format), so get your mobile phones ready. The results are at 9:45pm.

The song, Irlande Douze Points, is a parody on the current types of songs, acts and block-voting in the Eurovision. It may make your ears bleed a bit, you may ask yourself why, but what the hell, send someone you know to the final!!!

Apparently, Dustin urges the contest judges to “give douze points to Ireland, for its lowlands and its highlands, for Terry Wogan’s wig and Bono’s leather pants. We brought you Guinness and Westlife, 800-years of war and strife, but we all apologise for Riverdance.”

Check out the outraged reactions from Ireland’s past Eurovision “winners”:

Frank McNamara, who wrote two of the Irish Eurovision winners, asked whether RTE, the state broadcaster that selected the six acts, was “giving two fingers” to Irish ‘song’writers. “I think it is absolutely disgraceful.”

Shay Healy, who wrote Johnny Logan’s Eurovision hit What’s Another Year?, wondered “how any bunch of grown-ups could come up with this as a solution”

Phil Coulter thought that Eurovision was going “down the tubes”.

The choice on Saturday is between a turkey puppet taking the piss in a Northside accent, and such po-faced “serious pop” mawkfests as ‘“Double Cross My Heart” performed by Donal Skehan’ and ‘“Time to Rise” performed by Maya’. snore. You know it’s got to be the turkey.

Here’s the official Bebo page, and the Facebook group — and here’s the song itself:

Update: actually, here’s another, higher quality clip — with an entirely different song! Let’s hope this is the one…

Update 2: he won. Dana and the other professional Eurovision types have been chewing wasps, it’s hilarious!

A historical DailyWTF moment

Today, in work, we wound up discussing this classic DailyWTF.com article — “Remember, the enterprisocity of an application is directly proportionate to the number of constants defined”:

public class SqlWords
{
  public const string SELECT = " SELECT ";
  public const string TOP = " TOP ";
  public const string DISTINCT = " DISTINCT ";
  /* etc. */
}

public class SqlQueries
{
  public const string SELECT_ACTIVE_PRODCUTS =
    SqlWords.SELECT +
    SqlWords.STAR +
    SqlWords.FROM +
    SqlTables.PRODUCTS +
    SqlWords.WHERE +
    SqlColumns.PRODUCTS_ISACTIVE +
    SqlWords.EQUALS +
    SqlMisc.NUMBERS_ONE;
  /* etc. */
}

This made me recall the legendary source code for the original Bourne shell, in Version 7 Unix. As this article notes:

Steve Bourne, at Bell Labs, worked on his version of shell starting from 1974 and this shell was released in 1978 as Bourne shell. Steve previously was involved with the development of Algol-68 compiler and he transferred general approach and some syntax sugar to his new project.

“Some syntax sugar” is an understatement. Here’s an example, from cmd.c:

LOCAL REGPTR    syncase(esym)
        REG INT esym;
{
        skipnl();
        IF wdval==esym
        THEN    return(0);
        ELSE    REG REGPTR      r=getstak(REGTYPE);
                r->regptr=0;
                LOOP wdarg->argnxt=r->regptr;
                     r->regptr=wdarg;
                     IF wdval ORF ( word()!=')' ANDF wdval!='|' )
                     THEN synbad();
                     FI
                     IF wdval=='|'
                     THEN word();
                     ELSE break;
                     FI
                POOL
                r->regcom=cmd(0,NLFLG|MTFLG);
                IF wdval==ECSYM
                THEN    r->regnxt=syncase(esym);
                ELSE    chksym(esym);
                        r->regnxt=0;
                FI
                return(r);
        FI
}

Here are the #define macros Bourne used to “Algolify” the C compiler, in mac.h:

/*
 *      UNIX shell
 *
 *      S. R. Bourne
 *      Bell Telephone Laboratories
 *
 */

#define LOCAL   static
#define PROC    extern
#define TYPE    typedef
#define STRUCT  TYPE struct
#define UNION   TYPE union
#define REG     register

#define IF      if(
#define THEN    ){
#define ELSE    } else {
#define ELIF    } else if (
#define FI      ;}

#define BEGIN   {
#define END     }
#define SWITCH  switch(
#define IN      ){
#define ENDSW   }
#define FOR     for(
#define WHILE   while(
#define DO      ){
#define OD      ;}
#define REP     do{
#define PER     }while(
#define DONE    );
#define LOOP    for(;;){
#define POOL    }


#define SKIP    ;
#define DIV     /
#define REM     %
#define NEQ     ^
#define ANDF    &&
#define ORF     ||

#define TRUE    (-1)
#define FALSE   0
#define LOBYTE  0377
#define STRIP   0177
#define QUOTE   0200

#define EOF     0
#define NL      '\n'
#define SP      ' '
#define LQ      '`'
#define RQ      '\''
#define MINUS   '-'
#define COLON   ':'

#define MAX(a,b)        ((a)>(b)?(a):(b))

Having said all that, the Bourne shell was an awesome achievement; many of the coding constructs we still use in modern Bash scripts, 30 years later, are identical to the original design.

Technorati bloginfo API wierdness

For the benefit of other Technorati API users…

In a comment on this entry, Padraig Brady mentioned that his blog had mysteriously disappeared from the Irish Blogs Top 100 list.

I investigated, and found something odd — it seems Technorati has made a change to their bloginfo API, now listing weblogs with their ‘rank’, but without some of the important metadata, like ‘inboundblogs’, ‘inboundlinks’, and with a ‘lastupdate’ time set to the epoch (1970-01-01 00:00:00 GMT), in the API. Here’s an example:

<!-- generator="Technorati API version 1.0" -->
<!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN"
                 "http://api.technorati.com/dtd/tapi-002.xml">
<tapi version="1.0">
<document>
    <result>
        <url>http://www.pixelbeat.org</url>
                    <weblog>
                <name>Pádraig Brady</name>
                <url>http://www.pixelbeat.org</url>
                <rssurl></rssurl>
                <atomurl></atomurl>
                <inboundblogs></inboundblogs>
                <inboundlinks></inboundlinks>
                <lastupdate>1970-01-01 00:00:00 GMT</lastupdate>
                <rank>74830</rank>
            </weblog>
                            </result>
</document>
</tapi>

Compare that with this lookup result, on my own blog:

<?xml version="1.0" encoding="utf-8"?>
<!-- generator="Technorati API version 1.0" -->
<!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN"
                 "http://api.technorati.com/dtd/tapi-002.xml">
<tapi version="1.0">
<document>
    <result>
        <url>http://taint.org</url>
                    <weblog>
                <name>taint.org: Justin Mason’s Weblog</name>
                <url>http://taint.org</url>
                <rssurl>http://taint.org/feed</rssurl>
                <atomurl>http://taint.org/feed/atom</atomurl>
                <inboundblogs>143</inboundblogs>
                <inboundlinks>227</inboundlinks>
                <lastupdate>2008-02-12 11:48:10 GMT</lastupdate>
                <rank>43404</rank>
            </weblog>
                            <inboundblogs>143</inboundblogs>
                            <inboundlinks>227</inboundlinks>
            </result>
</document>
</tapi>

This bug had caused a number of blogs to be dropped from the list, since I was using “inboundblogs and inboundlinks == 0” as an indication that a blog was not registered with Technorati.

It’s now worked around in my code, although a side-effect is that blogs which have this set will appear with question-marks in the ‘inboundblogs’ and ‘inboundlinks’ columns, and will perform poorly in the ‘ranked by inbound link count’ table (unsurprisingly).

I’ve posted a query to the support forum — let’s see what the story is.

Interesting Irish Blog Awards shortlistee

This year’s Irish Blog Awards shortlists were posted yesterday. I maintain the Irish Blogs Technorati Top 100 list, so good sources of Irish blog URLs are always welcome; I took the shortlisted blogs and added them all.

Interestingly, straight in at number 2 went towleroad.com (warning: not worksafe!). It has a staggering Technorati rank of 1074 — way ahead of Donncha’s 5831 or Mulley’s 8678. I was pretty curious as to how an Irish blog could hit those heights without me having heard of it, so I took a look.

Let’s just say the content isn’t quite what you’d expect to find in a blog shortlisted for ‘Best News/Current Affairs Blog’ — a little bit short on Irish news, but heavy on pictures of naked guys getting off with each other. ;)

I took a quick glance, and I couldn’t spot any Irish content. WHOIS says the the publisher is LA-based. so I’m curious as to what qualified it as an “Irish blog”…

(by the way, I tried to leave a comment on the blog entry, but it appears Akismet is marking my comments as spam on a number of Wordpress-based blogs at the moment. Yes, I am aware of the irony. No, if SpamAssassin was a blog-spam filter, it wouldn’t do that ;)

Update: it’s sorted — they’re now gone. Also, it appears I’ve been removed from the Akismet blacklist, yay.

More on the Trend Micro patent

Dutch free knowledge and culture advocacy group ScriptumLibre.org has called for a worldwide boycott of Trend Micro products. Their chairman, Wiebe van der Worp, claims Trend Micro’s aggressive use of litigation is “well beyond the borders of decency”.

Also, this Linux.com feature has a great quote from Jim Zemlin, the executive director of the Linux Foundation:

“A company that files a patent claim against code coming from a widely adopted open source project vastly underestimates the self-inflicted damage to its customer and community relationships. In today’s world, all of our customers in the software industry are enjoying the benefits of a wide variety of open source projects that provide stability and vendor-neutral solutions to the most basic of their computing needs. I talk to those customers every day. They consider these claims short-sighted and those that assert them to be fearful of their ability to compete in today’s economy.”

Well said.

Plug: Lenovo service still rocks

I needed to buy a new laptop for work a few months back, and after a little agonizing between the MacBook Pro and a Thinkpad T61p, I plumped for the latter. As I noted at the time, one of the major selling points was the quality of IBM/Lenovo’s after-sales warranty service, compared to the atrocious stories I’d heard about AppleCare in Europe. I was, however, taking a leap of faith — I had used IBM service to great effect in the US, but had never actually tried it out in Ireland.

Sadly, I had to put this to the test today, after the hard disk started producing these warnings:

/var/log/messages:Feb  7 11:21:13 wall kernel: 
[2075890.116000] end_request: I/O error, dev sda, sector 116189461
/var/log/messages:Feb  7 11:21:38 wall kernel: 
[2075914.824000] end_request: I/O error, dev sda, sector 116189460
/var/log/messages:Feb  7 11:24:18 wall kernel: 
[2076075.072000] end_request: I/O error, dev sda, sector 116189462
/var/log/messages:Feb  7 11:25:05 wall kernel: 
[2076121.932000] end_request: I/O error, dev sda, sector 116189463

It’s a brand new machine, and a Hitachi TravelStar 7K100 drive, with a good reputation for reliability — but these things do happen. :(

Interestingly, I thought this was a case of the Bathtub curve in action — but this comprehensive CMU study of hard drive reliability notes that the ‘infant mortality’ concept doesn’t seem to apply to current hard-drive technology:

Replacement rates [of hard drives in a cluster] are rising significantly over the years, even during early years in the lifecycle. Replacement rates in HPC1 nearly double from year 1 to 2, or from year 2 to 3. This ob- servation suggests that wear-out may start much earlier than expected, leading to steadily increasing replacement rates during most of a system’s useful life. This is an in- teresting observation because it does not agree with the common assumption that after the first year of operation, failure rates reach a steady state for a few years, forming the “bottom of the bathtub”.

Anyway, I digress.

I ran the BIOS hard disk self-test, got the expected failure, then rang up Lenovo’s International Warranty line for Ireland. I got through immediately to a helpful guy in India, and gave him my details and the BIOS error message; he had no tricky questions, no guff about me using Linux rather than Windows, and there were no attempts to sting me for shipping.

There’s now a replacement HD (and a set of spare recovery disks, bonus!) winging their way via 2-day shipping, expected on Tuesday; I’m to hand over the broken HD to the courier once it arrives. Fantastic stuff!

Assuming the courier doesn’t screw up, this is yet another major win for IBM/Lenovo support, and I feel vindicated. ;)

Update: the HD arrived this morning at 10am — a day early. Very impressive!

CEAS needs your ham

CEAS 2008 is doing another Spam Challenge test of various spam-filters, and as part of this, they need samples of ham mail messages.

As part of the data collection effort, we have set up a website through which it is possible to donate non-sensitive legitimate email, to be used in the evaluation. Any kind of email that the recipient considers legitimate is welcome, including computer generated (non-spam) messages.

After the CEAS evaluation, the benchmark data will be made publicly available to facilitate future reasearch and development in the field of spam prevention.

Here is the collection site; they accept UNIX mbox format, and tar.gz or zip files of same, with an 8MB upload limit.

Remote sound playback through a Nokia 770

For a while now, I’ve been using various hacks to play music from my Linux laptop, holding my main music collection, to client systems which drive the speakers.

Previously, I used this setup to play via my MythTV box. Nowadays, however, my TV isn’t in the room where I want to listen to music. Instead, I have my Nokia 770 hooked up to the speakers; this plays the BBC Radio 4 RealAudio streams nicely, and also the laptop’s MP3 collection using a uPnP AV MediaServer.

I specifically use TwonkyMedia right now, playing back via the N770’s Media Streamer app. (That works pretty well — uPnP AV is one of those standards plagued with incompatibilities, but TwonkyMedia and Media Streamer seem to be a reliable combination.)

However, TwonkyMedia sometimes fails to notice updates of the library, and nothing has quite as good a music-player user interface as JuK, the KDE music player and organiser app, so a way to play directly from the laptop instead of via uPnP would be nice…

A weekend’s hacking reveals that this is pretty easily done nowadays, thanks to some cool features in pulseaudio, the current standard sound server on Ubuntu gutsy, and the Esound server running on the N770.

Unfortunately, the N770 doesn’t (yet) support pulseaudio directly, otherwise we could use its seriously cool support for RTP multicast streams. Still, we can hack something up using the venerable “esd” protocol (again!) Here’s how to set it up…

On the N770:

You need to fix the N770’s “esd” sound server to allow public connections. Set up your wifi network’s DHCP server to give the N770 a static IP address. Log in over SSH, or fire up an xterm. Run the following:

mv /usr/bin/esd /usr/bin/esd.real

cat > /usr/bin/esd <<EOM
#!/bin/sh
exec /usr/bin/esd.real -tcp -public -promiscuous -port 5678 $*
EOM

chmod 755 /usr/bin/esd
/etc/init.d/esd restart

On the server:

Download this file, and save it as n770.pa. Edit it, and change server=n770:5678 on the fourth line to use the IP address or hostname of your Nokia 770 instead of n770. Then run:

cp n770.pa ~/.n770.pa

cat > ~/bin/sound_n770 <<EOM
#!/bin/sh
pulseaudio -k; pulseaudio -nF $HOME/.n770.pa &
EOM

cat > ~/bin/sound_here <<EOM
#!/bin/sh
pulseaudio -k; pulseaudio &
EOM

chmod 755 ~/bin/sound_here ~/bin/sound_n770

Now you just need to run ‘~/bin/sound_n770’ to redirect sound playback to the N770, and ‘~/bin/sound_here’ to reset back to laptop speaker output, for the entire desktop environment. Nifty!

Update: it appears that things may work more reliably if you add “rate=22050” at the end of the “load-module module-esound-sink” line — this halves the bitrate of the network stream, which copes better with harsh wifi network conditions. The n770.pa file above now includes this.