TREC Spam Corpus

Some news from TREC’s Gordon Cormack:

The TREC 2005 Corpus (92,000 messages - 42,000 ham; 50,000 spam) is now available for self-serve download.

TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.

The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.

Tags: , , , , , , , , ,

Comments (2)

[thx] HAM


flickr_IMG_7139.jpg
Originally uploaded by Andy Cadaver.

I was just emailing with Sarah Carey, and she correctly noted that my weblog has been tending towards the techie-incomprehensible recently. A brief look at the front page confirms this.

So here’s a remedy: a photo of the delicious ham which the lovely C cooked up for Thanksgiving, last Thursday. Just look at that, mmmmm!

When I get back to Ireland, I will be bringing Thanksgiving with me; a holiday based around eating cooked fowl, with no religious baggage whatsoever? I’m so there.


Tags: , , , , ,

Comments (11)

Bayesian learning animation

Spam: via John Graham-Cumming’s excellent anti-spam newsletter this month, comes a very cool animation of the dbacl Bayesian anti-spam filter being trained to classify a mail corpus. Here’s the animation:

And Laird’s explanation:

dbacl computes two scores for each document, a ham score and a spam score. Technically, each score is a kind of distance, and the best category for a document is the lowest scoring one. One way to define the spamminess is to take the numerical difference of these scores.

Each point in the picture is one document, with the ham score on the x-axis and the spam score on the y-axis. If a point falls on the diagonal y=x, then its scores are identical and both categories are equally likely. If the point is below the diagonal, then the classifier must mark it as spam, and above the diagonal it marks it as ham.

The points are colour coded. When a document is learned we draw a square (blue for ham, red for spam). The picture shows the current scores of both the training documents, and the as yet unknown documents in the SA corpus. The unknown documents are either cyan (we know it’s ham but the classifier doesn’t), magenta (spam), or black. Black means that at the current state of learning, the document would be misclassified, because it falls on the wrong side of the diagonal. We don’t distinguish the types of errors. Only we know the point is black, the classifier doesn’t.

At time zero, when nothing has been learned, all the points are on the diagonal, because the two categories are symmetric.

Over time, the points move because the classifier’s probabilities change a little every time training occurs, and the clouds of points give an overall picture of what dbacl thinks of the unknown points. Of course, the more documents are learned, the fewer unknown points are left.

This is an excellent visualisation of the process, and demonstrates nicely what happens when you train a Bayesian spam-filter. You can clearly see the ‘unsure’ classifications becoming more reliable as the training corpus size increases. Very nice work!

It’s interesting to note the effects of an unbalanced corpus early on; a lot of spam training and little ham training results in a noticeable bias towards the classifier returning a spam classification.

Tags: , , , , , , , , ,

Comments

Blocking mail with no Message-ID

Spam: Bram shares a spam-filtering tip — ‘most of the viruses I get have a Message-Id tacked on by the local mailserver. A little bit of messing with procmail and suddenly my junk mail level is under control.’

This is what the SpamAssassin rule MSGID_FROM_MTA_SHORT does. It gets:

  4.432   6.7680   0.0560    0.992   0.94    3.67  MSGID_FROM_MTA_SHORT

6.7680% of spam is hit, but so is 0.0560% of ham mail — which makes it 99.2% accurate. By default in 2.6x, it gets a score of 3.67 points.

There’s a lot of divergence between people’s corpora — for instance, I currently have no ham mails that hit this, so it’s 100% accurate for my current mail collection; but some other people have an 80% hit-rate.

This is because some large-scale legitimate mass-mailers — for no apparent reason — also omit the Message-ID when they send the message across the internet. This isn’t quite a contravention of RFC 2822, but that RFC strongly recommends using the header:

Though optional, every message SHOULD have a ‘Message-ID:’ field.

(see RFC 2119 for what ‘SHOULD’ means — it’s a strong recommendation.)

The moral for legit senders: make sure you read the RFCs before you start sending SMTP; otherwise you’ll look like a spammer.

The moral for spamfilter developers: watch out for the legit bulk mail senders; some of them do really bizarre things with SMTP. ;)

Tags: , , , , , , , , , ,

Comments