October 10, 2002 - Justin's Linklog

Found on Paul Graham’s site: “according to a recent study, the MAPS RBL, probably the best known blacklist, catches only 24% of spam, with 34% false positives. It would take a conscious effort to write a content-based filter with performance that bad.”

The “recent study” is by David Nelson at Giga Information Group, sometime last year.

For the sake of it, I’ve checked out how the MAPS figures stack up using TCR, Ion Androutsopoulos‘ metric for measuring spam filter performance. TCR is a very nice single-figure metric, which takes into account the “inconvenience factor” of misfiled mails, based on a “lambda” setting indicating what action is taken when a mail is classified. For MAPS, I’m assuming a lambda of 9, the guideline figure for systems which bounce mail back to the sender, instead of 1 for simple tagging, or 999 for outright deletion with no notification.

So: using a lambda of 9, MAPS gets a TCR of 0.0912, a Spam Recall of 24%, and a Spam Precision of 17%. It’s worth noting that the baseline figure for TCR is 1.0, which represents no filtering whatsoever: ie. all the spam comes right into your mailbox.

In other words, using MAPS is more inconvenient all-round than not filtering your mail at all, if these figures are to be believed ;)

More spam: I’ve just assembled a totally-public corpus of spam and non-spam mail, to allow spamfilter developers to compare and contrast results using the same data. Let’s hope it proves useful.

Not spam: finally, I’m off to Chester for a wedding tomorrow morning; my good mates Kitty and Gerry are tying the knot, in Chester Zoo, no less. Let’s hope this horrible cold I’ve had all week dies down before Saturday…

Archives

MAPS gets the TCR treatment, a public corpus, and a wedding