On the effects of lowering your SpamAssassin threshold

So I was chatting to Danny O’Brien a few days ago. He noted that he’d reduced his Spamassassin “this is spam” threshold from the default 5.0 points to 3.7, and was wondering what that meant:

I know what it means in raw technical terms — spamassassin now marks anything >3.7 as spam, as opposed to the default of five. But given the genetic algorithm way that SA calculates the rule scoring, what does lowering the score mean? That I’m more confident that stuff marked ham is stuffed marked ham than the average person? That my bayesian scoring is now really good?

Do people usually do this without harmful side-effects? What does it mean about them if they do it?

Does it make me a good person? Will I smell of ham? These are the things that keep me awake at night.

It’s a good question! Here’s what I responded with — it occurs to me that this is probably quite widely speculated about, so let’s blog it here, too.

As you tweak the threshold, it gets more or less aggressive.

By default, we target a false positive rate of less than 0.1% — that means 1 FP, a ham marked as spam incorrectly, per 1000 ham messages. Last time the scores were generated, we ran our usual accuracy estimation tests, and got a false positive rate of 0.06% (1 in 1667 hams) and a false negative rate of 1.49% (1 in 67 spams) for the default threshold of 5.0 points. That’s assuming you’re using network tests (you should be) and have Bayes training (this is generally the case after running for a few weeks with autolearning on).

If you lower the threshold, then, that trades off the false negatives (reducing them — less spam getting past) in exchange for more false positives (hams getting caught). In those tests, here’s some figures for other thresholds:

SUMMARY for threshold 3.0: False positives: 290 0.43% False negatives: 313 0.26%

SUMMARY for threshold 4.0: False positives: 104 0.15% False negatives: 1084 0.91%

SUMMARY for threshold 4.5: False positives: 68 0.10% False negatives: 1345 1.13%

so you can see FPs rise quite quickly as the threshold drops. At 4.0 points, the nearest to 3.7, 1 in 666 ham messages (0.15%) will be marked incorrectly as spam. That’s nearly 3 times as many FPs as the default setting’s value (0.06%). On the other hand, only 1 in 109 spams will be mis-filed.

Here’s the reports from the last release, with all those figures for different thresholds — should be useful for figuring out the likelihoods!

In fact, let’s get some graphs from that report. Here is a graph of false positives (in orange) vs false negatives (in blue) as the threshold changes…

and, to illustrate the details a little better, zoom in to the area between 0% and 1%…

You can see that the default threshold of 5.0 isn’t where the FP% and FN% rates meet; instead, it’s got a much lower FP% rate than FN%. This is because we consider FPs to be much more dangerous than missed spams, so we try to avoid them to a higher degree.

An alternative, more standardized way to display this info is as a Receiver Operating Characteristic curve, which is basically a plot of the true positive rate vs false positives, on a scale from 0 to 1.

Here’s the SpamAssassin ROC curve:

More usefully, here’s the ROC curve zoomed in nearer the “perfect accuracy” top-left corner:

Unfortunately, this type of graph isn’t much use for picking a SpamAssassin threshold. GNUplot doesn’t allow individual points to be marked with the value from a certain column, otherwise this would be much more useful, since we’d be able to tell which threshold value corresponds to each point. C’est la vie!

Update:: this is possible with GNUplot 4.2 onwards, it seems. great news! Hat tip to Philipp K Janert for the advice. here are updated graphs using this feature:

(GNUplot commands to render these graphs are here.)

Update again: much better interactive Flash graphs here.

Tags: , , , , , ,

Comments (12)

A fishy Challenge-Response press release

I have a Google News notification set up for mentions of “SpamAssassin”, which is how I came across this press release on PRNewsWire:

Study: Challenge-Response Surpasses Other Anti-Spam Technologies in Performance, User Satisfaction and Reliability; Worst Performing are Filter-based ISP Solutions

NORTHBOROUGH, Mass., July 17 /PRNewswire/ — Brockmann & Company, a research and consulting firm, today released findings from its independent, self-funded “Spam Index Report– Comparing Real-World Performance of Anti-Spam Technologies.”

The study evaluated eight anti-spam technologies from the three main technology classes — filters, real-time black list services and challenge- response servers. The technologies were evaluated using the Spam Index, a new method in anti-spam performance measurement that leverages users’ real-world experiences.

[...] The report finds that the best performing anti-spam technology is challenge-response, based on that technology’s lowest average Spam Index score of 160.

[...] Filter - Open Source software-(Spam Index: 388): This technology is frequently configured to work in conjunction with PC email client filters. The server adds * * SPAM * * to the subject line so that the client filter can move the message into the junk folder. This class of software includes projects such as ASSP, Mail Washer and SpamAssassin, among others.

The “Spam Index” is a proprietary measurement of spam filtering, created by Brockmann and Company. A lower “Spam Index” score is better, apparently, so C/R wins! (Funny that. The author, Peter Brockmann, seems to have some kind of relationship with C/R vendor Sendio, being quoted in Sendio press releases like this one and this one, and providing a testimonial on the Sendio.com front page.)

However — there’s a fundamental flaw with that “Spam Index” measurement, though; it’s designed to make C/R look good. Here’s how it’s supposed to work. Take these four measurements:

  • Average number of spam messages each day x 20 (to get approximate number per work-month)
  • Average minutes spent dealing with spam each day x 20 (to get approximate minutes per work-month)
  • Number of resend requests last month
  • Number of trapped messages last month

Then sum them, and that gives you a “Spam Index”.

First off, let’s translate that into conventional spam filter accuracy terms. The ‘minutes spent dealing with spam each day’ measures false negatives, since having to ‘deal with’ (ie delete) spam means that the spam got past the filter into the user’s inbox. The ‘number of trapped messages’ means, presumably, both true positives — spam marked correctly as spam — and false positives – nonspam marked incorrectly as spam. The ‘number of resend requests last month’ also measures false positives, although it will vastly underestimate them.

Now, here’s the first problem. The “Spam Index” therefore considers a false negative as about as important as a false positive. However, in real terms, if a user’s legit mail is lost by a spam filter, that’s a much bigger failure than letting some more spam through. When measuring filters, you have to consider false positives as much more serious! (In fact, when we test SpamAssassin, we consider FPs to be 50 times more costly than a false negative.)

Here’s the second problem. Spam is sent using forged sender info, so if a spammer’s mail is challenged by a Challenge/Response filter, the challenge will be sent to one of:

  • (a) an address that doesn’t exist, and be discarded (this is fine); or
  • (b) to an invalid address on an innocent third-party system (wasting that system’s resources); or
  • (c) to an innocent third-party user on an innocent third-party system (wasting that system’s resources and, worst of all, the user’s time).

The “Spam Index” doesn’t measure the latter two failure cases in any way, so C/R isn’t penalised for that kind of abusive traffic it generates.

Also, if a good, nonspam mail is challenged, either

  • (a) the sender will receive the challenge and take the time to jump through the necessary hoops to get their mail delivered (”visit this web page, type in this CAPTCHA, click on this button” etc.); or
  • (b) they’ll receive the challenge, and not bother jumping through hoops (maybe they don’t consider the mail that important); or
  • (c) they’ll not be able to act on the challenge at all (for example, if an automated mail is challenged).

Again, the “Spam Index” doesn’t measure the latter two failure cases.

In other words, the situations where C/R fails are ignored. Is it any wonder C/R wins when the criteria are skewed to make that happen?

Tags: , , , , , ,

Comments (37)

False Positive ‘Reports’ != FP Measurement

John Graham-Cumming writes an excellent monthly newsletter on anti-spam, concentrating on technical aspects of detecting and filtering spam. Me, I have a habit of sending follow-up emails in response ;)

This month, it was this comment, from a techie at another software company making anti-spam products:

When I look at the stats produced on our spam traps, which get millions of messages per day from 11 countries all over the world, I see our spam catch rate being consistently over 98% and over 99% most of the time. We also don’t get more than 1 or 2 false positive reports from our customers per week, which can give an impression of our FP rate, considering the number of mailboxes we protect.

My response:

‘Worth noting that a “false positive report from our customer” is NOT the same thing as a “false positive” (although in fairness, [the sender] does note only that it will “give an impression” of their FP rate).

This is something that I’ve seen increasingly in the commercial anti-spam world — attempting to measure false positive rates from what gets reported “upstream” via the support channels.

In reality, the false positives are still happening — it’s just that there are obstacles between the end-user noticing them, and the FP report arriving on a developer’s desk; changes to the organisational structure, surly tech support staff, or even whether the user was too busy to send that report, will affect whether the FP is counted.

Many FPs will go uncounted as a result. As a result, IMO it is not a valid approach to measurement.’

I’ve been saying this a lot in private circles recently, so in my opinion that’s a good reason to post it here…

Tags: , , ,

Comments (7)

Fingerprinting and False Positives

New Scientist News - How far should fingerprints be trusted? (via jwz):

Evidence from qualified fingerprint examiners suggests a higher error rate. These are the results of proficiency tests cited by Cole in the Journal of Criminal Law & Criminology (vol 93, p 985). From these he estimates that false matches occurred at a rate of 0.8 per cent on average, and in one year were as high as 4.4 per cent. Even if the lower figure is correct, this would equate to 1900 mistaken fingerprint matches in the US in 2002 alone.

This is why I’m so unhappy about getting fingerprinted as part of US immigration’s US-VISIT program and similar. My fingerprints have been collected on several occasions as part of that program, and as a result will now be shared throughout the US government, and internationally, and will be retained for 75 to 100 years, whether I like it or not.

As a result, with sufficient bad luck, I may become one of those false positives. Fingers crossed all those government and international partner agencies are competent enough to avoid that!

Update: oh wow, this snippet from the New Scientist editorial clearly demonstrates one case where it all went horribly wrong:

Last year, an Oregon lawyer named Brandon Mayfield was held in connection with the Madrid bombings after his fingerprint was supposedly found on a bag in the Spanish capital. Only after several weeks did the Spanish police attribute the print to Ouhnane Daoud, an Algerian living in Spain.

eek! Coverage from the National Assoc of Criminal Defense Lawyers, and the Washington Post.

Tags: , , , , ,

Comments (1)