False Positive ‘Reports’ != FP Measurement

John Graham-Cumming writes an excellent monthly newsletter on anti-spam, concentrating on technical aspects of detecting and filtering spam. Me, I have a habit of sending follow-up emails in response ;)

This month, it was this comment, from a techie at another software company making anti-spam products:

When I look at the stats produced on our spam traps, which get millions of messages per day from 11 countries all over the world, I see our spam catch rate being consistently over 98% and over 99% most of the time. We also don’t get more than 1 or 2 false positive reports from our customers per week, which can give an impression of our FP rate, considering the number of mailboxes we protect.

My response:

‘Worth noting that a “false positive report from our customer” is NOT the same thing as a “false positive” (although in fairness, [the sender] does note only that it will “give an impression” of their FP rate).

This is something that I’ve seen increasingly in the commercial anti-spam world — attempting to measure false positive rates from what gets reported “upstream” via the support channels.

In reality, the false positives are still happening — it’s just that there are obstacles between the end-user noticing them, and the FP report arriving on a developer’s desk; changes to the organisational structure, surly tech support staff, or even whether the user was too busy to send that report, will affect whether the FP is counted.

Many FPs will go uncounted as a result. As a result, IMO it is not a valid approach to measurement.’

I’ve been saying this a lot in private circles recently, so in my opinion that’s a good reason to post it here…

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted October 27, 2005 at 00:14 | Permalink

    Very true. A lot of “professionals” seem to think that just because they are stopping more spam that they are automagically not blocking any “innocent” mail. There will always be a level of FPs, but whether anybody is aware of them or not is a different matter

  2. Posted October 27, 2005 at 01:41 | Permalink

    Hi Michele —

    yes, that’s a problem I’ve seen mainly in mail admins who want to deal with spam any way they can think of, but don’t want to install SA or another filter for some reason — typically it’s because they want to block the spam at SMTP time, thereby saving bandwidth.

    The problem is, if you don’t receive the message text, it can be damn hard to figure out if the message you just blocked was spam or not — therefore they seem to just assume it was spam!

    (It’s a form of the Texas Sharpshooter fallacy, I think.)

  3. adam
    Posted October 27, 2005 at 04:15 | Permalink

    I’m not sure if this is spam or virus filters, because I run OpenProtect and I’m not sure which picks it up, but a recent upgrade toggled reports back on and I’ve been noticing several FPs a day generated by FORMs and SCRIPTs. I think it’s just incredible that people are still doing that these days; and not penny-ante operations either, these guys must be sending out tens of thousands of mails out in each run.

  4. Posted October 27, 2005 at 05:30 | Permalink

    hi Adam!

    yep, the larger, legit senders are the worst for that. I think it’s because they just use pages designed for the web, and send those out in email — invalid HTML, commented sections, forms, javascript, remote CSS and all.

    The cat is really out of the bag when it comes to HTML in email. Blame Netscape. ;)

  5. Matt Sergeant
    Posted October 27, 2005 at 20:11 | Permalink

    The companies are stuck between a rock and a hard place a bit when it comes to this stuff. We can’t test with customer email because of privacy issues, and if we tested with our own mail we get into the issue that (I believe Paul Graham pointed out): if you can’t stop the stuff you already know is spam, then you’re doing a really bad job! (and vice-versa for HAM).

    What I’m not saying though is that I support this notion of “FP rate based on support calls”. Ours would be something like one in 250 billion if we did. The Veritest tests are very good for this – we worked very hard with them to ensure the test is legitmate (perhaps to a fault – our results suffer because of it), but it’s very sad that Veritest prevents people re-publishing the results – I guess they want to sell a report.

    For what it’s worth I don’t think SpamAssassin’s measure of FP rate (based on mass-check results) is any better – it’s purely based on a few people’s corpora, and doesn’t tend to reflect things like business oriented email (it more closely reflects “geek” email). At least going by FP reports covers your whole customer base. I know this is something you work on though so I know you understand I’m not trying to be critical.

    Oh, meta point: the auto-preview can’t keep up with my typing, which is really annoying.

  6. Posted October 27, 2005 at 21:02 | Permalink

    The problem with all FP tests is that there is an infinite number of legitimate e-mail out there. You can conduct a more or less accurate catch rate analysis based on a good spam trap data, knowing that your spam feeds cover at least 90% of all active spam campaigns. But an FP test based on some stale “ham” corpus collected over the past few years would never reflect the real situation. E.g. two comparable anti-spam solutions can give you two very contradicting FP test results on two sets of ham corpora.

    Having said that, I agree that FP rate measured on customer complaints is far from being ideal and I’d never put in our marketing materials. But in many cases, that is the best information you can get about your product’s behaviour today versus, lets say, 2 months ago.

    PS: I think a better way to measure FP rate in independent tests would be to base it on a significant amount of opt-in bulk email — moderated mailing lists, newletters, jokes of the day, etc. That’s where most FPs are happening, plus this is a kind of legitimate traffic that’s common across different organizations. At least this way, you can compare apples to apples.


  7. Posted November 22, 2005 at 13:42 | Permalink

    Have a look at the TREC spam tests at http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05/ In preparing one of the corpora we used one user’s feedback as the basis for a gold standard. We found that that one user (Mr. X) failed to report about half of the false positives (8 of 16) and about half of the false negatives (400 of 800) that his filter produced. Had his reporting been used as the gold standard for truth, the filter’s error rate would have been underestimated by a factor of 2. Worse, if a second filter had been tested using these results, every one of the 400 non-reported false negatives would potentially have been scored against it as a spurious false positive!

    Anyway, I completely agree with you that user reporting is a bogus measure of filter accuracy.