The ‘humans are 99.84% accurate’ figure

Spam: ‘The spam-classifying accuracy of a human being is 99.84%’. This statement has passed into SlashDot lore as the gospel truth, so time for some debunking.

First off, that’s not what Bill Yerazunis said in the CRM-114 Sparse Binary Polynomial Hashing and the CRM114 Discriminator paper. Here’s the real quote:

the human author’s measured accuracy as an antispam filter is only 99.84% on the first pass

Here’s a copy of the original mail:

I manually classified the same set of 1900 messages twice, and found three errors in my own classifications, hence I have a 99.84% success rate.

(my emphasis). In other words, the author sat down and ran through 1900 messages manually, then ran through them again, and checked to see how many messages in the first batch disagreed with the second.

Let’s consider an alternative situation, where a user is presented with one message, and asked to take their time, give it a full examination and some thought, and then classify the message. I would consider that more likely to be classified correctly, since fatigue will not be an issue (after 1900 messages, I’m pretty tired of eyeballing), and neither will time pressure (taking 20 seconds on each of 1900 mails would require 10.5 hours, and would be excruciatingly boring to boot).

In addition, the study wasn’t clear on exactly how much information from each mail was presented. Too little (just the subject line) or too much (every header and raw HTML), and a human will be more likely to make mistakes than if the mail is rendered fully, and the extraneous header info hidden. In my experience, I’ve never hand-classified 1900 messages purely through either method, because it’s just too tiring, and I know I’ll make quite a few mistakes. The UI for this work is important.

And finally, the figure is derived from a study with one user performing a task once. There’s no way you could use that figure in a serious setting — it’s not valid statistical science. Here’s Henry’s comment:

Yerazunis’ study of “human classification performance” is fundamentally flawed. He did a “user study” where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results “conclusive.” There are several reasons why this is not a sound methodology:
  • a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
  • b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human’s classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards “duplicate detection” when you’ve seen the data before hand.
  • c) He evaluates his own performance. When someone’s own ego is on the line, you would expect that it would be very difficult to remain objective.

So, to correct the statement:

‘The spam-classifying accuracy of this one guy, when classifying nearly two thousand mails by hand, was 99.84%, once.’
This entry was posted in Uncategorized and tagged , , , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.