Spam: ‘The spam-classifying accuracy of a human being is 99.84%’. This
statement has passed into SlashDot lore as the gospel truth, so time for
some debunking.
First off, that’s not what Bill Yerazunis said in the CRM-114 Sparse
Binary Polynomial Hashing and the CRM114 Discriminator paper.
Here’s the real quote:
the human author’s measured accuracy as an antispam filter is only 99.84% on the first pass
Here’s a copy of the original mail:
I manually classified the same set of 1900 messages twice, and found three errors in my own classifications, hence I have a 99.84% success rate.
(my emphasis). In other words, the author sat down and ran through 1900
messages manually, then ran through them again, and checked to see how
many messages in the first batch disagreed with the second.
Let’s consider an alternative situation, where a user is presented with
one message, and asked to take their time, give it a full examination
and some thought, and then classify the message. I would consider that
more likely to be classified correctly, since fatigue will not be an
issue (after 1900 messages, I’m pretty tired of eyeballing), and
neither will time pressure (taking 20 seconds on each of 1900 mails
would require 10.5 hours, and would be excruciatingly boring to boot).
In addition, the study wasn’t clear on exactly how much information from
each mail was presented. Too little (just the subject line) or too much
(every header and raw HTML), and a human will be more likely to make
mistakes than if the mail is rendered fully, and the extraneous header
info hidden. In my experience, I’ve never hand-classified 1900 messages
purely through either method, because it’s just too tiring, and I know
I’ll make quite a few mistakes. The UI for this work is important.
And finally, the figure is derived from a study with one user
performing a task once. There’s no way you could use that figure in
a serious setting — it’s not valid statistical science. Here’s Henry’s
comment:
Yerazunis’ study of “human classification performance” is fundamentally flawed. He did a “user study” where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results “conclusive.” There are several reasons why this is not a sound methodology:
- a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
- b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human’s classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards “duplicate detection” when you’ve seen the data before hand.
- c) He evaluates his own performance. When someone’s own ego is on the line, you would expect that it would be very difficult to remain objective.
So, to correct the statement:
‘The spam-classifying accuracy of this one guy, when classifying nearly two thousand mails by hand, was 99.84%, once.’
MS’ latest patent
Patents: Oh, come on. USPTO: task list window for use in an integrated development environment. Here’s claim 1:
A computer-implemented method for managing development-related tasks, the method comprising:
during an interactive code development session, evaluating source code to determine whether a comment token is present;
in response to determining that the source code contains a comment token, inserting a task into a task list; and
in response to completion of a task, modifying the task list during the interactive code development session to indicate that the task has been completed.
There’s 74 more claims that are about up to that standard, including the usual ‘an input module connected to the knee-bone’ mumbo-jumbo that means it ‘isn’t a software patent’.
This is just quite simply absurd. Are we really supposed to believe that nobody had thought of what is essentially a list of tickboxes, displaying the output of ‘grep TODO *.c’, before March 6, 2000? You have got to be kidding. This /. comment suggests that Delphi 5 (released 1999) did it.
(update: looks like there was a provisional patent application, so that may have to be Mar 5 1999.)
William Chiles, Anders Hejlsberg, Randy Kimmerly and Peter Loforte should be ashamed of themselves for filing this joke. And the USPTO examiner who granted it should be fired.
(PS: a factoid from the slashdot comments: IBM receives (note: not even ‘files for’) nearly 10 patents every day.)