Slashdot: This will fail because… Tick the boxes to produce
a generic slashdot comment on a new anti-spam proposal. Very funny.
So, regarding the Noise Reduction probabilistic-classification tokenizer tweak posted on Slashdot yesterday — it does look interesting; basically, it operates by monitoring the ‘noisiness’ of the token stream, and if the current probabilities for the tokens from the stream differs from what’s defined as acceptable for too long, it ‘dubs’ them out. In other words, it ignores those tokens until another sequence of ‘useful’ tokens is encountered. Plus I’m totally down with the Janine ref ;)
However, it’s disappointing to come across this in the DSPAM FAQ
Why Should I use DSPAM Instead of SpamAssassin? — a lovely selection of anti-perl and anti-SpamAssassin FUD, generally overlooking SpamAssassin‘s training components (‘leaves the end-user with no means of recourse or satisfaction when they receive a spam’), and in general taking a combative tone. Is that really necessary?
BTW, in case you’ve been living in a hole for the last year — SpamAssassin does include a probabilistic classifier, in the form of the BAYES rules. It’s easy to train, uses good tokenizing and combining algorithms to get high accuracy (although doesn’t yet do multi-word windowing until we’ve determined that that works acceptably for the db size increase), and, importantly, has been measured on corpora that are not my own mail.
A story: way back when, in June 2001, the SpamAssassin README boasted of it’s 99.94% accuracy rate. This was true — it was measured on my mail feed over the course of a couple of months. However, once measured on someone else’s mail, that dropped pretty quickly. Measuring a spam filter on the developer’s mail feed, (where presence of HTML is a killer spam-sign!), is a sure-fire way to get (a) great but (b) non-portable accuracy figures.