January 21, 2003 - Justin Mason's Weblog

Another good trip report, from ‘babbage’ at perl.org.

Again, and interestingly, quite a few folks agreed with one of SA’s core tenets; no single approach (stats, RBLs, rules, distributed hashes) can filter effectively on its own, as spammers will soon figure out a way to subvert that technique. However, if you combine several techniques, they cannot all be subverted at once, so your effectiveness in the face of active attacks is much better.
Also interesting to note how everyone working with learning-based approaches commented on how hard it was to persuade ‘normal people’ to keep a corpus. Let’s hope SA’s auto-training will work well enough to avoid that problem.
in passing — babbage noted the old canard about Hotmail selling their user database to spammers. That must really piss the Hotmail folks off ;) I think it’s much more likely that, with Moore’s Law and the modern internet, a dictionary attack *will* find your account eventually.
Good tip on the legal angle from John Praed of The Internet Law Group: if a spam misuses the name of a trademarked product like ‘Viagra’, get a copy to Pfizer pronto. Trademark holders have a particular desire to follow up on infringements like this, as an undefended trademark loses its TM status otherwise.
David Berlind, ZDNet executive editor: ‘They don’t want to be involved (in developing an SMTPng)’. He might say that, but I bet their folks working on sending out their bulk-mailed email newsletters might disagree ;). Legit bulk mail senders have to be involved for it to work, and they will want to be involved, too.
Brightmail have a patent on spam honeypots? Must take a look for this sometime.
the plural of ‘corpus’ is ‘corpora’ ;)

Great report, overall.

It’s interesting to see that Infoworld notes that reps from AOL, Yahoo! and MS were all present.

Since the conf, Paul Graham has a new paper up about ‘Better Bayesian Filtering’, and lists some new tokenization techniques he’s using:

keep dollar signs, exclamation and most punctuation intact (we do that!)
prepend header names to header-mined tokens (us too!)
case is preserved (ditto!)
keep ‘degenerate’ tokens; ‘Subject:FREE!!!’ degenerates to ‘Subject:free’, to ‘FREE!!!’, and ‘free’. (ditto! well, partly. We use degeneration of tokens, but we keep the degenerate tokens in a separate, prefixed namespace from the non-degenerate ones, as he contemplates in footnote 7. It’s worth noting that case-sensitivity didn’t work well compared to the database bloat it produced; each token needs to be duplicated into the case-insensitive namespace, but that doubled the database size, and the hit-rate didn’t go up nearly enough to make it worthwhile.)

Most of these were also discovered and verified experimentally by SpamBayes, too, BTW.

When we were working on SpamAssassin‘s Bayesian-ish implementation, we took a scientific approach, and used suggestions from the SpamBayes folks and from the SpamAssassin community on tokenizer and stats-combining techniques. We then tested these experimentally on a test corpus, and posted the results. In almost all cases, our results matched up with the SpamBayes folks’ results, which is very nice, in a scientific sense.

(PS: update on the Fly UI story — ‘apis’ is not French, it’s Latin. oops! Thanks Craig…)

Comments closed

Archives

Lotsa SpamConf linkage and commentary