Another
good trip report, from ‘babbage’ at perl.org.
-
Again, and interestingly, quite a few folks agreed with one of SA’s
core tenets; no single approach (stats, RBLs, rules, distributed
hashes) can filter effectively on its own, as spammers will soon
figure out a way to subvert that technique. However, if you combine
several techniques, they cannot all be subverted at once, so your
effectiveness in the face of active attacks is much better.
-
Also interesting to note how everyone working with learning-based
approaches commented on how hard it was to persuade ‘normal people’ to
keep a corpus. Let’s hope SA’s auto-training will work well enough to
avoid that problem.
-
in passing — babbage
noted the old canard about Hotmail selling their user database to
spammers. That must really piss the Hotmail folks off ;) I think it’s
much more likely that, with Moore’s Law and the modern internet, a
dictionary attack *will* find your account eventually.
-
Good tip on the legal angle from John Praed of The Internet Law Group:
if a spam misuses the name of a trademarked product like ‘Viagra’, get
a copy to Pfizer pronto. Trademark holders have a particular desire
to follow up on infringements like this, as an undefended trademark
loses its TM status otherwise.
-
David Berlind, ZDNet executive editor: ‘They don’t want to be involved
(in developing an SMTPng)’. He might say that, but I bet their folks
working on sending out their bulk-mailed email newsletters might
disagree ;). Legit bulk mail senders have to be involved for
it to work, and they will want to be involved, too.
-
Brightmail have a patent on spam honeypots? Must take a look for this
sometime.
-
the plural of ‘corpus’ is ‘corpora’ ;)
Great report, overall.
It’s interesting to see that
Infoworld notes that reps from AOL, Yahoo! and MS were all present.
Since the conf, Paul Graham
has a new paper up about ‘Better Bayesian Filtering’, and lists some
new tokenization techniques he’s using:
-
keep dollar signs, exclamation and most punctuation intact (we do
that!)
-
prepend header names to header-mined tokens (us too!)
-
case is preserved (ditto!)
-
keep ‘degenerate’ tokens; ‘Subject:FREE!!!’ degenerates to
‘Subject:free’, to ‘FREE!!!’, and ‘free’. (ditto! well, partly. We
use degeneration of tokens, but we keep the degenerate tokens in a
separate, prefixed namespace from the non-degenerate ones, as he
contemplates in footnote 7. It’s worth noting that case-sensitivity
didn’t work well compared to the database bloat it produced; each
token needs to be duplicated into the case-insensitive namespace, but
that doubled the database size, and the hit-rate didn’t go up nearly
enough to make it worthwhile.)
Most of these were also discovered and verified experimentally by
SpamBayes, too, BTW.
When we were working on SpamAssassin‘s Bayesian-ish implementation,
we took a scientific approach, and used suggestions from the SpamBayes
folks and from the SpamAssassin community on tokenizer and stats-combining
techniques. We then tested these experimentally on a test corpus,
and posted the results. In almost all cases, our results matched
up with the SpamBayes folks’ results, which is very nice, in a scientific
sense.
(PS: update on the Fly UI story — ‘apis’ is not French, it’s Latin. oops!
Thanks Craig…)