April 15, 2004 - Justin Mason's Weblog

Update: greetings, visitors from 2006! Please pay no attention to these figures, they’re from 2004, and both GMail and SpamAssassin have undergone major changes since those days. Historical interests only.

So, I set up a .forward to forward all my personal mail to GMail to see how it coped with my spam load, and compared it against the personal SpamAssassin install I’m running these days. Here’s the results:

test start: Mon Apr 12 15:50:39 PDT 2004
test end: Tue Apr 13 18:26:45 PDT 2004
total spam messages received by both during the test: 210
total ham messages received by both during the test: 528

The SpamAssassin results:

true positives: 189
false positives: 0
false negatives: 21
true negatives: 528
FP%: 0.00%
FN%: 10.00%

The GMail results:

true positives: 144
false positives: 7
false negatives: 66
true negatives: 521
FP%: 1.32%
FN%: 31.42%

So, not too hot. But there are extenuating circumstances! ;)

The GMail false positives were not ‘typical’ mail, whatever that is — all of them were Mailman ‘administration required’ messages regarding spam in Mailman mailing list queues. I’d only be annoyed if I was a GMail user administrating Mailman lists. And it turns out there’s a bug in current dev SpamAssassin that now does the same thing…
presumably, GMail allows some element of per-user probabilistic classifier training — if so, some ‘move to Inbox’ might also sort those out quite quickly, I’d guess.
GMail seems to be a four-phase classification system. Messages can either go into: 1. the inbox, 2. the spam box, 3. the inbox with a little green ‘Spam’ indicator, or 4. the spam box with a little green ‘Inbox’ indicator. Not sure what the latter two do, but they may indicate some level of ‘unsure’ as per spambayes; worth noting that most of the FNs in the Inbox did not get the green ‘Spam’ indicator beside them, though.
I used a .forward to bounce the traffic over. So if GMail includes spam-evasion at the SMTP level, along with whatever content-filtering and probabilistic classification they’re using, they wouldn’t get the benefits of that.
SpamAssassin has the benefit of some user configuration; I’d got a couple of my spamtrap addresses blacklisted in the SpamAssassin config, and my Bayes databases have been trained using SpamAssassin‘s autolearning.
this is all really unscientific, and it’s a really small sample ;)

Surprisingly, all the SpamAssassin mailing list traffic discussing spam, throwing around spammy URLs and phrases, didn’t get caught, however; probably because the volume of spammy phrases in those is less than in the Mailman admin stuff.

5 Comments

Archives

Some stats on GMail’s spam filter