Some stats on GMail’s spam filter

Update: greetings, visitors from 2006! Please pay no attention to these figures, they’re from 2004, and both GMail and SpamAssassin have undergone major changes since those days. Historical interests only.

So, I set up a .forward to forward all my personal mail to GMail to see how it coped with my spam load, and compared it against the personal SpamAssassin install I’m running these days. Here’s the results:

  • test start: Mon Apr 12 15:50:39 PDT 2004
  • test end: Tue Apr 13 18:26:45 PDT 2004
  • total spam messages received by both during the test: 210
  • total ham messages received by both during the test: 528

The SpamAssassin results:

  • true positives: 189
  • false positives: 0
  • false negatives: 21
  • true negatives: 528
  • FP%: 0.00%
  • FN%: 10.00%

The GMail results:

  • true positives: 144
  • false positives: 7
  • false negatives: 66
  • true negatives: 521
  • FP%: 1.32%
  • FN%: 31.42%

So, not too hot. But there are extenuating circumstances! ;)

  • The GMail false positives were not ‘typical’ mail, whatever that is — all of them were Mailman ‘administration required’ messages regarding spam in Mailman mailing list queues. I’d only be annoyed if I was a GMail user administrating Mailman lists. And it turns out there’s a bug in current dev SpamAssassin that now does the same thing…
  • presumably, GMail allows some element of per-user probabilistic classifier training — if so, some ‘move to Inbox’ might also sort those out quite quickly, I’d guess.
  • GMail seems to be a four-phase classification system. Messages can either go into: 1. the inbox, 2. the spam box, 3. the inbox with a little green ‘Spam’ indicator, or 4. the spam box with a little green ‘Inbox’ indicator. Not sure what the latter two do, but they may indicate some level of ‘unsure’ as per spambayes; worth noting that most of the FNs in the Inbox did not get the green ‘Spam’ indicator beside them, though.
  • I used a .forward to bounce the traffic over. So if GMail includes spam-evasion at the SMTP level, along with whatever content-filtering and probabilistic classification they’re using, they wouldn’t get the benefits of that.
  • SpamAssassin has the benefit of some user configuration; I’d got a couple of my spamtrap addresses blacklisted in the SpamAssassin config, and my Bayes databases have been trained using SpamAssassin‘s autolearning.
  • this is all really unscientific, and it’s a really small sample ;)

Surprisingly, all the SpamAssassin mailing list traffic discussing spam, throwing around spammy URLs and phrases, didn’t get caught, however; probably because the volume of spammy phrases in those is less than in the Mailman admin stuff.

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted August 30, 2006 at 22:43 | Permalink

    I’m analyzing gmail spam a bit different. I receive too much spam messages and its difficult to analize all of them. So I ve written a couple of PHP script that connects to gmail and reads spam emails data. It stores it in a database table and you can later review it from a webpage with some nice graphics. Check it out, follow my webpage address. Hope you like it. Regards, Jose Canciani.

  2. Posted March 20, 2007 at 21:19 | Permalink

    I want spam for testing my own spamassin

  3. Posted March 20, 2007 at 21:20 | Permalink
  4. tom
    Posted July 1, 2008 at 09:24 | Permalink
  5. Tom
    Posted July 22, 2009 at 07:27 | Permalink