TREC Spam Corpus

Some news from TREC’s Gordon Cormack:

The TREC 2005 Corpus (92,000 messages - 42,000 ham; 50,000 spam) is now available for self-serve download.

TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.

The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.

Tags: , , , , , , , , ,

Comments (2)

Annoying Non-spam Tricks, pt. XVIII

Spam: OK, I just noticed that I have a few hits for the SpamAssassin rule HTTP_ENTITIES_HOST in my corpus. This searches for obfuscated hostnames in the URL links in mail messages, and is generally a very reliable sign of spam — because who would want to hide a hostname apart from spammers?

Well, Buy4Now.IE, for one, it seems. WTF? I have a mail here that uses this markup:

  <a href="''http://www&#46;buy4now&#46;ie/fbd''>

Totally and utterly nuts. If they really wanted a way to tickle malware detectors, mail filters, and anti-spam measures, they could hardly pick a better one. I have no idea why they did this.

grr….

Tags: , , , , , , , , ,

Comments

Spam load and Hallowe’en

Spam: The volume of spam continues to rise inexorably. Brightmail are now estimating that 54% of all mail messages are spam.

Nowadays, my personal mail account is getting about 70 a day, rising to over 200 a day at the weekends. It’s getting tiresome; pretty much all of it gets marked as spam and diverted, but I still have to wade through it ‘just in case’, and to build the corpus. I guess I need to extend my .procmailrc to divert high-scoring spams somewhere I can check even less frequently ;)

That’s not the really annoying thing, though. I use tagged addressing when I publish my email address, most of the time. It works very well to identify spam sources overall, and divert ‘dead’ addresses that are getting spam, into the spamtraps. That’s the plus.

But the curse of writing spam filters is that you need a good archive of spam; and one of our SpamAssassin corpus guidelines is to attempt to trim out duplicate spams where possible. Many spammers will wind up sending more-or-less identical spam messages, modulo random subject lines, hash-busters, etc., and with (let’s say) 8 tagged addresses in their lists, I’ll get 8 copies of that spam, and have to pay a little bit of attention to trim it down to 1 copy for the corpus.

Damn spam-filter development! All this corpus building is hard work ;)

BTW, note how spam load rises at the weekends; (Tim Hunter, Paul Terry and Alan Judge of eircom.net also noted this in their paper presented at LISA ‘03 yesterday ;). There’s a good reason – spammers attempt to deliver their spam while abuse staff are not at their desk. Same thing applies in the network security world; many of those attacks have taken place over a US holiday weekend.

Hallowe’en: best too-late idea for a hallowe’en costume: ‘Top Gun GWB’ in his flight suit. In the end, I played half of the ‘Dr. Frankenstein and Monster’ pair (I was the monster, as C really is a scientist, and computer ’science’ doesn’t count). Best costume seen: a very impressive onnagata kabuki player.

Tags: , , , , , , , , ,

Comments

Spam load and Hallowe’en

The volume of spam continues to rise inexorably. Brightmail are now estimating that 54% of all mail messages are spam.

Nowadays, my personal mail account is getting about 70 a day, rising to over 200 a day at the weekends. It’s getting tiresome; pretty much all of it gets marked as spam and diverted, but I still have to wade through it ‘just in case’, and to build the corpus. I guess I need to extend my .procmailrc to divert high-scoring spams somewhere I can check even less frequently ;)

That’s not the really annoying thing, though. I use tagged addressing when I publish my email address, most of the time. It works very well to identify spam sources overall, and divert ‘dead’ addresses that are getting spam, into the spamtraps. That’s the plus.

But the curse of writing spam filters is that you need a good archive of spam; and one of our SpamAssassin corpus guidelines is to attempt to trim out duplicate spams where possible. Many spammers will wind up sending more-or-less identical spam messages, modulo random subject lines, hash-busters, etc., and with (let’s say) 8 tagged addresses in their lists, I’ll get 8 copies of that spam, and have to pay a little bit of attention to trim it down to 1 copy for the corpus.

Damn spam-filter development! All this corpus building is hard work ;)

BTW, note how spam load rises at the weekends; (Tim Hunter, Paul Terry and Alan Judge of eircom.net also noted this in their paper presented at LISA ‘03 yesterday ;). There’s a good reason – spammers attempt to deliver their spam while abuse staff are not at their desk. Same thing applies in the network security world; many of those attacks have taken place over a US holiday weekend.

Hallowe’en: best too-late idea for a hallowe’en costume: ‘Top Gun GWB’ in his flight suit. In the end, I played half of the ‘Dr. Frankenstein and Monster’ pair (I was the monster, as C really is a scientist, and computer ’science’ doesn’t count). Best costume seen: a very impressive onnagata kabuki player.

Tags: , , , , , , , , ,

Comments

Lotsa SpamConf linkage and commentary

Another good trip report, from ‘babbage’ at perl.org.

  • Again, and interestingly, quite a few folks agreed with one of SA’s core tenets; no single approach (stats, RBLs, rules, distributed hashes) can filter effectively on its own, as spammers will soon figure out a way to subvert that technique. However, if you combine several techniques, they cannot all be subverted at once, so your effectiveness in the face of active attacks is much better.

  • Also interesting to note how everyone working with learning-based approaches commented on how hard it was to persuade ‘normal people’ to keep a corpus. Let’s hope SA’s auto-training will work well enough to avoid that problem.

  • in passing — babbage noted the old canard about Hotmail selling their user database to spammers. That must really piss the Hotmail folks off ;) I think it’s much more likely that, with Moore’s Law and the modern internet, a dictionary attack *will* find your account eventually.

  • Good tip on the legal angle from John Praed of The Internet Law Group: if a spam misuses the name of a trademarked product like ‘Viagra’, get a copy to Pfizer pronto. Trademark holders have a particular desire to follow up on infringements like this, as an undefended trademark loses its TM status otherwise.

  • David Berlind, ZDNet executive editor: ‘They don’t want to be involved (in developing an SMTPng)’. He might say that, but I bet their folks working on sending out their bulk-mailed email newsletters might disagree ;). Legit bulk mail senders have to be involved for it to work, and they will want to be involved, too.

  • Brightmail have a patent on spam honeypots? Must take a look for this sometime.

  • the plural of ‘corpus’ is ‘corpora’ ;)

Great report, overall.

It’s interesting to see that Infoworld notes that reps from AOL, Yahoo! and MS were all present.

Since the conf, Paul Graham has a new paper up about ‘Better Bayesian Filtering’, and lists some new tokenization techniques he’s using:

  • keep dollar signs, exclamation and most punctuation intact (we do that!)

  • prepend header names to header-mined tokens (us too!)

  • case is preserved (ditto!)

  • keep ‘degenerate’ tokens; ‘Subject:FREE!!!’ degenerates to ‘Subject:free’, to ‘FREE!!!’, and ‘free’. (ditto! well, partly. We use degeneration of tokens, but we keep the degenerate tokens in a separate, prefixed namespace from the non-degenerate ones, as he contemplates in footnote 7. It’s worth noting that case-sensitivity didn’t work well compared to the database bloat it produced; each token needs to be duplicated into the case-insensitive namespace, but that doubled the database size, and the hit-rate didn’t go up nearly enough to make it worthwhile.)

Most of these were also discovered and verified experimentally by SpamBayes, too, BTW.

When we were working on SpamAssassin’s Bayesian-ish implementation, we took a scientific approach, and used suggestions from the SpamBayes folks and from the SpamAssassin community on tokenizer and stats-combining techniques. We then tested these experimentally on a test corpus, and posted the results. In almost all cases, our results matched up with the SpamBayes folks’ results, which is very nice, in a scientific sense.

(PS: update on the Fly UI story — ‘apis’ is not French, it’s Latin. oops! Thanks Craig…)

Tags: , , , , , , , , ,

Comments