TREC Spam Corpus

Some news from TREC’s Gordon Cormack:

The TREC 2005 Corpus (92,000 messages – 42,000 ham; 50,000 spam) is now available for self-serve download.

TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.

The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted November 2, 2006 at 19:35 | Permalink


    How dangerous is it to download Gordon’s corpus? I am an applied linguist and would like to analyse the linguistic properties of spam; in particular, linguistic repetition or language patterns in fraudulent solicitations.



  2. Posted November 2, 2006 at 19:41 | Permalink

    Keith — I wouldn’t say it’s dangerous at all. If you’re worried, don’t read the messages using a “real” mail user agent like Outlook — just parse the files directly using other tools like SpamAssassin.