TREC Spam Corpus
Some news from TREC’s Gordon Cormack:
The TREC 2005 Corpus (92,000 messages - 42,000 ham; 50,000 spam) is now available for self-serve download.
TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.
The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.
Tags: corpus, download, evaluation, filter, ham, news, nist, program, spam, trec

Keith Stuart said,
November 2, 2006 @ 7:35 pm
Justin,
How dangerous is it to download Gordon’s corpus? I am an applied linguist and would like to analyse the linguistic properties of spam; in particular, linguistic repetition or language patterns in fraudulent solicitations.
Thanks,
Keith
Justin said,
November 2, 2006 @ 7:41 pm
Keith — I wouldn’t say it’s dangerous at all. If you’re worried, don’t read the messages using a “real” mail user agent like Outlook — just parse the files directly using other tools like SpamAssassin.