TREC Spam Corpus

Some news from TREC’s Gordon Cormack:

The TREC 2005 Corpus (92,000 messages - 42,000 ham; 50,000 spam) is now available for self-serve download.

TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.

The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.

Tags: , , , , , , , , ,

2 Comments »

  1. Keith Stuart said,

    November 2, 2006 @ 7:35 pm

    Justin,

    How dangerous is it to download Gordon’s corpus? I am an applied linguist and would like to analyse the linguistic properties of spam; in particular, linguistic repetition or language patterns in fraudulent solicitations.

    Thanks,

    Keith

  2. Justin said,

    November 2, 2006 @ 7:41 pm

    Keith — I wouldn’t say it’s dangerous at all. If you’re worried, don’t read the messages using a “real” mail user agent like Outlook — just parse the files directly using other tools like SpamAssassin.

RSS feed for comments on this post

Leave a Comment

Comment text formatting: Markdown Extra syntax is supported, as is plain old HTML. (Quick reference for Markdown basics)

View blog reactions using Technorati