TREC Spam Corpus

Some news from TREC’s Gordon Cormack:

The TREC 2005 Corpus (92,000 messages - 42,000 ham; 50,000 spam) is now available for self-serve download.

TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.

The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.

Tags: , , , , , , , , ,

Comments (2)

playing around with Google Suggest

Web: Google Suggest, a drop-down list of suggestions — with hitrates! The one letter hits are interesting, too.

“spam” hitrates, the top 3 (aside from “spam” itself):

  • “spam filter”: 6,400,000 results
  • “spamcop”: 1,570,000
  • “spamassassin”: 1,350,000

in the top 3. getting there!

unfortunately, you have to get as far as “justin ma” before my name shows up, so not doing too great in that competition. ;)

Tags: , , , , , , , , , ,

Comments

Open source v closed-source spam filtering

Spam: I’m quoted in
New Scientist! w00t!

SlashDot picked it up pretty quickly. One comment there misses the point, though:

This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

It’s worth talking about this. Newsflash: spammers have no difficulty testing their spam against closed-source spam filters, even when they can’t ‘get a copy’ and test them in ‘their sleazy lairs’.

How do they do it? Easy — just set up an account at a site that uses that filter (AOL, Yahoo!, Hotmail, and GMail, it’s pretty obvious how to do that; for other closed-source filters, find an ISP that uses it). Then send ‘test mails’ repeatedly to that account, and apply trial and error to see what gets past the filter and what doesn’t. Eventually, they figure out what works for that filter, and what doesn’t.

How did I figure this out? Well, I came across the manual for the Send-Safe ratware on-line. It noted that the ‘hashbuster’ randomisation technique, which we in the SpamAssassin team had long assumed was intended to block hash matches by DCC, Pyzor and Razor, was in fact intended to block AOL’s implementation of that system. The open source ones weren’t even mentioned.

Update: found it — from their FAQ:

Mime Encoded content

If you want to get into AOL… use it.

MIME encoders allow you to send documents written within a specific application through email without causing readability or formatting problems. For example, you can send a letter created in MSWord with and be certain that it arrives at its destination in the same format by encoding it with MIME first. The recipient then decodes it back into the original MSWord format.

That isn’t why we use it though.

We use it to cause ‘uniqueness’.

When you put a rotate tag at the beginning of a MIME encoded email, it causes everything after that point (including checksums) to be ‘different’ in every message.

Why is that that important?

Because it throws off filters that look for many copies of the same message to nuke.

Tags: , , , , , , , , ,

Comments

Spamometer

Spam: The Spamometer; a 1997-vintage spamfilter along the lines of filter.plx. Interestingly, I hadn’t seen this before — who knows, if I had, SpamAssassin could have used a (0.0, 1.0) scoring system instead of the ‘5 point threshold’. ;) (Thanks, Gary!)

Tags: , , , , , , , , , ,

Comments

nose-picking

Funny: According to a ‘top Austrian doctor’, picking your nose and eating it is good for you:

‘Medically it makes great sense and is a perfectly natural thing to do. In terms of the immune system the nose is a filter in which a great deal of bacteria are collected, and when this mixture arrives in the intestines it works just like a medicine.

‘Modern medicine is constantly trying to do the same thing through far more complicated methods, people who pick their nose and eat it get a natural boost to their immune system for free.’

Tags: , , , , , , , , , ,

Comments (1)

Blocked By SonicWall!

Censorship: This is pretty funny — a friend writes that SonicWall’s ‘Content Filter’ has judged my home page and FOUND IT WANTING:

  The URL
  http://jmason.org/
  is currently rated as:
  category 4 - Pornography

w00t! It’s true, I have some pretty hot pics up there — the accuracy of their content filtering product amazes me!

Tags: , , , , , , , , , ,

Comments (1)

‘Goblin-fancier’?

Insults: Tom takes issue with my assumption that ‘anyone not living in a hole would know that SpamAssassin includes a probabilistic classifier’. Hmm. OK, I should have made it clear I meant anyone following anti-spam filter development. Henceforth I’ll over-qualify every statement on this weblog accordingly.

But at least I know that badgers are CLEARLY down, since they do live in a hole. DO YOUR RESEARCH, FARRELL.

Tags: , , , , , , , , , ,

Comments

Spam filters and FTC’s ‘Do Not Call’ list

Wired News: Yahoo! Spam Filter Thwarts FTC:

Consumers who used Yahoo Mail e-mail accounts to register for the Federal Trade Commission’s new do-not-call service were met with an ironic twist Friday — Yahoo’s spam filter intercepted confirmation messages sent from FTC servers.

‘Our tests showed that Yahoo’s spam filter was automatically sending the confirmation messages from the do-not-call list into users’ bulk-mail folders,’ said NetFrameworks co-founder and CTO Eric Greenberg. ‘The irony of it is that the spam filter is blocking the very thing that’s supposed to help you stop getting spam over the phone.’

FWIW, I signed up, without any hitches.

As noted elsewhere, their mail-sending systems were massively overloaded – an insane quantity of people were also signing up at the same time, from what I’ve heard.

But a day later, the confirmation message eventually came through, and got run through my ‘dogfood’ SpamAssassin 2.60 installation. That gave it -5.2 points. Not bad, considering they didn’t have reverse DNS records for the machines sending the mails out ;) (update: they do now, btw.)

In case you’re wondering, the tests it hit were: BAYES_00,CLICK_BELOW,DATE_IN_PAST_12_24,NO_REAL_NAME. Pretty respectable, really. Aside: that message getting a BAYES_00 match is impressive, given that (a) that Bayes db was initialized entirely from auto-learned mails, no hand-training; and (b) I’d never received a mail from the Do Not Call registry operators before.

Tamales: this is cool — San Francisco’s boozy culture paid homage last night to ‘The Tamale Lady’:

Tonight, Zeitgeist will swell again for Ramos’ 50th birthday party. There, San Francisco filmmaker Cecil B. Feeder will premiere his mini-documentary ‘Our Lady of Tamale,’ featuring 30-second songs submitted by dozens of San Francisco musicians.

Isn’t that nice. Ben says it went well. Somehow or other we missed her tamales last time we were up, but I’ll be sure to get one next time…

Tags: , , , , , , , , ,

Comments

minor bloglet

New Scientist: Turing tests filter spam email. “Simple tests designed to distinguish computers from humans are increasingly being used to clamp down on unsolicited, or ’spam’, email advertising.”

The article notes that Yahoo! has imposed such a test to block automated account-signup-then-spam bots. (Thankfully — that might discourage some of the more automated 419 spammers.)

Sorry ’bout the lack of blogging — very busy ’round here, what with a new SpamAssassin release in the pipeline and a move to the US in the offing…

Tags: , , , , , , , , ,

Comments

MAPS gets the TCR treatment, a public corpus, and a wedding

Found on Paul Graham’s site: “according to a recent study, the MAPS RBL, probably the best known blacklist, catches only 24% of spam, with 34% false positives. It would take a conscious effort to write a content-based filter with performance that bad.”

The “recent study” is by David Nelson at Giga Information Group, sometime last year.

For the sake of it, I’ve checked out how the MAPS figures stack up using TCR, Ion Androutsopoulos‘ metric for measuring spam filter performance. TCR is a very nice single-figure metric, which takes into account the “inconvenience factor” of misfiled mails, based on a “lambda” setting indicating what action is taken when a mail is classified. For MAPS, I’m assuming a lambda of 9, the guideline figure for systems which bounce mail back to the sender, instead of 1 for simple tagging, or 999 for outright deletion with no notification.

So: using a lambda of 9, MAPS gets a TCR of 0.0912, a Spam Recall of 24%, and a Spam Precision of 17%. It’s worth noting that the baseline figure for TCR is 1.0, which represents no filtering whatsoever: ie. all the spam comes right into your mailbox.

In other words, using MAPS is more inconvenient all-round than not filtering your mail at all, if these figures are to be believed ;)

More spam: I’ve just assembled a totally-public corpus of spam and non-spam mail, to allow spamfilter developers to compare and contrast results using the same data. Let’s hope it proves useful.

Not spam: finally, I’m off to Chester for a wedding tomorrow morning; my good mates Kitty and Gerry are tying the knot, in Chester Zoo, no less. Let’s hope this horrible cold I’ve had all week dies down before Saturday…

Tags: , , , , , , , , ,

Comments