CEAS needs your ham

CEAS 2008 is doing another Spam Challenge test of various spam-filters, and as part of this, they need samples of ham mail messages.

As part of the data collection effort, we have set up a website through which it is possible to donate non-sensitive legitimate email, to be used in the evaluation. Any kind of email that the recipient considers legitimate is welcome, including computer generated (non-spam) messages.

After the CEAS evaluation, the benchmark data will be made publicly available to facilitate future reasearch and development in the field of spam prevention.

Here is the collection site; they accept UNIX mbox format, and tar.gz or zip files of same, with an 8MB upload limit.

Tags: , , , ,

Comments

Spammers “giving up” according to Google

According to this Wired story, Google reckons spammers are giving up on spam:

a remarkable trend is underfoot, according to Brad Taylor, a staff software engineer at Google: The number of spam attempts — that is, the number of junk messages sent out by spammers — is flat, and may even be declining for the first time in years.

Actually, this is a wilful misunderstanding of what the Googler in question really said, which was that ‘attempts to spam Gmail users have been leveling off over the last year and more recently, even declining slightly’. In other words, they didn’t make an observation about the state of the spam problem on an internet-wide basis — just about the “local” situation as it pertains to Gmail. Bad reporting there, Wired.

But, in passing…

David Berlind at ZDNet recently blogged a rather grumpy response to InfoWorld coverage of CEAS 2007. He raised a very important point:

If I could say something to the author of that story, it would be that so long as any anti-spam solution is not deployed universally throughout the Internet’s e-mail system (in other words, so long as some anti-spam tech is not a standard), that anti-spam solution actually makes the spam problem worse. You read that right. Worse. Proprietary anti-spam solutions make the global spam problem worse. They are digging us deeper into the hole that the Internet is already in because everyone who makes those solutions is under the false belief that “s/he who is finally successful at filtering out all spam while allowing the legitimate mail in wins.”

Google’s blog post is a case in point: ‘we’re keeping more spam out of your inbox than ever before, so more and more, you can use Gmail for things you enjoy without even realizing that the spam filter is there most of the time.’

That’s great — but it doesn’t help anyone except Gmail. It’s a myopic view of the spam problem, and David’s point stands.

(I disagree with his later conclusion that the only way forward is for Google, MS, AOL and Yahoo! to get together and ‘commit to jointly supporting the same technical solutions’ — when the usual BigCos get together, they tend to focus on their own priorities. Take what happened back in 2005 with nofollow for blog-spam — while it helped the search giants with their own overriding priority, which was to tweak their algorithms to filter out the spam on the search results page, it did nothing to slow the spam flood itself, which has continued unabated.)

We need more open-source, and open-data, anti-spam work.

Tags: , , , , , , , , , ,

Comments (9)

CEAS

Spam: back from CEAS. The schedule with links to full papers is up, so anyone can go along and check ‘em out, if you’re curious.

Overall, it was pretty good — not as good as last year’s, but still pretty worthwhile. I didn’t find any of the talks to be quite up to the standards of last year’s TCP damping or Chung-Kwei papers; but the ‘hallway track’ was unbeatable ;)

Here’s my notes:

AOL’s introductory talk had some good figures; a Pew study reported that 41% of people check email first thing in morning, 40% have checked in the middle of the night, and 26% don’t go more than 2-3 days without checking mail. It also noted that URLs spimmed (spammed via IM) are not the same as URLs spammed — but the obfuscation techniques are the same; and they’re using 2 learning databases, per-user and global, and the ‘Report as Spam’ button feeds both.

Experiences with Greylisting: John Levine’s talk had some useful data — there are still senders that treat a 4xx SMTP response (temp fail) as 5xx (permanent fail), particularly after end of the DATA phase of the transaction, such as an ‘old version of Lotus Notes’; and there are some legit senders, such as Kodak’s mail-out systems, which regenerate the body in full on each send, even after a temp fail, so the body will look different. He found that less than 4% of real mail from real MTAs is delayed, and overall, 17% of his mail traffic was temp-failed. The 4% of nonspam that was delayed was delayed with peaks at 400 and 900 seconds between first tempfail and eventual delivery.

As usual, there were a variety of ‘antispam via social networks’ talks – there always are. Richard Clayton had a great point about all that: paraphrasing, I trust my friends and relatives on some things, and they are in my social networks — but I don’t trust their judgement of what is and is not spam. (If you’ve ever talked to your mother about how she always considers mails from Amazon to be spam, you’ll know what he means.)

Combating Spam through Legislation: A Comparative Analysis of US and European Approaches:
the EU ‘opt-in’ directive is now transposed everywhere in the EU; EU citizens who are spammed by a citizen from another EU country, the reports should be sent to the antispam authority in the sender’s country; and there’s something called ‘ECNSA’, an EU contact network of spam authorities, which sounds interesting (although ungoogleable).

Searching For John Doe: Finding Spammers and Phishers: MS’ antispam attorney, Aaron Kornblum, had a good talk discussing their recent court cases. Notably, he found one cases where an Austrian domain owner had set up a redirector site which sounded like it was expressly set up for spam use — news to me (and worrying).

A Game Theoretic Model of Spam E-Mailing: Ion Androutsopoulos gave a very interesting talk on a game theoretic approach to anti-spam — it was a little too complex for the time allotted, but I’d say the paper is worth a read.

Understanding How Spammers Steal Your E-Mail Address: An Analysis of the First Six Months of Data from Project Honey Pot: Matthew Prince of Project Honeypot had some excellent data in this talk; recommended. He’s found that there’s an exponential relationship between google Page Rank and spam received at scraped addresses, which matches with my theory of how scrapers work; and that only 3.2% of address-harvesting IPs are in proxy/zombie lists compared to 14% of spam SMTP delivery IPs. (BTW, my theory is that address scraping generally uses Google search results as a seed, which explains the former.)

Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs): this presented some great demonstrations of how a neural network can be used to solve HIPs (aka CAPTCHAs) automatically. However, I’m unsure how useful this data is, given that the NN required 90000 training characters to achieve the accuracy levels noted in the paper; unless the attacker has access to their own copy of the HIP implementation they can run themselves, they’d have to spend months performing HIPs to train it, before an attack is viable.

Throttling Outgoing SPAM for Webmail Services: cites Goodman in ACM E-Commerce 2004 as saying that ESP webmail services are a ’substantial source of spam’, which was news to me! (less than 1% of spam corpora, I’d guess). It then discusses requiring the submitter of email via an ESP webmail system to perform a hashcash-style proof-of-work before their message is delivered. By using a Bayesian spam filter to classify submitted messages, the ESP can cause spammers to perform more work than non-spammers, thereby reducing their throughput. Didn’t strike me as particularly useful — Yahoo!’s Miles Libbey got right to the heart of the matter, asking if they’d considered a situation where spammers have access to more than one computer; they had not. A better paper for this situation would be Alan Judge’s USENIX LISA 2003 one which discusses more industry-standard rate-limiting techniques.

SMTP Path Analysis: IBM Research’s anti-spam team discuss something very similar to several techniques used in SpamAssassin; our versions have been around for a while, such as the auto-whitelist (which tracks the submitter’s IP address rounded to the nearest /16 boundary), since 2001 or 2002, and the Bayes tweaks we added from bug 2384, back in 2003.

Naive Bayes Spam Filtering Using Word-Position-Based Attributes: an interesting tweak to Bayesian classification using a ‘distance from start’ metric for the tokens in a message. Worth trying out for Bayesian-style filters, I think.

Good Word Attacks on Statistical Spam Filters: not so exciting. A bit of a rehash of several other papers — jgc’s talk at the MIT conference on attacking a Bayesian-style spam filter, the previous year’s CEAS paper on using a selection of good words from the SpamBayes guys, and it entirely missed something we found in our own tech report — that effective attacks will result in poisoned training data, with a significant bias towards false positives. In my opinion, the latter is a big issue that needs more investigation.

Stopping Outgoing Spam by Examining Incoming Server Logs: Richard Clayton’s talk. Well worth a read. It’s an interesting technique for ISPs — detecting outgoing spam by monitoring hits to your MX from your own dialup pools which uses known ratware patterns.

Tags: , , , , , , , , ,

Comments

CEAS Roundup

Spam: So, CEAS was great fun, and very educational:

  • Got to meet up with various antispammers, including Daniel and Theo from the SpamAssassin dev team, Jeff Chan from SURBL, Dan Kohn from Habeas, Catherine Hampton from The SpamBouncer, Miles Libbey, John Levine, Neil Schwartzman — lots of good chats.
  • MS really know how to feed a conference! I hear rumours there was an extra-special tinned-meat-product-based dish at the banquet…
  • But their firewalling tendencies put a serious damper on keeping in touch with the outside world, at least until we set up an SSH tunnel on port 443 ;)
  • During a lull, Dan Kohn fired off a hands-up census — a good 75% of the attendees (roughly) admitted to using SpamAssassin!

My highlight papers:

  • IBM’s Chung-Kwei pattern-discovery system — the one which Mark dug up. Very interesting stuff; it turns out that bioinformatics is full of large corpora of data (genomes) which you then need to find patterns in. Funnily enough, so is SpamAssassin: s/genomes/spam/, s/patterns/regular expressions/. The more advanced pattern-discovery algorithms even allow complex patterns to contain alternative blocks, ‘don’t-cares’ and similar regular-expression-like features.

    The really good bit of Chung-Kwei is the Teiresias algorithm (more pages, online demo). Of course, being IBM research, it’s probably patented to the hilt, and may be tricky to license; but it’s certainly pointed us in a whole new interesting direction — anyone know any bioinformaticians?

    IBM is really gearing up on anti-spam research. 4 of the 6 papers listed on their website were presented this year, at CEAS.

  • Another good paper was On Attacking Statistical Spam Filters, by Gregory L. Wittel and S. Felix Wu, which (similarly to Henry Stern’s submission, which I helped a little with) dealt with an attack on Bayesian filters.

    This is interesting stuff; we’re pretty sure it’s not as serious as it could possibly be, in SpamAssassin’s implementation, but it’s still a serious attack.

  • The Impact of Feature Selection on Signature-Driven Spam Detection was an interesting paper on AOL’s new signature schemes. (The conference was sponsored by Cloudmark, BTW, but those guys were nowhere to be seen — in which case they missed this presentation ;)
  • Reputation Network Analysis for Email Filtering was interesting, in that it mirrors to a degree the thinking behind web-o-trust.org, but in my opinion suffered due to a lack of thought about avoiding spoofing (by including IP address information in the FOAF file, it could do this now). However, once SPF becomes pervasive, this could be combined with that to generate personalised webs of trust usable for email whitelisting.
  • Resisting SPAM Delivery by TCP Damping was very nifty; plug a classifier into your MTA, and thereby detect connections from spam relays. Once you’ve found them, you then throttle down their connection as they attempt to deliver spam. Some other TCP-level tricks can do nifty stuff like massively increasing the bandwidth consumption of the spamming machines. Very very nice!

I took copious notes on the SpamAssassin wiki, if anyone’s curious.

Tags: , , , , , , , , ,

Comments