Sender Address Verification considered harmful

(as an anti-spam technique, at least.)

Sender-address verification, also known as callback verification, is a technique to verify that mail is being sent with a valid envelope-sender return address. It is supported by Exim and Postfix, among others.

Some view this as a useful anti-spam technique. In my opinion, it’s not.

Spam/anti-spam is an adversarial “game”. Whenever you’re considering anti-spam techniques, it’s important to bear in mind game theory, and the possible countermeasures that spammers will respond with. Before SAV became prevalent, spam was often sent using entirely fake sender data; hence the initial attractiveness of SAV. Once SAV became worth evading, the spammers needed to find “real” sender addresses to evade it. And where’s the obvious place to find real addresses? On the list of target addresses they’re spamming!

Since the spam is now sent using forged sender addresses of “real” people, when a spam bounces (as much of it does), the bounce will be sent back not to an entirely fake address, but to a spam recipient’s address.

Hence, the spam recipients now get twice as much mail from each spam run – spam aimed at them, and bounce blowback from hundreds of spams aimed at others, forged to appear to be from them.

This is the obvious “next move” in response to SAV, which is one reason why we never implemented something like it in SpamAssassin.

On top of this — it doesn’t work well enough anymore. Verizon use SAV. Have you ever heard anyone talk about how great Verizon’s spam filtering is? Didn’t think so.

(This post is a little late, given that SAV has been used for years now, but better late than never ;)

By the way, it’s worth noting that it’s still marginally acceptable to use SAV as a general email acceptance policy for your site — ie. as a way to assert that you’re not going to accept mail from people who won’t accept mail to the envelope sender address used to deliver it. Just don’t be fooled into thinking it’s helping the spam problem, or is helping anyone else but yourself.

Finally, this Sender Address Verification is different from what Sendio calls Sender Address Verification. That’s just challenge-response, which is crap for an entirely different, and much worse, set of reasons.

Tags: , , , , , , , ,

Comments (7)

Masonic spam

Wow, here’s a new one — and kind of appropriate, given my surname ;) Masonic spam!

To: xxxxxx at taint.org

Subject: Dear Benefactor Of 2007 Masory Grant,

From: Dr.Lavine Ferdon Ferdon

Date: Wed, 21 Feb 2007 15:40:26 +0100 (CET)

Dear Benefactor Of 2007 Masory Grant,

The Freemason society of Bournemout under the jurisdiction of the all Seeing Eye, Master Nicholas Brenner has after series of secret deliberations selected you to be a beneficiary of our 2007 foundation laying grants and also an optional opening at the round table of the Freemason society.

These grants are issued every year around the world in accordance with the objective of theFreemasons as stated by Thomas Paine in 1808 which is to ensure the continuous freedom of man and toenhance mans living conditions.

We will also advice that these funds which amount to USD2.5million be used to better the lot of man through your own initiative and also we will go further to inform that the open slot to become a Freemason is optional, you can decline the offer.

In order to claim your grant, contact the Grand Lodge Office co-secretary Dr.Lavine Ferdon Ferdon Grand Lodge Office Co-Secretary’s email: (lavin_ferd_law at excite.com)

Dr.Lavine Ferdon Ferdon,

Co-Secretary Freemason Society of Holdenhurst Road,

Bournemouth.

Sir David Hurley,

Secretary Freemason Society of Holdenhurst Road,

Brilliant. But why Bournemouth?

Tags: , , , , , ,

Comments (3)

Odd legal mail

Last week, I received an odd-looking mail from “Claims Administration Center” ClaimsAdministrationCenter /at/ enotice.info, sent to my private email address — the one listed in an image on http://jmason.org/ (it never gets spam).

The mail reads:

Mittlholtz v . International Medical Research, Inc., Sophie Chen, John Chen, and Allan Wang (”IMR Defendants”), aka Meco, et al. v. IMR, et al., case No. GIC846200.

We are requesting by order of the Court filed with the Superior Court for the County of San Diego, CA, that you post the attached Summary notice as a Public Service Announcement on your web-site.

Below is a link to the PDF Summary Notice (Note: The document is in the .PDF format. To view the documents you will need the Adobe Acrobat Reader)

http://echo.bluehornet.com/ct/ct.php?t=….

This message was intended for: webaddress@jmason.org You were added to the system January 17, 2007. For more information please follow the URL below: http://echo.bluehornet.com/subscribe/source.htm?c=…

Follow the URL below to update your preferences or opt-out: http://echo.bluehornet.com/phase2/survey1/survey.htm?CID=…

Googling for GIC846200, I find it on a cached “civil new filed cases index” page at sandiego.courts.ca.gov:

CASE NUMBER FILE DATE CATEGORY LOCATION

GIC846200 04/21/2005 A72120 - Personal Injury (Other) San Diego MECO vs INTERNATIONAL MEDICAL RESEARCH INCORPORATED

So the case exists. I have no idea who either of the parties are, however.

The URLs in the message were all web-bugged; but bluehornet seem legit in general.

The URL http://www.enotice.info/ times out. Seems to have no spam-related Google Groups hits, although there are a lot of discussions about some iffy-looking class-action suit about Google Adsense.

After quite a bit of discomfort and asking around about the reputation of both bluehornet.com and enotice.info, I eventually succumbed and clicked through. The Summary URL above, after logging my click, redirects to this PDF file, which reads:

This case, called Mittleholtz v . International Medical Research, Inc., Sophie Chen, John Chen, and Allan Wang (’IMR Defendants’), et al., case No. GIC846200, is a class action lawsuit that alleges that the IMR Defendants unlawfully distributed a product containing synthetic chemicals, the presence of which was also concealed from the public as a result of the IMR Defendants’ alleged failure to conduct any testing for adulteration by synthetic chemicals, including but not limited to diethylstilbestrol (DES) and warfarin (or coumadin), which is the active chemical in bloodthinners. Defendants deny the allegations. The Court has not formed any opinions concerning the merits of the lawsuit nor has it ruled for or against the Plaintiffs as to any of their claims. The sole purpose of this notice is to inform you of the lawsuit so that you may make an informed decision as to whether you wish to remain in or opt out of this class action.

You have legal rights and choices in this case. You can:

  • Join the case. You do not have to do or pay anything to be part of this case. And, you have to accept the final result in the case.

  • Exclude yourself and file your own lawsuit. If you want your own lawyer, you will have to exclude yourself as set forth below and pay your lawyer’s fees and costs.

  • Exclude yourself and not sue. If you do not wish to be part of this case and do not want to bring your own lawsuit, please mail a first class letter stating that you want to be excluded from the Mittleholtz v IMR class action (Case No. GIC846200), or you may fill out the letter available at www.gilardi.com/mittleholtzsettlement. Make sure the letter has your full name, address and signature. Mail it to: PC-SPES Litigation, Class Administrator, c/o Gilardi & Co. LLC, P O Box 8060 San Rafael, CA 94912-8060 by March 23, 2007.

    *This is only a summary. For complete notice and further information go to: www.gilardi.com/mittleholtzsettlement or call the toll-free number 1-877-800-7853.

So in other words, it’s hand-targeted unsolicited, but probably not bulk, email, flogging a class-action suit about ’synthetic chemicals’ (presumably as opposed to the ‘organic’ variety). I suspect, given the phrasing in the initial mail, they probably googled for a keyword or company name, and found a hit somewhere in taint.org’s 5 years of archives — hence the PSA request.

In fact, I bet this forwarded story is what they found through Googling. Pity they didn’t include a URL for that!

Does sending legal notices like this through email not seem particularly risky, given the lack of reliability of the medium?

An odd situation, all told…

Tags: , , , , ,

Comments (2)

CEAS

Spam: back from CEAS. The schedule with links to full papers is up, so anyone can go along and check ‘em out, if you’re curious.

Overall, it was pretty good — not as good as last year’s, but still pretty worthwhile. I didn’t find any of the talks to be quite up to the standards of last year’s TCP damping or Chung-Kwei papers; but the ‘hallway track’ was unbeatable ;)

Here’s my notes:

AOL’s introductory talk had some good figures; a Pew study reported that 41% of people check email first thing in morning, 40% have checked in the middle of the night, and 26% don’t go more than 2-3 days without checking mail. It also noted that URLs spimmed (spammed via IM) are not the same as URLs spammed — but the obfuscation techniques are the same; and they’re using 2 learning databases, per-user and global, and the ‘Report as Spam’ button feeds both.

Experiences with Greylisting: John Levine’s talk had some useful data — there are still senders that treat a 4xx SMTP response (temp fail) as 5xx (permanent fail), particularly after end of the DATA phase of the transaction, such as an ‘old version of Lotus Notes’; and there are some legit senders, such as Kodak’s mail-out systems, which regenerate the body in full on each send, even after a temp fail, so the body will look different. He found that less than 4% of real mail from real MTAs is delayed, and overall, 17% of his mail traffic was temp-failed. The 4% of nonspam that was delayed was delayed with peaks at 400 and 900 seconds between first tempfail and eventual delivery.

As usual, there were a variety of ‘antispam via social networks’ talks – there always are. Richard Clayton had a great point about all that: paraphrasing, I trust my friends and relatives on some things, and they are in my social networks — but I don’t trust their judgement of what is and is not spam. (If you’ve ever talked to your mother about how she always considers mails from Amazon to be spam, you’ll know what he means.)

Combating Spam through Legislation: A Comparative Analysis of US and European Approaches:
the EU ‘opt-in’ directive is now transposed everywhere in the EU; EU citizens who are spammed by a citizen from another EU country, the reports should be sent to the antispam authority in the sender’s country; and there’s something called ‘ECNSA’, an EU contact network of spam authorities, which sounds interesting (although ungoogleable).

Searching For John Doe: Finding Spammers and Phishers: MS’ antispam attorney, Aaron Kornblum, had a good talk discussing their recent court cases. Notably, he found one cases where an Austrian domain owner had set up a redirector site which sounded like it was expressly set up for spam use — news to me (and worrying).

A Game Theoretic Model of Spam E-Mailing: Ion Androutsopoulos gave a very interesting talk on a game theoretic approach to anti-spam — it was a little too complex for the time allotted, but I’d say the paper is worth a read.

Understanding How Spammers Steal Your E-Mail Address: An Analysis of the First Six Months of Data from Project Honey Pot: Matthew Prince of Project Honeypot had some excellent data in this talk; recommended. He’s found that there’s an exponential relationship between google Page Rank and spam received at scraped addresses, which matches with my theory of how scrapers work; and that only 3.2% of address-harvesting IPs are in proxy/zombie lists compared to 14% of spam SMTP delivery IPs. (BTW, my theory is that address scraping generally uses Google search results as a seed, which explains the former.)

Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs): this presented some great demonstrations of how a neural network can be used to solve HIPs (aka CAPTCHAs) automatically. However, I’m unsure how useful this data is, given that the NN required 90000 training characters to achieve the accuracy levels noted in the paper; unless the attacker has access to their own copy of the HIP implementation they can run themselves, they’d have to spend months performing HIPs to train it, before an attack is viable.

Throttling Outgoing SPAM for Webmail Services: cites Goodman in ACM E-Commerce 2004 as saying that ESP webmail services are a ’substantial source of spam’, which was news to me! (less than 1% of spam corpora, I’d guess). It then discusses requiring the submitter of email via an ESP webmail system to perform a hashcash-style proof-of-work before their message is delivered. By using a Bayesian spam filter to classify submitted messages, the ESP can cause spammers to perform more work than non-spammers, thereby reducing their throughput. Didn’t strike me as particularly useful — Yahoo!’s Miles Libbey got right to the heart of the matter, asking if they’d considered a situation where spammers have access to more than one computer; they had not. A better paper for this situation would be Alan Judge’s USENIX LISA 2003 one which discusses more industry-standard rate-limiting techniques.

SMTP Path Analysis: IBM Research’s anti-spam team discuss something very similar to several techniques used in SpamAssassin; our versions have been around for a while, such as the auto-whitelist (which tracks the submitter’s IP address rounded to the nearest /16 boundary), since 2001 or 2002, and the Bayes tweaks we added from bug 2384, back in 2003.

Naive Bayes Spam Filtering Using Word-Position-Based Attributes: an interesting tweak to Bayesian classification using a ‘distance from start’ metric for the tokens in a message. Worth trying out for Bayesian-style filters, I think.

Good Word Attacks on Statistical Spam Filters: not so exciting. A bit of a rehash of several other papers — jgc’s talk at the MIT conference on attacking a Bayesian-style spam filter, the previous year’s CEAS paper on using a selection of good words from the SpamBayes guys, and it entirely missed something we found in our own tech report — that effective attacks will result in poisoned training data, with a significant bias towards false positives. In my opinion, the latter is a big issue that needs more investigation.

Stopping Outgoing Spam by Examining Incoming Server Logs: Richard Clayton’s talk. Well worth a read. It’s an interesting technique for ISPs — detecting outgoing spam by monitoring hits to your MX from your own dialup pools which uses known ratware patterns.

Tags: , , , , , , , , ,

Comments

A highlight (or low-light) from the world of spam bounces

Spam: recently, I’ve been getting a lot of spam bounces; that is, messages sent by people’s autoresponders, in response to forged spam claiming to come from my domain. (I have an SPF record, but these autoresponders naturally don’t bother to check that before replying.)

I have a SpamAssassin ruleset which catches these, and it gets rid of the vast majority — but the odd wierd one gets past. This one caught my eye before I deleted it:

On October 5, 2004, I will be going to the Illinois Department of Corrections for approximately 18 months. If you wish to contact me, please snail mail me at: (address deleted)
Your letters will be forwarded to me and I will reply as soon as I receive them! Thanks…and please do write! Mail is vitally important! :-)

… ouch. Good luck to this guy, whoever he is…

Tags: , , , , , , , , ,

Comments

Mailing List Wishlist

Mail: Ask’s mods to ezmlm got me thinking about mailing list managers. Hence, here’s my wishlist for what MLMs should be capable of…

Tags: , , , ,

Comments

The ‘humans are 99.84% accurate’ figure

Spam: ‘The spam-classifying accuracy of a human being is 99.84%’. This statement has passed into SlashDot lore as the gospel truth, so time for some debunking.

First off, that’s not what Bill Yerazunis said in the CRM-114 Sparse Binary Polynomial Hashing and the CRM114 Discriminator paper. Here’s the real quote:

the human author’s measured accuracy as an antispam filter is only 99.84% on the first pass

Here’s a copy of the original mail:

I manually classified the same set of 1900 messages twice, and found three errors in my own classifications, hence I have a 99.84% success rate.

(my emphasis). In other words, the author sat down and ran through 1900 messages manually, then ran through them again, and checked to see how many messages in the first batch disagreed with the second.

Let’s consider an alternative situation, where a user is presented with one message, and asked to take their time, give it a full examination and some thought, and then classify the message. I would consider that more likely to be classified correctly, since fatigue will not be an issue (after 1900 messages, I’m pretty tired of eyeballing), and neither will time pressure (taking 20 seconds on each of 1900 mails would require 10.5 hours, and would be excruciatingly boring to boot).

In addition, the study wasn’t clear on exactly how much information from each mail was presented. Too little (just the subject line) or too much (every header and raw HTML), and a human will be more likely to make mistakes than if the mail is rendered fully, and the extraneous header info hidden. In my experience, I’ve never hand-classified 1900 messages purely through either method, because it’s just too tiring, and I know I’ll make quite a few mistakes. The UI for this work is important.

And finally, the figure is derived from a study with one user performing a task once. There’s no way you could use that figure in a serious setting — it’s not valid statistical science. Here’s Henry’s comment:

Yerazunis’ study of “human classification performance” is fundamentally flawed. He did a “user study” where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results “conclusive.” There are several reasons why this is not a sound methodology:
  • a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
  • b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human’s classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards “duplicate detection” when you’ve seen the data before hand.
  • c) He evaluates his own performance. When someone’s own ego is on the line, you would expect that it would be very difficult to remain objective.

So, to correct the statement:

‘The spam-classifying accuracy of this one guy, when classifying nearly two thousand mails by hand, was 99.84%, once.’

Tags: , , , , , , , , , ,

Comments

Shortest URL evah

Comments (1)

German neo-nazi UBE, and CAN-SPAM

Spam: Reg: German hate mail spam attack stuns experts: ‘Mailboxes in Germany and the Netherlands were flooded yesterday with spam containing German right-wing propaganda. Spammers used the Sober.G virus - a mass mailing worm that sends itself to email addresses harvested from infected computers - to spread their messages as widely as possible.’

The one good thing about this is that it might help some people realise that spam isn’t all about porn and commercial email; any kind of mail can be spam, including political speech.

However, this may be a bit late for the US, since CAN-SPAM explicitly does not regulate political spam. ah well, you live and learn, I suppose. ;)

Tags: , , , , , , , , ,

Comments

GMail Invites

Mail: GMail users, check your mail; if mine was anything to go by, you should have three new invites to give out.

Tags: , , , ,

Comments (1)

More Thoughts on GMail

Mail: I’ve been playing around with GMail a bit more recently. They’ve fixed the issues they had with Firefox and keyboard control, and it is nice.

Threading: since I plan to bother a few open-source MUA developers ;), I’ve written up a thorough analysis of their ‘conversation’ model, with its ‘collapsable history’, archive-not-delete approach, etc. Take a look, if you’re curious.

HTML: one feature that no-one’s commented on, is that GMail does not create HTML mail — all mail composed through their composer is sent as text/plain only.

This is very interesting, because it suits me just fine. HTML mail causes so many more problems than it solves, especially when full-featured web browser components are used to display it, IMO. I get to see the security exploits this enables, every day in my anti-spam work.

But it’s also very significant that nobody else has commented on it – nobody misses it!

Phantom Labels: another interesting thing I’ve noted: sometimes a mail will appear in your Inbox with a ’spam’ label, even though you’ve never defined one. It’s not in the ‘Spam’ folder; it’s in your inbox.

Aaron has a good theory on what this is, and I think he’s right — he suggests it’s when ‘ the two emails are in a conversation (same subject); one is marked as spam, one isn’t. So the conversation (which is what appears in your inbox) gets two tags: Spam, and Inbox. So when viewing the list it looks like it gets the Spam tag.’

Also, while I’m here — details on LiveJournal’s distributed filesystem, MogileFS, which apparently ‘will be open source’. Link via acme.

Tags: , , , , , , , , ,

Comments

Email Usability List updated in light of GMail, given new home

Mail: I’ve dusted off my old e-mail usability wishlist, made a couple of changes to reflect the current situation now that GMail has implemented some of them, and Wikified the page.

There’s still a couple that I think would be valuable, so anyone looking at new usability ideas for email is welcome to take a look ;)

Tags: , , , , , , , , , ,

Comments

Some stats on GMail’s spam filter

Update: greetings, visitors from 2006! Please pay no attention to these figures, they’re from 2004, and both GMail and SpamAssassin have undergone major changes since those days. Historical interests only.

So, I set up a .forward to forward all my personal mail to GMail to see how it coped with my spam load, and compared it against the personal SpamAssassin install I’m running these days. Here’s the results:

  • test start: Mon Apr 12 15:50:39 PDT 2004
  • test end: Tue Apr 13 18:26:45 PDT 2004
  • total spam messages received by both during the test: 210
  • total ham messages received by both during the test: 528

The SpamAssassin results:

  • true positives: 189
  • false positives: 0
  • false negatives: 21
  • true negatives: 528
  • FP%: 0.00%
  • FN%: 10.00%

The GMail results:

  • true positives: 144
  • false positives: 7
  • false negatives: 66
  • true negatives: 521
  • FP%: 1.32%
  • FN%: 31.42%

So, not too hot. But there are extenuating circumstances! ;)

  • The GMail false positives were not ‘typical’ mail, whatever that is – all of them were Mailman ‘administration required’ messages regarding spam in Mailman mailing list queues. I’d only be annoyed if I was a GMail user administrating Mailman lists. And it turns out there’s a bug in current dev SpamAssassin that now does the same thing…
  • presumably, GMail allows some element of per-user probabilistic classifier training — if so, some ‘move to Inbox’ might also sort those out quite quickly, I’d guess.
  • GMail seems to be a four-phase classification system. Messages can either go into: 1. the inbox, 2. the spam box, 3. the inbox with a little green ‘Spam’ indicator, or 4. the spam box with a little green ‘Inbox’ indicator. Not sure what the latter two do, but they may indicate some level of ‘unsure’ as per spambayes; worth noting that most of the FNs in the Inbox did not get the green ‘Spam’ indicator beside them, though.
  • I used a .forward to bounce the traffic over. So if GMail includes spam-evasion at the SMTP level, along with whatever content-filtering and probabilistic classification they’re using, they wouldn’t get the benefits of that.
  • SpamAssassin has the benefit of some user configuration; I’d got a couple of my spamtrap addresses blacklisted in the SpamAssassin config, and my Bayes databases have been trained using SpamAssassin’s autolearning.
  • this is all really unscientific, and it’s a really small sample ;)

Surprisingly, all the SpamAssassin mailing list traffic discussing spam, throwing around spammy URLs and phrases, didn’t get caught, however; probably because the volume of spammy phrases in those is less than in the Mailman admin stuff.

Tags: , , , , ,

Comments (3)

Blocking mail with no Message-ID

Spam: Bram shares a spam-filtering tip — ‘most of the viruses I get have a Message-Id tacked on by the local mailserver. A little bit of messing with procmail and suddenly my junk mail level is under control.’

This is what the SpamAssassin rule MSGID_FROM_MTA_SHORT does. It gets:

  4.432   6.7680   0.0560    0.992   0.94    3.67  MSGID_FROM_MTA_SHORT

6.7680% of spam is hit, but so is 0.0560% of ham mail — which makes it 99.2% accurate. By default in 2.6x, it gets a score of 3.67 points.

There’s a lot of divergence between people’s corpora — for instance, I currently have no ham mails that hit this, so it’s 100% accurate for my current mail collection; but some other people have an 80% hit-rate.

This is because some large-scale legitimate mass-mailers — for no apparent reason — also omit the Message-ID when they send the message across the internet. This isn’t quite a contravention of RFC 2822, but that RFC strongly recommends using the header:

Though optional, every message SHOULD have a ‘Message-ID:’ field.

(see RFC 2119 for what ‘SHOULD’ means — it’s a strong recommendation.)

The moral for legit senders: make sure you read the RFCs before you start sending SMTP; otherwise you’ll look like a spammer.

The moral for spamfilter developers: watch out for the legit bulk mail senders; some of them do really bizarre things with SMTP. ;)

Tags: , , , , , , , , , ,

Comments

Ca Plane Pour Moi, GMail, and XCP

Music: Ever wondered what the lyrics to Plastic Bertrand’s classic belgopunk tune really said? (Apart from ‘I am the king of the divan’, that is.) Wonder no more. (…ok, maybe these are a bit more likely. ‘Ey up!’, indeed.)

Mail: Google Mail front page. It has MXes — but they don’t answer yet. No SPF record yet, either ;)

Funny: XCP - the XML Control Protocol ‘is a drop in replacement for traditional Transmission Control Protocol, or TCP. With the advent of XCP/IP, connection-oriented networking will finally move from the legacy environment of inscrutable bits and bytes to a structured, human-readable world relying upon XML. XCP is the first 4th Generation Protocol, or 4GP. It is designed for a networking environment that is very fast and very reliable - the Internet of today!’

Tags: , , , , , , , , , ,

Comments

GMail

Mail: Google announces new mail service. This is not an April Fool’s Day joke — just terrible timing. ;) It’s for real.

Diego has some good comments.

My thoughts:

  • Privacy: ‘we do not disclose your personally identifying information to third parties unless we believe we are required to do so by law or have a good faith belief that such access, preservation or disclosure is reasonably necessary to … (c) detect, prevent, or otherwise address fraud, security or technical issues (including, without limitation, the filtering of spam)’. They’re going to build one hell of a spam-filtering corpus this way ;)
  • A nice ToS clause: ‘Your Intellectual Property Rights. Google does not claim any ownership in any of the content, including any text, data, information, images, photographs, music, sound, video, or other material, that you upload, transmit or store in your Gmail account. We will not use any of your content for any purpose except to provide you with the Service.’

Tags: , , , , , , , , ,

Comments

LOAF

Social: LOAF is ‘a way to share your address book without abandoning your privacy.’

A nifty use of Bloom filters to share your address book in a one-way manner — when you receive a mail, you can query your LOAF db to see if any of your correspondents previously corresponded with the sender; but they cannot look up the LOAF file to determine your correspondents, unless they know that correspondent’s email address in advance.

This, BTW, would be a very good way to implement a ‘Do-Not-Email’ list — although the other two problems with those still apply.

Interesting stuff — although I wonder how acceptable the 4-8Kb MIME part overhead per message will be…

Tags: , , , , , , , , , ,

Comments

GPRS, and the price of it

Tech: GPRS roaming works… technically. Joi Ito gets a $3,500 bill for checking his mail around the world. Yowch.

FWIW, I’ve never met anyone who’s used GPRS for anything other than the odd demo, or emergency use only, except for employees of the mobile carriers — and they get it for free.

My bet is that the basic failure was a disconnect between the real world and the specification stages — someone somewhere picked up one of those massively-inflated analyst reports a few years ago, said ‘I’d like a piece of that road-warrior market which will be worth $5 billion by 2005, it says here!’ and set prices (to stun) accordingly.

Tags: , , , , , , , , , ,

Comments

Slashdot Anti-FUSSP Form, and DSPAM’s FAQ

Spam: Slashdot: This will fail because… Tick the boxes to produce
a generic slashdot comment on a new anti-spam proposal. Very funny.

So, regarding the Noise Reduction probabilistic-classification tokenizer tweak posted on Slashdot yesterday — it does look interesting; basically, it operates by monitoring the ‘noisiness’ of the token stream, and if the current probabilities for the tokens from the stream differs from what’s defined as acceptable for too long, it ‘dubs’ them out. In other words, it ignores those tokens until another sequence of ‘useful’ tokens is encountered. Plus I’m totally down with the Janine ref ;)

However, it’s disappointing to come across this in the DSPAM FAQ list:
Why Should I use DSPAM Instead of SpamAssassin?
– a lovely selection of anti-perl and anti-SpamAssassin FUD, generally overlooking SpamAssassin’s training components (’leaves the end-user with no means of recourse or satisfaction when they receive a spam’), and in general taking a combative tone. Is that really necessary?

BTW, in case you’ve been living in a hole for the last year – SpamAssassin does include a probabilistic classifier, in the form of the BAYES rules. It’s easy to train, uses good tokenizing and combining algorithms to get high accuracy (although doesn’t yet do multi-word windowing until we’ve determined that that works acceptably for the db size increase), and, importantly, has been measured on corpora that are not my own mail.

A story: way back when, in June 2001, the SpamAssassin README boasted of it’s 99.94% accuracy rate. This was true — it was measured on my mail feed over the course of a couple of months. However, once measured on someone else’s mail, that dropped pretty quickly. Measuring a spam filter on the developer’s mail feed, (where presence of HTML is a killer spam-sign!), is a sure-fire way to get (a) great but (b) non-portable accuracy figures.

Tags: , , , , , , , , ,

Comments

How To Increase Voter Turnout With New Technology - The Right Way

eVoting: One of the desired features for new voting mechanisms is that they will increase voter ‘turnout’, encouraging people to vote who are too busy (or too unmotivated) to visit a polling station.

This has been used to suggest internet voting (see the fiasco that was the now-scrapped SERVE project) and voting-by-phone. Both offer a scary number of vote-fixing opportunities and possible failure modes, and are fundamentally a bad idea.

However, it turns out there is a great system to implement absentee voting securely, reliably, conveniently (for the voter) and even cheaply! A comment on Bruce Schneier’s Crypto-Gram newsletter (scroll down to comment number 3) details this.

I’ve copied the entire mail here, since it’s hard to link to in the other location, and is well worth a page to itself:

From: Fred Heutte

Thanks for your cogent thoughts on ballot security. I almost completely agree and was one of the first signers of David Dill’s petition. I am also involved professionally in voter data — from the campaign side, with voter files, not directly with voting equipment – but we’re close enough to the vote counting process to see how it actually works.

I would only disagree slightly in one area. Absentee voting is quite secure when looking at the overall approach and assessing the risks in every part of the process. As long as reasonable precautions like signature checking are done, it would be difficult and expensive to change the results of mail voting significantly.

For example, in Oregon, ballots are returned in an inside security envelope which is sealed by the voter. The outside envelope has a signature area on the back side. This is compared to the voter’s signature on file at the elections office. The larger counties actually do a digitized comparison, and back that up with a manual comparison with a stratified random sample (to validate machine results on an ongoing basis), as well as a final determination for any questionable matches.

Certainly it is possible to forge a signature. However, this authentication process would greatly raise the cost of forged mail ballots, absent consent of the voter. In turn, interference or coercion with absentee voting would require much higher travel costs (at least) than doing so at a polling place, for a given change in the outcome.

It is true that precincts have poll watchers, and absentee voters do not. But consider this. Ballot boxes, which are often delivered by temporary poll workers from the precinct to the elections office, are occasionally stolen, but mail ballots are handled within a vast stream of other mail by employees with paychecks and pensions at stake. The relatively low level of mail fraud inside the postal system is a testament to its relative security, and the points where ballots are aggregated for delivery to the elections office are usually on public property and can also be watched by outside observers if need be.

Oregon has had some elections with 100% ‘vote by mail’ since 1996, and all elections since 1999. So far, no verifiable evidence of voter fraud has emerged, despite many checks and some predictions by those with a political axe to grind that we would be engulfed in a wave of election fixing.

The reality is that Oregon’s system, which is based on some common-sense security principles, has proven to be robust. The one lingering problem has been the need of some counties to make their voters use punch cards at home because of their antiquated vote counting equipment. But while this is a vote integrity issue – since state statistics show a much higher undervote and spoiled ballot total for punch cards as compared to mark-sense ballots – it is not a security issue per se. And with Help America Vote Act (HAVA) funding to convert to more modern vote counting systems, the Oregon chad remains in only one county and will go extinct after 2004.

The mark-sense (’fill in the ovals’) ballots we have work well, and have low rates of over-votes and under-votes, despite the lack of automated machine checking that is possible in well-designed precinct voting systems. This suggests that reasonable visual design and human-friendly paper and pencil/pen home voting is a very reliable and secure system. When aided by automated counting equipment, we even have the additional benefit of very fast initial counts.

The increase in voter participation in Oregon since the advent of vote-by-mail — 10 to 30 percentage points above national averages, depending on the kind of election — leads to the only other issue, which is slow machine counts on election night after the polls close due to the surge of late ballots received at drop-off locations around the state. Oregon in fact isn’t really ‘vote by mail,’ it’s vote-at-home, with a paper ballot that can be mailed or left at any official drop-off point in the state, including county election offices, many schools and libraries, malls, town squares, etc.

The great advantage of the Oregon system is that it relies on the principle that if you appeal to the best instincts of the citizen, the overwhelming majority will ‘do our part’ to ensure the integrity of the democratic voting process, whether it is full consideration of the candidates and issues before voting, watching to make sure all ballots are securely transferred and counted, or favoring those laws and policies that insure that everyone eligible can vote, that their votes are counted, and that the candidates and measures with the most votes win.

The system is also cheaper than running traditional precinct elections. What’s not to like?

It’s so simple, and so sensible. Next time someone suggests ‘i-voting’ or ‘m-voting’ or whatever, you know what to point to…

Tags: , , , , , , , , , ,

Comments

Annoying Non-spam Tricks, pt. XVIII

Spam: OK, I just noticed that I have a few hits for the SpamAssassin rule HTTP_ENTITIES_HOST in my corpus. This searches for obfuscated hostnames in the URL links in mail messages, and is generally a very reliable sign of spam — because who would want to hide a hostname apart from spammers?

Well, Buy4Now.IE, for one, it seems. WTF? I have a mail here that uses this markup:

  <a href="''http://www&#46;buy4now&#46;ie/fbd''>

Totally and utterly nuts. If they really wanted a way to tickle malware detectors, mail filters, and anti-spam measures, they could hardly pick a better one. I have no idea why they did this.

grr….

Tags: , , , , , , , , ,

Comments

Post-Xmas

Vacation: We’re back. Well, technically, my body is back, but the silver thread is reeling in somewhere over Greenland. So I’m pre-classifying my mail and looking for urgent stuff with my eyes glazing over instead of doing anything more useful.

Scams: Interesting Wired News article: ‘Cyber-blackmail artists are shaking down office workers, threatening to delete computer files or install pornographic images on their work PCs unless they pay a ransom’. ‘The e-mail typically contains a demand that unless a small fee is paid … they will attack the PC … or download onto the machine images of child pornography.’

Of course, it’s simply spammed out, and they phish in anyone who is dumb enough to take it seriously and reply. But it does raise an interesting point, which I read about last week in this interview with Pete Townshend:

‘Perhaps Townshend (was) thinking of a case at Southwark Crown Court in 1998, in which the judge made it clear what constituted possession: that you were in possession of child pornography not just if you actively downloaded it, but if it appeared on your computer screen at all.’

So that sounds like, if child-porn images are found on a PC — and it doesn’t matter how they got there — the PC’s owner is liable. So theoretically this could be exploited to cause serious legal difficulties to a UK resident with a lack of computer literacy, or a bad email client that displays images in messages from unknown senders without user approval first. Another bad law.

Funny: Andy Kershaw in North Korea: songs about revolutionary cabbage-growing.

Tags: , , , , , , , , , ,

Comments

Potentially objectionable xscreensaver

Humour: xscreensaver, the default (and greatest) screensaver on most free UNIX distros, may contain R-rated content, as this mail to the Fedora discussion list notes.

Much to my surprise, I stumbled across it drawing an ‘erect penis’ when I returned from lunch today. So I did some investigating:

    $ strings /usr/X11R6/lib/xscreensaver/glsnake | grep penis
    erect penis
    flaccid penis
  

Tags: , , , , , , , , , ,

Comments

XmlStarlet, and lots of stuff

XML: XmlStarlet: ‘a set of command line utilities (tools) which can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands.’ Sheer genius!

SCOvEveryone: Humorix: ‘PROVO, UTAH — Nearly two hundred humor writers, fake news reporters, and tongue-in-cheek columnists descended on SCO’s headquarters yesterday to protest the company’s continued slide into unreality.’

‘Humor writers have very active imagination. But none of us — absolutely none of us — could ever have imagined the kind of ludicrous and inconceivable things that SCO has decided to pursue,’ explained a reporter for the New York Times, the world’s leading source of spurious news. ‘You simply can’t make this stuff up… a fact which represents a great hardship on humorists everywhere.’

(thanks Ben!)

Ireland: some beautiful pics of Dublin in Autumn from Diego Doval.

Books: Hari Kunzru rejects the John Llewellyn Rhys award, since it is sponsored by two notoriously anti-immigrant newspapers, the Daily Mail and the Mail on Sunday:

both ‘pursue an editorial policy of vilifying and demonising refugees and asylum-seekers … As the child of an immigrant, I am only too aware of the poisonous effect of the Mail’s editorial line. The atmosphere of prejudice it fosters translates into violence, and I have no wish to profit from it. … The Impressionist is a novel about the absurdity of a world in which race is the main determinant of a person’s identity. My hope is that one day the sponsors of the John Llewellyn Rhys prize will join with the judges in appreciating this.’

Well said! (via Oblomovka)

Health: University of Chicago healthcare ’stories of shame’. A shockingly widespread situation in the US, as far as I can tell. For non-USians wondering what all the fuss is about, have a read of this and it’ll become clear. At the same time, the US government spends more per capita on healthcare than Sweden does. Figure that one out…

Tags: , , , , , , , , , ,

Comments

Real-time DNS blocklist accuracy figures

Spam: DNS blocklists are the oldest means of spam-blocking, and are still exceedingly useful; nowadays, many of these are fully automated systems, using proxy-detection algorithms and sensing patterns in mailer behaviour indicative of spam.

A few months back on the ASRG list, there was a discussion of DNSBL accuracy; I posted some SpamAssassin figures, based on our ‘mass-check’ tests, but noted that they were computed using current DNSBL contents against a corpus of saved mail, so due to the time delta, were not 100% representative.

These figures are a lot better. Since August, I’ve been collecting real-time DNSBL hit data on my mail, as it is delivered at my SpamAssassin installation. In other words, it’s live accuracy data — it’s using just what the DNSBLs had listed at scan time.

(DNS blocklist accuracy figures continued…)

Note, however, that it’s still incomplete:

  • some DNSBLs were not measured; these are just the default DNSBL list in SpamAssassin 2.60, excluding RCVD_IN_NJABL_DIALUP (which I had to remove because I can’t parse out accurate data).
  • it’s only 1 person’s hand-classified mail.
  • SpamAssassin tests more than just the ‘delivering’ SMTP relay; it’ll also look backwards through the headers, at earlier relays, to catch spam sent via mailing lists. This is different from what’s used with most traditional DNSBL-supporting systems.

But the results should still be quite useful.

The time period covered:

  • Thu, 21 Aug 2003 17:11:30 -0700 (PDT)
  • Sat, 25 Oct 2003 23:11:52 -0700 (PDT)

Recap of the fields:

  • SPAM% = percentage of messages hit that were spam
  • HAM% = percentage of messages hit that were spam
  • S/O = Spam/Overall = Bayesian probability of spam
  • RANK = artificial ranking figure, ignore this!
  • SCORE = default SpamAssassin 2.60 score
  • NAME = name of test. Figuring out the exactly DNSBL should be pretty obvious ;)

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
21839     1993    19846    0.091   0.00    0.00  (all messages)
100.000   9.1259  90.8741    0.091   0.00    0.00  (all messages as %)
5.989  59.0567   0.6601    0.989   1.00    2.25  RCVD_IN_BL_SPAMCOP_NET
3.869  37.7822   0.4636    0.988   0.96    1.10  RCVD_IN_DSBL
0.751   8.2288   0.0000    1.000   0.95    4.30  RCVD_IN_OPM_HTTP
1.964  20.2709   0.1260    0.994   0.95    1.10  RCVD_IN_NJABL_PROXY
0.659   7.1751   0.0050    0.999   0.95    0.64  RCVD_IN_NJABL_SPAM
0.614   0.0000   0.6752    0.000   0.94   -0.10  RCVD_IN_BSP_OTHER
0.050   0.5519   0.0000    1.000   0.94    4.30  RCVD_IN_OPM_SOCKS
0.027   0.3011   0.0000    1.000   0.94    4.30  RCVD_IN_OPM_WINGATE
0.119   0.0000   0.1310    0.000   0.94   -4.30  RCVD_IN_BSP_TRUSTED
0.939   9.7341   0.0554    0.994   0.94    4.30  RCVD_IN_OPM
1.081  10.9383   0.0907    0.992   0.93    1.52  RCVD_IN_SORBS_SOCKS
1.062  10.7376   0.0907    0.992   0.93    1.27  RCVD_IN_SBL
0.229   2.4084   0.0101    0.996   0.93    1.10  RCVD_IN_SORBS_MISC
0.618   6.3221   0.0453    0.993   0.93    1.10  RCVD_IN_SORBS_HTTP
0.595   5.9709   0.0554    0.991   0.92    4.30  RCVD_IN_OPM_HTTP_POST
0.078   0.7526   0.0101    0.987   0.90    2.60  RCVD_IN_SORBS_ZOMBIE
0.815   7.5263   0.1411    0.982   0.89    1.39  DNS_FROM_RFCI_DSN
3.594  24.8369   1.4613    0.944   0.81    2.55  RCVD_IN_DYNABLOCK
1.685  11.4400   0.7054    0.942   0.78    0.10  RCVD_IN_RFCI
0.380   2.4586   0.1713    0.935   0.75    1.31  RCVD_IN_NJABL_RELAY
6.182  33.9689   3.3911    0.909   0.73    0.10  RCVD_IN_NJABL
10.422  44.4054   7.0090    0.864   0.63    0.10  RCVD_IN_SORBS
0.037   0.1505   0.0252    0.857   0.54    2.80  RCVD_IN_SORBS_WEB
2.344   4.1144   2.1667    0.655   0.17    0.00  RCVD_IN_SORBS_SPAM

Tags: , , , , , , , , ,

Comments (3)

Spam load and Hallowe’en

Spam: The volume of spam continues to rise inexorably. Brightmail are now estimating that 54% of all mail messages are spam.

Nowadays, my personal mail account is getting about 70 a day, rising to over 200 a day at the weekends. It’s getting tiresome; pretty much all of it gets marked as spam and diverted, but I still have to wade through it ‘just in case’, and to build the corpus. I guess I need to extend my .procmailrc to divert high-scoring spams somewhere I can check even less frequently ;)

That’s not the really annoying thing, though. I use tagged addressing when I publish my email address, most of the time. It works very well to identify spam sources overall, and divert ‘dead’ addresses that are getting spam, into the spamtraps. That’s the plus.

But the curse of writing spam filters is that you need a good archive of spam; and one of our SpamAssassin corpus guidelines is to attempt to trim out duplicate spams where possible. Many spammers will wind up sending more-or-less identical spam messages, modulo random subject lines, hash-busters, etc., and with (let’s say) 8 tagged addresses in their lists, I’ll get 8 copies of that spam, and have to pay a little bit of attention to trim it down to 1 copy for the corpus.

Damn spam-filter development! All this corpus building is hard work ;)

BTW, note how spam load rises at the weekends; (Tim Hunter, Paul Terry and Alan Judge of eircom.net also noted this in their paper presented at LISA ‘03 yesterday ;). There’s a good reason – spammers attempt to deliver their spam while abuse staff are not at their desk. Same thing applies in the network security world; many of those attacks have taken place over a US holiday weekend.

Hallowe’en: best too-late idea for a hallowe’en costume: ‘Top Gun GWB’ in his flight suit. In the end, I played half of the ‘Dr. Frankenstein and Monster’ pair (I was the monster, as C really is a scientist, and computer ’science’ doesn’t count). Best costume seen: a very impressive onnagata kabuki player.

Tags: , , , , , , , , ,

Comments

Spam load and Hallowe’en

The volume of spam continues to rise inexorably. Brightmail are now estimating that 54% of all mail messages are spam.

Nowadays, my personal mail account is getting about 70 a day, rising to over 200 a day at the weekends. It’s getting tiresome; pretty much all of it gets marked as spam and diverted, but I still have to wade through it ‘just in case’, and to build the corpus. I guess I need to extend my .procmailrc to divert high-scoring spams somewhere I can check even less frequently ;)

That’s not the really annoying thing, though. I use tagged addressing when I publish my email address, most of the time. It works very well to identify spam sources overall, and divert ‘dead’ addresses that are getting spam, into the spamtraps. That’s the plus.

But the curse of writing spam filters is that you need a good archive of spam; and one of our SpamAssassin corpus guidelines is to attempt to trim out duplicate spams where possible. Many spammers will wind up sending more-or-less identical spam messages, modulo random subject lines, hash-busters, etc., and with (let’s say) 8 tagged addresses in their lists, I’ll get 8 copies of that spam, and have to pay a little bit of attention to trim it down to 1 copy for the corpus.

Damn spam-filter development! All this corpus building is hard work ;)

BTW, note how spam load rises at the weekends; (Tim Hunter, Paul Terry and Alan Judge of eircom.net also noted this in their paper presented at LISA ‘03 yesterday ;). There’s a good reason – spammers attempt to deliver their spam while abuse staff are not at their desk. Same thing applies in the network security world; many of those attacks have taken place over a US holiday weekend.

Hallowe’en: best too-late idea for a hallowe’en costume: ‘Top Gun GWB’ in his flight suit. In the end, I played half of the ‘Dr. Frankenstein and Monster’ pair (I was the monster, as C really is a scientist, and computer ’science’ doesn’t count). Best costume seen: a very impressive onnagata kabuki player.

Tags: , , , , , , , , ,

Comments

On Pay-Per-Mail

Spam: Lee Maguire on pay-per-mail schemes. A great read — recommended to anyone who has given thought to this system.

It’s usually the fear of the odd overlooked gem that has rendered anti-spam techniques impotent. A salutation from a long lost friend with the subject ‘Hi’, an important business mail sent out-of-hours from the kid’s computer, that domain renewal reminder. Most people would apply no charge on the things they want to read, and a bajillion dollars on spam. And if there’s mail you don’t want to read but have to? Chances are you’re being paid to read them already - get back to work.

SoCal: an amazing satellite picture of the wild fires, courtesy of NASA’s Earth Observatory.

Tags: , , , , , , , , , ,

Comments

For Reference: Why Greylisting Sucks

Spam: I’ve been meaning to collate a page about why I don’t like greylisting. My previous posting is relatively useful, but it needs an update, so here it is:

First off, every single message is delayed until a database match is found for the combination of sending IP, envelope-from and envelope-to. As Alan Leghart pointed out, ‘So…we punish everyone in the world, and hope that a delay of one or more hours is considered ‘acceptable’? Maybe some people already expect a mail to take several hours to reach a recipient. In that case, you need to fix your mail server.’

Secondly, large mailing lists that use VERP (generating keyed From addresses for each mail for good bounce-handling) will require manual whitelisting for each list, or each host.

Yahoo! Groups, for example,
uses VERP for all its lists, and also will not retry delivery if the first attempt fails.

There’s even buggy SMTP servers that do not support retrying, believe it or not.

(Once again, as for many spamfilter designs, the unusual SMTP clients are the ‘edge cases’ that cause the most trouble.)

Manual whitelisting == work == what spam filtering is trying to reduce == bad.

Thirdly, and most seriously, it assumes spammers would never introduce retries into their spam-tools if it took off. Tempfailing, what this is based on, is effective right now because spamtools don’t retry. But every proposed spam solution has to consider what would happen if every server admin in the world implements it, and spammers then want to subvert it.

For a spamtool to retry, it just needs to track 4xx responses, and if it encounters one, save these items of data:

  • From, To addrs and HELO string used
  • proxy IP used (btw proxies are almost never shut down successfully, so the spammer can generally assume this can be reused next time)
  • random seed used to generate random hashbuster tokens etc., so the body text matches

That’s really not a lot of data — 64 bytes per address that requires a retry. Then, an hour or more later, do the retry.

So, IMO, ‘greylisting‘ will work fine in the short term, until it becomes reasonably common — then the spamtool develop