Still using perl 5.6.x?

For the upcoming release of Apache SpamAssassin, we’re considering dropping support for perl 5.6.x interpreters. Perl 5.6.0 is 9 years old, and the most recent maintainance release, 5.6.2, dates back to November 2003. The current 5.x release branch is 5.10, so we’re still sticking with a “support the release branch before the current one” policy this way.

If you’re still using one of the 5.6.x versions, or know of a (relatively recent) distro that does, please reply to highlight this….

Tags: , , ,

Comments (5)

Hack: reassassinate

A coworker today, returning from a couple of weeks holiday, bemoaned the quantities of spam he had to wade through. I mentioned a hack I often used in this situation, which was to discard the spam and download the 2 weeks of supposed-nonspam as a huge mbox, and rescan it all with spamassassin — since the intervening 2 weeks gave us plenty of time for the URLs to be blacklisted by URIBLs and IPs to be listed by DNSBLs, this generally results in better spamfilter accuracy, at least in terms of reducing false negatives (the “missed spam”). In other words, it gets rid of most of the remaining spam nicely.

Chatting about this, it occurred to us that it’d be easy enough to generalize this hack into something more widely useful by hooking up the Mail::IMAPClient CPAN module with Mail::SpamAssassin, and in fact, it’d be pretty likely that someone else would already have done so.

Sure enough, a search threw up this node on perlmonks.org, containing a script which did pretty much all that. Here’s a minor freshening: download

reassassinate – run SpamAssassin on an IMAP mailbox, then reupload

Usage: ./reassassinate –user jmason –host mail.example.com –inbox INBOX –junkfolder INBOX.crap

Runs SpamAssassin over all mail messages in an IMAP mailbox, skipping ones it’s processed before. It then reuploads the rewritten messages to two locations depending on whether they are spam or not; nonspam messages are simply re-saved to the original mailbox, spam messages are sent to the mailbox specified in “–junkfolder”.

This is especially handy if some time passed since the mails were originally delivered, allowing more of the message contents of spam mails to be blacklisted by third-party DNSBLs and URIBLs in the meantime.

Prerequisites:

  • Mail::IMAPClient
  • Mail::SpamAssassin

Tags: , , , , ,

Comments (3)

Links for 2008-09-13

Tags: , , , ,

Comments

Links for 2008-08-14

Tags: , , , , , , , , , , , , , , , , , , , , , , , ,

Comments

Links for 2008-08-12

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Comments

Links for 2008-07-22

ZSFA — I Want The Mutt Of Feed Readers Zed recommends Newsbeuter. must take a look

We Want A Dead Simple Web Tablet For $200. Help Us Build It. having worked on a project to do just this, believe me, this is doomed. DOOMED

Science Clouds ‘compute cycles in the cloud for scientific communities .. allows you to provision customized compute nodes .. that you have full control over using a leasing model based on the Amazon’s EC2 service.’ Wonder if they’d like to give SA some time ;)

Tags: , , , , , , , , , , , , , , , , , , , , , ,

Comments (1)

Backscatter rising

Recently, more and more people have been complaining about backscatter; its levels seem to have increased over the past few weeks.

If you’re unfamiliar with the terminology — backscatter is mail you didn’t ask to receive, generated by legitimate, non-spam-sending systems in response to spam. Here are some examples, courtesy of Al Iverson:

  • Misdirected bounces from spam runs, from mail servers who “accept then bounce” instead of rejecting mail during the SMTP transaction.
  • Misdirected virus/worm “OMG your mail was infected!” email notifications from virus scanners.
  • Misdirected “please confirm your subscription” requests from mailing lists that allow email-based signup requests.
  • Out of office or vacation autoreplies and autoresponders.
  • Challenge requests from “Challenge/Response” anti-spam software. Maybe C/R software works great for you, but it generates significant backscatter to people you don’t know.

It used to be OK to send some of these types of mail — but no longer. Nowadays, due to the rise in backscatter caused by spammer/malware abuse, it is no longer considered good practice to “accept then bounce” mail from an SMTP session, or in any other way respond by mail to an unauthorized address of the mail’s senders.

Backscatter as spam delivery mechanism

I would hazard a guess that this rise is due to one of the major spam-sending botnets adopting the use of “real” sender addresses rather than randomly-generated fake ones, probably in order to evade broken-by-design Sender-Address Verification filters.

There’s an alternate theory that spammers use backscatter as a means of spam delivery — intending for the mails to bounce, in effect using the bounce as the spam delivery mechanism. Symantec’s most recent “State of Spam” report in particular highlights this.

I don’t buy it, however. Compare their own example message — here’s what the mail originally sent by the spammer to the bouncer, rendered:

img

And here’s what it looks like once it passes through the bouncer’s mail system:

img2

That’s simply unreadable. There’s absolutely no way for a targeted end user to read the “payload” there…

Getting rid of it

I haven’t run into this recent spike in backscatter at all, myself, since I have a working setup that deals with it. This blog post describes it. If you’re using Postfix and SpamAssassin, it would be well worth taking a look; if you’re just using SpamAssassin and not Postfix, you should still try using the Virus Bounce Ruleset to rid yourself of various forms of unwanted bounce message.

Note that you need to set the ‘whitelist_bounce_relays’ setting to use the ruleset, otherwise its rules will not fire.

SPF

There’s a theory that setting SPF records (or other sender-auth mechanisms like DomainKeys or DKIM) on your domains, will reduce the amount of backscatter sent to your domains. Again, I doubt it.

Backscatter is being sent by old, legacy mail systems. These systems aren’t configured to take SPF into account either. When they’re eventually updated, it’s likely they’ll be fixed to simply not send “accept then bounce” responses after the SMTP transaction has completed. It’s unlikely that a system will be fixed to take SPF into account, but not fixed to stop sending backscatter noise.

It’s good advice to use these records anyway, but don’t do it because you want to stop backscatter.

What about my own bounces?

You might be worried that the SpamAssassin VBounce ruleset will block bounces sent in response to your own mail. As long as the error conditions are flagged during the SMTP transaction (as they should be nowadays), and you’ve specified your own mailserver(s) in ‘whitelist_bounce_relays’, you’re fine.

Tags: , , , , ,

Comments (7)

SpamAssassin thresholds again

Following on from this post, I came across a great graphing Flash applet from amCharts which is ideal for putting a much improved UI on those SpamAssassin ‘required_score’ threshold line charts.

Here’s the result — nice, zoomable, dynamic charts, displaying the effectiveness of various ‘required_score’ thresholds depending on whether Bayes and network rules are active in your SpamAssassin configuration. Very cool!

Tags: , , ,

Comments

On the effects of lowering your SpamAssassin threshold

So I was chatting to Danny O’Brien a few days ago. He noted that he’d reduced his Spamassassin “this is spam” threshold from the default 5.0 points to 3.7, and was wondering what that meant:

I know what it means in raw technical terms — spamassassin now marks anything >3.7 as spam, as opposed to the default of five. But given the genetic algorithm way that SA calculates the rule scoring, what does lowering the score mean? That I’m more confident that stuff marked ham is stuffed marked ham than the average person? That my bayesian scoring is now really good?

Do people usually do this without harmful side-effects? What does it mean about them if they do it?

Does it make me a good person? Will I smell of ham? These are the things that keep me awake at night.

It’s a good question! Here’s what I responded with — it occurs to me that this is probably quite widely speculated about, so let’s blog it here, too.

As you tweak the threshold, it gets more or less aggressive.

By default, we target a false positive rate of less than 0.1% — that means 1 FP, a ham marked as spam incorrectly, per 1000 ham messages. Last time the scores were generated, we ran our usual accuracy estimation tests, and got a false positive rate of 0.06% (1 in 1667 hams) and a false negative rate of 1.49% (1 in 67 spams) for the default threshold of 5.0 points. That’s assuming you’re using network tests (you should be) and have Bayes training (this is generally the case after running for a few weeks with autolearning on).

If you lower the threshold, then, that trades off the false negatives (reducing them — less spam getting past) in exchange for more false positives (hams getting caught). In those tests, here’s some figures for other thresholds:

SUMMARY for threshold 3.0: False positives: 290 0.43% False negatives: 313 0.26%

SUMMARY for threshold 4.0: False positives: 104 0.15% False negatives: 1084 0.91%

SUMMARY for threshold 4.5: False positives: 68 0.10% False negatives: 1345 1.13%

so you can see FPs rise quite quickly as the threshold drops. At 4.0 points, the nearest to 3.7, 1 in 666 ham messages (0.15%) will be marked incorrectly as spam. That’s nearly 3 times as many FPs as the default setting’s value (0.06%). On the other hand, only 1 in 109 spams will be mis-filed.

Here’s the reports from the last release, with all those figures for different thresholds — should be useful for figuring out the likelihoods!

In fact, let’s get some graphs from that report. Here is a graph of false positives (in orange) vs false negatives (in blue) as the threshold changes…

and, to illustrate the details a little better, zoom in to the area between 0% and 1%…

You can see that the default threshold of 5.0 isn’t where the FP% and FN% rates meet; instead, it’s got a much lower FP% rate than FN%. This is because we consider FPs to be much more dangerous than missed spams, so we try to avoid them to a higher degree.

An alternative, more standardized way to display this info is as a Receiver Operating Characteristic curve, which is basically a plot of the true positive rate vs false positives, on a scale from 0 to 1.

Here’s the SpamAssassin ROC curve:

More usefully, here’s the ROC curve zoomed in nearer the “perfect accuracy” top-left corner:

Unfortunately, this type of graph isn’t much use for picking a SpamAssassin threshold. GNUplot doesn’t allow individual points to be marked with the value from a certain column, otherwise this would be much more useful, since we’d be able to tell which threshold value corresponds to each point. C’est la vie!

Update:: this is possible with GNUplot 4.2 onwards, it seems. great news! Hat tip to Philipp K Janert for the advice. here are updated graphs using this feature:

(GNUplot commands to render these graphs are here.)

Update again: much better interactive Flash graphs here.

Tags: , , , , , ,

Comments (13)

Test my auto-generated ruleset

(I posted this to the SA users and dev lists, too.)

I’ve been working on a new way to auto-generate body rules recently (see previous posts). The results are checked into SVN trunk daily in the “rulesrc/sandbox/jm/20_sought.cf” file.

We haven’t had much time to figure out how to produce auto-generated 3.2.x rule updates for our entire ruleset at updates.SpamAssassin.org, so instead of dealing with that, I’ve taken a shortcut around it ;) I’m now making just the “20_sought.cf” ruleset available as a standalone, unofficial sa-update ruleset at sought.rules.yerp.org.

Before using it, you’ll need the GPG key:

  wget http://yerp.org/rules/GPG.KEY
  sudo sa-update --import GPG.KEY

then use this to update:

  sudo sa-update \
        --gpgkey 6C6191E3 --channel sought.rules.yerp.org \
        [...other channels...] \
        --channel updates.spamassassin.org

(similar to how you’d use Daryl’s sa-update version of the SARE rulesets.)

Feel free to run sa-update as frequently as you like.

Please consider it alpha; I may take it down in a few months depending on how it goes, or if we can get it working as part of the core updates. In the meantime though, I’m curious to hear how you get on with it. (In particular, copies of false positives would be very welcome.)

Update: it’s been very successful, so I’d now consider it in production.

Tags: , , ,

Comments (39)

Rule Discovery Progress Update

Back in March, I wrote a post about a new rule discovery algorithm I’d come up with, based on the BLAST bioinformatics algorithm. I’m still hacking on that; it’s gradually meandering towards production status, as time permits, so here’s an update on that progress.

There have been various tweaks to improve memory efficiency; I won’t go into those here, since they’re all in SVN history anyway. But the results are that the algorithm can now extract rules from 3500 spam and 50000 ham messages without consuming more than 36 MB of RAM, or hitting disk. It can also now generate a SpamAssassin rules file directly, and apply a basic set of QA parameters (required hit rate, required length of pattern, etc.).

On top of this, I’ve come up with a workflow to automatically generate a usable batch of rules, on a daily basis, from a spam and ham corpus. This works as follows:

  • Take a sample of the past 4 days traffic from our spamtrap network. Today this was about 3000 messages.

  • add the hand-vetted spam from my own accounts over the same period (this helps reduce bias, since spamtraps tend to collect a certain type of spam), about 3400 messages.

  • discard spams that scored over 10 points (to concentrate on the stuff we’re missing).

  • Pass the remaining 3517 spams, and text strings from over 50000 nonspam messages, into the “seek-phrases-in-log” script, specifying a minimum pattern length of 30 characters, and a minimum hitrate of 1% (in today’s corpus, a rule would have to hit at least 34 messages to qualify).

  • That script gronks for a couple of minutes, then produces an output rules file, in this case containing 28 rules, for human vetting. (Since I’ve started this workflow, I’ve only had to remove a couple of rules at this step, and not for false positives; instead, they were leaking spamtrap addresses.)

  • Once I’ve vetted it, I check it into rulesrc/sandbox/jm/20_sought.cf for testing by the SpamAssassin rule QA system.

The QA results for the ruleset from yesterday (Aug 3) can be seen here, and give a pretty good idea of how these rules have been performing over the past week or two; out of the nearly 70000 messages hit by the rules, only 2 ham mails are hit — 0.0009%.

In fact, I measured the ruleset’s overall performance in the logs provided by the 4 mass-check contributors who provided up-to-date data in yesterday’s nightly mass-check; bb-jm, jm, daf, dos, and theo (all SpamAssassin committers):

Contributor Hits Spams Percent
bb-jm 4249 24996 17.00%
jm 3450 14994 23.00%
daf 1236 35563 3.48%
dos 32867 100223 32.79%
theo 28077 382562 7.34%

(bb-jm and jm are both me; they scan different subsets of my mail.)

The “Percent” column measures the percentage of their spam collection that is hit by at least one of these rules; it works out to an average of 16.72% across all contributors. This is underestimating the true hitrate on “fresh” spam, too, since the mass-check corpora also include some really old spam collections (daf’s collection, for example, looks like it hasn’t been updated since the start of July).

Even better, a look at the score-map for these rules shows that they are, indeed, hitting the low-scoring spam that other rules don’t hit.

That’s pretty good going for an entirely-automated ruleset!

The next step is to come up with scores, and publish these for end-user use. I haven’t figured out how this’ll work yet; possibly we could even put them into the default “sa-update” channel, although the automated nature of these rules may mean this isn’t a goer.

If you’re interested, the hits-over-time graph for one of the rules (body JM_SEEK_ICZPZW / Home Networking For Dummies 3rd Edition \$10 /) can be viewed here.

Tags: , , , , ,

Comments (3)

A fishy Challenge-Response press release

I have a Google News notification set up for mentions of “SpamAssassin”, which is how I came across this press release on PRNewsWire:

Study: Challenge-Response Surpasses Other Anti-Spam Technologies in Performance, User Satisfaction and Reliability; Worst Performing are Filter-based ISP Solutions

NORTHBOROUGH, Mass., July 17 /PRNewswire/ — Brockmann & Company, a research and consulting firm, today released findings from its independent, self-funded “Spam Index Report– Comparing Real-World Performance of Anti-Spam Technologies.”

The study evaluated eight anti-spam technologies from the three main technology classes — filters, real-time black list services and challenge- response servers. The technologies were evaluated using the Spam Index, a new method in anti-spam performance measurement that leverages users’ real-world experiences.

[...] The report finds that the best performing anti-spam technology is challenge-response, based on that technology’s lowest average Spam Index score of 160.

[...] Filter – Open Source software-(Spam Index: 388): This technology is frequently configured to work in conjunction with PC email client filters. The server adds * * SPAM * * to the subject line so that the client filter can move the message into the junk folder. This class of software includes projects such as ASSP, Mail Washer and SpamAssassin, among others.

The “Spam Index” is a proprietary measurement of spam filtering, created by Brockmann and Company. A lower “Spam Index” score is better, apparently, so C/R wins! (Funny that. The author, Peter Brockmann, seems to have some kind of relationship with C/R vendor Sendio, being quoted in Sendio press releases like this one and this one, and providing a testimonial on the Sendio.com front page.)

However — there’s a fundamental flaw with that “Spam Index” measurement, though; it’s designed to make C/R look good. Here’s how it’s supposed to work. Take these four measurements:

  • Average number of spam messages each day x 20 (to get approximate number per work-month)
  • Average minutes spent dealing with spam each day x 20 (to get approximate minutes per work-month)
  • Number of resend requests last month
  • Number of trapped messages last month

Then sum them, and that gives you a “Spam Index”.

First off, let’s translate that into conventional spam filter accuracy terms. The ‘minutes spent dealing with spam each day’ measures false negatives, since having to ‘deal with’ (ie delete) spam means that the spam got past the filter into the user’s inbox. The ‘number of trapped messages’ means, presumably, both true positives — spam marked correctly as spam — and false positives – nonspam marked incorrectly as spam. The ‘number of resend requests last month’ also measures false positives, although it will vastly underestimate them.

Now, here’s the first problem. The “Spam Index” therefore considers a false negative as about as important as a false positive. However, in real terms, if a user’s legit mail is lost by a spam filter, that’s a much bigger failure than letting some more spam through. When measuring filters, you have to consider false positives as much more serious! (In fact, when we test SpamAssassin, we consider FPs to be 50 times more costly than a false negative.)

Here’s the second problem. Spam is sent using forged sender info, so if a spammer’s mail is challenged by a Challenge/Response filter, the challenge will be sent to one of:

  • (a) an address that doesn’t exist, and be discarded (this is fine); or
  • (b) to an invalid address on an innocent third-party system (wasting that system’s resources); or
  • (c) to an innocent third-party user on an innocent third-party system (wasting that system’s resources and, worst of all, the user’s time).

The “Spam Index” doesn’t measure the latter two failure cases in any way, so C/R isn’t penalised for that kind of abusive traffic it generates.

Also, if a good, nonspam mail is challenged, either

  • (a) the sender will receive the challenge and take the time to jump through the necessary hoops to get their mail delivered (”visit this web page, type in this CAPTCHA, click on this button” etc.); or
  • (b) they’ll receive the challenge, and not bother jumping through hoops (maybe they don’t consider the mail that important); or
  • (c) they’ll not be able to act on the challenge at all (for example, if an automated mail is challenged).

Again, the “Spam Index” doesn’t measure the latter two failure cases.

In other words, the situations where C/R fails are ignored. Is it any wonder C/R wins when the criteria are skewed to make that happen?

Tags: , , , , , ,

Comments (37)

Lyris’ low SpamAssassin threshold

via jgc’s newsletter, Lyris’ latest ISP Deliverability Report (Q1 2007) makes an interesting point about legitimate bulk mail and SpamAssassin:

Contrary to popular belief among marketers, message content is not a major cause of deliverability challenges for most email marketers. This finding is a result of testing the content of more than 1,705 unique emails, using [Lyris] EmailAdvisor’s content scoring tool. The content scoring function is based on the content scoring rules of the widely adopted Spam Assassin open source project. The emails tested had an average content point score of 1.04 well below the filter’s generally accepted spam identification level of 3.0 or higher.

Now, that’s broadly good advice — SpamAssassin hasn’t really given much strength to signatures found in message body text in the past couple of years, since the signatures from other sources (especially DNS blocklists and URI blocklists) are much more reliable.

However, note the bit I emphasised. Since when is 3.0 the ‘generally accepted spam identification level’? Only the most paranoid user would ever go that low, since at that level, they’d expect to find 2.22% of their nonspam mail going into the spam folder (according to our own tests). In reality, our recommended level has always been 5.0 points, and that’s what we optimise for. I’m mystified as to where they’re getting 3.0 from…

Tags: , , ,

Comments (3)

MAAWG Talk

Here’s the talk I gave at MAAWG, entitled New Features in SpamAssassin 3.2.0 Of Interest To Large Receivers:

Abstract:

Many ISPs and mail receivers, at all scales, use SpamAssassin as part of their spam-filtering arsenal. The recent release of SpamAssassin 3.2.0 introduces much new functionality, and some of this is of particular interest to the large-scale mail receiver; in particular, rules compiled to parallel-matching native object code for increased speed, early short-circuiting based on administrator-specified rules, the new “msa_networks” setting to specify MSA hosts or pools, a new ruleset to detect spam/virus backscatter bounces, a way to run SpamAssassin in the Apache httpd server using mod_perl, and support for Amazon’s EC2 virtual server farm. In this talk, I’ll discuss each of these in detail, and discuss why it may be useful to you.

If you were at MAAWG, hope you enjoyed it ;)

Tags: , , , , ,

Comments (1)

Dealing with backscatter, revisited

Back in January, I wrote about how I deal with email backscatter nowadays. Since then, I’ve made a notable tweak.

This is that I no longer reject “null-sender” traffic during the SMTP transaction. It turned out that it broke Exim’s implementation of Sender Address Verification, which performs the SAV check using a MAIL FROM of <>, rendering it indistinguishable from a bounce during the SMTP transaction.

Now, I’ve complained about SAV, but I have to be pragmatic anyway (Postel’s law and all that!) — so it was better to just allow other sites to perform SAV lookups against our server, and fix the anti-bounce stuff some other way.

The new method (below) does this, by allowing null-sender SMTP traffic just fine; it detects bounces in Postfix if they arrive via SMTP in RFC-3464 format, and bounces that slip past are then dealt with in a more CPU-intensive manner using the SpamAssassin “VBounce” ruleset (which is part of the now-released SpamAssassin 3.2.0, btw).

This increases the load, since some bounces cannot be rejected at MAIL FROM time now, and instead we have to wait ’til DATA — but CPU hasn’t been a problem recently, so this is ok.

Here are the updated instructions:

In Postfix

In my Postfix configuration, on the machine that acts as MX for my domains – edit ‘/etc/postfix/header_checks’, and add these lines:

/^Content-Type: multipart\/report; report-type=delivery-status\;/  REJECT no third-party DSNs
/^Content-Type: message\/delivery-status; /     REJECT no third-party DSNs

Edit ‘/etc/postfix/main.cf’, and ensure it contains:

header_checks = regexp:/etc/postfix/header_checks

Then run:

sudo /etc/init.d/postfix restart

This catches most of the bounces — RFC-3464-format Delivery-Status-Notification messages from other mail servers.

In SpamAssassin

As before, install the Virus-bounce ruleset and set it up. This will catch challenge-response mails, “out of office” noise, “virus scanner detected blah” crap, and bounce mails generated by really broken groupware MTAs — the stuff that gets past the Postfix front-line.

Tags: , , , , , , , , ,

Comments (7)

SpamAssassin 3.2.0!

W00t! SpamAssassin 3.2.0 has finally gone gold!

This release is a big one — it’s the first major release since 3.1.0, back in September 2005, just over a year and a half ago. Here is the release announcement mail, containing a list of major changes since version 3.1.8. There are a few major new features that I feel worth picking out in more detail and editorialising about:

sa-compile

This is a biggie. This new script takes the active SpamAssassin ruleset, and uses code contributed by Matt Sergeant to produce input for re2c. re2c in turn compiles the ruleset into a deterministic finite automaton, which can match multiple regular expressions in parallel. That’s not all, though; re2c then compiles that DFA into C code — which is then compiled into native object code. SpamAssassin will then load that object code and use it to replace the slower perl regexp tests, if it’s available at scan-time.

Now, it’s been a long time since SpamAssassin’s ruleset consisted mainly of rudimentary regular expressions matched against the body text — a good portion of SpamAssassin’s ruleset these days operates against headers, performs network lookups, analyzes URLs extracted from the body, uses the more advanced features supported by Perl’s NFA regexp engine, or so on. But even given that, the effects of ’sa-compile’ seem to average between a 15% and 25% speedup, in my testing. That’s good ;)

Many of the commercial versions of SpamAssassin include their own body-rule speedups — but this is the first time anything similar has made it into the open source code.

Short-circuiting

Another good one for performance. There are some rules that you can reasonably assume will never hit nonspam or spam mail in a well-configured setup. For example, a hit on “ALL_TRUSTED” should mean that the message never traversed an untrusted network, therefore it cannot be spam, so why bother applying the expensive tests? It should be reasonable to “short-circuit” and immediately return a “ham” score for that mail.

This new plugin implements that algorithm — and efficiently, too, which historically has been the hard part!

I’ve been using this for a while with a ruleset like this one — in my experience, it’s cut overall CPU time spent scanning mail by 20%.

It is pretty flexible, too — there’s lot of tweakage that can be done with this functionality to suit your own setup.

Reduced memory footprint

One aim of this release has been to reduce the memory usage of SpamAssassin; the core code now uses less RAM than 3.1.x does, when tested with the same ruleset. (Unfortunately we’ve added lots more rules in the interim, so it’s a bit of a wash overall. ;)

The VBounce anti-bounce ruleset

Detects spurious bounce messages sent by broken mail systems in response to spam or viruses. More info about that here.

Apache-spamd

apache-spamd implements spamd as a mod_perl module. This was contributed by Radoslaw Zielinski, as a Google Summer of Code project last year. Thanks Radoslaw!

There are plenty more new, useful features and rules — these are just the top ones, in my opinion. Pretty cool stuff!

Tags: , , , , , , , ,

Comments (2)

Using qpsmtpd for traps.spamassassin.org

Like many anti-spam systems these days, SpamAssassin operates a network of spamtraps. One set of these run off traps.SpamAssassin.org, a server kindly donated by ISP Sonic.net.

Large-scale spam-trapping systems like this are generally run in quite a secretive manner, but we’re an open source project — so it may be interesting if I give some details of our setup. Here’s a potted history of how this spamtrap server has run over the years…

The beginning

The architecture was initially very simple. The MX was Postfix, delivering to the “trapper” user, which in turn ran procmail, which directly ran a perl script. This perl script then performed the trap actions, namely: DoS prevention, discarding viruses and malware, discarding backscatter bounces, extraction and cleanup of the incoming mails, then onward reporting, archival, and further distribution.

Given that this was a target for spam — and we want as much spam as possible here! — this would predictably run into load issues. Right at the beginning, back in around 2001/2002, I ran this on our shared server, where it pretty quickly caused trouble for delivery of other, more useful mail. It was around this time that Sonic kindly donated the server.

With dedicated hardware, we weren’t seeing much trouble — it was enough to just wait for the few hours for a traffic spike to pass, and the Postfix queue would then clear.

Clearing the queues

After a few months, though, this wasn’t enough — the queue would get consistently clogged, and the backlog became enough to result in the incoming spam being delayed for days before it made it from the MX to the trap archives. For a spamtrap, you want fresh spam, but not necessarily all spam — so I installed a cron job to simply clear the queue on a nightly basis. (I also had to restart the Postfix server, too, since it’d occasionally get hung and stop accepting connections on port 25, presumably due to load issues.)

IPC::DirQueue

The next level was an inability of the procmail/perl script end to process the mail fast enough for the MTA to keep up with the incoming connections, and follow-on problems, caused by load generated by the perl script impacting the MX’s activity. To work around these, I designed a new queueing backend, based around IPC::DirQueue. This allowed a new split architecture; the procmail-run perl script was extremely lightweight, delivering all inbound mail to a dirqueue and exiting quickly, allowing the MX to get back to the next inbound spam message, and the trap processing script was then split into a web of dirqueues, allowing each individual part of the trap backend pipeline to operate independently.

There were several benefits to this:

  1. Since dirqueues operate as a batch-processing model, load spikes become irrelevant; the load incurred is limited by how many dequeuer processes are run.
  2. The time taken in backend tasks becomes irrelevant to the MX throughput, since that is bottlenecked only by the lightweight perl script and its write speed to the “incoming” dirqueue.
  3. By splitting the backend work into multiple queues, outages in the spam-reporting systems or onward forwardings become much less of a problem, since they won’t affect inbound spam, archival, outbound delivery to other reporting systems, forwards, etc.

Again, the dirqueues were cleared on a frequent basis, to discard the “spiky” traffic and ensure we were just seeing samples of the freshest spam. The dirqueues use a tmpfs as the backing storage directory, so it never hits the disk at all.

This worked pretty well for several years — from 80 megabytes of spam per day to the current level, which is around 130MB per day. However, we still occasionally saw problems from load spikes, where high load caused the traps to refuse incoming SMTP connections — purely because the load of inbound connections is too high for the Postfix MX to accept them all in a timely fashion.

qpsmtpd

Last weekend, I had a go at a project I’d been thinking of trying out for a long time — switching from Postfix to qpsmtpd. A while back, Matt Sergeant rewrote qpsmtpd to use Danga::Socket, Danga Interactive / Six Apart’s insanely scalable event-driven asynchronous socket class, as used in mogilefsd, perlbal and djabberd. This article notes that ‘two large antispam companies’ high-traffic spam traps have used this effectively since the second quarter of 2005, delivering concurrency as high as 10,000 on some occasions’, so it seemed likely to work ;)

Sure enough, results have been great… we now have a pure-perl system handling heavy volumes without breaking a sweat, certainly compared to the previous system. qpsmtpd’s plugin system was elegant, allowing me to annotate inbound spam with more details of the SMTP transaction, write plugins to deliver mail to a dirqueue directly instead of to an MTA, and do some conditional code (ie. basic “deliver this RCPT TO to this queue”) where needed.

Full details are over on the QpsmtpdSpamtrap page on the taint.org wiki, for the curious.

Tags: , , , , , , , ,

Comments (5)

A SpamAssassin rule-discovery algorithm

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and “ham” (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple “grep” will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, “this is\na \ntest” -> “this is a test”

All pretty basic stuff, and performed by the SpamAssassin “body” rendering process during a “mass-check” operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later “collapse patterns” step.

Split the text into “words” — ie. space-separated chunks of non-whitespace chars. Compress each “word” into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global “token count” hashtable, and add the message ID to the token’s entry in a “message subset hit” table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the “token count” hash entries, simply delete the entries in the “token count” and “message subset hit” tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
  • the “message subset hit” table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the “message subset hit” table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the “discard if subsumed” trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent “OEM” stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/.2007022 \
        ham:dir:/local/cor/recent/ham/.200702
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin “meta” rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <tyokaluassa.com@ultradian.com>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <xxxxxxxxxxx@jmason.org>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <tyokaluassa.com@ultradian.com>
  To: <xxxxxxxx@jmason.org>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%!

Discounts! Special offers! Software for home and office! TOP 1O ITEMS.

$79 Microsoft Windows Vista Ultimate
$79 MS Office Enterprise 2007
$79 Adobe Acrobat 8 Pro
$49 Windows <b>XP Pro</b> w/SP2
$99 Macromedia Studio 8
$59 Adobe Premiere 2.0
$59 Corel Grafix Suite X3
$59 Adobe Illustrator CS2

$129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

        Mac Specials:

Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE 6E1FAE&t6

See more by this manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded INSTANTLY. http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3 Best choice for home and professional. (37268 reviews)

Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1 Sales Rank: #1 (121329 reviews)

Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for INSTANT download. http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2 Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching “anys”, also known as “gapped alignment” searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. “this is foo”, “this is bar”, “this is baz” => “this is (foo|bar|baz)”, would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

Tags: , , , , , , ,

Comments (9)

How to deal with joe-jobs and massive bounce storms

As I’ve noted before, we still have a major problem with sites generating bounce/backscatter storms in response to forged mail — whether deliberately targeted, as a “Joe-Job”, or as a side-effect of attempts to evade over-simplistic sender address verification as seen in spam, viruses, and so on.

Sites sending these bounces have a broken mail configuration, but there are thousands remaining out there — it’s very hard to fix an old mail setup to avoid this issue. As a result, even if your mail server is set up correctly and can handle the incoming spam load just fine, a single spam run sent to other people can amplify the volume of response bounces in a Smurf-attack-style volume multiplication, acting as a denial of service. I’ve regularly had serious load problems and backlogs on my MX, due solely to these bounces.

However, I think I’ve now solved it, with only a little loss of functionality. Here’s how I did it, using Postfix and SpamAssassin.

(UPDATE: if you use the algorithm described below, you’ll block mail from people using Sender Address Verification! Use this updated version instead.)

Firstly, note that if you adopt this, you will lose functionality. Third party sites will not be able to generate bounces which are sent back to senders via your MX — except during the SMTP transaction.

However, if a message delivery attempt is run from your MX, and it is bounced by the host during that SMTP transaction, this bounce message will still be preserved. This is good, since this is basically the only bounce scenario that can be recommended, or expected to work, in modern SMTP.

Also, a small subset of third-party bounce messages will still get past, and be delivered — the ones that are not in the RFC-3464 bounce format generated by modern MTAs, but that include your outbound relays in the quoted header. The idea here is that “good bounces”, such as messages from mailing lists warning that your mails were moderated, will still be safe.

OK, the details:

In Postfix

Ideally, we could do this entirely outside Postfix — but in my experience, the volume (amplified by the Smurf attack effects) is such that these need to be rejected as soon as possible, during the SMTP transaction.

Update: I’ve now changed this technique: see this blog post for the current details, and skip this section entirely!

(If you’re curious, though, here’s what I used to recommend:)

In my Postfix configuration, on the machine that acts as MX for my domains – edit ‘/etc/postfix/header_checks’, and add these lines:
/^Return-Path: <>/                              REJECT no third-party DSNs
/^From:.*MAILER-DAEMON/                         REJECT no third-party DSNs
Edit ‘/etc/postfix/null_sender’, and add:
<>              550 no third-party DSNs
Edit ‘/etc/postfix/main.cf’, and ensure it contains these lines:
header_checks = regexp:/etc/postfix/header_checks
smtpd_sender_restrictions = check_sender_access hash:/etc/postfix/null_sender
(If you already have an ’smtpd_sender_restrictions’ line, just add ‘check_sender_access hash:/etc/postfix/null_sender’ to the end.) Finally, run:
sudo postmap /etc/postfix/null_sender
sudo /etc/init.d/postfix restart
This catches most of the bounces — RFC-3464-format Delivery-Status-Notification messages from other mail servers.

In SpamAssassin

Install the Virus-bounce ruleset. This will catch challenge-response mails, “out of office” noise, “virus scanner detected blah” crap, and bounce mails generated by really broken groupware MTAs — the stuff that gets past the Postfix front-line.

Once you’ve done these two things, that deals with almost all the forged-bounce load, at what I think is a reasonable cost. Comments welcome…

Tags: , , , , , , , , ,

Comments (15)

Our first physical award

Comments (11)

SpamAssassin as an EC2 service

I had a bit of an epiphany while chatting to Antoin about the qpsmtpd/EC2 idea. Craig had the same thoughts.

Here’s the thing — there’s actually no need to offload the SMTP part at all. That stuff is tricky, since you’ve got to build in a lot of fault tolerance, quality-of-service, uptime, etc. to ensure that the MX really is reachable. Since an EC2 instance will lose its “disks” once rebooted/shut down, you need to store your queues in Amazon S3 — which has differing filesystem semantics from good old POSIX — so things get quite a bit hairier. On top of that, it requires a little RFC-breakage; there are issues with using CNAMEs in MX records, reportedly.

However, if we offload just the spamd part, it becomes a whole lot simpler. The SPAMD protocol will work fine across long distances, securely, with SSL encryption active, and SpamAssassin will work fine as a filtering system in an entirely stateless mode, with no persistent-across-reboots storage. (What about the persistent-storage aspects of spamd operation? There’s just the auto-whitelist, which can be easily ignored, and I haven’t trained a Bayes database in 2 years, so I doubt I’ll need that either ;)

If the spamd server is down or uncontactable, spamc will handle this and retry with another server, or eventually give up and pass the message through, safely intact (though unscanned).

Given that there’s a cool third-party ClamAV plugin now available for SpamAssassin, this system can offload the virus-scanning work, too.

So here’s the new plan: run the MTA, MX, and the super-lean “spamc” client on the normal MX machine — and offload the “spamd” work to one or more EC2 machines.

Basically, there would be a CNAME record in DNS, listing the dynamic DNS names of the EC2 spamd instances. Then, spamc is set to point at that CNAME as the spamd host to use. As EC2 instances are started/removed, they are added/removed from that CNAME list and spamc will automatically keep up.

Pricing is reasonably affordable — don’t send over-large messages to the EC2 spamd; rate-limit total incoming SMTP traffic in the MTA; and use the SPAMD protocol’s REPORT verb to reduce the bandwidth consumption of mails in transit by ensuring that the mail messages are only transmitted one-way, MX-to-EC2, instead of both MX-to-EC2 and EC2-to-MX. That will keep the bandwidth pricing down.

Recent figures indicate that I got about 90MB of mail per day, at peak, over the past weekend (which nearly DOS’d my server and caused some firefighting) – 68MB of spam, and 13MB of blowback. At 20 cents per GB, that’s 1.8 cents per day for traffic. Plus the $0.10 per instance hour, that’s $2.42 per day to run a single EC2 instance to handle DDOS spikes. Of course, that can be shut down when load is low.

Yep, this is looking very promising. Now when are Amazon going to let me onto the beta program for EC2?…

Tags: , , , , ,

Comments (13)

Bleadperl regexp optimization vs SA

I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.

These are the “base strings” extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A “base string” is a simplified subset of the regular expression; specifically, these are the cases where the “base strings” of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as “r” lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after “r” and before the “:”; after that, the rule names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string “food” against the strings “food” and “foo” should just fire on “food”, not on “foo”.

  • Overlapping is permitted: on the other hand, overlapping is fine; “foobar” matched against “foo” and “oobar” should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a “verification” match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail message, with each paragraph appearing as a single “line”, and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

if ($line =~ m{(hello|hi|foo)}) { $rule_fired = $base_to_rulename_lookup{$1}; }

However, that will fail in the face of the string “hi foo!”, since only one of the bases will be returned as $1, whereas we want to know about both “RULE_HI” and “RULE_FOO”.

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

while ($line =~ m{(hello|hi|foo)}gc) { $rule_fired = $base_to_rulename_lookup{$1}; }

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string “abcd”, for example, will fire only on “abc”, and miss the “bcd” hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you’re curious, the current code is here in SVN.)

Tags: , , , , , ,

Comments (18)

PhishTank now supported by SpamAssassin

Thanks to Jeff Chan of the SURBL project, data from PhishTank is now being included in the SURBL ‘ph’ anti-phishing list.

This means it’s now supported by all existing versions of SpamAssassin from 3.0.0 onwards. Good news, and thanks to Jeff and the OpenDNS guys!

Tags: , , , , ,

Comments

Some p0f Data From Craig

Regarding the use of p0f, passive OS fingerprinting, as an anti-spam measure — on top of this analysis which I linked to a few weeks back, one of the emeritus SA guys, Craig Hughes, sends over some p0f experiences. Handily, this includes a more detailed breakdown by OS release:

I’ve been using the SA p0f plugin for nearly a month or so now both on gumstix’s web server and my hughes-family.org server, and it actually looks like it could be pretty useful. So far I’ve just been scoring 0.001 for each OS to collect data, but here’s the results amavis has logged:

This breakdown shows what %age of the stuff coming in via OS xyz is spam or ham. ie 84.6% of all mail received from Windows-2000 is spam, 14.9% is ham (the rest is viruses). The first numeric column is number of messages of each type. Statistics are only since the last time amavis restarted:

On his home machine (comcast cable modem connection) :

spam.byOS.Windows-2000438 1/h 84.6 %
spam.byOS.Linux417 1/h 18.3 %
spam.byOS.Windows-XP265 1/h 97.8 %
spam.byOS.UNKNOWN135 0/h 55.1 %
spam.byOS.Windows-XP/200024 0/h 100.0 %
spam.byOS.Novell5 0/h 100.0 %
spam.byOS.Windows-983 0/h 60.0 %
spam.byOS.Windows-20032 0/h 66.7 %
spam.byOS.FreeBSD2 0/h 1.3 %
spam.byOS.Solaris1 0/h 1.8 %
spam.byOS.Windows-SP31 0/h 100.0 %
ham.byOS.Linux1851 6/h 81.2 %
ham.byOS.FreeBSD143 0/h 96.0 %
ham.byOS.UNKNOWN102 0/h 41.6 %
ham.byOS.Windows-200077 0/h 14.9 %
ham.byOS.Solaris56 0/h 98.2 %
ham.byOS.NetCache6 0/h 100.0 %
ham.byOS.Windows-XP6 0/h 2.2 %
ham.byOS.Tru642 0/h 100.0 %
ham.byOS.AIX2 0/h 100.0 %
ham.byOS.Windows-982 0/h 40.0 %
ham.byOS.Windows-20031 0/h 33.3 %

On gumstix.com (hosted at some provider in Texas):

spam.byOS.Windows-2000 401 1/h 58.4 %
spam.byOS.Windows-XP 131 0/h 92.9 %
spam.byOS.UNKNOWN 64 0/h 18.7 %
spam.byOS.Windows-XP/2000 29 0/h 96.7 %
spam.byOS.FreeBSD 11 0/h 4.1 %
spam.byOS.Linux 11 0/h 0.5 %
spam.byOS.Windows-98 6 0/h 85.7 %
spam.byOS.Solaris 4 0/h 3.3 %
spam.byOS.Windows-SP3 2 0/h 100.0 %
ham.byOS.Linux 1983 4/h 97.6 %
ham.byOS.UNKNOWN 277 0/h 80.8 %
ham.byOS.Windows-2000 271 0/h 39.4 %
ham.byOS.FreeBSD 253 0/h 93.7 %
ham.byOS.Solaris 116 0/h 96.7 %
ham.byOS.NetCache 40 0/h 100.0 %
ham.byOS.Windows-XP 9 0/h 6.4 %
ham.byOS.Windows-NT 7 0/h 70.0 %
ham.byOS.Novell 3 0/h 100.0 %
ham.byOS.Windows-XP/2000 1 0/h 3.3 %
ham.byOS.Windows-98 1 0/h 14.3 %
ham.byOS.Windows-2003 1 0/h 100.0 %

my home machine has a lot more relayed mail coming to it (all my various craig@* email addresses forward into there) which is probably why the linux spam rate is higher there — the relaying machines are probably running linux and forwarding spam through.

Interesting figures — but I’m still not-convinced that the correlation is quite high enough to form a good enough basis for solid anti-spam rules; reliable rules in the SpamAssassin core typically have over 95% accuracy at differentiating ham from spam (at least when we first check them in).

Update: it’s a natural for use as a Bayes token, though. The way amavisd-new implements p0f support is perfect for this use.

BTW, my guess is that many of the spam hits for “linux” are due to things like Netgear/Linksys routers, running embedded linuces. No evidence, just guessing ;)

Tags: , , , ,

Comments

Linus on Bayesian filtering

Linus Torvalds, in a post to linux-kernel today:

I’m sorry, but spam-filtering is simply harder than the bayesian word-count weenies think it is. I even used to know something about bayesian filtering, since it was one of the projects I worked on at uni, and dammit, it’s not a good approach, as shown by the fact that it’s trivial to get around.

I don’t know why people got so excited about the whole bayesian thing. It’s fine as one small clause in a bigger framework of deciding spam, but it’s totally inappropriate for a “yes/no” kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not accepting email from machines that aren’t registered MX gateways. Sure, that will mean that people who just set up their local sendmail thing and connect directly to port 25 will just not be able to email, but let’s face it, that’s why we have ISP’s and DNS in the first place.

But don’t do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it – with some Bayesian rule perhaps adding a few points to the score. That’s entirely appropriate. But running bogo-filter instead of spamassassin is just asinine.

Me, I like bogofilter — those guys are cool, and it’s a great anti-spam product for many purposes. But of course I have to agree with Linus that the correct approach in most cases is a bigger picture than just Bayes alone, a la SpamAssassin ;)

Tags: , , ,

Comments (4)

More parallel string-match algorithm hacking: re2xs

Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.

In other words, it offers the chance for SpamAssassin rules to be compiled into a trie structure in C code to match multiple patterns in parallel. Given that this is then compiled down to native machine code, it has the potential to be the fastest method possible, apart from using dedicated hardware co-processors.

Sure enough, Matt’s results were pretty good — he says, ‘I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.’ ;)

Unfortunately, turning this into something that works with SpamAssassin hasn’t been quite so easy. SpamAssassin rules are free to use the full perl regular expression language — and this language supports many features that re2c’s subset does not. So we need to extract/translate the rule regexps to simplified subsets. This has generally been the case with all parallel matching systems, anyway, so that’s not a massive problem.

More problematically, re2c itself does not support nested patterns — if one token is contained within another, e.g. “FOO” within “FOOD”, then the subsumed token will not be listed as a match. SpamAssassin rules, of course, are free to overlap or subsume each other, so an automated way to detect this is required.

For simple text patterns, this is easy enough to do using substring matching – e.g. “FOOD” =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated regexp functionality is available, this is no longer the case: consider /FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .

The only way to do this is to either (a) fully parse the regexp, build the trie, and basically reimplement most of re2c to do this in advance; or (b) change the trie-generation code in re2c to support states returning multiple patterns, as Aho-Corasick does.

I requested support for this in re2c, but got a brush-off, unfortunately. So work continues…

In other news, that food poisoning thing I had back at the end of June has lingered on. It’s now pretty clear that it isn’t food poisoning or a stomach bug… but I still have no idea what it actually is. No fun :(

Tags: , , , , , , ,

Comments (5)

SpamAssassin advisory CVE-2006-2447

CVE 2006-2447, in which Radoslaw Zielinski spotted a nasty in spamd’s ‘vpopmail’ support in pretty much all recent versions of Apache SpamAssassin.

If you use spamd with vpopmail, go read the advisory and determine if you need to take action. Not many people will need to, I think; it’s a very rare setup. Still, it’s important to get the warning out there anyway.

The irony is that the bug is triggered partly by the “–paranoid” switch. This was intended to increase security, by increasing paranoia when possibly-unsafe situations arose — hence providing a great demonstration of how the addition of optional code paths, even in the best intentions, can reduce security by allowing bugs to creep in unnoticed.

Tags: , , , ,

Comments

SpamAssassin in the Google Summer of Code 2006

Are you a student, and interested in earning $4,500 for contributing to open source, and fighting spam, over the course of the summer?

If so, get thee hence to the Google Summer of Code 2006 site, and propose a project!

Last year, we in SpamAssassin didn’t get it together to mentor SoC projects. This year, however, we have a few prospective mentors (including myself), and a few sample project ideas lined up; we’re all ready to go! Here’s the Student FAQ. Be quick; applications end in a week and a bit.

Here’s hoping we get some interesting submissions ;)

Tags: , , , ,

Comments

Phishing and Inept Banks

John-Graham Cumming asks, ‘Are Citibank crazy?’:

I blogged a while ago about Thunderbird’s phishing filter trapping a seemingly innnocent mail. Now, a reader has forwarded to me a genuine email from Citibank that he says was trapped by Thunderbird. I’m not going to reproduce the email here because it contains private details of the user, but it is a valid Citibank message.

Thunderbird thinks it’s a scam because Citibank uses one of the oldest phishing tricks in the book. The have a URL displayed in the message then when clicked goes to a totally different URL.

Sadly, this has proven to be really quite common. We’ve investigated using this rule as a worthwhile phish-detection rule in SpamAssassin, several times, and without much luck. In fact, we’ve had to create a FAQ entry for it — since it’s such a superficially-attractive but ultimately useless, idea, many people have had long discussions on our lists about it!

The companies that produce these false positives in their mails include American Express, Bed Bath & Beyond, Universal Studios, Microsoft, Hilton Hotels — and now Citibank.

A couple of other examples from real mails:

  <a href="http://www65.americanexpress.com/clicktrk/Tracking?
    mid=MESSAGEID&msrc=ENG-ALERTS&url=
    https://www.americanexpress.com/estatement/?12345">
    https://www.americanexpress.com/estatement/?12345</a>

<A HREF="http://echo.epsilon.com/WebServices/EchoEngine/T.aspx?l=ID"> https://www.hilton.com/en/ww/email/tab_email_subscriptions.jhtml</A>

By the way, it really is quite impressive for a bank as heavily phished as Citibank to still be making this kind of basic mistake in their mail-outs! It reinforces a point I made in a mailing list posting recently:

As far as I can see, the approach taken by pretty much all banks to their online services is simply too bureaucratic, hide-bound, and fundamentally driven by their marketing departments, to ever cope effectively with phishing. :(

(For what it’s worth, I know Citi have some smart techies working there; but the rest of the company needs to start paying attention to them.)

Tags: , , , , , ,

Comments (5)

Disclosure

As of yesterday, I have a new day-job.

I won’t be working on email spam as part of the job, which is an interesting turn of events. However, I’ll be sticking with the open-source Apache SpamAssassin project, and keeping up the rate of work on that [*].

I’m not sure how much I can blog about the new place just yet, but I will say it’s certainly looking like it’ll be very interesting work ;)

[*: modulo the next couple of weeks while I'm waiting for my bloody DSL to be installed. argh!]

Tags: , , ,

Comments (2)

We Win

ongoing: The ASF Server:

Tim Bray: Which Apache project burns the most resources?

Mads: Spamassassin by a wide margin. [...]

Heh, we win ;)

Helios, the Zones server, has been an incredible resource for us. SpamAssassin isn’t a traditional open-source software project in one respect: we use a lot of centralized “phone home” infrastructure to support rule and score generation. Having a virtualized server of this quality and horsepower to use for this has been fantastic.

(thanks to John O’Shea for the pointer!)

Tags: , , , , , , ,

Comments (1)

DataMation Anti-Spam Product of the Year!

Hooray!

SpamAssassin has been voted DataMation Anti-Spam Product of the Year for 2006, earning three times as many votes as the next contender.

This is the second year in a row, which is fantastic — and our margin is increasing each year. ;)

Tags: , , , , , , , , ,

Comments (1)

My ApacheCon Roundup

Back from ApacheCon!

I’ve got to say, I found it really useful this year. Last year, I was pretty new to the ASF, and found that my expectations of ApacheCon didn’t quite match reality; it wasn’t a rip-roaring success exactly, for me, as a result.

However, many details of how the ASF works — and how the conference itself works and is organised — are much clearer after you’ve spent some time lurking and absorbing practices in the meantime. (The visibility one gets into the process as a member of the ASF makes this a lot easier.)

Result: it was much more of a success for me this time around. Plenty of networking, putting faces to the names, hanging out, and discussing many aspects of our work.

The hackathon really worked out, too; while we didn’t produce a hell of a lot of code per se, it made for a good ‘developer summit’ and I think we established solid agreement on SpamAssassin’s short-term directions and goals. (summary: rules, and faster).

On top of that, I got to meet up with Colm MacCarthaigh and Cory Doctorow for discussion of Digital Rights Ireland. Looks like I’ll be spending a bit of time on that next year ;)

Finally: Solaris. On Monday night, I got to sit down with Daniel Price, one of the kernel engineers behind Solaris Zones, work through a quick demo of a bug I was running into with chroot(2) and zones on our rule-QA buildbot server, and watch as he visually traced it through the OpenSolaris kernel source on the web. From this — and from talking to Daniel — it’s pretty clear that things have changed at Sun. Pretty much the entire Solaris operating system is now a full-on open-source project; it’s not just a marketing gimmick. The source is up there on the web, that’s the source for the code they’re running now, and there’s no half-assed ‘freeze it, cut out the good bits, and throw it over the wall’ fake-open-source tricks.

The concept of getting this level of access to Solaris source code and engineers, would have blown my mind when I was Iona’s sysadmin back in the 1990s ;) I’m very impressed.

Tags: , , , , , , , , , , ,

Comments

ApacheCon US 2005

In a couple of weeks, I’ll be going to San Diego for ApacheCon US 2005 (including the hackathon beforehand). There’ll be quite a few other SpamAssassin committers there, too, so if you’re working with SA, or interested in getting some face time with the developers, there’s no better way of doing so.

Tags: , , , ,

Comments

New SpamAssassin Rule Development Tools

Recently, I’ve been working on new systems to develop SpamAssassin rules faster, and with a lower ‘barrier to entry’ to the core ruleset. Some highlights seem bloggable, seeing as it’s all web-based and I can link to it!

The ‘preflight’ BuildBot:

This uses the fantastic BuildBot continuous-integration system to monitor changes to our Subversion repository.

Every time something is checked into SVN, this wakes up and immediately runs mass-checks using that latest code and rules, allowing near-real-time viewing of changes in rule behaviour. (A ‘mass-check’ is a massive run of SpamAssassin across a corpus of hundreds of thousands of emails, en masse, to measure rule hit-rates.)

The corpus it mass-checks is split in a certain way so that results will be available very quickly — typically in under 10 minutes — with increasing quantities of results becoming available as time elapses.

Progress of the mass-checks are visible at the BuildBot here; as they complete, their results become visible on the Rule-QA app (below). (More info, if you’re curious.)

The Rule-QA App:

To date, we’ve used the basic “freqs” table — output from the hit-frequencies command-line script — as the UI for rule QA and evaluation. This is fine for a small number of developers, but it scales badly and (like mass-checks) requires a pretty complex setup on the developer’s machine.

This new component is a web application, which takes the “freqs” table, and “webifies” it — demo.

Some major improvements are also made possible; the most important, that it can now display ‘freqs’ for multiple revisions during the day, and keeps historical data for comparison. It adds several new reports from ‘hit-frequencies’; a score-map, overlaps, a performance measurement, and a boolean ‘promoteability’ measurement.

Finally, a really useful new report is the graph of rule hit-rate, as it changes over time. Here’s a cached demo, or see the same data produced ‘live’. This gives a totally new insight into how the rule hits for various people’s corpora, how that changed over time, and allows a whole new type of rule analysis. (In fact, it also allows pretty good corpus analysis, too; can you tell which submitters bounce high-scoring spam at receipt time?)

(More info on these.)

Tags: , , , , , ,

Comments

SpamAssassin 3.1.0

‘ANNOUNCE: SpamAssassin 3.1.0 available!’ – MARC

Phew. That took a while, but it’s worth it ;)

Tags: , ,

Comments (1)

DnsblAccuracy082005 – Spamassassin Wiki

Do you use anti-spam DNS blocklists? If so, you should probably go take a look at DnsblAccuracy082005 on the SpamAssassin wiki; I’ve collated the results from our recent mass-check rescoring runs for 3.1.0, to produce have up-to-date measurements of the accuracy and hit-rates for most of the big DNS blocklists.

A few highlights:

We don’t have accurate figures for the new URIBL.COM lists, btw — only the rulesets that are distributed with SpamAssassin were measured.

Tags: , , ,

Comments (1)

The Life of a SpamAssassin Rule

Spam: during a recent discussion on the SpamAssassin dev list, the question came up as to how long a rule could expect to maintain its effectiveness once it was public — the rule secrecy issue.

In order to make a point — that certain types of very successful rules can indeed last a long time — I picked out one rule, MIME_BOUND_DD_DIGITS. Here’s a smartened-up copy of what I found out.

This rule matches a certain format of MIME boundary, one observed in 17.4637% of our spam collection and with 0 nonspam hits. Since we have a massive collection of mails, received between Jan 2004 to May 2005, and a rule with a known history, we can then graph its effectiveness over time.

The rule’s history was:

  • bug 3396: the initial contribution from Bob Menschel, May 15 2004
  • r10692: arrived in SVN: May 16 2004
  • r20178: promoted to ‘MIME_BOUND_DD_DIGITS’: May 20 2004 (funnily enough, with a note speculating about its lifetime from felicity!)
  • released in the SpamAssassin 3.0.0 release: mid-Sep 2004

So, we would expect to see a drop in its effectiveness against spam in late May 2004 and onwards, if the spammers were reacting to SVN changes; or post September 2004, if they react to what’s released.

By graphing the number of hits on mails within each 2-hour window, we can get a good idea of its effectiveness over time:

The red bars are total spam mails in each time period; green bars, the number of spam mails that hit the rule in each period. May 15 2004 and Sep 20 2004 are marked; Jan 2004 is at the left, and May 2005 is at the right-most extreme of the graph. (There’s a massive spike in spam volume at the right — I think this is Sober.Q output, which disappears after a week or so.)

It appears that the rule remains about even in effectiveness in the 4 months it’s in SVN, but unreleased; it declines a little more after it makes it into a SpamAssassin release. However, it trails off very slowly — even in May 2005, it’s still hitting a good portion of spam.

Given this, I suspect that most spammers are not changing structural aspects of their spam in response to SpamAssassin with any particular alacrity, or at least are not capable of doing so.

To speculate on the latter, I think many spammers are using pirated copies of the spamware apps, so cannot get their hands on updated versions through ‘legitimate’ channels.

Speculating on the former — in my opinion there’s a very good chance that SpamAssassin just isn’t a particular big target for them to evade, compared to the juicy pool of gullible targets behind AOL’s filters, for example. ;)

Tags: , , , , , , , , ,

Comments (3)

DCC no longer open source

Patents: DCC (Distributed Checksum Clearinghouse) is a venerable, and widely-used anti-spam system created by Vernon Schryver; we’ve supported it in SpamAssassin for yonks.

It now appears that DCC is now no longer open source software; it’s still free for personal and noncommercial use, but this clause has been added to the new license text:

This agreement is not applicable to any entity which sells anti-spam solutions to others or provides an anti-spam solution as part of a security solution sold to other entities, or to a private network which employes DCC or uses data provided by operation of DCC but does not provide corresponding data to other users.

So there’s talk that those commercial users should now license it – interestingly, from another company called Commtouch, not Vernon’s Rhyolite Software. (More info).

It appears that the license change is part of an agreement with Commtouch, owner of US Patent 6,330,590, a patent on the idea of hash-sharing antispam techniques. (I haven’t read the patent due to ASF and other policies so I can’t tell you what it really covers.)

It looks like we’ll be disabling DCC’s use in SpamAssassin by default, as we did with Razor, as a result. (Our policy is that the default ruleset used in SpamAssassin be usable by anyone who can use our software, so that the normal usage is open source by default, rather than subsets of the overall functionality.)

Tags: , , , , , , , , , ,

Comments

Back in the US, and Daniel’s interview

Misc: So I was travelling last week — a very productive trip to the UK visiting the main work dev office, and getting a little socialising in too while I was at it. A pretty good trip overall, especially since I seem to have figured out how to use my frequent flyer miles effectively to get great seats! ;)

Here’s a good interview with SpamAssassin PMC chair, Daniel; well worth a read if you want to see what we in SpamAssassin think about the state of the onion in spam-filtering.

In not-so-good news, it seems Charlie McCreevy has managed to push the software patent directive through, despite massive EU Parliament unhappiness. Third time around at the Fisheries meeting, naturally; and there’s some serious questions about the legitimacy of the procedural rules invoked by the Commission in refusing to take the directive off the A-item menu. Now that’s what I call democracy…

It can still be defeated, but it’s an uphill battle now — for it to be thrown out in the second reading at the European Parliament, it’ll need a two-thirds majority of all MEPs (not just the MEPs present), reportedly.

In the meantime, thanks to the FF and PDs’ bullying tactics, Ireland’s small but growing pool of homegrown software developers are being ignored, and the Irish software industry looks more like a lame import operation for the likes of Microsoft. Our reputation is dragged through the mud for a few multinationals, and the rest of Europe resents us for it. Wonderful.

BTW, even if it does pass, there are ways to fix it — directives must be implemented into national law in each country. This means that Ireland could still write their implementation of the directive to exclude software inventions (even the ones where it’s supposedly a patent on hardware like ‘a CPU connected to a hard disk, with such-and-such software running on the CPU’). However, given McCreevy’s obvious bias in favour of getting this specific text into place, how likely is that going to be?

Tags: , , , , , , , , , ,

Comments

« Previous entries Next Page » Next Page »