RAII in perl

Suppose you have matching start() and end() functions. You want to ensure that each start() is always matched with its corresponding end(), without having to explicitly pepper your code with calls to that function. Here’s a good way to do it in perl — create a guard object:

package Scoper;
sub new {
  my $class = shift; bless({ func => shift },$class);
}
sub DESTROY {
  my $self = shift; $self->{func}->();
}

Here’s an example of its use:

{
  start();
  my $s = Scoper->new(sub { end(); });
  [... do something...]
}
[at this point, end() has been called, even if a die() occurred]

The idea is simply to use DESTROY to perform whatever the cleanup operation is. Once the $s object goes out of scope, it’ll be deleted by perl’s GC, in the process of which, calling $s->DESTROY(). In other words, it’s using the GC for its own ends.

Unlike an eval { } block to catch die()s, this will even be called if exit() or POSIX::exit() is called. (POSIX::_exit(), however, skips DESTROY.)

This is a pretty old C++ pattern — Resource Acquisition Is Initialization. C++’s auto_ptr template class is the best-known example in that language. Here’s a perl.com article on its use in perl, from last year, mostly regarding the CPAN module Object::Destroyer. To be honest, though, it’s 6 lines of code — not sure if that warrants a CPAN module! ;)

RAII is used in SpamAssassin, in the Mail::SpamAssassin::Util::ScopedTimer class.

Tags: , , , , , ,

Comments (4)

converting TAP output to JUnit-style XML

Here’s a perl script that may prove useful: tap-to-junit-xml

NAME

tap-to-junit-xml - convert perl-style TAP test output to JUnit-style XML

SYNOPSIS

tap-to-junit-xml "test suite name" [ outputprefix ] < tap_output.log

DESCRIPTION

Parse test suite output in TAP (Test Anything Protocol) format, and produce XML output in a similar format to that produced by the <junit> ant task. This is useful for consumption by continuous-integration systems like Hudson.

Written in perl, requires TAP::Parser and XML::Generator. It’s based on junit_xml.pl by Matisse Enzer, although pretty much entirely rewritten.

Tags: , , , , , , ,

Comments (2)

SpamAssassin 3.2.0!

W00t! SpamAssassin 3.2.0 has finally gone gold!

This release is a big one — it’s the first major release since 3.1.0, back in September 2005, just over a year and a half ago. Here is the release announcement mail, containing a list of major changes since version 3.1.8. There are a few major new features that I feel worth picking out in more detail and editorialising about:

sa-compile

This is a biggie. This new script takes the active SpamAssassin ruleset, and uses code contributed by Matt Sergeant to produce input for re2c. re2c in turn compiles the ruleset into a deterministic finite automaton, which can match multiple regular expressions in parallel. That’s not all, though; re2c then compiles that DFA into C code — which is then compiled into native object code. SpamAssassin will then load that object code and use it to replace the slower perl regexp tests, if it’s available at scan-time.

Now, it’s been a long time since SpamAssassin’s ruleset consisted mainly of rudimentary regular expressions matched against the body text — a good portion of SpamAssassin’s ruleset these days operates against headers, performs network lookups, analyzes URLs extracted from the body, uses the more advanced features supported by Perl’s NFA regexp engine, or so on. But even given that, the effects of ’sa-compile’ seem to average between a 15% and 25% speedup, in my testing. That’s good ;)

Many of the commercial versions of SpamAssassin include their own body-rule speedups — but this is the first time anything similar has made it into the open source code.

Short-circuiting

Another good one for performance. There are some rules that you can reasonably assume will never hit nonspam or spam mail in a well-configured setup. For example, a hit on “ALL_TRUSTED” should mean that the message never traversed an untrusted network, therefore it cannot be spam, so why bother applying the expensive tests? It should be reasonable to “short-circuit” and immediately return a “ham” score for that mail.

This new plugin implements that algorithm — and efficiently, too, which historically has been the hard part!

I’ve been using this for a while with a ruleset like this one — in my experience, it’s cut overall CPU time spent scanning mail by 20%.

It is pretty flexible, too — there’s lot of tweakage that can be done with this functionality to suit your own setup.

Reduced memory footprint

One aim of this release has been to reduce the memory usage of SpamAssassin; the core code now uses less RAM than 3.1.x does, when tested with the same ruleset. (Unfortunately we’ve added lots more rules in the interim, so it’s a bit of a wash overall. ;)

The VBounce anti-bounce ruleset

Detects spurious bounce messages sent by broken mail systems in response to spam or viruses. More info about that here.

Apache-spamd

apache-spamd implements spamd as a mod_perl module. This was contributed by Radoslaw Zielinski, as a Google Summer of Code project last year. Thanks Radoslaw!

There are plenty more new, useful features and rules — these are just the top ones, in my opinion. Pretty cool stuff!

Tags: , , , , , , , ,

Comments (2)

Bleadperl regexp optimization vs SA

I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.

These are the “base strings” extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A “base string” is a simplified subset of the regular expression; specifically, these are the cases where the “base strings” of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as “r” lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after “r” and before the “:”; after that, the rule names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string “food” against the strings “food” and “foo” should just fire on “food”, not on “foo”.

  • Overlapping is permitted: on the other hand, overlapping is fine; “foobar” matched against “foo” and “oobar” should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a “verification” match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail message, with each paragraph appearing as a single “line”, and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    ‘hi’ => ['RULE_HI'],
    ‘foo’ => ['RULE_FOO']
  );

  if ($line =~ m{(hello|hi|foo)}) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

However, that will fail in the face of the string “hi foo!”, since only one of the bases will be returned as $1, whereas we want to know about both “RULE_HI” and “RULE_FOO”.

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    ‘hi’ => ['RULE_HI'],
    ‘foo’ => ['RULE_FOO']
  );

  while ($line =~ m{(hello|hi|foo)}gc) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string “abcd”, for example, will fire only on “abc”, and miss the “bcd” hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you’re curious, the current code is here in SVN.)

Tags: , , , , , ,

Comments (18)

More parallel string-match algorithm hacking: re2xs

Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.

In other words, it offers the chance for SpamAssassin rules to be compiled into a trie structure in C code to match multiple patterns in parallel. Given that this is then compiled down to native machine code, it has the potential to be the fastest method possible, apart from using dedicated hardware co-processors.

Sure enough, Matt’s results were pretty good — he says, ‘I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.’ ;)

Unfortunately, turning this into something that works with SpamAssassin hasn’t been quite so easy. SpamAssassin rules are free to use the full perl regular expression language — and this language supports many features that re2c’s subset does not. So we need to extract/translate the rule regexps to simplified subsets. This has generally been the case with all parallel matching systems, anyway, so that’s not a massive problem.

More problematically, re2c itself does not support nested patterns — if one token is contained within another, e.g. “FOO” within “FOOD”, then the subsumed token will not be listed as a match. SpamAssassin rules, of course, are free to overlap or subsume each other, so an automated way to detect this is required.

For simple text patterns, this is easy enough to do using substring matching – e.g. “FOOD” =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated regexp functionality is available, this is no longer the case: consider /FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .

The only way to do this is to either (a) fully parse the regexp, build the trie, and basically reimplement most of re2c to do this in advance; or (b) change the trie-generation code in re2c to support states returning multiple patterns, as Aho-Corasick does.

I requested support for this in re2c, but got a brush-off, unfortunately. So work continues…

In other news, that food poisoning thing I had back at the end of June has lingered on. It’s now pretty clear that it isn’t food poisoning or a stomach bug… but I still have no idea what it actually is. No fun :(

Tags: , , , , , , ,

Comments (5)

A Released Perl With Trie-based Regexps!

Good news! From the Perl 5.9.2 ‘perl592delta’ change log:

The regexp engine now implements the trie optimization : it’s able to factorize common prefixes and suffixes in regular expressions. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization.

in other words, the trie-optimization patch contributed by demerphq back in March 2005 is now in a released build of Perl. Yay!

Here’s a writeup of what it does:

A trie is a way of storing keys in a tree structure where the branching logic is determined by the value of the digits of the key. Ie: if we have “car”, “cart”, “carp”, “call”, “cull” and “cars” we can build a trie like this:

        c + a + r + t
          |   |   |
          |   |   + p
          |   |   |
          |   |   + s
          |   | 
          |   + l - l
          |   
          + u - l - l

What the patch does is make /a | list | of | words/ into a trie that matches those words. This means that we can efficiently tell if any of the words are at a given location in a strng by simply walking the string and trie at the same time. In many cases we can rule out the entire list by looking at only one character of the input. The current way perl handles this would require looking at N chars where N is the number of words involved. (BTW: Thats the beauty of a trie, its lookup time is independent of the number of words it stores but rather on the key length of the word being looked up. )

SpamAssassin is, of course, both (a) very regular-expression-intensive and (b) searches a single block of text for a large number of independent patterns in parallel. I’d love to see someone coming up with a patch to SpamAssassin that uses trie-compatible regexps when the perl version is >= 5.9.2, and gets increased performance that way. hint ;)

BTW, the Regexp::Trie module on CPAN is related — in that it, similar to Regexp::Optimizer, Regex::PreSuf, or Regexp::Assemble, will compile a list of words or regular expressions into a super-efficient trie-style regexp. However, without the trie patch to the regexp engine itself, this would be a minor efficiency tweak at best; although having said that, Regexp::Assemble’s POD notes:

You should realise that large numbers of alternations are processed in perl’s regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie. Note that Perl’s own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out). Regexp::Assemble will do the right thing when it knows it’s running on a a trie’d perl. (At least in some version after this one).

(PS: interestingly, demerphq mentioned back in March 2005 that he was working on Aho-Corasick matching next. A-C is a great parallel-matching algorithm, and I would imagine it would increase performance yet more. I wonder what happened to that…)

Tags: , , , , ,

Comments (16)

A Gotcha With perl’s “each()”

It’s my bi-monthly perl blog entry, to earn my place on planet.perl.org! ;)

Here’s an interesting “gotcha”. Take this code:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    while(($k,$v)=each %t){print "1: $k\n"; last;}
    while(($k,$v)=each %t){print "2: $k\n";}'

In other words, iterate through all the key-value pairs in %t once, then do it again — but exit early in the first loop.

You would expect to get something like this output:

    1: 1
    2: 1
    2: 3
    2: 2

instead, you see:

    1: 1
    2: 3
    2: 2

The “1″ entry in the second loop is AWOL. Here’s why — as “perldoc -f each” notes:

There is a single iterator for each hash, shared by all “each”, “keys”, and “values” function calls in the program

That’s all “each” calls, throughout the entire codebase, possibly in a different class entirely. Argh.

The workaround: reset the iterator using “keys” between calls to “each”:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    while(($k,$v)=each %t){print "1: $k\n"; last;}
    keys %t;
    while(($k,$v)=each %t){print "2: $k\n";}'

This got us in SpamAssassin — bug 4829.

To be honest, having to call “keys” after the loop is kludgy — as you can see if you check the patch in bug 4829 there, we had to change from a “return inside loop” pattern to a “set variable and exit loop, reset state, then return” pattern. It’d be nice to have a scoped version of each(), instead of this global scope, so that this would work:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    { while(($k,$v)=scoped_each %t){print "1: $k\n"; last;} }
    # that each() iterator is now out of scope, so GC'd;
    # the next call uses a new iterator, starting from scratch
    { while(($k,$v)=scoped_each %t){print "2: $k\n";} }'

Scoping, of course, has the benefit of allowing “return early” patterns to work; in my opinion, those are clearer — at the least because they require less lines of code ;)

Tags: , , , ,

Comments (4)

Another script: goog-love.pl

A quick hack –

goog-love.pl - find out where your site’s google juice comes from

This script will grind through your web site’s “access.log” file (which must be in the “combined” log format). It’ll pick out the top 100 Google searches found in the referer field, re-run those searches, and determine which ones are giving your website all the linky Google love — in other words, the searches that your site ‘wins’ on.

The output is in plain text and a chunk of HTML.

usage:

goog-love.pl sitehost google-api-key < access.log > out.html

e.g.

cat /var/www/logs/taint.org.* | goog-love.pl \
  taint.org 0xb0bd0bb5yourgoogleapikeyhere0xdeadbeef | tee out.html

NOTE: this script requires the SOAP::Lite module be installed. Install it using apt-get install libsoap-lite-perl or cpan SOAP::Lite. It also requires a Google API key.

For example, here are the current results for this site. You can immediately see some interesting stuff that’s not immediately obvious otherwise, such as my site being the top hit for [beardy justin] ;)

Download here (5 KiB perl script).

Notes:

  • if you see a lot of “502 Bad Gateway” errors, it’s probably over-zealous anti-bot ACLs on Google’s side. Try from another host.

  • Read the comments for notes on a bug in recent releases of SOAP::Lite; please let me know if you hear of them getting fixed ;)

Tags: , , , , , ,

Comments (5)

What Works in Software Development

I already posted this to the link-blog yesterday, but it’s so good it’s worth promoting more widely. If you write software for a living, you really ought to read the slides for Michael Schwern’s excellent ‘What Works In Software Development’ talk.

It’s a long presentation (108 slides!), but during the course of that, he covers:

  • effective teamwork
  • dealing with bad customers
  • dealing with bad management
  • classic coding mistakes
  • classic project management mistakes
  • classic design mistakes
  • test-driven development
  • refactoring
  • patterns

It’s a really good synthesis of what I think are the best bits of good OO design, XP, CPAN and perl’s design and coding styles, without most of the cruft. I’ll be pointing people at this for years to come, I think…

(Found via yoz.)

Tags: , , , , , , ,

Comments (1)

IPC::DirQueue 0.06 released

More details on the mailing list, if you’re into that sort of thing. ;)

Tags: , , ,

Comments

A couple of links while del.icio.us is ill

Happy birthday, Perl!

Perl was 18 today. In many jurisdictions, it can now drink intoxicating liquors, vote, and join the armed forces.

Global Warming Sceptic Bingo:

Just tick the box when they use the argument next to it. Get four in a row and you win!

Get well soon, del.icio.us.

Tags: , , , , , , , , ,

Comments (2)

trueColor() bug in GD::Graph

Hacking on a new rule-QA subsystem for SpamAssassin, I came across this bug in GD::Graph. If:

  • you are drawing a graph using GD::Graph;
  • outputting in PNG or GIF format;
  • and the ‘box’ area — the margins outside the graph — keeps coming up as black, instead of white as you’ve specified;

check your code for calls to GD::Image->trueColor(1);, or the third argument to the GD::Image->new() constructor being 1. It appears that there’s a bug in the current version of GD (or GD::Graph) where graphing to a true-colour buffer is concerned, in that the ‘box’ area continually comes out in black.

(Seen in versions: perl 5.8.7, GD 2.23, GD::Graph 1.43 on Linux ix86; perl 5.8.6, GD 2.28, GD::Graph 1.43 on Solaris 5.10.)

Tags: , , , ,

Comments (7)

Latest Script Hack: utf8lint

Perl: double-encoding is a frequent problem when dealing with UTF-8 text, where a UTF-8 string is treated as (typically) ISO Latin-1, and is re-encoded.

utf8lint is a quick hack script which uses perl’s Encode module to detect this. Feed it your data on STDIN, and it’ll flag lines that contain text which may be doubly-encoded UTF-8, in a lintish way.

Tags: , , , , , , , , ,

Comments

Continuations in perl

Code: Ugo Cei: Building Interactive Web Programs with Continuations quoting Phil Windley:

This leads to the question: what if I could write programs for the Web that were ’structured’ in the programming sense of that word? The result would be Web programs that were more natural to write and easy to read. You’d no longer have to maintain the state of your program outside the language and the data could be kept in variables, where it belongs. The answer is: you can.

I hate the ’save all state’ model imposed by developing for the web, and have been hoping for a way to do this for a while — and now I know what it’s called ;)

It seems Seaside is the leading continuations-based web-app framework, using Smalltalk, and (as Ugo noted) Apache Cocoon has it too, but there’s a whole load more. Can you tell I haven’t been following web-app development techniques much recently?

Never mind those other languages, though — Continuity looks promising as a Perl framework based around continuations. Perl 6 will reportedly have native continuation support, and Dan Sugalski gives a good write-up of how they’re implemented and their ramifications there.

Tags: , , , , , , , , , ,

Comments

IPC::DirQueue 0.04 released

Perl: at last, a perl-related posting! I’ve released IPC::DirQueue 0.04; details of what’s changed (summary, a couple of bugs fixed) are at that link.

BTW, thanks to Ask and Robert at perl.org, who are providing free SVN repository and list hosting for CPAN modules! And don’t overlook the fact that the mailing list/newsgroups each have their own RSS feed, woot!)

Tags: , , , , , , , , ,

Comments

Web-browser style history for the command line

Code: Here’s something I came up with recently — it’s actually an evolution of the idea of pushd and popd, as included in BASH. To quote the POD docs:

cdhistory is a perl script used to implement web-browser style “history” for UNIX shells; as you use the cd command to explore the filesystem, your moves are remembered, and you can go “back” through history, and “forward” again, as you like.

Download the perl script here.

Tags: , , , , , , , , , ,

Comments

Announcing IPC::DirQueue

Perl: So, I wrote a new CPAN module recently — IPC::DirQueue. It implements a nifty design pattern for slightly larger systems, ones where multiple processes, possibly on multiple machines, must collaborate to deal with incoming task submissions. To quote the POD:

This module implements a FIFO queueing infrastructure, using a directory as the communications and storage media. No daemon process is required to manage the queue; all communication takes place via the filesystem.

A common UNIX system design pattern is to use a tool like lpr as a task queueing system; for example, this article describes the use of lpr as an MP3 jukebox.

However, lpr isn’t as efficient as it could be. When used in this way, you have to restart each task processor for every new task. If you have a lot of startup overhead, this can be very inefficient. With IPC::DirQueue, a processing server can run persistently and cache data needed across multiple tasks efficiently; it will not be restarted unless you restart it.

Multiple enqueueing and dequeueing processes on multiple hosts (NFS-safe locking is used) can run simultaneously, and safely, on the same queue.

Since multiple dequeuers can run simultaneously, this provides a good way to process a variable level of incoming tasks using a pre-defined number of worker processes.

If you need more CPU power working on a queue, you can simply start another dequeuer to help out. If you need less, kill off a few dequeuers.

If you need to take down the server to perform some maintainance or upgrades, just kill the dequeuer processes, perform the work, and start up new ones. Since there’s no ’socket’ or similar point of failure aside from the directory itself, the queue will just quietly fill with waiting jobs until the new dequeuer is ready.

Arbitrary ‘name = value’ metadata pairs can be transferred alongside data files. In fact, in some cases, you may find it easier to send unused and empty data files, and just use the ‘metadata’ fields to transfer the details of what will be worked on.

Sound interesting? Here’s the tarball.

Tags: , , , , , , , , , ,

Comments

Easy-peasy web scraping: HTTP::Recorder

Perl: I’ve been writing a few convenience web-scrapers recently using WWW::Mechanize, with great success.

So the latest development, HTTP::Recorder, looks very nifty too:

HTTP::Recorder is a browser-independent recorder that records interactions with web sites and produces scripts for automated playback. Recorder produces WWW::Mechanize scripts by default (see WWW::Mechanize by Andy Lester), but provides functionality to use your own custom logger.

… Simply speaking, HTTP::Recorder removes a great deal of the tedium from writing scripts for web automation. If you’re like me, you’d rather spend your time writing code that’s interesting and challenging, rather than digging through HTML files, looking for the names of forms an fields, so that you can write your automation scripts. HTTP::Recorder records what you do as you do it, so that you can focus on the things you care about.

No SSL support yet, though, as far as I can see, but for simple scraping – or as a good starting point for a more complex Mechanize script — it looks like it’ll work great.

Tags: , , , , , , , , , ,

Comments

Bloom Filters

Code: A very good intro to Bloom Filters at perl.com by Maciej Ceglowski.

Strikes me as something that might be very applicable to the SpamAssassin auto-whitelist…

Tags: , , , , , , ,

Comments

5p@mff1ti

Comments

Lotsa SpamConf linkage and commentary

Another good trip report, from ‘babbage’ at perl.org.

  • Again, and interestingly, quite a few folks agreed with one of SA’s core tenets; no single approach (stats, RBLs, rules, distributed hashes) can filter effectively on its own, as spammers will soon figure out a way to subvert that technique. However, if you combine several techniques, they cannot all be subverted at once, so your effectiveness in the face of active attacks is much better.

  • Also interesting to note how everyone working with learning-based approaches commented on how hard it was to persuade ‘normal people’ to keep a corpus. Let’s hope SA’s auto-training will work well enough to avoid that problem.

  • in passing — babbage noted the old canard about Hotmail selling their user database to spammers. That must really piss the Hotmail folks off ;) I think it’s much more likely that, with Moore’s Law and the modern internet, a dictionary attack *will* find your account eventually.

  • Good tip on the legal angle from John Praed of The Internet Law Group: if a spam misuses the name of a trademarked product like ‘Viagra’, get a copy to Pfizer pronto. Trademark holders have a particular desire to follow up on infringements like this, as an undefended trademark loses its TM status otherwise.

  • David Berlind, ZDNet executive editor: ‘They don’t want to be involved (in developing an SMTPng)’. He might say that, but I bet their folks working on sending out their bulk-mailed email newsletters might disagree ;). Legit bulk mail senders have to be involved for it to work, and they will want to be involved, too.

  • Brightmail have a patent on spam honeypots? Must take a look for this sometime.

  • the plural of ‘corpus’ is ‘corpora’ ;)

Great report, overall.

It’s interesting to see that Infoworld notes that reps from AOL, Yahoo! and MS were all present.

Since the conf, Paul Graham has a new paper up about ‘Better Bayesian Filtering’, and lists some new tokenization techniques he’s using:

  • keep dollar signs, exclamation and most punctuation intact (we do that!)

  • prepend header names to header-mined tokens (us too!)

  • case is preserved (ditto!)

  • keep ‘degenerate’ tokens; ‘Subject:FREE!!!’ degenerates to ‘Subject:free’, to ‘FREE!!!’, and ‘free’. (ditto! well, partly. We use degeneration of tokens, but we keep the degenerate tokens in a separate, prefixed namespace from the non-degenerate ones, as he contemplates in footnote 7. It’s worth noting that case-sensitivity didn’t work well compared to the database bloat it produced; each token needs to be duplicated into the case-insensitive namespace, but that doubled the database size, and the hit-rate didn’t go up nearly enough to make it worthwhile.)

Most of these were also discovered and verified experimentally by SpamBayes, too, BTW.

When we were working on SpamAssassin’s Bayesian-ish implementation, we took a scientific approach, and used suggestions from the SpamBayes folks and from the SpamAssassin community on tokenizer and stats-combining techniques. We then tested these experimentally on a test corpus, and posted the results. In almost all cases, our results matched up with the SpamBayes folks’ results, which is very nice, in a scientific sense.

(PS: update on the Fly UI story — ‘apis’ is not French, it’s Latin. oops! Thanks Craig…)

Tags: , , , , , , , , ,

Comments

Lamest patent prior-art search ever?

AOL patents instant messaging (/.). ‘Specifically, any technology that provides ‘a network that allows multiple users to see when other users are present and then to communicate with them’ is covered.’

The CNet story which /. references points out that the patent was filed in 1997 — but that’s still 6 years after I wrote a similar perl script on the Maths Department UNIX machines in TCD. There’s a myriad of similar apps, of the same vintage, too.

The thing I find amazing is this, however — the AOL patent actually cites prior art in its References section, namely the xhtalk README file, dated 1992. There’s nothing different between xhtalk and AOL Instant Messenger apart from the protocol and the look and feel, and those aren’t key to the patent.

The US patent office really needs to start reading the patent applications before granting them.

Tags: , , , , , , , , ,

Comments