Bleadperl regexp optimization vs SA

I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.

These are the “base strings” extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A “base string” is a simplified subset of the regular expression; specifically, these are the cases where the “base strings” of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as “r” lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after “r” and before the “:”; after that, the rule names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string “food” against the strings “food” and “foo” should just fire on “food”, not on “foo”.

  • Overlapping is permitted: on the other hand, overlapping is fine; “foobar” matched against “foo” and “oobar” should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a “verification” match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail message, with each paragraph appearing as a single “line”, and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

if ($line =~ m{(hello|hi|foo)}) { $rule_fired = $base_to_rulename_lookup{$1}; }

However, that will fail in the face of the string “hi foo!”, since only one of the bases will be returned as $1, whereas we want to know about both “RULE_HI” and “RULE_FOO”.

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

while ($line =~ m{(hello|hi|foo)}gc) { $rule_fired = $base_to_rulename_lookup{$1}; }

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string “abcd”, for example, will fire only on “abc”, and miss the “bcd” hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you’re curious, the current code is here in SVN.)

Tags: , , , , , ,

Comments (18)

Search Engine Optimisation

Tom Coates on search engine optimisation. Summary: they don’t work; smart search engines realise you’re trying to game them, and will ignore or penalise your site as a result. The correct answer is to provide interesting/good/linkworthy textual information, and keep superfluous eye candy at a sensible level. I agree with his essay, FWIW.

Personally, I reckon Google deserve a lot of credit for turning the web around, from a flashy, Flash-laden animated DHTML blinky-blink medium, back into one where text is king. Once it got recognized that Google used titles, h1 tags, and other semantic markup as key metadata, and that the gimmicky stuff is unindexable, the never-ending slide into flashy blinky-blink land was halted. Phew!

Aside: Labour MP Tom Watson has a weblog?! Wow. He’d get my vote straight away, no matter what his policies were — that’s transparency ;)

Interesting — so does Liberal Democrats MP Richard Allen. This is really amazing. He even links to SpamAssassin as part of a discussion on the All-Party Internet Group’s spam summit to be held on July 1st!

It’s worth noting that his comment here notes that the APIG concept seems to be leaning towards prosecution of spamvertised products; advertise via spam (sent by you or by a ’spam outsourcing’ company), and you’re liable. A very sensible approach, as long as they can avoid the danger of malicious spammers spamvertising a product without that company’s permission — a la what happens regularly to SpamCop and SpamHaus.

Tags: , , , , , , , , ,

Comments