Bleadperl regexp optimization vs SA
I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.
These are the “base strings” extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A “base string” is a simplified subset of the regular expression; specifically, these are the cases where the “base strings” of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.
The base strings appear in that file as “r” lines, like so:
r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE r I drive a:__DOS_I_DRIVE_A r I might be c:__DOS_COMING_TO_YOUR_PLACE r I might c:__DOS_COMING_TO_YOUR_PLACE
The base string is the part after “r” and before the “:”; after that, the rule names appear.
Now, here are some limitations that make this less easy:
One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.
One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.
No anchors: the strings may match anywhere inside the line, similar to
("foo bar baz" =~ /bar/).Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.
Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string “food” against the strings “food” and “foo” should just fire on “food”, not on “foo”.
Overlapping is permitted: on the other hand, overlapping is fine; “foobar” matched against “foo” and “oobar” should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)
Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a “verification” match using the full regexp.
Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail message, with each paragraph appearing as a single “line”, and run all those base strings in parallel, identifying the rule names that then need to be run.
This is turning out to be quite tricky with the bleadperl trie code.
For example, if we have 3 base strings, as follows:
hello:RULE_HELLO hi:RULE_HI foo:RULE_FOO
At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:
%base_to_rulename_lookup = (
'hello' => ['RULE_HELLO'],
‘hi’ => ['RULE_HI'],
‘foo’ => ['RULE_FOO']
);
if ($line =~ m{(hello|hi|foo)}) {
$rule_fired = $base_to_rulename_lookup{$1};
}
However, that will fail in the face of the string “hi foo!”, since only one of the bases will be returned as $1, whereas we want to know about both “RULE_HI” and “RULE_FOO”.
m//gc might help:
%base_to_rulename_lookup = (
'hello' => ['RULE_HELLO'],
‘hi’ => ['RULE_HI'],
‘foo’ => ['RULE_FOO']
);
while ($line =~ m{(hello|hi|foo)}gc) {
$rule_fired = $base_to_rulename_lookup{$1};
}
That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string “abcd”, for example, will fire only on “abc”, and miss the “bcd” hit.
Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:
if ($line =~ m{hello|hi|foo}) {
$line =~ /hello/ and rule_fired("HELLO");
$line =~ /hi/ and rule_fired("HI");
$line =~ /foo/ and rule_fired("FOO");
}
Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(
This can be optimized a little by breaking into multiple trie/match sets:
if ($line =~ m{hello|hi}) {
$line =~ /hello/ and rule_fired("HELLO");
$line =~ /hi/ and rule_fired("HI");
...
}
if ($line =~ m{foo|bar}) {
$line =~ /foo/ and rule_fired("FOO");
$line =~ /bar/ and rule_fired("BAR");
...
}
But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.
Suggestions, anyone?
(by the way, if you’re curious, the current code is here in SVN.)
Tags: aho-corasick, hacking, optimisation, perl, regexps, spamassassin, tries
More parallel string-match algorithm hacking: re2xs
Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.
In other words, it offers the chance for SpamAssassin rules to be compiled into a trie structure in C code to match multiple patterns in parallel. Given that this is then compiled down to native machine code, it has the potential to be the fastest method possible, apart from using dedicated hardware co-processors.
Sure enough, Matt’s results were pretty good — he says, ‘I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.’ ;)
Unfortunately, turning this into something that works with SpamAssassin hasn’t been quite so easy. SpamAssassin rules are free to use the full perl regular expression language — and this language supports many features that re2c’s subset does not. So we need to extract/translate the rule regexps to simplified subsets. This has generally been the case with all parallel matching systems, anyway, so that’s not a massive problem.
More problematically, re2c itself does not support nested patterns — if one token is contained within another, e.g. “FOO” within “FOOD”, then the subsumed token will not be listed as a match. SpamAssassin rules, of course, are free to overlap or subsume each other, so an automated way to detect this is required.
For simple text patterns, this is easy enough to do using substring matching – e.g. “FOOD” =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated regexp functionality is available, this is no longer the case: consider /FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .
The only way to do this is to either (a) fully parse the regexp, build the trie, and basically reimplement most of re2c to do this in advance; or (b) change the trie-generation code in re2c to support states returning multiple patterns, as Aho-Corasick does.
I requested support for this in re2c, but got a brush-off, unfortunately. So work continues…
In other news, that food poisoning thing I had back at the end of June has lingered on. It’s now pretty clear that it isn’t food poisoning or a stomach bug… but I still have no idea what it actually is. No fun :(
Tags: aho-corasick, matt-sergeant, perl, re2c, regexps, spamassassin, strings, tries
