Bleadperl regexp optimization vs SA

I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.

These are the “base strings” extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A “base string” is a simplified subset of the regular expression; specifically, these are the cases where the “base strings” of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as “r” lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A

The base string is the part after “r” and before the “:”; after that, the rule names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string “food” against the strings “food” and “foo” should just fire on “food”, not on “foo”.

  • Overlapping is permitted: on the other hand, overlapping is fine; “foobar” matched against “foo” and “oobar” should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a “verification” match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail message, with each paragraph appearing as a single “line”, and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:


At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']

  if ($line =~ m{(hello|hi|foo)}) {
    $rule_fired = $base_to_rulename_lookup{$1};

However, that will fail in the face of the string “hi foo!”, since only one of the bases will be returned as $1, whereas we want to know about both “RULE_HI” and “RULE_FOO”.

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']

  while ($line =~ m{(hello|hi|foo)}gc) {
    $rule_fired = $base_to_rulename_lookup{$1};

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string “abcd”, for example, will fire only on “abc”, and miss the “bcd” hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");

Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you’re curious, the current code is here in SVN.)

This entry was posted in Uncategorized and tagged , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. fishbot
    Posted November 16, 2006 at 17:44 | Permalink

    What about using some of the new backtracking controls, and failing matches to collect lists of matches with the wonder of backtracking?

    my $string = "foobar";
    my %stuff;
    $string =~ m{
        (?{$stuff{$^N}++}) (*FAIL)
    print Dumper \%stuff;
    $VAR1 = {
              'foo' => 1,
              'oobar' => 1

    I did a re qw{debug}, and that is using trie+AHOCORASICK-EXACTF for the alternation, but being overlap-greedy. Have no idea if the eval blocks negate and efficiency gains.

  2. Posted November 16, 2006 at 17:58 | Permalink

    awesome — that looks promising, alright! in fact, it even works with fully-subsumed matches:

    $string =~ m{
        (?{$stuff{$^N}++}) (*FAIL)
    use Data::Dumper; print Dumper \%stuff;
    $VAR1 = {
              'foo' => 1,
              'oob' => 1,
              'foobar' => 1

    nifty. fingers crossed those eval blocks are low-impact, performance wise.

  3. demerphq
    Posted November 16, 2006 at 18:03 | Permalink

    Its possible we have to work on this together a bit to tune things correctly.

    The core issue you have is want to find all the overlapping matches in a list of words.

    The correct way to do this in the most recent blead is:

    /(list|of|words)(?{push @matched,$1})(*FAIL)/

    Replace (*FAIL) with (?!) if you need to.

    This will force the engine to backtrack and try to find the next match. If this doesnt work well please let me know.

    Also, there are open optimisation issues with this type of matching.

    Also, id like to point out its no longer necessary to use only fixed strings in the list. Just make sure that each entry in the list starts with a literal sequence (the longer the better).

    Also, in one of the most recent patches I added a feature that specifically had this problem (and SA) in mind. Imagine your keyword list is


    obviously you cant use $1 anymore to get the info you need. And using capture buffers doesnt help much (named capture buffers would a bit but not much).

    So I added a feature called


    or its short form


    Lets change the pattern to:


    So now after a successful match of this pattern the package variable $REGMARK will be set to either ‘A’ or ‘B’.

    Hrm, i just realized you cant access $REGMARK properly inside of a (?{…}) block, which is not right.

    Expect further action on this. Also please feel free to contact me directly with further discussion. This isnt the ideal medium for this.

    Cheers, Yves

  4. Posted November 16, 2006 at 18:26 | Permalink

    hi Yves — good tips, thanks.

    That REGMARK feature would be awesome! Could you mail/post on your journal when it can be accessed inside a (?{…}) block? I’ll be keeping an eye out…

    For what it’s worth, this isn’t a particularly bad medium for this discussion — there’s a few people beyond just the two of us who are interested in how this would work, and this way I can get comments/discussion from all of you ;).

  5. ben
    Posted November 16, 2006 at 18:39 | Permalink

    My cat’s breath smells like catfood.

  6. demerphq
    Posted November 16, 2006 at 18:48 | Permalink

    Hi Justin,

    Yes ill keep you informed of the regmark work as it progreses. Naturally.

    And I didnt mean to slag your blog off, its great that you are discussing this stuff here. I mostly said that because I had to fight a bit to get what I had written to render properly. Anyway, you are totally right that posts here are exposed to a more diverse audience than emails would be. Ill just have to RTFM on the markup here I guess. :-)

    BTW, right now im thinking that the best solution for the problem above is to create a new special purpose regop, lets call it (*LOGMARK), that pushes the most recent MARK into @REGMARK and then fails. Therefore doing an exhaustive search, but populating an array to see what matched without having eval overhead.


  7. Posted November 16, 2006 at 19:08 | Permalink

    That (*LOGMARK) regop idea makes a lot of sense to me, given that it allows this multiple-rules-in-one-trie pattern to be expressed cleanly.

    don’t worry too much about the Markdown markup — I can always fix anything that’s misformatted anyway ;)

  8. demerphq
    Posted November 16, 2006 at 19:49 | Permalink

    BTW, there was something else about this post that reminded me of a project i have on the slow burner right now. Id like to provide something like a B interface for the regex engine which would amongst other things allow access to data such as the longest fixed string that must appear in the pattern.

    Its clear to me from this post that the longest fixed string data is sufficiently useful that I should provide a tool to access it regardless of the deeper introspection code I have in mind.

    Something like:

    my $pat=qr/..../;
    my $must=regmust($pat);

    which would return the longest string that must match for the pattern to match. This data is actually easily obtained from XS so I can provide a wrapper in the re module to do it.

  9. Posted November 16, 2006 at 20:02 | Permalink

    wow — that would be very useful, alright… I’ve been writing some horrible pure-perl regex-parsing code instead, but logically it would be much cooler to get the re engine to do it instead.

  10. demerphq
    Posted November 17, 2006 at 01:01 | Permalink

    Ok, I’ve posted a patch to add ‘regmust()’ to, which would allow you to do something like:

     my $qr=qr/here .* there/x;
     my ($anchored,$floating)=regmust($qr);
     print "anchored:'$anchored'\nfloating:'$floating'\n";

    and get


    Anyway, presumably this should be available in bleadperl soon, or you can apply the patch yourself from the mailing list. It should apply ok to any relatively recent bleadperl. See the pod changes in the patch for more details.

  11. demerphq
    Posted November 17, 2006 at 11:32 | Permalink

    FYI: re::regmust() was added to bleadperl in patch #29299

  12. Posted November 19, 2006 at 21:44 | Permalink

    fwiw, I’m using regmust in the jm_re2c_hacks branch now to extract required text strings… very handy ;)

    let me know if you get a chance to do the regmark thing. I think with that, I can turn the current speedup approach on its head — instead of selecting a small subset of rules with “simple enough” base strings, I could simply just put virtually all of the body rules into a single (?:foo|bar|baz) alternation, run that single regexp match OP, and get the matched rules using the @REGMARK array! that would be cool — and hopefully perform faster than the current body rule approach.

  13. demerphq
    Posted December 6, 2006 at 19:07 | Permalink

    Just to let you know I havent forgotten you here. In fact ive been lurking in IRC (#Spamassissin) wondering if youd pop up.

    Anyway, Ive been focusing on getting certain Perl 5.10 objectives done before doing more work on the stuff you need. Although afaik I have fixed the issue with $REGMARK in (?{}) and (??{}) blocks, but I havent added a better way yet.

    Partly this is because I havent exactly decided how it will work, and partly because ive been focusing on getting the new pluggable regex engine interface sorted out first. What that means is that at some hypothetical point in the future you will be able to use a Spamassissin specific regex engine if you wish, with whatever custom stuff you need. Of course id still like to get some of the stuff we have discussed done in time for 5.10 but the heat is now off. We can swap out regex engine easily so that means you can do custom stuff at whatever time you like.

    Anyway, keep an eye on #irc and say hello some time.

  14. Posted December 6, 2006 at 19:43 | Permalink

    hey —

    yeah, I don’t visit IRC often ;) I’ll start dropping in, too. what’s your nick?

    do you mean the REGMARK thing works now, as described in ? I’ll have a try with blead, if so… I’ve been following perl5-porters again, but must have missed this one going in. That’d be cool!

  15. demerphq
    Posted December 6, 2006 at 22:05 | Permalink

    My nick is dmq, and yeah, $REGMARK as explained earlier should work fine in regex eval code.

  16. Posted December 16, 2006 at 13:41 | Permalink

    hey — I’m never in IRC. stupid work has me too busy in the afternoon/early-evenings! sorry about that. ;)

    Anyway — I’ve been playing around with the REGMARK feature, and I think your thoughts might be best on this one.

    Here’s the sub that the entire current SA body ruleset compiles into:

    Output of perl -Mre=debug:

    Right now, this performs slower than the basic SpamAssassin body tests (with each RE applied separately instead of one big set of alternations).

    I tried the other approach, too, where a limited subset of rules with easy “bases” are extracted (strings with no metachars), but that doesn’t provide any speed-up either compared to existing SpamAssassin :(

    What would you suggest? Should I think more about removing a different subset of regexps to simplify (and make more trie-friendly) the alternation? what kind of regexps should be avoided?

  17. demerphq
    Posted December 16, 2006 at 14:26 | Permalink

    Ok, so when you look at the debug output you can see that the program consists of a bunch of branches. This means that the trie optimisation hasnt kicked in and that perl is doing N comparisons per position in the text being searched. Thats what we want to avoid. The strategy is to try to prune our search space before resorting to the heavy machinery.

    What im thinking is you want to do something like this:

    Identify a non trivial fixed string that must appear in each pattern (maybe using some logic to handle both fixed and floating components). If necessary to do this unroll existing patterns. IOW, you have the relation

    /(foo|bar)baz/ => RULE1

    so you convert that to

    /foo baz/ => RULE1 /bar baz/=> RULE1

    In the case of a pattern like

    /foo bar \d+ blah blah blah/=>RULE2 set up something like:

    /foo bar/ => RULE2a /blah blah blah/ => RULE2b

    Once you have reduced your patterns down to this type of thing create a pattern out of the elements along the lines of the following:

    /(?:fnorble(:RULE1)|foo bar(:RULE2a)|blah blah blah(:RULE2b)) (?{%count{$REGMARK}++})(FAIL)/x

    and then apply that to the full_mail. Not line by line.

    Once thats done the %count array will be populated with the rules that need to be tried. Now when you loop over the pattern line by line you can apply only those rules that might match. If none of the rules match, then the mail is checked.

    So for instance you might have code like

    my @rules;
    foreach my $k (keys %count) {
       if ($k=~/\d$/ || ($k=~s/a$/b/ && $count{$k} )){
          push @rules,$k
    foreach my $line (@lines_of_mail) {
       foreach my $rule (@[email protected]}) {
           if ($line=~$rule) { ... }


  18. Posted December 16, 2006 at 16:19 | Permalink

    ok, that’s good news — most of the code written for the rule2xs fast matcher can be reused. Good to know it was on the right track, too. ;)

    I’ll take a look at that.

    Also, good point that it will be faster performed over the entire body string that way — it hadn’t occurred to me that it’s safe to run rules over the entire body since there’s no danger of exponentional .* matching.