A Released Perl With Trie-based Regexps!

Good news! From the Perl 5.9.2 ‘perl592delta’ change log:

The regexp engine now implements the trie optimization : it’s able to factorize common prefixes and suffixes in regular expressions. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization.

in other words, the trie-optimization patch contributed by demerphq back in March 2005 is now in a released build of Perl. Yay!

Here’s a writeup of what it does:

A trie is a way of storing keys in a tree structure where the branching logic is determined by the value of the digits of the key. Ie: if we have “car”, “cart”, “carp”, “call”, “cull” and “cars” we can build a trie like this:

        c + a + r + t
          |   |   |
          |   |   + p
          |   |   |
          |   |   + s
          |   | 
          |   + l - l
          |   
          + u - l - l

What the patch does is make /a | list | of | words/ into a trie that matches those words. This means that we can efficiently tell if any of the words are at a given location in a strng by simply walking the string and trie at the same time. In many cases we can rule out the entire list by looking at only one character of the input. The current way perl handles this would require looking at N chars where N is the number of words involved. (BTW: Thats the beauty of a trie, its lookup time is independent of the number of words it stores but rather on the key length of the word being looked up. )

SpamAssassin is, of course, both (a) very regular-expression-intensive and (b) searches a single block of text for a large number of independent patterns in parallel. I’d love to see someone coming up with a patch to SpamAssassin that uses trie-compatible regexps when the perl version is >= 5.9.2, and gets increased performance that way. hint ;)

BTW, the Regexp::Trie module on CPAN is related — in that it, similar to Regexp::Optimizer, Regex::PreSuf, or Regexp::Assemble, will compile a list of words or regular expressions into a super-efficient trie-style regexp. However, without the trie patch to the regexp engine itself, this would be a minor efficiency tweak at best; although having said that, Regexp::Assemble’s POD notes:

You should realise that large numbers of alternations are processed in perl’s regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie. Note that Perl’s own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out). Regexp::Assemble will do the right thing when it knows it’s running on a a trie’d perl. (At least in some version after this one).

(PS: interestingly, demerphq mentioned back in March 2005 that he was working on Aho-Corasick matching next. A-C is a great parallel-matching algorithm, and I would imagine it would increase performance yet more. I wonder what happened to that…)

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

16 Comments

  1. Posted July 7, 2006 at 19:48 | Permalink

    Heh, I guess finally SA might get a mass-regexp matching system which does better than the hacky but useful pre-parser thing we did at Deersoft.

  2. Matt Sergeant
    Posted July 7, 2006 at 20:46 | Permalink

    Aho-Corasick is a trie based matcher. In essence that’s exactly what demerphq has created here.

  3. Posted July 8, 2006 at 00:35 | Permalink

    Craig — that Deersoft matcher, frankly, rocked in terms of results ;)

    Matt — demerphq mentioned that the current one is not Aho-Corasick; not sure what the difference is, but I got the impression he considers A-C to be superior to whatever’s in 5.9.2-3.

  4. Posted July 8, 2006 at 01:59 | Permalink

    Of course, that deersoft matcher was designed by a card-carrying genius

  5. Posted July 8, 2006 at 15:13 | Permalink

    What kind of thing was that?

  6. Posted July 9, 2006 at 15:04 | Permalink

    Aristotle — sadly, I don’t think I can talk too much about it. It wasn’t that fancy an algorithm, but I can say, at least, that it matched the workload very well…

  7. Posted July 9, 2006 at 17:03 | Permalink

    Ah, it wasn‘t published, OK.

  8. Randy W. Sims
    Posted July 10, 2006 at 00:50 | Permalink

    Aho-Corasick was implemeted in blead-perl early June thanks to Yves (demerphq):

  9. demerphq
    Posted July 10, 2006 at 09:13 | Permalink

    I thought id a little info here, since you were so kind to write this up. (Thanks by the way.)

    Trie optimization occurs when a pattern contains a list of alternations that do not use any regex metacharacters in them. For instance

    /foo|bar|baz|[Ff]norble/

    will result in the /foo|bar|baz/ part of the alternation sequence being converted to a single TRIE regop. /[Ff]fnorble/ wont be included, as the optimization code currently doesn’t convert the /[Ff]norble/ into /Fnorble|fnorble/. This means the regex program will still have a branch alternation structure, and that Aho-Corasick start position logic wont be available.

    When the alternation can be converted to a TRIE as a whole such in the case of

    /foo|bar|baz|Fnorble|fnorble/
    

    then the entire branch-alternation structure is removed, giving O(1) lookup on the string stored within. If this TRIE regop is sufficiently close to the beginning of the pattern/program and there are no better candidates for the optimizer to use for start point optimizations (such as the presence of a mandatory constant string somewhere in the pattern) then the TRIE will be upgraded to an Aho-Corasick structure.

    Aho-Corasick is useful because it allows Perl to efficiently scan the search string looking for the point at which the trie will match, without having to retry from a given starting point[1] when encountering a partial match. So for instance if you were looking for /foo|bar|ooh|ahh/ and the string contained a lot of stuff like ‘fobaooah’ in it an alternation or TRIE would try to match ‘fo’ and then fail when it saw the ‘b’. After tha fail the TRIE would advance the start point by 1 and then try again. The pre-TRIE alternation would be even worse and try to match ‘bar’, ‘ooh’ and ‘ahh’ at the same start point and then fail. Aho-Corasick would simply continue on, knowing that the ‘b’ it read failing the ‘foo’ possibility both precludes any match starting at the ‘o’ and also starts a different possibility, ‘bar’.

    In addition to this Aho-Corasick essentially include built in (?=[letter]) logic. This means that when the string contains mostly non matching chars the engine can scream through the non-matching junk looking for a character that actually starts a possible matching sequence.

    Other optimisations have been added as well. Char classes with a single char in them are silently converted to their literal equivelent. This means that

    /foo[ ]bar/

    is now as efficient as

    /foo bar/

    (in older perls it is much less efficient, partially due to the fact that the char class means that the Fast-Boyer-Moore optimisations will look for ‘foo’ and not ‘foo bar’, and FBM matching is really fast.)

    Also, alternations like (a|b|c|d) are now only just a tiny bit slower than the char class equivelent [abcd]. But something like /a|b|c|d|foo|bar/ should be altogether considerably faster than /[abcd]|foo|bar/.

    Future plans include extending the TRIE optimisation so that it handles

    /blah[!.]|foo[ld]/

    If you see mention of JTRIES (jump-tries) getting applied to blead then you’ll know I’ve completed this.

    Also, I thought id point something out, dont stick really huge strings in an alternation (or if possible in the regex at all). Constant strings are chunked into 255 byte sections at most. If you try to match a constant string with 500 chars it will get broken into two pieces, and if those pieces are in an alternation the second chunk is sufficient to invalidate that branch for consideration by the TRIE optimisation. If and/or when i get JTRIES working this will no longer be true.

    As a last comment a while ago I looked into how SA does its pattern matching a bit, and its structured in such a way that it probably isn’t getting the best out of the RE. For instance regex based rules should probably come in two flavours, a simple part and a complex part. The simple part should be some pattern (preferable a constant string) that must match for the complex pattern to match. All of these simple parts could be rolled together into a single pattern (at least in TRIE enabled perls) that would be used to scan the ENTIRE mail (not line by line). A hash lookup on the string matching would then allow access to the complex patterns which could then be applied to see if a real match was possible. This would eliminate many no-match patterns in a single sweep of the email.

    If anybody has questions id be happy to address them on p5p or directly by mail. (Ive requested reply notifications from this, so ill get a mail if your reply here.)

    Hope this is useful and helpful to people, and Justin, thanks again for posting this.

    Oh, as Randy points out, both TRIE and Aho-Corasick (which is an extension of TRIE matching) are in bleadperl already. 5.10 will have as many optimisations as I can get done in time, and that the pumpking is willing to apply.

    Cheers, Yves/demerphq [1] You could call this backtracking, but its a little different, in that its not using the stack to do its business.

  10. Posted July 10, 2006 at 13:38 | Permalink

    Yves, Randy — thanks for the comments!

    Very interesting to see that A-C is now in blead, I didn’t know that! I can’t wait for 5.9.4; it sounds like significant enhancements are in the pipeline there. ;)

    Your suggestion for SpamAssassin’s matching is very interesting; would it be possible for you to open a bugzilla entry on http://issues.apache.org/SpamAssassin/enter_bug.cgi , and suggest that there?

    I expect we may have to make a few changes to SpamAssassin to take advantage of the trie functionality… given enough of a speed boost, I doubt anyone will worry about that though. ;)

  11. Posted July 10, 2006 at 19:18 | Permalink

    As a last comment a while ago I looked into how SA does its pattern matching a bit, and its structured in such a way that it probably isn’t getting the best out of the RE. For instance regex based rules should probably come in two flavours, a simple part and a complex part. The simple part should be some pattern (preferable a constant string) that must match for the complex pattern to match. All of these simple parts could be rolled together into a single pattern (at least in TRIE enabled perls) that would be used to scan the ENTIRE mail (not line by line). A hash lookup on the string matching would then allow access to the complex patterns which could then be applied to see if a real match was possible. This would eliminate many no-match patterns in a single sweep of the email.

    Boy does that sound familiar! I wonder if that “simple part” matching could be efficiently implemented as a tree scanner, perhaps implemented in something like flex. I bet you could even, say, write a script to read the rules files, and automatically generate a flex input file. Wouldn’t that be neat?

    C

  12. demerphq
    Posted July 10, 2006 at 21:04 | Permalink

    I wonder if that “simple part” matching could be efficiently implemented as a tree scanner, perhaps implemented in something like flex.

    Not sure if Im following you right. Whats the problem with doing

    my $pat=”(“.join(“|”,@simple).”)”;

    The aho-corasick/trie combo should handle that pretty efficiently, (possibly not as efficiently as something like flex), and its easy to implement.

    I was thinking the difficult part with this is getting the rules to be constructed in such a way that they would tend to have simple parts. In my experience people tend to write complex patterns after a while, especially as in earlier perl a little hand tweaking can make a big difference in the run time of the pattern. Unfortunately its exactly that tweaking that disables the perl level optimisations, which will result in better performance in 5.10

  13. Posted July 10, 2006 at 22:45 | Permalink

    my $pat=”(”.join(”|”,@simple).”)”;

    That’ll work great, to tell you that something matched. But which “shortcut” rule matched? The nice thing with the flex approach would be that you scan once, and end up in a scanner leaf node which just lists off all the rules which apply. To get the list of full-blown rules which need to be checked with your code sample will be a bit trickier, since you can get $& but what you want is actually which part of the combined RE caused the match. And then you need to iterate with /g to get all the matches.

    Is there an easy way in

    $search =~ /some|foo/;

    to tell whether it was the “some” part or the “foo” part which matched? Without having to do another search on $&?

  14. demerphq
    Posted July 10, 2006 at 23:25 | Permalink

    Id expect to see something like this:

      my %complex=some_func(@data);
      my $pat=join("|",sort keys %complex);
      our %check; #must be our 
      $pat=qr/($pat)(?{$check{$1}; '\z\A'})/; # sigh we need a FAIL regop.
      foreach my $mail (@mails) {
        %check=();
        $mail=~$pat;
        foreach my $hit (keys %check) {
          foreach my $complex (@{$complex{$hit}}) {
            do_something if $mail=~$complex;
          }
        }
      }
    

    obviously untested. You dont really want to use /g because it wont get all the possible matches, it will get all the possible nonoverlapping matches such that leftmost-longest applies.

    Sure you could use lexer technology for all of this, but i wonder if it would be as easy to do…

    BTW, its not wise to use $& if regex performance is a priority. Using it once, anywhere, in a program results in a performance penalty for every regex in your program. Something that is essentially impossible to fix due to Perls dynamic nature.

    Sorry about the indentation, i dont know how to fix it… (jm: edited to fix it)

  15. Posted July 11, 2006 at 18:15 | Permalink

    There’s one (probably minor) hitch, which is that multiple rules may end up having the same fixed-length check RE, so $complex{$hit} needs to give a list of rules rather than just a single rule. Also, the above limits the “simple” rule checking to only static strings, since otherwise $1 in the RE-embedded code won’t actually be a key in $complex — that is, the code depends on the pattern string being the same as the matched string. Probably there’ll at least need to be some case-shifting to help ensure that. I can’t off the top of my head remember from 2-3 years ago whether there were any non-static string cases that it made sense to use a pre-check for. Oh, and also, you might want to check for multiple static strings in order to decide on using a complex rule, eg:

    $complex = qr/blahblah(?:$uglything)foofoo/;

    You’d want to ensure hits on both “blahblah” and “foofoo” (maybe even in order) before trying to match the full $complex. That’s trickier to do the logic on. with the join()ed pattern approach.

  16. demerphq
    Posted July 11, 2006 at 19:26 | Permalink

    There’s one (probably minor) hitch, which is that multiple rules may end up having the same fixed-length check RE, so $complex{$hit} needs to give a list of rules rather than just a single rule.

    Hence the

    foreach my $complex (@{$complex{$hit}}) {

    :-)

    the above limits the “simple” rule checking to only static strings,

    Yes, thats right. The point being that by using static strings you get the speed of AC matching for the simple component. When I said “simple” I meant it. :-)

    You’d want to ensure hits on both “blahblah” and “foofoo” (maybe even in order) before trying to match the full $complex. That’s trickier to do the logic on. with the join()ed pattern approach

    Yeah, I suppose. I imagine you could come up with a number of strategies to deal with that type of case. For instance if each complex rule had a list of must match simple strings associated to it, then you could scan the rule list looking for rules for which all the simple criteria would be met, then only fire those rules. It would scale linear to the number of rules which is kinda annoying, but the checks would be pretty efficient as compared to running many passes over the input string.

    But depends what point of view you are looking at this from. I was originally thinking of this mostly as how you would structure things to most efficiently use the RE, but other considerations might lead to different design decisions.

    Another thing that is a factor is relationship between the number of size of the data being scanned as versus the number of rules there are. How those numbers look IMO would/will strongly influence the design chosen.