More parallel string-match algorithm hacking: re2xs

Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.

In other words, it offers the chance for SpamAssassin rules to be compiled into a trie structure in C code to match multiple patterns in parallel. Given that this is then compiled down to native machine code, it has the potential to be the fastest method possible, apart from using dedicated hardware co-processors.

Sure enough, Matt’s results were pretty good — he says, ‘I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.’ ;)

Unfortunately, turning this into something that works with SpamAssassin hasn’t been quite so easy. SpamAssassin rules are free to use the full perl regular expression language — and this language supports many features that re2c’s subset does not. So we need to extract/translate the rule regexps to simplified subsets. This has generally been the case with all parallel matching systems, anyway, so that’s not a massive problem.

More problematically, re2c itself does not support nested patterns — if one token is contained within another, e.g. “FOO” within “FOOD”, then the subsumed token will not be listed as a match. SpamAssassin rules, of course, are free to overlap or subsume each other, so an automated way to detect this is required.

For simple text patterns, this is easy enough to do using substring matching – e.g. “FOOD” =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated regexp functionality is available, this is no longer the case: consider /FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .

The only way to do this is to either (a) fully parse the regexp, build the trie, and basically reimplement most of re2c to do this in advance; or (b) change the trie-generation code in re2c to support states returning multiple patterns, as Aho-Corasick does.

I requested support for this in re2c, but got a brush-off, unfortunately. So work continues…

In other news, that food poisoning thing I had back at the end of June has lingered on. It’s now pretty clear that it isn’t food poisoning or a stomach bug… but I still have no idea what it actually is. No fun :(

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

5 Comments

  1. Luke
    Posted August 17, 2006 at 14:15 | Permalink

    “Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.”

    This is one of the alltime great opening paragraphs of modern literature.

    L

  2. Posted August 17, 2006 at 14:26 | Permalink

    yes, it’s one of those techie entries. Sorry about that ;) does the discussion of my stomach woes not help?

  3. Matt Sergeant
    Posted August 17, 2006 at 18:45 | Permalink

    Hey! I resemble that remark.

  4. Posted January 24, 2007 at 11:31 | Permalink

    If you need to order regexps so that a simpler one is listed before the regexp which implies it, may I recommend my Regexp::Compare (AFAIK still waiting for its first user :-) )? One problem is that Regexp::Compare doesn’t compare all regexps – ordering i.e. regexps containing Perl code is fundamentally impossible, and even many simpler cases proved too hard for me – but I’d be very interested to know whether it’s good enough…

  5. Posted January 24, 2007 at 12:10 | Permalink

    Vaclav — hmm, that looks interesting! I’ll keep it in mind…