RAII in perl

Suppose you have matching start() and end() functions. You want to ensure that each start() is always matched with its corresponding end(), without having to explicitly pepper your code with calls to that function. Here’s a good way to do it in perl — create a guard object:

package Scoper;
sub new {
  my $class = shift; bless({ func => shift },$class);
}
sub DESTROY {
  my $self = shift; $self->{func}->();
}

Here’s an example of its use:

{
  start();
  my $s = Scoper->new(sub { end(); });
  [... do something...]
}
[at this point, end() has been called, even if a die() occurred]

The idea is simply to use DESTROY to perform whatever the cleanup operation is. Once the $s object goes out of scope, it’ll be deleted by perl’s GC, in the process of which, calling $s->DESTROY(). In other words, it’s using the GC for its own ends.

Unlike an eval { } block to catch die()s, this will even be called if exit() or POSIX::exit() is called. (POSIX::_exit(), however, skips DESTROY.)

This is a pretty old C++ pattern — Resource Acquisition Is Initialization. C++’s auto_ptr template class is the best-known example in that language. Here’s a perl.com article on its use in perl, from last year, mostly regarding the CPAN module Object::Destroyer. To be honest, though, it’s 6 lines of code — not sure if that warrants a CPAN module! ;)

RAII is used in SpamAssassin, in the Mail::SpamAssassin::Util::ScopedTimer class.

Tags: , , , , , ,

Comments (4)

A historical DailyWTF moment

Today, in work, we wound up discussing this classic DailyWTF.com article — “Remember, the enterprisocity of an application is directly proportionate to the number of constants defined”:

public class SqlWords
{
  public const string SELECT = " SELECT ";
  public const string TOP = " TOP ";
  public const string DISTINCT = " DISTINCT ";
  /* etc. */
}

public class SqlQueries
{
  public const string SELECT_ACTIVE_PRODCUTS =
    SqlWords.SELECT +
    SqlWords.STAR +
    SqlWords.FROM +
    SqlTables.PRODUCTS +
    SqlWords.WHERE +
    SqlColumns.PRODUCTS_ISACTIVE +
    SqlWords.EQUALS +
    SqlMisc.NUMBERS_ONE;
  /* etc. */
}

This made me recall the legendary source code for the original Bourne shell, in Version 7 Unix. As this article notes:

Steve Bourne, at Bell Labs, worked on his version of shell starting from 1974 and this shell was released in 1978 as Bourne shell. Steve previously was involved with the development of Algol-68 compiler and he transferred general approach and some syntax sugar to his new project.

“Some syntax sugar” is an understatement. Here’s an example, from cmd.c:

LOCAL REGPTR    syncase(esym)
        REG INT esym;
{
        skipnl();
        IF wdval==esym
        THEN    return(0);
        ELSE    REG REGPTR      r=getstak(REGTYPE);
                r->regptr=0;
                LOOP wdarg->argnxt=r->regptr;
                     r->regptr=wdarg;
                     IF wdval ORF ( word()!=')' ANDF wdval!='|' )
                     THEN synbad();
                     FI
                     IF wdval=='|'
                     THEN word();
                     ELSE break;
                     FI
                POOL
                r->regcom=cmd(0,NLFLG|MTFLG);
                IF wdval==ECSYM
                THEN    r->regnxt=syncase(esym);
                ELSE    chksym(esym);
                        r->regnxt=0;
                FI
                return(r);
        FI
}

Here are the #define macros Bourne used to “Algolify” the C compiler, in mac.h:

/*
 *      UNIX shell
 *
 *      S. R. Bourne
 *      Bell Telephone Laboratories
 *
 */

#define LOCAL   static
#define PROC    extern
#define TYPE    typedef
#define STRUCT  TYPE struct
#define UNION   TYPE union
#define REG     register

#define IF      if(
#define THEN    ){
#define ELSE    } else {
#define ELIF    } else if (
#define FI      ;}

#define BEGIN   {
#define END     }
#define SWITCH  switch(
#define IN      ){
#define ENDSW   }
#define FOR     for(
#define WHILE   while(
#define DO      ){
#define OD      ;}
#define REP     do{
#define PER     }while(
#define DONE    );
#define LOOP    for(;;){
#define POOL    }


#define SKIP    ;
#define DIV     /
#define REM     %
#define NEQ     ^
#define ANDF    &&
#define ORF     ||

#define TRUE    (-1)
#define FALSE   0
#define LOBYTE  0377
#define STRIP   0177
#define QUOTE   0200

#define EOF     0
#define NL      '\n'
#define SP      ' '
#define LQ      '`'
#define RQ      '\''
#define MINUS   '-'
#define COLON   ':'

#define MAX(a,b)        ((a)>(b)?(a):(b))

Having said all that, the Bourne shell was an awesome achievement; many of the coding constructs we still use in modern Bash scripts, 30 years later, are identical to the original design.

Tags: , , , , , , , , , , , ,

Comments (8)

Rule Discovery Progress Update

Back in March, I wrote a post about a new rule discovery algorithm I’d come up with, based on the BLAST bioinformatics algorithm. I’m still hacking on that; it’s gradually meandering towards production status, as time permits, so here’s an update on that progress.

There have been various tweaks to improve memory efficiency; I won’t go into those here, since they’re all in SVN history anyway. But the results are that the algorithm can now extract rules from 3500 spam and 50000 ham messages without consuming more than 36 MB of RAM, or hitting disk. It can also now generate a SpamAssassin rules file directly, and apply a basic set of QA parameters (required hit rate, required length of pattern, etc.).

On top of this, I’ve come up with a workflow to automatically generate a usable batch of rules, on a daily basis, from a spam and ham corpus. This works as follows:

  • Take a sample of the past 4 days traffic from our spamtrap network. Today this was about 3000 messages.

  • add the hand-vetted spam from my own accounts over the same period (this helps reduce bias, since spamtraps tend to collect a certain type of spam), about 3400 messages.

  • discard spams that scored over 10 points (to concentrate on the stuff we’re missing).

  • Pass the remaining 3517 spams, and text strings from over 50000 nonspam messages, into the “seek-phrases-in-log” script, specifying a minimum pattern length of 30 characters, and a minimum hitrate of 1% (in today’s corpus, a rule would have to hit at least 34 messages to qualify).

  • That script gronks for a couple of minutes, then produces an output rules file, in this case containing 28 rules, for human vetting. (Since I’ve started this workflow, I’ve only had to remove a couple of rules at this step, and not for false positives; instead, they were leaking spamtrap addresses.)

  • Once I’ve vetted it, I check it into rulesrc/sandbox/jm/20_sought.cf for testing by the SpamAssassin rule QA system.

The QA results for the ruleset from yesterday (Aug 3) can be seen here, and give a pretty good idea of how these rules have been performing over the past week or two; out of the nearly 70000 messages hit by the rules, only 2 ham mails are hit — 0.0009%.

In fact, I measured the ruleset’s overall performance in the logs provided by the 4 mass-check contributors who provided up-to-date data in yesterday’s nightly mass-check; bb-jm, jm, daf, dos, and theo (all SpamAssassin committers):

Contributor Hits Spams Percent
bb-jm 4249 24996 17.00%
jm 3450 14994 23.00%
daf 1236 35563 3.48%
dos 32867 100223 32.79%
theo 28077 382562 7.34%

(bb-jm and jm are both me; they scan different subsets of my mail.)

The “Percent” column measures the percentage of their spam collection that is hit by at least one of these rules; it works out to an average of 16.72% across all contributors. This is underestimating the true hitrate on “fresh” spam, too, since the mass-check corpora also include some really old spam collections (daf’s collection, for example, looks like it hasn’t been updated since the start of July).

Even better, a look at the score-map for these rules shows that they are, indeed, hitting the low-scoring spam that other rules don’t hit.

That’s pretty good going for an entirely-automated ruleset!

The next step is to come up with scores, and publish these for end-user use. I haven’t figured out how this’ll work yet; possibly we could even put them into the default “sa-update” channel, although the automated nature of these rules may mean this isn’t a goer.

If you’re interested, the hits-over-time graph for one of the rules (body JM_SEEK_ICZPZW / Home Networking For Dummies 3rd Edition \$10 /) can be viewed here.

Tags: , , , , ,

Comments (2)

Ohloh

Ohloh is fun — a social networking site for free software and open source developers! Well, I guess Advogato was there first, but this is quite a bit more web2.0-compliant. ;) There are some cool features, like neat Sparkline graphs of commits per person over time. nifty.

Here I am, and here’s the Apache SpamAssassin project page.

Tags: , , , , , ,

Comments (1)

A SpamAssassin rule-discovery algorithm

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and “ham” (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple “grep” will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, “this is\na \ntest” -> “this is a test”

All pretty basic stuff, and performed by the SpamAssassin “body” rendering process during a “mass-check” operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later “collapse patterns” step.

Split the text into “words” — ie. space-separated chunks of non-whitespace chars. Compress each “word” into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global “token count” hashtable, and add the message ID to the token’s entry in a “message subset hit” table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the “token count” hash entries, simply delete the entries in the “token count” and “message subset hit” tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
  • the “message subset hit” table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the “message subset hit” table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the “discard if subsumed” trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent “OEM” stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/*.2007022* \
        ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft…Mac…Adobe…Borland…Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin “meta” rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <tyokaluassa.com@ultradian.com>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <xxxxxxxxxxx@jmason.org>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: “Kevin Morris” <tyokaluassa.com@ultradian.com>
  To: <xxxxxxxx@jmason.org>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

  OEM software - throw packing case, leave CD, use electronic manuals.
  Pay for software only and save 75-90%!

  Discounts! Special offers! Software for home and office!
              TOP 1O ITEMS.

    $79 Microsoft Windows Vista Ultimate
    $79 MS Office Enterprise 2007
    $79 Adobe Acrobat 8 Pro
    $49 Windows XP Pro w/SP2
    $99 Macromedia Studio 8
    $59 Adobe Premiere 2.0
    $59 Corel Grafix Suite X3
    $59 Adobe Illustrator CS2
  $129 Autodesk Autocad 2007
  $149 Adobe Creative Suite 2
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

            Mac Specials:
  Adobe Acrobat PR0 7             $69
  Adobe After Effects             $49
  Adobe Creative Suite 2 Premium $149
  Ableton Live 5.0.1              $49
  Adobe Photoshop CS              $49
  http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
  6E1FAE&t6

  See more by this manufacturers:
  Microsoft…Mac…Adobe…Borland…Macromedia
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

  Microsoft Windows Vista Ultimate
  Retail price:  $399.00
  Proposition:  $79.95
  Your benefit:  $319.05 (80%)
  Availability: Can be downloaded INSTANTLY.
  http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
  Best choice for home and professional. (37268 reviews)

  Microsoft Office 2007 Enterprise Edition
  Regular price:  $899.00
  Our offer:  $79.95
  You save:  $819.95 (89%)
  Availability: Pay and download instantly.
  http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
  Sales Rank: #1 (121329 reviews)

  Adobe Acrobat 8.0 Professional
  Market price:  $449.00
  We propose:  $79.95
  Your profit:  $369.05 (80%)
  Availability: Available for INSTANT download.
  http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
  Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching “anys”, also known as “gapped alignment” searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. “this is foo”, “this is bar”, “this is baz” => “this is (foo|bar|baz)”, would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

Tags: , , , , , , ,

Comments (8)

Kernighan and Pike on debugging

While reading the log4j manual, I came across this excellent quote from Brian W. Kernighan and Rob Pike’s “The Practice of Programming”:

As personal choice, we tend not to use debuggers beyond getting a stack trace or the value of a variable or two. One reason is that it is easy to get lost in details of complicated data structures and control flow; we find stepping through a program less productive than thinking harder and adding output statements and self-checking code at critical places. Clicking over statements takes longer than scanning the output of judiciously-placed displays. It takes less time to decide where to put print statements than to single-step to the critical section of code, even assuming we know where that is. More important, debugging statements stay with the program; debugging sessions are transient.

+1 to that.

Tags: , , , , , , ,

Comments (3)

A Released Perl With Trie-based Regexps!

Good news! From the Perl 5.9.2 ‘perl592delta’ change log:

The regexp engine now implements the trie optimization : it’s able to factorize common prefixes and suffixes in regular expressions. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization.

in other words, the trie-optimization patch contributed by demerphq back in March 2005 is now in a released build of Perl. Yay!

Here’s a writeup of what it does:

A trie is a way of storing keys in a tree structure where the branching logic is determined by the value of the digits of the key. Ie: if we have “car”, “cart”, “carp”, “call”, “cull” and “cars” we can build a trie like this:

        c + a + r + t
          |   |   |
          |   |   + p
          |   |   |
          |   |   + s
          |   | 
          |   + l - l
          |   
          + u - l - l

What the patch does is make /a | list | of | words/ into a trie that matches those words. This means that we can efficiently tell if any of the words are at a given location in a strng by simply walking the string and trie at the same time. In many cases we can rule out the entire list by looking at only one character of the input. The current way perl handles this would require looking at N chars where N is the number of words involved. (BTW: Thats the beauty of a trie, its lookup time is independent of the number of words it stores but rather on the key length of the word being looked up. )

SpamAssassin is, of course, both (a) very regular-expression-intensive and (b) searches a single block of text for a large number of independent patterns in parallel. I’d love to see someone coming up with a patch to SpamAssassin that uses trie-compatible regexps when the perl version is >= 5.9.2, and gets increased performance that way. hint ;)

BTW, the Regexp::Trie module on CPAN is related — in that it, similar to Regexp::Optimizer, Regex::PreSuf, or Regexp::Assemble, will compile a list of words or regular expressions into a super-efficient trie-style regexp. However, without the trie patch to the regexp engine itself, this would be a minor efficiency tweak at best; although having said that, Regexp::Assemble’s POD notes:

You should realise that large numbers of alternations are processed in perl’s regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie. Note that Perl’s own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out). Regexp::Assemble will do the right thing when it knows it’s running on a a trie’d perl. (At least in some version after this one).

(PS: interestingly, demerphq mentioned back in March 2005 that he was working on Aho-Corasick matching next. A-C is a great parallel-matching algorithm, and I would imagine it would increase performance yet more. I wonder what happened to that…)

Tags: , , , , ,

Comments (16)

Web 2.0 and Open Source

A commenter at this post on Colm MacCarthaigh’s weblog writes:

I guess I still don’t understand how Open Source makes sense for the developers, economically. I understand how it makes sense for adapters like me, who take an app like Xoops or Gecko and customize it gently for a contract. Saves me hundreds of hours of labour. The down side of this is that the whole software industry is seeing a good deal of undercutting aimed at sales to small and medium sized commercial institutions.

Similarly, in the follow-up to the O’Reilly “web 2.0″ trademark shitstorm, there’s been quite a few comments along the lines of “it’s all hype anyway”.

I disagree with that assertion — and Joe Drumgoole has posted a great list of key Web 2.0 vs Web 1.0 differentiators, which nails down some key ideas about the new concepts, in a clear set of one-liners.

Both open source software companies, and “web 2.0″ companies, are based on new economic ideas about software and the internet. There’s still quite a lot of confusion, fear and doubt about both, I think.

Open Source

As I said in my comment at Colm’s weblog — open source is a network effect. If you think of the software market as a single buyer and seller, with the seller producing software and selling to the buyer, it doesn’t make sense.

But that’s not the real picture of a software market. If you expand the picture beyond that, to a more realistic picture of a larger community of all sorts of people at all levels, with various levels interacting in a more complex maze of conversation and transactions, open source creates new opportunities.

Here’s one example, speaking from experience. As the developer of SpamAssassin, open source made sense for me because I could never compete with the big companies any other way.

If I had been considering it in terms of me (the seller) and a single customer (the buyer), economically I could make a case of ‘proprietary SpamAssassin’ being a viable situation — but that’s not the real situation; in reality there was me, the buyer, a few 800lb gorillas who could stomp all over any puny little underfunded Irish company I could put together, and quite a few other very smart people, who I could never afford to employ, who were happy to help out on ‘open-source SpamAssassin’ for free.

Given this picture, I’m quite sure that I made the right choice by open sourcing my code. Since then, I’ve basically had a career in SpamAssassin. In other words my open source product allowed me to make income that I wouldn’t have had, any other way.

It’s certainly not simple economics, is a risk, and is complicated, and many people don’t believe it works — but it’s viable as an economic strategy for developers, in my experience. (I’m not sure how to make it work for an entire company, mind you, but for single developers it’s entirely viable.)

Web 2.0

Similarly — I feel some of the companies that have been tagged as “web 2.0″ are using the core ideas of open source code, and applying them in other ways.

Consider Threadless, which encourages designers to make their designs available, essentially for free — the designer doesn’t get paid when their tee shirt is printed; they get entered into a contest to win prizes.

Or Upcoming.org, where event tracking is entirely user-contributed; there’s no professional content writers scribbling reviews and leader text, just random people doing the same. For fun, wtf!

Or Flickr, where users upload their photos for free to create the social experience that is the site’s unique selling point.

In other words — these companies rely heavily on communities (or more correctly certain actors within the community) to produce part of the system – exactly as open source development relies on bottom-up community contribution to help out a little in places.

The alternative is the traditional, “web 1.0″ style; it’s where you’re Bill Gates in the late 90’s, running a commercial software company from the top down.

  • You have the “crown jewels” — your source code — and the “users” don’t get to see it; they just “use”.
  • Then they get to pay for upgrades to the next version.
  • If you deal with users, it’s via your sales “channels” and your tech support call centre.
  • User forums are certainly not to be encouraged, since it could be a PR nightmare if your users start getting together and talking about how buggy your products are.
  • Developers (er, I mean “engineers”) similarly can’t go talking to customers on those forums, since they’ll get distracted and give away competitive advantage by accidentally leaking secrets.
  • Anyway, the best PR is the stuff that your PR staff put out — if customers talk to engineers they’ll just get confused by the over-technical messages!

Yeah, so, good luck with that. I remember doing all that back in the ’90’s and it really wasn’t much fun being so bloody paranoid all the time ;)

URLs:

(PS: The web2.0 companies aren’t using all of the concepts of open-source, of course — not all those web apps have their source code available for public reimplementation and cloning. I wish they were, but as I said, I can’t see how that’s entirely viable for every company. Not that it seems to stop the cloners, anyway. ;)

Tags: , , , , , , ,

Comments (15)

Software Patenting and “Hot” Fields

Paul Graham’s recent essay on his experience with software patenting has been making the rounds recently.

Now Kevin Marks has commented. Worth reading, since he demonstrates nicely the kind of crap you see in a ‘hot’ field, such as video (which he worked on with Apple’s Quicktime):

I broadly agree with Paul Graham’s essay on Software Patents, but I do think he underestimates the damage from patent trolls, and from what he calls the mafia-like behaviour of some patent holders. Paul has been lucky in the field he has worked in, but in the Audio and Video area there are many patent thickets. … While I was at Apple on QuickTime, there was a steady stream of patent trolls claiming that Apple should pay them royalties; enough to keep several lawyers busy, and a lot of engineers spending time working on prior art evidence demonstrations. Several potential features were excluded from QuickTime due to patent thickets. The obvious one was the Unisys LZW patent that encumbered GIF, but there were other more subtle pressures that meant adopting open source codecs was discouraged. Working on the patent license agreements for MPEG meant that technology ready to ship was deferred pending legal agreement on more than one occasion.

In my experience, that’s what happens — once a field becomes “hot”, patent trolls and other nuisance “inventors” start appearing en masse, and then you’ve got to waste a lot of time dealing with that crap.

Tags: , , ,

Comments

A Gotcha With perl’s “each()”

It’s my bi-monthly perl blog entry, to earn my place on planet.perl.org! ;)

Here’s an interesting “gotcha”. Take this code:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    while(($k,$v)=each %t){print "1: $k\n"; last;}
    while(($k,$v)=each %t){print "2: $k\n";}'

In other words, iterate through all the key-value pairs in %t once, then do it again — but exit early in the first loop.

You would expect to get something like this output:

    1: 1
    2: 1
    2: 3
    2: 2

instead, you see:

    1: 1
    2: 3
    2: 2

The “1″ entry in the second loop is AWOL. Here’s why — as “perldoc -f each” notes:

There is a single iterator for each hash, shared by all “each”, “keys”, and “values” function calls in the program

That’s all “each” calls, throughout the entire codebase, possibly in a different class entirely. Argh.

The workaround: reset the iterator using “keys” between calls to “each”:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    while(($k,$v)=each %t){print "1: $k\n"; last;}
    keys %t;
    while(($k,$v)=each %t){print "2: $k\n";}'

This got us in SpamAssassin — bug 4829.

To be honest, having to call “keys” after the loop is kludgy — as you can see if you check the patch in bug 4829 there, we had to change from a “return inside loop” pattern to a “set variable and exit loop, reset state, then return” pattern. It’d be nice to have a scoped version of each(), instead of this global scope, so that this would work:

    perl -e '%t=map{$_=>1}qw/1 2 3/;
    { while(($k,$v)=scoped_each %t){print "1: $k\n"; last;} }
    # that each() iterator is now out of scope, so GC'd;
    # the next call uses a new iterator, starting from scratch
    { while(($k,$v)=scoped_each %t){print "2: $k\n";} }'

Scoping, of course, has the benefit of allowing “return early” patterns to work; in my opinion, those are clearer — at the least because they require less lines of code ;)

Tags: , , , ,

Comments (4)

What Works in Software Development

I already posted this to the link-blog yesterday, but it’s so good it’s worth promoting more widely. If you write software for a living, you really ought to read the slides for Michael Schwern’s excellent ‘What Works In Software Development’ talk.

It’s a long presentation (108 slides!), but during the course of that, he covers:

  • effective teamwork
  • dealing with bad customers
  • dealing with bad management
  • classic coding mistakes
  • classic project management mistakes
  • classic design mistakes
  • test-driven development
  • refactoring
  • patterns

It’s a really good synthesis of what I think are the best bits of good OO design, XP, CPAN and perl’s design and coding styles, without most of the cruft. I’ll be pointing people at this for years to come, I think…

(Found via yoz.)

Tags: , , , , , , ,

Comments (1)