Skip to content

Month: March 2007

New list for Irish users of MythTV

MythTV is a pretty great product, once you get it working — however, it can be labour-intensive, involving lots of local knowledge to deal with the ins and outs of each area’s TV provider, cable service, etc.

To that end, we’re recently set up a new mailing list: mythtv-ireland, a list for discussion of topics of interest for MythTV users in Ireland.

Particularly on-topic:

  • the NTL frequencies list for areas in Ireland

  • hacks to scrape the Channel 6 schedule from their website

  • dealing with the NTL Digital set-top box

Sign up, if you’re interested!

Twitter and del.icio.us

Walter Higgins says:

It’s just occurred to me why I don’t like twitter – It doesn’t fulfill any need that isn’t already fulfilled by del.icio.us. I usually post a note alongside each bookmark which lets me micro-blog (post short comments without having to think too much). If I want to signal to someone to take a look at the bookmarked item I just tag it for:[nameofperson] which I suppose you could loosely call ‘chat’. Since I gave up personal blogging, del.icio.us has fulfilled a need for short-hand blogging. Thinking about it – twitter is like del.icio.us but without the bookmarks – viewed in that light it really is hard to understand why anyone would use twitter.

To my mind, though, there’s a big difference:

  • My del.icio.us page is where I post things I’m reading, and things I think others may be interested in reading;

  • My twitter page is where I post things I’m doing, and chat.

There’s no way I’d try to hold a conversation in my del.icio.us bookmarks! ;) Different tools for different uses.

Geeking out on the ‘leccy bill

A good post from Lars Wirzenius on measuring the electricity consumption of his computer hardware. Here’s a previous post of mine on the subject.

With the rising cost of energy, a keenness to reduce consumption for green purposes, and an overweening nerdity in general, I did some more investigation around my house recently.

I have a pretty typical Irish electricity meter; it contains a visible disc with a red dot, which spins at a speed proportional to power usage. (There’s a good pic of something similar at the Wikipedia page).

The fuse-board works out as follows (discarding the boring ones like the house alarm etc.):

  • Fuse 7 – gas-fired central heating (on), fridge (on), kitchen power sockets

  • Fuse 8 – TV in standby, idle PVR, Wii in standby, digital cable set-top box, washing machine

  • Fuse 9 – telephone, DSL router, Linksys WRT54G AP/router

  • Fuse 10 – bedroom sockets, home office with laptop, printer, speakers, laptop-server etc.

The approach was simply to turn off the house fuses at the fuse board, one by one, and measure how long it took the disc to make a full revolution; then invert that (1/n) to convert from units of time over a static power value, to some notional unit of power consumption over a static time interval (I haven’t figured out how to convert to kW/h or anything like that, they’re just makey-uppy units).

Fuses Time/power Power/time
Baseline (all fuses on) 22.71 seconds 0.0440
Fuse 7 off 43.03 0.0232
Fuses 7 and 8 off 57.92 0.0172
Fuse 7, 8 and 9 off 84.88 0.0117
Fuse 7, 8, 9, and 10 off ~20 minutes (I’d guess) 0.0008?

(I stopped measuring on the last one and just estimated; it was crawling around.)

Breaking out the individual fuses, that works out as:

Fuse Power/time
Fuse 7 (central heating, fridge, kitchen bits) 0.0208
Fuse 8 (TV, Wii, set-top box, washing machine) 0.0060
Fuse 9 (phones, routers) 0.0055
Fuse 10 (home office, bedrooms) 0.0109

Good results already: (a) it was pretty clear that fuse 7 was doing all the quotidian legwork, eating the majority of the power, and (b) the TV equipment and internet/wifi infrastructure was pretty good at low-power operation (yay). However (c) the computer bits aren’t so great, but still only half the power consumption of the kitchen bits.

Breaking down the kitchen consumption further:

Appliances Time/power Power/time
Gas central heating on (rechecking the baseline) 20.46 0.0488
Gas central heating off 34.15 0.0292
Washing machine on (40 degree wash) 13.65 0.0732
Dishwasher on 2.53 0.3952
Dishwasher and dehumidifier on 2.53 0.3952

Subtracting the baseline:

Appliance Power/time
Gas central heating 0.0196
Washing machine 0.0244
Dishwasher 0.3464
Dishwasher and dehumidifier 0.3464

So the central heating, despite being supposedly gas-fired, eats lots of power! I guess this is the electric pump, used to drive the heated water around the house to the radiators. Ah well, I’m not skimping on that ;)

More practically: the dishwasher result is incredible. That’s 30 times the power usage of the house’s computer hardware. This is a ~7-year-old standard dishwasher; obviously green power consumption wasn’t an issue back then! We’re running it less frequently now, obviously; the odd hand-wash of bulky and nearly-clean items helps. With any luck when we move in a few months, we can replace it with a greener model.

The washing machine is about what I would expect, so I’m OK with that.

Also interesting to note that our dehumidifier is unnoticeable in the volume of the dishwasher; I could have tried to work it out properly in isolation, but couldn’t be bothered by that stage ;)

Sender Address Verification considered harmful

(as an anti-spam technique, at least.)

Sender-address verification, also known as callback verification, is a technique to verify that mail is being sent with a valid envelope-sender return address. It is supported by Exim and Postfix, among others.

Some view this as a useful anti-spam technique. In my opinion, it’s not.

Spam/anti-spam is an adversarial “game”. Whenever you’re considering anti-spam techniques, it’s important to bear in mind game theory, and the possible countermeasures that spammers will respond with. Before SAV became prevalent, spam was often sent using entirely fake sender data; hence the initial attractiveness of SAV. Once SAV became worth evading, the spammers needed to find “real” sender addresses to evade it. And where’s the obvious place to find real addresses? On the list of target addresses they’re spamming!

Since the spam is now sent using forged sender addresses of “real” people, when a spam bounces (as much of it does), the bounce will be sent back not to an entirely fake address, but to a spam recipient’s address.

Hence, the spam recipients now get twice as much mail from each spam run — spam aimed at them, and bounce blowback from hundreds of spams aimed at others, forged to appear to be from them.

This is the obvious “next move” in response to SAV, which is one reason why we never implemented something like it in SpamAssassin.

On top of this — it doesn’t work well enough anymore. Verizon use SAV. Have you ever heard anyone talk about how great Verizon’s spam filtering is? Didn’t think so.

(This post is a little late, given that SAV has been used for years now, but better late than never ;)

By the way, it’s worth noting that it’s still marginally acceptable to use SAV as a general email acceptance policy for your site — ie. as a way to assert that you’re not going to accept mail from people who won’t accept mail to the envelope sender address used to deliver it. Just don’t be fooled into thinking it’s helping the spam problem, or is helping anyone else but yourself.

Finally, this Sender Address Verification is different from what Sendio calls Sender Address Verification. That’s just challenge-response, which is crap for an entirely different, and much worse, set of reasons.

Something in the oven

Check out what’s cooking chez Mason:

Thrills and spills! I may have to cut down on the extra-curricular activities for a while, so we’d better get SpamAssassin 3.2.0 released before August 21st ;)

Spam volumes at accidental-DoS levels

Both Jeremy Zawodny and Dale Dougherty at O’Reilly Radar are expressing some pretty serious frustration with the current state of SMTP. I have to say, I’ve been feeling it too.

A couple of months back, our little server came under massive load; this had happened before, and normally in those situations it was a joe-job attack. Switching off all filtering and just collecting the targeted domain’s mail in a buffer for later processing would work to ameliorate the problem, by allowing the load to “drain”. Not this time, though.

Instead, when I turned off the filtering, the load was still too high — the massive volume of spam (and spam blowback / backscatter) was simply too much for the Postfix MTA. The MTA could not handle all the connections and SMTP traffic in time to simply collect all the data and store it in a file!

Looking into the “attack” afterwards, once the load was back under control, it looked likely that it wasn’t really an attack — it was just a volume spike. Massive SMTP load, caused by spammers increasing the volume of their output for no apparent reason. (Since then, spam volumes have been increasing still further on a nearly weekly basis.)

This is the effect of botnets — the amount of compromised hosts is now big enough to amplify spam attacks to server-swamping levels. Our server is not a big one, but it serves less than 50 users’ email I’d say; the user-to-CPU-power ratio is pretty good compared to most ISPs’ servers.

So here’s the thing. New SMTP-based methods of delivering nonspam email — whether based on DKIM, SPF, webs of trusted servers, or whatever — will not be able to operate if they have to compete for TCP connection slots with spammers, since spammers can now swamp the SMTP listener for port 25 with connections. In effect, spam will DDoS legitimate email, no matter what authentication system that legit mail uses to authenticate itself.

This, in my opinion, is a big problem.

What’s the fix? A “new SMTP” on a whole different port, where only authed email is permitted? How do you make that DoS-resistant? Ideas?

(Obviously, counting on spammers to notice or care is not a good approach.)

A SpamAssassin rule-discovery algorithm

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and “ham” (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple “grep” will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, “this is\na \ntest” -> “this is a test”

All pretty basic stuff, and performed by the SpamAssassin “body” rendering process during a “mass-check” operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later “collapse patterns” step.

Split the text into “words” — ie. space-separated chunks of non-whitespace chars. Compress each “word” into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global “token count” hashtable, and add the message ID to the token’s entry in a “message subset hit” table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the “token count” hash entries, simply delete the entries in the “token count” and “message subset hit” tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
  • the “message subset hit” table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the “message subset hit” table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the “discard if subsumed” trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent “OEM” stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/*.2007022* \
        ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin “meta” rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <[email protected]>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <[email protected]>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <[email protected]>
  To: <[email protected]>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

  OEM software - throw packing case, leave CD, use electronic manuals.
  Pay for software only and save 75-90%!

  Discounts! Special offers! Software for home and office!
              TOP 1O ITEMS.

    $79 Microsoft Windows Vista Ultimate
    $79 MS Office Enterprise 2007
    $79 Adobe Acrobat 8 Pro
    $49 Windows XP Pro w/SP2
    $99 Macromedia Studio 8
    $59 Adobe Premiere 2.0
    $59 Corel Grafix Suite X3
    $59 Adobe Illustrator CS2
  $129 Autodesk Autocad 2007
  $149 Adobe Creative Suite 2
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

            Mac Specials:
  Adobe Acrobat PR0 7             $69
  Adobe After Effects             $49
  Adobe Creative Suite 2 Premium $149
  Ableton Live 5.0.1              $49
  Adobe Photoshop CS              $49
  http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
  6E1FAE&t6

  See more by this manufacturers:
  Microsoft...Mac...Adobe...Borland...Macromedia
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

  Microsoft Windows Vista Ultimate
  Retail price:  $399.00
  Proposition:  $79.95
  Your benefit:  $319.05 (80%)
  Availability: Can be downloaded INSTANTLY.
  http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
  Best choice for home and professional. (37268 reviews)

  Microsoft Office 2007 Enterprise Edition
  Regular price:  $899.00
  Our offer:  $79.95
  You save:  $819.95 (89%)
  Availability: Pay and download instantly.
  http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
  Sales Rank: #1 (121329 reviews)

  Adobe Acrobat 8.0 Professional
  Market price:  $449.00
  We propose:  $79.95
  Your profit:  $369.05 (80%)
  Availability: Available for INSTANT download.
  http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
  Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching “anys”, also known as “gapped alignment” searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. “this is foo”, “this is bar”, “this is baz” => “this is (foo|bar|baz)”, would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

Irish Blog Awards 2007

Well, that was fun! Taint.org didn’t make the shortlists, but I went along anyway just to hang out — and lots of chat was had accordingly. Got to finally meet up with a few people I’d chatted with online, like Nialler9 — and with a few old friends I don’t get to see often enough: Antoin, Elana, Brendan, Clare Dillon (ex-Iona!), and another ex-Ionian, Aisling Mackey. A good laugh.

Have to say though, it seems a vote from me was the kiss of death in many of the categories: Sarah Carey, Blogorrah, Ireland from a Polish perspective, and (the late lamented) TCAL all got my thumbs-up in the shortlist voting, and all wound up missing out on the chunk’o’lucite. Sorry about that guys. ;)

Thanks again to Damien for organising the whole do! It’s great to have an event like this to bring each of our disparate blogs physically together for a bit of community.

By the way I’d like to point out that, in contrast to the Blogorrah Bock the Robber mafiosi, I had a real moustache… ;)

BT’s daily disconnects, revisited

As I noted last year, BT, the ISP I use here in Ireland, disconnects broadband sessions on a daily basis, assigning a new IP address; this is really aggravating to anyone who uses a VPN, such as most telecommuters. Reportedly, this is done to work around deficiencies in their billing system.

A comment from Jeremy on that post suggested something interesting, though:

Just had a very helpful tech support guy on from BT. [… he] told me to restart the modem sometime that will make it convenient for the 24 hour IP change – i.e. restart it at 6am, and then it’ll change IP every day at 6am.

I’ve tested this, and it works. Much more convenient! Now the renumbering and VPN breakage can take place when I want it to — at the start of the workday, instead of some random point chosen by BT’s billing system. Quite an improvement.

To make this useful, here’s a script, “reboot-zyxel”, which will reboot your Zyxel P-660RU router remotely over the LAN. (It requires perl and curl.)