Home » Archives for August 2004

Month: August 2004

Linux and small hardware vendors

Published August 27, 2004

Linux: Everyone who’s used a non-MS system will have learned — typically the hard way — that not all hardware is equal. Not just in terms of specs, flexibility and power, but also in terms of whether or not it can be used at all.

Most hardware vendors consider their specification and interface documentation to be their crown jewels; giving access to these without a signed NDA is impossible. On the other hand, for free software developers, signing an NDA makes life quite difficult — it can be done, but nobody else can help you maintain it further without signing an NDA, the resulting code may ‘disclose’ too much of the ‘IP’, and so on. In a lot of cases, the vendor isn’t interested in giving access to the specs, even with an NDA — it’s their IP and why isn’t the customer just using Windows?

The end result: lots of hardware with crappy support on non-MS operating systems.

Things aren’t as bad as they used to be, though — since nowadays the high-end hardware is more likely to support standards, and Linux is a top choice on embedded hardware (set-top boxes for example), so it has a much higher profile. But cheap, end-user oriented PCs still wind up with components from vendors who couldn’t be bothered with non-Windows customers, and that can mean using a hacked-up, reverse-engineered driver and hoping it works. (That’s not to denigrate reverse-engineered drivers. some of them work great. But fundamentally, the vendors are making a mistake here.)

So it’s pretty impressive to see that LaCie are now sponsoring development of k3b, the CD/DVD burning application for KDE!

Good timing too, I was about to buy a DVD burner ;)

kgst output to ALSA/Artsd/ESD instead of just OSS

Published August 27, 2004

Linux: Here’s a patch that adds support for ALSA/Artsd/ESD output from kgst, the KDE gstreamer middleware used by JuK.

Background: JuK is a great music player app for KDE. However, it hogs
the sound device while running, which means that nothing else gets access to play sound until the app is shut down. This is suboptimal.

The reason it does this is because it plays sound via this chain of components: juk -> kgst -> gstreamer -> sink. Unfortunately, the
kgst component doesn’t allow control over what output sink to use, instead hard-coding the string 'osssink' — the OSS drivers, for traditional Linux /dev/dsp sound. My laptop doesn’t support mixing in the sound hardware, which means I need to use a software mixer. 'osssink' doesn’t support software mixing, instead giving the caller exclusive access to the sound card, and other apps will just have to wait for it to finish.

(As to why JuK doesn’t just play mp3s by running ‘mpg321 name-of-file.mp3’, and let us specify the ‘-o’ switch to use, I wish I knew. (ObOldbieGripe: component-based architectures are full of this kind of needless over-complexity ;)

Anyway, the patch in the bug above lets the user provide an environment variable for a string for kgst to use instead of 'osssink'.

Life Hacks: the magic of flat files

Published August 24, 2004

Tech: This is the second entry talking about ‘Life Hacks’. Possibly the best tip I came away from the talk with, is this one:

All geeks have a todo.txt file. They use texteditors (Word, BBEdit, Emacs, Notepad) not Outlook or whathaveyou.
What we keep in our todo is the stuff we want to forget. Geeks say they remember details well, but they forget their spouses’ birthdays and the dry-cleaning. Because it’s not interesting.
It’s the 10-second rule: if you can’t file something in 10 seconds, you won’t do it. Todo.txt involves cut-and-paste, the simplest interface we can imagine.
It’s also the simplest way to find intercomation. EMACS, Moz and Panther have incremental search: when you type a “t” it goes to the first mention of “t”, add “to” and you jump to the first instance of “to”, etc.
Power-users don’t trust complicated apps. Every time power-geeks has had a crash, s/he moves away from it. You can’t trust software unless you’ve written it — and then you’re just more forgiving. Text files are portable (except for CRLF issues) between mac and win and *nix. Geeks will try the Brain, etc, but they want to stay in text.

I was already doing this, having learned the latter lesson ;), but I was making one mistake — I was trying to keep the TODO.txt file small by clearing out old stuff, done stuff, and cut-and-paste snippets of command lines, and by moving things into files in ‘storage’ directories.

That doesn’t work. You think you’ll be able to grep for it later, but you’ll have forgotten what to grep for. You’ll even have forgotten what storage directory you used. The solution is to keep it all in one big file, and use i-search. That really does work.

In fairness, I actually have two files of this type. One is the “real” TODO.txt. But the other is a GPG-encrypted file containing usernames, URLs, passwords, nameservers, VPN settings, etc. I have a feeling this is another common Life Hack idiom, too…

Another great tip in the same vein, from JWZ — make an /etc/LOG:

Every machine I admin has a file called /etc/LOG where I keep a script of every system-level change I make (installing software, etc.) I rsync these LOG files around (keeping redundant copies of all of them in several places) so that if/when I need to re-build a server from scratch, it’s just a matter of following the script.

This has been working out great (when I remember to do it. Discipline! ;)

Life Hacks: getting back to the command-line

Published August 24, 2004

Tech: So Danny O’Brien’s ‘Life Hacks’ talk is one of the most worthwhile reflections on productivity (and productivity technology) I’ve heard. (Cory Doctorow’s transcript from NotCon 2004, video from ETCon.)

There’s a couple of things I wanted to write about it, so I’ll do them in separate blog entries.

(First off, I’d love to see Ward Cunningham’s ‘cluster files by time’ hack, it sounds very useful. But that’s not what I wanted to write about ;)

People don’t extract stuff from big complex apps using OLE and so on; it’s brittle, and undocumented. Instead they write little command-line scriptlets. Sometimes they do little bits of ‘open this URL in a new window’ OLE-type stuff to use in a pipeline, but that’s about it. And fundamentally, they pipe.

This ties into the post that reminded me to write about it — Diego Doval’s atomflow, which is essentially a small set of command-line apps for Atom storage. Diego notes:

Now, here’s what’s interesting. I have of course been using pipes for years. And yet the power and simplicity of this approach had simply not occurred to me at all. I have been so focused on end-user products for so long that my thoughts naturally move to complex uber-systems that do everything in an integrated way. But that is overkill in this case.

Exactly! He’s not the only one to get that recently — MS and Google are two very high-profile organisations that have picked up the insight; it’s the Egypt way.

There’s fundamentally a breakage point where shrink-wrapped GUI apps cannot do everything you want done, and you have to start developing code yourself — and the best APIs for that, after 30 years, has been the command-line and pipe metaphor.

(Also, complex uber-apps are what people think is needed — however, that’s just a UI scheme that’s prevailing at the moment. Bear in mind that anyone using the web today uses a command line every day. A command line will not necessarily confuse users.)

Tying back into the Life Hacks stuff — one thing that hasn’t yet been done properly as a command-line-and-pipe tool, though, is web-scraping. Right now, if you scrape, you’ve got to do either (a) lots of munging in a single big fat script of your own devising, if you’re lucky using something like WWW::Mechanize (which is excellent!); (b) use a scraping app like sitescooper; or (c) get hacky with a shell script that runs wget and greps bits of output out in a really brittle way.

I’ve been considering a ‘next-generation sitescooper’ a little bit occasionally over the past year, and I think the best way to do it is to split its functionality up into individual scripts/perl modules:

one to download files, maintaining a cache, taking likely freshness into account, and dealing with crappy HTTP/HTTPS wierdness like cookies, logins and redirects;
one to diff HTML;
one to lobotomise (ie. simplify) HTML;
one to scrape out the ‘good bits’ using sitescooper-style regions

Tie those into HTML Tidy and XMLStarlet, and you have an excellent command-line scraping framework.

Still haven’t got any time to do all that though. :(

Open source v closed-source spam filtering

Published August 23, 2004

Spam: I’m quoted in
New Scientist! w00t!

SlashDot picked it up pretty quickly. One comment there misses the point, though:

This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

It’s worth talking about this. Newsflash: spammers have no difficulty testing their spam against closed-source spam filters, even when they can’t ‘get a copy’ and test them in ‘their sleazy lairs’.

How do they do it? Easy — just set up an account at a site that uses that filter (AOL, Yahoo!, Hotmail, and GMail, it’s pretty obvious how to do that; for other closed-source filters, find an ISP that uses it). Then send ‘test mails’ repeatedly to that account, and apply trial and error to see what gets past the filter and what doesn’t. Eventually, they figure out what works for that filter, and what doesn’t.

How did I figure this out? Well, I came across the manual for the Send-Safe ratware on-line. It noted that the ‘hashbuster’ randomisation technique, which we in the SpamAssassin team had long assumed was intended to block hash matches by DCC, Pyzor and Razor, was in fact intended to block AOL’s implementation of that system. The open source ones weren’t even mentioned.

Update: found it — from their FAQ:

Mime Encoded content
If you want to get into AOL… use it.
MIME encoders allow you to send documents written within a specific application through email without causing readability or formatting problems. For example, you can send a letter created in MSWord with and be certain that it arrives at its destination in the same format by encoding it with MIME first. The recipient then decodes it back into the original MSWord format.
That isn’t why we use it though.
We use it to cause ‘uniqueness’.
When you put a rotate tag at the beginning of a MIME encoded email, it causes everything after that point (including checksums) to be ‘different’ in every message.
Why is that that important?
Because it throws off filters that look for many copies of the same message to nuke.

Olympic commentator stupidity

Published August 22, 2004

TV: A choice quote from NBC’s Olympics coverage: ‘This girl (one of the US beach volleyball team) reads a book a week!’ (delivered in shocked tones.)

Image Watermarking With ‘pamcomp’

Published August 21, 2004

Web: My Dad runs a couple of websites — his architectural photography business, and Andalucia Photo Gallery, a side project selling some lovely photos from the Andalusia region of Spain.

Needless to say, as the family geek, guess who coded all that up? Using WebMake, naturally ;) This was the main reason I wrote the ‘thumbnail_tag’ plugin.

You’ll note, however, that the image to right is watermarked, quite small, and encoded with a low quality setting. It turned out after a couple of years of operation, that the images were being downloaded and used in print all over the place — from both sites!

It seems photo piracy is rampant. Even with terms of use clearly linked on the sites, it’s still commonplace for print publications to swipe the images — and not just the little guys, either — some big commercial names have apparently used the images without asking (or paying licensing fees).

The Andalucia gallery site was a favourite; being a good hit for ‘travel photos spain’ meant lots of images being used for holiday pages in magazines, newspapers, and so on.

Needless to say, digital watermarking software doesn’t work — it’s trivial to load an image into Photoshop, resize or crop, and resave, apparently. Even if PS did respect the watermarks, netpbm doesn’t, and a watermarked image isn’t identifiable as such once it appears in print anyway! So we went for the blunt-tool approach, adding visible watermarks to the images.

It’s pretty easy — pamcomp allows you to overlay one image on top of another, using a third as an ‘alpha mask’ to control transparency. The results are pretty nice and not too intrusive.

It’s a shame it has to be done, though… :(

MS Patents sudo(8)

Published August 20, 2004

Patents: The varchars.com scraped RSS feeds now include new patent grants and applications by certain companies! Interesting, although given that most developers are advised not to look, not advisable ;)

However, I glanced at the MS one — and immediately spotted this gem: US Patent 6,775,781, filed by Microsoft, is a patent on the concept of ‘a process configured to run under an administrative privilege level’ which, based on authorization information ‘in a data store’, may perform actions at administrative privilege on behalf of a ‘user process’.

This, and the patent claims, perfectly describe the operation of sudo, fundamentally as it’s operated since running on a 4.1BSD VAX-11/750 in 1980.

20 years head start on a patent application — surely that must qualify as prior art ;)

RFID Security

Published August 12, 2004

Security: It looks like the security people are starting to take a look at RFID, and it’s not pretty.

I link-blogged this the other day — RFDump is a tool to display and modify data in RFID tags — including deployed ones, at least in some cases. (Think rewriting the price tags in a shop, scrambling the tracking numbers on a warehouse full of goods, or corrupting frequent-shopper data on a card.)

It looks like this was also discussed at USENIX Security ’04 in an RSA presentation (those notes are swarming with typos, but the content’s there ;)

That talk has some interesting stuff — ‘blocker’ tags which spoof readers with gibberish data, or crash the collision-detection network protocol; while that’s being discussed as a security tool here, if the protocol is that hackable, and the hardware is available, I could see that having additional interesting effects in a supermarket. Of course, range is an issue — but that hasn’t stopped Bluetooth hacking, wardriving, etc.

If you ask me, it looks an awful lot like RFID is chock-full of security holes, and the features that make it so attractive (low power use, low cost, tiny size) will be the very features that militate against adding security. We could be in for interesting times here…

A ‘Boulder Pledge scoreboard’ website

Published August 10, 2004

Spam: Ask Slashdot: How Powerful is the Turn-Off Power of Spam? The question is, ‘How often do you make the decision to NOT buy something form a company because you know they engage in spamming activities?’

This is an old idea — it goes back to a December 1996 column by Roger Ebert, of all people, who proposes the following pledge that all internet users should take:

Under no circumstances will I ever purchase anything offered to me as the result of an unsolicited e-mail message. Nor will I forward chain letters, petitions, mass mailings, or virus warnings to large numbers of others. This is my contribution to the survival of the online community.

8 years later, it’s more important than ever.

However, it’s complicated by one additional factor — not everyone knows which products and companies use spam to advertise. For example, did you know that Kraft routinely advertise their Gevalia coffee through spam?

My suggestion — a daring individual (that rules me out ;) should set up a website where samples of major-product-advertising spam are collected from (trusted) reporters. A quick scoreboard based on how many reports a particular company accumulates, and we have a Boulder Pledge reputation service.

Some simple rules should be applied:

Messages arriving at never-used spamtrap addresses, or scraped addresses from USENET or the web, especially if the message hits multiple of those addresses (indicating a high volume), is the basis for a listing;
Failure to respect opt-outs, of course, would be a biggie;
Using a known spamhaus, or sending via open proxies in Shandong, would be a massive thumbs-down;
Failure to clean up it’s act after being made aware of the problem, oh dear.

It’d be essential to take an extremely careful approach to this; any hint of personal axe-grinding, and the site would be useless, written off as just the work of ‘another anti-spam kook’.

Essentially, this’d be a Fortune-500-oriented version of spamvertized.org.

Reportedly, many of the large companies using spam to advertise are fully aware at a management level that they are responsible for spamming. (That line about open proxies in Shandong is no joke — at least one Fortune 500 company has hired a spamhaus that does this.)

Doubtless, some spamvertisers may be victim to an overzealous but clueless marketing department, on the other hand — but either way, a public ‘name and shame’ forum gives a great impetus for them to avoid this problem, at least once they’ve been bitten the first time.

In some cases, it’s dodgy ‘affiliates’ that use spam to advertise their products — but a company that operates affiliates really should post a policy that says that affiliates found to be spamming will be terminated and have their commissions forfeited; reportedly, that has been found in other programs to quickly cut off the problem.

Spamusement rocks!

Published August 10, 2004

Spam: oh man, Spamusement started off well, and has just been getting better and better; * HEATH WARNING * had me laughing out loud, and the idea of linking the entries since August 8 as a series is genius.

Announcing IPC::DirQueue

Published August 7, 2004

Perl: So, I wrote a new CPAN module recently — IPC::DirQueue. It implements a nifty design pattern for slightly larger systems, ones where multiple processes, possibly on multiple machines, must collaborate to deal with incoming task submissions. To quote the POD:

This module implements a FIFO queueing infrastructure, using a directory as the communications and storage media. No daemon process is required to manage the queue; all communication takes place via the filesystem.
A common UNIX system design pattern is to use a tool like lpr as a task queueing system; for example, this article describes the use of lpr as an MP3 jukebox.
However, lpr isn’t as efficient as it could be. When used in this way, you have to restart each task processor for every new task. If you have a lot of startup overhead, this can be very inefficient. With IPC::DirQueue, a processing server can run persistently and cache data needed across multiple tasks efficiently; it will not be restarted unless you restart it.
Multiple enqueueing and dequeueing processes on multiple hosts (NFS-safe locking is used) can run simultaneously, and safely, on the same queue.
Since multiple dequeuers can run simultaneously, this provides a good way to process a variable level of incoming tasks using a pre-defined number of worker processes.
If you need more CPU power working on a queue, you can simply start another dequeuer to help out. If you need less, kill off a few dequeuers.
If you need to take down the server to perform some maintainance or upgrades, just kill the dequeuer processes, perform the work, and start up new ones. Since there’s no ‘socket’ or similar point of failure aside from the directory itself, the queue will just quietly fill with waiting jobs until the new dequeuer is ready.
Arbitrary ‘name = value’ metadata pairs can be transferred alongside data files. In fact, in some cases, you may find it easier to send unused and empty data files, and just use the ‘metadata’ fields to transfer the details of what will be worked on.

Sound interesting? Here’s the tarball.

CEAS Roundup

Published August 3, 2004

Spam: So, CEAS was great fun, and very educational:

Got to meet up with various antispammers, including Daniel and Theo from the SpamAssassin dev team, Jeff Chan from SURBL, Dan Kohn from Habeas, Catherine Hampton from The SpamBouncer, Miles Libbey, John Levine, Neil Schwartzman — lots of good chats.
MS really know how to feed a conference! I hear rumours there was an extra-special tinned-meat-product-based dish at the banquet…
But their firewalling tendencies put a serious damper on keeping in touch with the outside world, at least until we set up an SSH tunnel on port 443 ;)
During a lull, Dan Kohn fired off a hands-up census — a good 75% of the attendees (roughly) admitted to using SpamAssassin!

My highlight papers:

IBM’s Chung-Kwei pattern-discovery system — the one which Mark dug up. Very interesting stuff; it turns out that bioinformatics is full of large corpora of data (genomes) which you then need to find patterns in. Funnily enough, so is SpamAssassin: s/genomes/spam/, s/patterns/regular expressions/. The more advanced pattern-discovery algorithms even allow complex patterns to contain alternative blocks, ‘don’t-cares’ and similar regular-expression-like features.
The really good bit of Chung-Kwei is the Teiresias algorithm (more pages, online demo). Of course, being IBM research, it’s probably patented to the hilt, and may be tricky to license; but it’s certainly pointed us in a whole new interesting direction — anyone know any bioinformaticians?
IBM is really gearing up on anti-spam research. 4 of the 6 papers listed on their website were presented this year, at CEAS.
Another good paper was On Attacking Statistical Spam Filters, by Gregory L. Wittel and S. Felix Wu, which (similarly to Henry Stern’s submission, which I helped a little with) dealt with an attack on Bayesian filters.
This is interesting stuff; we’re pretty sure it’s not as serious as it could possibly be, in SpamAssassin’s implementation, but it’s still a serious attack.
The Impact of Feature Selection on Signature-Driven Spam Detection was an interesting paper on AOL’s new signature schemes. (The conference was sponsored by Cloudmark, BTW, but those guys were nowhere to be seen — in which case they missed this presentation ;)
Reputation Network Analysis for Email Filtering was interesting, in that it mirrors to a degree the thinking behind web-o-trust.org, but in my opinion suffered due to a lack of thought about avoiding spoofing (by including IP address information in the FOAF file, it could do this now). However, once SPF becomes pervasive, this could be combined with that to generate personalised webs of trust usable for email whitelisting.
Resisting SPAM Delivery by TCP Damping was very nifty; plug a classifier into your MTA, and thereby detect connections from spam relays. Once you’ve found them, you then throttle down their connection as they attempt to deliver spam. Some other TCP-level tricks can do nifty stuff like massively increasing the bandwidth consumption of the spamming machines. Very very nice!

I took copious notes on the SpamAssassin wiki, if anyone’s curious.