Linus on Bayesian filtering

Linus Torvalds, in a post to linux-kernel today:

I’m sorry, but spam-filtering is simply harder than the bayesian word-count weenies think it is. I even used to know something about bayesian filtering, since it was one of the projects I worked on at uni, and dammit, it’s not a good approach, as shown by the fact that it’s trivial to get around.

I don’t know why people got so excited about the whole bayesian thing. It’s fine as one small clause in a bigger framework of deciding spam, but it’s totally inappropriate for a “yes/no” kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not accepting email from machines that aren’t registered MX gateways. Sure, that will mean that people who just set up their local sendmail thing and connect directly to port 25 will just not be able to email, but let’s face it, that’s why we have ISP’s and DNS in the first place.

But don’t do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it – with some Bayesian rule perhaps adding a few points to the score. That’s entirely appropriate. But running bogo-filter instead of spamassassin is just asinine.

Me, I like bogofilter — those guys are cool, and it’s a great anti-spam product for many purposes. But of course I have to agree with Linus that the correct approach in most cases is a bigger picture than just Bayes alone, a la SpamAssassin ;)

    “Linus on Bayesian filtering” is close to the platonic ideal of a post to this blog.

    That Linus, he’s no spamatuer! ;)

    Yawn. I wish Linus would stick to what he’s good at, because anything else he rants about will be picked up and circulated and perhaps even believed.

    What’s so difficult to believe about what Linus said? To me, it’s just common sense: I don’t believe there will ever be a content filter that can cut directly to the binary kill/no-kill decision. One of the patterns that starts mushrooming beyond any control is the natural inclination to make unpredictable errors. Within the context of a single cross-cultural population, for instance, it can be shown that certain substitutions and permutations of the preferred or prescribed syllables or “symbol-groups” have a certain predictability. Add to that the word patterns caused by multiple cross-cultural groups using multiple languages and it’s easy to see that just the accidental errors prevent appropriate filtering based on “content.” Then, when you consider all the “personal styles” and intentional errors, you can see how easy it is to “get around” a filter that bases decisions purely on word analysis.

    Anybody who has done command-line processing probably deserves an opinion on the use of Bayesian rules. Anybody who would assert that Linus can’t parse his way out of a paper bag has bigger problems than an inflated ego. It’s one thing to have examined myriad proposed solutions and quite another to be infatuated with Bayesian techniques. There are a lot of good hammers in my toolbox: I’ve yet to use any of them at the dinner table (at least during meals).