October 2, 2006 - Justin's Linklog

Linus Torvalds, in a post to linux-kernel today:

I’m sorry, but spam-filtering is simply harder than the bayesian word-count weenies think it is. I even used to know something about bayesian filtering, since it was one of the projects I worked on at uni, and dammit, it’s not a good approach, as shown by the fact that it’s trivial to get around.

I don’t know why people got so excited about the whole bayesian thing. It’s fine as one small clause in a bigger framework of deciding spam, but it’s totally inappropriate for a “yes/no” kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not accepting email from machines that aren’t registered MX gateways. Sure, that will mean that people who just set up their local sendmail thing and connect directly to port 25 will just not be able to email, but let’s face it, that’s why we have ISP’s and DNS in the first place.

But don’t do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it – with some Bayesian rule perhaps adding a few points to the score. That’s entirely appropriate. But running bogo-filter instead of spamassassin is just asinine.

Me, I like bogofilter — those guys are cool, and it’s a great anti-spam product for many purposes. But of course I have to agree with Linus that the correct approach in most cases is a bigger picture than just Bayes alone, a la SpamAssassin ;)

Archives

Linus on Bayesian filtering