Bayesian learning animation

Spam: via John Graham-Cumming’s excellent anti-spam newsletter this month, comes a very cool animation of the dbacl Bayesian anti-spam filter being trained to classify a mail corpus. Here’s the animation:

And Laird’s explanation:

dbacl computes two scores for each document, a ham score and a spam score. Technically, each score is a kind of distance, and the best category for a document is the lowest scoring one. One way to define the spamminess is to take the numerical difference of these scores.

Each point in the picture is one document, with the ham score on the x-axis and the spam score on the y-axis. If a point falls on the diagonal y=x, then its scores are identical and both categories are equally likely. If the point is below the diagonal, then the classifier must mark it as spam, and above the diagonal it marks it as ham.

The points are colour coded. When a document is learned we draw a square (blue for ham, red for spam). The picture shows the current scores of both the training documents, and the as yet unknown documents in the SA corpus. The unknown documents are either cyan (we know it’s ham but the classifier doesn’t), magenta (spam), or black. Black means that at the current state of learning, the document would be misclassified, because it falls on the wrong side of the diagonal. We don’t distinguish the types of errors. Only we know the point is black, the classifier doesn’t.

At time zero, when nothing has been learned, all the points are on the diagonal, because the two categories are symmetric.

Over time, the points move because the classifier’s probabilities change a little every time training occurs, and the clouds of points give an overall picture of what dbacl thinks of the unknown points. Of course, the more documents are learned, the fewer unknown points are left.

This is an excellent visualisation of the process, and demonstrates nicely what happens when you train a Bayesian spam-filter. You can clearly see the ‘unsure’ classifications becoming more reliable as the training corpus size increases. Very nice work!

It’s interesting to note the effects of an unbalanced corpus early on; a lot of spam training and little ham training results in a noticeable bias towards the classifier returning a spam classification.

Tags: , , , , , , , , ,

Comments

Muff News

Travel: I’m just back from a great road trip around Nevada and Arizona – lots of fun was had, and I even came out $100 up on the blackjack!

In other travels, my mate Eoin recently visited Muff, Co. Donegal, and made sure to get a picture of the event.

Muff is well-reknowned as one of those towns with a silly name; the story goes that they even have a SCUBA diving club, called — guess what – “Muff Diving Club”. Sadly, the reports are apparently greatly exagerrated. Eoin writes:

I have been hearing the story of the ‘muff diving club’ for the last 10 years, and now i can categorically state that its an urban legend. No such thing. There was a ‘top muff’ petrol station though where we picked up a few keyrings. The girl behind the counter was trying to give us all 200 keyrings left in the bag as she was so sick of muppets like us coming in for a laugh.

Tags: , , , , , , , , , ,

Comments (5)

Daytime Fireballs

Astronomy: APOD: A Daytime Fireball Over South Wales. Great picture
of a fireball disintegrating in the daytime sky.

I saw a similar daytime fireball streak through the sky when I was in Fraser Island in Australia last year; a little bit smaller than this one, mind you ;) Unfortunately, I didn’t get a picture in time. Very cool though!

Tags: , , , , , , , , , ,

Comments

Daytime Fireballs

APOD: A Daytime Fireball Over South Wales. Great picture
of a fireball disintegrating in the daytime sky.

I saw a similar daytime fireball streak through the sky when I was in Fraser Island in Australia last year; a little bit smaller than this one, mind you ;) Unfortunately, I didn’t get a picture in time. Very cool though!

Tags: , , , , , , , , ,

Comments