Spam: via John Graham-Cumming’s
excellent anti-spam newsletter this month, comes a very cool animation of the
dbacl Bayesian anti-spam filter being
trained to classify a mail corpus. Here’s the animation:

And Laird’s explanation:
dbacl computes two scores for each document, a ham score and a spam
score. Technically, each score is a kind of distance, and the best
category for a document is the lowest scoring one. One way to define
the spamminess is to take the numerical difference of these scores.
Each point in the picture is one document, with the ham score on the
x-axis and the spam score on the y-axis. If a point falls on the
diagonal y=x, then its scores are identical and both categories are
equally likely. If the point is below the diagonal, then the
classifier must mark it as spam, and above the diagonal it marks it
as ham.
The points are colour coded. When a document is learned we draw a
square (blue for ham, red for spam). The picture shows the current
scores of both the training documents, and the as yet unknown
documents in the
SA corpus. The unknown documents are either cyan
(we know it’s ham but the classifier doesn’t), magenta (spam), or
black. Black means that at the current state of learning, the
document would be misclassified, because it falls on the wrong side
of the diagonal. We don’t distinguish the types of errors. Only we
know the point is black, the classifier doesn’t.
At time zero, when nothing has been learned, all the points are on
the diagonal, because the two categories are symmetric.
Over time, the points move because the classifier’s probabilities
change a little every time training occurs, and the clouds of points
give an overall picture of what dbacl thinks of the unknown
points. Of course, the more documents are learned, the fewer unknown
points are left.
This is an excellent visualisation of the process, and demonstrates
nicely what happens when you train a Bayesian spam-filter. You
can clearly see the ‘unsure’ classifications becoming more reliable
as the training corpus size increases. Very nice work!
It’s interesting to note the effects of an unbalanced corpus early on; a
lot of spam training and little ham training results in a noticeable bias
towards the classifier returning a spam classification.
Tags: classifier, dbacl, diagonal, document, ham, picture, point, score, spam, time