Bayesian learning animation

Spam: via John Graham-Cumming’s excellent anti-spam newsletter this month, comes a very cool animation of the dbacl Bayesian anti-spam filter being trained to classify a mail corpus. Here’s the animation:

And Laird’s explanation:

dbacl computes two scores for each document, a ham score and a spam score. Technically, each score is a kind of distance, and the best category for a document is the lowest scoring one. One way to define the spamminess is to take the numerical difference of these scores.

Each point in the picture is one document, with the ham score on the x-axis and the spam score on the y-axis. If a point falls on the diagonal y=x, then its scores are identical and both categories are equally likely. If the point is below the diagonal, then the classifier must mark it as spam, and above the diagonal it marks it as ham.

The points are colour coded. When a document is learned we draw a square (blue for ham, red for spam). The picture shows the current scores of both the training documents, and the as yet unknown documents in the SA corpus. The unknown documents are either cyan (we know it’s ham but the classifier doesn’t), magenta (spam), or black. Black means that at the current state of learning, the document would be misclassified, because it falls on the wrong side of the diagonal. We don’t distinguish the types of errors. Only we know the point is black, the classifier doesn’t.

At time zero, when nothing has been learned, all the points are on the diagonal, because the two categories are symmetric.

Over time, the points move because the classifier’s probabilities change a little every time training occurs, and the clouds of points give an overall picture of what dbacl thinks of the unknown points. Of course, the more documents are learned, the fewer unknown points are left.

This is an excellent visualisation of the process, and demonstrates nicely what happens when you train a Bayesian spam-filter. You can clearly see the ‘unsure’ classifications becoming more reliable as the training corpus size increases. Very nice work!

It’s interesting to note the effects of an unbalanced corpus early on; a lot of spam training and little ham training results in a noticeable bias towards the classifier returning a spam classification.

Tags: , , , , , , , , ,

Comments

IBM patents web transcoding proxies

Web: I link-blogged this, but it’s generated some email already, so it deserves a proper posting.

One thing you quickly learn about IBM where software patents are concerned, is that if IBM Research is making noise about a new software technique, they’ve probably patented it already. A few years ago, IBM was keen on HTTP transcoding — rewriting web content in a proxy, to be more suitable for display and access from less-capable devices, like PDAs and mobile phones.

So I probably should not have been surprised today when I came across USPTO patent 6,886,013, which is an IBM patent on a ‘HTTP caching proxy to filter and control display of data in a web browser’. It was applied for on Sep 11 1997, and finally granted on Apr 26 of this year.

The first claim covers:

  1. A method of controlling presentation on a client of a Web document formatted according to a markup language and supported on a server, the client including a browser and connectable to the server via a computer network, the method comprising the steps of:

    as the Web document is received on the client, parsing the Web document to identify formatting information;

    altering the formatting information to modify at least one display characteristic of the Web document; and

    passing the Web document to the browser for display.

Notice that there’s actually no mention of a HTTP proxy there — in other words, an in-browser rewriting element, such as Greasemonkey or Trixie may be covered by that claim. However, the claim does indicate that the document is passed from the ‘client’ to the ‘browser’, so perhaps having the ‘client’ inside the ‘browser’ evades that.

It appears this really wasn’t original research even when the patent was applied for — there’s probable prior art, even if the patent itself doesn’t cite it. For example, WWW4 in 1995 included Application-Specific Proxy Servers as HTTP Stream Transducers, which discusses ‘transduction’ of the HTTP traffic and gives an example of ‘A “rewriting” OreO (transducer element) that encapsulates each anchor inside the Netscape Blink extension, making anchors easier to spot on monochrome displays’. On top of that, Craig Hughes notes that his ’senior project at Stanford in 1992 was an implementation of a content-modifying HTTP proxy. It re-worked HTML in http streams to add some markup to enable full navigability through touch screen or voice control, for screen-only kiosks.’

Add this to the ever-growing list of over-broad software patents.

Tags: , , , , , , , , ,

Comments

Thank you, MS Word Metadata

Politics: California AG forwards anti-P2P screed on behalf of the MPAA.

However, the metadata associated with the Microsoft Word document indicates it was either drafted or reviewed by a senior vice president of the Motion Picture Association of America. According to this metadata (automatically generated by the Word application), the document’s author or editor is ’stevensonv.’ (The metadata of a document is viewable through the File menu under Properties.)

Sources tell Wired News that the draft letter’s authorship is attributed to Vans Stevenson, the MPAA’s senior vice president for state legislative affairs. MPAA representatives have issued similar criticisms of P2P technology in the past. Stevenson could not be reached for comment.

Funny: Humorix: Feds Unveil Practical Method To Combat Spam. ”If a spammer
has access to a list of millions of clue-impaired users, they won’t need to bother sending spam to anybody else’, Thullweppon argued.’ (thanks to Kenneth Porter for the link!)

Tags: , , , , , , , , , ,

Comments

MS Word’s change history feature strikes again

Security: SCO accidentally leaked their previous lawsuit plans — to sue Bank of America — through MS Word’s ability to retain prior changes in a Word document.

This seems as good a time as any to re-plug
find-hidden-word-text, a quick perl hack to use ‘antiword’ to extract hidden text from MS Word documents in an automated fashion, based on Simon Byers’ paper Scalable Exploitation of, and Responses to Information Leakage Through Hidden Data in Published Documents. It works well ;)

Safety: Great Malcolm Gladwell article on S.U.V.’s. My favourite bit:

when, in focus groups, industry marketers probed further, they heard things that left them rolling their eyes. …. what consumers said was ‘If the vehicle is up high, it’s easier to see if something is hiding underneath or lurking behind it.’

Bradsher brilliantly captures the mixture of bafflement and contempt that many auto executives feel toward the customers who buy their S.U.V.s. Fred J. Schaafsma, a top engineer for General Motors, says, ‘Sport-utility owners tend to be more like ‘I wonder how people view me,’ and are more willing to trade off flexibility or functionality to get that.’ According to Bradsher, internal industry market research concluded that S.U.V.s tend to be bought by people who are insecure, vain, self-centered, and self-absorbed, who are frequently nervous about their marriages, and who lack confidence in their driving skills.

… Toyota’s top marketing executive in the United States, Bradsher writes, loves to tell the story of how at a focus group in Los Angeles ‘an elegant woman in the group said that she needed her full-sized Lexus LX 470 to drive up over the curb and onto lawns to park at large parties in Beverly Hills.’

Social: Ted Leung: Google requires that its employees spend 20% of their working hours on ‘personal projects’. Wow.

Tags: , , , , , , , , , ,

Comments

What’s in a Name?

quotes some guy called “Kevin Hemenway” who wrote a document called The Semantic Web: 1-2-3. So I was thinking “hmmm… Kevin Hemenway… I though Morbus Iff wrote that”. Then the penny dropped. Another pseudonym blown apart by the callous Mark Pilgrim!

Tags: , , , , , , , ,

Comments

(Untitled)

Due to a set of advocacy and plain show-off mails recently, regarding sub-pixel font rendering under Linux, my hand has been forced ;)

As a result, here’s a little HOWTO document I’ve written up for getting sub-pixel rendering working under Linux. Check it out if you’ve got a Linux laptop and want some sweet-looking fonts!

Tags: , , , , , , , , ,

Comments

(Untitled)

Some vague web musing: while reading Cory Doctorow’s “Metacrap” essay on metadata, I noticed this:

Certain kinds of implicit metadata is awfully useful, in fact. Google exploits metadata about the structure of the World Wide Web: by examining the number of links pointing at a page (and the number of links pointing at each linker), Google can derive statistics about the number of Web-authors who believe that that page is important enough to link to, and hence make extremely reliable guesses about how reputable the information on that page is.

He’s right, of course — that’s how Google works. But while reading this, it occurred to me that this implicitly rewards websites that consist of small numbers of large pages, instead of high numbers of short pages; if your site has a page for ever sub-heading (think of a Linux HOWTO document here), and a linker to your site links to the page that’s relevant to what they’re talking about, your Google ranking will be lower than if you keep the document all in one page and use named anchors.

Personally, despite what Jakob Neilsen thinks, I prefer the all-in-one page mode myself. It’s quicker to download (overall), easier to print or read offline, and I’m not afraid to use a scrollbar. Interesting to see Google (accidentally) recommends it too ;)

The rest of the essay is spot on, in my opinion.

BTW, Cory also writes for Boing Boing, one of the coolest mags I used to read back when, and now a top-quality weblog.

Tags: , , , , , , , , ,

Comments