Google’s CAPTCHA – not entirely broken after all?

A couple of weeks ago, WebSense posted this article with details of a spammer’s attack on Google’s CAPTCHA puzzle, using web services running on two centralized servers:

[...] It is observed that two separate hosts active on same domain are contacted during the entire process. These two hosts work collaboratively during the CAPTCHA break process. [...]

Why [use 2 hosts]? Because of variations included in the Google CAPTCHA image, chances are that host 1 may fail breaking the code. Hence, the spammers have a backup or second CAPTCHA-learning host 2 that tries to learn and break the CAPTCHA code. However, it is possible that spammers also use these two hosts to check the efficiency and accuracy of both hosts involved in breaking one CAPTCHA code at a time, with the ultimate goal of having a successful CAPTCHA breaking process.

To be specific, host 1 has a similar concept that was used to attack Live mail CAPTCHA. This involved extracting an image from a victim’s machine in the form of a bitmap file, bearing BM.. file headers and breaking the code. Host 2 uses an entirely different concept wherein the CAPTCHA image is broken into segments and then sent as a portable image / graphic file bearing PV..X file headers as requests. [...]

While it doesn’t say as such, some have read the post to mean that Google’s CAPTCHA has been solved algorithmically. I’m pretty sure this isn’t the case. Here’s why.

Firstly, the FAQ text that appears on “host 1″ (thanks Alex for the improved translation!):

img

FAQ

If you cannot recognize the image or if it doesn’t load (a black or empty image gets displayed), just press Enter.

Whatever happens, do not enter random characters!!!

If there is a delay in loading images, exit from your account, refresh the page, and log in again.

The system was tested in the following browsers: Internet Explorer Mozilla Firefox

Before each payment, recognized images are checked by the admin. We pay only for correctly recognized images!!!

Payment is made once per 24 hours. The minimum payment amount is $3. To request payment, send your request to the admin by ICQ. If the admin is free, your request will be processed within 10-15 minutes, and if he is busy, it will be processed as soon as possible.

If you have any problems (questions), ICQ the admin.

That reads to me a lot like instructions to human “CAPTCHA farmers”, working as a distributed team via a web interface.

Secondly, take a look at the timestamps in this packet trace:

img2

The interesting point is that there’s a 40-second gap between the invocation on “Captcha breaking host 1″ and the invocation on “Captcha breaking host 2″. There is then a short gap of 5 seconds before the invocations occur on the Gmail websites.

Here’s my theory: “host 1″ is a web service gateway, proxying for a farm of human CAPTCHA solvers. “host 2″, however, is an algorithm-driven server, with no humans involved. A human may take 40 seconds to solve a CAPTCHA, but pure code should be a lot speedier.

Interesting to note that they’re running both systems in parallel, on the same data. By doing this, the attackers can

  1. collect training data for a machine-learning algorithm (this is implied by the ‘do not enter random characters!’ warning from the FAQ — they don’t want useless training data)

  2. collect test cases for test-driven development of improvements to the algorithm

  3. measure success/failure rates of their algorithms, “live”, as the attack progresses

Worth noting this, too:

Observation*: On average, only 1 in every 5 CAPTCHA breaking requests are successfully including both algorithms used by the bot, approximating a success rate of 20%. The second algorithm (segmentation) has very poor performance that sometimes totally fails and returns garbage or incorrect answers.

So their algorithm is unreliable, and hasn’t yet caught up with the human farmers. Good news for Google — and for the CAPTCHA farmers of Romania ;)

Update: here’s the NYTimes’ take, with broadly agreeing comments from Brad Taylor of Google. (The Register coverage is off-base, however.)

This entry was posted in Uncategorized and tagged , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

5 Comments

  1. Posted March 5, 2008 at 15:43 | Permalink

    I had exactly the same thought when I read that piece Justin, I just thought they were using an odd defintiion of “cracked”.

    adam

  2. Jim Galding
    Posted March 19, 2008 at 19:53 | Permalink

    20% is enough when you automate.

  3. Dean
    Posted January 25, 2009 at 02:58 | Permalink

    Now the writer has a interesting story, but to conclude that 20% success rate means good news for google is showing his mathematical ignorance. When you have an army of machines each processing ten’s of thousands of captcha break request that 20% amounts to a very large number of successes. Its like having a door that needs to be hit with 5 hits from a sledge hammer instead of just one in order to be taken down. With this technology, Google has already been compromised.

    Get Your Articles to Produce for You http://www.zazazine.com

  4. Posted April 6, 2009 at 02:27 | Permalink

    Yo Dean,

    you are wrong or you missed the point. What Justing was trying to explain it ain’t rather yet a technology (algorithm) .. however they were working on it, it is rather a solution. Setting up a farm I don’t find so f.. brilliant.

    Yo Jim,

    20% .. hmm .. right, but the captcha is not ought to be not readable by a human thou.

    Yo dahamsta, Yo Justin,

    that is my opinion too.

  5. Posted April 6, 2009 at 11:27 | Permalink

    Um, yo, apparently.