Google Webmaster Tools now includes ‘goog-love.pl’

Back in 2006, I wrote a script I called “goog-love.pl”; it used Google’s now-dead SOAP search API (thanks, Nelson!) to figure out which Google queries your web site was “winning” on. Unfortunately, Google shut down new signups for the SOAP interface later that year.

I was just looking through Google’s Webmaster Tools page for taint.org, when I came across the Statistics / Top search queries page:

img

This is exactly what goog-love.pl produced. hooray!

Tags: , , , , ,

Comments

Google now include Code Search in normal results

Latest Google curiosity… I hadn’t spotted this before: it appears Google is now including ‘Code Snippet’ results in the results for its normal search. For example, a search for XSLoader gives this result:

xsloader

The results highlighted on the page are for a local variable in a Java module, rather than the much more common XSLoader perl module. I guess ‘Code Snippet’ search is case-sensitive.

Tags: , , ,

Comments (2)

Long-lived spam via Yahoo! search

Back in May, I noticed some spam in my Moin Moin wiki, and fixed it.

As this Yahoo! Site Explorer view of taint.org demonstrates, Yahoo!’s search is still showing these results, partly; despite the spam content being long deleted (example ), they still show the spam title and URL, despite the fact that the title and text no longer contains those spam keywords.

Annoyingly, I’m still seeing referrer clickthroughs from search.yahoo.com to these deleted pages from lusers looking for porn, as a result. Come on Yahoo!, fix your search to notice the title change at least, so people don’t think the pages still contain porn!

Tags: , , , , ,

Comments (4)

VAST.com

So, my new employer just launched today!

It’s a new search service, VAST.com. As the blog says, ‘we are building a search service that extracts classified ads from across the web, structures them, and then makes them available via an open REST API for commercial and non-commercial uses.’

Now you can see why I’m excited ;)

Tags: , , , ,

Comments (4)

Spam and Broken Windows, and wecanstopspam.org

Spam: Spam Chongqing: Spamming Experiment:

Kasia at unix-girl.com decided to run a spamming experiment on her blog. She posted a couple spams to her own blog and waited to see what would happen. In less than 24 hours she received 356 more spams.

The chongqing guys confirm this, and I’ve noticed this as well (although just in passing, I’ve never tried testing it).

Interestingly, I’m pretty sure the same thing can happen with mailing lists, if the mailing list archives are allowed to contain the mailing list’s posting address, and the list allows open posting. It works like this:

  • spammer A posts a spam to the list
  • spam is archived
  • google finds archived spam
  • list-builders B, C, D google for search terms, find archive page for that mail message
  • B, C, D scrape the addresses from that page and pick up the list posting address
  • they then either sell on to spammers E, F, and G, who spam that address, or they spam the address themselves
  • and redo loop from the start.

One key factor is the search terms B, C, and D use. My theory is that they are intending to generate ‘targeted’ lists, and in spamming, most targeted lists are simply lists of addresses scraped from pages that show up in a google search for a specific keyword — ‘meds’, ‘viagra’, ‘degree’, etc.

Joe at chonqing surmises that it may be through the Broken Windows Theory — that spam appearing in a weblog’s comments, or in a wiki page, indicates that the administrator is asleep at the wheel and more spam can be posted with impunity. in my opinion, that’s probably more likely for google-spam and wiki-spam than for email spam, but undoubtedly is a factor.

PS: href=”http://chongq.blogspot.com/2005/04/another-spammer-owned-antispam-site.html”> wecanstopspam.org has been allowed to lapse and has been stolen by a spammer. Oh dear.

Tags: , , , , , , , , ,

Comments

Echo chamber goes crazy about ‘nofollow’

Blogs: Just to expand on a linkblog posting I made yesterday, Google’s search team have announced support for a new piece of Google functionality; they’ll fix their crawlers to ignore links with a rel="nofollow" attribute, for PageRank calculations, the idea being that spammers will stop blog-spamming once they can’t get PageRank out of it.

The blog world has been all aflutter:

BurningBird is right, to a degree. In fact, it’s been solved before.

Here’s a taint.org posting from November 2003 where I point out that by using a trivial Javascript URL one can link to another page without conferring PageRank. The format is:

javascript:document.location=target

The result looks like this, and work in any browser with a basic JS engine, from IE 3.02 and Netscape Navigator 2 onwards. I’ve been using it for my referrer logs, among other things, for over a year. I wrote a patch that implemented it for external links in the Moin Moin wiki software.

Amazingly, despite my plugging this idea at virtually every opportunity, it seems nobody noticed! At least, nobody among the people who (it would seem) should be looking into comment spam, thinking about how to deal with it, etc.

Disappointing — the echo chamber keeps talking to itself, once again. Maybe I’ll stick with dealing with email spam instead ;)

Ah, whatever. Anyway, this is a nicer fix; relying on JS isn’t a good thing. So nice work, Google.

(PS: worth noting that while this is a good plan, comment spam won’t be going away any time soon, as Mark Pilgrim noted. Still, here’s hoping it’ll help in the long term…)

Tags: , , , , , , , , , ,

Comments

Patents in an open source world

Patents: Newsforge: Patents in an open source world, by Lawrence Rosen (founding partner of Rosenlaw and Einschlag).

Interesting article, but I’m not sure summary point number 2 (’continue to document our own “prior art” to prevent others from patenting things they weren’t the first to invent’) really helps, when the patent examiners clearly haven’t performed the simplest Google check. I’ve found obvious prior art in 30 seconds, by plugging 3 words from patent claims into Google in the past (and yes, I have a reasonable idea how to read patent claims by now).

Point number 3 is interesting, since it contradicts most other advice I’ve read regarding patent searches: ‘Conduct a reasonably diligent search for patents we might infringe. At least search the portfolios of our major competitors. (This, by the way, is also a great way to make sure we’re aware of important technology advances by our competitors.) Maintain a commercially reasonable balance between doing nothing about patents and being obsessed with reviewing every one of them.’

However, this comment really is interesting and raises something major that I’d never heard of before — users of proprietary software can also face a significant risk from the patent threat. In particular, according to the linked comment, Microsoft licensed some patented technology from a company called Timeline Inc., but the license was not sublicenseable — in other words, it did not grant their customers the rights to fully use the technology! (in fairness to MS, this was established later in court.) Result: href=”http://trends.newsforge.com/comments.pl?sid=39443&cid=96153″>MS SQL server OEMs and ISVs are now being sued.

Tags: , , , , , , , , , ,

Comments

Microsoft 0wnz ‘http’

Web: Back in 2002, it occurred to someone to check the Google search results for ‘http’, to figure out what the most popular sites were.

Looks like it’s changed — here’s the top five results from a Google search for ‘http’ now:

  • 1: Microsoft
  • 2: AltaVista (!!)
  • 3: Yahoo!
  • 4: My Excite
  • 5: Google

My guess: older links are getting good PageRank, using whatever new tweaked algorithm they’re using. But AltaVista beating Google? ;)

Tags: , , , , , , , , , ,

Comments

The ‘Hog Bog’

Architecture: For reasons which I won’t go into here, I wound up doing a Google Image Search for ‘toilet’ which turned up a link to this page: Toilets of the World. However, he’s missing one very important variety: the world-famous Goan ‘Hog Bog’.

Here’s a tasteful pic of an expectant pig waiting for lunch (local mirror) — and then, if your stomach can take it, a rather more graphic account here. (warning: not safe for lunch)

Tags: , , , , , , , , , ,

Comments

Ma, Google won’t leave me alone

Bizarre: OK, OK, Google, I’m planning to! Geesh, all I wanted was a search engine, not health advice. They’re not even my ads!

Tags: , , , , , , ,

Comments

Google Sets

Web: Google Labs has a nifty toy called Google Sets; name a few items, and it’ll tell you what other items have been seen in conjunction with it.

Of course, the only use I know for it is this search for Blonde and Brunette, which says more about the modern web than we really need to know.

Tags: , , , , , , , , ,

Comments

Search Engine Optimisation

Tom Coates on search engine optimisation. Summary: they don’t work; smart search engines realise you’re trying to game them, and will ignore or penalise your site as a result. The correct answer is to provide interesting/good/linkworthy textual information, and keep superfluous eye candy at a sensible level. I agree with his essay, FWIW.

Personally, I reckon Google deserve a lot of credit for turning the web around, from a flashy, Flash-laden animated DHTML blinky-blink medium, back into one where text is king. Once it got recognized that Google used titles, h1 tags, and other semantic markup as key metadata, and that the gimmicky stuff is unindexable, the never-ending slide into flashy blinky-blink land was halted. Phew!

Aside: Labour MP Tom Watson has a weblog?! Wow. He’d get my vote straight away, no matter what his policies were — that’s transparency ;)

Interesting — so does Liberal Democrats MP Richard Allen. This is really amazing. He even links to SpamAssassin as part of a discussion on the All-Party Internet Group’s spam summit to be held on July 1st!

It’s worth noting that his comment here notes that the APIG concept seems to be leaning towards prosecution of spamvertised products; advertise via spam (sent by you or by a ’spam outsourcing’ company), and you’re liable. A very sensible approach, as long as they can avoid the danger of malicious spammers spamvertising a product without that company’s permission — a la what happens regularly to SpamCop and SpamHaus.

Tags: , , , , , , , , ,

Comments

MSN’s Google-Killer

Maciej, Jeremy and Dave have all been blogging about this: Microsoft have unleashed MSNBOT, a new web crawler (judging by the robots.txt string, written in COBOL) which heralds their new search service which will topple Google.

My thoughts: dream on, guys.

What makes Google cool? Fast, accurate searches, and no ads. OK, MSN could do fast searching; that’s doable, it’s just a technical matter.

But what does the latter require? IMO, it takes very strong technical leadership, willing to resist any and every business unit that fancies dropping some cruddy ads on the front page; it’s a cultural issue. This is especially tricky where ads (and money) are involved. Now go take a look at MSN.com. See what I mean? I rest my case.

Tags: , , , , , , , , ,

Comments

‘Internet advances not always pure tech’ shocker

Jason Kottke: Portal Wars II: When Search Engines Attack. He makes a great point (from Robert Morris at Etech 2002): while advances on the internet are typically heralded as tech-driven, in fact they’re more often usability-driven. Examples:

Mosaic was not an advancement in technology over TBL’s original browser. Blogger is a highly-specialized FTP client. IM is IRC++ (or IRC for Dummies, depending on your POV).

Dead right. Good tech, without the rough edges sanded down, and a degree of comprehensibility, is useless.

Aside: I wonder if Robert Morris, IBM is any relation to Robert T Morris, the 1988 internet worm guy?

Tags: , , , , , , , , ,

Comments

The top 100 PageRanked CGI scripts

similar to the much-discussed-elsewhere http search trick, which figures out the top 100 websites according to PageRank, here’s the top 100 CGI scripts according to PageRank. They’re incomplete, since only scripts with “cgi-bin” in the URL will show up, but hey ho. The top ten:

And the winner is:

boo.

Tags: , , , , , , , , ,

Comments

(Untitled)

Checking out the logs and stats for this site, I notice that a google search for “jennifer aniston nipples” is one of the main referrers. It is, of course, a hit to this page, the fake-nipples story. Sex (or nipples, at least) brings hits!

Tags: , , , , , , , ,

Comments

(Untitled)

Before coming over here to Australia from Ireland, I put my CV (ie. resume) up on http://jmason.org/ (I initially assumed I’d be looking for work over here — it’s since turned out that my Irish employers are happy to keep me on, even when I’m on the other side of the world.)

I’ve been getting loads of job offers (about 3 a week, by email and phone) from companies and recruiters in the US, since I put the CV up.

I think I’ve just figured out why… a search for “unix cv resume” on Google returns my CV as the first hit!

No wonder. Any half-awake recruiter who wants someone who can “do UNIX” will try a Google search. Better figure out some way of fixing it to get a lower ranking…

Tags: , , , , , , , , ,

Comments