Skip to content

Day: February 20, 2006

TREC Spam Corpus

Some news from TREC’s Gordon Cormack:

The TREC 2005 Corpus (92,000 messages – 42,000 ham; 50,000 spam) is now available for self-serve download.

TREC Spam Evaluation is a NIST program to develop methods to measure spam filter accuracy and performance. More details here.

The corpus can be picked up at Gordon’s site. As far as I can tell, this should be a pretty solid corpus for spam researchers and developers.

Four Things

I don’t do silly blog antics much, but I got tagged by Mat for the Four Things meme. Looking around, it is indeed a bit more interesting than things like the usual LJ quiz, so why not!

I wrote this on the plane from LA to Dublin, which may have affected some of the selections in 4 places I would rather be right now at least ;)

4 jobs I’ve had:

  • I was Iona Technologies’ first employee, and stayed there for no less than 7 years. I got to see the company grow from a handful of people, most of whom weren’t getting paid (hence how I wound up as the first employee ;), all the way up to a 300-strong multinational, while the company itself formed a core of Ireland’s mini dot-com boom. That was fantastic fun, and educational to boot.

  • my Dad’s gun/fishing/sporting-goods shop. Was it really a good idea to have a teenager working near firearms? At least I wasn’t the one who unplugged the fridge where the maggots were kept, so that they all hatched over the course of one weekend…

  • A horrible teenage job — picking tomatoes. I can still feel the orange dust under my fingernails every time I smell fresh tomatoes :( I didn’t last very long at that at all.

  • writing an Amiga-based kiosk system for virtually no pay whatsoever, at the age of 18 or 19. Ah, exploitation.

4 movies I can watch over and over:

  • Koyaanisqatsi — it’s dating a little now, since every ad agency through the 90s ripped it off. But still, the invention of a new format. I remember looking at the 405 freeway in LA, and thinking “looks like something out of Koyaanisqatsi” — of course, it was.

  • Princess Mononoke — either that, or Nausicaa. I just love the way the characters are coloured in shades of grey, rather than black and white.

  • the Lord of the Rings trilogy — oh dear I’m a hopeless Tolkien fanboy.

  • Spinal Tap — pure genius.

4 places I’ve lived:

  • Melbourne, Australia; around the time of the annoying TV drama, The Secret Lives Of Us;

  • Newport Beach, CA; around the time of the annoying TV drama, The O.C.;

  • Dublin, Ireland; no annoying TV drama — so far

  • University of California Irvine, CA; while Irvine itself is the most soulless suburban hellhole I’ve ever visited, living on the UCI campus is quite fun by comparison. Take about 1000 grad students, post-docs and lecturers from around the world; put them all in the same square mile or so; remove all fun (and bars!) from the surrounding areas; watch them make their own entertainment, or go mad.

4 tv shows I love:

4 places I’ve vacationed:

  • Annapurna Base Camp, Nepal; we trekked our way up to there, then trekked back down again. Unforgettable. I really want to do another Nepal trek as a result

  • car-camping around the Australian state of Victoria; they have some fantastic national park campsites, which most tourists overlook

  • learning how to dive in Ko Tao, Thailand; great setting, great dive sites, pretty cheap too!

  • Yosemite; amazing, world-class natural beauty. Californians don’t realise just how lucky they’ve got it ;)

4 of my favourite dishes:

  • A good Thai green curry

  • Laos-style green papaya salad with sticky rice

  • a good meaty cassoulet, from Fandango in San Luis Obispo. At least, that was the tastiest meal I’ve had in recent months ;)

  • Mangosteen — the queen of fruit, according to the Thais. I could, and probably have, eaten hundreds of these

4 places I would rather be right now:

  • spending New Year’s Day with a bunch of friends in rural West Cork or County Galway; until I moved to the US, this was one of my favourite annual traditions.

  • the Stag’s Head Bar, Dublin, in the snug, again with a bunch of friends

  • sitting on the grass outside the Pavilion bar in TCD, on a sunny summer’s day (hmm, that’s a lot of bars!)

  • Chiang Mai, Thailand

4 sites I visit daily:

4 people I’m tagging:

The Return of Sneakernet

Keith Dawson sent this on — an interview with Jim Gray, head of Microsoft’s Bay Area Research Center and winner of the ACM Turing Award, talking about new transmission systems for truly massive data collections. Very interesting:

[One] option is to send whole computers. …. We’re now into the 2-terabyte realm, so we can’t actually send a single disk; we need to send a bunch of disks. It’s convenient to send them packaged inside a metal box that just happens to have a processor in it. I know this sounds crazy — but you get an NFS or CIFS server and most people can just plug the thing into the wall and into the network and then copy the data.

Dave Patterson, interviewer: What’s the difference in cost between sending a disk and sending a computer?

JG: If I were to send you only one disk, the cost would be double — something like $400 to send you a computer versus $200 to send you a disk. But I am sending bricks holding more than a terabyte of data — and the disks are more than 50 percent of the system cost. Presumably, these bricks circulate and don’t get consumed by one use.

DP: Are you sending them a whole PC?

JG: Yes, an Athlon with a Gigabit Ethernet interface, a gigabyte of RAM, and seven 300-GB disks — all for about $3,000.

DP: It’s your capital cost to implement the Jim Gray version of “Netflicks.” (jm: sic)

JG: Right. We built more than 20 of these boxes we call TeraScale SneakerNet boxes. Three of them are in circulation. We have a dozen doing TeraServer work; we have about eight in our lab for video archives, backups, and so on. It’s real convenient to have 40 TB of storage to work with if you are a database guy. Remember the old days and the original eight-inch floppy disks? These are just much bigger.

DP: “Sneaker net” was when you used your sneakers to transport data?

JG: In the old days, sneaker net was the notion that you would pull out floppy disks, run across the room in your sneakers, and plug the floppy into another machine. This is just TeraScale SneakerNet. You write your terabytes onto this thing and ship it out to your pals. Some of our pals are extremely well connected — they are part of Internet 2, Virtual Business Networks (VBNs), and the Next Generation Internet (NGI). Even so, it takes them a long time to copy a gigabyte. Copy a terabyte? It takes them a very, very long time across the networks they have.

E-Pending

Boing Boing has an interesting case today:

“I filled out a web form for a contest from Miller using a throwaway junk email address and then, months after I dumped the throwaway account, I got this to my main account! Not sure I like the idea of companies tracking me down like this.”

I sent a mail to follow up on this, but it’s worth blogging here too.

This is, unfortunately, common practice among the “legitimate” bulk mailer companies; it’s called “e-pending” (short for “email address appending”). Basically, the advertiser contacts one of the big data-mining companies, provides them with the data they have about the customer — name, postal address, etc., and gets them to match that against their database; the data-miner then provides any other email addresses they may have on file for that user, even if those email addrs were provided for bills, promotional use for other companies, etc.

The advertisers contend that permission was given by the person who’s being mailed; the recipients contend that permission was given to send to a specific address, not all of that person’s addresses in perpetuity.

Here’s a few more examples of e-pending gone bad: two Jennifer Millers, Sony scraping ancient Internic contact addresses, Spamvertized.org comment on the practice, Joe St. Sauver comments.

It’s exclusively a US phenomenon, as far as I know; I think most cases of e-pending are rendered illegal under EU data protection law. Handy. ;)

Update: Brian at the Spam Kings weblog notes that ‘this spooky little spam was the work of Equifax, the big credit reporting agency that shut down its Boca Raton-based spam operation, Naviant, in 2003, due to the impending passage of CAN-SPAM.’