IR book recommendation

Thanks to Pierce for pointing me at this review of an interesting-sounding book called Introduction to Information Retrieval. The book sounds quite useful, but I wanted to pick out a particularly noteworthy quote, on compression:

One benefit of compression is immediately clear. We need less disk space.

There are two more subtle benefits of compression. The first is increased use of caching … With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query … we instead access its postings list in memory and decompress it … Increased speed owing to caching — rather than decreased space requirements — is often the prime motivator for compression.

The second more subtle advantage of compression is faster transfer data from disk to memory … We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.

This is something I’ve been thinking about recently — we’re getting to the stage where CPU speed has so far outstripped disk I/O speed and network bandwidth, that pervasive compression may be worthwhile. It’s simply worth keeping data compressed for longer, since CPU is cheap. There’s certainly little point in not compressing data travelling over the internet, anyway.

On other topics, it looks equally insightful; the quoted paragraphs on Naive Bayes and feature selection algorithms are both things I learned myself, “in the field”, so to speak, working on classifiers — I really should have read this book years ago I think ;)

The entire book is online here, in PDF and HTML. One to read in that copious free time…

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

5 Comments

  1. Posted February 9, 2009 at 13:31 | Permalink

    Yes if the data size being processed is less than RAM when compressed (i.e. fits in cache), then it’s a big win. I’ve experience with my access_log files which I always keep compressed on disk. Processing them is faster than processing uncompressed files from hard disk (and that’s sequential access which would be relatively better from hard disk), and reprocessing them is much faster.

    The storage hierarchy is changing though with the advent of solid state disks. Perhaps when these move to MRAM, there will be no benefit to compression? http://lwww.pixelbeat.org/docs/memory_hierarchy/

  2. Posted February 9, 2009 at 16:00 | Permalink

    hey Padraig —

    wow, MRAM is going to be great! looking forward to that…

  3. Craig Hughes
    Posted February 9, 2009 at 17:40 | Permalink

    Another good book which deals with advantages of storing everything compressed (and including strategies for searching inside compressed files, etc) is Managing Gigabytes, published by Morgan Kaufman a few years back now. I leant my copy to Komal I think and have not seen it in a couple years…. Great book though.

  4. Craig Hughes
    Posted February 9, 2009 at 17:42 | Permalink

    Haha. Then after posting I click through to TFA and see first sentence also mentions Managing Gigabytes!

  5. David Malone
    Posted February 9, 2009 at 21:06 | Permalink

    I know that some people looked at compressing pages rather than swapping them in a VM system (see info.iet.unipi.it/~luigi/swap.ps), and that was taken a bit further here (http://www.usenix.org/events/usenix99/full_papers/wilson/wilson_html/). I guess this is something of an evolution of these ideas.