Cache Control Using Redirects
This is a handy technique used on the nearly-live planetary desktop backgrounds site; I've been asked for a brief write-up on how it works, so here goes.
Essentially, the problem is that you have a file or set of files with a known validity time period. (For example, in the site above, this is images of the world's current cloud patterns overlaid on a world map, where the images are regenerated every 3 hours.) So you want to ensure that downloaders download an up-to-date file, but at the same time you're shipping a lot of data, and want to take advantage of caching.
The Technique
The technique is to put in place a redirector URL, which redirects to a new URL once every period. So, in the example above, http://taint.org/xplanet/day_clouds_800x600.png is the redirector. This is set up to redirect, using a temporary HTTP redirect code, to something like http://taint.org.nyud.net:8090/xplanet/tmp/200503141756.432933/day_clouds_800x600.png, which is the cacheable target file. (This URL will be invalid, so there's no need to try clicking it.)
Note that the target URL has other features:
it's Coralized, in other words it uses the Coral content distribution network as a front-end. This reduces the bandwidth requirements on the host, since Coral will cache copies on the CDN during the 3 hour validity window.
- the filename component of the URL remains the same, for least user surprise when they're saving the file or using a download manager.
- the full URL contains a timestamped directory name, including some random data, to both ensure the URL will not repeat at a future time, and to attempt to force users to use the redirector rather than second-guessing the redirection algorithm and hitting the server directly.
In addition, since the redirector URL is on your server, and since a downloader must download that URL to get the current target file's URL, you'll get a usable hit-count from that.
Cache Expiration Control
In addition, files in the target URL's directories use explicit cache-control headers, thanks to an Apache .htaccess file and Apache's 'mod_expires', using these htaccess commands:
ExpiresActive On ExpiresByType image/png "modification plus 1 day"
Note that ExpiresByType has a very flexible syntax to specify validity periods. In this case, it uses an expiry time longer than the required 3 hours, just in case a cache's clock is off by several hours or has faulty timestamp handling.
The Code
If you're planning to implement a similar scheme, here's the shell script that generates this:
#!/bin/sh cd $HOME/shared/xplanet PATH=$PATH:/usr/local/bin:$HOME/bin . config.sh mkdir output > /dev/null 2>&1 ( cd state [.... generation of output into "../output" omitted ....] date cookiedir=`date -u +%Y%m%d%H%M`.`../gen_rand_999999` outputdir=$PLAIN_PATH_BASE/tmp/$cookiedir mv $PLAIN_PATH_BASE/tmp $PLAIN_PATH_BASE/tmp.OLD mkdir -p $outputdir # ensure that the tmp dir is unlistable touch $PLAIN_PATH_BASE/tmp/index.html # these are what gets requested (and cached) cp -p ../output/* $outputdir/. files=`ls $outputdir` # generate .htaccess ( echo ' ExpiresActive On ExpiresByType image/png "modification plus 1 day" ' for f in $files ; do echo "Redirect temp /xplanet/$f ${CACHED_URLS_BASE}tmp/$cookiedir/$f" done ) > $PLAIN_PATH_BASE/.htaccess # and these are never actually accessed, but make for good wget targets for f in $files ; do touch $PLAIN_PATH_BASE/$f done rm -rf $PLAIN_PATH_BASE/tmp.OLD date ) > LOG
and the config.sh file it sources contains:
PLAIN_PATH_BASE=$HOME/taint.org/xplanet/ PLAIN_URLS_BASE=http://taint.org/xplanet/ CACHED_URLS_BASE=http://taint.org.nyud.net:8090/xplanet/
gen_rand_999999 is a short perl script to generate a random number between 0 and 999999 inclusive:
#!/usr/bin/perl srand (time^$$^ unpack "%L*", `ps axww | gzip`); print int rand(999999);
Given perl's weak PRNG, it's important to do this properly.
(The details of how the image generation takes place are omitted here, since that's not what's important for the purposes of this page.)