Script: new-referrer-rss – generate RSS feed of new referrer URLs from access_log


new-referrers-rss nameofsite [source ...] > new-referrers.xml


Given the name of a web site, and a selection of Apache combined log format ‘access_log’ files containing referrer URL data, this will generate an RSS feed containing the latest referrers.

The script should be run periodically with ‘fresh’ access_log data, from cron.

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted May 10, 2006 at 00:54 | Permalink

    Hi Justin.

    Thats a great Idea! one small nit. you need to put a full URL into the RSS ‘link’ feed for it to validate.

    eg. $rssfile->channel( title => “New referrers for $site”, description => “New referrers for $site”, link => “http://$site”, );

  2. Posted May 10, 2006 at 01:58 | Permalink

    Nice, but what about the referrer spammers??

  3. Posted May 10, 2006 at 06:32 | Permalink

    Any reason you’re generating RSS rather than Atom? :)

  4. Posted May 10, 2006 at 06:56 | Permalink

    Hey Justin, what the heck is that read_history() BS? Something wrong with a good ol’ tie to a DB? If you get lots of referrers over time, that read_history thing is going to get really ugly. Plus you can cut the code in like half by just tieing to a DB.

  5. Posted May 10, 2006 at 13:24 | Permalink

    wow comments!

    Ian: thanks, good point, I’d missed that. I should really have tried validating the output before posting ;) Fixed now.

    Michele: actually, referrer spam hasn’t been a problem. The script requires that the target exist (ie. have a 200 return code) — so that cuts down the referrer spam I get that produces 404 error codes from hitting nonexistent paths. Some ref spammers — the ones that hit “/” — get through, but don’t cause much irritation; once a site is listed once, it won’t be listed again. So you would typically spot one or two ref spammers in the output for the first couple of runs (from old data), then after that there’d be only one or two every week or two.

    Aristotle: I didn’t use Atom purely out of laziness. My feed reader supports XML::RSS’ output very well, it works fine for the purposes of this script, I’m familiar with that module, and it’s quite widely available — including a ‘libxml-rss-perl’ apt-gettable package in Debian and Ubuntu. I just haven’t bothered getting to grips with whatever CPAN modules can produce Atom yet… XML::Atom looks like a possibility for next time maybe.

    Craig: actually — after several years dealing with DB_File and it’s incompatible revisions between libdb releases in SpamAssassin — yep, there is quite a lot wrong with a tie to a db, I’ve concluded ;) It’s just easier to save incompatibility hassle by keeping it simple, which DB_File isn’t. Nowadays I try to avoid using that as a result.

    Plus I’d still have had to use a separate hash, or another CPAN module, in order to stringify the data structure for saving, which is where most of those 16 lines of code are going anyway…

    (If only YAML was in perl core nowadays — that would be perfect. ;)

  6. Posted May 10, 2006 at 22:06 | Permalink

    Ah; that means I should hurry up and finish XML::Atom::SimpleFeed then, I guess. :)