Script: new-referrer-rss

new-referrer-rss.pl - generate RSS feed of new referrer URLs from access_log

SYNOPSIS

new-referrers-rss nameofsite [source ...] > new-referrers.xml

DESCRIPTION

Given the name of a web site, and a selection of Apache combined log format ‘access_log’ files containing referrer URL data, this will generate an RSS feed containing the latest referrers.

The script should be run periodically with ‘fresh’ access_log data, from cron.

Tags: , , , , ,

6 Comments »

  1. Ian Holsman said,

    May 10, 2006 @ 12:54 am

    Hi Justin.

    Thats a great Idea! one small nit. you need to put a full URL into the RSS ‘link’ feed for it to validate.

    eg. $rssfile->channel( title => “New referrers for $site”, description => “New referrers for $site”, link => “http://$site”, );

  2. Michele said,

    May 10, 2006 @ 1:58 am

    Nice, but what about the referrer spammers??

  3. Aristotle Pagaltzis said,

    May 10, 2006 @ 6:32 am

    Any reason you’re generating RSS rather than Atom? :)

  4. Craig Hughes said,

    May 10, 2006 @ 6:56 am

    Hey Justin, what the heck is that read_history() BS? Something wrong with a good ol’ tie to a DB? If you get lots of referrers over time, that read_history thing is going to get really ugly. Plus you can cut the code in like half by just tieing to a DB.

  5. Justin said,

    May 10, 2006 @ 1:24 pm

    wow comments!

    Ian: thanks, good point, I’d missed that. I should really have tried validating the output before posting ;) Fixed now.

    Michele: actually, referrer spam hasn’t been a problem. The script requires that the target exist (ie. have a 200 return code) — so that cuts down the referrer spam I get that produces 404 error codes from hitting nonexistent paths. Some ref spammers — the ones that hit “/” — get through, but don’t cause much irritation; once a site is listed once, it won’t be listed again. So you would typically spot one or two ref spammers in the output for the first couple of runs (from old data), then after that there’d be only one or two every week or two.

    Aristotle: I didn’t use Atom purely out of laziness. My feed reader supports XML::RSS’ output very well, it works fine for the purposes of this script, I’m familiar with that module, and it’s quite widely available — including a ‘libxml-rss-perl’ apt-gettable package in Debian and Ubuntu. I just haven’t bothered getting to grips with whatever CPAN modules can produce Atom yet… XML::Atom looks like a possibility for next time maybe.

    Craig: actually — after several years dealing with DB_File and it’s incompatible revisions between libdb releases in SpamAssassin — yep, there is quite a lot wrong with a tie to a db, I’ve concluded ;) It’s just easier to save incompatibility hassle by keeping it simple, which DB_File isn’t. Nowadays I try to avoid using that as a result.

    Plus I’d still have had to use a separate hash, or another CPAN module, in order to stringify the data structure for saving, which is where most of those 16 lines of code are going anyway…

    (If only YAML was in perl core nowadays — that would be perfect. ;)

  6. Aristotle Pagaltzis said,

    May 10, 2006 @ 10:06 pm

    Ah; that means I should hurry up and finish XML::Atom::SimpleFeed then, I guess. :)

RSS feed for comments on this post

Leave a Comment

Comment text formatting: Markdown Extra syntax is supported, as is plain old HTML. (Quick reference for Markdown basics)

View blog reactions using Technorati