How to eliminate spam bots from AWStats for good

Here’s example Perl implementation of the Apache access.log filter for “telling human from computer [site visitors] apart”.
For the description please see main page.

#!/usr/bin/perl -w
#
# Extract human-like entries from httpd server log.
# Note:
# Some legitimate users may be filtered out,
# however, they probably are not interesting economically anyway,
# so not really required in the analysis.

my $logfile = defined $ARGV[0] ? $ARGV[0] : "";

# Pass 1: get the list of (spam) bots.
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 1).\n";
my %bots = ();
my %humans = ();
my $bad_lines = 0;
my $MAX_BAD = 3;
my $verb = 0;
while () {
        /^(\S+) .+? (\S*?) HTTP\/\S*? (\d\d\d) / or do {
                ++$bad_lines;
                $verb and $bad_lines <= $MAX_BAD and
                        warn "Bad line: $_";
                next;
        };
        $host           = $1;
        $request        = $2;
        $status         = $3;
        if (like_human()) {
                $humans{$host} = 1;
                delete $bots{$host};
        }
        else {
                exists $humans{$host} or $bots{$host} = 1;
        }
}
close L;
$verb && printf STDERR " bots: %d, humans: %d, bad lines: %d\n",
        scalar keys %bots,
        scalar keys %humans,
        $bad_lines;

#Pass 2: extract human-like entries
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 2).\n";
while () {
        /^(\S+) / or next;
        $host           = $1;
        print unless exists $bots{$host};
}
close L;

sub like_human
{
        return 1 if exists $humans{$host};
        return
                $request =~ /^\// &&
                $status eq "304" ||
                $request =~ /\.(?:js|css|jpe?g|png|gif|ico|pdf|mp3|avi)$/i
                # Is "htm" ok?..
                ?
                1 : 0;
}

Pages: 1 2



Subscribe to New stories via RSS
Don't Miss

10 Responses to “How to eliminate spam bots from AWStats for good

  • 1
    Results of taking care for the site • STEREO.org.ua
    November 3rd, 2007 4:23am

    [...] Statistics are very reliable, I believe. A special technique for filtering out spam bots from apache log files has been employed for the report Feb-Oct 2007 (blog.stereo.org.ua) and Jul-Oct 2007 [...]

  • 2
    Penultimate Reality » Blog Archive » Spambots Hurting Statistics
    December 22nd, 2007 3:20am

    [...] reason), and it’d be nice to filter them out of AWStats somehow. I found an interesting script that purports to help solve this problem, but I haven’t actually figured out how the script [...]

  • 3
    Ruben Rubio
    December 10th, 2008 11:07am

    If you have this error:

    Use of uninitialized value in pattern match (m//) at ./remove_spambots.pl line 19.

    Change the lines at line 19 and line 45 that seems like:

    while () {

    by

    while () {

  • 4
    Ruben Rubio
    December 10th, 2008 11:08am

    Testing … <

  • 5
    Ruben Rubio
    December 10th, 2008 11:09am

    So, change

    while () {

    by

    while (<L>) {

  • 6
    Alex
    December 10th, 2008 2:30pm

    Thanks for contributing, Ruben.

    Actually, I see the script is in need of revision as my stats once again started to seem poisoned.

    Looks like .js access is the single most reliable indication for humanness, robots are rarely interested in JavaScript. Only problem – not all sites readily include this type of files in every page, so this test isn’t totally universal out of the box.

  • 7
    Joe
    April 7th, 2009 4:46pm

    What about old reports?!

  • 8
    Joe
    April 7th, 2009 4:47pm

    What about old reports?!

  • 9
    Alex
    April 7th, 2009 5:10pm

    You’ll need to re-run logs through new filter obviously.
    The old reports just lack the needed key information to distinguish trash within them.

    BTW, I believe it’s a good policy to keep all your web access logs. Quite a few of my websites take only 300 MB uncompressed in log files space per month (just over 1 million hits). Disks are cheap nowadays and you may get new ideas of what to do with the logs in future.

  • 10
    Carsten
    September 22nd, 2009 8:43pm

    Is there any way to make AWStats apply this filter on the fly as it analyzes the log files?

Leave a Reply