How to eliminate spam bots from AWStats for good

Here’s example Perl implementation of the Apache access.log filter for “telling human from computer [site visitors] apart”.
For the description please see main page.

#!/usr/bin/perl -w
#
# Extract human-like entries from httpd server log.
# Note:
# Some legitimate users may be filtered out,
# however, they probably are not interesting economically anyway,
# so not really required in the analysis.

my $logfile = defined $ARGV[0] ? $ARGV[0] : "";

# Pass 1: get the list of (spam) bots.
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 1).\n";
my %bots = ();
my %humans = ();
my $bad_lines = 0;
my $MAX_BAD = 3;
my $verb = 0;
while () {
        /^(\S+) .+? (\S*?) HTTP\/\S*? (\d\d\d) / or do {
                ++$bad_lines;
                $verb and $bad_lines <= $MAX_BAD and
                        warn "Bad line: $_";
                next;
        };
        $host           = $1;
        $request        = $2;
        $status         = $3;
        if (like_human()) {
                $humans{$host} = 1;
                delete $bots{$host};
        }
        else {
                exists $humans{$host} or $bots{$host} = 1;
        }
}
close L;
$verb && printf STDERR " bots: %d, humans: %d, bad lines: %d\n",
        scalar keys %bots,
        scalar keys %humans,
        $bad_lines;

#Pass 2: extract human-like entries
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 2).\n";
while () {
        /^(\S+) / or next;
        $host           = $1;
        print unless exists $bots{$host};
}
close L;

sub like_human
{
        return 1 if exists $humans{$host};
        return
                $request =~ /^\// &&
                $status eq "304" ||
                $request =~ /\.(?:js|css|jpe?g|png|gif|ico|pdf|mp3|avi)$/i
                # Is "htm" ok?..
                ?
                1 : 0;
}

1 Star2 Stars3 Stars4 Stars5 Stars ← Rate!


11 thoughts on “How to eliminate spam bots from AWStats for good

  1. Pingback: Results of taking care for the site • STEREO.org.ua

  2. Pingback: Penultimate Reality » Blog Archive » Spambots Hurting Statistics

  3. If you have this error:

    Use of uninitialized value in pattern match (m//) at ./remove_spambots.pl line 19.

    Change the lines at line 19 and line 45 that seems like:

    while () {

    by

    while () {

  4. Thanks for contributing, Ruben.

    Actually, I see the script is in need of revision as my stats once again started to seem poisoned.

    Looks like .js access is the single most reliable indication for humanness, robots are rarely interested in JavaScript. Only problem – not all sites readily include this type of files in every page, so this test isn’t totally universal out of the box.

  5. You’ll need to re-run logs through new filter obviously.
    The old reports just lack the needed key information to distinguish trash within them.

    BTW, I believe it’s a good policy to keep all your web access logs. Quite a few of my websites take only 300 MB uncompressed in log files space per month (just over 1 million hits). Disks are cheap nowadays and you may get new ideas of what to do with the logs in future.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>