How to eliminate spam bots from AWStats for good
Here’s example Perl implementation of the Apache access.log filter for “telling human from computer [site visitors] apart”.
For the description please see main page.
#!/usr/bin/perl -w
#
# Extract human-like entries from httpd server log.
# Note:
# Some legitimate users may be filtered out,
# however, they probably are not interesting economically anyway,
# so not really required in the analysis.
my $logfile = defined $ARGV[0] ? $ARGV[0] : "";
# Pass 1: get the list of (spam) bots.
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 1).\n";
my %bots = ();
my %humans = ();
my $bad_lines = 0;
my $MAX_BAD = 3;
my $verb = 0;
while () {
/^(\S+) .+? (\S*?) HTTP\/\S*? (\d\d\d) / or do {
++$bad_lines;
$verb and $bad_lines <= $MAX_BAD and
warn "Bad line: $_";
next;
};
$host = $1;
$request = $2;
$status = $3;
if (like_human()) {
$humans{$host} = 1;
delete $bots{$host};
}
else {
exists $humans{$host} or $bots{$host} = 1;
}
}
close L;
$verb && printf STDERR " bots: %d, humans: %d, bad lines: %d\n",
scalar keys %bots,
scalar keys %humans,
$bad_lines;
#Pass 2: extract human-like entries
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 2).\n";
while () {
/^(\S+) / or next;
$host = $1;
print unless exists $bots{$host};
}
close L;
sub like_human
{
return 1 if exists $humans{$host};
return
$request =~ /^\// &&
$status eq "304" ||
$request =~ /\.(?:js|css|jpe?g|png|gif|ico|pdf|mp3|avi)$/i
# Is "htm" ok?..
?
1 : 0;
}
Pages: 1 2
Categories: On the Web, Ongoing Projects, Tips & tricks, UNIX & Computing
Tags: How-To • Ideas • On the Web • Perl • spam • spam_bots • stats • traffic • UNIX
Published
on Friday, September 28th, 2007
Comments RSS
| Write a comment | trackback from your own site
November 3rd, 2007 4:23am
[...] Statistics are very reliable, I believe. A special technique for filtering out spam bots from apache log files has been employed for the report Feb-Oct 2007 (blog.stereo.org.ua) and Jul-Oct 2007 [...]
December 22nd, 2007 3:20am
[...] reason), and it’d be nice to filter them out of AWStats somehow. I found an interesting script that purports to help solve this problem, but I haven’t actually figured out how the script [...]
December 10th, 2008 11:07am
If you have this error:
Use of uninitialized value in pattern match (m//) at ./remove_spambots.pl line 19.
Change the lines at line 19 and line 45 that seems like:
while () {
by
while () {
December 10th, 2008 11:08am
Testing … <
December 10th, 2008 11:09am
So, change
while () {
by
while (<L>) {
December 10th, 2008 2:30pm
Thanks for contributing, Ruben.
Actually, I see the script is in need of revision as my stats once again started to seem poisoned.
Looks like .js access is the single most reliable indication for humanness, robots are rarely interested in JavaScript. Only problem – not all sites readily include this type of files in every page, so this test isn’t totally universal out of the box.
April 7th, 2009 4:46pm
What about old reports?!
April 7th, 2009 4:47pm
What about old reports?!
April 7th, 2009 5:10pm
You’ll need to re-run logs through new filter obviously.
The old reports just lack the needed key information to distinguish trash within them.
BTW, I believe it’s a good policy to keep all your web access logs. Quite a few of my websites take only 300 MB uncompressed in log files space per month (just over 1 million hits). Disks are cheap nowadays and you may get new ideas of what to do with the logs in future.
September 22nd, 2009 8:43pm
Is there any way to make AWStats apply this filter on the fly as it analyzes the log files?