Here’s example Perl implementation of the Apache access.log filter for “telling human from computer [site visitors] apart”.
For the description please see main page.
#!/usr/bin/perl -w
#
# Extract human-like entries from httpd server log.
# Note:
# Some legitimate users may be filtered out,
# however, they probably are not interesting economically anyway,
# so not really required in the analysis.
my $logfile = defined $ARGV[0] ? $ARGV[0] : "";
# Pass 1: get the list of (spam) bots.
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 1).\n";
my %bots = ();
my %humans = ();
my $bad_lines = 0;
my $MAX_BAD = 3;
my $verb = 0;
while () {
/^(\S+) .+? (\S*?) HTTP\/\S*? (\d\d\d) / or do {
++$bad_lines;
$verb and $bad_lines <= $MAX_BAD and
warn "Bad line: $_";
next;
};
$host = $1;
$request = $2;
$status = $3;
if (like_human()) {
$humans{$host} = 1;
delete $bots{$host};
}
else {
exists $humans{$host} or $bots{$host} = 1;
}
}
close L;
$verb && printf STDERR " bots: %d, humans: %d, bad lines: %d\n",
scalar keys %bots,
scalar keys %humans,
$bad_lines;
#Pass 2: extract human-like entries
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 2).\n";
while () {
/^(\S+) / or next;
$host = $1;
print unless exists $bots{$host};
}
close L;
sub like_human
{
return 1 if exists $humans{$host};
return
$request =~ /^\// &&
$status eq "304" ||
$request =~ /\.(?:js|css|jpe?g|png|gif|ico|pdf|mp3|avi)$/i
# Is "htm" ok?..
?
1 : 0;
}
Pingback: Results of taking care for the site • STEREO.org.ua
Pingback: Penultimate Reality » Blog Archive » Spambots Hurting Statistics
If you have this error:
Use of uninitialized value in pattern match (m//) at ./remove_spambots.pl line 19.
Change the lines at line 19 and line 45 that seems like:
while () {
by
while () {
Testing … <
So, change
while () {
by
while (<L>) {
Thanks for contributing, Ruben.
Actually, I see the script is in need of revision as my stats once again started to seem poisoned.
Looks like .js access is the single most reliable indication for humanness, robots are rarely interested in JavaScript. Only problem – not all sites readily include this type of files in every page, so this test isn’t totally universal out of the box.
What about old reports?!
What about old reports?!
You’ll need to re-run logs through new filter obviously.
The old reports just lack the needed key information to distinguish trash within them.
BTW, I believe it’s a good policy to keep all your web access logs. Quite a few of my websites take only 300 MB uncompressed in log files space per month (just over 1 million hits). Disks are cheap nowadays and you may get new ideas of what to do with the logs in future.
Is there any way to make AWStats apply this filter on the fly as it analyzes the log files?
This is my first time pay a quick visit at here and i am in fact impressed to read all at one place.