How to eliminate spam bots from AWStats for good
The two most common approaches in Web analytics are:
- Web server logfile analysis
- Page tagging
Page tagging is the method of choice from the commercial standpoint. However, it’s got its characteristic drawbacks:
- changes to the web application are needed
- vendor lock-in of some sort takes place (regardless whether you use a subscription-based solution or acquire a hosted one).
On the other hand there is one nice web stats tool operating in the old good logfile analysis realm, which is AWStats. Until recently it was a reliable work horse for many webmasters delivering quite useful reports about origin breakdown, sessions (visits duration), lists of landing pages (”entry”) & exit pages - categories commonly associated with the more complex page tagging statistics systems.
What happened to it?
It’s spam bots and referrer spammers who now spoil the reports produced by AWStats.
You know, when you look at e.g. the Top Hosts report by AWStats and you see that (almost) all of the Top 10 are non-humans it’s kinda frustrating - they may not be making up the whole lot to the totals, but they shift the real people down and beyond the Top 10 report’s boundaries and you simply loose this whole part of your stats which isn’t really playing in favor of the stats system in use.
And exactly this section of AWStats page - Top Hosts - made me think out the ways to cure the problem.
As I mentioned above there are two distinctive types of spoilers in the stats. They are somewhat similar as both are represented by robots, yet they’ve got major differences:
- Referrer spammers.
These are specifically targeting the logs being analyzed so they’ve been combated for some time already. - Comment spammers.
These do not target logs in their malice, instead messed logs are the by-product of their activities. Because of this there seems to be inadequate attention paid to them to the date.
So here’s an idea on how to weed out web server statistics from comments spam bots activity, presumably working for referrer spam bots as well.
When you look at a Top Hosts report for a highly spammed web site you most surely may notice remarkably similar digits for each host in the two columns - Pages and Hits.
That equality meaning the Pages/Hits ratio being 1 suggests one very special characteristics common of human users accessing web pages:
Real people’s browsers request some non-page elements besides pages themselves.
Technically this could be detected as:
- requesting some files from this list of extensions: (taken directly from AWStats config file)
NotPageList=”css js class gif jpg jpeg png bmp” - having some of the requests return HTTP codes 304, 303 and such.
Anybody NOT requesting at least something of the above is very likely a spamming robot. Robots aren’t fond of style, are they?
One little problem I see with this method is loosing mobile users, slow connect users - those who try to cut traffic in every possible way.
But hey, they aren’t likely to spend money with you anyway (I mean, they are respectable users, but nobody really loses anything skipping them from the web site stats)
Yet another class of a potential blunder might be some old site featuring text-only pages. Here again, if you’re planning to use those pages as an advertising media you’ll have to find way to include additional elements in the pages, and that will trigger the filter to distinguish humans from machines here.
Therefore, the easily implemented solution (see on page 2 below) being based upon 1-line idea seems to have huge effect with little effort.
And I’m telling you, I literally reinvented some of the AWStats reports after installation of the new log file filter.
For example, I found out that about 30% of STEREO.org.ua visitors in the old stats were comment spammers. In some less popular sites this figure was making up for more than half page traffic!
Or, before filter I saw only modest 15-20% of visitors bookmarking my website (and I thought that was nice), then I was amazed with around 30% visitors adding my blog to favourites.
I have to say that is really encouraging.
A recent trend of Search Keywords spamming appears to be eliminated too.
In short - just try for yourself, and who knows, may be you’ll be able to waive a purchase of a commercial analysis package until the next “big” spammers’ attack.
Pages: 1 2




November 3rd, 2007 4:23am
[...] Statistics are very reliable, I believe. A special technique for filtering out spam bots from apache log files has been employed for the report Feb-Oct 2007 (blog.stereo.org.ua) and Jul-Oct 2007 [...]
December 22nd, 2007 3:20am
[...] reason), and it’d be nice to filter them out of AWStats somehow. I found an interesting script that purports to help solve this problem, but I haven’t actually figured out how the script [...]