As SEOs we’ve come to rely heavily on analytics packages – it’s how we prove our value. We’ve always been told spiders don’t execute Javascript. That means stats based on log files have bot traffic, while stats based on Javascript tagging don’t include bots. Some see this as a problem, while others see it as a benefit because the tag-based data is pure human traffic.
But what if that weren’t true? What if competitors and other unfriendlies decided to manipulate the system for their advantage? Or what if bored computer science students decided to mess with your web analytics rather than hack your root password? What if it became common for botnets to roam the Internet and toy with the analytics of random sites? And what if they were good at doing it – mimicking human behavior in relation to geolocation, browser type, time on site, pages crawled, etc?
Other forms of hacking are pretty routine these days. If you’ve ever managed a web server you just expect to see hackers trying to break into the server. It’s completely common. There’s going to come a time, pretty soon, when analytics will be routinely hacked/messed with/corrupted. The sooner those of us who rely on analytics start thinking about it, the better.
I’ve encountered a couple of incidents recently that can only be explained by the fact that there are spiders out there that are indeed mimicking human behavior and executing tag-based analytics. This is actually a bit distressing since it means bots could become an increasing problem for web analytics.
Case 1 – Obvious Bots & Suspicious Behavior
We have one client that uses both Google Analytics and Unica’s NetInsight. Recently we were looking at some extreme cases of users with high page counts per visit. Here’s an example of one visitor (who says they’re running IE7), that came in, pulled the first page once, then pulled the next page 1,055 times!
For the same day Google Analytics shows a total of 500 page views for the same document and says there was 1 unique page view. In other words, Google Analytics seems to cap the damage at 500 (a nice even number). I love this next graphic – 500 pageviews, 1 unique view, time on page: 3 seconds!
That’s clearly bot activity. No human would load a page 1,000 times in rapid succession – it’s physically impossible. But the significant things to take note of are 1) the bot executed Javascript-based tags, and 2) it lied and did not properly identify itself…
Then there are other cases that just make you wonder…
- The person (?) in Asia who had 591 page views on different pages with a few natural duplicates in a single session of less than 2 hours.
- The “person” in Greece who did 1,250 page views in 4 hours. It started out looking pretty normal, then all of a sudden you see the same exact page repeated 34 times over a 6 minute period…
- The “person” in Japan who did 181 page views in one session – nearly all one type of page, many of them in sequential database order.
Sure, it’s possible those cases were real human activity, but they’re very suspicious… Is someone really going to spend 2 to 4 hours on one web site, or view a site largely in sequential database order?
So in this case what we’re seeing is some bots that are completely obvious (like the one that hits the same page over 1,000 times in rapid succession), and other cases that may well be bots, but it’s hard to say. One can assume there is also more discrete bot activity mixed in there that isn’t easily identifiable…
Case 2 – Suspicious Affiliate Traffic
Last week I was also seeing strange things in the analytics for a customer with very different sites than the first case. The customer runs several sites that get affiliate traffic and when I looked at the reports, a few of the affiliates stood out like sore thumbs – there were details about the traffic patterns that were completely different from other affiliates and non-affiliate traffic. However, in certain cases their traffic blended in with the other traffic perfectly.
In the following two charts I’ve replaced the names of the referring sites with codes (A1, A2, A3, etc.) You can see two referring sites (A1 and A3) really stand out – they have completely unrealistic bounce rates. And other data (not shown here) reveal that they have click through rates (to learn more about the product) that are typically more than 5 times the click through rates from other affiliate sites.
Yet in the traffic to a similar site owned by the same customer and promoted by many of the same affiliates, traffic from one of the suspicious affiliates suddenly blends in perfectly with the other affiliates, while the other suspicious affiliate continues to stand out. (The B# codes are the codes for this graphic, the A# codes are their position in the previous graphic).
Notice A3 (B5) now has a normal bounce rate – a little high, actually, while A1 (B1) continues to have an abnormally low bounce rate.
We asked our customer to look into sales from these affiliates and they confirmed that, with the exception of one day for one of the suspicious affiliates, the conversion rates (defined by actual sales) were extremely low.
Going further, it’s not like the suspicious affiliate traffic is coming from one or two servers that can be easily identified. It appears to be coming from all over – just like “normal” traffic would. Here’s a chart of where the suspicious affiliate traffic supposedly came from – notice that the bounce rate is the only thing that’s really remarkable…
Now, compare that to regular traffic from a non-affiliate site which is typical of the non-suspicious traffic…
Those are pretty much the same traffic centers as the suspicious traffic. The NY bounce rate looks a bit odd, but I suppose that could just be a factor of sample size.
And the weird affiliate traffic has a broad spectrum of browsers as well…
The bottom line is these affiliates are playing some sort of game with bots and sending fake traffic. Doing it to the sites our customer runs doesn’t help them in sales and makes their conversion ratios look bad, but I’m guessing they do this as a blanket policy for all of their outbound links because there are link trade scenarios where they do benefit from the perception that they are capable of sending a lot of traffic. And if they get a reputation in their industry for having a lot of traffic, that helps them long term, even if a few people know that their conversion rates are low.
Wrap Up:
Judah Phillips wrote a post last July where he talked about seeing a similar phenomenon. He said he sees bots doing the following:
- Crawl inordinate numbers of pages per visit when compared to human visitors
- Enter the site at various intervals for various durations
- Crawl a site in unusual patterns
- Repeatedly request pages that human visitors don’t access
And he goes on to say bots are evolving in that they…
- Execute javascript
- Enter the site from various referrers using various methods
- Come from different IP addresses and subnets
- Repeatedly hit one page
- Spoof their user agents, thus not identifying themselves
- Take cookies
That’s pretty much what I’ve seen. Some of those actions are obviously first generation Javascript-capable bots that are easily identified (BTW, Javascript-support isn’t actually required, but it makes the life of the hacker considerably easier). What’s a bit scary are the “better bad bots” that blend in with human traffic. Without taking unusual measures, the only way to tell the bot traffic from real traffic is when they slip up and don’t understand typical traffic patterns for the sites they’re sending fake traffic to, or otherwise act in ways only bots would act (like asking for the same page 1,000 times).
The newer bots that do a decent job of mimicking humans are a bit scary when you think about it since they have the power to render web analytics largely useless by filling it with garbage data.
Scenario 1: Imagine an SEO company that “proves” to their customer that their actions are resulting in more traffic to the customer’s site. The thing is it’s mostly just the bots they have been crawling the customer’s site. Just about anything can be faked and they have access to baseline analytics and know what the norms are that they need to fake. Let’s say something seems weird, and the customer decides to end the relationship with them and hires you to do their SEO. The first company has the ability to slowly reduce the apparent traffic to your customer’s site, which will make the customer (and you) think you aren’t doing as good of a job as the first company that was faking traffic.
Scenario 2: Someone asks you for a link exchange and they swear they have a lot of traffic. They provide the Google Analytics reports to back it up, and after the link exchange you’re seeing “solid” traffic from them, but in reality you’re sending them way more human visitors than they’re sending you…
Scenario 3: You just launched an online marketing campaign and you want to judge its effectiveness. The problem is your competitor has been hostile in the past. You don’t know whether or not you can trust your analytics. What if your competitor is sending fake bot traffic just to muddy your analytics? The truth is in there somewhere, but you’re not confident the stats you’re seeing aren’t manipulated in some way.
The one silver lining to all of this is that e-commerce is relatively unaffected. The bots can do many nasty things, but in the end they’re cheap dates and don’t put their money where their mouth is. Still, they can affect conversion ratios so you need to look at absolute revenue, which they can’t affect without creating discrepancies between your actual revenue and your analytics package which would reveal their presence.
We’ve been in a comfortable bubble where, for the most part, people have played fair. But that could change, and in some cases it looks like it already has changed. If your adversary does it right, you may never know your statistics are lying to you.