By Dennis Last update December 8, 2016 Comments: 6

google analytics spam guide

If you run your business by the numbers, you need to be able to trust them 100%.

But one look at your Google Analytics reports will show you that that’s easier said than done.

On any given day you’ll see tens to hundreds of visits from all kinds of strange places.

These aren’t real visitors, it’s Google Analytics spam and it has become a big problem over the last few years.

So if you want to get rid of this spam and want to clean those fake visitors from your reports, this article will show you exactly how!

Get your free checklist that outlines all of the steps in this article, ideal to follow along or for future reference!

 

It’s been about 2 years since Google Analytics spam has really become a problem. And the approach I use to fight this spam has also evolved.

Today I do the following for my clients:

  1. Set up multiple Google Analytics views
  2. Filter hostname spam
  3. Filter spam referrals
  4. Exclude known bots
  5. Create a spam free segment

Let’s take a look at these steps in more detail.

Step 1. Set Up Multiple Google Analytics Views

By default, a new Google Analytics property comes with one view: All website data.

If you make changes to the settings, there is always the chance that something goes wrong. The data you’ve already got in your account won’t be affected, but the data that comes in after you make the changes will be modified.

Since there is no undo button, it’s a good idea to have a backup of your data.

google-analytics-create-new-views

So before you do anything, create 2 extra views. So that’s 3 views in total:

  • Main –  your main view (you can rename All Website Data to this one)
  • Raw – a view without any changes or filters
  • Test – a view to test changes before you make them in the Main filter

The problem with the Google Analytics spam is just in your reports. These are fake visitors that never land on your website. They use a loophole in the way Google Analytics works to fake visits from other websites. That’s why they are also called ghost spam.

So while it seems that you have visitors from big sites like apple.com or reddit.com, most of those aren’t real. Luckily we can tell them apart them from real visitors.

Step 2. Filter Hostname Spam

The first way to detect them is via hostnames. In simple terms, your hostname is the name of your site.

Let’s look at the hostname report in Google Analytics:

google-analytics-spam--ghost-hostnames

A valid hostname is the domain from your site which I blurred out in the example above. Besides that, the only other valid hostname is checkout.shopify.com.

So the only reason there might be a different domain is because you are using your Google Analytics tracking code with other tools.

If you’re using Google Analytics for ecommerce, the actual checkout often is on another domain (checkout.shopify.com), but those pages are loading your own Google Analytics code (that way you can track transactions). That’s why there is a different hostname.

The other hostnames in the example above: (not set)lifehacker.com, google.org or www.foxnews.com are fakes.

Take a look at your own report in Google Analytics: Audience > Technology > Network > Hostname

google-analytics-spam-hostname-report

So instead of filtering out the fakes, we are only going to include valid hostnames in our reports, the rest can be ignored.

But you want to make sure to filter out only the spam, not the legit traffic!

Let’s take a look at which hostnames are valid:

  • your own domain (domain.com)
  • your own sub domains (blog.domain.com)
  • Content Delivery Networks (or CDN): Cloudflare or Akamai
  • Translation services: Google, Bing or Baidu
  • Shopping carts: Shopify or Lightspeed
  • Payment services: Paypal
  • Cache services: Google cache
  • Other tools that use your tracking code: landing page tools, email providers, etc.

It’s essential you don’t filter out real traffic, that’s why I wanted to give you some examples of hostnames that are valid. It’s not a complete list, but it will tell you what to look for:

  • checkout.shopify.com (Shopify checkout pages)
  • yourshopifydomain.myshopify.com (your own Shopify domain)
  • translate.googleusercontent.com (Google Translate)
  • yourdomain.webshopapp.com (Lightspeed checkout pages)
  • develop.yourdomain.com (staging server)
  • dev.yourdomain.com (staging server)
  • yourdomain.us4.list-manage.com (MailChimp list settings)
  • fbrender.heyo.com (Facebook contest tool)
  • webcache.googleusercontent.com (Google Cache)
  • us4.campaign-archive.com (MailChimp archives)
  • cdn.yourdomain.com (a CDN service)
  • web.archive.org (users looking at old versions of your site via archive.org)
  • yourdomain.googleweblight.com (light version of your domain by Google)
  • yourwpenginedomain.wpengine.com (your WPEngine subdomain – WordPress only)
  • yourdomain.dev (staging server)
  • yourdomain.3dcartstores.com (your 3D cart subdomain)
  • translate.baiducontent.com (Baidu translate)
  • www.yourdomain.stfi.re (link tools)
  • www.youtube.com (if you use your tracking code on your Youtube channel)

Action time

You need to create a new filter on your Google Analytics view that only includes the traffic to the hostnames you specify.

I recommend to start with your Test view & let it run for a week, and check transactions/value. These should be the same since we’re only filtering out fake traffic. Once you’ve verified it’s correct, you can create a filter in your Main view.

Goto Admin > correct view > Filters > + Add Filter

Select Custom > Include > Filter Field: Hostname > Filter Pattern: see below > Save

google-analytics-filter-include-hostnames

In the Filter Pattern field you’re going to enter a combination of all the valid domains that you’ve found.

You have to do that in a special format, called regular expression or regex.

Let me give an example to simplify it.

Example

I've discovered 2 good hostnames:
www.storegrowers.com
checkout.shopify.com

In the Filter Pattern field I'll enter: 
www\.storegrowers\.com|checkout\.shopify\.com

So that’s a backslash(\) in front of every dot and a pipe (|) in between domains.

This will take care of a bunch of spam already, but not everything. Let’s look at step 3 to filter out the other spam in your Google Analytics account.

Step 3. Filter Referral Spam

Besides ghost hostnames, your Google Analytics reports are also full of ghost referrals.

These are websites that appear to send visitors to your site, but actually aren’t.

They play on the curiosity of website owners since it’s only natural to wonder what that site that linked to you is all about and go visit them. (Sidenote: most of these websites actually don’t work anymore, so it’s unclear why they would bother with this shit)

google-analytics-spam-ghost-referrals

To exclude these from our reports we’re going to set up filters that eliminate these.

As I mentioned before, my approach has changed over the last couple of years. In the beginning I kept track of all of the domains that I found in my own reports, or those of clients. But that quickly became too much work to keep updated.

So my new approach is that instead of excluding the exact referrals, I try to look at the patterns in all of the referrals. Those won’t filter out everything, but they will get you 95% of the way there.

Analytics provider piwik has kept a nice updated Google Analytics spam list on Github of over 483 domains. I’ve rolled all of my domains into that list, so that’s what I’m using in this post.

To do this, you’ll create 2 new custom filters on your Views.

Action time

Goto Admin > correct view > Filters > + Add Filter

Select Custom > Exclude > Filter Field: Campaign source > Filter Pattern: see below > Save

google-analytics-spam-referrals-new-filter

In the Filter Pattern field you’re going to enter a combination of all the spam referrals that you’ve found, again in the regex format described above.

There is a 255 character limit to the field, so you might have to create multiple similar filters.

Example

In my reports I've found 4 spam referral domains:
motherboard.vice.com
lifehacĸer.com
site-auditor.online
addons.mozilla.org

In the Filter Pattern field I'll enter:
motherboard\.vice\.com|lifehacĸer|site-auditor|addons\.mozilla\.org

So that’s a backslash(\) in front of every dot and a pipe (|) in between domains.

Like I said before you can add only the spam you see in your reports, or you can use the huge list of domains mentioned above.

I use a mix of filter for specific domains & things that keep popping up again in reports across clients. I’ve also started excluding a couple of  domains that are responsible for most of the spam (.site, .xyz, .рф, .ru, .info, .top, .ua, .kz, .uz, .ga & .cf). I know that there is the possibility that a legit site from one of those domains will send me traffic, but I’m willing to take the chance.

Again, you’ll also have to apply this filter for every single one of your views, start with your Test view, then roll it out to your Main view if things are correct.

4. Exclude known bots

Google doesn’t really seem to care about these spam issues, otherwise I wouldn’t have to write a detailed blog post on how to deal with Google Analytics spam.

But they do have a small feature that helps out.

On the View level, Google Analytics offers a solution to filter all known bots. Just check the Exclude all hits from known bots and spiders option.

google-analytics-spam-exclude-known-bots

5. Create a spam free segment

If you create new filters, your data will only be filtered from the moment you add them. So even if you exclude spam referrals, your past history is still affected.

To look at your historic data without the spam, you can create a segment that excludes these known referrals.

You can do this by re-using the regex code you creates for your spam referrals & hostnames.

Action time

To create a new segment, click + Add Segment above the chart in Google Analytics > + New Segment

Then Exclude the Sessions that have a Source that matches the regex code you’ve found above.

You can add an additional filter for the hostname spam.

google-analytics-spam-create-new-segment

Note on referral exclusion lists

To close off I want to mention another technique that’s often mentioned to tackle Google Analytics spam, the referral exclusion list.

The Referral Exclusion List is a feature in Google Analytics to add domains that you wanted to exclude from your reports.

While that seems ideal at first, there is a catch. If you add domains to that list, those visits don’t get excluded at all, they simply get added to the direct traffic of your website. That only makes the spam invisible, so it’s definitely not a good options.

 

That’s it for this post, by know you should be able to get that annoying spam out of your Analytics. If you have any questions or have an alternative approach, let me know in the comments!

To make it easier to do the same in your Google Analytics accounts, grab your free checklist that outlines all of the steps necessary to get rid of spam.

About the author

Dennis

Dennis is the main guy behind Store Growers. He's never had a job that he didn't invent himself and loves that freedom.
In writing articles, creating courses or working with ecommerce clients he has one goal: to create more freedom for online store owners.

6 responses to “Google Analytics Spam Removal Guide

  1. You just informed everyone that they should take lifehacker.com or vice.com and remove it as a referral source. At least that’s how it seems to someone who doesn’t fully know the situation at hand with those specific referral spam issues. Those two stemmed from the language spam ‘hacker’ which may likely be over however this article doesn’t mention anything about language spam removal.

    I’d suggest editing that section to be about continual referral spam sites – not one time outbreaks since those will likely never show their face again. In addition here’s the regex for removing language spam:

    Create New Filter -> Exclude -> Language Settings
    Regex: .{15,}|\s[^\s]*\s|\.|,|\!|\/

    This code will block any language with an excessive character count or with characters that don’t match any normal language definition.

    Although personally, I’m of the opinion that if you don’t do business outside of the US – make a US only filter which by default will eliminate Russian traffic (one of the main culprits of the analytics spam issue).

Leave a Reply

Your email address will not be published. Required fields are marked *