Hitting the Google Analytics sampling treshold? Can you trust your numbers?

For a while I wanted to write this post about the Google Analytics sampling. You know, the dreaded message that appears on top of your reports:

This message shows up when you work with a dataset that contains more than 500.000 visits or more than 1.000.000 items (keywords/url's/etc). Above that Google takes a sample of all those visits to calculate the numbers for your reports. But what is acceptable? In this example Google uses 30.62% of all visits to guess what the other 70% did on my site...

Example 1, very large site, 1% sample size

Imagine you want to analyse a page for a cool new product you launched a while ago. You open up the "All Pages" report and search for that page:

Well, the numbers look a bit odd, don't they? No bounces, no entrances, no exits, and all visitors looked at the page once per visit (pageviews matches unique pageviews exactly). Let's raise the sample size to "Higher precision":

Now I see these numbers:

Suddenly we have entrances, and a bounce and exit rate for exactly the same page. I'm glad to see 321 people found their way to this page through an external source. But what would happen if I could raise the sample size to 1.000.000? I'll never know.

Example 2: medium site, 30% sample size

Let's do it again like the example above. I looked for a specific URL and I see this:

Now we raise the sample size to "Higher precision" again:

As you can see the numbers are much closer to each other.

What sample size is acceptable

The big question here is: where do you stop trusting the numbers. Google says that "needle in a haystack" analysis is difficult when hitting the sample treshold, and I agree. But a 30% or 60% sample size seems pretty reliable, 1% not. Somewhere in the middle is the absolute minimum. Perhaps a good statistician can shed some light upon this?

Possible solutions

To overcome these problems there are a few solutions:

  • Download your data trough the Google Analytics API and store it in your data warehouse (if you have that). That gives you some more reliability in a day to day analyses.
  • Upgrade Google Analytics to it's paid version: Premium
  • Create multiple trackers per subfolder with the setCookiePath command. That way you have an account per subfolder that won't hit the sampling treshold as soon as the main profile.

My questions

Any statistician want to comment? What is the lowest acceptable sample size?
Any webanalist found/uses other solutions?
How do you deal with sampling?

Click to activate social bookmarks

 
  • René Stuifzand

    This is certainly a problem with big websites. One paid solution is Analytics Canvas. They make use of the Google Analytics API. When they request the data they make the data period from the queries small enough, so no sampling will occur and add this data to their database. This won't work with unique visitors though.

    Read more about it here: http://www.analyticscanvas.com/google-analytics-sampling-data-and-confidence-interval/

  • Martijn Beijk

    Another solution of course would be to use a tool that does not use sampling unless you tell it to sample and where you can define the ratio of your sampling rate. :-)

    • http://andrescholten.net/ André Scholten

      Ha, that's also a good one. Therefore I want to know what amount of sampling is acceptable. When you know that, you can say "above X visitors you shouldn't use GA anymore".

  • http://blog.jongbelegen.net/ zjuul

    Good sample size statistics should include margin of error and confidence interval.
    For an ultra short read up and calculator, check http://www.raosoft.com/samplesize.html

    • http://andrescholten.net/ André Scholten

      Nice calculator, but to do the math everytime I analyze something can be very time consuming. I think Google should tell whether the confidence level is too low.

  • http://twitter.com/gerronmulder Gerron Mulder

    It depends on the the size of the segment you would like to analyze. Imagine if the above site would have had just 1 page, a sample of 500.000 would have been highly accurate (every sample selection would include that page). There is no single answer, but what you would like to know is the confidence level. Within the old interface Google used to show you the confidence level / margin of error within the data table ( http://cutroni.com/blog/wp-content/uploads/cm-capture-7.jpg ). I guess that within the new interface you would have to calculate it per row yourself.

    A simple solution for re-occurring analysis might be to create a new profile and filter out the traffic which does not convert so you have a low volume profile with much information.

    • http://andrescholten.net/ André Scholten

      Good option, but how does that filter look? Exclude visitors that have 0 conversions ;)

      • http://twitter.com/gerronmulder Gerron Mulder

        Include filter for e-commerce data, include filter for funnel uri's, adwords only profile, event based funnel, shopping cart only profile, etc. depends on where and how you make money. anything is possible due to the persistent traffic sources within the cookie. you can do 'exclude 0 conversion' if you include the funnel endpoint url.

        • http://andrescholten.net/ André Scholten

          I see what you mean, thanks for the addition.

  • http://www.adrianspeyer.com/ Adrian

    I would suggest you check out Piwik.org. It's free and you can import your Google data.

    • http://andrescholten.net/ André Scholten

      The problem with huge sites that pass the 500.000 visits limit by far is that you need quit a server to keep Piwik fast. It's not a solution you can implement easy, but it could be an option if you want to invest time and money.

      • http://www.adrianspeyer.com/ Adrian

        Hi Andre. I think there is a misperception on Piwik's performance and ease of use/setup. Piwik can easily handle 500k visits a month, any it can be implemented just as easily as it takes on a small/medium traffic site. Sure it might take a more robust hosting option (like a dedicated server), but if getting all the data correctly is important, as well as having access to it, spending a $100 a month is better than $100k for Google premium. There is also some details at Piwik on the requirements. Nothing that hard to do: http://piwik.org/docs/optimize/ I should note I do participate in the Piwik project, built some plugins and currently manage over a dozen installs. It's really worth cecking out, if nothing more as a Google "backup"

        • http://andrescholten.net/ André Scholten

          I had it installed once, I think I will have to look at it again. Thanks for the clarification.

  • delinkbuilder

    I'm having this problem as well with my website http://www.delinkbuilder.nl I think, I get a lot of visitors but I don't trust it because it is really to much and most visitors don't stay on too long. Is there any help I can get from you guys, maybe a better program than analytics to solve this?

    • http://andrescholten.net/ André Scholten

      It could be an iframe that is causing some tracking troubles. Is there a specific page that has some weird numbers?