What are your criteria to stop an A/B Test?

Jeanbaptiste Alarcon · 20th May 2016, 06:59 AM

Title: What are your criteria to stop an A/B Test?

Well met CRO Warriors!

Stopping A/B Tests too early is the most common and potent mistake made. And there are no cookie-cutter answer working for all statistical methods of A/B Testing.

So I’ve got one question and an article for you:

What are YOUR criteria to stop an A/B Test?
A short version of my article on the concepts you need to understand so you can make an informed decision as to when it’s safe to stop an A/B Test (link to the original, longer version at the bottom)

I mulled over this topic for quite a while. It wasn’t a good idea to tackle it from a statistical point of view, because—well, it’s complicated. Few people actually care whether their A/B testing tool is frequentist or bayesian and if you dig around a bit in the different softwares, they never use exactly the same statistical engine.

Plus I really didn’t like the idea of just giving numbers for you to blindly apply (or discard because they don’t feel right for you).

Instead I decided to try and explain the different concepts that would help you stop tests safely (and not get useless, most likely imaginary results).

We’ll cover the following:

Significance level
Sample size
Duration
Variability of data

Note: None of these elements are stopping rules on their own. But having a better grasp of them will allow you to make better decisions.

I. Significance level
When your A/B Testing tool tells you something along the line of: « your variation has 95% chance of outperforming/beating your control », it’s giving you the significance level.

But if you take things the other way around, it means: « There is 5% chance (1 in 20) that the result you see is completely random—a fluke.

You want at minimum 95%. Think about what it actually means. If you stop at 80%, there is 20% (1 in 5) that your result is a false positive.

You’re testing to make data-driven decisions, not slightly-better-than-flipping-a-coin ones.

BUT—having 95% significance level is NOT a sufficient condition to stop an A/B Test.

II. Sample size
Unless you’re testing a particular segment of visitors, make sure your sample is representative of your overall audience in composition and proportions. Be wary of unusual sources traffic that could be skewing your data.

Example: shooting your newsletter during your test thus having a spike of traffic with visitors more likely to receive positively any changes you make since they already trusted/appreciated you enough to subscribe.

Your sample must also be large enough so it’s not vulnerable to the natural variability of the data, i.e. if you don’t have enough measures, outliers results will have a strong impact on your overall results.

III. Duration
You should test for full weeks at a time. We recommend you test for 2-3 weeks, or 1 (or 2) business cycle.

Why? You already know that, for example, social networks and emails have optimal days (even hours) to shoot.

Meaning time and days influence people behaviours. Same thing with your conversion rates, if you’d do a conversion by day in Google Analytics, you’d see that mondays convert differently than thursdays for example.

Test for full weeks. 2 or 3 is good, or 1-2 business cycles so you can have people who just discovered you, some who already know you, etc ...

IV. Variability of data
If your significance level and/or the conversion rates of your variations are still fluctuating notably, let your test running.

Two phenomenons to consider here:

Regression to the mean: This is what we talked about earlier: the more you record data, the more you approach the “true value”. This why your tests fluctuate so much at first, you have few measures so outliers have a considerable impact.
The novelty effect: When people react to your change just because it’s new. It will fade with time.

This is also why the significance level isn’t enough on its own. During a test, you’ll most likely reach several times 95% before you can actually stop your test.

As we already mentioned, you’ll have these important fluctuations at the beginning of your tests because outliers will have an important impact on the overall conversion rate since you don’t have enough data to approach the « true » value.

To sum up, before stopping an A/B Test, consider the following:

Is your significance level equal or superior to 95%?
Is your sample large enough and representative of your overall audience in composition and proportions?
Have you run your test for the appropriate length of time?
Have your significance and conversion rate curves flattened out?

Only after taking all of those into account can you stop a test. Don’t skip them, don’t lose money …

>> As promised, here’s the link to the original—longer, more detailed and with silly GIF article.<<

Alright, back to you now!

PS: Tell me if the content was useful for you, and if not why + what would you have needed?

Hearn · 23rd May 2016, 03:47 AM

Great approach. Will bookmark this for further research. Thanks.

Jeanbaptiste Alarcon · 23rd May 2016, 04:29 AM

Thank you for the kind words, Hearn

(If you're interested by this topic, I wrote a 10,000 words ebook on A/B Testing mistakes that I can pm you for further research^^)

20th May 2016, 06:59 AM	#1
Jeanbaptiste Alarcon Warrior Member Join Date: 2016 Location: Paris, France Posts: 4 Thanks: 0 Thanked 1 Time in 1 Post	What are your criteria to stop an A/B Test? Share on: Title: What are your criteria to stop an A/B Test? Well met CRO Warriors! Stopping A/B Tests too early is the most common and potent mistake made. And there are no cookie-cutter answer working for all statistical methods of A/B Testing. So I’ve got one question and an article for you: What are YOUR criteria to stop an A/B Test? A short version of my article on the concepts you need to understand so you can make an informed decision as to when it’s safe to stop an A/B Test (link to the original, longer version at the bottom) I mulled over this topic for quite a while. It wasn’t a good idea to tackle it from a statistical point of view, because—well, it’s complicated. Few people actually care whether their A/B testing tool is frequentist or bayesian and if you dig around a bit in the different softwares, they never use exactly the same statistical engine. Plus I really didn’t like the idea of just giving numbers for you to blindly apply (or discard because they don’t feel right for you). Instead I decided to try and explain the different concepts that would help you stop tests safely (and not get useless, most likely imaginary results). We’ll cover the following: Significance level Sample size Duration Variability of data Note: None of these elements are stopping rules on their own. But having a better grasp of them will allow you to make better decisions. I. Significance level When your A/B Testing tool tells you something along the line of: « your variation has 95% chance of outperforming/beating your control », it’s giving you the significance level. But if you take things the other way around, it means: « There is 5% chance (1 in 20) that the result you see is completely random—a fluke. You want at minimum 95%. Think about what it actually means. If you stop at 80%, there is 20% (1 in 5) that your result is a false positive. You’re testing to make data-driven decisions, not slightly-better-than-flipping-a-coin ones. BUT—having 95% significance level is NOT a sufficient condition to stop an A/B Test. II. Sample size Unless you’re testing a particular segment of visitors, make sure your sample is representative of your overall audience in composition and proportions. Be wary of unusual sources traffic that could be skewing your data. Example: shooting your newsletter during your test thus having a spike of traffic with visitors more likely to receive positively any changes you make since they already trusted/appreciated you enough to subscribe. Your sample must also be large enough so it’s not vulnerable to the natural variability of the data, i.e. if you don’t have enough measures, outliers results will have a strong impact on your overall results. III. Duration You should test for full weeks at a time. We recommend you test for 2-3 weeks, or 1 (or 2) business cycle. Why? You already know that, for example, social networks and emails have optimal days (even hours) to shoot. Meaning time and days influence people behaviours. Same thing with your conversion rates, if you’d do a conversion by day in Google Analytics, you’d see that mondays convert differently than thursdays for example. Test for full weeks. 2 or 3 is good, or 1-2 business cycles so you can have people who just discovered you, some who already know you, etc ... IV. Variability of data If your significance level and/or the conversion rates of your variations are still fluctuating notably, let your test running. Two phenomenons to consider here: Regression to the mean: This is what we talked about earlier: the more you record data, the more you approach the “true value”. This why your tests fluctuate so much at first, you have few measures so outliers have a considerable impact. The novelty effect: When people react to your change just because it’s new. It will fade with time. This is also why the significance level isn’t enough on its own. During a test, you’ll most likely reach several times 95% before you can actually stop your test. As we already mentioned, you’ll have these important fluctuations at the beginning of your tests because outliers will have an important impact on the overall conversion rate since you don’t have enough data to approach the « true » value. To sum up, before stopping an A/B Test, consider the following: Is your significance level equal or superior to 95%? Is your sample large enough and representative of your overall audience in composition and proportions? Have you run your test for the appropriate length of time? Have your significance and conversion rate curves flattened out? Only after taking all of those into account can you stop a test. Don’t skip them, don’t lose money … >> As promised, here’s the link to the original—longer, more detailed and with silly GIF article.<< Alright, back to you now! PS: Tell me if the content was useful for you, and if not why + what would you have needed?
	Last edited on 20th May 2016 at 08:02 AM. Reason: typo

23rd May 2016, 03:47 AM	#2
Hearn viptraffictraining.com Join Date: 2012 Location: Cyberspace Posts: 359 Thanks: 8 Thanked 54 Times in 51 Posts	Re: What are your criteria to stop an A/B Test? Share on: Great approach. Will bookmark this for further research. Thanks.
	[SIGPIC][/SIGPIC] How To Turn Facebook Into Your Most Profitable Traffic Source...Ever!

23rd May 2016, 04:29 AM	#3
Jeanbaptiste Alarcon Warrior Member Join Date: 2016 Location: Paris, France Posts: 4 Thanks: 0 Thanked 1 Time in 1 Post	Re: What are your criteria to stop an A/B Test? Share on: Thank you for the kind words, Hearn (If you're interested by this topic, I wrote a 10,000 words ebook on A/B Testing mistakes that I can pm you for further research^^)