Split testing question - how is this possible?

31 replies
I've been seeing some unexpected results from split testing, so I performed a simple test just to make sure the system is working right. I split up my traffic randomly into 3 buckets - all the users saw the same page, but I set them up as 3 separate pages in the system.

So once I got enough data in here, the results for all 3 pages should be identical - because the traffic is the same and the pages are identical. Now here are the results so far:

Page 1: 118 sales and a 5.31% conversion rate
Page 2: 110 sales and a 5.17% conversion rate
Page 3: 123 sales and a 5.60% conversion rate

Looking at this Page 3 is the clear winner - but obviously not since they are all the same.

So my primary question is, if the data is this wild after 350 sales - just how long do I have to wait before I see statistically valid and meaningful numbers?

We are doing a good number of sales but at this rate it looks like I'll have to wait for 1,000 sales and that's insane, and so surely I must be doing something wrong. I've got a lot of things I want to test and it will take forever at that rate.

Anyone?

My only guess is that this is the type of thing you will see when testing pages that really don't produce a statistically relevant difference in response (which makes sense in this case since they are the same pages) and that when you have 2 or 3 pages that are just too similar the numbers will bounce around in a "meaningless" way for awhile until you have tons of data at which point they should all even out??

But if so, how do you test for small but meaningful increases? I mean if you create a new page that converts 5% better than your control, that can add up to a lot of money over time. But it seems like you'd have to wait for 1000s and 1000s of sales before you knew the results of a 5% difference were statistically valid ... ????

(One thing I know with 110% certainty is that the tracking and split test #s are correct.)
#question #split
  • Profile picture of the author KristiDaniels
    So your winning version has 123 sales on 2,195 views (about... you didn't give the exact number)

    And your losing version has 110 sales on 2,126 views (about again since you excluded the actual number of views and only gave a percentage.)

    So yes. That is a standard deviation of 0.63 which is completely statistically insignificant. You need a sigma of at least 1.0 to be even 85% confident of your results.

    Are you really running a split test with no way of calculating your results? What is the purpose of that? How will you know when you are done?

    It looks like whatever you are testing isn't very significant. Why not call it a tie and test something that matters? I wouldn't bother with testing something that didn't show a statistically significant result after over 300 actions.
    {{ DiscussionBoard.errors[1554321].message }}
  • Profile picture of the author KristiDaniels
    I see this kind of question a lot. What kinds of split testing tools does everyone use? Don't they calculate the sigma and the CI? If not, then how do you know when your split test has statistically significant results? Do you just plug it into a spreadsheet like i just did?
    {{ DiscussionBoard.errors[1554327].message }}
    • Profile picture of the author dsiomtw
      We like to build everything in house, so this is just phase 1 - a basic split testing system. The next functionality we obviously need to build is something that will tell us when the results are statistically valid.

      Right now we are doing it in a very rudimentary way - we split our traffic into 3 buckets, with new users randomly going into 1 of the 3 buckets. People in the first 2 buckets see the same control page, and the users in the 3rd bucket see the test page. The logic is that the test is "done" when the results from buckets 1 and 2 are close to identical (since those users are seeing the same page).

      The problem we are having over and over is that it's just taking too long and requiring too many sales to see any statistically valid differences. I understand this could be indicating that we are testing the wrong things that aren't making much of a difference, and we need to focus on testing larger changes, but at the same time I'm still wondering the proper way to test small changes. I know in my heart that there are small changes that can result in for example a 5-10% increase in conversions, but such small improvements seem to require a ton of data before you know they are statistically valid.

      Obviously it would be better if you had a new test page that converted 25% better, as it wouldn't take nearly as much data to determine when it was statistically valid, but on the flip side, our experience so far is that it's easier and more likely to come up with a handful of small 5-10% improvements rather than 1 big change that will result in a 25%+ increase in conversions.

      I've searched the net and found various formulas, but we haven't got them to work properly yet. If you could point me in the right direction in that regard it would be most appreciated. I got some good info way back from the guy who developed splittester.com (I think his name is Brian) but just couldn't get it working right. He has a pretty cool excel spreadsheet you can buy rather inexpensively but we are trying to figure out the formulas so we can program them into our web-based stats and analysis systems...
      {{ DiscussionBoard.errors[1554337].message }}
      • Profile picture of the author JohnMcCabe
        Originally Posted by dsiomtw View Post

        Right now we are doing it in a very rudimentary way - we split our traffic into 3 buckets, with new users randomly going into 1 of the 3 buckets. People in the first 2 buckets see the same control page, and the users in the 3rd bucket see the test page. The logic is that the test is "done" when the results from buckets 1 and 2 are close to identical (since those users are seeing the same page).

        The problem we are having over and over is that it's just taking too long and requiring too many sales to see any statistically valid differences. I understand this could be indicating that we are testing the wrong things that aren't making much of a difference, and we need to focus on testing larger changes, but at the same time I'm still wondering the proper way to test small changes. I know in my heart that there are small changes that can result in for example a 5-10% increase in conversions, but such small improvements seem to require a ton of data before you know they are statistically valid.
        Maybe it's late at night here, but I don't see your logic. By waiting for the results of buckets 1 and 2 to be nearly identical, you're building in a long run. According to the Law of Large Numbers, given a sufficiently large number of trials you will eventually approach a mean. In your case, given a sufficient number of sales, the results of bucket 1 and 2 will approach each other. Which is exactly what you seem to be seeing.

        Try running a simple simulation program that uses random numbers to simulate a coin flip. Heads, you made a sale. Tails, you didn't. In theory, you should end up with 50% heads and 50% tails. But in practical runs, you will see wide variations - extended runs of both consecutive heads and consecutive tails.

        Your test in your original post is running pretty much as I would predict it would. Minor variations, but no significance. Expecting things to be exactly equal just isn't reasonable.

        What you want is a probability - the confidence interval. Saying something has an 85% confidence interval is really saying that you might expect the same results from the same test 85% of the time. Getting a different result 15% of the time would be entirely normal.
        {{ DiscussionBoard.errors[1558183].message }}
  • Profile picture of the author KristiDaniels
    Standard deviation is a formula built into any decent spreadsheet. I know Microsoft Excel and OpenOffice.org Calc both have it built in.

    It is also built into every programming language created after Fortran 4.
    {{ DiscussionBoard.errors[1554364].message }}
  • Profile picture of the author dsiomtw
    Thanks for the info, we are making some progress.

    When would you consider a test "done"? It seems like you would want the CI to be at least 90 or 95% minimum eh?
    {{ DiscussionBoard.errors[1557878].message }}
  • Profile picture of the author Kevin_Hutto
    I agree with Kristi, those numbers are insignificant.... I use a spreadsheet too, but I know there is a free site that lets you input your data and it tells you if you have a winner or not and if the data is statistically relevant or not. The site is free, but i cant remember the name of it. Another good place to go to learn about testing is marketingexperiments.com -- they have lots of info about A/B testing, etc....
    {{ DiscussionBoard.errors[1557921].message }}
  • Profile picture of the author dsiomtw
    The site is splittester.com. The same guy sells a newer and more advanced excel spreadsheet for $50 that I have. We are trying to reverse engineer everything and figure out the formulas to determine sigma, CI, etc. so we can build it all into our web-based stats systems so we don't have to manually use spreadsheets all the time...
    {{ DiscussionBoard.errors[1558131].message }}
    • Profile picture of the author milkyway
      Originally Posted by dsiomtw View Post

      We are trying to reverse engineer everything and figure out the formulas to determine sigma, CI, etc. so we can build it all into our web-based stats systems so we don't have to manually use spreadsheets all the time...
      Maybe I'm getting you wrong, but do you really reverse-engineer the statistical formulas? Why not just go to the local library (or a book shop) and get some stats books for beginners?

      milkyway
      {{ DiscussionBoard.errors[1558681].message }}
  • Profile picture of the author dsiomtw
    Thanks for the insights. So at what CI would one consider a test "done" ? Obviously it depends on just how certain you want to be, but is it realistic to wait until it's 95% or is that overkill?
    {{ DiscussionBoard.errors[1558424].message }}
    • Profile picture of the author theemperor
      A lot said here about statistics - but i think a lot needs to be said about gut instinct, because relying soley on numbers and formulas can be foolish. And I say this having studied Maths at university (but doing my best to avoid statistics lectures)!

      Those results might be significant in a different context but in this one I would have to agree that it is much of a muchness and can be explained as pure luck / chance.
      Signature
      Learn to code faster, and remove the roadblocks. Get stuff done and shipped! PM me and I can help you with programming tutoring, specialising in Web and the following languages: Javascript ~ HTML ~ CSS ~ React ~ JQuery ~ Typescript ~ NodeJS ~ C#.
      {{ DiscussionBoard.errors[1558594].message }}
    • Profile picture of the author JohnMcCabe
      Originally Posted by dsiomtw View Post

      Thanks for the insights. So at what CI would one consider a test "done" ? Obviously it depends on just how certain you want to be, but is it realistic to wait until it's 95% or is that overkill?
      It depends on what the stakes are.

      In theory, you could could be 95% confident that one of the options in a test would raise your gross profit per sale from $20 to $20.01, but the net effect on you is pretty tiny. If I'm seeing a trend toward a significant, but unimportant, test result, I'll likely settle for 85% confidence.

      The bigger the stakes, or the bigger the consequences of getting a test wrong, the more confident you want to be in the test result.
      {{ DiscussionBoard.errors[1559419].message }}
  • Originally Posted by dsiomtw View Post

    Page 1: 118 sales and a 5.31% conversion rate
    Page 2: 110 sales and a 5.17% conversion rate
    Page 3: 123 sales and a 5.60% conversion rate

    Looking at this Page 3 is the clear winner - but obviously not since they are all the same.
    I am big on split testing and I've done A LOT of it on my own products, and I agree with you that sometimes I get different CTR ratios using exactly the same freaking sales page.

    In your example, you've tested around 6500 visitors delivering around 350 sales. That should be enough to draw pretty valid stats, but as you said they all saw the same page so... It makes you wonder about the validity of split testing, doesnt it?
    {{ DiscussionBoard.errors[1559459].message }}
    • Profile picture of the author JohnMcCabe
      Originally Posted by Anonymous Affiliate View Post

      I am big on split testing and I've done A LOT of it on my own products, and I agree with you that sometimes I get different CTR ratios using exactly the same freaking sales page.

      In your example, you've tested around 6500 visitors delivering around 350 sales. That should be enough to draw pretty valid stats, but as you said they all saw the same page so... It makes you wonder about the validity of split testing, doesnt it?
      Split testing is valid. Some data interpretations are not.

      In this case, the result of the test is a mean conversion rate of 5.36% +/- ~0.25%. There really was no "test", as all three hypotheses were the same.

      If you ran the test for an unimaginably long time, all else staying equal, the three values would approach the mean. After 5 million samples (visitors), you might see something like:

      Bucket #1 = 5.35% conversion
      Bucket #2 = 5.36% conversion
      Bucket #3 = 5.355% conversion

      Flip a coin ten times and record the results. Repeat 10,000 times.

      After 10,000 datasets, you'll find a distribution around the value 50% heads and 50% tails. But you'll also find sets with all heads and all tails, and everything in between.

      If you run an A/A test, getting perfectly identical results is actually one of the least probable outcomes...
      {{ DiscussionBoard.errors[1559556].message }}
      • Profile picture of the author dsiomtw
        Thanks for all the discussion.

        We are just learning this stuff as we go. We were seeing some unexpected results from our tests so I just wanted to do the A/A test to try to verify that our system was working correctly.

        I figured after 1000s of clicks and 300+ sales the numbers would be a lot closer. Apparently that is not the case, and that shows how much I know about statistics.

        We've tracked down some formulas and code to compute CI and have built it in to our split testing system. The results are pretty similar to splittester.com (and the $50 spreadsheet he sells), so this is a step in the right direction.

        I obviously don't understand much of this yet, but now I'm really curious and I'm going to be reading up on this stuff. Wish I took statistics in school. Here's another shocker on an A/A test:

        Page A: 18,345 visitors and 3203 opt-ins (17.46%)
        Page B: 18,467 visitors and 3012 opt-ins (16.31%)

        That is a pretty huge difference if you ask me after so many visitors and opt-ins, for 2 identical pages!

        This is the same page being tested at the same time with the same traffic being split randomly 50/50. I understand it will take TONS of data for them to ever approach identical numbers, but this is way more of a difference than I would have expected.

        According to these numbers Page A is doing 7% better than page B after 35,000+ visitors. You would think that would be a statistically valid result. This sure does change how I view testing... it ain't as cut and dry as I thought it was going to be.

        After a few big things, it was my plan that on a long term basis we would be testing a lot of small things here and there that might increase conversions 5-10% each, for example. But it seems like you need a metric poop ton of data to know if a 5% difference is statistically valid...

        I'm still somewhat confused because I thought a big part of split testing WAS about testing small changes that add 5% here and 5% there, that when combined all add up to a huge difference vs. your original page.

        But if you need 500,000 visitors, 10s of 1000s of opt-ins and/or 1000s of sales before you know if a 5% difference is valid, this doesn't seem practical in any way shape or form.
        {{ DiscussionBoard.errors[1559831].message }}
        • Profile picture of the author Ironman77
          As far as I remember statistics - you must have 1000 datasets to have a 97% certainty.
          {{ DiscussionBoard.errors[1559864].message }}
        • Profile picture of the author JohnMcCabe
          My comments in blue...

          Originally Posted by dsiomtw View Post

          I obviously don't understand much of this yet, but now I'm really curious and I'm going to be reading up on this stuff. Wish I took statistics in school. Here's another shocker on an A/A test:

          Page A: 18,345 visitors and 3203 opt-ins (17.46%)
          Page B: 18,467 visitors and 3012 opt-ins (16.31%)

          That is a pretty huge difference if you ask me after so many visitors and opt-ins, for 2 identical pages!

          It's not a huge difference, just over 1%.

          This is the same page being tested at the same time with the same traffic being split randomly 50/50. I understand it will take TONS of data for them to ever approach identical numbers, but this is way more of a difference than I would have expected.

          According to these numbers Page A is doing 7% better than page B after 35,000+ visitors. You would think that would be a statistically valid result. This sure does change how I view testing... it ain't as cut and dry as I thought it was going to be.

          You have to remember that you are testing something against itself. Any difference you see can be chalked up to random chance. If you duplicated this with another 37,000 visitors, it would not be a surprise if the results were just the opposite. Your A/A test is yielding exactly what it should.

          I'm still somewhat confused because I thought a big part of split testing WAS about testing small changes that add 5% here and 5% there, that when combined all add up to a huge difference vs. your original page.

          But if you need 500,000 visitors, 10s of 1000s of opt-ins and/or 1000s of sales before you know if a 5% difference is valid, this doesn't seem practical in any way shape or form.
          You will start seeing those changes that make a difference once you actually change something. In some instances, you can get significance with high confidence in 50-100 results.

          You will also run many tests where your changes don't result in any appreciable difference. It doesn't mean the test was defective, just that in this case the two factors were more or less equal in effect.

          In your opt-in test, the one and only inference you can draw is that for that page, the opt-in rate will likely settle in at around 16.9%. That's it.

          Again, confidence intervals are more or less probabilities. If one of your tests is significant with 95% confidence, you are saying that your probability of picking the wrong option is only 1 in 20. 19 times out 20, picking the winner of the test is the right choice.
          {{ DiscussionBoard.errors[1559938].message }}
  • Profile picture of the author dsiomtw
    Page A: 18,345 visitors and 3203 opt-ins (17.46%)
    Page B: 18,467 visitors and 3012 opt-ins (16.31%)

    That is a pretty huge difference if you ask me after so many visitors and opt-ins, for 2 identical pages!

    It's not a huge difference, just over 1%.
    1% difference?

    17.46 is 7% more than 16.31. That's a pretty big difference isn't it?? It may only be about a 1 percentage point increase, but it's still a 7% increase in opt-ins.

    (In this A/A scenario it's obviously meaningless, but if this was a real test it wouldn't be meaningless in my mind if one page got 7% more opt-ins than another page)
    {{ DiscussionBoard.errors[1559976].message }}
  • Profile picture of the author KristiDaniels
    Using a confidence interval to call a test "done" isn't something I have ever done. It doesn't make any sense when split testing ad copy (although I know most testing tools do exactly that).

    When you are testing things that make a ton of difference, you can often see that before a confidence interval can even be calculated. If you have 10 impressions on A with 8 optins and 10 impressions on B with zero optins, I'm not going to run A anymore even though there is simply no way to even calculate a confidence interval.

    On the other hand, if you really had results like you stated above in a real A/B test, then even if you ever found the winner... who cares? The winner is only the tiniest bit better than the loser. Why does it matter that you are 99.9% confident that it actually is better? You just burned up thousands of impressions getting confident with your results. You could have used those impressions testing something else that might actually matter.

    Sigma and CI are useful for switch points in testing, but I would never waste testing time waiting for a CI of any particular number. Think through your business goals rather than trying to make a statistics teacher happy.

    And especially think through whether you want to do this inhouse when it isn't a core competency (it isn't a competency at all if you don't even know how to calculate a standard deviation). Focus on where YOU can add value with your inhouse work and outsource the work where stats geniuses have spent their lives geeking out with the coolest and best multivariate solution. There are dozens of those out there (several free). The enterprise level solutions are incredible once you get to that point.
    {{ DiscussionBoard.errors[1560087].message }}
    • Profile picture of the author JohnMcCabe
      Originally Posted by dsiomtw View Post

      1% difference?

      17.46 is 7% more than 16.31. That's a pretty big difference isn't it?? It may only be about a 1 percentage point increase, but it's still a 7% increase in opt-ins.

      (In this A/A scenario it's obviously meaningless, but if this was a real test it wouldn't be meaningless in my mind if one page got 7% more opt-ins than another page)
      I think the bulb is starting to glow here...

      If you were testing two different options, that 7% difference might be pretty big. On a sample this size, enough for me to make the larger one my control to test something else against. And I might consider running both options against Option C, just in case.

      If you really want to do this in-house, someone on your team needs to be grounded in the fundamentals. I would suggest that you or someone on your team sign up to audit an introductory statistics class at your local university, or at least get a good introductory textbook and study it.
      {{ DiscussionBoard.errors[1560175].message }}
    • Profile picture of the author JohnMcCabe
      Kristi, I agree with almost everything you say here. Where we part ways is your example of abandoning a test with only 8 results on a total of 20 trials. I'm a bit more conservative on that score. I'd likely let the test run until I had 50 or 100 opt-ins total and then check the stats.

      Flip a coin 10 times and the odds of getting 8 heads are about the same as getting 10 tails. Get 8/10 heads ten times in a row and I'm going to want to examine that coin...

      Originally Posted by KristiDaniels View Post

      Using a confidence interval to call a test "done" isn't something I have ever done. It doesn't make any sense when split testing ad copy (although I know most testing tools do exactly that).

      When you are testing things that make a ton of difference, you can often see that before a confidence interval can even be calculated. If you have 10 impressions on A with 8 optins and 10 impressions on B with zero optins, I'm not going to run A anymore even though there is simply no way to even calculate a confidence interval.

      On the other hand, if you really had results like you stated above in a real A/B test, then even if you ever found the winner... who cares? The winner is only the tiniest bit better than the loser. Why does it matter that you are 99.9% confident that it actually is better? You just burned up thousands of impressions getting confident with your results. You could have used those impressions testing something else that might actually matter.

      Sigma and CI are useful for switch points in testing, but I would never waste testing time waiting for a CI of any particular number. Think through your business goals rather than trying to make a statistics teacher happy.

      And especially think through whether you want to do this inhouse when it isn't a core competency (it isn't a competency at all if you don't even know how to calculate a standard deviation). Focus on where YOU can add value with your inhouse work and outsource the work where stats geniuses have spent their lives geeking out with the coolest and best multivariate solution. There are dozens of those out there (several free). The enterprise level solutions are incredible once you get to that point.
      {{ DiscussionBoard.errors[1560199].message }}
  • Profile picture of the author paraschopra
    Originally Posted by dsiomtw View Post

    Page 1: 118 sales and a 5.31% conversion rate
    Page 2: 110 sales and a 5.17% conversion rate
    Page 3: 123 sales and a 5.60% conversion rate
    I will do quick maths for you. 118 sales and 5.31% conversion rate means around 2200 visitors. Now with 2200 visitors, doing some stats (which is simply treating conversions as a binomial variable) you get conversion rate as 5.31% and standard error as 0.47%. Taking 1.96 times of the standard error gives you confidence intervals within 95% range.

    So in your case 95% confidence intervals range from 4.37% to 6.24% which range overlaps with all other same variations you have. So no surprises here. The conversion rate confidence intervals linearly decrease with this formula: sqrt((p*(1-p)/n) where p is conversion rate and n is number of visitors.

    If you want I can share a spreadsheet for calculating confidence intervals, ranges and other metrics for a split test. (You don't really need to pay $50 for such sheets, such statistics functions in excel is pretty basic)

    Incidentally, I wrote a post on this topic today on my startup's blog: "How reliable are your split test results?" (The forum won't let me post the link to the post here, so go to the website in my signature and then see our blog)

    Email me at paras{at}wingify.com if you need help on split test results interpretation.
    {{ DiscussionBoard.errors[1562429].message }}
    • Profile picture of the author JohnMcCabe
      paraschopra, I read the blog post you mentioned. Nice explanation.

      I like your observation that when one alternative starts to beat our favored alternative, the reaction is often to question the validity of the test or calculation rather than trust the result of the test.

      People are funny that way...

      Very nice first post.
      {{ DiscussionBoard.errors[1562738].message }}
  • Profile picture of the author dsiomtw
    There's something I'm still missing here. A follow-up on the A/A opt-ins test:

    Page A: 20,897 visitors and 3618 opt-ins (17.31%)
    Page B: 21,028 visitors and 3454 opt-ins (16.43%)

    Two identical pages - and one is converting 5% better than the other.

    Earlier John McCabe said:

    You have to remember that you are testing something against itself. Any difference you see can be chalked up to random chance. If you duplicated this with another 37,000 visitors, it would not be a surprise if the results were just the opposite. Your A/A test is yielding exactly what it should.
    I don't understand what you're saying here. Even though I'm comparing 2 identical pages, statistics is telling me that there's a 99% chance one of them converts 5% better than the other.

    Suppose they WERE different pages and I got the exame same results, with the same 99% CI. In this case I would believe that one page was truly 5% "better" than the other.

    Why does the fact that I'm doing an A/A test invalidate the statistical results?

    What is the point of CI if you can get a 99% CI after 40,000+ visitors and it tells you that 1 of 2 identical pages converts 5% better than the other? If this can happen, how can you rely on CI in any circumstance?

    (And I know that this isn't just a 1 out of 100 "fluke" because it happened recently on another test, and the odds of that would 1 out of 10,000 i.e. pretty improbable)

    What am I missing here?
    {{ DiscussionBoard.errors[1563044].message }}
    • Profile picture of the author JohnMcCabe
      What you are missing is that one is not converting 5% better than the other. The difference is only 0.88%.

      You are saying 3618 is bigger than 3454, so one is "better" than the other. I'm saying that if you let this "test" run out to infinity, that 0.88% difference in conversion rate will approach 0. Look at the conversion ratios, not the raw numbers.

      Since the two alternatives are exactly the same, you can chalk the <1% difference in conversion rates up to random chance. If everything truly is identical, there is no other explanation.

      You can run this A/A test forever and the differences will shift back and forth, but the result will be meaningless because you can't explain or control or even affect the variation.

      Try running a true A/B test with a significant variable, like a different headline or bonus offer, and see what you get...
      {{ DiscussionBoard.errors[1563116].message }}
  • Profile picture of the author DogScout
    flip a coin 100,000 times and see a 7% heads more result it not out of the question. People at casinos bet that happens to them every day. At a million flips the percent should sneak under 1% difference.

    When you stop testing same pages, and see 50-800% differences, now you are learning something. But under a 10-20% variation, the certainty factor is way under 95% until you hit numbers that matter. (over a mil).

    Personally trying to squeeze a 7% increase with different pages is not productive. In any split testing I have done, I have usually not moved until I see a 100% difference. Although it has not been unusual for the placement of a comma alone to change the opt-in by as much as 500%!
    {{ DiscussionBoard.errors[1563240].message }}
  • Profile picture of the author dsiomtw
    What you are missing is that one is not converting 5% better than the other. The difference is only 0.88%.
    John I have to disagree with your logic on this one. 17.31% is 5% more opt-ins than 16.43%. I don't understand why you keep saying 1%. If Page A has an opt in rate of 17.31% and Page B has an opt-in rate of 16.43%, that means Page A gets 5% more optins than Page B i.e. meaningful. You wouldn't say that Page A gets 1% more optins than Page B (hardly meaningful).

    (To put it another way, say you have 2 offer pages that convert at 17.31% and 16.43%. If you run with the one that converts at 17.31% you will make 5.36% more money, not 0.88% more.)

    Dogscout, what you are saying jives very closely with my experiences. My assumption however is that it's much more likely that you'll find a handful of changes that will increase your conversions 10% each than you are likely to find 1 or more changes that will increase your conversions 50-100% or more.

    Are you suggesting that split testing to yield small 10% incremental improvements is not possible/practical without millions of visitors? Most of what you said jives with my real world experience, but at the same time, this last part doesn't jive with what I see other people doing.

    You said you only replace your control when you have a 100% increase i.e. double the response? No offense but that sounds a little absurd. If I gave you my squeeze page right now I absolutely guarantee you that there is no way you or anyone else could possibly double the response (without changing the offer altogether).

    BUT I am 110% positive it could be increased by 10%, 20%, etc. Why wouldn't you try to make the "small" changes and increase it by 20% which is very meaningful? Surely there must be a way to find out that a 20% increase is statistically valid without needing millions of visitors.

    BTW guys just so you know, I do plenty of real world testing. These A/A tests I'm talking about are just examples to illustrate a sticking point I have i.e. conventional statistics telling me that 1 of 2 identical pages is 99% likely to convert 5-10% better than another identical page.
    {{ DiscussionBoard.errors[1563290].message }}
    • Profile picture of the author JohnMcCabe
      Originally Posted by dsiomtw View Post

      John I have to disagree with your logic on this one. 17.31% is 5% more opt-ins than 16.43%. I don't understand why you keep saying 1%. If Page A has an opt in rate of 17.31% and Page B has an opt-in rate of 16.43%, that means Page A gets 5% more optins than Page B i.e. meaningful. You wouldn't say that Page A gets 1% more optins than Page B (hardly meaningful).

      (To put it another way, say you have 2 offer pages that convert at 17.31% and 16.43%. If you run with the one that converts at 17.31% you will make 5.36% more money, not 0.88% more.)
      Okay, I'll try this one more time.

      You want to measure the number of people who opt in compared to the total number of people who visit, right?

      It's a simple ratio Optins/Visitors = Conversion Rate

      Conversion Rate (A) = 17.31
      Conversion Rate (B) = 16.43

      Difference in Conversion Rates (A-B) = 0.88

      % change in Conversion Rate (A-B)/B = 5%

      If you were actually measuring two different things, this might or might not be significant. It's quite possible that the difference (or at least part of it) is due to random factors. Your A/A test demonstrates that.

      As I said in an earlier post, I'd be inclined to adopt option A as my control version. But I wouldn't totally dismiss B out of hand.

      One way to continue the test without giving up potential gains in performance is to send the majority of your traffic to the version you believe will perform better. Send, say, 90% to version A and 10% to version B and let the numbers pile up.

      On a more practical level, the smart move would be to pick the one you think is superior (A in this case) and find something else to test. If you are right, you get the benefit. If you are wrong, and the results are a statistical anomaly, the penalty isn't huge unless you are talking traffic in the hundreds of thousands or millions of visitors.

      Did I do a better job explaining myself this time?
      {{ DiscussionBoard.errors[1563490].message }}
      • Profile picture of the author Groovy99
        When I am split testing I am looking to change a variable that will hit a home run, not just get me to 2nd base. So I am initially considering anything within 5% or so to be the same result and no big change. I want to see a jump in response rates larger than that.

        If after a short period of time, less than a week, of making changes to swing for the fences and I am not hitting a home run, I then circle back and go after the 5% increases...
        Signature

        Patrick
        "The business that considers itself immune to the necessity for advertising sooner or later finds itself immune to business."
        connecticut internet marketing - Get in, TAKE OFF!
        business card printing

        {{ DiscussionBoard.errors[1563684].message }}
      • Profile picture of the author dsiomtw
        I totally understand what you're saying, but what I'm saying is that Page A and B are IDENTICAL.

        And after 20,000+ visitors to each, "Page A" is generating 5% more opt-ins than Page B, with a 99% CI. That seems like a pretty massive anomaly to me.

        I'm sure by the time I send a million visitors to each the number would be almost identical, but I guess I just figured that after 20,000+ visitors to each "bucket" the numbers would be a lot closer than a 5% difference.

        I know it sounds kind of silly to talk about A/A tests, but I'm trying to wrap my mind around how much deviation is to be considered "normal" ... it's A LOT more than I would have thought.

        I guess the big confusion on my end is that it seems like in order to know if a small increase of say 5-10% is statistically valid you need huge amounts of data.

        I understand that the smaller the difference the more data you need to know if it's a real difference vs. random, but from a business standpoint a 10% increase in anything is HUGE but it seems almost impossible or at least impractical to test for a 10% increase if you need millions of data points.

        Groovy99 - that's exactly what we're doing. We've already hit the homerun, we're now looking for the singles and doubles.
        {{ DiscussionBoard.errors[1563697].message }}
        • Profile picture of the author JohnMcCabe
          Again, my comments in blue...

          Originally Posted by dsiomtw View Post

          I totally understand what you're saying, but what I'm saying is that Page A and B are IDENTICAL.

          Which is why the only explanation that makes sense is that there are random factors at work.

          And after 20,000+ visitors to each, "Page A" is generating 5% more opt-ins than Page B, with a 99% CI. That seems like a pretty massive anomaly to me.

          When dealing with the Law of Large Numbers, 20K isn't that large. You could start over and run the same test, and find the results reversed.

          I'm sure by the time I send a million visitors to each the number would be almost identical, but I guess I just figured that after 20,000+ visitors to each "bucket" the numbers would be a lot closer than a 5% difference.

          It could have been. Or the difference could have been wider. If you look at simulations of sporting events, you'll see that they'll often run the computer simulations 100,000 times and average out the results.

          I know it sounds kind of silly to talk about A/A tests, but I'm trying to wrap my mind around how much deviation is to be considered "normal" ... it's A LOT more than I would have thought.

          I guess the big confusion on my end is that it seems like in order to know if a small increase of say 5-10% is statistically valid you need huge amounts of data.

          That's why you establish the confidence level you want before you start. If you have a 95% confidence level, it means that there is still a 1 in 20 chance you could pick the wrong option. Rather than running the test sample into the millions, figure out how many repetitions you need to reach your chosen confidence level, then repeat the test from scratch a few times.

          Instead of running 20,000+ to each page, run 1,000 to each page 20 times and look at the differences. You should see something approaching a normal distribution centered near the 50% mark for an A/A test.

          When you are testing landing pages, you don't need the precision required of, say, a drug study. If you guess wrong, you leave a few bucks on the table. If they guess wrong, people die.


          I understand that the smaller the difference the more data you need to know if it's a real difference vs. random, but from a business standpoint a 10% increase in anything is HUGE but it seems almost impossible or at least impractical to test for a 10% increase if you need millions of data points.

          I think part of the reason we couldn't get on the same page is we were looking at different things. I was looking at the conversion ratio, you were focusing on the number of conversions. I saw a <1% difference in the ratios, you saw a 5% change in the raw number of opt-ins. The fact that it was an A/A test, running 2 identical pages against each other, meant that the only reason for variation was some combination of random factors.

          Instead of running one test looking for mathematical certainty that doesn't exist, run sets of tests to see if they repeat. If you can be reasonably sure that the result will repeat 19 out of 20 times, run with it.


          Groovy99 - that's exactly what we're doing. We've already hit the homerun, we're now looking for the singles and doubles.
          {{ DiscussionBoard.errors[1565580].message }}

Trending Topics