The other day, I was speaking to a friend of mine who lamented that he didn’t really believe in A/B testing. He noticed over the years that his company has had plenty of successful test results, but hasn’t been able to see the tests' effects on the overall metrics they care about. He was expecting to see inflection points around the implementation of successful experiments. Quite often, companies run A/B tests that show positive results, and yet when overall metrics are examined, the impact of these tests is invisible. In this post, I’ll discuss the reasons this happens and ways to avoid or explain it.
Confusing Statistics
Statistics can be remarkably counterintuitive, and it's quite easy to reach the wrong conclusions about what the results mean. Each A/B testing tool uses a statistical engine for its results. Some use Frequentist, some use Bayesian, and others use more esoteric models. Often, they give you the confidence (P-Value) interval and the percent improvement right next to each other, making it very easy to believe that they are related. For instance, saying: “We are 95% confident that there is a 10% improvement,” versus “We saw 10% improvement, and there is a 95% chance that this test is better than the baseline.” With the second explanation, the test might actually have much smaller improvement, and there is even a 5% chance that it wasn’t an improvement at all.
Many A/B testing tools rely on you having successful tests, making you seem like a hero for the tests’ success, since they want you to keep paying them. They are not incentivized to do A/B testing properly and to be honest with the statistics. The end result is that it is very easy to misinterpret your results, and your tests may not be as successful as you think.
Solution: Use a tool with a statistical range
Use a tool that gives you a statistical range of likely outcomes rather than one value. This discussion gets far more complicated, and I’ll save the Frequentist vs Bayesian for another post. The short version: Bayesian approaches offer a probability range, giving you a much more intuitive way to understand the results of your experiment. If the improvements of a test are over-exaggerated, you will erode trust in the long-term results of A/B testing.
The Peeking Problem
The “peeking problem” is where the more you watch your tests for significance, the more likely it is that the test will show you an incorrect significant result. If you’re constantly watching the test until it “goes significant,” and then acting on it by stopping the test, you’re likely going to see the result of random chance, and not necessarily an actual improvement.
You have not captured meaning; you have only captured random noise. Instead of asking whether the difference is significant at a certain predetermined point in the future, you are asking whether the difference is significant at least once within the test. These are completely different questions. If you called this experiment by waiting for the test to reach significance, you may not see effects overall. You “peeked” and only saw random noise.
Solution: Predefine your sample size
There are two solutions. One is to estimate the sample size for the experiment, then stop the test when you reach that sample size and draw conclusions from that data. However, this has its own problems. The other is to use Bayesian statistics, which has less of this problem because it always gives a probabilistic range; to use sequential analysis; or to apply a multiple testing correction each time the results are examined.
Micro Optimizations
One of our growth teams spent a lot of time and effort coming up with experiment ideas for a section of our site, which they correctly surmised could be significantly improved. What they failed to take into account was that the traffic to that section was very low. Even if they were extremely successful, the overall impact would be low.
Hypothetically, even if they increased the conversion rate by 10%, given that this section represented only 10% of the overall converting traffic, they would’ve only increased revenue by 1% overall, which is likely not noticeable. It is very easy to focus on that 10% win and celebrate it, while not acknowledging that the raw numbers are low.
Solution: Better prioritization
Be honest about your traffic numbers, estimate the possible impact, and sort by this. There are many ways to prioritize test ideas, each with its pros and cons, but it’s important to choose one.
You could also use a system like Growth Book, which has these features built in, including our Impact Score. Make sure your team is aware of the current state of your traffic and conversion numbers (if that is your goal), so they can focus on high-impact sections of your product.
Real Impact Gets Lost in the Noise
In the previous section, we talked about a 1% win as being lost in the noise. But 1% improvements are great! If you can start stacking consistent small wins, this can have a real impact on your overall metrics — but you’re unlikely to see any noticeable changes on a top-level metric, particularly if you’re testing frequently.
Solution: Look for long-term trends
There are no great solutions for this. Make sure that people looking at the overall numbers are aware that it’s hard to see small signals amid the noise of larger metrics. It doesn’t mean your tests aren't having an effect. If possible, look for longer trends or ways to isolate the cumulative effects. Small changes can compound and have large effects.
Optimizing Non-Leading Indicators
At my previous company, we noticed that users who engaged with multiple content types converted and retained at higher rates than those who engaged with only one type. We used this engagement as a leading indicator for lifetime value. We then spent some time trying to get all users to engage with multiple types of content. We achieved this goal, but the actual conversion and revenue numbers decreased as a result (we pushed users into a pattern they didn’t want). We fell into the classic “correlation is not causation” problem, and our chosen proxy metric was meaningless.
Many companies want to optimize for metrics that are not directly measurable, or are measurable but require time frames that are impractical to test. In this situation, companies use proxy metrics as leading indicators for their actual goals. If you improve a proxy metric and fail to see movements in your goal metrics, it is likely that your assumptions about your proxy metrics are wrong.
Solution: Measure against your actual goals
Measure and test against the goals you’re trying to achieve directly if you can. If your goal is revenue, you may need to use a platform like Growth Book, which supports many more metric types than just binomial. If you have to use a proxy metric, quickly test the assumptions you made and make sure they’re causally linked, not just correlated.
Not enough tests
Successful A/B tests are typically between 10% to 30%. The chances that any one of these is a big win are even lower. If you’re running 2 tests a month, 24 tests a year, you may have about 5 of them succeed on average. If 10% of those are very successful, you may not have any big wins in a year. This can be the downfall of growth teams and experimentation programs. If you’ve been running a program for a while and you’re not seeing the results, it could be that you’re not running enough tests.
Solution: Test more
Test more. If you find your odds of success low, and you’re able to test more often, your chances of a successful result go up. Just make sure you have an honest way to prioritize your tests and that you're focusing on high-impact areas.
Trust is a critical part of successful A/B testing programs, and a great way to earn trust is to show results. Be honest with your statistics, prioritize well, and test as much as you can. Hopefully, by implementing, or at least being aware of, the problems and solutions above, you can make your experimentation program more successful and see its impact on overall metrics.