The other day I was speaking to a friend of mine who lamented that he didn’t really believe in A/B testing. He noticed over the years that at his company they have had plenty of successful test results but haven’t been able to see the effects of the tests on the overall metrics they cared about. He was expecting to see inflection points around the implementation of successful experiments. Quite often companies run A/B tests which show positive results, and yet when overall metrics are examined, the impact of these tests is invisible. In this post I’ll talk about reasons why this happens, and ways to avoid or explain it.
Confusing statistics
Statistics can be remarkably counterintuitive, and it can be quite easy to come to the wrong conclusions about what the results mean. Each A/B testing tool uses some statistical engine for its results. Some use Frequentist, some use Bayesian, others use more esoteric models. Often they give you the confidence (P-Value) interval and the percent improvement right next to each other, making it very easy to believe that they are related. For instance, saying “We are 95% confident that there is a 10% improvement,” versus the correct interpretation of, “We saw 10% improvement, and there is a 95% chance that this test is better than the baseline.” With the second explanation, the test might actually have much smaller improvement, and there is even a 5% chance that it wasn’t an improvement at all.
Many A/B testing tools are relying on you having successful tests, making you seem like a hero for the tests’ success, since they want you to keep paying them. They are not incentivized to do A/B testing properly and to be honest with the statistics. The end result is that it is very easy to misinterpret your results and your tests may not be as successful as you think.
Solution:
Use a tool that gives you a statistical range of likely outcomes rather than one value. This discussion gets far more complicated, and I’ll save the Frequentist vs Bayesian for another post, but the short version is that Bayesian approaches that offer a probability range give you a much more intuitive way to understand the results of your experiment. If the improvements of a test are over exaggerated, you will erode trust in the long term results of A/B testing.
Peeking problem
Another problem with the A/B testing statistics is called the “Peeking problem.” Simply put, this is a problem where the more you watch your tests for significance, the more likely it is that the test will show you an incorrect significant result. If you’re constantly watching the test until it “goes significant,” and then acting on it by stopping the test, you’re likely going to see the result of random chance, and not necessarily an actual improvement. You have not captured meaning, but only random noise. Instead of asking whether the difference is significant at a certain predetermined point in the future, you are asking whether the difference is significant at least once within the test. These are completely different questions to ask. If you called this experiment by watching for when the test reached significance, you may not see effects overall. You “peeked” and only saw random noise.
Solution:
There are two solutions. One is to estimate the sample size to be used in the experiment, and then stop the test when you reach that sample size, and draw conclusions from that data. However, this has its own problems. The other is to use Bayesian statistics, which has less of this problem as it always gives a probabilistic range (read more here if you’re interested), sequential analysis, or do a multiple testing correction each time the results are examined.
Micro optimizations
One of our growth teams spent a lot of time and effort coming up with experiment ideas for a section of our site which they correctly surmised could be improved a lot. What they failed to take into account was that the traffic to that section was very low. Even if they were extremely successful, the overall impact would be low. Hypothetically, even if they increased the conversion rate by 10%, given that this section only represented 10% of the overall converting traffic, they would’ve only increased revenue 1% overall, which is likely not noticeable. It is very easy to focus on that 10% win and celebrate that, and not acknowledge that the raw numbers are low.
Solution:
Prioritize better. Be honest with your traffic numbers and come up with an estimation of possible impact, and sort by this. There are many ways to prioritize test ideas, which all have their pros and cons, but it’s important to use one. You could also use a system like Growth Book, which has these features built in, including our Impact Score. Make sure your team is aware of the current state of your traffic and conversion numbers (if that is your goal), so they can focus on high-impact sections of your product.
Lost in the noise
In the previous section, we talked about a 1% win as being lost in the noise. But 1% improvements are great! If you can start stacking consistent small wins, this can have a real impact on your overall metrics — but you’re unlikely to see any noticeable changes on a top level metric, particularly if you’re testing frequently.
Solution:
There are no great solutions for this. Make sure that people looking at the overall numbers are aware that it’s hard to see small signals in the noise of larger metrics, but that doesn’t mean that your tests are not having an effect. If it’s possible, look for longer trends or ways to isolate the cumulative effects. Small changes can work like compounded interest, and have large effects.
Optimizing non-leading indicators
At my previous company, we noticed that users who engaged with more than one type of content converted and retained at a higher rate than everyone else. We used this engagement as a leading indicator for lifetime value. We then spent some time trying to get all users to engage with multiple types of content. We were successful in this goal, but actual conversion and revenue numbers actually decreased as a result (we pushed users into a pattern that they didn’t want). We fell into the classic “correlation is not causation” problem, and our chosen proxy metric was meaningless.
Many companies want to optimize for metrics which are not directly measurable, or are measurable but require time frames which are impractical to test. In this situation companies use proxy metrics as leading indicators for their actual goals. If you improve a proxy metric, and fail to see movements in your goal metrics, it is likely that your assumptions about your proxy metrics are wrong.
Solution:
Measure and test against the goals you’re trying to achieve directly if you can. If your goal is revenue, you may need to use a platform like Growth Book which supports many more metric types than just binomial. If you have to use a proxy metric, quickly test assumptions you made and make sure they’re causally linked, not just a correlation.
Not enough tests
Successful A/B tests are typically between 10% to 30%. The chances that any one of these are big wins are even lower. If you’re running 2 tests a month, 24 tests a year, you may have about 5 of which are successful on average. If 10% of those are very successful, you may not have any big wins in a year. This can be the downfall of growth teams and experimentation programs. If you’ve been running a program for a while and you’re not seeing the results, it could be that you’re not running enough tests.
Solution:
Test more. If you find your odds of success low, and you’re able to test more often, your chances of a successful result go up. Just make sure you have an honest way to prioritize your tests, and are focusing on big impact areas.
Trust is a critical part of successful A/B testing programs, and a great way to earn that is to show results. Be honest with your statistics, prioritize well, and test as much as you can. Hopefully by implementing or at least being aware of the problems and solutions above, you can make your experimentation program more successful and see its impacts on the overall metrics.