Goodhart’s Law and the Dangers of Metric Selection with A/B Testing

Experimentation has many parts: choosing the hypothesis, implementing the variations, assignment, results analysis, documentation, etc, and each of these can have their own nuances that make it complicated. Usually choosing the metric (or metrics) that determines if your hypothesis is correct is straightforward. However there are some situations where choosing metrics can be problematic, and surface classic issues of Goodhart’s law.

Goodhart’s Law says that when a measure becomes a goal, it ceases to be a good measure. It is a close relative of Campbell’s Law, Cobra Effect, and more generally Perverse Incentives. Plainly put, these laws describe conditions where there are unintended negative effects of choosing metrics.

The classic example and etymology behind “Cobra Effect” comes from colonial India where, as the story goes, they had a problem with too many cobras. To solve the issue the British government placed a bounty on dead cobras. This caused the locals to start breeding cobras to turn in for the bounty.

A more recent example comes from the US News & World Report Best College Rankings. One factor in the ranking is selectivity — how many students applied vs. were accepted. This measure encouraged some schools to game the number of applicants by including every postcard expressing interest as an “application”, and reject more students from the Fall admission to increase their ranking.¹

These same unintended negative effects can apply to A/B testing. If you tell your teams to improve one metric, teams can be incentivized to game these metrics. As an example, consider the case where you want to increase pages per visit. One idea that would absolutely work is to paginate your content into smaller and smaller pages (and you can see real examples on many sites). You risk destroying your user experience and annoying your users, but that one metric will almost certainly increase.

Facebook is a classic example of this phenomenon. Facebook is famous for hyper-focusing on growth metrics (eg. active users and user engagement). Teams have these growth metrics as their primary performance indicators, and have bonuses that are dependent on improving them. They therefore have no incentive to improve unrelated metrics that give a more complete picture. Facebook does have teams that measure other impacts of their work, like their Integrity teams, or internal researchers, but due to these conflicts of interest, routinely fail to meaningfully address the concerns. Even when decisions are escalated, growth is put above all. This has led to them building a product that certainly has high engagement, but alienates large segments of the population, and spreads mis-information that has caused real harm in the world (see Jan 6th² or Myanmar³). This relentless focus on a single metric without regard for user sentiment is a prime example of how short-term optimization leads to long-term product degradation, effectively turning your experimentation program into a generator for dark patterns."

Assuming that you have chosen your metrics for your teams carefully to avoid the above problems, and your internal teams are good at avoiding gaming the metrics, there are still other aspects of unintended negative effects to watch out for when choosing metrics with A/B testing.

Correlation, Causation, and Proxy Metrics

A few years ago we ran an interesting experiment on a freemium content site, where we had a north star metric of improving revenue. The hypothesis for the experiment came from an insight from our data team, who noticed that users who used multiple types of content converted to paid at a much higher rate than normal. With this correlation in mind, our product teams set out to push more users to consume multiple kinds of content with the hope of increasing revenue. For these tests we used metrics of both the proxy (multiple content use) and the goal (revenue). The results were fascinating, and counterintuitive.

What we found after many experiments was that not only did they not increase revenue, but by pushing users to do actions they weren’t naturally inclined to do, we decreased pages per visit, and actually reduced revenue. We did succeed, however, in increasing the metric of multiple content use. What we actually proved through these experiments was that this correlation was not causal. If we had just used the proxy metric of multiple content and trusted its causal relationship to revenue, we might have declared these tests a success.

Pushing on metrics can also destroy the original correlation. During the experiments we looked back at the original correlation, but it had disappeared. We had pushed so many more unqualified users into using multiple content types without the increase in revenue, that the original correlation was no longer significant.

Another common reason for using proxy metrics is because that’s all you have. You want to make sure you are testing against metrics you need, rather than the metrics you have. Think about what the right metrics are to prove the hypothesis, and if you find yourself not being able to test against them, then you have a problem. Using proxy metrics without knowing the effect on the goal metric will mean you are not able to determine causal relationships and it will limit the effectiveness of your experiments. Make sure your experimentation platform is not the limiting factor in running correct experiments, and run tests with the right metrics.

Conclusion

The lesson for metric selection from Goodhart’s Law is to be thoughtful with your metrics. Try not to use proxy metrics unless you are sure they are causal. Be careful with the implications of your goal metrics and consider what other metrics might also need to be measured to determine what the negative effects could be. Finally, not all experiment results are straightforward to analyze, and careful consideration is needed to determine the right conclusion for your product which is quite often nuanced. With these thoughts in mind, hopefully you can avoid some of the pitfalls of metric-driven decision-making.

[1] G. S. Morson and M. Schapiro, Oh what a tangled web schools weave: The college rankings game (2017) Chicago Tribune

[2] C. Timberg, E. Dwoskin and R. Albergotti, Inside Facebook, Jan. 6 violence fueled anger, regret over missed warning signs (2021) Washington Post

[3] J. Clayton, Rohingya sue Facebook for $150bn over Myanmar hate speech (2021) BBC News

Goodhart’s Law and the Dangers of Metric Selection with A/B Testing

Correlation, Causation, and Proxy Metrics

Conclusion

Read next

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI

Dark Patterns in A/B Testing: How Short-Term Optimization Leads to Product Enshittification

7 Steps to Better Experiment Design

Want to give GrowthBook a try?