Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact

Many teams ship features weekly but struggle to measure their true cumulative impact. Holdouts in GrowthBook provide a simple way to maintain a control group across multiple features, answering the critical question: What did all this shipping actually do to our key metrics?

· 4 min read
Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact

Many successful product teams iterate quickly, running simultaneous experiments and launching new features weekly. Measuring the overall effect of these tests is critical to understanding the team’s impact and to help set product direction. However, actually measuring this cumulative impact can be quite difficult.

Holdouts in GrowthBook provide a simple way to keep a true control group across multiple features and measure long-run cumulative impact. It’s the gold standard way to answer the question: “What did all of this shipping actually do to my key metric?”


Why holdouts matter

Cumulative impact is important to measure.

Ensuring that your experimentation program is helping you ship winning features and avoid losing features helps set your product direction. Knowing which teams are driving the most impact can help you understand what’s working and what isn’t. Teams successfully moving the needle may deserve more investment to continue driving team goals upward. If a team struggles to have a significant impact, they may have hit diminishing returns, they may need a new direction, or the product may have reached a certain level of maturity, making gains more difficult to achieve.

Cumulative impact is hard to measure.

Looking at the overall trend in your goal metrics is not enough. Forces beyond your control or seasonality can dictate goal metric movements and can mislead you. With constant shipping across product teams, attributing lift to individual teams can be nearly impossible.

Other approaches try to sum up the effect of individual experiments and apply some bias reduction, like the one on our own Insights section. Almost always, the individual impacts of experiments when summed up overstate the final effects due to selection bias, generally diminishing returns over time, and cannibalizing interactions with other experiments. This isn’t just theoretical, Airbnb documented how a naive sum overstates impact by 2x when compared with a holdout, and bias-corrected estimates still overstate impact by 1.3x.

Holdouts as the solution.

A well-run holdout exposes a stable baseline of users to none of your new features for a period of time, then compares them to the general population. Because a holdout can run for longer on a small percentage of traffic, you capture longer-run effects. Furthermore, it allows you to stack all of your features and experiments into one test, capturing cumulative and interactive effects. Finally, it uses the reliable statistics and inference provided by experiments to make holdouts the gold standard for cumulative, long-run impact.

How Holdouts work in GrowthBook

At a high level:

As you launch new features and experiments, all new traffic checks whether they should be diverted to the holdout before seeing the new feature or experiment values.

When an experiment goes live, the holdout group is completely excluded while the general population gets randomized into one condition or another. Once an experiment is shipped, all users in the general population will receive the shipped variant.

This means that the holdout measures the cumulative impact of using your product, which includes all the false starts and the test period for the experiments that didn’t ship, because that is a true record of what actually happened in the past quarter.

Only once the holdout is ended will users in the holdout group receive any shipped features.

Using your Holdout

Facebook and Twitter product teams ran 6-month holdouts for all their features, withholding 5% or less of traffic, and then used the cumulative impact in reporting and to understand if they had correctly set their product direction. They then released the holdout and start a new one for the next 6-month period.

Other teams at Twitter were also using long-run, low traffic holdouts on a bundle of critical features, to ensure they were continuing to provide value.

Product teams at Twitter would run a holdout for a half a year, adding new features to the holdout over the course of 6 months. Then, they would use the following quarter to get a reliable, long-run measure of their cumulative impact.

So, a year would look like this:

Timeframe Holdout Status
Q1 h1-holdout (active)
Q2 h1-holdout (active)
Q3 h2-holdout (active); h1-holdout (measurement only)
Q4 h2-holdout (active)

Tips & Trade-offs

Get started

Read more about holdouts in our Knowledge Base and see our documentation to help run your first holdout.

Read next

Want to give GrowthBook a try?

In under two minutes, GrowthBook can be set up and ready for feature flagging and A/B testing, whether you use our cloud or self-host.