Many successful product teams iterate quickly, running simultaneous experiments and launching new features weekly. Measuring the overall effect of these tests is critical to understanding the team’s impact and to help set product direction. However, actually measuring this cumulative impact can be quite difficult.
Holdouts in GrowthBook provide a simple way to keep a true control group across multiple features and measure long-run cumulative impact. It’s the gold standard way to answer the question: “What did all of this shipping actually do to my key metric?”
Why holdouts matter
Cumulative impact is important to measure.
Ensuring that your experimentation program is helping you ship winning features and avoid losing features helps set your product direction. Knowing which teams are driving the most impact can help you understand what’s working and what isn’t. Teams successfully moving the needle may deserve more investment to continue driving team goals upward. If a team struggles to have a significant impact, they may have hit diminishing returns, they may need a new direction, or the product may have reached a certain level of maturity, making gains more difficult to achieve.
Cumulative impact is hard to measure.
Looking at the overall trend in your goal metrics is not enough. Forces beyond your control or seasonality can dictate goal metric movements and can mislead you. With constant shipping across product teams, attributing lift to individual teams can be nearly impossible.
Other approaches try to sum up the effect of individual experiments and apply some bias reduction, like the one on our own Insights section. Almost always, the individual impacts of experiments when summed up overstate the final effects due to selection bias, generally diminishing returns over time, and cannibalizing interactions with other experiments. This isn’t just theoretical, Airbnb documented how a naive sum overstates impact by 2x when compared with a holdout, and bias-corrected estimates still overstate impact by 1.3x.
Holdouts as the solution.
A well-run holdout exposes a stable baseline of users to none of your new features for a period of time, then compares them to the general population. Because a holdout can run for longer on a small percentage of traffic, you capture longer-run effects. Furthermore, it allows you to stack all of your features and experiments into one test, capturing cumulative and interactive effects. Finally, it uses the reliable statistics and inference provided by experiments to make holdouts the gold standard for cumulative, long-run impact.
How Holdouts work in GrowthBook
At a high level:
- Holdout group: A small percentage of traffic (usually users) is diverted away from new features, experiments, and bandits.
- General population: Everyone else—experimenting and shipping as usual. We then take a small subset of the general population to use as a measurement group to compare against the holdout group.
As you launch new features and experiments, all new traffic checks whether they should be diverted to the holdout before seeing the new feature or experiment values.
When an experiment goes live, the holdout group is completely excluded while the general population gets randomized into one condition or another. Once an experiment is shipped, all users in the general population will receive the shipped variant.
This means that the holdout measures the cumulative impact of using your product, which includes all the false starts and the test period for the experiments that didn’t ship, because that is a true record of what actually happened in the past quarter.
Only once the holdout is ended will users in the holdout group receive any shipped features.
Using your Holdout
Facebook and Twitter product teams ran 6-month holdouts for all their features, withholding 5% or less of traffic, and then used the cumulative impact in reporting and to understand if they had correctly set their product direction. They then released the holdout and start a new one for the next 6-month period.
Other teams at Twitter were also using long-run, low traffic holdouts on a bundle of critical features, to ensure they were continuing to provide value.
- Define the population size: Pick a sample large enough to measure your cumulative impact, but beware that larger population sizes mean you will end up with less traffic for your day-to-day experiments and fewer users with the latest set of features.
- Define the active period length (half month to a quarter): Pick a period long enough to accumulate some wins
- During the active period (half to a full quarter): Ship normally. Keep adding experiments and launching features. The holdout quietly accumulates evidence.
- Analysis period (2–4 weeks): Freeze adding new changes, let effects settle, and compare cumulative impact with our automatic lookback windows applied to measure only the analysis period.
Product teams at Twitter would run a holdout for a half a year, adding new features to the holdout over the course of 6 months. Then, they would use the following quarter to get a reliable, long-run measure of their cumulative impact.
So, a year would look like this:
Timeframe | Holdout Status |
---|---|
Q1 | h1-holdout (active) |
Q2 | h1-holdout (active) |
Q3 | h2-holdout (active); h1-holdout (measurement only) |
Q4 | h2-holdout (active) |
Tips & Trade-offs
- Project-scope your Holdout: If you want to measure the impact of a given team’s set of features, have that team work within one or more GrowthBook Projects and have the Holdout automatically apply to their features and experiments.
- Be wary of the user experience: A small group won’t see new features—keep the percentage small and the period finite.
- Be ready to keep feature flags in code: Holdouts require feature flags to stick around through the analysis period, so prepare your work flows for longer lasting features.
- Metrics: Favor durable outcomes (revenue, retention, engagement) and use lookbacks for clean analysis windows so that you only measure the impact once all experiments have had a chance to bed-in.
Get started
- Create your first holdout in the app (Experiments → Holdouts) and scope it to a project you want to measure impact within.
- Pick 2 - 4 long-run metrics that your team is hoping to improve in the long-run.
Read more about holdouts in our Knowledge Base and see our documentation to help run your first holdout.