A practical checklist for running A/B tests you can trust
From predictive model accuracy at Facebook and experiment design at X (formerly Twitter), to building the best experimentation platform used by DropBox, Sony and Upstart with GrowthBook, I've spent the last six years shaping how some of the largest tech companies measure success and ship features.
Across companies, industries, and scales, I’ve seen the same pattern repeat: experimentation rarely fails because teams don’t understand A/B testing mechanics. It fails because experiments are poorly designed—unclear goals, misaligned metrics, weak baselines, flawed randomization, or decisions made without a plan for ambiguous results.
The teams that get the most value from experimentation aren’t running more tests. They’re running better ones. They’re deliberate about what they’re trying to learn and disciplined about how results turn into decisions.
This article distills the most reliable experiment design practices I’ve learned from years of work in the field. If you already know how A/B testing works and want results you can trust—and act on—these seven steps are a strong place to start.
(For a deeper technical walkthrough, see GrowthBook’s Experimentation Best Practices)
1. Define the Goal Clearly
Every experiment should answer a specific question.
Start by writing down the problem you’re trying to solve in plain language. Is it activation? Retention? Conversion efficiency?
A good test of clarity is whether you can write a concrete hypothesis, such as:
“Users who complete the new onboarding flow will reach the activation milestone 10% more often than users in the existing flow.”
Clear goals prevent experiments from drifting into vague “did anything change?” territory.
In practice: Teams at Dropbox use tightly framed hypotheses to avoid shipping changes that move surface-level engagement but fail to improve long-term collaboration or retention.
2. Choose the Right Success Metrics
Once the goal is clear, metrics follow.
Every experiment should have:
- One primary metric that defines success
- A set of secondary metrics for context
- Guardrail metrics to catch unintended harm
Focusing on too many metrics creates confusion. Tracking too few hides important tradeoffs—especially when multiple metrics are evaluated simultaneously (see GrowthBook’s guidance on multiple testing corrections).
Use your secondary metrics to improve your understanding of what drives your primary metric. They also help you check-in periodically with your primary metric, ensuring it is well-defined and driving you towards your business goals.
Teams at Khan Academy use experimentation to iterate on learning experiences while remaining deeply thoughtful about how success is measured in an educational context.
3. Know Your Baseline
You can’t interpret change without knowing where you started.
Before launching an experiment:
- Understand current performance
- Measure normal variance
- Calibrate expectations for realistic lift
A change from 4% to 5% conversion is only meaningful if you know how stable 4% really is.
In practice: One GrowthBook customer—a large European marketplace—moved away from before-and-after analysis after realizing they couldn’t separate real lift from seasonality. Establishing proper baselines made results interpretable and decisions easier.
4. Understand Leading vs. Lagging Indicators
Not all metrics respond at the same speed.
- Leading indicators provide fast feedback and are often better suited for short-term experiments.
- Lagging indicators validate long-term impact and strategic alignment.
High-performing teams use both, but they’re intentional about which metric actually determines success.
Optimizing only for lagging indicators slows learning. Ignoring them risks local optimization.
5. Define the Experiment Population and Randomization Strategy
Decide who should be included in the experiment—and exclude everyone else.
Best practices include:
- Randomizing users as close to the experience as possible
- Ensuring assignment persists across sessions
- Using a true control group
- Keeping designs simple when traffic is limited
If you don’t have enough users, avoid multi-variant tests.
In practice: One GrowthBook customer, a major European retailer, was running underpowered tests. They moved from partial traffic to testing on 100% of visitors—dramatically reducing time to confidence and revealing insights that challenged long-held assumptions.
If you’re using feature flags to control exposure, GrowthBook’s approach to running experiments with feature flags is designed specifically for this kind of setup.
6. Validate Your Setup Before You Trust Results
You can’t analyze what you can’t connect.
Before launching real experiments, confirm that:
- Exposure data joins cleanly with outcome data
- Identifiers are consistent
- Metrics are computed correctly
Then run an A/A test—two identical variants with no visible change.
In practice: Teams operating at scale use A/A tests to catch instrumentation and analysis issues early. If multiple uncorrelated metrics “win” in a no-change test, or multiple A/A tests fail with clear issues, something is broken. GrowthBook strongly recommends this as a validation step (A/A testing documentation).
7. Decide How Long to Run the Experiment
Ending experiments early increases false positives. Letting them run forever slows learning.
Plan duration in advance based on:
- Expected variance
- Minimum detectable effect
- Available traffic
If you need flexibility, approaches like sequential testing can help—but only if you understand the tradeoffs.
Bonus: Plan for All Outcomes
Only 10–30% of experiments produce a clear winner. That’s normal.
High-performing teams plan for this reality before launching:
- Low-cost features may ship on directional evidence
- High-cost features require stronger confidence
- Neutral results still generate valuable learning
Experiments aren’t always about maximizing win rates. In some cases, they prevent huge losses. In other cases, their primary value is learning about user behavior.
Final Thought
Experimentation isn’t about proving you’re right. It’s about discovering what’s true.
Every experiment—even a neutral one—teaches you something about your users and your assumptions. Teams that stay curious, document learnings, and iterate deliberately are the ones that compound results over time.
That’s what turns experimentation into a real competitive advantage.
FAQ: Experimentation & A/B Testing in Practice
How do you decide whether an A/B test result is actionable?
When the results all point to the same decision, even when accounting for uncertainty. If you would ship even if the results were at the bottom end of the confidence intervals and you've collected a reasonable amount of data, ship!
Why are so many A/B test results inconclusive?
Because most product changes simply don’t meaningfully change behavior. Neutral results often reveal what users don’t care about, guiding better future experiments.
How long should an experiment run?
Long enough to reach sufficient statistical power—not until a metric looks good.
When should you ship a result that isn’t statistically significant?
For low-risk, low-cost changes with stable guardrails. High-risk features need stronger confidence.
What’s the biggest mistake teams make with experimentation?
Treating experimentation as validation instead of learning.