What is A/B Testing? The Complete Guide to Data-Driven Decision Making

Stop guessing what works and start knowing: A/B testing transforms hunches into data-driven decisions, turning every change on your website from a risky bet into a calculated experiment that can boost conversions without spending more on traffic.

· 46 min read
Two website wireframes with an A and B layout

A/B testing is simple in concept. Split your users, show them different experiences, and measure what happens.

In practice, A/B testing for product teams is rarely that clean. Real products have real constraints in tracking, assignment, and metric definition, quickly making a straightforward test complicated.

While low-velocity teams can absorb slow, isolated mistakes, high-volume experimentation at scale requires mastering the fundamentals, as flaws compound leading to bad, high-confidence product decisions. Fortunately, these failure modes are well-understood and avoidable.

What is A/B Testing

A/B testing, sometimes called split testing, is a randomized experiment in which multiple versions of something are shown to different groups simultaneously. Each group is measured against a defined metric to determine which performs better.

By randomly assigning units to each version, you control for external factors like seasonality, changes in traffic mix, and broader market conditions, so any difference in outcomes can be attributed to your change and nothing else.

What Does A/B Testing Look Like in Practice

In product development, an A/B test runs alongside your normal release process. Rather than shipping a change to everyone at once, you expose a subset of your users to the new experience while the rest continue seeing the existing one. Both groups run simultaneously, and you measure the difference.

  1. Define a hypothesis including the metric you're testing against.
  2. Randomly split your audience into groups, each exposed to a different version.
  3. Analyze the difference between groups using a statistical framework.
  4. Ship the winning variant, or go back to the drawing board with what you learned.

Without that structure, you're left comparing against historical data. Consider a team that ships a new feature and watches new signups drop 8% over the following two weeks. They blame the release and roll it back, but sales stay flat. It turns out it was a seasonal dip that would have happened regardless of what was shipped, and now the team has spent a week in firefighting mode reverting a change that had nothing to do with the decline.

Or consider a team deciding between two redesigns of the same checkout flow. Rather than debating which one to ship, they test both against the current experience simultaneously. One variant performs similarly to the control. The other increases completed purchases by 12%. Without the test, that call comes down to whoever argues most convincingly in the design review.

Why Does A/B Testing Matter

For product teams, the value of A/B testing isn't just finding winning variants. It's making consequential decisions about how your product works based on what users actually do, rather than what your team thinks they'll do. 

It's also one of the few tools that gives teams the ability to push back on the HiPPO (the highest paid person's opinion) with something more than a gut feeling of their own. When the data says otherwise, it says so for everyone in the room.

The Critical Difference: A/B Testing vs Gut/Intuition

Without A/B testing, product decisions tend to default to a familiar set of inputs:

With A/B testing, product decisions are grounded in more reliable inputs:

What are the Benefits of A/B Testing in 2026

For modern product teams, the benefits of A/B testing go well beyond finding a winning variant. In 2026, with AI accelerating the pace of product development and raising the bar for what teams can ship, the cost of making bad product decisions has never been higher. Done consistently and rigorously, experimentation touches how teams make decisions, allocate resources, and understand their users.

1. Get More Value from Your Existing Traffic

Customer acquisition costs have climbed as high as 60% since 2023

Getting more value out of the traffic you already have is increasingly a business necessity, and A/B testing is how you do it systematically.

2. Reduce the Risk of Rolling Out Major Changes

Every product change carries risk. A change can perform worse than expected for a variety of reasons: a bug that only surfaces under certain conditions, user behavior that didn't match your assumptions, or a change that worked well for one segment while degrading the experience for another. Without feature experimentation, you find out about these issues after the fact, when it has already reached your entire user base.

By feature flagging and exposing a change to a subset of users first, you limit the damage if something goes wrong. A variant that damages an important metric affects 10% of your traffic, not 100%. If it performs well, you can roll it out knowing what to expect. If it doesn't, you can roll it back before most of your users ever see it.

3. Speed Up Product Decision Making 

Product decisions are slow when they rely on opinion. Design reviews stretch into hours as stakeholders debate, and the person with the most seniority often wins, not because they're right, but because they're the loudest voice in the room.

Product experimentation changes how those conversations go. When you have data on how users actually behaved, the debate shifts from “I think" to "here's what we know." As one PM put it: "A/B testing turned our three-hour design debates into 30-minute data reviews."

That speed compounds over time. Teams that can make and validate product decisions faster than their competitors ship more, learn more, and course-correct before small mistakes become expensive ones.

4. Develop a Deeper Understanding of Your Users 

Every experiment tells you something about your users, whether it wins or loses. A variant that underperforms is still evidence. It tells you what your users don't respond to, which is often just as useful as knowing what they do.

Over time, that body of evidence becomes more valuable than any single test result. Teams that maintain a searchable archive of past experiments (GrowthBook does this automatically) stop asking "Didn't we already test this?" and start forming better hypotheses from the outset. This process builds a richer understanding of their users and how they actually behave, leading to better prioritization as the most impactful initiatives become clearer.

5. Uncover Surprising Insights

Not every valuable idea looks valuable before it's tested. A Microsoft engineer once ran a quick A/B test on a low-priority change to how Bing displayed ad headlines (an idea that had sat untouched for over six months). The test showed a 12% increase in revenue, which translated to more than $100 million annually in the US alone. It turned out to be the best revenue-generating idea in Bing's history and it almost never got tested at all.

These insights only surface when you have an A/B testing framework that makes it easy to ship any product change as a controlled experiment.

6. Build a Competitive Advantage

The teams that consistently outperform their competitors aren't necessarily the ones with the best ideas. They're the ones who can validate ideas faster and learn from failures.

Netflix is a well-documented example. The company runs experiments across virtually every aspect of its product, optimizing everything from thumbnails to recommendation algorithms to ensure that data (rather than opinion) drives decisions. That commitment to experimentation at scale is part of what allows a company of that size to keep iterating as fast as it does.

The more consistently you test, the better your decisions get, and the harder that advantage is for competitors to close.

Who Should Use A/B Testing (and Who Shouldn’t)

Most teams can benefit from A/B testing in some form. But the teams that get the most out of it tend to share a few things in common: enough volume to reach statistically meaningful results, the technical infrastructure to instrument changes correctly, and decisions that are frequent enough to make a testing practice worthwhile.

A/B testing isn't the right tool in every situation. There are a few conditions where it will either produce unreliable results or simply isn't worth the investment.

What Can You A/B Test?

Most teams start experimenting with the most visible parts of their product and stop there. The reality is that if a change can be measured and randomly assigned, it can be tested. That applies as much to a ranking algorithm or a model prompt as it does to a button label or a checkout flow, and the most sophisticated experimentation programs treat almost every product change as a candidate for a controlled experiment.

User-Facing Product Experiences

Changes to what users see and interact with directly are often the easiest to instrument, the most straightforward to design a clean experiment around, and the most immediately connected to the metrics product teams care about.

Copy and Messaging

The words you use to describe your product, explain a feature, or prompt an action affect how users respond in ways that are hard to predict without testing. This includes headlines, body copy, error messages, empty states, and tooltips. Copy that works well in one context often fails in another, which makes experimentation more reliable than intuition.

Visual Design Elements

Colors, typography, imagery, iconography, and visual hierarchy all affect how users perceive and engage with a product. These elements are worth testing on high-traffic acquisition surfaces where visual choices directly affect first impressions and conversion.

Social Proof and Trust Signals

The placement, format, and type of social proof affects how users evaluate whether to take action. Testimonials, review counts, trust badges, and case study callouts are all worth testing at high-stakes moments in the user journey, like pricing pages or checkout flows, where trust is a meaningful factor in the decision.

Calls to Action

Button text, placement, size, and visual weight all affect whether users take the action you want. The difference between "Start free trial" and "Get started" may seem trivial, but it can produce measurable differences in click-through and conversion rates.

Forms and Data Collection

The number of fields, their order, their labels, and how validation errors are presented all affect completion rates. For teams with signup flows, checkout processes, or any other form-gated experience, this is a productive area for experimentation.

Layout and Navigation

How you organize and present information affects how users move through a product and what they do next. Single versus multi-column layouts, card versus list views, menu structure, and the placement of key actions relative to supporting content are structural decisions that are harder to get right through intuition alone.

Onboarding Flows

What happens in a user's first few sessions shapes everything that comes after. Changes to the number of steps, the order of actions, or the point at which users are asked to commit to something can have measurable downstream effects on activation and retention metrics.

Pricing and Packaging Display

How you present pricing affects conversion without changing the underlying price. Tier ordering, anchoring, and the framing of free versus paid features are all worth testing for any team with a monetization surface, though the effects can take time to manifest.

Backend and Infrastructure

The most impactful experiments a product team can run are often invisible to users. A change to a ranking algorithm or a model prompt can affect user behavior just as much as a redesigned interface, and without a controlled experiment, the effect is nearly impossible to isolate.

Infrastructure and Performance

Performance improvements are generally good for users, but testing them as controlled experiments lets you quantify exactly how much they matter for the metrics you care about. Knowing which specific infrastructure investments moved conversion by 3% and which didn't gives teams a more reliable basis for deciding where to invest next.

Default Settings and Configurations

Most users never change defaults, which means the state you ship with has an outsized effect on how a feature gets used. Testing different default configurations is low-cost to implement and can meaningfully affect adoption and engagement.

Notification Timing and Content

Both the notification you send and what it says affect whether users engage with it. Testing send timing, message length, and the specific action you're prompting can improve open rates and click-through without increasing notification volume.

Product Features and Functionality

Beyond how a feature looks, you can test how it behaves. The results often reveal that users interact with the functionality in ways that don't align with the original design assumptions, which is useful information regardless of which variant wins.

Search and Discovery

Search ranking, autocomplete behavior, and filtering defaults all affect whether users find what they're looking for. Search is often a high-intent surface where small improvements in relevance or presentation directly affect conversion or engagement.

Algorithms and Ranking

Ranking and recommendation algorithms affect every user simultaneously, which makes them worth testing carefully. Small changes to the underlying logic can produce meaningful differences in engagement and retention that aren't visible until you measure them.

AI and ML Models

AI and ML models are particularly hard to evaluate without controlled experiments. A model that scores better on benchmarks doesn't always perform better in production, which makes a/b testing AI the only way to know for sure.  Performance, quality and speed are all important to test.  Slight changes in system prompts also require in-depth testing.

Growth and Acquisition Surfaces

Growth and acquisition surfaces are where most teams first encounter A/B testing, and for good reason. The metrics are clear, the feedback loops are short, and the tests are relatively cheap to run compared to changes deeper in the product.

Email Campaigns

Subject lines, send timing, message length, preview text, and calls to action all affect whether users open, click, and convert. Email is one of the more forgiving surfaces for experimentation because tests are cheap to run and results come in quickly, making it a good starting point before moving into more complex product surfaces.

Ad creative, copy, targeting parameters, and landing page destinations all affect cost per acquisition and return on ad spend. Testing these systematically rather than relying on platform optimization alone gives teams more control over what's actually driving performance and makes it easier to apply what you learn across campaigns.

Landing Pages

Landing pages connect acquisition and product, which makes them worth testing carefully. Headline copy, hero imagery, social proof placement, form length, and page structure all affect conversion, and improvements here affect the efficiency of every upstream acquisition channel.

Mobile App Stores (ASO)

App store listings are a testable surface that many teams overlook. Screenshots, preview videos, descriptions, and icon design all affect install rates, and both the App Store and Google Play offer native tools for running controlled tests on these elements.

Internal Tools and Systems

Most teams think of A/B testing as something you do on user-facing surfaces. Internal tooling is worth the same rigor. The workflows your team uses, the interfaces they navigate, and the systems that handle billing and support all affect business outcomes in measurable, improvable ways.

Billing Systems

When and how you charge users affects conversion, retention, and revenue in ways that aren't always intuitive. Credit charging timing, trial length, grace periods, and dunning flows are all worth testing, and the effects can be substantial even when the changes seem minor.

Customer Success

The interfaces and workflows your support team uses directly affect both resolution times and the experience customers receive on the other end. Testing different queue structures, response templates, or escalation flows can surface improvements that are invisible from the outside but meaningful to the people doing the work and the customers they're helping.

Dashboard and Reporting Interfaces

How data is presented to internal users affects the decisions they make. Testing different visualizations, metric groupings, or alert thresholds can improve how quickly teams identify issues and act on them.

Internal Search and Navigation

How employees find information and move through internal tools affects productivity in ways that are easy to underestimate. Testing search ranking, navigation structure, and information hierarchy in internal tools follows the same principles as product experimentation, just with a different user base.

Workflow and Process Design

Internal processes are testable too. Whether it's the order of steps in an approval flow, the default assignee for a task, or the trigger conditions for an automated action, small changes to how work moves through a system can have measurable effects on speed and accuracy.

Different Types of A/B Tests

Not all experiments are structured the same way. The standard A/B test is the right tool for most situations, but there are different types of A/B tests for different situations.

A/A Test

An A/A test runs two identical variants against each other. The purpose isn't to find a winner but to confirm your experimentation infrastructure is working correctly. You should test a number of metrics to confirm data is flowing correctly, that you're seeing an equal number of users assigned to each test.  You should expect 1 out of 20 tests to show a statistically significant result with a 95% confidence interval.  

A/B/n test

An A/B/n test extends the standard A/B test to include multiple variants tested simultaneously against a single control. You evaluate several hypotheses in one experiment rather than running them sequentially. Each additional variant requires more units to reach significance, so population requirements scale with the number of variants.  If you have enough traffic, multiple variant tests are a great way to accelerate learning.  

Multivariate Test

A multivariate test changes multiple elements simultaneously and tests combinations of them. If you're testing two headlines and two button colors, a multivariate test runs all four combinations to understand not just which elements perform better individually, but how they interact. The tradeoff is that you need considerably more traffic than a standard A/B test, because the population is split across every combination.

Holdouts

A holdout test withholds a feature from a group of users after it has been fully rolled out to everyone else. The holdout group continues to see the old experience, which lets you measure the long-term effect on retention and engagement that takes time to manifest. A new onboarding flow might look neutral in a two-week test but show meaningful differences in retention at 90 days. Holdouts are also useful for measuring the cumulative effect of many experiments running simultaneously. By comparing the holdout group to the fully treated population over 3–6 months, you can measure the combined effect of all your experiments.

Statistical Approaches to A/B Testing

Most modern experimentation platforms, like GrowthBook, give you a choice between Bayesian and frequentist statistics. Both are good options but understanding the differences can help you decide which approach is best for you. 

Bayesian Statistics

Bayesian statistics handles hypothesis testing by expressing results as probabilities. Instead of a binary significant/not-significant decision, you get a probability distribution: what's the chance variant B is better than variant A, and by how much? This makes results easier to interpret and communicate to non-technical stakeholders. Bayesian methods can also incorporate prior beliefs about the metric being tested, helping avoid over-interpreting results from small samples.

Benefits of Bayesian Statistics

Drawbacks of Bayesian Statistics

Frequentist Statistics

Frequentist statistics is the more traditional approach to hypothesis testing. It calculates the probability of observing your results if there were no real difference between variants. That probability is the p-value, and it gets compared against a predetermined significance threshold, typically 0.05.

Benefits of Frequentist Statistics

Drawbacks of Frequentist Statistics

Concepts Shared by Both Bayesian and Frequentist Statistics

Despite their differences, Bayesian and frequentist statistics share many common concepts:

Which Statistical Approach Should You Use?

Use Bayesian when you want probability-based results or have well-established priors that can reduce uncertainty in smaller samples.

Use Frequentist when results need to meet an established statistical standard, or when you want to enable sequential testing. 

Step-By-Step A/B Testing Process

How you plan and run a test determines whether the results can actually be trusted. Here’s the step-by-step process from developing a hypothesis all the way through to implementing a winning variant. 

Step 1: Research and Identify Opportunities

Good experiments start with a clear understanding of where the opportunity is. For product development teams, that usually means looking at where users drop off, where engagement is lower than expected, or where there's a meaningful gap between how a feature was designed to be used and how it's actually used.

Start with quantitative data like funnel drop-off rates, feature adoption rates to identify potential opportunities, then use qualitative data like user interviews, support tickets, and session recordings to better understand the situation.

How to Prioritize Experiments

Not every problem is worth testing. The best starting point is your team's current roadmap and goals. If you're focused on improving activation this quarter, test things that affect activation. Experiments that don't connect to what your team is actively solving are a distraction, however interesting the hypothesis.

Before committing, use an objective scoring system or prioritization framework like ICE to evaluate each opportunity:

Step 2: Form a Strong Hypothesis

A good hypothesis forces you to be specific about what you're changing, why you expect it to work, and how you'll know if it did.

A weak hypothesis sounds like: "Let's try a shorter onboarding flow." 

A strong one sounds like: "Reducing the onboarding flow from five steps to three will increase 7-day activation because users are dropping off at step three."

Here are a few more examples for weak and strong hypotheses. 

Weak Hypothesis Strong Hypothesis
The recommendation widget will increase sales. Adding a personalized recommendation widget below the product description will increase average order value because users who see relevant product suggestions will add more items to their cart before checkout.
We will surface errors more clearly. Replacing generic error messages with specific guidance on how to fix the issue will increase form completion rates, because users currently abandon forms at the error state without understanding what went wrong.
We think the pricing page is confusing. Replacing the feature comparison table with a use-case based pricing guide will increase trial conversion, because users in exit surveys say they can't determine which plan is right for them.
Dark mode would probably improve engagement. Adding a dark mode option to the dashboard will increase daily active usage among power users because power users spend more than 4 hours per day in the product and have requested this feature in support tickets.

Use this structure as a starting point for writing your own hypotheses:

[Specific change] will cause [measurable effect] because [reasoning based on research].

Step 3: Design Your Experiment

Most of the work in running a good experiment happens before you launch. The decisions you make at the experimental design stage will determine how useful your experiment is.

Define Your Measurement Criteria

Before you build anything, be clear on what you're measuring and why. Your primary metric should flow directly from your hypothesis. It's the specific effect you expect to see. If your hypothesis is that reducing onboarding steps will improve 7-day activation, then 7-day activation is your primary metric. 

Here’s what each metric might be for our onboarding experiment example. 

Hypothesis Reducing the onboarding flow from five steps to three will increase 7-day activation rate by 15%, because users are dropping off at step three.
Primary Metric 7-day activation
Secondary Metrics 30-day retention rate

Time to complete onboarding flow

Number of core features activated within 7 days
Guardrail Metrics Invited users (The number of referrals should at least stay the same compared with the control)

Support tickets related to onboarding confusion

Error rate on onboarding steps

Calculate Your Required Sample Size

The best way to ensure good decision making with experiments is to know how much data you need up front. Running an experiment without a sample size calculation is one way to end up not knowing if you can trust your results or when to end an experiment. Most modern experimentation platforms include a power calculator. You'll need four inputs:

The calculator will tell you how many units you need per variant. Divide by your average daily volume of that unit to get your required duration. That might be daily active users, daily email sends, or accounts, depending on what you're randomizing on. Make sure you're calculating based only on the population that meets your targeting criteria, not your total user base.

Many tests should run for at least two full business cycles, typically two weeks minimum, to account for day-of-week behavior patterns even if you reach your sample size sooner.

Designing for Trustworthy Results

Experiment implementation is a crucial part of running a clean causal experiment and learning what you actually set out to learn. 

Step 4: Set Up Your Experiment and Validate the Implementation

Before you launch your experiment, validate that your experiment is configured correctly. Problems caught here are easy to fix. Problems caught after two weeks of bad data are not.

Step 5: Launch and Monitor

Once your experiment is live, your job is mostly to leave it alone. The temptation to check results early is real, especially when there's pressure to ship, but acting on interim results is one of the most common ways teams produce conclusions they can't trust.

Monitor only for:

Everything else can wait until the test reaches its required sample size. If you need the flexibility to act on results before that point, enable sequential testing.

Step 6: Analyze Results Properly

When your experiment reaches its required sample size, resist the urge to declare a winner immediately. Good analysis goes beyond the binary question of whether the variant beats the control.

Some metrics require additional waiting time even after the experiment ends. For example, if you're measuring 7-day activation, you need to wait seven days after the last user was exposed before you can analyze that metric. Build this into your timeline upfront.

When it’s time to analyze the results:

Step 7: Implement and Iterate

Every experiment produces an outcome worth acting on, even when your hypothesis is proven wrong.

Advanced A/B Testing Strategies 

Once the fundamentals are in place, these advanced techniques can create additional value as your program matures and the questions you're trying to answer get harder.

CUPED 

CUPED (Controlled-experiment using pre-experiment data) is a variance reduction technique that uses pre-experiment metric data to improve the accuracy of your results. By accounting for pre-existing differences between users before the experiment starts, it reduces the noise in your estimates, meaning you can detect smaller effects with the same traffic, or reach the same level of confidence faster

GrowthBook's implementation extends CUPED with post-stratification, which uses user attributes like country or plan tier to further reduce variance by isolating the treatment effect from natural differences between groups. The more correlated your pre-experiment data and attributes are with the metric you're measuring, the more variance reduction you'll see.

The main requirement is that you have pre-experiment data for the metric you're testing. It works best for metrics that are frequently observed (engagement rates, session counts, revenue) and is less effective for new users or rare events where there's little pre-experiment history to draw on.

Example: Netflix reported CUPED reduced variance by roughly 40% for some key engagement metrics. Microsoft reported it was equivalent to adding 20% more traffic for a majority of metrics on one product team.

Quantile Testing

Most A/B tests compare means across variants, which works well when the effect is evenly distributed across users. Quantile testing compares percentiles instead, making it the right tool when you care about what's happening at the extremes. A change that improves average page load time by 50ms might look neutral on a mean test while actually fixing a severe performance problem affecting your slowest 1% of users.

The main consideration is sample size. Extreme quantiles (P99, P99.9) require large samples to produce reliable estimates. It also works best when you have a clear hypothesis about which part of the distribution you're trying to move.

Example: An engineering team testing a backend optimization uses a P99 latency metric to confirm the change reduced worst-case load times by 7ms, even though the mean improvement was too small to detect.

Multi-Armed Bandits

A multi-armed bandit is an adaptive experiment that shifts traffic toward better-performing variants as data comes in, rather than maintaining a fixed split throughout. Unlike a standard A/B test, which waits until the end to declare a winner, a bandit continuously reallocates traffic based on which variant is performing best on a single decision metric. GrowthBook uses Thompson sampling, a Bayesian algorithm that balances exploration (testing all variants) with exploitation (sending more traffic to the best performer).

Bandits work best when you have a clear single metric to optimize, five or more variants to test, and care more about minimizing exposure to poor-performing variants than understanding why each one performed the way it did. They're less suited to situations with long feedback loops, multiple goal metrics, or where statistical rigor matters more than speed.

Example: An ecommerce team testing five different product page layouts uses a bandit to automatically shift traffic toward the best-performing variant. This allows them to quickly capitalize on a winner during time-sensitive, days-long promotions like a Black Friday sale, while also reducing the number of users exposed to lower-converting layouts.

Cluster Experiments

Most experiments randomize at the user level, but some products require randomization at a coarser level of granularity. In B2B software, for example, you might need everyone at a company to see the same experience. Showing different variants to different users within the same organization would create confusion and contaminate results. Cluster experiments solve this by randomizing at the group level (the organization, the school, the household) while still analyzing outcomes at the individual level.

The main challenge is that cluster-level randomization reduces your effective sample size. You're randomizing across a smaller number of clusters than individual users, which means you need more clusters to reach significance. GrowthBook supports cluster experiments natively, handling the statistical complexity of analyzing at a different level than you randomize through its statistics engine.

Example: A B2B SaaS team testing a new dashboard layout randomizes at the organization level so every user within a company sees the same variant, then analyzes individual user engagement to measure impact.

Full-Funnel Testing

Most experiments measure a single metric at a single point in the user journey. Full-funnel testing measures the effect of a change across multiple stages, from initial conversion through to retention, revenue, and long-term engagement. This matters because a change that looks positive at the top of the funnel can have neutral or negative downstream effects that a single-metric test would miss entirely.

The main requirement is having metrics instrumented across the full user journey and enough traffic to detect meaningful differences at each stage. It also requires patience — downstream metrics like 30-day retention take time to manifest, which means full-funnel tests run longer than standard conversion tests.

Example: A team testing 7-day versus 14-day free trial lengths measures not just trial starts but 30-day conversion to paid, finding that the longer trial increased signups but reduced urgency to convert, producing a net negative revenue impact.

Long-Term Holdouts

Individual experiments measure the impact of a single change. Long-term holdouts measure the cumulative impact of all your changes over time. A small group of users is withheld from new features and experiments for an extended period, typically a quarter, while the rest of the product moves forward. Comparing the holdout group to the general population reveals the true long-term value of everything you shipped, including any unexpected interactions between features that individual tests couldn't detect.

The main tradeoff is that a small percentage of users (typically around 5%) experience a degraded product for the duration. 

Example: A product team runs a quarterly holdout and discovers that the cumulative lift from five experiments, each with a 1% lift, is only 3% relative to the holdout group because of diminishing returns.

Incorporating Research

A/B tests tell you what happened, but they can’t tell you why. Combining quantitative experiment results with supplemental research (user interviews, session recordings, surveys, usability testing) gives you both. A variant that wins on conversion but generates support tickets is a signal worth investigating. A variant that loses might reveal through user interviews that the hypothesis was right but the execution was wrong.

Supplemental research is most valuable at two points: before an experiment, to sharpen the hypothesis, and after an inconclusive or surprising result, to understand what the data couldn't tell you and you help generate your next hypothesis.

Example: A team runs an experiment on a new onboarding flow that produces an inconclusive result. User interviews reveal that users understood the new flow better but felt uncertain about committing without seeing the product first, leading to a new hypothesis worth testing.

Common A/B Testing Mistakes (and How to Avoid Them)

Every experimentation program can make mistakes. Learning to recognize common experimentation pitfalls is the first step to not repeating them.

1. Running Experiments Without Big Enough Samples

An underpowered experiment is one that doesn't have enough units to reliably detect the effect you're looking for. It happens when teams skip the power analysis and launch tests without knowing whether their population size or expected traffic is sufficient. Without enough data, you'll either get an inconclusive result or one that looks real but isn't stable enough to act on.

Example: Running tests on pages with fewer than 1,000 visitors per week.

Solution: Focus on the highest-traffic pages or make bolder changes that require smaller samples to detect.

2. Testing Changes That Are Too Small

A change that's too small to produce a detectable effect is a change that's too small to test. It happens when teams focus on incremental tweaks (like a slightly different button shade or a minor copy change) rather than changes that are likely to meaningfully affect user behavior. When these tests do reach significance, the effect size is often too small to justify implementing, and every underpowered test on a trivial change is a test you didn't run on something that could actually move your metrics.

Example: Testing the order of navigation menu dropdowns when users don't understand what your product does.

Solution: Match the boldness to your traffic volume. Smaller sample sizes need bigger swings.

3. Stopping Tests Early

Stopping a test before it reaches its required sample size is one of the most common ways teams produce results they can't trust. It happens when interim results look promising, and there's pressure to ship. The numbers seem to confirm the hypothesis, so stopping feels justified. The problem is that early data is noisier than final data, and a result that looks significant at day five may look very different at day twenty. Stopping early inflates your false positive rate, meaning you'll ship changes that don't actually work.

Example: Ending tests as soon as the p-value hits 0.05.

Solution: Predetermine the sample size and duration, and stick to them. If you need the flexibility to monitor continuously, enable sequential testing.

4. Ignoring External Factors

External factors are events or conditions outside your product that affect user behavior during an experiment. It happens when teams run tests during atypical periods like a seasonal sale, a major product launch, or a news cycle, without accounting for how those conditions might skew results. A winning variant during an unusual period may reflect the context more than the change itself, and applying those results year-round can lead to poor decisions.

Example: Testing during Black Friday and assuming results can be replicated year-round.

Solution: Note external factors and retest important changes during normal periods before permanently implementing them.

5. Shopping Metrics for Significant Results

Metric shopping is when teams run an experiment against many metrics and report whichever ones show significance after the fact. It happens when teams don't define their primary metric before the experiment starts, leaving the door open to interpret results selectively once the data comes in. The more metrics you test, the more likely you are to find a false positive by chance, and a result that emerges from fishing through metrics is not a result you can act on confidently.

Example: Testing 20 metrics, hoping one shows significance.

Solution: Choose your primary metric before starting. Treat others as directional insights rather than stable conclusions.

6. Shipping a Winner Without Checking Segment Performance

Aggregate results can hide meaningful differences in how different groups of users respond to a change. It happens when teams declare a winner based on overall performance without breaking results down by segment. A new checkout flow might increase conversion overall but frustrate returning users who've built habits around the old one, or perform well on desktop while degrading the experience on mobile. Shipping without checking may help some users at the expense of others.

Example: The overall winner performs worse for important segments.

Solution: Always analyze results by key segments, like new versus returning users or mobile versus desktop, before implementing.

7. Ignoring Implementation Cost

Not all winning variants are worth shipping. It happens when teams evaluate experimental outcomes purely by metric lift, without accounting for the actual cost of building and maintaining the change. A variant that requires significant refactoring, introduces dependencies on other systems, or creates an ongoing maintenance burden may not be worth implementing even if the results are strong. The lift needs to justify not just the initial build but the long-term cost of owning the change.

Example: A lead categorization model can improve onboarding success, but implementing it requires rebuilding the underlying data model and all downstream dependencies.

Solution: Factor implementation cost into your hypothesis prioritization.

8. Optimizing for Short-Term Metrics

A test can show an increase in conversion while masking longer-term damage. It happens when teams optimize for metrics that look good in a two-week experiment window without considering what happens to users afterward. Dark patterns that trick users into taking actions they might not otherwise take can boost immediate conversion rates while increasing refund rates, reducing retention, and eroding brand trust over time. A confused user and a genuinely converted user can look identical in the short term.

Example: An EdTech team test that shortened a mandatory course tutorial showed a 20% gain in time to first lesson completion, but decreased the final exam pass rate by 7%.Solution: Use guardrail metrics to catch downstream damage. If a winning variant hurts retention or drives up support volume, it's not a real win.

9. Running One Test and Moving On

Experimentation compounds over time, but only if teams treat it as a continuous practice rather than a series of isolated events. The mistake happens when shipping a winner feels like the finish line. The variant performed better, the change ships, and the team moves on to the next project without asking what they learned or what to test next. A single experiment answers a single question, but the real magic comes from using that answer to sharpen the next hypothesis, and the next one after that.

Example: A new recommendation algorithm that increases platform engagement, but no one investigates which content types drove the increase to further refine it.

Solution: Create a regular testing cadence with iteration built in.

10. Not Documenting Results

Without a record of what was tested, what the hypothesis was, and what the outcome was, institutional knowledge disappears when people leave, and teams waste time re-running experiments that have already been answered. It happens when documentation is treated as an afterthought rather than part of the process.

Example: A variant everyone was confident would win loses. A few months later, a different team has the same idea and runs the same test.

Solution: Maintain a searchable experiment archive and share learnings broadly.

How to Build a Culture of Experimentation

Winning a single A/B test is straightforward. The harder work is building an organization where evidence, not opinion, drives decisions consistently across teams, product areas, and levels of seniority.

An experimentation culture means controlled experiments are the default way of resolving uncertainty. Companies with a strong experimentation culture, recognize that their win rate is only 20% and that they are terrible at predicting which features will win or lose.  That insight creates humility and a determination to test everything. They recognize that the results of a test carry more weight than the instinct of the most senior person in the room, and losing an experiment is treated as useful information rather than a failure.

What a Strong Experimentation Culture Looks Like in Practice

In his book Experimentation Works, Harvard Business School Professor Stefan Thomke identifies seven attributes that characterize organizations where experimentation is genuinely embedded. They're worth understanding not as a checklist but as a description of what the mature state actually looks like.

  1. A Learning Mindset: Experimentation is treated as a continuous process, not a one-time validation. Most experiments won't produce dramatic results, and teams that have internalized this don't treat inconclusive results as wasted effort. They treat them as the cost of learning.
  2. Rewards Consistent With Values: Teams are rewarded for running good experiments, not just winning ones. When compensation is tied to metrics that make experimentation difficult, or when people are punished for null results, the culture quietly dies regardless of what leadership says about it.
  3. Humility: In a true experimentation organization, even the most senior person's assumptions get tested. Leadership's job shifts from making top-down calls to creating the conditions for good experiments and accepting what they find.
  4. Experiments Have Integrity: Strict guidelines govern how experiments are designed and run. This means pre-registered hypotheses, appropriate sample sizes, and agreed-upon metrics before the experiment starts, not after the results come in.
  5. Tools are Trusted: Experimentation only works if people trust what it produces. If teams routinely question the validity of results or find workarounds to avoid acting on them, the infrastructure exists, but the culture doesn't.
  6. Exploration and Exploitation are Balanced: There's an inherent tension between running experiments to learn and shipping product to grow. Organizations that only exploit what they already know stop learning. Organizations that only explore never ship. Senior leadership has to manage that balance deliberately.
  7. Leadership Actively Promotes It: Companies tend to become less innovative as they grow, as the distance between senior leadership and the teams doing the work increases. Experimentation cultures require leaders who actively champion the practice, not just endorse it in all-hands presentations.

Where Does Your Team Sit on the Experimentation Maturity Model?

Thomke and his colleagues describe five stages of experimentation maturity. These are a great way to honestly assess where your organization is and what it would take to move forward.

Stage 1: Awareness

At the awareness stage, leadership values experimentation, but there are no processes, tools, or infrastructure in place. Decisions are still mostly based on experience and intuition. If your team occasionally runs an experiment when a decision is particularly contested, this is probably where you are.

Stage 2: Belief

At the belief stage, leadership accepts that a more disciplined approach is needed and starts investing in tools and dedicated teams. The impact on day-to-day decision-making is still minimal, but the direction is set.

Stage 3: Commitment

At the commitment stage, experimentation becomes core to how the team operates. Some product decisions and roadmap calls now require data from experiments, and the impact on business outcomes is becoming measurable. 

Stage 4: Diffusion

At the diffusion stage, large-scale experimentation is recognized as necessary, and formal standards are rolled out across the organization, supported by tooling and training. Individual teams are no longer the bottleneck.

Stage 5: Embeddedness

At the embeddedness stage, experimentation is fully democratized. Teams design and run their own experiments without central oversight, results are shared automatically across the organization, and the institutional memory of past experiments actively informs new ones.

When Harvard Business School's Baker Research Services compared the stock performance of companies with strong experimentation cultures against the S&P 500 over ten years, those companies outperformed the index by a wide margin. The group included Amazon, Etsy, Facebook, Google, Microsoft, Booking Holdings, and Netflix, organizations that had spent years building the infrastructure and culture for large-scale experimentation.

The Future of A/B Testing

Experimentation is evolving fast. The tools and techniques available today look very different from what existed five years ago, and the next five years will likely bring even more significant shifts. A few trends worth paying attention to:

AI-Powered Testing

AI coding tools are accelerating development velocity in ways that are changing how product teams need to think about experimentation. When engineers can ship features faster, the volume of changes hitting production increases. More features shipping faster means more opportunities for something to hurt retention, conversion, or engagement before anyone catches it. Gradual rollouts and a rigorous experimentation practice matter more as shipping velocity increases.

There's also an entirely new category of things to test. Teams building AI-powered features like recommendation systems, content generation tools, and AI tutors face a challenge that standard A/B testing wasn't designed for. LLMs are non-deterministic: the same input doesn't always produce the same output, and measuring quality requires different metrics than measuring clicks or conversions. Testing whether one model prompt produces better learning outcomes than another requires an experimentation platform that can handle that kind of measurement.

GrowthBook's approach is to accelerate every step of the experimentation lifecycle from directly within the tools developers already use. AI integrations built into the platform include automatic results summaries, hypothesis validation before a test launches, similar experiment detection using vector embeddings, metric definition generation, and SQL generation for data exploration. MCP integration lets you connect your own tools and agents directly to GrowthBook via the MCP server.

Learn more about how to test AI with this practical guide.

Real-Time Personalization

Traditional A/B testing delivers the same experience to everyone in a variant. The next evolution is moving beyond fixed variants toward delivering the optimal experience for each individual user in real time, based on their behavior, context, and predicted response. Multi-armed bandits are an early version of this idea, but the direction is toward much more granular personalization.

Causal Inference

As experimentation programs mature, teams are increasingly using advanced statistical methods to understand cause and effect more precisely, particularly in situations where traditional randomized experiments are difficult or impossible to run. Techniques like difference-in-differences, synthetic control, and instrumental variables are becoming more accessible to product and data teams.

Cross-Channel Orchestration

Many experimentation programs are siloed by channel. A web team runs web experiments, a mobile team runs mobile experiments, and the combined effect of changes across both is rarely measured. The direction is toward experimentation infrastructure that can orchestrate and measure tests across web, mobile, email, and other touchpoints simultaneously.

Privacy-First Experimentation

Privacy regulations and the deprecation of third-party tracking are forcing experimentation platforms to adapt. The approaches gaining traction are those that minimize data movement, work with aggregated rather than individual-level data, and can operate within strict compliance requirements. Platforms that support self-hosting are well-positioned for this shift.

How to Get Started with A/B Testing

Getting started with A/B testing doesn't require a mature experimentation platform or a dedicated data science team, but the setup decisions you make early will either accelerate or constrain your program as it grows.

Get Your Instrumentation Right

Before you can run reliable experiments, you need confidence that the metrics you care about are being tracked correctly and consistently. This means checking that your event logging is complete, that events fire consistently across platforms and devices, and that your data pipeline is reliable.

Skipping this step is one of the most common reasons early experimentation programs produce results no one trusts. A test result is only as good as the data behind it, and discovering instrumentation gaps after a test has run is a frustrating way to learn that lesson.

Start With One Test

Pick a high-traffic surface where you have a clear hypothesis and a metric you can measure. Don't start with the most complex change or the most ambitious idea. Start with something where the feedback loop is short, the instrumentation is straightforward, and you have a reasonable chance of seeing a result. Early wins help build organizational buy-in. Run the test for long enough to reach your required sample size, analyze the results honestly, and document what you learned regardless of the outcome.

Chances are, your team already has hypotheses worth testing. Look at support tickets, session recordings, and drop-off points in your funnel. If you want external inspiration, resources like GoodUI.org and the Baymard Institute publish evidence-based UX patterns that can serve as a starting point for simple but effective test ideas.

Find a Leadership Sponsor

Experimentation programs that stick have a senior champion: someone with enough organizational influence to protect the team's time, push back when results are inconvenient, and make the case for investing in the infrastructure. Without one, a single bad test result or a quarter of inconclusive experiments is enough to kill the program before it gets traction.

It’s ok if your sponsor isn’t technical, but they need to believe that making decisions based on evidence is worth the investment, and be willing to say so publicly when the HiPPO in the room disagrees with the data.

Go Deeper

One of the most useful things you can do early is learn from teams that have already built mature experimentation programs.

Join the Community

Having access to people who have already solved the problems you're facing is one of the most underrated resources in experimentation, and the experimentation community loves to share knowledge and help each other out. 

How to Choose an A/B Testing Platform

The experimentation platform you choose now will shape your experimentation program for years. It determines what you can test, how fast you can move, and how much you can trust your results. And because experimentation infrastructure becomes deeply embedded in your codebase and data pipelines over time, switching platforms is expensive and disruptive enough that most teams avoid it, so it's worth getting right the first time.

Technical Fit and Developer Experience

The platform needs to work with how your team already builds. A tool that requires significant engineering lift to integrate, or doesn't support your tech stack, will create friction from day one and limit who can actually run experiments.

Statistical Rigor and Data Ownership

The platform needs to produce results you can actually trust, and that means being transparent about how the statistics work.

Security, Compliance, and Deployment Options

Security and compliance requirements vary widely across industries, but the cost of getting this wrong is high regardless. Data residency issues, compliance violations, and PII exposure can all stem from choosing a platform that wasn't built with these constraints in mind.

Accessibility and Collaboration

Experimentation only scales when the whole team can participate. A platform that creates friction for non-technical users will limit how often experiments get run and who benefits from the results.

Pricing and Total Cost of Ownership

With experimentation platforms, the sticker price is rarely the full cost. How a platform charges you shapes how much you can experiment, and the wrong pricing model can quietly constrain your program as it grows.

Why GrowthBook

GrowthBook is the open-source feature flagging and experimentation platform built for product and engineering teams. It's used by over 2,900 companies, from early-stage startups running their first experiments to enterprises processing billions of feature flag evaluations per day. Here’s why you should consider using GrowthBook:

You can start for free and scale from there. The free tier gives you everything you need to run your first experiments, while the Enterprise plan adds advanced features like holdouts and the governance tools that mature programs need.

Start A/B Testing the Right Way

A/B testing doesn't replace judgment, but it gives judgment something solid to work with. The teams that get the most out of it aren't running the most experiments. They're asking sharper questions, defining better metrics, and building enough rigor into their process that results can actually be trusted.

That's harder to build than it sounds. But the organizations that do it stop having the same arguments about what to ship. They stop reverting changes based on noise. They stop leaving product decisions to whoever made the most compelling case in the last meeting.

In 2026, the companies pulling ahead are the ones replacing guesswork with evidence.

Read next

Want to give GrowthBook a try?

In under two minutes, GrowthBook can be set up and ready for feature flagging and A/B testing, whether you use our cloud or self-host.