A/B Testing in the Age of AI

Prolog: Is A/B Testing Here to Stay?

In the age of AI, there is a growing debate about how it will transform the professions and skills we rely on today. Some view AI as a game changer, capable of completely reshaping the workforce: certain professions and skill sets may disappear entirely, while new, as-yet-unknown roles will emerge. Others argue that AI’s impact will be more evolutionary than revolutionary: the same professions will remain, but AI will accelerate and enhance the work we already do, enabling people to accomplish more in less time.

This debate naturally extends to the realm of A/B testing: will experimentation remain necessary at all? Some suggest that experimentation could become fully automated, potentially making roles like product managers, analysts, developers, and designers redundant. Others contend that while AI will fundamentally reshape these roles, it will not eliminate them. From this perspective, AI’s most significant contribution lies in speed: it can increase the volume of ideas that require testing and accelerates the ability to analyze the results. In effect, AI has the potential to dramatically compress product development cycles, allowing teams to iterate faster and more efficiently.

From this vantage point, A/B testing is far from disappearing; it is evolving. In this blog, we explore how AI is reshaping A/B testing: highlighting the areas already transformed, those on the verge of change, and those likely to remain largely unchanged.

Already Here: How AI Powers the Building Blocks of A/B Testing

A/B testing is the standard approach for determining whether new product versions genuinely outperform existing features. A typical A/B test comprises four main stages: hypothesis generation, where the proposed change and its expected impact are defined; experiment planning, which includes setting up the test conditions and determining an appropriate sample size; data collection and analysis; and finally, drawing conclusions and sharing the results across the organization. AI usage can already be found across the different stages of this lifecycle.

Hypothesis Generation: Defining What to Test

Keeping track of what has already been done is essential for generating strong hypotheses. Past experiments provide critical context, helping teams avoid redundant or low-value tests and focus on ideas with real potential. Yet systematically tracking prior experiments remains a major challenge for analysts. As experiment volume grows, it quickly exceeds human cognitive capacity, and documentation becomes harder to navigate, especially as teams scale and members frequently join or leave.

This is precisely the kind of problem where AI excels. Platforms like GrowthBook leverage AI to help teams build efficiently on prior experiments by surfacing what has worked, identifying opportunities for new features, and even creating new feature flags and experiments directly in the platform. Crucially, these insights are not based on generic ideas; they are grounded in the company’s own data and experimental history, producing tailored solutions for the specific user population of the product.

Activating this AI support is as natural as talking with a teammate. In GrowthBook, analysts can simply ask what has worked in previous experiments, what has failed, and what to do next. Beyond suggesting new test ideas, the platform evaluates hypothesis quality against organization-defined criteria and helps prevent duplicate experiments by surfacing similar tests related to the current hypothesis.

Planning: Setting Up the Test

Once you know what you are going to check, the next step is to design the test. The main goal at this stage is to determine the test duration, which is directly driven by the required sample size. Sample size calculation is essential to ensure the experiment has enough statistical power to detect an effect when one truly exists.

Importantly, sample size planning is tightly linked to the data and the required confidence levels. Sample is driven off of the expected improvement, the required statistical significance, often 0.05 and the statistical power required, often 80%. While this planning is largely data driven, AI can still add value by helping teams manage, standardize, and document metrics across the organization by generating clear, consistent definitions and descriptions.

Analysis: Data Acquisition and Evaluation

Once data has been collected, AI can add value across the entire analysis pipeline, from data extraction to generating insights. For example, GrowthBook allows users to create SQL queries directly from plain-text descriptions, execute them, and visualize the results. But the impact of AI in this platform goes far beyond query generation. By leveraging information linked to the tests, such as the hypothesis and metrics, AI can produce a full analysis of the results. This includes generating a summary that can be attached to the experiment, with the content and style of the summary controlled through prompts that specify how the results should be described.

Beyond straightforward hypothesis testing, AI can also enable deeper exploration through segmentation analyses, helping teams understand where and for whom effects occur. AI-assisted exploration can also uncover unexpected patterns or secondary signals (such as mouse movement or click behavior) that might otherwise go unnoticed.

To derive impact from an A/B test, it is essential to clearly communicate the results. In the era of AI, analysts no longer need to struggle with interpreting findings or deciding how to present them to stakeholders. By leveraging language models, AI can simplify this task simply by being provided with information about the experiment and the actual data.

Within Reach: Accelerating A/B Testing Automation & Learning with AI

AI enhancements are already influencing various components of A/B testing. In the near future, we believe AI has the potential to go beyond individual components, integrating the entire A/B testing lifecycle into an end-to-end process and enabling a deeper, more causal understanding of why effects occur and where errors originate.

Unifying the experimentation lifecycleAI already supports key parts of the experimentation lifecycle. The next step is to integrate these capabilities into a unified, automated workflow. In practice, this could range from describing analysis goals in natural language to AI proactively proposing experiment ideas. For example, GrowthBook’s Weblens allows teams to upload a website URL and receive data-driven experiment recommendations.

In the future, AI-driven systems could autonomously generate product variants and run experiments end to end, while analysts retain oversight to ensure product quality, correct user allocation, and sound interpretation of results. This shift has the potential to significantly reduce the friction that commonly exists between product, engineering, and analytics teams.

Today, analysts are rarely responsible for implementing product changes or launching experiments, which often leads to misalignment, such as missing tracking events or users being allocated but never exposed to a variant. These issues can require substantial rework or, in the worst case, invalidate the experiment entirely. By centralizing the experimentation workflow within an AI-driven system, many of these coordination failures can be prevented, resulting in faster execution, cleaner data, and more reliable insights.

Automatically diagnosing experimental issuesAI can improve experiment validity not only by reducing operational friction, but also by automating validity checks and helping debug experiments when issues arise. For example, today a common validity check is Sample Ratio Mismatch (SRM), which verifies that the actual allocation of users matches the planned allocation. Beyond SRM, it is advisable to periodically conduct A/A tests, which compare the control version against itself. Ideally, no significant differences should emerge; if they do, it may indicate that some aspect of the software or testing environment is unintentionally affecting outcomes.

So, what can AI contribute to these validation checks? Quite a lot. While implementing SRM and A/A tests is relatively straightforward, diagnosing the source of a problem when one arises is far more complex. Tracing the root cause often requires detailed data exploration, which can be guided by AI tools. In more advanced settings, AI can even proactively detect potential issues by continuously monitoring differences in allocation or user characteristics, (e.g., identifying a higher proportion of bots in one group). This capability allows teams to catch and resolve problems earlier, reducing wasted time and resources.

Learning about your productUnderstanding why an effect occurred is not only important when errors arise; it becomes even more critical when significant results are observed. Beyond simply deciding which features to ship or retire, companies are deeply interested in understanding their users and their needs. By uncovering what drove the impact in a test, companies can make more informed product decisions, identify opportunities for improvement, and tailor experiences that truly resonate with their users.

Achieving this goal today often requires substantial manual effort, from building dashboards and running follow-up analyses to iteratively exploring data to uncover the drivers of observed changes. AI can dramatically accelerate this process by enabling learning across user segments and other potential explanatory variables. While tools such as automated segmentation analysis already address part of this need, AI’s potential extends much further. In the near future, it is expected to reveal complex segment interactions, detect seasonal patterns, and analyze historical user behavior, providing a deeper understanding of the people who use our product.

Beyond Our Grasp: AI as a Replacement for Humans in A/B Testing

The existing and emerging AI-based practices naturally raises a broader question: how far can automation go? In theory, AI could fully automate the experimentation process. In such a “human-free” scenario, humans would no longer be needed in two key roles. First, they might not be required as users, as their behavior could be accurately modeled and simulated. Second, they might no longer be necessary as decision-makers, with AI autonomously generating, running, and evaluating experiments. From our perspective, however, both assumptions remain far from reality, at least for the foreseeable future. Let’s explore why.

No Need for Humans as Users?

The case for automation.Human behavior can be modeled computationally, which raises the possibility of using synthetic users to test and iterate on product changes. In principle, such agents could enable experiments to be run, evaluated, and refined without involving real users.

Why humans still matter.Human behavior is deeply contextual, shaped by emotions, social norms, cultural influences, and continuously evolving motivations. These nuances are difficult to capture fully in any model, and existing datasets inevitably reflect only a partial view of human decision-making. Moreover, products are ultimately designed for, and evaluated by, humans; not abstract agents or simulations. As a result, even the most sophisticated models must ultimately be validated against real human responses.

No Need for Humans as Decision-Makers?

The case for automation.If AI could autonomously generate hypotheses, run experiments, evaluate outcomes, and draw conclusions to guide subsequent tests, human intervention might become unnecessary. In such a scenario, each experiment would naturally flow into the next, creating a continuous, fully automated experimentation workflow.

Why humans still matter.Although this vision is tempting, allowing AI algorithms to operate entirely without human oversight is unlikely in the near future; too much is at stake. While AI can assist in running experiments, organizations are unlikely to relinquish human judgment, which safeguards revenue growth, user experience, and alignment with broader business objectives.

This cautious approach is already evident in A/B testing. Fully automated methods, such as reinforcement learning and multi-armed bandits for user allocation, have existed for years. Despite their advantages, these methods are never allowed to run without human supervision. Instead, they typically complement rather than replace classical A/B testing.

This highlights a broader reality: even if AI eventually handles the entire product development lifecycle autonomously, analysis and creativity will remain crucial for evaluating AI-generated ideas, monitoring product updates, and interpreting results and insights. Human involvement in A/B testing and product decision-making is therefore unlikely to disappear; instead, it will transform: analysts will spend less time on hands-on execution and more on supervising, guiding, ideating, and shaping AI-driven processes.

Bottom line

There is no doubt that AI is transforming A/B testing as we know it. What remains open to debate is the extent of that transformation. In this piece, we’ve shared our perspective on what has already changed and what is most likely to evolve in the near future.

Today, AI is already helping teams generate stronger hypotheses, monitor and interpret metrics, automate large parts of the analysis workflow, and communicate results more effectively. Looking ahead, AI is likely to further connect the different phases of the experimentation lifecycle, enhance debugging and validation capabilities, and strengthen segmentation analysis, unlocking deeper and more nuanced product insights.

By reducing friction, accelerating learning cycles, and lowering the cost of running and analyzing experiments, AI empowers analysts and product teams to learn faster and make better-informed decisions every day.

We invite you to join us at GrowthBook as we continue building the next generation of experimentation, where A/B testing meets the power of AI.

A/B Testing in the Age of AI