How We Built WebLens: Creating an AI-Powered Hypothesis Generator

Learn how we built WebLens, our AI-powered hypothesis generator, to enhance A/B testing by efficiently analyzing websites and proposing actionable optimization strategies.

· 6 min read
How We Built WebLens: Creating an AI-Powered Hypothesis Generator

Background

With the advent of generative AI, companies across all industries are leveraging these models to enhance productivity and improve product usability.

At GrowthBook we have the unique opportunity to apply generative AI to the challenges of experimentation and A/B testing as an open-source company.

After careful consideration of the various applications of generative AI within the experimentation and A/B testing space, we decided to focus on creating an AI-powered hypothesis generator. This project not only allowed us to quickly prototype but also provided a rich roadmap for further innovation as we explore the potential of large language models (LLMs) in online controlled experiments.

An AI-powered hypothesis generator

Problem: New users often struggle to identify opportunities for optimizing their websites, while experienced users may need fresh ideas for new experiments.

Solution: Our AI-powered hypothesis generator analyzes websites and suggests actionable changes to achieve desired outcomes. It can generate DOM mutations in JSON format, making it easier to create new experiments in GrowthBook. For example, on an e-commerce site, the generator might recommend repositioning the checkout button or simplifying the navigation menu to boost conversion rates.

Overview

In our first iteration of the AI-powered hypothesis generator, we focused on a straightforward approach using existing technology. Below, we outline our process, the challenges we encountered, and how we plan to improve in future iterations.

Hypothesis generation steps

To accomplish the task of generating hypotheses for a given web page, there were a few discrete steps that the application would need to accomplish:

  1. Analyze the web page by scraping its contents.
  2. Prompt an LLM for hypotheses using the scraped data and additional context.
  3. If feasible, prompt the LLM to generate DOM mutations in a format compatible with our Visual Editor.
  4. Onboard the generated hypothesis and visual changes into GrowthBook.

Hypothesis generator tech stack

The hypothesis generator is built on a Next.js App supported by several services:

System Architecture for WebLens

Here's a visual overview of our hypothesis generator architecture

Scraping the page

To provide the LLM with the context it needs to form hypotheses, we needed a way to serialize a web page into data that it could understand, that is: text and images.

We were lucky to partner with another YC company, Firecrawl for web scraping. Initially, we used Playwright, but switched to Firecrawl for its simpler API and reduced resource utilization.

Firecrawl helps us:

Prompting the LLM

Now for fun part - we had to learn how to prompt the LLM to get accurate and novel results. There were a number of challenges that arose in this area of the project.

In the beginning, we experimented with OpenAI GPT 4, Anthropic Claude 2.1, and Google Gemini Pro. We also made sure to acquaint ourselves with established prompting techniques and leveraged those that seemed to improve results.

In the end we decided to go with Google Gemini Pro chiefly for its large context window of 1 million tokens. The difference in quality of hypotheses was negligible but we thought OpenAI’s GPT-4 did slightly better.

Prompt engineering

We experimented with a variety of novel prompting techniques (as well as a bit of sorcery and positive thinking) in order to output hypotheses of suitable quality. Here are a few of the techniques that we found useful:

Context window limits

Context window limits quickly became a challenge when trying to provide full web page scrape data to the LLM. This was a hard problem to compromise - there was no way around providing the full payload of scrape data. For example, it could be possible that crucial <style> tags were included near the bottom of the page. In general, there could be important details anywhere throughout the DOM that could affect the LLMs ability to formulate reliable hypotheses.

We came up with ideas on how to approach this problem in the long-term, but for now, we decided to take advantage of Google Gemini’s large context window to allow us to provide a nearly-full scrape of a webpage (with irrelevant markup removed) and still have enough token space left over for our prompt and the resultant hypotheses.

Generating DOM mutations

To take things one step further, we wanted to add the ability to use a generated hypothesis to prompt the LLM to generate DOM Mutations in a JSON format used by our Visual Editor. These mutations could then be used to render live previews in the browser.

We encountered some amazing results along with some pretty disastrous ones while experimenting with this. In the end, we had to narrow the focus of our prompt to focus only on modifying the text copy of very specific elements on a page, so that the live previews were reliably good. We also implemented iterative prompting to ensure the generated mutations were usable, testing them out on a virtual DOM and suggesting fixes to the LLM when possible.

Future iterations could improve accuracy and power by refining this process.

Scalability

To ensure reliability and support increased traffic from platforms like Product Hunt and Hacker News, we used Upstash’s QStash for a message queue. This system provides features like payload deduplication, retrying with exponential backoff, and a DLQ.

On the frontend, we used Supabase’s Realtime feature to notify clients immediately when steps are completed.

Challenges

Context window limits

One of the major challenges we faced was dealing with context window limits. Scraped HTML pages can range from 100k to 300k tokens, while most models' context windows are under 100k, with Google Gemini Pro being an exception at 1 million tokens. While large context windows simplify the developer experience by allowing us to create a single, comprehensive prompt for inference, they also come with the risk of the model focusing on irrelevant details.

Long context windows offer both advantages and disadvantages. On the plus side, they make it easier for developers by accommodating all necessary data in one prompt. This is particularly useful for use cases like ours, where raw HTML doesn't align well with techniques like Resource Augmented Generation (RAG). However, larger prompts can lead to unexpected inferences or hallucinations by the model.

Despite these challenges, our tests showed that Gemini Pro consistently produced relevant and innovative hypotheses from the scraped data. Looking ahead, we plan to enhance this process by using machine learning to categorize web pages and break them down into common UI groupings, such as "hero section" or "checkout CTA" for e-commerce sites.

By creating a taxonomy of web pages, we can use machine learning to extract essential details from each UI grouping, including screen captures and raw HTML. This approach reduces the amount of HTML from 300k+ tokens to less than 10k per grouping. Such optimization will enable faster, more accurate, and higher-quality inference across a broader range of models, even those with smaller context window limits.

Naming

We had a tough time coming up with a name for the tool - at one point considering “Lenticular Labs.” We ultimately chose WebLens to emphasize the tool's ability to analyze websites through a lens, enhancing focus. Plus, we secured the domain weblens.ai.

Conclusion

We hope you enjoyed reading this high-level overview of how we came to build the hypothesis generator at GrowthBook. Our current implementation is straightforward and utilizes tools that are readily available today, yet is able to generate novel insights given any web page on the internet today. Give it a shot with your website of choice at https://weblens.ai and let us know your thoughts via our community slack.

Read next

Want to give GrowthBook a try?

In under two minutes, GrowthBook can be set up and ready for feature flagging and A/B testing, whether you use our cloud or self-host.