Fact Table Query Optimization

Back in October 2023, GrowthBook 2.5 added support for Fact Tables. This allowed you to write SQL once and re-use it for many related metrics. For example, an Orders fact table being used for Revenue, Average Order Value, and Purchase Rate metrics.

However, behind-the-scenes, we were still treating these as independent metrics. We weren't taking advantage of the fact that they shared a common SQL definition.

With the release of GrowthBook 2.7 in January 2024, we added some huge SQL performance optimizations for our Enterprise customers to better take advantage of this shared nature. Read on for a deep dive on how we did this and the resulting gains we achieved.

Understanding Experiment Queries

The SQL that GrowthBook generates to analyze experiment results is complex, often exceeding 10 sub-queries (CTEs) and 200 lines total. Here's a simplified view of some of the steps involved:

Get all users who were exposed to the experiment and which variation they saw
Roughly filter the raw metric table (e.g. by the experiment date range)
Join [1] and [2] to get all valid metric conversions we should include
Aggregate [3] on a per-user level
Aggregate [4] on a per-variation level

An important thing to note is that these queries are metric-specific. If you add 10 metrics to an experiment, we would generate and run 10 unique SQL queries.

Abstracting out the Common Parts

The queries above can be expensive, especially for companies with huge amounts of data. Reducing the amount of duplicate work can result in big savings, especially for usage-based data warehouses like BigQuery or Snowflake.

It's pretty clear that step 1 above is always going to be identical for every single metric in an experiment, whether or not they share the same Fact Table. That's why in GrowthBook 2.5, we released Pipeline Mode to take advantage of this by running that part of the query once and creating an optimized temp table that subsequent metric queries could use.

Now that we have Fact Tables, we can take this optimization even further. If multiple metrics from the same Fact Table are added to an experiment, it's pretty easy to see that step 2 will be the same*. What's not so obvious is that, with a little tweaking, the rest of the steps (3-5) can also become largely identical, paving the way for some huge performance gains. We don't need temp tables, we can just run a single query that operates over multiple metrics at the same time.

Filters

I said above that step 2 (roughly filtering metrics) would be the same for all metrics in a fact table. This isn't 100% true because of Filters. We let you add arbitrary WHERE clauses to a metric to limit the rows that it includes. This lets you, for example, define both a Revenue and a Revenue from orders over $50 metric using the same Fact Table.

Filters are super powerful, but pose a problem when trying to combine metrics into a single query. If two metrics have different filters, we can't use a WHERE clause to do the filtering since it will apply to all of the metrics.

The solution turns out to be easy - the all powerful CASE WHEN statements. This lets each metric have its own mini WHERE clause without interfering with any other ones.

SELECT 
  amount as revenue,
  (CASE WHEN amount > 50 THEN amount ELSE NULL END) as revenue_over_50
FROM orders

Combining Metrics

This is a simplified view of the data at step 3, when we join the experiment data (variation) with the metric data (timestamp/value) based on userId.

userId	variation	timestamp	value
123	control	2024-01-18T00:01:02	$100.54
456	variation	2024-01-18T00:01:03	$75.43
123	control	2024-01-18T11:15:12	$10.54

Notice how each event has its own row (UserId 123 purchased twice and has 2 rows). To support multiple metrics, we can't have a single value column anymore. We need each metric to have its own column. We can just prefix these by the metric number. m0_value, m1_value, etc.. m0 might represent the revenue, m1 might represent the number of items in the order.

We do the same thing in step 4, aggregating by userId. The structure is identical, there will just now be one single row per userId and we will sum each prefixed value column.

SELECT
  userId,
  variation,
  SUM(m0_value) as m0_value,
  SUM(m1_value) as m1_value
FROM
  step_3
GROUP BY userId, variation

Step 5, aggregating by variation, is similar, but a single number per metric is no longer enough. We need multiple data points to calculate standard deviations and other more advanced stats. So we end up with something like this:

SELECT
  variation,
  COUNT(*) as users,
  -- Multiple prefixed columns for each metric
  SUM(m0_value) as m0_sum,
  SUM(POWER(m0_value, 2)) as m0_sum_squares,
  ...
FROM
  step_4
GROUP BY variation

Pulling Them Apart Again

The results we get back from the data warehouse can be very wide, with potentially hundreds of columns (each metric needs between 3 and 10 columns and there could be dozens in an experiment).

Our Python stats engine was written to process one metric at a time, so to avoid a massive refactor, we simply split this wide table back into many smaller datasets before processing. For example, to process m1, we would clone the dataset, remove all of the m0_, m2_, etc. columns, and rewrite the m1_ column names to remove the prefix. Now the result looks 100% identical to how it was before this optimization.

Ratio Metrics, CUPED, and More

All of the examples above show the simplest case. Combining metrics is even more powerful for advanced use cases.

Ratio metrics let you divide two other metrics, for example Average Order Value (revenue / orders). Previously, we would have to select 2 metric tables, one for the numerator and one for denominator, and join them together, which could get really expensive. Now, if both the numerator and denominator are in the same Fact Table, we can avoid this costly extra join entirely, making the query significantly faster and cheaper.

CUPED is an advanced variance reduction technique for experimentation. It involves looking at user behavior before they saw your experiment and using that to cancel out variance during the experiment. This makes metric queries more expensive since they now have to scan a wider date range. Because of this, users had to be really judicious about which metrics they enabled CUPED for. Now, since that expensive part of the query is shared between multiple metrics, it becomes feasible to just run CUPED for everything without having to think about the performance costs.

The story is similar for other advanced techniques like percentile capping (winsorization), timeseries analyses, and dimension drill-downs.

The Gains

So, was all of this work worth it? Absolutely.

BigQuery is the most popular data warehouse for our customers. BigQuery charges based on the amount of data scanned and compute used, so any performance improvements translates directly to cost savings for our users.

For a large company adding 100 metrics to an experiment, it's not uncommon for those to be split between only 5-10 fact tables, lets say 10 to be conservative.

Selecting additional columns from the same source table is effectively free from a performance point of view, so going from 100 narrow queries to 10 wide ones is a 90% cost reduction! When you factor in the savings from Ratio Metrics and CUPED, it's not unheard of to see an additional 2X cost decrease!

We're super excited about these cost savings for our users. The biggest determinant of success with experimentation is velocity - the more experiments you run, the more wins you will get. So anything we can do to reduce the cost and barrier of running more tests is well worth the investment.