“Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data”, Alex Deng, Ya Xu, Ron Kohavi, Toby Walker2013-02 ()⁠:

Online controlled experiments are at the heart of making data-driven decisions at a diverse set of companies, including Amazon, eBay, Facebook, Google, Microsoft, Yahoo, and Zynga. Small differences in key metrics, on the order of fractions of a percent, may have very large business implications. At Bing it is not uncommon to see experiments that impact annual revenue by millions of dollars, even tens of millions of dollars, either positively or negatively.

With thousands of experiments being run annually, improving the sensitivity of experiments allows for more precise assessment of value, or equivalently running the experiments on smaller populations (supporting more experiments) or for shorter durations (improving the feedback cycle and agility).

We propose an approach (CUPED) that uses data from the pre-experiment period to reduce metric variability and hence achieve better sensitivity…The two Monte Carlo variance reduction techniques we consider here are stratification and control variates…This technique is applicable to a wide variety of key business metrics, and it is practical and easy to implement.

The results on Bing’s experimentation system are very successful: we can reduce variance by about 50%, effectively achieving the same statistical power with only half of the users, or half the duration.

[Keywords: controlled experiment, variance, A/B testing, search quality evaluation, pre-experiment, power, sensitivity]

[Speed matters]5.1 Slowdown Experiment in Bing To show the impact of CUPED in a real experiment we examine an experiment that tested the relationship between page load-time and user engagement on Bing. Delays, on the order of hundreds of milliseconds, are known to hurt user engagement (Kohavi et al 200915yab, §6.1.2). In this experiment, we deliberately delayed the server response to Bing queries by 250 milliseconds. The experiment first ran for two weeks on a small fraction of Bing users, and we observed an impact to click-through-rate (CTR) that was borderline statistically-significant, i.e. the p-value was just slightly below our threshold of 0.05. To confirm that the treatment effect on this metric is real and not a false positive, a much larger experiment was run, which showed that this was indeed a real effect with a p-value of 2 × 10−13.

We applied CUPED using CTR from the 2-week pre-period as the covariate. The result is impressive: the delta was statistically-significant from day 1! The top plot of Figure 2 shows the p-values over time in log scale. The black horizontal line is the 0.05 statistical-significance bar. The vanilla t-test trends slowly down and by the time the experiment was stopped in 2 weeks, it barely reached the threshold. When CUPED is applied, the entire p-value curve is below the bar. The bottom plot of Figure 2 compares the p-value curves when CUPED runs on only half the users. Even with half the users exposed to the experiment, CUPED results in a more sensitive test, allowing for more non-overlapping experiments to be run.