“A/B Interactions: A Call to Relax”, Microsoft Research2023-08-02 (, )⁠:

…We’re going to show you why A/B interactions—the dreaded scenario where two or more tests interfere with each other—are not as common a problem as you might think. Don’t get us wrong, we’re not saying that you can completely let down your guard and ignore A/B interactions altogether. We’re just saying that they’re rare enough that you can usually run your tests without worrying about them.

Looking for A/B Interactions at Microsoft: Our previous experience with A/B tests at Microsoft had found that A/B interactions were extremely rare. Similarly, researchers at Facebook found that A/B interactions were not a serious problem for their tests.

We recently carried out a new investigation of A/B interactions in a major Microsoft product group. In this product group, A/B tests are not isolated from each other, and each control-treatment assignment takes place independently. The data analysis

Within this product group, we looked at 4 major products, each of which runs hundreds of A/B tests per day on millions of users. For each product, we picked a single day, and looked at every pair of A/B tests that were running on that same day. For each pair, we calculated every metric for that product for every possible control or treatment assignment combination for the two tests in the pair. The results for metric y are shown here for a case where each test has one control and one treatment.

Table 3: Treatment effects for one A/B test, segmented by user control/treatment assignment in a different A/B test.
A/B Test #2: C A/B test #2: T Treatment effect
A/B test #1: c YC,c YT,c Δc = YT,c − YC,c
A/B test #1: t YC,t YT,t Δt = YT,t − YC,t

A chi-square test was performed to check if there was any difference between the two treatment effects. Because there were hundreds of thousands of A/B test pairs and metric combinations, hundreds of thousands of p-values were calculated. Under the null hypothesis of no A/B interactions, the p-values should be drawn from a uniform distribution, with 5% of the p-values satisfying p < 0.05, 0.1% of the p-values satisfying p < 0.001, etc. Accordingly, some were bound to be small, just by chance.

The results: few or no interactions

Therefore, to check whether there were A/B interactions, we looked at the distribution function of p-values, shown here for a single day for a specific product:

Figure 1: Cumulative distribution of p-values for A/B interaction tests.

The graphs for all 4 products look similar; all are very close to a uniform distribution. We then looked for deviations from a uniform distribution by checking if there were any abnormally small p-values, using a Benjamini-Hochberg false positive rate correction test. For 3 of the products, we found none, showing that all results were consistent with no A/B interactions. For one product, we did find a tiny number of abnormally small p-values, corresponding to 0.002%, or 1 in 50,000 A/B test pair metrics. The detected interactions were checked manually, and there were no cases where the two treatment effects in Table 3 were both statistically-significant but moving in opposite directions. In all cases either the two treatment effects were in the same direction but different in magnitude, or one of them was not statistically-significant.

…This is possible, but it raises the question of when we should worry about interaction effects. For most A/B tests at Microsoft, the purpose of the A/B test is to produce a binary decision: whether to ship a feature or not. There are some cases where we’re interested in knowing if a treatment effect is 10% or 11%, but those cases are the minority. Usually, we just want to know if key metrics are improving, degrading, or remaining flat. From that perspective, the scenario with small cross-A/B test treatment effects is interesting in an academic sense, but not typically a problem for decision-making. [ie. ‘bet on sparsity’ & additivity]