“False Discovery in A/B Testing”, Ron Berman, Christophe Van den Bulte2021-12-30 (, , ; backlinks)⁠:

[previously] We investigate what fraction of all statistically-significant results in website A/B testing is actually null effects (ie. the false discovery rate (FDR)).

Our data consist of 4,964 effects from 2,766 experiments conducted on a commercial A/B testing platform, Optimizely.

Using 3 different methods, we find that the FDR ranges between 28% and 37% for tests conducted at 10% statistical-significance and between 18% and 25% for tests at 5% statistical-significance (two sided). These high FDRs stem mostly from the high fraction of true null effects, about 70%, rather than from low statistical power…A similarly high fraction of null effects has been observed on Microsoft’s Bing (Deng2015), and our study generalizes this finding to a much greater set of experimenters, organizations, and industries.

Using our estimates, we also assess the potential of various A/B test designs to reduce the FDR. The 2 main implications are that decision makers should expect one in 5 interventions achieving statistical-significance at 5% confidence to be ineffective when deployed in the field and that analysts should consider using 2-stage designs with multiple variations rather than basic A/B tests.

[Keywords: statistics, design of experiments, decision analysis, inference, A/B testing, false discovery rate]

Figure 3: Histograms of effect sizes, z-Scores, and p-Values.