“Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments”, 2015-05-01 (; backlinks):
[cf. Berman & Van den 2021] As A/B testing gains wider adoption in the industry, more people begin to realize the limitations of the traditional frequentist null hypothesis statistical testing (NHST). The large number of search results for the query “Bayesian A/B testing” shows just how much the interest in the Bayesian perspective is growing. In recent years there are also voices arguing that Bayesian A/B testing should replace frequentist NHST and is strictly superior in all aspects.
Our goal here is to clarify the myth by looking at both advantages and issues of Bayesian methods. In particular, we propose an objective Bayesian A/B testing framework for which we hope to bring the best from Bayesian and frequentist methods together. Unlike traditional methods, this method requires the existence of historical A/B test data to objectively learn a prior.
We have successfully applied this method to Bing, using thousands of experiments to establish the priors.
…5. Empirical Results: We also applied Algorithm 1 using Bing experiments data. After data quality check, we found for many metrics except a few recently added ones, we typically had more than 2,000 historical data.
After fitting the model, we found the prior P(H1) = P [probability of a non-zero effect] ranges from as much as 70% to less than 1%. The ordering of those P for different metrics aligns well with our perception of how frequently we believed a metric truly moved. For example metrics like page loading time moved much more often than user engagement metrics such as visits per user. For most metrics P is below 20%. This is because the scale of Bing experimentation allows us to test more aggressively with ideas of low success rate. We also used P(Flat) in the P-Assessment and only looked at metrics with P(Flat) < 20% and found it very effective in controlling FDR.
See Also: