[cf. Feit & Berman2018/Berman & Van den Bulte2021] We conduct a meta-analysis on over 6,700 large e-commerce experiments, mainly from the retail and travel sectors, grouping together common treatment types performed on websites.
We find that cosmetic changes have a far smaller impact on revenue than treatments grounded in behavioral psychology.
…We categorise roughly 2,600 experiments into a set of 29 categories, and measure a few statistics, such as the average uplift. The full list of results is in §2.2.2. Test categories that perform best in terms of average uplift are:
scarcity (stock pointers) +2.9% uplift
social proof (informing users of others’ behavior) +2.3% uplift
urgency (countdown timers) +1.5% uplift
abandonment recovery (messaging to keep users on-site) +1.1% uplift
product recommendations (suggesting other products to purchase) +0.4% uplift
Most simple UI changes to websites are ineffective. For example:
colour (changing the color of elements on a website) +0.0% uplift
buttons (modifying website buttons) −0.2% uplift
calls to action (changing the wording on a website to be more suggestive) −0.3% uplift
We find that 90% of experiments have an effect of less than 1.2% on revenue, positive or negative (see §3.2). However, we find that overall our clients benefit from A/B testing campaigns, some greatly (see §4.2).
Figure 3.1: Estimated overall effects of all A/B tests
…Most of the treatments we measured tend to fall in the [−1%, 1%] range for uplift. To reliably and confidently detect an uplift of 1% just on conversion rate requires about 120,000 converters (purchasing visitors) in each variant including the control. For a revenue uplift, one requires more. We will detail how one arrives at this number in Appendix B. Only a small proportion of companies have enough traffic to measure uplifts of this size in a realistic time-frame.
[Informative priors: most A/B experiments fail and the successful ones have small effects on the order of a few % at most. Note the implications that most successful, in the sense of p < 0.05, A/B tests will grossly overestimate the true effect-size; detecting realistic effects may require large sample sizes; and some categories of tests may well not be worth running at all. It would also be interesting to know how many of those null experiments were justified based on correlational data…]