[cf. Milkmanet al2022; criticism of wrong control group (to their passive rather than their active controls)] Policy-makers are increasingly turning to behavioral science for insights about how to improve citizens’ decisions and outcomes. Typically, different scientists test different intervention ideas in different samples using different outcomes over different time intervals. The lack of comparability of such individual investigations limits their potential to inform policy. Here, to address this limitation and accelerate the pace of discovery, we introduce the megastudy—a massive field experiment in which the effects of many different interventions are compared in the same population on the same objectively measured outcome for the same duration.
In a megastudy targeting physical exercise among 61,293 members of an American fitness chain [24 Hour Fitness], 30 scientists from 15 different US universities worked in small independent teams to design a total of 54 different 4-week digital programmes (or interventions) encouraging exercise.
We show that 45% of these interventions statistically-significantly increased weekly gym visits by 9% to 27%; the top-performing intervention offered micro-rewards for returning to the gym after a missed workout. Only 8% of interventions induced behavior change that was statistically-significant and measurable after the 4-week intervention. Conditioning on the 45% of interventions that increased exercise during the intervention, we detected carry-over effects that were proportionally similar to those measured in previous research.
Forecasts by impartial judges failed to predict which interventions would be most effective, underscoring the value of testing many ideas at once and, therefore, the potential for megastudies to improve the evidentiary value of behavioral science.
…In the 4 weeks before joining our megastudy, participants’ mean number of weekly visits to the gym was 1.27 (s.d.=1.48) and the mean number of participants who checked into the gym at least once in a given week was 47.7% (s.d.=40.4%).
Figure 1: Measured versus predicted changes in weekly gym visits induced by interventions. The measured change (blue) versus change predicted by third-party observers (gold) in weekly gym visits induced by each of the 53 experimental conditions in our megastudy compared with the placebo control condition during a 4-week intervention period. The error bars represent the 95% confidence intervals (see Extended Data Table 6 for the complete OLS regression results shown here in blue and the sample sizes for each condition; Supplementary Information 11 for more details about the prediction data shown in gold; and Supplementary Table 1 for full descriptions of each treatment condition in our megastudy). Sample weights were included in the pooled third-party prediction data to ensure equal weighting of each of our 3 participant samples (professors, practitioners and Prolific respondents). The superscripts a–e denote the different incentive amounts offered in different versions of the bonus for returning after missed workouts, higher incentives and rigidity rewarded conditions, which are described in Supplementary Table 1. In conditions with the same name, superscripts that come earlier in the alphabet indicate larger incentives.
…Prediction accuracy: One could argue that the harder it is to predict the results of experiments, the more valuable the megastudy approach. The more difficult it is to forecast ex ante which interventions will work, the harder it is to decide in advance which interventions to prioritize for testing, and the more useful it is to instead test a large number of treatment approaches.
To assess forecasting accuracy, we conducted a series of separate preregistered studies (see the ‘Data availability’ section) in which third-party observers were asked to predict the impact of 3 randomly selected interventions from our megastudy. We collected these data 14 months after conducting our megastudy. One study included 301 participants recruited from Prolific (who made a total of 903 predictions, or a mean of 17 predictions per treatment condition); another included 156 professors from the top 50 schools of public health as rated by US News & World Report in 2019 (who made a total of 468 predictions, or a mean of 9 predictions per treatment condition; a list of schools is provided in Supplementary Information 11); and a final study included 90 practitioners recruited from companies that specialize in applied behavioral science (who made a total of 270 predictions, or a mean of 5 predictions per treatment condition). See the ‘Prediction study participants’ section in the Methods for demographic information about the study participants.
We found no robust correlations (weighted pooled r = 0.02, p = 0.89) between these populations’ estimated treatment effects and observed treatment effects (Prolific participants r = 0.25, p = 0.07; professors’ r = −0.07, p = 0.63; practitioners r = −0.18, p = 0.19). Furthermore, predictions about the benefits of our interventions were a mean of 9.1× too optimistic (Figure 1b). Predictions of treatment effects for our secondary dependent variable—the likelihood of making a gym visit in a week—were similarly inaccurate and are presented in Supplementary Information 11.
Taken together, these results highlight how difficult it is to predict ex ante the efficacy of interventions and why it is therefore so valuable that megastudies enable the synchronous testing of many different approaches to changing behavior.