Nudge interventions have quickly expanded from academic studies to larger implementation in so-called Nudge Units in governments. This provides an opportunity to compare interventions in research studies, versus at scale. We assemble an unique data set of 126 RCTs covering over 23 million individuals, including all trials run by 2 of the largest Nudge Units in the United States. We compare these trials to a sample of nudge trials published in academic journals from 2 recent meta-analyses.
In papers published in academic journals, the average impact of a nudge is very large—an 8.7 percentage point take-up effect, a 33.5% increase over the average control. In the Nudge Unit trials, the average impact is still sizable and highly statistically-significant, but smaller at 1.4 percentage points, an 8.1% increase [8.7 / 1.4 = 6.2×].
We consider 5 potential channels for this gap: statistical power, selective publication, academic involvement, differences in trial features and in nudge features. Publication bias in the academic journals, exacerbated by low statistical power, can account for the full difference in effect sizes. Academic involvement does not account for the difference. Different features of the nudges, such as in-person versus letter-based communication, likely reflecting institutional constraints, can partially explain the different effect sizes.
We conjecture that larger sample sizes and institutional constraints, which play an important role in our setting, are relevant in other at-scale implementations. Finally, we compare these results to the predictions of academics and practitioners. Most forecasters overestimate the impact for the Nudge Unit interventions, though nudge practitioners are almost perfectly calibrated.
Figure 4: Nudge treatment effects. This figure plots the treatment effect relative to control group take-up for each nudge. Nudges with extreme treatment effects are labeled for context.
…In this paper, we present the results of an unique collaboration with 2 of the major “Nudge Units”: BIT North America operating at the level of US cities and SBST/OES for the US Federal government. These 2 units kept a comprehensive record of all trials that they ran from inception in 2015 to July 2019, for a total of 165 trials testing 349 nudge treatments and a sample size of over 37 million participants. In a remarkable case of administrative transparency, each trial had atrial report, including in many cases a pre-analysis plan. The 2 units worked with us to retrieve the results of all the trials. Importantly, over 90% of these trials have not been documented in working paper or academic publication format. [emphasis added]
…Since we are interested in comparing the Nudge Unit trials to nudge papers in the literature, we aim to find broadly comparable studies in academic journals, without hand-picking individual papers. We lean on 2 recent meta-analyses summarizing over 100 RCTs across many different applications (Benartziet al2017, and Hummel & Maedche2019). We apply similar restrictions as we did in the Nudge Unit sample, excluding lab or hypothetical experiments and non-RCTs, treatments with financial incentives, requiring treatments with binary dependent variables, and excluding default effects. This leaves a final sample of 26 RCTs, including 74 nudge treatments with 505,337 participants. Before we turn to the results, we stress that the features of behavioral interventions in academic journals do not perfectly match with the nudge treatments implemented by the Nudge Units, a difference to which we indeed return below. At the same time, overall interventions conducted by Nudge Units are fairly representative of the type of nudge treatments that are run by researchers.
What do we find? In the sample of 26 papers in the Academic Journals sample, we compute the average (unweighted) impact of a nudge across the 74 nudge interventions. We find that on average a nudge intervention increases the take up by 8.7 (s.e. = 2.5) percentage points, out of an average control take up of 26.0 percentage points.
Turning to the 126 trials by Nudge Units, we estimate an unweighted impact of 1.4 percentage points (s.e. = 0.3), out of an average control take up of 17.4 percentage points. While this impact is highly statistically-significantly different from 0 and sizable, it is about 1⁄6th the size of the estimated nudge impact in academic papers. What explains this large difference in the impact of nudges?
We discuss 3 features of the 2 samples which could account for this difference. First, we document a large difference in the sample size and thus statistical power of the interventions. The median nudge intervention in the Academic Journals sample has treatment arm sample size of 484 participants and a minimum detectable effect size (MDE, the effect size that can be detected with 80% power) of 6.3 percentage points. In contrast, the nudge interventions in the Nudge Units have a median treatment arm sample size of 10,006 participants and MDE of 0.8 percentage points. Thus, the statistical power for the trials in the Academic Journals sample is nearly an order of magnitude smaller. This illustrates a key feature of the “at scale” implementation: the implementation in an administrative setting allows for a larger sample size. Importantly, the smaller sample size for the Academic Journals papers could lead not just to noisier estimates, but also to upward-biased point estimates in the presence of publication bias.
A second difference, directly zooming into publication bias, is the evidence of selective publication of studies with statistically-significant results (t > 1.96), versus studies that are not statistically-significant (t < 1.96). In the sample of Academic Journals nudges, there are over 4× as many studies with a t-statistic for the most statistically-significant nudge between 1.96 and 2.96, versus the number of studies with the most statistically-significant nudge with at between 0.96 and 1.96. Interestingly, the publication bias appears to operate at the level of the most statistically-significant treatment arm within a paper. By comparison, we find no evidence of a discontinuity in the distribution of t-statistics for the Nudge Unit sample, consistent with the fact that the Nudge Unit registry contains the comprehensive sample of all studies run. We stress here that with “publication bias” we include not just whether a journal would publish a paper, but also whether a researcher would write up a study (the “file drawer” problem). In the Nudge Units sample, all these selective steps are removed, as we access all studies that were run.