“Generalizability of Heterogeneous Treatment Effect Estimates across Samples”, Alexander Coppock, Thomas J. Leeper, Kevin J. Mullinix2018-11-16 (, ; backlinks)⁠:

In experiments, the degree to which results generalize to other populations depends critically on the degree of treatment effect heterogeneity. We replicated 27 survey experiments (encompassing 101,745 individual survey responses) originally conducted on nationally representative samples using online convenience samples, finding very high correspondence despite obvious differences in sample composition. We contend this pattern is due to low treatment effect heterogeneity in these types of standard social science survey experiments.

The extent to which survey experiments conducted with non-representative convenience samples are generalizable to target populations depends critically on the degree of treatment effect heterogeneity. Recent inquiries have found a strong correspondence between sample average treatment effects estimated in nationally representative experiments and in replication studies conducted with convenience samples.

We consider here two possible explanations: low levels of effect heterogeneity or high levels of effect heterogeneity that are unrelated to selection into the convenience sample.

We analyze subgroup conditional average treatment effects using 27 original/replication study pairs (encompassing 101,745 individual survey responses) to assess the extent to which subgroup effect estimates generalize.

While there are exceptions, the overwhelming pattern that emerges is one of treatment effect homogeneity, providing a partial explanation for strong correspondence across both unconditional and conditional average treatment effect estimates.

Figure 1: Across-study correspondence of CATEs [conditional average treatment effect, difference-in-means]

Figure 1 displays scatterplots of the estimated CATEs subgroup by subgroup. The relationship between the conditional average treatments in the original and Mechanical Turk versions of the studies is unequivocally positive for all demographic subgroups. Whereas previous analyses of these datasets showed strong correspondence of average treatment effects, this analysis shows that the same pattern holds at every level of age, gender, race, education, ideology, and partisanship that we measure.

The figure also indicates whether the CATEs are statistically-significantly different from each other. Out of 393 opportunities, the difference-in-CATEs is statistically-significant 59×, or 15% of the time. In 0⁄393 opportunities do the CATEs have different signs while both being statistically distinguishable from 0. There is also a close correspondence of statistical-significance tests for CATEs across study pairs. Of the 156 CATEs that were statistically-significantly different from no effect in the original, 118 are statistically-significantly different from no effect in the Mechanical Turk replication. Of the 237 CATEs that were statistically indistinguishable from no effect in the original, 158 were statistically indistinguishable from 0 in the Mechanical Turk version.

The overall “significance match” rate is therefore 70%. We must be careful, however, not to over-interpret conclusions based on this statistic, as it is confounded by the power of the studies. If the studies were infinitely powered, all estimates of non-0 CATEs in both versions of the study would be statistically-significant, and therefore, the match rate would be 100%. By contrast, if all studies were severely underpowered, almost all estimates would be non-statistically-significant, again implying a match rate of 100%. We therefore prefer evaluating correspondence across studies based on (error-corrected) regression slopes, since they directly operate on the estimates themselves rather than on arbitrary statistical-significance levels.

The estimated slopes across CATEs are shown in Table 2: The slopes are all positive, ranging 0.71–1.01. A true slope of 1 would indicate perfect correspondence of original and replication CATEs within demographic subgroups. All but one of the 95% CIs include 1, but the intervals are sometimes quite wide, so we resist “accepting the null” of perfect correspondence. The CI for the conservative group (just barely) excludes 1, which aligns with a common belief that conservatives on Mechanical Turk are especially idiosyncratic, though this view is challenged in ref. 21. Overall, we conclude that in this set of studies, the estimated CATEs within demographic subgroups are quite similar.

…We now have two basic findings to explain: Average treatment effects are the same in probability and nonprobability samples and so are CATEs. Which of our explanations (no heterogeneity or heterogeneity orthogonal to selection) can account for both findings?

To arbitrate between these explanations, we turn to within-study comparisons. Within a given study, we ask, are the CATEs that were estimated to be high in the original study also high in the Mechanical Turk version? Figure 2 shows that the answer tends to be no. The CATEs in the original study are mostly uncorrelated with the CATEs in the Mechanical Turk versions. Table 1 confirms what the visual analysis suggests. We see within-study slopes that are smaller than the across-study slopes and slopes of both signs.

Figure 2: Within-study correspondence of CATEs.

An inspection of the CATEs themselves reveals why. Most of the CATEs are tightly clustered around the overall average treatment effect in each study version. Put differently, the treatment effects within each study version appear to be mostly homogeneous. We conclude from this preliminary analysis that the main reason why we observe strong correspondence in average treatment effects is low treatment effect heterogeneity.

…As a result, the convenience samples we analyze provide useful estimates not only of the PATE but also of subgroup CATEs. The reason for this is that there appears to be little effect heterogeneity—as seen in the tight clustering of CATEs in each panel of Figure 2. Lacking such heterogeneity, any subgroup provides a reasonable estimate of not only the CATE but the PATE as well. In cases where some heterogeneity appears to be present, CATEs in each study pair rarely differ substantially from one another. Our results indicate that even descriptively unrepresentative samples constructed with no design-based justification for generalizability still tend to produce useful estimates not just of the SATE but also of subgroup CATEs that generalize quite well.

Important caveats are in order. First, we have not considered all possible survey experiments, let alone all possible experiments in other modes or settings. Our pairs of studies were limited to those conducted in an online mode on samples of US residents. However, this set of studies is also quite comprehensive, drawing from multiple social science disciplines, using a variety of experimental stimuli and outcome question formats. The studies are also drawn not just from published research (which we might expect to be subject to publication biases) but from a sample of experiments fielded by Time-Sharing Experiments for the Social Sciences (TESS).

…Perhaps the most controversial conclusion that could be drawn from the present research is that we should be more suspect of extant claims of effect moderation. A common post hoc data analysis procedure is to examine whether subgroups differ in their apparent response to treatment. We find only limited evidence that such moderation occurs and, when it does, the differences in effect sizes across groups are small. The response to this evidence should not be that any convenience sample can be used to study any treatment without concern about generalizability (23) but rather that debates about generalizability and replication must focus on the underlying causes of replication and non-replication, among these most importantly the variation in treatment effects across experimental units.