Regression To A Mean Fallacies
Regression-to-a-mean is a general statistical phenomenon which leads to several widespread fallacies in analyzing & interpreting statistical results, such as ‘residual confounding’ and ‘Lord’s paradox’.
Regression to a mean: If you select something highly unusual in one way, it will probably be more usual in other ways, because stuff is usually more usual than unusual. This has many implications, like “regression fallacy”, but it also leads to additional errors, particularly when combined with measurement error:
Residual Confounding: “Statistically Controlling for Confounding Constructs Is Harder than You Think”, Westfall & Yarkoni 2016 (Phillips & Smith 1992/Phillips & Smith 1991/Smith et al 1992, Lawlor et al 200422yaa/Lawlor et al 200422yab, Smith et al 2007/Fewell et al 2007/Smith 2011, Pingault et al 2021)1 Example: Parker et al 2021.
“Impossibly hungry judges”, Lakens
A newer twist on residual confounding is to include a polygenic score, which typically measures a small fraction of all genetic influences on many outcomes & environmental measures (usually much less than half of the genetic variance), and declare that “[all] genetics have been controlled for” and proceed to interpret all remaining model coefficients as purely-environmental causal variables, and all remaining group differences as various kinds of societal bias or discrimination or environmental/nurture (eg. Cuevas et al 2021, Harden et al 2021, Engzell et al 2020, Lin 2020, Barnes et al 2019, Sauce et al 2022). An example of correctly using PGSes (using the Pingault et al 2021 method to extrapolate the known incompleteness of the PGS to full heritability) is Baldwin et al 2022.
-
“Evaluating the Effect of Inadequately Measured Variables in Partial Correlation Analysis”, Stouffer 1936
“Gifted Today But Not Tomorrow? Longitudinal Changes in Ability and Achievement in Elementary School”, Lohman & Korb 200620ya (Challenges in gifted education in elementary or earlier: IQ scores are unstable and so regression to a mean implies that few children in G&T programs will grow up to be gifted); Genius Revisited Revisited (early childhood IQ is measured with great error, and so extremely-high-IQ elementary schools select much less high IQ adults and correspondingly unimpressive results, in contrast to later selection); “To Understand Regression From Parent to Offspring, Think Statistically”, Humphreys 1978
“Regression Fallacies in the matched groups experiment”, Thorndike 1942
“Control of Spurious Association and the Reliability of the Controlled Variable”, Kahneman 196561ya; “Nuisance Variables and the Ex Post Facto Design”, Meehl 1970
Kelley’s Paradox (cf. Lord’s paradox): the Roman poet Terence noted that “When two do the same, it isn’t the same.”; when we have prior knowledge about score distributions, 2 identical scores may have different implications because they will be shrunk differently by regression to a mean, and this will be stronger the more extreme the scores are & the larger the prior differences. (If measurements are not corrected and their predictive accuracy is reduced by leaving them as ‘raw’, this may manifest in the real world as statistical discrimination as optimizing agents learn to implicitly correct and ‘discriminate’.) This frequently comes up in standardized testing & exams, where one of the first to point out the implications was Truman Lee Kelley:
Statistical Method, Kelley 1923103ya; Interpretation of Educational Measurements, Kelley 192799ya; Fundamentals of Statistics, Kelley 1947
Longitudinal: “Interpreting regression toward the mean in developmental research”, Furby 197353ya; “Lord’s Paradox in a Continuous Setting and a Regression Artifact in Numerical Cognition Research”, Eriksson & Häggström 201412ya; “Allocation to groups: Examples of Lord’s paradox”, Wright 2019
“The relevance of group membership for personnel selection: A demonstration using Bayes’ theorem”, Miller 1994
“Kelley’s Paradox”, Wainer 2000
“Three Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing Data”, Wainer & Brown 20063
“Measurement Error, Regression to a mean, and Group Differences” (eg. selective attrition from college majors differs by group, and so a measurement like “has an bachelor degree” means different things by group)
Winner’s curse: “The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”, Smith & Winkler 200620ya (decisions with different accuracies/measurement-error will suffer different regression to a mean and be biased towards the most-overestimated options)
“Predicting the Next Big Thing: Success as a Signal of Poor Judgment”, Denrell & Fang 2010 constructs a scenario where regression is so severe that being in the tail constitutes evidence for true below-average accuracy
Dunning-Kruger Effect: the famous Dunning-Kruger effect can be caused by regression to a mean from measurement error, due to floor/ceiling effects in measured performance vs expressed confidence; “backfire effects” can also be manufactured this way (Swire-Thompson et al 2020)
Placebo Effects: just regression to a mean (on Hróbjartsson et al 2010) after all?
‘Nocebo’ effects also suffer from this critique (eg. in sports statistics, the supposed harm of status like the “Madden curse” or the ‘Forbes cover’ effect)
A Primer on Regression Artifacts, Campbell & Kenny 1999
Replication Crisis: because systemic biases like p-hacking filter for the most extreme outliers4, published effect sizes will predictably decline over time, with additional replication, and with better methodologies
Baader-Meinhof Effect: Diaconis & Mosteller 1989 propose that one reason that a rare word may seems to abruptly appear repeatedly is simply that for rare words which are not seen for the most unusually over-due period, the duration for the next appearance will be more ordinary and it’ll re-appear ‘quickly’
See Also: Second product syndrome, Order statistics: The Probability of a Double Maximum, the James-Stein estimator (Efron & Morris 1977/Stigler 1990)
This is part of why results in sociology/epidemiology/psychology are so unreliable: everything is correlated but not only do they usually not control for genetics at all, they don’t even control for the things they think they control for! You have not controlled for SES by throwing in a discretized income variable measured in one year plus a discretized college degree variable. Variables which correlate with or predict some outcome such as poverty, may be doing no more than correcting some measurement error (frequently, due to the heavy genetic loading of most outcomes—“the environment is genetic”—correcting the omission of genetic information). This is why within-family designs are desirable even without worries about genetics: they hold constant shared-environment factors so you don’t need to measure or model them. Even a structural equation model (SEM) which explicitly incorporates measurement error may still have enough leakage to render ‘controlling’ misleading. Such confounding where the highly-imperfect correlations drive pseudo-causal effects (which are just regression to a mean) are doubtless a reason why so many apparently-well-controlled & highly-replicable correlations fail in RCTs.↩︎
There are countless examples of incorrect interpretations of measured variables which are imperfectly correlated with their latent variables, requiring explicit correction for measurement error or range restriction, particularly when meta-analyzed (see Hunter & Schmidt 2004); for example, it is particularly common for researchers to claim that their favorite new trait (SES, personality, “emotional intelligence”, etc.) correlates more with an outcome than IQ, without noting that their sample has been selected to be high on IQ, or that their IQ measure has much more random error in it than their alternative (sometimes with an excuse about “second-order sampling error” like Allen et al 2020), and so it is unsurprising yet uninformative if the raw correlation coefficient may be larger because the IQ correlate was biased towards zero much more heavily. This sort of argument-from-attenuated-variables is wrong, but doesn’t become a regression-to-the-mean fallacy until combined with something else.↩︎
The draft version is “Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data”.↩︎
Including but not limited to researcher malpractice; eg. the use of “genome-wide statistical-significance” to filter GWAS hits ensures a “winner’s curse”, and (contra critics) hits replicate as expected given their statistical power+regression-to-the-mean.↩︎