Reversals in psychology

Now a crowdsourced project elsewhere. Seeking volunteers!

A medical reversal is when an existing treatment is found to actually be useless or harmful. Psychology has in recent years been racking up reversals: in fact only 40-65% of its classic social results were replicated, in the weakest sense of finding ‘significant’ results in the same direction. (Even in those that replicated, the average effect found was half the originally reported effect.) Such errors are far less costly to society than medical errors, but it’s still pollution, so here’s the cleanup.

Psychology is not alone: medicine, cancer biology, and economics all have many irreplicable results. It’d be wrong to write off psychology: we know about most of the problems here because of psychologists, and its subfields differ a lot by replication rate and effect-size shrinkage.

One reason psychology reversals are so prominent is that it’s an unusually ‘open’ field in terms of code and data sharing. A less scientific field would never have caught its own bullshit.

The following are empirical findings about empirical findings; they’re all open to re-reversal. Also it’s not that “we know these claims are false”: failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis. I’ve tried to ban myself from saying “successful” or “failed” replication, and to report the best-guess effect size rather than play the bad old Yes/No science game.

Figures correct as of March 2020; I will put some effort into keeping this current, but not that much.
Code for converting means to Cohen’s d and Hedge’s g here.

No good evidence for many forms of priming, automatic behaviour change from ‘related’ (often only metaphorically related) stimuli.

Questionable evidence for elderly priming, that hearing about old age makes people walk slower. The p-curve alone argues against the first 20 years of studies.

Stats

No good evidence for professor priming, improved (“+13%”) performance at trivia after picturing yourself as a professor vs as a thug.

Stats

No good evidence for the Macbeth effect, that moral aspersions induce literal physical hygiene.

Stats

No good evidence for money priming, that “images or phrases related to money cause increased faith in capitalism, and the belief that victims deserve their fate”, etc.

Stats

No good evidence of anything from the Stanford prison ‘experiment’. It was not an experiment; ‘demand characteristics’ and scripting of the abuse; constant experimenter intervention; faked reactions from participants; as Zimbardo concedes, they began with a complete “absence of specific hypotheses”.

Stats

No good evidence from the famous Milgram experiments that 65% of people will inflict pain if ordered to. Experiment was riddled with researcher degrees of freedom, going off-script, implausible agreement between very different treatments, and “only half of the people who undertook the experiment fully believed it was real and of those, 66% disobeyed the experimenter.”

Stats

No good evidence that tribalism arises spontaneously following arbitrary groupings and scarcity, within weeks, and leads to inter-group violence . The “spontaneous” conflict among children at Robbers Cave was orchestrated by experimenters; tiny sample (maybe 70?); an exploratory study taken as inferential; no control group; there were really three experimental groups - that is, the experimenters had full power to set expectations and endorse deviance; results from their two other studies, with negative results, were not reported.

Stats

Lots of screen-time is not strongly associated with low wellbeing; it explains about as much of teen sadness as eating potatoes, 0.35%.

Stats

No good evidence that female-named hurricanes are more deadly than male-named ones. Original effect size was a 176% increase in deaths, driven entirely by four outliers; reanalysis using a greatly expanded historical dataset found a nonsignificant decrease in deaths from female named storms.

Stats

At most weak use in implicit bias testing for racism. Implicit bias scores poorly predict actual bias, r = 0.15. The operationalisations used to measure that predictive power are often unrelated to actual discrimination (e.g. ambiguous brain activations). Test-retest reliability of 0.44 for race, which is usually classed as “unacceptable”. This isn’t news; the original study also found very low test-criterion correlations.

Stats

The Pygmalion effect, that a teacher’s expectations about a student affects their performance, is at most small, temporary, and inconsistent, r<0.1 with a reset after weeks. Rosenthal’s original claims about massive IQ gains, persisting for years, are straightforwardly false (“The largest gain… 24.8 IQ points in excess of the gain shown by the controls.”), and used an invalid test battery. Jussim: “90%–95% of the time, students are unaffected by teacher expectations”.

Stats

At most weak evidence for stereotype threat suppressing girls’ maths scores. i.e. the interaction between gender and stereotyping.

Stats

Questionable evidence for an increase in “narcissism” (leadership, vanity, entitlement) in young people over the last thirty years. The basic counterargument is that they’re misidentifying an age effect as a cohort effect (The narcissism construct apparently decreases by about a standard deviation between adolescence and retirement.) “every generation is Generation Me”
All such “generational” analyses are at best needlessly noisy approximations of social change, since generations are not discrete natural kinds, and since people at the supposed boundaries are indistinguishable.

Stats

Be very suspicious of anything by Diederik Stapel. 58 retractions here.

Positive psychology

No good evidence that taking a “power pose” lowers cortisol, raises testosterone, risk tolerance.

That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

After the initial backlash, it focussed on subjective effect, a claim about “increased feelings of power”. Even then: weak evidence for decreased “feelings of power” from contractive posture only. My reanalysis is here.

Stats

Weak evidence for facial-feedback (that smiling causes good mood and pouting bad mood).

Stats

Reason to be cautious about mindfulness for mental health. Most studies are low quality and use inconsistent designs, there’s higher heterogeneity than other mental health treatments, and there’s strong reason to suspect reporting bias. None of the 36 meta-analyses before 2016 mentioned publication bias. The hammer may fall.

Stats

No good evidence for Blue Monday, that the third week in January is the peak of depression or low affect ‘as measured by a simple mathematical formula developed on behalf of Sky Travel’. You’d need a huge sample size, in the thousands, to detect the effect reliably and this has never been done.

Cognitive psychology

Good evidence against ego depletion, that willpower is limited in a muscle-like fashion.

Stats

Mixed evidence for the Dunning-Kruger effect. No evidence for the “Mount Stupid” misinterpretation.

Stats

Questionable evidence for a tiny “depressive realism” effect, of increased predictive accuracy or decreased cognitive bias among the clinically depressed.

Stats

Questionable evidence for the “hungry judge” effect, of massively reduced acquittals (d=2) just before lunch. Case order isn’t independent of acquittal probability (“unrepresented prisoners usually go last and are less likely to be granted parole”); favourable cases may take predictably longer and so are pushed until after recess; effect size is implausible on priors; explanation involved ego depletion.

Stats

No good evidence for multiple intelligences (in the sense of statistically independent components of cognition). Gardner, the inventor: “Nor, indeed, have I carried out experiments designed to test the theory… I readily admit that the theory is no longer current. Several fields of knowledge have advanced significantly since the early 1980s.”

Stats

At most weak evidence for brain training (that is, “far transfer” from daily training games to fluid intelligence) in general, in particular from the Dual n-Back game.

Stats

In general, be highly suspicious of anything that claims a positive permanent effect on adult IQ. Even in children the absolute maximum is 4-15 points for a powerful single intervention (iodine supplementation during pregnancy in deficient populations).
See also the hydrocephaly claim under “Neuroscience”.
Good replication rate elsewhere.

Developmental psychology

Some evidence for a tiny effect of growth mindset (thinking that skill is improveable) on attainment.

Stats

Evidence for a small marshmallow effect, that ability to delay gratification as a 4 year old predicts educational outcomes at 15 or beyond (Mischel).
After controlling for the socioeconomic status of the child’s family, the Marshmallow effect is r=0.05 or d=0.1 one-tenth of a standard deviation for an additional minute delay, with nonsignificant p-values. And since it’s usually easier to get SES data…

Stats

“Expertise attained after 10,000 hours practice” (Gladwell). Disowned by the supposed proponents.

No good evidence that tailoring teaching to students’ preferred learning styles has any effect on objective measures of attainment. There are dozens of these inventories, and really you’d have to look at each. (I won’t.)

Stats

Personality psychology

Pretty good? One lab’s systematic replications found that effect sizes shrank by 20% though. See the comments for someone with a fundamental critique.
Anything by Hans Eysenck should be considered suspect, but in particular these 26 ‘unsafe’ papers (including the one which says that reading prevents cancer).

Behavioural science

The effect of “nudges” (clever design of defaults) may be exaggerated in general. One big review found average effects were six times smaller than billed. (Not saying there are no big effects.)
Here are a few cautionary pieces on whether, aside from the pure question of reproducibility, behavioural science is ready to steer policy.

Moving the signature box to the top of forms does not decrease dishonest reporting in the rest of the form.

Marketing

Brian Wansink accidentally admitted gross malpractice; fatal errors were found in 50 of his lab’s papers. These include flashy results about increased portion size massively reducing satiety.

Neuroscience

No good evidence that brains contain one mind per hemisphere. The corpus callosotomy studies which purported to show “two consciousnesses” inhabiting the same brain were badly overinterpreted.
Very weak evidence for the existence of high-functioning (IQ ~ 100) hydrocephalic people. The hypothesis begins from extreme prior improbability; the effect of massive volume loss is claimed to be on average positive for cognition; the case studies are often questionable and involve little detailed study of the brains (e.g. 1970 scanners were not capable of the precision claimed).

Stats

Readiness potentials seem to be actually causal, not diagnostic. So Libet’s studies also do not show what they purport to. We still don’t have free will (since random circuit noise can tip us when the evidence is weak), but in a different way.
No good evidence for left/right hemisphere dominance correlating with personality differences. No clear hemisphere dominance at all in this study.

Stats

Psychiatry

At most extremely weak evidence that psychiatric hospitals (of the 1970s) could not detect sane patients in the absence of deception.

Parapsychology

No good evidence for precognition, undergraduates improving memory test performance by studying after the test. This one is fun because Bem’s statistical methods were “impeccable” in the sense that they were what everyone else was using. He is Patient Zero in the replication crisis, and has done us all a great service. (Heavily reliant on a flat / frequentist prior; evidence of optional stopping; forking paths analysis.)

Stats

Evolutionary psychology

Weak evidence for romantic priming, that looking at attractive women increases men’s conspicuous consumption, time discount, risk-taking. Weak, despite there being 43 independent confirmatory studies!: one of the strongest publication biases / p-hacking ever found.

Stats

Questionable evidence for the menstrual cycle version of the dual-mating-strategy hypothesis (that “heterosexual women show stronger preferences for uncommitted sexual relationships [with more masculine men]… during the high-fertility ovulatory phase of the menstrual cycle, while preferring long-term relationships at other points”). Studies are usually tiny (median n=34, mostly over one cycle). Funnel plot looks ok though.

Stats

No good evidence that large parents have more sons (Kanazawa); original analysis makes several errors and reanalysis shows near-zero effect. (Original effect size: 8% more likely.)

Stats

At most weak evidence that men’s strength in particular predicts opposition to egalitarianism.

Stats

Psychophysiology

At most very weak evidence that sympathetic nervous system activity predicts political ideology in a simple fashion. In particular, subjects’ skin conductance reaction to threatening or disgusting visual prompts - a noisy and questionable measure.

Stats

Behavioural genetics

No good evidence that 5-HTTLPR is strongly linked to depression, insomnia, PTSD, anxiety, and more. See also COMT and APOE for intelligence, BDNF for schizophrenia, 5-HT2a for everything…
Be very suspicious of any such “candidate gene” finding (post-hoc data mining showing large >1% contributions from a single allele). 0/18 replications in candidate genes for depression. 73% of candidates failed to replicate in psychiatry in general. One big journal won’t publish them anymore without several accompanying replications. A huge GWAS, n=1 million: “We find no evidence of enrichment for genes previously hypothesized to relate to risk tolerance.”

[What I propose] is not a reform of significance testing as currently practiced in soft-psych. We are making a more heretical point… We are attacking the whole tradition of null-hypothesis refutation as a way of appraising theories… Most psychology using conventional H_0 refutation in appraising the weak theories of soft psychology… [is] living in a fantasy world of “testing” weak theories by feeble methods.

– Paul Meehl (1990)

What now? When the next flashy WEIRD paper out of a world-class university arrives, will we swallow it?

Andrew Gelman and others suggest deflating all single-study effect sizes you encounter in the social sciences, without waiting for the subsequent shrinkage from publication bias, measurement error, data-analytic degrees of freedom, and so on. There is no uniform factor, but it seems sensible to divide novel effect sizes by a number between 2 and 100 (depending on its sample size, method, measurement noise, maybe its p-value if it’s really tiny)…

Now a crowdsourced project elsewhere. Seeking volunteers!

Social psychology

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Positive psychology

Stats

Stats

Stats

Cognitive psychology

Stats

Stats

Stats

Stats

Stats

Stats

Developmental psychology

Stats

Stats

Stats

Personality psychology

Behavioural science

Marketing

Neuroscience

Stats

Stats

Psychiatry

Parapsychology

Stats

Evolutionary psychology

Stats

Stats

Stats

Stats

Psychophysiology

Stats

Behavioural genetics

The melancholy of pareidolia

Selection criteria

Why trust replications more than originals?

Mandatory errata

TODO

See also

Comments