“Many Labs 2: Investigating Variation in Replicability Across Sample and Setting”, Richard Klein, Michelangelo, Vianello, Fred, Hasselman, Byron, Adams, Reginald B. Adams Junior, Sinan Alper, Mark Aveyard, Jordan Axt, Mayowa Babalola, Štěpán Bahník, Mihaly Berkics, Michael J. Bernstein, Daniel Berry, Olga Bialobrzeska, Konrad Bocian, Mark Brandt, Robert Busching, Huajian Cai, Fanny Cambier, Katarzyna Cantarero, Cheryl Carmichael, Zeynep Cemalcilar, Jesse Chandler, Jen-Ho Chang, Armand Chatard, Eva CHEN, Winnee Cheong, David Cicero, Sharon Coen, Jennifer Coleman, Brian Collisson, Morgan Conway, Katherine Corker, Paul Curran, Fiery Cushman, Ilker Dalgar, William Davis, Maaike de Bruijn, Marieke de Vries, Thierry Devos, Canay Doğulu, Nerisa Dozo, Kristin Dukes, Yarrow Dunham, Kevin Durrheim, Matthew Easterbrook, Charles Ebersole, John Edlund, Alexander English, Anja Eller, Carolyn Finck, Miguel-Ángel Freyre, Mike Friedman, Natalia Frankowska, Elisa Galliani, Tanuka Ghoshal, Steffen Giessner, Tripat Gill, Timo Gnambs, Angel Gomez, Roberto Gonzalez, Jesse Graham, Jon Grahe, Ivan Grahek, Eva Green, Kakul Hai, Matthew Haigh, Elizabeth Haines, Michael Hall, Marie Heffernan, Joshua Hicks, Petr Houdek, Marije van, der Hulst, Jeffrey Huntsinger, Ho Huynh, Hans IJzerman, Yoel Inbar, Åse Innes-Ker, William Jimenez-Leal, Melissa-Sue John, Jennifer Joy-Gaba, Roza Kamiloglu, Andreas Kappes, Heather Kappes, Serdar Karabati, Haruna Karick, Victor Keller, Anna Kende, Nicolas Kervyn, Goran Knezevic, Carrie Kovacs, Lacy Krueger, German Kurapov, Jaime Kurtz, Daniel Lakens, Ljiljana Lazarevic, Carmel Levitan, Neil Lewis, Samuel Lins, Esther Maassen, Angela Maitner, Winfrida Malingumu, Robyn Mallett, Satia Marotta, Jason McIntyre, Janko Međedović, Taciano Milfont, Wendy Morris, Andriy Myachykov, Sean Murphy, Koen Neijenhuijs, Anthony Nelson, Felix Neto, Austin Nichols, Susan O’Donnell, Masanori Oikawa, Gabor Orosz, Malgorzata Osowiecka, Grant Packard, Rolando Pérez, Boban Petrovic, Ronaldo Pilati, Brad Pinter, Lysandra Podesta, Monique Pollmann, Anna Dalla Rosa, Abraham Rutchick, Patricio Saavedra, Airi Sacco, Alexander Saeri, Erika Salomon, Kathleen Schmidt, Felix Schönbrodt, Maciek Sekerdej, David Sirlopu, Jeanine Skorinko, Michael Smith, Vanessa Smith-Castro, Agata Sobkow, Walter Sowden, Philipp Spachtholz, Troy Steiner, Jeroen Stouten, Chris Street, Oskar Sundfelt, Ewa Szumowska, Andrew Tang, Norbert Tanzer, Morgan Tear, Jordan Theriault, Manuela Thomae, David Torres-Fernández, Jakub Traczyk, Joshua Tybur, Adrienn Ujhelyi, Marcel van Assen, Anna van ’t Veer, Alejandro, Vásquez-Echeverría Leigh, Ann Vaughn, Alexandra Vázquez, Diego Vega, Catherine Verniers, Mark Verschoor, Ingrid Voermans, Marek Vranka, Cheryl Welch, Aaron Wichman, Lisa Williams, Julie Woodzicka, Marta Wronska, Liane Young, John Zelenski, Brian Nosek2019-11-19 (scientific bias; backlinks; similar):
We conducted preregistered replications of 28 classic and contemporary published findings with protocols that were peer reviewed in advance to examine variation in effect magnitudes across sample and setting. Each protocol was administered to ~half of 125 samples and 15,305 total participants from 36 countries and territories.
Using conventional statistical-significance (p < 0.05), 15 (54%) of the replications provided evidence in the same direction and statistically-significant as the original finding. With a strict statistical-significance criterion (p < 0.0001), 14 (50%) provide such evidence reflecting the extremely high powered design. 7 (25%) of the replications had effect sizes larger than the original finding and 21 (75%) had effect sizes smaller than the original finding. The median comparable Cohen’s d effect sizes for original findings was 0.60 and for replications was 0.15. 16 replications (57%) had small effect sizes (< 0.20) and 9 (32%) were in the opposite direction from the original finding.
Across settings, 11 (39%) showed statistically-significant heterogeneity using the Q statistic and most of those were among the findings eliciting the largest overall effect sizes; only one effect that was near zero in the aggregate showed statistically-significant heterogeneity. Only one effect showed a Tau > 0.20 indicating moderate heterogeneity. 9 others had a Tau near or slightly above 0.10 indicating slight heterogeneity. In moderation tests, very little heterogeneity was attributable to task order, administration in lab versus online, and exploratory WEIRD versus less WEIRD culture comparisons.
Cumulatively, variability in observed effect sizes was more attributable to the effect being studied than the sample or setting in which it was studied.