“Reproducible Variability: Assessing Investigator Discordance across 9 Research Teams Attempting to Reproduce the Same Observational Study”, Anna Ostropolets, Yasser Albogami, Mitchell Conover, Juan M. Banda, William A. Baumgartner Jr, Clair Blacketer, Priyamvada Desai, Scott L. DuVall, Stephen Fortin, James P. Gilbert, Asieh Golozar, Joshua Ide, Andrew S. Kanter, David M. Kern, Chungsoo Kim, Lana Y. H. Lai, Chenyu Li, Feifan Liu, Kristine E. Lynch, Evan Minty, Maria Ines Neves, Ding Quan Ng, Tontel Obene, Victor Pera, Nicole Pratt, Gowtham Rao, Nadav Rappoport, Ines Reinecke, Paola Saroufim, Azza Shoaibi, Katherine Simon, Marc A. Suchard, Joel N. Swerdel, Erica A. Voss, James Weaver, Linying Zhang, George Hripcsak, Patrick B. Ryan2023-02-24 ()⁠:

Objective: Observational studies can impact patient care but must be robust and reproducible. Non-reproducibility is primarily caused by unclear reporting of design choices and analytic procedures. This study aimed to: (1) assess how the study logic described in an observational study could be interpreted by independent researchers and (2) quantify the impact of interpretations’ variability on patient characteristics.

Materials & Method: 9 teams of highly qualified researchers reproduced a cohort from a study by Albogami et al 2021. The teams were provided the clinical codes and access to the tools to create cohort definitions such that the only variable part was their logic choices. We executed teams’ cohort definitions against the database and compared the number of subjects, patient overlap, and patient characteristics.

Results: On average, the teams’ interpretations fully aligned with the master implementation in 4⁄10 inclusion criteria with at least 4 deviations per team. Cohorts’ size varied from 1⁄3rd of the master cohort size to 10× the cohort size (2,159–63,619 subjects compared to 6,196 subjects). Median agreement was 9.4% (interquartile range 15.3–16.2%). The teams’ cohorts statistically-significantly differed from the master implementation by at least 2 baseline characteristics, and most of the teams differed by at least 5.

Conclusion: Independent research teams attempting to reproduce the study based on its free-text description alone produce different implementations that vary in the population size and composition. Sharing analytical code supported by a common data model and open-source tools allows reproducing a study unambiguously thereby preserving initial design choices.

[Keywords: reproducibility, observational data, credibility, open science]

Figure 5: Difference in patient characteristics between the master implementation and teams’ implementations colored based on the absolute standardized difference of means (SDM). White indicates SDM < 0.1.