[Twitter, commentary; cf. SSC on Wiseman & Schlitz] The low reproducibility rate in social sciences has produced hesitation among researchers in accepting published findings at their face value. Despite the advent of initiatives to increase transparency in research reporting, the field is still lacking tools to verify the credibility of research reports.
In the present paper, we describe methodologies that let researchers craft highly credible research and allow their peers to verify this credibility. We demonstrate the application of these methods in a multi-laboratory replication of Bem’sExperiment 1 (Bem2011) on extrasensory perception (ESP), which was co-designed by a consensus panel including both proponents and opponents of Bem’s original hypothesis [15 for, 14 against].
In the study we applied direct data deposition in combination with born-open data and real-time research reports [pushed to GitHub repon = 200 at a time] to extend transparency to protocol delivery and data collection. We also used piloting, checklists, laboratory logs and video-documented trial sessions to ascertain as-intended protocol delivery, and external research auditors to monitor research integrity.
We found 49.89% successful guesses [n = 37,836], while Bem reported 53.07% success rate [in n = 1,650 or 20× less], with the chance level being 50%.
Figure 2: The figure shows the results of the mixed effect logistic regression (bottom) and the Bayesian parameter estimation robustness test (top). The density curve shows the posterior distribution derived from the Bayesian parameter estimation analysis with the 90% highest density interval (HDI) overlaid in grey. The horizontal error bar represents the 99.75% confidence interval (CI) derived from the mixed effect logistic regression in the primary analysis. Both of these are interval estimates for the probability of correct guesses in the population. The dashed vertical line represents 0.5 probability of correct guess chance, the dotted vertical line on the top represents the threshold of the region of practical equivalence (ROPE) used in the Bayesian parameter estimation (0.506), while the dotted vertical line on the bottom represents the threshold of the equivalence test (smallest effect size of interest, SESOI) used in the mixed effect logistic regression (0.51). The figure indicates that both the Bayesian parameter estimation and the frequentist mixed model support the null model, with the estimates very close to 50%, and falling well below 51% correct guess probability.
Thus, Bem’s findings were not replicated in our study. In the paper, we discuss the implementation, feasibility and perceived usefulness of the credibility-enhancing methodologies used throughout the project…This conclusion is all the more important because the study design that we replicated was the one that yielded the highest effect size in the 2016 meta-analysis. The fact that this effect was irreproducible, even with the input of Daryl Bem and more than a dozen other parapsychological researchers during protocol planning, should make readers cautious regarding that and other similar meta-analytical findings, which are mainly drawn from studies that are conducted without preregistration or using other best practices in experimental research. Due to its controversial claims, ESP research should probably apply the highest possible standards to reduce methodological bias and error, and to limit researcher and analyst degrees of freedom. However, our recommendations for increasing credibility do not only apply to ESP research, but to biomedical and social sciences in general. We should raise the standards of credible original research, and increase the standards for including studies in meta-analyses.
…2.3.7. External research audit: External audit refers to the delegation of the task of assessing certain aspects of research integrity to a trusted third party. An IT auditor and two research auditors independent of the laboratories involved in the study were also involved in the project. The IT auditor was responsible for the evaluation of the integrity of the software and data deposition pipeline used in the project. The research auditors were responsible for evaluating protocol delivery and data integrity. These external auditors published reports about the project after data collection ended. Information about the auditors, their tasks and responsibilities, and their reports is accessible via OSF. The practice of external audit is common in interventional medicine research, especially pharmaceutical research. However, formal external audit is almost unheard of in psychological science. The closest thing in the field of psychology is the stage-2 registered report review, but whether and to what extent the scope of the stage-2 review includes a systematic audit is currently unknown. The total transparency approach used in this research provides an opportunity for anyone to verify the credibility of the findings, but reviewing all the open materials takes considerable effort and some expertise. Accordingly, some voices in the field have advocated for supplementing peer review with a formal research audit40, 41. Delegating the task of assessing certain aspects of research integrity to a trusted third party provides an added layer of assurance to those who do not review the materials themselves. This approach also allows for materials that cannot be openly shared due to confidentiality (in our case, the recorded trial research sessions) to still be used to demonstrate protocol fidelity…The auditors are not authors of this paper.
…3.2.3. Sample and study characteristics: In total, 2,220 individuals participated in the study. Among these, 2,207 participants started the session before the study stopping rule was triggered. An additional 13 participants started the session after the stopping rule was met, but their data were not included in the analysis. Of those who started the session before the study stopping rule was triggered, 92 (4.17%) dropped out before providing valid data for the primary analysis (ie. they declined participation, were ineligible, or stopped before the first erotic trial). Valid data was contributed to the primary analysis by 2,115 participants, completing a total number of 37,836 erotic trials. The age range of most (92.62%) participants was 18–29 years; 67.52% of participants identified as women, and 32.39% identified as men. The average score on the ESP belief item was 3.46 (s.d. = 1.09), and the average score of the sensation-seeking items was 2.71 (s.d. = 0.76). Both scales ranged 1–5, with lower values indicating lower belief in ESP and lower sensation-seeking. Participants chose the left-side curtain in 49.08% of the trials (meaning that there was a slight right-side bias in participant choices), while the target side was left in 49.88% of the trials.
…There are some articles in the field of parapsychology which claim that the average hit rate at chance level in the sample as a whole is produced by a bimodal distributionof two distinct subgroups: unexpectedly lucky or talented individuals who consistently perform at higher than chance accuracy, and unexpectedly unlucky individuals who consistently perform at lower than chance accuracy.55 This is often referred to as ‘positive psi’ and ‘negative psi’, and, since the performance is thought to be linked with belief in ESP, the consistent positive performers (and believers in psi) are called ‘sheep’, while the consistent negative performers (and ESP sceptics) are called ‘goats’…The difference between the theoretical and the observed distributions was EMD = 0.037. The visual inspection shows no substantial deviation between the two histograms, which does not indicate uneven distribution of guess chance. We explored the sheep-goat hypothesis further in a set of post hoc exploratory analyses by calculating the correlation between the performance of the individuals in their odd and their even experimental trials. If there are individuals who consistently guess below or above chance level, there should be a positive correlation between the odd and even trial performance. The correlation was r = 0.026 (95% CI: −0.017, 0.069), which is very close to 0 and does not seem to support the sheep-goat hypothesis. To investigate the possibility of experimenter sheep-goat effects on performance, we also examined the relationship between the participant’s performance (successful guess rate) and the ASGS score of the experimenter present during the session. We built a linear mixed model predicting the successful guess rate of the participant with the ASGS total score of the experimenter as a fixed effect predictor and a random intercept of the experimenter ID. The same analysis was done separately with the site-PI’s ASGS total score as the fixed predictor and site-PI ID as a random intercept. The parameter estimates corresponding to the effect of the ASGS score were very close to zero in both of these analyses, providing no support for the sheep-goat hypothesis (experimenter ASGS score effect estimate: −0.0003 [95% CI: −0.001, 0.001]; site-PI ASGS score effect estimate: −0.0002 [95% CI: −0.001, 0.001]).
…6.8. Cost-benefit analysis: Up to this point, we have focused on discussing the benefits and limitations of the credibility-enhancing tools. In this section, we will enrich this discussion by evaluating the costs and the perceived usefulness of these techniques. We hope this analysis will make it easier for the reader to decide which of these techniques to implement in their own research or institution.
Our study was supported by BialFoundation (grant no. 122/16) via a grant of €43,000 [~$47,000]. Roughly €30,000 of this budget was spent on salaries of the coordinating research team, roughly €7,000 was spent on contracts with a software developer and the 3 research auditors, and the rest covered other costs such as conference and publication. However, this grant only covered a portion of the total costs associated with the project. The project was made possible by generous support of volunteer work and explicit or implicit subsidies from the institutions of the researchers who were involved in the project; thus, estimating the total dollar cost of the research is difficult, especially given regional differences in labour costs. Instead, we provide an estimate of the work hours that are associated with implementing and carrying out each highlighted credibility-enhancing method so readers can understand the amount of labour it requires.
In Table 2, we provide a non-comprehensive summary of the benefits associated with each technique, and estimates for two scenarios: how many work hours would be devoted to the specific methods (1) if researchers were to implement these techniques in an average single-site experimental psychology study, or (2) if researchers were to implement the same techniques in a larger scale, multi-site experimental psychology study. We used this approach because the costs of some of the methods are dependent on the scale of the study. For example, laboratory logs have a time cost for each research session since the experimenter has to manually complete the log, the video-verified training depends on the number of experimenters involved in the study, and so forth. For the single-site study, we calculated with 8 consensus design panel members, two experimenters running the study sessions, 100 data collection sessions, one research auditor, and one IT auditor. For the multi-site study, we calculated with values close to that of the present study: 30 consensus design panel members, 30 experimenters, 1,000 data collection sessions, two research auditors, and one IT auditor. We did not log work hours spent on these methods in our project, so these estimates are based on hindsight estimates. Also, note that the time estimates do not account for the learning process of implementing these techniques for the first time or mistakes and restarts that might be associated with the first-time implementation of any sophisticated system. The time estimates do not include the time spent on finding the right programmer and the auditors either. Tables S4 & S5 in the Supplement give more details about the calculations behind the estimated work hour costs.
These estimates indicate that by far the most expensive methodological tool in the toolkit is the consensus design procedure, which involves several hundred hours of work divided across assistants, coordinating researchers and panel members. It is also apparent from the time estimates that the automatized tools, such as direct data deposition, born-open data, real-time research reports and tamper-evident software, scale very well, since they mainly require one-time implementation. Thus, their cost-benefit ratio might be more favourable for large-scale projects compared with the other tools in the toolkit.