APRIL 17TH, 2016
Understanding the replication crisis
Over the last years, there have been many efforts to shed light on what has been termed the replication crisis of psychological science. Although this title suggests that this discussion is primarily about the reproducibility of psychological studies, it touches many more aspects. In this discussion many important questions are being asked: How reliable is (psychological) science in general? What factors influence the quality of science? What can we do to improve the quality of science? It further asks questions about publication bias, questionable research practices, open science, and more generally the credibility of science itself. If one dives into the many discussions and comments on this topic, one can easily loose orientation.
As I have been taken aback by this whole discussion and as it had a massive impact how I view science today, I tried to learn and understand more about the origins of the crisis. Thereby, I have learned a lot by looking at the key papers and subsequent commentaries (which are often only found on blogs). As the topics related to the replication crisis are widespread and oftentimes complicated, I have tried to summarize its main points in the following post. This post furthermore includes the most important links to the key publications. In summary, this article should help to understand the replication crisis from a general point of view and provide a reading list for those who want to learn about the origins and many aspects of the replication crisis in more depth.
Origins of the crisis
Although the origins reach further back in time, specifically questionable research practices got public attention with the case of Diederick Stapel. The credibility of psychological science has been shattered when the scientific fraud of Diederik Stapel has been discovered. In September 2011, the Levelt Committee found that Stapel had made up data for his experiments in at least 30 cases. He had fabricated whole data sets which eventually helped him to publish his results in top-ranked journals. The discovery of Stapel's scientific misconduct has sparked a wide-spread discussion on questionable research practices and the credibility of psychological science in general. Although it seems plausible that such a severe behavior is nonetheless quite rare among scientists, there are nonetheless good reasons to believe that not all science is as robust as one wants to think. To understand this assumption, we have to go way back to the first publications in which scholars have raised doubts that published results might not necessarily represent true findings.
In 1962, Jacob Cohen published a paper called "The statistical power of abnormal social-psychological research: A review". He understood that psychologists generally pay a lot of attention to the problem that a statistically significant result might be a false-positive finding (type-I or α-error), but ignore the complementary error that a study might fail to reject the null-hypothesis when the predicted effect actually exists (type-II or β-error). Without going into too much detail (an easily digestable summary of the paper can be found here.), Cohen wanted to know what kind of chance does a researcher have in rejecting false null hypotheses. He argues that if the typical effect size were d = .5, only 50% of studies would replicate even if the published literature reports 100% significant results. Moreover, Cohen assumed that a d = .5 would be a medium effect size for the relationship between two constructs or an experimental variable and a perfectly reliable measure, but statistical tests are based on the observed effects with unreliable measures. Based on this considerations, Cohen suggested that researchers would be optimistic if they conducted power analysis with an effect size of d = .5. It is hence suprising that most researchers manage to publish significant results all the time.
Cohen concludes with the following recommendation: "Since power is a direct monotonic function of sample size, it is recommended that investigators use larger sample sizes than they customarily do. It is further recommended that research plans be routinely subjected to power analysis, using as convention the criteria of population effect size employed in this survey" (p. 153). Another interesting read with regard to this problem is by John Ioannidis which was published in 2005: In a very simple analysis using simulation procedures, he demonstrates that it is more likely for a research claim to be false than true. Moreover, he argues that for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias.
Despite these early concerns, no in-depth investigation of potential consequences have been conducted. In 2011, however (just before the discovery of Stapel's scientific fraud), Prinz et al. (2011) published a paper that showed that in-house attempts of drug companies often fail to replicate the results of prominent publications in top-journals such as Science, Nature or Cell.
Undisclosed flexibility in data collection and analysis
Again in 2011, another two papers gained popularity which can be regarded as further milestones: First, Daryl Bem has published results from 9 experiments demonstrating that college undergraduates have extrasensory perception abilities (a sort of sixth sense). At the time, many psychologists were shocked that commonly used statistical methods can support a hypothesis which should by all means be theoretically invalid.
Inspired by Bem's demonstration of ESP, Simmons, Nelson, & Simonsohn (2011) wrote a paper called "False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant". The authors demonstrated that with enough flexibility in data treatment and analysis, one can find anything in a study. In their highly interesting paper, they even showed that it is possible to alter data in a way that it shows evidence for the hypothesis that listening to a song makes participants younger. They further argues that a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. Using computer simulations and actual experiments, they demonstrated how easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.
The replication project
2015 has been an influential year with regard to the replication crisis. In a massive effort, the Open Science Collaboration under the direction of Brian Nosek has replicated 100 psychological studies. The paper, which was published in Science, reported results 100 replication of experimental and correlational studies which were originally published in three high-ranked psychology journals using high-powered designs and original materials when available. The results can be summarized as follows: 97% of original studies had significant results, 36% of replications had significant results; 47% of original effect sizes were within the 95% confidence interval of the replication effect size. This project has sparked many discussions and debates about the current state of psychological science. In fact, it has even started a debate about whether it is even possible to estimate the reproducibility of science. In a response, which was also published in science, Gilbert et al. (2016), argue that the authors have made statistical errors. They even argue that the data shows that the reproducibility of psychological studies is quite high (almost indistinguishable from 100%).
The authors of the replication project have answered with a comment of their own, arguing that Gilbert et al's optimistic view is based likewise on a statistical misconception. In fact, many authors of the project and even non-authors have endorsed the Open Science Collaboration (e.g. Sanjay Srivastava, Uri Simonsohn, Daniel Lakens, Simine Vazire, Andrew Gelman, David Funder, Rolf Zwaan, and Dorothy Bishop). It is in particual worth-while to read a blog post by Brian Nosek, in which he tries to answer to a point made by Gilbert et al. that many studies in the replication project cannot be regarded as replications. The evolving discussion, in my opinion, is quite important as we need to ask ourselves what a successful replication actually is. An interesting take on the whole discussion is broad forward by Andrew Gellman who proposes the following though experiment:
"One helpful (I think) way to think about such an episode is to turn things around. Suppose the attempted replication experiment, with its null finding, had come first. A large study finding no effect. And then someone else runs a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would we be inclined to believe it? I don't think so. At the very least, we'd have to conclude that any such phenomenon is fragile."
One problem about the replication project is that it has only replicated the original studies only once. With regard to the idea of cumulative evidence, the results are hard to evaluate. The Many Labs Project tries to solve this questions. In this project, several labs replicated each of several experiments. Simply put: the results show that some replicated and some not. Apart from these efforts, there has been another incident that, in my opinion has shown a much deeper problem with our current way of scientific research. This incident shows that even cumulative evidence, proved by meta-analyses and several replications cannot be regarded as sufficient evidence for an effect. In 2015, Carter & McCullough published the following paper: "Publication bias and the limited strength model of self-control". Although a meta-analysis of Hagger et al. (2010) concluded that "ego depletion," a form of fatigue in self-control, was a real and robust phenomenon (d = .6), Carter and McCullough found strong indications of publication and analytic bias. In fact, they argue that it is not clear whether the true effect is any different from zero!
After the publication of the paper, several psychologists joined together to perform several preregistered replication studies of a standard ego-depletion paradigm. Although the manuscript is not yet public, it has been announced that the project found zero evidence of ego depletion. An independent preregistered replication also finds no evidence for the phenomenon.
Conclusion and outlook
What can we learn from these papers and commentaries? Despite the alarming findings, we should not question the discipline itself. Rather, we should see the results from these efforts to discover problems within our scientific practices as opportunities to move forward. But what steps need to be taken? What can we do ourselves? The following points compile a list of things that have been proposed and might help to strengthen the reproducibility of scientifique studies:
- Identifying questionable research practices. This point is probably the most important one. I do not mean that we should try to detect these practices in every published paper and denunciate the authors in public. It seems however that many common practices are not regarded as problematic and continue to be used in many instances. Learning about these practices and critically asking how they can be avoided is the key (This blogpost gives an overview of questionable research practices).
- Conducting power analyses before data collection. With regard to Cohen's classical paper, this is most important in order to reduce the error-rates. Many scholars have argued for such a step, however, it has not been implied in common research practices until today. It is time to change this.
- Preregistering studies and publishing the results even when they fail to support the alternative hypothesis. This is especially important in order to reduce publication bias. We need to have a record of studies that failed to support alternative hypotheses. It simply cannot be true that all studies always find significant results.
- Publishing the study's material (data, analysis, codebook, survey). This will help to make science more transparent and to make replications easier. It might further indirectly strengthen the researcher's motivation to not use questionable research practices. This is actually what open science is about. The infrastructure for publishing material online is already available at the Open Science Framework.
As the debate on the reproducibility of science goes on, these are important steps that need to be done. So on a final note, I really recommend the following blog post by Andrew Gelman in which he says:
"...the solution to the "replication crisis" is not to replicate everything or to demand that every study be replicated. Rather, the solution is more careful measurement. Improved statistical analysis and replication should help indirectly in reducing the motivation for people to perform analyses that are sloppy or worse, and reducing the motivation for people to think of empirical research as a sort of gambling game where you gather some data and then hope to get statistical significance. Reanalysis of data and replication of studies should reduce the benefit of sloppy science and thus shift the cost-benefit equation in the right direction."