Much has been made of the reproducibility crisis in psychology and cognitive neuroscience. There are many partial culprits: small sample size, subjects tending to come from a single weird demographic, convoluted processing pipelines, and a phenomenon that goes by several names:
- P hacking or Data Dredging
- Circular Analysis or Double Dipping
- HARKing (Hypothesizing After Results are Known)
These three issues all share a common root: a conflation of exploratory and confirmatory analysis. In exploratory analysis, we’re trying to see what, if anything, the data tells us. In confirmatory analysis, we want to see if the data tells us what we think we already know. In other words, we’re trying to confirm a hypothesis, as opposed to exploring the data for new hypotheses.
Both types of analysis are essential to the scientific process. For many complex systems (like the brain) we can’t yet make good predictions. Sometimes you just need to observe and try to draw theories from these observations. At the same time, if you want to prove that a theory is true, you need to test it via an experiment.
The problem of reproducibility arises when exploratory work is presented as confirmatory. The p values you present in such a case are optimistic (it’s the multiple comparisons problem again). The sad truth is that you really only get one chance to confirm your hypothesis. You collect your data, choose your statistical techniques, and hope for the best. Sound difficult? It is.
The unfortunate thing is that if you want to get published and make a name for yourself, confirmatory work is all anyone is interested in. It’s understandable: there is something more satisfying about proving that something is true, as opposed to simply describing something. And so researchers are incentivized to try their best to do only confirmatory work.
But that’s not the only work that matters, and such a preference for confirmatory analysis has led psychology and neuroscience into the mess that we’re in with respect to reproducibility.
When this issue brought up at conferences or in journals, there are two defensive reactions:
Both reactions are understandable. Often when we talk about these issues, we make it sound intentional – as if scientists are deliberately trying to deceive us with data dredging. Unintentional overfitting in the name of a strongly held hypothesis is much more likely. To be accused of being dishonest would put anybody on the defensive.
Furthermore, statistics can be confusing, and it’s really hard to be an expert in your own field and an expert at statistics. It’s good to ask what kind of test to use when, but banning an entire statistic (as in the case of the p value ban) is throwing the baby out with the bathwater.
So you’re probably wondering: what should we do about this? Is it reasonable to expect everyone to follow the insanely stringent standards of confirmatory analysis? Should we only ask scientific questions that can be answered in that way? I say no.
I have two recommendations:
- Share data. Data is very expensive to collect, which is why we have so little and try to squeeze as much out of it as possible. A testbed of data collected for common tasks (e.g. an N-back task) would help more people perform exploratory analysis and carefully plan the data they do collect.
- Value exploratory work. It we want people to be up front about exploration vs confirmation, we need to treat exploratory work for what it is: a valuable tool for gaining insight about complex systems. Only publishing confirmatory work (a more general form of publication bias) is what got us into this reproducibility crisis in the first place.
Sharing data is gaining traction in human neuroscience. Maybe exploratory work is up and coming as well. If you’re committed to the confirmatory lifestyle, it can really help to pre-register your analyses. Here are two places where you can do that.