July 22, 2014

Reproducibility is not just for researchers

I'm currently taking a course in Reproducible Research, one of nine courses comprising Coursera's Data Science Specialization that launched earlier this year. "Reproducible research" is the term for a researcher openly providing their data and computer code (for both processing and analysis) in order to make it practical for others to reproduce their entire data analysis. It's not the same as study replication, the "gold standard" in the research world, in which an independent investigator tackles the same question with different data and methods. However, it does give others the opportunity to verify that the data and methods used by the researcher are sound, as well as build upon that work more easily. As such, it bridges the gap between full replication and no replication.

The course is taught by Dr. Roger Peng, a biostatistics professor from Johns Hopkins University and an editor of the excellent Simply Statistics blog. For a deeper introduction to the topic of reproducibility, you can watch Dr. Peng's YouTube playlist or read his article from the journal Science.

But reproducibility isn't just relevant for scientific researchers, it's relevant for anyone doing data analysis. From the course page:

The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary.

In other words, it provides more transparency to a data analysis and increases the transfer of knowledge to others who could learn from your data and methods.

Dr. Mikio Braun brings up another reason for reproducibility when discussing the "mainstreamification" of data analysis:

The general message is: Data analysis has become super easy. But has it? I think people want it to be, because they have understood what data analysis can do for them, but there is a real shortage in people who are good at it. So the usual technological solution is to write tools which empower more people do it...

For a number of reasons, I don't think that you can "toolify" data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I'd say it takes a lot of experience to be done properly and you need to know what you're doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

Dr. Braun's post matches my own experience that data analysis is incredibly easy to get wrong, and it's just as hard to know when you're getting it right. Which makes reproducible research all the more important!

This brings to mind the Duke cancer case from a few years ago, in which two researchers from Duke University reported that they had developed a technique for personalized chemotherapy based on an analysis of a patient's genes. The research appeared in well-known, peer-reviewed journals, and patients were enrolled in clinical trials of the technique. Unfortunately for all involved, it turned out that the underlying data analysis was faulty, resulting in sub-optimal treatments for these patients. (60 Minutes and The Economist both did excellent reporting on the case.)

So, how did the errors come to light? The discovery was made by two researchers who attempted to reproduce the results numerous times and found glaring errors. As one of those researchers explains in a fascinating talk, it took them nearly 2,000 hours of work over the course of three years because they were consistently hampered by a lack of access to the complete data and code. During that time, the clinical trials even survived a scientific review committee tasked with verifying the legitimacy of the analysis!

This case makes clear that reproducible research is especially important when life-changing decisions depend upon that research, since it can catch errors in an analysis that might not otherwise get caught by a more traditional review process. But even if life-and-death decisions do not rest on your data analyses, reproducibility is becoming increasingly important if you want your analyses to be trusted. I expect this trend to accelerate as public awareness increases of how often poor data analyses are presented as fact.

And while there has not yet been a culture shift to reproducibility in the computational sciences, it is at least getting easier as a practical matter. For those using R for data analysis, the knitr package makes it easy to weave together your code and analysis into a single document, which can then be published on RPubs. For those using Python, a Jupyter notebook hosted on nbviewer can accomplish the same thing. This does not address the issue of the hosting of the data itself, but I suspect that solutions like Academic Torrents will become more of the norm going forward.

Of course, reproducibility is not a silver bullet. An analysis being reproducible is still not a guarantee that it's any good; it just means that flaws can be more easily discovered if someone takes the time to reproduce it. And typically, this will happen long after the original analysis has already been disseminated and accepted as "true."

However, it is at the very least a step in the right direction.

Comments powered by Disqus