Is software reproducibility possible and practical?

Posted by s.aragon on 8 February 2017 - 4:02pm

By Daniel S. Katz, Assistant Director for Scientific Software and Applications at NCSA

Reposted with the author's permission. This article was originally published in Daniel S. Katz's blog.

This blog is based on part of a talk I gave in January 2017, and the thinking behind it, in turn, is based on my view of a series of recent talks and blogs, and how they might be fit together. The short summary is that general software reproducibly is hard at best, and may not be practical except in special cases.

Software reproducibility here means the ability for someone to replicate a computational experiment that was done by someone else, using the same software and data, and then to be able to change part of it (the software and/or the data) to better understand the experiment and its bounds.

I’m borrowing from Carole Goble (slide 12), who defines:

Repeat: the same lab runs the same experiment with the same set up
Replicate: an independent lab runs the same experiment with the same set up
Reproduce: an independent lab varies the experiment or set up
Reuse: an independent lab runs a different experiment

(I am aware that my choice of definitions for replicate and reproduce is very much a matter of dispute, but I nonetheless choose to use them this way. If you prefer the other choice, please feel free to switch the words in your mind as you read.)

Note that an interesting alternative way of looking at this is using the concept of confirmation depth, as proposed by David Soergel, which is meant to be a measure of the reproducibility of scientific research. It’s defined as: given two experiments that provide the same result, how many steps back from that result is the first commonality of materials or methods found? Conversely, how similar is the derivation of the inputs shared by the two experiments? The shallowest form of confirmation is peer review, while deeper forms, such as using different software, different approaches, or different inputs, give more confidence in results. Though their intents are somewhat different, confirmation depth thus overlaps Goble’s definitions, with Soergel’s shallow confirmation depth outside the scope of Goble’s definitions, and Goble’s reuse outside the scope of Soergel’s deepest confirmation depth.

Much like Mark Twain’s definition of classic books as those that people praise but don’t read (see Following the Equator, Chapter 25), reproducibility seems to be a goal mostly discussed in the abstract, but not actually practiced, though there are notable exceptions, such as Claerbout, Donoho, etc., as discussed by Lorena Barba, who is also one of a small number of researchers who are very seriously attempting to make their work reproducible. As Barba mentions, our culture and our institutions do not reward reproducibility; we generally don’t have incentives or practices that translate the high-level concept of reproducibility into actions that support actual reproducibility.

Reproducibility can be hard due to a unique situation, for example, data can be taken with a unique instrument or can be transient, meaning that the data cannot be recollected, so that the starting point for reproducibility might have to assume the same data. Or perhaps a unique computer system was used, so that the calculation itself cannot be repeated. What’s more, given limited resources, reproducibility is considered less important than new research. For example, a computer run that took months is unlikely to be repeated, because generating a new result is seen as a better use of the computing resources than reproducing the old result. In the days when Moore’s Law applied to computer speeds, waiting a few years would allow these heroic calculations to be reproduced, though they rarely were.

But time is an important factor in software reproducibility. Konrad Hinsen has coined the term software collapse for the fact that software stops working eventually if is not actively maintained. He says that software stacks used in computational science have a nearly universal multi-layer structure:

Project-specific software: whatever it takes to do a computation using software building blocks from the lower three levels: scripts, workflows, computational notebooks, small special-purpose libraries and utilities
Discipline-specific research software: tools and libraries that implement models and methods which are developed and used by research communities
Scientific infrastructure: libraries and utilities used for research in many different disciplines, such as LAPACK, NumPy, or Gnuplot
Non-scientific infrastructure: operating systems, compilers, and support code for I/O, user interfaces, etc.

where software in each layer builds on and depends on software in all layers below it, and any changes in any lower layer can cause it to collapse.

Hinsen goes on to say that just addressing project-specific software (the top layer) isn’t enough to solve software collapse; the lower layers are still likely to change. And the options he suggests (and I’ve named) are similar to those available to house owners facing the risk of earthquakes:

Teardown – treat your home as having minimal value, subject to collapse at any time, and in case of collapse, start from scratch
Repair – whenever shaking foundations cause damage, do repair work before more serious collapse happens
Flexible – make your house or software robust against perturbations from below
Bedrock – Choose stable foundations

Hansen suggests that many researchers building new code for what they think is a single use choose teardown, while most active projects that are building code intended to last choose repair. While engineers know how to build flexible buildings to survive a given level of shaking, we don’t know how do this in software (though people like Jessica Kerr and Rich Hickey are talking potential practical and social solutions, and as suggested by Greg Wilson, perhaps new computer science research could more fundamentally address it.) The bedrock choice is possible, as demonstrated by the military and NASA, but it also dramatically limits innovation, so it’s probably not appropriate for research projects.

Much of the immediate inspiration for this blog post and Hinsen’s is a blog from C. Titus Brown called “Archivability is a disaster in the software world.” Titus talked about why using containers or VMs isn’t satisfactory: they themselves are not robust, and they provide bitwise reproducibility, but aren’t scientifically useful, because as black boxes, you can’t really remix the contents. Titus suggested we either run everything all the time, or accept that exact repeatability has a half-life.

Running everything all the time is related to Hinsen’s repair option, it just guarantees you know when it’s time to make repairs. This could be done through continuous analysis (as proposed many times, for example by Howison at WSSSPE2 and in a preprint by Beaulieu-Jones and Greene), similar to continuous integration.

Accepting that exact repeatability has a half-life has a nice architectural parallel: we don’t build houses to last forever, and that seems fine. But if we do accept this, perhaps we should consider costs and benefits (as is done for houses, where we don’t worry about earthquakes for sheds, but we do for hospitals and highways.)

For example, if software-enabled results could be made reproducible at no cost, we would do so. If the cost was 10x the cost of original result, we probably wouldn’t. If the cost was equal to the cost of the original result, we probably still wouldn’t, though maybe in some cases, for particularly important results, we would. What about if the cost of making the work reproducible was an extra 10%? Or an extra 20%?

We need to balance both the cost of reproducibility vs. lost opportunity of new research, and the cost of not having reproducibility vs. the lost opportunity of future reuse. This could be a specific question about any one experiment/result, or it could be a general question about the culture of science. Similarly, if it’s not practical to make everything reproducible, we could also use a cost/benefit ratio to determine what to do in any particular case.

Going back to the original question that started this blog, “is software reproducibility possible and practical?” I think the answer is yes, over a short period, but more generally, the answer is no. As Barba and Brown and others are doing, we can provide reproducibility over a short period, though our incentives don’t align with this and it’s fairly difficult, so most researchers don’t. In the longer term, our systems do not really support reproducibility, mostly due to software collapse issues as defined by Hinsen, and the fact that containers can support repeatability and replicability, but not reproducibility or reusability. Additionally, the costs associated with full long-term reproducibility are considered worthwhile for all software-based research. A method to get to reproducibility and reusability while not dramatically reducing our ability to innovative is potentially to overcome software collapse by using flexible or fuzzy APIs between underlying layers, though much computer science research is needed to enable this. This could also lower some of the costs, increasing the amount of work that could be made reproducible.

Thanks to Kyle E. Niemeyer, Sandra Gesing, Matt Turk, and Kyle Chard for their comments on an earlier draft of this blog, and of course, all of the people who I’ve named and whose work I’ve quoted, paraphrased, or linked to; any incorrect interpretations of their work are mine.