Archiving code and software shared with research: journal, author and re-user perspectives

Posted by s.aragon on 25 May 2018 - 9:31am

5518280677_581f2a1e3f_z.jpgBy Naomi Penfold, Nikoleta Glynatsi, Yo Yehudi, James Baker, Steve Crouch

This post is part of the Collaborations Workshops 2018 speed blogging series.

As science becomes more reproducible, how do we ensure that the work we do today can be interrogated in the future as a matter of historical record? We discuss this from the perspective of researchers, research software engineers (RSEs) and the publisher, and offer responses to five questions central to archiving code and software. We note first that our perspectives come from an expert group—not the average researcher.

Why archive and for whom?

People change. It’s not just that code evolves and packages change, in turn influencing the choice of methodology, but that people and their preferences change as well. Regardless of how research will be done in 10 years time, it’s critical that methods used to derive a result in the past can be inspected and interrogated. No matter how much the code changes, the results shouldn’t if you are doing things right.

The author and reuser are perhaps—and should be—indistinguishable; that is authors are one potential future reuser, and that consideration should be taken into account.

How do we make archived code and software meaningfully reusable?

Code should be readable, and at best meaningfully reusable. Even if we can’t run code in real time, and even if dependencies disappear, the code itself should be possible to read on screen. This means that, as much as possible, the code should document and explain itself. Further, code should be archived with real data that enables meaningful reuse of that code, or—if the former is not possible—dummy data produced by the author.

Best practices should be reinforced through journal requirements. If journals recommend or require RSEs and other researchers who submit code to use sustainable (long-lived) dependencies, this has a positive and self-reinforcing knock-on effect for future researchers who reproduce and reuse this software-based research: it influences their technical decisions towards the sustainable solutions in order to be published.

Researchers should lead on dependency management. RSEs and other researchers who submit code should attempt to understand which dependencies are stable (e.g. PyPl, MRAN) versus those with acknowledged problems (e.g. JavaScript). But this should not be the basis for rejection by a journal, just something to know and that by being visible may contribute to the push more sustainable practices.

Journals should build workflows that support stable/backed-up software/languages (e.g. R, Python) without undermining authors who want to use experimental software approaches. If needed, researchers should even provide an ‘image’ with all the dependencies.

Is there a role for peer review before archiving?

Code should be reviewed. We need it to be able to clearly distinguish archived code that has been tested, found valid, or shared with or without review. This expert group believes that journals should review the code before sharing and archiving.

Whose responsibility and workload is this?

Authors must take responsibility for their code. Nevertheless, the archiving process could be adapted according to the researcher’s situation and engagement with the necessity to archive their code.

The publisher could adapt the support they offer according to how much effort the author has made to share reusable code. For example, if the researcher supplies well-documented and reproducible code they get service X+Y, but if they supply a zip file of scripts they get service X.

How can these processes help to change culture?

Journals and cultures of publication shape practice. If journals develop minimal requirements for archiving code and software it could motivate researchers to follow better code practices.

Processes for archiving software and code will evolve as researchers and publishers get used to it and as technological capabilities evolve. So, don’t aim for perfection right now. Instead, aim for rigid basic requirements (developed by journals and communities) with flexible implementation (e.g. don’t impose a service like Zenodo as not all researchers comfortable Zenodo needing write permission to their GitHub). This recognises that asking researchers to conform to only a small degree of reproducibility for their code and software is still progress and an achievement in itself that can be built on for greater future change.

To discuss minimum requirements for implementing some of these ideas at eLife, please contact Naomi.