Reproducible Research: Citing your execution environment using Docker and a DOI

Posted by s.sufi on 29 March 2016 - 3:15pm

By Robert Haines, Institute Fellow & Research Software Engineering Manager, IT Services, University of Manchester and Caroline Jay, Institute Fellow & Lecturer, School of Computer Science, University of Manchester.

As we move into a world where (hopefully) more and more people are trying to make their research as reproducible as possible, a lot of us are turning to Docker to help out with the task of distributing our research software in a way in which it is as accessible as possible to others. As we move in this direction we need to be able to cite the software environments that we are executing, not just the source code itself.

In the IDInteraction project we are working on tools that allow people to use object tracking over a video to create models of human behaviour - a technique known as 'behavioural coding'. This process was previously done manually, and so these tools could be very useful to others, but what is the best way to make them available? Ensuring our code is open source is an important first step, but this isn't optimal for a researcher who doesn't have the technical expertise (or time) to build the software from scratch. In the rest of this post we describe our approach to making research software easily available, by citing the software environments that we are executing, not just the source code itself.

Docker images and containers

Docker deals with two fundamental concepts: images and containers. You can think of an image as the immutable source of a container; you can start 1,000 containers from a single image and the image will remain unchanged. The container is the running environment, created from an image, with an extra layer on top that holds the particular state of that instance.

In broad terms, containers can be thought of as virtualization at the application level, rather than the more usual operating system level; your research software and all of its dependencies can be built into a single image, which can be instanced as a container, and run on any platform that supports Docker. Packaging your software in this way means that running them can be as easy as typing:

docker run

In the IDInteraction project we have done just this and have created Docker images for each stage of the processing pipeline for our recent paper on Automated Behavioural Coding, which has been accepted for presentation at CHI 2016 in San Jose, CA, USA.

Building and distributing images

Creating an image requires some knowledge of installing software in Linux and setting up the environment that your tools need, but once you have written this down in a Dockerfile it serves not only as the recipe for Docker to build your image but as a comprehensive description of the complete environment your software requires to be able to run.

Docker images are published in the Docker Hub. You can either push an image that you have built on your local machine to the Hub manually, or you can set up a hook so that the Hub will build your image for you each time you push a new version of the Dockerfile to GitHub.

Our video processing, object tracking and data analysis images are all available in the Docker Hub. We don't just use software we've written ourselves either; we've installed standard tools such as mencoder (a video processing tool) and R with all the extras they require in our Docker images too. Docker really is the only dependency of our entire workflow.

Citing your Docker image with a DOI

If you want to cite something it needs to be archived somewhere in such a way as to be unchangeable and stored, ideally, in perpetuity. Docker Hub isn't the sort of repository that is designed with long term archiving in mind, so it doesn't have the option of assigning DOIs to your images. The other problem is that it is all too easy, on purpose it seems, to overwrite tagged images with newer versions. To publish your Docker images alongside the work they support you need to archive the actual versions you used elsewhere to be able to cite them reliably. Enter Zenodo.

Zenodo already supports archiving of software directly from GitHub when you create a release, and so you can always use this to get a DOI for your source code and Dockerfile. But the Docker image itself is what you execute to get your results, so you should cite this as well so that others can reuse the exact environment that you used in your work should they wish to. Thankfully Zenodo supports uploading archive files, such as tar and zip, and there is an easy way to save your Docker images in an archive:

docker save image/name:tag > archive-file.tar

At this point you probably want to compress the resulting tar file as Docker images can be quite big and Zenodo has a 2GB limit. Then you can upload to Zenodo in the usual way. Once you've gone through the usual steps of adding some metadata to your upload in Zenodo you get a DOI that you can cite in your papers.

For anyone wanting to use your image there is an equivalent command to load a saved image into Docker on their local machine:

docker load < archive-file.tar

We have archived our video processing, object tracking and data analysis Docker images in Zenodo in this way and cited them in our paper using the generated DOIs.

Summary

Using a combination of Docker and Zenodo you can provide others with access to the exact execution environment that you used in your research and cite your software environments as distinct outputs of your research with their own DOIs.

Acknowledgements

IDInteraction is funded by the Engineering and Physical Sciences Research Council through grant agreement number EP/M017133/1.