C4RR Talks

Abstracts of talks sorted by last name of the first author.


Singularity Containers for Reproducible Research

Michael Bauer, University of Michigan / Lawrence Berkeley National Lab

In scientific research it is imperative that methods and data used for those methods be preserved to allow for the reproduction of results. Unfortunately, some results are also dependent on the computing environment in which they are run. For example, a random number generator may depend not just on the seed used, but the version of the random number generator library and the compiler used to generate the executable. Singularity is an open source container solution designed to mitigate these reproducibility problems. Singularity was initially developed at Lawrence Berkeley National Laboratory and is now maintained by collaborators from institutions across the globe. With Singularity a scientist can package the entire computing environment (compiler, libraries, data, etc..) used to generate their results into a single file that runs independently of the Linux distribution, without the need to create a virtual machine. This file can then be archived alongside a published research paper to allow others to independently verify the published results with the guarantee that they will be running the exact same executable as the researcher.


Docker Containers for Deep Learning Experiments

Paul K. Gerke, Diagnostic Image Analysis Group, Radboudumc Nijmegen

Deep learning is a powerful tool to solve problems in the area of image analysis. The dominant compute platform for deep learning is Nvidia’s proprietary CUDA, which can only be used together with Nvidia graphics cards. The nivida-docker project allows exposing Nvidia graphics cards to docker containers and thus makes it possible to run deep learning experiments in docker containers.

In our department, we use deep learning to solve problems in the area of medical image analysis and use docker containers to offer researchers a unified way to set up experiments on machines with inhomogeneous hardware configurations. By running experiments in docker containers, researches can set up their own custom software environments which often depend on the medical image modality that is being analyzed. Experiments can be archived in a docker registry and easily be moved between computers. Differences in hardware configurations can be hidden through the system configuration of the base system. This way, container environments remain largely the same even across different computers.

Using graphics hardware from docker containers, however, also introduces extra complications: CUDA uses C-like code that is compiled to binaries that are not necessarily compatible between graphics cards. It is also possible, due to the lack of proper hardware virtualization, to crash the Nvidia driver on the base system which will affect all other containers running on the system.

Allowing researchers to define their own runtime environments for their experiments using containers made archiving of experiments more viable. Experiments do not depend on local system configurations anymore and therefore can be moved between systems and expected run. Using graphics hardware from docker containers introduces more complexity, but generally works and should make deep learning experiments more repeatable in the long run.

Slides.


Building moving castles: Scaling our analyses from laptops to supercomputers

Matthew Hartley, John Innes Centre, Tjelvar Olsson, John Innes Centre

To analyse our scientific data, we develop code and pipelines. We usually do this first on small datasets, using laptops and desktops. Our dream is to effortlessly take these small scale analyses and use them on large datasets on powerful HPC clusters or cloud computing environments. The complexity of the environments needed to run the code, and manage data locations make this a challenge.

Docker provides an important piece of the solution to this problem, giving us portable environments. Unfortunately, we can't use Docker on HPC clusters for security reasons. Singularity, another container technology, helps by allowing us to use those environments on HPC systems.

We'll talk about how we've developed and are developing a set of interlocking tools for helping us bridge the gap between our small scale experimental analyses and large scale processing. These tools abstract away the problems of managing environments and data paths, along with integration with an HPC scheduler. We'll give a work in progess snapshot of how these tools are helping us so far, highlighting the underlying principles that we think are giving us the most value.

Slides.


Virtual Container Clusters: building reproducible cluster software environments

Josh Higgins, University of Huddersfield

Does putting code inside a container really make your experiment reproducible?

Scientific software often depends on a system foundation that provides libraries, middleware and other supporting software in order for the code to be executed as intended. With container virtualization, some of these things can be shipped along with the application - clearly reducing the barrier to reproducibility.

For example, MPI applications can be containerized and the container itself executed in place of the original program. However, whilst this containerized code may offer compatibility between different system environments, for execution as intended, it must still be run on a cluster that provides the correct interface for parallel MPI execution.

The VCC is a framework for building containers that ship the application along with the foundation required to execute it - whether that is an MPI cluster, big data processing framework, bioinformatics pipeline or other execution models. This gives us the ability to quickly create and teardown complex virtual environments tailored for a task or experiment. These can be used as both the primary execution environment and as the method for reproducibility, without requiring the underlying system to provide anything but the container runtime.

In this talk we will introduce the tool and the principles of operation, using example applications to demonstrate it from the point of view of the experiment creator and a person who wants to reproduce an experiment. The tool has been utilised successfully within the University of Huddersfield, and this experience will also be presented – where do we draw the line at how much of the system should be encapsulated in a container, how we adapted our computing resources to support the VCC, and how we convinced users to use it.


Deploying many diverse services as a single system using Containers

Stefan Idriceanu, The University of Manchester

To enhance the monitoring and maintenance of the Computational Shared Facility at the University of Manchester, new software was developed to graphically display and analyse running memory logs of the cluster nodes. A number of issues were encountered during the development and deployment of this application, in particular with the installation and configuration of a Grafana server alongside a log-parsing application on different machines that were running various operating systems. The diversity of components and technologies that were interacting on the same system created a large number of compatibility and dependency issues.

Using Docker made it easier to migrate the entire project from the development machines to any existing set-up, turning this from a long and exhausting process to a short and straight-forward operation. Using containerisation to isolate the components allowed them communicate with each other by port exposure.

By splitting the components into two different containers: one which was holding a Grafana web application, Carbon daemon, Whisper database and Statsd server, and another one which was holding the log-parsing application, meant that the required independence was achieved. Furthermore, configuration files could be automatically deployed by cloning them from a GitHub repository on the host machine and adding them as a data volume when running the docker container.

To conclude, Docker provides a lightweight form of virtualisation in contrast to a classic virtual machine; it shares memory with the host rather than requiring its own independent memory and kernel files. This makes it an ideal platform for deploying many independent, but interconnected, services, as we have done here, but care must be taken when service management tools, such as systemctl and systemd, are shared by both the host and the containers.


Beyond the container - configuration management of high throughput workflows in a production environment

Nick James, Eagle Genomics

Reproducibility is a absolute requirement in commercial life science R&D. This requirement is getting harder to deliver as data sets get ever bigger and more complex. Docker enables reproducibility even at scale, but does not provide the complete answer in and of itself. Here I will present a real-world example of "dockerisation" of an open source data analysis pipeline (shotgun metagenomics) based on the eHive artificial intelligence workflow engine from the European Bioinformatics Institute. Architectural components include configuration management, multi-layered versioning, continuous integration, federated logging and other goodies. Controlling reproducibility and reliability as a fundamental principle (rather than bolt-on extra) allowed us to confidently perform aggressive optimisation returning some quite startling performance figures. The system is in daily production use by global R&D teams and has directly contributed to the development of new products.

Slides.


Using Docker and Knitr to create reproducible and extensible publications

David Mawdsley, University of Manchester, Robert Haines, University of Manchester, Caroline Jay, University of Manchester

The current paradigm for publishing research involves demonstrating significant novelty and waiting a long time for peer review. We propose a containerisation system for research that makes both the analysis and manuscript easy to produce and extend, providing the starting point for a new, versioned publication model, which will allow early publication of results and their incremental extension. This transparent approach allows others to verify the results of a paper, and potentially modify or extend the analysis. Our approach uses Docker images to provide a reproducible analysis pipeline, together with Knitr, an R package that allows us to combine R and other code with LaTeX in an extensible and transparent way. This allows us to generate a reproducible paper, where calculations in the analysis code are automatically reflected in the manuscript text.

We will discuss some of the challenges in following this approach, including handling time consuming parts of the analysis pipeline and linking Docker and Knitr into a fully reproducible (and straightforwardly executable) pipeline. We will conclude by outlining how such an approach could transform the research publication system, by allowing people to modify and extend the analysis and/or manuscript to create novel results, and an additional 'version' of the paper, which could be subject to lighter-weight peer review, and faster publication.

Slides.


Using Docker to add flexible container support to CyVerse UK

Alice Minotto, Earlham Institute, E. Van Den Bergh, EMBL-EBI, A. Eckes, Earlham Institute, R. P. Davey, Earlham Institute

Docker containers are typically organised in layers and more lightweight than virtual machines (VMs). Unlike VMs that comprise full operating systems (OS) and libraries, Docker containers can interact with underlying libraries of the OS and therefore have advantageous properties such as disk space used will be far less than the size of a standalone VM. This makes it very fast to generate sharable and reproducible computational objects.

As part of CyVerse UK, we use the Docker framework to provide users with a virtually isolated environment that includes bioinformatics applications and all their required dependencies so that it can be composed, run, and spun down all within the CyVerse Agave API lifecycle. Furthermore Docker images are built from Dockerfiles that have an intuitive syntax so with minimal training, users can contribute to widen the range of applications available on CyVerse, benefiting both user and developer communities. Lastly, in the case where users don’t wish to develop against the Agave API interfaces themselves, they can still benefit from publicly available Docker images curated by CyVerse UK to easily run applications within the infrastructure, or even on their own systems.

Many bioinformatics applications are already “containerised” and can be downloaded from public registries. In the future this would make collaborations and application integration in the CyVerse infrastructure more immediate.

Docker universes are also available as part of the CyVerse UK HTCondor batch system, allowing jobs that would be traditionally run on bare metal servers to be run within Docker containers, yet this comes with some minor caveats to keep in mind. The use of volumes (or Data Volumes Containers) is currently not enabled as it would require giving permissions to specific folders with related security concerns. Nevertheless, is possible to achieve almost the same results defining `transfer_input_files` in the Condor submit script.

Slides.


Software “Best Before” Dates: posing questions about containers and digital preservation

James Mooney, The Bodleian Libraries, Oxford; David Gerrard, Cambridge University Library.

As part of the two-year Polonsky digital preservation research project - running jointly at the Bodleian Libraries (the University of Oxford) and Cambridge University Library - the libraries are researching and developing requirements for digital-preservation-specific services. Part of the project concerns incorporating digital preservation processes into existing workflows.

Our presentation is designed to highlight some of the digital preservation opportunities and concerns relating to the use of container software for research projects. The Software Sustainability Institute’s workshop enables us to both raise, and pose questions about, the following topics:

  • Will your software environment be reproducible in 5/10/20 years time? Is that even important? If so, what steps are you taking to preserve the software? And how do you decide which pieces of the research need to be preserved, and what can be thrown away?
  • How are you handling dependency management relating to packages and external software repositories? Have you considered what happens when that external repository no longer exists?
  • How can technical system environment information be captured effectively?
  • Preservation models: should everything be containerised? Or is it better to store the components of your research individually (in preservation-friendly formats), alongside an execution environment? Or take a hybrid approach?
  • What makes for a good preservation container format and platform?
  • What are the strengths and weaknesses of Docker vs Singularity as preservation container formats? If Singularity is a more effective preservation format, what processes are there for exporting Docker to Singularity?
  • If you're relying on databases are you considering extracting database data into a preservation format such as SIARD? And perhaps considered using the Database Preservation Toolkit?
  • Is it important to capture the context and provenance of your work? E.g. archiving collaboration discussions from email correspondence etc.

Slides.


Reproducible HPC in the cloud -- a Docker success story

Robert Nagler, RadiaSoft LLC

Parallel modeling of particle accelerators, x-ray beamlines and free electron lasers requires expert users, each of whom develops their own workflow and data management strategy, making reproducibility next to impossible. Compiling and installing the codes is difficult, as is using them correctly and post-processing the results. We present our system for containerizing custom codes for reproducibility and ease of use, including MPI-based parallelism. The sources and build environments are retained and documented in the image to allow easy replication and reproducibility. Users can run the same binaries via Jupyter, Vagrant, or Sirepo (our browser-based beam simulation app). We provide both incremental updates and complete builds via GitHub, Docker Hub, and Vagrant Cloud. We describe successful strategies for continuous deployment, testing, ease of use and how these enable reproducibility. Our approach makes it straightforward to build and distribute Docker containers with multiple scientific codes and, hence, to operate a JupyterHub/Jupyter server.


Creating executable research compendia to improve reproducibility in the geosciences

Daniel Nüst, Institute for Geoinformatics, University of Münster

The project Opening Reproducible Research tries to reduce the barrier to reproducible research by developing a convention and supporting tools for Executable Research Compendia (ERCs) which include (i) the article, (ii) data, (iii) code, and (iv) the runtime environment to reproduce the study. The ERC provides a well-structured container for both the needs of journals (ERC as the item under review), archives (suitable metadata and packaging formats), and researchers (literally everything needed to re-do an analysis is there). It relies on Docker to define and store the runtime environment. ERCs should be simple enough to be created manually and absorb best practices for organizing digital workspaces. Complementary, an online creation service automatically creates ERC, including Dockerfile and Docker image, from typical user workspaces for less experienced users. A validation and manipulation service will allow (a) users to create an ERC for their workflows with minimal required input, (b) users to interact with published ERC, e.g. (peer) review the contents, or manipulate parameters of the workflow and explore interactive graphics, and (c) platform providers (e.g. journals, data repositories, archives, universities) to integrate o2r building blocks to expand their procedures with exectuable containers. The reference implementation focuses on the geoscience domain and the R language.

We show which steps and aspects of publishing and properly archiving computational research with containers can or cannot be automated for a specific community of practice, and point to future challenges. We will share the concepts behind the ERC and the state of the o2r architecturesoftware and API.

Slides.


Applying Containers at The University of Melbourne for HPC, Virtual Laboratories and Training

David Perry, The University of Melbourne

This talk outlines our experiences in applying containers to research applications at The University of Melbourne, focusing on three areas.

  1. Exploring whether containers can ease the burden on HPC system administrators to install and maintain an ever-growing range of applications, including different permutations of version and toolchain. This is likely to be particularly relevant for software with legacy requirements, niche use cases, rapid release cycles, and where architecture-specific optimization is unnecessary.
  2. Containers as a means of achieving portability between domain-specific web portals (virtual laboratories) and HPC systems. Such portals typically have limited internal resources, but offloading to HPC is hampered by inconsistency between clusters and lack of researcher control over the software environment.
  3. Our experience developing DIT4C, a container-based web app that offers ready-to-use programming and data analysis environments. We focus on its application in researcher training sessions, where it aims to overcome the difficulties of installing software on researcher computers, allowing greater focus on the tool itself.

A DevOps Approach to Research

Sebastian Pölsterl, Institute of Cancer Research

In recent years, Docker has become an essential tool for software development. We demonstrate that Docker containers together with the GitLab platform can be a useful tool for researchers too. It enables them to easily catch problematic code, automate analysis workflows, archiving of results, and share their software and its dependencies across platforms.

While a Docker image bundles the whole development stack and enables its cross-platform sharing, it is often cumbersome and repetitive to build, run, and deploy an image. GitLab is a software development platform built on top of the Git version control system with built-in support for Docker. Using GitLab’s continuous integration pipelines, most tasks related to managing Docker images can be automated. In addition, utilising tools from software development, we can perform automatic code analysis to identify faulty or problematic code as early as possible.

We explain how to setup a Docker-powered project in GitLab and how to automate certain tasks to ease the development workflow:

  1. How to automatically build a new Docker image once a project has been updated.
  2. How to identify faulty and problematic code.
  3. How to use GitLab to automatically run experiments and archive their results.
  4. How to share images with other researchers using GitLab’s Docker registry.

Slides.


oswitch: Docker based “virtual environments” for flexible and reproducible Bioinformatics analyses

Anurag Priyam, Queen Mary University of London

Docker allows creating and sharing immutable computing environments (containers) that are isolated from the underlying operating system. Immutable and isolated nature of containers are key to reproducible research. As a result, containers are being rapidly embraced by the scientific community and several scientific software are now also available as docker containers. However, using “readymade” containers as part of standard workflow is challenging for several reasons. First, different software containers make varying assumptions about the location of data volumes. Secondly, files generated by a process inside the container end up being owned by the root user on the host operating system. Finally, the switch from host user’s shell and current working directory when running a container are berries to scripting as well as interactive usage.

We created oswitch to overcome these hurdles. When oswitch runs a container, all non-system-critical directories on the local filesystem become available inside the container and results of a computation performed inside the container are automatically available on the host system, with the right file permissions. Further, oswitch uses the host user’s shell (bash, zsh, fish) and the directory from which oswitch was invoked as the entry point into the container, thus minimising the mental context switch of entering and exiting a container. The net effect is a “virtual environment” containing data to be analysed and specific version of specific tools. Users can enter a virtual environment and run the containerised software interactively (e.g., for testing or a quick analysis), or as part of a pipeline just like locally installed tools.

Slides.


Reproducibility of Scientific Workflow in the Cloud using Container Virtualization

Rawaa Qasha, Newcastle University

Workflows are increasingly used in life sciences as a means to capture, share, and publish the processing of computational analysis. However, a problem has arisen with workflow systems is the decay of workflow due to the lack of adequate workflow description, the unavailability of the resources required to execute workflows such as data and services, or changes in the execution environments for the workflow. The problem of workflow decay is recognised as an impediment to the reuse of workflows and the reproducibility of their results. The reproducibility of scientific workflows is therefore fundamental to reproducible e-Science.

In this work, we introduce an approach to facilitate the reproducibility of scientific workflow by integrating container, system description and source control. Our approach allows workflows’ components to be systematically specifies and automatically deployed on a range of clouds using container virtualization techniques. Moreover, we leverage this integration to effectively support the re-execution and reproducibility of scientific workflow in the Cloud and in local computing environments.

Firstly, Docker containers are used to automatically deploy the tasks of scientific workflow to support an isolated and portable execution environment for each task.

Secondly, Docker virtualisation and imaging are used to package the whole workflow in one image or each individual task in a separate image in order to offer the workflows and their tasks as portable and ready to use packages. The images contains full software stack to deploy the workflow or the tasks. In addition, by integration our description approach and Docker technology, the packaging is automated where the workflow and tasks images are automatically created during the deployment time, hence users are free from creating and managing Docker images. As a consequence, these images can be used later to re-execute the workflow or as a building block to create new one.

Slides.


Using Docker on a Cloud Compute Platform

Chris Richardson, University of Cambridge

Docker containers encapsulate a complete environment, so are ideal for computing in heterogeneous environments. The same container might run on a laptop, a desktop workstation or HPC systems. Another possibility is running docker containers on cloud instances. Cloud providers usually offer VM instances for you to run in. It is also possible to run docker containers within the VM, and many cloud providers have predefined VM images to do this.

In this talk, we will look at one cloud provider, Microsoft Azure, and see how it can be used with docker containers. Microsoft Azure also has Infiniband RDMA hosts available, so can efficiently run large-scale MPI codes. Performance across the network is generally very good, at least up to 8 nodes (128 cores).

Azure Batch Shipyard is a framework which can be used to supervise the running of large jobs.

I will demonstrate the steps which you have to take to get docker running on Azure, and discuss some of the problems you may encounter, such as accessing file storage, MPI linking, etc.


Reproducible high energy physics analyses

Diego Rodríguez Rodríguez, CERN

The revalidation, reuse and reinterpretation of data analyses require having access to the original virtual environments, datasets and software that was used to produce the original scientific result. The CERN Analysis Preservation pilot project is developing a set of tools that support particle physics researchers in preserving the knowledge around analyses so that capturing, sharing, reusing and reinterpreting data becomes easier. The original research environments are captured in form of Docker images, but other container technologies such as Singularity are under investigation. Assuming the full preservation of the original environment and its analysis workflow steps, the REANA (Reusable Analysis) system launches container-based processes on the computing cloud (Kubernetes) allowing massive workflow task parallelization. The REANA system aims at supporting custom data analysis workflow engine, several shared storage systems (Ceph, EOS) and compute cloud infrastructure (OpenStack, HTCondor) …. Targeting general research data analysis patterns in different scientific disciplines.

Slides.


Containers at the Dutch National e-Infrastructure

Jeroen Schot, SURFsara

SURF is one of the major partners in the Dutch National e-Infrastructure, which provides researchers in the Netherlands with access and support to advanced IT facilities [1]. In this talk I will discuss two examples of the adoption of containers at the Dutch National e-Infrastructure: support for Singularity containers on our HPC and HTC clusters, and a new service for private Jupyter notebooks environments based on Docker and Kubernetes.

We operate a number of different compute clusters such as the national supercomputer Cartesius and part of the Grid infrastructure. To make it easier for researchers to move their applications to and between these clusters we started with enabling Singularity support on our clusters in early 2017.

Jupyter notebooks are interactive documents that mix documentation and code. These notebooks documents are accessible via the browser and easy to share with other researchers. As such they are becoming an important tool for reproducible science [2]. We started offering Jupyter notebooks environments in 2016 and converted our setup to Docker containers and the Kubernetes container orchestrator in April 2017. In the future we plan to add additional services on top of Kubernetes such as Elasticsearch databases and private Spark clusters.

[1]: The Dutch National e-Infrastructure for Research, J. Templon and J. Bot, International Symposium on Grids and Clouds, 2016.
[2]: Interactive notebooks: Sharing the code, H. Shen, Nature, 2014.

Slides.


Reproducible Analysis for Government

Matthew Upson, Government Digital Service

A key function of government statisticians is to produce official statistics for publication. Often these statistics have a direct impact on government policy, so it is imperative that they are accurate, timely, and importantly: reproducible. At any point in the future, we should be able to reproduce all the steps required to produce a statistic, but manual processes (common in many publications) can make this challenging. Using literate programming tools, version control, continuous integration, code coverage, and ultimately containers can greatly improve the speed and accuracy with which publications can be produced, whilst ensuring their reproducibility. In this presentation I will take you through the journey of how these tools have been implemented for statistics production for the first time in the UK Government.


HPC infrastructure for high energy density physics research

Arturas Venskus, First Light Fusion Limited, Richard King, First Light Fusion Limited

Numerical simulations play crucial role in understanding and predicting behaviour of high energy density physics [1-2]. Essential components of such simulations are hardware, software, mathematical models and scientists who orchestrate everything. Orchestration involves software implementation of new mathematical models, updating of software components, trial of different ideas, results comparison from different software versions, usage of new software packages that involves dependencies and so on. Simulations need to run smoothly on PC’s as well on HPC, development should not introduce new regression bugs and workflow needs to be automated as much as possible to speed up development cycle.

To meet all requirements, we implemented a continuous integration system [3] that produces Docker [4] images and stores them in the premise Docker registry. Images are accessible from a scientist’s PC and from the HPC cluster. HPC jobs are scheduled using Kubernetes [5] and a customised job submission bash script. The implemented HPC infrastructure plays important role for result reproducibility and archiving because Docker containers encapsulate all dependencies, ensuring exact reproduction.

[1] R. Ramasamy, N. Bevis, J.W.S. Cook, N.A.Hawker, D. Huggins, M. Mason, N. Niasse, A. Venskus, D. Vassilev, T. Edwards, H. Doyle, B.J. Tully, Hytrac: A Numerical Code for Interface Tracking, First Light Fusion Ltd. (to be submitted)
[2] D. Vassilev, N. Niasse, R. Ramasamy, B. Tully, Efficient solution to multi-temperature Riemann problem coupled with front-tracking for gas dynamics simulations, First Light Fusion Ltd. (to be presented in IOP Plasma Physics Conference, 2017)
[3] Atlassian, Bamboo.
[4] Docker.
[5] Google, Kubernetes.