C4RR Lightning Talks

Abstracts of lightning talks sorted by last name of the first author.

The lightning talks will take place in the afternoon of the second day, 28 June, starting at 15:00.

Using Containers to drive reproducibility best practices in Bioinformatics training

Mark Fernandes, Quadram Institute Biosciences

At Quadram Institute Biosciences, we have implemented an innovative training programme called ‘Bite-sized Bioinformatics’.

These are regular short (60-90 minute) sessions where there is a short single-topic talk followed by an applied practical exercise.

A key requirement is to have a means of easily & quickly deploying the training environments to users machines inside and outside of the Institute premises. This enables learners to complete the practicals at their own time, location and pace outside of the sessions.

Along with some use of VMs, we have increasingly used Docker containers. We contend that actively engaging with containers for learning environments is a powerful advocacy for further usage in their research work. The strengths and limitations of the technology can be experienced first-hand.

The containers have also been used to facilitate exposure to other good practices with regard to reproducibility e.g. Jupyter Notebooks.

Slides, video.

Containing infrastructure models for testing and scale

Tom Russell, Oxford University, Environmental Change Institute, ITRC Mistral project

NISMOD (National Infrastructure Systems MODel) is a system-of-systems model which couples simulation models of the energy, water, waste water, transport and solid waste systems in order to evaluate long-term plans, risk and resilience under socioeconomic and climate scenarios.

Motivated in part by a scenario and decision space which looks amenable to an embarrassingly parallel scale-out, and in part by the need to isolate and test components, we are migrating models to build and test in containers under continuous integration and we are currently prototyping the model-running architecture which should take advantage of the opportunities to scale out.

Slides, video.

EBI Metagenomics: building shareable and reproducible workflows for metagenomic analysis

Maxim Scheremetjew, European Bioinformatics Institute (EMBL-EBI)

EBI metagenomics (EMG, https://www.ebi.ac.uk/metagenomics/) is a free-to-use hub for the analysis and exploration of metagenomic, metatranscriptomic, amplicon and assembly datasets. The resource provides rich functional and taxonomic analyses of user-submitted sequences, as well as analysis of publicly available metagenomic datasets that are held within the European Nucleotide Archive (ENA).

Metagenomics is a rapidly changing field, where new analysis tools are frequently released. Following evaluation and validation, these tools can be integrated into existing analysis pipelines if they provide novel analyses or outperform existing components (for example, providing more accurate analyses or requiring substantially less compute). Easy integration of such components and reproducibility of results for validation are therefore important. However, these aspects are presently largely overlooked within the metagenomics community.

Here, we outline our work describing the EBI metagenomics analysis pipeline in a generic format, using Common Workflow language (CWL, http://www.commonwl.org/). This approach will allow us to integrate new tools and substitute existing components within our analysis pipeline, providing greater flexibility and helping streamline the pipeline update process. In addition, the work will facilitate sharing of pipeline sub-workflows with collaborators and end users, and greatly simplify evaluation of potential new pipeline components.

At the same time, as metagenomic datasets grow ever larger, it is highly important to address future analysis challenges. To this end, we have continued to refine the pipeline, aiming to improve scalability and portability. As part of this process, we have begun to investigate ways to deploy the pipeline on computing clouds within the ELIXIR hub, as well as commercial clouds, such as Amazon or Google. In addition to our CWL work, we also discuss the ongoing work to deploy the analysis pipeline on the cloud through containerisation.

Slides.

Fig2code: create runnable, reproducible figures

Robert Stojnic, Dataprogrammers

Fig2code is a free, open-source tool to create figures that contain all the information needed to run them.

Our goal is to make it easy for data scientists of all skill levels to create runnable, reproducible figures with minimal effort.

Fig2code is based on Docker, but no knowledge of Docker is needed to use it. It abstracts all the Docker details to focus on the operations meaningful for data analysis workflows.

The basic usage is: 'fig2code run myScript.R'. This will examine the local environment and the R script, provision a Docker container with the correct dependencies, run the script inside the container, and record the output into an HTML file that will have the output figure, R script, Dockerfile and any other required files embedded. Data can be embedded or replaced with SHA256 signatures and distributed separately (eg through the Dat project).

The output HTML, by default, shows only the figure, so it is immediately viewable and distributable without any additional tools. However, the HTML also contains data sufficient for anyone to reproduce the figure. As such, it is a new way to create and distribute reproducible figures.

The project is in its early stage and the poster will focus on the goals and roadmap for the project. We are looking forward to feedback and would like to invite potential collaborators to join us.

Project page: https://github.com/rstojnic/fig2code

Slides.