Improving reproducibility and sustainability with ReproPhylo
By Malcom Illingworth, Research Software Engineer.
The Software Sustainability Institute has been working with the School of Biological, Biomedical and Environmental Sciences at the University Of Hull to help improve the sustainability of their ReproPhylo software suite. The ReproPhylo developers applied for consultancy from the Institute via the Open Call.
Root-knot nematodes are parasitic roundworms whose larvae infect plant roots. These tiny parasites have a devastating impact on agriculture, causing 5% of crops to be lost worldwide each year. The Evolutionary Biology Group, School of Biological, Biomedical and Environmental Sciences, The University of Hull studies the genomes of these root-knot nematodes to understand how they have evolved and continue to evolve, their diversity, and the threat they pose to crops. This research involves applying both large-scale comparative genomics—comparing the genomes of different organisms—and phylogenomics—analysing genome data and evolutionary relationships. As part of their research, the group has developed ReproPhylo, a phylogenomics pipeline written in Python. ReproPhylo allows researchers to carry out phylogenomic analyses using pre-written or self-written commands.
ReproPhylo has also been designed to promote open and reproducible science. Using commands ensures that all stages of the analysis can be explicitly recorded, and can be exactly reproduced if required. Input sequence data and generated intermediate data files (e.g. alignments, or metadata) are tracked and held under version control so that the exact versions of files used for analysis can be identified. For each experiment, ReproPhylo writes a detailed, yet human-readable, graphical report. ReproPhylo can also create a ZIP archive of an experiment that can be uploaded to FigShare or other digital repositories. Researchers are encouraged to deploy ReproPhylo as a Docker container, which includes not just the experimental components but also the phylogenetics programs and their dependencies, ensuring that other researchers can run the same environment as the original researcher.
ReproPhylo uses BioPython and other open tools, and the code is hosted on GitHub where it has been released into the public domain (via Creative Commons CC0 1.0 Universal (CC0 1.0)). Users are also encouraged to improve the user documentation, hosted on GoogleDocs.
The Institute performed a thorough analysis of the existing code using the pyCharm development environment and pylint analysis tool. The existing installation instructions were evaluated by following them to deploy the software from a native install and via a Docker-based install. The existing tests were analysed and run using the nosetests framework, and overall test coverage measured using the coverage tool. A report was produced outlining ways in which the code could be refactored to be more maintainable for future developers, how unit testing frameworks could be integrated into the code and how overall test coverage could be improved. A full-day meeting was held at Hull University with the ReproPhylo developers to discuss the findings of the report, how to make the code more modular to support future third party software tools and to discuss the integration of ReproPhylo with the TravisCI continuous build system.