Mentorship programme: Building workflows - research software development

Posted by j.laird on 1 September 2022 - 10:00am mural of two hands reaching out for each other

Photo by Joshua Reddekopp on Unsplash

By Monika Gonka, PhD student, University of York.

This blog post reflects on our Learning to Code mentorship programme as part of a Research Software Camp.

Reproducibility of bioinformatic analyses

Computational science has revolutionised the field of cancer research. Bioinformatic analyses can generate many new testable biological hypotheses and provide an unprecedented insight into the onset of disease and its progression. However, recent discussions amongst researchers have identified issues with the reproducibility of published results. Machine-readable analyses should be easy to reproduce, right?1,2 It seems not always so.

Several reasons for this emerge, such as unclear rules on sharing code, undocumented assumptions, problems with different computing environments and conflicting software dependencies. Bioinformatic software containers and automated workflows may help resolve these issues.3

Software containerisation and multi-step workflows

Containers encapsulate units of software into independently deployable bits of code.4 In this lightweight virtualisation technology, software and its dependencies are distributed in ‘images’ which enable the execution of the analyses under the same computational conditions.5–7 In recent years, the multi-packaging and containerisation systems like Docker8,9, Singularity10, Conda11, Bioconda12 and Biocontainers13,14 have gained popularity among scientists as they allow for software portability across research groups by preserving controlled computing environments.14–16 In addition, bioinformatic pipelines can be orchestrated via workflow managers such as CWL17, Nextflow18 or Snakemake19 – there are now hundreds of available frameworks to choose from.20 They are usually integrated with containers and save time by enabling re-entrancy (restarting the script from the last successful step) and efficient resource distribution.7

These systems help us pursue FAIR (Findability, Accessibility, Interoperability, and Reusability) objectives in the data stewardship.21 A number of community members have devised various sets of guidelines to help us develop and use bioinformatic tools by taking software sustainability into consideration.15,16,22–27 Proper assembly of these containerised software pieces may help advance reproducibility in research.28

Software engineering for researchers

Throughout the Software Sustainability Institute 8-week mentorship programme I explored the intrigues behind software development using container systems. I reviewed the latest published literature and online discussions in the field and followed several tutorials, experimenting with different next-generation sequencing tools and workflow platforms. Keeping in mind the importance of reproducibility of the analyses, I took special care to apply the best recommended software-engineering practices in constructing bioinformatic pipelines.

Thanks to the programme, I improved my software programming skills and hope they will help me in my multi-omic analyses of normal and mutant blood stem cells. I would like to express my gratitude to Adrian D'Alessandro, Mikhael Manurung and the Software Sustainability Institute Team - thank you for sharing your knowledge! I am really excited for my next steps in research software development.

1. Beaulieu-Jones, B. K. & Greene, C. S. Reproducibility of computational workflows is automated using continuous analysis. Nat Biotechnol 35, 342–346 (2017).

2. Simoneau, J. & Scott, M. S. In silico analysis of RNA-seq requires a more complete description of methodology. Nat Rev Mol Cell Biol 20, 451–452 (2019).

3. Schulz, W. L., Durant, T. J. S., Siddon, A. J. & Torres, R. Use of application containers and workflows for genomic data analysis. J Pathol Inform 7, 53 (2016).

4. Biocontainers: What is a Container? biocontainers-edu.readthedocs.io/en/latest/what_is_container.html, Accessed: 2022/07/15.

5. Docker: Use containers to Build, Share and Run your applications. www.docker.com/resources/what-container/, Accessed: 2022/07/15.

6. Silver, A. Software simplified. Nature 546, 173–174 (2017).

7. Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 18, 1161–1168 (2021).

8. Docker. www.docker.com, Accessed: 2022/07/15.

9. Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux Journal 2014, (2014).

10. Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: Scientific containers for mobility of compute. PLoS One 12, e0177459 (2017).

11. Conda. conda.io, Accessed: 2022/07/15.

12. Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15, 475–476 (2018).

13. da Veiga Leprevost, F. et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33, 2580–2582 (2017).

14. Bai, J. et al. BioContainers Registry: Searching Bioinformatics and Proteomics Tools, Packages, and Containers. J Proteome Res 20, 2056–2061 (2021).

15. Gruening, B. et al. Recommendations for the packaging and containerizing of bioinformatics software. F1000Res 7, ISCB Comm J-742 (2018).

16. Leipzig, J. A review of bioinformatic pipeline frameworks. Brief Bioinform 18, 530–536 (2017).

17. Crusoe, M. R. et al. Methods included: standardizing computational reuse and portability with the Common Workflow Language. Commun. ACM 65, 54–63 (2022).

18. Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 35, 316–319 (2017).

19. Mölder, F. et al. Sustainable data analysis with Snakemake. F1000Res 10, 33 (2021).

20. Peter Amstutz, Maxim Mikheev, Michael R. Crusoe, Nebojša Tijanić, Samuel Lampa, et al. Existing Workflow systems, Common Workflow Language wiki, GitHub. https://s.apache.org/existing-workflow-systems, Updated: 2022/06/20, Accessed 2022/07/15.

21. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).

22. Kadri, S., Sboner, A., Sigaras, A. & Roy, S. Containers in Bioinformatics: Applications, Practical Considerations, and Best Practices in Molecular Pathology. J Mol Diagn 24, 442–454 (2022).

23. Brack, P. et al. Ten simple rules for making a software tool workflow-ready. PLoS Comput Biol 18, e1009823 (2022).

24. Taschuk, M. & Wilson, G. Ten simple rules for making research software more robust. PLoS Comput Biol 13, e1005412 (2017).

25. Lee, B. D. Ten simple rules for documenting scientific software. PLoS Comput Biol 14, e1006561 (2018).

26. Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten simple rules for reproducible computational research. PLoS Comput Biol 9, e1003285 (2013).

27. Nüst, D. et al. Ten simple rules for writing Dockerfiles for reproducible data science. PLoS Comput Biol 16, e1008316 (2020).

28. Perkel, J. M. Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019).