Skip to main content Site map
Category
HomeNews and blogs hub

Keeping Track: Version Control for Reproducible Research

Bookmark this page Bookmarked

Keeping Track: Version Control for Reproducible Research

Author(s)
Hui Ling Wong Profile Picture

Hui Ling Wong

SSI fellow

Posted on 23 July 2025

Estimated read time: 8 min
Sections in this article
Share on blog/article:
LinkedIn

Keeping Track: Version Control for Reproducible Research

Turing Way - image of Track Project History by Scriberia

On the 22nd of May, the Imperial College London Research Software Engineering community hosted the first session of the Research Software Conversation series: Tools and Techniques for Modern Research in different domains. This event explored how version control helps to make research more reproducible.

The event opened with a short talk by Ms Hui Ling Wong — Presentation Final Final Final: Version Tracking with Git — outlining the challenges version control addresses, demonstrated its role in research reproducibility and productivity, and delivered a concise Hitchhiker’s Guide to Git for researchers. To complement her presentation, she also introduced a website she's developing to signpost researchers toward useful Git-related resources. This was followed by a round table discussion, facilitated by Dr Irufan Ahmed (Senior Research Software Engineer), with Dr Chris Cantwell (Reader in Computational Engineering), Prof. Sylvain Laizet (Professor in Computational Fluid Mechanics), Prof. Rafael Palacios (Professor in Computational Aeroelasticity), and Dr Alvaro Cea (Research Associate).

A group of people in a lecture hall

AI-generated content may be incorrect.

Photo by Saranjeet Bhogal

The discussion was frank, insightful, and occasionally humorous, unfolding in an engaging fireside chat atmosphere as panellists shared hard-earned lessons from decades of working with (and without) version control. A wide range of topics were explored, from the pivotal moments that drove adoption to unpacking how academic incentive structures often hinder and fail to promote the adoption of good software practices. Though time ran short, the conversation yielded rich insights. The following sections are highlights from the discussion.

A group of people sitting in chairs

AI-generated content may be incorrect.

Photo by Saranjeet Bhogal

Git Happens: The Journey to Version Control

In research, the multitude of demands placed on the researcher means that version control adoption is rarely driven by a commitment to best practices. Instead, it is forged in the fires of lost time, increasing frustrations with the status quo, and a deep-seated conviction that there must be a better way. Those experiences metamorphose into using version control as a vital tool for tackling the trials academia throws at them.

This journey was a resonant theme through the panel’s stories:

  • Sylvain reached a turning point when the growing complexity made solving the problem untenable without version control and better software practices.
  • Chris watched Nektar++ grow beyond being a departmental project. Dropping Subversion in favour of Git, creating the linchpin for empowering collaborators to work independently while safeguarding the code wayward PhD projects.
  • Rafa’s aerospace industry experience – where he effectively became the version control system and endured the tedium of dealing with a folder-based system – left an indelible mark on him. When he returned to academia, it spurred him to learn Subversion and began challenging the prevailing mindset that treated PhD code as disposable, valuing only the resulting equations and theory.

Their collective dream is to sing an ode to the bygone days of tar balls of code and manually picking out changes. Underlying this sentiment was a core message from Hui Ling’s talk that had resonated amongst the panel: the code written and the practices adopted are not for present yourself, but for your future self.

When Best Practices Meet Publish-or-Perish

Employing version control and ensuring reproducibility are fundamental to high-quality research; they should serve as the benchmark for all research. However, the relentless push for publications and career milestones within academia’s incentive system often pushes these best practices out of sight. This view that the extraneous pressure placed upon a researcher by the machinery of academia works in direct opposition to them was succinctly summarised by a panellist in one word — “everything”.

Throughout an academic career, researchers are initially judged on publication counts and citation impact, and later, their ability to secure sizeable grants. Reproducibility, however, rarely features among these yardsticks. This omission trickles down into the culture of a research group, where a computational PhD student can complete their studies without version control or submitting their code. Panellists agreed that only a top-down mandate from institutions, funding agencies, and journals has the potential to drive the lasting cultural shift needed to make reproducibility and good software practices standard.

Encouragingly, more journals are now asking authors to share the data sets and configuration files behind their papers and to make their code available. This requirement moves the field beyond simply showcasing impressive results to ensuring those findings can be independently verified and built upon.

Despite the gloomy outlook, the conversation had a levity, buoyed by a genuine sense of hope. At its core, the panel agreed that the real impetus behind the adoption of version control is the pursuit of research excellence. Tackling the most demanding (and often most rewarding) problems requires code that can handle significant complexity and can evolve. Version control and supporting tools become indispensable, with rigorous workflows embedded and safeguards into every stage of development. In other words, achieving the highest levels of insight and impact requires robust tooling and reproducibility baked into its core.

Shifting the Academic Glacier (with a Spoon and Optimism)

Currently, computational researchers wade into complex code without a formal induction. Experimentalists, by contrast, receive thorough safety briefings before they handle any equipment. Poor code and software practices may not maim researchers or cause an accidental explosion, but they can subject one to needless frustrations and waste invaluable time. To close this gap, we need to extend that same level of structured onboarding to every newcomer and support every member in our endeavour.

At a team level, newcomers should be inducted into version control and documentation best practices from the start. This will guard against preventable errors and future pains. Coupling this with a culture of using regular code reviews and the occasional pair programming sessions with a more senior individual on the team will help with onboarding overwhelmed newcomers. Moreover, it can dispel the initially daunting nature of software development.

At an institutional level, the introduction of Research Software Engineers (RSEs) has been a long-awaited game-changer. RSEs are the lab technicians of the computational research world – irreplaceable and capable of making all the difference. They deliver the dedicated support and expertise that experimentalists have long enjoyed but computational researchers have lacked. In Imperial’s Aeronautics Department, the RSE role is a recent yet already impactful addition. By introducing RSE roles in each department, we can transform reproducibility from an optional extra into the norm, while simultaneously improving productivity.

These actions can be complemented by debunking two insidious myths: that hosting code on GitHub is equivalent to open sourcing; and that learning Git is more intimidating than most assume. Moreover, it should be reinforced that version control alone will be the silver bullet. Every repository must have a robust test suite and is incorporated into continuous integration pipelines. Automated testing enforces quality, catches issues early, enables pinpointing where a bug is introduced, and gives researchers the confidence to fearlessly iterate rapidly.

Outside the research sphere, rethinking how programming is taught to undergraduates is the next (albeit idealist) big step. Currently, STEM undergraduates learn to program on toy problems. These do not adequately prepare one to deal with sprawling code bases that they will encounter in practice. Consequently, they should be trained to 1) break problems down into modular, decoupled, reusable components; 2) navigate and understand large code bases; and 3) apply design patterns and software architecture principles. The emphasis on usability and maintainability will not only cultivate researchers who naturally think in terms of clean code but also equip them with transferable skills that are valuable in any industry.

Ultimately, lasting change will demand more than individual good intentions. Universities, funding bodies, and journals must set clear mandates, provide infrastructure, and reward reproducible practices. Every workshop we run, policy we influence, and RSE we hire chips away at the ice. If we pick up our spoons and act with optimism, we can transform best practices into the academic standard we aspire to.

 

HomeNews and blogs hub

Why should you care about reproducible code — and how to get started?

Bookmark this page Bookmarked

Why should you care about reproducible code — and how to get started?

Author(s)

Diana I. Bocancea

Daniela Gawehns

Julian Lopez Gordillo

Sam Langton

Katinka Rus

Sally Hogenboom

Iris Spruit

Eduard Klapwijk

Posted on 5 November 2024

Estimated read time: 7 min
Sections in this article
Share on blog/article:
LinkedIn

Why should you care about reproducible code — and how to get started?

Figures looking at a green path

This blog was originally published on the Netherlands eScience Center Medium page.

On 23 April 2024, the first ‘National Research Software Day’ took place in Hilversum, the Netherlands. During the unconference part of the program, Diana Bocancea ran the session about the importance of reproducible code.

Despite the increased awareness regarding reproducibility in recent years, most research results are not computationally reproducible: they cannot be independently reproduced. The main reason for this is that in most cases, data and code are not shared publicly. But even when a researcher openly shares their data and code with the public, reviewers or research colleagues, their findings can rarely be reproduced in their entirety. Perhaps the code cannot be executed, only parts of the results are generated, or perhaps the results produced are totally different from the published study. Reproducibility can even be a challenge internally. As any programmer will know, just because your code runs perfectly today, it does not mean it will run perfectly in five years’ time (or even five days’ time!).

But why does writing reproducible code even matter, and how might you as a researcher get started on this journey toward reproducible research?

Benefits of working reproducibly

One reason is that it will make your life as a researcher easier! Many of the components that make a piece of research reproducible — well-documented, clearly written code, containerised environments, properly organized data — are also things that save a lot of time. These activities ensure that when you return to code six months later, the scripts still run, and you don’t have to spend three days debugging them. It also means that your code can be shared and reused by your colleagues, saving them time, and giving you credit (e.g., authorship) in the process. There are other reasons too, including reputational benefits and advantages during peer review. You can read more about ‘selfish’ reasons to make your research reproducible here.

What about the scientific community? We are currently in a situation where a large proportion of research is not reproducible. This situation threatens the integrity of scientific research, weakens our evidence base, and ultimately might lessen public trust in science. The main method for scrutinising and sharing scientific results — peer-reviewed journals — are slowly adapting to this realisation. Increasingly, researchers are encouraged, if not expected, to provide their data, code and other materials used alongside the publication itself. In time, we could see reproducibility move from being an optional bonus to becoming a mandatory part of the research (and publication) process. Adapting to this change early will bring you all the benefits noted above (e.g., timesaving, code reuse) but will also prepare you for the future.

Reproducible tools as a contribution to science

On that note, the changing perspective on the importance of reproducibility is bringing new career paths with it. For example, beyond the fundamental tools that enable reproducible research (such as git for version control), other higher-level tools are appearing to address certain challenges particular to some scientific domains. Usually, they are aimed at solving well-known problems for researchers from a certain field, problems not well-known outside of one niche. They might revolve around workflow management and experiment design or standardisation of certain procedures within the community. In many cases, the developers behind those software tools and resources are… the researchers themselves. They might have struggled with these issues in their own research and decided to take up the task of developing the tools that they wished they had (for example, extensive Python-based processing pipelines such as fmriprep in the neuroimaging field, and thousands of R packages ranging from complex statistical modelling packages such as brms for Bayesian regression to packages to help you formatting manuscripts such as papaja). In doing so, they shifted their focus from their original subject domain to the mission of making research within that domain reproducible. This typically takes the form of developing the software libraries that make that possible and integrating them with the standard software used within the domain.

The whole scientific community can benefit from such tools! Newer research can be built on top of them, without the need to solve common reproducibility issues from scratch. These software developments can be just as valuable a contribution to the research domain as other research findings, and as such, they should be recognised accordingly. And just like it is possible to publish your research findings, it should be possible to publish your code contributions when they are significant enough. A good example of this idea put into practice is the Journal of Open Source Sofware (JOSS), where the submitted code takes the main stage in the review process (as opposed to be required as “supplementary material”). Initiatives like JOSS showcase developments around reproducible research as a meaningful contribution to science and a viable development path, both of which are powerful incentives for researchers to get interested in the topic.

In time, with all these smaller and bigger changes, scientific research can become more trustworthy, more reliable, and in turn, more impactful.

How to get started

The inevitable question that follows is then: how to get started with reproducibility? One answer is training. Luckily, there are a lot of initiatives for training that will help you to get started, both nationally and internationally. For example, a lot of institutions organize Software and Data Carpentries that offer foundational coding and data science skills.

One way senior academics can make a difference — as group leaders, supervisors, and grant reviewers — is to give (junior) colleagues the time and incentives to value and practice reproducibility. For instance, supervisors could have all PhD students replicate and extend an existing analysis as part of their initial research. The process of reproducing an existing work will familiarize the student with the common challenges that come with doing good science. The work of reproducing someone else’s work might entail finding and understanding a certain dataset (sometimes difficult to even get access to), as well as the software (e.g. scripts or packages) that was used to produce the results. Running the previous analysis, often on a different computer and at a later time (when software dependencies have likely changed) would check the computational reproducibility of the previous work, and in doing so, be a valuable learning experience for the student.

Group leaders benefit from reproducible workflows as it prevents (PhD) students from re-writing the same piece of software again and again. While learning the ropes is important for any junior scholar, it is not very efficient if every new generation of students re-writes code for basic operations or frequently used analysis methods.

In addition to the benefits of an academic career, researchers themselves also increase their employability outside of academia by learning digital skills (such as version control or programming reusable pieces of code) that are valued in many different (industry) jobs.

In modern science, computational methods are the norm in almost every discipline. Yet attempts at reproducibility are almost always unsuccessful due to missing materials and/or lack of skills. Part of this problem can be mitigated by learning how to produce reproducible code: how to write documentation, perform version control, and manage packages. Doing so will benefit you as a researcher, but also your colleagues, and the wider scientific community, because your (coding) efforts will become reusable. Increasing the use of reproducible workflows is in the interest of many stakeholders in academia — increasing the reproducibility of research is key for a broader change in how we do science.

Authors

Diana I. Bocancea

Amsterdam UMC

Daniela Gawehns

University Medical Center Groningen

Julian Lopez Gordillo

Naturalis Biodiversity Center

Sam Langton

Amsterdam UMC

Katinka Rus
Sally Hogenboom

Open Universiteit

Iris Spruit

Universiteit Leiden

Eduard Klapwijk

Erasmus School of Social and Behavioural Sciences

Image by The Turing Way.

 

HomeTraining hub

Docker Introduction

Bookmark this page Bookmarked

Docker Introduction

Estimated read time: 1 min
Sections in this article
Share on blog/article:
LinkedIn

Docker Introduction

This tutorial aims to introduce the use of Docker containers with the goal of using them to effect reproducible computational environments. Such environments are useful for ensuring reproducible research outputs, for example.

Go to tutorial

HomeTraining hub

INTERSECT research software engineering training

Bookmark this page Bookmarked

INTERSECT research software engineering training

s
Estimated read time: 1 min
Sections in this article
Share on blog/article:
LinkedIn

INTERSECT research software engineering training

INTERSECT provide a lot of training material for RSEs. The training material cover topics such as continuous integration, Git, collaboration, licensing, packaging, performance, reproducibility, software engineering among others.

Go to training material

HomeTraining hub

R4E reproducibility workshops

Bookmark this page Bookmarked

R4E reproducibility workshops

Estimated read time: 1 min
Sections in this article
Share on blog/article:
LinkedIn

R4E reproducibility workshops

R4E (Reproducibility for Everyone) is a community-led education initiative that offer community-led workshops on open research practice. Click the link below to find workshops near you or lead a workshop on making research and software reproducible.  

Go to R4E website

They also have published resources around open research here

HomeTraining hub

Making software ready for publication

Bookmark this page Bookmarked

Making software ready for publication

Estimated read time: 1 min
Sections in this article
Share on blog/article:
LinkedIn

Making software ready for publication

This course focuses on making software ready for publication, looking at software reproducibility and Git. 

Go to course

HomeTraining hub

Code Reproducibility Training

Bookmark this page Bookmarked

Code Reproducibility Training

Estimated read time: 1 min
Sections in this article
Share on blog/article:
LinkedIn

Code Reproducibility Training

This training programme is being developed as part of the ELIXIR-CONVERGE project. It is currently led by Alexia Cardona (ELIXIR-UK) and Nazeefa Fatima (ELIXIR-Norway).

They aim to create an extensive reproducibility training programme with the aim to equip learners with the core skills required to develop sustainable and reproducible code. The training materials will be developed for people with non-computational backgrounds.

Find out more and sign up for training courses

Subscribe to Software reproducibility
Back to Top Button Back to top