If you are reading the Institute's blog, you are probably well aware that your life will be impacted by legacy code at some point. Perhaps most relevant at the moment is the modelling of the Covid-19 outbreak. The private codes of research groups, previously only of interest to a narrow research community and the small number of individuals who maintain them, have come into the spotlight and public scrutiny.There are hundreds, if not thousands, of legacy code bases that are critical to groundbreaking scientific research and government policy around the globe at this very moment.
Of course, being legacy does not mean these codes are faulty. Mostly, they have been painstakingly written and verified against empirical evidence. The more important point is how difficult it is to continue to verify, update, and extend this code, and as a proxy, how confident we are that the results from the code will continue to be reliable and reproducible as it changes through time. Undoubtedly, the software developers amongst us will have had the painful experience of making code modifications and having everything break. This pain is exacerbated when the build system is a quagmire, there are no unit tests, and the only person who knows how to solve a problem is away on holiday.
The technical debt here is not just about the lost time of a few software engineers. As hinted immediately above, many critical systems in our information-driven societies are reliant on legacy software, so difficulties managing them can have a broad impact on citizens. The maintenance of legacy code is a necessary reality, so how can we do it well?
Perhaps the largest barrier to improving the maintainability of legacy code is finding the motivation, both individually and on a group level, to actually do it. There needs to be a change in mindset. Research is driven by software, and software by research. Quickly iterating on research ideas requires code that is easily extensible to give place to new ideas without affecting the overall correctness of a system. That is, maintainable software, rather than being a time waster, can lead to more productivity.
Moreover, higher standards for research software are becoming the norm. Domains, such as Computer Science, through the ACM Artifact Evaluation scheme, are adding information about the software that is available for a paper alongside the paper itself, e.g. that the code is available, the code runs, and the results in the paper are reproducible using the software.
This lifts software to a publication output, and not having the software evaluated is seen as a negative against the paper. Ensuring code is maintainable makes it easier to perform evaluation including easier packaging and distribution to reviewers and readers.
A driver for updating software is often that it lacks performance, and often developers are quick to jump to requiring parallelism to fix this. Modern compilers and programming languages are extremely impressive at optimising code, however, much like developers, they too struggle to effectively analyse unwieldy legacy code bases. By simply restructuring and tidying the code you can open up a lot of performance improvements “for free”.
Making the move
Now that you are properly motivated, here is what can be done to start the transition towards sustainable software development.
Capture the current output of the code and create tests that check that new outputs match the expected output. This is called regression testing and it will allow you to work on your code without fear of changing the desired outputs. These tests also serve as documentation for what your project currently does.
Document the current functionality of your code. This will create a reference for what the users of your code expect from it (and which would cause trouble if it went away). It will also give you a more holistic overview of your project. You don’t have to do it all at once. Start with a high level overview and iterate to include the details. (PRO TIP: this is a great way to encourage expert users to join your project as developers). Creating documentation can be a challenge, particularly figuring out what to include and how to organise the content. The Divio Documentation System offers some general guidelines for how to do this effectively.
Automate as much of the build process as possible. This lowers the barrier to entry and creates a record of the steps required to build your software. Oftentimes your future self will be the main beneficiary of these instructions. Most software does not run in isolation, so be sure to also keep track of the required system dependencies (compiler versions, libraries, etc). Containerisation technologies, such as Docker and Singularity, are a great way to manage this.
Survey the landscape of scientific software that you find of high quality. This will give you an idea of the practices adopted by other groups that you can adopt for your own use. Modern tools can automate a lot of the work for you (generating reference documentation, building and testing, etc). The return on investment can be very large.
Consult a software quality checklist to fill in any of the gaps. Some options can be found here.
Some of these changes will be possible for you as an individual developer to make depending on how much freedom you have with the project. However, in the long term you will need committed stakeholders to make broad and lasting changes. Refer to the previous section on how to motivate and market to an outside audience. And if possible, contact your local Research Software Engineering (RSE) group. They can help and guide you through these transitions or even do them for you, which can also be a selling point.
Don’t slip back
Once we have made our code “maintainable” it’s important that we keep it that way. Continuing with good practice avoids regressions in the newly manageable code. Many of the techniques that we’ve used to improve the legacy code can be retained.
A style guide which reflects the style chosen to modernise the code should be clear. New developments should follow this style. Ideally automate the code formatting or at least the checking of formatting. For example, the Python tool black automatically formats code, freeing up valuable developer time and attention. A contribution guide makes it clear to new and established developers what the best practices are for this project. As with open-source software, you don’t have to start these documents from scratch. Instead, start from an open-licensed contributing guide and adapt it to your own project (for example, the Fatiando a Terra Contributing Guides are licensed CC-BY).
The tests that we added to validate the existing code should be extended to cover any new features that we wish to add or modifications that we make. We should keep our documentation up to date; there's no point retaining documentation for features that have been removed or missing out features that have been added. A good practice is to include a rule in your contributing guide that code contributions must have associated documentation updates.
Besides the code, also document the motivation for the problem that you are trying to solve (the vision) and how you go about solving it (the methods). Keep that documentation up-to-date as you update your method and the project evolves beyond its initial scope. If the project documentation and vision exist only in a developer’s mind, the long-term relevance of the code isn’t assured. Developers move on, but if the project is meant to last, the documentation has to be there for new developers to proceed.
There is the old saying from the Boy Scouts: “always leave the campsite cleaner than you found it”. The same is true of software: “always leave the code cleaner than you found it”.
Improving the maintainability of legacy software is hard, we know! But, there are things that you can do right now. Talk to your PIs about the importance of software maintainability: investing in improved software maintainability brings benefits in publication, reputation, efficiency and the happiness of those who use it. In the meantime, start writing some documentation, however little—any contribution to the understanding of those who come after you is a good deed.
Finally, before investing heavily in improving a legacy code base, take a moment to consider whether it is worth maintaining. Also remember that modernising code isn’t the same thing as rewriting it in today’s trendy language.