How to write code like a scientist

Posted by s.aragon on 12 September 2017 - 9:37am

By Alice Harpole, University of Southampton

Coding is often seen as a tool to do science, rather than an intrinsic part of the scientific process. This often results in scientific code that is written in a rather unscientific way. In my experience as a PhD student, I've regularly read papers describing exciting new codes, only to find that there are number of issues preventing me from looking at or using the code itself. The code is often not open source, which means I can’t download or use it. Code commonly has next to no documentation, so even if I can download it, it's very difficult to work out how it runs. There can be questionable approaches to testing with an overreliance on replicating "standard" results, but no unit tests exist to demonstrate that the individual parts of the code work as they should. This is not good science and goes against many of the principles of the scientific method followed in experimental science.

In the following sections, I shall look at how we might go about writing code in a more scientific way. This material is based on the talk I recently gave at Europython 2017 on Sustainable Scientific Software Development.

Scientific code

Scientific code has several properties which set it apart from code written for standard, non-scientific applications. It is written to investigate complex, unknown phenomena: we often don't exactly know what the output will look like when we run it. It is often developed over long periods of time (sometimes even over decades) with a great number of contributors. It is build by scientists, not software engineers. As such, many of the people writing it are self-taught, and whilst they may have a firm grip of the science side of the code, may not have such an extensive understanding of the intricacies of the machines they are using to run their experiments.

In experimental science, the results of an experiment are not trusted unless the experiment has been executed following the principles of the scientific method. This involves testing the apparatus so that any sources of error can be understood and quantified, and fully documenting the method used to conduct the experiment so that its results can be reproduced. The scientists conducting the experiment must take steps to demonstrate that their results are accurate, reproducible and reliable.

In computational science, we are essentially doing experiments and the computer is our apparatus. We should also follow the scientific method to demonstrate that our results can be trusted; conversely, results from codes without proper testing or documentation should not be trusted.

Development workflow

So we've decided that invoking the principles of the scientific method throughout our code's development is probably a good idea. How might we go about this? Fortunately, there are lots of tools out there that will help us do this and allow us to automate things so that, once set up, there is minimal extra effort required on our part.

Version control

For an experiment to be reproducible, it must be well documented. To do this, it helps if we keep a record of our progress. In standard experiments, this is often done using a lab notebook. In computational science, we can use version control, with commits documenting the development of the project. Version control also has the added benefits of aiding collaboration, preventing contributors from overwriting each other's changes. Branches allow you to hack without fear of breaking everything irreversibly, and if you have others using your code allow you to still develop it whilst still allowing them to work away on a clean master branch.

Testing

As in experimental science, results should not be trusted unless the apparatus and method used to produce them (i.e., the software) has been demonstrated to work, and any limitations (e.g., numerical errors incurred from the choice of algorithm) are understood and quantified. Unfortunately, scientific codes can be hard to test as they are often very complex (making it hard to write tests) and investigate unknown behaviour (making it hard to know what to test against).

Once we've written a set of tests, continuous integration (using tools like Travis CI) can then be used to regularly run these tests for us automatically after every new version we create. This allows bugs to be spotted as they occur, so they can then be fixed before they become too entrenched in the code, or reappear in the future. Code coverage tools are useful to ensure that tests cover as much of the code as possible: if tests only cover 20% of the code, that is no guarantee that the other 80% will work.

Documentation

The goal of documentation is that someone else should be able to set up, use and understand your code without any extra help from you. For this to be possible, it's necessary to include comprehensive installation instructions that (ideally) cover all standard systems (i.e. not just for your institution's particular supercomputer). The code itself should be documented, with sensible function and variable names and liberal use of comments, so that if someone else wants to hack the code themselves they can have some idea of where to start. It's a good idea to have a user guide detailing how the code works as well as examples that demonstrate how to use it.

Documentation can often seem like it will be incredibly time consuming and therefore a waste of time; however, there are fortunately more tools out there which can help automate the process such as Sphinx and Doxygen. For Python-based codes, Jupyter notebooks are a great way to present example cases; nbconvert can then generate static versions of these that can be included in the main documentation. Sites such as Read the Docs can host Sphinx-generated documentation, automatically updating it every time changes are pushed to the project's remote repository.

Distribution

As mentioned above, for code to be sustainable, we need to allow others to find it. The first step to this is making it open source in the first place. There are many situations where this will not be possible for perfectly good reasons: the code may deal with sensitive data, or it may be intellectual property. However, in many cases such restrictions do not exist and the benefits of making the code open source and freely available to other researchers far outweigh any possible disadvantages. Unfortunately, as the internet can be a pretty transient place, we can’t guarantee that if we post it online it will always remain in the same place. We shouldn't assume that GitHub, GitLab or Bitbucket will still exist in 20 year's time or that they will preserve the current addresses they assign repositories indefinitely. Fortunately, we can get around this issue by assigning projects with a DOI (a digital object identifier), which will act as a permanent address to it.

For results to be reproducible, users must be able to reproduce the runtime environment used to produce them. This is especially important if your code uses lots of external libraries, as you need to ensure that other users use the same versions of these libraries. This can be done by packaging the code—e.g., in a Docker container—or by providing the user with instructions to set up a conda environment.

To help others use your code, you should try and make sure that the installation process is as painless as possible. This can be achieved by providing makefiles and trying to limit the reliance of your code on external libraries and material (particularly if these are not open source).

Conclusions

We need to make sure that scientific code is written in a sustainable, scientific way. We can do this by applying the scientific method to all stages of software development, and only trusting results from codes that are reproducible, tested and documented. We must also future-proof our software so that it is still possible to find and run it in the future: not doing so harms scientific progress.