The dark arts of particle physics software

Posted by a.hay on 4 October 2013 - 11:32am

higgsbosun.jpgBy Christopher Tunnell, postdoctoral researcher at Nikhef (Nationaal instituut voor subatomaire fysica), The Netherlands.

This article is part of our series: a day in the software life, in which we will be asking researchers from all disciplines to discuss the tools that make their research possible.

Particle physics, the field which discovered the Higgs Boson, unravelled the secrets of neutrinos, and seeks to understand dark matter, relies on lots of code. Accordingly, the field tends to attract those with a predilection for programming. The most endorsed language is C++, but many people use Python for I/O, scripting, and sometimes for data analysis. More modern programming packages relating to GPU computing with CUDA purely Python-based data analysis with numpy/scipy, or NoSQL databases for work at the fringes.

Most code is C++, a trend that began over ten years ago, and which will carry on for at least another decade given the timescales of these projects. It's ironic that we search for a new understanding of our universe by working with legacy code. A reality best summarised by my PhD supervisor "If you do the software well, then what will the physicists do all day?".

The received wisdom is that computers are cheaper than programmers. Therefore, huge computing centres exist to satisfy the physicists' needs. It is also easier to get money for computers than programmers since it's a far easier sell to the funding agency. This is, for the most part, a good philosophy but it leaves open one big question: how do you demonstrate that your code works? It doesn't matter if your code manages to do its job in under an hour or even a minute if the results are wrong!

What I care about is making sure my results are correct and the analysis (which is to say, the code) is transparent, as you would expect from a physicist. And before I go further, let me admit that I've tried many things over the years but still think there is too much guess work in our coding practices. Just as problematic is that correctness is a difficult thing to prove. I don't mean the computer science definition, but rather how best to ensure our code models the world in a way that agrees with the physics textbooks.

Lessons can be learned from industry, for example, the MICE experiment with unit tests, build tests, style guides, DVCS and so on (something the Software Sustainability Institute has been helping with). Science makes predictions that you hope are wrong, and this makes writing a specification difficult. We might be building a detector that has never been built before so the correct output is completely unknown (if it was known, we wouldn't need to do the experiment). This lack of specification makes it difficult to keep tests up to date, which hurts sustainability and means there is a tedious process to ensure code correctness. It isn't an easy problem to fix.

Answering big questions requires big experiments, lots of people (between 50 to 3,000), and lots of time. Physicists tend to get involved in every aspect of the experiment from writing electronics code (such as FPAGs), to mathematical codes, to visualisation codes. The scope is enormous. I myself have written roughly a dozen packages within the past few years, which means maintaining them is an almost impossible task. Indeed, millions of lines of code is not uncommon for a particle physics experiment: the CMS experiment, which co-discovered the Higgs boson, has 8.5 million lines of code written in over 30 languages.

There are some well-designed open-source libraries that are universally used such as Geant4 (for physics models) or ROOT (for file format and data analysis) that can be used by other communities. But these packages are showing their age. Most other code is only ever used internally. Different groups have differing views on open-source code, but for the most part even if the code was open source, it is so specialised that nobody else could even run it. This isn't because people are bad at what they do. They're paid to do something other than develop code - namely, publishing physics papers - which takes precedence over coding.

Now we have come to what I and most other particle physicists do all day. We deal with legacy code. We try to get it to compile with all the legacy dependencies, we try to modify some unmaintained code to add some new feature for a new analysis, and we try to track down the bugs we introduced despite there not being low level unit tests.

Sadly, it is not appreciated how much inefficiency and repetition results from not adopting better software engineering practices in large scientific collaborations. One common example is that if code is found to be unsustainable, then it can be collectively rewritten and tested to see if each contributor gets the same answer. This is a major bugbear for graduate students who work on larger experiments, since their labour is used to solve these problems. This complicates the issue because the end result is even more code, and potential crossed wires as a result.

In all fairness, the current programming model does work for the most part. Physicists have used these methods to come up with the most precise and well-tested theories that mankind has ever produced, such as QED and g-2. However, discoveries would be made much quicker by having a better software development process, which is the direction in which I want to continue my non-physics research. (This will be taking place in the Netherlands since I am now also associated with NIKHEF.)

In an ideal world, how would I like to use software? When I start new projects, I use distributed version control systems (e.g., git or mercurial) on either Github, Bitbucket or Launchpad. What's more, testing and sustainability in Python is significantly easier than in C++. I also use Python as a glue language for high-speed applications, and I'm always amazed by how much Readthedocs makes me want to write documentation, while build systems like Jenkins make me want to write unit tests, though Travis CI also looks promising in this regard. Both packages are a pleasure to work with, even for non-experts.

I am going to do something unorthodox in future articles: I strongly believe that we learn best by our own failings, so I am going to discuss code bases I've worked on and the problems I've discovered when the support requests were received. In summary, rather than just trying to discover new particles, we need to understand the principles of research and software development if we are to search for new particles in a more efficient manner.

Share this page