Lessons from a workshop on “Debugging Numerical Software”

Posted by s.aragon on 8 August 2018 - 10:57am

By Eike Mueller, University of Bath.

If you are anything like me, you write code which contains bugs (in my case: lots of bugs). You probably also use code or tools written by others, which might contain even more bugs (e.g. in compilers, scripts, external numerical libraries etc.). As a consequence, most of the time allocated for “programming” is actually spent on debugging. Yes, systematic testing and good software engineering practices will reduce the number of bugs in our code, but still debugging can be incredibly painful and time consuming. Yet, strangely, this is often not talked about as a process in its own right or taught systematically.

This simple observation motivated the organisation of the workshop Debugging Numerical Software in Bath this June 2018. The event was generously sponsored both by the Software Sustainability Institute and by the Institute of Mathematical Innovation in Bath. It was hosted jointly by James Grant, as the central RSE in Bath, and me as a new Software Sustainability Institute Fellow. We must have hit a nerve and were slightly surprised by the popularity: all 40 spaces were filled and the feedback after the workshop was very positive. The interactive bug hunting session was particularly well received, even though this started as an experiment and we were not really sure how it would work out.

The aims of the workshop were manifold: we wanted to use it to explore common themes in debugging, share techniques and good practices, learn about debugging tools from industry experts and network with other developers. Throughout the event the idea was to engage everyone by making the sessions as interactive as possible, but we also invited several experienced external speakers to have a basis for discussions.

Structure

To get a better idea of how the workshop was laid out let’s go through the different parts:

Part I: Talks by invited speakers

All slides are available on the workshop website.

The five talks on the Monday afternoon were organised into half-hour sessions. After a whirlwind tour of common bugs (such as segfaults, MPI deadlocks, non-associative parallel reduction and memory leaks/dangling pointers), Phil Ridley [Arm] gave an overview of the DDT debugger which is particularly suited for parallel codes. Following on from this, Chris Maynard [Met Office and University of Reading] described the issues with developing a massively parallel cutting edge weather forecast model. Beside a “separation of concerns” approach to keep the parallelisation system separate from numerical science code, the Met Office puts a lot of emphasis on avoiding bugs in the first place by adopting sound software engineering practices, using tools such as Valgrind and routinely building the code with a range of different compilers. The latter is particularly important to address issues due to the sub-optimal support of latest Fortran 2003 features and related compiler bugs.

Mike Ashworth [University of Manchester] introduced us to comparative debugging, which traces bugs by comparing the output to that of a know, correct reference implementation. This technique is particularly useful if it can be automated, as Mike demonstrated with the scripting capabilities of the TotalView debugger. As Lucian Anton [CRAY] convinced us, CRAY as one of the leading supercomputer vendors, also develops a very useful suite of debugging tools, such as the CRAY abnormal termination processing (ATP) and Stack Trace Analysis Tool (STAT). Lucian also introduced the concept of a “Performance Bug”, i.e. undesired slowdown of an otherwise correct code. Obviously, this can be a issue in time-critical applications such as operational weather forecast models. It can be addressed with CRAY’s perftool library.

Finally, Mick Pont [Numerical Algorithms Group, NAG] shared his long experience of working with numerical codes. After dispelling some common misconceptions, he strongly advocated the use of a suitable debugger - something often dismissed by over-confident programmers who rely on their own analytical powers and simple print statements alone. Like Phil, Mick also highlighted issues with reproducibility which arise from the non-deterministic behaviour of codes containing vector instructions. While this can be an issue especially if there are strict regulatory requirements, with the recent rise of SIMD processors programmers will have to understand and deal with this kind of behaviour.

Part II: Live bug hunt

The highlight of the workshop was the interactive bug hunting session on the Tuesday morning. For this we had asked participants to submit documented bugs to a Git repository, which could be accessed by everyone. We had also reserved time on the University’s supercomputer to be able to debug parallel code. After each bug had been introduced by its submitter at the beginning of the session, we organised into small groups which tried to find the bug and ideally a solution. All groups collected notes which focused on the process rather than the outcome. At the end of the morning session teams reported back on their work.

The submitted bugs covered issues such as recursive MPI calls in Python, numerical instabilities in the evaluation of error functions and problems with Fortran 2003 object oriented features such as dynamic subroutine dispatch and unexpected memory behaviour (see complete list at the end of this blog).

Some bugs were subjected to new techniques (here Mick Pont’s experience with the NAG compiler was invaluable) but remained unresolved: at the end of the session it was still not clear whether the odd behaviour was caused by a compiler bug, which seems to have been suggested by other vendors. Another unusual line of attack was the use of CamFort (developed by Software Sustainability Institute Fellow Dominic Orchard) for tracing down memory issues in LAPACK calls. In fact, the problem turned out to be caused by confusion about the memory layout of passed temporaries.

The hardest bugs usually proved to be related to parallelisation, such as unexpected behaviour of variables with the Fortran SAVE attribute in OpenMP implementations, or modern Fortran 2003 object oriented features, and the group presentations stimulated lively discussions.

Part III: Discussion session

The final part of the workshop consisted of small-group discussions based on topic partially suggested by the participants (see list of final topics at the end). As it might be expected, one of the suggestions for avoiding bugs in the first place was proper software engineering and testing. An interesting topic was “teaching debugging”. While it could be challenging to provide systematic tutorials, there might still be some basic ideas which are not obvious to the novice debugger, such as using a systematic and evidence based approach. Bit-reproducibility proved to be a controversial topic, and there were alternative suggestions, such as checking the correctness of numerical codes by exploiting symmetry arguments instead. Certainly non-associativity of floating point numbers needs to be taken into account, in particular in parallel applications with global reductions. One group attempted to develop a general flowchart for debugging and parallel debugging techniques were explored, although it was agreed that there might be a big difference between debugging a code running on a handful of cores and on a supercomputer with hundreds of thousands of cores. Since this seems to be a common topic, a separate discussion was dedicated to dealing with codes that process arrays. Here compiler flags can help to identify out-of-bounds access, but for large problems it can be hard to check large arrays for correctness.

One of the aims of this session was also to generate material for speed blog-posts and a white paper on Debugging, which would be a useful resource for others. Unfortunately time was a bit short in the end and we found that there was much more discuss.

Outcomes and lessons learned

Overall the most popular part of the workshop was the collaborative debugging session and it was particularly appreciated that all examples were real-world use cases. Maybe debugging can be made much more fun and successful if not done individually?

A few interesting observations from this and the discussions were:

The majority of submitted bugs occurred in Fortran code (which might be purely due to the fact that still the majority of numerical software is written in this language). There were two bugs which were related to the new object-oriented features in Fortran 2003.
Several of the issues could be traced to compiler bugs, especially for Fortran 2003 bugs.
There was a lot of interest in debugging parallel codes.
Sometimes it was surprisingly hard to figure out whether a particular bug was caused by the compiler or intrinsic to the application code.

Some take-home messages

It might be hard to systematically teach debugging, but some common messages showed up throughout the workshop and might be useful for others:

Use different compilers and exploit the flags they provide, such as “-g” to add debugging information, different optimisation levels and runtime-warnings such as “-Wall”. Bear in mind that some compilers might provide more information than others and can be particularly useful.
Explaining the bug to someone else (or yourself) is often the first step to finding/solving it.
Similarly, creating a minimal working example, can be a significant step towards fixing a bug (and also makes it much easier to get help from others, such as library developers or compiler builders).
Learn to use a debugger such as gdb or ARM DDT (for parallel code). Valgrind can help to find memory leaks.
Be aware of non-deterministic behaviour due to non-associative floating point sums on modern vector processors. This is only going to get worse with widening of vector units.
Wherever possible, use existing, well tested libraries (but beware that those can also have bugs).
It is not uncommon for compilers to have bugs - this is particularly the case for less-used features such as F2003 support for OOP

Where do we go from here?

Given the positive feedback from participants, an obvious question is how we can build on the event. Is it for example possible to develop more systematic training material or tutorials, for example in a Software Carpentry lesson? This might be challenging due to the wide variety of bugs. One comment from participants was that they would like to see demos of debugging tools, such as an in-depth tutorial of gdb or valgrind for a particular problem. However, the most successful aspect seems to have been the interactive bug hunting session, which was an excellent opportunity to share experiences. In a future iteration we would definitely expand this aspect.

While this workshop was the first of its kind and run as an experiment, in the future we hope to iron out some of the organisational issues, such as having too many talks on the first day. Ideas and suggestions for improvements are:

have a demo-session for commonly used debuggers and include some training sessions
spend more time on the bug hunt session and allow participants to work on more than one bug
organise the invited talks around predefined topics
tailor bugs to discussion topics
provide a well known working code before the workshop and ask people to introduce bugs, which are then tracked down in the workshop

We finally would like to thank everyone who helped to run this workshop: our sponsors, Nia and Juliet for administrative support from the Institute of Mathematical Innovation (Bath IMI), all PhD students who helped out, the invited speakers, bug submitters, and all participants who actively engaged throughout the event.

At the very end, a word of warning from one of our invited speakers:

“If don’t find any bugs in your work, beware, something is really wrong!”

Tables

Table 1: Bugs covered in the interactive bug-hunting session

1	Recursive MPI calls and pytest
2	Bug/feature in Fortran90 array slicing
3	Fortran dynamic dispatching
4	Revealing a latent Fortran memory management bug in NAME
5	Fortran 2003 OO solver API code
6	Seg fault in lattice disorder Monte Carlo code
7	Solving an equation involving error functions in a Fortran program
8	Unexpected behaviour of Fortran OpenMP code

Table 2: Discussion topics

1	Teaching Debugging: how can we do this?
2	Strategies for avoiding bugs
3	What is special about debugging numerical codes? Dealing with rounding errors/is bit-reproducibility desirable?
4	Can we come up with a flowchart for finding bugs? / What are the best practices?
5	Debugging parallel codes
6	Debugging code with arrays