Bug hunting: Seg fault in lattice disorder Monte Carlo code (written in C) Wr

Posted by s.aragon on 20 September 2018 - 8:37am

Image courtesy of Susanne Wallace

By Susanne Wallace, Anna Brown, Lewis Irvine, William Saunders and Paul Secular.

This is a speed blog written up as part of the Bath Numerical Debugging Workshop activities.

As part of the Bath debugging workshop we attempted to find the cause of a segfault in a lattice disorder Monte Carlo code written in C. We were working without a known solution as recent unrelated additions to the code had fixed the bug without revealing why. We were able to find the cause of the segfault using memory checking and debugging tools to narrow down the location of the bug, with print statements to finish.

What we found was a relic from an older version of the code -- a hard coded file path that was no longer valid – that resulted in trying to write to a file that could not be opened. While the segfault occurred at the point of trying to write data to this file, it’s worth noting that the source of the problem was away from this point at the file path definition – a common complication when debugging.

Process

As the symptom of the bug was a segfault we recognised that debugging tools that were suited to finding memory-related bugs were a good starting point. After re-compiling with the ‘-g’ flag for debugging, we compared outputs from various debuggers to identify the origin of the segfault:

valgrind
gdb (with back tracing, bt)

Our situation was further complicated by the segfault that occurred every time on a Linux cluster when compiled with icc, but not on a local Linux machine compiled with gcc. In order to force the segfault to occur consistently and at the point of failure on the local machine, we also ran gdb with Electric Fence. This tool transparently replaces malloc and triggers a segfault immediately on discovering a bad memory access (c.f. https://elinux.org/Electric_Fence).

On both systems, GDB then produced a backtrace in the C standard library corresponding to a segfaulting fprintf call. By moving back up the backtrace we hypothesised that the segfault occurred due to an fprintf call attempting to write to a file pointer that was NULL which was confirmed by printing the variable. GDB provided no insight as to why the pointer was NULL. In the same way, valgrind gave us the line number of the segfault but not the root cause. Hence the above tools still did not find the exact incorrect (hardcoded) filename in ‘eris-main.c’, but where the subsequent calls to fprintf failed.

To get to the ultimate cause of the bug, we had to add some speculative print statements. In particular, we wanted to see the filename of the file which failed to open. Invalid file names and/or paths are a very common cause of files failing to open. In fact it was an invalid path (specifying a directory which did not exist) included in the filename that was responsible for the problem.

It is useful to note that our print statements didn’t always show up as expected right before the segfault due to the program crashing before the print buffer could flush. We dealt with this by flushing the buffer immediately after print statements ourselves → fflush(stdout);

Debugging take-home messages

Use fflush(stdout); after print statements in order to flush the buffer and ensure the output is printed to the screen.
Invalid filenames/paths are a common cause of errors.
If possible avoid adding additional unknowns. In this case the segfault could have been reliably debugged on the cluster alone.

Tips for bug prevention

Do not hardcode paths in the middle of the code.
Write the validation code to check that files are opened successfully before attempting to write to them.