Tips for sustainable software development on supercomputers

Posted by a.pawlik on 16 December 2013 - 9:21am

By Derek Groen, Software Sustainability Institute Fellow and Research Assistant, Centre for Computational Science, University College London.

This blog is already chock-full of useful tips for software development, and much of it applies to sustaining software on supercomputers as well. Here are a few tips on developing sustainable software for supercomputer environments.

1. Get to know your e-Infrastructure

Yes, that’s what the locals call it in Europe. The locals in the UK call it High-End Computing infrastructure and the locals in the US prefer Cyberinfrastructure. Familiarise yourself with at least your local supercomputer before you start. Check the technical specifications, read the user guide, arrange access and preferably examine how the machine performs with existing codes. And last but not least, try to find relevant libraries that are already installed there. That way you don’t have to waste time writing bespoke code or installing libraries of your own.

2. Pair simplicity with complexity

Supercomputers tend to have a more complicated architecture than your local machines, so installing software on them is going to be more difficult. Intuitively you may feel inclined to write more complicated code than usual, so that your program nicely wraps around all the intricacies of the machine. However, you may well find yourself sinking in a swamp of errors and incompatibilities, especially when the administrators have decided to update their operating system or, worse still, upgrade to a new supercomputer!

Complex software doesn’t work well with complex hardware, because complex hardware tends to make the software installed on it more difficult to maintain. If you’re not part of a huge company, and would still like to be able to use your software five years from now, it’s best to keep your code structured simply with a constrained set of dependencies.

3. Write bread and butter code, supercomputer compilers don’t always obey the Rules

Supercomputer compilers generally tend to be optimised for performance, and frequently are a little unstable on the fringes of your programming language. When you write supercomputing code, try to avoid implementing that ten-level dynamically type object hierarchy in C++, or basing 50% of your communication routines on a feature that emerged in the MPI a year ago.

4. Brace for crashes, and embrace crashes

Similarly, supercomputers behave a little differently when things go wrong. When you test your program, you’ll save a lot of time and frustration by assuming that your program will almost surely crash. Prepare for crashes and hangs by capping your jobs tightly in wall-clock time, and enabling a reasonable level of verbosity. If crashes do occur, dig into the data. Don’t just look for the cause of the current crash, but inspect your output data for other inconsistencies and errors. This may save you a handful of additional crashes further down the line.

5. Use the helpdesk, but avoid direct admin’s assistance.

Helpdesks at supercomputing centres are responsive and useful. They are manned by specialist people who, in many cases, are expected to resolve issues within a very limited timeframe. So... if you get stuck with an issue, you’ll help yourself greatly by contacting them. However, when it comes to installing new system-level software, such as your favorite flavor of MPI or a set of Java web service libraries, don’t expect the helpdesk team to be as enthusiastic.

Resource providers tend to acknowledge installed software as a perpetually present sink of energy and money. You’ll find yourself making much faster progress if you either install it in your local home directory (when possible) or avoid using the software altogether.

6. Know when one supercomputer isn’t enough

Suppose you decided to develop that next generation science solver by combining six existing programs: one runs on a desktop and is interactive, the second consists of millions of tiny and independent tasks, the third is optimised to run on the latest NVidia/ATI graphics cards, the fourth actually is a straightforward parallel code, the fifth is a straightforward parallel code that uses privacy-sensitive data, and the sixth one simply handles much more data than any of the others.

Now, what kind of computer should you pick to run this super-program? Clearly, there is no one-size-fits-all solution, which is exactly why we have so many different architectures out there to begin with. Fortunately, distributed computing is very much alive, and workflow tools such as Taverna and coupling tools such as MUSCLE make it easier to just link resources together to use them for a common purpose.