Heroes of software engineering - Miron Livny and HTCondor

Posted by s.hettrick on 30 September 2013 - 1:53pm

By Ian Cottam, IT Services Research Lead, The University of Manchester.

The next post in my series on heroes of software engineering focuses on Miron Livny and the men and women of his HTCondor software engineering team at the University of Wisconsin–Madison. This is my second post about a team rather than an individual, this time to the designers and engineers of a piece of software that has been around for something like a quarter of a century: HTCondor (known as simply Condor before a naming dispute in the USA last year). Before I tell you about HTCondor, let me have a small rant.

Software systems (and programming languages) are too often disparaged for being old. Come on people: we are talking about software sustainability! If it works and is great, don’t knock it. You don’t hear mathematicians going on about how set theory or calculus is getting long in the tooth.

Miron Livny, the lead for the HTCondor development, says [1]:

“In fact, the Condor kernel [. . . ] has not changed at all since 1988.”

That is great software design and engineering that has demonstrably proved to be successful and sustainable.

A famous quote from Tony Hoare is about a programming language that was largely designed in the late 1950s (Algol 60 - my first programming language):

“Here is a language so far ahead of it time that it was not only an improvement on its predecessors but also on nearly all of its successors.”

I would like to thank Miron Livny and his software engineering team for producing a high-throughput computational workflow management system that is, in my opinion, not only an improvement on its predecessors, but also on its successors.

What is HTCondor?

HTCondor is “an open-source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks.” A bit of a mouthful! There is, of course, a dedicated website maintained by the development team where you can read the details.

What makes it a great engineering effort?

It is easy to install - never underestimate how important this is for any system - it runs on almost anything (from laptop class upwards), can link a user’s PC with dedicated computational clusters and with cycle scavenging from machines mainly used for other purposes (at Manchester we use our teaching PC clusters overnight), and it does all of this transparently.

You can start with a Personal HTCondor, just running on the cores on your own PC. This is an excellent and cheap way to learn the system and debug your computational workflows (before launching thousands of jobs out into a bigger pool of machines).

Users’ jobs can easily flock (to use the HTCondor vernacular) transparently from one pool of machines to others. You can use this ability, for example, to go straight from a Personal HTCondor to a big, distributed pool. At Manchester, we also use it to link a non-HTCondor compute system to our main pool; a gateway machine (which can access all of a user’s data on the foreign system) is set up as a separate HTCondor pool with no local resources but has a flocking relationship to our main pool.

The basics of HTCondor are rock solid and it is extended with optional software, such as the excellent DAGman: a workflow dependency system which uses the core HTCondor tools to schedule work as and when data is produced from dependent jobs (somewhat like the UNIX make tool does for large compilations).

You don’t pay a penalty for parts of HTCondor that you don’t use or need (a good lesson from programming language design). Finally, although HTCondor is command-line based, it is easy to add your own GUI. See, for example, the DropAndCompute interface we have at Manchester.

The HTCondor pool at the University of Manchester has computed over 1,800 years of results for our researchers. Our users simply love this software system, which is the best way of saying thanks to Miron Livny and the men and women of his HTCondor software engineering team.

References

Thain, Douglas; Tannenbaum, Todd; Livny, Miron (2005). "Distributed Computing in Practice: the Condor Experience". Concurrency and Computation: Practice and Experience 17 (2–4): 323–356. doi:10.1002/cpe.938.