The 20-line script that saves you hours of mind-numbing tedium

Posted by s.hettrick on 22 August 2014 - 2:00pm

By Simon Hettrick, Deputy Director.

Apart from a brief liaison during my undergraduate years, I am cursed with a complete lack of training in programming. When I face problems that are easily solved with some basic coding, I experience the beginner’s dilemma (not unlike the problem with automation): do I choose the frustration entailed in working out how to write a short program to do the work automatically, or the monotony of performing the same simple task a thousand times by hand?

Once you’ve taken a first step into coding, and you see how quickly and efficiently it can change your work, it’s difficult to stop. My epiphany occurred when someone renamed a hundred images for me using a single command line instruction. I was going to make those changes by hand, so this little trick saved me something like an hour of tedious work. In addition to these tricks, I find short programs provide the most compelling reason for researchers to learn a bit about coding. The heavyweight software packages are, of course, very important to research, but the 20-line script that saves you hours of mind-numbing tedium is the real hero of research.

There is no better example than the problem I am currently facing. I have about 10,000 files, each of which contains a job advert for a position in academia. The question is: how many of these jobs relate to software development? To find out, I can search the job adverts for terms that are likely to crop up in an advert for a software developer. I’ll be using various terms, but to take a simple example, let’s choose one: Software Developer.

When faced with this problem, I doubt that anyone born after the 20th Century would set about reading 10,000 job adverts to look for the term Software Developer. (And if this was your first thought, then you really should be looking at enrolling on a Software Carpentry course).

A while back, I had a few simple problems to sort out and one of our developers recommended I write a program in bash (Bourne Again Shell). Bash is a funny language in which there are thousands of ways to solve the same problem, which makes it a little arcane, but it also means that you can usually find a solution that fits your style of thinking. It’s got a reputation as being a bit rough and ready, and I certainly think Python would be a cleaner language, but bash is what I know the most about so that’s what I used.

A bash command called grep will find a term in a file, so I can simply run grep on a file for a specific job advert and search for the term Software Developer. A result! I now need to run the same command for the other 9999 job adverts. This is where the coding avalanche begins. You save a load of time doing one thing, but there’s plenty of room for improvement, so you add a new feature. If that feature's going to work it’ll need a few other features to be added, so you add those too. And wouldn’t it be nice if it formatted the results properly… and suddenly it’s dark outside.

The latest avalanche began when I realised that I’m not going to type in the same bash command 10,000 times. Instead, I could set up a loop in which I run the grep command over each of the job adverts. I need some way of gathering the results, so I’ll output the name of each file into a results file, and whilst I’m there I’ll count the number of positive results. What the hell, I’ll also count the number of files I’ve looked at, and record the date on which I looked at them and – for posterity - the term that I was searching for. Now I have a load of result files, which is great, but I may as well write them to a csv file so that I can open it in a spreadsheet – or something like R – to plot the results.

To anyone who develops code for a living, the program I wrote would be trivial in the extreme, but that doesn’t mean that it is not extremely useful: it saved me a lot of time. Like me, most researchers aren’t programmers and may well be put off writing code because they think the goal is always some polished software package with a fancy user interface. A lot of coding (for researchers at least) is simply about solving a tedious problem efficiently. So if you want to save some time in which to conduct your research, I suggest that you give coding a bash (or any other language...).

(And a proviso: I have to admit that it’s much easier throwing a program together when you’re surrounded by Research Software Engineers – a group of people whose level of expertise is matched only by their morbid curiosity of how the non-coding guy’s going to handle that program. To that end: thanks to Steve, John and Devasena.)

Share this page