Using Make for reproducible scientific analyses

Posted by s.hettrick on 25 November 2013 - 9:30am

By Ben Morris, PhD student in Evolution, Ecology and Organismal Biology at University of North Carolina

Reproduced with kind permission from the original post on Ben's website.

The two tools that I use most frequently to manage my research are Git and Make. I've been using Git for years, but Make for only the past year or so. Ironically, I learned about it at a Software Carpentry workshop at which I was one of the instructors. Since then, it's become a key weapon in my computational arsenal. In this post, I'll try to sell you on using Make for your own research projects.

What is Make?

Make is an automated build system. In other words, it's a software tool for automating the process of building one or more files from one or more other files they depend on. A Makefile contains a set of recipes that specify how all of your desired final products - figures, processed data files, documents, etc. - should be produced as well as their dependencies, or what files are needed to create them.

To many C or C++ programmers, Make is used for compiling code. This can often be messy, and may require other tools to get it to work correctly all the time - tools that generate very ugly looking, difficult to understand Makefiles. This has resulted in a lot of people thinking that Make itself is complicated before learning more about it (I was one of them.) However, Make is conceptually very simple, and how to produce a set of files from a set of other files is a general problem that scientists run into every day.

Learning Make was simple and using it netted me three major benefits:

Automation reduces cognitive load. When I come back to a project that has a Makefile, I don't need to think about which program I need to run with which file, and which options to get my desired result. I simply type make or make [filename] and it handles everything for me. Previously I might've had a script like do_everything.py that would... well... do everything. Having a Makefile means I don't even need to look for such a script. I know in advance how I'm going to repeat the analysis. It also means that other people can check out my code, see that there's a Makefile, and effortlessly repeat my analysis, often without even requiring instructions.
Make automatically handles dependencies. When you have a complicated workflow, changing one file means that other files downstream may also need to be updated. But which ones? Without Make, I need to know which files depend on the file I changed, and which files depend on those files, etc. Make will figure this out for you, saving you the time and effort of thinking it through yourself.
Makefiles provide a roadmap to your project. The best way to start a project is to map out your workflow - what input files do I start with, how do they need to be processed, what will be the end results. The more specific you can be with this roadmap, the better. This is precisely what a Makefile gives you - but in addition to simply documenting your project, it's also something that can be run.

After all, the spirit of laziness is the spirit of innovation!

An example: Here's a brief example of a hypothetical scientific workflow, and how it can be improved with Make.

My scenario: I have ten FASTA files containing DNA sequences. I want to align them using muscle, convert the alignment to another format, then produce a figure that visualises the alignments.

What I have: ten FASTA files (homo_sapiens.fasta, pan_troglodytes.fasta, gorilla_gorilla.fasta,...), a Python script (visualize_alignments.py), and muscle. My Python script requires the alignment to be in Phylip format, but muscle outputs in FASTA format.

To manually generate these figures, I would use muscle to align each FASTA file individually, use the BioPython library to convert those alignments from FASTA to Phylip, then call my python script to generate the figure. This would take three individual steps per FASTA file for a total of 30 steps - and as I add more DNA sequences, the number of steps will go up. Instead I'll show you how to do everything with just a single word, make.

First, let's learn the anatomy of a Make recipe. A Makefile is a collection of recipes, each of which specifies how to create a single target:

target: dependencies...
command

The target is the file to be produced, dependencies is a list of files it needs, command is the command that turns those dependencies into the target.

Back to our example. Working backwards, my end result is a figure visualising the alignment for a given species. The target is the figure. To generate this figure, I first need the alignment files and the script to generate the figure. Once I have those, the command is simply python visualize_alignments.py [alignment file]. So the first recipe would look like this:

homo_sapiens_alignment.png: homo_sapiens.phy visualize_alignments.py
python visualize_alignments.py homo_sapiens.phy

Note the target to the left of the colon, dependencies on the right, command below. The command needs to be indented with a tab - spaces won't work. Also, it might be strange to think of my figure depending on the script that produces it. What I really mean is that if my visualisation script changes - if I change the figure label, or colours, or something - I need to re-build the figure. More on that later.

Since we're doing this ten (or potentially many) times, we can make this recipe even more general by changing a few parts using some ugly but useful Make syntax:

%_alignment.png: %.phy visualize_alignments.py
python visualize_alignments.py $<

You can see two changes here. First, in the target and dependencies, I've replaced homo_sapiens with %, the wildcard character. I'm telling Make that to make any file that ends in _alignment.png, I need a file of the same name that ends in .phy. I've also added $< in the command: this is shorthand for the first dependency, or in this example the .phy file. Think of the < as an arrow, pointing to the leftmost dependency.

We've written one recipe. Now I'll add two more, one to convert the alignment from FASTA (.aln) to Phylip (.phy) using biopython, and another to produce the alignment from the input file using muscle:

%_alignment.png: %.phy visualize_alignments.py
   python visualize_alignments.py $<

%.phy: %.aln
   python -c "import Bio.AlignIO as aio; aio.convert('$<','fasta','$@','phylip')"

%.aln: %.fasta
   muscle -in $< -out $@

One more syntax element has been added here: $@ stands for the name of the target.

What I have now is sufficient to generate figures individually. I can type make homo_sapiens_alignment.png and the entire workflow will be run, producing the figure. Here's where automatic dependency resolution becomes useful. If I run make homo_sapiens_alignment.png the figure will be created. If I run the same command a second time, nothing will happen. This is because Make sees that none of the dependencies have changed, and therefore nothing needs to be done. If I make a change to one file - change some of the plotting code in visualize_alignments.py, or manually edit the alignment file, for example - Make will re-run only the necessary portions of the workflow, instead of the entire thing!

Our workflow is still ten steps, one per figure - better than thirty, but not yet ideal. We can make one final improvement that will allow us to simply type make and generate all ten figures automatically. We'll create a rule called all at the very top:

all: $(patsubst %.fasta, %_alignment.png, $(wildcard *.fasta))

This patsubst function produces a list of files. For every file called [something].fasta (my input files), the list will contain [something]_alignment.png (the corresponding figure). By default, running make without the name of a target will run the recipe at the very top (in this case, all), and that recipe is now dependent on all of the figures we want to produce, so they'll all be created automatically if they need to be.

Conclusion

While there are some ugly bits - the unfamiliar and alien syntax, particularly - Make is essentially quite simple. If you're new to Make, I would suggest avoiding the advanced syntax for now. Simply use filenames for your recipes, and you can add in more generality later by taking advantage of Make syntax elements one at a time. Remember: a Makefile is just a list of steps that need to be taken to turn your input files into the output you want. Used in this way, it is a powerful tool that can improve reproducibility and save you time, effort, and headaches.