Evolution: from farm to tree

Posted by s.hettrick on 28 August 2012 - 3:18pm

By Elisa Loza, Agent and and scientific statistician, Rothamsted Research.

Evolutionary biology is the branch of biology that studies the evolutionary processes that give rise to the diversity of life on Earth. Understanding these processes is essential to every aspect of our life including the development of crop varieties that are resistant to pests (and, therefore, food security); finding cures to human diseases (for example, there is some similarity between the way cancer develops in one person and the way genes evolve through billions of years); and measuring the impacts of pollution and climate change by studying the variation through time of microbial diversity in marine sediments or soil samples.

Modern evolutionary biology utilises the information from the components that make up all living organisms on Earth: our DNA. By comparing the DNA material of a set of organisms we can understand the relationships that hold amongst them and, ultimately, organise our biological knowledge into a model of ancestry and descent. The model of how all living organisms have evolved from ancestral forms into their present state is usually visualised as a tree of life: the root represents the common ancestor to all life (perhaps a primitive form of cell), the branches indicate the evolutionary paths that ancestors took when diversifying to become new species, and the tips correspond to the organisms that are alive today [1,2]. In practice, we usually concentrate on relevant portions in this tree of life and do not attempt to study it all (e.g. the sub-tree of the Fusarium genus: a large group of fungi that can cause disease in wheat). Although several research groups around the globe do work on the full tree.

Determining the tree that best represents the evolutionary relationships in a group of organisms requires sophisticated mathematical, statistical and computational tools. The demand for software and computational resources is increasing as data sets become more complex and larger in scale. At Rothamsted Research, I am currently involved in conducting evolutionary analyses of DNA data extracted from agricultural soil samples. The observed data are tens of millions of short DNA sequences that originate from the genomes of the microorganisms that populate the soil. The vast majority of these microorganisms have never been isolated or studied in the laboratory before. So, we are faced with an enormous amount of incomplete, heterogeneous and uncharted data. The objective is to associate the soil DNA fragments with a set of reference organisms in order to say something about the organism from which the fragments originated and, more broadly, about the structure of the population in the sample. For instance, is the sample clearly dominated by one set of species? Or, is there any evidence of a detrimental impact from sustained agricultural practices on the structure and function of soil microbial communities? The associations between the unidentified DNA fragments with the reference organisms are visualised in the form of an evolutionary (or phylogenetic) tree and, so, we require software tools that are able to compute large phylogenetic trees efficiently.

RAxML (Randomized Axelerated Maximum Likelihood) is an open-source software package for the rapid reconstruction of large phylogenetic trees. It was first developed by Alexandros Stamatakis (Heidelberg Institute for Theoretical Studies) and colleagues in 2002, and it currently is at version 7.3.4. in its standard form [3,4], and at version 1.0.9. in its light version [5]. The evolutionary placement algorithm, implemented in the latest release of the standard RAxML, is able to associate 100,000 DNA fragments with a set of 4,874 reference organisms in less than 1.5 hours (when run in parallel on a multicore system with 32 cores and 64 GB of main memory [4]). This is an impressive performance! I find the linux version of RAxML relatively simple to use and, because RAxML has been around for a while and is widely popular in the biological systematics community, there are several useful tutorials and pieces of documentation available [6,7,8,9]. RAxML is available as C source code, Windows executable, and Mac OS X executable at the Exelexis Lab software web page at [10].