Software chasing sequence: a report from ISMB

Posted by s.hettrick on 25 October 2011 - 3:33pm

By Colin Semple, a Software Sustainability Institute Agent.

It has become a cliche to announce that biology is undergoing a revolution, driven by the rapid advance of new technologies for high-throughput sequencing (HTS) of DNA and RNA. A cursory glance at the dramatic increase in sequencing capacity (and the corresponding fall in costs) over the past couple of years reveals rates of improvements that outpace Moore's Law, the famous doubling of processing power every two years seen during the evolution of computer hardware. This is prompting biologists of almost every flavour to think bigger than ever before.

There has been, and continues to be, an explosion in very large and ambitious experiments involving the generation of large and often entirely novel datasets. In short, it is a very good time to be a biologist. It is also a very busy, and rather scary time to be a bioinformaticist, as we attempt not to drown in the deluge of sequence data demanding computational processing and analysis. The related trials and tribulations of those of us in the firing line are often ignored, but certainly not at ISMB 2011 the largest yearly conference of the bioinformatics community.

Of course, like any large conference, ISMB 2011 covered a wide variety of areas from population genomics to text mining, systems biology to bioimaging, and disease genetics to evolution. However, the impact of HTS technology could be seen almost everywhere. The keynote presentation by Bonnie Berger from MIT outlined new software approaches to bridge the gap between the massive data accumulation we see, and the relatively modest rate of advance in computing power. Along with others, she suggests novel uses of data compression and the development of new algorithms to compute directly from compressed datasets. Bonnie's group is also experimenting with cutting-edge algorithms using spectral graph theory to extract meaningful signals from the avalanche of noisy sequence data. Addressing these challenges will be critical to realising the full potential of HTS in future applications such as personalised medical genomics.

The current state of play in software for HTS data also received a lot of attention, to the point where the topic has spawned the ISMB special interest group, HiTSeq, which has grown into an absorbing, two-day session on HTS software. Last year's presentations covered generic tools for the more basic requirements of any HTS analysis: preprocessing and mapping reads to a reference genome, visualisation of the mapping results, and so on. The winner of the ISMB 2010 Killer App contest was an elegant Java tool, GenomeView, for such visualisation tasks. By contrast, the 2011 HiTSeq session mainly concerned more exotic algorithms, aimed at more specific, downstream analyses tailored to particular datasets. Comrad is a good example which aims to unravel the complex re-arrangements of the human genome that are favoured by cancer cells. This is one of the holy grails of cancer biology, since a detailed understanding of how a cancer cell thrives will undoubtedly result in better ways to interrupt or even stop the growth of tumours. On the plus side, Comrad is reported to do a good job in some test cases and can operate with rather sparse (i.e. cheaply produced) sequencing data. On the downside it seems to be a complex web of novel C++ programs, dependent tools and packages that users must obtain elsewhere, and Perl wrapper scripts to paper over the cracks in the sprawling implementation. Installing and testing the software is probably time consuming, users need to be familiar with command-line tinkering, and have access to a linux compute cluster. To be fair, this situation is not at all unusual in higher level bioinformatics analysis pipelines. It partly reflects software development racing to catch up with HTS technology and experiments, and partly reflects the fact that most bioinformaticists are interested in the biological research results rather than the software. It also underlines the point that many current HTS analyses remain out of the reach of the average biologist generating the data, and that the progress of bioinformatics leaves much in its wake for the attention, and finesse of dedicated software engineers.