Long Beach, USA, 15-17 July 2012.
Report by Colin Semple, Agent and MRC Human Genetics Unit, Edinburgh.
Witnessing the coming of age of systems biology
This was the 20th Anniversary of the original ISMB conference (http://www.iscb.org/ismb2012/), a meeting established by amiable computer scientists with an inkling that interesting things were happening in biology, and a hope that useful synergy could develop a new field of theoretical biology. The initial keynote, 'Seeing forward by looking back', (Lawrence Hunter, University of Colorado and Richard Lathrop, University of California at Irvine, USA) emphasised the successes of machine learning algorithms from the AI community in biological data analysis. The data intensive fields of structural biology, genomics and population genetics have all benefited and as ever, algorithms based upon HMMs (Hidden Markov Models) litter the landscape. Less celebrated were the many mismatches and misunderstandings between biology and computer scientists, although the conference organisers recaptured some of that awkwardness by providing the conference program in the form of an unwieldy MS Excel spreadsheet. We wrestled manfully (of some 1500 delegates I would estimate around 10% were female) with the Excel file as the familiar graphs displaying the exponential increase in sequence and structure databases appeared on the screen. And one might say this is where computational biology finds itself today: fumbling with formats on the periphery of huge, complex and expanding datasets. The aspiration for a predictive branch of theoretical biology, akin to theoretical physics, surviving only as a niche sub-genre of science fiction; while the reality is a morass of heuristics, unreconcilable statistics, simplistic assumptions and biased data, tangled in a trillion lines of poorly written scripts and anchored only by the crushing weight of our own ignorance. But that may be a little pessimistic! At a meeting of this scale and breadth there are usually reasons for optimism and this was no exception.
It seems that the area of 'systems' and 'network' biology (previously misnamed as 'biochemistry' and 'physiology') is where the highest concentrations of computer scientists are active. An area where often so little is known that one algorithm's prediction is as good as any other's. An interesting talk by Mukund Thattai (Tata Institute of Fundamental Research, India) started from the challenging proposition that the ultimate test of network predictions is to successfully use the information discovered to design and re-engineer those systems successfully. His work has focused upon a specific example of bacterial gene regulation (Rai et al, 2012, PLoS Comp Biol) where he has successfully predicted and re-engineered the behaviour of a gene regulatory network. While this is undeniably elegant work the network in question consists of two genes, showing a strong, stable and well understood molecular interaction; and in the relatively simple environment of a bacterial cell. What happens at the other end of the scales of complexity and ambiguity? Jesse Gillis (University of British Columbia, Canada) who has examined previously published mouse and yeast regulatory networks based upon analyses of large gene expression datasets - by pulling them apart to see what happens. His talk convincingly showed that key (and common) assumptions underlying the construction of such networks are wrong, and that the vast majority of predicted links between genes in the networks are unlikely to be biologically meaningful (see Gillis and Pavlidis, 2012, PLoS Comp Biol). He concludes that attempts to understand gene function through similar massaging of large datasets is misguided. So it seems that approaches that often work rather well in simple, bacterial systems do not perform well on high throughput data from eukaryotes.
The latest output from the DREAM (Dialogue on Reverse Engineering Assessment and Methods; http://www.the-dream-project.org/) project, which performs blind assessments of current regulatory network inference methods was published online during the conference (Marbach et al, Jul 15 2012, Nat Methods) and goes some way to explaining this contrast. The methods' performance on bacterial gene expression data and simulated data was considered good (recovering 50% of known interactions) but fell off dramatically for eukaryotic (yeast) data attaining only a fraction of the accuracy. The explanation is undoubtedly that gene regulation in bacteria (with a clear correlation between RNA levels of transcription factors and their targets) is much simpler than in eukaryotes. The only solution is therefore more in depth studies of gene regulation in eukaryotes and integrating the additional data into the models. On the bright side, I suppose this means that the algorithm development is ahead of the data production for once.
The conference reception was held in the Aquarium of the Pacific in Long Beach, given that it was billed as a 'journey of discovery through the world's largest ocean' it was a surprisingly dry journey. My single complementary drink was rapidly dispatched as we happy geeks shuffled around blinking at a variety of oblivious ocean organisms, as the biology stared back.