Systems biology: hyperbole and hypotheses

eukaryotes2.jpgBy Colin Semple, MRC Human Genetics Unit, University of Edinburgh.

Systems biology has had no shortage of detractors, which is at least partly due to the many ways of defining it. Some definitions go as far as pseudo-mystical descriptions of holistic perspectives (e.g. see the Wikipedia definition), contrasted with traditional reductionist approaches to science.

Scientists are a skeptical bunch and vagaries are bound to be regarded as a poor foundation for a new discipline - if indeed it is a new discipline. Many have complained that definitions such as this are neither helpful or new "The study of the interactions between the components of biological systems, and how these interactions give rise to the function and behavior of that system (for example, the enzymes and metabolites in a metabolic pathway)" (Wikipedia again).

Unless, of course, we are to ignore the past few hundred years of physiology and biochemistry. More often than not systems biology is used to describe computational attempts to meaningfully integrate large, complementary, genome-wide datasets. And since the most widely used technologies have, until quite recently, been used to measure gene expression across the genome, these attempts have often focused on identifying similarly expressed genes and inferring that these similarities represent common regulatory mechanisms.

Using more or less fancy algorithms, groups of similarly expressed genes are taken to indicate regulatory networks. The conceptual underpinnings of these networks are often cartoon diagrams of genes under the dominant control of a small number of transcription factors. Although we all prefer simple representations of reality, this seems to ignore several revolutions in the, still rapidly evolving, literature on eukaryotic gene regulation.

For instance, the past decade or two have shed light on the importance of post-transcriptional (particularly miRNA mediated) regulation, the large variety of chromatin structures genes reside in, and the roles of the expanding universe of noncoding RNA species. This must mean ever larger numbers of scenarios under which a pair of genes share a similar expression pattern, which translates into a larger number of hypotheses to test in the wet lab (using standard reductionist approaches - so much for holism).

In the best case scenarios this may allow one to selectively cherry pick the most interesting computational leads for experimental validation, perhaps based upon prior hypotheses about the cells under study. In the worst cases it amounts to a hypothesis-free trawl through large, artifact and error strew datasets armed with little more than wishful thinking and a Bonferroni correction.

A few years ago it was already becoming clear that there was a stark contrast between "detailed quantitative circuit models on a small scale and cruder, genome-wide models on a large scale" (Kim et al, 2009, Science). There are well-established examples of detailed studies, such as the pioneering work of Eric Davidson and colleagues (long before anyone mentioned systems in this context), painstakingly investigating the intricacies of sea urchin regulatory networks experimentally over the past few decades.

The latter studies, seeking to predict real groups of functionally interacting genes based on genome-wide expression data have generally not been an enormous success, and this has continued. The DREAM (Dialogue on Reverse Engineering Assessment and Methods) project performs ongoing blind assessments of current regulatory network inference methods (Marbach et al, 2012, Nat Methods) and provides sobering reading. In the latest tests the methods' performance on bacterial gene expression data and simulated data was considered good (recovering ~50% of known interactions) but fell off dramatically for eukaryotic (yeast) data attaining only a fraction of the accuracy.

The explanation is undoubtedly that gene regulation in bacteria (with a clear correlation between RNA levels of transcription factors and their targets) is much simpler than in eukaryotes. So what is the solution? We need to reconstruct regulatory networks, but we lack the time and effort required to do it reliably; ergo we need better performance from global analyses of high throughput data. Do we need to incorporate additional types of data? Or do we need to test new hypotheses altogether? The answer seems to be some of both.

The ENCODE consortium has allowed studies of gene regulation in eukaryotes in unparalleled depth, charting the physical structure of the genome (the chromatin) and, for instance, revealing which regions are accessible to gene regulatory complexes or not in a given cell type. The data comes in the form of scores of genome-wide variables measuring different facets of chromatin structure. Remarkable results have been achieved (Dong et al, 2012, Genome Biol) by the consortium integrating these additional data into models of transcription itself.

These detailed, quantitative models have >80% accuracy in predicting human gene expression based upon many chromatin variables. This success is, to some extent, a function of the detailed data generated but also reflects a clearer focus. Rather than asking which genes might share some (poorly defined) aspect of their regulation, a clear hypothesis is tested: do these chromatin variables explain most variation in gene expression that we observe? Both clearer hypotheses and larger, better targeted datasets are emerging from the new field of 'systems medicine' (previously called medicine), which aims to use high throughput data to inform medical diagnoses.

A recent study from Michael Snyder's lab (Chen et al, 2012, Cell) subjected the subject (Snyder himself) to every sequencing based, biochemical, physiological etc measurement they could think of, collecting data over a period of several months. At an early stage a rare variant was discovered in Snyder's genome that was predicted to increase his risk of diabetes, although he lacked symptoms at the time. However during the study they were able to inadvertently test this hypothesis, charting his increasing blood glucose levels (among many other measurements) and the early onset of diabetes, but also enabling effective medical treatment. It seems that what is good for patients may turn out to be good for the science too.