Better analyses for social insect genomics

Posted by s.aragon on 29 October 2018 - 9:28am

Solenopsis invicta fire ants on an Illumina MiSeq
sequencing flowcell.
Image courtesy of Yannick Wurm.

By Yannick Wurm, Software Sustainability Institute Fellow

I spent the week of August 5th at the 18th Congress of the International Society for the Study of Social Insects in Guarujá, Brazil. This is a big quadrennial conference uniting researchers from around the world who study ants, bees, wasps, termites and a few other animals. Part of my trip was funded by the Software Sustainability Institute which lobbies for and helps people do better research through improving software. Hence this blog post.

The study of social insects has traditionally used approaches including behavioural observation and taxonomic sampling, with genetic analyses becoming more common since the mid-2000s. A pleasant surprise at the conference was the recent increase in highly molecular, genome-wide approaches where whole or partial genomes or transcriptome sequences of many individuals are obtained in order to make specific comparisons within species, or sometimes also between species.

This disruptive shift is largely due to the 50,000-fold drop in DNA sequencing costs over the past 10 years. See my student Émeline’s recent review on the genes and processes underpinning the evolution of social behaviour in ants.

With great power comes great responsibility

A major challenge for small research labs now wielding in large genomic datasets is that it is easy to make a small mistake that has high costs, see the articles by Greg Miller, M. Gallego et al. and Steve Horvath.

In light of this, as part of a workshop on genomics approaches organised with Tim Linksvayer and Alex Mikheyev, I gave an overview of some of the lessons we can transfer from the worlds of “other” data sciences to our expanding world of social insect genomics. This includes:

writing analysis code for humans;
respecting style guides for code (e.g., R style guide), and for how to structure a genomic analysis;
benefits of peer-reviewing code, and of peer-coding sessions;
using specific tools that increase productivity while decreasing risks (rmarkdown, fat machines, snakemake/nextflow);
benefits of visualising data in many different manners. Typically when people learn to do basic linear models they learn the importance of visually inspecting some plots (e.g. qqplot, residuals). But when we end up performing tens of thousands of such analyses (e.g. one for each gene or one for each SNP), many forgo doing this.
Highlighting the GeneValidator protein quality software and Sequenceserver BLAST comparison software my lab makes to facilitate analysis and visualisation.

My slides are available on SlideShare.

If you wish to discuss this post with us, send us an email or contact us on Twitter @SoftwareSaved.