Sam Brockington Lab, Dept. of Plant Sciences, University of Cambridge
I am interested in content mining to aid efficient synthesis of data and knowledge from the academic literature. Evolutionary biology spans thousands of journals leaving relevant content scattered in unstructured, heterogeneous ways. Extracting the bigger picture from the morass of academic literature is my goal.
There are over 114 million scholarly documents available on the web, with millions more being added every year at an ever increasing rate. They are unhelpfully scattered across tens of thousands of journals and no-one has access to them all. For my PhD I grappled with the use of fossils in phylogeny. I found the state of data availability in my field to be poor and there were many different interesting barriers to effective re-use of data. I spent a truly incredible amount of time scraping data out of PDFs and emailing authors (mostly unsuccessfully) for data.
This set me on a path to explore more efficient methods of information retrieval, extraction and re-use from published academic content. In my research I have explored how information in certain journals, beyond the title and abstract, is nearly invisible to all, even Google Scholar. I have also helped an international collaboration assess the state of public data sharing in phylogenetics; we found that less than 4% of phylogenetic studies published in 2010 publicly archived machine-readable data. I also work to make the processes and platforms of scientific publishing more re-use friendly too with the journal Research Ideas and Outcomes. I blog regularly about my research and barriers to research where you can read in more detail about what I'm doing right now.