By Alasdair J G Gray, Lecturer in Computer Science, Heriot-Watt University
At the end of June, a group of individuals from across Europe came together in Leiden for the first FAIRport-ELIXIR Bring Your Own Data (BYOD) workshop, which was also sponsored by the Dutch Techcentre for Life Sciences. None of us quite knew what would happen but we were all excited that such an event was taking place. The result was better than we expected.
This first BYOD workshop combined experts in Linked Data as well as in MycoBase and the Human Protein Atlas. The participants were evenly split between data providers with some, but not a lot of RDF knowledge, and trainers, who were experts in semantic web technologies. The workshop’s aim was to give the data providers a mix of tutorial and hackathon that would make their data available in a more accessible and reusable manner, based on the Data FAIRport initiative, and using RDF. The goal was to develop showcases that would demonstrate the added value of interoperable data to facilitate questions across multiple resources.
The first day of the workshop was mainly devoted to publishing data as interoperable RDF and how to understand the datasets represented by the data providers. Before the afternoon was over, we had split into two teams to work on the showcases using the experts’ knowledge of related datasets and either MycoBase or HPA, followed by a pleasant canal-side dinner in Leiden.
The next day started with the teams feverishly working on their ideas. There was a general buzz around the room as the experts called on each other’s knowledge to bring together working demonstrations. The day closed with a show-and-tell of what had been accomplished that day.
The MycoBase showcase focused on how to discover which compounds fermented. About 10,000 fungal strains in MycoBase were represented in RDF resulting in 2.5 million triples. These were linked to the ChEMBL database by exploiting the Open PHACTS Discovery Platform API to resolve chemical names present in MycoBase to their ChEMBL URI. This allowed us to integrate the two data sets, pulling in key facts, such as molecular weight, log p value and hydrogen bound count.
The Human Protein Atlas (HPA) team worked on two showcases. The first involved discovering which pathways in which a given HPA protein occurs, sourced from wikipathways. The second involved linking the genes present in FANTOM5 and included a resolution step involving the Bio2RDF version of EntrezGene. These connections were possible due to lengthy modelling discussions and the development of an RDF generating script that converted part of the HPA relational database into an RDF representation.
Overall the workshop was a great success. The data providers felt they had learned a great deal about RDF and were happy with the progress that had been made. While it was recognised that modelling the data in RDF was hard, the interoperability possibilities were nonetheless a great incentive. The trainers were pleased with the ad hoc training approach, although they had some key suggestions for training material at the next BYOD workshop. The facilitators also played a key role in ensuring that there was an appropriate amount of tutorial time for us all, as well as making sure we all caught our planes. Both teams left with a vow to continue work on their showcases to completion and aiming to produce a paper about their work.
For me, the key measure of success for the workshop will be if the data providers are now able to find their way into the world of semantic data publishing without further workshops. Only time will tell.