Interact to Interoperate
By Alejandra Gonzalez-Beltran, Software Sustainability Institute Fellow.
This year I was invited to attend the Interoperability of Metadata Standards in Cross Domain Science Health and Social Science Applications workshop, which was held at the beginning of October 2018 in the Schloss Dagstuhl, Leibniz Center for Informatics, in Wadern, Germany. I am grateful to the organisers for the invitation and to the Software Sustainability Institute Fellowship for funding my attendance.
This was my second time in Dagstuhl: I attended another workshop there in 2015, also in the area of interoperability and data standards. I thoroughly enjoyed visiting Dagstuhl in both occasions, meeting interesting people for the first time (including people I had met virtually) and meeting some colleagues again.
This time, there were 24 attendees, including experts in metadata standards as well as research specialists involved in three pilot projects from the Data Integration Initiative, which is run by CODATA (Committee on Data of the International Science Council). These pilot projects provided our motivation and goal to achieve interoperability among the different metadata standards.
But you may wonder, what is interoperability? what are metadata standards? which ones were considered during the workshop? what are the pilot projects about? how can interoperability help the pilot projects? what is the experience of attending a Dagstuhl workshop?
I will address these questions (and more!) in this blog post.
What is interoperability?
Computer systems can interoperate if they can “exchange and make use of information”. When considering data/metadata interoperability, we are referring to the ability to exchange data between computer systems and data stores. This ability can be split into two levels: syntactic interoperability and semantic interoperability (if you consider interoperability in a broader sense, there are frameworks classifying interoperability in more layers, such as technical and organisational).
Syntactic interoperability refers to how the data is packaged, through the use of common data formats and communication protocols. Semantic interoperability goes further, as in addition to syntactic interoperability, it requires the systems to interpret the data in a meaningful way.
From the perspective of making data FAIR, or Findable, Accessible, Interoperable and Reusable (for more details check out our community-driven Nature Scientific Data article entitled “FAIR guiding principles for scientific data management and stewardship”), interoperability is “the ability of data or tools from non-cooperating resources to integrate or work together with minimal effort”. This definition adds the idea of data integration as one of the specific uses of information. In the FAIR principles, we established the following criteria for the data or metadata to be interoperable:
(meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
(meta)data use vocabularies that follow FAIR principles
(meta)data include qualified reference to other (meta)data
These items encompass semantic interoperability, which also includes the syntactic layer, at the data and metadata levels. The first two items focus on making the data unambiguous and comprehensible and the third point focuses on the making meaningful the links between the data.
What are metadata standards?
Data interoperability is only possible through the use of metadata standards, as they provide a common understanding of data. But, what do we mean by “metadata”? The usual catchy definition is that metadata is “data about the data”. More precisely, it is structured information that describes a resource. This includes things such as how the resource can be located or accessed, and how it was produced.
So, enabling data exchange requires agreed metadata. There are also many advantages if the metadata standards are open.
What metadata standards were considered in the workshop?
At the Dagstuhl workshop, there were representatives of the following metadata standards, with some people representing more than one standard:
Data Tag Suite (DATS )
Some standards from the World Wide Web Consortium (W3C), including Data Catalogue Vocabulary (DCAT), Semantic Sensor Network (SSN) ontology, the RDF Data Cube vocabulary, the Provenance Ontology (PROV-O).
I am involved in two of those standards: the Data Tag Suite (or DATS) model for dataset descriptions and the Data Catalog vocabulary (DCAT).
DATS started as the underlying model for the DataMed prototype search engine for datasets, whose objective was to do for datasets what PubMed had done for the scientific literature. The slides I presented on DATS are available in Zenodo.Photo courtesy of Alejandra Gonzalez-Beltran.
In addition to a general introduction about DATS, my slides focused on showing how DATS deals with generic vs domain-specific datasets, structure vs semantics, and study level vs variable level information. These were the dimensions over which we analysed all the metadata specifications. We developed DATS as a series of schemas (specifically, JSON-schemas), representing the structure Regarding the semantics, we have annotated the information with multiple vocabularies (in the form of JSON-LD context files), namely: schema.org, a set of ontologies from the Open Biological and Biomedical Ontology (OBO) Foundry, and the W3C Data Catalogue vocabulary. In a separate blog, I will focus on W3C DCAT vocabulary, so watch this space!
At the workshop, I also did a demo of the FAIRsharing web application, which interrelates data and metadata standards, databases and data policies. In FAIRsharing, you can see how the different metadata standards are related to each other, and how they are implemented in databases and recommended in data policies (for more information about FAIRsharing, see our pre-print on bioRxiv, which has been accepted for publication in Nature Biotechnology).
What are the CODATA Data Integration Initiative pilot projects about?
The pilot projects we worked on deal with the following global challenges:
Disaster risk reduction. This pilot is investigating how data integration could support the Sendai Framework, which is a 15-year, voluntary, non-binding agreement signed by UN member states that recognises that the State has the primary responsibility to work towards reducing disaster risk, working with other stakeholders. It is led by Public Health England in partnership with the Integrated Research on Disaster Risk.
Infectious diseases. This pilot considers Ebola as the primary case and is looking at supporting both research and humanitarian efforts. It is led by the Infectious Diseases Data Observatory (IDDO).
Resilience cities. This pilot is looking at integration of geospatial data, air quality measurement, location of hospitals, and so on, to enable better governance and investment decisions that would lead to more resilient cities. The partners in this pilot are Resilience Brokers and the city of Medellín in Colombia.
These pilot projects provided the motivation for the work at the workshop, allowing us to interact considering interoperability issues around data sources that would be useful to address these important challenges.
What is the experience of attending a Dagstuhl workshop?Photo by Alejandra
The Schloss Dagstuhl is a very special place “where Computer Science meets” and whose aim is “furthering world class research in computer science by facilitating communication and interaction between researchers.” This centre for Computer Science is located in the Dagstuhl castle, which was built in the 1760s and is composed of a seven axis manor house and an adjunct chapel in baroque style. Since its construction, the castle saw many residents, from Counts and Barons, to Franciscan nuns, and elderly people. Finally, it became an “Informatics Monastery” with the first workshop happening in 1990. If you want to know more about the Dagstuhl Castle history you can grab a booklet about it when you visit, or check it out on their website.
Given that it is a quite isolated place, workshops there provide lots of opportunities to interact with other attendees. The meetings usually run from Monday to Friday. During the day, there are lots of opportunities for presentations and breakout group activities. Also, at lunch breaks, we usually went out for walks in the beautiful surrounding area. In the evenings, people meet up in the wine cellar for more social activities, while eating cheese. There we usually played Werewolf, Cards against Humanity or other fun games.Text: Alexander Weinen, translation:
Yvette Cases-Bröhmer, layout: Uwe
In parallel with our “Interoperability of Metadata Standards” workshop, there was another one on Automating Data Science, where some of the topics covered were data wrangling, predictive modelling, exploratory data analysis, inductive querying, probabilistic programming and visual analytics. I am also very interested in these topics, and it was good to have some opportunities to chat and exchange contact details with participants of the other group. However, as both our workshops had a packed schedule, it was not feasible to have a common session. For the future, it would be great to put together a workshop bringing together data scientists with the goal of exploiting metadata specifications for automating data science. This would provide more opportunities to “interact to interoperate”
It so happens that the Software Sustainability Institute (SSI) Collaborations Workshop for 2019 (CW19) includes interoperability as one of its topics. Recently, Raniere Silva wrote an SSI blog post on “Interoperability and its importance for research software”, so check it out for more information. The other topics for CW19 are documentation, training and sustainability. If you are interested in research software and/or some of these topics, I hope to meet you at CW19.