By Julie Sullivan, at InterMine, with an introduction by Raniere Silva, Community Officer, Software Sustainability Institute.
Licences in the open source world are a big challenge. In 2015, GitHub reported that only around 20% of the repositories had a licence assigned to them. Julie Sullivan and the InterMine team found similar figures for data on their platform that had clear licence information. Julie wrote the following blog post where she addresses the challenges InterMine faces when displaying licence information to users.
In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated into InterMine, and making it available to humans via our web application, and to machines via queries.
The absence of licence means that nobody can legally use, copy, reproduce or distribute your software.
Same for data. If you want to make you data open, you need to:
Publish your data.
Apply a suitable open data licence.
Without a licence, users don’t know how to use and re-use your data. Choosing a licence is not always easy but there are already some open licences designed exclusively for data developed by Open Data Commons and Creative Commons.
Data licences in InterMine
InterMine provides a library of data parsers for 26 popular data sets, e.g. NCBI, UniProt etc. We went through each of these core InterMine data sources and recorded the data licence for each. During this process, we identified 3 cases detailed below.
Good to have information on how to reuse the data, but these URLs might change. Also, in some cases the wording was vague or confusing, and the page itself was hard to find.
For example, one data provider has a statement "This work by our lab is licensed under …", what does "this work" mean? Software? Data? Both? It wasn’t clear. Another data provider offers their data "free of all copyright restrictions". How do we represent that?
Case 3: Data source had no information about how data can be reused (11.5%)
Example: Experimental data which has no data licence.
In cases where no data licence is listed and there was no information about how data can be reused, we have emailed them and asked for clarification.
We have to find a way to provide data licensing information even though these data are inconsistent. And regardless of how popular data licenses become in the future, due to the integrative nature of InterMine, we’ll always have to handle all three cases.
What’s the best way to present these data in InterMine so that data consumers can easily understand how they can re-use data?
Only provide URL to official data licence as recommended by voiD, the “Vocabulary Of Interlinked Datasets.”