Being FAIR – data licences in InterMine
Licences in the open source world are a big challenge. In 2015, GitHub reported that only around 20% of the repositories had a licence assigned to them. Julie Sullivan and the InterMine team found similar figures for data on their platform that had clear licence information. Julie wrote the following blog post where she addresses the challenges InterMine faces when displaying licence information to users.
This post was originally published at the InterMine blog.
In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated into InterMine, and making it available to humans via our web application, and to machines via queries.
Open data licences
If you want to make your software open you need to:
Publish your software in a public space.
Apply a suitable free and open source licence.
The absence of licence means that nobody can legally use, copy, reproduce or distribute your software.
Same for data. If you want to make you data open, you need to:
Publish your data.
Apply a suitable open data licence.
Without a licence, users don’t know how to use and re-use your data. Choosing a licence is not always easy but there are already some open licences designed exclusively for data developed by Open Data Commons and Creative Commons.
Data licences in InterMine
InterMine provides a library of data parsers for 26 popular data sets, e.g. NCBI, UniProt etc. We went through each of these core InterMine data sources and recorded the data licence for each. During this process, we identified 3 cases detailed below.Pie chart shows that 34.7% of data sources had
licences, 53.18 has some licence info, and 11.5% had no licensing info at all.
Case 1: Data source had a data licence (34.6%)
Perfect, ideally all data sets would have licenced data!
Case 2: Data source had some information about how data can be reused (53.8%)
Good to have information on how to reuse the data, but these URLs might change. Also, in some cases the wording was vague or confusing, and the page itself was hard to find.
For example, one data provider has a statement "This work by our lab is licensed under …", what does "this work" mean? Software? Data? Both? It wasn’t clear. Another data provider offers their data "free of all copyright restrictions". How do we represent that?
Case 3: Data source had no information about how data can be reused (11.5%)
Example: Experimental data which has no data licence.
In cases where no data licence is listed and there was no information about how data can be reused, we have emailed them and asked for clarification.
We have to find a way to provide data licensing information even though these data are inconsistent. And regardless of how popular data licenses become in the future, due to the integrative nature of InterMine, we’ll always have to handle all three cases.
What’s the best way to present these data in InterMine so that data consumers can easily understand how they can re-use data?
Only provide URL to official data licence as recommended by voiD, the “Vocabulary Of Interlinked Datasets.”
URL will not change, e.g. http://creativecommons.org/licenses/by/4.0/
Easy to ascertain permissiveness.
Easy to compare across data sets.
Provide URL to data licence OR to more information.
URL might change.
Useful because people can get details on allowed usage, even if there is no data licence.
Provide licence text and URL. Would provide more information immediately to users where there isn’t a licence.
Danger of being inaccurate or out of date.
User would not have to leave the InterMine to see what’s allowed.