Making data open: resources, gaps and incentives

Posted by s.aragon on 24 May 2017 - 3:00pm

Open Data By Naomi Penfold, eLife, Penny Andrews, University of Sheffield, M. H. Beals, Loughborough University, Rosie Higman, University of Cambridge, Callum Iddon, Science and Technologies Facilities Council, Cyril Pernet, University of Edinburgh, Diana Suleimenova, Brunel University London.

This post is part of the Collaborations Workshops 2017 speed blogging series.

What resources already exist and what’s needed next?

Data sharing relies on having somewhere that the data can be accessed, typically in a repository. Some researchers are lucky enough to have university repositories; for the others they have to rely on external resources, such as Zenodo or disciplinary repositories such as those found at Re3data. This is a trivial but necessary first step: identifying the most suitable place to host data.

It is also worth noting that open data does not mean just posting your research dataset online with your publication. The FORCE11 community advocates for open and FAIR data: Findable, Accessible, Interoperable, and Reusable. Understanding all the best practises and resources available to help achieve these goals can be intimidating for a researcher who wants to start sharing their data.

It can be confusing to know how best to share data: what formats are best? Which data should be shared? How best can data be managed so that it can be shared later? Both the Digital Curation Centre and DataCite provide great resources to support this. There are also tools available to help access, and make the best use of, data that is already available in semi-open states (open, but difficult to access, particularly in machine-readable form). Below is a (not at all exhaustive) list of some key tools and resources to help any researcher to get started with open research data:

Decide which research data to preserve1
Plan for managing and sharing your research data2
No data is truly open without the right licence: learn how to license your data3
Find the appropriate repository for your data at re3data.org
Learn how to cite and link to datasets in your research articles4 and easily format your DOI citations
Track the impact of your open research data5
Find and reuse other researcher’s open data via DataCite
Clean up your own and other’s messy data using OpenRefine
Extract data from PDFs using Scraper Wiki and Chem Data Extractor

Literacy gap

As well as ensuring that different disciplinary communities can adopt these resources in an appropriate way, their adoption also relies on researchers having the skills to use them. Many researchers teach themselves coding skills or learn some basics through Software Carpentry and Data Carpentry, but this relies on enthusiasts doing this in their spare time. However, there remain many researchers in academia who understandably struggle with basic data management, such as keeping regular back-ups. When setting out best practices for open data, whether they are tools, frameworks or standards, it is important to recognise these disparities in data and software literacy. Standards and frameworks should be accessible to different skills levels and allow researchers to develop their skill levels and move "up" the framework.

Workflows mid-project

It is often stated that it is important to introduce best practices early on in a project. However, one of the key issues facing widening the range of academics involved in Open Data, both in its use and its creation, is the dynamic and fast-moving landscape of tools, standards and expectations. It is rare for a principal investigator or sole-researcher to be both at the stage of the career where they are producing data suitable for open dissemination and at the true "start" of a project. Even those entering the field as postgraduates often come with an assortment of materials from previous study or supervisors that must be somehow integrated into their new project's workflow. A key factor in the dissemination of Open Data practices may therefore lie in the ability to adapt methods to existing workflows and datasets. This may mean incremental or iterative improvement of metadata, documentation and format as the project progresses, rather than a halting and reformatting of existing materials. In other cases, it may mean providing tools that automatically reformat data that had been created using existing proprietary or eclectic storage practices and highlighting where additional information is required. In any case, any new practice must be made immediately actionable to learners lest it becomes lost in the "for my next project" mental file.

Incentives and compliance

Even with sufficient knowledge and understanding of best practices in research data sharing, few researchers are likely to adopt these practices without appropriate incentivisation. How to achieve this remains an open question, and it is not a foregone conclusion that there will be a single mechanism to incentivise all data creators to practise Open Data. Nor is there a single stakeholder responsible for the incentives. That said, funding bodies hold the obvious stick and carrot: several funders of scientific research have requested researchers share their data for many years. The Engineering and Physical Sciences Research Council has been singled out as a funder that goes beyond a policy statement, and demands compliance in order for the individual to receive another grant in the future (see "Research data management and openness: The role of data sharing in developing institutional policies and practices" by Rosie Higman and Stephen Pinfield. Success is also contingent on the community, contributors and users alike. Should we be demanding that open data users eat their own dog food and contribute to the open data pool too? Or does this conflict with the definition of open data published by the Open Data Initiative: “Open Data is data anyone can access, use and share”?

There is a dilemma between taking the carrot or the stick approach to getting researchers to share non-publication research outputs such as data and code. For the Open Access movement, it appears that only the stick has worked, with threats from funders, including Higher Education Funding Council for England, Research Councils UK, Wellcome and Bill & Melinda Gates Foundation that money will no longer be forthcoming if researchers do not comply. We are at an earlier stage in the Open Data movement, and it may be possible to take a different route when it comes to advocacy and incentives to share. Incentives can seem abstract: beyond life sciences the “it saves lives” argument doesn't always work. Punishing researchers with fewer resources, e.g. those in the Global South, for not sharing or sharing lower quality data is unlikely to help achieve the social justice goals of Open. Further, research data managers and librarians feel uncomfortable with the role of “policing” compliance with policies around Open and sharing, preferring an advocacy role. However, as funders begin to enforce their Open Data policies, responsibility for monitoring and incentivising data sharing has to fall somewhere.

Certainly, a means to surface the value that comes from the extra effort and time taken to document, structure and share data effectively is well-needed. Moreover, adequate attention needs paying to concerns about Open Data: will a better-resourced group perform my next research project before I get the chance (the leapfrog scoop)? Will my Open Data be distorted in media for which there is no appropriate channel for debate? Is it ever possible to publish data about individuals, given identifiability is possible even when data are anonymised, and closed data can still be accessed by anyone able to mimic researcher credentials?

Where next?

Regardless of self-study efforts - or even approaching the local curation expert for advice - enacting Open Data remains intimidating where there is a skills gap, or a lack of appreciation of the individual researcher’s situation and expectations. If Open Data is to be promoted then the significant issues regarding research assessment need to addressed at the same time.

References

DCC (2014). 'Five steps to decide what data to keep: a checklist for appraising research data v.1'. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides
Jones, S. (2011). ‘How to Develop a Data Management and Sharing Plan’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides
Ball, A. (2014). ‘How to License Research Data’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides
Ball, A. & Duke, M. (2015). ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides
Alex Ball, Monica Duke (2015). ‘How to Track the Impact of Research Data with Metrics’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides