By Neil Chue Hong, Director.
This guide explains how software fits with the EPSRC policy framework for research data.
Why write this guide?
From 1 May 2015, organisations that receive EPSRC funding, and their researchers, are expected to comply with the EPSRC policy framework on research data. This sets out EPSRC’s principles and expectations concerning the management and provision of access to EPSRC-funded research data, in particular the principle that “research data is a public good produced in the public interest and should be made freely and openly available with as few restrictions as possible in a timely and responsible manner” (from the EPSRC principles).
This guide has been written to clarify how these expectations relate to research software. It explains how access to research software may be provided in line with the policy, and provides examples of common situations and how they can be dealt with. If you have further questions about the EPSRC research data policy, please get in touch with your EPSRC contact or the person responsible for EPSRC matters in your institution's research support office.
What is research data?
Research data is defined by EPSRC as recorded factual material commonly retained by, and accepted in, the scientific community as necessary to validate research findings; although the majority of such data is created in digital format, all research data is included irrespective of the format in which it is created.
Note that EPSRC does not expect every piece of data produced during a project to be retained – decisions about what to keep should be taken on a case by case basis. There is however a clear expectation that data which underpins published research outputs will, by default, be kept.
Does research data include software?
This depends on the research which is being carried out. As noted in the definition of research data above, the deciding factor is whether the software is necessary to validate research findings, such as those published in a journal paper. The examples at the end of this document should clarify situations when your software should be preserved and made accessible. As a “rule of thumb”, if your journal paper does not include sufficient detail for others to unambiguously replicate your work, you should share your code as part of your research data.
Additionally, even if you don’t need to preserve software, it is good practice to make available the software and adequate documentation to enable others to more easily validate your research findings, and to access and reuse your research data.
Who should make the decision about what research data should be preserved?
Each research organisation will have specific policies and associated processes to determine what and how publicly-funded research data will be stored and managed. Normally it will be the PI of the research project and/or Head of Department/School who will make the decision about what research data should be preserved and made available. It is important to recognise that not all research data can or should be freely shared – ethical, legal or confidentiality issues may constrain who may have access.
What about software which has not been produced by my project, but is required to validate my research results?
Research organisations are not expected to assume responsibility for the preservation and management of software not produced within their own organisation. It is prudent, both in terms of providing access to your research data and in terms of enabling your own future research, to take reasonable steps to assure the continued availability of the software you use. This may include taking a copy of open-source software and preserving it if the licence allows, or using commercial software where a multi-year support agreement is available. Given the requirement to preserve research data for 10 years from the date of last access to the data by a third party, this provides a compelling reason to use open-data formats and open-source software.
What licence should I choose for my data and software?
Following the principle that publicly funded research data should be made openly available with as few restrictions as possible, you should consider applying an appropriate open licence to the data and software generated by your research. The Digital Curation Centre and Software Sustainability Institute have written guides to help you license your research data and choose an open-source licence for your software.
Where should I deposit my research data and software?
EPSRC does not provide a publications repository, research data repository, or software repository. Researchers are expected to use institutional or subject-based repositories available to them. It may be appropriate to use third-party repositories as well. It is important that deposited objects can be referenced and accessed via a persistent identifier (e.g. a DOI) and that appropriately structured metadata describing the objects is easily discoverable. A good way to make data and software discoverable is to cite it in research publications, and to include the persistent identifier in the citation.
We have published guides to help you choose a code repository, understand software preservation, cite software, and write a software management plan.
Analysing research data using third-party software
Amy has recorded the measurements from her long-running chemistry experiment in an electronic lab notebook, and used the R and MATLAB software packages to analyse her results and produce graphs which are included in her published paper.
Since R and MATLAB are both commonly-used software packages, Amy is not required to preserve the software as long as the metadata describing her research data is sufficient, and her paper explains the techniques she used. It may be useful for Amy to deposit the R/MATLAB scripts that she used to analyse her results in a repository and link to this in her paper, because this will let others reuse her data and methods more easily and it is not an onerous task to complete.
Building scripts to support a workflow
Brian has written a script which converts data from one format to another to allow him to interface two separate codes which use different input and output formats. This script is used in research work, which results in some publications.
Brian is not expected to make the script available, as long as he has made the data that underpins the research work available and he has provided the metadata that describes it, including the formats. In this case it is of benefit to both Brian and other researchers for him to simply make the script available under an open licence. This is particularly the case if the amount of code was small, and there was no expectation that Brian would support the script after release.
Creating new software as part of a research project
Colin has written a piece of software which implements a new algorithm for calculating a statistical index on a pre-existing dataset, and has published this algorithm in a paper along with results benchmarking it against other implementations of the statistical index.
As the paper describes both the algorithm and compares it to other work, it is important that Colin deposits the software and makes it accessible. It will also be important for others to have access to the pre-existing dataset to enable validation of the results in the paper, which ideally will have a DOI and be openly accessible under a Creative Commons Attribution licence.
Dealing with commercially confidential objects
Diya is undertaking research which simulates the airflow over a vehicle chassis, and has created an improved version of a commercial software model provided by an industry partner. She has then published a paper with the permission of the industry partner which broadly describes the revised model and presents the results of applying this model to a test dataset.
In the case of ‘commercially confidential’ research data (in this case the airflow model), where a business organisation has a legitimate interest, it is not expected that the improved version produced by Diya would be made openly available. However, it would be reasonable to investigate making the revised model available subject to a suitable, legally enforceable, non-disclosure agreement to enable other researchers to verify the results published in the paper.
Exploiting software with commercial potential
In the course of her EPSRC-funded research Erin has written some code which she believes has real commercial potential in its own right. She has written up the work and wishes to publish, but the results can only be validated by the code and Erin does not wish to jeopardise its commercial potential by disclosing it.
Erin should seek the advice of her University’s commercialisation support office because under EPSRC’s standard grant conditions the university owns, and has the responsibility for exploiting, the intellectual property arising from EPSRC research grants. Because it is acceptable for there to be a delay in publication while arrangements are made to protect valuable IP, if the support office agrees with Erin they should ensure that suitable protection is put in place before the paper is published. It is important that the code is available to anyone who wishes to validate Erin’s research after it is published.
Faced with enormous amounts of generated data
Feng is working on a large theoretical physics experiment which uses a piece of software to generate simulated data for an event. Each event data set consists of a very large amount of data, but a scientifically equivalent data set can be recreated as long as the initial parameters are identical.
In some cases, it may not be possible or cost effective to preserve research data. For example, in the case of simulated data or outputs of models, it may be more effective to preserve the means to recreate the data by preserving the generating code and environment, rather than preserving the data themselves. Provided that the ability to validate published research findings is not fundamentally compromised, a deliberate decision to dispose of research data at an appropriate time is acceptable in these cases. (From Clarifications of EPSRC expectations on research data management).
Many thanks to SSI Fellow Stephen Eglen for inspiring this guide and providing the “rule of thumb”, and to Ben Ryan from EPSRC for providing feedback on the guide and contributing the example on software with commercial potential.