By Daniel S. Katz, University of Illinois Urbana-Champaign, Robert Haines, Research IT, University of Manchester, David Perez-Suarez, Research IT Services, UCL, Alexander Struck, Humboldt-Universität zu Berlin
Nowadays, software is used in most research. But how the software is created, used, and what it depends on are not well understood questions. The importance of such knowledge varies based on the motivation of the reader. On one side, we could be interested in the impact of the software, how many times it has been used and by who. This type of analysis could come, for example, from funding bodies and organisations to reward the creation of something and help its sustainability, from institutions who hire people behind that software, or from the software authors to get an understanding of the needs of their users or simply to get credit for their work. Another motivation may be trying to understand the research being carried out with a particular software or set of tools either for purely academic purposes (e.g., by historians and scholars of science) or with a commercial perspective (such as by intellectual property teams from universities for the monetisation of the software). Some other purposes have to do with reproducibility and provenance: for example, how do we know which calculations need to be repeated if a bug is discovered in a particular version of software?
Looking for software in research is one of many topics that are all related to the problem of understanding the role of software in research in general, and also of understanding the role of specific software in specific research. Other topics include software citation, software contributor roles, software metadata. Two open questions in these topics include: how does one get credit for the research software they write or contribute to? And how does one give credit to someone else for the research software they use? A related question is what should be cited when software is used? Perhaps this should be papers about specific software, the specific software itself (e.g., by its DOI), or the software project in general (e.g., by its GitHub repository.) To some extent, these questions are addressed by the recent Software Citation Principles paper [Smith 2016]. Another question, related to metadata (which is needed for citation) is: How do I know who has contributed to a software project? The difficulty in answering this is compounded by the fact that that all contributors to a project may not be captured by commits to a repository. Project CRediT (http://docs.casrai.org/CRediT) attempted to create a taxonomy of contributor roles for all research outputs, but, in practice, focused on those roles related to papers. And even if all contributors are known, in some projects, there can be a very large number of contributors, and the project may prefer that it is named rather than all the individual contributors, as is done (and explained and explained) by the yt project and by other projects such as ROOT.
The following sections include a set of methods, services and ideas on how to find the software used in research on different means: research, papers, software, and data.
Methods for finding software
Some efforts have been made to survey the use of software in research by asking researchers directly how they find software, most recently among postdocs in the USA and in the UK before. [Nangia 2017a, Hettrick 2014] Answers to questions like those in these surveys often mention programming languages, packages or (commercial) stand-alone products. Despite the low return rate of such surveys, they have provided a first glimpse of general software use. Interviews have also been identified as a method to provide such insights.
Research outputs themselves may provide information about the software that is used in the research. In many cases, research software is mentioned in articles or conferences proceedings. Research articles also contain keywords mentioning the use of software. A comprehensive keyword vocabulary is key here. Some initial manual efforts have been made to identify software used in research output, as detailed in the subsections “In papers” and “In data”.
Another place to find software in research is through a Current Research Information System (CRIS), which in some countries is mandatory for every researcher to use. However, it still has to be seen how these will be used to document the development (or use) of software. Additionally, some funding agencies compile a database of funding applications, awards, and reports. Such databases, individually or collectively, could be used to generate a state-of-the-art report about the software in use for solving a particular research problem. Finally, in an ideal world, lab notebooks would be published next to the article and dataset. These notebooks would be another source of information about software used in research.
Some papers cite software in the references, in the same way that they cite other papers, but this is currently still quite rare. Even if this is done, the software may not be picked up or may be stored inconsistently by citation indices, which may not have an internal software type. But papers are still a rich source of data for software usage metrics, which can be mined from the text. However, this is non-trivial and requires a lot of manual effort. Howison and Bullard undertook a study of 90 randomly selected biology papers, searching for and noting any references to software within them [Howison 2016]; Nangia and Katz performed a similar task for three months of Nature articles [Nangia 2017b].
Fully automated methods are as yet unable to match human accuracy in this task; previous efforts have attempted to find mentions of software within free text by looking for keywords, such as “software” or “GitHub”, and then searching nearby for words which look like names i.e. Capitalized or CamelCased words). A full machine learning approach could be developed, but building a comprehensive training corpus will require monumental manual effort, many times the level of that required for the two studies referenced above.
In software and software repositories
Software itself is perhaps the ultimate source of metrics about software use. Any dependencies that a piece of code relies upon must be formally described, possibly at two levels: in the code itself (‘import’ in python, ‘#include’ in C) and in the language’s package manifest file (‘xyz.gemspec’ in ruby). This means that following the package tree from a particular root will give all the other packages, within that packaging environment, that are used by that software. Limitations of this method are that some languages (e.g. C, Fortran) are lacking central package repositories, and if a piece of software uses multiple languages then this must be detected and multiple searches performed.
Software repositories generally operate at a higher level than code, and are used to store and reference complete software packages. They are often domain specific and may or may not store metadata about software dependencies in them. Examples of such repositories are the Astrophysics Source Code Library (ASCL) and the Digital Research Tools (DiRT) Directory.
Code repositories, such as GitHub, offer further opportunities for detecting software use. In addition to providing direct access to the source code of a project such as code, package manifests and CITATION files, other indicators of usage, such as forks and stars, are also tracked. Both of these are indicators that software is being used by others: following the fork tree will, for example, show where others have taken a software package and extended it; and by looking at the list of people who have “starred” a repository, one might be able to then find further uses of it.
Building on top of all of this are tools that collate information from multiple sources to provide software usage data in one place. Depsy and libraries.io are examples of this.
For some research institutions and publishers, the term “research data” includes software responsible for handling the data. DataCite is one organization that assigns identifiers to datasets. DataCite offers an API (or maybe even a dump) that could help identify such research outputs. Research data repositories hold some data sets including software. Unfortunately, meta repositories like re3data.org do not classify research data repositories that contain software. With the increasing awareness regarding software development, such a classifier might be added in the future.
We are convinced that software is essential to research today, but also believe that we need evidence of this, both to demonstrate it to others and to understand the details of the software use and how it varies over time, over field, and in other ways. The first step to providing this evidence and understanding is to simply find the software, and we hope this blog is informative in providing examples of some initial methods to do so.
[Hettrick 2014] S.J. Hettrick, M. Antonioletti, L. Carr, N. Chue Hong, S. Crouch, D. De Roure, et al., UK Research Software Survey 2014, Zenodo, 2014. doi: https://doi.org/10.5281/zenodo.14809 Retrieved: Sep 06, 2017
[Howison 2016] Howison, J. and Bullard, J. (2016), Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. J Assn Inf Sci Tec, 67: 2137–2155. doi: https://doi.org/10.1002/asi.23538
[Nangia 2017a] Nangia, Udit; Katz, Daniel S. (2017): Track 1 Paper: Surveying the U.S. National Postdoctoral Association Regarding Software Use and Training in Research. figshare. https://doi.org/10.6084/m9.figshare.5328442.v3 Retrieved: Sep 06, 2017
[Nangia 2017b] Nangia, Udit; Katz, Daniel S. (2017): Understanding Software in Research: Initial Results from Examining Nature and a Call for Collaboration, accepted at WSSSPE5.2, preprint available: https://arxiv.org/abs/1706.06527 Retrieved: Sep 06, 2017
[Smith 2016] Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. (2016) Software citation principles. PeerJ Computer Science 2:e86 https://doi.org/10.7717/peerj-cs.86