Hacking software citation implementation: The Citation File Format Hack Day at RSE18

Posted by s.aragon on 2 October 2018 - 9:09am
image1_4.png
Image courtesy of futureatlas.com

By Stephan Druskat, Humboldt-Universität zu Berlin (ORCID 0000-0003-4925-7248); Jurriaan H. Spaaks, Netherlands eScience Center (ORCID 0000-0002-7064-4069); and Alexander Struck, Cluster of Excellence Image Knowledge Gestaltung (ORCID 0000-0002-1173-9228)

In order to enable attribution and credit for Research Software Engineers, and other developers of and contributors to research software, software must be made citable, and must be cited. One of the obstacles for correct and comprehensive software citation is the lack, or suboptimal discoverability, of relevant metadata. While, for instance, papers provide their metadata quite obviously (i.e., title, authors, containing publication, publication date, etc.), software hardly ever does.

One strategy to uncover citation-relevant metadata for research software is to let software creators provide them in CITATION files. In this context, Robin Wilson (in his blogpost “Encouraging citation of software - Introducing CITATION files”) suggested the use of plaintext files which may also include BibTeX snippets. One problem with such files is, however, that in the highly automated software citation workflow, plaintext files are an unreliable source for correct, comprehensive software citation metadata. During a discussion session at the Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE) 5.1, a machine-readable standard format for CITATION files was therefore suggested to replace the free form plaintext files.

Enter CFF

The Citation File Format (CFF) has been developed to implement this standard. CFF is written in simply-structured YAML files called CITATION.cff, cf. example below. It retains a high degree of human-readability and -writability to preserve the recoverability of metadata from a CITATION.cff file by humans, while reliably providing correct metadata for other actors in the software citation workflow, such as repositories, indexers, converters, publication platforms, etc.

cff-version: 1.0.3
message: If you use this software, please cite it as below.
authors:
  - family-names: Druskat
    given-names: Stephan
orcid: https://orcid.org/0000-0003-4925-7248
title: My Research Tool
version: 1.0.4
doi: 10.5281/zenodo.1234
date-released: 2017-12-18

The Citation File Format has a number of required fields to ensure that it is self-documenting (message) and implements the Software Citation Principles (https://doi.org/10.7717/peerj-cs.86, principles 1., 2., 3., 5., 6.). Additionally, it allows the definition of arbitrary secondary references for software, such as software, context, or algorithm papers. It is compatible with other metadata formats, such as the JSON-LD implementation of the multi-purpose software metadata model CodeMeta for machine exchange.

As a “front-end”, human-friendly format, there exist a number of tools for CFF for, e.g., creation, manipulation, and conversion (see this list of tools in the CFF README), and it is increasingly being adopted by researchers and institutions, such as the Netherlands eScience Center – the Netherlands’ central institution for Research Software Engineering – who use it to power software citation in their Research Software Directory.

Hacking CFF

On 5 September 2018, 26 software citation hackers convened at the Citation File Format hack day in the Murray Learning Centre at the University of Birmingham, to learn, discuss, and work on the Citation File Format (CFF) and its tooling. The hack day co-located with the  Third Conference of Research Software Engineers, which has been made possible by the immensely helpful #RSE18 team (Louise Brown, Claire Wyatt, Catherine Jones) and their University of Birmingham collaborators (Debbie Carter, John Owen, Andrew Edmondson), who we’d like to thank very much, as well as the Software Sustainability Institute, who generously funded the hack day through a special collaborations grant.

The hack day ran from 9 am to 5 pm, and after a brief introduction to CFF – based on the RSE18 talk “‘YOU HAVE 0 CREDIT – PLEASE INSERT C̶O̶I̶N̶ FILE’: The Citation File Format”, Jurriaan H. Spaaks (eScience Research Engineer at the Netherlands eScience Center) talked about “How the Netherlands eScience Center uses CFF to promote software citation”. The Center provides software citation metadata in CITATION.cff files for their Research Software Directory.

After a round of brief introductions from everyone, participants presented their hack ideas – some original, some based on issues reported to the main Citation File Format repository – formed groups, and got hacking. The resulting hacks were presented to the plenary at the end of the workshop. Together, we have made very diverse hacks on different levels: tooling implementation, policy, licensing, documentation, software citation workflow issues, and more. This is the magic behind hack events, and this one wasn’t different: a diverse group of people coming together, bringing their different sets of skills to the table, and collaborating on advancing a certain thing, just getting some work done, and creating all different sorts of output, be it code, text, knowledge, policy, or something else.

Hacks

Report more formally on experiences with CFF at the Netherlands eScience Center

Jurriaan H. Spaaks and Jason Maassen have started working on a blog post detailing their adoption of CFF to provide and re-use citation metadata for the software listed in the Netherlands eScience Center’s Research Software Directory. Once the post is finalised, it will be submitted to the Software Sustainability Institute’s blog.

Clarify the relation and interfaces between CFF and CodeMeta

CodeMeta is a crosswalk table and format that “improve[s] how [different] resources [for research software] can talk to each other.” As a general software metadata exchange format, CodeMeta can also be used to record the metadata needed for software citation. Therefore, the relation between CFF and CodeMeta should be clarified, and potentially useful interfaces for an exchange between the two should be identified. In the context of the FORCE11 Software Citation Implementation Working Group, Daniel S. Katz (as Group Leader) and Stephan Druskat (as member) started the discussion that led to internal communication with  the working group about the differences between the two formats and an assessment of their potential uses.

Solve the chicken/egg dilemma for DOIs

This has been a very important hack, not only for CFF, but for any effort that is affected by the following dilemma: The Software Citation Principles state that a software version should be uniquely identifiable via a unique, persistent, and machine-actionable identifier such as a DOI. DOIs are usually assigned to releases of a software version. The assignment can be automated, e.g., via the Zenodo integration for GitHub releases. This, however, precludes the citation metadata in a CITATION.cff file from being updated in time for the release. The dilemma can be solved by manually reserving a DOI at Zenodo, updating the CITATION.cff file, and then making the release including the up-to-date metadata manually. However, the manual steps involved make this approach prone to error, so Toby Hodges, Patricia Herterich, Cerys Lewis, David Perez-Suarez, and Robin Dasler started investigating the feasibility of automating the process. The group tested whether they can reserve DOIs on Zenodo through its API, and add the pre-released DOI to a zenodo.json file before the release to push from GitHub to Zenodo. In this workflow, the DOI would be added to the CITATION.cff file. However, it has turned out that they couldn’t get Zenodo to parse the pre-reserved DOI. Instead, it always created a new DOI for releases.

In lieu of the GitHub-Zenodo integration implementing a similar feature, one way to take this forward would be to implement a service pre-reserving a DOI, creating the .zenodo.json and CFF file for the GitHub repository, and then pushing all this through the API to Zenodo and upload the release tarball as a “standard” file upload instead of using the GitHub integration.

Find the Chicken/Egg Dilemma for DOIs project on Github.

Implement CFF support for R

Jan Philipp Dietrich implemented an R package that offers read and write support for CITATION.cff files. Furthermore, it provides tools for the extraction of citation information from R packages, thus extending the citation function from the utils package. The package is available from the CRAN of the Potsdam Institute for Climate Impact Research.

Find the CFF Support for R project on Github.

Optimise developer documentation

Oliver Strickson has not only optimised the documentation for CFF and the repositories in the Citation File Format GitHub organisation, but also checked and updated LICENSE files where necessary, and led a discussion about licensing the format itself, which in turn led to a respective hack giving the format standard a license.

Create a generic CFF reader in Python

Peter Hill and Jennifer Radtke have created a Python library for reading CITATION.cff files. The great thing about the module is that it also provides a class (i.e., a data model) for the citation metadata, which allows the library to be used in the backend of Python applications reading, creating, manipulating and exporting CFF files.

Find the CFF reader in Python project on Github.

Give CFF a licence

Flagged by Oliver Strickson while he was working on updating and improving developer documentation and license information for the different CFF projects, we discussed whether we should license the format itself, in contrast to the specifications document which is licensed under CC BY-SA 4.0. After some discussion, we concluded that if the format can be licensed at all, it should be licensed under a maximally liberal licence, which should mostly avoid the creation of new projects under the same name. The Apache License, Version 2.0 was a candidate, but its licence text redistribution requirement and the fact that it is mainly used for software led to its dismissal. Instead we tentatively opted for CC BY 4.0 and included it in the main CFF repository’s README, but would like to ask the community for more informed input if this is a viable solution.

​​​​​​​Submit CFF as a standard to fairsharing.org

Alexander Struck has submitted the Citation File Format as a standard to fairsharing.org, a resource on data and metadata standards, inter-related to databases and policies. This entry will make the format more visible and discoverable.

We are still awaiting DOI assignment. We ask interested parties to join the CFF team for future work.​​​​​​​

How to assign DOI to GAP releases?

Slightly out of the scope of the hack day, but well within the realm of Software Citation, Robin Dasler, Pete Arnold, Magnus Hagdorn and Alexander Konovalov discussed how DOIs can be assigned to Groups, Algorithms, Programming (GAP), a system for computational discrete algebra.

The discussion will be documented, and will report on progress, in this GitHub issue.

​​​​​​​Drag-drop web front end for a CFF editor

Matt Walker and Ana Costa Conrado developed a prototype for a drag’n’drop-enabled front end for a web application that can work on CFF files. This could tie in nicely with, e.g., the Python-based back end from the respective hack mentioned above. The drag’n’drop interface already works with single files, but can also be made to work with, e.g., whole directories, so that an entire software database could be read out with it.

Find the Drag-drop Web Front End for a CFF Editor project on Github.

​​​​​​​Discover software published in JOSS which does not have a CITATION.cff file

JOSS – the Journal of Open Source Software – is a journal for software packages. During the Software Sustainability Institute’s Collaborations Workshop 2018, Neil Chue Hong et al. worked on retrieving a dataset and analysing how software has been cited across it (The Code and Data Citation Counter project). Neil came to the hack day to use a portion of this dataset, software published in JOSS, to find out which of the published projects already have a CITATION.cff file, and to investigate how the remainder of the set can be incentivised to add a CITATION.cff file to their project, e.g., by creating automated pull requests with pre-populated CFF files. A preliminary analysis of published projects in JOSS discovered that 14 of the GitHub repositories linked from JOSS papers had citation files of some sort, with 2 being CITATION.cff files.

Find the Discover software published in JOSS which does not have a CITATION.cff file project on GitHub.

Learning and exploring

Learning, exploration and discussions around the Citation File Format and software citation in general have also all been valid hacks, and we had the impression and supporting feedback that those who had not worked on a “product-oriented” hack have used the opportunity to do just that during the day.

Conclusion

The hack day saw overwhelming interest and enthusiasm by all participants. Software citation really is one of the current pressing issues that must be properly implemented in order to award Research Software Engineers and other creators of research software their due credit, and as a requirement for linkage, discovery, reproducibility, and provenance analysis of research software. The hack day seems to have shown that the Citation File Format is a suitable starting point for the software citation workflow, and we hope to run similar events in the future.