Hacking the future of software citation

Posted by s.aragon on 9 November 2017 - 10:22am

By Stephan Druskat (Humboldt-Universität zu Berlin, Germany)

On 26 October 2017, the Force11 Software Citation Implementation Working Group and Force11 Hackathon was hosted at the Force2017 Conference in Berlin, Germany, and led by Neil Chue Hong (Software Sustainability Institute), Lars Holm Nielsen (Zenodo) and Martin Fenner (DataCite). Participants took a full day to exchange, discuss, plan, and hack towards implementations of the software citation workflow.

Software citation is at the very heart of the process to create recognition for software as a scholarly product, and finding and implementing a working solution for this issue is the key to attribution and credit for the creators of scientific software. At the same time, the possibility of citing a software also helps to unlock its potential for sustainability by boosting its accessibility, and by fostering a software’s persistence through the encouragement of tagging its published versions with unique identifiers, such as DOIs. Thus, software citation also becomes a natural path to the reproducibility of research results in the context of open science, where not only the data, but also the complete toolchain employed in a research endeavour, is made openly available.

Software citation is still hard

However, software citation is hard, and for diverse reasons. It is not supported by tools (e.g., BibTeX does not have a dedicated software type), and it is not supported by current practices (e.g., journals may not encourage software items in reference lists; researchers/research software engineers do not make their software citable; researchers perceive software as an object to be a black box for citation purposes - hence software citation practices wildly diverge between not naming a software at all, citing websites, papers, concepts, etc., instead of software versions themselves).

The change of scholarly culture needed to tackle the latter issue has been kickstarted by the Force11 Software Citation Working Group, which took up its work in early 2015, and has completed its work in late 2016 by publishing the Software Citation Principles paper [1].

The former issue - the lack of tooling for implementations of the software citation principles and suitable workflows - is being tackled as well: The Force11 Software Citation Implementation Working Group has been set up to endorse the software citation principles, develop guidelines for implementing them, help implement them and test specific implementations, and the hackathon at Force2017 touched on the majority of these aspects.

Hacking software citation = discussing software citation

The hackathon was attended by around 30 people from a number of different projects, including Zenodo, DataCite, Crossref, the Citation File Format, eLife, JATS4R, Fidus Writer, swMath, Software Heritage, Software Sustainability Institute, Substance.io, and others. Interestingly, though perhaps naturally, it featured a lot more discussion than what you would expect from a typical hackathon (it was my first, so it’s hard for me to tell). In the spirit of one of the working group’s self-assigned tasks - to collect feedback for further iterations of the software citation principles - a lot of the questions that were discussed, and issues that were established, included questions of authorship, because authorship of software is dynamic, it changes over time:

How do agents such as repositories or archives, including for example Software Heritage and swMath, deal with authorship? Do these agents decide authorship? Where and how is authorship information extracted?
Who should - more generally - determine authorship: The governing instance (“owners”) of a software? The people who reference the software (“users”)? The machine-extractable facts (e.g., “contributors” to a software as per a commit history or similar?)
Should any one instance of a software (e.g., a specific version) be assigned only one single identifier (“deduplication”)? Or should different agents be able to access that instance via different identifiers?

Ultimately, the more philosophical questions (e.g., 2 and 3) need to be addressed and answered by the community, and should arguably find their way back into the software citation principles.

Hacking software citation = hacking the software citation workflow

Following the discussion, the proposed hacks - previously collected in the working group’s issue tracker on GitHub as well as on site - were presented and embedded in a model of the software citation workflow, which approximates the following.

A code repository produces a file in a format that extracts information automatically (e.g., codemeta.json); in the file, author names for example are only assumed (via, e.g., an ORCID id), but are not checked.
codemeta.json can be used to produce different exportable formats
There is a service to generate references in different styles, powered by, e.g., the Citation Style Language
Repositories can extract the information from the metadata file (codemeta.json)
Publishers are able to create software citation in JATS XML from the metadata file

Apart from the concrete workflow hacks, there were also at least two catch-all proposals: How are people using software and citing it? Is CodeMeta usable for the lifecycle, and can its documentation be optimized.

The above outline clearly highlights the role of CodeMeta as the emerging standard for capturing research software metadata. As a rosetta stone for software metadata providing comparability via a crosswalk table as well as an implementation in JSON/JSON-LD, it has secured support potential from a large number of agents from different roles within the software citation ecosystem - repositories, publishers, harvesters -, is appropriately concentrating implementation efforts around it, and will arguably be at the heart of a software citation workflow based on the software citation principles. Efforts to align other formats with CodeMeta, such as JATS XML and the Citation File Format, strengthen its approach significantly.

Therefore, a number of hacks completed at the hackathon focused on CodeMeta:

Issues in the current schema have been identified (e.g., GitHub issues #166, #167, #169, #171, #174) and solutions are currently being discussed;
Crosswalk properties have been adapted from the DataCite Metadata schema 4.1;
Crosswalk properties for the Citation File Format have been added to the CodeMeta crosswalk table, which makes the user-centred, standardized format for CITATION files a suitable low-threshold source format for CodeMeta JSON files.

The challenges for software citation need to be attacked from different sides though, and thus other hacks included, for example, the implementation of support for JATS element-citation in the DataCite Content Negotiation, and the export of Crossref and DataCite DOI metadata to JATS element-style citations to the bolognese DOI metadata conversion library.

In conclusion, the hackathon has been successful in highlighting open questions with regard to the software citation principles, while at the same time starting to clear the path towards an implementation of the software citation principles. It has been amazing to see how much work—both theoretical and practical - can be achieved by simply getting interested parties in the same room and letting them focus on a specific issue for a few hours! Hopefully, more events like this will follow.

Everybody can contribute to the future of software citation, and you can start today by registering as a member of the Force11 Software Citation Implementation Working Group!

References

[1] Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. (2016) Software citation principles. PeerJ Computer Science 2:e86