Towards Software Non-Creator-Instigated Identification (NCI) and Citation

Posted by n.chuehong on 13 March 2019 - 9:00am Image of parchment and quill

Image courtesy of Kazareith.

By Daniel S. Katz, Daina Bouquin and Neil Chue Hong.

This blog post was originally published in Daniel S. Katz's blog.

Identification of software is essential to a number of important issues, such as citation, provenance, and reproducibility. Here, we are focusing on issues related to citation. Identification can be thought of as a subset of naming. Some important questions are therefore: How do we name things? How do we know how things are named? And who gets to name things?

When we name something, we give it identity. Identity allows us to use the name for a unique object or class of objects, and as a result allows distinction and classification. It also enables the connection of the thing being named to the wider sense of how it is understood: other metadata in many cases. Naming something may also give it legitimacy, depending on who is naming the object(s).

Some types of entities, for example, people, are named by their creators (their parents). People can also, at some point, name themselves. We can call this creator identification and self-identification. These names are legally stored in government systems, and informally stored in the collective consciousness of those who know or know about a person. In some sense, papers published through publishers are also identified by their creators, though here, while the creator starts the registration process, the publisher plays a key role as well and actually assigns the name, which is attached to the paper itself and also stored in the DOI system. To include this type of naming, we’ll call this category creator-instigated identification or CII.

Other types of entities are named by their discoverer, such as new atomic elements, new organisms, and geographic or astronomical objects or features. These names are stored by organizations set up within a society for this purpose, and in some cases, the concept of a “discoverer” is specific to that society. Many examples of this occurred in colonialism, where geographic features in the Americas were “discovered” and (re)named by and in Western Europe society. We can call this non-creator-instigated identification or NCI.

The difference between CII and NCI can be used to enforce a particular meaning for the same entities. For instance, Hispanic, a descriptor invented by the US federal government, has a very different meaning than Latino, a group who claimed the name for themselves. There may still be confusion between two CII terms whose perceived meaning overlaps, for instance “climate change” and “global warming”, or between two NCI terms where there is not yet agreement on which term should be used, for instance hahnium and dubnium.

Software identification

If we think about software, we have a mix of situations. In the first, the software creator can identify software by giving it a name, and the specific version can be identified in the context of text by the name along with some version information, such as a version number, a release date, or a software repository commit hash. This identifier can be formally stored in the software repository or more informally stored in the software ecosystem (added to the software itself, listed on a web page, mentioned in the text of a paper, etc.) If interested in identifying a permanently archived version of the software, the creator can use an archival repository (e.g., Zenodo, figshare) to generate a unique identifier, which will be stored in the DOI system. These are all examples of creator-instigated identification (CII). The FORCE11 Software Citation Principles were written with these examples in mind, suggesting the use of archival repositories to make citations long-lived and dereferenceable to an archived version of the cited software.

The other identification situation, non-creator-instigated identification (NCI), also clearly applies to software. A user of a piece of software developed by others might want to procure an identifier from a taxonomic system with legitimacy in their field, to allow use of the software to be recorded. If the software is open source, they could generate an identifier for it by submitting that software to an archive, but probably should not, since this act of archiving is tied to authorship, even though the act of identification does not need to be. And for closed source software, archiving the software might not even be a possible option. So what can be done?

At least two different approaches to this exist, for two different disciplines. In life sciences, the SciCrunch registry allows the registration of RRIDs for a number of types of resources (both physical and digital objects), including software. And in astronomy, there’s ASCL, which started as a repository for software, and in 2010 changed to be a registry as well. (When creators actually deposit software in ASCL, the software is assigned a DOI in addition to an ASCL ID; here we focus on ASCL as a registry, where it assigns an ASCL ID to software that is registered by both creators and non-creators.) There may be other such registries as well, since this issue of how to obtain an identifier for something someone else created is not a problem just in these disciplines.

However, the fact that there are at least two independent registries for software leads to potential problems. For example, a math library that is used in both astronomy and life sciences might be registered in both. Which record should someone working in a different discipline, for example, high energy physics, use? It’s not clear that there is a general answer that holds across fields, though there may be a preferred answer within high energy physics.

Who’s in charge?

From the field of library science, one relevant concept is authority control, which is a process associated with organizing bibliographic information. For instance, library catalogs use a single, distinct spelling of a name or a numeric identifier for each topic. These unambiguous concepts are defined through community consensus and subsequently incorporated into an “authority file” that is maintained with updates and logical linkages to other files. This network of linkages represents the authoritative structure of the information and enables consistency throughout and between catalogs, in addition to supporting faceted searching.

However, a problem is then who decides what is “authoritative”? One option is a creator, but given that we are talking about non-creator identification, we need a different answer in this case. Another option is a discipline specific-registry, which is where we seem to be in life sciences and astronomy. A third option is an institutional registry, which again doesn’t make sense for NCI. Finally, there could be a single authority across all software, but this is also problematic, as it’s hard to imagine a “software registry authority” being accepted by the software community in the near future. Essentially, the go-to approach does not apply to NCI software.

From an information science point-of-view, it would be nice if all registries had the same APIs and used the same schema, which would enable them to be federated, and this federation could support a global authority. But even this would not fully address other complexities that multiple registries introduce to software identification, such as deduplication / reconciliation, ownership and legitimacy of claim, and updating and synchronisation of metadata. However, for the purpose of this blog, where we are concerned with citation, having just one identifier makes citation and counting citations easy, and multiple identifiers can also be ok, they just make counting citations harder.

Issues with multiple records

In addition to the possibility of having a single piece of software with records in multiple registries, software can have many different records, from registries, repositories, version control systems, and papers. There are examples of software packages with multiple types of records, for example, a record in a registry and another record in separate archival repository, where those records do not reference each other. For example, AstroPy has both an ASCL record and a Zenodo record, where the ASCL record does not link to the Zenodo record and vice versa. For other software, the ASCL record may link to other proxies for the software though (e.g. papers, websites) and the Zenodo record may have multiple versions that are not represented in ASCL.

The issue of having multiple representations of a piece of software, each being cited, is worth discussing on its own, but registries add an extra abstraction layer and it’s not clear how these records should be curated or cross-referenced. For instance, many institutions curate bibliographies. How should bibliographers deal with registry records? Should a registry record be included in an institutional bibliography? An ASCL record for software may just contain links to papers that are already included in a bibliography. In cases like this the ASCL record could be seen as a duplicate record, but the ASCL record can also gain citations independent from the papers it links out to, which makes these records challenging to reconcile.

Moving forward

Perhaps these issues and others can be examined by the FORCE11 Software Citation Implementation Working Group’s Repository Best Practice Taskforce.

Examples of some other open questions about NCI software include:

Under what specific circumstances would it be appropriate to create a record in a registry for software that you did not create?
When creating a software registry record, how should we determine authorship and determine how to describe the software? (Note that the authors generally cannot be determined without their participation, and for many projects, including commercial software, the authors should probably be listed as the project name or company name. In talks, Dan has referred to this as software creation metadata, as it is associated with the creation of the software and can best be determined by the creators.)
How do we determine what should be associated with, or linked to/from, a software registry record? (e.g., papers, manuals, websites)

There are also open questions about how to address the software’s provenance and relationships between registry records and registries themselves:

How do we define the roles of the person who registers a piece of software, the person who created the software, and the person who manages the registry? And what do we do when they conflict?
Should software creators be able to edit registry records that were made by others? What are the potential intellectual property implications?
How should we document software versions and dependencies in registry records?
How should registry records be curated and cross-referenced?

The goal of this blog has been to expose distinctions between creator-instigated and non-creator-instigated software identification. While the former is well-understood and well-served by existing practices and tools, we’ve aimed to highlight the partial solutions that exist for the latter, and the need for work to formalize and standardize this area.

Thanks to Alice Allen, Tom Gillespie, and Anita Bandrowski for ideas and comments on this blog and a preceding document.