Software Metrics—why and how?

Posted by s.aragon on 9 May 2017 - 9:43am

By Neil Chue Hong, Software Sustainability Institute, Daniel S. Katz, University of Illinois Urbana-Champaign, Thomas Kluyver, University of Southampton, David Mawdsley, University of Manchester, Patrick McSweeney, University of Southampton, Geraint Palmer, Cardiff University.

This post is part of the Collaborations Workshops 2017 speed blogging series.

Software is important to research. Whether you think software is a primary product of research or not—or indeed not yet—it’s clear that a lot of researchers rely on a lot of pieces of software. From short, ill-planned, thrown-together temporary scripts to solve a specific problem, through an abundance of complex spreadsheets analysing collected data, to the collaborative and well-structured efforts of hundreds of software engineers and millions of lines of code behind international projects such as the Large Hadron Collider and the Square Kilometre Array, there are few areas of research where software does not have a fundamental role. Given that reliance, it’s surprising that there hasn’t been more research or guidance to help researchers understand the software itself, from its usage to its impact, and from its quality, reusability and sustainability to how much it should be rewarded.

Why measurement is useful

Providing quantitative indicators of the characteristics of software allows us, in principle, to inform an objective evaluation of the software. This assumes that we have a metric (or metrics) that measure what we want it to measure, and that the metrics aren’t vulnerable to manipulation.

In this blog post, we discuss some of the software metrics that may be useful, and what they might tell us about a piece of software, as well as some of the things we would like to measure but don’t yet know how to.

The importance associated with a metric leads to the risk of gaming: the temptation of individuals to "chase the metric," rather than focus on producing good software. This is a form of Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

Despite the imperfect nature of metrics, they allow us to compare software rapidly and can be a useful tool in evaluating the usage and quality of software.

Why measure software

At a high level, there are two main reasons we measure software: to understand (or decide on) usage of it and reward the contributors.

Understanding software usage

When faced with a choice between a number of specialised software packages for use in research, what reasons or factors determine how we will make this decision? For example, when implementing neural networks in Python, should I try PyBrain or scikit-learn? Which choice meets the particular needs of my project? (Do I require speed, specific functionalities, easy of use?) Some form of software metric will help in this decision.

Other users and uses of these software metrics go beyond carrying out the research itself: when reviewing a research paper that makes use of a particular software package, how can I be confident that this package is appropriate for the work, or were there better software options that should’ve been explored? For article references, number of citations, keywords, and quality and reputation of journal can do this job, but another form of metric is required for software citation. This may also be a concern for funding bodies, governments, and other institutions that scientific work is accountable to.

As researchers and academics there are other considerations too, where the use of software metrics can be beneficial: which projects should I spend time contributing to and imparting knowledge and skills to? Which software is appropriate for use in education, and will my teaching using this software be valuable for my students?

Rewarding software contributors

Software developers in academia have additional motivations for measuring their software. Academia is a very merit-driven working environment and how that merit is recognised affects the developer’s value in the organisation. This can be as simple a determining whether the developer is eligible for promotion to far more complex social problems including how developers are perceived by their peers. Different disciplines have very different attitudes to the value of technical staff. In areas where technical staff feel underappreciated, metrics are crucial to demonstrating to their colleagues why the developers are valuable to them. Obviously numbers aren't the whole story but they are an important piece of the puzzle.

Aside from the personal aspect, metrics can be used to drive good behaviours. Because you get what you measure, at least to some extent, it is important that we associate personal reward with metrics that result in good research software. Rewarded metrics can motivate a research software engineer to write those few extra tests or to spend a few extra hours tidying up or enhancing the documentation. Ultimately, good research software should make for better and more repeatable research.

It is worth mentioning that the downside of measuring can be seen all around us in research. The way impact has been measured in the UK REF (the Research Excellence Framework used in 2014 for assessing the quality of research in UK higher education institutions) has had a huge impact on how researchers and academics are perceived by institutions. Combined with stiff competition and fixed term contracts, this results in undesirable career development opportunities for researchers who can be replaced by someone with higher numbers at the next contract break. However, the downside of not measuring was academics with no motivation to publish the outputs of their research and thus not advancing the research area.

Actual measurements

There are a number of tools that already produce metrics about software. GitHub stars are a very convenient indicator of popularity for many open source projects, but projects that aren’t hosted on GitHub don’t have any comparable metric. There’s also a risk that people might read too much into popularity metrics if other ways of assessing a project aren’t as convenient.

Debian and Ubuntu run an opt-in reporting system known as ‘popularity contest,’ or popcon. This shows what proportion of the systems reporting data have a given package installed. However, this is specific to these platforms, to system-managed packages, and to users who have opted in to share data, so it may not be useful for domain specific research software which has a small target audience.

Libraries.io is another effort to assess open-source software more generally. It assesses a project’s popularity by how many other projects make use of it, combining this with factors based on analysing the project itself – does it have a license? Has it released multiple versions – into a metric it calls ‘SourceRank.’

Depsy is targeted at research software, trying to produce metrics similar to the citation counts of academic papers. Like Libraries.io, it captures data from software dependencies, but it also scrapes academic papers for citations or mentions of the software in question. Depsy explicitly tries to address the question of academic rewards for software, even portioning out the impact of a piece of software to its committers to assemble impact scores for developers.

In addition, a number of properties about software that we would like to measure are not currently measurable. These properties can be thought of as properties of the overall scientific ecosystem, or properties of the software ecosystem, or properties of specific software.

One example of a property belonging to the overall scientific ecosystem is how a library that is used in a number of software packages contributes to scientific papers, which could be measured by transitive credit. More generally, understanding software impact is very difficult. Information and library scientists have worked for many years and made progress in understanding the impact of one research paper on another, but this is not currently possible for software, since software citations are not common in research papers, and more importantly, software citations are not common in other software. We don’t even really have a metadata structure for publishing software that permits citations. For example, if one publishes software in Zenodo, there is no metadata for citations of other software.

For properties related to the software ecosystem, we need collaboration between different elements of the ecosystem. For example, we might want to know which tools are used together, which would have to be tracked through more than one tool, or at a system level. In either case, some agreement would be needed about how to do this tracking and how to collect the data.

An example of a property belonging to the specific software is to track how often a software package is actually used, which would require some runtime measurement and data collection, leading to both practical overhead and privacy concerns. This is done by software such as GridFTP, BoneJ, and PsychoPy. Similarly, we could measure how often software is built, potentially by adding a curl command in the build script, as is done by OpenQBMM.

Wrap-up

In this short trip through the world of software metrics, it’s clear we’ve only touched on their use. What do you use to choose between the software that you use, or the software you’re going to spend time contributing to? How do want the software you produce to be measured and rewarded? Answers in the comments below!