Software Metrics—why and how?

Software metricsBy Neil Chue Hong, Software Sustainability Institute, Daniel S. Katz, University of Illinois Urbana-Champaign, Thomas Kluyver, University of Southampton, David Mawdsley, University of Manchester, Patrick McSweeney, University of Southampton, Geraint Palmer, Cardiff University.

This post is part of the Collaborations Workshops 2017 speed blogging series.

Software is important to research. Whether you think software is a primary product of research or not—or indeed not yet—it’s clear that a lot of researchers rely on a lot of pieces of software. From short, ill-planned, thrown-together temporary scripts to solve a specific problem, through an abundance of complex spreadsheets analysing collected data, to the collaborative and well-structured efforts of hundreds of software engineers and millions of lines of code behind international projects such as the Large Hadron Collider and the Square Kilometre Array, there are few areas of research where software does not have a fundamental role. Given that reliance, it’s surprising that there hasn’t been more research or guidance to help researchers understand the software itself, from its usage to its impact, and from its quality, reusability and sustainability to how much it should be rewarded.

Why measurement is useful

Providing quantitative indicators of the characteristics of software allows us, in principle, to inform an objective evaluation of the software. This assumes that we have a metric (or metrics) that measure what we want it to measure, and that the metrics aren’t vulnerable to manipulation.  

In this blog post, we discuss some of the software metrics that may be useful, and what they might tell us about a piece of software, as well as some of the things we would like to measure but don’t yet know how to.

The importance associated with a metric leads to the risk of gaming: the temptation of individuals to "chase the metric," rather than focus on producing good software.  This is a  form of Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

Despite the imperfect nature of metrics, they allow us to compare software rapidly and can be a useful tool in evaluating the usage and quality of software.  

Why measure software

At a high level, there are two main reasons we measure software: to understand (or decide on) usage of it and reward the contributors.

Understanding software usage

When faced with a choice between a number of specialised software packages for use in research, what reasons or factors determine how we will make this decision? For example, when implementing neural networks in Python, should I try PyBrain or scikit-learn? Which choice meets the particular needs of my project? (Do I require speed, specific functionalities, easy of use?) Some form of software metric will help in this decision.

Other users and uses of these software metrics go beyond carrying out the research itself: when reviewing a research paper that makes use of a particular software package, how can I be confident that this package is appropriate for the work, or were there better software options that should’ve been explored? For article references, number of citations, keywords, and quality and reputation of journal can do this job, but another form of metric is required for software citation. This may also be a concern for funding bodies, governments, and other institutions that scientific work is accountable to.

As researchers and academics there are other considerations too, where the use of software metrics can be beneficial: which projects should I spend time contributing to and imparting knowledge and skills to? Which software is appropriate for use in education, and will my teaching using this software be valuable for my students?

Rewarding software contributors

Software developers in academia have additional motivations for measuring their software. Academia is a very merit-driven working environment and how that merit is recognised affects the developer’s value in the organisation. This can be as simple a determining whether the developer is eligible for promotion to far more complex social problems including how developers are perceived by their peers. Different disciplines have very different attitudes to the value of technical staff. In areas where technical staff feel underappreciated, metrics are crucial to demonstrating to their colleagues why the developers are valuable to them. Obviously numbers aren't the whole story but they are an important piece of the puzzle.

Aside from the personal aspect, metrics can be used to drive good behaviours. Because you get what you measure, at least to some extent, it is important that we associate personal reward with metrics that result in good research software. Rewarded metrics can motivate a research software engineer to write those few extra tests or to spend a few extra hours tidying up or enhancing the documentation. Ultimately, good research software should make for better and more repeatable research.

It is worth mentioning that the downside of measuring can be seen all around us in research. The way impact has been measured in the UK REF (the Research Excellence Framework used in 2014 for assessing the quality of research in UK higher education institutions) has had a huge impact on how researchers and academics are perceived by institutions. Combined with stiff competition and fixed term contracts, this results in undesirable career development opportunities for researchers who can be replaced by someone with higher numbers at the next contract break. However, the downside of not measuring was academics with no motivation to publish the outputs of their research and thus not advancing the research area.

Actual measurements

There are a number of tools that already produce metrics about software. GitHub stars are a very convenient indicator of popularity for many open source projects, but projects that aren’t hosted on GitHub don’t have any comparable metric. There’s also a risk that people might read too much into popularity metrics if other ways of assessing a project aren’t as convenient.

Debian and Ubuntu run an opt-in reporting system known as ‘popularity contest,’ or popcon. This shows what proportion of the systems reporting data have a given package installed. However, this is specific to these platforms, to system-managed packages, and to users who have opted in to share data, so it may not be useful for domain specific research software which has a small target audience.

Libraries.io is another effort to assess open-source software more generally. It assesses a project’s popularity by how many other projects make use of it, combining this with factors based on analysing the project itself – does it have a license? Has it released multiple versions – into a metric it calls ‘SourceRank.’

Depsy is targeted at research software, trying to produce metrics similar to the citation counts of academic papers. Like Libraries.io, it captures data from software dependencies,  but it also scrapes academic papers for citations or mentions of the software in question. Depsy explicitly tries to address the question of academic rewards for software, even portioning out the impact of a piece of software to its committers to assemble impact scores for developers.

In addition, a number of properties about software that we would like to measure are not currently measurable. These properties can be thought of as properties of the overall scientific ecosystem, or properties of the software ecosystem, or properties of specific software.

One example of a property belonging to the overall scientific ecosystem is how a library that is used in a number of software packages contributes to scientific papers, which could be measured by transitive credit. More generally, understanding software impact is very difficult.  Information and library scientists have worked for many years and made progress in understanding the impact of one research paper on another, but this is not currently possible for software, since software citations are not common in research papers, and more importantly, software citations are not common in other software.  We don’t even really have a metadata structure for publishing software that permits citations.  For example, if one publishes software in Zenodo, there is no metadata for citations of other software.   

For properties related to the software ecosystem, we need collaboration between different elements of the ecosystem.  For example, we might want to know which tools are used together, which would have to be tracked through more than one tool, or at a system level.  In either case, some agreement would be needed about how to do this tracking and how to collect the data.

An example of a property belonging to the specific software is to track how often a software package is actually used, which would require some runtime measurement and data collection, leading to both practical overhead and privacy concerns. This is done by software such as GridFTP, BoneJ, and PsychoPy.  Similarly, we could measure how often software is built, potentially by adding a curl command in the build script, as is done by OpenQBMM.

Wrap-up

In this short trip through the world of software metrics, it’s clear we’ve only touched on their use. What do you use to choose between the software that you use, or the software you’re going to spend time contributing to? How do want the software you produce to be measured and rewarded? Answers in the comments below!

Posted by s.aragon on 9 May 2017 - 10:43am

What is the importance/the impact of a research finding?

As a caveat, I prefer to program using Java and not to rely on third party libraries or software. I develop using a modern IDE (I tend to use Netbeans, but will use Eclipse from time to time) and am most happy with the Maven approach to building and managing libraries.

When starting out on a new project I will tend to draw on my own generic library (https://github.com/agdturner/generic) as well as any higher level abstracted library with particular use for the type of project I am doing. Often I will make use of my grids (https://github.com/agdturner/grids) and vector (https://github.com/agdturner/vector) libraries.

For various projects I have made use of third party software. Some of these, for example, GeoTools http://www.geotools.org/, rely on a large number of other third party libraries. The licences for all the software need to be compatible and it needs to be worth having the dependency rather than writing the code myself to warrant the inclusion of the software in the solution I am developing. I prefer actively developed and or very stable software that does what I want it to do, or something very close. I will contribute to the development of that third party offering if that seems like the best thing to do.

I use some software that is not open source and I am unlikely to contribute to the code base of (e.g. Google Docs). Often these are just the best tools I have found and they help me with things that are not part of the automated workflows that are vital for reproducibility   For a quick one off job that is not part of a scientific workflow per se, I might use other software more common to which has few third party dependencies.

So, how would I like my software to evaluated and rewarded? Well, I think it should be rewarded more the more useful it is. So if findings rely on it and these have impact, then it would be good to translate that into a reward and recognition metric. Resources could fuel further development, but this perhaps should be steered by research projects. GeoTools was originally developed as research software, but now it supports a lot more than that and is a fairly successful open development.

Thanks for the post and for the questions. I look forward to reading some other replies. Sorry I've not really given any clear answers.

Submitted by Adrian Jackson on 10 May 2017 - 6:32pm

Permalink

No answers I'm afraid, but plenty of questions...

Outputs.  It would be useful if we could link outputs to software properly (i.e. to papers, patents, etc...).  However, this is not straight forward, and I wonder where you draw the line in what software to acknowledge.  Clearly, if my research depends on an application or software package (i.e. large amounts of my results were generated from that package) then it should be referenced somehow in my output.  But that software will have relied on other things, other libraries, operating systems, compilers, command line tools, etc...  So, should they also be referenced?  What about if I've bought some software, paid for it, do I still have to reference it?

Usage. Simple download stats can be a useful starting point, although usage is more meaningful, but it's hard to collect and you get into questions over whether counting the number of runs of an applications is useful, should you be counting the number of threads/processes/cores used, should you care about length of run, etc...  Large number of downloads can show some interest in a software package, but you may have an application that's only downloaded a handful of times and is still used by a large number of people (i.e. installed on a large, multi-user, system).  I guess I'd be inclined to have a hybrid metric that combines some estimation of usage/widespreadness (i.e. downloads, stars, etc...) with mentions in papers, web pages, blog posts; as a starting point for measuring software.

Algorithms or implementations?  Clearly software packages or applications that are widely used as a tool for science need recognition in some way.  But a lot of these packages are implementing algorithms created/defined/refined by others, so where should the reward/recognition lie?  With the implementers following someone else's recipe or with the original cooks?

Choices.  Often there are multiple choices for the same functionality, for instance, the blas numerical libraries.  It is both important that this general functionality is available, and there are optimised versions that reduce my runtime and improve the amount of science I can get with a given amount of compute time.  Would this be acknowledge and tracked as blas or as all the different implementations of blas that people use?

Add new comment

The content of this field is kept private and will not be shown publicly.
By submitting this form, you accept the Mollom privacy policy.