How many researchers rely on software? Want to know?

Posted by s.hettrick on 28 May 2014 - 2:20pm

By Simon Hettrick, Deputy Director.

At conferences and presentations, we are regularly asked two questions. The first is about how we plan to sustain ourselves (my advice to anyone setting up a new institute is not to include the word sustainability in its title). The second question is far more difficult to answer: how many researchers use software?

We talk of the research software community as the subset of the research community that relies on software for its research. According to the Higher Education Statistics Agency (HESA), in 2012/2013 there were almost 700,000 researchers in the UK (counting staff associated with research and postgraduate students), and we believe that many of them would be unable to conduct their research without software. It’s a big community then… but exactly how big? To our knowledge, no one has determined the size of the research software community – so we’ve decided to be the first. (If you would like to help, see the end of this post.)

What is the question?

One of the big problems is that it is difficult to frame the question we need to answer. If someone develops code that implements an algorithm to generate results, then most people would happily include that person in the research software community. At the opposite end of the spectrum, someone using a word processor to write a paper would be discounted, because they’re not using the software to generate results. What about a spreadsheet? There are some beautifully complex spreadsheets in research that use a whole host of bespoke macros, and there are some that are simply glorified calculators. And the debate doesn’t end there. In fact, if you pose the question “How many researchers rely on software to generate their results?” to a roomful of researchers and software developers, you will end up in a long argument about the definitions of - at a very minimum - five of those words.

Rather than get bogged down in semantics, we’re going to take a flexible approach. We will still be starting with the question “How many researchers rely on software to generate their results?”, but we will adapt that question as the research continues and we gain a greater understanding of the nuances of the research software community. We might have to lose some edge cases by focussing on a hard definition of some terms, or we might have to broaden the scope due to less tightly constrained data than we would like. Either way, this flexible approach will let us move closer to the answer without going all Groundhog Day over the question.

Approaches

We’re going to investigate a few approaches to estimate the size of the research software community and see how well their results agree. Possibly the most accurate method is to ask the research community about their software usage (obviously, not by asking everyone, but by conducting sample surveys and applying the central limit theorem). Data mining might help identify the people who rely on software via the grants they apply for, the papers they write and the jobs they advertise. We can study reports about research software to identify clues about the size of the community, and then build on those results. Finally, we can take some of the better studied groups – like bioinformatics, say – determine the community size and extrapolate across the research community as a whole.

We will use an open approach when conducting this research. After all, it seems counterintuitive to try and study a community without including input from that community in the study. We will post regular updates about our ideas, the research we’re conducting and the conclusions that it is leading us to, and we hope that the community will comment on our methods, provide feedback and pass on information. We’re also in the market for partners, so if you are interested in studying the research software community, please let me know.

How can you help?

There are a number of ways in which you can help us right now. Please feel free to comment on our proposed approaches (comment below or let us know by email), or suggest other ideas that might help put a number on the research software community. If you know of reports that discuss research software in relation to the UK research community, or if you know details about your own part of the research software community (i.e. like the bioinformatics example discussed above), then please let us know.

What's next?

The next step is to conduct some preliminary work on each of the approaches to test their viability. We’ll post an update on these soon, so watch this space and keep an eye on our Twitter if you want to know more.