First steps towards understanding the size of the research software community

Posted by s.hettrick on 10 July 2014 - 10:20am

By Simon Hettrick, Deputy Director.

In an earlier post, I discussed our plans for investigating the number of researchers who rely on software. We’ve spent the last month looking at the feasibility of some of our ideas. In this post, I’ll present our findings about one of these approaches and some of the problems that we’ve encountered. If you’ve ever wondered what happens when a clueless physicist starts to dabble in social science, then this is the post for you.

First of all, a quick recap. Anecdotally, it seems that the number of researchers who rely on software for their research is – pretty much – everyone. There are few researchers who don’t use software in some way, even when we discount things like word processing and other non-result-generating software. But without a study, we lack the evidence to make this case convincingly. And it’s not just about the size of the community, it’s also about demographics. Seemingly simply questions are unanswerable without knowing the make up of the research software community. How much EPSRC funding is spent on researchers who rely on software? Is that greater, proportionally speaking, than the AHRC?

We had a few ideas about how to determine the size of the research software community. The one that has progressed furthest is the conceptually most straightforward idea: simply ask the research community. This is my first foray into social science, an unusual position for a physicist to find himself in, and I can report that it’s easy enough to write a questionnaire, but fiendishly difficult to identify the people to whom that questionnaire should be sent.

The UK research community appears to comprise around 250,000 researchers (a revised down number since my last post – more about that in a later post). The power of polling means that it’s not necessary to survey everyone, we just need a representative sample of 400 participants. A few hundred results doesn’t sound that difficult until you realise that this will be an electronic survey and these tend to gain response rates of around 10%. At least we’ve got a hard number: we need to send out the questionnaire to 4000 researchers.

A term that I have learned recently: sample frame. This is the name given to the community one is attempting to survey (in my case the entire research community). My 4000-person sample needs to be randomly selected from my sample frame to be representative, which means I need a sample frame that includes the details of everyone in the research community. Wait a second… that means I need the names and email addresses of all UK researchers. This is where things start to get difficult.

Funnily enough, the names and email addresses of 250,000 people are not straightforwardly accessed. I’ve been talking to national bodies that do this kind of work, like HESA (the Higher Education Statistics Agency) and HEIDI (Higher Education Information Database for Institutions), university HR departments, and even a professional surveying organisation. It’s not possible to emulate the approach the national bodies take to collecting data, because they use embedded data collectors in every university and, understandably, these people can’t be hijacked. It is possible to add new questions for these embedded data collectors to ask, but this is a process that takes around three years and a lot of lobbying. University HR departments are somewhat cagey about handing out lists of their staff contact details. This has me conflicted: it makes my work more difficult, but as a university employee it also saves me from being incessantly spammed. Professional surveying organisations will take on the work of finding a sample frame, but the costs are considerable – some reasonable multiple of £10,000. After quite a few phone calls, it turns out that we’re on our own when it comes to creating the sample frame for our survey.

This stage in a project is always fairly challenging. Of all our methods for determining the size of the research software community, the surveying option seems the most likely to produce a result on which we can depend. Doing the sample frame work ourselves means taking staff off other projects and getting them to invest time into a problem of which no one in the Institute is an expert. Since we have no guarantee of results, we risk shifting resources away from better-known problems, onto something that might produce nothing. Lots of work for potentially no payback? Sounds like research.

The sample frame problem is not insurmountable, because the vast majority of researchers’ details are made publicly available on the staff pages of their university’s website. We could spend years trawling these pages by hand, or we could write a scraper to do it automatically. This solves our sample frame problem (give or take a bit of development wizardry, but we’re good at that), but introduces a brand new problem of ethics. If we’re scraping websites, does that mean we’re spammers?

Along with dusting off my long-forgotten knowledge of statistics, and learning the new skill of survey design, I now need to worry about ethics. The ethics committee never used to care what I did when I worked with lasers – as long as I didn’t point them at anyone – but this research needs clearance before it can continue.

It turns out that ethical approval is easier to achieve than I expected. It comes down to the balance between the benefit of the research weighed against the potential for harm. If you’re not working with vulnerable people (children, for example), you’re not messing with people psychologically (by hiding the real purpose of your study, for example) and you’re not storing sensitive information (about health, for example), then you simply need to make a reasonable case that your research will lead to a benefit. In our case, the risks are low, and the benefits are significant. The study will support software in research and, like it says on our T-shirts: better software, better research.

Getting ethics approval is one thing, but ensuring the happiness of the people who receive our survey request out of the blue is quite another. The survey is, of course, voluntary, the respondents won’t receive further emails, there will be a procedure to ensure that people’s names are quickly removed from our files, and there will be an email address to allow people to provide feedback. Apart from that, there is little we can do except run a series of progressively larger surveys and gauge the response as we go.

We’re now at the stage where we have begun collection of the contact details that we need. There will follow a period of trialling the survey with groups, like our Fellowship, who represent researchers. Once we are happy that the survey is ready, we will poll five randomly selected universities from the Russell Group. If that goes well, there are plans to expand the survey to all higher education organisations in the UK.

In my next post, I’ll look at the other ways of determining the size of the research community and talk about how we wiped out 500,000 researchers – figuratively speaking.