Research Data Management as a national service

Posted by g.law on 4 June 2020 - 7:31am servers

Photo by Thomas Jensen on Unsplash

By Alastair Downie, Head of IT at The Gurdon Institute, University of Cambridge

This blog post was first published on the IT and Research Data Management in the Gurdon Institute blog.

The volume of data stored in research institutions is growing, and the rate at which it is growing is accelerating.

Modern research practices and equipment are generating huge amounts of data. In science there are many desktop devices that can create terabytes of data every day – they’re becoming more prolific; more compact; more affordable, and they’re popping up everywhere. This is great of course – more data allows the research community to answer bigger questions more quickly, but the flood of data is creating a management problem.

Driven mostly by university libraries, a lot of great work has been done to establish and encourage best practice in the management of data relating to published research, but this accounts for only a small percentage of all data that is currently held in research institutions – the very tip of the iceberg. I’ll come back to published data in a moment, but for the time being I’d like to focus on the rest of the data – the bulk of the iceberg: live data (that which is being collected, processed and analysed as part of current research work that is not yet ready to be published), or ‘old’ data (that relates to completed or published research, but has not itself been published or made accessible to the broader community).

The Gurdon Institute (a medium-sized research department within the University of Cambridge undertaking research into cancer biology and developmental biology) has a filestore that now accommodates over 200 million files, in approximately 3.5 petabytes of hardware. In common with most other institutions, we strive to embrace the principles of Open Science and Data Re-Use, and to ensure that our data is kept securely and remains accessible for at least ten years. It’s worth noting here that the typical life expectancy of storage hardware is around five to seven years, therefore a 10 year retention period requires two rounds of infrastructure investment over the course of the data lifecycle.

There are hundreds of research institutions around the country. The Gurdon Institute is shared with two universities, The Sanger Institute, MRC LMB, Babraham Institute and many others. And all of these institutions, all over the country, are building and managing their own storage systems. Within each of these institutions there are individual departments that are building and managing their own local storage systems. At huge expense, all of these organisations are locked into cycles of buying, building and maintaining unique systems, despite all sharing a common goal. There are no published standards or best practice models that I’m aware of, and I dare say that the quality of provision across all of the institutions is inconsistent.

All of our research institutions are building separate but similar systems, to provide similar services to a single community.

Published data lives in a separate infrastructure all of its own – at another conference recently I heard someone suggest that there are tens of thousands of repositories around the world. I don’t know how many there are in the UK, but I think it’s a safe bet that most of the hundreds of institutions in the country have at least one repository running on local hardware, independent of their live data storage. And all of these repositories ALSO share a common goal – to curate and to disseminate published data and resources for the benefit of the entire research community.

It seems completely nuts to me, that so much expensive resource is being multiplied all over the country, to provide exactly the same service to the same community – it’s hugely extravagant and wasteful. But that’s only part of the problem. Another problem is in the traditional (or popular) methods of organising data within these filestores.

One of our research group leaders came to speak to me a while ago and told me that she felt she was ‘losing control of her data’ – she knew the files she wanted to find were somewhere in our filestore, but she just couldn’t find them. There’s a lot of great advice and guidance available about using robust file naming conventions, and well-organised directory structures, so I started describing some strategies, but she stopped me and said: “I’m pretty sure we’re doing all that!”. I looked over her shoulder as she logged in to the filestore and, sure enough, I found that she is using naming conventions, and she does have a good directory structure – the problem is that the directory structure is now more than 20 levels deep, and their filenames have become super-complex, reflecting the number and the complexity of the systems that have created the data. It seems that filenames have changed from being uselessly simple ten years ago, to being unhelpfully cryptic today. And the process of trying to find data created by someone who left the lab years ago, by rummaging through directories and speculatively double-clicking on files, is becoming increasingly hopeless. This highlights the value of a good metadata-driven repository system of course, with its massively more effective search tools.

So. Spending and effort and resources are being duplicated needlessly, and the systems they’re building don’t even serve their purpose very well. And given the deluge of data that’s now being generated, I just don’t think this approach is sustainable. We’re not quite at breaking point yet but the trajectory is clear, and I think now is a good time to start thinking about a different approach. I’ve become interested in the idea that we should replace all of this – or at least as much as possible – with a single, joined-up, national infrastructure for research data management.

My proposal is that a single infrastructure (comprised of at least three very large data centres) should accommodate all research data, for all disciplines and institutions, throughout the entire data lifecycle – from creation to publication – irrespective of how researchers move around between institutions.

One of my responsibilities is to help research groups migrate their data out of the Gurdon Institute and into their new institutions, and it really is a painful task. It’s time-consuming and can be very complicated because every institution that has its own infrastructure, also has its own policies and capacities that might not be compatible with the structure of the volume of data that has to be moved. And it shouldn’t be necessary. In the joined-up infrastructure shown above, the data stays put while people move around it.

Jane Smith would move from a university in Aberdeen to another in Manchester, and as soon as she steps into her new office, she could log into the same system, using the same credentials, and access the same data. The same benefit applies to operation and analysis, and collaboration and publication – there’s no need for the data to be moved in order to satisfy all of these functions – they can all be undertaken and managed remotely.

I would propose that the storage platform should be presented through a streamlined, lightweight repository interface using an ORCID login, and would automatically harvest a bunch of metadata from the file, and from the researcher’s ORCID profile, and from a minimal number of keywords that would be required in the submission form – the aim being to make the initial submission process for newly-created, live data almost as easy as saving a file on the researcher’s own computer. Persistent identifiers would be created and attributed to each new dataset immediately, and a revision history created – researchers would access their data using their ORCID credentials, and other stakeholders, collaborators, and eventually the whole community would access it via the persistent identifier – either shared by the researcher, or published in a paper, or referenced in the research documentation.

Every research discipline has different requirements and expectations of storage platforms and repository systems, and one size will not fit all. But this is where I think there are some interesting opportunities. My proposal is that we should build this as a basic, discipline-agnostic, universally-relevant storage platform – something that’s equally useful to all researchers in all disciplines – but design it in such a way that the developers and commercial partners can build and grow a library of interface skins that will provide specialist toolsets for those different research communities. I’m keen to stress here that this is not a solution for science disciplines only – this is a system for every researcher, in every faculty, in every institution – science and arts and humanities alike. All operating within their own familiar or specialist environments, but underneath those interfaces, all sharing a single, common infrastructure.

And this highlights one of the great additional benefits of a shared infrastructure for storage – it creates a framework for rapid deployment of other innovations to the entire community– all disciplines, in all institutions, simultaneously.

The benefits of a single, joined-up approach are clear:

• Single platform for everyone
• Democratized access to resources
• Common experience, processes, quality and culture
• Easy cross-discipline collaboration
• Economy of scale
• Reduced local complexity and support burden
• Easy retrieval via persistent identifier, repository search or documentation search
• Reduction of duplication and movement of data
• Quick, easy and standardised publication of any dataset, irrespective of size
• Rapid deployment of innovation to the entire community

But the challenges – technological and cultural – are equally clear. I’m sure you can think of a hundred reasons why something like this would probably never work, but if there is a will, then all of these obstacles could be overcome by clever people. I’m guessing we can continue to adapt and stretch our current practices as best we can for maybe 10-15 years. But a system like this will not evolve out of current practices – I think it will require to be deliberately built by a bold and forward-thinking governing body, and that could take the best part of ten years to achieve.

There have been some high-level discussions in the past within the academic community about a national approach to RDM. I was chatting recently with a colleague from Cambridge, now retired, who chaired a national committee about 15 years ago to consider the potential benefits of a national data management service (without actually proposing any particular model). Their conclusion was that the idea was good, but 10 years ahead of its time because there was insufficient interest or support within the community for it to gain traction. But here we are, 15 years later, and now there is a very active and energetic community that is very engaged with data management practices and policies, and I think we do have a real chance to build the conversation to a level that will start to influence research councils, funding bodies, and other policymakers.

I’d love to hear your thoughts. And if you agree that our current approaches to research data management are unsustainable, then please feel free to share/discuss this idea in your own communities.