A checklist for using the Cloud in research

iMac.jpgThe Impact

Researchers now have a straightforward checklist to help them decide whether cloud computing can help them with their work.

The RAPPORT project investigated the use of cloud across a range of research projects, and a variety of cloud computing types. That has allowed the team to create a Best Practice checklists for researchers breaking down the different aspects of cloud use that have to be considered.

Perhaps surprisingly, the team found that in most cases there is no barrier to using cloud computing in High Performance Computing. But there’s still plenty to consider, and each project has its own peculiarities, so the RAPPORT checklist was put together to help researchers make their own call on the best way forward. A second checklist targets funders, helping them to make their own decisions on where cloud computing is a useful tool.

These checklists step through everything from security to licensing issues, and the true costs of using cloud – including storage and data transfer costs – compared to an in-house cluster. They ensure users consider the dependability of the cloud, plus any legal and ethical issues, and give technical help on how exactly to port an application to the cloud (and back out again, if need be).

“It’s a simplified model that allows researchers to just step through and decide whether it makes sense to even start.” – Neil Chue Hong, director of the Software Sustainability Institute.

The problem

Cloud computing is the buzzword of the moment, so it can be hard to tell where it’s useful and where it might actually be counterproductive. There have been two popular viewpoints, neither necessarily based on much evidence.

On the one hand, the general feeling in the research community has tended to be that cloud computing - and virtualised environments in general – don’t really suit the very specific needs of HPC. An in-house data centre can be tailored to meet the precise memory and interconnect requirements of the code being run, without the overheads needed to run a virtualised environment.

But on the other, cloud is almost seen as a panacea for anyone struggling to get access to compute power. It’s touted as cheap, available and ever-expandable, so why not use it for research?

The solution

The six-month RAPPORT (Robust Application PORTing for HPC in the Cloud) project was funded under JISC/EPSRC’s Pilot Projects in Cloud Computing for Research programme. The aim was to identify whether cloud computing could actually be practical in research situations and if so, in what form.

The Software Sustainability Institute (Mike Jackson, Neil Chue Hong) worked with the London e-Science Centre (Jeremy Cohen, John Darlington), based within the Department of Computing at Imperial College, London, the Imperial HPC Service (Matt Harvey) plus researchers from three different research domains within Imperial College: Bioinformatics (Sarah Butcher, Mark Woodbridge, Ioannis Filippis), Digital Humanities (Brian Fuchs), and High Energy Physics (David Colling, Daneila Bauer). These domains were chosen to give the widest possible range of software types.

The Bioinformatics group looked at two commonly used open-source software tools called MrBayes and RAxML, plus GenomeThreader, a closed-source tool that’s available free. If these could be run on the cloud, that would remove the need for researchers to purchase dedicated clusters, reduce contention, and also give access to extra resources whenever they were needed. RAPPORT showed that it was possible to run these tools on the cloud, but that clouds were better suited for longer running processes as there was a time cost to starting a run. In some standard use cases, the memory available on standard small cloud instances was insufficient for the applications.

Within Digital Humanities, RAPPORT worked with a project called Dynamic Variorum Editions (DVE). The project uses a Document Workflows (DocWF) application for batch processing historical multi-lingual documents with OCR (optical character recognition) software. DVE aims to build a framework capable of identifying and tracking references to the Greco-Roman world on an internet scale – involving work on around five million books, or around 500 terabytes of data.

The large number of independent OCR runs that have to be carried out on document scans give the application an embarrassingly parallel nature and the team felt that scaling to very large numbers of machines, using the Hadoop framework to manage the farming out and execution of jobs, ought to be possible on cloud nodes. However, licensing issues might require alternative OCR software to be used.

The High Energy Physics team was interested in a more theoretical sense – the department has a large local computing infrastructure that covers its current needs. Cloud, however, could potentially offer flexibility and efficiency in resource usage.

The software considered by RAPPORT included a Monte Carlo code generating simulated events based on CMS (Company Muon Solenoid) experiments in the Large Hadron Collider, and code used to analyse pre-processed data from the CMS detector. Both of these applications are high-throughput, data intensive and parallelisable. The software stack is made available in a custom virtual machine image called the CernVM. However it proved difficult to understand how this VM (virtual machine) image could deployed on to a cloud platform like Eucalyptus.

Each new dataset produced by the CMS creates a huge demand on the Large Hadron Collider's (LHC) global computing grid, the Worldwide LHC Compute Grid (WLCG), so the ability to cloudburst to additional resources would be a definite boon to researchers. However, the challenge of transferring large amounts of input and output data required innovative solutions: accessing data directly from the existing High Energy Physics storage resource without staging it to cloud storage.

The software from all three domains was tested on a range of cloud platforms – from private clouds – the London e-Science Cloud and RAPPORT’s own cloud – to the National Grid Service’s community cloud, to Amazon’s EC2 public cloud. In all instances the team opted for a basic Infrastructure as a Service (IaaS) offering.

Some of the software presented considerable challenges, due to more complex underlying applications and structures. In some cases, too, it was clear that the transfer and storage of data on a public cloud would be too costly and time consuming to be viable – but private cloud infrastructure was still a possibility.

In the end, cloud came out neither as a great solution nor a disaster for HPC – it’s just a very useful tool in the right circumstances. The team identified a whole spectrum of benefits and disadvantages, and it became clear that researchers need a way to consider each research project in its own merits. By breaking down the project and assessing the different types of requirement separately, it is possible to tell whether using the cloud will be useful.