Making High Performance Computing resources more accessible

Sprint.jpgThe Impact

Bioinformatics researchers using SPRINT (Simple Parallel R INTerface) now have access to quality resources and support, helping them to work with the High Performance Computing services they need for their work.

SPRINT aims to help researchers to do more in-depth, complex analysis of high-throughput, post-genomic data using High Performance Computing (HPC), while still working with their preferred statistical programming language, R. Thanks to SPRINT, these researchers now have access to excellent training materials and documentation to help them in this work. The parallel functionality provided by SPRINT is also applicable to any users who use R to analyse very large datasets.

A local installation service within two departments at the University of Edinburgh has also eased access to SPRINT, encouraging take up of the software and helping more scientists to benefit from it. In the longer term, it is hoped this easier access will also boost contributions from users, allowing developers to keep improving the framework.

The Problem

Microarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data pushes the limits of the existing computing infrastructure available to the bioinformatics community.

SPRINT was therefore developed to help bioinformatics researchers to analyse this data on HPC resources – enabling the analysis of datasets that were previously too large to tackle, and run analyses that were thought too time-consuming to perform.

The community was accustomed to working with the programming language R, but the language wasn’t immediately suitable for use in an HPC environment. The SPRINT project created a prototype framework to allow the addition of parallelised functions to R, making it easier to use on HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around these parallelised functions.

However simple the framework, the bioinformatics community is naturally more interested in science than in learning to program, so it became clear that there was a need to work with users, to make SPRINT as simple and straightforward to use as possible.

The solution

The Software Sustainability Institute worked closely with the SPRINT team to improve user engagement, and to make better resources and support available for users.

User documentation and training materials were developed, and a training course on the use of SPRINT on the HECToR supercomputer was run for bio-statisticians in Oxford in 2011.

The Institute also funded a local installation service to both the Centre for Cardiovascular Research and the Institute of Evolutionary Biology at the University of Edinburgh, and trained staff at both institutions on its use.

The Institute did a considerable amount of work on the SPRINT code to make it more robust, and updated the software to process next-generation sequencing data using the Hamming distance function. Applied to data at the Institute of Evolutionary Biology (IEB), the new software has produced so much data, so quickly, that the IEB now needs to develop filtering methods to cope – a highly positive result, and one that the SPRINT team continues to help with.

The latest release of SPRINT software (v1.0.2.) includes the Hamming distance function developed by the Software Sustainability Institute.

Established in 2008, SPRINT is a collaborative project between EPCC and the Division of Pathway Medicine at the University of Edinburgh.