PhotPipe: Software solutions for the distributed batch processing of data from the Gaia satellite

Gaia artist's impresion
Gaia artist's impression. Credits: ESA/ATG MEDIALAB;
Background Image: ESO/S. Brunier, June 2013

By Francesca De Angeli, in collaboration with Marco Riello, Gregory Holland, Patrick Burgess and Paul Osborne.

This article is part of our series: A day in the software life, in which researchers from all disciplines discuss the tools that make their research possible.

On 19 December 2013, at 09:12:19 UTC, a spacecraft containing the Gaia satellite was launched from Europe's Spaceport in French Guiana. The Gaia satellite reached its stable operational orbit around L2 (approximately 1.5 million km from the Earth) about one month later. Since then, a continuous stream of data has been downloaded for further processing on ground. This data includes broad-band photometry and low-resolution spectra for all sources brighter than magnitude 20 and high-resolution spectra for sources brighter than magnitude 16.

The Gaia focal-plane assembly is the largest ever developed for a space application, with 106 CCDs, a total of almost 1,000 million pixels, and a physical dimension of 1.0 m × 0.4 m. The sheer size volume of the data collected by the various instruments on board Gaia demanded the development of efficient compression strategies and ad hoc software design solutions. As a result, the data stream to be processed on ground is extremely dense with information and highly complex. Processing is distributed between six data processing centres (DPCs) around Europe.

Each DPC is focussed on one aspect of the data processing and results are produced and exchanged at the end of each processing cycle every year or so. During each cycle, any one DPC will have available the latest results from the other DPC's operations during the previous cycles. Modules relying on data produced by other DPCs can often only become active after a few cycles.

This blog entry focusses on the reduction of the photometry and low-resolution spectroscopy data performed at the Institute of Astronomy of the University of Cambridge (IoA). This is carried out using an internally developed software system called PhotPipe.

The large data volume produced by Gaia (26 billion transits or observations per year), the complexity of its data stream, and the self-calibrating approach pose unique challenges for scalability, reliability, and robustness of both the software pipelines and the operational infrastructure. The Gaia team at the IoA adopted Hadoop and Map/Reduce as its infrastructure core technologies in 2010 prior to Hadoop's initial release in 2011. This has proven a winning choice allowing the IoA DPC excellent processing throughput and reliability to the point that other DPCs have started following our footsteps.

With time the DPC has gradually evolved to use Map/Reduce on Apache Hadoop for simple, data intensive workflows and Apache Spark for investigative work. The Hadoop MapReduce framework allows for rapid development of parallel algorithms based around a simple model: transform, or "map", a data record into an intermediate type, group the intermediate types together based on some criteria and then process, or "reduce", each group to produce the output. The framework provides for reliably and efficiently distributing the execution of such algorithms across a processing cluster to be executed in parallel by hundreds or thousands of processes; while the Spark data processing engine allows for more complex distributed processing models to be defined and executed in a similar way. Use of these community developed Open Source projects allows the team to focus more effort on the issues specific to Gaia data processing.

Sky map
An all-sky view of stars in our Galaxy–the Milky Way–and neighbouring galaxies,
based on the first year of observations from ESA's Gaia satellite,
from July 2014 to September 2015
Credits: ESA/Gaia/DPAC

About 55 GB of data is delivered to the IoA daily. Additionally, at the start of each cycle, integrated products from other DPCs are received. This amounted to 26 TB for the first cycle, but it is bound to increase as new systems come into operations. After being downloaded from the satellite, the data is decompressed and undergoes an initial treatment stage resulting in more than 200 of data types delivered to the IoA on a regular basis. An automatic data-handling system records new deliveries and stores useful metadata into a database. The data is thus transferred to the Hadoop cluster ready to be imported. This process converts the input data from the exchange format into an internal data model, optimised for processing. The modelling of the complex data stream uses a Domain Specific Language (DSL) and compiler developed ad hoc by the IoA DPC team.

The scientific modules are implemented in the processing pipeline as a series of "Processing Elements". These can be assembled into a workflow or "Recipe" that can be easily reconfigured without recompiling the software.  Recipes are defined using yet another specifically designed DSL, that allows a functional programming style more amenable to scientific work and easily parallelised.

The large number of different instrumental configurations affecting the data complicates the photometric calibration of the data and effectively creates a huge number of different instruments—each of those needs to be calibrated to the same internal photometric system. As mentioned before, Gaia is self-calibrating, i.e. no input catalogue is used. This is because no single catalogue of sufficient accuracy and size covering the vast magnitude range covered by Gaia exists. This implies that the internal reference system needs to be initialised using Gaia data itself in an iterative process.

These endeavours have recently enabled the release of the first Gaia catalogue, containing astrophysical positions and broad-band magnitudes for all sources with acceptable formal standard errors on positions and epoch photometric data of selected RR Lyrae and Cepheid variable stars. An additional catalogue of sources in common between the Tycho-2 Catalogue and Gaia is included, providing the full five-parameter astrometric solution—positions, parallaxes, and proper motions.

Future releases of the catalogue will contain  improved broad-band photometry and the addition of colours and low-resolution spectra for all sources thus creating what would be the most complete and accurate photometric catalogue to date.

For more details on the general UK involvement in the project, visit the website.

Posted by s.aragon on 25 October 2016 - 11:30am