Migrating from Excel to Python and MongoDB in LUX-ZEPLIN

LUX-Zepelin water tankBy Mike Jackson, Software Architect

In Using Excel for data storage and analysis in LUX-ZEPLIN, I summarised how Excel is both used and managed within the LUX-ZEPLIN (LZ) project and recommendations for improvements. In this second of two blog posts, I describe how LZ could migrate their data within Excel to MongoDB with supporting software, in Python, for computation and presentation. I also describe a proof-of-concept which extracts data from Excel, populates MongoDB with this data, and computes the radiogenic backgrounds expected from a subset of the possible sources of contamination.

As a reminder, the BG table is an Excel spreadsheet, with 43 sheets, used by LZ to calculate radiogenic backgrounds, and the WS Backgrounds Table is a sheet within the BG table which summarises the radiogenic backgrounds expected during the lifetime of the experiment from each source of contamination.

Migrating from Excel to MongoDB and Python

Excel combines data, computation and presentation. For example, a cell with a formula in Excel is a combination of data and computation, in effect a tiny program. The migration plan was based around migrating from the BG table into a solution reflecting a common decomposition of software into:

  • Data: stored in a database management system, accessed via code.

  • Domain logic, or computation, implemented as code.

  • Presentation: summary tables, reports, and, in future, a web front-end, created via code.

The goal of the migration plan was to get data out of the BG table and into a database management system, and to implement what is needed in terms of data access, update and computation functionality to deliver the same data and computation as is provided within the BG table. This would then provide a core around which a fully-functional web- and service-based infrastructure, envisaged by LZ, could then be evolved.

For consistency with LZ's Information Repository, the target operating system was any Linux flavour, the database management system was MongoDB and the implementation language was Python (either 2 or 3).

The key steps of the plan were to:

  1. Identify raw data in the BG table (cells with values) and design JSON documents for this raw data (JSON is the document data format used within MongoDB).

  2. Identify derived data in the BG table (cells with formulae) and design documents for this derived data. This does not imply that these need to be stored within MongoDB. Rather, it is intended as a representation for the data both consumed and produced by the computation code..

  3. Document dependencies between raw data and derived data in the BG table.

  4. Design data access code to access raw data from MongoDB, which hides the specifics of MongoDB from computation and presentation code.

  5. Design computation code to produce the derived data, code which may use both raw and derived data.

  6. Design presentation code to produce summary tables and reports, code which may use both raw and derived data.

  7. Export raw data from the BG table into MongoDB.

  8. Implement data access, computation and presentation code.

  9. Perform a code review of all the code, with reference to both the raw and derived data and the BG table formulae, to check for bugs introduced during implementation.

  10. Implement validation code to compare data held in the BG table to both the raw data, held within MongoDB, and derived data, created by the computation code, to check that the raw data has been imported correctly and that the derived data has been computed correctly, to detect any bugs introduced during migration or implementation.

  11. Concurrently use both the BG table and MongoDB and the code (e.g. if data in the BG table is updated, then the corresponding data in MongoDB is also updated), regularly rerunning the validation code from step 10. This allows LZ researchers to:

    1. Develop experience in using MongoDB and the code.

    2. Understand how to do tasks, currently done using the BG table, using MongoDB and the code.

    3. Identify and fix bugs.

  12. Deprecate the BG table.

Steps 4, 5, 6, and 8 could be combined and replaced by a rapid prototyping of data access, computation and presentation code.

Proof-of-concept

Both LZ and myself considered it useful to check whether the migration plan was feasible by rapidly-prototyping a proof-of-concept of the key components. The steps in the development of the proof-of-concept, in effect a subset of the migration plan, were as follows:

  1. Identify raw data in the BG table and design JSON documents for this raw data.

  2. Identify derived data in the BG table and design documents for this derived data.

  3. Export raw data from the BG table into MongoDB.

  4. Identify all raw and derived data required to generate a subset of the WS Backgrounds Table with data relating to two components plus one non-component source of contamination (the liquid Xenon in the detector). This can be done by tracking all dependencies from the WS Backgrounds Table to other sheets, and, ultimately back to cells which contain values only.

  5. Rapidly-prototype data access code to access raw data from MongoDB.

  6. Rapidly-prototype computation code to produce the derived data.

  7. Rapidly-prototype presentation code to produce an HTML summary table of the subset of the WS Backgrounds Table.

  8. Rapidly-prototype two sample regression tests to validate derived data:

    1. Read a sheet in the BG table containing the subset of the WS Backgrounds Table and compare its data to that computed by the computation code). The sheet in the BG table is created by copying the WS Backgrounds Table sheet and removing rows until the desired subset of data remains.

    2. Read a JSON document containing the subset of the WS Backgrounds Table previously computed by the computation code, and compare its data to that computed by the computation code. This regression test can be used to validate successive versions of the computation code.

I developed the proof-of-concept using the Anaconda Python 4.3.0 scientific computing stack. I chose Anaconda as it provides a number of useful Python packages including the Pandas Python Data Analysis Library and the py.test test framework; it is easy to install; it is recommended by Software Carpentry; and, it has proven to be of great use to me in numerous projects over the past few years. Rather than install MongoDB natively, I used Docker, into which I loaded a MongoDB image available via DockerHub, which allowed me to deploy MongoDB very rapidly.

To import raw data from the BG table into MongoDB, I wrote Python scripts that use Pandas classes and functions for reading Excel (of note are pandas.ExcelFile, pandas.read_excel, and pandas.DataFrame) to scrape the data out of the BG table and insert it, using the Python pymongo package, into MongoDB. The scripts are configured by a JSON document which specifies which sheets, rows and columns are to be extracted. I plan to convert these scripts into a Python package, hosted within our softwaresaved GitHub project, at a later date. Before these scripts could be run, myriad changes to the BG table were needed to get it into a form to allow the data to be easily extracted. The resources recommended by Data Carpentry, including Wickham’s Tidy Data and White et al.’s Nine simple ways to make it easier to (re)use your data, were invaluable.

There are many resources that provide advice on migrating data from Excel to databases or files of comma-separated values (CSV) and using Python or R for data analysis. Unfortunately, there seems to be little in the way of resources that describe how to handle formulae. How does one go about migrating these into Python or R? What about spreadsheets where the formulae are co-located, or even deeply embedded within, tables of raw data? The techniques I adopted will be described in the blog post, Tips for porting formulae from Excel into code. Implementing LZ’s formulae in Python did not require any advanced mathematical functions and were implemented using built-in Python operators and functions in its math module. pymongo was used to access data from MongoDB (as most of the computation code is only aware of the pymongo Collection class and only uses its find, find_one and distinct methods, this could be wrapped in an abstraction layer to hide the use of MongoDB from the computation code).

The Pandas function for converting tabular data into HTML (pandas.DataFrame.to_html) enabled me to rapidly prototype code to create an HTML representation of the WS Backgrounds Table.

Looking ahead

The migration plan and proof-of-concept are now in the hands of LUX-ZEPLIN. I look forward to reporting on their experiences in a future blog post.

Acknowledgements

My colleagues in EPCC’s data science team - Amy Krause, Ally Hume and Eilidh Troup - provided invaluable input in both creating the migration plan and the plan for the proof-of-concept.

Posted by s.aragon on 2 November 2017 - 11:42am