By Mike Jackson, Software Architect
The LUX-ZEPLIN (LZ) project are building one of the largest and most sensitive dark matter detectors ever constructed. I’ve been providing consultancy, as part of an Institute open call project, on how LZ can migrate their data storage and analysis software from Microsoft Excel to a database management system-centred solution. In the first of two blog posts, I summarise how Excel is both used and managed within LZ and recommendations for improvements.
As described in my blog post at the outset of the consultancy, Shining a light on dark matter, LZ partners at University College London and University of Coimbra, maintain LZ's backgrounds control software. At the heart of the backgrounds control software is a Microsoft Excel spreadsheet (termed the “BG table”). While fit for purpose in the experiment’s early design and procurement stage, Excel is now reaching its limits in terms of sustainability, its ability to interface with other software in the experiment (for example, analysis software that interprets dark matter data), and the interface with other users in the collaboration such that they can view and interactively query both current and historical data (for example tracking changes over time). This has motivated the intention to move to a solution which uses a database management system and supporting software.
Looking inside the BG table
The BG table is the largest Excel spreadsheet I’ve ever seen. It consists of 43 sheets, including: a materials sheet, with results of the screening campaign; one sheet for each of the 25 types of component in the detector; sheets with data related to neutron emissions for each material; sheets with simulation data; and sheets for other sources of contamination e.g. from liquid Xenon in the detector, the laboratory housing the detector, and astronomical events. The data from all these sheets are used to derive data within a sheet, the WS Backgrounds Table, which summarises the radiogenic backgrounds expected during the lifetime of the experiment from each source of contamination. Most of the sheets have numerous tables, and most have many formulae, with both inter-table and inter-sheet dependencies.
Following a detailed inspection of the BG table, recommendations were made to improve formatting, layout and documentation and to ease migration from Excel to a database-centred solution. These covered cell formatting; table, sheet and column layout; reducing the level of manual copying of cell contents that exist elsewhere in the table; ensuring consistent naming conventions for tables, sheets and columns; and, fully documenting all steps when manually importing or copying data.
Updating the BG table’s data
Data used to populate, and update, the BG table can come from many sources but which fall into two classes:
Data from simulation software, including C, C++, Python and bash code.
Simulation results and scientific data published in books, papers, reports, slides, LZ’s TWiki, LZ's ELOG notebooks, the LZ Information Repository (LZIR) which holds screening data, Google sheets, and e-mails from project collaborators.
Data is always imported manually, by copy-and-paste. Sometimes data needs to be processed before or after import e.g. to remove columns from a table copied from a Google Sheet or to change the units of values from ppm or ppb (g/g) to mBq/kg.
The frequency with which data is updated can vary. For example, materials screening and component geometry data is updated more than once a week, results from the simulation of radioactive decays within a realistic detector geometry 2-3 times a year, and inputs for non-WIMP analyses even less frequently.
Most of the data import processes are undocumented, or are held within e-mails.
Using the BG table’s data
The BG table both provides data for both analysis software and for approval reports for project partners who wish to purchase materials/components. The data is exported manually.
Managing the BG table
The canonical version of the BG table is held under version control within LZ’s GitLab repository, with a read-only copy of the latest version being provided via Microsoft OneDrive. A LaTeX version with a one page summary of the WS Backgrounds Table is also available.
A LZ TWiki page dedicated to the BG table records information on each version, including its date, Git commit hash, nature of the changes, who requested these changes, and links to supporting documentation. The TWiki page also records how researchers can request both changes and approval reports.
Alex Lindote is responsible for changes to the BG table, managing updates to the canonical version in GitLab and the TWiki page. Researchers can request changes to the table by e-mailing Alex via an e-mail or contacting him via LZ’s Slack chat room. Researchers need to tell Alex what needs to be changed, the reason for the change, and a report or ELOG log entry justifying the change.
Changes to material assays, geometries of components, simulation results, or any value that has an impact on backgrounds values (i.e. the computed WS Backgrounds Table data) all motivate a commit of a new version of the BG table. Very minor changes that have no impact on the backgrounds values do not motivate a new commit but are held off until a more significant change is requested.
From the foregoing, a number of specific recommendations were proposed.
Firstly, LZ should fix the issues identified with the BG table. On a related note, Excel spreadsheets can be viewed as analogous to code, where the cells with values are the data and the cells with formulae are the code. In the same way that regular reviews of code can identify bugs and issues, so undertaking regular reviews of Excel spreadsheets can identify issues with their design and bugs in values and formulae, reducing the risk of errors seeping into papers based on their data, for example.
It was recommended that LZ document all manual processes e.g. copying values within the BG table, importing data into the BG table, exporting data from the BG table, and managing versions of the BG table. The documentation should include: the nature of any inbound or outbound data, where it originates from or is destined to, any pre- or post-processing of data required, and the frequency of these activities. Furthermore, the documented should be in a location accessible to all LZ project members, for example the LZ TWiki. Complementing this, it was recommended that LZ set up an e-mail alias and a browsable e-mail archive, accessible to project members, to which researchers would e-mail requests for updates to the BG table, rather than e-mailing Alex directly. Together, these would help to increase the BG table's bus factor to be greater than 1 (Alex!).
It was recommended that LZ redesign the BG table, so that its data is structured in a way that makes automation easier, that makes the data more readily exportable into non-proprietary data formats (e.g. plain-text files with comma-separated values or tab-separated values, or JSON documents) that other tools (that may not be able to read Excel spreadsheets) can use, and that also makes future migration to a database management system easier. In particular, it was recommended that LZ consult the resources recommended by Data Carpentry, including Wickham’s Tidy Data and White et al.’s Nine simple ways to make it easier to (re)use your data.
Finally, it was recommended that LZ automate frequently-occurring manual processes, where possible, to both reduce the risk of errors or inconsistencies arising during manual data handling, and to make the BG table easier to use and maintain. This includes, for example: replacing manual copying of data within the BG table with cell references; writing scripts to extract data from Google Sheets and simulation data files, to process this data and then import it into the BG table; writing a script to generate the LaTeX version of the WS Backgrounds Summary Table; and, writing scripts to auto-generate template approval reports.
Four of these recommendations are those that the Institute would recommend for any software project:
Carry out regular code reviews, to identify issues and bugs earlier, and to share knowledge about the software between developers.
Document all manual processes, to avoid critical project knowledge being solely held within any one person’s head, helping to keep a project’s bus factor above 1.
Set up e-mail aliases or lists and browsable e-mail archives, to provide a shared resource of project-specific communications and to ensure that other project members can address any issues.
Automate frequently-occurring manual processes, where possible, to make project tasks more efficient and to reduce the risks of manual steps introducing errors.
In the concluding blog post, I’ll describe how LZ could migrate the data within the BG table to a database management system and supporting software for computation and presentation. I’ll also describe a proof-of-concept which extracts data from BG table, populates MongoDB with this data, and computes a subset of the WS Backgrounds Table.