This guide focuses on data handling practices by managing the dataset lifecycle. It is not a full guide to organisational carbon accounting; instead, it provides simple actions that teams and individuals can apply in their day-to-day work.
When measuring or tracking progress, it is usually useful to use a metric. In this case, let's use a metric to measure the impact “per unit of work”. For example, we can track the energy or cost “per dataset refresh” (or with each update to the dataset) or “per GB processed”. Quantifying in this manner will help monitor improvements in energy use and cost per unit of work as green data practices are applied. We will also use three guiding principles when working with data: “Store less, move less, and compute less”.
Store less
Throughout the research process, one might end up storing multiple copies of the dataset in different locations. It is helpful to take a step back and assess whether all these duplicate copies are really essential and whether some could be cleaned up. Below are some steps to make sure that we are not using up resources to store redundant and duplicate data:
- You could start by creating an inventory to list the datasets used in your research work, for example, in a simple spreadsheet that records key details such as owner, size, location and last access data, often referred to as data provenance. This exercise will help identify datasets that are being stored, whether by default or otherwise, but are rarely used or have become outdated. Once such datasets are identified, it would be useful to consider whether resources should be spent on continuing to store them.
- Once the datasets have been recorded in an inventory, they can be further categorised by usage. For example, actively used, occasionally used, and rarely used. Data that are rarely used could be archived or deleted when possible, with the reason documented whenever this action occurs. This process helps establish a lifecycle for the datasets, making it easier to track their status.
- When you survey your datasets, you might come across duplicate copies. To avoid duplicating datasets, it might be useful to choose a location to store the dataset that you and your collaborators can access. This could involve using hosting services like GitHub (up to 100 MB per file), Zenodo (up to 50 GB per file), or other cloud storage services. GitHub and Zenodo additionally support version control. GitHub repositories can also be integrated with Zenodo, allowing releases to be automatically archived and assigned a Digital Object Identifier (DOI). If your analyses repeatedly use the same portion of a dataset (for example, particular columns, a specific time frame), it might be useful to publish a validated, versioned subset of that dataset. This will enable reuse while also avoiding repeated downloads from the parent dataset.
Move less
Frequently moving datasets between systems or locations, such as between cloud storage and local machines, creates a carbon footprint that is often not readily apparent. For example, downloading a very large dataset from a cloud storage to your local machine might demand significant time and local storage capacity. Instead of moving the data for analysis, explore running the analysis where the data already sits, for example, by connecting to a remote server using SSH, using cloud-based notebooks, or submitting jobs directly on HPC systems. When exporting, try exporting only a sample instead of the entire dataset to see if that is sufficient.
Compute less
Tracking and monitoring compute usage in data processing can help identify areas that can be made more efficient. For example:
- When performing analysis, build reusable preprocessing steps. This can be achieved by creating a shared, version-controlled pipeline for preprocessing the dataset, for example, using workflow tools such as Nextflow, Snakemake, or Apache Airflow. Moreover, once the dataset has been preprocessed, store this version for future reuse.
- If several researchers/developers are working on a project that uses a given dataset, you could provide a small sample for development, testing, prototyping, and debugging. This could be mentioned in the project's README or contributing guidelines. Furthermore, you could recommend that full-scale jobs be run only after the pipeline has been tested on the provided sample dataset.
- Use incremental workflows to ensure they recompute only what has changed. It is better to process the work in chunks and resume the pipeline after a chunk fails, so computations do not restart from scratch after each failure. For example, incorporating workflow tools such as targets (an R package for managing reproducible pipelines that run only the parts that have changed) or Nextflow (a workflow tool that supports resuming pipelines and tracking completed steps) ensures that up-to-date steps are skipped and only necessary computations are performed.
- Consider scheduling the analysis, which is not time-sensitive and can be performed flexibly, to run during cleaner energy generation hours (this is called “temporal shifting”). If the work policy permits, you could also opt for “spatial shifting” by running from servers in lower-carbon emission regions. If, for whatever reason, such shifting is not possible, then you might want to avoid periods where the carbon emissions are on the higher side. For example, tools such as the UK Carbon Intensity API or website can be used to check when electricity is generated with lower emissions.
- In general, if multiple collaborators are working on the same project, it is advisable to use version control by using collaborative hosting platforms like GitHub and GitLab. This will help with tracking the overall project progress using features like issues (in GitHub), which eventually would promote transparency and reduce duplicate efforts.
Real example
Electricity demand data
Electricity demand time series data are available from a distributed network operator for a period of 6 years. These data are at 15-minute resolution, which means there are 15*24 = 96 data points per day; hence, 96*365*6 = 210240 data points for the 6 years. A data science team of 5 individuals is required to predict the day-ahead electricity demand using this historical dataset. The data science team begins working on this; however, each individual creates a local copy of the data and performs data preprocessing (data cleaning, handling missing values and outliers, and creating features) on their local machines. This results in:
- Duplicated data storage
- Repeated computation
- Inconsistent assumptions
- Slow iteration
Green data practices can fix most of these issues. First, the parent dataset can be stored in a single location accessible to the data science team, such as a GitHub repository. Furthermore, before starting individual work, the team could meet to create a list of tasks (for example, using GitHub Issues) to track their work and avoid duplicate effort. They can then start working on their individual feature branches and eventually merge them into the main branch, sharing the work and updates across the team. The preprocessing step should be prioritised when working on the issues. Using a shared, version-controlled storage system, the analysis process will become more energy-efficient. They also reach a consensus on creating contributing guidelines for this project and add them to their work repository for future developers.
Takeaway message
- Guiding principles: Store less, move less, compute less.
- Create an inventory of your datasets to be able to track their usage (size, owner, last access, duplicates).
- Using a “per unit of work” metric will help track progress.
- Document your decisions either in the README or in the contributing guidelines, to help not only your collaborators but also your future self.
Further reading
Acknowledgements
This guide was written by Aman Singh and reviewed by William Haese-Hill.