Don't be afraid of synthetic data

Posted by j.laird on 29 May 2020 - 9:00am Diagram of a use-case asking if results from synthetic datasets are equivalent to results from real datasets

Diagram of a use-case asking if results from synthetic datasets are equivalent to results from real datasets

Image by Will Hulme

By Software Sustainability Institute Fellow Will Hulme.

Datasets are often restricted for valid reasons such as patient confidentiality in health research. However, this can pose problems for the reproducibility and scrutiny of the research. Synthetic datasets can be used to overcome some of the problems that arise from working with restricted data, and more researchers need to be brave enough to start using them.

Problems with restricted data

Transparency and reproducibility are fundamental principles of scientific observation and discovery, ensuring that science is “show me” not “trust me”. But as health researchers, we also have to consider other pesky principles like informed consent, minimising harm and confidentiality. It’s why we don’t remove organs from unwilling prisoners. It’s why we induce tumours in mice instead of humans. And it’s why we prevent unrestricted access to sensitive research data.

Not sharing data might sometimes be justified, but it creates problems. It can prevent scrutiny of the differences between the research that was planned, conducted, and reported, leaving the evidence-base vulnerable to questionable research practices and conflicts of interest—anything from unconscious p-hacking to fraud. It can frustrate efforts to reuse data for original, independent analyses. Sometimes controlled access to patient-level data is possible, but the process is often slow, expensive, and onerous.

A possible solution

A partial solution to these problems is the use of synthetic datasets. A synthetic dataset mimics certain useful statistical properties of a real dataset but contains no information that could be used to re-identify individual patients, so the dataset can safely be shared.

Synthetic datasets are perhaps most often thought of as supporting the use-case described in the figure above: create a synthetic dataset from the real dataset, share it publicly, analyse the synthetic dataset as if it were the real dataset, and then, erm, hope that the analysis produces results that reliably replicate the results that would have been derived from real dataset.

In this use-case, we need high confidence that the synthetic results and the real results match, without actually being able to verify it. But I’m sceptical that, in all but the simplest or most contrived scenarios, methods for generating synthetic datasets are good enough to guarantee this. We might have some theoretical guarantees on how closely the synthetic dataset mimics the statistical properties of the real dataset in certain types of data, but we don’t have any guarantees on how closely the synthetic results mimic the real results for any arbitrary research question or statistical model.

Without verifying the synthetic results against the real results should we trust them? Should we use them to inform treatment or policy decisions? I don’t think so but I would love to see examples where this has been done. Still, perhaps we should be prioritising improvements in responsible access to sensitive datasets rather than fobbing researchers off with datasets that we can’t guarantee will give reliable results.

When to use synthetic data

There are, however, many other opportunities to use synthetic datasets in spite of - because of! - this unreliability. Let’s consider a few examples.

Ever tried to make progress on a project when you’ve been waiting months for access to the dataset? Synthetic datasets — Pow! Delays in data access or data collection are often the most disruptive bottleneck in a research project. Releasing freely-accessible synthetic versions of controlled datasets has the potential to stop infeasible studies early, reducing inefficiency, waste, and unnecessary re-identification risk. You can use synthetic datasets to write and test analysis scripts, resolve computational issues, design tables and figures… the list goes on. A script can be run immediately on the real dataset once access is granted, or else run by someone else who does have access. Sure, some changes may be necessary once the real dataset is available, but most major design hurdles will already have been jumped.

Ever felt guilty for tweaking your study design because the results weren’t what you wanted? Synthetic datasets — Bosh! Designs developed using synthetic datasets can be revised iteratively without concerns about data-dredging or post-hoc hypothesising since you know not to trust the results. This limits wriggle-room for you to bias your methods towards favourable results, consciously or unconsciously. Your study design should be agnostic to the results, after all. Synthetic datasets (or simulated datasets if data collection hasn’t happened yet and you need to generate data from scratch) can be used to pre-register studies in a manner much more precise than with vague allusions to “multiple imputation” or “non-parametric tests”, as the script will unambiguously encode the methodological design that converts the data into evidence.

Ever tried to adapt someone else’s code for your own use when you didn’t have access to the dataset it was intended for? Synthetic datasets — Kablamo! Even if you didn’t use synthetic dataset in the development of your study, if you can’t share your dataset then preparing a synthetic dataset afterwards facilitates the reusability of your analysis scripts. This helps other groups to repurpose or build on the study’s methods and increases the impact of your research.

Ever needed a big, messy ‘true-to-life’ dataset for teaching or training? Synthetic datasets — Doof! This doesn’t need much elaboration. Small or toy datasets are incredibly useful, but they have their limits. There’s only so much you can learn about complex datasets from iris and Gapminder.

Don’t be afraid to use synthetic data

It’s bonkers that synthetic datasets are not used more often to increase research integrity and reduce research waste.

I’m looking at you, writers of the phrase “the dataset for this study contains confidential patient-level data and cannot be shared” without making any attempt whatsoever to mitigate the problems of research transparency and reusability that this raises. I’m looking at you, PIs who obsess over being scooped, but aren’t concerned that your research isn’t reproducible. I’m looking at you, custodians of every single birth cohort, disease or treatment registry, electronic health record, trial dataset, or administrative database on the planet who make researchers wait over a year for access to your data.

We can do so much more to improve research transparency and efficiency using synthetic datasets without worrying whether they can be trusted to reliably replace real datasets entirely. The methods, tools and tutorials are already here. Use them!