Synthetic data

Synthetic data refers to artificially generated fictive data. Instead of modifying an existing dataset to make it less identifiable, a completely new dataset is generated, containing fictitious individuals and values. These data may be partially or entirely generated from artificial sources such as statistical distribution models or random generators. Therefore, “synthetic data” does not refer to a specific type of data found in files with particular formats; it is a category of data created through specific techniques.

When synthetic data are generated for the purpose of protecting personal data, sensitive values in the original dataset are replaced with values generated from a statistical model. Synthetic data can be created in many ways – for example, based on rules or by using trained machine learning models – and for a range of purposes, including privacy protection, data validation, and software testing.

Considerations for synthetic data based on personal information

Synthetic datasets that are generated from original data containing personal or sensitive information are often referred to, somewhat paradoxically, as “synthetic personal data”. Creating such data requires additional safeguards.

A key concern with synthetic personal data is the risk of re-identification. In some cases, synthetic data may be so realistic that it becomes possible to re-identify individuals from the real data used to train the model.

To reduce the re-identification risk, you should:

Document re-identification risk assessments using measures such as k-anonymity and quantify the differences from the original dataset.
Consider how outliers may affect the re-identification risk.
Re-evaluate your requirements for data fidelity to the original dataset. High fidelity to the original dataset can increase the risk of re-identification and may not be necessary or even desirable.

You should also consider the following:

Folder structure: If the original data are sensitive and cannot be shared, consider providing an empty placeholder file or a synthetic dataset with low fidelity.
Provision of sample data: When access to a dataset is restricted, a risk-free sample dataset can help users understand its structure and content before placing a request for full access.
Metadata and codebooks: You can improve the reusability of synthetic survey data by describing the variables in a standard-format codebook, rather than in a generic text file.

When should I use synthetic data?

How do I create synthetic data?

The study below provides an introduction to synthetic data by explaining what synthetic data are, why they may be useful, and how to use them.

Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N. & Weller, A. (2022). Synthetic Data – what, why and how? arXiv:2205.03257. LinkOpens in a new tab.