Skip to main content

Synthetic data

Synthetic data refers to artificially generated fictive data. Instead of modifying an existing dataset to make it less identifiable, a completely new dataset is generated, containing fictitious individuals and values. These data may be partially or entirely generated from artificial sources such as statistical distribution models or random generators. Therefore, “synthetic data” does not refer to a specific type of data found in files with particular formats; it is a category of data created through specific techniques.

When synthetic data are generated for the purpose of protecting personal data, sensitive values in the original dataset are replaced with values generated from a statistical model. Synthetic data can be created in many ways – for example, based on rules or by using trained machine learning models – and for a range of purposes, including privacy protection, data validation, and software testing.

Syntetiska data process

Considerations for synthetic data based on personal information

Synthetic datasets that are generated from original data containing personal or sensitive information are often referred to, somewhat paradoxically, as “synthetic personal data”. Creating such data requires additional safeguards.

A key concern with synthetic personal data is the risk of re-identification. In some cases, synthetic data may be so realistic that it becomes possible to re-identify individuals from the real data used to train the model.

To reduce the re-identification risk, you should:

  • Document re-identification risk assessments using measures such as k-anonymity and quantify the differences from the original dataset.
  • Consider how outliers may affect the re-identification risk.
  • Re-evaluate your requirements for data fidelity to the original dataset. High fidelity to the original dataset can increase the risk of re-identification and may not be necessary or even desirable.

You should also consider the following:

  • Folder structure: If the original data are sensitive and cannot be shared, consider providing an empty placeholder file or a synthetic dataset with low fidelity.
  • Provision of sample data: When access to a dataset is restricted, a risk-free sample dataset can help users understand its structure and content before placing a request for full access.
  • Metadata and codebooks: You can improve the reusability of synthetic survey data by describing the variables in a standard-format codebook, rather than in a generic text file.
When should I use synthetic data?
  • As documentation: Synthetic data can be used as documentation when you share restricted data that contain personal information. This can help recipients explore the contents, which variables or how many observations they might need from the actual dataset.
  • For exploratory analysis: Synthetic data can be used to test statistical relationships without accessing the actual dataset. This requires that the variables in the synthetic dataset are statistically similar and reasonably reflect the distributions of the real data, so it is important to ensure that the variable distributions resemble the actual distributions, and that correlations and other dependencies between variables are preserved.
  • As dummy data: Synthetic data may also be used as “dummy data” to develop or test methods or code without accessing real data. This type of synthetic data is typically generated using strictly generative tools. In such cases, the synthetic dataset does not need to be statistically similar to the real data, only structurally similar (i.e., containing the same variable names and data types). If the data mimic anything statistically, it might be in the form of generalizable distributions – such as a normal distribution within a population.
How do I create synthetic data?

Creating synthetic data requires specialized software tools. These tools use advanced algorithms and statistical models to generate datasets that preserve the statistical characteristics of the original data, while protecting sensitive information. The general process involves:

  • Data preparation: Prepare the original dataset by identifying and managing missing values, cleaning the data, and ensuring they are formatted correctly for modelling.
  • Model training: Train a statistical or machine learning model on the original data. The model learns the underlying patterns and distributions in the data.
  • Data generation: Use the trained model to generate a new dataset that reflects the statistical properties of the original dataset but contains entirely fictitious values.
  • Evaluation and validation: Evaluate the quality of the synthetic data by comparing their statistical properties with those of the original dataset to ensure that both privacy and usability are preserved.

Examples of tools are described in the Tools section. You can also read more about synthetic data in a research article referenced in the Resources section.

Do you want to know more?

The study below provides an introduction to synthetic data by explaining what synthetic data are, why they may be useful, and how to use them.

  • Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N. & Weller, A. (2022). Synthetic Data – what, why and how? arXiv:2205.03257. LinkOpens in a new tab.