Guidance and examples

It is difficult to provide universally applicable rules for handling personal data in research. In most cases, the specific circumstances determine how data should be managed, and many situations therefore require individual assessment. The purpose of this page is to give an overview of some key concepts and concrete examples of what counts as personal data and how such data may be handled in different scenarios.

The examples are based on SND’s practical experience of supporting researchers with questions related to personal data management. This guidance is intended as a support tool for making assessments, but it does not replace local regulations at your university or research organization.

The risk of re-identification – the core challenge
Safeguards – how data are stored and managed
Transparency, information, and documentation
Examples of personal data and how they may be handled

The risk of re-identification – the core challenge

One aspect that always needs to be considered is the risk of re-identification, sometimes referred to as indirect or backdoor identification. Re-identification of personal data means that an individual can be identified by combining different types of data with so-called indirect identifiers – for example, occupation, age, or geographical location. This makes it possible, through supplementary information, to link a person to data in a dataset, even when directly identifying details such as name or personal identity number are absent. If an individual can be identified by using such supplementary information, the dataset as a whole is regarded as personal data. Even if the risk of re-identification is low, it still needs to be assessed, and ideally the assessment should be documented.

How much information is needed to re-identify an individual?

A well-known example of how re-identification can be applied comes from the United States, where researcher Latanya Sweeney demonstrated in her article k-ANONYMITY: A Model for Protecting PrivacyOpens in a new tab (PDF) that 85 per cent of the US population could be directly identified using only postcode, gender, and date of birth.

To reduce the risk, data can sometimes be processed, for example through recoding and grouping of variables, such as dividing age into intervals or using broader geographical categories. Although the dataset may still contain sensitive personal data, these measures can reduce the risk of re-identification. You can read more about how to reduce this risk under Methods in the Handbook for data containing personal information.

Safeguards – how data are stored and managed

When research data contain personal information, appropriate safeguards must be put in place. What type of safeguards are required will vary depending on the sensitivity of the data and the level of privacy risk they may pose to individuals.

Sensitive personal data require extensive safeguards, such as procedures for confidentiality review before any disclosure, secure storage with access control, and ensuring that only metadata are openly shared, while the actual data files are made accessible upon request.
Personal data with minimal need for protection may, in some cases, be shared openly. In such instances, it may be sufficient to inform users that the data could fall under GDPR, and that they are responsible for identifying a legal basis and purpose under GDPR for any further processing.

Safeguards can be both technical (secure storage, access management) and administrative (confidentiality review, user agreements). Most higher education institutions today offer some form of secure storage for data containing personal information.

Assessment of privacy risks and appropriate safeguards must always be carried out on a case-by-case basis, within the specific research context. For support in classifying the sensitivity of your data and determining how any potential disclosure should be managed, you should consult your local research data support, Data Protection Officer, or university legal counsel.

Transparency, information, and documentation

A fundamental measure in privacy protection is to provide research participants with information about the processing of their personal data and how they will be processed. Without such information, individuals cannot exercise their rights under GDPR. For researchers, information to participants is also central from an ethical perspective. Informed consent is only valid if participants have received accurate and comprehensible information in advance. It is therefore advisable to explain already at the data collection stage that research data may be preserved and made available in repositories to enable new research or reviews.

It is also important to document the considerations and measures you take to protect research data. Documentation helps you keep track of how data have been managed and can be crucial in the event of an audit or personal data breach. This documentation can be incorporated into a data management plan or be set out in more detail through a data protection impact assessment (DPIA). If you are uncertain about what you are required to do, you should contact your local research data support service.

Templates for consent and documentation

On the website of the Swedish Ethical Review Authority, you can find support templatesOpens in a new tab for creating informed consent forms for participants in research studies. The Swedish Authority for Privacy Protection (IMY) provides a template for assessing the need for a DPIAOpens in a new tab (in Swedish), which serves both as a decision-making tool and a form of documentation. IMY has more information in English on Impact assessments and prior consultationOpens in a new tab, and refer to the EDPB's Guidelines on Data Protection Impact Assessment (DPIA)Opens in a new tab.

Examples of personal data and how they may be handled

Indirect personal data

Even research data that do not contain directly identifying personal details can be indirect personal data, if the information can be traced to an individual by means of supplementary data. To determine what level of safeguards is required, it is necessary to assess the potential privacy risks that may arise when the data are managed or shared. If the dataset contains information that may be regarded as highly sensitive, it should be handled through secure storage and only shared after a confidentiality review. This could, for example, include indirect health information, political opinions, or information about a large number of individuals. If the privacy risks are low, it may be sufficient to inform users that the dataset could include personal data and that anyone downloading it must comply with GDPR.

Scenario

Pseudonymized personal data with a key code

When accompanied by a key code, a pseudonymized dataset can still point to an individual, and the information must therefore be regarded as personal data. Such data must always be placed behind an access request barrier and handled with restrictions.

Scenario

Pseudonymized personal data without a key code

Even if the key code or other supplementary information is kept separately from the dataset by another organization, the dataset is still regarded as indirect personal data, similar to the example above with pseudonymized data with a key code. As a rule, such data must be placed behind an access request barrier and cannot be shared openly.

Scenario

Different versions of the same dataset

It is not uncommon for researchers to create different versions of the same dataset for different purposes. If any of the datasets contain direct or indirect personal data, the same principles generally apply as in the examples with and without a key code – even if the new versions do not include a directly matching serial number. This is because, in many cases, it is possible to re-identify individuals by comparing different versions of the same dataset.

Even if background variables have been removed or recoded, the combination of remaining responses may still be unique enough to link back to the corresponding individual in the original, unaltered dataset. In this way, individuals can be identified even if the material does not have a serial number or an obvious key code.

Scenario

Geographical coordinates

Geographical coordinates are common in many types of research data. If coordinates can be linked to a property, and that property in turn can be linked to an individual (the property owner), the information may constitute indirect personal data.

To identify which individual is associated with a property, information usually has to be requested from Lantmäteriet, the Swedish mapping, cadastral and land registration authority. In many cases, such information does not represent a major intrusion into personal privacy – for example, data on bird migration routes over a property. Other types of information, however, may be more sensitive, such as the presence of pests, since this could affect the value of the property and, in turn, an individual’s financial situation.

Scenario 2

References, citations, and bibliographic studies

Authors conducting bibliographic studies and citing other sources, such as books or articles, must refer to sources clearly and accurately. This means that research data may include the names of authors or other creators, which are considered personal data. In bibliometric studies, where data on publications, citations, and authorship are analyzed, this type of personal data forms the very basis of the research.

Providing references is both a legal obligation under copyright law and an element of good research practice. Author lists, citations, and references are, by their nature, intended to be disseminated and are essential for scientific publishing.

On this basis, the dissemination of author lists, citations, and references is considered to fall under the exemption for academic expression in Article 85(2) of the GDPR. The processing of personal data that this entails is therefore not regarded by SND as being subject to the full requirements of the GDPR.

Scenario

Previously published personal data

Research data may sometimes include personal data that have already been published, for example when a researcher collects data from social media. The fact that the data have been published before does not change their character: they are still personal data that can be traced to an individual.

It can be difficult to assess the degree of privacy intrusion, and an evaluation often needs to be made on a case-by-case basis. The assessment may differ depending on how much information is used in the research and how likely it is that the information can be linked back to an individual. People who post on social media may not expect their data to be used for research purposes, even if they have chosen to share the information. The evaluation is also affected by whether the data have been collected from closed forums or from openly accessible sources. In many cases, such data should only be made accessible on request and after review.

Scenario 1

Personal information about third parties in research data

Research data can sometimes contain information about a person other than the intended research participant, for example when a participant mentions a politician, a manager, or a relative. Such information is also considered personal data and falls under the GDPR.

The level of privacy risk can vary greatly depending on the context. Information about public figures in neutral contexts may be considered harmless and can be shared openly. Information about private individuals in sensitive contexts, on the other hand, could cause an invasion of privacy and should be handled with restrictions.

Scenario 2

Data on criminal offences

Information about criminal offences is not classified as sensitive personal data under the GDPR, but it is nevertheless considered to require special protection and is regulated separately in Article 10.

Examples of such data include case numbers in court rulings or information about roles in legal proceedings (e.g., “the complainant”). Survey data concerning criminal behaviour, such as drug use, are also covered. Such information should always be handled with restrictions.