Skip to main content

Guidance with examples

It is difficult to provide universally applicable rules for handling personal data in research. In most cases, the specific circumstances determine how data should be managed, and many situations therefore require individual assessment. The purpose of this page is to give an overview of some key concepts and concrete examples of what counts as personal data and how such data may be handled in different scenarios.

The examples are based on SND’s practical experience of supporting researchers with questions related to personal data management. This guidance is intended as a support tool for making assessments, but it does not replace local regulations at your university or research organization.

Contents of this page:

The risk of re-identification – the core challenge

One aspect that always needs to be considered is the risk of re-identification, sometimes referred to as indirect or backdoor identification. Re-identification of personal data means that an individual can be identified by combining different types of data with so-called indirect identifiers – for example, occupation, age, or geographical location. This makes it possible, through supplementary information, to link a person to data in a dataset, even when directly identifying details such as name or personal identity number are absent. If an individual can be identified by using such supplementary information, the dataset as a whole is regarded as personal data. Even if the risk of re-identification is low, it still needs to be assessed, and ideally the assessment should be documented.

How much information is needed to re-identify an individual?

A well-known example of how re-identification can be applied comes from the United States, where researcher Latanya Sweeney demonstrated in her article k-ANONYMITY: A Model for Protecting PrivacyOpens in a new tab (PDF) that 85 per cent of the US population could be directly identified using only postcode, gender, and date of birth.

To reduce the risk, data can sometimes be processed, for example through recoding and grouping of variables, such as dividing age into intervals or using broader geographical categories. Although the dataset may still contain sensitive personal data, these measures can reduce the risk of re-identification. You can read more about how to reduce this risk under Methods in the Handbook for data containing personal information.

Safeguards – how data are stored and managed

When research data contain personal information, appropriate safeguards must be put in place. What type of safeguards are required will vary depending on the sensitivity of the data and the level of privacy risk they may pose to individuals.

  • Sensitive personal data require extensive safeguards, such as procedures for confidentiality review before any disclosure, secure storage with access control, and ensuring that only metadata are openly shared, while the actual data files are made accessible upon request.
  • Personal data with minimal need for protection may, in some cases, be shared openly. In such instances, it may be sufficient to inform users that the data could fall under GDPR, and that they are responsible for identifying a legal basis and purpose under GDPR for any further processing.

Safeguards can be both technical (secure storage, access management) and administrative (confidentiality review, user agreements). Most higher education institutions today offer some form of secure storage for data containing personal information.

Assessment of privacy risks and appropriate safeguards must always be carried out on a case-by-case basis, within the specific research context. For support in classifying the sensitivity of your data and determining how any potential disclosure should be managed, you should consult your local research data support, Data Protection Officer, or university legal counsel.

Transparency, information, and documentation

A fundamental measure in privacy protection is to provide research participants with information about the processing of their personal data and how they will be processed. Without such information, individuals cannot exercise their rights under GDPR. For researchers, information to participants is also central from an ethical perspective. Informed consent is only valid if participants have received accurate and comprehensible information in advance. It is therefore advisable to explain already at the data collection stage that research data may be preserved and made available in repositories to enable new research or reviews.

It is also important to document the considerations and measures you take to protect research data. Documentation helps you keep track of how data have been managed and can be crucial in the event of an audit or personal data breach. This documentation can be incorporated into a data management plan or be set out in more detail through a data protection impact assessment (DPIA). If you are uncertain about what you are required to do, you should contact your local research data support service.

Templates for consent and documentation

On the website of the Swedish Ethical Review Authority, you can find support templatesOpens in a new tab for creating informed consent forms for participants in research studies. The Swedish Authority for Privacy Protection (IMY) provides a template for assessing the need for a DPIAOpens in a new tab (in Swedish), which serves both as a decision-making tool and a form of documentation. IMY has more information in English on Impact assessments and prior consultationOpens in a new tab, and refer to the EDPB's Guidelines on Data Protection Impact Assessment (DPIA)Opens in a new tab.

Examples of personal data and how they may be handled

Indirect personal data

Even research data that do not contain directly identifying personal details can be indirect personal data, if the information can be traced to an individual by means of supplementary data. To determine what level of safeguards is required, it is necessary to assess the potential privacy risks that may arise when the data are managed or shared. If the dataset contains information that may be regarded as highly sensitive, it should be handled through secure storage and only shared after a confidentiality review. This could, for example, include indirect health information, political opinions, or information about a large number of individuals. If the privacy risks are low, it may be sufficient to inform users that the dataset could include personal data and that anyone downloading it must comply with GDPR.

Scenario

A research group conducts an electoral survey using questionnaires sent out by regular post. The original data include directly identifying information such as names, personal identity numbers, and street addresses. These details are permanently deleted after the data collection phase, but information about gender, age, occupation, and place of residence remains.

In the dataset, there is only one male respondent from a municipality of around 2,500 inhabitants, aged 65 and working as a priest. This person can easily be identified using supplementary information such as data from Statistics Sweden or other online search services, and sensitive information about political sympathies could then be linked to him.

The dataset is therefore considered to contain sensitive personal data and must be handled with appropriate safeguards. As a researcher, you cannot share such data openly.

Pseudonymized personal data with a key code

When accompanied by a key code, a pseudonymized dataset can still point to an individual, and the information must therefore be regarded as personal data. Such data must always be placed behind an access request barrier and handled with restrictions.

Scenario

A research group conducts a panel study on lifestyle and political opinions, collecting data through online questionnaires. The original data, which include directly identifying information such as e-mail addresses, are stored securely.

A separate dataset is created without e-mail addresses, but with a matching serial number. Indirect identifiers such as age, education, income, and place of residence are recoded into categories broad enough to prevent the identification of individual participants in the dataset.

However, the new dataset must still be considered personal data, since the serial number functions as a key code that can link the new dataset back to the original data, which contain directly identifying information. This means that an individual could potentially be identified using the key code, and the datasets are therefore, taken together, regarded as personal data. Such data must always be placed behind an access request barrier and handled with restrictions.

Pseudonymized personal data without a key code

Even if the key code or other supplementary information is kept separately from the dataset by another organization, the dataset is still regarded as indirect personal data, similar to the example above with pseudonymized data with a key code. As a rule, such data must be placed behind an access request barrier and cannot be shared openly.

Scenario

A researcher conducts an interview study with patients who have undergone chemotherapy. The data contain personal identity numbers, which are replaced with serial numbers in a new, separate dataset that only includes information about the participants’ health and their experiences of the care they have received.

The original data, with personal identity numbers and matching serial numbers, are stored separately by the healthcare region where the interviewees received treatment and are not directly accessible to the researcher.

Even though the new dataset does not contain directly identifying information, it must still be regarded as sensitive personal data. In this case, it does not matter that the researcher does not have access to the key code, since there remains a risk – however small – that individuals could be identified if the dataset were to be combined with the key code. As a rule, such data must be placed behind an access request barrier and cannot be shared openly.


Different versions of the same dataset

It is not uncommon for researchers to create different versions of the same dataset for different purposes. If any of the datasets contain direct or indirect personal data, the same principles generally apply as in the examples with and without a key code – even if the new versions do not include a directly matching serial number. This is because, in many cases, it is possible to re-identify individuals by comparing different versions of the same dataset.

Even if background variables have been removed or recoded, the combination of remaining responses may still be unique enough to link back to the corresponding individual in the original, unaltered dataset. In this way, individuals can be identified even if the material does not have a serial number or an obvious key code.

Scenario

A researcher has a dataset that does not contain any directly identifying information but does include indirect identifiers. The researcher wants to share the data openly and therefore creates a new version where age and income are recoded into broad categories that make it impossible to identify individuals. In addition, information about place of residence and income is completely removed.

The dataset still contains about ten multiple-choice questions on political opinions, each with five response options. The combinations of these responses can create unique patterns that make it possible, for example, with the help of statistical software, to match the processed version against the original dataset. In this way, it becomes possible to identify which row in the original data that corresponds to a specific respondent.

This means that details from the processed version can be linked back to the unaltered material, creating a risk of re-identification. The data must therefore still be regarded as personal data. For this reason, they should not be published openly but instead made available behind an access request barrier and handled with clear restrictions.


Geographical coordinates

Geographical coordinates are common in many types of research data. If coordinates can be linked to a property, and that property in turn can be linked to an individual (the property owner), the information may constitute indirect personal data.

To identify which individual is associated with a property, information usually has to be requested from Lantmäteriet, the Swedish mapping, cadastral and land registration authority. In many cases, such information does not represent a major intrusion into personal privacy – for example, data on bird migration routes over a property. Other types of information, however, may be more sensitive, such as the presence of pests, since this could affect the value of the property and, in turn, an individual’s financial situation.

Scenario 1

A research project collects data on plant diseases. The coordinates identify individual properties and show where pests have been found. Even though no names appear in the dataset, it can still be indirectly linked to a specific property owner and must therefore be regarded as personal data. Since the information could have financial consequences for the property owner, the data must be handled with specific restrictions.

Scenario 2

A research project collects data on pollination by tracking the movements of bumblebees within a particular area. The area covers several properties, and in some cases individual property owners could be identified through Lantmäteriet’s databases. The material is therefore classified as personal data.

However, the data do not present any risk of privacy, physical, or financial harm to the individuals concerned – bumblebee flight routes do not affect property owners in a way that could negatively impact them. On this basis, it is considered possible to share the data openly, provided that the researcher informs anyone downloading the dataset that it may contain personal data.

References, citations, and bibliographic studies

Authors conducting bibliographic studies and citing other sources, such as books or articles, must refer to sources clearly and accurately. This means that research data may include the names of authors or other creators, which are considered personal data. In bibliometric studies, where data on publications, citations, and authorship are analyzed, this type of personal data forms the very basis of the research.

Providing references is both a legal obligation under copyright law and an element of good research practice. Author lists, citations, and references are, by their nature, intended to be disseminated and are essential for scientific publishing.

On this basis, the dissemination of author lists, citations, and references is considered to fall under the exemption for academic expression in Article 85(2) of the GDPR. The processing of personal data that this entails is therefore not regarded by SND as being subject to the full requirements of the GDPR.

Scenario

A research group wants to study publishing patterns in environmental science in the Nordic countries over the past 20 years. To do this, they collect data from international databases of scientific articles. The material includes information about article titles, author lists, citations, and the journals in which the articles were published.

Since the research is based on analyzing who published what, when, and where, the material contains personal data in the form of author names. However, these details are a necessary and integral part of the research. Without them, it would not be possible to study, for example, collaboration patterns between researchers, citation frequencies, or the development of the field over time. The data are therefore covered by the exemption for academic expression and may be disseminated and shared openly.

Previously published personal data

Research data may sometimes include personal data that have already been published, for example when a researcher collects data from social media. The fact that the data have been published before does not change their character: they are still personal data that can be traced to an individual.

It can be difficult to assess the degree of privacy intrusion, and an evaluation often needs to be made on a case-by-case basis. The assessment may differ depending on how much information is used in the research and how likely it is that the information can be linked back to an individual. People who post on social media may not expect their data to be used for research purposes, even if they have chosen to share the information. The evaluation is also affected by whether the data have been collected from closed forums or from openly accessible sources. In many cases, such data should only be made accessible on request and after review.

Scenario 1

A researcher studies right-wing populist opinion trends on Twitter/X over time, collecting data in the form of public statements made by users on the platform. Even though the posts are public, they contain personal data that can be traced to individual users, for example through account names, profile pictures, or quotations.

The research value is high, but processing the data also poses risks to the individuals whose opinions and statements are analyzed. For example, mapping their political views involves processing sensitive personal data under GDPR. A careful assessment of the risks is therefore required, taking into account both the purpose of the research and the individual’s right to privacy.

In this case, it may be appropriate for the data to be shared only upon request and after specific review, and for the material to be anonymized or aggregated as far as possible before being made accessible to other researchers.

Scenario 2

A PhD student wants to analyse posts from a closed Facebook group about childbirth experiences. Even though members published the information themselves, they perceived the forum as private. The data must therefore still be regarded as sensitive personal data. The dataset should only be made accessible on request and after a confidentiality review, not published openly.

Personal information about third parties in research data

Research data can sometimes contain information about a person other than the intended research participant, for example when a participant mentions a politician, a manager, or a relative. Such information is also considered personal data and falls under the GDPR.

The level of privacy risk can vary greatly depending on the context. Information about public figures in neutral contexts may be considered harmless and can be shared openly. Information about private individuals in sensitive contexts, on the other hand, could cause an invasion of privacy and should be handled with restrictions.

Scenario 1

In a workplace study, a participant talks about their manager and mentions the manager’s name in connection with negative comments. The manager has not taken part in the research but can still be identified since the workplace is relatively small. This means the information is personal data that must be safeguarded and cannot be shared openly.

Scenario 2

In an attitude survey, a large number of people are asked what they think about various Swedish politicians. The data collection was carried out by a private polling company, which deleted all direct identifiers before transferring the data to the research group. The only data available to the researchers are the attitudes expressed, together with the respondents’ gender and age. From this perspective, the data may be regarded as anonymous.

However, the names of the politicians commented on by the respondents are included, which means the data must still be considered personal data. The personal information is not regarded as a significant invasion of individual privacy, since they relate to public figures.

This means that the dataset can be shared openly, provided it is made clear that it contains personal data. Safeguards such as ensuring proper context and preventing misuse of the material may be advisable, but the overall risk level is considered low.

Data on criminal offences

Information about criminal offences is not classified as sensitive personal data under the GDPR, but it is nevertheless considered to require special protection and is regulated separately in Article 10.

Examples of such data include case numbers in court rulings or information about roles in legal proceedings (e.g., “the complainant”). Survey data concerning criminal behaviour, such as drug use, are also covered. Such information should always be handled with restrictions.

Scenario

A research group conducts a survey on health and lifestyle in which participants are asked about cannabis use. An affirmative response means that a person indirectly admits to committing a criminal offence. This makes the data particularly sensitive and in need of protection. The dataset must therefore be safeguarded and cannot be shared openly; it should only be made available after a confidentiality review and behind an access request barrier.