
Document data
Documenting research data involves describing the research activities undertaken, how data are structured and organized, and the decisions made during the research process. What documentation is required in a project depends on the research field. The basic principle is that documentation should include the information needed by you or someone else (potentially from another scientific discipline) to analyze or understand the data collected in the project.
One way to gather this information is to use a data management plan, where you describe, for example, the folder structure of the project, how files are named, and what distinguishes different file versions. You can also note any additional documentation required for memory support and workflow routines. If you make occasional deviations from the data management plan, you should document these as well. However, if you frequently deviate from the plan, it may be worth revising it instead of continuously documenting exceptions. One way to gather this information is to use a data management plan, where you describe, for example, the folder structureOpens in a new tab of the project, how files are namedOpens in a new tab, and what distinguishes different file versionsOpens in a new tab. You can also note any additional documentation you need to support your memory and workflow routines. If you make occasional deviations from the data management plan, you should document these as well. However, if you frequently deviate from the plan, it may be worth revising it instead of continuously documenting exceptions.
In projects with multiple team members, it is beneficial to assign someone responsibility for ensuring that your documentation guidelines are followed.
If you are unsure about what needs to be documented, you can contact your local research data support teamOpens in a new tab.
What should be documented?
Documentation can relate to different aspects of the project, and data materials can be documented at varying levels of detail. Some key aspects to describe are:
- How and why data have been collected, created, or modelled
- How different data files and versions are organized
- What changes have been made between different versions of data files
- The meaning of codes, abbreviations, variable names, etc.
- Which software and software versions were used for data processing and analysis
- Legal, ethical, or other restrictions that limit data reuse
- Whether (and how) data have been reused in other research projects.
The necessary level of detail needed is something you know best, with your expertise of the nature of your data. In short, you should document anything that may be crucial for understanding and analyzing the data. This is particularly important if new members join the project, if data will be analyzed later in the same or another research project, or if research results need to be verified, meaning that it must be possible to reproduce your research.
To avoid missing details, documentation should be a constant part of the research process. The best way to capture all relevant information – such as what you have done with the data, what decisions you have made, and what definitions you have used – is to document it immediately. Ideally, your documentation should be structured and continuously updated, but even an unstructured file with all the relevant information is better than having no documentation at all. The worst documentation is the one that does not exist.
Documentation software
Depending on the tools you use for data processing and analysis, there are different ways to document your work. Some analysis software includes built-in documentation functions that automatically track actions and version history with comments. Others have features that, while not explicitly designed for documentation, can still be used for this purpose, though they may require some additional work. Many analysis programs lack built-in documentation capabilities, and in such cases, documentation must be maintained in a separate file.
Examples of documentation options in software
Built-in functions: SPSS (used for survey data analysis) and Dedoose (for qualitative data analysis) include features for documenting variables.
Add-ins: Colectica for Excel is an add-in that enables variable documentation for observation or survey data.
Making use of existing features: Transana (for qualitative analysis of text, image, audio, and video files) and Kinovea (for motion analysis) include comment functions that can be used for data documentation. Comments can be exported, but you may need to compile the documentation manually.
Documentation for data reuse
When a project concludes and data are archived and potentially made accessible, a final version of the documentation should be compiled. This final documentation must be complete and comprehensible.
If the methodology behind a dataset is described in an Open Access research article, a secondary user of the dataset can be expected to have access to the article, which can serve as documentation. A preprint of the article can be included as a documentation file and later replaced with the published version.
However, if the methodology is not described in an openly accessible document, a secondary user will only have access to the dataset (including documentation files) and the data description. In such cases, the data description or documentation must include a detailed explanation of how the study was conducted; for example, how experiments were designed and conducted.
Simply referring to a published article or report is not considered sufficient documentation. Even if an Open Access article describes how data were collected or created, it is advisable to include a README file in the dataset, explaining how the contents of the data files relate to what is described in the article.
There are no strict requirements for how documentation should be formatted when sharing data, but it must contain enough information for data to be understood and reused by a secondary user. The level of detail required varies between projects and disciplines. You do not need to share everything, but you should consider which documentation files contain essential information that others need to analyze the data correctly.
Documentation files can include:
- Variable lists explaining variable names, codes, and abbreviations
- Questionnaires or survey forms
- Interview guides and protocols
- Codebooks and coding lists
- Links to articles or other publications
- Methodological descriptions or technical reports
- Information on data processing methods
- Syntaxes for derived variables
- Final project reports
- Instructions for custom software needed to process the data
- Field notes or logbooks
- Information on legal, ethical, or other restrictions that limit data reuse.
One way to summarise documentation is by writing a README file. Many templates are available, such as this README file templateOpens in a new tab, developed by Cornell University.
A README file can contain extensive metadata and a description of research methodology, or it can simply describe the content and structure of the dataset. In the latter case, the README should also include a brief description of each file or folder in the dataset and their contents.
For tabular data, the README should include a variable list with full names and definitions of all variables in the data file. Other essential information includes measurement units and definitions for codes or symbols used to represent missing data.
Keep in mind that those wishing to reuse your research data may come from different disciplines. It is therefore helpful to ensure that the documentation is understandable to a broad audience. Defining abbreviations and providing methodological explanations – even if they are common in your field – can facilitate reuse by researchers from other domains.