Folder structure, file names, and versioning

To ensure that the research materials you work with are well-organized, it is essential to have a well-planned folder structure. Ideally, this structure should be set up at the start of the research project before data collection begins. It is also helpful if files are arranged in a way that is intuitive even to someone unfamiliar with the project, so that they can easily understand the information.

When starting a project:

Determine the guidelines for versioning, folder structure, and file naming in the project
Update and document any changes or deviations from these guidelines
If multiple people are involved in the project, assign someone the responsibility of ensuring that the guidelines are followed.

Folder structure

A well-organised folder structure makes it easy for all team members to locate files. It also serves as a template for how data should be stored and organized within the project.

A useful tip is to include a .txt file (a so-called README file) in the top-level folder, describing the structure and the approach taken regarding file names and versioning. If any changes are made to the folder structure later, these should be documented in the README file. In a project with several phases of data collection, you may create a separate folder for each round of data collection, using consistent names that describe the collected data, the collection context, and the date.

Folders should:

follow a structured hierarchy of folders and subfolders that reflect the project’s organization and workflow
have clear, descriptive names that are not longer than necessary
have unique names (avoid giving the same name to both a folder and a subfolder).

File names

A research project can quickly generate a large number of files, so it is important to decide on a file-naming convention in advance. This is particularly important when multiple people are creating and naming files.

A file name should:

be unique not only within its folder but ideally across the entire project. If a file is moved out of its folder, its name should indicate where it belongs
provide an idea of its content
be relatively short
include the version number in the name.

Versioning

Research work often involves processing data files multiple times. Once data have been collected, they typically need to be cleaned and refined before the final dataset is created.

The simplest way to track such changes is to create different versions (local working copies) of the data files. This means that data files are not overwritten; instead, the results of each processing stage are saved as new files. The original version of the data usually comes from the initial data collection, and each subsequent version receives a new version number (e.g., v01, v02, v03) or is marked with the creation date in ISO format (e.g., 2023-08-28). By following a clear versioning practice, you can quickly locate the most recent version of a data file, document different versions, and understand where a specific file fits within the workflow.

Imagine you have the following data files in a project:

PeterS_glossary_17Jun.doc
speaker1_words_final.doc
speaker1_final2.doc
speakerPS_words_clean.doc

What are you seeing here? In which order have the files been collected? What do they contain? How do they relate to one another? Is Peter S the same person as speaker1?

Now assume that the files were named:

speaker1_glossary_v00_orig.doc
speaker1_glossary_v01_clean.doc
speaker1_glossary_v02_clean.doc
speaker1_glossary_v03_clean_final.doc

or:

speaker1_glossary_2023-03-11_orig.doc
speaker1_glossary_2023-03-15_clean.doc
speaker1_glossary_2023-03-15_clean_v2.doc
speaker1_glossary_2023-03-16_clean_final.doc.

Now you can see that all files contain speaker1’s reading of the words in a glossary. The files are different versions: the original file at the top, followed by cleaned versions of the original, and the final version at the bottom.

Version control

By maintaining a log of changes where you document when and how each file version is created, you ensure that it will be easier to restore files from backup versions. You can also use a log to document how the material has been processed, if someone should later question the project’s data or results, or if you need to go back and modify the sequence of analyses.

Version control means that files, together with changes and updates to them, are compiled into a central data structure called a “repository” or “repo”. It offers a more sophisticated way of managing versions of data. Every change to a data file is documented, and changes to multiple files can be packaged together. Multiple users can collaborate on making file changes, merging concurrent changes and resolving any overlaps, and users can create local copies and test the changes.

Common open-source tools for version control of source code and text-based data files (.txt, .csv, .md) are Git and Subversion. Version control of application-specific files (e.g., .xlxs) may require a commercial solution.

Cloud-based platforms that offer hosting and collaboration tools – known as “code repositories” – include GitHub, BitBucket, GitLab. A code repository can also be stored locally, either on your own computer or managed by a Git/Subversion server for use by your research group.

Programmatic versioning

An alternative to versioning using filenames is to create a script or executable that reads the original data file and modifies the data. This is a common method for statistical analysis applications such as STATA, R, and SAS.

You should document your code carefully, so other users can understand the processing steps. You can also publish the script along with the data: note that some high-ranking journals require that the analysis code is made accessible.