Databases and spreadsheets

Although databases and spreadsheet software serve different functions, they are often used in similar ways to collect and store data (i.e., in rows and columns). This section focuses on data created using basic spreadsheet software (Excel) or more advanced statistical analysis programs (SPSS, STATA, etc.).

Databases

Databases can be categorized based on their structure.

Spreadsheets are the simplest form of database and typically consist of one or more sheets containing tabulated data. Spreadsheets may also include formulas, images, charts, and tables, as well as additional values, such as a column total calculated using different formulas.

Databases generally include more than just values and metadata. Users often interact with databases via forms used for data entry or calculations. These forms can be considered part of the database, even though they are separate from the stored values. Queries, results, and reports may also be viewed as ‘non-data components’ of the database. If these elements are essential for understanding or reusing the material, they should be stored alongside the database.

Types of databases

Hierarchical databases use a tree structure to store data in a parent-child relationship. Each record has a ‘parent’ that may have one or more ‘children,’ which in turn may have their own ‘children.’ This system is fast at retrieving individual records and allows easy addition or deletion of information, but it is memory-intensive and makes it difficult to link and mark relationships between ‘children’ in different parts of the structure.

Rectangular databases are structured like spreadsheets, with data arranged in horizontal rows and vertical columns, where different values or attributes are assigned to the object being studied.

Relational databases organize data into separate tables based on shared attributes. These tables are linked using key fields. A combination of one or more key fields can create a unique key.

Object-oriented databases store complex objects, such as multimedia files and CAD objects. Data are not typically stored in the database itself but as attributes within objects. When an application queries an object-oriented database, it may not search the database directly but instead instruct the object to execute a specific routine and return a result.

Key considerations

Spreadsheets can be shared in common formats (.xlsx, .ods) or exported as delimited text files (.csv, .tsv). These file formats are well known and widely supported, so compatibility issues are rare. In practice, the way data are structured within the file and how it is documented are more likely to affect its reusability. For delimited text files (.csv, .tsv), each file should contain a single logical table, with the first row serving as variable names or column headers, followed by data entries in subsequent rows.

Database files are less commonly used than spreadsheets, and many secondary users may find database files difficult to handle. To improve reusability, it is often helpful to export database contents to delimited text (.csv), provided that important relationships between tables are not lost.

If a database is shared with the intention of preserving the entire database structure (rather than just the content of its tables), it is advisable to create a text-based database dump. This allows secondary users to reconstruct the entire database from the exported text. These files typically use SQL syntax (.sql). To avoid compatibility issues between different database software, the SIARD (.siard) standard has been developed, storing a database dump in a technology-neutral XML format.

As an alternative to a database dump, databases can also be shared in a binary database format that can be quickly reused in many tools. A well-documented and open example of this is the SQLite standard (.db, .db3, .sqlite).

Statistical data often consist of extended tabular data where variables, formulas, and other elements are accompanied by metadata. Unfortunately, most statistical software formats are proprietary. Exporting statistical data to delimited text or spreadsheet formats can make data analysis more challenging for secondary users, as metadata are then separated from the actual data. However, SPSS (.sav, .por) and STATA (.dta) formats are well known and supported by some open-source software.

It is possible to share the same dataset in multiple formats. For example, a table from an .sav file can also be provided as a delimited text file (.csv). If formulas, scripts, or other software-specific syntax are included in statistical data files, these can be exported as separate text files to help secondary users review the processing workflow, even if they cannot open the proprietary file formats.

Best practices:

If possible, use controlled vocabularies and established keyword lists when entering data in databases and spreadsheets.
Be consistent and use meaningful, easily understandable names for tables/sheets and rows/columns. Keep in mind that tables and spreadsheets may not always be stored within the same file.
Avoid using formatting and page layout features to indicate the significance of values in a spreadsheet. If such layout features are necessary, save the formatted version separately, e.g., in PDF/A format, to preserve the visual appearance, as formatting is often lost when exporting tabular data to text-based formats.

What should be preserved?

The core components of databases and spreadsheets are the data tables/sheets, along with documentation and metadata that describe their contents and relationships.

Essential elements for preservation include:

Values: Cell headers and their associated values. A file may contain multiple sheets or tables.

Images: Figures, charts, and tables should be stored separately. It is possible to link to external files to show relationships between different elements. Clearly document file locations and use descriptive names to indicate their contents.

Relationships: In databases (and sometimes spreadsheets), documenting relationships between tables/sheets is crucial for maintaining data integrity.

Formulas, queries, and macros: If a file contains complex formulas or queries, these should be preserved in a separate text file so that the spreadsheet’s functionality can be reconstructed in the future.

Comments and annotations: Ensure that any comments or notes are retained. These should be saved in a separate file, clearly indicating which file and cell each comment belongs to.

Hidden or protected data: Some spreadsheets contain hidden or protected cells. Identify these to ensure that no information is lost.

Recommended file formats for sharing

Microsoft Excel, formally Office Open XML Workbook format (.xlsx),
OpenDocument Spreadsheet (.ods)
Delimited text (.csv, .tsv)
SQL syntax in a text file (.sql)
SIARD (.siard)
SQLite (.db, .db3, .sqlite)
SPSS (.sav, .por)
STATA (.dta)
R (.rdata, .rda)

For more information on file formats for text, see the ARIADNE guide Databases and spreadsheets: A guide to good practiceOpens in a new tab.The guides have been developed by SND and translated into English in cooperation with the EU-funded infrastructure ARIADNEOpens in a new tab. ARIADNE is responsible for updating the English guides and keeping them accessible.