Replication package: The Prevalence of Code Review Guidelines for GUI-Based Testing in Open-Source
https://doi.org/10.5281/zenodo.18664471
Replication package for the study "The Prevalence of Code Review Guidelines for GUI-Based Testing in Open-Source"
Summary of files included:
results.xls contains the final results of mapping code review guidelines to observed code review comments in pull requests.
repositories_all.zip includes a CSV file listing all identified GitHub repositories that meet our search criteria.
repositories-top100.xlsx provides an overview of the top 100 (by star count) repositories that are considered in the study.
GitHub-Crawler-code-only includes the scripts used to gather all relevant data from GitHub.
GitHub-Crawler-with-data includes the scripts used to gather all relevant data from GitHub, with intermediate results from the crawling process, such as all pull requests and metadata for each repository. (~3.6 GB uncompressed)
CodeBook.xlsx provides a codebook that defines the guidelines and the rules for mapping comments to these guidelines.
Coding_with_IRR_metrics.zip contains the coding of a random sample of 200 PRs by an independent coder, including inter-rater reliability (IRR) metrics.
How to Read the Results
The file results.xls contains all pull requests (PRs) from 84 repositories that modify GUI-based test files. The columns in the main sheet are as follows:
Repository: The name of the GitHub repository.
PR_id: The unique identifier for a PR within a repository. An ID may appear in multiple repositories as it is a sequential number assigned in each repository.
URL: A link that allows direct access to the PR on GitHub.
Guidelines: Our mapping of code review comments to previously proposed guidelines. If multiple guidelines are identified in a PR, they are separated by the pipe symbol "|".
Notes: Researcher notes regarding the mapping of comments to guidelines. These notes are used to highlight specific aspects of the PR that led to the mapping to a guideline.
Checked: The letter "y" indicates the PR was reviewed by both authors who performed the mapping. A "d" marks PRs that contain only emojis, which we do not consider as a code review.
Comments on File Changes (Threads): Code review comments that directly address file changes (also known as threads in GitHub). These are the comments used for the mapping to guidelines. Each thread (comments on file changes) includes a path to the changed file, comments from reviewers, and a horizontal line indicating the end of the thread (as a visual indicator for analysis).
Author of PR: The GitHub user who created the PR. This is relevant to determine whether a code review comment was made by the PR author, for example, to highlight a specific part of the change.
General PR Comments: General comments about the PR that do not target specific file changes. These comments were not used for the mapping to guidelines, but are included in this spreadsheet for additional context.
The other sheets in the results.xls file are:
Stats: An overview of the number of PRs that were analyzed by two authors.
Rules: These are guidelines on how to map specific comments. The rules were developed through discussions between the two authors who performed the mapping and were documented to ensure consistent mapping throughout the various sessions. These rules were also used to create the first version of the codebook, which is included in the CodeBook.xlsx file.
Sessions: Protocols for the sessions conducted for mapping guidelines by the two authors.
G5.6: This section represents different concerns along with the number of PRs in which each concern appears. It was used to further subdivide guideline G5.6 into more specialized guidelines.
To analyze or replicate the results, the columns "Guidelines", "Comments on File Changes (Threads)", and "Author of PR" are particularly important. The "Guidelines" column indicates which code review guidelines were identified in the PR, while the "Comments on File Changes (Threads)" column provides the specific comments that led to the mapping to these guidelines. The formatting of suggested changes in the "Comments on File Changes (Threads)" is not always easy to understand, especially for larger changes or minor formatting changes. In unclear cases, the URL column can be used to access the PR on GitHub and review the comments in their original formatting. The "Author of PR" column should be used to determine whether the PR author or a reviewer makes a comment. If the PR author does the first comment of a thread, it is to highlight specific file changes for the reviewer (G3.9).
Coding of Guidelines
If you want to perform coding (a qualitative data analysis strategy) on code review comments to map them to guidelines (codes), you can use the codebook provided in CodeBook.xlsx. This codebook defines the guidelines and provides rules for mapping comments to these guidelines.
In particular, it provides:
a description of each guideline (code)
explains when to use a code
explains when not to use a code
provides examples of comments for the "when to use" case
All four versions of the codebook and a changelog are included in the spreadsheet, which shows how the codebook has been improved over time.
Go to data source
Opens in a new tabhttps://doi.org/10.5281/zenodo.18664471
Citation and access
Citation and access
Data access level:
Creator/Principal investigator(s):
Research principal:
Citation:
Language:
Administrative information
Administrative information
Topic and keywords
Topic and keywords
Relations
Relations
Metadata
Metadata

Blekinge Institute of Technology