Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection

11/20/2019
by   Steffen Herbold, et al.
0

Defect prediction research has a strong reliance on published data sets that are shared between researchers. The SZZ algorithm is the de facto standard for collecting defect labels for this kind of data and is used by most public data sets. Thus, problems with the SZZ algorithm may have a strong indirect impact on almost the complete state of the art of defect prediction. Recent research uncovered potential problems in different parts of the SZZ algorithm. Within this article, we provide an extensive empirical analysis of the defect labels created with the SZZ algorithm. We used a combination of manual validation and adopted or improved heuristics for the collection of defect data to establish ground truth data for bug fixing commits, improved the heuristic for the identification of inducing changes for defects, as well as the assignment of bugs to releases. We conducted an empirical study on 398 releases of 38 Apache projects and found that only half of the bug fixing commits determined by SZZ are actually bug fixing. Moreover, if a six month time frame is used in combination with SZZ to determine which bugs affect a release, one file is incorrectly labeled as defective for every file that is correctly labeled as defective. In addition, two defective files are missed. We also explored the impact of the relatively small set of features that are available in most defect prediction data sets, as there are multiple publications that indicate that, e.g., churn related features are important for defect prediction. We found that the difference of using more features is negligible.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2021

Mea culpa: How developers fix their own simple bugs differently from other developers

In this work, we study how the authorship of code affects bug-fixing com...
research
03/28/2021

Watch out for Extrinsic Bugs! A Case Study of their Impact in Just-In-Time Bug Prediction Models on the OpenStack project

Intrinsic bugs are bugs for which a bug introducing change can be identi...
research
08/09/2023

Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel

The SZZ algorithm is used to connect bug-fixing commits to the earlier c...
research
11/12/2020

Large-Scale Manual Validation of Bug Fixing Commits: A Fine-grained Analysis of Tangling

Context: Tangled commits are changes to software that address multiple c...
research
11/17/2021

Are automated static analysis tools worth it? An investigation into relative warning density and external software quality

Automated Static Analysis Tools (ASATs) are part of software development...
research
09/07/2022

SZZ in the time of Pull Requests

In the multi-commit development model, programmers complete tasks (e.g.,...
research
06/20/2022

PR-SZZ: How pull requests can support the tracing of defects in software repositories

The SZZ algorithm represents a standard way to identify bug fixing commi...

Please sign up or login with your details

Forgot password? Click here to reset