The Impact of Dormant Defects on Defect Prediction: a Study of 19 Apache Projects

05/26/2021
by   Davide Falessi, et al.
0

Defect prediction models can be beneficial to prioritize testing, analysis, or code review activities, and has been the subject of a substantial effort in academia, and some applications in industrial contexts. A necessary precondition when creating a defect prediction model is the availability of defect data from the history of projects. If this data is noisy, the resulting defect prediction model could result to be unreliable. One of the causes of noise for defect datasets is the presence of "dormant defects", i.e., of defects discovered several releases after their introduction. This can cause a class to be labeled as defect-free while it is not, and is, therefore "snoring". In this paper, we investigate the impact of snoring on classifiers' accuracy and the effectiveness of a possible countermeasure, i.e., dropping too recent data from a training set. We analyze the accuracy of 15 machine learning defect prediction classifiers, on data from more than 4,000 defects and 600 releases of 19 open source projects from the Apache ecosystem. Our results show that on average across projects: (i) the presence of dormant defects decreases the recall of defect prediction classifiers, and (ii) removing from the training set the classes that in the last release are labeled as not defective significantly improves the accuracy of the classifiers. In summary, this paper provides insights on how to create defects datasets by mitigating the negative effect of dormant defects on defect prediction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2020

On the Need of Removing Last Releases of Data When Using or Validating Defect Prediction Models

To develop and train defect prediction models, researchers rely on datas...
research
12/15/2018

A Large-Scale Study of Call Graph-based Impact Prediction using Mutation Testing

In software engineering, impact analysis involves predicting the softwar...
research
04/29/2009

Quality Classifiers for Open Source Software Repositories

Open Source Software (OSS) often relies on large repositories, like Sour...
research
01/31/2018

The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

Defect prediction models that are trained on class imbalanced datasets (...
research
08/20/2022

Learning to predict test effectiveness

The high cost of the test can be dramatically reduced, provided that the...
research
05/29/2021

Investigating the Significance of Bellwether Effect to Improve Software Effort Estimation

Bellwether effect refers to the existence of exemplary projects (called ...
research
05/22/2023

Relabel Minimal Training Subset to Flip a Prediction

Yang et al. (2023) discovered that removing a mere 1 often lead to the f...

Please sign up or login with your details

Forgot password? Click here to reset