An Empirical Study on the Effectiveness of Data Resampling Approaches for Cross-Project Software Defect Prediction

06/16/2022
by   Kwabena Ebo Bennin, et al.
0

Crossp-roject defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is highly skewed datasets where non buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE, Borderline-SMOTE, Random Oversampling, and ADASYN) and three undersampling approaches (Random Undersampling, Tomek Links, and Onesided selection) is investigated and results are compared to approaches without data resampling. The authors' examined six defect prediction models on 34 datasets extracted from the PROMISE repository. The authors results show that there is a significant positive effect of data resampling on CPDP performance, suggesting that software quality teams and researchers should consider applying data resampling approaches for improved recall (pd) and g-measure prediction performance. However if the goal is to improve precision and reduce false alarm (pf) then data resampling approaches should be avoided.

READ FULL TEXT
research
01/24/2019

Transfer-Learning Oriented Class Imbalance Learning for Cross-Project Defect Prediction

Cross-project defect prediction (CPDP) aims to predict defects of projec...
research
01/31/2018

The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

Defect prediction models that are trained on class imbalanced datasets (...
research
02/24/2022

Investigating the Use of One-Class Support Vector Machine for Software Defect Prediction

Early software defect identification is considered an important step tow...
research
07/12/2022

The Untold Impact of Learning Approaches on Software Fault-Proneness Predictions

Software fault-proneness prediction is an active research area, with man...
research
03/31/2020

On the Need of Removing Last Releases of Data When Using or Validating Defect Prediction Models

To develop and train defect prediction models, researchers rely on datas...
research
04/02/2021

A Comparison of Similarity Based Instance Selection Methods for Cross Project Defect Prediction

Context: Previous studies have shown that training data instance selecti...

Please sign up or login with your details

Forgot password? Click here to reset