Data Balancing Improves Self-Admitted Technical Debt Detection

by   Murali Sridharan, et al.

A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing technique SMOTE or Classifier level Ensemble approaches Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1 score by 10 fell short in Cross-Project set up by 9 generalization capability of deep learning in Cross-Project SATD detection, yet while working within individual projects, classical machine learning algorithms can deliver better performance. We also evaluate and quantify the impact of duplicate source code comments in SATD detection performance. Finally, we employ SHAP and discuss the interpreted SATD features. We have included the replication package and shared a web-based SATD prediction tool with the balancing techniques in this study.


Measuring Improvement of F_1-Scores in Detection of Self-Admitted Technical Debt

Artificial Intelligence and Machine Learning have witnessed rapid, signi...

SATDBailiff- Mining and Tracking Self-Admitted Technical Debt

Self-Admitted Technical Debt (SATD) is a metaphorical concept to describ...

Characterizing and Mitigating Self-Admitted Build Debt

Technical Debt is a metaphor used to describe the situation in which lon...

Deep Learning Frameworks for Pavement Distress Classification: A Comparative Analysis

Automatic detection and classification of pavement distresses is critica...

On the use of test smells for prediction of flaky tests

Regression testing is an important phase to deliver software with qualit...

I-SiamIDS: an improved Siam-IDS for handling class imbalance in network-based intrusion detection systems

NIDSs identify malicious activities by analyzing network traffic. NIDSs ...

Automatically Identifying Relations Between Self-Admitted Technical Debt Across Different Sources

Self-Admitted Technical Debt or SATD can be found in various sources, su...

Please sign up or login with your details

Forgot password? Click here to reset