DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

01/25/2022
by   Huy Tu, et al.
5

Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24 average to reach 90 SATDs. The human experts are then required to read almost a quintuple of the SATD comments which indicates the inefficiency of the tool. Plus, human experts are still prone to error: 95 were actually true positives. To solve the above problems, we propose DebtFree, a two-mode framework based on unsupervised learning for identifying SATDs. In mode1, when the existing training data is unlabeled, DebtFree starts with an unsupervised learner to automatically pseudo-label the programming comments in the training data. In contrast, in mode2 where labels are available with the corresponding training data, DebtFree starts with a pre-processor that identifies the highly prone SATDs from the test dataset. Then, our machine learning model is employed to assist human experts in manually identifying the remaining SATDs. Our experiments on 10 software projects show that both models yield a statistically significant improvement in effectiveness over the state-of-the-art automated and semi-automated models. Specifically, DebtFree can reduce the labeling effort by 99 (labeled training data) while improving the current active learner's F1 relatively to almost 100

READ FULL TEXT

page 3

page 4

page 25

page 27

research
02/25/2020

Identifying Self-Admitted Technical Debts with Jitterbug: A Two-step Approach

Keeping track of and managing the self-admitted technical debts (SATDs) ...
research
11/16/2020

On the Marginal Benefit of Active Learning: Does Self-Supervision Eat Its Cake?

Active learning is the set of techniques for intelligently labeling larg...
research
09/29/2021

Multi-class Probabilistic Bounds for Self-learning

Self-learning is a classical approach for learning with both labeled and...
research
10/03/2021

Annotation Cost Reduction of Stream-based Active Learning by Automated Weak Labeling using a Robot Arm

Stream-based active learning (AL) is an efficient training data collecti...
research
08/07/2020

Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification

Privacy policies are statements that notify users of the services' data ...
research
10/27/2021

Active clustering for labeling training data

Gathering training data is a key step of any supervised learning task, a...
research
06/29/2021

Unsupervised Technique To Conversational Machine Reading

Conversational machine reading (CMR) tools have seen a rapid progress in...

Please sign up or login with your details

Forgot password? Click here to reset