Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction Tasks

05/05/2023
by   Xiaohui Wan, et al.
0

Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the difficulties of these tasks from the perspective of data complexity. In this paper, we conduct an empirical study to estimate the hardness of over 33,000 instances, employing a set of measures to characterize the inherent difficulty of instances and the characteristics of defect datasets. Our findings indicate that: (1) instance hardness in both classes displays a right-skewed distribution, with the defective class exhibiting a more scattered distribution; (2) class overlap is the primary factor influencing instance hardness and can be characterized through feature, structural, instance, and multiresolution overlap; (3) no universal preprocessing technique is applicable to all datasets, and it may not consistently reduce data complexity, fortunately, dataset complexity measures can help identify suitable techniques for specific datasets; (4) integrating data complexity information into the learning process can enhance an algorithm's learning capacity. In summary, this empirical study highlights the crucial role of data complexity in defect prediction tasks, and provides a novel perspective for advancing research in defect prediction techniques.

READ FULL TEXT

page 4

page 21

page 22

page 23

page 27

page 29

page 33

page 34

research
12/04/2022

Characterizing instance hardness in classification and regression problems

Some recent pieces of work in the Machine Learning (ML) literature have ...
research
11/16/2022

Features for the 0-1 knapsack problem based on inclusionwise maximal solutions

Decades of research on the 0-1 knapsack problem led to very efficient al...
research
09/04/2023

Which algorithm to select in sports timetabling?

Any sports competition needs a timetable, specifying when and where team...
research
09/05/2018

Online local pool generation for dynamic classifier selection: an extended version

Dynamic Classifier Selection (DCS) techniques have difficulty in selecti...
research
03/07/2022

ILDAE: Instance-Level Difficulty Analysis of Evaluation Data

Knowledge of questions' difficulty level helps a teacher in several ways...
research
05/19/2022

Let the Model Decide its Curriculum for Multitask Learning

Curriculum learning strategies in prior multi-task learning approaches a...

Please sign up or login with your details

Forgot password? Click here to reset