Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering

02/12/2022
by   Gopi Krishnan Rajbahadur, et al.
0

Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.

READ FULL TEXT
research
02/12/2022

The Impact of Using Regression Models to Build Defect Classifiers

It is common practice to discretize continuous defect counts into defect...
research
04/27/2020

An Empirical Study on Feature Discretization

When dealing with continuous numeric features, we usually adopt feature ...
research
04/13/2021

Deducing properties of ODEs from their discretization

We show that some hard to detect properties of quadratic ODEs (eg certai...
research
04/28/2021

On exact discretization of the L_2-norm with a negative weight

For a subspace X of functions from L_2 we consider the minimal number m ...
research
08/24/2021

Discretization of parameter identification in PDEs using Neural Networks

We consider the ill-posed inverse problem of identifying parameters in a...
research
09/01/2020

Generalisation of Cyberbullying Detection

Cyberbullying is a problem in today's ubiquitous online communities. Fil...
research
03/31/2020

On the Need of Removing Last Releases of Data When Using or Validating Defect Prediction Models

To develop and train defect prediction models, researchers rely on datas...

Please sign up or login with your details

Forgot password? Click here to reset