Missing Data Imputation for Classification Problems

02/25/2020
by   Arkopal Choudhury, et al.
0

Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. A widely used solution to this imputation problem is based on the lazy learning technique, k-nearest neighbor (kNN) approach. However, most of the previous work on missing data does not take into account the presence of the class label in the classification problem. Also, existing kNN imputation methods use variants of Minkowski distance as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance between the missing datum and all the training data. Grey distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the features and the class label. This ensures that the imputation of the training data is directed towards improving classification performance. This class weighted grey kNN imputation algorithm demonstrates improved performance when compared to other kNN imputation algorithms, as well as standard imputation algorithms such as MICE and missForest, in imputation and classification problems. These problems are based on simulated scenarios and UCI datasets with various rates of missingness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2017

Evolving imputation strategies for missing data in classification problems with TPOT

Missing data has a ubiquitous presence in real-life applications of mach...
research
12/19/2019

Robust Multi-Output Learning with Highly Incomplete Data via Restricted Boltzmann Machines

In a standard multi-output classification scenario, both features and la...
research
05/26/2023

Confidence-Based Feature Imputation for Graphs with Partially Known Features

This paper investigates a missing feature imputation problem for graph l...
research
02/24/2020

Clustering and Classification with Non-Existence Attributes: A Sentenced Discrepancy Measure Based Technique

For some or all of the data instances a number of independent-world clus...
research
10/22/2021

Missing the Point: Non-Convergence in Iterative Imputation Algorithms

Iterative imputation is a popular tool to accommodate missing data. Whil...
research
09/23/2020

Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

In this paper, we propose Ensemble Learning models to identify factors c...
research
01/02/2020

Using Data Imputation for Signal Separation in High Contrast Imaging

To characterize circumstellar systems in high contrast imaging, the fund...

Please sign up or login with your details

Forgot password? Click here to reset