Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets

01/02/2023
by   Ekaterina Antonenko, et al.
0

Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.

READ FULL TEXT
research
04/23/2023

Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?

Missing values are a fundamental problem in data science. Many datasets ...
research
11/15/2019

Imputing missing values with unsupervised random trees

This work proposes a non-iterative strategy for missing value imputation...
research
12/19/2019

Robust Multi-Output Learning with Highly Incomplete Data via Restricted Boltzmann Machines

In a standard multi-output classification scenario, both features and la...
research
06/03/2021

Semi-supervised Conditional Density Estimation for Imputation and Classification of Incomplete Instances

Incomplete instances with various missing attributes in many real-world ...
research
09/25/2020

Online Missing Value Imputation and Correlation Change Detection for Mixed-type Data via Gaussian Copula

Most data science algorithms require complete observations, yet many dat...
research
04/28/2023

Counterfactual Explanation with Missing Values

Counterfactual Explanation (CE) is a post-hoc explanation method that pr...
research
06/10/2023

Machine Learning Based Missing Values Imputation in Categorical Datasets

This study explored the use of machine learning algorithms for predictin...

Please sign up or login with your details

Forgot password? Click here to reset