Random Forest Missing Data Algorithms

01/19/2017
by   Fei Tang, et al.
0

Random forest (RF) missing data algorithms are an attractive approach for dealing with missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms but relatively little guidance about their efficacy, which motivated us to study their performance. Using a large, diverse collection of data sets, performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting---the latter class representing a generalization of a new promising imputation algorithm called missForest. Performance of algorithms was assessed by ability to impute data accurately. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2011

MissForest - nonparametric missing value imputation for mixed-type data

Modern data acquisition based on high-throughput technology is often fac...
research
04/23/2020

Influence of parallel computing strategies of iterative imputation of missing data: a case study on missForest

Machine learning iterative imputation methods have been well accepted by...
research
04/30/2020

Multiple imputation using chained random forests: a preliminary study based on the empirical distribution of out-of-bag prediction errors

Missing data are common in data analyses in biomedical fields, and imput...
research
08/21/2021

A computational study on imputation methods for missing environmental data

Data acquisition and recording in the form of databases are routine oper...
research
07/06/2020

Multiple Imputation with Massive Data: an Application to the Panel Study of Income Dynamics

Multiple imputation (MI) is a popular and well-established method for ha...
research
07/05/2022

Data Integrity Error Localization in Networked Systems with Missing Data

Most recent network failure diagnosis systems focused on data center net...

Please sign up or login with your details

Forgot password? Click here to reset