Who wins the Miss Contest for Imputation Methods? Our Vote for Miss BooPF

by   Burim Ramosaj, et al.
Universität Ulm

Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. Therein, MissBooPF, a combination of the stochastic gradient tree boosting method together with the parametrically bootstrapped random forest method, appeared to be promising. Finally, an empirical analysis focusing on credit information and Facebook data is conducted.


page 1

page 2

page 3

page 4


A cautionary tale on using imputation methods for inference in matched pairs design

Imputation procedures in biomedical fields have turned into statistical ...

MissForest - nonparametric missing value imputation for mixed-type data

Modern data acquisition based on high-throughput technology is often fac...

Influence of parallel computing strategies of iterative imputation of missing data: a case study on missForest

Machine learning iterative imputation methods have been well accepted by...

Dimensionality reduction with missing values imputation

In this study, we propose a new statical approach for high-dimensionalit...

MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data

A major challenge in embedding or visualizing clinical patient data is t...

On the consistency of a random forest algorithm in the presence of missing entries

This paper tackles the problem of constructing a non-parametric predicto...

Winning Models for GPA, Grit, and Layoff in the Fragile Families Challenge

In this paper, we discuss and analyze our approach to the Fragile Famili...

1 Introduction

In quantitative science, partially observed data are essential components during the data collection process. Excluding observations with missing values from further analyses could lead to serious information losses. Imputation is one possible solution to tackle this issue by imputing missing values with reasonable estimates and conduct the statistical analyzes as if there has not been any missing values. So far, several imputation methods have been proposed ranging from simple mean imputation to more complex and advanced imputation techniques, see, e.g.

[9]. In contrast to constant accreditation like mean or median imputation, where dependency structures within the data set are not fully taken into consideration, regression and classification based techniques have prevailed in practice. For example the R-package mice assumes that the data generating process originates from a parametric distribution and imputed values are generated in a cyclic fashion using Gibbs Sampling resulting into chained equations, cf. [17]

for details. However, extending its usage to more complex data structures, especially when the assumption of a parametric probability distribution is not met, the


package can lose in imputation accuracy. Recursive partitioning methods, especially regression and classification trees are able to deal with mixed-type data sets while dissociating from the parametric assumption of the data generating process. Advantages such as the capability of dealing with collinearity effects, detecting complex interaction effects, process high dimensional data sets, rare overfitting problems and increased prediction accuracy made them popular in theory and practice. One famous approach of regression and classification trees is the CART algorithm developed in

[2]. Several modifications have been proposed such as randomized split selection, bootstrap aggregation (bagging) and boosting (see [5]). In particular, the incorporation of randomness in the CART algorithm has shown favorable prediction accuracy effects, resulting into the random forest algorithm (cf. [3]).

In order to use the benefits of CART-based decision trees,

[13, 14] proposed and implemented the R package missForest for predicting missing values using the random forest algorithm. As analyzed in [20] and [13], the missForest algorithm has beaten currently available imputation methods as MICE or -nearest neighbor with regard to accuracy. The latter has been measured in terms of normalized root mean squared error and the proportion of false classification. Especially, under the missing completely at random scheme, favorable imputation results could be achieved. However, a proper analysis of the imputation accuracy under various missing mechanisms was not conducted. Therefore, one part of our simulation study covers the analysis of imputation schemes under the missing completely at random as well as the missing at random mechanism. Furthermore, we were interested in potential modifications towards existing CART-based decision trees such that prediction accuracy can be increased. For this reason, several methods have been taken into consideration, that have not yet been compared to the missForest: C5.0, stochastic gradient tree boosting as well as modifications of the random forest method. The latter benefits by introducing bagging procedures during tree construction, which is limited to simple resampling strategies such as with and without replacement. However, these resampling strategies might be too costly in terms of computational memory and time, when the training data is too large. Therefore, [21]

proposed an efficient sampling schedule that does not use the whole training data. In case of binary response variable,

[22] analyzed the effect of the class distribution on the tree induction. Since the random forest is constructed with the help of iterative bootstrapping (non-parametrically using simple resampling procedures such as with and without replacement), it is also worth to investigate the effect of other resampling strategies. Here, certain asymptotic models or parametric bootstrap strategies are of specific interest as e.g. proposed by [8, 19, 7] or [12] for continuous MANOVA. Therein, computational time cost and memory can be saved during tree construction, since training data information is compressed into the estimation of bootstrap parameters. Appropriately, we will propose several resampling methods and modify the protocol of the random forest algorithm which may result in increased imputation accuracy.

2 Missing Mechanism

Let , be independent and identically distributed

- dimensional random vectors with common distribution

. The marginal distributions , are assumed to be either continuous or finitely discrete; corresponding to continuous or categorical outcome variables, respectively. According to this difference, we divide into continuous and categorical components, respectively. Let denote the corresponding data matrix and the matrix indicating whether , , is observed or not . Further, let and be the observed and missing parts of . In [10] the missing mechanism is defined through a probabilistic model, where depends on some unknown parameter

and the random matrix

. It is said to be

  • Missing completely at random (MCAR), if

  • Missing at random (MAR), if

  • Missing not at random (MNAR), if

In quantitative science, mainly two approaches have been established to handle missing data. Imputation and data adjusted methods. For the latter the missing mechanism is incorporated in the likelihood of the data analysis model resulting into parametric or non-parametric MLE-type methods, see e.g. [16, 11, 18, 6, 1]. In contrast, imputation resp. multiple imputation assigns each missing value a reasonable estimate such that partially observed data is completed and the analysis can be conducted as if there has not been any missing values. Latter incorporates the uncertainty of the imputation procedure into sample estimates by predicting several values. [9] proved that sample estimates remain unaffected in case of an MCAR mechanism. However, for regression and classification trees, MCAR can result into biased estimates, as [15] have counteracted. Therefore, increasing imputation accuracy by imputing missing values close to the real data under MCAR or MAR mechanisms is an essential gain in later statistical inference. In a comparative simulation study, we want to illuminate the predictive accuracy of various imputation techniques under these missing mechanisms, by artificially inserting missing values in synthetic and empirical data sets.

3 Methodology

In this section, we give a brief introduction of CART-based methods by enlightening the random forest, the C5.0 as well as the stochastic gradient tree boosting method. For the MICE method, we refer to [17].

Tying to the previous notation, we denote with the column-per- muted data matrix for a permutation . Furthermore, let be the index set of observed components of its -th column and denote the corresponding -dimensional sub-vector of observed components as . Accordingly, we define as the set of indices, where the components of are missing and the sub-vector with missing components in . In addition, let be the permuted data matrix without the random vector and all entries, where has observed components. Hence, we set , as the permuted data matrix without but with all other entries where has missing components. Then, the general algorithm of [13] for imputing missing values within the class of decision-tree based regression and classification algorithms can be summarized as

Input : Data matrix with missing values.
Output : Imputed data matrix .
  1. [label=0:]

  2. Sort the columns of the data matrix based on the missing rate in ascending order resulting into a permutation of the outcome variables.

  3. Initially impute mean resp. mode values for missing continuous resp. categorical realizations of random variables.

  4. Starting with the first permuted column , separate the permuted data matrix into four parts using the previous notation:

  5. Train the chosen regression resp. classification tree method using the sub-matrix as covariates and as response variable.

  6. Impute the missing values of using the trained regression resp. classification model with as covariate values. Then move to the next variable and repeat steps and until all variables have been treated.

  7. As long as the -error of the newly and previously imputed data set has not increased for the first time, return to step .

Algorithm 1 Missing Value Imputation.

Thus, the algorithm turns the missing value problem into a regression or classification problem, depending on the scale of the response variable. In this way, the CART-based procedure tries to overcome the initial issue of an unsupervised learning problem, by training the algorithm on observed values and predicting missing values afterwards. Recall that this does not necessarily lead to an MAR-respecting mechanism, since being observed is defined through

, i.e. may have a missing value in a variable other than . CART-based ensemble methods such as the random forest or the stochastic gradient tree boosting algorithm, however, benefit from the construction of several base learners and their later combination by majority vote or averaging. Their combination and construction of trees nevertheless differ as described below:

  1. Random Forest:
    For each decision tree, non-parametric bootstrap sampling is performed selecting observations either with or without replacement from the training set. In case of selecting without replacement, of the observations are selected for training. Furthermore, feature subspacing at each terminal node is conducted shrinking the potential number of split variables to . For classification, is chosen, whereas for regression, is selected. The choices are motivated by a simulation study conducted in [3] and the trade-off between computational time and prediction accuracy (cf. [14]).

  2. Stochastic Gradient Tree Boosting:

    The trees are constructed in sequential order by minimizing prediction error with respect to a loss function, say

    . It is chosen depending on the nature of the variable: For continuous outcomes, the squared error loss is used while for categorical outcomes, either the Bernoulli loss or the multinomial loss is used subject to level size. Minimization is conducted using the gradient descent method. The subsequent tree is fitted to a bootstrap sample (without replacement) of of size and its corresponding pseudo-residuals of the previous (additive) tree as training signal. The final decision tree is then constructed by taking the weighted sum of all trees.

  3. C5.0:
    The predecessor of the classification algorithm C5.0 has been considered as the best among data mining algorithms by the IEEE International Conference on Data Mining (ICDM) in December . Since the implemented version of the algorithm in R is only capable of solving classification tasks, in our scheme of work, C5.0 is used for imputing categorical outcomes only. The construction of trees is similar to the general CART method. In contrast to the random forest method, C5.0 does not conduct feature subspacing and it is able to construct multiway split trees while using the Shannon entropy as impurity measure. Motivated by [23], the algorithm is able to conduct boosting similar to the stochastic gradient tree boosting. In addition, pessimistic pruning is executed and decision trees can be transformed into rulesets (cf. [28]).

Stochastic gradient tree boosting and C5.0, however, have not yet been analyzed within the current missing framework and their imputation accuracy using the scheme of algorithm . Although the stochastic gradient tree boosting method and the C5.0 differ technically, their approach is based on the same boosting principle. Since its technical details are rather undocumented, missing values using the C5.0 algorithm are imputed similarly to the stochastic gradient tree boosting for categorical response variables.

To explicitly explain the differences between the approaches denote with a collection of decision trees with random seed , specifying the nature of each constructed tree and input vector . Then for the random forest algorithm imputes via

where is the trained ensemble of decision trees on and denotes the modus of a vector . For the stochastic gradient tree boosting, imputation is conducted as follows:

where estimates the class probability

in case of a categorical variable with

. The weights resp. , with are chosen by iteratively minimizing

with respect to , (cf. [23]). The prediction of missing values using the C5.0 algorithm is conducted in a similar fashion to the stochastic gradient tree boosting, where the difference lies in the construction of each single tree by using the Shannon entropy and pessimistic pruning. Although the stochastic gradient tree boosting and the random forest algorithm both make use of regression and classification trees, they mainly differ in the way the trees are constructed. Since stochastic gradient tree boosting fits subsequent trees on pseudo-residuals, in principle, it is not possible to implement this algorithm in a parallel fashion. Furthermore, their focus in achieving higher prediction accuracy is different. Recalling the decomposition of the mean squared error loss function evaluated in into

the random forest aims to achieve its decrease in mean squared error by the construction of (de-correlated) trees resulting into the decrease through variance reduction:

Here, is the pairwise tree correlation at . On the other hand, stochastic gradient tree boosting aims to minimize mean squared error by decreasing potential bias. This is realized through fitting subsequent trees not to the training signal , , but to the pseudo-residuals obtained from the previous (additive) tree model. However, both methods require to find an optimal value of the tree size within the missing scheme. In addition, the stochastic gradient tree boosting minimizes a loss function , which in practice is conducted using gradient descent method. Hence, the choice of the step-size in the direction of the steepest descent clearly affects prediction accuracy.

4 New Resampling Proposal

In this section, we propose modifications of the resampling technique within the random forest method. We extend the possibility of constructing CART-based ensemble trees from simple random sampling to parametric sampling. Since we request to treat both, continuous and categorical variables, several sampling strategies are considered:

  1. Stratification: Assuming that the data set consists of ordinal or nominal variables, a stratified sampling procedure with replacement is conducted. This way, we aim to represent low frequent levels of the response variable into the training phase of the learning algorithm accordingly. The latter would, e.g. not be the case when the resample is generated from a multinomial distribution (results not shown).

  2. Multivariate Normal Distribution:

    If the data matrix consists of only continuous realizations, then the bootstrap sample in the random forest procedure is substituted by a so-called asymptotic-model based (or parametric) bootstrap sample generated from . Here, the mean vector and the covariance matrix are estimated by their empirical counterparts and calculated from , .

  3. Multivariate Kernel: Similar to the latter case we additionally propose to draw the resample from a kernel estimator for multivariate data. In its general form it is given by

    where is the bandwidth matrix and . The fitting quality towards the sample is more an issue of the choice of the bandwidth matrix rather than the choice of the kernel function (cf. [24]). Hence, we choose the Gaussian kernel . The selection of is usually conducted by minimizing the mean integrated squared error. Since this minimization problem is generally intractable, several classes of bandwidth matrices have been proposed, see e.g. [25]. We simply follow the normal scale rule given by

    As proven in [26], this is the plug-in estimate in case of a normal density under regularity conditions.

5 Simulation

In this section we investigate the performance of the algorithms described in Sections 3 and 4 using synthetic as well as empirical data. Performance is measured by means of the normalized root mean squared error (NRMSE) for continuous variables and proportion of false classification (PFC) for categorical variables, i.e.

Here, with and denotes the -th component of the row vector before missing values have been inserted artificially. Correspondingly, is its imputed value. Additionally, we identify with the mean value of the sequence , .

First, PFC resp. NRMSE is measured for synthetic data consisting of only categorical resp. continuous variables. Thereafter, mixed-type data from empirical studies are analyzed. For each of the first two settings a total number of observations were generated with variables. For the case of consisting of categorical realizations of random vectors only, the variables where split into nominal variables with levels and ordinal variables with , . Missing values are artificially generated under the MCAR and MAR mechanism using a missing rate of and with Monte Carlo runs. Hence, the absolute number of missing observations is not defined through , but through , i.e. is the total number of missing observations.

For every simulation set-up, the one-sided Brunner-Munzel test [29] is conducted in favor of the resulting ’best’ procedure, due to its robustness towards heterogeneous settings. Since the random forest as well as stochastic gradient tree boosting are ensemble methods with tree structures as base learners, the choices of hyper-parameters such as tree size and step-size are highly influential on the final prediction accuracy. Due to an additional discretization process for continuous outcomes using the CART algorithm, an increased number of base learners will remarkably increase computational time cost. Hence, following the empirical analysis of [5] for the stochstic gradient tree boosting method and [13] for the random forest method, we set the number of base learners for the random forest to and for the stochastic gradient tree boosting to . Furthermore, the step-size of the gradient descent procedure within the stochastic gradient tree boosting is set to . In contrast to the findings of [3] and [14], we set the number of split variables to for both, regression and classification tasks. A similar approach was also followed by [13]. Finally, note that the p-values of each Brunner-Munzel test are encoded as , and over each boxplot.

5.1 Categorical variables only

We create correlated multivariate (ordinal) variables with the help of the R function rmvord (cf. [27]). The generation is based on thresholding a multivariate normal distribution. However, variables are treated as nominal outcomes and their feature presentation is transformed into elements of . Setting the pairwise correlation to , with , we create ordinal variables and treat the first seven variables in the data matrix as nominal variables. Each of them have four levels: either or for nominal, resp. ordinal variables. In order to measure the effect of stratified sampling, two synthetic data sets have been generated:

  • (D1) In the first setting, the relative frequency is simulated using a Dirichlet prior of order with equal concentration parameters .

  • (D2) For the second data set, a Dirichlet prior of order with unequal concentration parameters given by is used for generating relative frequencies.

Figure 1: PFC for data sets (D1) and (D2) under both missing mechanims with various missing rates for the stochastic gradient tree boosting (S.GBM), the C5.0, the random forest with stratified sampling (Strat. RF), the (classical) random forest (RF) and the mice method (MICE).

The simulation results for MCAR and MAR are displayed in Figure 1. They reveal that the novel random forest method with stratified sampling achieved comparable results to the usual random forest method with simple resampling under all twelve simulation settings. Hence, no additional gain in imputation accuracy could be achieved by changing the resampling method to stratification. Although the stochastic gradient tree boosting (S.GBM) method does not distinguish itself compared to random forest (RF) under data set (D1), imputation performance increased if level frequencies are not equally present. Contrary, S.GBM resulted in slightly higher impuation accuracy results compared to the random forest in data set (D2) under the MCAR and MAR scheme. The results were significant at the level using the one-sided Brunner-Munzel test in favor of the S.GBM. For example, under the MAR sheme with missing rate in data set (D2), the mean PFC could be reduced by around using the S.GBM. The effect in PFC decrease in the same data set was larger when the missing mechanism was changed to an MAR-respecting procedure leading to a decrease of around . The C5.0 method and MICE performed worse for both data sets and missing mechanisms and are clearly dominated by the other three approaches.

5.2 Continuous variables only

We generate synthetic data containing only continuous realizations of random variables with observations originating from the following -variate distributions:

  • (D3) A multivariate normal -distribution with parameter and covariance matrix . Here, denotes the matrix of ones and the identity. Note, that this implies a constant correlation coefficient of for two different variables , .

  • Each observation in the data set is generated from for , where and is chosen such that and for , . We assume further that are either distributed with (D4) resp. (D5)degrees of freedom or log-normally distributed with location parameter and scale parameter (D6) resp. (D7).

Figure 2: Imputation error measured by NRMSE under the MCAR scheme of the stochastic gradient tree boosting (S.GBM), the random forest with parametric boostrapping (Norm RF), the random forest with kernel sampling (Kernel RF), the (classical) random forest (RF) and the mice method (MICE) on data sets (D3)(D7) with various missing rates () .
Figure 3: Imputation error measured by NRMSE under the MAR scheme of the stochastic gradient tree boosting (S.GBM), the random forest with parametric boostrapping (Norm RF), the random forest with kernel sampling (Kernel RF), the (classical) random forest (RF) and the mice method (MICE) on data sets (D3)(D7) with various missing rates ().

The results are displayed in Figure 2 and Figure 3 for the MCAR and MAR mechanism, respectively. Imputing missing values in continuous outcomes using the S.GBM resulted in less accurate imputation results compared to all random forest methods. These findings were consistent through all data sets with continuous outcomes and under all simulation settings. From the RF methods, a random forest with resampling from a multivariate Gaussian kernel (Kernel RF), showed the best overall results among all five methods compared in this paper. In particular, even when the assumption of multivariate normality was not met and substituted by heavy-tailed or non-symmetric distributions, the Kernel RF yielded favorable imputation accuracy. Only in case of MCAR and data sets (D3)(D6) the RF method based on parametric bootstrapping from an estimated multivariate normal distribution (Normal RF) showed comparable results. In all other settings, the Kernel RF resulted in the lowest imputation error. These findings were also significant at the level using the one-sided Brunner-Munzel test. For example, mean NRMSE could be reduced by in data set (D7) under the MCAR mechanism with missing rate when using Kernel RF. In case of a heavy-tailed distribution like in data set (D5), mean NRMSE could be reduced by up to under the MAR mechanism with missing rate. Within each missing mechanism and simulation set-up, imputation accuracy decreased slightly with increasing missing rates . However, it seems that the novel RF methods (Kernel and Normal RF) are slightly less affected by the missing rate than simple resampling within the random forest protocol. Comparable to the categorical cases, MICE yielded high imputation errors under both missing mechanisms.

For mixed-type data, we finally combine the resulting best procedures from the two previous sections, i.e. we apply the S.GBM and the Kernel RF method in cases where classification resp. regression tasks are required to be solved. The resulting imputation algorithm (Miss BooPF) is analyzed in detail in the sequel.

5.3 Mixed-type data

In this section, we consider data matrices consisting of both, continuous and categorical outcomes. In addition to the generated synthetic data sets, we were interested in the accuracy of imputation techniques on real-life examples. Therefore, two data sets have been considered, which we summarize in the following:

German Credit Data

was acquired from the UCI Machine Learning Repository. The primary goal of the data set was to classify credit applicants as credit-worthy or not by considering a set of attributes. The data set consists of

variables from customers applying for credit at a German credit granting institution. Three variables have been identified as continuous outcomes, whereas are of categorical nature (ten ordinal and seven nominal variables are present). Although missing values are not present, the level outcome of attributes (property) and (telephone) might be misleading. We interpret the feature characteristic ’unknown/no property’ and ’none’ as potential feature levels.

myPersonality.com was a Facebook application developed by David Stillwell from the University of Cambridge. In collaboration with more than 200 researchers worldwide, psycho-demographic profiles of more than million Facebook users were analyzed. The application provides psychological questionnaires and real psychometric tests while recording Facebook profiles from users across the world. For our simulation study, Facebook profiles of German and British users were analyzed. We extracted variables such as gender, age, relationship status, interests, locality, number of friends, timezone, political attitude, religion, IQ score, life satisfaction, employment status and account duration. Some features such as political attitude and religion had diverse characteristics with a same or similar meaning. Therefore, we grouped political attitude into levels apathetic, conservative, democratic, liberal and others. The variable religion was grouped into Christian, Muslim and Others.

Figure 4: Imputation error measured by PFC of the combined imputation procedure using the gradient tree boosting and the parametrically bootstrapped random forest with kernel re-sampling (missBooPF) as well as the usual random forest method (missForest) on the German Credit Data and Facebook Data under various missing rates () and mechanisms.
Figure 5: Imputation error measured by NRMSE of the combined imputation procedure using the gradient tree boosting and the parametrically bootstrapped random forest with kernel re-sampling (missBooPF) as well as the usualy random forest method (missForest) on the German Credit Data and Facebook Data under various missing rates () and missing mechanisms.

The simulation design for mixed-type data remains the same as before. Under the two missing mechanisms MCAR and MAR with various missing rates the best procedures from Section 5.1 and 5.2 are chosen and applied on the German Credit Data as well as on the myPersonality.com data set. In particular, in case of classification, S.GBM is chosen and for regression, we took advantage of the parametric version of the random forest using the kernel method with the normal scale rule (Kernel RF). We denote the combination as Miss BooPF and compare its imputation accuracy with that of the missForest method.

The results are displayed in Figure 4 (categorical outcomes) and Figure 5 (continuous outcomes) and are similar to the one obtained from synthetic data generation. In particular, the Miss BooPF increases the imputation accuracy considerably in most scenarios, where its advantage is slightly more pronounced in the MCAR cases. Considering the Facebook data set, mean NRMSE could be reduced by around under the MCAR mechanism with missing rate. For the German Credit data set, mean PFC could be reduced by up to under the MAR mechanism with missing rate.

Finally, note that the effect of the Kernel RF procedure is less distinct due to the relatively low number of continuous outcomes in both data sets.

6 Conclusion

We proposed several modifications of the random forest method and investigated its performance for imputing missing values following the proposal of [13, 14]. The key idea was to change the bootstrap step within the random forest by means of other resampling techniques. Imputation accuracy was measured for MCAR and MAR under various settings. In particular, continuous, categorical as well as mixed-type data were analyzed.

In case of only continuous variables, the proposed parametric random forest method based on kernel sampling (Kernel RF) yielded favorable results in terms of higher imputation accuracy. This result could be observed under the MCAR and MAR conditions across the various missing rates of , and . Even under the normal scale rule for choosing an optimal bandwidth matrix, violations towards the assumptions of normality did not affect the imputation accuracy dramatically. Instead, it improved existing imputation procedures such as mice and missForest in the synthetic data set-up as well as in the empirical analysis for mixed-type data. For categorical variables, the stochastic gradient tree boosting (S.GBM) performed comparably well and yielded better results than the missForest imputation method when level frequencies are not equally present. The same observation was met for other competitors such as mice or random forest procedures based on other resampling strategies such as stratified sampling. Based on these findings we proposed to apply a mixture of the Kernel RF and the S.GBM depending on the regression resp. classification task. The resulting Miss BooPF imputation method is applicable to all kind of data. In particular, its applicability for mixed-type data was exemplified by additionally imputing data from empirical studies on credit information and Facebook data. Again Miss BooPF yielded improved results for both, continuous as well as categorical outcomes.

Due to the increased imputation accuracy of the Kernel RF method, we plan to study its practical performance in pure regression problems together with theoretical analyses (such as consistency in metric spaces).


We are thankful to David Stillwell from Cambridge University and Michal Kosinsky from the Stanford Graduate School for providing us the Facebook data. We appreciate the initial support of the Daimler AG represented by the research department ’Computer Vision and Camera Systems’

Appendix A.

Missing values have been inserted artificially under consideration of the various missing mechanisms. For the MCAR, MAR and MNAR condition, a special kind of mechanism has been implemented, which will be described in the following:

  1. Missing Completely at Random. We replace values randomly with missing values. For every variable , , we assumed that , where is the overall missing rate based on .

  2. Missing at Random.

    We implement this mechanism by building dependency structures across missing values of subsequent variables using the logistic regression. First, randomly select

    as the initial index and assume that , where is the overall missing rate. The missing values for the subsequent variable are inserted using the observed components of as covariate values within a logistic regression model. The response variables are randomly generated in an upstream step to estimate model parameters. Therefore, let be the sub-vector of observed components of . We construct a training response by generating for all and set as the training sample on which the logistic regression will be conducted. If is the predicted probability of , then for the observations with the - smallest values of , we set . The process is continued in a pairwise fashion until all variables have been treated.

  3. Missing Not at Random. We implement a specific type of the MNAR mechanism by constructing censored data. For every column vector , , sort the observations in increasing order and randomly select an observational point , of the ordered sequence . We denote with the resulting order-permutation. Depending on whether lies above or below its resp. quantile, missing values are inserted, i.e. for

    with , and such that , .


  • [1] Amro, L. and Pauly, M. Diagnosing bootstrap success. Journal of Statistical Computation and Simulation, Taylor and Francis, Vol. 87, No. 6, p. 1148-1159, (2017)
  • [2] Breiman, L. and Friedman, J. and Stone, C. J. and Olshen, Richard A. Classification and regression trees. CRC press, (1984)
  • [3] Breiman, L. Random forests. Machine Learning, Vol. 45, No. 1, p. 5-32, (2001)
  • [4] Hapfelmeier, A. and Hothorn, T. and Ulm, K. and Strobl, C. A new variable importance measure for random forests with missing data. Statistics and Computing, Vol. 24, No. 1, p. 21-34, (2014)
  • [5] Hastie, T. and Tibshirani, R. and Friedman, J.

    Overview of supervised learning.

    Springer, p. 9-41, (2009)
  • [6] Konietschke, F. and Harrar, S. W. and Lange, K. and Brunner, E. Ranking procedures for matched pairs with missing data - Asymptotic theory and a small sample approximation. Computational Statistics & Data Analysis, Vol. 56, No. 5, p. 1090-1102, (2012)
  • [7] Konietschke, F. and Bathke, A.C. and Harrar, S.W. and Pauly, M. Parametric and nonparametric bootstrap methods for general MANOVA.

    Journal of Multivariate Analysis, Vol. 140, p. 291-301, (2015)

  • [8] Krishnamoorthy, K and Lu, F.

    A parametric bootstrap solution to the MANOVA under heteroscedasticity.

    Journal of Statistical Computation and Simulation, Vol. 80, No. 8, p. 873-887, (2010)
  • [9] Little, R. and Rubin, D. B. Statistical analysis with missing data. JJohn Wiley & Sons, (2014)
  • [10] Rubin, D. B. Inference and missing data. Biometrika, Vol. 63, No. 3, p. 581-592, (1976)
  • [11] Schafer J.L. Analysis of incomplete multivariate data. CRC press, (1997)
  • [12] Smaga L. Bootstrap methods for multivariate hypothesis testing. Communications in Statistics-Simulation and Computation , (2017)
  • [13] Stekhoven, D. J. and Bühlmann, P. MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics , Vol. 28, No. 1 p. 112-118, (2011)
  • [14] Stekhoven, D. J. Using the missForest Package. Lecture Notes, University of Zürich, (2011)
  • [15] Strobl, C. and Boulesteix, A. and Augustin, T. Unbiased split selection for classification trees based on the Gini index. Computational Statistics & Data Analysis, Vol. 52, No. 1, p. 483-501, (2007)
  • [16] Vach, W. Missing values: statistical theory and computational practice. Computational Statistics, p. 345-354, (1994)
  • [17] Van Buuren, Stef and others. Multiple imputation of multilevel data. Handbook of advanced multilevel analysis, p. 173-196, (2011)
  • [18] Xu, J. and Harrar, S. W. Accurate mean comparisons for paired samples with missing data: An application to a smoking-cessation trial. Handbook of advanced multilevel analysis, Vol. 54, No. 2, p. 281-295, (2012)
  • [19] Xu, L.W. and Yang, F.Q. and Qin, S. and others. A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances. Journal of Multivariate Analysis, Vol. 115, p. 172-180, (2013)
  • [20] Waljee, A. K and others. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, Vol. 3, (2013)
  • [21] Provost, F. and Jensen, D. and Oates, T. Efficient Progressive Sampling. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 3, p. 23-32, (1999)
  • [22] Weiss, G. M. and Provost, F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction.

    Journal of Artificial Intelligence Research, Vol. 23, p. 315-354, (2003)

  • [23] Friedman, J. H.

    Stochastic Gradient Boosting.

    Computational Statistics & Data Analysis, Vol. 38, p. 367-378, (2002)
  • [24] Friedman, J. H. Density Estimation Including Examples. Wiley StatsRef: Statistics Reference Online, (2016)
  • [25] Wand, M. P. and Jones, M. C. Multivariate Plug-in Bandwidth Selection. Computational Statistics, Vol. 9, p. 97-116, (1994)
  • [26] Chacón, J. E. and Duong, T. and Wand, M. P. Asymptotics for General Multivariate Kernel Density Derivative Estimators. Statistica Sinica, Vol. 21, p. 807-840, (2011)
  • [27] Kaiser, S. and Träger D. and Leisch, F. Generating Correlated Ordinal Random Variables. Department of Statistics, University of Munich, Technical Reports, Vol. 94, (2011)
  • [28] Bujlow, T. and Riaz, T. and Pedersen, J. M A method for classification of network traffic based on C5.0 Machine Learning Algorithm. International Conference on Computing, Networking and Communications. IEEE Press, p.237-241, (2012)
  • [29] Brunner, E. and Munzel, U. The Nonparametric Behrens-Fisher Problem: Asymptotic Theory and a Small-Sample Approximation. Biometrical Journal, Vol. 42, p.17-25, (2000)