1 Introduction
In quantitative science, partially observed data are essential components during the data collection process. Excluding observations with missing values from further analyses could lead to serious information losses. Imputation is one possible solution to tackle this issue by imputing missing values with reasonable estimates and conduct the statistical analyzes as if there has not been any missing values. So far, several imputation methods have been proposed ranging from simple mean imputation to more complex and advanced imputation techniques, see, e.g.
[9]. In contrast to constant accreditation like mean or median imputation, where dependency structures within the data set are not fully taken into consideration, regression and classification based techniques have prevailed in practice. For example the Rpackage mice assumes that the data generating process originates from a parametric distribution and imputed values are generated in a cyclic fashion using Gibbs Sampling resulting into chained equations, cf. [17]for details. However, extending its usage to more complex data structures, especially when the assumption of a parametric probability distribution is not met, the
micepackage can lose in imputation accuracy. Recursive partitioning methods, especially regression and classification trees are able to deal with mixedtype data sets while dissociating from the parametric assumption of the data generating process. Advantages such as the capability of dealing with collinearity effects, detecting complex interaction effects, process high dimensional data sets, rare overfitting problems and increased prediction accuracy made them popular in theory and practice. One famous approach of regression and classification trees is the CART algorithm developed in
[2]. Several modifications have been proposed such as randomized split selection, bootstrap aggregation (bagging) and boosting (see [5]). In particular, the incorporation of randomness in the CART algorithm has shown favorable prediction accuracy effects, resulting into the random forest algorithm (cf. [3]).In order to use the benefits of CARTbased decision trees,
[13, 14] proposed and implemented the R package missForest for predicting missing values using the random forest algorithm. As analyzed in [20] and [13], the missForest algorithm has beaten currently available imputation methods as MICE or nearest neighbor with regard to accuracy. The latter has been measured in terms of normalized root mean squared error and the proportion of false classification. Especially, under the missing completely at random scheme, favorable imputation results could be achieved. However, a proper analysis of the imputation accuracy under various missing mechanisms was not conducted. Therefore, one part of our simulation study covers the analysis of imputation schemes under the missing completely at random as well as the missing at random mechanism. Furthermore, we were interested in potential modifications towards existing CARTbased decision trees such that prediction accuracy can be increased. For this reason, several methods have been taken into consideration, that have not yet been compared to the missForest: C5.0, stochastic gradient tree boosting as well as modifications of the random forest method. The latter benefits by introducing bagging procedures during tree construction, which is limited to simple resampling strategies such as with and without replacement. However, these resampling strategies might be too costly in terms of computational memory and time, when the training data is too large. Therefore, [21]proposed an efficient sampling schedule that does not use the whole training data. In case of binary response variable,
[22] analyzed the effect of the class distribution on the tree induction. Since the random forest is constructed with the help of iterative bootstrapping (nonparametrically using simple resampling procedures such as with and without replacement), it is also worth to investigate the effect of other resampling strategies. Here, certain asymptotic models or parametric bootstrap strategies are of specific interest as e.g. proposed by [8, 19, 7] or [12] for continuous MANOVA. Therein, computational time cost and memory can be saved during tree construction, since training data information is compressed into the estimation of bootstrap parameters. Appropriately, we will propose several resampling methods and modify the protocol of the random forest algorithm which may result in increased imputation accuracy.2 Missing Mechanism
Let , be independent and identically distributed
 dimensional random vectors with common distribution
. The marginal distributions , are assumed to be either continuous or finitely discrete; corresponding to continuous or categorical outcome variables, respectively. According to this difference, we divide into continuous and categorical components, respectively. Let denote the corresponding data matrix and the matrix indicating whether , , is observed or not . Further, let and be the observed and missing parts of . In [10] the missing mechanism is defined through a probabilistic model, where depends on some unknown parameterand the random matrix
. It is said to be
Missing completely at random (MCAR), if

Missing at random (MAR), if

Missing not at random (MNAR), if
In quantitative science, mainly two approaches have been established to handle missing data. Imputation and data adjusted methods. For the latter the missing mechanism is incorporated in the likelihood of the data analysis model resulting into parametric or nonparametric MLEtype methods, see e.g. [16, 11, 18, 6, 1]. In contrast, imputation resp. multiple imputation assigns each missing value a reasonable estimate such that partially observed data is completed and the analysis can be conducted as if there has not been any missing values. Latter incorporates the uncertainty of the imputation procedure into sample estimates by predicting several values. [9] proved that sample estimates remain unaffected in case of an MCAR mechanism. However, for regression and classification trees, MCAR can result into biased estimates, as [15] have counteracted. Therefore, increasing imputation accuracy by imputing missing values close to the real data under MCAR or MAR mechanisms is an essential gain in later statistical inference. In a comparative simulation study, we want to illuminate the predictive accuracy of various imputation techniques under these missing mechanisms, by artificially inserting missing values in synthetic and empirical data sets.
3 Methodology
In this section, we give a brief introduction of CARTbased methods by enlightening the random forest, the C5.0 as well as the stochastic gradient tree boosting method. For the MICE method, we refer to [17].
Tying to the previous notation, we denote with the columnper muted data matrix for a permutation . Furthermore, let be the index set of observed components of its th column and denote the corresponding dimensional subvector of observed components as . Accordingly, we define as the set of indices, where the components of are missing and the subvector with missing components in . In addition, let be the permuted data matrix without the random vector and all entries, where has observed components. Hence, we set , as the permuted data matrix without but with all other entries where has missing components. Then, the general algorithm of [13] for imputing missing values within the class of decisiontree based regression and classification algorithms can be summarized as

[label=0:]

Sort the columns of the data matrix based on the missing rate in ascending order resulting into a permutation of the outcome variables.

Initially impute mean resp. mode values for missing continuous resp. categorical realizations of random variables.

Starting with the first permuted column , separate the permuted data matrix into four parts using the previous notation:

Train the chosen regression resp. classification tree method using the submatrix as covariates and as response variable.

Impute the missing values of using the trained regression resp. classification model with as covariate values. Then move to the next variable and repeat steps and until all variables have been treated.

As long as the error of the newly and previously imputed data set has not increased for the first time, return to step .
Thus, the algorithm turns the missing value problem into a regression or classification problem, depending on the scale of the response variable. In this way, the CARTbased procedure tries to overcome the initial issue of an unsupervised learning problem, by training the algorithm on observed values and predicting missing values afterwards. Recall that this does not necessarily lead to an MARrespecting mechanism, since being observed is defined through
, i.e. may have a missing value in a variable other than . CARTbased ensemble methods such as the random forest or the stochastic gradient tree boosting algorithm, however, benefit from the construction of several base learners and their later combination by majority vote or averaging. Their combination and construction of trees nevertheless differ as described below:
Random Forest:
For each decision tree, nonparametric bootstrap sampling is performed selecting observations either with or without replacement from the training set. In case of selecting without replacement, of the observations are selected for training. Furthermore, feature subspacing at each terminal node is conducted shrinking the potential number of split variables to . For classification, is chosen, whereas for regression, is selected. The choices are motivated by a simulation study conducted in [3] and the tradeoff between computational time and prediction accuracy (cf. [14]). 
Stochastic Gradient Tree Boosting:
The trees are constructed in sequential order by minimizing prediction error with respect to a loss function, say
. It is chosen depending on the nature of the variable: For continuous outcomes, the squared error loss is used while for categorical outcomes, either the Bernoulli loss or the multinomial loss is used subject to level size. Minimization is conducted using the gradient descent method. The subsequent tree is fitted to a bootstrap sample (without replacement) of of size and its corresponding pseudoresiduals of the previous (additive) tree as training signal. The final decision tree is then constructed by taking the weighted sum of all trees. 
C5.0:
The predecessor of the classification algorithm C5.0 has been considered as the best among data mining algorithms by the IEEE International Conference on Data Mining (ICDM) in December . Since the implemented version of the algorithm in R is only capable of solving classification tasks, in our scheme of work, C5.0 is used for imputing categorical outcomes only. The construction of trees is similar to the general CART method. In contrast to the random forest method, C5.0 does not conduct feature subspacing and it is able to construct multiway split trees while using the Shannon entropy as impurity measure. Motivated by [23], the algorithm is able to conduct boosting similar to the stochastic gradient tree boosting. In addition, pessimistic pruning is executed and decision trees can be transformed into rulesets (cf. [28]).
Stochastic gradient tree boosting and C5.0, however, have not yet been analyzed within the current missing framework and their imputation accuracy using the scheme of algorithm . Although the stochastic gradient tree boosting method and the C5.0 differ technically, their approach is based on the same boosting principle. Since its technical details are rather undocumented, missing values using the C5.0 algorithm are imputed similarly to the stochastic gradient tree boosting for categorical response variables.
To explicitly explain the differences between the approaches denote with a collection of decision trees with random seed , specifying the nature of each constructed tree and input vector . Then for the random forest algorithm imputes via
where is the trained ensemble of decision trees on and denotes the modus of a vector . For the stochastic gradient tree boosting, imputation is conducted as follows:
where estimates the class probability
in case of a categorical variable with
. The weights resp. , with are chosen by iteratively minimizingwith respect to , (cf. [23]). The prediction of missing values using the C5.0 algorithm is conducted in a similar fashion to the stochastic gradient tree boosting, where the difference lies in the construction of each single tree by using the Shannon entropy and pessimistic pruning. Although the stochastic gradient tree boosting and the random forest algorithm both make use of regression and classification trees, they mainly differ in the way the trees are constructed. Since stochastic gradient tree boosting fits subsequent trees on pseudoresiduals, in principle, it is not possible to implement this algorithm in a parallel fashion. Furthermore, their focus in achieving higher prediction accuracy is different. Recalling the decomposition of the mean squared error loss function evaluated in into
the random forest aims to achieve its decrease in mean squared error by the construction of (decorrelated) trees resulting into the decrease through variance reduction:
Here, is the pairwise tree correlation at . On the other hand, stochastic gradient tree boosting aims to minimize mean squared error by decreasing potential bias. This is realized through fitting subsequent trees not to the training signal , , but to the pseudoresiduals obtained from the previous (additive) tree model. However, both methods require to find an optimal value of the tree size within the missing scheme. In addition, the stochastic gradient tree boosting minimizes a loss function , which in practice is conducted using gradient descent method. Hence, the choice of the stepsize in the direction of the steepest descent clearly affects prediction accuracy.
4 New Resampling Proposal
In this section, we propose modifications of the resampling technique within the random forest method. We extend the possibility of constructing CARTbased ensemble trees from simple random sampling to parametric sampling. Since we request to treat both, continuous and categorical variables, several sampling strategies are considered:

Stratification: Assuming that the data set consists of ordinal or nominal variables, a stratified sampling procedure with replacement is conducted. This way, we aim to represent low frequent levels of the response variable into the training phase of the learning algorithm accordingly. The latter would, e.g. not be the case when the resample is generated from a multinomial distribution (results not shown).

Multivariate Normal Distribution:
If the data matrix consists of only continuous realizations, then the bootstrap sample in the random forest procedure is substituted by a socalled asymptoticmodel based (or parametric) bootstrap sample generated from . Here, the mean vector and the covariance matrix are estimated by their empirical counterparts and calculated from , . 
Multivariate Kernel: Similar to the latter case we additionally propose to draw the resample from a kernel estimator for multivariate data. In its general form it is given by
where is the bandwidth matrix and . The fitting quality towards the sample is more an issue of the choice of the bandwidth matrix rather than the choice of the kernel function (cf. [24]). Hence, we choose the Gaussian kernel . The selection of is usually conducted by minimizing the mean integrated squared error. Since this minimization problem is generally intractable, several classes of bandwidth matrices have been proposed, see e.g. [25]. We simply follow the normal scale rule given by
As proven in [26], this is the plugin estimate in case of a normal density under regularity conditions.
5 Simulation
In this section we investigate the performance of the algorithms described in Sections 3 and 4 using synthetic as well as empirical data. Performance is measured by means of the normalized root mean squared error (NRMSE) for continuous variables and proportion of false classification (PFC) for categorical variables, i.e.
Here, with and denotes the th component of the row vector before missing values have been inserted artificially. Correspondingly, is its imputed value. Additionally, we identify with the mean value of the sequence , .
First, PFC resp. NRMSE is measured for synthetic data consisting of only categorical resp. continuous variables. Thereafter, mixedtype data from empirical studies are analyzed. For each of the first two settings a total number of observations were generated with variables. For the case of consisting of categorical realizations of random vectors only, the variables where split into nominal variables with levels and ordinal variables with , . Missing values are artificially generated under the MCAR and MAR mechanism using a missing rate of and with Monte Carlo runs. Hence, the absolute number of missing observations is not defined through , but through , i.e. is the total number of missing observations.
For every simulation setup, the onesided BrunnerMunzel test [29] is conducted in favor of the resulting ’best’ procedure, due to its robustness towards heterogeneous settings. Since the random forest as well as stochastic gradient tree boosting are ensemble methods with tree structures as base learners, the choices of hyperparameters such as tree size and stepsize are highly influential on the final prediction accuracy. Due to an additional discretization process for continuous outcomes using the CART algorithm, an increased number of base learners will remarkably increase computational time cost. Hence, following the empirical analysis of [5] for the stochstic gradient tree boosting method and [13] for the random forest method, we set the number of base learners for the random forest to and for the stochastic gradient tree boosting to . Furthermore, the stepsize of the gradient descent procedure within the stochastic gradient tree boosting is set to . In contrast to the findings of [3] and [14], we set the number of split variables to for both, regression and classification tasks. A similar approach was also followed by [13]. Finally, note that the pvalues of each BrunnerMunzel test are encoded as , and over each boxplot.
5.1 Categorical variables only
We create correlated multivariate (ordinal) variables with the help of the R function rmvord (cf. [27]). The generation is based on thresholding a multivariate normal distribution. However, variables are treated as nominal outcomes and their feature presentation is transformed into elements of . Setting the pairwise correlation to , with , we create ordinal variables and treat the first seven variables in the data matrix as nominal variables. Each of them have four levels: either or for nominal, resp. ordinal variables. In order to measure the effect of stratified sampling, two synthetic data sets have been generated:

(D1) In the first setting, the relative frequency is simulated using a Dirichlet prior of order with equal concentration parameters .

(D2) For the second data set, a Dirichlet prior of order with unequal concentration parameters given by is used for generating relative frequencies.
The simulation results for MCAR and MAR are displayed in Figure 1. They reveal that the novel random forest method with stratified sampling achieved comparable results to the usual random forest method with simple resampling under all twelve simulation settings. Hence, no additional gain in imputation accuracy could be achieved by changing the resampling method to stratification. Although the stochastic gradient tree boosting (S.GBM) method does not distinguish itself compared to random forest (RF) under data set (D1), imputation performance increased if level frequencies are not equally present. Contrary, S.GBM resulted in slightly higher impuation accuracy results compared to the random forest in data set (D2) under the MCAR and MAR scheme. The results were significant at the level using the onesided BrunnerMunzel test in favor of the S.GBM. For example, under the MAR sheme with missing rate in data set (D2), the mean PFC could be reduced by around using the S.GBM. The effect in PFC decrease in the same data set was larger when the missing mechanism was changed to an MARrespecting procedure leading to a decrease of around . The C5.0 method and MICE performed worse for both data sets and missing mechanisms and are clearly dominated by the other three approaches.
5.2 Continuous variables only
We generate synthetic data containing only continuous realizations of random variables with observations originating from the following variate distributions:

(D3) A multivariate normal distribution with parameter and covariance matrix . Here, denotes the matrix of ones and the identity. Note, that this implies a constant correlation coefficient of for two different variables , .

Each observation in the data set is generated from for , where and is chosen such that and for , . We assume further that are either distributed with (D4) resp. (D5)degrees of freedom or lognormally distributed with location parameter and scale parameter (D6) resp. (D7).
The results are displayed in Figure 2 and Figure 3 for the MCAR and MAR mechanism, respectively. Imputing missing values in continuous outcomes using the S.GBM resulted in less accurate imputation results compared to all random forest methods. These findings were consistent through all data sets with continuous outcomes and under all simulation settings. From the RF methods, a random forest with resampling from a multivariate Gaussian kernel (Kernel RF), showed the best overall results among all five methods compared in this paper. In particular, even when the assumption of multivariate normality was not met and substituted by heavytailed or nonsymmetric distributions, the Kernel RF yielded favorable imputation accuracy. Only in case of MCAR and data sets (D3) – (D6) the RF method based on parametric bootstrapping from an estimated multivariate normal distribution (Normal RF) showed comparable results. In all other settings, the Kernel RF resulted in the lowest imputation error. These findings were also significant at the level using the onesided BrunnerMunzel test. For example, mean NRMSE could be reduced by in data set (D7) under the MCAR mechanism with missing rate when using Kernel RF. In case of a heavytailed distribution like in data set (D5), mean NRMSE could be reduced by up to under the MAR mechanism with missing rate. Within each missing mechanism and simulation setup, imputation accuracy decreased slightly with increasing missing rates . However, it seems that the novel RF methods (Kernel and Normal RF) are slightly less affected by the missing rate than simple resampling within the random forest protocol. Comparable to the categorical cases, MICE yielded high imputation errors under both missing mechanisms.
For mixedtype data, we finally combine the resulting best procedures from the two previous sections, i.e. we apply the S.GBM and the Kernel RF method in cases where classification resp. regression tasks are required to be solved. The resulting imputation algorithm (Miss BooPF) is analyzed in detail in the sequel.
5.3 Mixedtype data
In this section, we consider data matrices consisting of both, continuous and categorical outcomes. In addition to the generated synthetic data sets, we were interested in the accuracy of imputation techniques on reallife examples. Therefore, two data sets have been considered, which we summarize in the following:
German Credit Data
was acquired from the UCI Machine Learning Repository. The primary goal of the data set was to classify credit applicants as creditworthy or not by considering a set of attributes. The data set consists of
variables from customers applying for credit at a German credit granting institution. Three variables have been identified as continuous outcomes, whereas are of categorical nature (ten ordinal and seven nominal variables are present). Although missing values are not present, the level outcome of attributes (property) and (telephone) might be misleading. We interpret the feature characteristic ’unknown/no property’ and ’none’ as potential feature levels.myPersonality.com was a Facebook application developed by David Stillwell from the University of Cambridge. In collaboration with more than 200 researchers worldwide, psychodemographic profiles of more than million Facebook users were analyzed. The application provides psychological questionnaires and real psychometric tests while recording Facebook profiles from users across the world. For our simulation study, Facebook profiles of German and British users were analyzed. We extracted variables such as gender, age, relationship status, interests, locality, number of friends, timezone, political attitude, religion, IQ score, life satisfaction, employment status and account duration. Some features such as political attitude and religion had diverse characteristics with a same or similar meaning. Therefore, we grouped political attitude into levels apathetic, conservative, democratic, liberal and others. The variable religion was grouped into Christian, Muslim and Others.
The simulation design for mixedtype data remains the same as before. Under the two missing mechanisms MCAR and MAR with various missing rates the best procedures from Section 5.1 and 5.2 are chosen and applied on the German Credit Data as well as on the myPersonality.com data set. In particular, in case of classification, S.GBM is chosen and for regression, we took advantage of the parametric version of the random forest using the kernel method with the normal scale rule (Kernel RF). We denote the combination as Miss BooPF and compare its imputation accuracy with that of the missForest method.
The results are displayed in Figure 4 (categorical outcomes) and Figure 5 (continuous outcomes) and are similar to the one obtained from synthetic data generation. In particular, the Miss BooPF increases the imputation accuracy considerably in most scenarios, where its advantage is slightly more pronounced in the MCAR cases. Considering the Facebook data set, mean NRMSE could be reduced by around under the MCAR mechanism with missing rate. For the German Credit data set, mean PFC could be reduced by up to under the MAR mechanism with missing rate.
Finally, note that the effect of the Kernel RF procedure is less distinct due to the relatively low number of continuous outcomes in both data sets.
6 Conclusion
We proposed several modifications of the random forest method and investigated its performance for imputing missing values following the proposal of [13, 14]. The key idea was to change the bootstrap step within the random forest by means of other resampling techniques. Imputation accuracy was measured for MCAR and MAR under various settings. In particular, continuous, categorical as well as mixedtype data were analyzed.
In case of only continuous variables, the proposed parametric random forest method based on kernel sampling (Kernel RF) yielded favorable results in terms of higher imputation accuracy. This result could be observed under the MCAR and MAR conditions across the various missing rates of , and . Even under the normal scale rule for choosing an optimal bandwidth matrix, violations towards the assumptions of normality did not affect the imputation accuracy dramatically. Instead, it improved existing imputation procedures such as mice and missForest in the synthetic data setup as well as in the empirical analysis for mixedtype data. For categorical variables, the stochastic gradient tree boosting (S.GBM) performed comparably well and yielded better results than the missForest imputation method when level frequencies are not equally present. The same observation was met for other competitors such as mice or random forest procedures based on other resampling strategies such as stratified sampling. Based on these findings we proposed to apply a mixture of the Kernel RF and the S.GBM depending on the regression resp. classification task. The resulting Miss BooPF imputation method is applicable to all kind of data. In particular, its applicability for mixedtype data was exemplified by additionally imputing data from empirical studies on credit information and Facebook data. Again Miss BooPF yielded improved results for both, continuous as well as categorical outcomes.
Due to the increased imputation accuracy of the Kernel RF method, we plan to study its practical performance in pure regression problems together with theoretical analyses (such as consistency in metric spaces).
Acknowledgement
We are thankful to David Stillwell from Cambridge University and Michal Kosinsky from the Stanford Graduate School for providing us the Facebook data. We appreciate the initial support of the Daimler AG represented by the research department ’Computer Vision and Camera Systems’
Appendix A.
Missing values have been inserted artificially under consideration of the various missing mechanisms. For the MCAR, MAR and MNAR condition, a special kind of mechanism has been implemented, which will be described in the following:

Missing Completely at Random. We replace values randomly with missing values. For every variable , , we assumed that , where is the overall missing rate based on .

Missing at Random.
We implement this mechanism by building dependency structures across missing values of subsequent variables using the logistic regression. First, randomly select
as the initial index and assume that , where is the overall missing rate. The missing values for the subsequent variable are inserted using the observed components of as covariate values within a logistic regression model. The response variables are randomly generated in an upstream step to estimate model parameters. Therefore, let be the subvector of observed components of . We construct a training response by generating for all and set as the training sample on which the logistic regression will be conducted. If is the predicted probability of , then for the observations with the  smallest values of , we set . The process is continued in a pairwise fashion until all variables have been treated. 
Missing Not at Random. We implement a specific type of the MNAR mechanism by constructing censored data. For every column vector , , sort the observations in increasing order and randomly select an observational point , of the ordered sequence . We denote with the resulting orderpermutation. Depending on whether lies above or below its resp. quantile, missing values are inserted, i.e. for
with , and such that , .
References
 [1] Amro, L. and Pauly, M. Diagnosing bootstrap success. Journal of Statistical Computation and Simulation, Taylor and Francis, Vol. 87, No. 6, p. 11481159, (2017)
 [2] Breiman, L. and Friedman, J. and Stone, C. J. and Olshen, Richard A. Classification and regression trees. CRC press, (1984)
 [3] Breiman, L. Random forests. Machine Learning, Vol. 45, No. 1, p. 532, (2001)
 [4] Hapfelmeier, A. and Hothorn, T. and Ulm, K. and Strobl, C. A new variable importance measure for random forests with missing data. Statistics and Computing, Vol. 24, No. 1, p. 2134, (2014)

[5]
Hastie, T. and Tibshirani, R. and Friedman, J.
Overview of supervised learning.
Springer, p. 941, (2009)  [6] Konietschke, F. and Harrar, S. W. and Lange, K. and Brunner, E. Ranking procedures for matched pairs with missing data  Asymptotic theory and a small sample approximation. Computational Statistics & Data Analysis, Vol. 56, No. 5, p. 10901102, (2012)

[7]
Konietschke, F. and Bathke, A.C. and Harrar, S.W. and Pauly, M. Parametric and nonparametric bootstrap methods for general MANOVA.
Journal of Multivariate Analysis, Vol. 140, p. 291301, (2015)

[8]
Krishnamoorthy, K and Lu, F.
A parametric bootstrap solution to the MANOVA under heteroscedasticity.
Journal of Statistical Computation and Simulation, Vol. 80, No. 8, p. 873887, (2010)  [9] Little, R. and Rubin, D. B. Statistical analysis with missing data. JJohn Wiley & Sons, (2014)
 [10] Rubin, D. B. Inference and missing data. Biometrika, Vol. 63, No. 3, p. 581592, (1976)
 [11] Schafer J.L. Analysis of incomplete multivariate data. CRC press, (1997)
 [12] Smaga L. Bootstrap methods for multivariate hypothesis testing. Communications in StatisticsSimulation and Computation , (2017)
 [13] Stekhoven, D. J. and Bühlmann, P. MissForest  nonparametric missing value imputation for mixedtype data. Bioinformatics , Vol. 28, No. 1 p. 112118, (2011)
 [14] Stekhoven, D. J. Using the missForest Package. Lecture Notes, University of Zürich, (2011)
 [15] Strobl, C. and Boulesteix, A. and Augustin, T. Unbiased split selection for classification trees based on the Gini index. Computational Statistics & Data Analysis, Vol. 52, No. 1, p. 483501, (2007)
 [16] Vach, W. Missing values: statistical theory and computational practice. Computational Statistics, p. 345354, (1994)
 [17] Van Buuren, Stef and others. Multiple imputation of multilevel data. Handbook of advanced multilevel analysis, p. 173196, (2011)
 [18] Xu, J. and Harrar, S. W. Accurate mean comparisons for paired samples with missing data: An application to a smokingcessation trial. Handbook of advanced multilevel analysis, Vol. 54, No. 2, p. 281295, (2012)
 [19] Xu, L.W. and Yang, F.Q. and Qin, S. and others. A parametric bootstrap approach for twoway ANOVA in presence of possible interactions with unequal variances. Journal of Multivariate Analysis, Vol. 115, p. 172180, (2013)
 [20] Waljee, A. K and others. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, Vol. 3, (2013)
 [21] Provost, F. and Jensen, D. and Oates, T. Efficient Progressive Sampling. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 3, p. 2332, (1999)

[22]
Weiss, G. M. and Provost, F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction.
Journal of Artificial Intelligence Research, Vol. 23, p. 315354, (2003)

[23]
Friedman, J. H.
Stochastic Gradient Boosting.
Computational Statistics & Data Analysis, Vol. 38, p. 367378, (2002)  [24] Friedman, J. H. Density Estimation Including Examples. Wiley StatsRef: Statistics Reference Online, (2016)
 [25] Wand, M. P. and Jones, M. C. Multivariate Plugin Bandwidth Selection. Computational Statistics, Vol. 9, p. 97116, (1994)
 [26] Chacón, J. E. and Duong, T. and Wand, M. P. Asymptotics for General Multivariate Kernel Density Derivative Estimators. Statistica Sinica, Vol. 21, p. 807840, (2011)
 [27] Kaiser, S. and Träger D. and Leisch, F. Generating Correlated Ordinal Random Variables. Department of Statistics, University of Munich, Technical Reports, Vol. 94, (2011)
 [28] Bujlow, T. and Riaz, T. and Pedersen, J. M A method for classification of network traffic based on C5.0 Machine Learning Algorithm. International Conference on Computing, Networking and Communications. IEEE Press, p.237241, (2012)
 [29] Brunner, E. and Munzel, U. The Nonparametric BehrensFisher Problem: Asymptotic Theory and a SmallSample Approximation. Biometrical Journal, Vol. 42, p.1725, (2000)