1 General introduction
1.1 Random forests with missing values: a contemporary challenge
Random forests and recursive trees are widely used in applied statistics and computer science. The popularity of recursive trees relies on several factors: their easy interpretability, the fact that they can be used for both regression and classification tasks, the small number of hyper-parameters to be tuned and finally, their non-parametric nature that allows their use to infer arbitrarily complex relations between the input and the output space. A random forest combines several randomized trees, improving the prediction accuracy at a cost of a slight lost in interpretation. This technique is easily parallelizable which has made it one of the most popular tools for handling high dimensional data sets. It has been successfully involved in various practical problems, including chemioinformatics, ecology, 3D object recognition, bioinformatics and econometrics. biau2016random present a detailed list of applications as well as a review on random forests. In the present work we have focused on the ability of random forests to deal with missing values.
Three groups of algorithms.
Approaches to handle missing values regarding the use of random forests could be classified in three groups. In thefirst group
, we classify the algorithms that impute the missing values without using ad-hoc techniques, likenearest neighbor imputation (-imputation) troyanskaya2001missing or multiple imputation chained equations (MICE) van2006fully, for example. Once performed this imputation step, the regular random forest algorithm is implemented on the completed data set. In this paper, we will not pay much attention to this first group of algorithms as the imputation step is unrelated with the later use of the random forest algorithm which is finally, only treated as a black box. The second group is composed of the algorithms that use the random forest structure (directly or through some extra method like proximity matrices) to impute the missing values. They typically use iterated updates of the imputations to refine them. Logically, we decided to store the methods proposed by Breiman2003,ishioka2013imputation,stekhoven2011missforest in this group. The third group
consists in built-in methodologies that include the incomplete observations directly in the construction of the recursive trees (without imputation steps) to compute the random forest’s estimation. Are included in this group: surrogate splits breiman1984CART, missing incorporated in attributes (MIA) twala2008good or the algorithm proposed by gomezmendez2020consistency, which is the main focus of this paper.
1.2 Literature on missing values treatment with random forests
Handling missing values through different algorithms has recently received much attention with several simulation studies comparing the performance of distinct methods feelders1999handling,farhangfar2008impact,rieger2010random,hapfelmeier2012recursive,josse2019consistency. In this paper, we propose a comparison study of techniques of the last two groups since they appear to be the most used and precise in practice. Of special focus for this present paper is the study of the practical performance of the new approach presented in gomezmendez2020consistency, which has the advantage to be consistent under an MCAR mechanism and a generalized additive model.
The idea of handling missing values with random forests algorithms is not new whether it be for a “filling the gaps” task, a learning or a prediction task with corrupted data. In this section, we describe the current state of the art in the study of this problem. The study developed by feelders1999handling is one of the first to present a result on missing values using recursive trees. This work compares the error rate between surrogate splits in a single decision tree, and the imputation procedure presented by schafer1997analysis,schafer1998multiple. Two data sets are considered in the study, given by the so-calledWaveform data set (presented by breiman1984CART) and the Prima Indian Datasets
(available at the UCI machine learning repository). The percentage of missing values varies between 10% and 45% for the first data set, where the missing values are introduced accordingly to an MCAR mechanism. On the other hand, 10% of the observations present missing values in the second data set. feelders1999handling concludes that the imputation procedure yields significantly less missclassification error rate.
The work of farhangfar2008impact presents a comparison of classification methods like support vector machines,-imputation and the decision tree method C4.5 quinlan1993 applied to missing data, and imputation algorithms, including single imputation and MICE. In total 15 data sets are considered, inducing missing values with up to 50% of the observations per variable being missing. The authors conclude that the application of MICE leads to better results in most of the instances.
rieger2010random compare surrogate splits introduced in the conditional inference forests hothorn2006unbiased with -imputation, with no clear advantage for either of the methods. This study considers classification and regression problems with three different correlation structures and seven schemes for missing values. However, the percentage of missing values is kept constant.
hapfelmeier2012recursive present a comparison study using trees built with the CART criterion, conditional inference trees and their corresponding random forests, focusing on surrogate splits to handle missing values and the use of MICE to impute missing values. The authors consider 12 real life data sets, half of them for regression and the other half for classification, 8 of the data sets already present missing values while in the rest 4 data sets missing values are induced. Arguing that rieger2010random found similar results for MCAR and MAR mechanisms, the missing values are solely introduced according to an MCAR missing-data mechanism, considering missing rates between 0% (benchmark) to 40%. Their results do not show a clear improvement by using multiple imputation, with MICE even producing inferior results when missing values are limited in number and are not arbitrary spread across the data. The results also show a similar result between trees constructed with the CART criterion and conditional inference trees.
The simulation developed by josse2019consistency studies the performance of several methods to handle missing values in regression tasks. In this study, three different regression functions are considered, one of them being linear, one quadratic and the third being the so-called “friedman1” friedman1991multivariate. They consider the MCAR mechanism and censoring, inducing up to 20% of observations with missing values in the first variable. The authors compare conditional inference trees, trees and random forests based on the CART criterion and XGBoost chen2016xgboost. The algorithms to handle missing values include MIA, surrogate splits for both trees built with the CART criterion and conditional inference trees, block propagation (only implemented in XGBoost), surrogate splits, mean-imputation and EM imputation dempster1977maximum. The authors clearly favors the usage of MIA for tree-based methods, while block propagation could also be a good method.
Breiman2003 proposes an algorithm based on random forests to impute the missing values making use of the observed values and the proximity matrix. ishioka2013imputation presents an improvement which instead of considering only the observed values in the imputation procedure, it considers nearest neighbors for continuous variables and all the observations for categorical variables. ishioka2013imputation compares these two approaches and-imputation introducing missing values completely at random in the Spam data set and considering missing data rates from 5% to 60%, concluding that the proposed approach outperforms -imputation and the previous method proposed by Breiman2003. However, the same author comments that other missing-data mechanisms should be considered.
stekhoven2011missforest introduce the missForest algorithm which imputes the missing values iteratively considering it as a regression problem in which the imputation of the current variable is done using all the other variables. In the simulation study presented, they consider classification and regression problems in 7 different data sets where 10%, 20% or 30% of the values are removed completely at random, concluding that missForest outperforms -imputation and MICE. Furthermore, the authors ensure that the full potential of missForest is deployed when data includes interactions or non-linear relations between variables of unequal scales and different types.
In Section 5, we compare the performance of most of these techniques with our new proposal taken from gomezmendez2020consistency when we let the percentage of missing values to vary from 0% to 95%.
The rest of the paper is organized as follows. In Section 2, we give the general background on random forests algorithms. We formally introduce the CART criterion and the missing-data mechanisms. In Section 3, we describe our proposal to build recursive trees including the missing values and we give upper bounds on its computational complexity. In Section 4, we describe the other methods that we use as benchmark in the simulation study explained in detail in Section 4.2. In Section 5, we presents and describe the results. Finally Section 6 presents the conclusions.
2 Settling the concepts
2.1 Random forests built upon the CART criterion
Throughout this article, we assume to have access to a training data set
where the response variablesare real-valued and the input variables belong to some space . The objective is to use the data to construct a learning model, also called learner, predictor or estimator, that estimates the regression function .
The random forest is made of a set of regression trees that are later aggregated all together with a simple mean idea. Each branch of the tree will be random (in its construction process) and represents a partition of the input space in smaller regions. Moving along the path of the tree corresponds to a choice of one of the possible regions. To construct this partition of the input space, the trees are built in a recursive way (hence the name of recursive trees). The root of the tree corresponds to the whole input space . Then, recursively, a region is chosen and is split into two smaller regions. This process is continued until some stopping rule is applied. At each step of the tree construction, the partition performed over a cell (or equivalently its corresponding node) is determined by maximizing some split-criterion. The present work focuses in the so-called CART split criterion. We first introduce some important notations.
denotes a general node (or cell).
holds for the number of points in .
The notation denotes a cut in , where
is a direction, , and
is the position of the cut in the th direction, between the limits of .
is the set of all possible cuts in node .
A cell is split into two cells denoted and .
(resp. , ) is the empirical mean of the response variable for the indexes such that belongs to the cell (resp. , ).
Then, the CART split criterion for a generic cell is defined as
with the convention .
As mentioned above, a random forest is a predictor consisting of randomized trees. The randomization is introduced in two different parts of the tree construction. Prior to the construction of each tree, observations are extracted at random with (or without) replacement from the learning data set . Only these observations are taken into account in the tree construction. Then, at each cell a split (or cut) is performed by maximizing the split criterion over a number mtry of directions , chosen uniformly at random. The tree construction is stopped when each final node contains less or equal than nodesize points. Hence, the parameters of this algorithm are:
, which is the number of trees in the forest.
, which is the number of observations in each tree.
, which is the number of directions (features) chosen, candidates to be split. We denote by
the features selected in each step.
, which is the maximum number of observations for a node to be a final cell.
The randomization introduced in the trees (independent from the original source of randomness in the sample
) is represented in a symbolic random variable. To each tree – randomized with the random variable – there is associated a predicted value at a query point , denoted as . The different trees are constructed by the same procedure but with independent randomization, so the random variables are i.i.d. with common law . In our choice of the construction rules, consists in the observations selected for the tree and the candidate variables to split at each step. Finally, the th tree’s estimation at point is defined as
where is the set of the observations selected prior to the construction of the th tree, is the unique final cell that contains , and is the number of observations which belong to the cell . The average of the trees forms the random forest’s estimation given by
It is known from the work of breiman2001random that the random forest does not overfit when tends to infinity. This makes the parameter only restricted by computational power.
2.2 Missing-data mechanisms
The concept of missing-data mechanism (introduced by rubin1976inference) establishes the relationship between missingness and data. Before introducing the missing-data mechanisms, let us define a new variable, called the missing-data indicator
We assume throughout this work that the response has no missing values which makes unnecessary to define an indicator of missing variable for . Then, the mechanisms are fully characterized by the information of the conditional distribution of given . There are three possible missing-data mechanisms.
Missing Completely at Random (MCAR).
We say that the variable is MCAR if is independent from . In other words, under the MCAR assumption, a coordinate
has some probability to be missing in the sample and this probability does not depend on the value ofnor the response variable .
Missing at Random (MAR).
Let us define the set , thus
is the vector formed by the variables that are always observed. The variableis MAR if
That is, the probability of being missing only depends on observed data.
Missing Not at Random (MNAR).
If the probability of missingness depends on unobserved values we say that is MNAR.
3 A new approach
It is of special interest to study the performance of the algorithm presented by gomezmendez2020consistency who have proven its consistency for an MCAR mechanism and a generalized additive model, being one of the first results on the consistency of random forests with missing values. This algorithm adapts the original CART criterion to manage missing data directly in the construction of the regression trees. We now recall this proposal.
3.1 Description of the algorithm
Assume for now that we have a partition of the input space. When there are missing values in the data set, there is uncertainty on the region to which each observation belongs. Thus, the original CART criterion becomes intractable since the quantities , , , , , , , and can not be computed. The proposed approach keeps the form of the CART criterion and makes use of adapted imputations for the intractable parts which allows the computation of a modified version of the CART criterion. Unlike most of the imputation techniques, the imputation step is not performed independently of the evaluation of the CART criterion but is integrated to its later optimization. While a cut is selected by maximizing the original CART criterion, now a couple (cut, imputation) is chosen at each split in the creation of the random tree. The idea is that, for a cut, the observations with missing values are assigned to the child node that maximizes this modified CART criterion. At the end, the missing observations will belong to a final node of the tree, which can give an “imputation” of the missing values as a region of the input space, which in turn would be translated into a “cloud” of possible regions for the missing values when the random forest is considered. For seek of clarity the proposal has been divided in two parts, Algorithm 1 describes the steps for the construction of the random forest with missing entries and Algorithm 2 which establishes the steps to find the best cut and assignation of missing data. At the beginning of the construction of each tree it is necessary to initialize the list of current not-final cells as well as the induced partitions of the input variables and the target variable and , respectively. Along with the initialization of these sets, the final partition of the input space and the corresponding partitions of the input variables and the target variable are initialized. When a cell satisfies the criteria to be a final cell, it is removed from and added to , the input variables and the target variables belonging to are also removed from and , and added to and , respectively. In the sequel, we denote as the number of points belonging to , whose value in the direction has been observed and the number of observations assigned to cell whose value in the direction is missing.
3.2 Complexity of the Algorithm
At first sight, it looks that the number of possible assignations defined by in Algorithm 2 are all the combinations for the assignations of the missing observations. This would lead to exponential complexity in the number of calculations of the CART criterion and then to an intractable algorithm. However, the number of possible candidate assignations to maximize the CART criterion is lower than this quantity. To see this, fix a cell and a cut in , to keep a simple notation let (resp. ) be the mean of the response variable for the points belonging to the left (right) node and observed in the direction . Suppose without lost of generality that and denote by the set of indexes of the observations assigned to the cell whose direction is missing, without lost of generality assume that . Because and maximizing the CART criterion implies to make as different as possible the average of the target on the left node from the average of the target on the right node, then observations with the lowest values should be assigned to the left node and the observations with the largest values should be assigned to the right node. Therefore, there exists such that assigning and maximizes the CART criterion. Hence, the set with all possible assignations has a cardinality of and there is only a linear number of assignations to be considered. Now, we can calculate the complexity of the algorithm.
Let be the number of operations needed to calculate the CART criterion for a given cut and assignation of missing values. It makes sense to consider the complexity of our algorithm with respect to since every other random forest algorithm that we compare to also make a certain number of evaluation of the CART criterion. Consider the Algorithm 2 and note that and . Thus, the number of necessary operations to the get the best cut and assignation is
On the other hand, denote by the set of non-final nodes of a tree, it is clear that , where the equality holds when each final cell contains just one observation. Now, let be the number of operations needed to build a regression tree with our approach, then
3.3 On simplifications of the algorithm
In this section we discuss a remark that leads to further simplification of our algorithm in order to reach an algorithmic complexity of the same order than MIA. Recall from Section 3.2 that the algorithmic complexity of our proposal is of the order times the calculation required to compute one single CART criterion and that MIA algorithm (in ) is quicker by a linear factor.
From the CART criterion written in Equation 1, we see that the criterion is convex with respect to the variable that counts the number of observations assigned into , see Figure 1 for a graphical representation of this fact. This simple remark leads us to opt for a dichotomy strategy for the optimal assignation of the missing variables. Indeed, at first step, we assign half of the missing variables to the left (in ) and half to the right (in ). The corresponding assignation is then given by a (called pivot for the dichotomy) such that and , assuming once again that without lost of generality. Such configuration gives a CART criterion resumed (abusively) in the notation . We search for the optimal assignation through calculations of the local gradient given by
If the value is positive (resp. negative), then the optimal is bigger or equal (resp. smaller or equal ) to . Say (for example), that . Then, we place the new pivot at the point and compute . Once again, the sign of the gradient tells us in which sub interval of the optimal assignation is. We, then repeat this simple searching procedure until ending with an interval containing only two points. The biggest value of the two gives the optimal assignation of our variables. This dichotomy procedure allows to consider only computations of the local gradient and a simple comparison at the end. This allows us to replace the complexity of the algorithm by
which is comparable to the MIA algorithm complexity up to a logarithmic factor. This optimization of the procedure is particularly important when the algorithm deals with a very corrupted data set where the missing entrees could represent a significant proportion of the all data.
4 Experimental details
4.1 Benchmark methods
Many methods proposed in the literature to handle missing data using random forests operate through imputation in a recursive way. They start by using the original training data set to fill the blank spaces in a roughly way. For example, using the median of the observed values in the specific direction. We denote this new data set as .
The imputed data set is used to build a random forest. Then, some structures of the forest are exploited, like the so-called proximity matrix, improving the imputation and resulting in a new data set . The procedure continues by iterations until some stopping rule is applied. These stopping rules trigger, for example, when the update in imputed variables becomes negligible or when a fix number of iterations is achieved. More formally, let us define
where is the imputation of at time , and let . To properly introduce the methods considered in the simulation study, we need to define the connectivity between two points in a tree and the proximity matrix of the forest. Let be 1 if and only if and belong to the same final cell (we say that and are connected in the tree ) in the tree designed with and the parameter . Otherwise, . Finally, the proximity between between and in the random forest , is defined as
Analogously, we define the proximity between and at time (i.e. using the data set ). We also define as the indexes where is missing, and as the indexes where is observed.
We consider three different approaches which impute missing values through random forests. These methods correspond to the algorithms presented respectively in Breiman2003,ishioka2013imputation,stekhoven2011missforest. We refer to those as Breiman’s approach, Ishioka’s approach and missForest approach, respectively. These algorithms impute the missing values through iterative improvements and hence belong to the second category. We also consider MIA, which is an algorithm that handle the missing values directly in the construction of the trees, assigning all the missing values to the same cell. As simple baselines we consider median-imputation of the missing values and listwise deletion (i.e. removing the observations with missing values) before the construction of the random forest.
4.1.1 With imputation
In this section, we describe the algorithms that belong to the second group (with iterative imputation) that are included in the upcoming simulations.
If is a continuous variable, is the weighted mean of the observed values in , where the weights are defined by the proximity matrix of the previous random forest, that is
On the other hand, if is a categorical variable, is given by
That is, is the class that maximizes the sum of the proximity considering the observed values in the class.
If is a continuous variable, is the weighted mean of the nearest neighbors, according to the proximity matrix, over all the values, both imputed and observed. The
closest values are chosen to make more robust the method and avoid values which are outliers.
For categorical variables, it is not necessary to see only the closest values because the outliers of will have few attention. Meanwhile the proximity with missing values should have more attention, especially when the missing rate is high. Hence, if is a categorical variable, is given by
This algorithm treats the imputation as a regression problem by itself, where the target variable is the input variable with missing values. MissForest predicts the missing values using a random forest trained on the observed parts of the data set. More formally for an arbitrary variable we can separate the data set into four parts:
the observed parts of the variable , denoted as ;
the missing values of the variable , denoted as ;
the variables other than and the observations whose indexes belong to , denoted by ;
the variables other than and the observations whose indexes belong to , denoted by .
To begin, the original training data set is used to fill the blank spaces in a roughly way, for example, with the median of the observed values in the variable. Then, for each variable a random forest is trained with target and predictors , then the missing values are imputed with the prediction of using the random forest.
4.1.2 Without imputation
Missing Incorporated in Attributes.
The Missing Incorporated in Attributes (MIA) consists in keeping all the missing values together when a split is performed. That is, missing values are assigned together to the child node that maximizes the CART criterion (or any other considered criterion). Thus, the splits with this approach assign the values according to one of the following rules:
4.2 Description of the parameters
The regression function considered in this study is the so-called “friedman1” friedman1991multivariate, which has been used in previous simulation studies friedman1991multivariate,breiman1996bagging,rieger2010random,josse2019consistency,friedberg2020local, given by
Our simulation study is based on the previous work of rieger2010random, with the following characteristics:
We simulate uniformly distributed on and introduce missing values in and , considering 7 different missing-data mechanisms.
For each missing-data mechanism we create 100 training data sets, each one with 200 observations.
In 20% of the data is missing, in the amount is 10%, and in there is 20% again.
We also create a testing data set with 2000 observations without missing values.
This amount of data is to have an appropriate approximation to the mean squared error (MSE)
and the bias
Note that these expressions are conditioned on the training sample and thus they are random variables which take a different value for each one of the 100 training data sets.
A random forest is built for each training data set and each missing-data mechanism as well as for the data sets without missing values (which are used as benchmark).
We use trees, which has been seen by simulation to be sufficient to stabilize the error in the case of the complete data sets.
For the rest of parameters we use the default values in the regression mode of the R package randomForests.
The parameter is set to .
We have sampled without replacement, so is set to .
And nodesize is set to 5.
These parameters are the same for the median-imputation and MIA approaches. For Breiman’s approach, Ishioka’s approach and missForest we initialize the algorithms with the median-imputation and consider the same parameters as before for the construction of the regression trees. The number of iterations is set to 10 and random forests are built with 100 trees in each iteration, corresponding to the default values of the package missForest in R. We now describe the missing-data mechanisms, which are based on those presented by rieger2010random.
4.3 Missing-data mechanisms
Missing at Random
Methods for generating missing values at random are more complicated. The choice of the locations that are replaced by missing values in the “missing” variable now depends on the value of a second variable, called the “determining” variable. Therefore, the values of the “determining” variable now have influence on whether a value in the “missing” variable is missing or not. For the “determining” variable is , while is used as the “determining” variable for and .
Missing Completely at Random (MCAR).
We select as many locations as desired sampled out of the observations and replace them by NA.
Creation of ranks (MAR1).
The probability for a missing value in a certain location in the “missing” variable is computed by dividing the rank of the location in the “determining” variable by . The locations for NA in the “missing” variable are then sampled with the resulting probability vector.
Creation of two groups (MAR2).
We divide the data set in two groups defined by the “determining” variable. A value belongs to the first group if the value in the “determining” variable is greater than or equal to the median of the “determining” variable, otherwise it belongs to the second group. An observation has a missing value with probability of 0.9 for the first group (0.1 for the second group) divided by the number of members in therespective group. The locations for NA in the “missing” variable are then sampled with the resulting probability vector.
Dexter truncation (MAR3).
The observations with the biggest values in the “determining” variable have the “missing” variable replaced by NA until the desired fraction of NA has been achieved.
Symmetric truncation (MAR4).
This method is similar to the previous one but we replace by NA the values in the “missing” variable in the observations with the biggest and the smallest values in the “determining” variable.
Missing depending on (Depy).
The missing values depend on the value of the response, the probability is 0.1 for observations where , otherwise it is 0.4. The locations for NA in the “missing” variable are then sampled with the resulting probability vector.
Missing Not at Random
Logit modelling (LOG).
In this method the probability for NA no longer depends on a single “determining” variable but in all the other variables. It is modeled as
Therefore, the probability of missingness depends on variables with observed values and variables with missing values.
Additionally, we consider the data sets with no missing values as a benchmark, which is denoted as “COMP”.
5.1 Change of missing-data mechanism
Figures 3 and 2 present the average MSE and the average bias over all the training data sets for each of the approaches considered. In green there is listwise deletion, those approaches that implement some imputation in the data set (Breiman’s approach, Ishioka’s approach and missForest) are in blue and those approaches that handle missing values directly in the construction of the trees (MIA and the our proposal) are in red. Listwise deletion (denoted as “NoRows”) generates the largest errors and the estimates with more bias for all the mechanisms. Hence, it could be taken as a bound of the minimum expected performance for a method that attempts to estimate the regression function with missing values. We observe that missForest consistently generates estimators with the lowest errors regardless of the missing-data mechanism. Furthermore, we observe that our proposed method outperforms MIA, Breiman’s approach and Ishioka’s approach and can achieve similar errors as missForest. It is worthy to observe that even a simple approach as imputing with the median can outperforms most of the methods considered or with similar behavior in several missing-data mechanisms (see MAR1, MAR2, MAR3, LOG). In terms of bias, we observe that the algorithms that impute the missing values before the construction of the random forest tend to generate less biased results, while MIA tends to generate the second more biased estimators (with listwise deletion generating the most biased estimations). For the MCAR case, all the methods considered tend to be unbiased and with similar MSE, except for missForest and listwise deletion with the lowest and highest MSE, respectively.111Codes to reproduce our results can be found in: https://github.com/IrvingGomez/RandomForestsSimulations
5.2 Increasing the rate of missingness
Using the 100 training data sets with no missing values, we calculate the importance of the variables with the R package randomForests, by percentage of increase in mean squared error and by increase in node purity breiman2001random, Breiman2003. Figures 5 and 5 show the violin plots for these measurements of importance. We can observe a consistently order for the variables with missing values in both measures of importance, where is considered more important than and . Hence, we decided to change the fraction of missingness in to vary between 5%, 10%, 20%, 40%, 60%, 80%, 90% and 95%, without changing the percentage of missingness in and . In this part of the study we do not consider anymore listwise deletion.
Figure 6 presents the average MSE varying the percentage of missing data and considering the MAR1 mechanism222Analogous figures for the MSE and bias for all the missing-data mechanisms can be found in the supplementary material at https://irvinggomez.com/publication/supplementary_random_forests_simulation/Supplementary_RandomForestsSimulation.pdf. Important differences between the performance of the methods become clear with the increasing percentage of missingness. Especially when this value is over 60% there is an order on the performance of the methods. In this cases missForest, our proposal and MIA represent the methods with less MSE. Moreover, we can see the advantage of searching for the best assignation when the percentage becomes large with our proposal outperforming all the other algorithms. While all the methods present a similar MSE when the percentage of missing values is smaller than 40%.
The differences between the missing-data mechanisms are also reflected when we increase the percentage of missing values. For example, the top three algorithms (missForest, our approach and MIA) present a MSE between 7.95 and 8.19 when the percentage of missingness is 90% and the introduction of missing values is done completely at random (MCAR). On the other hand, for the same percentage of missing values and the same algorithms we observe a deterioration in terms of the MSE which varies between 9.05 and 12.86 when we consider the DEPY scenario. In this case we consistently observe that our approach and missForest outperform the other methods, regardless of the percentage of missing values, while there is no clear advantage for the rest of the algorithms over the others.
Figure 7 presents the average bias varying the percentage of missing data and considering the MAR1 mechanism. When we include the bias in the study, the differences between methods and missing-data mechanisms become more evident. We observe that this missing-data mechanism introduced a bias in the methods. MIA and median-imputation create the most biased estimators, while Breiman’s approach creates the less biased. Considering other missing-data mechanisms, we observe an unbiased behavior for the MCAR and the LOG cases, with some deterioration when the percentage of missing values affects up to 60% of the observations.
5.3 Prediction of new observations with missing entries
A challenge in machine learning is to be able to compute a solution or a prediction to a certain problem when the given information is somewhat incomplete. In our specific context, our proposal do not have to rely on imputations of the missing entries to be able to calculate a prediction for a new data point. The fact that most of the existing techniques use imputation to construct the random forest tend to provide a prediction that is highly dependent on those imputations. In the following, we explain a way to perform the testing phase without having to impute the missing entries of the new data point.
At this stage, we assume that the training phase is finished and that one have access to the random forest with the assignations associated to each tree.
The prediction is computed tree by tree and is averaged over the different trees in the random forest.
A tree prediction is performed looking for a (pseudo-random) final cell in the tree that is the most likely to contain the query point . This is done in a recursive manner by going down in the tree following a procedure that we describe now.
Assume that we are in the cell which has a cut that splits it into and and that the assignation vector associated to that cut is given by . If the direction is observed then, as usual, is assigned to the left if , otherwise it is assigned to the right. Let us now assume that the direction is missing, in this case we need to look at the vector to assign the query point. Once again, let be the number of points with a missing value in the direction let be (resp. ) the number of such points assigned to the left (right) node. At this step, if (when no missing values where observed in the direction in the cell during the training phase) we stop the descent and predict the value of by its mean value in the cell . Otherwise, and we can compute (resp. ) the probability for a missing observation to belong to the left (resp. right) node, given that it belongs to cell .
The next cell is, then, stochastically selected to the left or to the right with respective probabilities and .
Our algorithm keeps track of the assignations at each step so that computing the probabilities and is direct. We can think of the same procedure for the other random forest techniques that do perform a certain kind of assignation. For example, MIA algorithm is the most suited for a comparable treatment for incomplete data points in the testing phase. Nevertheless, the “assignation information” is not accessible in the existing codes for MIA. For the case of the algorithms that do perform imputation of the missing values, it is not clear how an imputation has to be done for the testing phase. As discussed in gomezmendez2020consistency, these algorithms are really dependent on the way the imputation is performed and then no theoretical guaranties are known for their consistency.
In the following, we give some simulations of this testing phase letting the proportion of missing values to vary.
In the training phase:
For each data set and each of the 7 missing-data mechanisms (MCAR, MAR1, MAR2, MAR3, MAR4, LOG y DEPY) we introduce missing values in the variables , and . The proportion of missing values are 20% for , 10% for and 60% for the variable .
The random forests are constructed following the lines of Section 4.2
For the testing phase:
For each missing-data mechanism, we let the percentage of missing values for to vary from 0% to 95% and let the other two proportions of missing values for and unchanged (at 20% and 10%, resp.)
The graphics in Figure 8 represent the MSE on those 2000 observation where 100 random forests are computed for each missing-data mechanism.
6 Discussion and conclusions
We developed a simulation study comparing 7 distinct approaches based on random forests to perform regression with missing entries. The algorithm proposed in this work handle missing values directly in the split criterion considered in the construction of the trees, assigning the each missing value to the node that maximizes the split criterion. Three of these algorithms use the random forests to impute the missing values and create completed data sets. Two of these group of algorithms rely on the computation of the proximity matrix and improve the imputations iteratively. The third algorithm, corresponding to missForest performs imputation through the implementation of random forests where the imputation of the missing values is treated as a regression problem by itself. We also considered MIA which, similar to our proposed algorithm, handle missing values directly in the construction of the trees. Finally, as simple benchmarks we considered the median-imputation and listwise deletion.
With no surprise, listwise deletion was the approach with the worst performance, this method should be avoided unless the percentage of observations with missing values is so low that they can be deleted without a severe harmful. For the rest of the algorithms, we observed differences between distinct techniques. These differences become more evident when the percentage of missing values increases, especially when it is over 60%, while these differences are diluted for small values of this percentage (less than 40%). Moreover, the behavior of the methods appear to be dependent on the missing-data mechanism. Intensive computer algorithms seem to perform particularly well for large percentage of missing values. In particular, we see that simple techniques as median-imputation have similar performance to more complicated algorithms when the percentage of missing values is under 40% which makes its use sufficient in practice for few missing value contexts.
We studied the complexity of the proposal in Section 3.2 and show in Section 3.3 how a bisection technique can be performed to find the best assignation of the missing values. With these simplifications, we proved that the complexity of the algorithm is comparable to the MIA algorithm complexity up to a logarithmic factor, making possible to apply the algorithm for real applications.
Finally, in Section 5.3 we explain a process that allows the prediction of a new observation with missing entries. These procedure uses the assignation of the missing values during the training phase to estimate the probabilities of belonging to each cell. Hence, the new observation can be assigned stochastically to the cells of each tree using these probabilities. In the end, the prediction of the random forests is simply the average of the prediction of each tree. We observed through the simulation study that these technique seems to be robust to the missing-data mechanism and that it could be applied even for data sets with several missing values. Moreover, it can be applied with any other technique that assigns the missing observations, as long as the information of these assigantions is kept.