1 Introduction
As volumes of data increase, they are harder to curate and clean. They may come from the aggregation of various sources (e.g. merging multiple databases) and contain variables of different natures (e.g. different sensors). Such heterogeneous data collection can lead to many missing values: samples only come with a fraction of the features observed. Though there is a vast literature on treating missing values, the classical methods have been designed with the aim of estimating some parameters and their variance in the presence of missing values in a single data set. In contrast, only few methods are available in for supervisedlearning settings where the aim is to predict a target variable given input variables. Rather than generative models, these settings only require discriminative (or conditional) models. Also, they separate training and testing.
Aside from the aggregation of multiple sources, missing values can also appear for a variety of reasons: for sensor data, missing data can be arise from device failures; on the contrary, informative missing values can be found in poll data for instance where participants may not to answer sensitive questions related to unpopular opinions. In medical studies, some measurements may be impractical on patients in a critical state, in which case the presence of missing values can be related to the variable of interest, target of the prediction (patient status). These various scenarios lead to different missing value mechanisms.
The classical literature on missing data, led by rubin1976inference, defines missingdata mechanisms based on the relationship between missingness and observed values: if they are independent, the mechanism is said to be Missing Completely At Random (MCAR); if the missingness only depends on the observed values, then it is Missing At Random (MAR); otherwise it is Missing Not At Random (MNAR). However, adapting this nomenclature to supervised learning, to cater for variables target of the prediction, has been seldom discussed. In addition, the standard missingdata framework is formulated at the level of dataset, i.e. multiple observations, and not a single one. It leads to several ambiguities (seaman2013meant; 1811.04161; mohan2018graphical), and poses problems for outofsample formulations used in supervised learning.
Many methods are available to deal with missing values (josse2018). Listwise deletion, i.e. removing incomplete observations, is may enable to train the model on complete data. Yet it may not suffice for supervised learning, as the test set may also contain incomplete data. Hence the model should handle missing data. A popular solution suitable with any existing learning algorithm is to impute missing values, that is replacing the missing entries with plausible values to get a completed data set. The widespread practice of imputing with the mean of the variable on the observed entries is known to have serious drawbacks (little2002statistical) as it distorts the joint and marginal distributions of the data which induces bias in estimators. Interestingly, the performance of mean imputation has never been really assessed when aiming at predicting an output. In practice, users resort to different strategies such as imputing separately the training and test sets or imputing them jointly. More elaborate strategies rely on using expectation maximization (EM) to fit a model on incomplete data (dempster1977maximum; little2002statistical; little1992regression). However, such a model cannot readily by applied to new incomplete data. In addition, the EM relies on strong parametric assumptions. Alternatively, some learning algorithms, such as those based on decision trees, can be adapted to readily handle missing values.
In this paper, we study the classic tools of missing data in the context of supervised learning. We start in Section 2 by setting the notations and by briefly summarizing the missingdata literature. Our first contribution, detailed in Section 3, is to suggest a formalism for missing data adapted to supervised learning; we write existing methods with this formalism and show how to make predictions on a test set with missing values. Section 4 presents our main contribution which consists in studying the consistency of two approaches to estimate the prediction function with missing values. The first theorem states that, given an optimal predictor for the completelyobserved data, a consistent procedure can be built by predicting on a test set where missing entries are replaced by multiple imputation. The second approach, which is the most striking and has important consequences in practice, shows that mean imputation prior to learning is consistent for supervised learning. This is as far as we know the first result justifying this very convenient practice of handling missing values. In Section 5, we then analyze decision trees as they greedy and discrete natures enables adapting them to handle directly missing values. We compare various missingdata methods for trees: surrogate splits, the default in Classification and Regression Tree (CART, DBLP:books/wa/BreimanFOS84), probabilistic splits, the default in C4.5 (quinlan2014c4), block propagation, the method used in xgboost (chen2016xgboost) and lightgbm (ke2017lightgbm), a method called “missing incorporated in attribute” (MIA, Twala:2008:GMC:1352941.1353228) and also conditional trees (Hothorn:2006:JCGS). Theoretical analysis of toy examples justify some empirical results observed in kapelner2015prediction, one of the few papers that studied trees with missing values for supervised learning. We recommend MIA as a way to exploit the missing pattern in the estimation. Finally, Section 6 compares the different tree methods on simulated data with missing values. We also show the benefits of an approach often used in practice consisting in “adding the mask”
, i.e. adding binary variables that code for the missingness of each variables as new covariates.
2 Definitions, problem setting, prior art
Notation
Throughout the paper,
letters refer to vectors; CAPITAL letters refer to realvalued or vectorvalued random variables, while lowercase letters are realisations. In addition, as usual, for any two variables
and of joint density ,2.1 Supervised learning
Supervised learning is typically focused on learning to predict a target from inputs , where the pair is considered as random, drawn from a distribution . Formally, the goal is to find a function , that minimizes given a cost function , called the loss (vapnik1999overview). The best possible prediction function is known as the Bayes rule, given by
(1) 
and its error rate is the Bayes rate (devroye2013probabilistic). A learning procedure is used to create the function from a set of training pairs . is therefore itself a function of , and we will sometimes write it or simply
. There are many different learning procedures, including random forests
(breiman2001random)(cortes1995support). A learning procedures that, given an infinite amount of data, yields a function that achieves the Bayes rate is said to be Bayes consistent. In other words, is Bayes consistent ifIn a classification setting, is drawn from a finite set of discrete values, and the cost is typically the zeroone loss: . In a regression setting, is drawn from continuous values in and is assumed to satisfy . A common cost is then the square loss, . Considering the zeroone loss or the square loss, the Bayesoptimal function , that minimizes the expected loss, satisfies (see e.g. sec 1.5.5 of bishop2006prml for the square loss in regression, and rosasco2004loss for classification losses).
Note that the learning procedure has access to a finite sample, , and not to the distribution hence, it can only use the empirical risk, , rather than the expected risk. A typical learning procedure is therefore the empirical risk minimization defined as the following optimization problem
A new data set is then needed to estimate the generalization error rate of the resulting function .
2.2 Prior art on missing values
In this section, we review the different missing data mechanisms. We then summarize the main methods to handle missing values: imputation methods and likelihood based ones. Most of this prior art to deal with missing values is based on a single data set with no distinction between training and test set.
2.2.1 Missing data mechanisms
To follow the historical definitions which does not give to the response a particular role, we temporarily consider as part of the input vector , though we assume that has no missing values. rubin1976inference defines three missing data mechanisms and fundamental results for working with likelihood models in the presence of missing data. These are defined by considering the realised data set , as one realisation from a distribution in . The missingness is encoded as an indicator matrix where, for all and , if is observed, and if is missing.
The definitions of rubin1976inference can naturally be adapted to the i.i.d. setting. All rows are assumed to be sampled i.i.d. from a distribution in where, marginally, and . The goal in statistical inference is to estimate the parameter . This is usually done by maximizing the likelihood , which is well defined when the are fully observed. Here, the likelihood is integrated over the missing values, resulting in
(full likelihood) 
where denotes the observed values for any realisation of (seaman2013meant), and the Dirac measure. The parameter is generally not considered as of interest. An easier quantity would be
(likelihood of observed data) 
ignoring the missingdata mechanism. To leave the difficult term, i.e. the missing values mechanism, out of the expectation, rubin1976inference introduces an ad hoc assumption, called Missing At Random (MAR), which is that for all , for all , for all ,
and states the following result.
Theorem 2.1 (Theorem in rubin1976inference).
Let such that for all , . Assuming (a) MAR, (b) , is proportional to with respect to , so that the inference for can be obtained by maximizing the likelihood which ignores the mechanism.
MAR has a stronger version, more intuitive: Missing Completely At Random (MCAR). In its simplest and strongest form, it states that (the model’s density is ). At the other end of the spectrum, if it is not possible to ignore the mechanism, the corresponding model is called Missing Not At Random (MNAR).
There is little literature on missing data mechanism for supervised learning or discriminative models. kapelner2015prediction formalise the problem by separating the role of the response , factorising the likelihood as . Note that they do not write
. They justify this factorisation with the – somewhat causal – consideration that the missing values are part of the features which precede the response. The need to represent the response variable in the factorization show that it may be useful to extend the traditional mechanisms for a supervisedlearning setting: the link between the mechanism and the output variable can have a significant impact on the results.
lecturenote and arel2018can noticed that as long as does not depend on , it is possible to estimate regression coefficients without bias even with listwise deletion and MNAR values. ding2010investigation generalise the MAR assumption with the following nomenclature : the missing mechanism can marginally depend on the target (), on the features that are always observed () or on the features that can be missing ().2.2.2 Imputation prior to analysis
Most statistical models and machine learning procedures are not designed for incomplete data. It is therefore useful to
impute them to get a completed data set that can be analysed by any procedure, e.g. supervisedlearning methods. To impute data, joint modeling(JM) approaches capture the joint distribution across features
(little2002statistical). A simple example of jointmodeling imputation is to assume a Gaussian distribution of the data, to estimate the mean vector and covariance matrix from the incomplete data (using an EM algorithm). Missing entries can then be imputed with their conditional expectation knowing the observed data and the estimated parameters. More powerful methods can be based on lowrank models
(hastie2015matrix; josse2016denoiser), or deep learning approaches such as denoising autoencoders (DAEs,
vincent2008extracting; gondara2018mida) and generative adversarial networks (li2017generative; yoon2018gain). Another popular approach to impute data is called fully conditional specification (FCS) also known as imputation with conditional equation (ICE) (vanbuuren2018flexible). It also assumes a joint distribution for the data, but defines it implicitly by the conditional distributions of each variable. This approach is popular because it is flexible and can easily handle variables of a different nature such as ordinal, categorical, numerical, etc. A powerful example of this class is missforest, using on iterative imputation of each variable by random forests (stekhoven2011missforest).The role of the dependent variable and whether or not to include it in the imputation model has been a rather controversial point. Indeed, it is quite counterintuitive to include it when the aim is to apply a conditional model on the imputed data set to predict the outcome . Nevertheless, it is recommended as it can provide information for imputing covariates (allison1999missing, p.57). Sterneb2393 illustrated the point for the simple case of a bivariate Gaussian data () with a positive structure of correlation and missing values on . Imputing using only
is not appropriate when the aim is to estimate the parameters of the linear regression model of
given .One important issue with “single” imputation, i.e. predicting only one for each missing entries, is that it forgets that some values were missing and considers imputed values and observed values in the same way. It leads to underestimation of the variance of the estimated parameters (little2002statistical). One solution, to incorporate the uncertainty of the prediction of values is to use multiple imputation (MI, rubin:1987) where many plausible values are generated for each missing entries, leading to many imputed data sets. Then, MI consists in applying an analysis on each imputed data sets and combining the results. Although many procedures to generate multiple imputed data sets (murray2018) are available, here again, the case of discriminatory models is only rarely considered, with the notable exception of model_select_mice who use a variable selection procedure on each imputed data set and propose to keep the variables selected in all imputed data sets to construct the final model (see also, mirl).
2.2.3 EM algorithm
Imputation leads to twostep methods that are generic in the sense that any analysis can be performed from the same imputed data set. On the contrary, the EM (expectation maximization) algorithm proceeds directly in one step (dempster1977maximum). It can thus be better suited to a specific problem but requires the development of a dedicated algorithm.
The EM algorithm can be used in missingdata settings to compute maximum likelihood estimates from an incomplete data set. Indeed, with the assumptions of Theorem 2.1 (MAR settings), maximizing the observed likelihood gives principle estimation of parameters . The loglikelihood of the observed data is
Starting from an initial parameter , the algorithm alternates the two following steps,
(Estep)  
(Mstep) 
The wellknown property of the EM algorithm states that at each step , the observed loglikelihood increases, although there is no guarantee to find the global maximum. In Appendix B.2 we include an example of an EM algorithm to estimate the parameters of a bivariate Gaussian distribution from an incomplete data.
3 Supervisedlearning procedures with missing data on train and test set
Supervised learning typically assumes that the data are i.i.d. In particular, an outofsample observation is supposed to be drawn from the same distribution as the original sample. Hence it has the same missingdata mechanism. An appropriate method should be able to predict on new data with missing values. Here we discuss how to adapt classic missingdata techniques to machine learning settings, and vice versa.
Notations
Following rubin1976inference, mohan2018graphical and yoon2018gain, we define the incomplete feature vector as if , and otherwise. As is a cartesian product, belongs to the space . We have
where is the termbyterm product, with the convention that, for all onedimensional , As such, when the data are real, can be seen as a mixed categorical and continuous variable, taking values in . The observed training set, which is available for statistical analysis, is then defined as .
3.1 Outofsample imputation
Using missingvalue imputation in a supervised learning setting is not straightforward as it requires to impute new, outofsample, test data, where the target is unavailable.
A simple strategy is to impute the training data with an imputation model (whose parameters are estimated via on the training set). Denoting as the imputed data set, a predictive model can then be learned using and (whose parameters are estimated as on the training set). Finally, on the test set, the covariates must be imputed with the same imputation model (using ) and the dependent variable predicted using the imputed test set and the estimated learning model (using ).
This approach is easy to implement for univariate imputation methods that consider each feature separately, for instance with mean imputation: the mean of each column is learned on the training set, and any new observation can be imputed by . The imputation with joint Gaussian model on – with parameters are learned by the EM algorithm on the training set– is also appropriate as one can impute the test set using the conditional expectations of the missing features given the observed features (and without a ) and the estimated parameters.
For more general imputation methods, two issues hinder outofsample imputation. First, many available imputation methods are “blackboxes” that take as an input an incomplete data and output a completed data: they do not separate estimation of model parameters from using them to complete the data. This is the case for many implementations of iterative conditional imputation such as MICE (vanbuuren2018flexible) or missForest (stekhoven2011missforest). It is also difficult for powerful imputers presented in Section 2.2.2 such as lowrank matrix completion, as they cannot be easily marginalised on alone.
As most existing implementations cannot impute a new data set with the same imputation model, some analysts resort to performing separate imputation of the training set and the test set. But the smaller the test set, the more suboptimal this strategy is, and it completely fails in the case where only one observation has to be predicted. Another option is to consider semisupervised settings, where the test set is available at train time: grouped imputation can then simultaneously impute the train and the test set (kapelner2015prediction), while the predictive model learn is subsequently learned on the training set only.
3.2 EM and outofsample prediction
The likelihood framework (Section 2.2.1) enables predicting new observation, though it has not been much discussed. jiang2018logistic
consider a special case of this approach for a logistic regression and by assuming a Gaussian model on the covariates
.Let the assumptions of Theorem (2.1) be verified (MAR settings). Model parameters can then be estimated by maximizing the observed loglikelihood with an EM algorithm (Section 2.2.3). The corresponding MLE can be used for outofsample prediction with missing values. More precisely, for a fixed missing indicator , we write the observed values and
the missing values. Given this model, the probability of
as a function of the observed values only, can be related to that on a fullyobserved data set:(2) 
It is then possible to approximate the expectation with Monte Carlo sampling from the distribution . Such a sampling is easy in simple models, e.g. using Schur’s complements for Gaussian distributions in linear regression settings. But in more complex settings, such as logistic regression, there is no explicit solution and one option is Metropolis Hasting Monte Carlo.
3.3 Empirical risk minimisation with missing data
The two previous approaches that we discussed are specifically designed to fix the missing data issue: imputation or specifying a parametric model and maximizing the observed likelihood. However, in supervised learning settings, the goal is rather to build a prediction function that minimizes an expected risk. Empirical risk minimization, the workhorse of machine learning, can be adapted to deal with missing data.
Recall that in missingdata settings, we do not have access to but rather to . Therefore, given a class of functions from to , we aim at minimizing the empirical risk on this class, that is
(3) 
Unfortunately, the halfdiscrete nature of , makes the problem difficult. Indeed, many learning algorithms do not work with mixed data types, such as , but rather require a vector space. This is true in particular for gradientbased algorithms. As a result, the optimization problem (3) is hard to solve with typical learning tools.
Another point of view can be adopted for losses which leads to Bayesoptimal solutions such that . As there are at most admissible missing patterns, the Bayes estimate can be rewritten as
(4) 
where with all the indices such that . This formulation highlights the combinatorial issues: solving (3) may require to estimate different submodels, that is appearing in (4) for each , which grows exponentially with the number of variables.
Modifying existing algorithms or creating new ones to deal with the optimization problem (3) is in general a difficult task due to the numerous possible missing data patterns. We will see in Section 5 that decision trees are particularly well suited to address this problem.
Remark 3.1.
Note that in practice, not all patterns may be possible in the training and test sets. For instance, if there is only complete data in the train set, the only submodel of interest is for , which boils down to the regular supervised learning scenario on a complete data. However, the train and test set are assumed to be drawn from the same data distribution. Hence, we expect to observe similar patterns of missingness in train and test sets. If this is not the case, it corresponds to a distributional shift, and should be tackled with dedicated methods (see, e.g., sugiyama2017dataset). This may happen for instance, when a study conducted on past data led to operational recommendations, making the practitioners measure systematically the variables of interest.
4 Bayesrisk consistency of imputation procedures
Here we show theoretically that simple imputation procedures can lead to Bayesoptimal prediction when the goal is to estimate the function in the presence of missing data on covariates (in both train and test sets), and with no additional assumptions on the data distribution.
We start by studying the error rate of a predictor optimal for the complete data on test data with missing values, using several imputation strategies: unconditional mean, conditional mean and multiple imputation.
We then consider the full problem of tackling missing values in the train and the test set. We study a classical approach, described in Section 3.1, which consists first in imputing the training set, learning on the imputed data, and predicting on a test set which has been imputed with the same method. Although mean imputation of variables is one of the most widely used approaches, it is highly criticised in the classic literature for missing data (little2002statistical). Indeed, it leads to a distortion of the data distribution and consequently statistics calculated on the imputed data table are biased. A simple example is the correlation coefficient between two variables, which is biased towards zero if the missing data are imputed by the mean. However, in a supervisedlearning setting the aim is not to compute statistics representative of the data set, but to minimize a prediction risk by estimating a regression function. For this purpose, we show in Section 4.2 that mean imputation may be completely appropriate and leads to consistent estimation of the prediction function. This result is remarkable and extremely useful in practice.
4.1 Testtime imputation
Here we consider that we have an optimal (Bayesconsistent) predictor for the complete data, i.e. , and we show that when there is missing data in the test set, in MAR settings, multiple imputation with can give the optimal prediction, i.e. Bayes consistency for incomplete data. In the case of MCAR values, i.e. where the complete data is a random subsample from the sample, the function can be obtained for instance by “listwise deletion” in the train set: fitting a supervisedlearning procedure on data for which the samples with missing data have been removed.
4.1.1 Testtime conditional multiple imputation is consistent
Let us first make explicit the multiple imputation procedure for prediction. For a given vector , we let be the missing indicator and write for observed values and the missing values. We then draw the missing values from their distribution conditional on and compute the regression function on these completed observations. The resulting multiple imputation function is given by:
(5) 
Note that this expression is similar to the expression Equation 2 given for EM but assuming that we know the true nonparametric distribution of the data.
Proposition 4.1.
Consider the regression model
where takes values in , for all subset , (MAR mechanism) and where is a centred noise. Then the multiple imputation procedure described above is consistent, that is, for all ,
The proof is given in Appendix A.
4.1.2 Simple mean imputation is not consistent
Given the success of multiple imputation, it is worth checking that simple imputation is not sufficient. We show with two simple examples that indeed, single imputation on the test set is not consistent even in MAR setting.
We first show, that (unconditional) mean imputation is not consistent, if the learning algorithm has been trained on the complete cases only.
Example 4.1.
In one dimension, consider the following simple example,
with an independent centered Gaussian noise. Here, , and the regression function satisfies
(6) 
In the oracle setting where the distribution of is known, ”plugging in” the mean imputation of yields the prediction
(7) 
In this example, mean imputation is not optimal: when is missing, the prediction obtained by mean imputation is , whereas the optimal prediction (the one which minimizes the square loss) is as seen in (6).
Inspecting (6) and (7) reveals that the poor performance of mean imputation are due to the fact that . The nonlinear relation between and breaks mean imputation. This highlights the fact that the imputation method should be chosen in accordance with the learning algorithm that will be applied later on. This is related to the concept of congeniality (meng1994multiple) defined in multiple imputation.
4.1.3 Conditional mean imputation is consistent if there are deterministic relations between input variables
We now consider conditional mean imputation, using information of other observed variables to impute. Conditional mean imputation may work in situations where there is redundancy between variables, as highlighted in Example 4.2. However, we give a simple example below that shows that using it to impute the test may not give the Bayes prediction rate.
Example 4.2.
Consider the following regression problem with two identical input variables:
The Bayesoptimal predictor is then given by
Single imputation with the mean of in this function leads to
whereas, imputing by the mean of its conditional on gives
as .
If there is no deterministic link between variables, conditional mean imputation fails to recover the regression function, in the case where the regression function is not linear (see Example 4.2, where is replaced by ).
4.1.4 Pathological case: missingness is a covariate
Example 4.3 shows a situation in which any imputation methods single or multiple fail, since missingness contains information about the response variable .
Example 4.3.
Consider the following regression model,
Here,
(8) 
Unconditional mean imputation prediction is given by
whereas, the regression function satisfies
In this case, the presence of missing values is informative in itself, and having access to the complete data set (all values of ) does not provide enough information. Such scenario advocates for considering the missingness as an additional input variable. Indeed, in such situation, single and multiple imputation fail to recover the targeted regression function, without adding a missingness pattern to the input variables.
4.2 Mean imputation at train and test time is consistent
We now show that learning on the mean imputed training data, imputing the test set with the means (of the variables on the training data), and predicting is optimal if the missing data are MAR and if the capacity of the learning algorithm is large enough.
Theorem 4.1.
Consider the input vector which has a continuous density on , the response
such that exists, and a missingness pattern on variable such that and such that the function
is continuous. Assume furthermore that is a centered noise independent of . The imputed data is defined as . If the approximation capacity of the learning algorithm is high enough, it will predict, for all ,
Letting
the mean imputation prediction is equal to the Bayes function almost surely, that is
The proof is given in Appendix A. Theorem 4.1 confirms that it is preferable to use the same imputation for the train and the test set. Indeed, the learning algorithm can learn the imputed value (here the mean) and use that information to detect that the entry was initially missing. If the imputed value changes from train set to test set (for example, if instead of imputing the test set with the mean of the variables of the train set, one imputes by the mean of the variables on the test set), the learning algorithm will probably perform poorly, since the imputed data distribution would be different between train and test sets.
Note that the precise imputed value does not matter if the learner capacity is high enough. By default, the mean is not a bad choice even if it only preserves the first order statistic (mean) of the sample. Theorem 4.1 remains valid when missing values occur for variables under the assumption that conditional on and if for every pattern , the functions
are continuous.
Almost sure consistency.
Note that the equality between the mean imputation learner and the Bayes function holds almost surely but not for every . Indeed, under the setting of Theorem 4.1, let , for any such that
In this case, and
which is different, in general, from
Therefore, on the event , the two functions and differ. Since is a zero probability event, the equality
does not hold pointwise (as shown above) but hold almost surely (as stated in Theorem 4.1). However, a simple way to obtain the pointwise consistency, i.e. the fact that the imputation function tends to for any , is to impute missing data by values that are out of the range of the true distribution.
5 Decision trees: an example of empirical risk minimization with missing data
Decision trees offer a natural way for empirical risk minimization with missing values. This is in part due to their ability to handle the halfdiscrete nature of .
We first present the different approaches available to handle missing values in treebased methods in Sections 5.2 and 5.3. We then compare them theoretically in Section 5.4 and showing the interest of using the “missing incorporated in attribute” approach whether the missing values are MCAR or informative.
5.1 Tree construction with CART
The algorithm CART (Classification And Regression Trees, DBLP:books/wa/BreimanFOS84) is one of the most popular tree algorithm, originally designed for complete data set. Each leaf of the tree defines an interval on each variable . On , the algorithm finds the best split , where a split is defined by the choice of a feature along which the split is performed and the position of the split. Writing the set of all possible splits in the cell , the best split is defined as the solution of the following optimization problem
(9) 
For each cell , the problem (9) can be rewritten as
(10) 
where is the set of piecewiseconstant functions on and for . Therefore the optimization problem (10) amounts to solving a least square problem on the subclass of functions . Thus, by minimizing the mean squared error, the CART procedure targets the quantity . In the presence of missing values, this criterion must be adapted and several ways to do so have been proposed, as detailed below.
5.2 Splitting criterion discarding missing values
The most popular option is to select the split only on the available cases for each variable:
(11)  
As the missing values were not used to construct the criterion, it is still necessary to specify to which cell they are sent. Indeed, the solution of discarding missing data at each step would lead to a drastic reduction of the data set. The different methods to propagate missing data down the tree are detailed below.
Surrogate splits
Once the best split is chosen, surrogate splits search for a split on another variable that induces a data partition close to the original one. More precisely, let be the selected split. To send down the tree observations with no th variable, a new stump, i.e., a tree with one cut, is fitted to the response , using variables . The split which minimizes the misclassification error is selected, and observations are split accordingly. Those that lack both variables and are split with the second best, , and so on until the proposed split has a worse misclassification error than the blind rule of sending all remaining missing values to the same daughter, the most populated one. In the predicting phase, the training surrogates are kept. They are the default method in the rpart implementation (therneau1997introduction). Surrogate method is expected to be appropriate when there are relationship between the covariates.
Probabilistic splits
Another possibility is to propagate missing observations according to a Bernoulli distribution
, where (resp. ) is the number of points already on the left (resp. right). This is the default method in the C4.5 algorithm (quinlan2014c4).Block propagation
The third choice is to send all incomplete observations as a block, to a side chosen by minimizing the error. This is the method in xgboost (chen2016xgboost) and lightgbm (ke2017lightgbm).
Note that Hothorn:2006:JCGS proposed conditional trees, which adapt the criterion (11) to missing values. Indeed, this criterion implies a selection bias: it leads to underselecting the variables with many missing values due to the multiple comparison effects (StrBouAug:2007:CSDA). As a result, it favors variables were many splits are available, and therefore those with fewer missing values. Conditional trees are based on the calculation of a linear statistic of association between and each feature ,
on the observed feature. Then, its distribution under the null hypothesis of independence between
and is estimated by permutation and the variable with the smallest value is selected. Note that the improvement, illustrated in Appendix B.1, is meant to be on the selection of the variables but does not ensure that it improves the prediction performance. Once the variables have been selected, Hothorn:2006:JCGS use surrogate splits to propagate the missing entries. One potential drawback of this approach is the use of a linear statistic for association.5.3 Splitting criterion with missing values: MIA
The second important class of methods uses missing values to compute the criterion for each split and thus best split location. Its most common instance is “missing incorporated in attribute” (MIA, Twala:2008:GMC:1352941.1353228). More specifically, MIA selects
(12) 
where with

is the set of all functions piecewise constant on a partition of the form , for any .

is the set of all functions piecewise constant on a partition of the form , for any .

is the set of all functions piecewise constant on a partition of the form , for any .
This means that the missing values are treated like a category by the algorithm, they are simply distinct from real numbers which is appropriate to handle the space . It is a greedy algorithm to minimize a square loss between and a function of and consequently targets the quantity (4) which separate into terms. However, it is not exhaustive: at each step, the tree can cut for each variable according to missing or non missing and selects this cut when it is relevant, i.e. when it minimizes the prediction error. The final leaves can correspond to a cluster of missing values patterns (observations with missing values on the two first variables for instance and any missing patterns for the other variables). MIA is thought to be a good method to apply when missing pattern is informative, as this procedure allows to cut with respect to missing/ non missing and uses missing data to compute the best splits. Note this latter property implies that the MIA approach does not require a different method to propagate missing data down the tree.
Remark 5.1.
Implicit imputation: Whether it is in the case where the missing values are propagated in the available case method (Section 5.2), or incorporated in the split choice in MIA, missing values are assigned either to the left or the right interval. Consequently, handling missing values in a tree can be seen as implicit imputation by an interval.
5.4 Theoretical comparison of CART versus MIA
We now compare theoretically the positions of the splitting point at the root and the prediction errors on simple examples with MCAR values. Proposition 5.4 computes the splitting position of MIA and CART, and highlights that splitting position of MIA varies even for MCAR missing data. Proposition 5.1 then compares the risk of the different splitting strategies: probabilistic split, block propagation, surrogate split, and MIA. We prove that MIA and surrogate splits are the two best strategies, one of which may be better than the other depending on the dependence structure of covariables. Recall that is the value of the splitting criterion at computed on the complete data only.
propositionpropcrit Let . Consider the regression model
where is the missingness pattern on . Let be the value of the splitting MIA criterion computed at , where stands for the side where missing values are sent. Therefore,
The proof is given in Appendix A. Proposition 5.4 shows that the split given by optimizing the CART criterion does not depend on the percentage of missing values since the pattern is independent of . A numerical solution to equation (13) is displayed in Figure 1. When, is equal to 0, we retrieve . When increases, the threshold does not correspond anymore to the one calculated using observed values only as it is influenced by the missing entries even in the MCAR setting. Indeed, with MIA the threshold is selected as the one minimizing the prediction error. Hence MIA optimize both the threshold and the side of the split on which it sends all the missing entries such that the prediction error is the smallest. This is important as it allows to propagate a new observation in the test set with missing values. Recall that the quadratic risk of a function is defined as
Proposition 5.1 enables us to compare the risk of a single split performed with the different strategies. It highlights that even in the simple case of MCAR, MIA gives more accurate predictions than block propagation or probabilistic split.
Proposition 5.1.
Consider the regression model