Log In Sign Up

On the consistency of supervised learning with missing values

In many application settings, the data are plagued with missing features. These hinder data analysis. An abundant literature addresses missing values in an inferential framework, where the aim is to estimate parameters and their variance from incomplete tables. Here, we consider supervised-learning settings where the objective is to best predict a target when missing values appear in both training and test sets. We analyze which missing-values strategies lead to good prediction. We show the consistency of two approaches to estimating the prediction function. The most striking one shows that the widely-used mean imputation prior to learning method is consistent when missing values are not informative. This is in contrast with inferential settings as mean imputation is known to have serious drawbacks in terms of deformation of the joint and marginal distribution of the data. That such a simple approach can be consistent has important consequences in practice. This result holds asymptotically when the learning algorithm is consistent in itself. We contribute additional analysis on decision trees as they can naturally tackle empirical risk minimization with missing values. This is due to their ability to handle the half-discrete nature of variables with missing values. After comparing theoretically and empirically different missing-values strategies in trees, we recommend using the missing incorporated in attributes method as it can handle both non-informative and informative missing values.


The Missing Indicator Method: From Low to High Dimensions

Missing data is common in applied data science, particularly for tabular...

Benchmarking missing-values approaches for predictive models on health databases

BACKGROUND: As databases grow larger, it becomes harder to fully control...

Fairness without Imputation: A Decision Tree Approach for Fair Prediction with Missing Values

We investigate the fairness concerns of training a machine learning mode...

Imputing Missing Values in the Occupational Requirements Survey

The U.S. Bureau of Labor Statistics allows public access to much of the ...

Joints in Random Forests

Decision Trees (DTs) and Random Forests (RFs) are powerful discriminativ...

Conservative Policy Construction Using Variational Autoencoders for Logged Data with Missing Values

In high-stakes applications of data-driven decision making like healthca...

A fuzzified BRAIN algorithm for learning DNF from incomplete data

Aim of this paper is to address the problem of learning Boolean function...

1 Introduction

As volumes of data increase, they are harder to curate and clean. They may come from the aggregation of various sources (e.g. merging multiple databases) and contain variables of different natures (e.g. different sensors). Such heterogeneous data collection can lead to many missing values: samples only come with a fraction of the features observed. Though there is a vast literature on treating missing values, the classical methods have been designed with the aim of estimating some parameters and their variance in the presence of missing values in a single data set. In contrast, only few methods are available in for supervised-learning settings where the aim is to predict a target variable given input variables. Rather than generative models, these settings only require discriminative (or conditional) models. Also, they separate training and testing.

Aside from the aggregation of multiple sources, missing values can also appear for a variety of reasons: for sensor data, missing data can be arise from device failures; on the contrary, informative missing values can be found in poll data for instance where participants may not to answer sensitive questions related to unpopular opinions. In medical studies, some measurements may be impractical on patients in a critical state, in which case the presence of missing values can be related to the variable of interest, target of the prediction (patient status). These various scenarios lead to different missing value mechanisms.

The classical literature on missing data, led by rubin1976inference, defines missing-data mechanisms based on the relationship between missingness and observed values: if they are independent, the mechanism is said to be Missing Completely At Random (MCAR); if the missingness only depends on the observed values, then it is Missing At Random (MAR); otherwise it is Missing Not At Random (MNAR). However, adapting this nomenclature to supervised learning, to cater for variables target of the prediction, has been seldom discussed. In addition, the standard missing-data framework is formulated at the level of dataset, i.e. multiple observations, and not a single one. It leads to several ambiguities (seaman2013meant; 1811.04161; mohan2018graphical), and poses problems for out-of-sample formulations used in supervised learning.

Many methods are available to deal with missing values (josse2018). Listwise deletion, i.e. removing incomplete observations, is may enable to train the model on complete data. Yet it may not suffice for supervised learning, as the test set may also contain incomplete data. Hence the model should handle missing data. A popular solution suitable with any existing learning algorithm is to impute missing values, that is replacing the missing entries with plausible values to get a completed data set. The widespread practice of imputing with the mean of the variable on the observed entries is known to have serious drawbacks (little2002statistical) as it distorts the joint and marginal distributions of the data which induces bias in estimators. Interestingly, the performance of mean imputation has never been really assessed when aiming at predicting an output. In practice, users resort to different strategies such as imputing separately the training and test sets or imputing them jointly. More elaborate strategies rely on using expectation maximization (EM) to fit a model on incomplete data (dempster1977maximum; little2002statistical; little1992regression). However, such a model cannot readily by applied to new incomplete data. In addition, the EM relies on strong parametric assumptions. Alternatively, some learning algorithms, such as those based on decision trees, can be adapted to readily handle missing values.

In this paper, we study the classic tools of missing data in the context of supervised learning. We start in Section 2 by setting the notations and by briefly summarizing the missing-data literature. Our first contribution, detailed in Section 3, is to suggest a formalism for missing data adapted to supervised learning; we write existing methods with this formalism and show how to make predictions on a test set with missing values. Section 4 presents our main contribution which consists in studying the consistency of two approaches to estimate the prediction function with missing values. The first theorem states that, given an optimal predictor for the completely-observed data, a consistent procedure can be built by predicting on a test set where missing entries are replaced by multiple imputation. The second approach, which is the most striking and has important consequences in practice, shows that mean imputation prior to learning is consistent for supervised learning. This is as far as we know the first result justifying this very convenient practice of handling missing values. In Section 5, we then analyze decision trees as they greedy and discrete natures enables adapting them to handle directly missing values. We compare various missing-data methods for trees: surrogate splits, the default in Classification and Regression Tree (CART, DBLP:books/wa/BreimanFOS84), probabilistic splits, the default in C4.5 (quinlan2014c4), block propagation, the method used in xgboost (chen2016xgboost) and lightgbm (ke2017lightgbm), a method called “missing incorporated in attribute” (MIA, Twala:2008:GMC:1352941.1353228) and also conditional trees (Hothorn:2006:JCGS). Theoretical analysis of toy examples justify some empirical results observed in kapelner2015prediction, one of the few papers that studied trees with missing values for supervised learning. We recommend MIA as a way to exploit the missing pattern in the estimation. Finally, Section 6 compares the different tree methods on simulated data with missing values. We also show the benefits of an approach often used in practice consisting in “adding the mask”

, i.e. adding binary variables that code for the missingness of each variables as new covariates.

2 Definitions, problem setting, prior art


Throughout the paper,

letters refer to vectors; CAPITAL letters refer to real-valued or vector-valued random variables, while lower-case letters are realisations. In addition, as usual, for any two variables

and of joint density ,

2.1 Supervised learning

Supervised learning is typically focused on learning to predict a target from inputs , where the pair is considered as random, drawn from a distribution . Formally, the goal is to find a function , that minimizes given a cost function , called the loss (vapnik1999overview). The best possible prediction function is known as the Bayes rule, given by


and its error rate is the Bayes rate (devroye2013probabilistic). A learning procedure is used to create the function from a set of training pairs . is therefore itself a function of , and we will sometimes write it or simply

. There are many different learning procedures, including random forests


or support vector machines

(cortes1995support). A learning procedures that, given an infinite amount of data, yields a function that achieves the Bayes rate is said to be Bayes consistent. In other words, is Bayes consistent if

In a classification setting, is drawn from a finite set of discrete values, and the cost is typically the zero-one loss: . In a regression setting, is drawn from continuous values in and is assumed to satisfy . A common cost is then the square loss, . Considering the zero-one loss or the square loss, the Bayes-optimal function , that minimizes the expected loss, satisfies (see e.g. sec 1.5.5 of bishop2006prml for the square loss in regression, and rosasco2004loss for classification losses).

Note that the learning procedure has access to a finite sample, , and not to the distribution hence, it can only use the empirical risk, , rather than the expected risk. A typical learning procedure is therefore the empirical risk minimization defined as the following optimization problem

A new data set is then needed to estimate the generalization error rate of the resulting function .

2.2 Prior art on missing values

In this section, we review the different missing data mechanisms. We then summarize the main methods to handle missing values: imputation methods and likelihood based ones. Most of this prior art to deal with missing values is based on a single data set with no distinction between training and test set.

2.2.1 Missing data mechanisms

To follow the historical definitions which does not give to the response a particular role, we temporarily consider as part of the input vector , though we assume that has no missing values. rubin1976inference defines three missing data mechanisms and fundamental results for working with likelihood models in the presence of missing data. These are defined by considering the realised data set , as one realisation from a distribution in . The missingness is encoded as an indicator matrix where, for all and , if is observed, and if is missing.

The definitions of rubin1976inference can naturally be adapted to the i.i.d. setting. All rows are assumed to be sampled i.i.d. from a distribution in where, marginally, and . The goal in statistical inference is to estimate the parameter . This is usually done by maximizing the likelihood , which is well defined when the are fully observed. Here, the likelihood is integrated over the missing values, resulting in

(full likelihood)

where denotes the observed values for any realisation of (seaman2013meant), and the Dirac measure. The parameter is generally not considered as of interest. An easier quantity would be

(likelihood of observed data)

ignoring the missing-data mechanism. To leave the difficult term, i.e. the missing values mechanism, out of the expectation, rubin1976inference introduces an ad hoc assumption, called Missing At Random (MAR), which is that for all , for all , for all ,

and states the following result.

Theorem 2.1 (Theorem in rubin1976inference).

Let such that for all , . Assuming (a) MAR, (b) , is proportional to with respect to , so that the inference for can be obtained by maximizing the likelihood which ignores the mechanism.

MAR has a stronger version, more intuitive: Missing Completely At Random (MCAR). In its simplest and strongest form, it states that (the model’s density is ). At the other end of the spectrum, if it is not possible to ignore the mechanism, the corresponding model is called Missing Not At Random (MNAR).

There is little literature on missing data mechanism for supervised learning or discriminative models. kapelner2015prediction formalise the problem by separating the role of the response , factorising the likelihood as . Note that they do not write

. They justify this factorisation with the – somewhat causal – consideration that the missing values are part of the features which precede the response. The need to represent the response variable in the factorization show that it may be useful to extend the traditional mechanisms for a supervised-learning setting: the link between the mechanism and the output variable can have a significant impact on the results.

lecturenote and arel2018can noticed that as long as does not depend on , it is possible to estimate regression coefficients without bias even with listwise deletion and MNAR values. ding2010investigation generalise the MAR assumption with the following nomenclature : the missing mechanism can marginally depend on the target (), on the features that are always observed () or on the features that can be missing ().

2.2.2 Imputation prior to analysis

Most statistical models and machine learning procedures are not designed for incomplete data. It is therefore useful to

impute them to get a completed data set that can be analysed by any procedure, e.g. supervised-learning methods. To impute data, joint modeling

(JM) approaches capture the joint distribution across features


. A simple example of joint-modeling imputation is to assume a Gaussian distribution of the data, to estimate the mean vector and covariance matrix from the incomplete data (using an EM algorithm). Missing entries can then be imputed with their conditional expectation knowing the observed data and the estimated parameters. More powerful methods can be based on low-rank models

(hastie2015matrix; josse2016denoiser)

, or deep learning approaches such as denoising autoencoders (DAEs,

vincent2008extracting; gondara2018mida) and generative adversarial networks (li2017generative; yoon2018gain). Another popular approach to impute data is called fully conditional specification (FCS) also known as imputation with conditional equation (ICE) (vanbuuren2018flexible). It also assumes a joint distribution for the data, but defines it implicitly by the conditional distributions of each variable. This approach is popular because it is flexible and can easily handle variables of a different nature such as ordinal, categorical, numerical, etc. A powerful example of this class is missforest, using on iterative imputation of each variable by random forests (stekhoven2011missforest).

The role of the dependent variable and whether or not to include it in the imputation model has been a rather controversial point. Indeed, it is quite counter-intuitive to include it when the aim is to apply a conditional model on the imputed data set to predict the outcome . Nevertheless, it is recommended as it can provide information for imputing covariates (allison1999missing, p.57). Sterneb2393 illustrated the point for the simple case of a bivariate Gaussian data () with a positive structure of correlation and missing values on . Imputing using only

is not appropriate when the aim is to estimate the parameters of the linear regression model of

given .

One important issue with “single” imputation, i.e. predicting only one for each missing entries, is that it forgets that some values were missing and considers imputed values and observed values in the same way. It leads to underestimation of the variance of the estimated parameters (little2002statistical). One solution, to incorporate the uncertainty of the prediction of values is to use multiple imputation (MI, rubin:1987) where many plausible values are generated for each missing entries, leading to many imputed data sets. Then, MI consists in applying an analysis on each imputed data sets and combining the results. Although many procedures to generate multiple imputed data sets (murray2018) are available, here again, the case of discriminatory models is only rarely considered, with the notable exception of model_select_mice who use a variable selection procedure on each imputed data set and propose to keep the variables selected in all imputed data sets to construct the final model (see also, mirl).

2.2.3 EM algorithm

Imputation leads to two-step methods that are generic in the sense that any analysis can be performed from the same imputed data set. On the contrary, the EM (expectation maximization) algorithm proceeds directly in one step (dempster1977maximum). It can thus be better suited to a specific problem but requires the development of a dedicated algorithm.

The EM algorithm can be used in missing-data settings to compute maximum likelihood estimates from an incomplete data set. Indeed, with the assumptions of Theorem 2.1 (MAR settings), maximizing the observed likelihood gives principle estimation of parameters . The log-likelihood of the observed data is

Starting from an initial parameter , the algorithm alternates the two following steps,


The well-known property of the EM algorithm states that at each step , the observed log-likelihood increases, although there is no guarantee to find the global maximum. In Appendix B.2 we include an example of an EM algorithm to estimate the parameters of a bivariate Gaussian distribution from an incomplete data.

3 Supervised-learning procedures with missing data on train and test set

Supervised learning typically assumes that the data are i.i.d. In particular, an out-of-sample observation is supposed to be drawn from the same distribution as the original sample. Hence it has the same missing-data mechanism. An appropriate method should be able to predict on new data with missing values. Here we discuss how to adapt classic missing-data techniques to machine learning settings, and vice versa.


Following rubin1976inference, mohan2018graphical and yoon2018gain, we define the incomplete feature vector as if , and otherwise. As is a cartesian product, belongs to the space . We have

where is the term-by-term product, with the convention that, for all one-dimensional , As such, when the data are real, can be seen as a mixed categorical and continuous variable, taking values in . The observed training set, which is available for statistical analysis, is then defined as .

3.1 Out-of-sample imputation

Using missing-value imputation in a supervised learning setting is not straightforward as it requires to impute new, out-of-sample, test data, where the target is unavailable.

A simple strategy is to impute the training data with an imputation model (whose parameters are estimated via on the training set). Denoting as the imputed data set, a predictive model can then be learned using and (whose parameters are estimated as on the training set). Finally, on the test set, the covariates must be imputed with the same imputation model (using ) and the dependent variable predicted using the imputed test set and the estimated learning model (using ).

This approach is easy to implement for univariate imputation methods that consider each feature separately, for instance with mean imputation: the mean of each column is learned on the training set, and any new observation can be imputed by . The imputation with joint Gaussian model on – with parameters are learned by the EM algorithm on the training set– is also appropriate as one can impute the test set using the conditional expectations of the missing features given the observed features (and without a ) and the estimated parameters.

For more general imputation methods, two issues hinder out-of-sample imputation. First, many available imputation methods are “black-boxes” that take as an input an incomplete data and output a completed data: they do not separate estimation of model parameters from using them to complete the data. This is the case for many implementations of iterative conditional imputation such as MICE (vanbuuren2018flexible) or missForest (stekhoven2011missforest). It is also difficult for powerful imputers presented in Section 2.2.2 such as low-rank matrix completion, as they cannot be easily marginalised on alone.

As most existing implementations cannot impute a new data set with the same imputation model, some analysts resort to performing separate imputation of the training set and the test set. But the smaller the test set, the more suboptimal this strategy is, and it completely fails in the case where only one observation has to be predicted. Another option is to consider semi-supervised settings, where the test set is available at train time: grouped imputation can then simultaneously impute the train and the test set (kapelner2015prediction), while the predictive model learn is subsequently learned on the training set only.

3.2 EM and out-of-sample prediction

The likelihood framework (Section 2.2.1) enables predicting new observation, though it has not been much discussed. jiang2018logistic

consider a special case of this approach for a logistic regression and by assuming a Gaussian model on the covariates


Let the assumptions of Theorem (2.1) be verified (MAR settings). Model parameters can then be estimated by maximizing the observed log-likelihood with an EM algorithm (Section 2.2.3). The corresponding MLE can be used for out-of-sample prediction with missing values. More precisely, for a fixed missing indicator , we write the observed values and

the missing values. Given this model, the probability of

as a function of the observed values only, can be related to that on a fully-observed data set:


It is then possible to approximate the expectation with Monte Carlo sampling from the distribution . Such a sampling is easy in simple models, e.g. using Schur’s complements for Gaussian distributions in linear regression settings. But in more complex settings, such as logistic regression, there is no explicit solution and one option is Metropolis Hasting Monte Carlo.

3.3 Empirical risk minimisation with missing data

The two previous approaches that we discussed are specifically designed to fix the missing data issue: imputation or specifying a parametric model and maximizing the observed likelihood. However, in supervised learning settings, the goal is rather to build a prediction function that minimizes an expected risk. Empirical risk minimization, the workhorse of machine learning, can be adapted to deal with missing data.

Recall that in missing-data settings, we do not have access to but rather to . Therefore, given a class of functions from to , we aim at minimizing the empirical risk on this class, that is


Unfortunately, the half-discrete nature of , makes the problem difficult. Indeed, many learning algorithms do not work with mixed data types, such as , but rather require a vector space. This is true in particular for gradient-based algorithms. As a result, the optimization problem (3) is hard to solve with typical learning tools.

Another point of view can be adopted for losses which leads to Bayes-optimal solutions such that . As there are at most admissible missing patterns, the Bayes estimate can be rewritten as


where with all the indices such that . This formulation highlights the combinatorial issues: solving (3) may require to estimate different submodels, that is appearing in (4) for each , which grows exponentially with the number of variables.

Modifying existing algorithms or creating new ones to deal with the optimization problem (3) is in general a difficult task due to the numerous possible missing data patterns. We will see in Section 5 that decision trees are particularly well suited to address this problem.

Remark 3.1.

Note that in practice, not all patterns may be possible in the training and test sets. For instance, if there is only complete data in the train set, the only submodel of interest is for , which boils down to the regular supervised learning scenario on a complete data. However, the train and test set are assumed to be drawn from the same data distribution. Hence, we expect to observe similar patterns of missingness in train and test sets. If this is not the case, it corresponds to a distributional shift, and should be tackled with dedicated methods (see, e.g., sugiyama2017dataset). This may happen for instance, when a study conducted on past data led to operational recommendations, making the practitioners measure systematically the variables of interest.

4 Bayes-risk consistency of imputation procedures

Here we show theoretically that simple imputation procedures can lead to Bayes-optimal prediction when the goal is to estimate the function in the presence of missing data on covariates (in both train and test sets), and with no additional assumptions on the data distribution.

We start by studying the error rate of a predictor optimal for the complete data on test data with missing values, using several imputation strategies: unconditional mean, conditional mean and multiple imputation.

We then consider the full problem of tackling missing values in the train and the test set. We study a classical approach, described in Section 3.1, which consists first in imputing the training set, learning on the imputed data, and predicting on a test set which has been imputed with the same method. Although mean imputation of variables is one of the most widely used approaches, it is highly criticised in the classic literature for missing data (little2002statistical). Indeed, it leads to a distortion of the data distribution and consequently statistics calculated on the imputed data table are biased. A simple example is the correlation coefficient between two variables, which is biased towards zero if the missing data are imputed by the mean. However, in a supervised-learning setting the aim is not to compute statistics representative of the data set, but to minimize a prediction risk by estimating a regression function. For this purpose, we show in Section 4.2 that mean imputation may be completely appropriate and leads to consistent estimation of the prediction function. This result is remarkable and extremely useful in practice.

4.1 Test-time imputation

Here we consider that we have an optimal (Bayes-consistent) predictor for the complete data, i.e. , and we show that when there is missing data in the test set, in MAR settings, multiple imputation with can give the optimal prediction, i.e. Bayes consistency for incomplete data. In the case of MCAR values, i.e. where the complete data is a random subsample from the sample, the function can be obtained for instance by “listwise deletion” in the train set: fitting a supervised-learning procedure on data for which the samples with missing data have been removed.

4.1.1 Test-time conditional multiple imputation is consistent

Let us first make explicit the multiple imputation procedure for prediction. For a given vector , we let be the missing indicator and write for observed values and the missing values. We then draw the missing values from their distribution conditional on and compute the regression function on these completed observations. The resulting multiple imputation function is given by:


Note that this expression is similar to the expression Equation 2 given for EM but assuming that we know the true nonparametric distribution of the data.

Proposition 4.1.

Consider the regression model

where takes values in , for all subset , (MAR mechanism) and where is a centred noise. Then the multiple imputation procedure described above is consistent, that is, for all ,

The proof is given in Appendix A.

4.1.2 Simple mean imputation is not consistent

Given the success of multiple imputation, it is worth checking that simple imputation is not sufficient. We show with two simple examples that indeed, single imputation on the test set is not consistent even in MAR setting.

We first show, that (unconditional) mean imputation is not consistent, if the learning algorithm has been trained on the complete cases only.

Example 4.1.

In one dimension, consider the following simple example,

with an independent centered Gaussian noise. Here, , and the regression function satisfies


In the oracle setting where the distribution of is known, ”plugging in” the mean imputation of yields the prediction


In this example, mean imputation is not optimal: when is missing, the prediction obtained by mean imputation is , whereas the optimal prediction (the one which minimizes the square loss) is as seen in (6).

Inspecting (6) and (7) reveals that the poor performance of mean imputation are due to the fact that . The non-linear relation between and breaks mean imputation. This highlights the fact that the imputation method should be chosen in accordance with the learning algorithm that will be applied later on. This is related to the concept of congeniality (meng1994multiple) defined in multiple imputation.

4.1.3 Conditional mean imputation is consistent if there are deterministic relations between input variables

We now consider conditional mean imputation, using information of other observed variables to impute. Conditional mean imputation may work in situations where there is redundancy between variables, as highlighted in Example 4.2. However, we give a simple example below that shows that using it to impute the test may not give the Bayes prediction rate.

Example 4.2.

Consider the following regression problem with two identical input variables:

The Bayes-optimal predictor is then given by

Single imputation with the mean of in this function leads to

whereas, imputing by the mean of its conditional on gives

as .

If there is no deterministic link between variables, conditional mean imputation fails to recover the regression function, in the case where the regression function is not linear (see Example 4.2, where is replaced by ).

4.1.4 Pathological case: missingness is a covariate

Example 4.3 shows a situation in which any imputation methods single or multiple fail, since missingness contains information about the response variable .

Example 4.3.

Consider the following regression model,



Unconditional mean imputation prediction is given by

whereas, the regression function satisfies

In this case, the presence of missing values is informative in itself, and having access to the complete data set (all values of ) does not provide enough information. Such scenario advocates for considering the missingness as an additional input variable. Indeed, in such situation, single and multiple imputation fail to recover the targeted regression function, without adding a missingness pattern to the input variables.

4.2 Mean imputation at train and test time is consistent

We now show that learning on the mean imputed training data, imputing the test set with the means (of the variables on the training data), and predicting is optimal if the missing data are MAR and if the capacity of the learning algorithm is large enough.

Theorem 4.1.

Consider the input vector which has a continuous density on , the response

such that exists, and a missingness pattern on variable such that and such that the function

is continuous. Assume furthermore that is a centered noise independent of . The imputed data is defined as . If the approximation capacity of the learning algorithm is high enough, it will predict, for all ,


the mean imputation prediction is equal to the Bayes function almost surely, that is

The proof is given in Appendix A. Theorem 4.1 confirms that it is preferable to use the same imputation for the train and the test set. Indeed, the learning algorithm can learn the imputed value (here the mean) and use that information to detect that the entry was initially missing. If the imputed value changes from train set to test set (for example, if instead of imputing the test set with the mean of the variables of the train set, one imputes by the mean of the variables on the test set), the learning algorithm will probably perform poorly, since the imputed data distribution would be different between train and test sets.

Note that the precise imputed value does not matter if the learner capacity is high enough. By default, the mean is not a bad choice even if it only preserves the first order statistic (mean) of the sample. Theorem 4.1 remains valid when missing values occur for variables under the assumption that conditional on and if for every pattern , the functions

are continuous.

Almost sure consistency.

Note that the equality between the mean imputation learner and the Bayes function holds almost surely but not for every . Indeed, under the setting of Theorem 4.1, let , for any such that

In this case, and

which is different, in general, from

Therefore, on the event , the two functions and differ. Since is a zero probability event, the equality

does not hold pointwise (as shown above) but hold almost surely (as stated in Theorem 4.1). However, a simple way to obtain the pointwise consistency, i.e. the fact that the imputation function tends to for any , is to impute missing data by values that are out of the range of the true distribution.

5 Decision trees: an example of empirical risk minimization with missing data

Decision trees offer a natural way for empirical risk minimization with missing values. This is in part due to their ability to handle the half-discrete nature of .

We first present the different approaches available to handle missing values in tree-based methods in Sections 5.2 and 5.3. We then compare them theoretically in Section 5.4 and showing the interest of using the “missing incorporated in attribute” approach whether the missing values are MCAR or informative.

5.1 Tree construction with CART

The algorithm CART (Classification And Regression Trees, DBLP:books/wa/BreimanFOS84) is one of the most popular tree algorithm, originally designed for complete data set. Each leaf of the tree defines an interval on each variable . On , the algorithm finds the best split , where a split is defined by the choice of a feature along which the split is performed and the position of the split. Writing the set of all possible splits in the cell , the best split is defined as the solution of the following optimization problem


For each cell , the problem (9) can be rewritten as


where is the set of piecewise-constant functions on and for . Therefore the optimization problem (10) amounts to solving a least square problem on the subclass of functions . Thus, by minimizing the mean squared error, the CART procedure targets the quantity . In the presence of missing values, this criterion must be adapted and several ways to do so have been proposed, as detailed below.

5.2 Splitting criterion discarding missing values

The most popular option is to select the split only on the available cases for each variable:


As the missing values were not used to construct the criterion, it is still necessary to specify to which cell they are sent. Indeed, the solution of discarding missing data at each step would lead to a drastic reduction of the data set. The different methods to propagate missing data down the tree are detailed below.

Surrogate splits

Once the best split is chosen, surrogate splits search for a split on another variable that induces a data partition close to the original one. More precisely, let be the selected split. To send down the tree observations with no th variable, a new stump, i.e., a tree with one cut, is fitted to the response , using variables . The split which minimizes the misclassification error is selected, and observations are split accordingly. Those that lack both variables and are split with the second best, , and so on until the proposed split has a worse misclassification error than the blind rule of sending all remaining missing values to the same daughter, the most populated one. In the predicting phase, the training surrogates are kept. They are the default method in the rpart implementation (therneau1997introduction). Surrogate method is expected to be appropriate when there are relationship between the covariates.

Probabilistic splits

Another possibility is to propagate missing observations according to a Bernoulli distribution

, where (resp. ) is the number of points already on the left (resp. right). This is the default method in the C4.5 algorithm (quinlan2014c4).

Block propagation

The third choice is to send all incomplete observations as a block, to a side chosen by minimizing the error. This is the method in xgboost (chen2016xgboost) and lightgbm (ke2017lightgbm).

Note that Hothorn:2006:JCGS proposed conditional trees, which adapt the criterion (11) to missing values. Indeed, this criterion implies a selection bias: it leads to underselecting the variables with many missing values due to the multiple comparison effects (StrBouAug:2007:CSDA). As a result, it favors variables were many splits are available, and therefore those with fewer missing values. Conditional trees are based on the calculation of a linear statistic of association between and each feature ,

on the observed feature. Then, its distribution under the null hypothesis of independence between

and is estimated by permutation and the variable with the smallest -value is selected. Note that the improvement, illustrated in Appendix B.1, is meant to be on the selection of the variables but does not ensure that it improves the prediction performance. Once the variables have been selected, Hothorn:2006:JCGS use surrogate splits to propagate the missing entries. One potential drawback of this approach is the use of a linear statistic for association.

5.3 Splitting criterion with missing values: MIA

The second important class of methods uses missing values to compute the criterion for each split and thus best split location. Its most common instance is “missing incorporated in attribute” (MIA, Twala:2008:GMC:1352941.1353228). More specifically, MIA selects


where with

  • is the set of all functions piecewise constant on a partition of the form , for any .

  • is the set of all functions piecewise constant on a partition of the form , for any .

  • is the set of all functions piecewise constant on a partition of the form , for any .

This means that the missing values are treated like a category by the algorithm, they are simply distinct from real numbers which is appropriate to handle the space . It is a greedy algorithm to minimize a square loss between and a function of and consequently targets the quantity (4) which separate into terms. However, it is not exhaustive: at each step, the tree can cut for each variable according to missing or non missing and selects this cut when it is relevant, i.e. when it minimizes the prediction error. The final leaves can correspond to a cluster of missing values patterns (observations with missing values on the two first variables for instance and any missing patterns for the other variables). MIA is thought to be a good method to apply when missing pattern is informative, as this procedure allows to cut with respect to missing/ non missing and uses missing data to compute the best splits. Note this latter property implies that the MIA approach does not require a different method to propagate missing data down the tree.

Remark 5.1.

Implicit imputation: Whether it is in the case where the missing values are propagated in the available case method (Section 5.2), or incorporated in the split choice in MIA, missing values are assigned either to the left or the right interval. Consequently, handling missing values in a tree can be seen as implicit imputation by an interval.

5.4 Theoretical comparison of CART versus MIA

We now compare theoretically the positions of the splitting point at the root and the prediction errors on simple examples with MCAR values. Proposition 5.4 computes the splitting position of MIA and CART, and highlights that splitting position of MIA varies even for MCAR missing data. Proposition 5.1 then compares the risk of the different splitting strategies: probabilistic split, block propagation, surrogate split, and MIA. We prove that MIA and surrogate splits are the two best strategies, one of which may be better than the other depending on the dependence structure of covariables. Recall that is the value of the splitting criterion at computed on the complete data only.

propositionpropcrit Let . Consider the regression model

where is the missingness pattern on . Let be the value of the splitting MIA criterion computed at , where stands for the side where missing values are sent. Therefore,

  1. The best split given by the CART criterion (11) is .

  2. The best splits and given by the MIA criterion (12), assuming that all missing values are sent to the left node (resp. to the right node), satisfy



    and .

Figure 1: , the split chosen by the MIA criterion, as a function of the fraction of missing values on , assuming values are sent left.

The proof is given in Appendix A. Proposition 5.4 shows that the split given by optimizing the CART criterion does not depend on the percentage of missing values since the pattern is independent of . A numerical solution to equation (13) is displayed in Figure 1. When, is equal to 0, we retrieve . When increases, the threshold does not correspond anymore to the one calculated using observed values only as it is influenced by the missing entries even in the MCAR setting. Indeed, with MIA the threshold is selected as the one minimizing the prediction error. Hence MIA optimize both the threshold and the side of the split on which it sends all the missing entries such that the prediction error is the smallest. This is important as it allows to propagate a new observation in the test set with missing values. Recall that the quadratic risk of a function is defined as

Proposition 5.1 enables us to compare the risk of a single split performed with the different strategies. It highlights that even in the simple case of MCAR, MIA gives more accurate predictions than block propagation or probabilistic split.

Proposition 5.1.

Consider the regression model