1 Introduction
This article addresses prediction problems where covariate information is missing during model construction and is also missing in future observations for which we are obligated to generate a forecast. Our aim is to innovate a nonparametric statistical learning extension which incorporates missingness into both the training and the forecasting phases. In the spirit of nonparametric learning, we wish to incorporate the missingness in both phases automatically, without the need for prespecified modeling.
We limit our focus to treebased statistical learning, which has demonstrated strong predictive performance and has consequently received considerable attention in recent years. Stateoftheart algorithms include Random Forests
(RF, Breiman, 2001b), stochastic gradient boosting
(Friedman, 2002), and Bayesian Additive and Regression Trees (BART, Chipman et al., 2010), the algorithm of interest in this study. Popular implementations of these methods do not incorporate covariate missingness natively without relying on either imputation or a complete case analysis of observations with no missing information.Previous simulations and real data set applications have indicated that BART is capable of achieving excellent predictive accuracy. Unlike most competing techniques, BART
is composed of a probability model, rather than a procedure that is purely “algorithmic”
(Breiman, 2001a). BARTpresents an alternative approach to statistical learning for those comfortable with the Bayesian framework. This framework provides certain advantages, such as builtin estimates of uncertainty in the form of credible intervals as well as the ability to incorporate prior information on covariates
(Bleich et al., 2013). However, no means for incorporating missing data in BART has been published to date. Our goal here is to develop a principled way of adapting BART’s machinery to incorporate missing data that takes advantage of the Bayesian framework.Our proposed method, henceforth named BARTm, modifies the recursive partitioning scheme during construction of the decision trees to incorporate missing data into splitting rules. By relying on the MetropolisHastings algorithm embedded in BART, our method attempts to send missing data to whichever of the two daughter nodes increases overall model likelihood. Missingness itself also becomes a valid splitting criterion.
During model construction, taking advantage of this modified set of splitting rules does not require imputation, which relies on assumptions that cannot be easily verified. Our approach is equally viable for continuous and nominal covariate data and both selection and pattern mixture models. The latter models do not assume that missing data is necessarily free of information; the data may have gone missing for a reason crucial to the response function and therefore crucial to our forecast. BARTm is able to exploit this relationship when appropriate.
Since missingness is handled natively within the algorithm, BARTm can generate predictions on future data with missing entries as well. Additionally, BART
’s Bayesian framework also naturally provides estimates of uncertainty in the form of credible intervals. The amount of uncertainty increases with the amount of information lost due to missingness; thereby missingness is appropriately incorporated into the standard error of the prediction. Also, our proposed procedure has negligible impact on the runtime during both model construction and prediction phases.
In Sections 2.1  2.3, we provide a framework for statistical learning with missingness with a focus on decision trees. We give a brief overview of the BART algorithm in Section 2.4 and explain the internals of BARTm in Section 3. We then demonstrate BARTm’s predictive performance on generated models in Section 4 as well as real data with a variety of missing data scenarios in Section 5. We conclude with Section 6. BARTm can be found in the R package bartMachine which is available on CRAN (Kapelner and Bleich, 2013).
2 Background
2.1 A Framework for Missing Data in Statistical Learning
Consider covariates , a continuous response ^{2}^{2}2When is binary, we use a similar framework with an appropriate link function encapsulating . and an unknown function where . We denote as the noise in the response unexplained by . The goal of statistical learning is to use the training set, which consists of observations drawn from the population , to produce an estimate, , the best guess of . This function estimate can then be used to generate predictions on future test observations with an unknown response. We denote these future observations as which we assume are likewise drawn from the same population as the training set.
Missingness is one of the scourges of data analysis, plaguing statistical learning by causing missing entries in both the training matrix as well as missing entries in the future records, . In the statistical learning context, the training set is defined by observations which do not exhibit missingness in their response, . Records with missing responses cannot be used to build models for estimation of .^{3}^{3}3“Imputing missing values in the response” for the new is equivalent to “prediction” and is the primary goal of statistical learning. Thus, “missingness” considered in this paper is missingness only in and . We denote missingness in the features of which suffer from missingness as
, binary vectors where 1 indicates missing and 0 indicates present, and covariates that are present with
. The main goal of statistical learning with missingness is to estimate .We now frame missing data models in statistical learning using the canonical framework of selection and patternmixture models (Little, 1993). Conditional on , selection models factor the full data likelihood as
(1) 
where and are parameter vectors and are assumed distinct. The first term on the right hand side reflects that the marginal likelihood for the response is independent of missingness. The second term on the right conventionally conditions on . In the forecasting paradigm, missingness is assumed independent of the response because is often yet to be realized and thus its unknown value should not influence , the missingness of the previously realized covariates.
Conditional on , pattern mixture models partition the full data likelihood as
(2) 
where and are parameter vectors and again assumed distinct. The difference between the above and Equation 1 is the marginal likelihood of the response is now a function of . This means there can be different response models under different patterns of missingness in the covariates.
In both selection and patternmixture paradigms, the term on the right is the missing data mechanism (MDM), which traditionally is the mechanism controlling missingness in the response. In our framework however, the MDM controls missingness only in : the covariates (and parameters ) create missingness within themselves which inevitably needs to be incorporated during model construction and forecasting. Thus, the MDM is conceptually equivalent in both the selection and pattern mixture paradigms.
The conceptual difference between the selection and pattern mixture models in the statistical learning framework can be envisioned as follows. Imagine the full covariates are realized but due to the MDM, is latent and we instead observe and . In the selection paradigm, is realized only from the full covariates via . However, in the patternmixture paradigm, both and intermix to create many collated response models corresponding to different points in space. Thus, under our assumptions, selection models are a subset of pattern mixture models. Note that patternmixture models are chronically underidentified and difficult to explicitly model in practice. We address why our proposed method is wellsuited to handle prediction problems under this framework in Section 3.
We now present Little and Rubin (2002)’s taxonomy of MDM’s which are traditionally framed in the selection model paradigm but here apply to both paradigms: (1) missing completely at random (MCAR), (2) missing at random (MAR) and (3) not missing at random (NMAR). MCAR is a mechanism that generates missingness in covariate without regard to the value of itself nor the values and missingness of any other covariates, denoted . MCAR is exclusively determined by exogenous parameter(s) . The MAR mechanism generates missingness without regard to , its own value, but can depend on values of other attributes as well as . The NMAR mechanism features the additional dependence on the value of itself as well as unobserved covariates.^{4}^{4}4Explicit dependence on unobserved covariates was not explored as MDM’s in this paper. We summarize these mechanisms in Table 1. In our framework, each of the covariates with missingness, denoted as ’s, are assumed to have their own MDM. Thus, the full MDM for the whole covariate space, , can be arbitrarily convoluted, exhibiting combinations of MCAR, MAR and NMAR among its covariates and each MDM relationship may be highly nonlinear with complicated interactions.
MDM  

MCAR  
MAR  
NMAR  (does not simplify) 
We conclude this section by emphasizing that in the nonparametric statistical learning framework where predictive performance is the objective, there is no need for explicit inference of (which may have unknown structure and arbitrary, possibly infinite, dimension). Instead, the algorithm performs “blackbox” estimation of the data generating process such that the output estimates the function. Thus, if we can successfully estimate this conditional expectation function directly, then accurate forecasts can be obtained. This is the approach that BARTm takes.
2.2 Strategies for Incorporating Missing Data
A simple strategy for incorporating missingness into model building is to simply ignore the observations in that contain at least one missing measurement. This is called “listwise deletion” or “complete case analysis.” It is well known that complete case analysis will be unbiased for MCAR and MAR selection models where missingness does not depend on the response when the target of estimation is . However, when forecasting, the data analyst must additionally be guaranteed that has no missing observations, since it is not possible to generate forecasts for these cases.
By properly modeling missingness, incomplete cases can be used and more information about becomes available, potentially yielding higher predictive performance. One popular strategy is to guess or “impute” the missing entries. These guesses are then used to “fill in” the holes in and . The imputed is then used as if it were the real covariate data when constructing and the imputed is then used as if it were the real covariate data during forecasting. To carry out imputation, the recommended strategy is to properly model the predictive distribution and then draws from the model are used to fill in the missing entries. Multiple imputation involves imputing many times and averaging over the results from each imputation (Rubin, 1978). In statistical learning, a prediction could be calculated by averaging the predictions from many ’s built from many imputed ’s and then further averaging over many imputed ’s. In practice, having knowledge of both the missing data mechanism and each probability model is very difficult and has usually given way to nonparametric methods such as nearest neighbors (Troyanskaya et al., 2001) for continuous covariates and saturated multinomial modeling (Schafer, 1997) for categorical covariates. The widely used R package randomForest (Liaw and Wiener, 2002) imputes via “hotdecking” (Little and Rubin, 2002).
A more recent approach, MissForest (Stekhoven and Bühlmann, 2012)
, fits nonparametric imputation models for any combination of continuous and categorical input data, even when the response is unobserved. In this unsupervised procedure (i.e., no response variable needed), initial guesses for the imputed values are made. Then, for each attribute with missingness, the observed values of that attribute are treated as the response and a
RF model is fit using the remaining attributes as predictors. Predictions for the missing values are made via the trained RF and serve as updated imputations. The process proceeds iteratively through each attribute with missingness and then repeats until a stopping criterion is achieved. The authors argue that their procedure intrinsically constitutes multiple imputation due to Random Forest’s averaging over many unpruned decision trees. The authors also state that their method will perform particularly well when “the data include complex interactions or nonlinear relations between variables of unequal scales and different type.” Although no explicit reference is given to Little and Rubin (2002)’s taxonomy in their work, we expect MissForest to perform well in situations generally wellsuited for imputation, namely, the MCAR and MAR selection models discussed in Section 2.1. MissForest would not be suited for NMAR MDMs as imputation values for can only be modeled from in their implementation. Additionally, implementing MissForest would not be recommended for patternmixture scenarios because imputation is insufficient to capture differing response patterns.Since BART is composed primarily of a sumofregressiontrees model, we now review strategies for incorporating missing data in treebased models.
2.3 Missing data in Binary Decision Trees
Binary decision trees are composed of a set of connecting nodes. Internal nodes contain a splitting rule, for example, x < c, where x is the splitting attribute and c is the splitting value. An observation that satisfies the rule is passed to the left daughter node otherwise it is passed to the right daughter node. This partitioning proceeds until an observation reaches a terminal node. Terminal nodes (also known as leaves) do not have splitting rules and instead have leaf values. When an observation “lands” in a terminal node it is assigned the leaf value of the terminal node in which it has landed. In regression trees, this leaf value is a real number, and is the estimate of the response for the given covariates. Thus, regression trees are a nonparametric fitting procedure where the estimate of is a partition of predictor space into various hyperrectangles. Regression trees are wellknown for their ability to approximate complicated response surfaces containing both nonlinearities and interaction effects.
There are many different ways to build decision trees. Many classic approaches rely on a greedy procedure to choose the best splitting rule at each node based on some predetermined criterion. Once the construction of the tree is completed, the tree is then pruned back to prevent overfitting.
Previous efforts to handle missingness in trees include surrogate variable splitting (Therneau and Atkinson, 1997), “Missing Incorporated in Attributes” (MIA, Twala et al., 2008, section 2) and many others (see Ding and Simonoff, 2010 and Twala, 2009). MIA, the particular focus for this work, is a procedure that natively uses missingness when greedily constructing the rules for the decision tree’s internal nodes. We summarize the procedure in Algorithm 1 and we explain how the expanded set of rules is injected into the BART procedure in Section 3.
There are many advantages of the MIA approach. First, MIA has the ability to model complex MAR and NMAR relationships, as evidenced in both Twala et al. (2008) and the results of Sections 4 and 5. Since missingness is integrated into the splitting rules, forecasts can be made without imputing when contains missingness.
Another strong advantage of MIA is the ability to split on feature missingness (line 3 of Algorithm 1). This splitting rule choice allows for the tree to better capture pattern mixture models where missingness directly influences the response model. Generally speaking, imputation ignores pattern mixture models; missingness is only viewed as holes to be filledin and forgotten.
Due to these benefits as well as conceptual simplicity, we chose to implement MIAwithinBART, denoted “BARTm”, when enhancing BART to handle missing data.
2.4 Bart
Bayesian Additive Regression Trees is a combination of many regression trees estimated via a Bayesian model. Imagine the true response function can be approximated by the sum of trees with additive normal and homoskedastic noise:^{5}^{5}5BART can also be adapted for classification problems by using a probit link function and a data augmentation approach relying on latent variables (Albert and Chib, 1993).
(3) 
The notation, , denotes both structure and splitting rules () as well as leaf values (🍂).
BART can be distinguished from other purely algorithmic ensembleoftrees models by its full Bayesian model, consisting of both a set of independent priors and likelihoods. Its posterior distribution is estimated via Gibbs Sampling (Geman and Geman, 1984) with a MetropolisHastings step (Hastings, 1970).
There are three regularizing priors within the BART
model which are designed to prevent overfitting. The first prior, placed on the tree structure is designed to prevent trees from growing too deep, thereby limiting the complexity that can be captured by a single tree. The second prior is placed on the leaf value parameters (the predicted values found in the terminal nodes) and is designed to shrink the leaf values towards the overall center of the response’s empirical distribution. The third prior is placed on the variance of the noise
and is designed to curtail overfitting by introducing noise into the model if it begins to fit too snugly. Our development of BARTmuses the default hyperparameters recommended in the original work
(Chipman et al., 2010). For those who do not favor a pure Bayesian approach, these priors can be thought of as tuning parameters.In addition to the regularization priors, BART imposes an agnostic prior on the splitting rules within the decision tree branches. First, for a given branch, the splitting attribute is uniformly drawn from the set of variables available at the branch. The splitting value is then selected by drawing uniformly from the available values conditional on the splitting attribute . Selecting attributes and values from a uniform discrete distribution represents a digression from the approach used in decision tree algorithms of greedily choosing splits based on some splitting criteria. Extending this prior allows for BART to incorporate MIA, which is discussed in Section 3.
To generate draws from the posterior distribution, each tree is fit iteratively, holding the other
trees constant, by using only the portion of the response left unfitted. To sample trees, changes to the tree structure are proposed then accepted or rejected via a MetropolisHastings step. The tree proposals are equallylikely alterations: growing a leaf by adding two daughter nodes, pruning two twin leaves (rendering their parent node into a leaf), or changing a splitting rule. Following the tree sampling, the posterior for the leaf value parameters are Gibbs sampled. The likelihood of the predictions in each node is assumed to be normal. Therefore, the normalnormal conjugacy yields the canonical posterior normal distribution. After sampling all tree changes and terminal node parameters, the variance parameter
is Gibbs sampled. By model assumption, the likelihood for the errors is normal and the conjugacy with the inversegamma prior yields the canonical posterior inversegamma.Usually around 1,000 MetropoliswithinGibbs iterations are run as “burnin” until converges (by visual inspection). Another 1,000 or so are sampled to obtain “burnedin” draws from the posterior, which define the BART model. Forecasts are then obtained by dropping the observations of down the collection of sampled trees within each burnedin Gibbs sample. A point prediction is generated by summing the posterior leaf values across the trees as in Equation 3. Credible intervals
, which are intervals containing a desired percentage (e.g. 95%) of the posterior probability mass for a Bayesian parameter of interest, can be computed via the desired empirical quantiles over the burnedin samples.
3 Missing Incorporated in Attributes within Bart
Implementing BARTm is straightforward. We previously described the prior on the splitting rules within the decision tree branches as being discrete uniform on the possible splitting attributes and discrete uniform on the possible splitting values. To account for Lines 1 and 2 in the MIA procedure (Algorithm 1), the splitting attribute and split value are proposed as explained in Section 2.4, but now we additionally propose a direction (left or right with equal probability) for records to be sent when the records have with missing values in . A possible splitting rule would therefore be “ c and move left if is missing.” To account for Line 3 in the algorithm, splitting on missingness itself, we create dummy vectors of length for each of the attributes with missingness, denoted , which assume the value 1 when the entry is missing and 0 when the entry is present. We then augment the original training matrix together with these dummies and use the augmented training matrix, , as the training data in the BARTm algorithm. Once again, the prior on splitting rules is the same as the original BART but now with the additional consideration that the direction of missingness is equally likely left or right conditional on the splitting attribute and value. But why should this algorithm yield good predictive performance under the framework discussed in Section 2.1?
We expect BARTm to exhibit greater predictive performance over MIA in classical decision trees for two reasons. First, BARTm’s sumoftrees model offers much greater fitting flexibility of interactions and nonlinearities compared to a single tree; this flexibility will explore models that the data analyst may not have thought of. Additionally, due to the greedy nature of decision trees, once a split is chosen, the direction in which missingness is sent cannot be reversed. BARTm can alter its trees by pruning and regrowing nodes or changing splitting rules. These proposed modifications to the trees are accepted or rejected stochastically using the MetropolisHastings machinery depending on how strongly the proposed move increases the model’s likelihood.
We hypothesize that BARTm’s stochastic search for splitting rules allows observations with missingness to be grouped with observations having similar response values. Due to the MetropolisHastings step, only splitting rules and corresponding groupings that increase overall model likelihood will be accepted. In essence, BARTm is “feeling around” predictor space for a location where the missing data would most increase the overall marginal likelihood. For selection models, since splitting rules can depend on any covariate including the covariate with missing data, it should be possible to generate successful groupings for the missing data under both MAR^{6}^{6}6As a simple example, suppose there are two covariates and and a MAR mechanism where is increasingly likely to go missing for large values of . BARTm can partition this data in two steps to increase overall likelihood: (1) A split on a large value of and then (2) a split on . and NMAR^{7}^{7}7Suppose an NMAR mechanism where is more likely to be missing for large values of . BARTm can select splits of the form “ and is missing” with large. Here, the missing data is appropriately kept with larger values of and overall likelihood should be increased. mechanisms. When missingness does not depend on any other covariates, it should be more difficult to find appropriate ways to partition the missing data,^{8}^{8}8Due to the regularization prior on the depths of the trees in BART coupled with the fact that all missing data must move to the same daughter node, the trees do not grow deep enough to create sufficiently complex partitioning schemes to handle the MCAR mechanism. and we hypothesize that BARTm will be least effective for selection models with MCAR MDMs. Additionally, we hypothesize that BARTm has potential to perform well on pattern mixture models due to the partitioning nature of the regression tree. BARTm can partition the data based on different patterns of missingness by creating splits based on missingness itself. Then, underneath these splits, different submodels for the different patterns can be constructed. If missingness is related to the response, there is a good chance BARTm will find it and exploit it, yielding accurate forecasts.
Another motivation for adapting MIA to BART arises from computational concerns. BART is a computationally intensive algorithm, but its runtime increases negligibly in the number of covariates (see Chipman et al., 2010, Section 6). Hence, BARTm leaves the computational time virtually unchanged with the addition of the new missingness dummy covariates. Another possible strategy would be to develop an iterative imputation procedure using BART similar to that in Stekhoven and Bühlmann (2012) or a model averaging procedure using a multiple imputation framework, but we believe these approaches would be substantially more computationally intensive.
4 Generated Data Simulations
4.1 A Simple Pattern Mixture Model
We begin with an illustration of BARTm’s ability to directly estimate and additionally provide uncertainty intervals. We consider the following nonlinear response surface:
(4)  
where and . Note that the pattern mixture model is induced by missingness in . Under this missingness pattern, the response is offset by , a draw from a normal distribution. Figure 0(a) displays the sample of the response from the model colored by to illustrate the separation of the two response patterns. We choose the following jointly NMAR MDM for and which were chosen to be simple for the sake of ensuring that the illustration is clear. The next section features more realistic mechanisms.
(5)  
If the BARTm model assumptions hold and is successfully able to estimate , then the true is highly likely to be contained within a 95% credible interval for the prediction. We first check to see whether BARTm can capture the correct response when has missing entries but does not. Predicting on should give for the prediction. Figure 0(b) illustrates that BARTm captures the expected value within its 95% credible interval.
Next we explore how well BARTm estimates the conditional expectation when missingness occurs within the new observation . We examine how BARTm handles missingness in attribute by predicting on where the “” denotes missingness. By Equation 5, is missing 30% of the time if
itself is greater than 0. By evaluating the moments of a truncated normal distribution, it follows that
BARTm should guess . Figure 0(c) indicates that BARTm’s credible interval captures this expected value. Note the larger variance of the posterior distribution relative to Figure 0(b) reflecting the higher uncertainty due to going missing. This larger interval is a great benefit of MIA. As the trees are built during the Gibbs sampling, the splitting rules on are accompanied by a protocol for missingness: missing data will flow left or right in the direction that increases model likelihood and this direction is chosen with randomness. Thus, when is predicted with missing, missing records flow left and right over the many burnedin Gibbs samples creating a wider distribution of predicted values, and thus a wider credible interval. This is an important point — BARTm can give a rough estimate of how much information is lost when values in new records become missing by looking at the change in the standard error of a predicted value.^{9}^{9}9Note that if BART’s hyperparameters are considered “tuning parameters,” the credible intervals endpoints are not directly interpretable. However, the relative lengths of the intervals can be trusted whereby they signify different levels of forecast confidence to the practitioner.We next consider how BARTm performs when is missing by predicting on . By Equation 5, BARTm should guess (which follows directly from the properties of the conditional distribution of bivariate normal distribution, recalling that ). When is missing, there is a different response pattern, and the response is shifted up by . Since , BARTm should predict approximately 10.32. The credible interval found in Figure 0(d) indicates that BARTm’s credible interval again covers the conditional expectation.
Finally, we consider the case where and are simultaneously missing. Predicting on has a conditional expectation of . Once again, the posterior draws displayed in Figure 0(e) indicate that BARTm reasonably estimates the conditional expectation. Note that the credible interval here is wider than in Figure 0(d) due to the additional missingness of .
4.2 Selection Model Performance
In order to gauge BARTm’s outofsample predictive performance on selection models and to evaluate the improvement over modelbuilding on complete cases, we construct the same model as Equation 4 withholding the offset (which previously induced the pattern mixture). Thus . We imposed three scenarios illustrating performance under the following missingness mechanisms. The first is MCAR; is missing with probability . The second is MAR; is missing according to a nonlinear probit model depending on the other two covariates:
(6) 
The last is NMAR; goes missing according to a similar nonlinear probit model this time depending on itself and :
(7) 
For each simulation, we set the number of training observations to and simulate 500 times. Additionally, each simulation is carried out with different levels of missing data, approximately percent of rows have at least one missing covariate entry. For the MCAR dataset, the corresponding and for both the MAR and NMAR datasets it was and .
We record results for four different scenarios: (1) and contain missingness (2) contains missingness and is devoid of missing data^{10}^{10}10In this case, is generated without the MDM to maintain a constant number of rows. (3) only complete cases of are used to build the model but contains missingness and (4) only complete cases of are used to build the model and is devoid of missing data.
We make a number of hypotheses about the relationship between the predictive performance of using incomplete cases (all observations) compared to the complete case performance. As we discussed in Section 3, BARTm should be able model the expectation of the marginal likelihood in selection models, thus we expect models built with incomplete cases to predict better than models that were built with only the complete cases. The complete case models suffer from performance degradation for two main reasons (1) these models are built with a smaller sample size and hence their estimate of is higher in bias and variance (2) the lack of missingness during the training phase does not allow the model to learn how to properly model the missingness, resulting in the missing data being filtered randomly from node to node during forecasting. These hypotheses are explored in Figure 2 by comparing the solid blue and solid red lines.
Further, during forecasting, we expect samples with incomplete cases to have worse performance than the full samples (devoid of missingness) simply because missingness is equivalent to information loss. However, for the NMAR model, we expect prediction performance on without missingness to eventually, as the amount of missingness increases, be beaten by the predictive performance on with missingness. Eventually there will be so much missingness in that (1) the trained model on missingness will only be able to create models by using and expect in the future and (2) the trained model on complete cases will never observe the response of the function where went missing. These hypotheses are explored in Figure 2 by comparing the solid lines to the dashed lines within the same color.
The results for the four scenarios under the three MDM’s comport with our hypotheses. The solid red line is uniformly higher than the solid blue line confirming degradation for completecase model forecasting on data with missingness. The dotted lines are lower than their solid counterparts indicating that providing more covariate information yields higher predictive accuracy. The one exception is for NMAR. After the number of rows with missingness is more than 40%, forecasts on completecases only begin to perform worse than the forecast data with missingness for models built with missingness (BARTm).
In conclusion, for this set of simulations, BARTm performs better than BART models that ignore missingness in the training phase. The next section demonstrates BARTm’s performance in a real data set and compares its performance to a nonparametric statistical learning procedure that relies on imputation.
5 Real Data Example
The Boston Housing data (BHD) measures 14 features about housing in the census tracts in Boston in 1970. For model building, the response variable is usually taken to be the median home value. For this set of simulations, we evaluate the performance of three procedures (1) BARTm (2) RF with and imputed via MissForest and (3) BART with and imputed via MissForest.^{11}^{11}11We assume a priori that will have missing data. Thus, the completecase comparisons a la Section 4.2 were not possible. We gauge outofsample predictive performance as measured by the oosRMSE for the three procedures on the simulation scenarios described in Table 2.
Scenario  Description 

Selection Model MCAR  rm, crim, lstat, nox and tax are each missing w.p. 
Selection Model MAR  rm and crim are missing according to the following models: 
Selection Model NMAR  rm and crim are missing according to the following models: 
Pattern Mixture  The MAR selection model above and two offsets: 
(1) if , the response is increased by  
(2) if , the response is decreased by 
Similar to Section 4.2, each simulation is carried out with different levels of missing data, approximately percent of rows have at least one missing covariate entry. For the MCAR scenario, the corresponding , for the MAR scenario and pattern mixture scenario, and is constant at 3 and for the NMAR scenario and is constant at 3. Similar to Section 4.1, we induce a pattern mixture model by creating a normally distributed offset based on missingness (we create two such offsets here). Here, we choose to be 25% of the range in and to be . These values are arbitrarily set for illustration purposes. It is important to note that the performance gap of BARTm versus RF with imputation can be arbitrarily increased by making larger.
For each scenario and each level of missing data, we run 500 simulations. In each simulation, we first draw missingness via the designated scenario found in Table 2. Then, we randomly partition 80% of the 506 BHD observations (now with missingness) as and the remaining 20% as . We build all three models (BARTm, RF with MissForest and BART with MissForest) on , forecast on and record the oosRMSE. Thus, we integrate over idiosyncrasies that could be found in a single draw from the MDM and idiosyncrasies that could be found in a single traintest partition. When using MissForest during training, we impute values for the missing entries in using columnbinded together. To obtain forecasts, we impute the missing values in using rowbinded together then predict using the bottom rows (i.e. those corresponding to the imputed test data). Note that we use MissForest in both RF and BART to ensure that the difference in predictive capabilities of BART and RF are not driving the results.
For the MCAR selection model, we hypothesize that the MissForestbased imputation procedures will outperform BARTm due to the conceptual reasons discussed in Section 3. For the MAR selection model, we hypothesize similar performance between BARTm and both MissForestbased imputation procedures, as both MIA and imputation are designed to perform well in this scenario. In the NMAR selection model and pattern mixture model, we hypothesize that BARTm will outperform both MissForestbased imputation procedures, as MissForest (1) cannot make use of the values in the missingness columns it is trying to impute and (2) cannot construct different submodels based on missingness. Although imputation methods are not designed to handle these scenarios, it is important to run this simulation to ensure that BARTm, which is designed to succeed in these scenarios, has superior outofsample predictive performance.
The results displayed in Figure 3 largely comport with our hypotheses. MissForestbased methods perform better on the MCAR selection model scenario (Figure 2(a)) and BARTm is stronger in the NMAR scenario (Figure 2(c)) and pattern mixture scenario (Figure 2(d)). It is worth noting that in the MAR selection model scenario (Figure 2(b)), BARTm begins to outperform the imputationbased methods once the percentage of missing data becomes greater than 20%. The performance of the imputationbased algorithms degrades rapidly here, while BARTm’s performance remains fairly stable, even with 70% of the rows having at least one missing entry. In conclusion, BARTm will generally perform better than MissForest because it is not “limited” to what can be imputed from the data onhand. This advantage generally grows with the amount of missingness.
6 Discussion
We propose a means of incorporating missing data into statistical learning for prediction problems where missingness may appear during both the training and forecasting phases. Our procedure, BARTm, implements “missing incorporated in attributes” (MIA), a technique recently explored for use in decision trees, into Bayesian Additive Regression Trees, a newly developed treebased statistical learning algorithm for classification and regression. MIA natively incorporates missingness by sending missing observations to one of the two daughter nodes. Due to the Bayesian framework and the MetropolisHastings sampling, missingness is incorporated into splitting rules which are chosen to increase overall model likelihood. This innovation allows missingness itself to be used as a legitimate value within splitting criteria, resulting in no need for imputing in the training or new data and no need to drop incomplete cases.
For the simulations explored in this article, BARTm’s performance was superior to models built using complete cases, especially when missingness appeared in the test data as well. Additionally, BARTm provided higher predictive performance on the MAR selection model relative to MissForest, a nonparametric imputation technique. We also observe promising performance on NMAR selection models and pattern mixture models in simulations. To the best of our knowledge, there is no clear evidence of other techniques that will exhibit uniformly better predictive performance in both selection and pattern mixture missingness models. Additionally, BARTm’s Bayesian nature provides informative credible intervals reflecting uncertainty when the forecasting data has missing covariates.
Although the exploration in this article was focused on regression, we have observed BARTm performing well in binary classification settings. BARTm for both classification and regression is implemented in the R package bartMachine.
Due to MIA’s observed promise, we recommend it as a viable strategy to handle missingness in other treebased statistical learning methods. Future work should also consider exploration of methods that combine imputation with MIA appropriately, in order to enhance predictive performance for MCAR missing data mechanisms.
Supplementary Materials
 Computer Code:

Simulated results, tables, and figures can be replicated via the scripts located at http://github.com/kapelner/bartMachine (the home of the CRAN package bartMachine) in the missing_data_paper folder.
Acknowledgements
We thank Richard Berk, Dana Chandler, Ed George, Dan Heitjan, Shane Jensen, and José Zubizaretta for helpful discussions. We thank the anonymous reviewers for their helpful comments. Adam Kapelner acknowledges support from the National Science Foundation’s Graduate Research Fellowship Program.
References
 Albert and Chib (1993) JH Albert and S Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, June 1993.
 Bleich et al. (2013) J Bleich, A Kapelner, ST Jensen, and EI George. Using bart for variable selection. ArXiv, 2013.
 Breiman (2001a) L Breiman. Statistical Modeling: The Two Cultures. Statistical Science, 16(3):199–231, 2001a.
 Breiman (2001b) L Breiman. Random forests. Machine learning, pages 5–32, 2001b.
 Breiman et al. (1984) L Breiman, JH Friedman, RA Olshen, and CJ Stone. Classification and regression trees. 1984.
 Chipman et al. (2010) HA Chipman, EI George, and RE McCulloch. BART: Bayesian Additive Regressive Trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
 Ding and Simonoff (2010) Y Ding and JS Simonoff. An investigation of missing data methods for classification trees applied to binary response data. The Journal of Machine Learning Research, 11:131–170, 2010.
 Friedman (2002) JH Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
 Geman and Geman (1984) S Geman and D Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and Machine Intelligence, 6:721–741, 1984.

Hastings (1970)
WK Hastings.
Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57(1):97–109, 1970.  Kapelner and Bleich (2013) A Kapelner and J Bleich. bartmachine: Machine learning with bayesian additive regression trees. ArXiv eprints, 2013.
 Liaw and Wiener (2002) A Liaw and M Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
 Little (1993) RJA Little. Patternmixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421):125–134, 1993.
 Little and Rubin (2002) RJA Little and DB Rubin. Statistical analysis with missing data. Wiley, second edition, 2002.
 Rubin (1978) DB Rubin. Multiple imputations in sample surveysa phenomenological Bayesian approach to nonresponse. Technical report, Proc. of the Survey Research Methods Section, ASA, 1978.
 Schafer (1997) JL Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall / CRC, 1997.
 Stekhoven and Bühlmann (2012) DJ Stekhoven and P Bühlmann. MissForest–nonparametric missing value imputation for mixedtype data. Bioinformatics, 28:112–8, 2012.
 Therneau and Atkinson (1997) TM Therneau and EJ Atkinson. An introduction to recursive partitioning using the rpart routines. Technical report, Mayo Foundation, 1997.
 Troyanskaya et al. (2001) O Troyanskaya, M Cantor, and G Sherlock. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525, 2001.

Twala (2009)
B Twala.
An empirical comparison of techniques for handling incomplete data
using decision trees.
Applied Artificial Intelligence
, 23(5):1–35, 2009.  Twala et al. (2008) B Twala, M Jones, and DJ Hand. Good methods for coping with missing data in decision trees. Pattern Recognition Letters, 29(7):950–956, 2008.