1 Introduction
Decision trees for classification and regression tasks have a long history in ML and AI (Quinlan, 1986; Breiman et al., 1984)
. Despite the remarkable successes of deep learning and the enormous attention it attracts, trees and forests are still the preferred offtheshelf model when in need of robust and interpretable learning on scarce data that is possibly heterogeneous (mixed continuousdiscrete) in nature and featuring missing values
(Chen and Guestrin, 2016; Devos et al., 2019; Prokhorenkova et al., 2018).In this work we specifically focus on the last property, noting that while trees are widely regarded as flawlessly handling missing values, there is no unique way to properly deal with missingness in trees when it comes to tree induction from data (learning time) or reasoning about partial configurations of the world (deployment time).
Numerous strategies and approaches have been explored in the literature in this regard (SaarTsechansky and Provost, 2007; Gavankar and Sawarkar, 2015; Twala et al., 2008)
. However, most of these are heuristics in nature
(Twala et al., 2008), tailored towards some specific tree induction algorithm
(Chen and Guestrin, 2016; Prokhorenkova et al., 2018), or make strong distributional assumptions about the data, such as the feature distribution factorizing completely (e.g., mean, median imputation
(Rubin, 1976)) or according to the tree structure (Quinlan, 1993). As many works have compared the most prominent ones in empirical studies (Batista and Monard, 2003; SaarTsechansky and Provost, 2007), there is no clear winner and ultimately, the adoption of a particular strategy in practice boils down to its availability in the ML libraries employed.In this work, we tackle handling missing data in trees at both learning and deployment time from a principled probabilistic perspective. We propose to decouple tree induction from learning the joint feature distribution, and we leverage tractable density estimators to flexibly and accurately model it. Then we exploit tractable marginal inference to efficiently compute the expected predictions (Khosravi et al., 2019)
of tree models. In essence, the expected prediction of a tree given a sample with missing values can be thought of as implicitly imputing all possible completions at once, reweighting each complete sample by its probability. In such a way, we can improve the performances of already learned trees, e.g., by XGBoost
(Chen and Guestrin, 2016), by making their predictions robust under missing data at deployment time.Moreover, we show how expected predictions can also be leveraged to deal with missing data at learning time by efficiently training trees over the expected version of commonlyused training losses (e.g., MSE). As our preliminary experiments suggest, this probabilistic perspective delivers better performances than common imputation schemes or default ways to deal with missing data in popular decision tree implementations. Lastly, this opens several interesting research venues such as to devise probabilistically principled tree structure induction and robustness against different kinds of missingness mechanisms.
2 Background
We use uppercase letters (
) for random variables (RVs) and lowercase letters (
) for their assignments. Analogously, we denote sets of RVs in bold uppercase () and their assignments in bold lowercase (). We denote the set of all possible assignments to as . We denote a partial assignment to RVs as and a possible completion to it as , that is, an assignment to RVs .Decision trees Given a set of input RVs (features) and an RV (target) having values in , a decision tree is a parameterized mapping characterized by a pair where is a rooted tree structure and is a set of parameters equipped to the leaves of . Every nonleaf node in , also called a decision node, is labeled by an RV . For a decision node the set of outgoing edges partitions , the set of values of RV , into a set of disjoint sets , and defines a set of corresponding decision tests . A decision path is a collection of adjacent edges from the root of to leaf . Given the above, the mapping encoded in a decision tree can be written as:
(1) 
where is an indicator function that is equal to 1 if “reaches” leaf and 0 otherwise; formally, , where is the assignment for RV in . The parameters attached to the leaves in here represent a hard prediction, i.e., for some value associated to leaf . Our derivations will also hold when encodes a soft predictor, e.g. for class classification,
. In that case, we consider a parameter vector
for each leaf comprising conditional probabilities for .In the following, we will assume RVs to be discrete. This is to simplify notation and does not hinder generality: our derivations can be easily extended to mixed discrete and continuous RVs by replacing summations to integrations when needed. Note that in the discrete case, a decision node labeled by RV having different states, i.e., , will define decision tests for one assignment will be for .
Decision forests Single tree models are aggregated in ensembles called forests (Breiman, 1996). One of the most common way to build a forest of trees is to put them in a weighted additive ensemble of the form
(2) 
This is the case for ensembling techniques like bagging (Breiman, 1996)
(Breiman, 2001)(Friedman, 2001).Decision trees for missing data Several ways have been explored to deal with missing values for decision trees both at training and inference (test) time (SaarTsechansky and Provost, 2007). One of the most common approaches goes under the name of predictive value imputation (PVI) and resorts to replacing missing values before performing inference or tree induction. Among the simplest treatments to missing values in PVI, mean, median and mode imputations are practical and cheap common techniques (Rubin, 1976; Breiman, 2001); however, they make strong distributional assumptions like total independence of the feature RVs. More sophisticated (and expensive) PVI techniques cast imputation as prediction from observed features. Among these are multiple imputation with chained equations (MICE) (Buuren and GroothuisOudshoorn, 2010) and the use of surrogate splits as popularized by CART (Breiman et al., 1984).
Somehow analogous to PVI methods, the missing value treatment done by XGBoost (Chen and Guestrin, 2016) learns to predict which branch to take for a missing feature at inference time by improving some gain criterion on observed data for that feature. While this approach has been proven successful in many realworld scenarios with missing data, it requires data to be missing at learning time and it may overfit to the missingness pattern observed.
Unlike above imputation schemes, the approach introduced in C4.5 (Quinlan, 1993) replaces imputation with reweighting the prediction associated to one instance by the product of the probabilities of the missing RVs in it. While C4.5 is more distribution aware, these probabilities acting as weights are only empirical estimates from the training data, and reweighting is limited only to the missing attributes appearing in a path. This assumes that the true distribution over factorizes exactly as the tree structure, which is hardly the case since the tree structure is induced to minimize some predictive loss over .
Several empirical studies showed evidence that there is no clear winner among the aforementioned approaches under different distributional and missingness assumptions (Batista and Monard, 2003; SaarTsechansky and Provost, 2007). In practice, the adoption of a particular strategy is dependent on the specific tree learning or inference algorithm selected, and on the availability of its implemented routines. We introduce in the next section a principled probabilistic and treeagnostic way of treating missing values at deployment time and extend it to deal with missingness at learning time in Section 4.
3 Expected Predictions of Decision Trees
From a probabilistic perspective, we would like a missing value treatment to be aware of the full distribution over RVs
without committing to restrictive distributional assumptions. If we have access to the joint distribution
, then clearly the best way to deal with missing values at inference time would be to impute all possible completions at once, weighting them by their probabilities according to , thereby generalizing both C4.5 and PVI treatments. This is what the expected prediction estimator delivers.Definition 1 (Expected prediction).
Given a predictive model , a distribution over features and a partial assignment for RVs , the expected prediction of w.r.t. is:
(3) 
where and .
Computing expected predictions is theoretically appealing also because the delivered estimator is consistent under both MCAR and MAR missingness mechanisms, if has been trained on complete data and is Bayes optimal (Josse et al., 2019). As one would expect, however, computing Equation 3 exactly for arbitrary pairs of and is NPhard (Khosravi et al., 2019). Recently, (Khosravi et al., 2019) identified a class of expressive density estimators and accurate predictive models that allows for polytime computation of the expected predictions of the latter w.r.t. the former. Specifically, probabilistic circuits (PCs) (Choi et al., 2020)
with certain structural restrictions can be used as tractable density estimators to compute the expected predictions exactly for regression and to approximate them for classification, from simple models such as linear and logistic regression to their generalization as circuits
(Liang and Van den Broeck, 2019). Here we extend those results to compute expected predictions for both classification and regression trees exactly and efficiently under milder distributional assumptions for .Proposition 1 (Expected predictions for decision trees).
Given a decision tree encoding , a distribution , and a partial assignment , the expected prediction of w.r.t. can be computed as follows:
(4) 
where and is the assignment to the RVs in that evaluates to 1.
We refer to Appendix A for detailed derivations. Note that we can readily extend Equation 4 to forests of trees (cf. Equation 2) by linearity of expectations.
As the proposition suggests, we can tractably compute the exact expected predictions of a decision tree if the number of its leaves is polynomial in the input size and we can compute in polytime for each leaf . The first condition generally holds in practice, as trees have lowdepth to avoid overfitting, especially in forests, while the second one can be easily satisfied by employing a probabilistic model guaranteeing tractable marginalization, as we need to marginalize over the RVs not in
. Among suitable candidates are Gaussian distributions and their mixtures for continuous data, and smooth and decomposable PCs
(Choi et al., 2020) which are deep versions of classical mixture models. We employ PCs in our experiments as they are potentially more expressive than shallow mixtures and can seamlessly model mixed discretecontinuous distributions.Average test RMSE (yaxis, the lower, the better) on the Insurance data for different percentages of missing values (xaxis) when missingness is only at deployment time for a forest of 5 trees (left) or both at learning and deployment time for a single tree learned with XGBoost (right). For each experiment setting, we repeat 10 times and report the average error and their standard deviation.
4 Expected Parameter Learning of Trees
Expected predictions provide a principled way to deal with missing values at inference time. In the following, we extend them to learn the parameters of a predictive model from incomplete data as to minimize a the expectation of a certain loss w.r.t. a generative model at hand. We call this learning scenario, expected loss minimization.
Definition 2 (Expected loss minimization).
Given a dataset over containing missing values for RVs , a density estimator trained on
by maximum likelihood, and a persample loss function
, we want to find the set of parameters of the predictive model that minimizes the expected loss defined as follows:Again, we harness the ability of the density estimator to accurately model the distribution over RVs and to minimize the loss over as if it were trained on all possible completions for a partial configuration .
For commonly used persample losses, the optimal set of parameters for single decision trees can be efficiently and independently computed in closed form. This is for instance the case for the loss, also known as mean squared error (MSE), defined as , which we will use in our experiments.
Proposition 2 (Expected parameters of MSE loss).
Given a decision tree structure and a training set , the set of parameters that minimizes , the expected prediction loss for MSE, can be found by
for each leaf .
The above equation for optimal leaf parameters can be extended to forests of trees where each tree is learned independently, e.g., via bagging. For other scenarios involving forests such as boosting refer to Appendix B.
Furthermore, a regularization term may be added to the expected loss to counter overfitting in a regression scenario, e.g., by penalizing the leaf parameter magnitude; this still yields closeform solutions (see Appendix A). Next, we will show the effect of tree parameter learning via expected losses on some tree structures that have been induced by popular algorithms such as XGBoost (Chen and Guestrin, 2016); that is, we will fine tune their parameters to optimality given their structures and a tractable density estimator for . Investigating how to blend expected loss learning in classical topdown tree induction schemes is an interesting venue we are currently exploring.
5 Experiments
In this section, we provide preliminary experiments to answer the following questions: (Q1) Do expected predictions at deployment time improve predictions over common techniques to deal with missingness for trees? (Q2) Does expected loss minimization improve predictions when missing values are present also at learning time?
Setup We employ the Insurance dataset, in which we want to predict the yearly medical insurance costs of a patient based their personal data.^{1}^{1}1Refer to Appendix C for more information about the dataset. We consider two scenarios, when data is missing only at deployment time or also at learning time. In both cases, we assume data to be MCAR: given complete data, we make each feature be missing with probability each for 10 independent trials. For each setting, we learn a probabilistic circuit on the training data as well as a decision tree or forest using the ubiquitous XGBoost.
Methods For XGBoost we employ the default parameters. As a simple baseline we use median imputation, estimating the perfeature imputations on the observed portion of the training set. We employ expected predictions over the trees learned by XGBoost for dealing with missing data at deployment time. Lastly, we use expected loss to finetune the XGBoost trees and use them for expected predictions at deployment time, which we denote as ”ExpLoss + Expected Prediction”. We measure performance by the average test root mean squared error (RMSE).
Missing only at deployment time Figure 1(left) summarizes our results for Q1. Expected prediction outperforms XGBoost and median imputation. Notably, the reason XGBoost performs poorly is that it has not seen any missing values at learning time, in which case the “default” branch it uses in case of missing values always points to the first child. Additionally, median imputation makes the strong assumption that all the features are fully independent, which would explain why expected prediction using PCs does better.
Missing during both learning and deployment Figure 1 summarizes our results for Q2. In this scenario, expected predictions perform on par, up to , with the way XGBoost treats missing values at deployment time generated from the same missingness mechanism it has been trained on. However, both methods are significantly outperformed by finetuning the tree parameters by the expected loss minimization. We leave for future work to investigate what happens with missingness mechanisms that differ at learning and deployment time, or when we adopt other ensembling techniques such as bagging and random forests.
6 Conclusion
In this work, we introduced expected predictions and expected loss minimization for decision trees and forests as a principled probabilistic way to handle missing data both at training and deployment time, while being agnostic to the tree structure or the way it has been learned. We are currently investigating how to exploit this methodology to extend tree induction schemes under different missing value mechanisms and derive consistency guarantees for the learned estimators.
Acknowledgments
This work is partially supported by NSF grants #IIS1943641, #IIS1633857, #CCF1837129, DARPA XAI grant #N660011724032, UCLA Samueli Fellowship, and gifts from Intel and Facebook Research. The authors would like to thank Steven Holtzen for initial discussions about expected prediction for decision trees.
References

An analysis of four missing data treatment methods for supervised learning
.Applied artificial intelligence
17 (56), pp. 519–533. Cited by: §1, §2.  Classification and regression trees. CRC press. Cited by: §1, §2.
 Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §2.
 Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §2, §2.
 Mice: multivariate imputation by chained equations in r. Journal of statistical software, pp. 1–68. Cited by: §2.
 XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. Cited by: §1, §1, §1, §2, §4.
 Lecture notes: probabilistic circuits: representation and inference. External Links: Link Cited by: §3, §3.
 Fast gradient boosting decision trees with bitlevel data structures. Proceedings of ECML PKDD, Springer. Cited by: §1.
 Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §2.
 Decision tree: review of techniques for missing values at training, testing and compatibility. In 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. Cited by: §1.
 On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931. Cited by: §3.
 On tractable computation of expected predictions. In Advances in Neural Information Processing Systems, pp. 11167–11178. Cited by: §1, §3.

What to expect of classifiers? reasoning about logistic regression with missing features
. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), External Links: Link Cited by: §3.  Learning logistic circuits. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI), External Links: Link Cited by: §3.
 CatBoost: unbiased boosting with categorical features. In Advances in neural information processing systems, pp. 6638–6648. Cited by: §1, §1.
 C4. 5: programs for machine learning. Elsevier. Cited by: §1, §2.
 Induction of decision trees. Machine Learning Journal 1 (1), pp. 81–106. Cited by: §1.
 Combining regular and irregular histograms by penalized likelihood. Computational Statistics & Data Analysis 54 (12), pp. 3313–3323. Cited by: Appendix C.
 Inference and missing data. Biometrika 63 (3), pp. 581–592. Cited by: §1, §2.
 Handling missing values when applying classification models. Journal of machine learning research 8 (Jul), pp. 1623–1657. Cited by: §1, §2, §2.
 Good methods for coping with missing data in decision trees. Pattern Recognition Letters 29 (7), pp. 950–956. Cited by: §1.
Appendix A Proofs
Proposition 1 (Expected predictions for decision trees).
Given a decision tree encoding , a distribution , and a partial assignment , the expected prediction of w.r.t. can be computed as follows:
where and is the assignment to the RVs in that evaluates to 1.
Proof.
Let be the set of RVs appearing in the decision nodes in path . Then the following holds:
∎
Before proving Proposition 2, let us first introduce a useful lemma.
Lemma 1 (Expected squared predictions).
Given a decision tree structure encoding , a distribution and a partial assignment the expected squared prediction of w.r.t. can be computed as follows:
Proposition 2 (Expected parameters of MSE loss).
Given a decision tree structure and a training set , the set of parameters that minimizes , the expected prediction loss for MSE, can be found by
for each leaf .
Proof.
First, the expected MSE loss can be expressed as the following:
To optimize this loss, we consider its partial derivative w.r.t. a leaf parameter . Using Equation 3 and the fact that gradient is a linear operator, we have:
Similarly, the partial derivative of expected squared prediction in Lemma 1 is:
Therefore, the partial derivative of expected MSE loss w.r.t. a leaf parameter can be computed as follows:
Then its gradient w.r.t. the parameter vector , with , can be written in matrix notation as:
Hence, by setting we can easily retrieve that the optimal parameter vector is:
Regularization.
During parameter learning, it is common to also add a regularization term to the total loss to reduce overfitting. In our case, we use regularizer . Now, we want to minimize the following loss:
(7) 
Where is the regularization hyperparamter. By repeating the steps from above we can easily see that the parameters that minimize are:
∎
Appendix B Expected Parameters Beyond Single Trees
In this section, we extend the expected parameter tuning to beyond single tree models. The learning scenarios include forests, bagging, random forests, and gradient tree boosting.
b.1 Forests
In this section, instead of a single tree , we are given a forest , and want to minimize the following loss instead:
Proposition 3 (Expected parameters of forests MSE loss).
Given the training set , and given the Forest , the set of parameters that minimizes can be found by solving for in the following linear system of equations:
(8) 
where is matrix, and are vectors.
Note that, we usually learn forest, tree by tree and do not have all the tree structures initially, and also above algorithm grows quadratic to number of total leaves which is less desirable. As a result, we also want to explore other scenarios such as bagging or boosting.
b.2 Bagging and Random Forests
In both Bagging of trees and Random forests, we learn our trees independently and average their predictions, we can also do the expected parameter tuning for each tree independently, w.r.t. a generative model learned on the boostrap sample of the training dataset on which the tree has been induced.
b.3 Gradient Tree Boosting
In this section, we adapt gradient tree boosting in the expected prediction framework. Before moving on to boosting of trees, we introduce Lemma 2, which computes the expected prediction of two trees multiplied.
Lemma 2 (Expected tree times tree).
Given two trees , and , a distribution and a partial assignment the expected squared prediction of w.r.t. can be computed as follows:
where .
Proof.
The proof similarly follows from proof of Lemma 1. The main difference is that in the leaves and are from two different trees so its not necessarily equal to , so we can not cancel those terms. ∎
Note that Lemma 2 result can be easily extended to multiplying two forests.
During gradient tree boosting we learn our forest in a additive manner. At each step, given the already learned forest we add a new tree that minimizes sum of losses of the form . We adapt this with the expected prediction framework as follows:
Definition 3 (Boosting expected loss minimization).
In addition to definition 2, we are also given a fixed forest , we want to find the set of parameters of the tree such that minimizes the expected loss defined as follows:
Proposition 4 (Expected parameters of Boosted MSE loss).
Given the training set , and given the Forest and the new tree structures , the set of parameters that minimizes can be found by
Appendix C More Experiment Info
Dataset  Train  Valid  Test  Features 

Insurance  936  187  215  36 
Description of the datasets
In the Insurance ^{2}^{2}2https://www.kaggle.com/mirichoi0218/insurance dataset, the goal is to predict yearly medical insurance costs of patients given other attributes such as age, gender, and whether they smoke or not.
Preprocessing Steps
We preserve the original test, and train splits if present for each dataset. Additionally, we merge any given validation set with the test set.
The probabilistic circuit learning implementation that we use does not support continuous features yet, so we perform discretization of the continuous features as follows. First, we try to automatically detect the optimal number of (irregular) bins through adaptive binning by employing a penalized likelihood scheme as in (Rozenholc et al., 2010)
. If the number of the bins found in this way exceeds ten, instead we employ an equalwidth binning scheme capping the bin number to ten. Once the data is discrete, we employ onehot encoding.
Other Settings
For XGBoost, we use “reg:squarederror” which corresponds to MSE loss. Max depth is set to 5, and we use regularization where applicable.
When learning XGBoost trees from missing values, some of the leaves become only reachable if a certain feature is missing, and never reachable with fully observed data. We ignore those leaves in our expected prediction framework.
Comments
There are no comments yet.