Classification And Regression Tree (CART) analysis Breiman et al. (1984) is a well-established statistical learning technique, which has been adopted by numerous other fields for its model interpretability, scalability to large data sets, and connection to rule-based decision making Loh (2014). CART builds a model by recursively partitioning the instance space, labeling each partition with either a predicted category (in the case of classification) or real-value (in the case of regression). Despite their widespread use, CART models often have lower predictive performance than other statistical learning models, such as kernel methods and ensemble techniques Caruana and Niculescu-Mizil (2006). Among the latter, boosting methods were developed as a means to train an ensemble of weak learners (often CART models) iteratively into a high-performance predictive model, albeit with a loss of model interpretability. In particular, gradient boosting methods Friedman (2001) focus on iteratively optimizing an ensemble’s prediction to increasingly match the labeled training data. Historically these two categories of approaches, CART and gradient boosting, have been studied separately, connected primarily through CART models being used as the weak learners in boosting. This paper investigates a deeper and surprising connection between full interaction models like CART and additive models like gradient boosting, showing that the resulting models exist upon a spectrum. In particular, this paper includes the following contributions:
We introduce tree-structured boosting (TSB) as a new mechanism for creating a hierarchical ensemble model that recursively partitions the instance space, forming a perfect binary tree of weak learners. Each path from the root node to a leaf represents the outcome of a gradient boosted stumps (GBS) ensemble for a particular partition of the instance space.
We prove that TSB generates a continuum of single tree models with accuracy between CART and GBS, controlled via a single tunable parameter. In effect, TSB bridges between CART and GBS, identifying never-before-seen connections between additive and full interaction models.
This result is verified empirically, showing that this hybrid combination of CART and GBS can outperform either approach individually in terms of accuracy and/or interpretability while building a single tree. Our experiments also provide insight into the continuum of models revealed by TSB.
2 Connections between CART and Boosting
Assume we are given a training set , where each -dimensional has a corresponding label , drawn i.i.d from a unknown distribution . In a classification setting, ; in regression, . The goal is to learn a function that will perform well in predicting the label on new examples drawn from . CART analysis recursively partitions , with assigning a single label in to each partition. In this manner, there is full interaction between each component of the model. Different branches of the tree are trained with disjoint subsets of the data, as shown in Figure 1.
In contrast, boosting iteratively trains an ensemble of weak learners , such that the model111In classification, gives the sign of the prediction. CART models are often used as the weak learners. is a weighted sum of the weak learners’ predictions with weights . Each boosted weak learner is trained with a different weighting of the entire data set, unlike CART, repeatedly emphasizing mispredicted instances to induce diversity (Figure 1). Gradient boosting with decision stumps or simple regression creates a pure additive model, since each new ensemble member serves to reduce the residual of previous members Friedman et al. (1998); Friedman (2001). Interaction terms can be included in the overall ensemble by using more complex weak learners, such as deeper trees.
As shown by Valdes et al. Valdes et al. (2016)
, classifier ensembles with decision stumps as the weak learners,, can be trivially rewritten as a complete binary tree of depth , where the decision made at each internal node at depth is given by , and the prediction at each leaf is given by . Intuitively, each path through the tree represents the same ensemble, but one that tracks the unique combination of predictions made by each member.
2.1 Tree-Structured Boosting
This interpretation of boosting lends itself to the creation of a tree-structured ensemble learner that bridges between CART and gradient boosting. The idea behind tree-structured boosting (TSB) is to grow the ensemble recursively, introducing diversity through the addition of different sub-ensembles after each new weak learner. At each step, TSB first trains a weak learner on the current training set with instance weights , and then creates a new sub-ensemble for each of the weak learner’s outputs. Each sub-ensemble is subsequently trained on the full training set, but instances corresponding to the respective branch are more heavily weighted during training, yielding diverse sub-ensembles (Figure 1, middle). This process proceeds recursively until the depth limit is reached. Critically, this approach identifies clear connections between CART and Gradient Boosted Stumps (GBS): as the re-weighting ratio is varied, tree-structured boosting produces a spectrum of models with accuracy between CART and TBS at the two extremes.
The complete TSB approach is detailed as Algorithm 1. The parameter used in step 8 of the algorithm provides the connection between CART and GBS, i.e., TSB converges to CART as and converges to GBS as . Theoretical analysis of TSB, given in the supplemental material, shows how TSB bridges between CART and GBS.
is a scalar defining the additive expansion to estimate.
Inputs: training data , instance weights (default: ), , node depth (default: ), max height , node domain (default: ), prediction function (default: ) Outputs: the root node of a hierarchical ensemble
In a first experiment, we use real-world data to evaluate the classification error of TSB for different values of . We then examine the behavior of the instance weights as varies in a second experiment.
3.1 Assessment of TSB Model Performance versus CART and GBS
In this experiment, we use four life science data sets from the UCI repository Lichman (2013): Breast Tissue, Indian Liver Patient Dataset (ILPD), SPECTF Heart Disease, and Wisconsin Breast Cancer. These data sets are all binary classification tasks and contain only numeric attributes with no missing values. We measure the classification error as the value of increases from to . In particular, we assess 10 equidistant error points corresponding to the in-sample and out-of-sample errors of the generated TSB trees, and plot the transient behavior of the classification errors as functions of . The goal is to illustrate the trajectory of the classification errors of TSB, which is expected to approximate the performance of CART as , and to converge asymptotically to GBS as .
To ensure fair comparison, we assessed the classification accuracy of CART and GBS for different depth and learning rate values over 5-fold cross-validation. As a result, we concluded that a tree/ensemble depth of 10 offered near-optimal accuracy, and so use it for all algorithms. The binary classification was carried out using the negative binomial log-likelihood as the loss function, similar to LogitBoostFriedman (2001), which requires an additional learning rate (shrinkage) factor, via Algorithm 1.
|Data Set||# Instances||# Attributes||TSB Learning Rate|
|Wisconsin Breast Cancer||569||30||0.7|
For each data set, the experimental results were averaged over 20 trials of 10-fold cross-validation over the data, using of the samples for training and the remaining
for testing in each experiment. The error bars in the plots denote the standard error at each sample point.
The results are presented in Figure 2, showing that the classification error of TSB approximates the CART and GBS errors in the limits of . As expected, increasing lambda generally reduces overfitting. However, note that for each data set except ILPD, the lowest test error is achieved by a TSB model between the extremes of CART and GBS. This reveals that hybrid TSB models can outperform either of CART or GBS alone.
3.2 Effect of on the Instance Weights
In a second experiment, we use a synthetic binary-labeled data set to graphically illustrate the behavior of the instance weights as functions of lambda. The synthetic data set consists of 100 points in , out of which 58 belong to the red class, and the remaining 42 belong to the green class, as shown in Figure 3. The learning rate was chosen to be 0.1 based on classification accuracy, as in the previous experiment. We recorded the instance weights produced by TSB at different values of .
shows a heatmap linearly interpolating the weights associated with each instance for a disjoint region defined by one of the four leaf nodes of the trained tree. The chosen leaf node corresponds to the logical function.
When , the weights have binary normalized values that produce a sharp differentiation of the surface defined by the leaf node, similar to the behavior of CART, as illustrated in Figure 3(a). As increases in value, the weights become more diffuse in Figures 3 (b) and (c), until becomes significantly greater than . At that point, the weights approximate the initial values as anticipated by theory. Consequently, the ensembles along each path to a leaf are trained using equivalent instance weights, and therefore are the same and equivalent to GBS.
We have shown that tree-structured boosting reveals the intrinsic connections between additive models (GBS) and full interaction models (CART). As the parameter varies from to , the models produced by TSB vary between CART and GBS, respectively. This has been shown both theoretically and empirically. Notably, the experiments revealed that a hybrid model between these two extremes of CART and GBS can outperform either of these alone.
- Breiman et al.  L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth & Brooks, Monterey, CA, 1984.
Caruana and Niculescu-Mizil 
R. Caruana and A. Niculescu-Mizil.
An empirical comparison of supervised learning algorithms.In Intl. Conf. on Mach. Learn., pages 161–168, Pittsburgh, 2006.
Friedman et al. 
J. Friedman, T. Hastie, and R. Tibshirani.
Additive logistic regression: a statistical view of boosting.Annals of Stat., 28(2):2000, 1998.
- Friedman  J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Stat., 29(5):1189–1232, 2001.
UCI machine learning repository, 2013.URL http://archive.ics.uci.edu/ml.
- Loh  W.-Y. Loh. Fifty years of classification and regression trees. Intl. Stat. Review, 82(3):329–348, 2014.
- Valdes et al.  G. Valdes, J. M. Luna, E. Eaton, C. B. Simone II, L. H. Ungar, and T. D. Solberg. MediBoost: a patient stratification tool for interpretable decision making in the era of precision medicine. Scientific Reports, 6, 2016.
Appendix A Proof Sketches of Key Lemmas
This section shows that TSB is equivalent to CART when and equivalent to GBS as , thus establishing a continuum between CART and GBS. We include proof sketches for the four lemmas used to prove our main result in Theorem 1.
TSB maintains a perfect binary tree of depth , with internal nodes, each of which corresponds to a weak learner. Each weak learner along the path from the root node to a leaf prediction node induces two disjoint partitions of , namely and so that and . Let be the corresponding set of partitions along that path to , where each is either or . We can then define the partition of associated with as . TSB predicts a label for each via the ensemble consisting of all weak learners along the path to so that . To focus each branch of the tree on corresponding instances, thereby constructing diverse ensembles, TSB maintains a set of weights over all training data. Let denote the weights associated with training a weak learner at the leaf node at depth . Notice that same as CART, TSB learns models, however, with better accuracy given its proven connection with GBS.
We train the tree as follows. At each boosting step we have a current estimate of the function corresponding to a perfect binary tree of height . We seek to improve this estimate by replacing each of the leaf prediction nodes with additional weak learners with corresponding weights , growing the tree by one level. This yields a revised estimate of the function at each terminal node as
where is a binary indicator function that is 1 if predicate is true, and 0 otherwise. Since partitions are disjoint, Equation (1) is equivalent to separate functions
one for each leaf’s corresponding ensemble. The goal is to minimize the loss over the data
by choosing and the ’s at each leaf. Taking advantage again of the independence of the leaves, Equation (2) is minimized by independently minimizing the inner summation for each , i.e.,
Next, we focus on deriving TSB where the weak learners are binary regression trees with least squares as the loss function . Following Friedman Friedman , we first estimate the negative unconstrained gradient at each data instance , which are equivalent to the residuals (i.e., ). Then, we can determine the optimal parameters for by solving
Gradient boosting solves Equation (4) by first fitting to the residuals , then solving for the optimal . For details on gradient boosting, see Friedman . Adapting TSB to the classification setting, for example using logistic regression base learners and negative binomial log-likelihood as the loss function , follows directly from Friedman  by using the gradient boosting procedure for classification in place of regression.
If all instance weights remain constant, this approach would build a perfect binary tree of height , where each path from the root to a leaf represents the same ensemble, and so would be exactly equivalent to gradient boosting of . To focus each branch of the tree on corresponding instances, thereby constructing diverse ensembles, the weights are updated separately for each of ’s two children: instances in the corresponding partition have their weight multiplied by a factor of and instances outside the partition have their weights multiplied by a factor of , where . The update rule for the weight of for (the two partitions induced by ) is given by
where normalizes to be a distribution. The initial weights are typically uniform.
The weight of at leaf at the boosting iteration () is given by
where is the sequence of partitions along the path from the root to .
This lemma can be shown by induction based on Equation (5).
Given the weight distribution formula (6) of at leaf at the boosting iteration, the following limits hold,
where is the intersection of the partitions along the path from the root to .
Both parts follow directly by taking the corresponding limits of Lemma 1.
The optimal simple regressor that minimizes the loss function (3) at the iteration at node is given by,
The TSB update rule is given by . If is defined as,
with constant , then is constant, with
The proof is by induction on , building upon (10). We can show that each is constant and so is constant, and therefore the lemma holds under the given update rule.
Building upon these four lemmas, our main theoretical result is presented in the following theorem, and explained in the subsequent two remarks:
where is the initial weight for the -th training sample.
The simple regressor given by (12) calculates a weighted average of the difference between the random output variables and the previous estimate of the function in the disjoint regions defined by . This formally defines the behavior of the CART algorithm.
The simple regressor given by (13) calculates a weighted average of the difference between the random output variables and the previous estimate of the function given by the piece-wise constant function . is defined in the overlapping region determined by the latest stump, namely . This formally defines the behavior of the GBS algorithm.