Tree-Structured Boosting: Connections Between Gradient Boosted Stumps and Full Decision Trees

11/18/2017 ∙ by José Marcio Luna, et al. ∙ UC San Francisco University of Maryland Medical Center University of Pennsylvania 0

Additive models, such as produced by gradient boosting, and full interaction models, such as classification and regression trees (CART), are widely used algorithms that have been investigated largely in isolation. We show that these models exist along a spectrum, revealing never-before-known connections between these two approaches. This paper introduces a novel technique called tree-structured boosting for creating a single decision tree, and shows that this method can produce models equivalent to CART or gradient boosted stumps at the extremes by varying a single parameter. Although tree-structured boosting is designed primarily to provide both the model interpretability and predictive performance needed for high-stake applications like medicine, it also can produce decision trees represented by hybrid models between CART and boosted stumps that can outperform either of these approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classification And Regression Tree (CART) analysis Breiman et al. (1984) is a well-established statistical learning technique, which has been adopted by numerous other fields for its model interpretability, scalability to large data sets, and connection to rule-based decision making Loh (2014). CART builds a model by recursively partitioning the instance space, labeling each partition with either a predicted category (in the case of classification) or real-value (in the case of regression). Despite their widespread use, CART models often have lower predictive performance than other statistical learning models, such as kernel methods and ensemble techniques Caruana and Niculescu-Mizil (2006). Among the latter, boosting methods were developed as a means to train an ensemble of weak learners (often CART models) iteratively into a high-performance predictive model, albeit with a loss of model interpretability. In particular, gradient boosting methods Friedman (2001) focus on iteratively optimizing an ensemble’s prediction to increasingly match the labeled training data. Historically these two categories of approaches, CART and gradient boosting, have been studied separately, connected primarily through CART models being used as the weak learners in boosting. This paper investigates a deeper and surprising connection between full interaction models like CART and additive models like gradient boosting, showing that the resulting models exist upon a spectrum. In particular, this paper includes the following contributions:

  • We introduce tree-structured boosting (TSB) as a new mechanism for creating a hierarchical ensemble model that recursively partitions the instance space, forming a perfect binary tree of weak learners. Each path from the root node to a leaf represents the outcome of a gradient boosted stumps (GBS) ensemble for a particular partition of the instance space.

  • We prove that TSB generates a continuum of single tree models with accuracy between CART and GBS, controlled via a single tunable parameter. In effect, TSB bridges between CART and GBS, identifying never-before-seen connections between additive and full interaction models.

  • This result is verified empirically, showing that this hybrid combination of CART and GBS can outperform either approach individually in terms of accuracy and/or interpretability while building a single tree. Our experiments also provide insight into the continuum of models revealed by TSB.

2 Connections between CART and Boosting

Assume we are given a training set , where each -dimensional has a corresponding label , drawn i.i.d from a unknown distribution . In a classification setting, ; in regression, . The goal is to learn a function that will perform well in predicting the label on new examples drawn from . CART analysis recursively partitions , with assigning a single label in to each partition. In this manner, there is full interaction between each component of the model. Different branches of the tree are trained with disjoint subsets of the data, as shown in Figure 1.

In contrast, boosting iteratively trains an ensemble of weak learners , such that the model111In classification, gives the sign of the prediction. CART models are often used as the weak learners. is a weighted sum of the weak learners’ predictions with weights . Each boosted weak learner is trained with a different weighting of the entire data set, unlike CART, repeatedly emphasizing mispredicted instances to induce diversity (Figure 1). Gradient boosting with decision stumps or simple regression creates a pure additive model, since each new ensemble member serves to reduce the residual of previous members Friedman et al. (1998); Friedman (2001). Interaction terms can be included in the overall ensemble by using more complex weak learners, such as deeper trees.

As shown by Valdes et al.  Valdes et al. (2016)

, classifier ensembles with decision stumps as the weak learners,

, can be trivially rewritten as a complete binary tree of depth , where the decision made at each internal node at depth is given by , and the prediction at each leaf is given by . Intuitively, each path through the tree represents the same ensemble, but one that tracks the unique combination of predictions made by each member.

Figure 1: CART, tree-structured boosting, and standard GBS, each given four training instances (blue and red points). The size of each point depicts its weight when used to train the adjacent node.

2.1 Tree-Structured Boosting

This interpretation of boosting lends itself to the creation of a tree-structured ensemble learner that bridges between CART and gradient boosting. The idea behind tree-structured boosting (TSB) is to grow the ensemble recursively, introducing diversity through the addition of different sub-ensembles after each new weak learner. At each step, TSB first trains a weak learner on the current training set with instance weights , and then creates a new sub-ensemble for each of the weak learner’s outputs. Each sub-ensemble is subsequently trained on the full training set, but instances corresponding to the respective branch are more heavily weighted during training, yielding diverse sub-ensembles (Figure 1, middle). This process proceeds recursively until the depth limit is reached. Critically, this approach identifies clear connections between CART and Gradient Boosted Stumps (GBS): as the re-weighting ratio is varied, tree-structured boosting produces a spectrum of models with accuracy between CART and TBS at the two extremes.

The complete TSB approach is detailed as Algorithm 1. The parameter used in step 8 of the algorithm provides the connection between CART and GBS, i.e., TSB converges to CART as and converges to GBS as . Theoretical analysis of TSB, given in the supplemental material, shows how TSB bridges between CART and GBS.

1:  If , return a prediction node that predicts the weighted average of with weights
2:  Create a new subtree root to hold a weak learner
3:  Compute negative gradients
4:  Fit weak classifier by solving , where

is a scalar defining the additive expansion to estimate

.
5:  Let be the partitions induced by .
6:  
7:  Update the current function estimation
8:  Update the left and right subtree instance weights, and normalize them:
9:  If , compute the left subtree recursively:    
10:  If , compute the right subtree recursively:    
Algorithm 1
Inputs:    training data , instance weights (default: ), , node depth (default: ), max height , node domain (default: ), prediction function (default: ) Outputs: the root node of a hierarchical ensemble

3 Experiments

In a first experiment, we use real-world data to evaluate the classification error of TSB for different values of . We then examine the behavior of the instance weights as varies in a second experiment.

3.1 Assessment of TSB Model Performance versus CART and GBS

In this experiment, we use four life science data sets from the UCI repository Lichman (2013): Breast Tissue, Indian Liver Patient Dataset (ILPD), SPECTF Heart Disease, and Wisconsin Breast Cancer. These data sets are all binary classification tasks and contain only numeric attributes with no missing values. We measure the classification error as the value of increases from to . In particular, we assess 10 equidistant error points corresponding to the in-sample and out-of-sample errors of the generated TSB trees, and plot the transient behavior of the classification errors as functions of . The goal is to illustrate the trajectory of the classification errors of TSB, which is expected to approximate the performance of CART as , and to converge asymptotically to GBS as .

To ensure fair comparison, we assessed the classification accuracy of CART and GBS for different depth and learning rate values over 5-fold cross-validation. As a result, we concluded that a tree/ensemble depth of 10 offered near-optimal accuracy, and so use it for all algorithms. The binary classification was carried out using the negative binomial log-likelihood as the loss function, similar to LogitBoost

Friedman (2001), which requires an additional learning rate (shrinkage) factor, via Algorithm 1.

Data Set # Instances # Attributes TSB Learning Rate
Breast Tissue 106 9 0.3
ILPD 583 9 0.3
SPECTF 80 44 0.3
Wisconsin Breast Cancer 569 30 0.7
Synthetic 100 2 0.1
Table 1: Data Set Specifications

For each data set, the experimental results were averaged over 20 trials of 10-fold cross-validation over the data, using of the samples for training and the remaining

for testing in each experiment. The error bars in the plots denote the standard error at each sample point.

The results are presented in Figure 2, showing that the classification error of TSB approximates the CART and GBS errors in the limits of . As expected, increasing lambda generally reduces overfitting. However, note that for each data set except ILPD, the lowest test error is achieved by a TSB model between the extremes of CART and GBS. This reveals that hybrid TSB models can outperform either of CART or GBS alone.

(a) Breast Tissue
(b) ILPD
(c) SPECTF
(d) Breast Cancer
Figure 2: In-sample and out-of-sample classification errors for different values of . All plots share the same legend and vertical axis label. (Best viewed in color.)

3.2 Effect of on the Instance Weights

In a second experiment, we use a synthetic binary-labeled data set to graphically illustrate the behavior of the instance weights as functions of lambda. The synthetic data set consists of 100 points in , out of which 58 belong to the red class, and the remaining 42 belong to the green class, as shown in Figure 3. The learning rate was chosen to be 0.1 based on classification accuracy, as in the previous experiment. We recorded the instance weights produced by TSB at different values of .

Figure 3

shows a heatmap linearly interpolating the weights associated with each instance for a disjoint region defined by one of the four leaf nodes of the trained tree. The chosen leaf node corresponds to the logical function

.

When , the weights have binary normalized values that produce a sharp differentiation of the surface defined by the leaf node, similar to the behavior of CART, as illustrated in Figure 3(a). As increases in value, the weights become more diffuse in Figures 3 (b) and (c), until becomes significantly greater than . At that point, the weights approximate the initial values as anticipated by theory. Consequently, the ensembles along each path to a leaf are trained using equivalent instance weights, and therefore are the same and equivalent to GBS.

(a)
(b)
(c)
(d)
Figure 3: Heatmap of the instance weights for one leaf partition (upper left corner of each plot: ) at each of the instances (red and green points) as the value for varies.

4 Conclusions

We have shown that tree-structured boosting reveals the intrinsic connections between additive models (GBS) and full interaction models (CART). As the parameter varies from to , the models produced by TSB vary between CART and GBS, respectively. This has been shown both theoretically and empirically. Notably, the experiments revealed that a hybrid model between these two extremes of CART and GBS can outperform either of these alone.

References

  • Breiman et al. [1984] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth & Brooks, Monterey, CA, 1984.
  • Caruana and Niculescu-Mizil [2006] R. Caruana and A. Niculescu-Mizil.

    An empirical comparison of supervised learning algorithms.

    In Intl. Conf. on Mach. Learn., pages 161–168, Pittsburgh, 2006.
  • Friedman et al. [1998] J. Friedman, T. Hastie, and R. Tibshirani.

    Additive logistic regression: a statistical view of boosting.

    Annals of Stat., 28(2):2000, 1998.
  • Friedman [2001] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Stat., 29(5):1189–1232, 2001.
  • Lichman [2013] M. Lichman.

    UCI machine learning repository, 2013.

    URL http://archive.ics.uci.edu/ml.
  • Loh [2014] W.-Y. Loh. Fifty years of classification and regression trees. Intl. Stat. Review, 82(3):329–348, 2014.
  • Valdes et al. [2016] G. Valdes, J. M. Luna, E. Eaton, C. B. Simone II, L. H. Ungar, and T. D. Solberg. MediBoost: a patient stratification tool for interpretable decision making in the era of precision medicine. Scientific Reports, 6, 2016.

Appendix A Proof Sketches of Key Lemmas

This section shows that TSB is equivalent to CART when and equivalent to GBS as , thus establishing a continuum between CART and GBS. We include proof sketches for the four lemmas used to prove our main result in Theorem 1.

TSB maintains a perfect binary tree of depth , with internal nodes, each of which corresponds to a weak learner. Each weak learner along the path from the root node to a leaf prediction node  induces two disjoint partitions of , namely and so that and . Let be the corresponding set of partitions along that path to , where each is either or . We can then define the partition of associated with as . TSB predicts a label for each via the ensemble consisting of all weak learners along the path to so that . To focus each branch of the tree on corresponding instances, thereby constructing diverse ensembles, TSB maintains a set of weights over all training data. Let denote the weights associated with training a weak learner at the leaf node at depth . Notice that same as CART, TSB learns models, however, with better accuracy given its proven connection with GBS.

We train the tree as follows. At each boosting step we have a current estimate of the function corresponding to a perfect binary tree of height . We seek to improve this estimate by replacing each of the leaf prediction nodes with additional weak learners with corresponding weights , growing the tree by one level. This yields a revised estimate of the function at each terminal node as

(1)

where is a binary indicator function that is 1 if predicate is true, and 0 otherwise. Since partitions are disjoint, Equation (1) is equivalent to separate functions

one for each leaf’s corresponding ensemble. The goal is to minimize the loss over the data

(2)

by choosing and the ’s at each leaf. Taking advantage again of the independence of the leaves, Equation (2) is minimized by independently minimizing the inner summation for each , i.e.,

(3)

Note that (3) can be solved efficiently via gradient boosting Friedman [2001] of each in a level-wise manner through the tree.

Next, we focus on deriving TSB where the weak learners are binary regression trees with least squares as the loss function . Following Friedman Friedman [2001], we first estimate the negative unconstrained gradient at each data instance , which are equivalent to the residuals (i.e., ). Then, we can determine the optimal parameters for by solving

(4)

Gradient boosting solves Equation (4) by first fitting to the residuals , then solving for the optimal . For details on gradient boosting, see Friedman [2001]. Adapting TSB to the classification setting, for example using logistic regression base learners and negative binomial log-likelihood as the loss function , follows directly from Friedman [2001] by using the gradient boosting procedure for classification in place of regression.

If all instance weights remain constant, this approach would build a perfect binary tree of height , where each path from the root to a leaf represents the same ensemble, and so would be exactly equivalent to gradient boosting of . To focus each branch of the tree on corresponding instances, thereby constructing diverse ensembles, the weights are updated separately for each of ’s two children: instances in the corresponding partition have their weight multiplied by a factor of and instances outside the partition have their weights multiplied by a factor of , where . The update rule for the weight of for (the two partitions induced by ) is given by

(5)

where normalizes to be a distribution. The initial weights are typically uniform.

Lemma 1

The weight of at leaf at the boosting iteration () is given by

(6)

where is the sequence of partitions along the path from the root to .

  • This lemma can be shown by induction based on Equation (5).

Lemma 2

Given the weight distribution formula (6) of at leaf at the boosting iteration, the following limits hold,

(7)
(8)

where is the intersection of the partitions along the path from the root to .

  • Both parts follow directly by taking the corresponding limits of Lemma 1.

Lemma 3

The optimal simple regressor that minimizes the loss function (3) at the iteration at node is given by,

(9)
  • For a given region at the boosting iteration, the simple regressor has the form

    (10)

    with constants . We take the derivative of the loss function (3) in each of the two regions and , and solve for where the derivative is equal to zero, obtaining (9).

Lemma 4

The TSB update rule is given by . If is defined as,

with constant , then is constant, with

(11)
  • The proof is by induction on , building upon (10). We can show that each is constant and so is constant, and therefore the lemma holds under the given update rule.

Building upon these four lemmas, our main theoretical result is presented in the following theorem, and explained in the subsequent two remarks:

Theorem 1

Given the TSB optimal simple regressor (9) that minimizes the loss function (3), the following limits regarding the parameter of the weight update rule (5) are enforced:

(12)
(13)

where is the initial weight for the -th training sample.

  • The limit (12) follows from applying (7) from Lemma 2 to (9) from Lemma 3 regarding the result with a constant defined by (11) in Lemma 4. Similarly, the limit (13) follows from applying (8) from Lemma 2 to (9) in Lemma 3.

Remark 1

The simple regressor given by (12) calculates a weighted average of the difference between the random output variables and the previous estimate of the function in the disjoint regions defined by . This formally defines the behavior of the CART algorithm.

Remark 2

The simple regressor given by (13) calculates a weighted average of the difference between the random output variables and the previous estimate of the function given by the piece-wise constant function . is defined in the overlapping region determined by the latest stump, namely . This formally defines the behavior of the GBS algorithm.

Remark 3

Without loss of generality, let us replace the loss function in (3) by the squared error and set . By replacing the functions given in Theorem 1 at the limits of , the MSE (for ) and the residual (for ) are enforced, which corresponds to the loss functions for CART and GBS respectively.