Gradient Boosting Machine (GBM) 
is a powerful supervised learning algorithm that combines multiple weak-learners into an ensemble with excellent predictive performance. GBM works very well for a number of tasks like spam filtering, online advertising, fraud detection, anomaly detection, computational physics (e.g., the Higgs Boson discovery), etc; and has routinely featured as a top algorithm in Kaggle competitions and the KDDCup. GBM can naturally handle heterogeneous datasets (highly correlated data, missing data, categorical data, etc). It is also quite easy to use with several publicly available implementations: scikit-learn , R gbm , LightGBM 
, XGBoost, TF Boosted Trees , etc.
In spite of the practical success of GBM, there is a considerable gap in its theoretical understanding. The traditional interpretation of GBM is to view it as a form of steepest descent in functional space [23, 14]. While this interpretation serves as a good starting point, such framework lacks rigorous non-asymptotic convergence guarantees, especially when compared to the growing body of literature on first order convex optimization.
In convex optimization literature, Nesterov’s acceleration is a successful technique to speed up the convergence of first-order methods. In this work, we show how to incorporate Nesterov momentum into the gradient boosting framework in order to obtain an accelerated gradient boosting machine.
1.1 Our contributions
We propose the first accelerated gradient boosting algorithm that comes with strong theoretical guarantees and can be used with any type of weak learner. In particular:
With a number of numerical experiments with weak tree learners (one of the most popular type of GBMs) we confirm the effectiveness of AGBM.
Apart from theoretical contributions, we paved the way for speeding up some practical applications of GBMs, which currently require a large number of boosting iterations. For example, GBMs with boosted trees for multi-class problems are commonly implemented as a number of one-vs-rest learners, resulting in more complicated boundaries  and a potentially a larger number of boosting iterations required. Additionally, it is a common practice to build many very weak learners for problems where it is easy to overfit. Such large ensembles result not only in slow training time, but also slower inference. AGBMs can be potentially beneficial for all these applications.
1.2 Related Literature
Convergence Guarantees for GBM: After being first introduced by Friedman et al. , several works established its guaranteed convergence, without explicitly stating the convergence rate [8, 23]. Subsequently, when the loss function is both smooth and strongly convex,  proved an exponential convergence rate—more precisely that iterations are sufficient to ensure that the training loss is within of its optimal value.  then studied the primal-dual structure of GBM and demonstrated that in fact only iterations are needed. However the constants in their rate were non-standard and less intuitive. This result was recently improved upon by  and , who showed a similar convergence rate but with more transparent constants such as the smoothness and strong convexity constant of the loss function, as well as the density of weak learners. Additionally, if the loss function is assumed to be smooth and convex (but not necessarily strongly convex),  also showed that iterations are sufficient. We refer the reader to ,  and  for a more detailed literature review of the theoretical results of GBM convergence.
Accelerated Gradient Methods: For optimizing a smooth convex function,  showed that the standard gradient descent algorithm can be made much faster, resulting in the accelerated gradient descent method. While gradient descent requires iterations, accelerated gradient methods only require . In general, this rate of convergence is optimal and cannot be improved upon 
. Since its introduction in 1983, the mainstream research community’s interest in Nesterov’s accelerated method started around 15 years ago; yet even today most researchers struggle to find basic intuition as to what is really going on in accelerated methods. Such lack of intuition about the estimation sequence proof technique used by has motivated many recent works trying to explain this acceleration phenomenon [33, 37, 16, 19, 15, 1, 5]. Some have recently attempted to give a physical explanation of acceleration techniques by studying the continuous-time interpretation of accelerated gradient descent via dynamical systems [33, 37, 16].
Accelerated Greedy Coordinate and Matching Pursuit Methods: Recently,  and  discussed how to accelerate matching pursuit and greedy coordinate descent algorithms respectively. Their methods however require a random step and are hence only ‘semi-greedy’, which does not fit in the boosting framework.
2 Gradient Boosting Machine
We consider a supervised learning problem with training examples such that
is the feature vector of the-th example and is a label (in a classification problem) or a continuous response (in a regression problem). In the classical version of GBM , we assume we are given a base class of learners and that our target function class is the linear combination of such base learners (denoted by ). Let be a family of learners parameterized by . The prediction corresponding to a feature vector is given by an additive model of the form:
where is a weak-learner and is its corresponding additive coefficient. Here, and
are chosen in an adaptive fashion in order to improve the data-fidelity as discussed below. Examples of learners commonly used in practice include wavelet functions, support vector machines, and classification and regression trees. We assume the set of weak learners is scalable, namely that the following assumption holds. If , then for any . Assumption 2 holds for most of the set of weak learners we are interested in. Indeed scaling a weak learner is equivalent to modifying the coefficient of the weak learner, so it does not change the structure of .
The goal of GBM is to obtain a good estimate of the function that approximately minimizes the empirical loss:
where is a measure of the data-fidelity for the -th sample for the loss function .
2.1 Best Fit Weak Learners
The original version of GBM by , presented in Algorithm 1, can be viewed as minimizing the loss function by applying an approximated steepest descent algorithm (2). GBM starts from a null function and at each iteration computes the pseudo-residual (namely, the negative gradient of the loss function with respect to the predictions so far ):
Then a weak-learner that best fits the current pseudo-residual in terms of the least squares loss is computed as follows:
This weak-learner is added to the model with a coefficient found via a line search. As the iterations progress, GBM leads to a sequence of functions (where is a shorthand for the set ). The usual intention of GBM is to stop early—before one is close to a minimum of Problem (2)—with the hope that such a model will lead to good predictive performance [14, 11, 38, 6].
Perhaps the most popular set of learners are classification and regression trees (CART) 
, resulting in Gradient Boosted Decision Tree models (GBDTs). These are the models that we are using for our numerical experiments. At the same time, we would like to highlight that our algorithm is not tied to a particular type of a weak learner and is a general algorithm.
|The features and the label of the -th sample.|
|is the feature matrix for all training data.|
|function||Weak learner parameterized by .|
|A vector of predictions .|
|function||Ensemble of weak learners at the -th iteration.|
|A vector of for any function .|
|functions||Auxiliary ensembles of weak learners at the -th iteration.|
|Pseudo residual at the -th iteration.|
|Corrected pseudo-residual at the -th iteration.|
3 Accelerated Gradient Boosting Machine (AGBM)
Given the success of accelerated gradient descent as a first order optimization method, it seems natural to attempt to accelerate the GBMs. As a warm-up, we first look at how to obtain an accelerated boosting algorithm when our class of learners is strong (complete) and can exactly fit any pseudo-residuals. This assumption is quite unreasonable but will serve to understand the connection between boosting and first order optimization. We then describe our actual algorithm which works for any class of weak learners.
3.1 Boosting with strong learners
In this subsection, we assume the class of learners is strong, i.e. for any pseudo-residual , there exists a learner such that
Of course the entire point of boosting is that the learners are weak and thus the class is not strong, therefore this is not a realistic assumption. Nevertheless this section will provide the intuitions on how to develop AGBM.
In the GBM we compute the psuedo-residual in (3) to be the negative gradient of the loss function over the predictions so far. A gradient descent step in a functional space would try to find such that for
Here is the step-size of our algorithm. Since our class of learners is rich, we can choose to exactly satisfy the above equation.
Thus GBM (Algorithm 1) then has the following update:
where . In other words, GBM performs exactly functional gradient descent when the class of learners is strong, and so it converges at a rate of . Akin to the above argument, we can perform functional accelerated gradient descent, which has the accelerated rate of . In the accelerated method, we maintain three model ensembles: , , and of which is the only model which is finally used to make predictions during the inference time. Ensemble is the momentum sequence and is a weighted average of and (refer to Table 1 for list of all notations used). These sequences are updated as follows for a step-size and :
where satisfies for
Note that the psuedo-residual is computed w.r.t. instead of . The update above can be rewritten as
If , we see that we recover the standard functional gradient descent with step-sze . For , there is an additional momentum in the direction of .
3.2 Boosting with weak learners
In this subsection, we consider the general case without assuming that the class of learners is strong. Indeed, the class of learners is usually quite simple and it is very likely that for any , it is impossible to exactly fit the residual . We call this case boosting with weak learners. Our task then is to modify (5) to obtain a truly accelerated gradient boosting machine.
First, the update to the sequence is replaced with a weak-learner which best approximates similar to (5). In particular, we compute pseudo-residual computed w.r.t. as in (6) and find a parameter such that
Secondly, and more crucially, the update to the momentum model is decoupled from the update to the sequence. We use an error-corrected pseudo-residual instead of directly using . Suppose that at iteration , a weak-learner was added to . Then error corrected residual is defined inductively as follows: for
and then we compute
Thus at each iteration two weak learners are computed— approximates the residual and the , which approximates the error-corrected residual . Note that if our class of learners is complete then , and . This would revert back to our accelerated gradient boosting algorithm for strong-learners described in (5).
4 Convergence Analysis of AGBM
We first formally define the assumptions required and then outline the computational guarantees for AGBM.
Let’s introduce some standard regularity/continuity constraints on the loss function that we require in our analysis. We denote as the derivative of the bivariant loss function w.r.t. the prediction . We say that is -smooth if for any and predictions and , it holds that
We say is -strongly convex (with ) if for any and predictions and , it holds that
Note that always. Smoothness and strong-convexity mean that the function is (respectively) upper and lower bounded by quadtratic functions. Intuitively, smoothness implies that that gradient does not change abruptly and hence is never ‘sharp’. Strong-convexity implies that always has some ‘curvature’ and is never ‘flat’.
The notion of Minimal Cosine Angle (MCA) introduced in  plays a central rule in our convergence rate analysis of GBM. MCA measures how well the weak-learner approximates the desired residual Let be a vector. The Minimal Cosine Angle (MCA) is defined as the similarity between and the output of the best-fit learner :
where is a vector of predictions .
The quantity measures how “dense” the learners are in the prediction space. For strong learners (in Section 3.1), the prediction space is complete, and . For a complex space of learners such as deep trees, we expect the prediction space to be dense and that . For a simpler class such as tree-stumps would be much smaller. Refer to  for a discussion of .
4.2 Computational Guarantees
We are now ready to state the main theoretical result of our paper. Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose is -smooth, the step-size and the momentum parameter , where is the MCA introduced in Definition 4.1. Then for all , we have:
Here we only give an outline—the full proof can be found in the Appendix (Section B). We use the potential-function based analysis of accelerated method (cf. [36, 37]). Recall that . For the proof, we introduce the following vector sequence of auxiliary ensembles as follows:
Let us define the following sequence of potentials:
Typical proofs of accelerated algorithms show that the potential is a decreasing sequence. In boosting, we use the weak-learner that fits the pseudo-residual of the loss. This can guarantee sufficient decay to the first term of related to the loss . However, there is no such guarantee that the same weak-learner can also provide sufficient decay to the second term as we do not apriori know the optimal ensemble . That is the major challenge in the development of AGBM.
We instead show that the potential decreases at least by :
where is an error term depending on (see Lemma B for the exact definition of and proof of the claim). By telescope, it holds that
Finally a careful analysis of the error term (Lemma B) shows that for any . Therefore,
which furnishes the proof by letting and substituting the value of . ∎
5 Extensions and Variants
In this section we study two more practical variants of AGBM. First we see how to restart the algorithm to take advantage of strong convexity of the loss function. Then we will study a straight-forward approach to accelerated GBM, which we call vanilla accelerated gradient boosting machine (VAGBM), a variant of the recently proposed algorithm in , however without any theoretical guarantees.
5.1 Restart and Linear Convergence
It is more common to show a linear rate of convergence for GBM methods by additionally assuming that the function is -strongly convex (e.g. ). It is then relatively straight-forward to recover an accelerated linear rate of convergence by restarting Algorithm 2.
Consider Accelerated Gradient Boosting with Restarts with Option 1 (Algorithm 3) . Suppose that is -smooth and -strongly convex. If the step-size and the momentum parameter , then for any and optimal loss ,
The loss function is -strongly convex, which implies that
Substituting this in Theorem 4.2 gives us that
Recalling that , , and gives us the required statement. ∎
The restart strategy in Option 1 requires knowledge of the strong-convexity constant . Alternatively, one can also use adaptive restart strategy (Option 2) which is known to have good empirical performance . Theorem 5.1 shows that weak learners are sufficient to obtain an error of using ABGMR (Algorithm 3). In contrast, standard GBM (Algorithm 1) requires weak learners. Thus AGBMR is significantly better than GBM only if the condition number is large i.e. . When is the least-squares loss, we would see no advantage of acceleration. However for more complicated functions with (e.g. logistic loss or exp loss), AGBMR might result in an ensemble that is significantly better (e.g. obtaining lower training loss) than that of GBM for the same number of weak learners.
5.2 A Vanilla Accelerated Gradient Boosting Method
A natural question to ask is whether, instead of adding two learners at each iteration, we can get away with adding only one? Below we show how such an algorithm would look like and argue that it may not always converge.
Algorithm 4 is equivalent to the recently developed accelerated gradient boosting machines algorithm [2, 10]. Unfortunately, it may not always converge to an optimum or may even diverge. This is because from Step (2) is only an approximate-fit to , meaning that we only take an approximate gradient descent step. While this is not an issue in the non-accelerated version, in Step (2) of Algorithm 4, the momentum term pushes the sequence to take a large step along the approximate gradient direction. This exacerbates the effect of the approximate direction and can lead to an additive accumulation of error as shown in . In Section 6.1, we see that this is not just a theoretical concern, but that Algorithm 4 also diverges in practice in some situations. Our corrected residual in Algorithm 2 was crucial to the theoretical proof of converge in Theorem 4.2. One extension could be to introduce in step (5) of Algorithm 4 just as in Algorithm 2.
6 Numerical Experiments
In this section, we present the results of computational experiments and discuss the performance of AGBM with trees as weak-learners. Subsection 6.1 demonstrates that the algorithm described in Section 5.2 may diverge numerically; Subsection 6.2 shows training and testing performance for GBM and AGBM with different parameters; and Subsection 6.3 compares the performance of GBM and AGBM with best tuned parameters. The code for the numerical experiments will be also open-sourced.
Datasets: Table 2 summaries the basic statistics of the LIBSVM datasets that were used. For each dataset, we randomly choose 80% as the training and the remaining as the testing dataset.
|Dataset||task||# samples||# features|
AGBM with CART trees
: In our experiments, all algorithms use CART trees as the weak learners. For classification problems, we use logistic loss function, and for regression problems, we use least squares loss. To reduce the computational cost, for each split and each feature, we consider 100 quantiles (instead of potentially allvalues). These strategies are commonly used in implementations of GBM like [7, 30].
6.1 Evidence that VAGBM May Diverge
Figure 1 shows the training loss versus the number of trees for the housing dataset with step-size and for VAGBM and for AGBM with different parameters . The -axis is number of trees added to the ensemble (recall that our AGBM algorithm adds two trees to the ensemble per iteration, so the number of boosting iterations of VAGBM and AGBM is different). As we can see, when is large, the training loss for VAGBM diverges very fast while our AGBM with proper parameter converges. When gets smaller, the training loss for VAGBM may decay faster than our AGBM at the begining, but it gets stuck and never converges to the true optimal solution. Eventually the training loss of VAGBM may even diverge. On the other hand, our theory guarantees that AGBM always converges to the optimal solution.
|number of trees||number of trees|
6.2 AGBM Sensitivity to the hyperparameters
In this section we look at how the two parameters and affect the performance of AGBM. Figure 2 shows the training loss and the testing loss versus the number of trees for the a1a dataset with two different step-sizes and (recall AGBM adds two trees per iteration). When the step-size is large (with logistic loss, the largest step-size to guarantee the convergence is ), the training loss decays very fast, and the traditional GBM can converge even faster than our AGBM at the beginning. But the testing performance is suffering, demonstrating that such a fast (due to the learning rate) convergence can result in severe overfitting. In this case, our AGBM with proper parameter has a slightly better testing performance. When the step-size becomes smaller, the testing performance of all algorithms becomes more stable, though the training becomes slower. AGBM with proper may require less number of iterations/trees to get a good training/testing performance.
|number of trees||number of trees|
6.3 Experiments with Fine Tuning
In this section we look at the testing performance of GBM, VAGBM and AGBM on six datasets with hyperparameter tuning. We consider depthtrees as weak-learners. We early stop the splitting when the gain smaller than (roughly for these datasets). The hyper-parameters and their ranges we tuned are:
step size (): for least squares loss and for logistic loss;
number of trees: ;
momentum parameter (only for AGBM): .
For each dataset, we randomly choose as the training dataset and the remainder was used as the final testing dataset. We use -fold cross validation on the training dataset to tune the hyperparameters. Instead of going through all possible hyper-parameters, we utilize randomized search (RandomizedSearchCV in scikit-learn). As AGBM has more parameters (namely ), we did proportionally more iterations of random search for AGBM. Table 3 presents the performance of GBM, VAGBM and AGBM with the tuned parameters. As we can see, the accelerated methods (AGBM and VAGBM) in general require less numbers of iterations to get similar or slightly better testing performance than GBM. Compared with VAGBM, AGBM adds two trees per iteration, and that can be more expensive, but the performance of AGBM can be more stable, for example, the testing error of VAGBM for housing dataset is much larger than AGBM.
|Dataset||Method||Training||Testing||# iter||# trees|
In this paper, we proposed a novel Accelerated Gradient Boosting Machine (AGBM), prooved its rate of convergence and introduced a computationally inexpensive practical variant of AGBM that takes advantage of strong convexity of loss function and achives linear convergence. Finally we demonstrated with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity.
Appendix A Additional Discussions
Below we include some additional discussions which could not fit into the main paper but which nevertheless help to understand the relevance of our results when applied to frameworks typically used in practice.
a.1 Line search in Boosting
Traditionally the analysis of gradient boosting methods has focused on algorithms which use line search to select the step-size (e.g. Algorithm 1). Analysis of gradient descent suggests that is not necessary—using a fixed step-size of where is -smooth is sufficient . Our accelerated Algorithm 2 also adopts this fixed step-size strategy. In fact, even the standard boosting libraries (XGBoost and TFBT) typically use a fixed (but tuned) step-size and avoid an expensive line search.
a.2 Use of Hessian
Popular boosting libraries such as XGBoost  and TFBT  compute the Hessian and perform a Newton boosting step instead of gradient boosting. Since the Newton step may not be well defined (e.g. if the Hessian is degenerate), an additional euclidean regularizer is also added. This has been shown to improve performance and reduce the need for a line-search for the parameter sequence [34, 32]. For LogitBoost (i.e. when is the logistic loss),  demonstrate that trust-region Newton’s method can indeed significantly improve the convergence. Leveraging similar results in second-order methods for convex optimization (e.g. [27, 17]) and adapting accelerated second-order methods  would be an interesting direction for the future work.
a.3 Out-of-sample Performance
Throughout this work we focus only on minimizing the empirical training loss (see Formula (2)). In reality what we really care about is the out-of-sample error of our resulting ensemble . A number of regularization tricks such as i) early stopping , ii) pruning [7, 30], iii) smaller step-sizes , iv) dropout  etc. are usually employed in practice to prevent over-fitting and improve generalization. Since AGBM requires much fewer iterations to achieve the same training loss than GBM, it outputs a much sparser set of learners. We believe this is partially the reason for its better out-of-sample performance. However a joint theoretical study of the out-of-sample error along with the empirical error is much needed. It would also shed light on the effectiveness of the numerous ad-hoc regularization techniques currently employed.
Appendix B Proof of Theorem 4.1
This section proves our major theoretical result in the paper:
Theorem 4.1 Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose is -smooth, the step-size and the momentum parameter . Then for all , we have:
Let’s start with some new notations. Define scalar constants and . We mostly only need —the specific values of and are needed only in Lemma B. Then define
then the definitions of the sequences , , and from Algorithm 3 can be simplified as:
The sequence is in fact closely tied to the sequence as we show in the next lemma. For notational convenience, we define and similarly throughout the proof.
Then we have
where the third equality is due to the definition of . ∎
Lemma B presents the fact that there is sufficient decay of the loss function:
Recall that is chosen such that
Since the class of learners is scalable (Assumption 2), we have
where the last inequality is because of the definition of , and the second equality is due to the simple fact that for any two vectors and ,
Now recall that and that . Since the loss function is -smooth and step-size , it holds that
where the final inequality follows from (8). This furnishes the proof of the lemma. ∎
Lemma B is a basic fact of convex function, and it is commonly used in the convergence analysis in accelerated method.
For any function and ,