1 Introduction
Gradient Boosting Machine (GBM) [14]
is a powerful supervised learning algorithm that combines multiple weaklearners into an ensemble with excellent predictive performance. GBM works very well for a number of tasks like spam filtering, online advertising, fraud detection, anomaly detection, computational physics (e.g., the Higgs Boson discovery), etc; and has routinely featured as a top algorithm in Kaggle competitions and the KDDCup
[7]. GBM can naturally handle heterogeneous datasets (highly correlated data, missing data, categorical data, etc). It is also quite easy to use with several publicly available implementations: scikitlearn [29], R gbm [31], LightGBM [18], XGBoost
[7], TF Boosted Trees [30], etc.In spite of the practical success of GBM, there is a considerable gap in its theoretical understanding. The traditional interpretation of GBM is to view it as a form of steepest descent in functional space [23, 14]. While this interpretation serves as a good starting point, such framework lacks rigorous nonasymptotic convergence guarantees, especially when compared to the growing body of literature on first order convex optimization.
In convex optimization literature, Nesterov’s acceleration is a successful technique to speed up the convergence of firstorder methods. In this work, we show how to incorporate Nesterov momentum into the gradient boosting framework in order to obtain an accelerated gradient boosting machine.
1.1 Our contributions
We propose the first accelerated gradient boosting algorithm that comes with strong theoretical guarantees and can be used with any type of weak learner. In particular:

We propose a variant of AGBM, taking advantage of strong convexity of loss function, which achieves linear convergence (Section
5). We also list the conditions (on the loss function) under which AGBMs would be beneficial. 
With a number of numerical experiments with weak tree learners (one of the most popular type of GBMs) we confirm the effectiveness of AGBM.
Apart from theoretical contributions, we paved the way for speeding up some practical applications of GBMs, which currently require a large number of boosting iterations. For example, GBMs with boosted trees for multiclass problems are commonly implemented as a number of onevsrest learners, resulting in more complicated boundaries [12] and a potentially a larger number of boosting iterations required. Additionally, it is a common practice to build many very weak learners for problems where it is easy to overfit. Such large ensembles result not only in slow training time, but also slower inference. AGBMs can be potentially beneficial for all these applications.
1.2 Related Literature
Convergence Guarantees for GBM: After being first introduced by Friedman et al. [14], several works established its guaranteed convergence, without explicitly stating the convergence rate [8, 23]. Subsequently, when the loss function is both smooth and strongly convex, [3] proved an exponential convergence rate—more precisely that iterations are sufficient to ensure that the training loss is within of its optimal value. [35] then studied the primaldual structure of GBM and demonstrated that in fact only iterations are needed. However the constants in their rate were nonstandard and less intuitive. This result was recently improved upon by [11] and [22], who showed a similar convergence rate but with more transparent constants such as the smoothness and strong convexity constant of the loss function, as well as the density of weak learners. Additionally, if the loss function is assumed to be smooth and convex (but not necessarily strongly convex), [22] also showed that iterations are sufficient. We refer the reader to [35], [11] and [22] for a more detailed literature review of the theoretical results of GBM convergence.
Accelerated Gradient Methods: For optimizing a smooth convex function, [25] showed that the standard gradient descent algorithm can be made much faster, resulting in the accelerated gradient descent method. While gradient descent requires iterations, accelerated gradient methods only require . In general, this rate of convergence is optimal and cannot be improved upon [26]
. Since its introduction in 1983, the mainstream research community’s interest in Nesterov’s accelerated method started around 15 years ago; yet even today most researchers struggle to find basic intuition as to what is really going on in accelerated methods. Such lack of intuition about the estimation sequence proof technique used by
[26] has motivated many recent works trying to explain this acceleration phenomenon [33, 37, 16, 19, 15, 1, 5]. Some have recently attempted to give a physical explanation of acceleration techniques by studying the continuoustime interpretation of accelerated gradient descent via dynamical systems [33, 37, 16].Accelerated Greedy Coordinate and Matching Pursuit Methods: Recently, [20] and [21] discussed how to accelerate matching pursuit and greedy coordinate descent algorithms respectively. Their methods however require a random step and are hence only ‘semigreedy’, which does not fit in the boosting framework.
2 Gradient Boosting Machine
We consider a supervised learning problem with training examples such that
is the feature vector of the
th example and is a label (in a classification problem) or a continuous response (in a regression problem). In the classical version of GBM [14], we assume we are given a base class of learners and that our target function class is the linear combination of such base learners (denoted by ). Let be a family of learners parameterized by . The prediction corresponding to a feature vector is given by an additive model of the form:(1) 
where is a weaklearner and is its corresponding additive coefficient. Here, and
are chosen in an adaptive fashion in order to improve the datafidelity as discussed below. Examples of learners commonly used in practice include wavelet functions, support vector machines, and classification and regression trees
[13]. We assume the set of weak learners is scalable, namely that the following assumption holds. If , then for any . Assumption 2 holds for most of the set of weak learners we are interested in. Indeed scaling a weak learner is equivalent to modifying the coefficient of the weak learner, so it does not change the structure of .The goal of GBM is to obtain a good estimate of the function that approximately minimizes the empirical loss:
(2) 
where is a measure of the datafidelity for the th sample for the loss function .
2.1 Best Fit Weak Learners
The original version of GBM by [14], presented in Algorithm 1, can be viewed as minimizing the loss function by applying an approximated steepest descent algorithm (2). GBM starts from a null function and at each iteration computes the pseudoresidual (namely, the negative gradient of the loss function with respect to the predictions so far ):
(3) 
Then a weaklearner that best fits the current pseudoresidual in terms of the least squares loss is computed as follows:
(4) 
This weaklearner is added to the model with a coefficient found via a line search. As the iterations progress, GBM leads to a sequence of functions (where is a shorthand for the set ). The usual intention of GBM is to stop early—before one is close to a minimum of Problem (2)—with the hope that such a model will lead to good predictive performance [14, 11, 38, 6].
Perhaps the most popular set of learners are classification and regression trees (CART) [4]
, resulting in Gradient Boosted Decision Tree models (GBDTs). These are the models that we are using for our numerical experiments. At the same time, we would like to highlight that our algorithm is not tied to a particular type of a weak learner and is a general algorithm.
Parameter  Dimension  Explanation 

The features and the label of the th sample.  
is the feature matrix for all training data.  
function  Weak learner parameterized by .  
A vector of predictions .  
function  Ensemble of weak learners at the th iteration.  
A vector of for any function .  
functions  Auxiliary ensembles of weak learners at the th iteration.  
Pseudo residual at the th iteration.  
Corrected pseudoresidual at the th iteration. 
3 Accelerated Gradient Boosting Machine (AGBM)
Given the success of accelerated gradient descent as a first order optimization method, it seems natural to attempt to accelerate the GBMs. As a warmup, we first look at how to obtain an accelerated boosting algorithm when our class of learners is strong (complete) and can exactly fit any pseudoresiduals. This assumption is quite unreasonable but will serve to understand the connection between boosting and first order optimization. We then describe our actual algorithm which works for any class of weak learners.
3.1 Boosting with strong learners
In this subsection, we assume the class of learners is strong, i.e. for any pseudoresidual , there exists a learner such that
Of course the entire point of boosting is that the learners are weak and thus the class is not strong, therefore this is not a realistic assumption. Nevertheless this section will provide the intuitions on how to develop AGBM.
In the GBM we compute the psuedoresidual in (3) to be the negative gradient of the loss function over the predictions so far. A gradient descent step in a functional space would try to find such that for
Here is the stepsize of our algorithm. Since our class of learners is rich, we can choose to exactly satisfy the above equation.
Thus GBM (Algorithm 1) then has the following update:
where . In other words, GBM performs exactly functional gradient descent when the class of learners is strong, and so it converges at a rate of . Akin to the above argument, we can perform functional accelerated gradient descent, which has the accelerated rate of . In the accelerated method, we maintain three model ensembles: , , and of which is the only model which is finally used to make predictions during the inference time. Ensemble is the momentum sequence and is a weighted average of and (refer to Table 1 for list of all notations used). These sequences are updated as follows for a stepsize and :
(5) 
where satisfies for
(6) 
Note that the psuedoresidual is computed w.r.t. instead of . The update above can be rewritten as
If , we see that we recover the standard functional gradient descent with stepsze . For , there is an additional momentum in the direction of .
3.2 Boosting with weak learners
In this subsection, we consider the general case without assuming that the class of learners is strong. Indeed, the class of learners is usually quite simple and it is very likely that for any , it is impossible to exactly fit the residual . We call this case boosting with weak learners. Our task then is to modify (5) to obtain a truly accelerated gradient boosting machine.
First, the update to the sequence is replaced with a weaklearner which best approximates similar to (5). In particular, we compute pseudoresidual computed w.r.t. as in (6) and find a parameter such that
Secondly, and more crucially, the update to the momentum model is decoupled from the update to the sequence. We use an errorcorrected pseudoresidual instead of directly using . Suppose that at iteration , a weaklearner was added to . Then error corrected residual is defined inductively as follows: for
and then we compute
Thus at each iteration two weak learners are computed— approximates the residual and the , which approximates the errorcorrected residual . Note that if our class of learners is complete then , and . This would revert back to our accelerated gradient boosting algorithm for stronglearners described in (5).
4 Convergence Analysis of AGBM
We first formally define the assumptions required and then outline the computational guarantees for AGBM.
4.1 Assumptions
Let’s introduce some standard regularity/continuity constraints on the loss function that we require in our analysis. We denote as the derivative of the bivariant loss function w.r.t. the prediction . We say that is smooth if for any and predictions and , it holds that
We say is strongly convex (with ) if for any and predictions and , it holds that
Note that always. Smoothness and strongconvexity mean that the function is (respectively) upper and lower bounded by quadtratic functions. Intuitively, smoothness implies that that gradient does not change abruptly and hence is never ‘sharp’. Strongconvexity implies that always has some ‘curvature’ and is never ‘flat’.
The notion of Minimal Cosine Angle (MCA) introduced in [22] plays a central rule in our convergence rate analysis of GBM. MCA measures how well the weaklearner approximates the desired residual Let be a vector. The Minimal Cosine Angle (MCA) is defined as the similarity between and the output of the bestfit learner :
(7) 
where is a vector of predictions .
The quantity measures how “dense” the learners are in the prediction space. For strong learners (in Section 3.1), the prediction space is complete, and . For a complex space of learners such as deep trees, we expect the prediction space to be dense and that . For a simpler class such as treestumps would be much smaller. Refer to [22] for a discussion of .
4.2 Computational Guarantees
We are now ready to state the main theoretical result of our paper. Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose is smooth, the stepsize and the momentum parameter , where is the MCA introduced in Definition 4.1. Then for all , we have:
Proof Sketch.
Here we only give an outline—the full proof can be found in the Appendix (Section B). We use the potentialfunction based analysis of accelerated method (cf. [36, 37]). Recall that . For the proof, we introduce the following vector sequence of auxiliary ensembles as follows:
The sequence is in fact closely tied to the sequence as we demonstrate in the Appendix (Lemma B). Let be any function which obtains the optimal loss (2)
Let us define the following sequence of potentials:
Typical proofs of accelerated algorithms show that the potential is a decreasing sequence. In boosting, we use the weaklearner that fits the pseudoresidual of the loss. This can guarantee sufficient decay to the first term of related to the loss . However, there is no such guarantee that the same weaklearner can also provide sufficient decay to the second term as we do not apriori know the optimal ensemble . That is the major challenge in the development of AGBM.
We instead show that the potential decreases at least by :
where is an error term depending on (see Lemma B for the exact definition of and proof of the claim). By telescope, it holds that
Finally a careful analysis of the error term (Lemma B) shows that for any . Therefore,
which furnishes the proof by letting and substituting the value of . ∎
5 Extensions and Variants
In this section we study two more practical variants of AGBM. First we see how to restart the algorithm to take advantage of strong convexity of the loss function. Then we will study a straightforward approach to accelerated GBM, which we call vanilla accelerated gradient boosting machine (VAGBM), a variant of the recently proposed algorithm in [2], however without any theoretical guarantees.
5.1 Restart and Linear Convergence
It is more common to show a linear rate of convergence for GBM methods by additionally assuming that the function is strongly convex (e.g. [22]). It is then relatively straightforward to recover an accelerated linear rate of convergence by restarting Algorithm 2.
Consider Accelerated Gradient Boosting with Restarts with Option 1 (Algorithm 3) . Suppose that is smooth and strongly convex. If the stepsize and the momentum parameter , then for any and optimal loss ,
Proof.
The loss function is strongly convex, which implies that
Substituting this in Theorem 4.2 gives us that
Recalling that , , and gives us the required statement. ∎
The restart strategy in Option 1 requires knowledge of the strongconvexity constant . Alternatively, one can also use adaptive restart strategy (Option 2) which is known to have good empirical performance [28]. Theorem 5.1 shows that weak learners are sufficient to obtain an error of using ABGMR (Algorithm 3). In contrast, standard GBM (Algorithm 1) requires weak learners. Thus AGBMR is significantly better than GBM only if the condition number is large i.e. . When is the leastsquares loss, we would see no advantage of acceleration. However for more complicated functions with (e.g. logistic loss or exp loss), AGBMR might result in an ensemble that is significantly better (e.g. obtaining lower training loss) than that of GBM for the same number of weak learners.
5.2 A Vanilla Accelerated Gradient Boosting Method
A natural question to ask is whether, instead of adding two learners at each iteration, we can get away with adding only one? Below we show how such an algorithm would look like and argue that it may not always converge.
Following the updates in Equation (5), we can get a direct acceleration of GBM by using the weak learner fitting the gradient. This leads to an Algorithm 4.
Algorithm 4 is equivalent to the recently developed accelerated gradient boosting machines algorithm [2, 10]. Unfortunately, it may not always converge to an optimum or may even diverge. This is because from Step (2) is only an approximatefit to , meaning that we only take an approximate gradient descent step. While this is not an issue in the nonaccelerated version, in Step (2) of Algorithm 4, the momentum term pushes the sequence to take a large step along the approximate gradient direction. This exacerbates the effect of the approximate direction and can lead to an additive accumulation of error as shown in [9]. In Section 6.1, we see that this is not just a theoretical concern, but that Algorithm 4 also diverges in practice in some situations. Our corrected residual in Algorithm 2 was crucial to the theoretical proof of converge in Theorem 4.2. One extension could be to introduce in step (5) of Algorithm 4 just as in Algorithm 2.
6 Numerical Experiments
In this section, we present the results of computational experiments and discuss the performance of AGBM with trees as weaklearners. Subsection 6.1 demonstrates that the algorithm described in Section 5.2 may diverge numerically; Subsection 6.2 shows training and testing performance for GBM and AGBM with different parameters; and Subsection 6.3 compares the performance of GBM and AGBM with best tuned parameters. The code for the numerical experiments will be also opensourced.
Datasets: Table 2 summaries the basic statistics of the LIBSVM datasets that were used. For each dataset, we randomly choose 80% as the training and the remaining as the testing dataset.
Dataset  task  # samples  # features 

a1a  classification  1605  123 
w1a  classification  2477  300 
diabetes  classification  768  8 
german  classification  1000  24 
housing  regression  506  13 
eunite2001  regression  336  16 
AGBM with CART trees
: In our experiments, all algorithms use CART trees as the weak learners. For classification problems, we use logistic loss function, and for regression problems, we use least squares loss. To reduce the computational cost, for each split and each feature, we consider 100 quantiles (instead of potentially all
values). These strategies are commonly used in implementations of GBM like [7, 30].6.1 Evidence that VAGBM May Diverge
Figure 1 shows the training loss versus the number of trees for the housing dataset with stepsize and for VAGBM and for AGBM with different parameters . The axis is number of trees added to the ensemble (recall that our AGBM algorithm adds two trees to the ensemble per iteration, so the number of boosting iterations of VAGBM and AGBM is different). As we can see, when is large, the training loss for VAGBM diverges very fast while our AGBM with proper parameter converges. When gets smaller, the training loss for VAGBM may decay faster than our AGBM at the begining, but it gets stuck and never converges to the true optimal solution. Eventually the training loss of VAGBM may even diverge. On the other hand, our theory guarantees that AGBM always converges to the optimal solution.
training loss 


number of trees  number of trees 
6.2 AGBM Sensitivity to the hyperparameters
In this section we look at how the two parameters and affect the performance of AGBM. Figure 2 shows the training loss and the testing loss versus the number of trees for the a1a dataset with two different stepsizes and (recall AGBM adds two trees per iteration). When the stepsize is large (with logistic loss, the largest stepsize to guarantee the convergence is ), the training loss decays very fast, and the traditional GBM can converge even faster than our AGBM at the beginning. But the testing performance is suffering, demonstrating that such a fast (due to the learning rate) convergence can result in severe overfitting. In this case, our AGBM with proper parameter has a slightly better testing performance. When the stepsize becomes smaller, the testing performance of all algorithms becomes more stable, though the training becomes slower. AGBM with proper may require less number of iterations/trees to get a good training/testing performance.
training loss 


testing loss 

number of trees  number of trees 
6.3 Experiments with Fine Tuning
In this section we look at the testing performance of GBM, VAGBM and AGBM on six datasets with hyperparameter tuning. We consider depth
trees as weaklearners. We early stop the splitting when the gain smaller than (roughly for these datasets). The hyperparameters and their ranges we tuned are:
step size (): for least squares loss and for logistic loss;

number of trees: ;

momentum parameter (only for AGBM): .
For each dataset, we randomly choose as the training dataset and the remainder was used as the final testing dataset. We use fold cross validation on the training dataset to tune the hyperparameters. Instead of going through all possible hyperparameters, we utilize randomized search (RandomizedSearchCV in scikitlearn). As AGBM has more parameters (namely ), we did proportionally more iterations of random search for AGBM. Table 3 presents the performance of GBM, VAGBM and AGBM with the tuned parameters. As we can see, the accelerated methods (AGBM and VAGBM) in general require less numbers of iterations to get similar or slightly better testing performance than GBM. Compared with VAGBM, AGBM adds two trees per iteration, and that can be more expensive, but the performance of AGBM can be more stable, for example, the testing error of VAGBM for housing dataset is much larger than AGBM.
Dataset  Method  Training  Testing  # iter  # trees 

a1a  GBM  0.2187  0.3786  97  97 
VAGBM  0.2454  0.3661  33  33  
AGBM  0.1994  0.3730  33  66  
w1a  GBM  0.0262  0.0578  84  84 
VAGBM  0.0409  0.0578  32  32  
AGBM  0.0339  0.0552  47  94  
diabetes  GBM  0.297  0.462  87  87 
VAGBM  0.271  0.462  24  24  
AGBM  0.297  0.458  47  94  
german  GBM  0.244  0.505  54  54 
VAGBM  0.288  0.514  51  51  
AGBM  0.305  0.485  35  70  
housing  GBM  0.2152  4.6603  93  93 
VAGBM  0.5676  5.8090  73  73  
AGBM  0.215  4.5074  35  70  
eunite2001  GBM  36.73  270.1  64  64 
VAGBM  28.99  245.2  58  58  
AGBM  26.74  245.4  24  48 
7 Conclusion
In this paper, we proposed a novel Accelerated Gradient Boosting Machine (AGBM), prooved its rate of convergence and introduced a computationally inexpensive practical variant of AGBM that takes advantage of strong convexity of loss function and achives linear convergence. Finally we demonstrated with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity.
Appendix A Additional Discussions
Below we include some additional discussions which could not fit into the main paper but which nevertheless help to understand the relevance of our results when applied to frameworks typically used in practice.
a.1 Line search in Boosting
Traditionally the analysis of gradient boosting methods has focused on algorithms which use line search to select the stepsize (e.g. Algorithm 1). Analysis of gradient descent suggests that is not necessary—using a fixed stepsize of where is smooth is sufficient [22]. Our accelerated Algorithm 2 also adopts this fixed stepsize strategy. In fact, even the standard boosting libraries (XGBoost and TFBT) typically use a fixed (but tuned) stepsize and avoid an expensive line search.
a.2 Use of Hessian
Popular boosting libraries such as XGBoost [7] and TFBT [30] compute the Hessian and perform a Newton boosting step instead of gradient boosting. Since the Newton step may not be well defined (e.g. if the Hessian is degenerate), an additional euclidean regularizer is also added. This has been shown to improve performance and reduce the need for a linesearch for the parameter sequence [34, 32]. For LogitBoost (i.e. when is the logistic loss), [34] demonstrate that trustregion Newton’s method can indeed significantly improve the convergence. Leveraging similar results in secondorder methods for convex optimization (e.g. [27, 17]) and adapting accelerated secondorder methods [24] would be an interesting direction for the future work.
a.3 Outofsample Performance
Throughout this work we focus only on minimizing the empirical training loss (see Formula (2)). In reality what we really care about is the outofsample error of our resulting ensemble . A number of regularization tricks such as i) early stopping [38], ii) pruning [7, 30], iii) smaller stepsizes [30], iv) dropout [30] etc. are usually employed in practice to prevent overfitting and improve generalization. Since AGBM requires much fewer iterations to achieve the same training loss than GBM, it outputs a much sparser set of learners. We believe this is partially the reason for its better outofsample performance. However a joint theoretical study of the outofsample error along with the empirical error is much needed. It would also shed light on the effectiveness of the numerous adhoc regularization techniques currently employed.
Appendix B Proof of Theorem 4.1
This section proves our major theoretical result in the paper:
Theorem 4.1 Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose is smooth, the stepsize and the momentum parameter . Then for all , we have:
∎
Let’s start with some new notations. Define scalar constants and . We mostly only need —the specific values of and are needed only in Lemma B. Then define
then the definitions of the sequences , , and from Algorithm 3 can be simplified as:
The sequence is in fact closely tied to the sequence as we show in the next lemma. For notational convenience, we define and similarly throughout the proof.
Proof.
Observe that
Then we have
where the third equality is due to the definition of . ∎
Lemma B presents the fact that there is sufficient decay of the loss function:
Proof.
Recall that is chosen such that
Since the class of learners is scalable (Assumption 2), we have
(8) 
where the last inequality is because of the definition of , and the second equality is due to the simple fact that for any two vectors and ,
Now recall that and that . Since the loss function is smooth and stepsize , it holds that
where the final inequality follows from (8). This furnishes the proof of the lemma. ∎
Lemma B is a basic fact of convex function, and it is commonly used in the convergence analysis in accelerated method.
For any function and ,