Gradient Boosting Machine (GBM) is an extremely powerful supervised learning algorithm that is widely used in practice. GBM routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In this work, we propose Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov's acceleration techniques into the design of GBM. The difficulty in accelerating GBM lies in the fact that weak (inexact) learners are commonly used, and therefore the errors can accumulate in the momentum term. To overcome it, we design a "corrected pseudo residual" and fit best weak learner to this corrected pseudo residual, in order to perform the z-update. Thus, we are able to derive novel computational guarantees for AGBM. This is the first GBM type of algorithm with theoretically-justified accelerated convergence rate. Finally we demonstrate with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity.

Authors

• 14 publications
• 14 publications
• 5 publications
• 37 publications

Gradient Boosting Machine (GBM) introduced by Friedman is an extremely p...
10/24/2018 ∙ by Haihao Lu, et al. ∙ 0

• Accelerated proximal boosting

Gradient boosting is a prediction method that iteratively combines weak ...
08/29/2018 ∙ by Erwan Fouillen, et al. ∙ 0

Gradient Boosting Machine has proven to be one successful function appro...
06/07/2020 ∙ by Ji Feng, et al. ∙ 0

GPU-based algorithms have greatly accelerated many machine learning meth...
05/19/2020 ∙ by Rong Ou, et al. ∙ 0

• How Does Momentum Help Frank Wolfe?

We unveil the connections between Frank Wolfe (FW) type algorithms and t...
06/19/2020 ∙ by Bingcong Li, et al. ∙ 0

• Quantum Boosting

Suppose we have a weak learning algorithm A for a Boolean-valued problem...
02/12/2020 ∙ by Srinivasan Arunachalam, et al. ∙ 0

• AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods

Boosting methods are highly popular and effective supervised learning me...
07/04/2013 ∙ by Robert M. Freund, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

is a powerful supervised learning algorithm that combines multiple weak-learners into an ensemble with excellent predictive performance. GBM works very well for a number of tasks like spam filtering, online advertising, fraud detection, anomaly detection, computational physics (e.g., the Higgs Boson discovery), etc; and has routinely featured as a top algorithm in Kaggle competitions and the KDDCup

[7]. GBM can naturally handle heterogeneous datasets (highly correlated data, missing data, categorical data, etc). It is also quite easy to use with several publicly available implementations: scikit-learn [29], R gbm [31], LightGBM [18][7], TF Boosted Trees [30], etc.

In spite of the practical success of GBM, there is a considerable gap in its theoretical understanding. The traditional interpretation of GBM is to view it as a form of steepest descent in functional space [23, 14]. While this interpretation serves as a good starting point, such framework lacks rigorous non-asymptotic convergence guarantees, especially when compared to the growing body of literature on first order convex optimization.

In convex optimization literature, Nesterov’s acceleration is a successful technique to speed up the convergence of first-order methods. In this work, we show how to incorporate Nesterov momentum into the gradient boosting framework in order to obtain an accelerated gradient boosting machine.

1.1 Our contributions

We propose the first accelerated gradient boosting algorithm that comes with strong theoretical guarantees and can be used with any type of weak learner. In particular:

• We propose a novel accelerated gradient boosting algorithm (AGBM) (Section 3) and prove (Section 4) that it reduces the empirical loss at a rate of after iterations, improving upon the rate obtained by traditional gradient boosting methods.

• We propose a variant of AGBM, taking advantage of strong convexity of loss function, which achieves linear convergence (Section

5). We also list the conditions (on the loss function) under which AGBMs would be beneficial.

• With a number of numerical experiments with weak tree learners (one of the most popular type of GBMs) we confirm the effectiveness of AGBM.

Apart from theoretical contributions, we paved the way for speeding up some practical applications of GBMs, which currently require a large number of boosting iterations. For example, GBMs with boosted trees for multi-class problems are commonly implemented as a number of one-vs-rest learners, resulting in more complicated boundaries [12] and a potentially a larger number of boosting iterations required. Additionally, it is a common practice to build many very weak learners for problems where it is easy to overfit. Such large ensembles result not only in slow training time, but also slower inference. AGBMs can be potentially beneficial for all these applications.

1.2 Related Literature

Convergence Guarantees for GBM: After being first introduced by Friedman et al. [14], several works established its guaranteed convergence, without explicitly stating the convergence rate [8, 23]. Subsequently, when the loss function is both smooth and strongly convex, [3] proved an exponential convergence rate—more precisely that iterations are sufficient to ensure that the training loss is within of its optimal value. [35] then studied the primal-dual structure of GBM and demonstrated that in fact only iterations are needed. However the constants in their rate were non-standard and less intuitive. This result was recently improved upon by [11] and [22], who showed a similar convergence rate but with more transparent constants such as the smoothness and strong convexity constant of the loss function, as well as the density of weak learners. Additionally, if the loss function is assumed to be smooth and convex (but not necessarily strongly convex), [22] also showed that iterations are sufficient. We refer the reader to [35], [11] and [22] for a more detailed literature review of the theoretical results of GBM convergence.

Accelerated Gradient Methods: For optimizing a smooth convex function, [25] showed that the standard gradient descent algorithm can be made much faster, resulting in the accelerated gradient descent method. While gradient descent requires iterations, accelerated gradient methods only require . In general, this rate of convergence is optimal and cannot be improved upon [26]

. Since its introduction in 1983, the mainstream research community’s interest in Nesterov’s accelerated method started around 15 years ago; yet even today most researchers struggle to find basic intuition as to what is really going on in accelerated methods. Such lack of intuition about the estimation sequence proof technique used by

[26] has motivated many recent works trying to explain this acceleration phenomenon [33, 37, 16, 19, 15, 1, 5]. Some have recently attempted to give a physical explanation of acceleration techniques by studying the continuous-time interpretation of accelerated gradient descent via dynamical systems [33, 37, 16].

Accelerated Greedy Coordinate and Matching Pursuit Methods: Recently, [20] and  [21] discussed how to accelerate matching pursuit and greedy coordinate descent algorithms respectively. Their methods however require a random step and are hence only ‘semi-greedy’, which does not fit in the boosting framework.

Accelerated GBM: Recently, [2] and [10] proposed accelerated versions of GBM by directly incorporating Nesterov’s momentum in GBM, however, no theoretical justification was provided. Furthermore, as we argue in Section 5.2, their proposed algorithm may not converge to the optimum.

We consider a supervised learning problem with training examples such that

is the feature vector of the

-th example and is a label (in a classification problem) or a continuous response (in a regression problem). In the classical version of GBM [14], we assume we are given a base class of learners and that our target function class is the linear combination of such base learners (denoted by ). Let be a family of learners parameterized by . The prediction corresponding to a feature vector is given by an additive model of the form:

 f(x):=(M∑m=1βmbτm(x))∈\lin(\BB) , (1)

where is a weak-learner and is its corresponding additive coefficient. Here, and

are chosen in an adaptive fashion in order to improve the data-fidelity as discussed below. Examples of learners commonly used in practice include wavelet functions, support vector machines, and classification and regression trees

[13]. We assume the set of weak learners is scalable, namely that the following assumption holds. If , then for any . Assumption 2 holds for most of the set of weak learners we are interested in. Indeed scaling a weak learner is equivalent to modifying the coefficient of the weak learner, so it does not change the structure of .

The goal of GBM is to obtain a good estimate of the function that approximately minimizes the empirical loss:

 L⋆=minf∈\lin(\BB){L(f):=n∑i=1ℓ(yi,f(xi))} (2)

where is a measure of the data-fidelity for the -th sample for the loss function .

2.1 Best Fit Weak Learners

The original version of GBM by  [14], presented in Algorithm 1, can be viewed as minimizing the loss function by applying an approximated steepest descent algorithm (2). GBM starts from a null function and at each iteration computes the pseudo-residual (namely, the negative gradient of the loss function with respect to the predictions so far ):

 rmi=−d ℓ(yi,fm(xi))dfm(xi). (3)

Then a weak-learner that best fits the current pseudo-residual in terms of the least squares loss is computed as follows:

 τm=\argminτ∈\wl n∑i=1(rmi−bτ(xi))2. (4)

This weak-learner is added to the model with a coefficient found via a line search. As the iterations progress, GBM leads to a sequence of functions (where is a shorthand for the set ). The usual intention of GBM is to stop early—before one is close to a minimum of Problem (2)—with the hope that such a model will lead to good predictive performance [14, 11, 38, 6].

Perhaps the most popular set of learners are classification and regression trees (CART) [4]

, resulting in Gradient Boosted Decision Tree models (GBDTs). These are the models that we are using for our numerical experiments. At the same time, we would like to highlight that our algorithm is not tied to a particular type of a weak learner and is a general algorithm.

3 Accelerated Gradient Boosting Machine (AGBM)

Given the success of accelerated gradient descent as a first order optimization method, it seems natural to attempt to accelerate the GBMs. As a warm-up, we first look at how to obtain an accelerated boosting algorithm when our class of learners is strong (complete) and can exactly fit any pseudo-residuals. This assumption is quite unreasonable but will serve to understand the connection between boosting and first order optimization. We then describe our actual algorithm which works for any class of weak learners.

3.1 Boosting with strong learners

In this subsection, we assume the class of learners is strong, i.e. for any pseudo-residual , there exists a learner such that

 b(xi)=ri,∀ i∈[n].

Of course the entire point of boosting is that the learners are weak and thus the class is not strong, therefore this is not a realistic assumption. Nevertheless this section will provide the intuitions on how to develop AGBM.

In the GBM we compute the psuedo-residual in (3) to be the negative gradient of the loss function over the predictions so far. A gradient descent step in a functional space would try to find such that for

 fm+1(xi)=fm(xi)+ηrmi.

Here is the step-size of our algorithm. Since our class of learners is rich, we can choose to exactly satisfy the above equation.

Thus GBM (Algorithm 1) then has the following update:

 fm+1=fm+ηbm,

where . In other words, GBM performs exactly functional gradient descent when the class of learners is strong, and so it converges at a rate of . Akin to the above argument, we can perform functional accelerated gradient descent, which has the accelerated rate of . In the accelerated method, we maintain three model ensembles: , , and of which is the only model which is finally used to make predictions during the inference time. Ensemble is the momentum sequence and is a weighted average of and (refer to Table 1 for list of all notations used). These sequences are updated as follows for a step-size and :

 gm=(1−θm)fm+θmhmfm+1=gm+ηbm: % primary modelhm+1=hm+η/θmbm:% momentum model (5)

where satisfies for

 bm(xi)=−d ℓ(yi,gm(xi))dgm(xi). (6)

Note that the psuedo-residual is computed w.r.t. instead of . The update above can be rewritten as

 fm+1=fm+ηbm+θm(hm−fm).

If , we see that we recover the standard functional gradient descent with step-sze . For , there is an additional momentum in the direction of .

The three sequences , , and match exactly those used in typical accelerated gradient descent methods (see [26, 36] for details).

3.2 Boosting with weak learners

In this subsection, we consider the general case without assuming that the class of learners is strong. Indeed, the class of learners is usually quite simple and it is very likely that for any , it is impossible to exactly fit the residual . We call this case boosting with weak learners. Our task then is to modify (5) to obtain a truly accelerated gradient boosting machine.

The full details are summarized in Algorithm 2 but we will highlight two key differences from (5).

First, the update to the sequence is replaced with a weak-learner which best approximates similar to (5). In particular, we compute pseudo-residual computed w.r.t. as in (6) and find a parameter such that

 τm,1=\argminτ∈\wln∑i=1(rmi−bτ(xi))2.

Secondly, and more crucially, the update to the momentum model is decoupled from the update to the sequence. We use an error-corrected pseudo-residual instead of directly using . Suppose that at iteration , a weak-learner was added to . Then error corrected residual is defined inductively as follows: for

 cmi=rmi+m+1m+2(cm−1i−bτm−1,2(xi)),

and then we compute

 τm,2=\argminτ∈\wln∑i=1(cmi−bτ(xi))2.

Thus at each iteration two weak learners are computed— approximates the residual and the , which approximates the error-corrected residual . Note that if our class of learners is complete then , and . This would revert back to our accelerated gradient boosting algorithm for strong-learners described in (5).

4 Convergence Analysis of AGBM

We first formally define the assumptions required and then outline the computational guarantees for AGBM.

4.1 Assumptions

Let’s introduce some standard regularity/continuity constraints on the loss function that we require in our analysis. We denote as the derivative of the bivariant loss function w.r.t. the prediction . We say that is -smooth if for any and predictions and , it holds that

 ℓ(y,f1)≤ℓ(y,f2)+∂ℓ(y,f2)∂f(f1−f2)+σ2(f1−f2)2.

We say is -strongly convex (with ) if for any and predictions and , it holds that

 ℓ(y,f1)≥ℓ(y,f2)+∂ℓ(y,f2)∂f(f1−f2)+μ2(f1−f2)2.

Note that always. Smoothness and strong-convexity mean that the function is (respectively) upper and lower bounded by quadtratic functions. Intuitively, smoothness implies that that gradient does not change abruptly and hence is never ‘sharp’. Strong-convexity implies that always has some ‘curvature’ and is never ‘flat’.

The notion of Minimal Cosine Angle (MCA) introduced in [22] plays a central rule in our convergence rate analysis of GBM. MCA measures how well the weak-learner approximates the desired residual Let be a vector. The Minimal Cosine Angle (MCA) is defined as the similarity between and the output of the best-fit learner :

 Θ:=minr∈\RRnmaxτ∈\wlmcos(r,bτ(X)), (7)

where is a vector of predictions .

The quantity measures how “dense” the learners are in the prediction space. For strong learners (in Section 3.1), the prediction space is complete, and . For a complex space of learners such as deep trees, we expect the prediction space to be dense and that . For a simpler class such as tree-stumps would be much smaller. Refer to [22] for a discussion of .

4.2 Computational Guarantees

We are now ready to state the main theoretical result of our paper. Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose is -smooth, the step-size and the momentum parameter , where is the MCA introduced in Definition 4.1. Then for all , we have:

 L(fM)−L(f∗)≤12ηγ(M+1)2∥f∗(X)∥22 .
Proof Sketch.

Here we only give an outline—the full proof can be found in the Appendix (Section B). We use the potential-function based analysis of accelerated method (cf. [36, 37]). Recall that . For the proof, we introduce the following vector sequence of auxiliary ensembles as follows:

 ^h0(X)=0,  ^hm+1(X)=^hm(X)+ηγθmrm.

The sequence is in fact closely tied to the sequence as we demonstrate in the Appendix (Lemma B). Let be any function which obtains the optimal loss (2)

 f⋆∈\argminf∈\lin(\BB){L(f):=n∑i=1ℓ(yi,f(xi))}.

Let us define the following sequence of potentials:

 Vm(f⋆)=⎧⎪ ⎪⎨⎪ ⎪⎩12\norm∗f⋆(X)−^h0(X)2 if m=0,ηγθ2m−1(L(fm)−L∗)+12\norm∗f⋆(X)−^hm(X)2 o.w

Typical proofs of accelerated algorithms show that the potential is a decreasing sequence. In boosting, we use the weak-learner that fits the pseudo-residual of the loss. This can guarantee sufficient decay to the first term of related to the loss . However, there is no such guarantee that the same weak-learner can also provide sufficient decay to the second term as we do not apriori know the optimal ensemble . That is the major challenge in the development of AGBM.

We instead show that the potential decreases at least by :

 Vm+1(f⋆)≤Vm(f⋆)+δm,

where is an error term depending on (see Lemma B for the exact definition of and proof of the claim). By telescope, it holds that

 ηγθ2m(L(fm+1)−L(f⋆))≤Vm+1(f∗)≤m∑j=0δj+12\norm∗f⋆(X)−^h0(X)2.

Finally a careful analysis of the error term (Lemma B) shows that for any . Therefore,

 L(fm+1)−L(f⋆)≤θ2m2ηγ\norm∗f⋆(X)2,

which furnishes the proof by letting and substituting the value of . ∎

Theorem 4.2 implies that to get a function such that the error , we need number of iterations . In contrast, standard gradient boosting machines, as proved in [22], require This means that for small values of , AGBM (Algorithm 2) can require far fewer weak learners than GBM (Algorithm 1).

5 Extensions and Variants

In this section we study two more practical variants of AGBM. First we see how to restart the algorithm to take advantage of strong convexity of the loss function. Then we will study a straight-forward approach to accelerated GBM, which we call vanilla accelerated gradient boosting machine (VAGBM), a variant of the recently proposed algorithm in [2], however without any theoretical guarantees.

5.1 Restart and Linear Convergence

It is more common to show a linear rate of convergence for GBM methods by additionally assuming that the function is -strongly convex (e.g. [22]). It is then relatively straight-forward to recover an accelerated linear rate of convergence by restarting Algorithm 2.

Consider Accelerated Gradient Boosting with Restarts with Option 1 (Algorithm 3) . Suppose that is -smooth and -strongly convex. If the step-size and the momentum parameter , then for any and optimal loss ,

 L(~fp+1)−L⋆≤12(L(~fp)−L(f⋆)).
Proof.

The loss function is -strongly convex, which implies that

 μ2∥f(X)−f∗(X)∥22≤L(f)−L(f⋆).

Substituting this in Theorem 4.2 gives us that

 L(fM)−L(f⋆)≤1μηγ(M+1)2(L(f0)−L(f⋆)).

Recalling that , , and gives us the required statement. ∎

The restart strategy in Option 1 requires knowledge of the strong-convexity constant . Alternatively, one can also use adaptive restart strategy (Option 2) which is known to have good empirical performance [28]. Theorem 5.1 shows that weak learners are sufficient to obtain an error of using ABGMR (Algorithm 3). In contrast, standard GBM (Algorithm 1) requires weak learners. Thus AGBMR is significantly better than GBM only if the condition number is large i.e. . When is the least-squares loss, we would see no advantage of acceleration. However for more complicated functions with (e.g. logistic loss or exp loss), AGBMR might result in an ensemble that is significantly better (e.g. obtaining lower training loss) than that of GBM for the same number of weak learners.

5.2 A Vanilla Accelerated Gradient Boosting Method

A natural question to ask is whether, instead of adding two learners at each iteration, we can get away with adding only one? Below we show how such an algorithm would look like and argue that it may not always converge.

Following the updates in Equation (5), we can get a direct acceleration of GBM by using the weak learner fitting the gradient. This leads to an Algorithm 4.

Algorithm 4 is equivalent to the recently developed accelerated gradient boosting machines algorithm [2, 10]. Unfortunately, it may not always converge to an optimum or may even diverge. This is because from Step (2) is only an approximate-fit to , meaning that we only take an approximate gradient descent step. While this is not an issue in the non-accelerated version, in Step (2) of Algorithm 4, the momentum term pushes the sequence to take a large step along the approximate gradient direction. This exacerbates the effect of the approximate direction and can lead to an additive accumulation of error as shown in [9]. In Section 6.1, we see that this is not just a theoretical concern, but that Algorithm 4 also diverges in practice in some situations. Our corrected residual in Algorithm 2 was crucial to the theoretical proof of converge in Theorem 4.2. One extension could be to introduce in step (5) of Algorithm 4 just as in Algorithm 2.

6 Numerical Experiments

In this section, we present the results of computational experiments and discuss the performance of AGBM with trees as weak-learners. Subsection 6.1 demonstrates that the algorithm described in Section 5.2 may diverge numerically; Subsection 6.2 shows training and testing performance for GBM and AGBM with different parameters; and Subsection 6.3 compares the performance of GBM and AGBM with best tuned parameters. The code for the numerical experiments will be also open-sourced.

Datasets: Table 2 summaries the basic statistics of the LIBSVM datasets that were used. For each dataset, we randomly choose 80% as the training and the remaining as the testing dataset.

AGBM with CART trees

: In our experiments, all algorithms use CART trees as the weak learners. For classification problems, we use logistic loss function, and for regression problems, we use least squares loss. To reduce the computational cost, for each split and each feature, we consider 100 quantiles (instead of potentially all

values). These strategies are commonly used in implementations of GBM like [7, 30].

6.1 Evidence that VAGBM May Diverge

Figure 1 shows the training loss versus the number of trees for the housing dataset with step-size and for VAGBM and for AGBM with different parameters . The -axis is number of trees added to the ensemble (recall that our AGBM algorithm adds two trees to the ensemble per iteration, so the number of boosting iterations of VAGBM and AGBM is different). As we can see, when is large, the training loss for VAGBM diverges very fast while our AGBM with proper parameter converges. When gets smaller, the training loss for VAGBM may decay faster than our AGBM at the begining, but it gets stuck and never converges to the true optimal solution. Eventually the training loss of VAGBM may even diverge. On the other hand, our theory guarantees that AGBM always converges to the optimal solution.

6.2 AGBM Sensitivity to the hyperparameters

In this section we look at how the two parameters and affect the performance of AGBM. Figure 2 shows the training loss and the testing loss versus the number of trees for the a1a dataset with two different step-sizes and (recall AGBM adds two trees per iteration). When the step-size is large (with logistic loss, the largest step-size to guarantee the convergence is ), the training loss decays very fast, and the traditional GBM can converge even faster than our AGBM at the beginning. But the testing performance is suffering, demonstrating that such a fast (due to the learning rate) convergence can result in severe overfitting. In this case, our AGBM with proper parameter has a slightly better testing performance. When the step-size becomes smaller, the testing performance of all algorithms becomes more stable, though the training becomes slower. AGBM with proper may require less number of iterations/trees to get a good training/testing performance.

6.3 Experiments with Fine Tuning

In this section we look at the testing performance of GBM, VAGBM and AGBM on six datasets with hyperparameter tuning. We consider depth

trees as weak-learners. We early stop the splitting when the gain smaller than (roughly for these datasets). The hyper-parameters and their ranges we tuned are:

• step size (): for least squares loss and for logistic loss;

• number of trees: ;

• momentum parameter (only for AGBM): .

For each dataset, we randomly choose as the training dataset and the remainder was used as the final testing dataset. We use -fold cross validation on the training dataset to tune the hyperparameters. Instead of going through all possible hyper-parameters, we utilize randomized search (RandomizedSearchCV in scikit-learn). As AGBM has more parameters (namely ), we did proportionally more iterations of random search for AGBM. Table 3 presents the performance of GBM, VAGBM and AGBM with the tuned parameters. As we can see, the accelerated methods (AGBM and VAGBM) in general require less numbers of iterations to get similar or slightly better testing performance than GBM. Compared with VAGBM, AGBM adds two trees per iteration, and that can be more expensive, but the performance of AGBM can be more stable, for example, the testing error of VAGBM for housing dataset is much larger than AGBM.

7 Conclusion

In this paper, we proposed a novel Accelerated Gradient Boosting Machine (AGBM), prooved its rate of convergence and introduced a computationally inexpensive practical variant of AGBM that takes advantage of strong convexity of loss function and achives linear convergence. Finally we demonstrated with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity.

Below we include some additional discussions which could not fit into the main paper but which nevertheless help to understand the relevance of our results when applied to frameworks typically used in practice.

a.1 Line search in Boosting

Traditionally the analysis of gradient boosting methods has focused on algorithms which use line search to select the step-size (e.g. Algorithm 1). Analysis of gradient descent suggests that is not necessary—using a fixed step-size of where is -smooth is sufficient [22]. Our accelerated Algorithm 2 also adopts this fixed step-size strategy. In fact, even the standard boosting libraries (XGBoost and TFBT) typically use a fixed (but tuned) step-size and avoid an expensive line search.

a.2 Use of Hessian

Popular boosting libraries such as XGBoost [7] and TFBT [30] compute the Hessian and perform a Newton boosting step instead of gradient boosting. Since the Newton step may not be well defined (e.g. if the Hessian is degenerate), an additional euclidean regularizer is also added. This has been shown to improve performance and reduce the need for a line-search for the parameter sequence [34, 32]. For LogitBoost (i.e. when is the logistic loss), [34] demonstrate that trust-region Newton’s method can indeed significantly improve the convergence. Leveraging similar results in second-order methods for convex optimization (e.g. [27, 17]) and adapting accelerated second-order methods [24] would be an interesting direction for the future work.

a.3 Out-of-sample Performance

Throughout this work we focus only on minimizing the empirical training loss (see Formula (2)). In reality what we really care about is the out-of-sample error of our resulting ensemble . A number of regularization tricks such as i) early stopping [38], ii) pruning [7, 30], iii) smaller step-sizes [30], iv) dropout [30] etc. are usually employed in practice to prevent over-fitting and improve generalization. Since AGBM requires much fewer iterations to achieve the same training loss than GBM, it outputs a much sparser set of learners. We believe this is partially the reason for its better out-of-sample performance. However a joint theoretical study of the out-of-sample error along with the empirical error is much needed. It would also shed light on the effectiveness of the numerous ad-hoc regularization techniques currently employed.

Appendix B Proof of Theorem 4.1

This section proves our major theoretical result in the paper:

Theorem 4.1 Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose is -smooth, the step-size and the momentum parameter . Then for all , we have:

 L(fM)−L(f∗)≤12ηγ(M+1)2∥f∗(X)∥22 .

Let’s start with some new notations. Define scalar constants and . We mostly only need —the specific values of and are needed only in Lemma B. Then define

 αm:=ηγθm=ηsΘ2θm,

then the definitions of the sequences , , and from Algorithm 3 can be simplified as:

 θm =2m+2 rm =−[∂l(yi,gm(xi))∂gm(xi)]i=1,…n cm =rm+(αm−1/αm)(cm−1−bτ2m−1(X)) ^hm+1(X) =^hm(X)+αmrm.

The sequence is in fact closely tied to the sequence as we show in the next lemma. For notational convenience, we define and similarly throughout the proof.

 ^hm+1(X)=hm+1(X)+αm(cm−bτm,2(X)).
Proof.

Observe that

 ^hm+1(X)=m∑j=0αjrjand thathm+1(X)=m∑j=0αjbτj,2(X).

Then we have

 ^hm+1(X)−hm+1(X) =m∑j=0αj(rj−bτj,2(X)) =m∑j=0αj(rj−αj−1αjbτ2j−1(X))−αmbτm,2(X) =m∑j=0αj(cj−αj−1αjcj−1)−αmbτm,2(X) =m∑j=0(αjcj−αj−1cj−1)−αmbτm,2(X) =αm(cm−bτm,2(X)),

where the third equality is due to the definition of . ∎

Lemma B presents the fact that there is sufficient decay of the loss function:

 L(fm+1)≤L(gm)−ηΘ22\normrm2.
Proof.

Recall that is chosen such that

 τm,1=\argminτ∈\wl\normbτ(X)−rm2.

Since the class of learners is scalable (Assumption 2), we have

 \normbτm,1(X)−rm2 =minτ∈\wlmminσ∈\RR\normσbτ(X)−rm2 =\normrm2(1−\argmaxτ∈\wlcos(rm,bτ(X))2) ≤\normrm2(1−Θ2), (8)

where the last inequality is because of the definition of , and the second equality is due to the simple fact that for any two vectors and ,

Now recall that and that . Since the loss function is -smooth and step-size , it holds that

 L(fm+1) =n∑i=1l(yi,fm+1(xi)) ≤n∑i=1l(yi,gm(xi)+ηbτm,1(xi)) ≤n∑i=1(l(yi,gm(xi))+∂l(yi,gm(xi))∂gm(xi)(ηbτm,1(xi))+σ2(ηbτm,1(xi))2) ≤n∑i=1(l(yi,gm(xi))+∂l(yi,gm(xi))∂gm(xi)(ηbτm,1(xi))+η2(bτm,1(xi))2) =n∑i=1(l(yi,gm(xi))−rmi(ηbτm,1(xi))+12η(bτm,1(xi))2) =L(gm)−η\inp∗rmbτm,1(X)+η2\normbτm,1(X)2 =L(gm)+η2\normbτm,1(X)−rm2−η2\normrm2 ≤L(gm)−Θ2η2\normrm2,

where the final inequality follows from (8). This furnishes the proof of the lemma. ∎

Lemma B is a basic fact of convex function, and it is commonly used in the convergence analysis in accelerated method.

For any function and ,

 L(gm)+θm\inp∗rmhm(X)−f(X)≤θmL(f)+(1−θm)L(fm).
Proof.

For any function , it follows from the convexity of the loss function that

 L(gm)+\inp∗rmgm(X)−f(X) =n∑i=1l(yi,gm(xi))+∂l(yi,gm(xi))∂gm(xi)(f(xi)−gm(xi)) ≤n∑i=1l(yi,f(xi))=L(f). (9)

Substituting in (9), we get

 L(gm)+\inp∗rmgm(X)−fm(X)≤L(fm). (10)

Also recall that . This can be rewritten as

 θm(gm(X)−hm(X))=(1−θm)(fm(X)−gm(X)). (11)

Putting (9), (10), and (11) together:

 L(gm)+θm\inp∗rmhm(X)−f(X) = L(gm)+θm\inp∗rmgm(X)−f(X)+θm\inp∗rmhm(X)−gm(X) = θm[L(gm)+\inp∗rmgm