# Accelerated proximal boosting

Gradient boosting is a prediction method that iteratively combines weak learners to produce a complex and accurate model. From an optimization point of view, the learning procedure of gradient boosting mimics a gradient descent on a functional variable. This paper proposes to build upon the proximal point algorithm when the empirical risk to minimize is not differentiable. In addition, the novel boosting approach, called accelerated proximal boosting, benefits from Nesterov's acceleration in the same way as gradient boosting [Biau et al., 2018]. Advantages of leveraging proximal methods for boosting are illustrated by numerical experiments on simulated and real-world data. In particular, we exhibit a favorable comparison over gradient boosting regarding convergence rate and prediction accuracy.

## Authors

• 1 publication
• 16 publications
• 6 publications
10/07/2021

### Accelerated Componentwise Gradient Boosting using Efficient Data Representation and Momentum-based Optimization

Componentwise boosting (CWB), also known as model-based boosting, is a v...
02/22/2022

### An accelerated proximal gradient method for multiobjective optimization

Many descent methods for multiobjective optimization problems have been ...
03/06/2018

Gradient tree boosting is a prediction algorithm that sequentially produ...
03/20/2019

Gradient Boosting Machine (GBM) is an extremely powerful supervised lear...
06/20/2019

### Boosting for Dynamical Systems

We propose a framework of boosting for learning and control in environme...
02/25/2018

### Functional Gradient Boosting based on Residual Network Perception

Residual Networks (ResNets) have become state-of-the-art models in deep ...
05/31/2000

### Distorted English Alphabet Identification : An application of Difference Boosting Algorithm

The difference-boosting algorithm is used on letters dataset from the UC...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Boosting is a celebrated machine learning technique, both in statistics and data science. In broad outline, boosting combines simple models (called weak learners) to build a more complex and accurate model. This assembly is performed iteratively, taking into account the performance of the model built at the previous iteration. The way this information is considered leads to several variants of boosting, the most famous of them being Adaboost

[Freund and Schapire, 1997] and gradient boosting [Friedman, 2001].

The reason of the success of boosting is twofold: i) from the statistical point of view, boosting is an additive model with an iteratively growing complexity. In this sense, boosting lies between ensemble methods and high capacity models. In practice, it combines the best of both worlds by reducing the variance and the bias of the risk; ii) from the data science perspective, boosting is a model, the fitting of which is computationally cheap. In contrast, it can quickly achieve highly complex models, thus it is able to perform accurately on difficult learning task. As an ultimate feature, the iterative process makes finding the frontier between under and overfitting quite easy. In particular, gradient boosting combined with decision trees (often referred to as gradient tree boosting) is currently regarded as one of the best off-the-shelf learning techniques in data challenges.

As explained by Biau et al. [2018]

, gradient boosting has its roots in Freund and Schapire’s work on combining classifiers, which resulted in the Adaboost algorithm

[Schapire, 1990, Freund, 1995, Freund and Schapire, 1996, 1997]. Later, Friedman and its colleagues developed a novel boosting procedure inspired by the numerical optimization literature, and nicknamed gradient boosting [Friedman et al., 2000, Friedman, 2001, 2002]. Such a connection of boosting between statistics and optimization was already stated in several previous analyses by Breiman [Breiman, 1997, 1998, 1999, 2000, 2004] and reviewed as functional optimization [Mason et al., 2000b, a, Meir and Rätsch, 2003, Bühlmann and Hothorn, 2007]: boosting can bee seen as an optimization procedure (similar to gradient descent), aimed at minimizing an empirical risk over the set of linear combinations of weak learners. In this respect, a few theoretical studies prove the convergence, from an optimization point of view, of boosting procedures [Zhang, 2002, 2003, Wang et al., 2015] and particularly of gradient boosting [Temlyakov, 2012, Biau and Cadre, 2017]. Yet, the topic of optimization (or statistical) convergence is out of the scope of this paper and, as a consequence, will not be covered for the proposed method.

It is quite surprising that in gradient boosting (and variants), the number of weak learners controls both the number of optimization steps performed in order to minimize the empirical risk and the statistical complexity of the final predictor. This early stopping feature of gradient boosting can be seen as an iterative regularization mechanism used to prevent overfitting [Lin et al., 2016]. As a consequence, besides the numerical learning procedure of gradient boosting, its statistical performance deeply relies on the algorithm employed. Especially as early stopping operates jointly with another regularization mechanism: the control of the model complexity.

That being said, one may wonder if gradient descent is really a good option. Following this direction, several alternatives have been proposed, such as using the Frank-Wolfe algorithm instead of a gradient descent [Wang et al., 2015], incorporating second order information [Chen and Guestrin, 2016], and applying Nesterov’s acceleration [Biau et al., 2018]

. All these variants rely on differentiable loss functions. The contribution of the work described here is to go a step forward

accelerated gradient boosting [Biau et al., 2018] by proposing a procedure for learning boosted models on non-differentiable loss functions, with a potential acceleration feature.

To go into details, Section 2 reviews boosting with respect to the empirical risk minimization principle and illustrates the flaw of the current learning procedure in a simple non-differentiable case: least absolute deviations. Then, some backgrounds on non-smooth optimization are stated in Section 3 and we explain the main contribution of this paper: adapting the proximal point algorithm [Nesterov, 2004] (and its Nesterov acceleration) to boosting. The proposed method is nicknamed accelerated proximal boosting. A second contribution is the derivation of the weights of each weak learners for accelerated descents (including accelerated gradient boosting [Biau et al., 2018]). Finally, the numerical study described in Section 4 shines a light on advantages and limitations of the proposed boosting procedure.

## 2 Problem and notation

Let be an arbitrary input space and an output space. Given a pair of random variables

, supervised learning aims at explaining

given , thanks to a function . In this context, may represent several quantities depending on the task at hand, for which the most notable examples are the conditional expectation and the conditional quantiles of given for regression, as well as the regression function for -classification. Often, this target function is minimizer of the risk over all integrable functions , where is a suitable convex loss function (respectively the square function and the pinball loss in the regression examples previously mentioned).

Since the distribution of is generally unknown, the minimization of the risk is out of reach. One would rather deal with its empirical version instead. Let be a training sample iid according to the distribution of and a class of functions integrable for the distribution of . In this work, we consider estimating thanks to an additive model , that is for an unknown integer and an unknown sequence , by solving the following optimization problem:

 minimizef∈spanFC(f), (P1)

where

 C(f)=En(ℓ(f(X),Y))=1nn∑i=1ℓ(Yi,f(Xi))

is the empirical risk and is the set of all linear combinations of functions in ( being the set of non-negative integers).

As a simple example, let us consider the regression model , where

and

is normally distributed and independent of

. We aim at solving:

 minimizef∈spanFEn(|Y−f(X)|),

with being the set of regression trees of depth less than .

Two boosting machines are learned (with fixed to ): a traditional one with a subgradient-type method, and another with the proposed proximal-based procedure. Figure 1 depicts the prediction of (left) and the training error along the iterations (right).

In an optimization perspective, it appears clearly that the subgradient method fails to minimize the empirical risk (prediction is far from the data and the training error is stuck above ) while the proximal-based procedure constantly improves the objective. The subgradient method faces a flaw in convergence, in all likelihood due to non-differentiability of the absolute function . This simple example illustrates, inside the boosting paradigm, a well-known fact in numerical optimization: proximal-based algorithms prevails over subgradient techniques.

## 3 Algorithm

There is an ambiguity in 2, since it is a functional optimization problem but, in practice, we do not necessarily have the mathematical tools to apply standard optimization procedures (in particular concerning differentiation of ). For this reason, is often regarded as a function from to , considering that it depends on only through . To make this remark more precise, let, for all , . Then, it is enough to remark that for any , , where is the vector of computed on the training sample.

Having this remark in mind helps solving 2, for instance considering that differentiating with respect to is roughly equivalent to differentiating with respect to (for all observed ), thus taking in fact the gradient of . Doing so, the only requirement is to maintain a functional variable while applying standard vectorial optimization procedures. This requires to approximate some vectors by a function, which is at the heart of functional optimization methods such that the ones used in boosting [Mason et al., 2000b].

From now on, all necessary computations of with respect to , can be forwarded to . For instance, if is differentiable with respect to its second argument, we can define, for all , the functional gradient of as . On the contrary, if is not differentiable, we may consider a subgradient of at , denoted and defined as any subgradient of at .

In the forthcoming sections, a common first order optimization algorithm is reviewed. Then, it is explained how to build different procedures for solving 2, according to the properties of the loss function .

### 3.1 Accelerated proximal gradient method

Let us assume for a while that we want to minimize the function , where is convex and differentiable (with -Lipschitz continuous gradient, ), and is convex and lower semi-continuous. Then the iterative procedure defined by choosing any and by setting for all :

 {xt+1=proxγh(vt−γ∇g(vt))vt+1=xt+1+αt+1(xt+1−xt)

where and will be made precise thereafter, is known to converge to a minimizer of [Nesterov, 2004]. The rate of converge depends on the choice of : if for all , then the previous procedure leads to the well known proximal gradient method, which converges in . More formally, assuming that has a minimizer , then . On the other hand, if one chooses the sequence defined recursively by:

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩β0=0βt+1=1+√1+4β2t2,t∈Nαt+1=βt−1βt+1,t∈N, (1)

then the convergence becomes . This is in the spirit of the acknowledged acceleration proposed by Nesterov [1983], and generalized to the composite setting by Beck and Teboulle [2009].

Depending on the properties of the objective function to minimize, the procedure described before leads to two simple algorithms (coming with their acceleration):

 xt+1=vt−γ∇g(vt),

minimizes a single function as soon as it is convex and differentiable with Lipschitz-continuous gradient;

• the proximal point algorithm:

 xt+1=proxγh(vt)=vt−γ[1γ(vt−proxγh(vt))],

minimizes a single function , which is only required to be convex and lower semi-continuous (in that case, there is no restriction on the step size , except being positive).

Without acceleration (i.e. with , for all ), the proximal gradient method (as well as its two children) has the asset to be a descent method: at each iteration, the objective function monotonically decreases, meaning that , with convergence rate at least ( with Nesterov’s acceleration). In particular, this is true when minimizing a single convex and lower semi-continuous function , even if it is not differentiable.

This has to be put in contrast with the subgradient method: , where and is any subgradient of at . This procedure, which is very similar to the gradient descent but replacing the gradient by any subgradient, has a convergence rate in the best case (when is well chosen) [Nesterov, 2004]. In addition, this rate is optimal: it cannot be improved without extra assumptions on [Nesterov, 2004, Theorem 3.2.1]. This means that there does not exist an acceleration scheme for this approach. This remark motivates the use of procedures different from the subgradient method when minimizing a non-differentiable function , such as the proximal point algorithm described above.

Let be the set of constant functions on and assume that . Also, for any , let us denote . Then, a simple procedure to approximately solve 2 is gradient boosting, described in Algorithm 1 [Mason et al., 2000a, Friedman, 2001]. It builds the requested additive model in an iterative fashion, by imitating a gradient (or subgradient if is not differentiable with repect to its second argument) method. At each iteration , Algorithm 1 finds a function that approximates the opposite of a subgradient of (also called pseudo-residues) and adds it to the model with a weight , where is a shrinkage coefficient (also called learning rate) and is the optimal step given the direction . At the end of the procedure, the proposed estimator of is , with .

Algorithm 1 requires a number of iterations , which acts on two regularization mechanisms. The first one is statistical ( controls the complexity of the subspace in which lies) and the second one is numerical ( controls the precision to which the empirical risk is minimized). The shrinkage coefficient tunes the balance between these two regularization mechanisms.

Algorithm 1 has two variants according to the way the subgradient of is approximated (respectively by correlation or by least squares). The first one closely relates to AdaBoost [Mason et al., 2000a] while the second one is officially known as gradient boosting [Friedman, 2001].

Let us remark that the line search (Line 4 of Algorithm 1) simply scales the weak learner by a constant factor. However, when the class is a set of regression trees, is a piecewise constant function. In this case, it is common to perform a line search sequentially for each leaf of the decision tree [Friedman, 2001] (called a multiple line search). As a consequence, each level of the piecewise constant function is scaled with its own factor.

### 3.3 Boosting with non-differentiable loss functions

When the function is not differentiable with respect to its second argument, gradient boosting just uses a subgradient instead of the gradient . This is, of course, convenient but as explained previously, far from leading to interesting convergence properties. For this reason, we propose a new procedure for non-differentiable loss functions , which consists in adapting the proximal point algorithm [Nesterov, 2004] to functional optimization.

For any , let , where is a parameter. The simple idea underlying the proposed algorithm, nicknamed accelerated proximal boosting, is to replace by , remarking that is exactly the iteration update of the proximal point method. In addition, this iterative procedure can be sped up by applying Nesterov’s acceleration, reviewed in Section 3.1.

The accelerated proximal boosting procedure is described in Algorithm 2. It is very similar to Algorithm 1, except that the pseudo-residues are now given by a proximal operator instead of a subgradient, and that Nesterov’s acceleration is on.

After approximating the direction of optimization (also called pseudo-residues) by , the iterate update (Line 11 in Algorithm 2) becomes (see Section 3.1):

 ft+1=ft+αt(ft−ft−1)+νγt+1gt+1. (1)

Similarly to regular gradient boosting (Algorithm 1), the estimator returned at the end of Algorithm 2 can be written , where the weights are now slightly more complicated (this is explained in the next section).

Let us remark that the idea of applying Nesterov’s acceleration to boosting originally appeared in [Biau et al., 2018] for a gradient-type procedure. Even though Biau et al. [2018] did not suggest to apply such an acceleration scheme when is not differentiable, this idea appears natural. However, this is not entirely relevant since it contradicts the optimization theory (see Section 3.1): subgradient method cannot be accelerated. This flaw clearly motivates using proximal-based methods for non-differentiable boosting, as proposed in Algorithm 2.

### 3.4 Weights with Nesterov’s acceleration

As an additive model, it is of interest to express with respect to the base learners and their weights: . On the first hand, in the non-accelerated case ( for all ), the update rule of Algorithm 1 is simply . Therefore the weights are defined by and for all positive integers .

On the other hand, when Nesterov’s acceleration is on ( is defined by Equation (1)), the update rule becomes (see Line 11 in Algorithm 2):

 ft′+1=(1+αt′)ft′−αt′ft′−1+νγt′+1gt′+1,

for all positive integers . Let us denote, for each iteration , the expansion of . Then

 ft′+1=t′−1∑t=0((1+αt′)w(t′)t−αt′w(t′−1)t)gt+(1+αt′)w(t′)t′gt′+νγt′+1gt′+1.

First, we see that the weights of and in the expansion of are respectively:

 ⎧⎨⎩w(t′+1)t′=(1+αt′)w(t′)t′w(t′+1)t′+1=νγt′+1.

Second, for each , the weight of in the expansion of is defined by:

 w(t′+1)t=(1+αt′)w(t′)t−αt′w(t′−1)t.

Therefore, considering that weights take value before being defined, i.e. , we have:

 w(t′+1)t−w(t′)t=αt′(w(t′)t−w(t′−1)t)=(t′∏k=tαk)(w(t)t−w(t−1)t)=(t′∏k=tαk)w(t)t.

It follows that:

 w(t′+1)t=w(t′)t+(t′∏k=tαk)w(t)t=w(t)t+t′∑j=t(j∏k=tαk)w(t)t=(1+t′∑j=tj∏k=tαk)w(t)t.

Then, for , one has , so and . Now, remarking that, for all , , we can conclude that the weights of are:

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩w0=1w1=νγ1wt=(1+∑T−1j=t∏jk=tαk)νγt,∀t∈{2,\mathellipsis,T−1}wT=νγT.

In a computational perspective, it may be efficient to update the weights, at each iteration , according to the following recursion:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩w(0)0=1w(0)1=νγ1w(1)1=νγ1w(t+1)j=(w(t)j−w(t−1)j)(1+αt)+w(t−1)j,∀j∈{1,\mathellipsis,t}w(t+1)t+1=νγt+1. (1)

## 4 Numerical analysis

In Section 3, proximal boosting (Algorithm 2) has been introduced in a fairly general way. However, following the success of gradient boosting, the empirical results presented in this section only relate to a least squares approximation of the pseudo-residues, with decision trees of depth at most as base learners (class ) and a multiple line search.

In the whole section, the four methods involved in the numerical comparison are nicknamed:

: accelerated gradient boosting [Biau et al., 2018];

Proximal (slow)

: Algorithm 2 without acceleration (, for all );

Proximal (fast)

: Algorithm 2 with Nesterov’s acceleration ( defined by Equation (1)).

### 4.1 Impact of parameters on algorithm behaviors

This section aims at numerically illustrating, based on synthetic data, the performance of our proximal boosting algorithm and at highlighting the benefits of coupling Nesterov’s acceleration scheme with proximal boosting. For this purpose, two synthetic models are studied (see description below), both coming from [Biau et al., 2016, 2018]. The other models considered in [Biau et al., 2018] have also been studied but results are not reported because they are very similar to the two models we focus on.

Regression

: .

Classification

: , where is the indicator function.

The first model covers an additive regression problem, while the second covers a binary classification task with covariate interactions. In both cases, we consider an input random variable , the covariate of which, denoted , are either uniformly distributed over (uncorrelated design) or normally distributed with zero mean and covariance matrix (correlated design). In these synthetic models, an additive and independent noise (normally distributed with mean and variance ) is embodied by the random variable .

Four different losses are considered (see Table 1 for a brief description): least squares and least absolute deviations for regression; exponential and hinge for classification. Computations for the corresponding (sub)gradients and proximal operators are detailed in Appendix A. Note that we also considered other kind of losses such as the pinball loss for regression and the logistic loss for classification (see Table 1). Nevertheless, since the numerical behaviors are respectively very close to the least absolute deviations and exponential cases, the results are not reported.

In the following numerical experiments, the random sample generated based on each model is divided into a training set (50%) to fit the method and a validation set (50%). The performance of the methods are appraised thanks to several curves representing the training and validation losses along the iterations with which the algorithms are run.

#### 4.1.1 Learning rate

This subsection tackles the impact of the learning rate parameter on the relative performances of both proximal and accelerated proximal methods. Throughout, the proximal step is fixed to . The convergence rates for are illustrated (i) in the case of the regression model, for the least squares loss in Figure 2 and for the least absolute deviations loss in Figure 3; (ii) in the case of the classification model, for the exponential loss in Figure 4 and for the hinge loss in Figure 5.

Some general observations can be drawn, which hold true for all losses and both correlated and uncorrelated designs: as one might expect, a higher learning rate leads to a faster convergence rate. In addition, accelerated boosting converges always faster than the vanilla variant but show unpleasant behaviors for high values of learning rate. Those behaviors depends on the loss and can be: divergence (least squares and least absolute deviations), plateau (exponential) and numerous oscillations (hinge).

As far as we can observe, the minimum validation error achieved depends on the shrinkage factor and is generally attained for . This value also depends on the acceleration (activated or not).

#### 4.1.2 Proximal step

In this subsection, we study the effect of the proximal step parameter on the performance of both proximal and accelerated proximal boosting methods. Throughout, the learning rate is fixed to . The convergence rates for are illustrated, in the case of the regression model, for the least squares loss in Figure 6 and for the least absolute deviations in Figure 7. In the case of the classification model, numerical results for the exponential and hinge losses, are depicted respectively in Figure 8 and in Figure 9. In all cases, we can observe that non-accelerated method do not converge after 5000 iterations.

For the least squares loss, do not seem to have an impact on proximal boosting, which yields similar performances to gradient boosting. For the accelerated variants, there is only a few differences when varies and accelerated proximal boosting seems to achieve a lower training error than accelerated gradient boosting for .

Concerning least absolute deviations, induces large differences in the training curves, both for regular and accelerated proximal boosting. Besides, (accelerated) proximal boosting achieves a lower training error than (accelerated) gradient boosting with large values of (). These observations are similar regarding the exponential loss, except that improvements of proximal boosting occur for small values of () and are less important.

Finally, proximal boosting with the hinge loss depends a lot on the parameter but always decreases the training error faster than gradient boosting. For the accelerated version of proximal boosting, unpleasant behaviors may occur and should be chosen adequately.

To conclude, the convergence rate as well as the generalization ability of proximal boosting depends on the shrinkage coefficient . In addition, the parameter (which only exists for proximal boosting) has a noticeable impact on the decrease of the training error. There is no general rule to choose appropriate values for these parameters, but with such values, (accelerated) proximal boosting demonstrates favorable performances, compared to (accelerated) gradient boosting.

### 4.2 Comparison in real-world cases

This section aims at assessing the generalization ability and the size of the final model for the proposed approach (see Section 3.3) in comparison to known variants. Intuitively, Algorithm 2 is expected to behave better than gradient-type boosting when the loss function is not differentiable. Benefits are expected on the generalization ability (proximal methods are able to minimize , with , as much as one likes) and on the number of iterations (or weak learners) necessary for producing accurate predictions (weak learners are more likely to minimize if they are based on proximal methods).

Comparison is based on four datasets (available on the UCI Machine Learning repository), for which the characteristics are described in Table 2. The first two are univariate regression datasets, while the three others relate to binary classification problems. In both situations, the sample is split into a training set (50%), a validation set (25%) and a test set (25%). The parameters of the methods (number of weak classifiers , learning rate and proximal step ) are selected as minimizers of the loss computed on the validation set for models fitted on the training set. Then, models are refitted on the training and the validation sets with selected parameters. Finally, the generalization capability of the methods are estimated by computing the loss (and the misclassification rate for classification models) on the test set. These quantities are reported thanks to statistics computed on 20 random splits of the datasets.

The losses considered in these experiments are least squares, least absolute deviations and pinball () for the regression problems, as well as exponential () and hinge for the classification tasks (see Table 1 for a quick definition and Appendix A

for the details). For usual regression (least square loss), the four methods described above are also confronted to random forests

[Breiman, 2001] with a number of trees selected in accordance with the evaluation procedure described above.

#### 4.2.1 Regression problems

Test losses for the least squares (left), least absolute deviations (middle) and pinball (right) losses are described in Figure 10, along with number of weak learners selected in Figure 11. One can observe that, with respect to both test losses and size of the final predictor, all the boosting methods yield comparable results on the least squares loss (which is differentiable), and better performances than random forests. However, regarding least absolute deviations and pinball losses, proximal approaches achieve similar or better accuracy with equally sized or smaller final models.

In addition, these results confirm an observation stated in Biau et al. [2018]: accelerated boosting (with gradient and proximal-based directions) produces models roughly as accurate as usual boosting but with much less weak learners.

For the two non-differentiable loss functions considered here (least absolute deviations and pinball losses), the good performances of subgradient-based boosting could be explained by the following reason: in this numerical experiment, we are interested in the test loss, which does not require minimizing the training loss entirely. Thus, even though subgradient techniques may not converge, the decrease of the empirical loss may be sufficient for achieving good generalization performances.

#### 4.2.2 Classification problems

Losses (left) and misclassification rates (right) computed on the test dataset are depicted in Figure 12 for exponential (top) and hinge (bottom) losses. One can observe that for the exponential loss, which is a differentiable function, all methods achieve comparable errors and accuracies, up to a slight waste for accelerated variants. This may be explained by the limited values offered to the learning rate in our experimental setting.

However, for the non-differentiable hinge loss, proximal-based boosting clearly outperforms gradient-based techniques, concerning both the loss value and the misclassification rate. This is in agreement with what was expected.

In addition, Figure 13 shows that the selected number of weak learners are generally greater for proximal boosting (but decreases dramatically when Nesterov’s acceleration is activated). This unfavorable comparison for proximal boosting can be explained by the selected values of shrinkage coefficient , which are different for all methods.

## 5 Conclusion

This paper has introduced a novel boosting algorithm, along with an accelerated variant, which have appeal for non-differentiable loss functions . The main idea is to use a proximal-based direction of optimization, coupled with Nesterov’s acceleration (as already introduced to boosting by Biau et al. [2018]).

Numerical experiments on synthetic data show a significant impact of the newly introduced parameter , but also improvements on regular gradient boosting for adequate values of . Moreover, in real-world situations, the proposed proximal boosting algorithm achieves comparable or better accuracies than gradient boosting and random forests, depending on the loss employed and the dataset. Finally, it does not appear that proximal boosting needs noticeably less weak learners than gradient boosting but in any case, the size of the final model can be dramatically reduced by activating Nesterov’s acceleration.

We believe that the connection between boosting and functional optimization can be much more investigated. In particular, advances in optimization theory can spread to boosting, like the recently revisited Frank-Wolfe algorithm impacted boosting [Jaggi, 2013, Wang et al., 2015]. This may also hold true for non-differentiable and non-convex optimization (see for instance [Ochs et al., 2014]).

## Acknowledgement

The authors are thankful to Gérard Biau and Jalal Fadili for enlightening discussions.

## References

• Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
• Biau and Cadre [2017] G. Biau and B. Cadre. Optimization by gradient boosting. arXiv:1707.05023 [cs, math, stat], 2017.
• Biau et al. [2016] G. Biau, A. Fischer, B. Guedj, and J.D. Malley. COBRA: A combined regression strategy.

Journal of Multivariate Analysis

, 146:18–28, 2016.
• Biau et al. [2018] G. Biau, B. Cadre, and L. Rouvière. Accelerated Gradient Boosting. arXiv:1803.02042 [cs, stat], 2018.
• Breiman [1997] L. Breiman. Arcing the Edge. Technical Report 486, Statistics Department, University of California, Berkeley, 1997.
• Breiman [1998] L. Breiman. Arcing classifier (with discussion and a rejoinder by the author). The Annals of Statistics, 26(3):801–849, 1998.
• Breiman [1999] L. Breiman. Prediction Games and Arcing Algorithms. Neural Computation, 11(7):1493–1517, 1999.
• Breiman [2000] L. Breiman. Some Infinite Theory for Predictor Ensembles. Technical Report 577, Statistics Department, University of California, Berkeley, 2000.
• Breiman [2001] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
• Breiman [2004] L. Breiman. Population theory for boosting ensembles. The Annals of Statistics, 32(1):1–11, 2004.
• Bühlmann and Hothorn [2007] P. Bühlmann and T. Hothorn. Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4):477–505, 2007.
• Chen and Guestrin [2016] T. Chen and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, New York, NY, USA, 2016. ACM.
• Freund [1995] Y. Freund. Boosting a Weak Learning Algorithm by Majority. Information and Computation, 121(2):256–285, 1995.
• Freund and Schapire [1996] Y. Freund and R.E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, San Francisco, CA, USA, 1996.
• Freund and Schapire [1997] Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
• Friedman [2001] J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
• Friedman [2002] J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, February 2002.
• Friedman et al. [2000] J. Friedman, T. Hastie, and R. Tibshirani.

Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).

The Annals of Statistics, 28(2):337–407, 2000.
• Jaggi [2013] M. Jaggi. Revisiting {Frank-Wolfe}: Projection-Free Sparse Convex Optimization. In Proceedings of The 30th International Conference on Machine Learning, pages 427–435, 2013.
• Lin et al. [2016] J. Lin, L. Rosasco, and D.-X. Zhou. Iterative Regularization for Learning with Convex Loss Functions. Journal of Machine Learning Research, 17(77):1–38, 2016.
• Mason et al. [2000a] L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Boosting Algorithms as Gradient Descent. In S.A. Solla, T.K. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, pages 512–518. MIT Press, 2000a.
• Mason et al. [2000b] L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A.J. Smola, P.L. Bartlett, B. Shölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221–246. The MIT Press, 2000b.
• Meir and Rätsch [2003] R. Meir and G. Rätsch. An Introduction to Boosting and Leveraging. In Advanced Lectures on Machine Learning, Lecture Notes in Computer Science, pages 118–183. Springer, Berlin, Heidelberg, 2003.
• Nesterov [1983] Y. Nesterov. A method of solving a convex programming problem with convergence rate 0(1/k). Soviet Mathematics Doklady, 27, 1983.
• Nesterov [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004.
• Ochs et al. [2014] P. Ochs, Y. Chen, T. Brox, and T. Pock. iPiano: Inertial Proximal Algorithm for Nonconvex Optimization. SIAM Journal on Imaging Sciences, 2014.
• Schapire [1990] R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.
• Temlyakov [2012] V.N. Temlyakov. Greedy expansions in convex optimization. In Proceedings of the Steklov Institute of Mathematics, 2012.
• Wang et al. [2015] C. Wang, Y. Wang, W. E, and R. Schapire. Functional Frank-Wolfe Boosting for General Loss Functions. arXiv:1510.02558 [cs, stat], 2015.
• Zhang [2002] T. Zhang. A General Greedy Approximation Algorithm with Applications. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1065–1072. MIT Press, 2002.
• Zhang [2003] T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682–691, March 2003.

## Appendix A Implementation details

As explained previously, given a loss function , gradient and proximal boosting aim at minimizing the risk functional

 C(f)=1nn∑i=1ℓ(Yi,F(Xi))=D(zn(f))

for (where is a class of weak learners ), thereby measuring the cost incurred by predicting when the answer is .

In the forthcoming subsections, implementation details are given for six popular losses: least squares, least absolute deviations and pinball losses (regression), as well as exponential, logistic and hinge losses (binary classification). In this latter case, the predicted label of a point is given by if and otherwise.

For each loss, we lay out the following information:

Definition:

the mapping of the loss function .

Initial estimator:

the constant function .

Line search:

the optimal step size .

the direction of optimization to follow at the iterate .

Proximal operator:

for all , .

First, for the exponential and the logistic loss, the line search and the proximal operator have no closed-form solution, but are known to be roots of some equations. In that case, we perform one or several steps of the Newton-Raphson method to obtain an approximation of the desired quantity.

Second, when using decision trees as base learners, its common to perform a line search for each leaf of the tree . In that case, the line search may take a simpler form than the one given below.

### a.1 Least squares

Definition:

.

Initial estimator:

.

Line search:

.

Proximal operator:

.

### a.2 Least absolute deviations

Definition:

.

Initial estimator:

is the empirical median of the sample .

Line search:

.

, where for all ,

Proximal operator:

.

### a.3 Pinball

Definition:

, .

Initial estimator:

is the -quantile of the sample .

Line search:

.

Proximal operator:

.

### a.4 Exponential loss

Definition:

, .

Initial estimator:

, where .

Line search:

no closed-form solution.

.

Proximal operator:

no closed-form solution.

### a.5 Logistic loss

Definition:

.

Initial estimator:

, where .

Line search:

no closed-form solution.

.

Proximal operator:

no closed-form solution.

### a.6 Hinge loss

Definition:

.

Initial estimator:

.

Line search:

.