Functional Frank-Wolfe Boosting for General Loss Functions

by   Chu Wang, et al.

Boosting is a generic learning method for classification and regression. Yet, as the number of base hypotheses becomes larger, boosting can lead to a deterioration of test performance. Overfitting is an important and ubiquitous phenomenon, especially in regression settings. To avoid overfitting, we consider using l_1 regularization. We propose a novel Frank-Wolfe type boosting algorithm (FWBoost) applied to general loss functions. By using exponential loss, the FWBoost algorithm can be rewritten as a variant of AdaBoost for binary classification. FWBoost algorithms have exactly the same form as existing boosting methods, in terms of making calls to a base learning algorithm with different weights update. This direct connection between boosting and Frank-Wolfe yields a new algorithm that is as practical as existing boosting methods but with new guarantees and rates of convergence. Experimental results show that the test performance of FWBoost is not degraded with larger rounds in boosting, which is consistent with the theoretical analysis.



There are no comments yet.


page 1

page 2

page 3

page 4


Unified Robust Boosting

Boosting is a popular machine learning algorithm in regression and class...

SGLB: Stochastic Gradient Langevin Boosting

In this paper, we introduce Stochastic Gradient Langevin Boosting (SGLB)...

Kernel-based L_2-Boosting with Structure Constraints

Developing efficient kernel methods for regression is very popular in th...

Boosting as a Product of Experts

In this paper, we derive a novel probabilistic model of boosting as a Pr...

Gradient and Newton Boosting for Classification and Regression

Boosting algorithms enjoy large popularity due to their high predictive ...

Online Coordinate Boosting

We present a new online boosting algorithm for adapting the weights of a...

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

In this paper we analyze boosting algorithms in linear regression from a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boosting is a generic learning method for classification and regression. From a statistical perspective, boosting is often viewed as a method for empirical minimization of an appropriate loss function in a greedy fashion [1, 2, 3, 4]. It is designed to iteratively select a base hypothesis that leads to the largest reduction of empirical risk at each step from the base hypothesis space . The family of combined hypotheses being considered is the set of ensembles of base hypotheses with and Boosting procedures have drawn much attention in the machine leaning and statistics community due to their superior empirical performance ever since the first practical boosting algorithm, Adaboost [5]. Many other generalizations and extensions have since been proposed [2, 4, 6, 1].

For classification, the margin theory [7] explains AdaBoost’s resistance to overfitting, although it can overfit when the base hypotheses are too complex relative to the size of the training set or when the learning algorithm is unable to achieve large margins. Yet all this theory is not applicable for regression and overfitting is ubiquitous in regression settings [8]. Various methods have been proposed to avoid overfitting. For example, the early stopping framework has been considered for different loss functions [9, 10]. Friedman [11]

proposed stochastic gradient boosting. Duchi and Singer

[12] studied penalties for Adaboost based on the , , and norms of the predictors. A different method to avoid overfitting (and obtain rates of convergence) is through restricting the regularization of the weights of the composite base hypotheses. For classification problems, this point of view is taken up in [13, 14, 15]. As an alternative, shrinkage has been applied [16]. It is surprising to find that there are fewer regularized boosting algorithms for regression problems other than shrinkage. There are extensive studies on minimizing the Lasso loss function ( penalized loss) [17, 18, 19]. Yet most of the studies cannot be applied to general loss functions as in ensemble learning (a.k.a, boosting) with predictors in the form of . Zhao and Yu [20] proposed the Boosted Lasso (BLasso) algorithm which ties the boosting algorithm with the penalized Lasso method with an emphasis on tracing the regularization paths.

In this paper, we study algorithms for loss minimization subject to an explicit constraint on the norm of the weights. This approach yields a bound on the generalization error that depends only on the constraint and that is independent of the number of iterations. In Section 3, we propose a novel Frank-Wolfe type boosting algorithm (FWBoost) for general loss functions. By using exponential loss for binary classification, the FWBoost algorithm can be reduced to an AdaBoost-like algorithm (AdaBoost.FW) derived as a coordinate descent method in Section 4. By making a direct connection between boosting and Frank-Wolfe, the FWBoost algorithms have exactly the same form as existing boosting methods with the same number of calls to a base learners but with new guarantees and rate of convergence . Experimental results in Section 5 show that the test performance of FWBoost is not degraded with larger rounds in the boosting, which is consistent with the theoretical analysis in Section 3.

2 Preliminaries

In this section, we briefly overview the gradient boosting algorithm and Frank-Wolfe algorithms.

Gradient Boosting

We assume that the samples are independently chosen from the same but unknown distribution where is an instance from domain and the univariate can be continuous (regression problem) or discrete (classification problem). During training, a learning algorithm receives a training set

. The goal is to estimate the function/hypothsis

that minimizes the expected value of some specified loss function : The hypothesis space being considered is the set of ensembles of base hypothesis from the base hypothesis space :

Gradient boosting [2, 3] tries to find an approximation that minimizes the empirical risk on the training set. It does so by iteratively building up the solution in a greedy fashion:

for  to  do
end for
Algorithm 1 Frank-Wolfe

However, there is no simply way to exactly solve the problem of choosing at each step the best for an arbitrary loss function since the base hypothesis space is usually large or infinite. A typical strategy is by applying functional gradient descent. For the purpose of minimizing the empirical risk, we only care about the value of at . Thus we can view

as a function of a vector of values

and then calculate the negative gradient of : . A base hypothesis is chosen to most closely approximates the negative gradient and a step size is found by line search.

Frank-Wolfe Algorithms.

Frank-Wolfe algorithm [21] (also known as conditional gradient method) is one of the simplest and popular iterative first-order method for general constrained optimization problems. Given an continuously differentiable function and a compact convex domain , the objective is


At each iteration, the algorithm searches for the minimizer of the linearized version of the optimization problem under the same constraint and then applies the descend method along the direction from the current position towards this minimizer. By confining the step size in , this algorithm automatically maintains the current position in the feasible region. It is known that the convergence of Algorithm 1 satisfies , where is the solution to (1) [21, 22].

There are several variants of the classic Frank-Wolfe algorithm, including line-search variant and the fully corrective variant, see [22] for a recent overview. Another important variant is the use of away-steps, as described in [23]. The idea is that in each iteration, we may potentially remove an old atom with bad performance instead of adding a new one, which can improve the sparsity of the iterates [24].

3 Functional Frank-Wolfe Boosting for General Loss Functions

In order to prevent boosting algorithms from overfitting, we consider the constraint version of the regularized empirical risk minimization:


We will use to denote the subset of whose norm is bounded by . Bounding the norm of the base hypotheses is essential because otherwise if is closed under scalar multiplication, we could always shrink and enlarge to remove the effect of the constraint . It is clear that and

are two dual parameters and the intrinsic degree of freedom is 1. Thus without loss of generality we consider the base hypotheses in


Since the base hypothesis space is usually large or infinite, we take a functional view by treating the empirical risk function as a functional which takes as input another function rather than viewing our objective function, as in Eq. (2), as a function of a set parameters representing the weights over all the base hypothesis. Define the hypotheses space as

We use to denote the hypothesis space after -step boosting.

The regularized empirical risk minimization can be written analogously to Eq. (1) in functional space with compact convex domain as:


We will see that this perspective overcomes possible computational difficulties encountered by optimizing over the set of parameters in coordinate descent for regularized empirical risk minimization and is widely applicable to general loss functions. In fact, by applying Frank-Wolfe algorithms in functional space, it allows the regularized minimization of many loss functions to be reduced to a sequence of ordinary classification problems or least-squares regression problems.

3.1 Functional Frank-Wolfe

For the purpose of optimizing Eq. (3), we only care about the value of at . Thus we can view as a function of a vector of values . To simplify notation, we use and the vector interchangeably for any function if no confusion is presented. By analogy of , and in Eq. (3) to , and in Eq.(1), respectively, we derive the basic Frank-Wolfe boosting (FWBoost) framework in Algorithm 2.

input :  examples and constant
for  to  do
             Calculate the negative gradient :
       (a)  Solve subproblem ,
end for
output : 
Algorithm 2 FWBoost: a generic functional Frank-Wolfe algorithm

Let denote the parameter of at step , . From step to , the updates on is for and . If , we have . Since , it is clear that is always in .

Recall that in the original Frank-Wolfe algorithm, the linearized subproblem is in the whole feasible region which corresponds to in the learning setting. Here we are able to use instead of in line (a). This is because the feasible region is determined by linear constraints and the objective function is also linear. Therefore attains the maximum at the vertices, which is exactly the function space . The way of choosing step length has several variants like line search or total corrective as discussed in Section 2.

3.2 Using Classification and Regression for General Loss Functions

To solve the central subproblem , we need to find a function in which, as a vector, has the most similar orientation to . If each is constrained to take values in , the problem of maximizing on each round is equivalent to an ordinary classification problem. To be more specific, let and . Then by a standard technique, we have

which means that the maximization subproblem is equivalent to minimizing the weighted classification error. Thus, in this fashion, can be reduced to a sequence of classification problems. The resulting practical algorithm is described in Algorithm 3 as FWBoost_C.

Alternatively, for general base hypothesis space taking values in , if is closed under scalar multiplication (which is true for commonly used base hypotheses like regression trees, splines or stumps), we can instead solve an alternative problem by directly minimizing the Euclidean distance and then let . Now finding such an is itself a least-squares regression problem with real-valued negative gradient , as described in Algorithm 4 as FWBoost_R.

Replace line (a) in Algorithm 2 with

Train base classifier

using and
Algorithm 3 FWBoost_C
Replace line (a) in Algorithm 2 with
Fit base hypothesis to residuals by least squares regression on training set :
Algorithm 4 FWBoost_R

This is computationally attractive for dealing with non-parametric learning problems with large or infinite number of base hypotheses and general loss functions. First, the computational effort in Frank-Wolfe Boosting is exactly the same as in AdaBoost or gradient boosting to refit the residuals with no other complex optimization procedure needed, which ensures the efficiency of this method even for large scale datasets. Second, it does not need to identify whether a newly added base hypothesis is the same as one already included from previous steps (e.g. as done in BLasso [20]), which is awkward in practice especially for a large and complex base hypothesis space. Therefore it is free from the need to search through previously added base hypotheses. Thus the computation complexity at each iteration will not increase.

Note that if we solve the subproblem exactly and if the FW direction turns out to be a non-decreasing direction, then we can conclude that we are already at the global minimum. More details are provided in the supplementary material.

Guelat and Marcotte [23] developed Frank-Wolfe with away-steps to directly solve the sparsity problem. Here we provide the away-step variant of Frank-Wolfe boosting method, which directly solves the constrained regression problem while maintaining the sparsity of the base hypothesis by away-steps that potentially remove an old base hypothesis with bad performance. The detailed algorithm is written as the following

input :  examples and constant
Set for   do
       Solve the subproblem , define FW direction
       Solve the subproblem , define away direction
       if   then
             Perform the FW step for and
             Let and update by
             Perform the away step for and
             if then
       end if
end for
output : 
Algorithm 5 Frank-Wolfe Gradient Boosting with awaystep

3.3 Theoretical Analysis

In this section, we provide theoretical guarantees for empirical risk and generalization of functional Frank-Wolfe boosting for general loss functions.

3.3.1 Rademacher Complexity Bounds

We use Rademacher complexity as a standard tool to capture the richness of a family of functions in the analysis of voting methods [25].

Let be a family of functions mapping from to , and let a fixed i.i.d. sample set with elements in . The empirical Rademacher complexity of with respect to is defined as

where the

’s are independent uniform random variables taking values in

. We also define the risk and empirical risk of as and , respectively. By Talagrand’s lemma and general Rademacher complexity uniform-convergence bounds [26], we have the following Rademacher complexity bound for general loss functions. The bound is in terms of empirical risk, the sample size and the complexity of .

Theorem 1.

Let be a set of real-valued functions. Assume the loss function is -Lipschitz continuous with respect to its first argument and that , . For any

and with probability at least

over a sample of size , the following inequalities hold for all :

We next show an upper bound for the Rademacher complexity of a regularized function space after -step boosting [25]. Namely that complexity, as measured by Rademacher, is the same for the combined class as for the base class times the constant . The proof is left in the supplement.

Theorem 2.

For any sample of size , Further if the assumption that is closed under scalar multiplication holds,

Based on Theorems 1 and 2, we can see that the risk bound is depending on the regularization constant and is independent of the number of boosting iterations. Unlike vanilla gradient boosting ( even with shrinkage and subsampling to prevent overfitting), the generalization error will not increase even if the algorithm is boosted forever. Specifically, the Rademacher complexity bounds for regression with loss are given in the supplementary material.

3.4 Bounds on Empirical Risk

In this subsection we analyze the convergence rate of Frank-Wolfe boosting on the training set. During the Frank-Wolfe subproblem, we linearize the objective function to get a descent direction. To analyze the training error of our algorithm, we first study the second derivative of the loss function in the feasible region. If for , then by Taylor’s theorem , which gives us enough evaluation of the distance between and its linearization. Noticing that the boundedness of the second derivative is conserved under summation, this property passes to the empirical risk for . This leads us to the following condition which we require for our convergence analysis.

Assumption 1 (Smooth Empirical Risk) For empirical risk function , there exists a constant that depends on loss and predictor space , such that for any , we have

This assumption is valid for most commonly used losses for bounded . First, notice that in our constrained boosting setting, for any predictor , based on Hlder’s inequality, we have . This specifies the range in which this Assumption 1 needs to hold. For loss , and , so can be chosen as . For exponential loss , , , so can be chosen as . For logistic loss , , , so can be chosen as .

Theorem 3.

Under Assumption 1, for convex loss , suppose we could exactly solve the linearized subproblem at each iteration, and the step size is as , then the training error for Frank-Wolfe boosting algorithm is bounded: where is the optimal solution to (3) and the constant depends on the predictor space , loss function and initial empirical loss .

In the learning setting, the subproblem is solved by fitting the residual by least-square regression or finding the best classifier. However the base learning algorithm sometimes only finds an approximate maximizer rather than the global maximizer. Nevertheless, if we assume the problem is solved not too inaccurately at each step, we have the following error bound. The detailed proof of all the theorems in this section are left in the supplementary materials.

Theorem 4.

Under the same assumption and notation, for a positive constant , if the linearized subproblem is solved with tolerance , that is

then we have the following empirical risk bound

4 Frank-Wolfe Boosting for classification

AdaBoost is one of the most popular and important methods for classification problem where . It is well known that AdaBoost can be derived from functional gradient decent method using exponential loss [1, 27]. The margin theory predicts AdaBoost’s resistance to overfitting provided that large margins can be achieved and the base hypotheses are not too complex. Yet overfit can certainly happen when the base hypothesis is not able to achieve large margins; or when the noise is overwhelming; or when the space of base hypotheses is too complex. Following the same path, we apply the Frank-Wolfe boosting algorithm for exponential loss with constrained regularization and derive a classification method with a similar reweightening procedure as in AdaBoost. Yet it comes with a tunable parameter to balance the empirical risk and generalization.

Specifically, define the exponential loss function and a base classifier space . Following the same logic as in Algorithm 3, we first calculate the negative gradient at each round as . On round , the goal is to find which is proportional to and is equivalent to minimizing the weighted classification error with weight vector . After has been chosen, that stepsize can be selected using the methods already discussed, such as line search. After setting , we are ready to update the weights:

With exponential loss, the generic FWBoost algorithm for classification can be rewritten in a reweightening procedure like AdaBoost in Algorithm 6, where denotes the vector . AdaBoost.FW is computationally efficient with large or infinite hypothesis space while a similar non-practical counterpart suitable for a small number of base hypotheses is mentioned in [28].

input :  examples and constant
Initialize: for
for  to  do
       Train weak hypothesis using distribution and get
       Choose , for with a normalization factor:
       := and
end for
output : 
Algorithm 6 AdaBoost.FW: a functional Frank-Wolfe algorithm for classification

5 Experimental Results

In this section, we evaluate the proposed method on UCI datasets [29]. In order to randomize the experiments, in each run of experiments we select 50% data as the training examples. The remaining 50% data is used as test set. Furthermore, in each run, the parameters of each method are tuned by 5-fold cross validation on the training set. This random split was repeated 20 times and the results are averaged over 20 runs.

Least-squares Regression

To demonstrate the effectiveness of our proposed method in avoiding overfitting while achieving reasonably good training error, we compare FWBoost_C and FWBoost_R with existing state-of-the-art boosting algorithms (with regularization), including vanilla Gradient Boosting [2], Gradient Boosting with shrinkage [16], Gradient Boosting with early stopping [16], Gradient Boosting with subsampling [11] and BLasso [20] for least-squares regression problems.

The experimental results are shown in Fig. 1. The first row is the empirical risk averaged over 20 runs and the second row is the averaged MSE on test set. For early stopping and BLasso, if the algorithm stops before reaching the maximum iteration, the MSE are set to their last rounds’ values. It is demonstrated in Fig. 1. that FWBoost algorithms reduce the empirical risk considerably fast. In all the examples where gradient boosting overfits after early iterations, FWBoost algorithms achieve the minimal averaged test risk compared to other regularization methods. It is consistent with the theoretical analysis that no degradation in performance of FWBoost is observed with larger rounds of boosting. In addition, it is difficult to observe backward steps of BLasso algorithm because it usually stops at early iterations. It behaves similarly to early stopping method.

(a) Auto MPG
(b) Housing
(c) Concrete Slump
Figure 1: Comparison of boosting methods on UCI datasets. The x-axis is the number of boosting iterations. The first row is the averaged empirical risk and the second row is the MSE on test set.
Binary Classification

In order to demonstrate the effectiveness of AdaBoost.FW, we compare with AdaBoost on UCI data sets that cause AdaBoost to overfit. The first example is the heart-disease data set as presented in [27] as an example of overfitting. Decision stumps are used as weak learners. While reducing the training error, the test error of AdaBoost method may go up after several iterations. On the other hand, AdaBoost.FW does not suffer from overfitting and achieves smaller test error.

(a) Heart disease
(b) Planning relax
(c) Wholesale customers
Figure 2: Comparison of boosting methods on UCI datasets. The x-axis is the number of boosting iterations. The first (second) row is the averaged training (test) error over 20 runs.

6 Conclusion

In this paper, we considered the regularized loss function minimization for ensemble learning to avoid overfitting. We propose a novel Frank-Wolfe type boosting algorithm applied to general loss functions and analyze empirical risk and generalization. By using exponential loss for binary classification, the FWBoost algorithm can be rewritten as an AdaBoost-like reweightening algorithm. Furthermore, it is computationally efficient since the computational effort in FWBoost is exactly the same as in AdaBoost or gradient boosting to refit the residuals. We further deploy an important variant of Frank-Wolfe algorithms with away-steps to improve sparsity of boosting algorithms.


  • [1] Leo Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493–1517, 1999.
  • [2] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  • [3] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. Advances in Neural Information Processing Systems, pages 221–246, 1999.
  • [4] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al.

    Additive logistic regression: a statistical view of boosting.

    The annals of statistics, 28(2):337–407, 2000.
  • [5] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • [6] Robert E Schapire.

    The boosting approach to machine learning: An overview.

    In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
  • [7] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of statistics, pages 1651–1686, 1998.
  • [8] Adam J Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pages 692–699, 1998.
  • [9] Peter Lukas Buhlmann. Consistency for l2boosting and matching pursuit with trees and tree-type basis functions. In Book, 2002.
  • [10] Tong Zhang and Bin Yu. Boosting with early stopping: convergence and consistency. Annals of Statistics, pages 1538–1579, 2005.
  • [11] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
  • [12] John Duchi and Yoram Singer. Boosting with structural sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 297–304. ACM, 2009.
  • [13] Gábor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of regularized boosting methods. Annals of Statistics, pages 30–55, 2004.
  • [14] Yongxin T Xi, Zhen J Xiang, Peter J Ramadge, and Robert E Schapire. Speed and sparsity of regularized boosting. In

    International Conference on Artificial Intelligence and Statistics

    , pages 615–622, 2009.
  • [15] Chunhua Shen, Hanxi Li, and Nick Barnes. Totally corrective boosting for regularized risk minimization. arXiv preprint arXiv:1008.5188, 2010.
  • [16] Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and R Tibshirani. The elements of statistical learning, volume 2. Springer, 2009.
  • [17] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [18] Michael R Osborne, Brett Presnell, and Berwin A Turlach. On the lasso and its dual. Journal of Computational and Graphical statistics, 9(2):319–337, 2000.
  • [19] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20(6):2807–2832, 2010.
  • [20] Peng Zhao and Bin Yu. Boosted lasso. Technical report, DTIC Document, 2004.
  • [21] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
  • [22] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 427–435, 2013.
  • [23] Jacques Guélat and Patrice Marcotte. Some comments on wolfe’s ‘away step’. Mathematical Programming, 35(1):110–119, 1986.
  • [24] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
  • [25] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, pages 1–50, 2002.
  • [26] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
  • [27] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.
  • [28] Paul Grigas, Robert Freund, and Rahul Mazumder. The frank-wolfe algorithm: New results, and connections to statistical boosting. Workshop on Optimization and Big Data, University of Edinburgh, 2013.
  • [29] M. Lichman. UCI machine learning repository, 2013.