1 Introduction
Boosting is a generic learning method for classification and regression. From a statistical perspective, boosting is often viewed as a method for empirical minimization of an appropriate loss function in a greedy fashion [1, 2, 3, 4]. It is designed to iteratively select a base hypothesis that leads to the largest reduction of empirical risk at each step from the base hypothesis space . The family of combined hypotheses being considered is the set of ensembles of base hypotheses with and Boosting procedures have drawn much attention in the machine leaning and statistics community due to their superior empirical performance ever since the first practical boosting algorithm, Adaboost [5]. Many other generalizations and extensions have since been proposed [2, 4, 6, 1].
For classification, the margin theory [7] explains AdaBoost’s resistance to overfitting, although it can overfit when the base hypotheses are too complex relative to the size of the training set or when the learning algorithm is unable to achieve large margins. Yet all this theory is not applicable for regression and overfitting is ubiquitous in regression settings [8]. Various methods have been proposed to avoid overfitting. For example, the early stopping framework has been considered for different loss functions [9, 10]. Friedman [11]
proposed stochastic gradient boosting. Duchi and Singer
[12] studied penalties for Adaboost based on the , , and norms of the predictors. A different method to avoid overfitting (and obtain rates of convergence) is through restricting the regularization of the weights of the composite base hypotheses. For classification problems, this point of view is taken up in [13, 14, 15]. As an alternative, shrinkage has been applied [16]. It is surprising to find that there are fewer regularized boosting algorithms for regression problems other than shrinkage. There are extensive studies on minimizing the Lasso loss function ( penalized loss) [17, 18, 19]. Yet most of the studies cannot be applied to general loss functions as in ensemble learning (a.k.a, boosting) with predictors in the form of . Zhao and Yu [20] proposed the Boosted Lasso (BLasso) algorithm which ties the boosting algorithm with the penalized Lasso method with an emphasis on tracing the regularization paths.In this paper, we study algorithms for loss minimization subject to an explicit constraint on the norm of the weights. This approach yields a bound on the generalization error that depends only on the constraint and that is independent of the number of iterations. In Section 3, we propose a novel FrankWolfe type boosting algorithm (FWBoost) for general loss functions. By using exponential loss for binary classification, the FWBoost algorithm can be reduced to an AdaBoostlike algorithm (AdaBoost.FW) derived as a coordinate descent method in Section 4. By making a direct connection between boosting and FrankWolfe, the FWBoost algorithms have exactly the same form as existing boosting methods with the same number of calls to a base learners but with new guarantees and rate of convergence . Experimental results in Section 5 show that the test performance of FWBoost is not degraded with larger rounds in the boosting, which is consistent with the theoretical analysis in Section 3.
2 Preliminaries
In this section, we briefly overview the gradient boosting algorithm and FrankWolfe algorithms.
Gradient Boosting
We assume that the samples are independently chosen from the same but unknown distribution where is an instance from domain and the univariate can be continuous (regression problem) or discrete (classification problem). During training, a learning algorithm receives a training set
. The goal is to estimate the function/hypothsis
that minimizes the expected value of some specified loss function : The hypothesis space being considered is the set of ensembles of base hypothesis from the base hypothesis space :Gradient boosting [2, 3] tries to find an approximation that minimizes the empirical risk on the training set. It does so by iteratively building up the solution in a greedy fashion:
However, there is no simply way to exactly solve the problem of choosing at each step the best for an arbitrary loss function since the base hypothesis space is usually large or infinite. A typical strategy is by applying functional gradient descent. For the purpose of minimizing the empirical risk, we only care about the value of at . Thus we can view
as a function of a vector of values
and then calculate the negative gradient of : . A base hypothesis is chosen to most closely approximates the negative gradient and a step size is found by line search.FrankWolfe Algorithms.
FrankWolfe algorithm [21] (also known as conditional gradient method) is one of the simplest and popular iterative firstorder method for general constrained optimization problems. Given an continuously differentiable function and a compact convex domain , the objective is
(1) 
At each iteration, the algorithm searches for the minimizer of the linearized version of the optimization problem under the same constraint and then applies the descend method along the direction from the current position towards this minimizer. By confining the step size in , this algorithm automatically maintains the current position in the feasible region. It is known that the convergence of Algorithm 1 satisfies , where is the solution to (1) [21, 22].
There are several variants of the classic FrankWolfe algorithm, including linesearch variant and the fully corrective variant, see [22] for a recent overview. Another important variant is the use of awaysteps, as described in [23]. The idea is that in each iteration, we may potentially remove an old atom with bad performance instead of adding a new one, which can improve the sparsity of the iterates [24].
3 Functional FrankWolfe Boosting for General Loss Functions
In order to prevent boosting algorithms from overfitting, we consider the constraint version of the regularized empirical risk minimization:
(2) 
We will use to denote the subset of whose norm is bounded by . Bounding the norm of the base hypotheses is essential because otherwise if is closed under scalar multiplication, we could always shrink and enlarge to remove the effect of the constraint . It is clear that and
are two dual parameters and the intrinsic degree of freedom is 1. Thus without loss of generality we consider the base hypotheses in
.Since the base hypothesis space is usually large or infinite, we take a functional view by treating the empirical risk function as a functional which takes as input another function rather than viewing our objective function, as in Eq. (2), as a function of a set parameters representing the weights over all the base hypothesis. Define the hypotheses space as
We use to denote the hypothesis space after step boosting.
The regularized empirical risk minimization can be written analogously to Eq. (1) in functional space with compact convex domain as:
(3) 
We will see that this perspective overcomes possible computational difficulties encountered by optimizing over the set of parameters in coordinate descent for regularized empirical risk minimization and is widely applicable to general loss functions. In fact, by applying FrankWolfe algorithms in functional space, it allows the regularized minimization of many loss functions to be reduced to a sequence of ordinary classification problems or leastsquares regression problems.
3.1 Functional FrankWolfe
For the purpose of optimizing Eq. (3), we only care about the value of at . Thus we can view as a function of a vector of values . To simplify notation, we use and the vector interchangeably for any function if no confusion is presented. By analogy of , and in Eq. (3) to , and in Eq.(1), respectively, we derive the basic FrankWolfe boosting (FWBoost) framework in Algorithm 2.
Let denote the parameter of at step , . From step to , the updates on is for and . If , we have . Since , it is clear that is always in .
Recall that in the original FrankWolfe algorithm, the linearized subproblem is in the whole feasible region which corresponds to in the learning setting. Here we are able to use instead of in line (a). This is because the feasible region is determined by linear constraints and the objective function is also linear. Therefore attains the maximum at the vertices, which is exactly the function space . The way of choosing step length has several variants like line search or total corrective as discussed in Section 2.
3.2 Using Classification and Regression for General Loss Functions
To solve the central subproblem , we need to find a function in which, as a vector, has the most similar orientation to . If each is constrained to take values in , the problem of maximizing on each round is equivalent to an ordinary classification problem. To be more specific, let and . Then by a standard technique, we have
which means that the maximization subproblem is equivalent to minimizing the weighted classification error. Thus, in this fashion, can be reduced to a sequence of classification problems. The resulting practical algorithm is described in Algorithm 3 as FWBoost_C.
Alternatively, for general base hypothesis space taking values in , if is closed under scalar multiplication (which is true for commonly used base hypotheses like regression trees, splines or stumps), we can instead solve an alternative problem by directly minimizing the Euclidean distance and then let . Now finding such an is itself a leastsquares regression problem with realvalued negative gradient , as described in Algorithm 4 as FWBoost_R.
This is computationally attractive for dealing with nonparametric learning problems with large or infinite number of base hypotheses and general loss functions. First, the computational effort in FrankWolfe Boosting is exactly the same as in AdaBoost or gradient boosting to refit the residuals with no other complex optimization procedure needed, which ensures the efficiency of this method even for large scale datasets. Second, it does not need to identify whether a newly added base hypothesis is the same as one already included from previous steps (e.g. as done in BLasso [20]), which is awkward in practice especially for a large and complex base hypothesis space. Therefore it is free from the need to search through previously added base hypotheses. Thus the computation complexity at each iteration will not increase.
Note that if we solve the subproblem exactly and if the FW direction turns out to be a nondecreasing direction, then we can conclude that we are already at the global minimum. More details are provided in the supplementary material.
Guelat and Marcotte [23] developed FrankWolfe with awaysteps to directly solve the sparsity problem. Here we provide the awaystep variant of FrankWolfe boosting method, which directly solves the constrained regression problem while maintaining the sparsity of the base hypothesis by awaysteps that potentially remove an old base hypothesis with bad performance. The detailed algorithm is written as the following
3.3 Theoretical Analysis
In this section, we provide theoretical guarantees for empirical risk and generalization of functional FrankWolfe boosting for general loss functions.
3.3.1 Rademacher Complexity Bounds
We use Rademacher complexity as a standard tool to capture the richness of a family of functions in the analysis of voting methods [25].
Let be a family of functions mapping from to , and let a fixed i.i.d. sample set with elements in . The empirical Rademacher complexity of with respect to is defined as
where the
’s are independent uniform random variables taking values in
. We also define the risk and empirical risk of as and , respectively. By Talagrand’s lemma and general Rademacher complexity uniformconvergence bounds [26], we have the following Rademacher complexity bound for general loss functions. The bound is in terms of empirical risk, the sample size and the complexity of .Theorem 1.
Let be a set of realvalued functions. Assume the loss function is Lipschitz continuous with respect to its first argument and that , . For any
and with probability at least
over a sample of size , the following inequalities hold for all :We next show an upper bound for the Rademacher complexity of a regularized function space after step boosting [25]. Namely that complexity, as measured by Rademacher, is the same for the combined class as for the base class times the constant . The proof is left in the supplement.
Theorem 2.
For any sample of size , Further if the assumption that is closed under scalar multiplication holds,
Based on Theorems 1 and 2, we can see that the risk bound is depending on the regularization constant and is independent of the number of boosting iterations. Unlike vanilla gradient boosting ( even with shrinkage and subsampling to prevent overfitting), the generalization error will not increase even if the algorithm is boosted forever. Specifically, the Rademacher complexity bounds for regression with loss are given in the supplementary material.
3.4 Bounds on Empirical Risk
In this subsection we analyze the convergence rate of FrankWolfe boosting on the training set. During the FrankWolfe subproblem, we linearize the objective function to get a descent direction. To analyze the training error of our algorithm, we first study the second derivative of the loss function in the feasible region. If for , then by Taylor’s theorem , which gives us enough evaluation of the distance between and its linearization. Noticing that the boundedness of the second derivative is conserved under summation, this property passes to the empirical risk for . This leads us to the following condition which we require for our convergence analysis.
Assumption 1 (Smooth Empirical Risk) For empirical risk function , there exists a constant that depends on loss and predictor space , such that for any , we have
This assumption is valid for most commonly used losses for bounded . First, notice that in our constrained boosting setting, for any predictor , based on Hlder’s inequality, we have . This specifies the range in which this Assumption 1 needs to hold. For loss , and , so can be chosen as . For exponential loss , , , so can be chosen as . For logistic loss , , , so can be chosen as .
Theorem 3.
Under Assumption 1, for convex loss , suppose we could exactly solve the linearized subproblem at each iteration, and the step size is as , then the training error for FrankWolfe boosting algorithm is bounded: where is the optimal solution to (3) and the constant depends on the predictor space , loss function and initial empirical loss .
In the learning setting, the subproblem is solved by fitting the residual by leastsquare regression or finding the best classifier. However the base learning algorithm sometimes only finds an approximate maximizer rather than the global maximizer. Nevertheless, if we assume the problem is solved not too inaccurately at each step, we have the following error bound. The detailed proof of all the theorems in this section are left in the supplementary materials.
Theorem 4.
Under the same assumption and notation, for a positive constant , if the linearized subproblem is solved with tolerance , that is
then we have the following empirical risk bound
4 FrankWolfe Boosting for classification
AdaBoost is one of the most popular and important methods for classification problem where . It is well known that AdaBoost can be derived from functional gradient decent method using exponential loss [1, 27]. The margin theory predicts AdaBoost’s resistance to overfitting provided that large margins can be achieved and the base hypotheses are not too complex. Yet overfit can certainly happen when the base hypothesis is not able to achieve large margins; or when the noise is overwhelming; or when the space of base hypotheses is too complex. Following the same path, we apply the FrankWolfe boosting algorithm for exponential loss with constrained regularization and derive a classification method with a similar reweightening procedure as in AdaBoost. Yet it comes with a tunable parameter to balance the empirical risk and generalization.
Specifically, define the exponential loss function and a base classifier space . Following the same logic as in Algorithm 3, we first calculate the negative gradient at each round as . On round , the goal is to find which is proportional to and is equivalent to minimizing the weighted classification error with weight vector . After has been chosen, that stepsize can be selected using the methods already discussed, such as line search. After setting , we are ready to update the weights:
With exponential loss, the generic FWBoost algorithm for classification can be rewritten in a reweightening procedure like AdaBoost in Algorithm 6, where denotes the vector . AdaBoost.FW is computationally efficient with large or infinite hypothesis space while a similar nonpractical counterpart suitable for a small number of base hypotheses is mentioned in [28].
5 Experimental Results
In this section, we evaluate the proposed method on UCI datasets [29]. In order to randomize the experiments, in each run of experiments we select 50% data as the training examples. The remaining 50% data is used as test set. Furthermore, in each run, the parameters of each method are tuned by 5fold cross validation on the training set. This random split was repeated 20 times and the results are averaged over 20 runs.
Leastsquares Regression
To demonstrate the effectiveness of our proposed method in avoiding overfitting while achieving reasonably good training error, we compare FWBoost_C and FWBoost_R with existing stateoftheart boosting algorithms (with regularization), including vanilla Gradient Boosting [2], Gradient Boosting with shrinkage [16], Gradient Boosting with early stopping [16], Gradient Boosting with subsampling [11] and BLasso [20] for leastsquares regression problems.
The experimental results are shown in Fig. 1. The first row is the empirical risk averaged over 20 runs and the second row is the averaged MSE on test set. For early stopping and BLasso, if the algorithm stops before reaching the maximum iteration, the MSE are set to their last rounds’ values. It is demonstrated in Fig. 1. that FWBoost algorithms reduce the empirical risk considerably fast. In all the examples where gradient boosting overfits after early iterations, FWBoost algorithms achieve the minimal averaged test risk compared to other regularization methods. It is consistent with the theoretical analysis that no degradation in performance of FWBoost is observed with larger rounds of boosting. In addition, it is difficult to observe backward steps of BLasso algorithm because it usually stops at early iterations. It behaves similarly to early stopping method.



Binary Classification
In order to demonstrate the effectiveness of AdaBoost.FW, we compare with AdaBoost on UCI data sets that cause AdaBoost to overfit. The first example is the heartdisease data set as presented in [27] as an example of overfitting. Decision stumps are used as weak learners. While reducing the training error, the test error of AdaBoost method may go up after several iterations. On the other hand, AdaBoost.FW does not suffer from overfitting and achieves smaller test error.



6 Conclusion
In this paper, we considered the regularized loss function minimization for ensemble learning to avoid overfitting. We propose a novel FrankWolfe type boosting algorithm applied to general loss functions and analyze empirical risk and generalization. By using exponential loss for binary classification, the FWBoost algorithm can be rewritten as an AdaBoostlike reweightening algorithm. Furthermore, it is computationally efficient since the computational effort in FWBoost is exactly the same as in AdaBoost or gradient boosting to refit the residuals. We further deploy an important variant of FrankWolfe algorithms with awaysteps to improve sparsity of boosting algorithms.
References
 [1] Leo Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493–1517, 1999.
 [2] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
 [3] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. Advances in Neural Information Processing Systems, pages 221–246, 1999.

[4]
Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al.
Additive logistic regression: a statistical view of boosting.
The annals of statistics, 28(2):337–407, 2000.  [5] Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.

[6]
Robert E Schapire.
The boosting approach to machine learning: An overview.
In Nonlinear estimation and classification, pages 149–171. Springer, 2003.  [7] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of statistics, pages 1651–1686, 1998.
 [8] Adam J Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pages 692–699, 1998.
 [9] Peter Lukas Buhlmann. Consistency for l2boosting and matching pursuit with trees and treetype basis functions. In Book, 2002.
 [10] Tong Zhang and Bin Yu. Boosting with early stopping: convergence and consistency. Annals of Statistics, pages 1538–1579, 2005.
 [11] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
 [12] John Duchi and Yoram Singer. Boosting with structural sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 297–304. ACM, 2009.
 [13] Gábor Lugosi and Nicolas Vayatis. On the bayesrisk consistency of regularized boosting methods. Annals of Statistics, pages 30–55, 2004.

[14]
Yongxin T Xi, Zhen J Xiang, Peter J Ramadge, and Robert E Schapire.
Speed and sparsity of regularized boosting.
In
International Conference on Artificial Intelligence and Statistics
, pages 615–622, 2009.  [15] Chunhua Shen, Hanxi Li, and Nick Barnes. Totally corrective boosting for regularized risk minimization. arXiv preprint arXiv:1008.5188, 2010.
 [16] Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and R Tibshirani. The elements of statistical learning, volume 2. Springer, 2009.
 [17] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [18] Michael R Osborne, Brett Presnell, and Berwin A Turlach. On the lasso and its dual. Journal of Computational and Graphical statistics, 9(2):319–337, 2000.
 [19] Shai ShalevShwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20(6):2807–2832, 2010.
 [20] Peng Zhao and Bin Yu. Boosted lasso. Technical report, DTIC Document, 2004.
 [21] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(12):95–110, 1956.
 [22] Martin Jaggi. Revisiting frankwolfe: Projectionfree sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 427–435, 2013.
 [23] Jacques Guélat and Patrice Marcotte. Some comments on wolfe’s ‘away step’. Mathematical Programming, 35(1):110–119, 1986.
 [24] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frankwolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
 [25] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, pages 1–50, 2002.
 [26] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
 [27] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.
 [28] Paul Grigas, Robert Freund, and Rahul Mazumder. The frankwolfe algorithm: New results, and connections to statistical boosting. Workshop on Optimization and Big Data, University of Edinburgh, 2013.
 [29] M. Lichman. UCI machine learning repository, 2013.
Comments
There are no comments yet.