Fully-Corrective Gradient Boosting with Squared Hinge: Fast Learning Rates and Early Stopping

04/01/2020 ∙ by Jinshan Zeng, et al. ∙ 0

Boosting is a well-known method for improving the accuracy of weak learners in machine learning. However, its theoretical generalization guarantee is missing in literature. In this paper, we propose an efficient boosting method with theoretical generalization guarantees for binary classification. Three key ingredients of the proposed boosting method are: a) the fully-corrective greedy (FCG) update in the boosting procedure, b) a differentiable squared hinge (also called truncated quadratic) function as the loss function, and c) an efficient alternating direction method of multipliers (ADMM) algorithm for the associated FCG optimization. The used squared hinge loss not only inherits the robustness of the well-known hinge loss for classification with outliers, but also brings some benefits for computational implementation and theoretical justification. Under some sparseness assumption, we derive a fast learning rate of the order O((m/log m)^-1/4) for the proposed boosting method, which can be further improved to O((m/log m)^-1/2) if certain additional noise assumption is imposed, where m is the size of sample set. Both derived learning rates are the best ones among the existing generalization results of boosting-type methods for classification. Moreover, an efficient early stopping scheme is provided for the proposed method. A series of toy simulations and real data experiments are conducted to verify the developed theories and demonstrate the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 7

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Boosting [9] is a powerful learning scheme that combines multiple weak prediction rules to produce a strong learner with the underlying intuition that one can obtain accurate prediction by combining “rough” ones. It has been successfully used in numerous learning tasks such as regression, classification, ranking and recognition [32]. The gradient descent view of boosting [11, 12], or gradient boosting, provides a springboard to understand and improve boosting via connecting boosting with a two-step stage-wise fitting of additive models corresponding to various loss functions.

There are commonly four ingredients of gradient boosting: a set of weak learners, a loss function, an update scheme and an early stopping strategy. The weak learner issue focuses on selecting a suitable set of weak learners by regulating the property of the estimator to be found. Typical examples are decision trees

[17]

, neural networks

[2] and kernels [24]. The loss function issue devotes to choosing an appropriate loss function to enhance the learning performance. Besides the classical exponential loss in Adaboost [9]

, some other widely used loss functions are the logistic loss in Logit-Boosting

[11], least square loss in Boosting [6] and hinge loss in HingeBoost [14]. The update scheme refers to how to iteratively derive a new estimator based on the selected weak learners. According to the gradient descent view, there are numerous iterative schemes for boosting [12]. Among these, five most commonly used iterative schemes are the original boosting iteration [9], regularized boosting iteration via shrinkage (RSBoosting) [12], regularized boosting via truncation (RTBoosting) [45], -Boosting [17] and re-scaled boosting (RBoosting) [41]. Noting that boosting is doomed to over-fit [6], the early stopping issue depicts how to terminate the learning process to avoid over-fitting. Some popular strategies to yield a stopping rule of high quality are An Information Criterion (AIC) [6], -based complexity restriction [2] and -based adaptive terminate rule.

The learning performance of Boosting has been rigorously verified in regression [2, 1]. In fact, under some sparseness assumption of the regression function, a learning rate of order has been provided for numerous variants of the original boosting [2, 1], where denotes the size of data set. However, for classification where Boosting performs practically not so well, there lack tight classification risk estimates for boosting as well as its variants. For example, the classification risk for AdaBoost is of an order [4] and for some variant of Logit-Boosting is of an order [45]. There are mainly two reasons resulting in such slow learning rates. The one is that the original update scheme in boosting leads to slow numerical convergence rate [26, 28], which requires numerous boosting iterations to achieve a prescribed accuracy. The other is that the widely used loss functions such as the exponential loss and logistic loss do not admit the truncation (or clip) operator like (25) below, requiring tight uniform bounds for the derived estimator.

The aim of the present paper is to derive tight classification risk bounds for boosting-type algorithms via selecting appropriate iteration scheme and loss function. In fact, we adopt the widely used fully-corrective greedy (FCG) update scheme and the squared hinge (also called truncated quadratic) function (i.e., for any ) as the loss function. FCG update scheme has been successfully used in [35, 33, 22], mainly in terms of the fast numerical convergence rate. Inspired by the square-type inequality [25], the squared hinge loss has been exploited to ease the computational implementation and is regarded to be an improvement of the classical hinge loss [21, 27, 23]. By taking advantage of the special form of the squared hinge loss, we develop an alternating direction method of multipliers (ADMM) algorithm [15, 13] for efficiently finding the optimal coefficients of the FCG optimization subproblem. More importantly, a tight classification risk bound is derived in the statistical learning framework [8], provided the algorithm is appropriately early stopped.

In a nutshell, our contributions can be summarized as follows.

Algorithmic side: We propose a novel variant of boosting to improve its performance for binary classification. The FCG update scheme and squared hinge loss are utilized in the new variant to accelerate the numerical convergence rate and reduce the classification risk.

Theoretical side: We derive fast learning rates for the proposed algorithm in binary classification. Under some regular sparseness assumption, the derived learning rate achieves an order of , which is a new record for boosting classification. If some additional noise condition is imposed, then the learning rate can be further improved to .

Numerical side: We conduct a series of experiments including the toy simulations, UCI-benchmark data experiments and a real-world earthquake intensity classification experiment to show the feasibility and effectiveness of the proposed algorithm. Our numerical results show that the proposed variant of boosting is at least comparable with the state-of-the-art methods.

The rest of this paper is organized as follows. In Section II, we introduce the proposed boosting method in detail. In Section III, we provide the theoretical generalization guarantees of the proposed method. A series of toy simulations are conducted in Section IV to illustrate the feasibility of the suggested method, and some real-data experiments are provided in Section V to demonstrate the effectiveness of the proposed method. All the proofs are provided in Section VI. We conclude this paper in Section VII.

Ii Proposed method

In this section, after presenting the classical boosting, we introduce our variant in detail.

Ii-a Boosting

Boosting can be regarded as one of the most important methods in machine learning for classification and regression [31]. The original versions of boosting proposed by [30] and [9] were not adaptive and could not take full advantage of the weak learners. Latter, an adaptive boosting algorithm called AdaBoost was introduced by [10] to alleviate many practical difficulties of the earlier versions of boosting. The gradient descent view of boosting [11] then connects boosting with the well known greedy-type algorithms [38] equipped with different loss functions. In light of this perspective, numerous variants of boosting were proposed to improve its learning performance [17].

Given a data set with size , boosting starts with a set of weak learners with size and a loss function . Mathematically, it formulates the learning problem to find a function to minimize the following empirical risk

(1)

where represents the function space spanned linearly by . If is Fréchet differentiable, gradient boosting firstly finds a such that

(2)

where denotes the value of linear functional at . Then, it finds a such that

(3)

In this way, gradient boosting yields a set of estimators

iteratively and early stops the algorithm according to the bias-variance trade-off

[45] to get the final estimator , where is the terminal number of iterations.

According to the above description, it can be noted that the selections of weak learners, the loss function, update scheme and early stopping strategy play important roles in the practical implementation of gradient boosting. The studies in [1, 22, 12, 6, 2, 41, 45, 24] discussed the importance of the mentioned four issues respectively and then presented numerous variants of boosting accordingly.

Ii-B Fully-corrective greedy update scheme

There are roughly two approaches to improve the learning performance of boosting: variance-based method and bias-based method. The former focuses on controlling the structure ( norm) of the derived boosting estimator and then reduces the variance of boosting for a fixed number of iterations, while the later devotes to accelerating the numerical convergence rates and early stopping the iteration procedure. Among these existing variants of boosting, RSBoosting [12], RTBoosting [45] and -Boosting [17] are typical variance-based methods, while RBoosting [41] is a bias-based method. The problem is, however, that the variance-based method frequently requires a large number of iterations to achieve a desired accuracy, while the bias-based method may suffer from the same problem unless some additional boundedness assumptions are imposed. An intuitive experiment is provided in Fig. 1. From Fig. 1(b), it follows that numerous iterations are required for these existing boosting methods to select a small number of weak learners.

(a) Test error (b) Selected weak learners
Fig. 1: Comparisons on boosting schemes. (a) The curves of test error for different types of boosting methods, (b) the curves of the number of selected weak learners. The detailed experimental settings can be found in Simulation IV in Sec.IV-A.

Different from above variants that can repeatedly select the same weak learners during the iterative procedure, the fully corrective greedy (FCG) update scheme proposed in [22] finds an optimal combination of all the selected weak learners. In particular, let be the selected weak learners at the current iteration, fully corrective greedy boosting (FCGBoosting) builds an estimator via the following minimization

(4)

It should be pointed out that FCGBoosting is similar to the orthogonal greedy algorithm in approximation theory [38], fully-corrective variant of Frank-Wolfe method in optimization [20, Algorithm 4], and also orthogonal matching pursuit in signal processing [39]. The advantages of FCGBoosting lie in the sparseness of the derived estimator and the fast numerical convergence rates without any compactness assumption [33].

Ii-C Squared hinge loss

Since the gradient descent viewpoint connects the gradient boosting with various loss functions, numerous loss functions have been employed in boosting to enhance the performance. Among these, the exponential loss in AdaBoost, logistic loss in LogitBoost and square loss in Boosting are the most popular ones. In the classification settings, the consistency of AdaBoost and LogitBoost has been proved in [45, 4] with relatively slow learning rates. In this paper, we equip boosting with the squared hinge to improve the learning performance, both in theory and experiments.

As shown in [25], the squared hinge is of quadratic type, and thus theoretically behaves similar to the square loss and commonly better than the other typical loss functions including the exponential loss, logistic loss and hinge loss. Furthermore, learning with the squared hinge loss usually permits the margin principle [23] and thus practically performs better than the square loss for classification. Selecting the loss function as the squared hinge loss, i.e., in FCGBoosting, we can obtain a new variant of boosting summarized in Algorithm 1.

   Input: training sample set , and a dictionary set .
  Initialization:.
   for
     let ,
     let , and
(5)
End until the stopping criterion is satisfied.
Algorithm 1 FCGBoosting with squared hinge loss

Notice that the FCG step (5) in Algorithm 1 is a smooth convex optimization problem, a natural algorithmic candidate is the gradient descent (GD) method. However, as shown in Fig. 2, GD needs many iterations to guarantee the convergence, which might be not efficient for the proposed boosting method, since the problem (5) in the FCG step should be solved at each iteration and there are usually numerous iterations for the proposed boosting method. Instead, we use the alternating direction method of multipliers (ADMM) due to its high efficiency and fast convergence in practice [13, 15, 19] (also, shown by Fig. 2). The convergence of the suggested ADMM algorithm (presented in Algorithm 2 in Appendix A) and its rate of convergence have been established in the existing literature (say, [13, 15, 19]).

Fig. 2: Comparison on the efficiency of ADMM and GD for problem (5). The samples were generated according to Section IV-A with 30% uniform random noise with The matrix was formed by the Gaussian kernel dictionary with width 0.1 and dictionary size 15. The computational time of ADMM is 0.034 seconds, while that of GD is 0.53 seconds. It can be observed that ADMM generally converges faster and more efficiently with a lower cost function value in the concerned optimization problem (5).

Iii Generalization Error Analysis

In learning theory [8, 37], the sample set with and are drawn independently according to an unknown distribution on

. Binary classification algorithms produce a classifier

, whose generalization ability is measured by the misclassification error

where is the marginal distribution of and

is the conditional probability at

. The Bayes rule minimizes the misclassification error, where is the Bayes decision function and if and otherwise, . Since is independent of the classifier , the performance of can be measured by the excess misclassification error . For the derived estimator in Algorithm 1, we have . Then, it is sufficient to present a bound for . With this, we at first present a sparseness assumption on .

Assumption 1.

There exists an such that

(6)

for some positive constants .

Assumption 1 requires that should be sparsely approximated by the set of weak learners with certain fast decay of some polynomial order. Such an assumption is regular in the analysis of boosting algorithm and has been adopted in large literature [45, 4, 2, 38, 26, 1, 33, 24, 28, 41]. Under this assumption, we can derive the following learning rate for FCGBoosting

Theorem 1.

Let be a set of weak learners with , . Under Assumption 1, if for , and , then for any , with confidence at least , there holds

where is a positive constant independent of or .

The proof of this theorem will be presented in Section VI. This theorem provides some early stopping of the proposed version of boosting method under the assumption that the Bayes decision function can be well approximated by combining weak learners. From Theorem 1, an optimal should be set as in the order of , which shows that the number of selected weak learners is significantly less than and . It should be noted that the derived learning rate in Theorem 1 is of the same order of FCGBoosting with the square loss [2] under the same setting. To further improve the learning rate, the following Tsybakov noise condition [40] is generally required.

Assumption 2.

Let There exists a positive constant such that

The Tsybakov noise assumption measures the size of the set of points that are corrupted with high noise in the labeling process, and always holds for with . It is a standard noise assumption in classification which has been adopted in [36, 43, 25, 44] to derive fast learning rates for classification algorithms. Under the Tsybakov noise assumption, we can improve the rate as follows.

Theorem 2.

Let be a set of weak learners with , . Under Assumption 1 and Assumption 2 with , if for , and . Then for any , with confidence at least , there holds

where is a positive constant independent of or .

The proof of this theorem is also postponed to Section VI. Note that when , the learning rate established in Theorem 2 reduces to that of Theorem 1, while when , the obtained learning rate in Theorem 2 approaches to , which is a new record for the boosting-type methods under the classification setting.

Iv Toy Simulations

In this section, we present a series of toy simulations to demonstrate the feasibility and effectiveness of FCGBoosting. All the numerical experiments were carried out in Matlab R2015b environment running Windows 8, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.2GHz 3.2GH.

Iv-a Experimental settings

The settings of simulations are similar to that in [44] described as follows.

Samples: In simulations, the training samples were generated as follows. Let

be a nonlinear Bayes rule. Let

be drawn i.i.d. according to the uniform distribution with size

. Then we labeled the samples lying in the epigraph of function as the positive class, while the rest were labeled as the negative class, that is, given an , its label if , and otherwise, Besides the uniformly random noise generally considered in regression, we mainly focused on the outlier noise in our simulations, that is, the noisy samples lying in the region that is far from the Bayes (see, Fig. 3). We considered different widths (i.e., tol) and noise ratios (i.e., ratio) with the banded region.

(a) uniform random noise (b) outlier noise
Fig. 3: The generated training samples with uniform random noise (see, figure (a)) and outlier noise (see, figure (b)) used in simulations. The red points are labeled as “+1” class, while the blue points are labeled as “-1” class. In the title of figure (b), the notation tol=0.3 represents that the difference between the Bayes rule and upper (lower) bound is 0.3, while the notation ratio=0.4 represents that the noise ratios in both left-lower and right-upper regions are 0.4. The total noise level in this case is 17.4%.

Implementation and Evaluation: We implemented four simulations to illustrate the effect of parameters and show the effectiveness of the proposed version of boosting method. For each simulation, we repeated times of experiments and recorded the test error, which is defined as the ratio of the number of misclassified test labels to the test sample size. The first one is to illustrate the effect of the number of iterations , which is generally exploited for setting the stopping rule of the proposed method. The second one is to demonstrate the effect of the number of dictionaries . The third one is to show the feasibility and effectiveness of the used squared hinge loss (i.e., ) via comparing with some other loss functions including the square loss with , the hinge loss with and the cubed hinge loss with [21]

. The final one is to show the outperformance of the fully-corrective update scheme via comparing to the existing popular update scheme used in gradient boosting. Since the performance of boosting-type methods also depends on the dictionary type, in this paper, we considered four types of dictionaries, that is, the dictionaries formed by the Gaussian kernel, polynomial kernel, and the neural network kernels with sigmoid and rectified linear unit (ReLU) activations, respectively, and henceforth, they are respectively called

Gauss, polynomial, sigmoid, Relu for short. We set empirically the parameters of ADMM algorithm (i.e., Algorithm 2) used in the FCG optimization step as follows: , and the maximal number of iterations was set as 100.

Iv-B Simulation Results

In this part, we report the experimental results and present some discussions.

Simulation 1: On effect of number of iterations . From Algorithm 1, the number of iterations is a very important algorithmic parameter, which is generally set as the stopping rule of the boosting type of methods. By Theorems 1 and 2, a moderately large (i.e., ) is required to achieve the optimal generalization performance. To illustrate the effect of the number of iterations , we randomly generated training and test samples with both sizes being . We considered both noise types in training samples, i.e., uniformly random noise with the noise level 30%, and the outlier noise with and (in this case, the level of outlier noise is 17.4%), as described in Section IV-A . Moreover, we considered four different dictionaries formed by Gaussian kernel, polynomial kernel, neural network kernel with sigmoid and neural network kernel with ReLU activation, respectively, where the sizes of all four dictionaries are the same . We varied according to the set of size 11, i.e., , and recorded the associated test error. In this experiment, since . The curves of test error are shown in Fig. 4.

From Fig. 4, the trends of test error for different dictionaries are generally similar, that is, as increasing from to , the test error generally decreases firstly and then becomes stable. This phenomenon is mainly due to that when is small, the selected model might be under-fitting, and then increasing shall improve the generalization ability. More specifically, in both uniform and outlier noise cases, it is generally sufficient to set the iteration number as by Fig. 4. This in some extent verifies our main theorems (i.e., Theorems 1 and 2), which show that the moderately large is in the order of . Motivated by this experiment, in practice, the maximal number of iterations for the proposed boosting method can be empirically chosen from these five values via cross validation. When comparing with these differen dictionaries, the generalization performance of Relu are slightly better than the other three dictionaries in both noise cases.

(a) 30% Uniform random noise (b) 17.4% Outlier noise
Fig. 4: Effect of the number of iterations . (a) the curves of test error of four different dictionaries with respect to under the 30% uniform random noise, (b) the curves of test error under the 17.4% outlier noise where and .

Simulation 2: On effect of size of dictionary set . Given a dictionary type, the size of dictionary set generally reflects the approximation ability of the given dictionary set. Particularly, according to Assumption 1, one prerequisite condition for the boosting type methods is that the underlying learner should be well-approximated by the chosen dictionary set. However, a larger dictionary set usually brings more computational cost. Thus, it is meaningful to verify the possible optimal size of dictionary set. To illustrate this, in this experiment, the training and test samples were generated in the same way of Simulation 1. Instead of varying the number of iterations , we varied the size of dictionary set from to , where is the size of training samples. For each , the number of iterations was chosen from these five values via cross validation. The curves of test error with respective to the number of dictionary sets are shown in Fig. 5.

From Fig. 5, the number of dictionary set has little effect on the generalization performance for all types of dictionaries and in both noise settings, when is in the order of . This is also verified by our main theorems (i.e., Theorems 1 and 2). By Theorems 1 and 2, should be in the order of for some to achieve the optimal learning rates. Specifically, in this experiment, we show that the generalization performance of the proposed method does not vary very much when varies from to . In our latter experiments, we empirically set in the consideration of both generalization performance and computational cost. Regarding the performance of the different dictionaries, we observed that the performance of Relu and Gauss is slightly better than that of polynomial and sigmoid.

(a) 30% Uniform random noise (b) 17.4% Outlier noise
Fig. 5: Effect of the number of dictionaries . (a) the curves of test error of four different dictionaries with respect to under the 30% uniform random noise, (b) the curves of test error under the 17.4% outlier noise where and .

Simulation 3: On comparison among different losses. In this experiment, we compare the performance of the fully-corrective greedy boosting method with different loss functions, including the square loss , the hinge loss , the suggested squared hinge loss and the cubed hinge loss . Note that the hinge loss is non-differentiable, while both the squared hinge and cubed hinge losses are differentiable. As demonstrated in the literature [27, 23, 21], the differentiability of squared hinge loss brings many benefits to both the computational implementation and theoretical analysis. In this experiment, we are willing to show the similar benefits brought by the squared hinge loss. Specifically, the training and test samples were generated via the similar way as described in Simulation 1. Moreover, we considered different noise levels for both uniform random and outlier noise. For each case, we repeated times of experiments and record the averages of their test errors. The test errors for different losses are presented in Table I and Table II.

From Table I and Table II, the performance of the suggested squared hinge loss is commonly slightly better than the other three loss functions. When comparing the performance of different dictionaries, by Tables I and II, the performance of Gauss and Relu are frequently better than that of polynomial and sigmoid, which is also observed in the previous experiments. These show the effectiveness of the suggested squared hinge loss via comparing with the other loss functions.

Dictionary 20% uniform random 30% uniform random 40% uniform random
Gauss 0.0239 0.0265 0.0295 0.0283 0.0418 0.0419 0.0419 0.0431 0.0851 0.0879 0.0882 0.0891
Polynomial 0.0248 0.0265 0.0276 0.0289 0.0425 0.0403 0.0436 0.0490 0.0879 0.0929 0.0935 0.0929
Sigmoid 0.0524 0.0693 0.0479 0.0271 0.0597 0.0768 0.0589 0.0433 0.0922 0.0925 0.0895 0.0963
Relu 0.0219 0.0394 0.0266 0.0288 0.0335 0.0503 0.0397 0.0453 0.0810 0.0864 0.0850 0.1
TABLE I: Comparison on the test errors of different losses in different levels of uniform random noise. The best results among different losses are marked in bold.
Dictionary 8.51% outlier noise 12.83% outlier noise 17.31% outlier noise
Gauss 0.0125 0.0136 0.0126 0.0145 0.0171 0.0195 0.0255 0.0237 0.0450 0.0548 0.0498 0.0714
Polynomial 0.0157 0.0184 0.0170 0.0172 0.0245 0.0238 0.0327 0.0332 0.0608 0.0550 0.0737 0.0866
Sigmoid 0.0554 0.0584 0.0514 0.0152 0.0512 0.0634 0.0487 0.0317 0.0629 0.0701 0.0657 0.0770
Relu 0.0129 0.0134 0.0149 0.0141 0.0156 0.0212 0.0211 0.0272 0.0380 0.0385 0.0405 0.0745
TABLE II: Comparison on the test errors of different losses in different levels of outlier noise with the same and different varying from to , where the associated noise levels are 8.51%, 12.83% and 17.31%, respectively. The best results among different losses are marked in bold.

Simulation 4: On comparison among different update schemes. In this experiment, we provided some comparisons between fully-corrective update and most of the existing types of update schemes such as that in the original boosting scheme (called OrigBoosting for short) in [10], the regularized boosting with shrinkage (called RSBoosting for short) in [12], the regularized boosting with truncation (called RTBoosting) in [45], the forward stagewise boosting (called -Boosting for short) in [18], and the rescaled boosting (called RBoosting for short) suggested in the recent paper [41], when adopted to the empirical risk minimization with the squared hinge loss over the Gaussian type dictionary. The optimal width of the Gaussian kernel was determined via cross validation from the set . Specifically, the training and test samples were generated according to Section IV-A, where the numbers of training and test samples were both and the training samples were generated with 30% uniform random noise. The size of the total dictionaries generated was set as . The maximal number of iterations for the proposed FCGBoosting was set as 500, while for the other types of boosting methods, the maximal number of iterations was set as . For each trail, we recorded the optimal test error with respect to the number of iterations, and the associated training error as well as number of dictionaries selected. The averages of the optimal test error, training error and number of dictionaries selected over 10 repetitions are presented in Table III.

As shown in Table III, the performance of all boosting methods with the optimal number of dictionaries are almost the same in terms of the generalization ability measured by the test error, and under these optimal scenarios, all the boosting methods are generally well-fitted in the perspective of training error. As demonstrated by Table III and Fig. 1(a), the most significant advantage of the adopted fully-corrective update scheme o is that the number of dictionaries for FCGBoosting is generally far less than that of the existing methods such as OrigBoosting, RSBoosting, RTBoosting, -Boosting and RBoosting. Particularly, from Table III, the average number of dictionaries for the proposed FCGBoosting is only , which is very close to the theoretical value as suggested in Theorem 1, where in this experiment . Moreover, from Fig. 1(b), most of the partially-corrective greedy type boosting methods select new dictionaries slowly after certain iterations, while their generalization performance improves also very slowly.

Boosting type OrigBoosting [10] RSBoosting [12] RTBoosting [45] -Boosting [18] RBoosting [41] FCGBoosting (this paper)
Test error 0.0238 0.0256 0.0239 0.0256 0.0221 0.0229
Training error 0.3071 0.3074 0.3069 0.3077 0.3078 0.3076
Dictionary no. 103.8 140.1 72.2 313.1 120.8 12.6
TABLE III: Comparisons among different types of boosting methods.

V Real data experiments

In this section, we show the effectiveness of the proposed method via a series of experiments on 11 UCI data sets covering various areas, and an earthquake intensity classification dataset.

V-a UCI Datasets

Samples. All data is from: https://archive.ics.uci.edu/ml/ datasets.html. The sizes of data sets are listed in Table IV. For each data set, we used , and samples as the training, validation and test sets, respectively.

Competitors.

We evaluated the effectiveness of the proposed boosting method via comparing with the baselines and five state-of-the-art methods including two typical support vector machine (SVM) methods with radial basis function (

SVM-RBF) and polynomial (SVM-Poly) kernels respectively, and a fast polynomial kernel based method for classification recently proposed in [44] called FPC

, and the random forest (

RF) [5] and AdaBoost [10]. We used the well-known libsvm toolbox to implement these SVM methods, from the website: https://www.csie.ntu.edu.tw/ cjlin/libsvm/. For the proposed method, we also considered four dictionaries including Gauss, Polynomial, Sigmoid and Relu.

Implementation. For the proposed boosting method, we set , , the initialization and the maximal number of iterations for the ADMM method used in the FCG step; the stopping criterion of the suggested method was set as the maximal iterations less than , where was chosen from these five values via cross validation; the size of the dictionary set was set as the number of training samples . These empirical settings are generally adequate as shown in the previous simulations.

For both SVM-RBF and SVM-Poly, the ranges of parameters involved in libsvm were determined via a grid search on the region in the logarithmic scale, while for SVM-Poly, the kernel parameter was selected from the interval [1, 10] via a grid search with 10 candidates, i.e., . The kernel parameter of FPC was selected similarly to SVM-Poly.

For RF, the number of trees used was determined from the interval via a grid search with 10 candidates, i.e., . For AdaBoost, the number of trees used was set as 100. For each data set, we ran times of experiments for all algorithms, and then record their averages of test accuracies, which is defined as the percentage of the correct classified labels.

Data sets Data size #Attributes
heart 270 14
breast_cancer 683 9
biodeg 783 42
banknote_authentication 1,372 4
seismic_bumps 2,584 18
musk2 6,598 166
HTRU2 17,898 8
MAGIC_Gamma_Telescope 19,020 10
occupancy 20,560 5
default_of_credit_card_clients 30,000 24
Skin_NonSkin 245,057 3
TABLE IV: Sizes of UCI data sets. In the latter tables, we use the first vocabulary of the name of the data set for short.

Experimental results. The experimental results of UCI data sets are reported in Table V. From Table V, the proposed boosting method with different dictionaries perform slightly different. In general, the proposed boosting method with the Gaussian, Polynomial, and Relu dictionariese generally perform slightly better than the other dictionaries, as also observed in the previous experiments. Compared to the other state-of-the-art methods, the proposed boosting method with the optimal dictionary usually performs better, where the proposed boosting method performs the best in 9 datasets, while performs slightly worse than the best results in the other 2 datasets. If we particularly compare the performance of the proposed boosting method with the other existing methods using the same dictionary, say, Boost-Gauss vs. SVM-RBF and Boost-Poly vs. SVM-Poly (or FPC), it can be observed that the adopted boosting scheme frequently improves the accuracy of these weak learners.

Data sets Boost-Gauss Boost-Poly Boost-Sigmoid Boost-ReLU SVM-RBF SVM-Poly FPC RF AdaBoost Baseline
heart 87.93 86.31 86.21 89.93 84.21 89.64 84.14 89.43 89.44 81.36
breast 97.75 96.95 97.57 97.10 97.19 96.84 96.78 96.81 96.34 96.20
biodeg 96.86 99.56 98.35 97.71 96.42 99.50 98.60 97.09 98.44 84.64
banknote 100 100 99.72 99.78 98.07 97.72 98.15 98.99 99.17 95.81
seismic 96.44 96.44 96.44 96.44 93.84 93.59 93.68 92.88 96.40 88.00
musk2 100 99.67 99.88 99.76 91.11 92.82 99.08 96.56 98.85 90.30
HTRU2 98.93 98.92 98.90 99.00 97.53 97.42 97.26 97.88 98.98 99.00
MAGIC 85.11 86.00 85.35 87.49 85.69 86.00 85.10 86.90 82.67 86.34
occupancy 98.80 98.52 98.51 98.76 98.63 98.95 98.77 99.14 99.55 97.16
default 82.56 83.27 81.07 82.36 81.60 82.10 80.51 81.01 81.67 82.00
Skin 98.67 99.21 98.26 99.75 98.80 99.06 98.83 99.94 99.14 98.09
TABLE V: Test accuracies (in percentages) of different algorithms for UCI datasets, where the first four columns present the results of the proposed FCGBoosting over four differen types of dictionaries. The best and second results are marked in red and blue color, respectively.

V-B Earthquake Intensity Classification

In this experiment, we considered the U.S. Earthquake Intensity Database, which was downloaded from: https://www.ngdc.noaa.gov/hazard/intintro.shtml. This database contains more than 157,000 reports on over 20,000 earthquakes that affected the United States from 1638 through 1985. The main features for each record in this database are the geographic latitudes and longitudes of the epicentre and “reporting city” (or, locality) where the Modified Mercalli Intensity (MMI) was observed, magnitudes (as a measure of seismic energy), and the hypocentral depth (positive downward) in kilometers from the surface, while the output label is measured by MMI, varying from 1 to 12 in integer. An illustration of the generation procedure of each earthquake record is shown in Figure 6.

Fig. 6: An illustration of the earthquake intensity data.

To transfer such multi-classification task into the binary classification setting considered in this paper, we set the labels lying in 1 to 4 as the positive class, while set the other labels lying in 5 to 12 as the negative class, mainly according to the damage extent of the earthquake suggested by the referred website. Moreover, we removed those incomplete records with missing labels. After such preprocessing, there are total 8,173 effective records. The settings of this experiment were similar to those on the UCI data sets. The classification accuracies of all algorithms are shown in Table VI.

From Table VI, the proposed boosting method with a suitable dictionary is generally better than the other state-of-the-art methods including two SVM methods, random forest, AdaBoost, and FPC. Moreover, the performance of the proposed boosting method with the polynomial kernel in this experiment is the best one among the used dictionaries.

Algorithm FCGBoost-Gauss FCGBoost-Poly FCGBoost-Sigmoid FCGBoost-ReLU SVM-RBF SVM-Poly FPC RF AdaBoost
Test Acc. (%) 78.93 80.48 79.27 80.38 80.37 73.92 80.16 74.51 75.80
TABLE VI: Test accuracies (in percentages) on the earthquake intensity data set, where the first four columns present the results of the proposed FCGBoosting over four differen types of dictionaries. The best and second results are marked in red and blue color, respectively.

Vi Proofs

In this section, we prove Theorem 1 and Theorem 2

by developing a novel concentration inequality associated with the squared hinge, a fast numerical convergence rate for FCGBoosting, and some standard error analysis techniques in

[37, 25]. Throughout the proofs, we will omit the subscript of for simplicity, and denote as the squared hinge loss.

Vi-a Concentration inequality with squared hinge loss

Denote by and the expectation risk and empirical risk, respectively. Let

be the regression regression minimizing . Since is the squared hinge loss, it can be found in [3] that

(7)

Our aim is to derive a learning rate for the generalization error Noting that the squared hinge loss is of quadratic type, we have [3, Lemma 7] (see also [25])

(8)

where denotes the space of square integrable functions endowed with norm . For , denote and as the -covering number of under the and norms, respectively. The following concentration inequality is the main tool in our analysis.

Theorem 3.

Let be a set of functions satisfying Then for arbitrary and , with confidence at least , there holds

(9)

It should be mentioned that a similar concentration inequality for the square loss is proved in [16, Theorem 11.4]. In [42], a more general concentration inequality associated with the covering number was presented for any bounded loss. Since we do not impose any bounded assumption on , it is difficult to derive an covering number estimates for the hypothesis space of FCGBoosting. Under this circumstance, a concentration inequality presented in Theorem 3 is highly desired.

Let be a set of functions . For and ,

and

Denote

By the definition of , one has

(10)

One of the most important step-stones of our proof is the following relation between variance and expectation, which can be found in [3, Lemma 7 and Table 1].

(11)

To derive another tool, we recall a classical concentration inequality shown in the following lemma [16, Theorem 11.6].

Lemma 1.

Let be a set of functions and be i.i.d.

-valued random variables. Assume

, , and . Then

Based on Lemma 1, we can derive the following bound for easily.

Lemma 2.

For arbitrary and ,

Proof:

Since , for , it follows from Lemma 1 with and that with confidence

for all , there holds

For arbitrary and , if , we apply the formula

(12)

and obtain

If , the above estimate also holds trivially. Then for arbitrary and , it follows from the above estimate that

This completes the proof of Lemma 2. ∎

The third tool of our proof is a simplified Talagrand’s inequality, which can be easily deduced from Theorem 7.5 and Lemma 7.6 in [37].

Lemma 3.

Let , and be constants such that and for all . Then, for any , and , with confidence , there holds

With these tools, we are now in the position to prove Theorem 3.

Proof:

For arbitrary , we have and . Let . Then for arbitrary , there exists an such that . Then, we get from (11) that

(13)

and

(14)

Then Lemma 3 with and , Lemma 2, (13) and (14) that with confidence at least , for any , there holds

For arbitrary , set and . It follows from (10) that , with confidence , there holds

For arbitrary , we have

Then,

Noting further that the element inequality for yields

This proves (9) and completes the proof of Theorem 3. ∎

Vi-B Numerical convergence without boundedness assumption

In this part, we show the fast convergence rate of FCGBoosting without imposing any boundedness assumption. Our proof is motivated by [33] by taking the special property of the squared hinge. The following numerical convergence rate is another main tool in our proof.

Proposition 1.

For arbitrary , we have

(15)

It can be found in [2] and Proposition 1 that the numerical convergence rates for FCGBoosting are the same for the square loss and squared hinge loss. To prove the above proposition, we need the following lemma, which was proved in [33, Lemma B.2].

Lemma 4.

Let and let be a sequence such that for all . Let be a positive scalar and be a positive integer such that . Then .

Based on this lemma, we estimate the upper bound of in the following proposition. Similar results can be found in [33, Theorem 2.7] for the general smooth type loss functions. We provide its proof here for the sake of completeness.

Proof of Proposition 1.

Let be an arbitrary function in . For , let be the support of , which implies

(16)

By Algorithm 1,

(17)

By the Lipschitz continuity of with the Lipschtz constant and , [33, Lemma B.1] (see also[29]) shows

for any and . Let , then

(18)

Since , and , there holds