1 Introduction
Boosting methods combine a set of weak classifiers (weak learners) to form a strong classifier. Boosting has been extensively studied
[3, 4] and applied to a wide range of applications due to its robustness and efficiency (e.g., realtime object detection [5, 6, 7]). Despite that fact that most classification tasks are inherently multiclass problems, the majority of boosting algorithms are designed for binary classification. A popular approach to multiclass boosting is to split the multiclass problem into a bunch of binary classification problems. A simple example is the onevsall approach. The wellknown error correcting output coding (ECOC) methods [8] belong to this category. AdaBoost.ECC [9], AdaBoost.MH [10] and AdaBoost.MO [10] can all be viewed as examples of the ECOC approach. The second approach is to directly formulate multiclass as a single learning task, which is based on pairwise model comparisons between different classes. Shen and Hao’s direct formulation for multiclass boosting (referred to as MultiBoost) is such an example [1]. From the perspective of optimization, MultiBoost can be seen as an extension of the binary column generation boosting framework [11, 4] to the multiclass case. Our work here builds upon MultiBoost. As most existing multiclass boosting, for MultiBoost of [1], different classes share the same set of weak learners, which leads to a sparse solution of the model parameters and hence slow convergence. To solve this problem, in this work we propose a novel formulation (referred to as MultiBoost) for multiclass boosting by using separate weak learner sets. Namely, each class uses its own weak learner set. Compared to MultiBoost, MultiBoost converges much faster, generally has better generalization performance and does not introduce additional time cost for training. Note that AdaBoost.MO proposed in [10] uses different sets of weak classifiers for each class too. AdaBoost.MO is based on ECOC and the code matrix in AdaBoost.MO is specified before learning. Therefore, the underlying dependence between the fixed code matrix and generated binary classifiers is not explicitly taken into consideration, compared with AdaBoost.ECC. In contrast, our MultiBoost is based on the direct formulation of multiclass boosting, which leads to fundamentally different optimization strategies. More importantly, as shown in our experiments, our MultiBoost is much more scalable than AdaBoost.MO although both enjoy faster convergence than most other multiclass boosting.In MultiBoost [1], sophisticated optimization tools like Mosek or LBFGSB [12] are needed to solve the resulting optimization problem at each boosting iteration, which is not very scalable. Here we propose a coordinate descent algorithm (FCD) for fast optimization of the resulting problem at each boosting iteration of MultiBoost. FCD methods choose one variable at a time and efficiently solve the singlevariable subproblem. CD(coordinate decent) has been applied to solve many largescale optimization problems. For example, Yuan et al. [13] made comprehensive empirical comparisons of regularized classification algorithms. They concluded that CD methods are very competitive for solving largescale problems. In the formulation of MultiBoost (also in our MultiBoost), the number of variables is the product of the number of classes and the number of weak learners, which can be very large (especially when the number of classes is large). Therefore CD methods may be a better choice for fast optimization of multiclass boosting. Our method FCD is specially tailored for the optimization of MultiBoost. We are able to obtain a closedform solution for each variable update. Thus the optimization can be extremely fast. The proposed FCD is easy to implement and no sophisticated optimization toolbox is required.
Main Contributions 1) We propose a novel multiclass boosting method (MultiBoost) that uses class specified weak learners. Unlike MultiBoost sharing a single set of weak learners across different classes, our method uses a separate set of weak learners for each class. We generate (the number of classes) weak learners in each boosting iteration—one weak learner for each class. With this mechanism, we are able to achieve much faster convergence. 2) Similar to MultiBoost [1], we employ column generation to implement the boosting training. We derive the Lagrange dual problem of the new multiclass boosting formulation which enable us to design fully corrective multiclass algorithms using the primaldual optimization technique. 3) We propose a FCD method for fast training of MultiBoost. We obtain an analytical solution for each variable update in coordinate descent. We use the KarushKuhnTucker (KKT) conditions to derive effective stop criteria and construct working sets of violated variables for faster optimization. We show that FCD can be applied to fully corrective optimization (updating all variables) in multiclass boosting, similar to fast stagewise optimization in standard AdaBoost (updating newly added variables only).
Notation Let us assume that we have classes. A weak learner is a function that maps an example to . We denote each weak learner by : , and ). is the space of all the weak learners;
is the number of weak learners. We define column vectors
as the outputs of weak learners associated with the th class on example . Let us denote the weak learners’ coefficients for class . Then the strong classifier for class is . We need to learn strong classifiers, one for each class. Given a test data , the classification rule is . is a vector with elements all being one. Its dimension should be clear from the context.2 Our Approach
We show how to formulate the multiclass boosting problem in the large margin learning framework. Analogue to MultiBoost, we can define the multiclass margins associate with training data as
(1) 
for . Intuitively, is the difference of the classification scores between a “wrong” model and the right model. We want to make this margin as large as possible. MultiBoost with the exponential loss can be formulated as:
(2) 
Here is defined in (1). We have also introduced a shorthand symbol . The parameter controls the complexity of the learned model.
The model parameter is .
Minimizing (2) encourages the confidence score of the correct label of a training example to be larger than the confidence of other labels. We define as a set of labels: . The discriminant function we need to learn is: . The class label prediction for an unknown example is to maximize over , which means finding a class label with the largest confidence: MultiBoost is an extension of MultiBoost [1] for multiclass classification. The only difference is that, in MultiBoost, different classes share the same set of weak learners . In contrast, each class associates a separate set of weak learners. We show that MultiBoost learns a more compact model than MultiBoost.
Column generation for MultiBoost To implement boosting, we need to derive the dual problem of (2). Similar to [1], the dual problem of (2) can be written as (3), in which is the index of class labels. is the dual variable associated with one constraint in (2):
(3a)  
(3b)  
(3c) 
Following the idea of column generation [4], we divide the original problem (2) into a master problem and a subproblem, and solve them alternatively. The master problem is a restricted problem of (2) which only considers the generated weak learners. The subproblem is to generate weak learners (corresponding classes) by finding the most violated constraint of each class in the dual form (3), and add them to the master problem at each iteration. The subproblem for finding most violated constraints can be written as:
(4) 
The column generation procedure for MultiBoost is described in Algorithm 1. Essentially, we repeat the following two steps until convergence: 1) We solve the master problem (2) with , to obtain the primal solution . is the working set of generated weak learners associated with the th class. We obtain the dual solution from the primal solution using the KKT conditions:
(5) 
2) With the dual solution , we solve the subproblem (2) to generate weak learners: , and add to the working weak learner set . In MultiBoost, weak learners are generated for classes respectively in each iteration, while in MultiBoost, only one weak learner is generated at each column generation and shared by all classes. As shown in [1] for MultiBoost, the subproblem for finding the most violated constraint in the dual form is:
(6) 
At each column generation of MultiBoost, (6) is solved to generated one weak learner. Note that solving (6) is to search over all classes to find the best weak learner . Thus the computational cost is the same as MultiBoost. This is the reason why MultiBoost does not introduce additional training cost compared to MultiBoost. In general, the solution of MultiBoost is highly sparse [1]. This can be observed in our empirical study. The weak learner generated by solving (6) actually is targeted for one class, thus using this weak learner across all classes in MultiBoost leads to a very sparse solution. The sparsity of indicates that one weak learner is usually only useful for the prediction of a very few number of classes (typically only one), but useless for most other classes. In this sense, forcing different classes to use the same set of weak learners may not be necessary and usually it leads to slow convergence. In contrast, using separate weak learner sets for each class, MultiBoost tends to have a dense solution of . With weak learners generated at each iteration, MultiBoost converges much faster.
Fast coordinate descent To further speed up the training, we propose a fast coordinate descent method (FCD) for solving the primal MultiBoost problem at each column generation iteration. The details of FCD is presented in Algorithm 2. The highlevel idea is simple. FCD works iteratively, and at each iteration (working set iteration), we compute the violated value of the KKT conditions for each variable in , and construct a working set of violated variables (denoted as ), then pick variables from the for update (one variable at a time). We also use the violated values for defining stop criteria. Our FCD is a mix of sequential and stochastic coordinate descent. For the first working set iteration, variables are sequentially picked for update (cyclic CD); in later working set iterations, variables are randomly picked (stochastic CD). In the sequel, we present the details of FCD. First, we describe how to update one variable of by solving a singlevariable subproblem. For notation simplicity, we define: is the orthogonal label coding vector: . Here is the indicator function that returns 1 if , otherwise .
denotes the tensor product. MultiBoost
in (2) can be equivalently written as:(7) 
We assume that binary weak learners are used here: . denotes the th dimension of , and denotes the rest dimensions of excluding the th. The output of only takes three possible values: . For the th dimension, we define: ; so is a set of constraint indices that the output of is . denotes the th variable of ; denotes the rest variables of excluding the th. Let be the objective function of the optimization (7). can be decomposited as:
(8) 
Here we have defined:
(9a)  
(9b) 
In the variable update step, one variable is picked at a time for updating and other variables are fixed; thus we need to minimize in (8) w.r.t , which is a singlevariable minimization. It can be written as:
(10) 
The derivative of the objective function in (10) with is:
(11) 
By solving (11) and the bounded constraint , we obtain the analytical solution of the optimization in (10) (since ):
(12) 
When is large, (12) can be approximately simplified as:
(13) 
With the analytical solution in (12), the update of each dimension of can be performed extremely efficiently. The main requirement for obtaining the closedform solution is that the use of discrete weak learners.
We use the KKT conditions to construct a set of violated variables and derive meaningful stop criteria. For the optimization of MultiBoost (7), KKT conditions are necessary conditions and also sufficient for optimality. The Lagrangian of (7) is: According to the KKT conditions, is the optimal for (10) if and only if satisfies and . For ,
Considering the complementary slackness: , if , we have ; if , we have . The optimality conditions can be written as:
(14) 
For notation simplicity, we define a column vector as in (15). With the optimality conditions (14), we define in (16) as the violated value of the th variable of the solution :
(15) 
(16) 
At each working set iteration of FCD, we compute the violated values , and construct a working set of violated variables; then we randomly (except the first iteration) pick one variable from for update. We repeat picking for times; is the element number of . is defined as
(17) 
where is a tolerance parameter. Analogue to [14] and [13], with the definition of the variable violated values in (16), we can define the stop criteria as:
(18) 
where can be the same tolerance parameter as in the working set definition (17). The stop condition (18) shows if the largest violated value is smaller than some threshold, FCD terminates. We can see that using KKT conditions is actually using the gradient information. An inexact solution for is acceptable for each column generation iteration, thus we place a maximum iteration number ( in Algorithm 2) for FCD to prevent unnecessary computation. We need to compute before obtaining , but computing in (15) is expensive. Fortunately, we are able to efficiently update after the update of one variable to avoid recomputing of (15). in (15) can be equally written as:
(19) 
So the update of is then:
(20) 
With the definition of in (19), the values and for one variable update can be efficiently computed by using to avoid the expensive computation in (9a) and (9b); and can be equally defined as:
(21) 
Some discussion on FCD (Algorithm 2) is as follows: 1) Stagewise optimization is a special case of FCD. Compared to totally corrective optimization which considers all variables of for update, stagewise only considers those newly added variables for update. We initialize the working set using the newly added variables. For the first working set iteration, we sequentially update the new added variables. If setting the maximum working set iteration to ( in Algorithm 2), FCD becomes a stagewise algorithm. Thus FCD is a generalized algorithm with totally corrective update and stagewise update as special cases. In the stagewise setting, usually a large (regularization parameter) is implicitly enforced, thus we can use the analytical solution in (13) for variable update.
2) Randomly picking one variable for update without any guidance leads to slow local convergence. When the solution gets close to the optimality, usually only very few variables need update, and most picks do not “hit”. In column generation (CG), the initial value of is initialized by the solution of last CG iteration. This initialization is already fairly close to optimality. Therefore the slow local convergence for stochastic coordinate decent (CD) is more serious in column generation based boosting. Here we have used the KKT conditions to iteratively construct a working set of violated variables, and only the variables in the working set need update. This strategy leads to faster CD convergence.
3 Experiments
We evaluate our method MultiBoost
on some UCI datasets and a variety of multiclass image classification applications, including digit recognition, scene recognition, and traffic sign recognition. We compare MultiBoost
against MultiBoost [1] with the exponential loss, and another there popular multiclass boosting algorithms: AdaBoost.ECC [9], AdaBoost.MH [10] and AdaBoost.MO [10]. We use FCD as the solver for MultiBoost, and LBFGSB [12] for MultiBoost. We also perform further experiments to evaluate FCD in detail. For all experiments, the best regularization parameter for MultiBoost and MultiBoost is selected from to ; the tolerance parameter in FCD is set to (); We use MultiBoost1 to denote MultiBoost using the stagewise setting of FCD which only uses one iteration ( in Algorithm 2). In MultiBoost1, we fix to be a large value: .All experiments are run 5 times. We compare the testing error, the total training time and solver time on all datasets. The results show that our MultiBoost and MultiBoost1 converge much faster then other methods, use less training time then MultiBoost, and achieve the best testing error on most datasets.
AdaBoost.MO [10] (Ada.MO) has a similar convergence rate as our method, but it is much slower than our method and becomes intractable for large scale datasets. We run Ada.MO on some UCI datasets and MNIST. Results are shown in Fig. 1 and Fig. 2. We set a maximum training time (1000 seconds) for Ada.MO; other methods are all below this maximum time on those datasets. If maximum time reached, we report the results of those finished iterations.


Results of 2 UCI datasets: VOWEL and ISOLET. CW and CW1 are our methods. CW1 uses stagewise setting. The number after the method name is the mean value with standard deviation of the last iteration. Our methods converge much faster and achieve competitive test accuracy. The total training time and the solver time of our methods both are less than MultiBoost of
[1].UCI datasets: we use 2 UCI multiclass datasets: VOWEL and ISOLET. For each dataset, we randomly select 75% data for training and the rest for testing. Results are shown in Fig. 1.



Handwritten digit recognition: we use 3 handwritten datasets: MNIST, USPS and PENDIGITS. For MNIST, we randomly sample 1000 examples from each class, and use the original test set of 10,000 examples. For USPS and PENDIGITS, we randomly select 75% for training, the rest for testing. Results are shown in Fig. 2.
3 Image datasets: PASCAL07, LabelMe, CIFAR10: For PASCAL07, we use 5 types of features provided in [15]. For labelMe, we use the subset: LabelMe1250k^{1}^{1}1http://www.ais.unibonn.de/download/datasets.html and generate GIST features. For these two datasets, we use those images which only have one class label. We use 70% data for training, the rest for testing. For CIFAR10^{2}^{2}2http://www.cs.toronto.edu/~kriz/cifar.html, we construct 2 datasets, one uses GIST features and the other uses the pixel values. We use the provided test set and 5 training sets for 5 times run. Results are shown in Fig. 3.


Scene recognition: we use 2 scene image datasets: Scene15 [16] and SUN [17]. For Scene15, we randomly select 100 images per class for training, and the rest for testing. We generate histograms of code words as features. The code book size is 200. An image is divided into 31 subwindows in a spatial hierarchy manner. We generate histograms in each subwindows, so the histogram feature dimension is 6200. For SUN dataset, we construct a subset of the original dataset containing 25 categories. For each category, we use the top 200 images, and randomly select 80% data for training, the rest for testing. We use the HOG features described in[17]. Results are shown in Fig. 4.
Traffic sign recognition: We use the GTSRB^{3}^{3}3http://benchmark.ini.rub.de/ traffic sign dataset. There are 43 classes and more than 50000 images. We use the provided 3 types of HOG features; so there are 6052 features in total. We randomly select 100 examples per class for training and use the original test set. Results are shown in Fig.5.
3.1 FCD evaluation
We perform further experiments to evaluate FCD with different parameter settings, and compare to the LBFGSB [12] solver. We use 3 datasets in this section: VOWEL, USPS and SCENE15. We run FCD with different settings of the maximum working set iteration( in Algorithm 2) to evaluate how the setting of (maximum working set iteration) affects the performance of FCD. We also run LBFGSB [12] solver for solving the same optimization (2) as FCD. We set for all cases. Results are shown in Fig. 6. For LBFGSB, we use the default converge setting to get a moderate solution. The number after “FCD” in the figure is the setting of in Algorithm 2 for FCD. Results show that the stagewise case () of FCD is the fastest one, as expected. When we set , the objective value of the optimization (2) of our method converges much faster than LBFGSB. Thus setting of is sufficient to achieve a very accurate solution, and at the same time has faster convergence and less running time than LBFGSB.


4 Conclusion
In this work, we have presented a novel multiclass boosting method. Based on the dual problem, boosting is implemented using the column generation technique. Different from most existing multiclass boosting, we train a weak learner set for each class, which results in much faster convergence.
A wide range of experiments on a few different datasets demonstrate that the proposed multiclass boosting achieves competitive test accuracy compared with other existing multiclass boosting. Yet it converges much faster and due to the proposed efficient coordinate descent method, the training of our method is much faster than the counterpart of MultiBoost in [1].
Acknowledgement. This work was supported by ARC grants LP120200485 and FT120100969.
References
 [1] Shen, C., Hao, Z.: A direct formulation for totallycorrective multiclass boosting. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2011)
 [2] Paisitkriangkrai, S., Shen, C., van den Hengel, A.: Sharing features in multiclass boosting via group sparsity. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2012)
 [3] Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics 26 (1998) 1651–1686
 [4] Shen, C., Li, H.: On the dual formulation of boosting algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 2216–2231

[5]
Viola, P., Jones, M.J.:
Robust realtime face detection.
Int. J. Comput. Vision 57 (2004) 137–154 
[6]
Wang, P., Shen, C., Barnes, N., Zheng, H.:
Fast and robust object detection using asymmetric totallycorrective
boosting.
IEEE Trans. Neural Networks & Learn. Syst.
23 (2012) 33–46  [7] Paisitkriangkrai, S., Shen, C., Zhang, J.: Fast pedestrian detection using a cascade of boosted covariance features. IEEE Trans. Circuits & Syst. for Video Tech. 18 (2008) 1140–1151
 [8] Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via errorcorrecting output codes. J. Artif. Int. Res. 2 (1995) 263–286

[9]
Guruswami, V., Sahai, A.:
Multiclass learning, boosting, and errorcorrecting codes.
In: Proc. Annual Conf. Computational Learning Theory, New York, NY, USA, ACM (1999) 145–155

[10]
Schapire, R.E., Singer, Y.:
Improved boosting algorithms using confidencerated predictions.
In: Machine Learn. (1999) 80–91
 [11] Demiriz, A., Bennett, K.P., ShaweTaylor, J.: Linear programming boosting via column generation. Mach. Learn. 46 (2002) 225–254
 [12] Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: LBFGSB: Fortran subroutines for largescale bound constrained optimization. ACM Trans. Math. Software (1994)
 [13] Yuan, G.X., Chang, K.W., Hsieh, C.J., Lin, C.J.: A comparison of optimization methods and software for largescale l1regularized linear classification. J. Mach. Learn. Res. (2010) 3183–3234
 [14] Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 9 (2008) 1871–1874

[15]
Guillaumin, M., Verbeek, J., Schmid, C.:
Multimodal semisupervised learning for image classification.
In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2010)  [16] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Volume 2. (2006) 2169 – 2178
 [17] Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: Largescale scene recognition from abbey to zoo. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2010) 3485 –3492
Comments
There are no comments yet.