The original linear support vector machine (SVM) aims to find a hyperplane that separates a collection of data points. Since it was proposed incortes1995support , it has been widely used for binary classifications, such as texture classification kim2002support , gene expression data analysis brown2000knowledge heisele2001face , to name a few. Mathematically, given a set of samples in -dimensional space and each attached with a label , the linear SVM learns a hyperplane from the training samples. A new data point can be categorized into the “” or “” class by inspecting the sign of .
The binary-class SVM (B-SVM) has been generalized to multicategory classifications to tackle problems that have data points belonging to more than two classes. The initially proposed multi-class SVM (M-SVM) methods construct several binary classifiers, such as “one-against-all’bottou1994comparison , “one-against-one” knerr1990single , and “directed acyclic graph SVM” platt2000large . These M-SVMs may suffer from data imbalance, namely, some classes have much fewer data points than others, which can result in inaccurate predictions. One alternative is to put all the data points together in one model, which results in the so-called “all-together” M-SVMs. The “all-together” M-SVMs train multi-classifiers by considering a single large optimization problem. An extensive comparison of different M-SVMs can be found in hsu2002comparison .
In this paper, we consider both B-SVM and M-SVM. More precisely, we consider the binary-class huberized SVM (B-HSVM) (see (2) below) for B-SVM and the “all-together” multi-class HSVM (M-HSVM) (see (3) below) for M-SVM. The advantage of HSVM over classic SVM with hinge loss is the continuous differentiability of its loss function, which enables the use of the “fastest” first-order method: Proximal Gradient (PG) method nesterov2007gradient ; BeckTeboulle2009 (see the overview in section 2.1). We demonstrate that the PG method is in general much faster than existing methods for (regularized) B-SVM and M-SVM while yielding comparable prediction accuracies. In addition, extensive numerical experiments are done to compare our method to state-of-the-art ones for both B-SVM and M-SVM on synthetic and also benchmark datasets. Statistical comparison is also performed to show the difference between the proposed method and other compared ones.
1.1 Related work
B-HSVM appears to be first considered111Strictly speaking, we add the term in (2). This modification is similar to that used in keerthi2006modified , and the extra term usually does not affect the prediction but makes PG method converge faster. by Wang et al in wang2008hybrid . They demonstrate that B-HSVM can perform better than the original unregularized SVM and also -regularized SVM (i.e., (1) with ) for microarray classification.
A closely related model to B-HSVM is the elastic net regularized SVM wang2006doubly
where is the hinge loss function. The -norm regularizer has the ability to perform continuous shrinkage and automatic variable selection at the same time tibshirani1996shrinkageLasso , and the
-norm regularizer can help to reduce the variance of the estimated coefficients and usually results in satisfactory prediction accuracy, especially in the case where there are many correlated variables. The elastic net inherits the benefits of bothand -norm regularizers and can perform better than either of them alone, as demonstrated in zou-hastie2005elastic-net .
Note that (1) uses non-differentiable hinge loss function while B-HSVM uses differentiable huberized hinge loss function. The differentiability makes B-HSVM relatively easier to solve. A path-following algorithm was proposed by Wang et. al wang2008hybrid for solving B-HSVM. Their algorithm is not efficient in particular for large-scale problems, since it needs to track the disappearance of variables along a regularization path. Recently, Yang and Zou yangefficient proposed a Generalized Coordinate Descent (GCD) method, which was, in most cases, about 30 times faster than the path-following algorithm. However, the GCD method needs to compute the gradient of the loss function of B-HSVM after each coordinate update, which makes the algorithm slow.
M-HSVM has been considered in li2010huberized , which generalizes the work wang2008hybrid on B-HSVM to M-HSVM and also makes a path-following algorithm. However, their algorithm could be even worse since it also needs to track the disappearance of variables along a regularization path and M-HSVM often involves more variables than those of B-HSVM. Hence, it is not suitable for large-scale problems either.
Similar to M-HSVM, several other models have been proposed to train multiple classifiers by solving one single large optimization problem to handle the multi-category classification, such as the -norm regularized M-SVM in wang20071 and the -norm regularized M-SVM in zhang2008variable . Again, these models use the non-differentiable hinge loss function and are relatively more difficult than M-HSVM to solve. There are also methods that use binary-classifiers to handle multicategory classification problems including “one-against-all” bottou1994comparison , “one-against-one” knerr1990single , and “directed acyclic graph SVM” platt2000large . The work galar2011overview makes a thorough review of methods using binary-classifiers for multi-category classification problems and gives extensive experiments on various applications. In a follow up and more recent paper Gal.et.al.15 the same authors present a dynamic classifier weighting method which deals with the limitations introduced by the non-competent classifiers in the one-versus-one classification strategy. In order to dynamically weigh the outputs of the individual classifiers, they use the nearest neighbor of every class from the instance that needs to be classified. Furthermore in Kra.et.al.14 the authors propose a new approach for building multi-class classifiers based on the notion of data-clustering in the feature space. For the derived clusters they construct one-class classifiers which can be combined to solve complex classification problems.
One key advantage of our algorithm is its scalability with the training sample size, and thus it is applicable for large-scale SVMs. While preparing this paper, we note that some other algorithms are also carefully designed for handling the large-scale SVMs, including the block coordinate Frank-Wolfe method lacoste2012block and the stochastic alternating direction method of multiplier ouyang2013stochastic . In addition, Graphics Processing Unit (GPU) computing has been utilized in li2013parallel
to run multiple training tasks in parallel to accelerate the cross validation procedure. Furthermore, different variants of SVMs have been proposed for specific applications such as the Value-at-Risk SVM for stability to outlierstsyurmasto2014value , the structural twin SVM to contain prior domain knowledge qi2013structural , the hierarchical SVM for customer churn prediction chen2012hierarchical
and the ellipsoidal SVM for outlier detection in wireless sensor networkszhang2013distributed . Finally in CzaTab14 a two-ellipsoid kernel decomposition is proposed for the efficient training of SVMs. In order to avoid the use of SOCP techniques, introduced by the ellipsoids, the authors transform the data using a matrix which is determined from the sum of the classes’ covariances. With that transformation it is possible to use classical SVM, and their method can be incorporated into existing SVM libraries.
We develop an efficient PG method to solve the B-HSVM
and the “all-together” M-HSVM
In (2), is the label of , and
is the huberized hinge loss function which is continuously differentiable. In (3), is the -th label, , if and otherwise, denotes the vector with all one’s, and denotes the -th column of . The constraints and in (3) are imposed to eliminate redundancy in and also are necessary to make the loss function
Fisher-consistent lee2004multicategory .
We choose the PG methodology because it requires only first-order information and converges fast. As shown in nesterov2007gradient ; BeckTeboulle2009 , it is an optimal first-order method. Note that the objectives in (2) and (3) have non-smooth terms and are not differentiable. Hence, direct gradient or second-order methods such as the interior point method are not applicable.
We speed up the algorithm by using a two-stage method, which detects the support of the solution at the first stage and solves a lower-dimensional problem at the second stage. For large-scale problems with sparse features, the two-stage method can achieve more than 5-fold acceleration. We also analyze the convergence of PG method under fairly general settings and get similar results as those in nesterov2007gradient ; schmidt2011convergence .
In addition, we compare the proposed method to state-of-the-art ones for B-SVM and M-SVM on both synthetic and benchmark datasets. Extensive numerical experiments demonstrate that our method can outperform other compared ones in most cases. Statistical tests are also performed to show significant differences between the compared methods.
1.3 Structure of the paper
The rest of the paper is organized as follows. In section 2, we overview the PG method and then apply it to (2) and (3). In addition, assuming strong convexity, we show linear convergence of the PG method under fairly general settings. Numerical results are given in section 3. Finally, section 4 concludes the paper.
2 Algorithms and convergence analysis
In this section, we first give an overview of the PG method. Then we apply it to (2) and (3). We complete this section by showing that under strong convexity assumptions the PG method possesses linear convergence.
2.1 Overview of the PG method
Consider convex composite optimization problems in the form of
where is a convex set, is a differentiable convex function with Lipschitz continuous gradient (that is, there exists such that , for all ), and is a proper closed convex function. The PG method for solving (4) iteratively updates the solution by
and for some nonnegative . The extrapolation technique can significantly accelerate the algorithm.
When is the Lipschitz constant of , it can easily be shown that , and thus this method is a kind of majorization minimization, as illustrated in Figure 1. With appropriate choice of and , the sequence converges to the optimal value . Nesterov nesterov2007gradient , and Beck and Teboulle BeckTeboulle2009 showed, independently, that if and is taken as the Lipschitz constant of , then converges to with the rate , namely, In addition, using carefully designed , they were able to show that the convergence rate can be increased to , which is the optimal rate of first-order methods for general convex problems NesterovConvexBook2004 .
2.2 PG method for binary-class HSVM
Then (2) can be written as
For convenience, we use the following notation in the rest of this section
The function defined above is differentiable and convex, and its gradient is Lipschitz continuous with constant
The proof of this proposition involves standard arguments and can be found in the appendix.
Now, we are ready to apply PG to (2). Define the proximal operator for a given extended real-valued convex function on by
Replacing and in (5) with and , respectively, we obtain the update
where is a component-wise shrinkage operator defined by .
In (9), we dynamically update by
where is a pre-selected constant, is defined in (8) and is the smallest integer such that the following condition is satisfied
where , and for some weight . The inequality in (11) is necessary to make the algorithm convergent (see Lemma 2 in the appendix). It guarantees sufficient decrease, and the setting will make it satisfied. In the case where becomes unnecessarily large, a smaller can speed up the convergence. To make the overall objective non-increasing, we re-update by setting in (9) whenever . This non-increasing monotonicity is very important, since we observe that the PG method may be numerically unstable without this modification. In addition, it significantly accelerates the PG method; see Table 1 below.
Algorithm 1 summarizes our discussion. We name it as B-PGH.
2.3 PG method for multi-class HSVM
In this section we generalize the PG method for solving multi-class classification problems. Denote Let
Then we can write (3) as
Similar to Proposition 1, we can show that is Lipschitz continuous with constant
The proof is essentially the same as that of Proposition 1, and we do not repeat it.
where is chosen in the same way as (11).
Problem (14) is relatively easy and has a closed form solution. Let and be the vector consisting of the first components of , where
denotes the identity matrix. Substitutingto (14) gives the solution
Hence, the update in (14) can be explicitly written as
where is defined in the above equation.
Problem (15) can be further decomposed into independent small problems. Each of them involves only one row of and can be written in the following form
where and is the -th row-vector of the matrix for the -th small problem. The coexistence of the non-smooth term and the equality constraint makes (17) a difficult optimization problem to solve. Next we describe a new efficient yet simple method for solving (17) using its dual problem, defined by
Since (17) is strongly convex, is concave and continuously differentiable. Hence, the solution of (18) can be found by solving the single-variable equation . It is easy to verify that and , so the solution lies between and , where and respectively denote the minimum and maximum components of . In addition, note that is piece-wise linear about , and the breakpoints are at , . Hence must be in for some , where is the sorted vector of in the increasing order. Therefore, to solve (18), we first obtain , then search the interval that contains by checking the sign of at the breakpoints, and finally solve within that interval. Algorithm 2 summarizes our method for solving (18).
2.4 Convergence results
Instead of analyzing the convergence of Algorithms 1 and 3, we do the analysis of the PG method for (4) with general and and regard Algorithms 1 and 3 as special cases. Throughout our analysis, we assume that is a differentiable convex function and is Lipschitz continuous with Lipschitz constant . We also assume that is strongly convex222Without strong convexity of , we only have sublinear convergence as shown in BeckTeboulle2009 . with constant , namely,
for any and , where denotes the domain of .
Similar results have been shown by Nesterov nesterov2007gradient and Schmidt et al schmidt2011convergence . However, our analysis fits to more general settings. Specifically, we allow dynamical update of the Lipschitz parameter and an acceptable interval of the parameters . On the other hand, nesterov2007gradient ; schmidt2011convergence either require to be fixed to the Lipschitz constant of or require specific values for the ’s. In addition, neither of nesterov2007gradient ; schmidt2011convergence do the re-update step as in line 7 of Algorithm 1 or line 13 of Algorithm 3.
We tested the PG method under different settings on synthetic datasets, generated in the same way as described in section 3.1.1. Our goal is to demonstrate the practical effect that each setting has on the overall performance of the PG method. Table 1 summarizes the numerical results, which show that PG method under our settings converges significantly faster than that under the settings of nesterov2007gradient ; schmidt2011convergence .
To analyze the PG method under fairly general settings, we use the so-called Kurdyka-Łojasiewicz (KL) inequality, which has been widely applied in non-convex optimization. Our results show that the KL inequality can also be applied and simplify the analysis in convex optimization. Extending the discussion of the KL inequality is beyond the scope of this paper and therefore we refer the interested readers to xu2013bcd ; lojasiewicz1993geometrie ; kurdyka1998gradients ; bolte2007lojasiewicz ; xu-yin-nonconvex and the references therein.
|Problems||Our settings||Settings of nesterov2007gradient||Settings of schmidt2011convergence|
|(3000, 300, 30)||34||0.06||1.0178e-1||135||0.22||1.0178e-1||475||0.75||1.0178e-1|
|(2000, 20000, 200)||91||4.34||8.0511e-2||461||21.77||8.0511e-2||2000||99.05||8.0511e-2|
Our main result is summarized as follows.
The proof of this theorem is given in the Appendix due to its technical complexity. Using the results of Theorem 2.1, we establish the convergence results of Algorithms 1 and 3 in the following corollary.
If one of and vanishes, we only have sublinear convergence by some appropriate choice of . The results can be found in BeckTeboulle2009 .
2.5 Two-stage accelerated method for large-scale problems
Most cost of Algorithm 1 at iteration occurs in the computation of and the evaluation of . Let where are defined in (7). To compute , we need to have and , which costs in total. Evaluating needs which costs . To save the computing time, we store both and so that can be obtained by . This way, we only need to compute once during each iteration, and the total computational cost is where is the total number of iterations.
As is large and the solution of (2) is sparse, namely, only a few features are relevant, we can further reduce the cost of Algorithm 1 by switching from the original high-dimensional problem to a lower-dimensional one. More precisely, we first run Algorithm 1 with and until the support of remains almost unchanged. Then we reduce (2) to a lower-dimensional problem by confining to the detected support, namely, all elements out of the detected support are kept zero. Finally, Algorithm 1 is employed again to solve the lower-dimensional problem. The two-stage method for (2) is named as B-PGH-2, and its solidness rests on the following lemma, which can be shown in a similar way as the proof of Lemma 5.2 in HaleYinZhang2007 .
Suppose the cardinality of the solution support is . Then the total computational cost of the two-stage method B-PGH-2 is , where are the numbers of iterations in the first and second stages, respectively. Numerically, we found that could be detected in several iterations, namely, is usually small. When , B-PGH-2 can be significantly faster than B-PGH, as demonstrated by our experiments in Section 3. In the same way, Algorithm 3 can be accelerated by a two-stage method. We omit the analysis since it can be derived by following the same steps.
3 Numerical Experiments
In the first part of this section, we compare B-PGH, described in Algorithm 1, with two very recent binary-class SVM solvers using ADMM ye2011efficient and GCD333Our algorithm and ADMM are both implemented in MATLAB, while the code of GCD is written in R. To be fair, we re-wrote the code of GCD and implemented it in MATLAB. yangefficient on both synthetic and real data. ADMM solves model (1) whereas both B-PGH and GCD solve model (2). In the second part, we compare the performance of five different multi-class SVMs which are: the model defined by (3), the -regularized M-SVM in wang20071 , the -regularized M-SVM in zhang2008variable , the “one-vs-all” (OVA) method bottou1994comparison , and the Decision Directed Acyclic Graph (DDAG) method platt2000large . We use M-PGH444The paper li2010huberized uses a path-following method to solve (3). However, its code is not publicly available., described in Algorithm 3, to solve (3) and Sedumi sturm1999using for the and -regularized M-SVMs. Sedumi is called through CVX grant2008cvx . All the tests were performed on a computer having an i7-620m CPU and 3-GB RAM and running 32-bit Windows 7 and Matlab 2010b.
3.1 Binary-class SVM
For B-PGH, the parameters and were set to and , respectively. We chose
where and . The starting point was chosen to be a zero vector. We stop B-PGH if both of the following conditions are satisfied during three consecutive iterations
where and . The stopping tolerance was set to for B-PGH and GCD and for ADMM since was too stringent. For B-PGH-2, we took at the first stage and at the second stage. The penalty parameters for ADMM were set to and as suggested in ye2011efficient . All the other parameters for ADMM and GCD were set to their default values.
3.1.1 Synthetic data
We generated samples in with one half in the “” class and the other half in the “” class. Specifically, each sample in the “
” class was generated according to the Gaussian distributionand each sample in the “” class according to . The mean vector
and the covariance matrix
where is the matrix of size with all one’s, is the matrix with all zero’s, and is an identity matrix of size . The first variables are relevant for classification and the rest ones being noise. This simulated data was also used in wang2008hybrid ; ye2011efficient . We tested the speed of the algorithms on different sets of dimension and the correlation . The smoothing parameter for B-PGH and GCD was set to in this subsection, and is set in (2). Table 2 lists the average running time (sec) of 10 independent trials. For each run, we averaged the time over 25 different pairs of . From the results, we can see that B-PGH was consistently faster than GCD and over 50 times faster than ADMM. In addition, the two-stage accelerated algorithm B-PGH-2 (see section 2.5) was significantly faster than B-PGH when was large and .
For , the time for ADMM is over one half hour.
|(500, 50, 5)||0.0060||0.0081||0.0059||0.0071||0.0116||0.0192||0.6354||0.8641|
|(2000, 100, 10)||0.0176||0.0256||0.0173||0.0218||0.0848||0.1469||2.4268||3.5340|
|(50, 300, 30)||0.0126||0.0162||0.0099||0.0113||0.0338||0.0409||0.6819||1.1729|
|(100, 500, 50)||0.0179||0.0242||0.0117||0.0152||0.0727||0.0808||1.4879||2.5482|
|(200, 2000, 100)||0.0720||0.1227||0.0301||0.0446||0.5653||0.4735||7.9985||12.998|
|(2000, 20000, 200)||5.7341||8.5379||1.1543||1.7531||32.721||30.558||——||——|
We also tested the prediction accuracy and variable selection of the algorithms. The problem dimension was set to . The optimal pair of was selected from a large grid by 10-fold cross validation. We use for the number of selected relevant variables and for the number of selected noise variables. In addition, we use “accu.” for the prediction accuracy. The average results of 500 independent runs corresponding to and are shown in Table 3. During each run, the algorithms were compared on a test set of 1000 samples. From the table, we can see that B-PGH achieved similar accuracy to that by GCD, and they both performed better than ADMM, especially in the case of . In addition, B-PGH-2 obtained similar solutions as B-PGH.
Classification accuracies and variable selections of B-PGH, GCD and ADMM. All results are the averages of 500 independent runs, each of which tests on a 1000 tesing set. The numbers in the parentheses are the corresponding standard errors.
3.1.2 Medium-scale real data
In this subsection, we compare B-PGH, GCD and ADMM on medium-scale real data (see Table 4). The first seven datasets are available from the LIBSVM dataset555http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets and the last three from Tom Mitchell’s neuroinformatics research group666http://www.cs.cmu.edu/tom/fmri.html. Both rcv1 and realsim have large feature dimensions but only about nonzeros. colon, duke and leuk are datasets of gene expression profiles for colon cancer, breast cancer and leukemia, respectively. The original dataset of colon consists of 62 samples, and we randomly chose 30 of them for training and the rest for tesing. gisette
is a hand-writing digit recognition problem from NIPS 2003 Feature Selection Challenge. The training set forgisette is a random subset of the original 6000 samples, and the testing set contains all of the original 1000 samples. rcv1 is a collection of manually cetegorized news wires from Reuters. Both the training and tesing instances for sub-rcv1 are randomly sub-sampled from the original training and tesing samples. realsim contains UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos and real aviation. The original dataset of realsim has 72,309 samples, and we randomly sub-sampled 500 instances for training and 1,000 instances for tesing. fMRIa, fMRIb, and fMRIc are functional MRI (fMRI) data of brain activities when the subjects are presented with pictures and text paragraphs.
We fixed since the algorithms appeared not sensitive to and . The optimal ’s were tuned by 10-fold cross-validation on training sets. The smoothing parameter was set to for both B-PGH and GCD. All the other settings are the same as those in the previous test. The results are shown in Table 5. For comparison, we also report the prediction accuracies by LIBLINEAR fan2008liblinear with -regularized -loss. From the results, we can see that B-PGH is significantly faster than GCD and ADMM, and it also gives the best prediction accuracy. B-PGH-2 was fastest and achieved the same accuracy as B-PGH except for gisette. We want to mention that GCD can give the same accuracy as B-PGH but it needs to run much longer time, and ADMM can rarely achieve the same accuracy even if it runs longer since it solves a different model.
3.1.3 Statistical comparison
We also performed the statistical comparison of B-PGH, GCD, and ADMM. Following demvsar2006statistical , we did the Wilcoxon signed-ranks test777The Wilcoxon signed-ranks test is in general better than the paired -test as demonstrated in demvsar2006statistical . wilcoxon1945individual and Friedman test friedman1937use ; friedman1940comparison to see if the differences of the compared methods are significant. The former test is for pair-wise comparison and the latter one for multiple comparison. Specifically, for two different methods, denote as the difference of their score (e.g., prediction accuracy) on the -th dataset and rank ’s based on their absolute value. Let
where is the number of datasets, , and
Table 6 shows the pair-wise -values and the corresponding -values of the five compared methods. At -, we find that there is no significant differences between B-PGH and B-PGH-2 and neither between GCD and ADMM, and any other pair of methods make significant difference.
The Friedman statistic can be calculated by
where is the number of compared methods, denotes the average rank of the -th method, and is the rank of the -th method on the -th dataset. Table 7 shows the ranks of the compared methods according to their prediction accuracies, where average ranks are applied to ties. The -value indicates significant difference among the five methods at -.
We further proceeded with a post-hoc test through Holm’s step-down procedure holm1979simple . Assuming the -values for compared classifiers to the control one are ordered as , Holm’s step-down procedure starts from the largest one and compares it to , where is the target -value. If