Proximal gradient method for huberized support vector machine

11/30/2015 ∙ by Yangyang Xu, et al. ∙ Siemens AG Rice University 0

The Support Vector Machine (SVM) has been used in a wide variety of classification problems. The original SVM uses the hinge loss function, which is non-differentiable and makes the problem difficult to solve in particular for regularized SVMs, such as with ℓ_1-regularization. This paper considers the Huberized SVM (HSVM), which uses a differentiable approximation of the hinge loss function. We first explore the use of the Proximal Gradient (PG) method to solving binary-class HSVM (B-HSVM) and then generalize it to multi-class HSVM (M-HSVM). Under strong convexity assumptions, we show that our algorithm converges linearly. In addition, we give a finite convergence result about the support of the solution, based on which we further accelerate the algorithm by a two-stage method. We present extensive numerical experiments on both synthetic and real datasets which demonstrate the superiority of our methods over some state-of-the-art methods for both binary- and multi-class SVMs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The original linear support vector machine (SVM) aims to find a hyperplane that separates a collection of data points. Since it was proposed in

cortes1995support , it has been widely used for binary classifications, such as texture classification kim2002support , gene expression data analysis brown2000knowledge

, face recognition

heisele2001face , to name a few. Mathematically, given a set of samples in -dimensional space and each attached with a label , the linear SVM learns a hyperplane from the training samples. A new data point can be categorized into the “” or “” class by inspecting the sign of .

The binary-class SVM (B-SVM) has been generalized to multicategory classifications to tackle problems that have data points belonging to more than two classes. The initially proposed multi-class SVM (M-SVM) methods construct several binary classifiers, such as “one-against-all’

bottou1994comparison , “one-against-one” knerr1990single , and “directed acyclic graph SVM” platt2000large . These M-SVMs may suffer from data imbalance, namely, some classes have much fewer data points than others, which can result in inaccurate predictions. One alternative is to put all the data points together in one model, which results in the so-called “all-together” M-SVMs. The “all-together” M-SVMs train multi-classifiers by considering a single large optimization problem. An extensive comparison of different M-SVMs can be found in hsu2002comparison .

In this paper, we consider both B-SVM and M-SVM. More precisely, we consider the binary-class huberized SVM (B-HSVM) (see (2) below) for B-SVM and the “all-together” multi-class HSVM (M-HSVM) (see (3) below) for M-SVM. The advantage of HSVM over classic SVM with hinge loss is the continuous differentiability of its loss function, which enables the use of the “fastest” first-order method: Proximal Gradient (PG) method nesterov2007gradient ; BeckTeboulle2009 (see the overview in section 2.1). We demonstrate that the PG method is in general much faster than existing methods for (regularized) B-SVM and M-SVM while yielding comparable prediction accuracies. In addition, extensive numerical experiments are done to compare our method to state-of-the-art ones for both B-SVM and M-SVM on synthetic and also benchmark datasets. Statistical comparison is also performed to show the difference between the proposed method and other compared ones.

1.1 Related work

B-HSVM appears to be first considered111Strictly speaking, we add the term in (2). This modification is similar to that used in keerthi2006modified , and the extra term usually does not affect the prediction but makes PG method converge faster. by Wang et al in wang2008hybrid . They demonstrate that B-HSVM can perform better than the original unregularized SVM and also -regularized SVM (i.e., (1) with ) for microarray classification.

A closely related model to B-HSVM is the elastic net regularized SVM wang2006doubly


where is the hinge loss function. The -norm regularizer has the ability to perform continuous shrinkage and automatic variable selection at the same time tibshirani1996shrinkageLasso , and the

-norm regularizer can help to reduce the variance of the estimated coefficients and usually results in satisfactory prediction accuracy, especially in the case where there are many correlated variables. The elastic net inherits the benefits of both

and -norm regularizers and can perform better than either of them alone, as demonstrated in zou-hastie2005elastic-net .

Note that (1) uses non-differentiable hinge loss function while B-HSVM uses differentiable huberized hinge loss function. The differentiability makes B-HSVM relatively easier to solve. A path-following algorithm was proposed by Wang et. al wang2008hybrid for solving B-HSVM. Their algorithm is not efficient in particular for large-scale problems, since it needs to track the disappearance of variables along a regularization path. Recently, Yang and Zou yangefficient proposed a Generalized Coordinate Descent (GCD) method, which was, in most cases, about 30 times faster than the path-following algorithm. However, the GCD method needs to compute the gradient of the loss function of B-HSVM after each coordinate update, which makes the algorithm slow.

M-HSVM has been considered in li2010huberized , which generalizes the work wang2008hybrid on B-HSVM to M-HSVM and also makes a path-following algorithm. However, their algorithm could be even worse since it also needs to track the disappearance of variables along a regularization path and M-HSVM often involves more variables than those of B-HSVM. Hence, it is not suitable for large-scale problems either.

Similar to M-HSVM, several other models have been proposed to train multiple classifiers by solving one single large optimization problem to handle the multi-category classification, such as the -norm regularized M-SVM in wang20071 and the -norm regularized M-SVM in zhang2008variable . Again, these models use the non-differentiable hinge loss function and are relatively more difficult than M-HSVM to solve. There are also methods that use binary-classifiers to handle multicategory classification problems including “one-against-all” bottou1994comparison , “one-against-one” knerr1990single , and “directed acyclic graph SVM” platt2000large . The work galar2011overview makes a thorough review of methods using binary-classifiers for multi-category classification problems and gives extensive experiments on various applications. In a follow up and more recent paper the same authors present a dynamic classifier weighting method which deals with the limitations introduced by the non-competent classifiers in the one-versus-one classification strategy. In order to dynamically weigh the outputs of the individual classifiers, they use the nearest neighbor of every class from the instance that needs to be classified. Furthermore in the authors propose a new approach for building multi-class classifiers based on the notion of data-clustering in the feature space. For the derived clusters they construct one-class classifiers which can be combined to solve complex classification problems.

One key advantage of our algorithm is its scalability with the training sample size, and thus it is applicable for large-scale SVMs. While preparing this paper, we note that some other algorithms are also carefully designed for handling the large-scale SVMs, including the block coordinate Frank-Wolfe method lacoste2012block and the stochastic alternating direction method of multiplier ouyang2013stochastic . In addition, Graphics Processing Unit (GPU) computing has been utilized in li2013parallel

to run multiple training tasks in parallel to accelerate the cross validation procedure. Furthermore, different variants of SVMs have been proposed for specific applications such as the Value-at-Risk SVM for stability to outliers

tsyurmasto2014value , the structural twin SVM to contain prior domain knowledge qi2013structural , the hierarchical SVM for customer churn prediction chen2012hierarchical

and the ellipsoidal SVM for outlier detection in wireless sensor networks

zhang2013distributed . Finally in CzaTab14 a two-ellipsoid kernel decomposition is proposed for the efficient training of SVMs. In order to avoid the use of SOCP techniques, introduced by the ellipsoids, the authors transform the data using a matrix which is determined from the sum of the classes’ covariances. With that transformation it is possible to use classical SVM, and their method can be incorporated into existing SVM libraries.

1.2 Contributions

We develop an efficient PG method to solve the B-HSVM


and the “all-together” M-HSVM


In (2), is the label of , and

is the huberized hinge loss function which is continuously differentiable. In (3), is the -th label, , if and otherwise, denotes the vector with all one’s, and denotes the -th column of . The constraints and in (3) are imposed to eliminate redundancy in and also are necessary to make the loss function

Fisher-consistent lee2004multicategory .

We choose the PG methodology because it requires only first-order information and converges fast. As shown in nesterov2007gradient ; BeckTeboulle2009 , it is an optimal first-order method. Note that the objectives in (2) and (3) have non-smooth terms and are not differentiable. Hence, direct gradient or second-order methods such as the interior point method are not applicable.

We speed up the algorithm by using a two-stage method, which detects the support of the solution at the first stage and solves a lower-dimensional problem at the second stage. For large-scale problems with sparse features, the two-stage method can achieve more than 5-fold acceleration. We also analyze the convergence of PG method under fairly general settings and get similar results as those in nesterov2007gradient ; schmidt2011convergence .

In addition, we compare the proposed method to state-of-the-art ones for B-SVM and M-SVM on both synthetic and benchmark datasets. Extensive numerical experiments demonstrate that our method can outperform other compared ones in most cases. Statistical tests are also performed to show significant differences between the compared methods.

1.3 Structure of the paper

The rest of the paper is organized as follows. In section 2, we overview the PG method and then apply it to (2) and (3). In addition, assuming strong convexity, we show linear convergence of the PG method under fairly general settings. Numerical results are given in section 3. Finally, section 4 concludes the paper.

2 Algorithms and convergence analysis

In this section, we first give an overview of the PG method. Then we apply it to (2) and (3). We complete this section by showing that under strong convexity assumptions the PG method possesses linear convergence.

2.1 Overview of the PG method

Consider convex composite optimization problems in the form of


where is a convex set, is a differentiable convex function with Lipschitz continuous gradient (that is, there exists such that , for all ), and is a proper closed convex function. The PG method for solving (4) iteratively updates the solution by



and for some nonnegative . The extrapolation technique can significantly accelerate the algorithm.

When is the Lipschitz constant of , it can easily be shown that , and thus this method is a kind of majorization minimization, as illustrated in Figure 1. With appropriate choice of and , the sequence converges to the optimal value . Nesterov nesterov2007gradient , and Beck and Teboulle BeckTeboulle2009 showed, independently, that if and is taken as the Lipschitz constant of , then converges to with the rate , namely, In addition, using carefully designed , they were able to show that the convergence rate can be increased to , which is the optimal rate of first-order methods for general convex problems NesterovConvexBook2004 .

Figure 1: Simple illustration of PG method: is a majorization approximation of at , which is an extrapolated point of and . The new iterate is the minimizer of .

2.2 PG method for binary-class HSVM

We first write the B-HSVM problem (2) into the general form of (4). Let

Then (2) can be written as


For convenience, we use the following notation in the rest of this section

Proposition 1

The function defined above is differentiable and convex, and its gradient is Lipschitz continuous with constant


The proof of this proposition involves standard arguments and can be found in the appendix.

Now, we are ready to apply PG to (2). Define the proximal operator for a given extended real-valued convex function on by

Replacing and in (5) with and , respectively, we obtain the update


where is a component-wise shrinkage operator defined by .

In (9), we dynamically update by


where is a pre-selected constant, is defined in (8) and is the smallest integer such that the following condition is satisfied


where , and for some weight . The inequality in (11) is necessary to make the algorithm convergent (see Lemma 2 in the appendix). It guarantees sufficient decrease, and the setting will make it satisfied. In the case where becomes unnecessarily large, a smaller can speed up the convergence. To make the overall objective non-increasing, we re-update by setting in (9) whenever . This non-increasing monotonicity is very important, since we observe that the PG method may be numerically unstable without this modification. In addition, it significantly accelerates the PG method; see Table 1 below.

Algorithm 1 summarizes our discussion. We name it as B-PGH.

1:  Input: sample-label pairs ; parameters .
2:  Initialization: choose , ; compute from (8) and choose and ; set .
3:  while Not converged do
4:     Let for some ;
5:     Update according to (10) and according to (9);
6:     if  then
7:         Re-update according to (9) with ;
8:     end if
9:     Let .
10:  end while
Algorithm 1 Proximal gradient method for B-HSVM (B-PGH)

2.3 PG method for multi-class HSVM

In this section we generalize the PG method for solving multi-class classification problems. Denote Let

Then we can write (3) as


Similar to Proposition 1, we can show that is Lipschitz continuous with constant


The proof is essentially the same as that of Proposition 1, and we do not repeat it.

Now we are ready to apply the PG method to (3) or equivalently (12). We update iteratively by solving the -subproblem


and -subproblem


where is chosen in the same way as (11).

Problem (14) is relatively easy and has a closed form solution. Let and be the vector consisting of the first components of , where

denotes the identity matrix. Substituting

to (14) gives the solution

Hence, the update in (14) can be explicitly written as


where is defined in the above equation.

Problem (15) can be further decomposed into independent small problems. Each of them involves only one row of and can be written in the following form


where and is the -th row-vector of the matrix for the -th small problem. The coexistence of the non-smooth term and the equality constraint makes (17) a difficult optimization problem to solve. Next we describe a new efficient yet simple method for solving (17) using its dual problem, defined by


Since (17) is strongly convex, is concave and continuously differentiable. Hence, the solution of (18) can be found by solving the single-variable equation . It is easy to verify that and , so the solution lies between and , where and respectively denote the minimum and maximum components of . In addition, note that is piece-wise linear about , and the breakpoints are at , . Hence must be in for some , where is the sorted vector of in the increasing order. Therefore, to solve (18), we first obtain , then search the interval that contains by checking the sign of at the breakpoints, and finally solve within that interval. Algorithm 2 summarizes our method for solving (18).

1:  Input: with in -dimensional space and .
2:  Let and sort as ; set .
3:  while  do
4:     If , set ; else .
5:  end while
6:  Solve within and output the solution .
Algorithm 2 Exact method for solving (18)

After determining , we can obtain the solution of (17) by setting . Algorithm 3 summarizes the main steps of the PG method for efficiently solving (3). We name it as M-PGH.

1:  Input: sample-label pairs with each ; parameters .
2:  Initialization: choose , ; compute in (13) and choose and ; set .
3:  while Not converged do
4:     Let for some ;
5:     Choose in the same way as (10);
6:     Update by (16);
7:     for  do
8:         Set to the -th row of
9:         ;
10:         Solve (18) by Algorithm 2 with input and let be the output;
11:         Set the -th row of to be ;
12:     end for
13:     if  then
14:         Re-update and according to (16) and (15) with ;
15:     end if
16:     Let .
17:  end while
Algorithm 3 Proximal gradient method for M-HSVM (M-PGH)

2.4 Convergence results

Instead of analyzing the convergence of Algorithms 1 and 3, we do the analysis of the PG method for (4) with general and and regard Algorithms 1 and 3 as special cases. Throughout our analysis, we assume that is a differentiable convex function and is Lipschitz continuous with Lipschitz constant . We also assume that is strongly convex222Without strong convexity of , we only have sublinear convergence as shown in BeckTeboulle2009 . with constant , namely,

for any and , where denotes the domain of .

Similar results have been shown by Nesterov nesterov2007gradient and Schmidt et al schmidt2011convergence . However, our analysis fits to more general settings. Specifically, we allow dynamical update of the Lipschitz parameter and an acceptable interval of the parameters . On the other hand, nesterov2007gradient ; schmidt2011convergence either require to be fixed to the Lipschitz constant of or require specific values for the ’s. In addition, neither of nesterov2007gradient ; schmidt2011convergence do the re-update step as in line 7 of Algorithm 1 or line 13 of Algorithm 3.

We tested the PG method under different settings on synthetic datasets, generated in the same way as described in section 3.1.1. Our goal is to demonstrate the practical effect that each setting has on the overall performance of the PG method. Table 1 summarizes the numerical results, which show that PG method under our settings converges significantly faster than that under the settings of nesterov2007gradient ; schmidt2011convergence .

To analyze the PG method under fairly general settings, we use the so-called Kurdyka-Łojasiewicz (KL) inequality, which has been widely applied in non-convex optimization. Our results show that the KL inequality can also be applied and simplify the analysis in convex optimization. Extending the discussion of the KL inequality is beyond the scope of this paper and therefore we refer the interested readers to xu2013bcd ; lojasiewicz1993geometrie ; kurdyka1998gradients ; bolte2007lojasiewicz ; xu-yin-nonconvex and the references therein.

Problems Our settings Settings of nesterov2007gradient Settings of schmidt2011convergence
#iter time obj. #iter time obj. #iter time obj.
(3000, 300, 30) 34 0.06 1.0178e-1 135 0.22 1.0178e-1 475 0.75 1.0178e-1
(2000, 20000, 200) 91 4.34 8.0511e-2 461 21.77 8.0511e-2 2000 99.05 8.0511e-2
Table 1: Performance of the PG method with different settings: schmidt2011convergence sets for all ; neither of nesterov2007gradient ; schmidt2011convergence makes non-increasing. The data used in these tests is generated in the same way as described in section 3.1.1, and here we set the correlation parameter .

Our main result is summarized as follows.

Theorem 2.1

Let be the sequence generated by (5) with and for some such that (24) holds. In addition, we make nonincreasing by re-updating with in (5) whenever . Then -linearly converges to the unique minimizer of (4), namely, there exist positive constants and such that


The proof of this theorem is given in the Appendix due to its technical complexity. Using the results of Theorem 2.1, we establish the convergence results of Algorithms 1 and 3 in the following corollary.

Corollary 1

Let and be the sequences generated by Algorithms 1 and 3 with and . Then and -linearly converge to the unique solutions of (2) and (3), respectively.

Remark 1

If one of and vanishes, we only have sublinear convergence by some appropriate choice of . The results can be found in BeckTeboulle2009 .

2.5 Two-stage accelerated method for large-scale problems

Most cost of Algorithm 1 at iteration occurs in the computation of and the evaluation of . Let where are defined in (7). To compute , we need to have and , which costs in total. Evaluating needs which costs . To save the computing time, we store both and so that can be obtained by . This way, we only need to compute once during each iteration, and the total computational cost is where is the total number of iterations.

As is large and the solution of (2) is sparse, namely, only a few features are relevant, we can further reduce the cost of Algorithm 1 by switching from the original high-dimensional problem to a lower-dimensional one. More precisely, we first run Algorithm 1 with and until the support of remains almost unchanged. Then we reduce (2) to a lower-dimensional problem by confining to the detected support, namely, all elements out of the detected support are kept zero. Finally, Algorithm 1 is employed again to solve the lower-dimensional problem. The two-stage method for (2) is named as B-PGH-2, and its solidness rests on the following lemma, which can be shown in a similar way as the proof of Lemma 5.2 in HaleYinZhang2007 .

Lemma 1

Let be the sequence generated by Algorithm 1 with and starting from any . Suppose is the unique solution of (2) with . Let


where is the th component of . Then and , where denotes the support of . In addition, and for all but at most finite iterations.

Suppose the cardinality of the solution support is . Then the total computational cost of the two-stage method B-PGH-2 is , where are the numbers of iterations in the first and second stages, respectively. Numerically, we found that could be detected in several iterations, namely, is usually small. When , B-PGH-2 can be significantly faster than B-PGH, as demonstrated by our experiments in Section 3. In the same way, Algorithm 3 can be accelerated by a two-stage method. We omit the analysis since it can be derived by following the same steps.

3 Numerical Experiments

In the first part of this section, we compare B-PGH, described in Algorithm 1, with two very recent binary-class SVM solvers using ADMM ye2011efficient and GCD333Our algorithm and ADMM are both implemented in MATLAB, while the code of GCD is written in R. To be fair, we re-wrote the code of GCD and implemented it in MATLAB. yangefficient on both synthetic and real data. ADMM solves model (1) whereas both B-PGH and GCD solve model (2). In the second part, we compare the performance of five different multi-class SVMs which are: the model defined by (3), the -regularized M-SVM in wang20071 , the -regularized M-SVM in zhang2008variable , the “one-vs-all” (OVA) method bottou1994comparison , and the Decision Directed Acyclic Graph (DDAG) method platt2000large . We use M-PGH444The paper li2010huberized uses a path-following method to solve (3). However, its code is not publicly available., described in Algorithm 3, to solve (3) and Sedumi sturm1999using for the and -regularized M-SVMs. Sedumi is called through CVX grant2008cvx . All the tests were performed on a computer having an i7-620m CPU and 3-GB RAM and running 32-bit Windows 7 and Matlab 2010b.

3.1 Binary-class SVM

For B-PGH, the parameters and were set to and , respectively. We chose

where and . The starting point was chosen to be a zero vector. We stop B-PGH if both of the following conditions are satisfied during three consecutive iterations


where and . The stopping tolerance was set to for B-PGH and GCD and for ADMM since was too stringent. For B-PGH-2, we took at the first stage and at the second stage. The penalty parameters for ADMM were set to and as suggested in ye2011efficient . All the other parameters for ADMM and GCD were set to their default values.

3.1.1 Synthetic data

We generated samples in with one half in the “” class and the other half in the “” class. Specifically, each sample in the “

” class was generated according to the Gaussian distribution

and each sample in the “” class according to . The mean vector

and the covariance matrix

where is the matrix of size with all one’s, is the matrix with all zero’s, and is an identity matrix of size . The first variables are relevant for classification and the rest ones being noise. This simulated data was also used in wang2008hybrid ; ye2011efficient . We tested the speed of the algorithms on different sets of dimension and the correlation . The smoothing parameter for B-PGH and GCD was set to in this subsection, and is set in (2). Table 2 lists the average running time (sec) of 10 independent trials. For each run, we averaged the time over 25 different pairs of . From the results, we can see that B-PGH was consistently faster than GCD and over 50 times faster than ADMM. In addition, the two-stage accelerated algorithm B-PGH-2 (see section 2.5) was significantly faster than B-PGH when was large and .

For , the time for ADMM is over one half hour.

(500, 50, 5) 0.0060 0.0081 0.0059 0.0071 0.0116 0.0192 0.6354 0.8641
(2000, 100, 10) 0.0176 0.0256 0.0173 0.0218 0.0848 0.1469 2.4268 3.5340
(50, 300, 30) 0.0126 0.0162 0.0099 0.0113 0.0338 0.0409 0.6819 1.1729
(100, 500, 50) 0.0179 0.0242 0.0117 0.0152 0.0727 0.0808 1.4879 2.5482
(200, 2000, 100) 0.0720 0.1227 0.0301 0.0446 0.5653 0.4735 7.9985 12.998
(2000, 20000, 200) 5.7341 8.5379 1.1543 1.7531 32.721 30.558 —— ——
Table 2: Running time (in seconds) of B-PGH, GCD and ADMM. Each result is the average of 10 independent trials. For each run, the time is averaged over 25 different pairs of .

We also tested the prediction accuracy and variable selection of the algorithms. The problem dimension was set to . The optimal pair of was selected from a large grid by 10-fold cross validation. We use for the number of selected relevant variables and for the number of selected noise variables. In addition, we use “accu.” for the prediction accuracy. The average results of 500 independent runs corresponding to and are shown in Table 3. During each run, the algorithms were compared on a test set of 1000 samples. From the table, we can see that B-PGH achieved similar accuracy to that by GCD, and they both performed better than ADMM, especially in the case of . In addition, B-PGH-2 obtained similar solutions as B-PGH.

accu.(%) accu.(%)
B-PGH 20.0(0.1) 0.1(0.4) 100(0.000) 19.9(0.3) 7.3(3.9) 86.6(0.011)
B-PGH-2 20.0(0.1) 0.1(0.3) 100(0.000) 19.9(0.4) 7.5(4.1) 86.6(0.011)
GCD 20.0(0.1) 0.1(0.3) 100(0.000) 19.9(0.4) 7.4(3.8) 86.4(0.011)
ADMM 19.0(0.9) 2.9(1.8) 100(0.000) 17.2(1.5) 23.1(6.0) 85.7(0.013)
Table 3:

Classification accuracies and variable selections of B-PGH, GCD and ADMM. All results are the averages of 500 independent runs, each of which tests on a 1000 tesing set. The numbers in the parentheses are the corresponding standard errors.

3.1.2 Medium-scale real data

In this subsection, we compare B-PGH, GCD and ADMM on medium-scale real data (see Table 4). The first seven datasets are available from the LIBSVM dataset555 and the last three from Tom Mitchell’s neuroinformatics research group666 Both rcv1 and realsim have large feature dimensions but only about nonzeros. colon, duke and leuk are datasets of gene expression profiles for colon cancer, breast cancer and leukemia, respectively. The original dataset of colon consists of 62 samples, and we randomly chose 30 of them for training and the rest for tesing. gisette

is a hand-writing digit recognition problem from NIPS 2003 Feature Selection Challenge. The training set for

gisette is a random subset of the original 6000 samples, and the testing set contains all of the original 1000 samples. rcv1 is a collection of manually cetegorized news wires from Reuters. Both the training and tesing instances for sub-rcv1 are randomly sub-sampled from the original training and tesing samples. realsim contains UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos and real aviation. The original dataset of realsim has 72,309 samples, and we randomly sub-sampled 500 instances for training and 1,000 instances for tesing. fMRIa, fMRIb, and fMRIc are functional MRI (fMRI) data of brain activities when the subjects are presented with pictures and text paragraphs.

Dataset #training #testing #features Type
australian 200 490 14 Dense
colon 30 32 2,000 Dense
duke 32 10 7,129 Dense
gisette 500 1,000 5,000 Dense
leuk 38 34 7,129 Dense
sub-rcv1 500 1,000 47,236 Sparse
sub-realsim 500 1,000 20,958 Sparse
fMRIa 30 10 1715 Dense
fMRIb 30 10 1874 Dense
fMRIc 30 10 1888 Dense
Table 4: The size and data type of the tested real datasets

We fixed since the algorithms appeared not sensitive to and . The optimal ’s were tuned by 10-fold cross-validation on training sets. The smoothing parameter was set to for both B-PGH and GCD. All the other settings are the same as those in the previous test. The results are shown in Table 5. For comparison, we also report the prediction accuracies by LIBLINEAR fan2008liblinear with -regularized -loss. From the results, we can see that B-PGH is significantly faster than GCD and ADMM, and it also gives the best prediction accuracy. B-PGH-2 was fastest and achieved the same accuracy as B-PGH except for gisette. We want to mention that GCD can give the same accuracy as B-PGH but it needs to run much longer time, and ADMM can rarely achieve the same accuracy even if it runs longer since it solves a different model.

accu(%) supp time accu(%) supp time accu(%) supp time accu(%) supp time accu(%)
australian 87.4 11 0.01 87.4 11 0.01 86.7 10 0.02 86.9 14 1.08 85.7
colon 84.4 89 0.04 84.4 89 0.05 84.4 89 0.38 84.4 118 1.48 81.3
duke 90 118 0.20 90 118 0.10 90 112 0.93 90 171 3.11 80
gisette 92.9 977 1.99 92.7 946 1.94 92.6 959 17.61 92.8 1464 218.5 91.7
leuk 91.2 847 0.19 91.2 846 0.15 82.4 716 3.10 82.4 998 2.35 91.2
sub-rcv1 84.8 1035 0.06 84.8 1035 0.05 83.3 1035 3.46 82.3 1776 2.61 80.1
sub-realsim 92.9 1134 0.04 92.9 1134 0.02 91.9 1134 2.96 92.8 1727 1.61 90.9
fMRIa 90 141 0.07 90 141 0.06 70 130 5.31 70 203 0.57 70
fMRIb 100 1098 0.11 100 1108 0.07 90 180 2.26 100 1767 0.03 70
fMRIc 100 1827 0.10 100 1825 0.08 80 1324 2.05 90 1882 0.06 50
Table 5: Comparison results of B-PGH, GCD, ADMM and LIBLINEAR on real data. The best prediction accuracy for each dataset is highlighted in bold and the best running time (sec) in italics.

3.1.3 Statistical comparison

We also performed the statistical comparison of B-PGH, GCD, and ADMM. Following demvsar2006statistical , we did the Wilcoxon signed-ranks test777The Wilcoxon signed-ranks test is in general better than the paired -test as demonstrated in demvsar2006statistical . wilcoxon1945individual and Friedman test friedman1937use ; friedman1940comparison to see if the differences of the compared methods are significant. The former test is for pair-wise comparison and the latter one for multiple comparison. Specifically, for two different methods, denote as the difference of their score (e.g., prediction accuracy) on the -th dataset and rank ’s based on their absolute value. Let


where is the number of datasets, , and

Table 6 shows the pair-wise -values and the corresponding -values of the five compared methods. At -, we find that there is no significant differences between B-PGH and B-PGH-2 and neither between GCD and ADMM, and any other pair of methods make significant difference.

The Friedman statistic can be calculated by


where is the number of compared methods, denotes the average rank of the -th method, and is the rank of the -th method on the -th dataset. Table 7 shows the ranks of the compared methods according to their prediction accuracies, where average ranks are applied to ties. The -value indicates significant difference among the five methods at -.

We further proceeded with a post-hoc test through Holm’s step-down procedure holm1979simple . Assuming the -values for compared classifiers to the control one are ordered as , Holm’s step-down procedure starts from the largest one and compares it to , where is the target -value. If