Proximal gradient method for huberized support vector machine

The Support Vector Machine (SVM) has been used in a wide variety of classification problems. The original SVM uses the hinge loss function, which is non-differentiable and makes the problem difficult to solve in particular for regularized SVMs, such as with ℓ_1-regularization. This paper considers the Huberized SVM (HSVM), which uses a differentiable approximation of the hinge loss function. We first explore the use of the Proximal Gradient (PG) method to solving binary-class HSVM (B-HSVM) and then generalize it to multi-class HSVM (M-HSVM). Under strong convexity assumptions, we show that our algorithm converges linearly. In addition, we give a finite convergence result about the support of the solution, based on which we further accelerate the algorithm by a two-stage method. We present extensive numerical experiments on both synthetic and real datasets which demonstrate the superiority of our methods over some state-of-the-art methods for both binary- and multi-class SVMs.

Authors

• 27 publications
• 3 publications
• 16 publications
06/02/2021

Improvement over Pinball Loss Support Vector Machine

Recently, there have been several papers that discuss the extension of t...
11/30/2015

Alternating direction method of multipliers for regularized multiclass support vector machines

The support vector machine (SVM) was originally designed for binary clas...
06/12/2020

Safety-guaranteed Reinforcement Learning based on Multi-class Support Vector Machine

Several works have addressed the problem of incorporating constraints in...
01/31/2014

A Unifying Framework in Vector-valued Reproducing Kernel Hilbert Spaces for Manifold Regularization and Co-Regularized Multi-view Learning

This paper presents a general vector-valued reproducing kernel Hilbert s...
05/30/2020

Solution Path Algorithm for Twin Multi-class Support Vector Machine

The twin support vector machine and its extensions have made great achie...
02/14/2012

Smoothing Multivariate Performance Measures

A Support Vector Method for multivariate performance measures was recent...
08/22/2017

A Deterministic Nonsmooth Frank Wolfe Algorithm with Coreset Guarantees

We present a new Frank-Wolfe (FW) type algorithm that is applicable to m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The original linear support vector machine (SVM) aims to find a hyperplane that separates a collection of data points. Since it was proposed in

cortes1995support , it has been widely used for binary classifications, such as texture classification kim2002support , gene expression data analysis brown2000knowledge heisele2001face , to name a few. Mathematically, given a set of samples in -dimensional space and each attached with a label , the linear SVM learns a hyperplane from the training samples. A new data point can be categorized into the “” or “” class by inspecting the sign of .

The binary-class SVM (B-SVM) has been generalized to multicategory classifications to tackle problems that have data points belonging to more than two classes. The initially proposed multi-class SVM (M-SVM) methods construct several binary classifiers, such as “one-against-all’

bottou1994comparison , “one-against-one” knerr1990single , and “directed acyclic graph SVM” platt2000large . These M-SVMs may suffer from data imbalance, namely, some classes have much fewer data points than others, which can result in inaccurate predictions. One alternative is to put all the data points together in one model, which results in the so-called “all-together” M-SVMs. The “all-together” M-SVMs train multi-classifiers by considering a single large optimization problem. An extensive comparison of different M-SVMs can be found in hsu2002comparison .

In this paper, we consider both B-SVM and M-SVM. More precisely, we consider the binary-class huberized SVM (B-HSVM) (see (2) below) for B-SVM and the “all-together” multi-class HSVM (M-HSVM) (see (3) below) for M-SVM. The advantage of HSVM over classic SVM with hinge loss is the continuous differentiability of its loss function, which enables the use of the “fastest” first-order method: Proximal Gradient (PG) method nesterov2007gradient ; BeckTeboulle2009 (see the overview in section 2.1). We demonstrate that the PG method is in general much faster than existing methods for (regularized) B-SVM and M-SVM while yielding comparable prediction accuracies. In addition, extensive numerical experiments are done to compare our method to state-of-the-art ones for both B-SVM and M-SVM on synthetic and also benchmark datasets. Statistical comparison is also performed to show the difference between the proposed method and other compared ones.

1.1 Related work

B-HSVM appears to be first considered111Strictly speaking, we add the term in (2). This modification is similar to that used in keerthi2006modified , and the extra term usually does not affect the prediction but makes PG method converge faster. by Wang et al in wang2008hybrid . They demonstrate that B-HSVM can perform better than the original unregularized SVM and also -regularized SVM (i.e., (1) with ) for microarray classification.

A closely related model to B-HSVM is the elastic net regularized SVM wang2006doubly

 minb,w1nn∑i=1[1−yi(b+x⊤iw)]++λ1∥w∥1+λ22∥w∥22, (1)

where is the hinge loss function. The -norm regularizer has the ability to perform continuous shrinkage and automatic variable selection at the same time tibshirani1996shrinkageLasso , and the

-norm regularizer can help to reduce the variance of the estimated coefficients and usually results in satisfactory prediction accuracy, especially in the case where there are many correlated variables. The elastic net inherits the benefits of both

and -norm regularizers and can perform better than either of them alone, as demonstrated in zou-hastie2005elastic-net .

Note that (1) uses non-differentiable hinge loss function while B-HSVM uses differentiable huberized hinge loss function. The differentiability makes B-HSVM relatively easier to solve. A path-following algorithm was proposed by Wang et. al wang2008hybrid for solving B-HSVM. Their algorithm is not efficient in particular for large-scale problems, since it needs to track the disappearance of variables along a regularization path. Recently, Yang and Zou yangefficient proposed a Generalized Coordinate Descent (GCD) method, which was, in most cases, about 30 times faster than the path-following algorithm. However, the GCD method needs to compute the gradient of the loss function of B-HSVM after each coordinate update, which makes the algorithm slow.

M-HSVM has been considered in li2010huberized , which generalizes the work wang2008hybrid on B-HSVM to M-HSVM and also makes a path-following algorithm. However, their algorithm could be even worse since it also needs to track the disappearance of variables along a regularization path and M-HSVM often involves more variables than those of B-HSVM. Hence, it is not suitable for large-scale problems either.

Similar to M-HSVM, several other models have been proposed to train multiple classifiers by solving one single large optimization problem to handle the multi-category classification, such as the -norm regularized M-SVM in wang20071 and the -norm regularized M-SVM in zhang2008variable . Again, these models use the non-differentiable hinge loss function and are relatively more difficult than M-HSVM to solve. There are also methods that use binary-classifiers to handle multicategory classification problems including “one-against-all” bottou1994comparison , “one-against-one” knerr1990single , and “directed acyclic graph SVM” platt2000large . The work galar2011overview makes a thorough review of methods using binary-classifiers for multi-category classification problems and gives extensive experiments on various applications. In a follow up and more recent paper Gal.et.al.15 the same authors present a dynamic classifier weighting method which deals with the limitations introduced by the non-competent classifiers in the one-versus-one classification strategy. In order to dynamically weigh the outputs of the individual classifiers, they use the nearest neighbor of every class from the instance that needs to be classified. Furthermore in Kra.et.al.14 the authors propose a new approach for building multi-class classifiers based on the notion of data-clustering in the feature space. For the derived clusters they construct one-class classifiers which can be combined to solve complex classification problems.

One key advantage of our algorithm is its scalability with the training sample size, and thus it is applicable for large-scale SVMs. While preparing this paper, we note that some other algorithms are also carefully designed for handling the large-scale SVMs, including the block coordinate Frank-Wolfe method lacoste2012block and the stochastic alternating direction method of multiplier ouyang2013stochastic . In addition, Graphics Processing Unit (GPU) computing has been utilized in li2013parallel

to run multiple training tasks in parallel to accelerate the cross validation procedure. Furthermore, different variants of SVMs have been proposed for specific applications such as the Value-at-Risk SVM for stability to outliers

tsyurmasto2014value , the structural twin SVM to contain prior domain knowledge qi2013structural , the hierarchical SVM for customer churn prediction chen2012hierarchical

and the ellipsoidal SVM for outlier detection in wireless sensor networks

zhang2013distributed . Finally in CzaTab14 a two-ellipsoid kernel decomposition is proposed for the efficient training of SVMs. In order to avoid the use of SOCP techniques, introduced by the ellipsoids, the authors transform the data using a matrix which is determined from the sum of the classes’ covariances. With that transformation it is possible to use classical SVM, and their method can be incorporated into existing SVM libraries.

1.2 Contributions

We develop an efficient PG method to solve the B-HSVM

 minb,w1nn∑i=1ϕH(yi(b+x⊤iw))+λ1∥w∥1+λ22∥w∥22+λ32b2, (2)

and the “all-together” M-HSVM

 minb,W1nn∑i=1J∑j=1aijϕH(bj+x⊤iwj)+λ1∥W∥1+λ22∥W∥2F+λ32∥b∥2, s.t.We=0,e⊤b=0. (3)

In (2), is the label of , and

 ϕH(t)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩0, for t>1,(1−t)22δ, for 1−δ

is the huberized hinge loss function which is continuously differentiable. In (3), is the -th label, , if and otherwise, denotes the vector with all one’s, and denotes the -th column of . The constraints and in (3) are imposed to eliminate redundancy in and also are necessary to make the loss function

 ℓM=1nn∑i=1J∑j=1aijϕH(bj+x⊤iwj)

Fisher-consistent lee2004multicategory .

We choose the PG methodology because it requires only first-order information and converges fast. As shown in nesterov2007gradient ; BeckTeboulle2009 , it is an optimal first-order method. Note that the objectives in (2) and (3) have non-smooth terms and are not differentiable. Hence, direct gradient or second-order methods such as the interior point method are not applicable.

We speed up the algorithm by using a two-stage method, which detects the support of the solution at the first stage and solves a lower-dimensional problem at the second stage. For large-scale problems with sparse features, the two-stage method can achieve more than 5-fold acceleration. We also analyze the convergence of PG method under fairly general settings and get similar results as those in nesterov2007gradient ; schmidt2011convergence .

In addition, we compare the proposed method to state-of-the-art ones for B-SVM and M-SVM on both synthetic and benchmark datasets. Extensive numerical experiments demonstrate that our method can outperform other compared ones in most cases. Statistical tests are also performed to show significant differences between the compared methods.

1.3 Structure of the paper

The rest of the paper is organized as follows. In section 2, we overview the PG method and then apply it to (2) and (3). In addition, assuming strong convexity, we show linear convergence of the PG method under fairly general settings. Numerical results are given in section 3. Finally, section 4 concludes the paper.

2 Algorithms and convergence analysis

In this section, we first give an overview of the PG method. Then we apply it to (2) and (3). We complete this section by showing that under strong convexity assumptions the PG method possesses linear convergence.

2.1 Overview of the PG method

Consider convex composite optimization problems in the form of

 minu∈Uξ(u)≡ξ1(u)+ξ2(u), (4)

where is a convex set, is a differentiable convex function with Lipschitz continuous gradient (that is, there exists such that , for all ), and is a proper closed convex function. The PG method for solving (4) iteratively updates the solution by

 uk=argminu∈U Q(u,^uk−1) (5)

where

 Q(u,^uk−1)=ξ1(^uk−1)+⟨∇ξ1(^uk−1),u−^uk−1⟩+Lk2∥u−^uk−1∥2+ξ2(u),

and for some nonnegative . The extrapolation technique can significantly accelerate the algorithm.

When is the Lipschitz constant of , it can easily be shown that , and thus this method is a kind of majorization minimization, as illustrated in Figure 1. With appropriate choice of and , the sequence converges to the optimal value . Nesterov nesterov2007gradient , and Beck and Teboulle BeckTeboulle2009 showed, independently, that if and is taken as the Lipschitz constant of , then converges to with the rate , namely, In addition, using carefully designed , they were able to show that the convergence rate can be increased to , which is the optimal rate of first-order methods for general convex problems NesterovConvexBook2004 .

2.2 PG method for binary-class HSVM

We first write the B-HSVM problem (2) into the general form of (4). Let

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩fi(b,w)=ϕH(yi(b+x⊤iw)), for i=1,⋯,n,f(b,w)=1n∑ni=1fi(b,w),g(b,w)=λ1∥w∥1+λ22∥w∥2+λ32b2.

Then (2) can be written as

 minb,wF(b,w)≡f(b,w)+g(b,w). (6)

For convenience, we use the following notation in the rest of this section

 u=(b;w),vi=(yi;yixi). (7)
Proposition 1

The function defined above is differentiable and convex, and its gradient is Lipschitz continuous with constant

 Lf=1nδn∑i=1y2i(1+∥xi∥2). (8)

The proof of this proposition involves standard arguments and can be found in the appendix.

Now, we are ready to apply PG to (2). Define the proximal operator for a given extended real-valued convex function on by

 proxh(v)=argminu12∥u−v∥2+h(u).

Replacing and in (5) with and , respectively, we obtain the update

 ⎧⎪ ⎪⎨⎪ ⎪⎩bk=Lk^bk−1−∇bf(^uk−1)Lk+λ3,wk=1Lk+λ2Sλ1(Lk^wk−1−∇wf(^uk−1)), (9)

where is a component-wise shrinkage operator defined by .

In (9), we dynamically update by

 Lk=min(ηnkLk−1,Lf), (10)

where is a pre-selected constant, is defined in (8) and is the smallest integer such that the following condition is satisfied

 f(proxhk(vk−1))≤f(^uk−1)+⟨∇f(^uk−1),proxhk(vk−1)−^uk−1⟩+Lk2∥∥proxhk(vk−1)−^uk−1∥∥2, (11)

where , and for some weight . The inequality in (11) is necessary to make the algorithm convergent (see Lemma 2 in the appendix). It guarantees sufficient decrease, and the setting will make it satisfied. In the case where becomes unnecessarily large, a smaller can speed up the convergence. To make the overall objective non-increasing, we re-update by setting in (9) whenever . This non-increasing monotonicity is very important, since we observe that the PG method may be numerically unstable without this modification. In addition, it significantly accelerates the PG method; see Table 1 below.

Algorithm 1 summarizes our discussion. We name it as B-PGH.

2.3 PG method for multi-class HSVM

In this section we generalize the PG method for solving multi-class classification problems. Denote Let

 G(b,W)= λ1∥W∥1+λ22∥W∥2F+λ32∥b∥2, U= {(b,W):We=0,e⊤b=0}.

Then we can write (3) as

 minU∈UH(U)≡ℓM(U)+G(U) (12)

Similar to Proposition 1, we can show that is Lipschitz continuous with constant

 Lm=Jnδn∑i=1(1+∥xi∥2). (13)

The proof is essentially the same as that of Proposition 1, and we do not repeat it.

Now we are ready to apply the PG method to (3) or equivalently (12). We update iteratively by solving the -subproblem

 bk=argmine⊤b=0⟨∇bℓM(^Uk−1),b−^bk−1⟩+Lk2∥b−^bk−1∥2+λ32∥b∥2 (14)

and -subproblem

 Wk=argminWe=0⟨∇WℓM(^Uk−1),W−^Wk−1⟩+Lk2∥W−^Wk−1∥2+λ1∥W∥1+λ22∥W∥2, (15)

where is chosen in the same way as (11).

Problem (14) is relatively easy and has a closed form solution. Let and be the vector consisting of the first components of , where

denotes the identity matrix. Substituting

to (14) gives the solution

 ¯bk=1λ3+Lk(P⊤P)−1P⊤(Lk^bk−1−∇bℓM(^Uk−1)).

Hence, the update in (14) can be explicitly written as

 bk=P¯bk, (16)

where is defined in the above equation.

Problem (15) can be further decomposed into independent small problems. Each of them involves only one row of and can be written in the following form

 minw12∥w−z∥2+λ∥w∥1,% s.t. e⊤w=0, (17)

where and is the -th row-vector of the matrix for the -th small problem. The coexistence of the non-smooth term and the equality constraint makes (17) a difficult optimization problem to solve. Next we describe a new efficient yet simple method for solving (17) using its dual problem, defined by

 maxσγ(σ)≡12∥Sλ(z−σ)−z∥2+λ∥Sλ(z−σ)∥1+σe⊤Sλ(z−σ). (18)

Since (17) is strongly convex, is concave and continuously differentiable. Hence, the solution of (18) can be found by solving the single-variable equation . It is easy to verify that and , so the solution lies between and , where and respectively denote the minimum and maximum components of . In addition, note that is piece-wise linear about , and the breakpoints are at , . Hence must be in for some , where is the sorted vector of in the increasing order. Therefore, to solve (18), we first obtain , then search the interval that contains by checking the sign of at the breakpoints, and finally solve within that interval. Algorithm 2 summarizes our method for solving (18).

After determining , we can obtain the solution of (17) by setting . Algorithm 3 summarizes the main steps of the PG method for efficiently solving (3). We name it as M-PGH.

2.4 Convergence results

Instead of analyzing the convergence of Algorithms 1 and 3, we do the analysis of the PG method for (4) with general and and regard Algorithms 1 and 3 as special cases. Throughout our analysis, we assume that is a differentiable convex function and is Lipschitz continuous with Lipschitz constant . We also assume that is strongly convex222Without strong convexity of , we only have sublinear convergence as shown in BeckTeboulle2009 . with constant , namely,

 ξ2(u)−ξ2(v)≥⟨gv,u−v⟩+μ2∥u−v∥2,

for any and , where denotes the domain of .

Similar results have been shown by Nesterov nesterov2007gradient and Schmidt et al schmidt2011convergence . However, our analysis fits to more general settings. Specifically, we allow dynamical update of the Lipschitz parameter and an acceptable interval of the parameters . On the other hand, nesterov2007gradient ; schmidt2011convergence either require to be fixed to the Lipschitz constant of or require specific values for the ’s. In addition, neither of nesterov2007gradient ; schmidt2011convergence do the re-update step as in line 7 of Algorithm 1 or line 13 of Algorithm 3.

We tested the PG method under different settings on synthetic datasets, generated in the same way as described in section 3.1.1. Our goal is to demonstrate the practical effect that each setting has on the overall performance of the PG method. Table 1 summarizes the numerical results, which show that PG method under our settings converges significantly faster than that under the settings of nesterov2007gradient ; schmidt2011convergence .

To analyze the PG method under fairly general settings, we use the so-called Kurdyka-Łojasiewicz (KL) inequality, which has been widely applied in non-convex optimization. Our results show that the KL inequality can also be applied and simplify the analysis in convex optimization. Extending the discussion of the KL inequality is beyond the scope of this paper and therefore we refer the interested readers to xu2013bcd ; lojasiewicz1993geometrie ; kurdyka1998gradients ; bolte2007lojasiewicz ; xu-yin-nonconvex and the references therein.

Our main result is summarized as follows.

Theorem 2.1

Let be the sequence generated by (5) with and for some such that (24) holds. In addition, we make nonincreasing by re-updating with in (5) whenever . Then -linearly converges to the unique minimizer of (4), namely, there exist positive constants and such that

 ∥uk−u∗∥≤Cτk, ∀k≥0. (19)

The proof of this theorem is given in the Appendix due to its technical complexity. Using the results of Theorem 2.1, we establish the convergence results of Algorithms 1 and 3 in the following corollary.

Corollary 1

Let and be the sequences generated by Algorithms 1 and 3 with and . Then and -linearly converge to the unique solutions of (2) and (3), respectively.

Remark 1

If one of and vanishes, we only have sublinear convergence by some appropriate choice of . The results can be found in BeckTeboulle2009 .

2.5 Two-stage accelerated method for large-scale problems

Most cost of Algorithm 1 at iteration occurs in the computation of and the evaluation of . Let where are defined in (7). To compute , we need to have and , which costs in total. Evaluating needs which costs . To save the computing time, we store both and so that can be obtained by . This way, we only need to compute once during each iteration, and the total computational cost is where is the total number of iterations.

As is large and the solution of (2) is sparse, namely, only a few features are relevant, we can further reduce the cost of Algorithm 1 by switching from the original high-dimensional problem to a lower-dimensional one. More precisely, we first run Algorithm 1 with and until the support of remains almost unchanged. Then we reduce (2) to a lower-dimensional problem by confining to the detected support, namely, all elements out of the detected support are kept zero. Finally, Algorithm 1 is employed again to solve the lower-dimensional problem. The two-stage method for (2) is named as B-PGH-2, and its solidness rests on the following lemma, which can be shown in a similar way as the proof of Lemma 5.2 in HaleYinZhang2007 .

Lemma 1

Let be the sequence generated by Algorithm 1 with and starting from any . Suppose is the unique solution of (2) with . Let

 Q(b,w)=f(b,w)+λ22∥w∥2+λ32b2,hi(w)=wi−1Lf+λ2∇wiQ(u),

and

 I={i:|∇wiQ(u∗)|<λ1}, E={i:|∇wiQ(u∗)|=λ1},

where is the th component of . Then and , where denotes the support of . In addition, and for all but at most finite iterations.

Suppose the cardinality of the solution support is . Then the total computational cost of the two-stage method B-PGH-2 is , where are the numbers of iterations in the first and second stages, respectively. Numerically, we found that could be detected in several iterations, namely, is usually small. When , B-PGH-2 can be significantly faster than B-PGH, as demonstrated by our experiments in Section 3. In the same way, Algorithm 3 can be accelerated by a two-stage method. We omit the analysis since it can be derived by following the same steps.

3 Numerical Experiments

In the first part of this section, we compare B-PGH, described in Algorithm 1, with two very recent binary-class SVM solvers using ADMM ye2011efficient and GCD333Our algorithm and ADMM are both implemented in MATLAB, while the code of GCD is written in R. To be fair, we re-wrote the code of GCD and implemented it in MATLAB. yangefficient on both synthetic and real data. ADMM solves model (1) whereas both B-PGH and GCD solve model (2). In the second part, we compare the performance of five different multi-class SVMs which are: the model defined by (3), the -regularized M-SVM in wang20071 , the -regularized M-SVM in zhang2008variable , the “one-vs-all” (OVA) method bottou1994comparison , and the Decision Directed Acyclic Graph (DDAG) method platt2000large . We use M-PGH444The paper li2010huberized uses a path-following method to solve (3). However, its code is not publicly available., described in Algorithm 3, to solve (3) and Sedumi sturm1999using for the and -regularized M-SVMs. Sedumi is called through CVX grant2008cvx . All the tests were performed on a computer having an i7-620m CPU and 3-GB RAM and running 32-bit Windows 7 and Matlab 2010b.

3.1 Binary-class SVM

For B-PGH, the parameters and were set to and , respectively. We chose

 ωk−1=min(tk−1−1tk,√Lk−1Lk),

where and . The starting point was chosen to be a zero vector. We stop B-PGH if both of the following conditions are satisfied during three consecutive iterations

 Fk−1−Fk1+Fk−1≤tol,∥uk−1−uk∥1+∥uk−1∥≤tol, (20)

where and . The stopping tolerance was set to for B-PGH and GCD and for ADMM since was too stringent. For B-PGH-2, we took at the first stage and at the second stage. The penalty parameters for ADMM were set to and as suggested in ye2011efficient . All the other parameters for ADMM and GCD were set to their default values.

3.1.1 Synthetic data

We generated samples in with one half in the “” class and the other half in the “” class. Specifically, each sample in the “

” class was generated according to the Gaussian distribution

and each sample in the “” class according to . The mean vector

 μ=(1,⋯,1s,0,⋯,0p−s)⊤,

and the covariance matrix

 Σ=[ρ1s×s+(1−ρ)Is×s0s×(p−s)0(p−s)×sI(p−s)×(p−s)],

where is the matrix of size with all one’s, is the matrix with all zero’s, and is an identity matrix of size . The first variables are relevant for classification and the rest ones being noise. This simulated data was also used in wang2008hybrid ; ye2011efficient . We tested the speed of the algorithms on different sets of dimension and the correlation . The smoothing parameter for B-PGH and GCD was set to in this subsection, and is set in (2). Table 2 lists the average running time (sec) of 10 independent trials. For each run, we averaged the time over 25 different pairs of . From the results, we can see that B-PGH was consistently faster than GCD and over 50 times faster than ADMM. In addition, the two-stage accelerated algorithm B-PGH-2 (see section 2.5) was significantly faster than B-PGH when was large and .

We also tested the prediction accuracy and variable selection of the algorithms. The problem dimension was set to . The optimal pair of was selected from a large grid by 10-fold cross validation. We use for the number of selected relevant variables and for the number of selected noise variables. In addition, we use “accu.” for the prediction accuracy. The average results of 500 independent runs corresponding to and are shown in Table 3. During each run, the algorithms were compared on a test set of 1000 samples. From the table, we can see that B-PGH achieved similar accuracy to that by GCD, and they both performed better than ADMM, especially in the case of . In addition, B-PGH-2 obtained similar solutions as B-PGH.

3.1.2 Medium-scale real data

In this subsection, we compare B-PGH, GCD and ADMM on medium-scale real data (see Table 4). The first seven datasets are available from the LIBSVM dataset555http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets and the last three from Tom Mitchell’s neuroinformatics research group666http://www.cs.cmu.edu/tom/fmri.html. Both rcv1 and realsim have large feature dimensions but only about nonzeros. colon, duke and leuk are datasets of gene expression profiles for colon cancer, breast cancer and leukemia, respectively. The original dataset of colon consists of 62 samples, and we randomly chose 30 of them for training and the rest for tesing. gisette

is a hand-writing digit recognition problem from NIPS 2003 Feature Selection Challenge. The training set for

gisette is a random subset of the original 6000 samples, and the testing set contains all of the original 1000 samples. rcv1 is a collection of manually cetegorized news wires from Reuters. Both the training and tesing instances for sub-rcv1 are randomly sub-sampled from the original training and tesing samples. realsim contains UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos and real aviation. The original dataset of realsim has 72,309 samples, and we randomly sub-sampled 500 instances for training and 1,000 instances for tesing. fMRIa, fMRIb, and fMRIc are functional MRI (fMRI) data of brain activities when the subjects are presented with pictures and text paragraphs.

We fixed since the algorithms appeared not sensitive to and . The optimal ’s were tuned by 10-fold cross-validation on training sets. The smoothing parameter was set to for both B-PGH and GCD. All the other settings are the same as those in the previous test. The results are shown in Table 5. For comparison, we also report the prediction accuracies by LIBLINEAR fan2008liblinear with -regularized -loss. From the results, we can see that B-PGH is significantly faster than GCD and ADMM, and it also gives the best prediction accuracy. B-PGH-2 was fastest and achieved the same accuracy as B-PGH except for gisette. We want to mention that GCD can give the same accuracy as B-PGH but it needs to run much longer time, and ADMM can rarely achieve the same accuracy even if it runs longer since it solves a different model.

3.1.3 Statistical comparison

We also performed the statistical comparison of B-PGH, GCD, and ADMM. Following demvsar2006statistical , we did the Wilcoxon signed-ranks test777The Wilcoxon signed-ranks test is in general better than the paired -test as demonstrated in demvsar2006statistical . wilcoxon1945individual and Friedman test friedman1937use ; friedman1940comparison to see if the differences of the compared methods are significant. The former test is for pair-wise comparison and the latter one for multiple comparison. Specifically, for two different methods, denote as the difference of their score (e.g., prediction accuracy) on the -th dataset and rank ’s based on their absolute value. Let

 z=T−14N(N+1)√124N(N+1)(2N+1), (21)

where is the number of datasets, , and

 R+=∑di>0rank(di)+12∑di=0rank(di), R−=∑di<0rank(di)+12∑di=0rank(di).

Table 6 shows the pair-wise -values and the corresponding -values of the five compared methods. At -, we find that there is no significant differences between B-PGH and B-PGH-2 and neither between GCD and ADMM, and any other pair of methods make significant difference.

The Friedman statistic can be calculated by

 χ2F=12NK(K+1)[K∑j=1ARj−K(K+1)24], (22)

where is the number of compared methods, denotes the average rank of the -th method, and is the rank of the -th method on the -th dataset. Table 7 shows the ranks of the compared methods according to their prediction accuracies, where average ranks are applied to ties. The -value indicates significant difference among the five methods at -.

We further proceeded with a post-hoc test through Holm’s step-down procedure holm1979simple . Assuming the -values for compared classifiers to the control one are ordered as