# An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Support vector machines (SVMs) with sparsity-inducing nonconvex penalties have received considerable attentions for the characteristics of automatic classification and variable selection. However, it is quite challenging to solve the nonconvex penalized SVMs due to their nondifferentiability, nonsmoothness and nonconvexity. In this paper, we propose an efficient ADMM-based algorithm to the nonconvex penalized SVMs. The proposed algorithm covers a large class of commonly used nonconvex regularization terms including the smooth clipped absolute deviation (SCAD) penalty, minimax concave penalty (MCP), log-sum penalty (LSP) and capped-ℓ_1 penalty. The computational complexity analysis shows that the proposed algorithm enjoys low computational cost. Moreover, the convergence of the proposed algorithm is guaranteed. Extensive experimental evaluations on five benchmark datasets demonstrate the superior performance of the proposed algorithm to other three state-of-the-art approaches.

## Authors

• 4 publications
• 8 publications
• 42 publications
• 32 publications
• 1 publication
• 4 publications
• ### Iteratively reweighted penalty alternating minimization methods with continuation for image deblurring

In this paper, we consider a class of nonconvex problems with linear con...
02/09/2019 ∙ by Tao Sun, et al. ∙ 0

• ### Sparse Reduced Rank Regression With Nonconvex Regularization

In this paper, the estimation problem for sparse reduced rank regression...
03/20/2018 ∙ by Ziping Zhao, et al. ∙ 0

• ### The Bernstein Function: A Unifying Framework of Nonconvex Penalization in Sparse Estimation

In this paper we study nonconvex penalization using Bernstein functions....
12/17/2013 ∙ by Zhihua Zhang, et al. ∙ 0

• ### An Incremental Path-Following Splitting Method for Linearly Constrained Nonconvex Nonsmooth Programs

The linearly constrained nonconvex nonsmooth program has drawn much atte...
01/30/2018 ∙ by Tianyi Lin, et al. ∙ 0

• ### A Primal Dual Active Set Algorithm for a Class of Nonconvex Sparsity Optimization

In this paper, we consider the problem of recovering a sparse vector fro...
10/04/2013 ∙ by Yuling Jiao, et al. ∙ 0

• ### An Iterative Algorithm for Fitting Nonconvex Penalized Generalized Linear Models with Grouped Predictors

High-dimensional data pose challenges in statistical learning and modeli...
11/29/2009 ∙ by Yiyuan She, et al. ∙ 0

• ### A provable two-stage algorithm for penalized hazards regression

From an optimizer's perspective, achieving the global optimum for a gene...
07/06/2021 ∙ by Jianqing Fan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

It is well-known that SVMs can perform automatic variable selection by adding a sparsity-inducing penalty (regularizer) to the loss function

[1, 2]. Typically, the sparsity-inducing penalties can be divided into two catagories: convex penalty and nonconvex penalty. The penalty is the most famous convex penalty and has been widely used for variable selection [3, 1]. Commonly used nonconvex penalties include penalty with , smooth clipped absolute deviation (SCAD) penalty [4], log penalty [5], minimax concave penalty (MCP) [6], log-sum penalty (LSP) [7], and capped- penalty [8]. It has been shown in literatures that nonconvex penalties outperform the convex ones with better statistics properties [9]

; theoretically, SVMs with elaborately designed nonconvex penalties can asymptotically unbiasedly estimate the large nonzero parameters as well as shrink the estimates of zero valued parameters to zero

[10]

. Consequently, the nonconvex penalized SVMs conduct variable selection and classification simultaneously; and they are more robust to the outliers and are able to yield a compact classifier with high accuracy.

Although nonconvex penalized SVMs are appealing, it is rather hard to optimize due to the nondifferentiability of the hinge loss function and the nonconvexity and nonsmoothness introduced by the nonconvex regularization term. Existing solutions to nonconvex penalized SVMs [11, 2] are pretty computationally inefficient, and they are limited to a few of nonconvex penalties. Besides that, other popular approaches are unable to apply to the nonconvex penalized hinge loss function since they typically require the loss function to be differentiable [12, 13].

In this paper, we focus on solving the standard support vector machines with a general class of nonconvex penalties including the SCAD penalty, MCP, LSP and capped- penalty. Mathematically, given a train set , where and , the nonconvex penalized SVMs minimize the following penalized hinge loss function:

 min{w,b}  1nn∑i=1[1−yi(w⊤xi+b)]++P(w), (1)

where the pair is the decision variable with and . is the regularization term with a tunning parameter , and is one of the nonconvex regularizers listed in Table I. Here and throughout this paper, represents ; and denotes the transposition of matrices and column vectors.

To address problem (1), we propose an efficient algorithm by incorporating the framework of alternating direction method of multipliers (ADMM) [14]. The main contributions of this paper can be summarized as follows.

1. We find that by reasonably reformulating problem (1) and applying the framework of ADMM, nonconvex penalized SVMs can be solved by optimizing a series of subproblems. In addition, each subproblem owns a closed-form solution and is easy to solve.

2. More importantly, we find the main computational burden of the ADMM procedure lies in the update of which requires to calculate the inversion of a matrix. It costs flops (floating point operations) when the order of is bigger than . We propose an efficient scheme to update via using the Sherman-Morrison formula [15] and Cholesky factorization, attaining an improvement by a factor of over the naive method in this case. Furthermore, we optimize the iteration scheme so that the computationally expensive part is calculated only once.

3. We present detailed computational complexity analysis and show that the optimized algorithm is pretty computationally efficient.

4. In addition, we also present detailed convergence analysis of the proposed ADMM algorithm.

5. Extensive experimental evaluations on five LIBSVM benchmark datasets demonstrate the outstanding performance of the proposed algorithm. In comparison with other three state-of-the-art algorithms, the proposed algorithm runs faster as well as attains high prediction accuracy.

The rest of this paper is organized as follows. Section II reviews the related work. Section III presents the derivation procedure and studies the computational complexity of the proposed algorithm. Section IV shows the convergence analysis. Section V details and discusses the experimental results. Finally, we conclude this paper in Section VI.

## Ii Related Work

Lots of studies have been devoted to the nonconvex penalized SVMs due to their superior performance in various applications arising from academic community and industry. Liu et al.[10] developed an -norm penalized SVM with nonconvex penalty -norm () based on margin maximization and approximation. Zhang et al.[11] combined SVM with smoothly clipped absolute deviation (SCAD) penalty, and obtained a compact classifier with high accuracy. In order to efficiently solve SCAD-penalized SVM, they proposed a successive quadratic algorithm (SQA) which converted the non-differentiable and non-convex optimization problem into an easily solved linear equation system. Zhang et al.[2] established a unified theory for SCAD- and MCP-penalized SVM in the high-dimensional setting. Laporte et al.[16]

proposed a general framework for feature selection in learning to rank using SVM with nonconvex regularizations such as log penalty, MCP and

norm with . Recently, Zhang et al.[17] have established an unified theory for a general class of nonconvex penalized SVMs in the high-dimensional setting. Liu et al.[18] showed that a class of nonconvex learning problems are equivalent to general quadratic programmings and proposed a reformulation-base technique named mixed integer programming-based global optimization (MIPGO).

Apart from previous work discussed above, many researches concerned with optimization problems with a general class of nonconvex regularizations [12, 19, 20, 13] are developed. Nevertheless, these proposed methods cannot be applied to solve the optimization problem studied in this paper. In [19], Hong et al. analyzed the convergence of the ADMM for solving certain nonconvex consensus and sharing problems. However, they require the nonconvex regularization term to be smooth, which violates the nonsmooth trait of the penalty functions considered in this paper. Later, Wang et al.[20] analyzed the convergence of ADMM for minimizing a nonconvex and possibly nonsmooth objective function with coupled linear constraints. Unfortunately, their analysis cannot be applied to the nonconvex penalized hinge loss function since they require the objective to be differentiable. Gong et al.[12] proposed General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of nonconvex penalties. Recently, Jiang et al.[13] have proposed two proximal-type variants of the ADMM to solve the structured nonconvex and nonsmooth problems. Nevertheless, the algorithms proposed in [12] and [13] are unable to solve the nonconvex penalized hinge loss function because they both require the loss function to be differentiable as well.

## Iii Algorithm For Nonconvex Penalized SVMs

In this section, we derive the solution of nonconvex penalized SVMs based on the framework of ADMM [14]. By introducing auxiliary variables and reformulating the original optimization problem, the nonconvex penalized SVMs can be solved by iterating a series of subproblems with closed-form solutions. Moreover, detailed computational complexity analysis of the proposed algorithm is presented in this section.

### Iii-a Derivation Procedure

In order to apply the framework of ADMM, we first introduce auxiliary variables to handl the nondifferentiability of problem (1).

Let () and . Then the unconstrained problem (1) can be rewritten as an equivalent form

 min{w,b,ξ}1n1⊤ξ+P(w),s.t.Y(Xw+b1)⪰1−ξ,ξ⪰0, (2)

where and is a diagonal matrix with on the th diagonal element, i.e., . In what follows, is an -column vector of s, is an -column vector of s, and denotes element-wised .

Note that, using variable splitting and introducing another slack variable , problem (2) can be converted to following equivalent constrained problem:

 min{w,b,ξ,s,z}1n1⊤ξ+P(z),s.t.w=z,Y(Xw+b1)+ξ=s+1,ξ⪰0,s⪰0, (3)

where and .

Hence, the corresponding surrogate Lagrangian function of (3) is

 L0(w,b,z,ξ,s,γ,τ)=1n1⊤ξ+Pλ(z)+<γ,(w−z)>+<τ,Y(Xw+b1)+ξ−s−1>, (4)

where and are the dual variables corresponding to the first and second linear constraints of (3), respectively. represents the standard inner product in Euclidean space. Note that we call as “surrogate Lagrangian function” since it does not involve the set of constraints . The projections onto these two simple linear constraints can be easily calculated by basic algebra computations and projections to the 1-dimensional nonnegative set ().

Let and note that , thus the scaled-form surrogate augmented Lagrangian function can be written as

 L(w,b,z,ξ,s,u,v)=L0(w,b,z,ξ,s,γ,τ)+ρ12||w−z||22+ρ22||Hw+by+ξ−s−1||22=1n1⊤ξ+P(z)+ρ12||w−z+u||22+ρ22||Hw+by+ξ−s−1+v||22+constant, (5)

where and are the scaled dual variables. The constants and are penalty parameters with and .

The resulting ADMM procedure starts with an iterate ; and for each iteration count , the scaled-form ADMM procedure can be expressed as

 w(k+1)=argminw L(w,b(k),z(k),ξ(k),s(k),u(k),v(k)), (6) b(k+1)=argminb L(w(k+1),b,z(k),ξ(k),s(k),u(k),v(k)), (7) z(k+1)=argminz L(w(k+1),b(k+1),z,ξ(k),s(k),u(k),v(k)) ξ(k+1)=argminξ⪰%0 L(w(k+1),b(k+1),z(k+1),ξ,s(k),u(k),v(k)), (9) s(k+1)=argmins⪰0%  L(w(k+1),b(k+1),z(k+1),ξ(k+1),s,u(k),v(k)), (10) u(k+1)=u(k)+(w(k+1)−z(k+1)), (11) v(k+1)=v(k)+(ξ(k+1)−s(k+1)+Hw(k+1)+by−1).

Considering optimizing problem (6), we can obtain the closed-form solution of it via , that is,

 w(k+1)=(ρ1Id+ρ2H⊤H)−1[ρ1(z(k)−u(k))+ρ2H⊤(s(k)+1−ξ(k)−v(k)−b(k)y)], (13)

where denotes the identity matrix.

Note that (13) requires to calculate the inversion of a matrix. The computational cost is especially high for the large case. Therefore, we further investigate an efficient solution for the update of according to the value of and .

Let and , then Equation (13) can be equivalently converted to

 w(k+1)=(ρId+H⊤H)−1f(k). (14)

If is more than in order, we can apply the Sherman-Morrison formula [15] to solve (14). Therefore, we have

 w(k+1)=f(k)ρ−(H⊤(U−1(L−1(Hf(k)))))ρ2, (15)

where and are the Cholesky factorization of a positive definite matrix , i.e., . Here, is the identity matrix.

For the case when , observe that the matrix is positive definite, then we can obtain the solution of via

 w(k+1)=U−1(L−1f(k)), (16)

where and are the Cholesky factorization of a matrix , i.e., .

Consequently, equation (13) can be equivalently converted to

 w(k+1)=⎧⎪⎨⎪⎩U−1(L−1f(k)),if n≥d,f(k)ρ−(H⊤(U−1(L−1(Hf(k)))))ρ2otherwise, (17)

where and are the Cholesky factorization of , if , and the Cholesky factorization of otherwise.

###### Proposition 1.

For the case when , the computational cost of the reformulated w-update by Equation (17) is flops, giving rise to an improvement by a factor of over the naive w-update by Equation (13).

###### Proof.

This proof exploits no structure in , i.e., our generic method works for any matrix. For convenience, this proof neglects the superscripts of each variable.

For the reformulated -update by Equation (17), we can first form at a cost of flops. Then forming costs flops, followed by the calculation of via Cholesky factorization at a cost of flops. After that, we can form through two matrix-vector multiplications and two back-solve steps at a cost of flops. Since it costs flops for forming and , the overall cost of forming is flops.

In terms of the naive update by Equation (13), we can first form at a cost of flops. Because is more than in order, we can form at a cost of flops. Considering that and , the cost of forming is flops. Thus, the naive method for calculating costs flops in total.

Since , thus the reformulated method obtains an improvement by a factor of over the naive method. This completes the proof of Proposition 1. ∎

By letting , we obtain the solution of Equation (7), that is,

 b(k+1)=y⊤(s(k)+1−Hw(k+1)−ξ(k)−v(k))y⊤y. (18)

In addition, note that Equation (LABEL:eq7) is equivalent to optimizing the following problem:

 z(k+1)=argminz 12||z−(w(k+1)+u(k))||22+1ρ1P(z). (19)

Based on the observation that , we can get the solution of problem (19) via solving independent univariate optimization problems. Let , then we can obtain the solution of the th entry of variable in the th iteration, that is,

 z(k+1)i=argminzi 12(zi−ψ(k+1)i)2+1ρ1pλ(zi), ∀i∈[1,d]. (20)

It has been shown in [12] that this subproblem admits a closed-form solution for many commonly used nonconvex penalties. The closed-form solution of for four commonly used nonconvex regularizers including LSP, SCAD penalty, MCP and capped- penalty are shown in the Appendix A.

The closed-form solution of Equation (9) can be obtained by performing , followed by the projection to the 1-dimensional nonnegative set(), that is,

 ξ(k+12)=s(k)+1−v(k)−Hw(k+1)−b(k+1)y−1nρ2, (21) ξ(k+1)=max(ξ(k+12),0). (22)

Similarly, the solution of (10) can be calculated through . Therefore, we can perform a two-step update as follows.

 s(k+12)=Hw(k+1)+b(k+1)y+ξ(k+1)−1+v(k), (23) s(k+1)=max(s(k+12),0). (24)

### Iii-B Algorithm and Computational Cost Analysis

The procedure for solving nonconvex penalized SVMs via ADMM is shown in Algorithm 1. It mainly consists of two parts: the pre-computation stage (line 1-8) and the iteration stage (line 9-21).

In Algorithm 1, the primal and dual variables are initialized first at line 1, followed by the calculation of two constant variable and at line 2. Since is a diagonal matrix, line 2 can be carried out at a total cost of flops. Note that the parameter and remain unchanged throughout the ADMM procedure. Thus we can carry out the Cholesky factorization according to the value of and once, and then use this cached factorization in subsequent solve steps. In algorithm 1, we first form an intermediate variable , a or matrix, according to the value of and , and then factor it (line 3-8). According to analysis arising in the proof of proposition 1, forming and then factoring it cost flops when the order of is more than . Meanwhile, if is on the order of or less than , line 3-8 can be carried out at a cost of flops. Therefore, we can see that the overall cost of carrying out the pre-computation stage is flops, if , and flops otherwise.

After the pre-computation stage, Algorithm 1 begins to iterate the ADMM procedure and quits until the pre-defined stopping criterion is satisfied (line 9-21). For the -update, can be first obtained via performing line 10 at a cost of flops. Then if the order of is more than , we can see that can be formed at a cost of flops according to the analysis arising in the proof of proposition 1. Otherwise, it takes flops to form via two back-solve steps. Since and , the -update costs flops in any case. In terms of the update of , it has been shown that we can get the solution of by solving independent univariate optimization problems and each of these univariate optimization problems owns a closed-form solution. Therefore, this step can be carried out at a cost of flops. Moreover, line 12 and line 14-20 can be easily carried out at a cost of flops in total. Since , thus it takes flops per iteration.

As a result, we can see that the overall computational cost of Algorithm 1 is

 {O(d2n)+O(dn)×#iterations  if n≥d,O(dn2)+O(dn)×#iterations  otherwise. (25)

The computational complexity shown in (25) demonstrates the efficiency of the proposed algorithm. In addition, note that the computational complexity analysis discussed above does not consider the sparse structure of the feature matrix. When exploring the sparsity, the overall computational complexity of Algorithm 1 can be further decreased. Meanwhile, it has been shown that ADMM can converge to modest accuracy-sufficient for many applications-within a few tens of iterations in [14]. The experimental results also demonstrate this point. We find that Algorithm 1 always converges within only a few tens of iterations to get a reasonable result by appropriately tuning the parameter , , and .

## Iv Convergence Analysis

In this section, we present the detailed convergence analysis of the proposed algorithm. To present the analysis, we first modify a little about the scheme for updating , that is,

 z(k+1) =argminz L(w(k+1),b(k+1),z,ξ(k),s(k),u(k),v(k)) +β2∥z−z(k)∥2, (26)

where but is small. If , (IV) equals to (LABEL:eq7); and if is very small, (IV) is very close to (LABEL:eq7). After that, we give the convergence analysis following the proof framework built in [20], which is also used in [21, 22]. However, it’s noting that our work is totally not an simple extension of [20]. As mentioned before, [20] cannot be applied to solve the nonconvex penalized hinge loss function since it requires the loss function to be differentiable.

Before giving the convergence analysis, We need following two assumptions.

For any , .

###### Assumption 2.

The augmented Lagrangian function has a lower bound, that is, .

Now we introduce several definitions and properties needed in the analysis.

###### Definition 1.

We say is strongly convex with constant , if the function is also convex.

If a function is strongly convex, the following fact obviously holds: Let be a minimizer of which is strongly convex with constant . Then, it holds that

 f(x)−f(x∗)≥δ2∥x−x∗∥2. (27)

To simplify the presentation, we use

 D(k):=(w(k),b(k),z(k),ξ(k),s(k),u(k),v(k)).

Now, we are prepared to present the convergence analysis of our algorithm.

###### Lemma 3.

Let be generated by our algorithm, then,

 L(w(k),b(k),z(k),ξ(k),s(k),u(k),v(k)) ≥L(w(k+1),b(k+1),z(k+1),ξ(k+1),s(k+1),u(k),v(k)) +ν2∥D(k+1)−D(k)∥2, (28)

where

 ν:=min{ρ1+ρ2σmin(H⊤H),ρ2,ρ2∥y∥2,β}.
###### Proof.

Noting is strongly convex with with respect to . Thus, minimization of directly yields

 L(w(k),b(k),z(k),ξ(k),s(k),u(k),v(k)) ≥L(w(k+1),b(k),z(k),ξ(k),s(k),u(k),v(k)) +ρ1+ρ2σmin(H⊤H)2∥w(k+1)−w(k)∥2. (29)

Similarly, we can obtain the following inequalities

 L(w(k+1),b(k),z(k),ξ(k),s(k),u(k),v(k)) ≥L(w(k+1),b(k+1),z(k),ξ(k),s(k),u(k),v(k)) +ρ2∥y∥22∥b(k+1)−b(k)∥2, (30)

and

 L(w(k+1),b(k+1),z(k+1),ξ(k),s(k),u(k),v(k)) ≥L(w(k+1),b(k+1),z(k+1),ξ(k+1),s(k),u(k),v(k)) +ρ22∥ξ(k+1)−ξ(k)∥2, (31)

and

 L(w(k+1),b(k+1),z(k+1),ξ(k+1),s(k),u(k),v(k)) ≥L(w(k+1),b(k+1),z(k+1),ξ(k+1),s(k+1),u(k),v(k)) +ρ22∥s(k+1)−s(k)∥2. (32)

Noting is the minimizer of with respect to , which means

 L(w(k+1),b(k+1),z(k),ξ(k+1),s(k),u(k),v(k)) ≥L(w(k+1),b(k+1),z(k+1),ξ(k+1),s(k+1),u(k),v(k)) +β∥z(k+1)−z(k)∥22. (33)

Summing (IV)-(IV) yields

 L(w(k),b(k),z(k),ξ(k),s(k),u(k),v(k)) ≥L(w(k+1),b(k+1),z(k+1),ξ(k+1),s(k+1),u(k),v(k)) +ν2∥D(k+1)−D(k)∥2, (34)

where

 ν:=min{ρ1+ρ2σmin(H⊤H),ρ2,ρ2∥y∥2,β}.

###### Lemma 4.

If Assumption 1 holds,

 ∥v(k+1)−v(k)∥2≤c1∥D(k+1)−D(k)∥2 +c2∥D(k+2)−D(