# Support Vector Machine Classifier via L_0/1 Soft-Margin Loss

Support vector machine (SVM) has attracted great attentions for the last two decades due to its extensive applications, and thus numerous optimization models have been proposed. To distinguish all of them, in this paper, we introduce a new model equipped with an L_0/1 soft-margin loss (dubbed as L_0/1-SVM) which well captures the nature of the binary classification. Many of the existing convex/non-convex soft-margin losses can be viewed as a surrogate of the L_0/1 soft-margin loss. Despite the discrete nature of L_0/1, we manage to establish the existence of global minimizer of the new model as well as revealing the relationship among its minimizers and KKT/P-stationary points. These theoretical properties allow us to take advantage of the alternating direction method of multipliers. In addition, the L_0/1-support vector operator is introduced as a filter to prevent outliers from being support vectors during the training process. Hence, the method is expected to be relatively robust. Finally, numerical experiments demonstrate that our proposed method generates better performance in terms of much shorter computational time with much fewer number of support vectors when against with some other leading methods in areas of SVM. When the data size gets bigger, its advantage becomes more evident.

## Authors

• 2 publications
• 1 publication
• 2 publications
• 87 publications
• 3 publications
• ### On the Precise Error Analysis of Support Vector Machines

This paper investigates the asymptotic behavior of the soft-margin and h...
03/29/2020 ∙ by Abla Kammoun, et al. ∙ 0

• ### Fast Training of Support Vector Machine for Forest Fire Prediction

Support Vector Machine (SVM) is a binary classification model, which aim...
12/26/2020 ∙ by stevechan, et al. ∙ 0

• ### Exact high-dimensional asymptotics for support vector machine

Support vector machine (SVM) is one of the most widely used classificati...
05/13/2019 ∙ by Haoyang Liu, et al. ∙ 0

• ### Unified SVM algorithm based LS-DC Loss

Over the past two decades, Support Vector Machine (SVM) has been a popul...
06/16/2020 ∙ by Zhou Shuisheng, et al. ∙ 0

• ### A Modified Construction for a Support Vector Classifier to Accommodate Class Imbalances

Given a training set with binary classification, the Support Vector Mach...
02/08/2017 ∙ by Matt Parker, et al. ∙ 0

• ### Alternating direction method of multipliers for regularized multiclass support vector machines

The support vector machine (SVM) was originally designed for binary clas...
11/30/2015 ∙ by Yangyang Xu, et al. ∙ 0

• ### Maximal Margin Distribution Support Vector Regression with coupled Constraints-based Convex Optimization

Support vector regression (SVR) is one of the most popular machine learn...
05/05/2019 ∙ by Gaoyang Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Support vector machine (SVM) was first introduced by Vapnik and Cortes [1]

and then has been widely applied into machine learning, statistic, pattern recognition and so forth. The basic idea is to find a hyperplane in the input space that separates the training data set. In the paper, we consider a binary classification problem that can be described as follows. Suppose we are given a training set

where are the input vectors and are the output labels. The purpose of SVM is to train a hyperplane with and by given training set. For any new input vector , we can predict the corresponding label as for and otherwise. In order to find optimal hyperplane, there are two possible cases: linearly separable and inseparable training data. If the training data is able to be linearly separated in the input space, then the unique optimal hyperplane can be obtained by solving a convex quadratic programming (QP) problem:

 minw∈Rn,b∈R 12∥w∥2 (1) s.t. yi(⟨w,xi⟩+b)≥1,i∈Nm,

where . Here, the provides the distance between the th sample and the hyperplane. The above model is termed as hard-margin SVM because it requires correct classifications of all samples. When it comes to the training data that are linearly inseparable in the input space, the popular approach is to allow violations in the satisfaction of the constraints in problem (1) and penalize such violations in the objective function, namely,

 minw∈Rn,b∈R  12∥w∥2+Cm∑i=1ℓ(1−yi(⟨w,xi⟩+b)), (2)

where is a penalty parameter and

is one of some loss functions that aim at penalizing some incorrectly classified samples and leaving the other ones. Therefore, the above model allows misclassified samples, and thus is known as soft-margin SVM. Clearly, different soft-margin loss functions yield different soft-margin SVM models. Generally speaking, soft-margin loss functions can be summarized as two categories based on the convexity of

.

### 1.1 Convex Soft-Margin Losses

Since there are large numbers of convex soft-margin loss function that have been proposed to deal with the soft-margin SVM problems, we only review some popular ones.

• Hinge loss function: It is non-differentiable at and unbounded. SVM with hinge loss (-SVM) was first proposed by Vapnik and Cortes [1], aiming at only penalizing the samples with .

• Pinball loss function: , with which is still non-differentiable at and unbounded. SVM with this loss function (-SVM) was proposed in [2, 3] to pay penalty for all samples. There is a quadratic programming solver embedded in Matlab to solve the SVM with pinball loss function [3].

• Hybrid Huber loss function: with It is differentiable everywhere but still unbounded. This function was first introduced in [4], while SVM with such loss (-SVM) was first proposed in [5] which can be solved by proximal gradient method [6].

• Square loss function: a differentiable but unbounded function. SVM with square loss (-SVM) can be found in [7, 8].

• Some other convex loss functions: the insensitive zone pinball loss [3], the exponential loss function [9] and log loss function [10].

Since those functions are convex, their corresponding SVM models are not difficult to be dealt with. However, the convexity often induces the unboundedness, which removes robustness of those loss functions to outliers from the training data. In order to overcome such drawback, authors in [11, 12] set an upper bound and enforce the loss to stop increasing after a certain extent. Doing so, the original convex loss functions become non-convex.

### 1.2 Non-Convex Soft-Margin Losses

Again since there are large numbers of non-convex soft-margin losses that have been studied, which is beyond our scope of review, we only present some of them.

• Ramp loss function: with which is non-differentiable at and but bounded between 0 and . It does not penalize the case when , while pays linear penalty when and a fixed penalty when . This makes this function robust to outliers. Authors in [13] investigated SVM with ramp loss (-SVM).

• Truncated pinball loss function (truncate left side of pinball loss function): with and It is non-differentiable at and and unbounded. The penalty is fixed at for and is linear otherwise. SVM with such loss (-SVM) can be referred in [14].

• Asymmetrical truncated pinball loss function (truncate two side of pinball loss function): with and This function is non-differentiable at but bounded between 0 and . The penalty is fixed at for and at for but is linear otherwise. SVM with such loss (-SVM) was from [15].

• Sigmoid loss function: a differentiable and bounded (between 0 and 1) function. It penalizes all samples. SVM with this loss (sigmoid-SVM) can be seen in [16].

• Some other non-convex loss function: normalized sigmoid cost loss function [17].

Compared with convex soft-margin loss, most of non-convex loss functions are less sensitive to feature noises or outliers due to their boundedness. Apparently, non-convexity would lead to difficulties to computations in terms of solving corresponding SVM models. In summary, the basic principles to choose a soft-margin loss are three aspects[18],[19]: (i) It is able to capture the discrete nature of the binary classification. (ii) It is suggested to be bounded to be robust to feature noises or outliers. (iii) It makes itself based SVM model easy to be computed.

### 1.3 ℓ0/1 Soft-Margin Loss

Taking above principles into consideration, we now introduce the 0-1 ( for short) soft-margin loss defined as

 ℓ0/1(t)={1,t>0,0,t≤0.

The soft-margin loss function is the most nature loss function for binary classification[20],[21]. Its properties are summarized as below.

• It is discontinuous at , which captures the discrete nature of the binary classification (correctness or incorrectness) [22].

• It is lower semi-continuous and nonconvex by the definition in [23]. Since it is either 0 or 1, sparsity and robustness will be guaranteed. In fact, it does not count the number of samples with , which leads to sparsity, while returns 1 otherwise, which ensures robustness to outliers.

• It is differentiable everywhere but at . However, it has subdifferential

 ∂ℓ0/1(0)=R+:={t∈R:t≥0}

and zero gradients elsewhere, see Lemma 2.1, which makes the computation tractable.

### 1.4 L0/1-Svm

For the sake of easing the reading, we present some notations here. Let and be the Euclidean norm and the zero norm of that counts the number of non-zero elements of . Denote with and , where is a diagonal matrix with diagonal elements being elements in . For a positive integer and a vector , denote

 Nm := {1,2,⋯,m}, 1 := (1,1,⋯,1)⊤∈Rm, Rm+ := {u∈Rm:ui≥0,i∈Nm}, |u| := (|u1|,⋯,|um|)⊤, u+ := (max{u1,0},⋯,max{um,0})⊤.

These notations indicate

 L0/1(u):=∥u+∥0=m∑i=1l0/1(ui), (3)

which returns the number of all positive elements in . We call (3) the soft-margin loss. Now, replacing by in (2) and using above notations allow us to rewrite model (2) in a matrix form,

 minw∈Rn,b∈Rf(w;b):=12∥w∥2+C∥(1−(Aw+by))+∥0. (4)

We call this model -SVM. The objective function is lower semicontinuous, non-differentiable and non-convex. It is difficult to be solved directly by most existing optimization algorithms. Despite that the discrete nature of zero norm makes above model NP-hard to be solved, the -SVM model is an ideal SVM model because it guarantees as few misclassified as possible for binary classification. Therefore, we carry out this paper along with this model.

### 1.5 Contributions

In this paper, we start to study the theoretical properties of the -SVM model and then design a new efficient and robust algorithm to solve the model. The main contributions of the paper can be summarized as follows.

• We prove the existence of a global minimizer of -SVM, which has not been thoroughly studied in prior works. Based on the explicit expressions of subdifferential and proximal operator of the loss (3), we introduce two types of optimality conditions of the problem: KKT and P-stationary points. We then unravel the relationships among a global/local minimizer and the above two points. This result is essential to our algorithmic design later on.

• We adopt the famous alternating direction method of multipliers (ADMM) to solve the -SVM problem, and thus the method is dubbed as LADMM. We show that if the sequence generated by the proposed method converges, then it must converge to a P-stationary points. To the best of our knowledge, it is the first time that a method being created aims at solving (4) directly rather than its surrogate model (2). The novelty of the method is using the -support vector operator as a filter to prevent the outliers from being support vectors during training process.

• We compare LADMM with other four existing leading methods on solving SVM problems with synthetic and real data sets. Extensive numerical experiments demonstrate that our proposed method achieves better performance in terms of providing higher prediction accuracy, using a small number of support vectors and consuming shorter computational time.

This paper is organized as follows. In Section 2, we will give the explicit expressions of three subdifferentials of soft-margin loss and derive its proximal operator. Section 3 presents the main theoretical contributions. We will show the existence of a global minimizer to problem (4) as well as investigating the relationships among a global/local minimizer and the KKT/P-stationary points of -SVM problem. In Section 4, we will introduce the -support vector operator and design the algorithm based on the optimality conditions established in previous section. Numerical experiments including comparison with other solvers and concluding remarks are given in the last two sections.

## 2 Subdifferential and Proximal Operator

To well analyze the properties of the soft-margin loss, we need introduce the necessary background of the subdifferential and the proximal operator of the .

### 2.1 L0/1 Subdifferential

From [24, Definition 8.3], for a proper and lower semicontinuous function , the regular, limiting and horizon subdiffential are defined respectively as

 ˆ∂f(u)=⎧⎨⎩v∈Rm:  liminfz→uz≠uf(z)−f(u)−⟨v,z−u⟩∥z−u∥≥0⎫⎬⎭,∂f(u)=limsupzf→u ˆ∂f(z)={v∈Rm:∃ zjf→u, vj∈ˆ∂f(zj)with vj→v},∂∞f(u)=limsupσ↓0, zf→uσˆ∂f(z)={v∈Rm:∃ zjf→u, vj∈ˆ∂f(zj)with σjvj→v},

where means and , and means both and . If the function is convex, then the limiting subdifferential is also known to the subgradient.

###### Lemma 2.1

The regular, limiting and horizon subdifferentials of at enjoy following property,

 Ω(u) := ˆ∂∥u+∥0=∂∥u+∥0=∂∞∥u+∥0 = {v∈Rm:vi{ll≥0,ui=0,=0,ui≠0,i∈Nm}.

We use a simple example to illustrate the three subdifferentials of . Consider one dimensional case . As shown in Figure 1, the red lines denote some elements in In fact, all right slashes crossing the origin comprise of the subdifferential .

Our next result is about proximal operator, which will be very useful in designing the algorithm in Section 4.

### 2.2 L0/1 Proximal Operator

By [25, Definition 12.23], the proximal operator of , associated with a parameter , at point , is defined by

 Proxαf(s)=argminu∈R αf(u)+12(u−s)2. (6)

The following lemma states that the proximal operator admits a closed form solution when

###### Lemma 2.2 (One-dimensional case)

For an , the proximal operator of at is given by

 Proxαℓ0/1(s):=⎧⎪⎨⎪⎩0,0≤s<√2α,0 or s,s=√2α,s,s>√2α or s<0. (7)

It is worth mentioning that the proximal operator may not be unique if in (7). However, to guarantee the uniqueness, hereafter, we always choose the proximal operator to be zero if it is not unique. Because of this, the proximal operator of is rewritten as

 Proxαℓ0/1(s)={0,0≤s≤√2α,s,otherwise. (8)

The proximal operator of is shown in Figure 2, where the red line denotes the proximal operator.

Based on the one dimensional case, we could derive the proximal operator of . The proof is similar to that of Lemma 2.2 and thus is omitted.

###### Lemma 2.3 (Multi-dimensional case)

For an , the proximal operator of at is given by

 ProxαL0/1(s)=⎡⎢ ⎢ ⎢⎣Proxαℓ0/1(s1)⋮Proxαℓ0/1(sm)⎤⎥ ⎥ ⎥⎦. (9)

To proceed further, we consider the following problem

 minu∈Rm fC(u):=h(u)+C∥u+∥0, (10)

where is a smooth convex function and gradient Lipschitz continuous with a Lipschitz constant and is given. To see the global solution of above problem, same as [26], we introduce an auxiliary problem,

 minu∈Rm fγ(u,z) := C∥u+∥0+h(z) + ⟨∇h(z),u−z⟩+12γ∥u−z∥2,

for some and fixed , where is the gradient of . This problem allows us to acquire the result related to the proximal operator of .

###### Lemma 2.4

For any given , we have following results.

• If is the global optimal solution to (2.2) for any fixed and , then it holds

 u∗=proxγCL0/1(z−γ∇h(z)).
• If is a global optimal solution to (10), then it is also a global optimal solution to (2.2) with and , namely,

 fC(u∗)=fγ(u∗,u∗)≤fγ(u,u∗),  ∀u∈Rm.

This lemma suffices to show that a global optimal solution to (10) must satisfy a fixed point equation, which is well established by following theorem whose proof is easy and is omitted here.

###### Theorem 2.1

If is a global optimal solution to (10), then for any given it satisfies

 u∗ =proxγCL0/1(u∗−γ∇h(u∗)). (12)

## 3 Optimality Conditions of L0/1-Svm

This section provides the existence of optimal solutions of -SVM and establishes two types of first-order optimality conditions: KKT points and P-stationary points.

### 3.1 Existence of L0/1-SVM Minimizer

###### Theorem 3.1

Assume is finite-valued. Then the solution set of (4) is bounded and its global minimizer exists.

We observe that may be an optimal solution (trivial solution) to (4), which possibly incorrectly predict the corresponding label for some new input vector because . However, for any , it follows from that

 f(\bf 0;b) =C∥(1−by)+∥0=Cmin{m+,m−},

where and denote the number of the positive and the negative labels in . Based on above equation, this means that any optimal solution satisfying

 f(w;b)

is a non-trivial optimal solution to (4).

### 3.2 First-Order Optimality Condition

In this subsection, we discuss the first-order optimality conditions for the problem (4). To proceed this, we introduce a variable to equivalently reformulate (4) as

 minw∈Rm,b∈R,u∈Rm 12∥w∥2+C∥u+∥0 s.t. u+Aw+by=1.

The Lagrangian function of above problem is

 L(w,b,u,λ) = 12∥w∥2+C∥u+∥0+⟨λ,u+Aw+by−1⟩,

where is the Lagrange multiplier, based on which we introduce the well known Karush-Kuhn-Tucker (KKT) point of problem (3.2).

###### Definition 3.1 (KKT point of (3.2))

For a given , we say that is a KKT point of problem (3.2) if there is a multiplier vector such that

 (15)

The following result reveals the relationship between a local minimizer and a KKT point of (3.2).

###### Theorem 3.2

For a given , then is a local minimizer of (3.2) if and only if it is a KKT point.

Now let us define some notation

 B:=[A y]∈Rm×(n+1),    H:=[In×n000]B+, (16)

where is the generalized inverse of . These notations could equivalently rewrite (3.2) as

 minu∈Rm  12∥H(u−1)∥2+C∥u+∥0, (17)

which is an unconstrained non-convex optimization problem. Based on (17), we will derive the proximal stationary point of (3.2), and this point is useful as a stop criteria of our algorithm proposed later.

###### Definition 3.2 (P-stationary point of (3.2))

For a given , we say is a proximal stationary (P-stationary) point of problem (3.2) if there is a multiplier vector and constant such that

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩w∗+ATλ∗=0,⟨y,λ∗⟩=0,u∗+Aw∗+b∗y=1,proxγCL0/1(u∗−γλ∗)=u∗. (18)

We now reveal the relationship between a global minimizer and a P-stationary point of (3.2). Before which, let

 γH:=1/λmax(H⊤H),

where

denotes maximum eigenvalue of

###### Theorem 3.3

Assume has a full column rank. For a given , if is a global minimizer of (3.2) then it is a P-stationary point with .

Note that having a full column rank means m n. However, numerical experiments will demonstrate that our proposed algorithm also works for the cases of mn in terms of finding a P-stationary point. To end this section, we also unravel the relationship between a P-stationary point and a KKT point of (3.2).

###### Theorem 3.4

For a given , if is a P-stationary point with of (3.2), then it is also KKT point.

The above two theorems state that a global minimizer of (3.2) is a P-stationary point which is also a KKT point. Most importantly, we could use the P-stationary point as a termination rule in terms of guaranteeing the local optimality of a point generated by the algorithm proposed in next section.

## 4 Algorithmic Design

In this section, we introduce the concept of -support vector operator and describe how ADMM can be applied into solving the -SVM problem (3.2).

### 4.1 L0/1-Support Vector Operator

In SVMs, the optimal hyperplane is actually only determined by a small portion of training samples. These samples are called support vectors. It is well known that soft-margin loss functions at non-support vectors have zero subdifferentials [14, 13, 28, 29]. In other words, to select support vectors, one could find samples at which the loss function has nonzero subdifferentials. However, this approach is not suitable for soft-margin loss, since and elsewhere. This indicates samples with always have zero subdifferentials and samples with also have zero subdifferentials due to

, which probably leads to empty set of support vectors. To overcome such drawback, we introduce a novel selection scheme,

-support vectors operator, to choose samples to be support vectors.

###### Definition 4.1 (L0/1-support vector operator)

For a given , the -support vector operator is defined by

 Tα(z) := {i∈Nm: [proxαL0/1(z)]i=0}. (19)

Hereafter, we let (resp. ) be the sub-vector (resp. sub-matrix) contains elements of (resp. rows of ) indexed on . Let and its complementarity set be . It follows from Definition 4.1 and (8) that

 u=proxαL0/1(z)  ⟺  [uTu¯¯¯T−z¯¯¯T]=0. (20)

The above equivalence will help us to design the algorithm that we are ready to outline as below.

The augmented Lagrangian function associated with the model (3.2) can be written as

 Lσ(w,b,u,λ)=12∥w∥2+C∥u+∥0+⟨λ,ϖ⟩+σ2∥ϖ∥2, (21)

where is Lagrangian multiplier, is a given parameter and

 ϖ:=u+Aw+by−1.

We take advantage of the ADMM to solve the augmented Lagrangian function. Given the th iteration , its framework takes the following form

 uk+1 = argminu∈Rm Lσ(wk,bk,u,λk), (22) wk+1 = argminw∈Rn Lσ(w,bk,uk+1,λk)+σ2∥w−wk∥2Dk, bk+1 = argminb∈R Lσ(wk+1,b,uk+1,λk), λk+1 = λk+ησϖk+1,

where is referred as the dual step size and . Here,

 ∥w−wk∥2Dk=⟨w−wk,Dk(w−wk)⟩

is the so-called proximal term and is symmetric. Note that if is positive semidefinite, then the above framework is the standard semi-proximal ADMM [30]. However, authors in papers [31, 32, 33] have also investigated ADMM with the indefinite proximal terms, namely is indefinite. The basic principle of choosing is to guarantee the convexity of -subproblem of (22). Since here is strongly convex with respect to , is able to be chosen as a negative semidefinite matrix. The flexibility of selecting allows us to design a very efficient algorithm when support vectors are used.

We mainly describe how each subproblem of (22) can be addressed efficiently as well as how the support vectors can be applied into reducing the computational cost.

(i) Updating . By (19), we denote

 zk:=1−Awk−bky−λk/σ,    Tk:=TC/σ(zk). (23)

Then the -subproblem of (22) is reformulated as

 uk+1 = argminu∈Rm C∥u+∥0+σ2∥u−zk∥2. = ProxCσL0/1(zk),

which combining (20) results in

 uk+1Tk=0,     uk+1¯¯¯Tk=zk¯¯¯Tk. (24)

(ii) Updating . We always choose

 Dk=−A⊤¯¯¯TkA¯¯¯Tk, (25)

which enables us to derive the -subproblem of (22) as

 wk+1 =argminw∈Rn 12∥w∥2+σ2∥Aw−vk∥2+ (26) σ2∥w−wk∥2−A⊤¯¯¯TkA¯¯¯Tk =argminw∈Rn 12∥w∥2+σ2∥Aw−vk∥2− σ2∥A¯¯¯Tkw−A¯¯¯Tkwk∥2,

where . Moreover,

 vk¯¯¯Tk = −(uk+1¯¯¯Tk+bky¯¯¯Tk−1+λk¯¯¯Tk/σ) = −(zk¯¯¯Tk+bky¯¯¯Tk−1+λk¯¯¯Tk/σ) = A¯¯¯Tkwk,

where the second and third equation are from (24) and (23). Now we rewrite (26) as

 wk+1 =argminw∈Rn 12∥w∥2+σ2∥Aw−vk∥2− (27) σ2∥A¯¯¯Tkw−vk¯¯¯Tk∥2 =argminw∈Rn 12∥w∥2+σ2∥ATkw−vkTk∥2.

To solve (27), we need find the solution to the equation

 (I+σA⊤TkATk=:Pk)w=σA⊤TkvkTk. (28)

Note that , where is the cardinality of . Then (28) can be addressed efficiently by following rules:

• If , one could just solve (28) through

 wk+1=σP−1kA⊤TkvkTk. (29)
• If , the matrix inverse lemma enables us to calculate the inverse as

 P−1k=I−σA⊤Tk(I+σATkA⊤Tk=:Qk)−1ATk. (30)

Then we update as

 wk+1 = σA⊤TkvkTk−σA⊤TkQ−1kσATkA⊤TkvkTk (31) = σA⊤TkvkTk−σA⊤TkQ−1k(Qk−I)vkTk = σA⊤TkQ−1kvkTk.

(iii) Updating . By letting , it follows from -subproblem in (22) that

 bk+1 = argminb∈R σ2∥uk+1−1+Awk+1+by∥2+⟨λk,by⟩ (32) = argminb∈R σ2∥by−rk∥2, = ⟨y,rk⟩/∥y∥2=⟨y,rk⟩/m.

(iv) Updating . According to (15) and Lemma 2.1, and have the relation , namely if . Based on this, we update the Lagrangian multiplier in the following way:

 ⎧⎨⎩λk+1Tk=λkTk+ησϖk+1Tk,λk+1¯¯¯Tk=0. (33)

We now summarize the framework of the algorithm in Algorithm 1. We call the method LADMM, an abbreviation for -SVM solved by ADMM.

###### Remark 4.1

We have some comments on Algorithm 1 regarding to the computational complexity. Note that in each step, updating dominates the whole computation, which needs solve a linear equation system through (29) or (31). If , then the computational complexities of calculating and are and with , respectively. If , then the computational complexities of calculating and are and with , respectively. Overall the whole complexity in each step is . Therefore, if there are few number of support vectors, namely is very small, then the complexity is very low, which allows us to do large scale computations.

The following theorem shows that if the sequence generated by LADMM converges, then it must converge to a P-stationary point of (3.2).

###### Theorem 4.1

Let be the limit point of the sequence generated by LADMM. Then