1 Introduction
Support vector machine (SVM) was first introduced by Vapnik and Cortes [1]
and then has been widely applied into machine learning, statistic, pattern recognition and so forth. The basic idea is to find a hyperplane in the input space that separates the training data set. In the paper, we consider a binary classification problem that can be described as follows. Suppose we are given a training set
where are the input vectors and are the output labels. The purpose of SVM is to train a hyperplane with and by given training set. For any new input vector , we can predict the corresponding label as for and otherwise. In order to find optimal hyperplane, there are two possible cases: linearly separable and inseparable training data. If the training data is able to be linearly separated in the input space, then the unique optimal hyperplane can be obtained by solving a convex quadratic programming (QP) problem:(1)  
s.t. 
where . Here, the provides the distance between the th sample and the hyperplane. The above model is termed as hardmargin SVM because it requires correct classifications of all samples. When it comes to the training data that are linearly inseparable in the input space, the popular approach is to allow violations in the satisfaction of the constraints in problem (1) and penalize such violations in the objective function, namely,
(2) 
where is a penalty parameter and
is one of some loss functions that aim at penalizing some incorrectly classified samples and leaving the other ones. Therefore, the above model allows misclassified samples, and thus is known as softmargin SVM. Clearly, different softmargin loss functions yield different softmargin SVM models. Generally speaking, softmargin loss functions can be summarized as two categories based on the convexity of
.1.1 Convex SoftMargin Losses
Since there are large numbers of convex softmargin loss function that have been proposed to deal with the softmargin SVM problems, we only review some popular ones.

Hinge loss function: It is nondifferentiable at and unbounded. SVM with hinge loss (SVM) was first proposed by Vapnik and Cortes [1], aiming at only penalizing the samples with .
Since those functions are convex, their corresponding SVM models are not difficult to be dealt with. However, the convexity often induces the unboundedness, which removes robustness of those loss functions to outliers from the training data. In order to overcome such drawback, authors in [11, 12] set an upper bound and enforce the loss to stop increasing after a certain extent. Doing so, the original convex loss functions become nonconvex.
1.2 NonConvex SoftMargin Losses
Again since there are large numbers of nonconvex softmargin losses that have been studied, which is beyond our scope of review, we only present some of them.

Ramp loss function: with which is nondifferentiable at and but bounded between 0 and . It does not penalize the case when , while pays linear penalty when and a fixed penalty when . This makes this function robust to outliers. Authors in [13] investigated SVM with ramp loss (SVM).

Truncated pinball loss function (truncate left side of pinball loss function): with and It is nondifferentiable at and and unbounded. The penalty is fixed at for and is linear otherwise. SVM with such loss (SVM) can be referred in [14].

Asymmetrical truncated pinball loss function (truncate two side of pinball loss function): with and This function is nondifferentiable at but bounded between 0 and . The penalty is fixed at for and at for but is linear otherwise. SVM with such loss (SVM) was from [15].

Sigmoid loss function: a differentiable and bounded (between 0 and 1) function. It penalizes all samples. SVM with this loss (sigmoidSVM) can be seen in [16].

Some other nonconvex loss function: normalized sigmoid cost loss function [17].
Compared with convex softmargin loss, most of nonconvex loss functions are less sensitive to feature noises or outliers due to their boundedness. Apparently, nonconvexity would lead to difficulties to computations in terms of solving corresponding SVM models. In summary, the basic principles to choose a softmargin loss are three aspects[18],[19]: (i) It is able to capture the discrete nature of the binary classification. (ii) It is suggested to be bounded to be robust to feature noises or outliers. (iii) It makes itself based SVM model easy to be computed.
1.3 SoftMargin Loss
Taking above principles into consideration, we now introduce the 01 ( for short) softmargin loss defined as
The softmargin loss function is the most nature loss function for binary classification[20],[21]. Its properties are summarized as below.

It is discontinuous at , which captures the discrete nature of the binary classification (correctness or incorrectness) [22].

It is lower semicontinuous and nonconvex by the definition in [23]. Since it is either 0 or 1, sparsity and robustness will be guaranteed. In fact, it does not count the number of samples with , which leads to sparsity, while returns 1 otherwise, which ensures robustness to outliers.

It is differentiable everywhere but at . However, it has subdifferential
and zero gradients elsewhere, see Lemma 2.1, which makes the computation tractable.
1.4 Svm
For the sake of easing the reading, we present some notations here. Let and be the Euclidean norm and the zero norm of that counts the number of nonzero elements of . Denote with and , where is a diagonal matrix with diagonal elements being elements in . For a positive integer and a vector , denote
These notations indicate
(3) 
which returns the number of all positive elements in . We call (3) the softmargin loss. Now, replacing by in (2) and using above notations allow us to rewrite model (2) in a matrix form,
(4) 
We call this model SVM. The objective function is lower semicontinuous, nondifferentiable and nonconvex. It is difficult to be solved directly by most existing optimization algorithms. Despite that the discrete nature of zero norm makes above model NPhard to be solved, the SVM model is an ideal SVM model because it guarantees as few misclassified as possible for binary classification. Therefore, we carry out this paper along with this model.
1.5 Contributions
In this paper, we start to study the theoretical properties of the SVM model and then design a new efficient and robust algorithm to solve the model. The main contributions of the paper can be summarized as follows.

We prove the existence of a global minimizer of SVM, which has not been thoroughly studied in prior works. Based on the explicit expressions of subdifferential and proximal operator of the loss (3), we introduce two types of optimality conditions of the problem: KKT and Pstationary points. We then unravel the relationships among a global/local minimizer and the above two points. This result is essential to our algorithmic design later on.

We adopt the famous alternating direction method of multipliers (ADMM) to solve the SVM problem, and thus the method is dubbed as LADMM. We show that if the sequence generated by the proposed method converges, then it must converge to a Pstationary points. To the best of our knowledge, it is the first time that a method being created aims at solving (4) directly rather than its surrogate model (2). The novelty of the method is using the support vector operator as a filter to prevent the outliers from being support vectors during training process.

We compare LADMM with other four existing leading methods on solving SVM problems with synthetic and real data sets. Extensive numerical experiments demonstrate that our proposed method achieves better performance in terms of providing higher prediction accuracy, using a small number of support vectors and consuming shorter computational time.
This paper is organized as follows. In Section 2, we will give the explicit expressions of three subdifferentials of softmargin loss and derive its proximal operator. Section 3 presents the main theoretical contributions. We will show the existence of a global minimizer to problem (4) as well as investigating the relationships among a global/local minimizer and the KKT/Pstationary points of SVM problem. In Section 4, we will introduce the support vector operator and design the algorithm based on the optimality conditions established in previous section. Numerical experiments including comparison with other solvers and concluding remarks are given in the last two sections.
2 Subdifferential and Proximal Operator
To well analyze the properties of the softmargin loss, we need introduce the necessary background of the subdifferential and the proximal operator of the .
2.1 Subdifferential
From [24, Definition 8.3], for a proper and lower semicontinuous function , the regular, limiting and horizon subdiffential are defined respectively as
where means and , and means both and . If the function is convex, then the limiting subdifferential is also known to the subgradient.
Lemma 2.1
The regular, limiting and horizon subdifferentials of at enjoy following property,
We use a simple example to illustrate the three subdifferentials of . Consider one dimensional case . As shown in Figure 1, the red lines denote some elements in In fact, all right slashes crossing the origin comprise of the subdifferential .
Our next result is about proximal operator, which will be very useful in designing the algorithm in Section 4.
2.2 Proximal Operator
By [25, Definition 12.23], the proximal operator of , associated with a parameter , at point , is defined by
(6) 
The following lemma states that the proximal operator admits a closed form solution when
Lemma 2.2 (Onedimensional case)
For an , the proximal operator of at is given by
(7) 
It is worth mentioning that the proximal operator may not be unique if in (7). However, to guarantee the uniqueness, hereafter, we always choose the proximal operator to be zero if it is not unique. Because of this, the proximal operator of is rewritten as
(8) 
The proximal operator of is shown in Figure 2, where the red line denotes the proximal operator.
Based on the one dimensional case, we could derive the proximal operator of . The proof is similar to that of Lemma 2.2 and thus is omitted.
Lemma 2.3 (Multidimensional case)
For an , the proximal operator of at is given by
(9) 
To proceed further, we consider the following problem
(10) 
where is a smooth convex function and gradient Lipschitz continuous with a Lipschitz constant and is given. To see the global solution of above problem, same as [26], we introduce an auxiliary problem,
for some and fixed , where is the gradient of . This problem allows us to acquire the result related to the proximal operator of .
Lemma 2.4
This lemma suffices to show that a global optimal solution to (10) must satisfy a fixed point equation, which is well established by following theorem whose proof is easy and is omitted here.
Theorem 2.1
If is a global optimal solution to (10), then for any given it satisfies
(12) 
3 Optimality Conditions of Svm
This section provides the existence of optimal solutions of SVM and establishes two types of firstorder optimality conditions: KKT points and Pstationary points.
3.1 Existence of SVM Minimizer
Theorem 3.1
Assume is finitevalued. Then the solution set of (4) is bounded and its global minimizer exists.
We observe that may be an optimal solution (trivial solution) to (4), which possibly incorrectly predict the corresponding label for some new input vector because . However, for any , it follows from that
where and denote the number of the positive and the negative labels in . Based on above equation, this means that any optimal solution satisfying
is a nontrivial optimal solution to (4).
3.2 FirstOrder Optimality Condition
In this subsection, we discuss the firstorder optimality conditions for the problem (4). To proceed this, we introduce a variable to equivalently reformulate (4) as
s.t. 
The Lagrangian function of above problem is
where is the Lagrange multiplier, based on which we introduce the well known KarushKuhnTucker (KKT) point of problem (3.2).
Definition 3.1 (KKT point of (3.2))
For a given , we say that is a KKT point of problem (3.2) if there is a multiplier vector such that
(15) 
The following result reveals the relationship between a local minimizer and a KKT point of (3.2).
Theorem 3.2
For a given , then is a local minimizer of (3.2) if and only if it is a KKT point.
Now let us define some notation
(16) 
where is the generalized inverse of . These notations could equivalently rewrite (3.2) as
(17) 
which is an unconstrained nonconvex optimization problem. Based on (17), we will derive the proximal stationary point of (3.2), and this point is useful as a stop criteria of our algorithm proposed later.
Definition 3.2 (Pstationary point of (3.2))
For a given , we say is a proximal stationary (Pstationary) point of problem (3.2) if there is a multiplier vector and constant such that
(18) 
We now reveal the relationship between a global minimizer and a Pstationary point of (3.2). Before which, let
where
denotes maximum eigenvalue of
Theorem 3.3
Assume has a full column rank. For a given , if is a global minimizer of (3.2) then it is a Pstationary point with .
Note that having a full column rank means m n. However, numerical experiments will demonstrate that our proposed algorithm also works for the cases of mn in terms of finding a Pstationary point. To end this section, we also unravel the relationship between a Pstationary point and a KKT point of (3.2).
Theorem 3.4
For a given , if is a Pstationary point with of (3.2), then it is also KKT point.
The above two theorems state that a global minimizer of (3.2) is a Pstationary point which is also a KKT point. Most importantly, we could use the Pstationary point as a termination rule in terms of guaranteeing the local optimality of a point generated by the algorithm proposed in next section.
4 Algorithmic Design
In this section, we introduce the concept of support vector operator and describe how ADMM can be applied into solving the SVM problem (3.2).
4.1 Support Vector Operator
In SVMs, the optimal hyperplane is actually only determined by a small portion of training samples. These samples are called support vectors. It is well known that softmargin loss functions at nonsupport vectors have zero subdifferentials [14, 13, 28, 29]. In other words, to select support vectors, one could find samples at which the loss function has nonzero subdifferentials. However, this approach is not suitable for softmargin loss, since and elsewhere. This indicates samples with always have zero subdifferentials and samples with also have zero subdifferentials due to
, which probably leads to empty set of support vectors. To overcome such drawback, we introduce a novel selection scheme,
support vectors operator, to choose samples to be support vectors.Definition 4.1 (support vector operator)
For a given , the support vector operator is defined by
(19) 
Hereafter, we let (resp. ) be the subvector (resp. submatrix) contains elements of (resp. rows of ) indexed on . Let and its complementarity set be . It follows from Definition 4.1 and (8) that
This leads to
(20) 
The above equivalence will help us to design the algorithm that we are ready to outline as below.
4.2 Framework of ADMM
The augmented Lagrangian function associated with the model (3.2) can be written as
(21) 
where is Lagrangian multiplier, is a given parameter and
We take advantage of the ADMM to solve the augmented Lagrangian function. Given the th iteration , its framework takes the following form
(22)  
where is referred as the dual step size and . Here,
is the socalled proximal term and is symmetric. Note that if is positive semidefinite, then the above framework is the standard semiproximal ADMM [30]. However, authors in papers [31, 32, 33] have also investigated ADMM with the indefinite proximal terms, namely is indefinite. The basic principle of choosing is to guarantee the convexity of subproblem of (22). Since here is strongly convex with respect to , is able to be chosen as a negative semidefinite matrix. The flexibility of selecting allows us to design a very efficient algorithm when support vectors are used.
4.3 LAdmm
We mainly describe how each subproblem of (22) can be addressed efficiently as well as how the support vectors can be applied into reducing the computational cost.
(i) Updating . By (19), we denote
(23) 
Then the subproblem of (22) is reformulated as
which combining (20) results in
(24) 
(ii) Updating . We always choose
(25) 
which enables us to derive the subproblem of (22) as
(26)  
where . Moreover,
where the second and third equation are from (24) and (23). Now we rewrite (26) as
(27)  
To solve (27), we need find the solution to the equation
(28) 
Note that , where is the cardinality of . Then (28) can be addressed efficiently by following rules:

If , one could just solve (28) through
(29) 
If , the matrix inverse lemma enables us to calculate the inverse as
(30) Then we update as
(31)
(iii) Updating . By letting , it follows from subproblem in (22) that
(32)  
(iv) Updating . According to (15) and Lemma 2.1, and have the relation , namely if . Based on this, we update the Lagrangian multiplier in the following way:
(33) 
We now summarize the framework of the algorithm in Algorithm 1. We call the method LADMM, an abbreviation for SVM solved by ADMM.
Remark 4.1
We have some comments on Algorithm 1 regarding to the computational complexity. Note that in each step, updating dominates the whole computation, which needs solve a linear equation system through (29) or (31). If , then the computational complexities of calculating and are and with , respectively. If , then the computational complexities of calculating and are and with , respectively. Overall the whole complexity in each step is . Therefore, if there are few number of support vectors, namely is very small, then the complexity is very low, which allows us to do large scale computations.
The following theorem shows that if the sequence generated by LADMM converges, then it must converge to a Pstationary point of (3.2).
Theorem 4.1
Let be the limit point of the sequence generated by LADMM. Then
Comments
There are no comments yet.