Support vector machine (SVM) was first introduced by Vapnik and Cortes 
and then has been widely applied into machine learning, statistic, pattern recognition and so forth. The basic idea is to find a hyperplane in the input space that separates the training data set. In the paper, we consider a binary classification problem that can be described as follows. Suppose we are given a training setwhere are the input vectors and are the output labels. The purpose of SVM is to train a hyperplane with and by given training set. For any new input vector , we can predict the corresponding label as for and otherwise. In order to find optimal hyperplane, there are two possible cases: linearly separable and inseparable training data. If the training data is able to be linearly separated in the input space, then the unique optimal hyperplane can be obtained by solving a convex quadratic programming (QP) problem:
where . Here, the provides the distance between the th sample and the hyperplane. The above model is termed as hard-margin SVM because it requires correct classifications of all samples. When it comes to the training data that are linearly inseparable in the input space, the popular approach is to allow violations in the satisfaction of the constraints in problem (1) and penalize such violations in the objective function, namely,
where is a penalty parameter and
is one of some loss functions that aim at penalizing some incorrectly classified samples and leaving the other ones. Therefore, the above model allows misclassified samples, and thus is known as soft-margin SVM. Clearly, different soft-margin loss functions yield different soft-margin SVM models. Generally speaking, soft-margin loss functions can be summarized as two categories based on the convexity of.
1.1 Convex Soft-Margin Losses
Since there are large numbers of convex soft-margin loss function that have been proposed to deal with the soft-margin SVM problems, we only review some popular ones.
Hinge loss function: It is non-differentiable at and unbounded. SVM with hinge loss (-SVM) was first proposed by Vapnik and Cortes , aiming at only penalizing the samples with .
Since those functions are convex, their corresponding SVM models are not difficult to be dealt with. However, the convexity often induces the unboundedness, which removes robustness of those loss functions to outliers from the training data. In order to overcome such drawback, authors in [11, 12] set an upper bound and enforce the loss to stop increasing after a certain extent. Doing so, the original convex loss functions become non-convex.
1.2 Non-Convex Soft-Margin Losses
Again since there are large numbers of non-convex soft-margin losses that have been studied, which is beyond our scope of review, we only present some of them.
Ramp loss function: with which is non-differentiable at and but bounded between 0 and . It does not penalize the case when , while pays linear penalty when and a fixed penalty when . This makes this function robust to outliers. Authors in  investigated SVM with ramp loss (-SVM).
Truncated pinball loss function (truncate left side of pinball loss function): with and It is non-differentiable at and and unbounded. The penalty is fixed at for and is linear otherwise. SVM with such loss (-SVM) can be referred in .
Asymmetrical truncated pinball loss function (truncate two side of pinball loss function): with and This function is non-differentiable at but bounded between 0 and . The penalty is fixed at for and at for but is linear otherwise. SVM with such loss (-SVM) was from .
Sigmoid loss function: a differentiable and bounded (between 0 and 1) function. It penalizes all samples. SVM with this loss (sigmoid-SVM) can be seen in .
Some other non-convex loss function: normalized sigmoid cost loss function .
Compared with convex soft-margin loss, most of non-convex loss functions are less sensitive to feature noises or outliers due to their boundedness. Apparently, non-convexity would lead to difficulties to computations in terms of solving corresponding SVM models. In summary, the basic principles to choose a soft-margin loss are three aspects,: (i) It is able to capture the discrete nature of the binary classification. (ii) It is suggested to be bounded to be robust to feature noises or outliers. (iii) It makes itself based SVM model easy to be computed.
1.3 Soft-Margin Loss
Taking above principles into consideration, we now introduce the 0-1 ( for short) soft-margin loss defined as
It is discontinuous at , which captures the discrete nature of the binary classification (correctness or incorrectness) .
It is lower semi-continuous and nonconvex by the definition in . Since it is either 0 or 1, sparsity and robustness will be guaranteed. In fact, it does not count the number of samples with , which leads to sparsity, while returns 1 otherwise, which ensures robustness to outliers.
It is differentiable everywhere but at . However, it has subdifferential
and zero gradients elsewhere, see Lemma 2.1, which makes the computation tractable.
For the sake of easing the reading, we present some notations here. Let and be the Euclidean norm and the zero norm of that counts the number of non-zero elements of . Denote with and , where is a diagonal matrix with diagonal elements being elements in . For a positive integer and a vector , denote
These notations indicate
We call this model -SVM. The objective function is lower semicontinuous, non-differentiable and non-convex. It is difficult to be solved directly by most existing optimization algorithms. Despite that the discrete nature of zero norm makes above model NP-hard to be solved, the -SVM model is an ideal SVM model because it guarantees as few misclassified as possible for binary classification. Therefore, we carry out this paper along with this model.
In this paper, we start to study the theoretical properties of the -SVM model and then design a new efficient and robust algorithm to solve the model. The main contributions of the paper can be summarized as follows.
We prove the existence of a global minimizer of -SVM, which has not been thoroughly studied in prior works. Based on the explicit expressions of subdifferential and proximal operator of the loss (3), we introduce two types of optimality conditions of the problem: KKT and P-stationary points. We then unravel the relationships among a global/local minimizer and the above two points. This result is essential to our algorithmic design later on.
We adopt the famous alternating direction method of multipliers (ADMM) to solve the -SVM problem, and thus the method is dubbed as LADMM. We show that if the sequence generated by the proposed method converges, then it must converge to a P-stationary points. To the best of our knowledge, it is the first time that a method being created aims at solving (4) directly rather than its surrogate model (2). The novelty of the method is using the -support vector operator as a filter to prevent the outliers from being support vectors during training process.
We compare LADMM with other four existing leading methods on solving SVM problems with synthetic and real data sets. Extensive numerical experiments demonstrate that our proposed method achieves better performance in terms of providing higher prediction accuracy, using a small number of support vectors and consuming shorter computational time.
This paper is organized as follows. In Section 2, we will give the explicit expressions of three subdifferentials of soft-margin loss and derive its proximal operator. Section 3 presents the main theoretical contributions. We will show the existence of a global minimizer to problem (4) as well as investigating the relationships among a global/local minimizer and the KKT/P-stationary points of -SVM problem. In Section 4, we will introduce the -support vector operator and design the algorithm based on the optimality conditions established in previous section. Numerical experiments including comparison with other solvers and concluding remarks are given in the last two sections.
2 Subdifferential and Proximal Operator
To well analyze the properties of the soft-margin loss, we need introduce the necessary background of the subdifferential and the proximal operator of the .
From [24, Definition 8.3], for a proper and lower semicontinuous function , the regular, limiting and horizon subdiffential are defined respectively as
where means and , and means both and . If the function is convex, then the limiting subdifferential is also known to the subgradient.
The regular, limiting and horizon subdifferentials of at enjoy following property,
We use a simple example to illustrate the three subdifferentials of . Consider one dimensional case . As shown in Figure 1, the red lines denote some elements in In fact, all right slashes crossing the origin comprise of the subdifferential .
Our next result is about proximal operator, which will be very useful in designing the algorithm in Section 4.
2.2 Proximal Operator
By [25, Definition 12.23], the proximal operator of , associated with a parameter , at point , is defined by
The following lemma states that the proximal operator admits a closed form solution when
Lemma 2.2 (One-dimensional case)
For an , the proximal operator of at is given by
It is worth mentioning that the proximal operator may not be unique if in (7). However, to guarantee the uniqueness, hereafter, we always choose the proximal operator to be zero if it is not unique. Because of this, the proximal operator of is rewritten as
The proximal operator of is shown in Figure 2, where the red line denotes the proximal operator.
Based on the one dimensional case, we could derive the proximal operator of . The proof is similar to that of Lemma 2.2 and thus is omitted.
Lemma 2.3 (Multi-dimensional case)
For an , the proximal operator of at is given by
To proceed further, we consider the following problem
where is a smooth convex function and gradient Lipschitz continuous with a Lipschitz constant and is given. To see the global solution of above problem, same as , we introduce an auxiliary problem,
for some and fixed , where is the gradient of . This problem allows us to acquire the result related to the proximal operator of .
This lemma suffices to show that a global optimal solution to (10) must satisfy a fixed point equation, which is well established by following theorem whose proof is easy and is omitted here.
If is a global optimal solution to (10), then for any given it satisfies
3 Optimality Conditions of -Svm
This section provides the existence of optimal solutions of -SVM and establishes two types of first-order optimality conditions: KKT points and P-stationary points.
3.1 Existence of -SVM Minimizer
Assume is finite-valued. Then the solution set of (4) is bounded and its global minimizer exists.
We observe that may be an optimal solution (trivial solution) to (4), which possibly incorrectly predict the corresponding label for some new input vector because . However, for any , it follows from that
where and denote the number of the positive and the negative labels in . Based on above equation, this means that any optimal solution satisfying
is a non-trivial optimal solution to (4).
3.2 First-Order Optimality Condition
The Lagrangian function of above problem is
where is the Lagrange multiplier, based on which we introduce the well known Karush-Kuhn-Tucker (KKT) point of problem (3.2).
Definition 3.1 (KKT point of (3.2))
For a given , we say that is a KKT point of problem (3.2) if there is a multiplier vector such that
The following result reveals the relationship between a local minimizer and a KKT point of (3.2).
For a given , then is a local minimizer of (3.2) if and only if it is a KKT point.
Now let us define some notation
where is the generalized inverse of . These notations could equivalently rewrite (3.2) as
which is an unconstrained non-convex optimization problem. Based on (17), we will derive the proximal stationary point of (3.2), and this point is useful as a stop criteria of our algorithm proposed later.
Definition 3.2 (P-stationary point of (3.2))
For a given , we say is a proximal stationary (P-stationary) point of problem (3.2) if there is a multiplier vector and constant such that
We now reveal the relationship between a global minimizer and a P-stationary point of (3.2). Before which, let
denotes maximum eigenvalue of
Assume has a full column rank. For a given , if is a global minimizer of (3.2) then it is a P-stationary point with .
Note that having a full column rank means m n. However, numerical experiments will demonstrate that our proposed algorithm also works for the cases of mn in terms of finding a P-stationary point. To end this section, we also unravel the relationship between a P-stationary point and a KKT point of (3.2).
For a given , if is a P-stationary point with of (3.2), then it is also KKT point.
The above two theorems state that a global minimizer of (3.2) is a P-stationary point which is also a KKT point. Most importantly, we could use the P-stationary point as a termination rule in terms of guaranteeing the local optimality of a point generated by the algorithm proposed in next section.
4 Algorithmic Design
In this section, we introduce the concept of -support vector operator and describe how ADMM can be applied into solving the -SVM problem (3.2).
4.1 -Support Vector Operator
In SVMs, the optimal hyperplane is actually only determined by a small portion of training samples. These samples are called support vectors. It is well known that soft-margin loss functions at non-support vectors have zero subdifferentials [14, 13, 28, 29]. In other words, to select support vectors, one could find samples at which the loss function has nonzero subdifferentials. However, this approach is not suitable for soft-margin loss, since and elsewhere. This indicates samples with always have zero subdifferentials and samples with also have zero subdifferentials due to
, which probably leads to empty set of support vectors. To overcome such drawback, we introduce a novel selection scheme,-support vectors operator, to choose samples to be support vectors.
Definition 4.1 (-support vector operator)
For a given , the -support vector operator is defined by
This leads to
The above equivalence will help us to design the algorithm that we are ready to outline as below.
4.2 Framework of ADMM
The augmented Lagrangian function associated with the model (3.2) can be written as
where is Lagrangian multiplier, is a given parameter and
We take advantage of the ADMM to solve the augmented Lagrangian function. Given the th iteration , its framework takes the following form
where is referred as the dual step size and . Here,
is the so-called proximal term and is symmetric. Note that if is positive semidefinite, then the above framework is the standard semi-proximal ADMM . However, authors in papers [31, 32, 33] have also investigated ADMM with the indefinite proximal terms, namely is indefinite. The basic principle of choosing is to guarantee the convexity of -subproblem of (22). Since here is strongly convex with respect to , is able to be chosen as a negative semidefinite matrix. The flexibility of selecting allows us to design a very efficient algorithm when support vectors are used.
We mainly describe how each subproblem of (22) can be addressed efficiently as well as how the support vectors can be applied into reducing the computational cost.
(i) Updating . By (19), we denote
Then the -subproblem of (22) is reformulated as
which combining (20) results in
(ii) Updating . We always choose
which enables us to derive the -subproblem of (22) as
where . Moreover,
To solve (27), we need find the solution to the equation
Note that , where is the cardinality of . Then (28) can be addressed efficiently by following rules:
If , one could just solve (28) through
If , the matrix inverse lemma enables us to calculate the inverse as
Then we update as
(iii) Updating . By letting , it follows from -subproblem in (22) that
We now summarize the framework of the algorithm in Algorithm 1. We call the method LADMM, an abbreviation for -SVM solved by ADMM.
We have some comments on Algorithm 1 regarding to the computational complexity. Note that in each step, updating dominates the whole computation, which needs solve a linear equation system through (29) or (31). If , then the computational complexities of calculating and are and with , respectively. If , then the computational complexities of calculating and are and with , respectively. Overall the whole complexity in each step is . Therefore, if there are few number of support vectors, namely is very small, then the complexity is very low, which allows us to do large scale computations.
The following theorem shows that if the sequence generated by LADMM converges, then it must converge to a P-stationary point of (3.2).
Let be the limit point of the sequence generated by LADMM. Then