Vapnik and Cortes firstly proposed the support vector machine (SVM) for binary classification in 1995 2]. The SVM classifier aims to build a linear or nonlinear maximum-margin decision boundary to separate two classes data points. Usually, the nonlinear SVM model is the preferred one in most practical scenarios . Given a data set , here is the label value corresponding to , the formulation of the nonlinear SVM problem is
where is a penalty parameter and the decision surface is defined by , is a proper feature map expecting new higher-dimension points
linearly separable in feature space. The loss functionrepresents the penalty function for the incorrectly classified data points . The 0-1 loss function is a natural option of for its capturing the discrete nature of binary classification, namely, it is designed to minimize the number of misclassified instances:
here . However, its non-convexity and discontinuousness make the problems with 0-1 loss NP-hard to optimize . For decades, compromising with the challenge of directly solving the original 0-1 loss SVM, many researchers transfer to operate the following two alternative strategies instead: replacing the 0-1 loss function with surrogate loss functions, or executing approximate algorithms based on the binary-valued characteristic of 0-1 loss function.
In the SVM community, a tremendous amount of work has been devoted on first strategy, especially to those SVMs with convex surrogates for their ease-of-use on practical applications and convenience on theoretical analysing. However, there is no solid evidence guarantees how well these convex surrogates approximate the 0-1 loss itself, even they indeed exist asymptotic relevance . The most widely researched surrogates include the hinge loss , the squared hinge loss , the least squares loss , the pinball loss  and the log loss . On the other hand, comparing to convex surrogates, the ramp loss  , the truncated pinball soft-margin loss function 
etc, these non-convex surrogates, still attract a vast of attention in certain scenes requiring high insensitiveness to outliers. Simultaneously, these non-convex SVMs exhibit relatively better performance on classification accuracy and the declining of SVs. As to the majority of nonlinear SVMs with convex surrogates, their optimization problems can be researched systematically by the dual kernel-based technique combining with KKT condition. Unluckily, this method can not guarantee its similar success on many non-convex surrogates. With respect to the 0-1 loss, the ideal candidate for SVM loss, we find it is hard to implement the Lagrangian dual approach directly. The reason is that its core function, namely, the Lagrangian dual function of 0-1 loss SVM can not be expressed explicitly due to the non-smoothness and non-convexity of 0-1 loss function. Finally, the failure of this popular technique, also without finding other effective routes to directly solve this NP-hard problem, makes the task to precisely solute the 0-1 loss SVM being a challenge in SVM community for decades.
As to the second strategy, some approximate algorithms for nonlinear SVMs employ sigmoid function or a sigmoidal transfer function  as the simulation of 0-1 loss for their similar numerical characteristics. In addition, Orsenigo and Vercellis  proposed the algorithm of multivariate classification trees based on minimum features discrete support vector machines and an approximate technique was executed to solve the mixed integer problem (MIP) model at each node of trees in their method. A point must be acknowledged that these approximate algorithms mainly focus on the techniques other than systematic solving framework including solid theoretical analysis, such as proposing an optimal condition and designing its corresponding solving algorithm more closely adhering to the SVM problems.
Most lately, Wang et al.  have made a big step for their systematic work aiming to directly solve the linear -SVM, an abbreviation to the linear SVM with 0-1 loss by them. To our best knowledge, they firstly built the optimal condition both for global and local solution of the linear -SVM, then a significantly efficient algorithm -ADMM was also invented to find the approximate solution based on optimal condition. In view of the fundamental or theoretical achievements on directly solving SVM with 0-1 loss are still undeveloped fully, Wang and his partners’ relatively systematic work on linear -SVM is a meaningful trial on this topic. Therefore, we expect their inspiring viewpoint can lead us to build a similar systematic framework on nonlinear -SVM.
A powerful and popular way to deal with the nonlinear SVMs is the kernel method . Generally, most nonlinear SVM problems need integrating the kernel technique into its Lagrangian dual function. However, just as said above, given what we knew, the explicit formulation of Lagrangian dual function of -SVM is unaccessible as 0-1 loss function has the negative property on convexity and continuity. The predicament pushes researchers to explore other paths that still can use kernel trick effectively but without involving the surrounding of Lagrangian dual problem. One way is just to approximate the kernel function by proper feature map. The technique for speeding up kernel methods in large-scale problems, called the class of random features , belongs to this kind. In detail, it avoids searching for the Lagrangian dual of -SVM by using feature map to approximately decompose the kernel function and then transforms the nonlinear SVM problems into the linear case.
In this article, based on the nonparametric representation theorem , we reestablish a nonlinear -SVM model, called -KSVM. The new model -KSVM tactfully involves the kernel technique in itself without using the Lagrangian dual function. More importantly,
the proposed model also successfully extend the systematic solving framework of linear model in , in other words,
the corresponding optimal condition and algorithm are similarly explored, as the linear framework has been finished in . The main contributions are summarized as follows.
(i) A new formulation of nonlinear kernel -SVM, called -KSVM, is presented. The new proposed -KSVM simultaneously involves the kernel technique and inherits the advanced and systematic solving framework of -SVM. Moreover, we also prove that the model -KSVM can degenerate into a slightly modified edition of original linear -SVM by adding an extra regularization into its objective function.
(ii) A precise definition of the support vector in -KSVM is provided by the proximal constrains of P-stationary point and an important conclusion about SVs is also explored: SVs are only located on the parallel decisive surfaces , where is the final decision surface. This geometrical character relating to SVs echoes the fact that SVMs with 0-1 loss has much smaller SVs than other SVM peers with traditional convex losses.
(iii) The experiment part, respectively setting -SVM and the other six classic nonlinear SVMs as the performance comparison, has validated the strong sparsity and reasonable robustness of our proposed -KSVM. The result of much fewer SVs of -KSVM is in stark contrast to all other experimental counterparts. Meanwhile, we have achieved a decent result compared to the other six classic nonlinear SVMs relating to the predicting accuracy.
The rest of the paper is organized as follows. Section 2 gives a broad outline for the -SVM and then presents two basis for modelling nonlinear -KSVM, kernel method and representation theorem. The whole framework of nonlinear SVM with 0-1 soft margin, called -KSVM, also the topic of paper, will be explicitly investigated in section 3. Lastly, we will prove the advantageous aspects of -KSVM on strong sparsity and robustness by experiments compared to -SVM and the other six classic nonlinear SVMs in section 4.
2 Preliminary knowledge
Firstly, we will review linear -SVM briefly on its model and optimal condition, then the following part focuses on presenting the knowledge about the kernel method and the representation theorem, which are vital to establish our final model -KSVM. For convenience, some notations are listed in Table 1.
|The number of non-zero elements in u|
2.1 Linear -Svm
Now we give a simple review about -SVM by Wang and his partners in . Firstly, they rewritten the SVM with 0-1 loss as the following matrix form
and the hyperplane or decision function has the form as+. Notice u here is obtained on the condition of being an identity map.
represents the maximum eigenvalue of
The theorem states that the P-stationary point must be a local minimizer under some condition, then  used the P-stationary point as a termination rule on the iterative points generated by -ADMM, the algorithm designed to find the numerical solution of (1). Additionally, they creatively brought a new expression to SV and continued to verify SVs only lie on the parallel decisive hyperplanes +, which was expected to guarantee the relative robustness and the high efficiency for its strong insensitiveness to outliers.
In , -SVM has rather
shorter computational time and much fewer SVs theoretically and experimentally comparing
to other leading linear SVM classifiers. However, we point out some barriers when we extend the whole modelling framework from the linear form to the nonlinear form.
(i). Theoretically, the number of the representation of the nonlinear or kernel base is always larger than the number of the training points. So, the requirement that full column rank of in Theorem 1 in  will not be satisfied when we normally introduce nonlinear mapping or kernel trick to handle the nonlinear SVM problems.
(ii). Algorithmically, -ADMM, designed to solve the linear -SVM, needs data points explicitly expressed in the iteration of w and it will be hindered for we do not know about but their product dots in most cases.
For addressing these challenges, we search for a new nonlinear model that can fit for the whole linear framework just reviewed, the following subsection will present the essential basis to construct it.
2.2 Kernel Method and Representation Theorem
Kernel method has the kernel function as its core role and the representation theorem also closely bonds with the kernel function, for which is the basic element to construct the Hilbert space where the representation theorem applies.
When failing to find a hyperplane as to the linearly inseparable data points , an ingenious solution is that we can transfer the original data sets to a new separable domain , here is always a dot product space under the proper map . For is high dimensional, sometimes even with , moreover, most nonlinear SVMs algorithms just need knowing about the dot products of new domain but not themselves. Therefore, a new technique called kernel method is introduced to alleviate the computation burden in . The kernel function, playing a key role in this method, is defined as follows:
Definition 2 (Kernel function)
 For a given binary function , , there exists a map from to a dot product space , for all , satisfying , then function is usually called kernel function and is called its feature map.
Next based on the kernel function, we introduce a special Hilbert space, called the reproducing kernel Hilbert space () , which is the completion of inner space with all as its elements. The corresponding dot product can be derived by ,.
In the following part, we further review a representation theorem in , which is the cornerstone to construct our proposed nonlinear -SVM. In mathematics, there are many closely related variations about the representation theorem. One of them is called Riesz representation theorem , which establishes an important connection between a Hilbert space and its continuous dual space. Wahba  applied Riesz Representation theorem on , it shows that the solutions of certain risk minimization problems, involving an empirical risk term and a quadratic regularizer, can be expressed as expansions in terms of the training examples. Schölkopf et al. further generalized the theorem to a larger class regularizers and empirical risk terms in , and this article explicitly gives the nonparametric formulation of the general theorem as follows:
Theorem 2 (Nonparametric Representation Theorem)
 Suppose we are
given a nonempty set , a positive definite real-valued kernel , a
training sample , a strictly monotonically increasing
real-valued function on , an arbitrary cost function , and a class of functions
Here, is the norm in the RKHS associated with , i.e. for any ,
Then any minimizing the regularized risk functional
admits a representation of the form
where are coefficients of in the RKHS .
In this section, we formally propose a nonlinear -SVM termed as -KSVM based on the kernel method and Theorem 2, then its degeneration property of -KSVM, the first-order optimality condition and solving algorithm are also explicitly explored.
3.1 Modeling of -Ksvm
Given data set composed by input and output pairs = , the nonlinear kernel -SVM classifier, termed as -KSVM, is obtained by solving the follow non-convex problem:
where =VKV, , , the kernel matrix is associated with the chosen kernel function on modified input data set with . The decision surface of -KSVM classifier is for the transformed point as to the test point .
Here, is the penalty parameter.
Therefore, the regularization problem related to nonlinear kernel SVM with 0-1 loss has the form:
At the end of subsection, a significant proposition is introduced to reveal the degeneration relationship between our newly-built nonlinear -KSVM and the slightly modified edition of -SVM.
Proposition 0 (Degeneration of -KSVM to -Svm)
3.2 First-Order Optimality Condition
In this subsection, the first-order optimality condition for the -KSVM problem (3.1) will be explored and the whole process is similar to the work finished in . To proceed this, we firstly define the pillar of constructing the first-order optimality condition, the proximal stationary point of -KSVM:
Definition 3 (P-stationary point of -Ksvm)
For a given , we say is a proximal stationary (P-stationary) point of -KSVM for a constant such that
Based on P-stationary point, the optimality condition of -KSVM is concluded as the following theorem:
For the -KSVM problem (3.1), the following relations hold.
(i) Assume is invertible. For a given , if is a global minimizer of -KSVM, then it is a P-stationary point of -KSVM with , here is the smallest eigenvalue of .
(ii) For a given , if is a P-stationary point of -KSVM with , then it is a local minimizer of -KSVM.
In the following subsection, the classical algorithm ADMM is carried on to find numerical solution for our nonlinear -SVM problem, termed as -ADMM. The SV of -KSVM will be defined and applied as select working set in the updating of all sub-problems, which undoubtedly speeds up the whole process of iteration and then makes the termination condition satisfied more rapidly.
3.3 -ADMM Algorithm via Selection of Working Set
3.3.1 -KSVM Support Vector
Suppose be a P-stationary point, hence be a local minimizer of -KSVM. From the second equation of (8), we define the index set
and . Then and . By (9), we have
This leads to
Then from definition of and (11), we have
Combining with in (11) yields
Taking (12) into the function of decision surface of -KSVM, we derive
Following the concept of SV in , we define or are the -KSVM SVs, here becomes the index set of SVs.
Again, from for , we obtain
That is to say, the -KSVM SVs must be lie on the support surfaces . As far as we know, only the hard-margin SVM and -SVM has such characteristic. This geometrical property shows the evidence that SVs in
-KSVM arrange orderly and probably sparsely, moreover, it also demonstrates the idealness of the 0-1 loss in SVM algorithms.
3.3.2 -ADMM Algorithm
Firstly, to our best knowledge, the close-form solution of -KSVM is hard to trace for its non-convexity, though its existence can be verified using similar skill in . In this subsection, ADMM algorithm will be operated on the augmented problem of (3.1) to search its numerical solution:
where is the Lagrangian multiplier and is the penalty parameter. Given the th iteration , the algorithm framework can be described as
where is referred as the dual step-size. Here,
where the third equation is from the definition of proximal operator with and . Define the working set
and complementary set then we obtain
(ii) Updating .
is introduced during the updating process, thus we derive the -subproblem of (15) as
and it can be simplified as
(iii) Updating . Let , then based on (15), we have
3.3.3 -ADMM Convergence and Complexity Analysis
Firstly, considering that directly solving the SVM with 0-1 loss is a NP-hard problem, we admit that only limit materials are available to guide us to analyze the solution system. Moreover, the ADMM algorithm, our finding path to calculate the -KSVM, has intrinsic flaw on the job of analyzing convergence for it is designed to a multi-step and multi-factor structure compared to other popular algorithms like gradient algorithms. Further, , the select working set in the updating of all sub-problems in -ADMM, its varying in each iteration, undoubtedly exacerbates the instability of the whole calculation process. Even worse, what this algorithm has to address now is a non-convex and discontinuous problem other than optimization problems with a general class of more-friendly non-convex losses, like the jobs finished in . Still, we achieve a convergence conclusion similar to the linear form in  as to the sequence created by ADMM algorithm:
Suppose be the limit point of the sequence generated by -ADMM, then is a local optimal solution of -KSVM.
Next we present the computational complexity analysis of -ADMM.
Updating by (16) needs the complexity .
The main term involved in computing by (17) is the kernel matrix , taking the complexity about , is the dimension of data points .
Similar to updating , is the most expensive computation in (21) to derive . Its complexity is supposed we have calculated firstly.
Overall, the whole computational complexity in each step of -ADMM in Algorithm 1 is
An obvious fact is that the kernel matrix brings the heavy burden on computational time, which is caused by the curse of kernelization  and continues to be a open and big challenge in kernel SVM training problems.
4 Numerical experiments
The experiments related to our proposed algorithm -ADMM on -KSVM (abbreviated as in the tables below), are divided into two parts, corresponding to two entities as a performance contrast: the linear -SVM and the other six leading nonlinear SVM classifiers. 10 UCI data sets, whose detailed information is listed in Table 2, are selected to conduct all experiments by using MATLAB (2019a) on a laptop of 32GB of memory and Inter Core i7 2.7Ghz CPU. In addition, all features in each data set are scaled to and Gaussian kernel is chosen in all experiments.
where tol is the tolerance level and
For our evaluation of classification performances is based on the method of 10-fold cross validation, that is, each data is randomly split into ten parts, one of which is used for testing and the remaining nine parts is for training. Therefore, we choose the mean accuracy , the mean number of support vectors and the mean CPU time as the criteria of performance of all models. Let be test samples, the testing accuracy is defined by
where is the SVs set of -KSVM and = .
4.1 Comparisons between linear -SVM and -Ksvm
The optimal parameters and in two ADMM algorithms are picked out through the standard 10-fold cross validation in the same range of
. In addition, as to the other parameters options in -ADMM, we set and choose , , as the initial points, the maximum iteration number is and the tolerance level tol. In algorithm -ADMM, , and , the maximum iteration number and the tolerance level are , tol respectively. The linear and nonlinear -SVM results can be found in Table 3. We conclude that -KSVM performs obviously better than its linear formulation for the majority of 10 data sets in terms of mACC and mNSV, with the bold numbers indicating superiority. All data sets except wdb attains a higher mACC for our proposed nonlinear model, the data set mon adds its accuracy increment to the surprising 25, wav also reduces nearly 4.2 for its error rate. As to the mNSV, -KSVM also greatly outperforms its linear peer besides hea. Especially, the number of SVs occurring in the nonlinear formulation is less than a fifth to its linear opponent in data sets hyp, two, ech. The stronger sparsity is a little surprise to us considering that most convex nonlinear SVMs brings more SVs than its corresponding linear form. We assume that higher accuracy in
-KSVM and the special geometrical distribution of SVs in 0-1 loss SVM are responsible for this unusual phenomenon. However, the curse of kernelization hinders the computation inefficiency for the-ADMM, which costs much more time than -ADMM.