1 Introduction
Vapnik and Cortes firstly proposed the support vector machine (SVM) for binary classification in 1995 [1]
, then it became a very popular modelling and prediction tool on many small and moderatesize machine learning, statistic and pattern recognition problems
[2]. The SVM classifier aims to build a linear or nonlinear maximummargin decision boundary to separate two classes data points. Usually, the nonlinear SVM model is the preferred one in most practical scenarios [3]. Given a data set , here is the label value corresponding to , the formulation of the nonlinear SVM problem iswhere is a penalty parameter and the decision surface is defined by , is a proper feature map expecting new higherdimension points
linearly separable in feature space. The loss function
represents the penalty function for the incorrectly classified data points . The 01 loss function is a natural option of for its capturing the discrete nature of binary classification, namely, it is designed to minimize the number of misclassified instances:here . However, its nonconvexity and discontinuousness make the problems with 01 loss NPhard to optimize [4][5][6][7]. For decades, compromising with the challenge of directly solving the original 01 loss SVM, many researchers transfer to operate the following two alternative strategies instead: replacing the 01 loss function with surrogate loss functions, or executing approximate algorithms based on the binaryvalued characteristic of 01 loss function.
In the SVM community, a tremendous amount of work has been devoted on first strategy, especially to those SVMs with convex surrogates for their easeofuse on practical applications and convenience on theoretical analysing. However, there is no solid evidence guarantees how well these convex surrogates approximate the 01 loss itself, even they indeed exist asymptotic relevance [8]. The most widely researched surrogates include the hinge loss [1], the squared hinge loss [9], the least squares loss [10], the pinball loss [11] and the log loss [12]. On the other hand, comparing to convex surrogates, the ramp loss [13] [14], the truncated pinball softmargin loss function [15]
etc, these nonconvex surrogates, still attract a vast of attention in certain scenes requiring high insensitiveness to outliers. Simultaneously, these nonconvex SVMs exhibit relatively better performance on classification accuracy and the declining of SVs. As to the majority of nonlinear SVMs with convex surrogates, their optimization problems can be researched systematically by the dual kernelbased technique combining with KKT condition. Unluckily, this method can not guarantee its similar success on many nonconvex surrogates. With respect to the 01 loss, the ideal candidate for SVM loss, we find it is hard to implement the Lagrangian dual approach directly. The reason is that its core function, namely, the Lagrangian dual function of 01 loss SVM can not be expressed explicitly due to the nonsmoothness and nonconvexity of 01 loss function. Finally, the failure of this popular technique, also without finding other effective routes to directly solve this NPhard problem, makes the task to precisely solute the 01 loss SVM being a challenge in SVM community for decades.
As to the second strategy, some approximate algorithms for nonlinear SVMs employ sigmoid function
[16] or a sigmoidal transfer function [17] as the simulation of 01 loss for their similar numerical characteristics. In addition, Orsenigo and Vercellis [18] proposed the algorithm of multivariate classification trees based on minimum features discrete support vector machines and an approximate technique was executed to solve the mixed integer problem (MIP) model at each node of trees in their method. A point must be acknowledged that these approximate algorithms mainly focus on the techniques other than systematic solving framework including solid theoretical analysis, such as proposing an optimal condition and designing its corresponding solving algorithm more closely adhering to the SVM problems.Most lately, Wang et al. [19] have made a big step for their systematic work aiming to directly solve the linear SVM, an abbreviation to the linear SVM with 01 loss by them. To our best knowledge, they firstly built the optimal condition both for global and local solution of the linear SVM, then a significantly efficient algorithm ADMM was also invented to find the approximate solution based on optimal condition. In view of the fundamental or theoretical achievements on directly solving SVM with 01 loss are still undeveloped fully, Wang and his partners’ relatively systematic work on linear SVM is a meaningful trial on this topic. Therefore, we expect their inspiring viewpoint can lead us to build a similar systematic framework on nonlinear SVM.
A powerful and popular way to deal with the nonlinear SVMs is the kernel method [10][20]. Generally, most nonlinear SVM problems need integrating the kernel technique into its Lagrangian dual function. However, just as said above, given what we knew, the explicit formulation of Lagrangian dual function of SVM is unaccessible as 01 loss function has the negative property on convexity and continuity. The predicament pushes researchers to explore other paths that still can use kernel trick effectively but without involving the surrounding of Lagrangian dual problem. One way is just to approximate the kernel function by proper feature map. The technique for speeding up kernel methods in largescale problems, called the class of random features [21][22], belongs to this kind. In detail, it avoids searching for the Lagrangian dual of SVM by using feature map to approximately decompose the kernel function and then transforms the nonlinear SVM problems into the linear case.
In this article, based on the nonparametric representation theorem [31], we reestablish a nonlinear SVM model, called KSVM. The new model KSVM tactfully involves the kernel technique in itself without using the Lagrangian dual function. More importantly,
the proposed model also successfully extend the systematic solving framework of linear model in [19], in other words,
the corresponding optimal condition and algorithm are similarly explored, as the linear framework has been finished in [19]. The main contributions are summarized as follows.
(i) A new formulation of nonlinear kernel SVM, called KSVM, is presented. The new proposed KSVM simultaneously involves the kernel technique and inherits the advanced and systematic solving framework of SVM. Moreover, we also prove that the model KSVM can degenerate into a slightly modified edition of original linear SVM by adding an extra regularization into its objective function.
(ii) A precise definition of the support vector in KSVM is provided by the proximal constrains of Pstationary point and an important conclusion about SVs is also explored: SVs are only located on the parallel decisive surfaces , where is the final decision surface. This geometrical character relating to SVs echoes the fact that SVMs with 01 loss has much smaller SVs than other SVM peers with traditional convex losses.
(iii) The experiment part, respectively setting SVM and the other six classic nonlinear SVMs as the performance comparison, has validated the strong sparsity and reasonable robustness of our proposed KSVM. The result of much fewer SVs of KSVM is in stark contrast to all other experimental counterparts. Meanwhile, we have achieved a decent result compared to the other six classic nonlinear SVMs relating to the predicting accuracy.
The rest of the paper is organized as follows. Section 2 gives a broad outline for the SVM and then presents two basis for modelling nonlinear KSVM, kernel method and representation theorem. The whole framework of nonlinear SVM with 01 soft margin, called KSVM, also the topic of paper, will be explicitly investigated in section 3. Lastly, we will prove the advantageous aspects of KSVM on strong sparsity and robustness by experiments compared to SVM and the other six classic nonlinear SVMs in section 4.
2 Preliminary knowledge
Firstly, we will review linear SVM briefly on its model and optimal condition, then the following part focuses on presenting the knowledge about the kernel method and the representation theorem, which are vital to establish our final model KSVM. For convenience, some notations are listed in Table 1.
Notation  Description  

X  
y  
V  
e  
, here  
u  
The number of nonzero elements in u 
2.1 Linear Svm
Now we give a simple review about SVM by Wang and his partners in [19]. Firstly, they rewritten the SVM with 01 loss as the following matrix form
(1) 
and the hyperplane or decision function has the form as
+. Notice u here is obtained on the condition of being an identity map.The optimal condition of (1) was displayed in Theorem 1, which was introduced by an important concept, Pstationary point. We retrospect them respectively in the following part:
Definition 1
[19] [Pstationary point of (1)] For a given , we say is a proximal stationary (Pstationary) point of (1) if there are a Lagrangian multiplier vector and a constant such that
(2) 
where
(3) 
Here, .
Theorem 1
The theorem states that the Pstationary point must be a local minimizer under some condition, then [19] used the Pstationary point as a termination rule on the iterative points generated by ADMM, the algorithm designed to find the numerical solution of (1). Additionally, they creatively brought a new expression to SV and continued to verify SVs only lie on the parallel decisive hyperplanes +, which was expected to guarantee the relative robustness and the high efficiency for its strong insensitiveness to outliers.
In [19], SVM has rather
shorter computational time and much fewer SVs theoretically and experimentally comparing
to other leading linear SVM classifiers. However, we point out some barriers when we extend the whole modelling framework from the linear form to the nonlinear form.
(i). Theoretically, the number of the representation of the nonlinear or kernel base is always larger than the number of the training points. So, the requirement that full column rank of in Theorem 1 in [19] will not be satisfied when we normally introduce nonlinear mapping or kernel trick to handle the nonlinear SVM problems.
(ii). Algorithmically, ADMM, designed to solve the linear SVM, needs data points explicitly expressed in the iteration of w and it will be hindered for we do not know about but their product dots in most cases.
For addressing these challenges, we search for a new nonlinear model that can fit for the whole linear framework just reviewed, the following subsection will present the essential basis to construct it.
2.2 Kernel Method and Representation Theorem
Kernel method has the kernel function as its core role and the representation theorem also closely bonds with the kernel function, for which is the basic element to construct the Hilbert space where the representation theorem applies.
When failing to find a hyperplane as to the linearly inseparable data points , an ingenious solution is that we can transfer the original data sets to a new separable domain , here is always a dot product space under the proper map . For is high dimensional, sometimes even with , moreover, most nonlinear SVMs algorithms just need knowing about the dot products of new domain but not themselves. Therefore, a new technique called kernel method is introduced to alleviate the computation burden in . The kernel function, playing a key role in this method, is defined as follows:
Definition 2 (Kernel function)
[23] For a given binary function , , there exists a map from to a dot product space , for all , satisfying , then function is usually called kernel function and is called its feature map.
The most frequently used kernel function includes Polynomial kernel function [24] , Gaussian kernel function [25][26] , and Sigmoid kernel function [27] , etc.
Next based on the kernel function, we introduce a special Hilbert space, called the reproducing kernel Hilbert space () [28], which is the completion of inner space with all as its elements. The corresponding dot product can be derived by ,.
In the following part, we further review a representation theorem in , which is the cornerstone to construct our proposed nonlinear SVM. In mathematics, there are many closely related variations about the representation theorem. One of them is called Riesz representation theorem [29], which establishes an important connection between a Hilbert space and its continuous dual space. Wahba [30] applied Riesz Representation theorem on , it shows that the solutions of certain risk minimization problems, involving an empirical risk term and a quadratic regularizer, can be expressed as expansions in terms of the training examples. Schölkopf et al. further generalized the theorem to a larger class regularizers and empirical risk terms in [31], and this article explicitly gives the nonparametric formulation of the general theorem as follows:
Theorem 2 (Nonparametric Representation Theorem)
[31] Suppose we are
given a nonempty set , a positive definite realvalued kernel , a
training sample , a strictly monotonically increasing
realvalued function on , an arbitrary cost function , and a class of functions
.
Here, is the norm in the RKHS associated with , i.e. for any ,
(4) 
Then any minimizing the regularized risk functional
(5) 
admits a representation of the form
where are coefficients of in the RKHS .
3 Ksvm
In this section, we formally propose a nonlinear SVM termed as KSVM based on the kernel method and Theorem 2, then its degeneration property of KSVM, the firstorder optimality condition and solving algorithm are also explicitly explored.
3.1 Modeling of Ksvm
Given data set composed by input and output pairs = , the nonlinear kernel SVM classifier, termed as KSVM, is obtained by solving the follow nonconvex problem:
s.t. 
where =VKV, , , the kernel matrix is associated with the chosen kernel function on modified input data set with . The decision surface of KSVM classifier is for the transformed point as to the test point .
Next we make some explanatory note about the modeling of KSVM through Theorem 2. In (5), we are specifying that
and
.
Here, is the penalty parameter.
Therefore, the regularization problem related to nonlinear kernel SVM with 01 loss has the form:
(7) 
Since (7) has as its solution based on Theorem 2, we plug it back into (7) and apply the norm equality (4), then the final formulation (3.1) of KSVM is obtained.
At the end of subsection, a significant proposition is introduced to reveal the degeneration relationship between our newlybuilt nonlinear KSVM and the slightly modified edition of SVM.
Proposition 0 (Degeneration of KSVM to Svm)
3.2 FirstOrder Optimality Condition
In this subsection, the firstorder optimality condition for the KSVM problem (3.1) will be explored and the whole process is similar to the work finished in [19]. To proceed this, we firstly define the pillar of constructing the firstorder optimality condition, the proximal stationary point of KSVM:
Definition 3 (Pstationary point of Ksvm)
For a given , we say is a proximal stationary (Pstationary) point of KSVM for a constant such that
(8) 
where
(9) 
Here, .
Based on Pstationary point, the optimality condition of KSVM is concluded as the following theorem:
Theorem 3
For the KSVM problem (3.1), the following relations hold.
(i) Assume is invertible. For a given , if is a global minimizer of KSVM, then it is a Pstationary point of KSVM with , here is the smallest eigenvalue of .
(ii) For a given , if is a Pstationary point of KSVM with , then it is a local minimizer of KSVM.
In the following subsection, the classical algorithm ADMM is carried on to find numerical solution for our nonlinear SVM problem, termed as ADMM. The SV of KSVM will be defined and applied as select working set in the updating of all subproblems, which undoubtedly speeds up the whole process of iteration and then makes the termination condition satisfied more rapidly.
3.3 ADMM Algorithm via Selection of Working Set
3.3.1 KSVM Support Vector
Suppose be a Pstationary point, hence be a local minimizer of KSVM. From the second equation of (8), we define the index set
(10) 
and . Then and . By (9), we have
This leads to
(11) 
Then from definition of and (11), we have
i.e.,
Combining with in (11) yields
(12) 
Taking (12) into the function of decision surface of KSVM, we derive
Following the concept of SV in [1], we define or are the KSVM SVs, here becomes the index set of SVs.
Again, from for , we obtain
(13) 
That is to say, the KSVM SVs must be lie on the support surfaces . As far as we know, only the hardmargin SVM and SVM has such characteristic. This geometrical property shows the evidence that SVs in
KSVM arrange orderly and probably sparsely, moreover, it also demonstrates the idealness of the 01 loss in SVM algorithms.
3.3.2 ADMM Algorithm
Firstly, to our best knowledge, the closeform solution of KSVM is hard to trace for its nonconvexity, though its existence can be verified using similar skill in [19]. In this subsection, ADMM algorithm will be operated on the augmented problem of (3.1) to search its numerical solution:
(14)  
where is the Lagrangian multiplier and is the penalty parameter. Given the th iteration , the algorithm framework can be described as
(15) 
where is referred as the dual stepsize. Here,
is the socalled proximal term, is a symmetric matrix properly chosen to guarantee the convexity of subproblem of (15).
(i) Updating : The usubproblem in (15) is equivalent to the following problem
where the third equation is from the definition of proximal operator with and . Define the working set
(16) 
and complementary set then we obtain
(17) 
3.3.3 ADMM Convergence and Complexity Analysis
Firstly, considering that directly solving the SVM with 01 loss is a NPhard problem, we admit that only limit materials are available to guide us to analyze the solution system. Moreover, the ADMM algorithm, our finding path to calculate the KSVM, has intrinsic flaw on the job of analyzing convergence for it is designed to a multistep and multifactor structure compared to other popular algorithms like gradient algorithms. Further, , the select working set in the updating of all subproblems in ADMM, its varying in each iteration, undoubtedly exacerbates the instability of the whole calculation process. Even worse, what this algorithm has to address now is a nonconvex and discontinuous problem other than optimization problems with a general class of morefriendly nonconvex losses, like the jobs finished in [33][34][35][36][37]. Still, we achieve a convergence conclusion similar to the linear form in [19] as to the sequence created by ADMM algorithm:
Theorem 4
Suppose be the limit point of the sequence generated by ADMM, then is a local optimal solution of KSVM.
Next we present the computational complexity analysis of ADMM.

Updating by (16) needs the complexity .

The main term involved in computing by (17) is the kernel matrix , taking the complexity about , is the dimension of data points .

Similar to updating , is the most expensive computation in (21) to derive . Its complexity is supposed we have calculated firstly.
Overall, the whole computational complexity in each step of ADMM in Algorithm 1 is
An obvious fact is that the kernel matrix brings the heavy burden on computational time, which is caused by the curse of kernelization [39] and continues to be a open and big challenge in kernel SVM training problems.
4 Numerical experiments
The experiments related to our proposed algorithm ADMM on KSVM (abbreviated as in the tables below), are divided into two parts, corresponding to two entities as a performance contrast: the linear SVM and the other six leading nonlinear SVM classifiers. 10 UCI data sets, whose detailed information is listed in Table 2, are selected to conduct all experiments by using MATLAB (2019a) on a laptop of 32GB of memory and Inter Core i7 2.7Ghz CPU. In addition, all features in each data set are scaled to and Gaussian kernel is chosen in all experiments.
Data sets  Numbers()  Features()  Rate() 

Breast (bre)  699  9  34/66 
Echo(ech)  131  10  33/67 
Heartstatlog (hea)  270  13  56/44 
Housevotes(hou)  435  16  61/39 
Hypothyroid (hyp)  3163  25  5/95 
Monk3(mon)  432  6  50/50 
PimaIndian (pim)  768  8  35/65 
TwoNorm (two)  7400  20  50/50 
Waveform (wav)  5000  21  33/67 
WDBC (wdb)  198  34  37/63 
The Theorem 3 allows us to take the Pstationary point as a stopping criteria, then we can terminate our algorithm if the point () closely satisfies the conditions in (8), namely,
where tol is the tolerance level and
For our evaluation of classification performances is based on the method of 10fold cross validation, that is, each data is randomly split into ten parts, one of which is used for testing and the remaining nine parts is for training. Therefore, we choose the mean accuracy , the mean number of support vectors and the mean CPU time as the criteria of performance of all models. Let be test samples, the testing accuracy is defined by
where is the SVs set of KSVM and = .
4.1 Comparisons between linear SVM and Ksvm
The optimal parameters and in two ADMM algorithms are picked out through the standard 10fold cross validation in the same range of
. In addition, as to the other parameters options in ADMM, we set and choose , , as the initial points, the maximum iteration number is and the tolerance level tol.
In algorithm ADMM, , and , the maximum iteration number and the tolerance level are , tol respectively.
The linear and nonlinear SVM results can be found in Table 3. We conclude that KSVM performs obviously better than its linear formulation for the majority of 10 data sets in terms of mACC and mNSV, with the bold numbers indicating superiority. All data sets except wdb attains a higher mACC for our proposed nonlinear model, the data set mon adds its accuracy increment to the surprising 25, wav also reduces nearly 4.2 for its error rate. As to the mNSV, KSVM also greatly outperforms its linear peer besides hea. Especially, the number of SVs occurring in the nonlinear formulation is less than a fifth to its linear opponent in data sets hyp, two, ech.
The stronger sparsity is a little surprise to us considering that most convex nonlinear SVMs brings more SVs than its corresponding linear form. We assume that higher accuracy in
KSVM and the special geometrical distribution of SVs in 01 loss SVM are responsible for this unusual phenomenon. However, the curse of kernelization hinders the computation inefficiency for the
ADMM, which costs much more time than ADMM.Data set  Catagory  mACC(%)  mNSV  mCPU(seconds) 

bre  SVM  0.9568  15.55  0.0465 
KSVM  0.9677  9.82  0.6453  
ech 
SVM  0.9042  23.35  0.0381 
KSVM  0.9186  7.76  0.0601  
hea  SVM  0.8252  18.67  0.0349 
KSVM  0.8374  26.76  0.0912  
hou 
SVM  0.9458  27.85  0.0297 
KSVM  0.9584  23.93  0.2323  
hyp 
SVM  0.9796  139.93  0.1253 
KSVM  0.9816  25.07  20.7255  
mon  SVM  0.7272  83.55  0.0178 
KSVM  0.9723  78.73  0.2546  
pim  SVM  0.7647  267.18  0.1710 
KSVM  0.7723  176.54  0.8658  
two  SVM  0.9770  309.77  0.3979 
KSVM  0.9776  16.81  116.4893  
wav 
SVM  0.8539  740.53  0.3969 