Log In Sign Up

Nonlinear Kernel Support Vector Machine with 0-1 Soft Margin Loss

by   Ju Liu, et al.
NetEase, Inc

Recent advance on linear support vector machine with the 0-1 soft margin loss (L_0/1-SVM) shows that the 0-1 loss problem can be solved directly. However, its theoretical and algorithmic requirements restrict us extending the linear solving framework to its nonlinear kernel form directly, the absence of explicit expression of Lagrangian dual function of L_0/1-SVM is one big deficiency among of them. In this paper, by applying the nonparametric representation theorem, we propose a nonlinear model for support vector machine with 0-1 soft margin loss, called L_0/1-KSVM, which cunningly involves the kernel technique into it and more importantly, follows the success on systematically solving its linear task. Its optimal condition is explored theoretically and a working set selection alternating direction method of multipliers (ADMM) algorithm is introduced to acquire its numerical solution. Moreover, we firstly present a closed-form definition to the support vector (SV) of L_0/1-KSVM. Theoretically, we prove that all SVs of L_0/1-KSVM are only located on the parallel decision surfaces. The experiment part also shows that L_0/1-KSVM has much fewer SVs, simultaneously with a decent predicting accuracy, when comparing to its linear peer L_0/1-SVM and the other six nonlinear benchmark SVM classifiers.


page 1

page 2

page 3

page 4


Support Vector Machine Classifier via L_0/1 Soft-Margin Loss

Support vector machine (SVM) has attracted great attentions for the last...

Multi-task nonparallel support vector machine for classification

Direct multi-task twin support vector machine (DMTSVM) explores the shar...

Learning rates for partially linear support vector machine in high dimensions

This paper analyzes a new regularized learning scheme for high dimension...

Alternating direction method of multipliers for regularized multiclass support vector machines

The support vector machine (SVM) was originally designed for binary clas...

Efficient Structure-preserving Support Tensor Train Machine

Deploying the multi-relational tensor structure of a high dimensional fe...

Splitting Method for Support Vector Machine with Lower Semi-continuous Loss

In this paper, we study the splitting method based on alternating direct...

Frank-Wolfe algorithm for learning SVM-type multi-category classifiers

Multi-category support vector machine (MC-SVM) is one of the most popula...

1 Introduction

Vapnik and Cortes firstly proposed the support vector machine (SVM) for binary classification in 1995 [1]

, then it became a very popular modelling and prediction tool on many small- and moderate-size machine learning, statistic and pattern recognition problems

[2]. The SVM classifier aims to build a linear or nonlinear maximum-margin decision boundary to separate two classes data points. Usually, the nonlinear SVM model is the preferred one in most practical scenarios [3]. Given a data set , here is the label value corresponding to , the formulation of the nonlinear SVM problem is

where is a penalty parameter and the decision surface is defined by , is a proper feature map expecting new higher-dimension points

linearly separable in feature space. The loss function

represents the penalty function for the incorrectly classified data points . The 0-1 loss function is a natural option of for its capturing the discrete nature of binary classification, namely, it is designed to minimize the number of misclassified instances:

here . However, its non-convexity and discontinuousness make the problems with 0-1 loss NP-hard to optimize [4][5][6][7]. For decades, compromising with the challenge of directly solving the original 0-1 loss SVM, many researchers transfer to operate the following two alternative strategies instead: replacing the 0-1 loss function with surrogate loss functions, or executing approximate algorithms based on the binary-valued characteristic of 0-1 loss function.

In the SVM community, a tremendous amount of work has been devoted on first strategy, especially to those SVMs with convex surrogates for their ease-of-use on practical applications and convenience on theoretical analysing. However, there is no solid evidence guarantees how well these convex surrogates approximate the 0-1 loss itself, even they indeed exist asymptotic relevance [8]. The most widely researched surrogates include the hinge loss [1], the squared hinge loss [9], the least squares loss [10], the pinball loss [11] and the log loss [12]. On the other hand, comparing to convex surrogates, the ramp loss [13] [14], the truncated pinball soft-margin loss function [15]

etc, these non-convex surrogates, still attract a vast of attention in certain scenes requiring high insensitiveness to outliers. Simultaneously, these non-convex SVMs exhibit relatively better performance on classification accuracy and the declining of SVs. As to the majority of nonlinear SVMs with convex surrogates, their optimization problems can be researched systematically by the dual kernel-based technique combining with KKT condition. Unluckily, this method can not guarantee its similar success on many non-convex surrogates. With respect to the 0-1 loss, the ideal candidate for SVM loss, we find it is hard to implement the Lagrangian dual approach directly. The reason is that its core function, namely, the Lagrangian dual function of 0-1 loss SVM can not be expressed explicitly due to the non-smoothness and non-convexity of 0-1 loss function. Finally, the failure of this popular technique, also without finding other effective routes to directly solve this NP-hard problem, makes the task to precisely solute the 0-1 loss SVM being a challenge in SVM community for decades.

As to the second strategy, some approximate algorithms for nonlinear SVMs employ sigmoid function

[16] or a sigmoidal transfer function [17] as the simulation of 0-1 loss for their similar numerical characteristics. In addition, Orsenigo and Vercellis [18] proposed the algorithm of multivariate classification trees based on minimum features discrete support vector machines and an approximate technique was executed to solve the mixed integer problem (MIP) model at each node of trees in their method. A point must be acknowledged that these approximate algorithms mainly focus on the techniques other than systematic solving framework including solid theoretical analysis, such as proposing an optimal condition and designing its corresponding solving algorithm more closely adhering to the SVM problems.

Most lately, Wang et al. [19] have made a big step for their systematic work aiming to directly solve the linear -SVM, an abbreviation to the linear SVM with 0-1 loss by them. To our best knowledge, they firstly built the optimal condition both for global and local solution of the linear -SVM, then a significantly efficient algorithm -ADMM was also invented to find the approximate solution based on optimal condition. In view of the fundamental or theoretical achievements on directly solving SVM with 0-1 loss are still undeveloped fully, Wang and his partners’ relatively systematic work on linear -SVM is a meaningful trial on this topic. Therefore, we expect their inspiring viewpoint can lead us to build a similar systematic framework on nonlinear -SVM.

A powerful and popular way to deal with the nonlinear SVMs is the kernel method [10][20]. Generally, most nonlinear SVM problems need integrating the kernel technique into its Lagrangian dual function. However, just as said above, given what we knew, the explicit formulation of Lagrangian dual function of -SVM is unaccessible as 0-1 loss function has the negative property on convexity and continuity. The predicament pushes researchers to explore other paths that still can use kernel trick effectively but without involving the surrounding of Lagrangian dual problem. One way is just to approximate the kernel function by proper feature map. The technique for speeding up kernel methods in large-scale problems, called the class of random features [21][22], belongs to this kind. In detail, it avoids searching for the Lagrangian dual of -SVM by using feature map to approximately decompose the kernel function and then transforms the nonlinear SVM problems into the linear case.

In this article, based on the nonparametric representation theorem [31], we reestablish a nonlinear -SVM model, called -KSVM. The new model -KSVM tactfully involves the kernel technique in itself without using the Lagrangian dual function. More importantly, the proposed model also successfully extend the systematic solving framework of linear model in [19], in other words, the corresponding optimal condition and algorithm are similarly explored, as the linear framework has been finished in [19]. The main contributions are summarized as follows.
(i) A new formulation of nonlinear kernel -SVM, called -KSVM, is presented. The new proposed -KSVM simultaneously involves the kernel technique and inherits the advanced and systematic solving framework of -SVM. Moreover, we also prove that the model -KSVM can degenerate into a slightly modified edition of original linear -SVM by adding an extra regularization into its objective function.
(ii) A precise definition of the support vector in -KSVM is provided by the proximal constrains of P-stationary point and an important conclusion about SVs is also explored: SVs are only located on the parallel decisive surfaces , where is the final decision surface. This geometrical character relating to SVs echoes the fact that SVMs with 0-1 loss has much smaller SVs than other SVM peers with traditional convex losses.
(iii) The experiment part, respectively setting -SVM and the other six classic nonlinear SVMs as the performance comparison, has validated the strong sparsity and reasonable robustness of our proposed -KSVM. The result of much fewer SVs of -KSVM is in stark contrast to all other experimental counterparts. Meanwhile, we have achieved a decent result compared to the other six classic nonlinear SVMs relating to the predicting accuracy.

The rest of the paper is organized as follows. Section 2 gives a broad outline for the -SVM and then presents two basis for modelling nonlinear -KSVM, kernel method and representation theorem. The whole framework of nonlinear SVM with 0-1 soft margin, called -KSVM, also the topic of paper, will be explicitly investigated in section 3. Lastly, we will prove the advantageous aspects of -KSVM on strong sparsity and robustness by experiments compared to -SVM and the other six classic nonlinear SVMs in section 4.

2 Preliminary knowledge

Firstly, we will review linear -SVM briefly on its model and optimal condition, then the following part focuses on presenting the knowledge about the kernel method and the representation theorem, which are vital to establish our final model -KSVM. For convenience, some notations are listed in Table 1.

Notation Description
, here
The number of non-zero elements in u
Table 1: List of notations.

2.1 Linear -Svm

Now we give a simple review about -SVM by Wang and his partners in [19]. Firstly, they rewritten the SVM with 0-1 loss as the following matrix form


and the hyperplane or decision function has the form as

+. Notice u here is obtained on the condition of being an identity map.

The optimal condition of (1) was displayed in Theorem 1, which was introduced by an important concept, P-stationary point. We retrospect them respectively in the following part:

Definition 1

[19] [P-stationary point of (1)] For a given , we say is a proximal stationary (P-stationary) point of (1) if there are a Lagrangian multiplier vector and a constant such that




Here, .

Theorem 1

[19] For problem (1), the following relations hold.

(i)  Assume has a full column rank. For a given , if is a global minimizer of (1), then it is a P-stationary point of (1) with .

(ii) For a given , if is a P-stationary point of (1) with , then it is a local minimizer of (1).


represents the maximum eigenvalue of

The theorem states that the P-stationary point must be a local minimizer under some condition, then [19] used the P-stationary point as a termination rule on the iterative points generated by -ADMM, the algorithm designed to find the numerical solution of (1). Additionally, they creatively brought a new expression to SV and continued to verify SVs only lie on the parallel decisive hyperplanes +, which was expected to guarantee the relative robustness and the high efficiency for its strong insensitiveness to outliers.

In [19], -SVM has rather shorter computational time and much fewer SVs theoretically and experimentally comparing to other leading linear SVM classifiers. However, we point out some barriers when we extend the whole modelling framework from the linear form to the nonlinear form.
(i). Theoretically, the number of the representation of the nonlinear or kernel base is always larger than the number of the training points. So, the requirement that full column rank of in Theorem 1 in [19] will not be satisfied when we normally introduce nonlinear mapping or kernel trick to handle the nonlinear SVM problems.
(ii). Algorithmically, -ADMM, designed to solve the linear -SVM, needs data points explicitly expressed in the iteration of w and it will be hindered for we do not know about but their product dots in most cases.

For addressing these challenges, we search for a new nonlinear model that can fit for the whole linear framework just reviewed, the following subsection will present the essential basis to construct it.

2.2 Kernel Method and Representation Theorem

Kernel method has the kernel function as its core role and the representation theorem also closely bonds with the kernel function, for which is the basic element to construct the Hilbert space where the representation theorem applies.

When failing to find a hyperplane as to the linearly inseparable data points , an ingenious solution is that we can transfer the original data sets to a new separable domain , here is always a dot product space under the proper map . For is high dimensional, sometimes even with , moreover, most nonlinear SVMs algorithms just need knowing about the dot products of new domain but not themselves. Therefore, a new technique called kernel method is introduced to alleviate the computation burden in . The kernel function, playing a key role in this method, is defined as follows:

Definition 2 (Kernel function)

[23] For a given binary function , , there exists a map from to a dot product space , for all , satisfying , then function is usually called kernel function and is called its feature map.

The most frequently used kernel function includes Polynomial kernel function [24] , Gaussian kernel function [25][26] , and Sigmoid kernel function [27] , etc.

Next based on the kernel function, we introduce a special Hilbert space, called the reproducing kernel Hilbert space () [28], which is the completion of inner space with all as its elements. The corresponding dot product can be derived by ,.

In the following part, we further review a representation theorem in , which is the cornerstone to construct our proposed nonlinear -SVM. In mathematics, there are many closely related variations about the representation theorem. One of them is called Riesz representation theorem [29], which establishes an important connection between a Hilbert space and its continuous dual space. Wahba [30] applied Riesz Representation theorem on , it shows that the solutions of certain risk minimization problems, involving an empirical risk term and a quadratic regularizer, can be expressed as expansions in terms of the training examples. Schölkopf et al. further generalized the theorem to a larger class regularizers and empirical risk terms in [31], and this article explicitly gives the nonparametric formulation of the general theorem as follows:

Theorem 2 (Nonparametric Representation Theorem)

[31] Suppose we are given a nonempty set , a positive definite real-valued kernel , a training sample , a strictly monotonically increasing real-valued function on , an arbitrary cost function , and a class of functions
Here, is the norm in the RKHS associated with , i.e. for any ,


Then any minimizing the regularized risk functional


admits a representation of the form

where are coefficients of in the RKHS .

3 -Ksvm

In this section, we formally propose a nonlinear -SVM termed as -KSVM based on the kernel method and Theorem 2, then its degeneration property of -KSVM, the first-order optimality condition and solving algorithm are also explicitly explored.

3.1 Modeling of -Ksvm

Given data set composed by input and output pairs = , the nonlinear kernel -SVM classifier, termed as -KSVM, is obtained by solving the follow non-convex problem:


where =VKV, , , the kernel matrix is associated with the chosen kernel function on modified input data set with . The decision surface of -KSVM classifier is for the transformed point as to the test point .

Next we make some explanatory note about the modeling of -KSVM through Theorem 2. In (5), we are specifying that


Here, is the penalty parameter.
Therefore, the regularization problem related to nonlinear kernel SVM with 0-1 loss has the form:


Since (7) has as its solution based on Theorem 2, we plug it back into (7) and apply the norm equality (4), then the final formulation (3.1) of -KSVM is obtained.

At the end of subsection, a significant proposition is introduced to reveal the degeneration relationship between our newly-built nonlinear -KSVM and the slightly modified edition of -SVM.

Proposition 0 (Degeneration of -KSVM to -Svm)

Nonlinear -KSVM (3.1) can degenerate to a slightly modified edition of linear -SVM (1) with a linear kernel, that is adding an extra regularization in the objective function of -SVM.

The proof of Proposition 1 is in A.1.

3.2 First-Order Optimality Condition

In this subsection, the first-order optimality condition for the -KSVM problem (3.1) will be explored and the whole process is similar to the work finished in [19]. To proceed this, we firstly define the pillar of constructing the first-order optimality condition, the proximal stationary point of -KSVM:

Definition 3 (P-stationary point of -Ksvm)

For a given , we say is a proximal stationary (P-stationary) point of -KSVM for a constant such that




Here, .

Based on P-stationary point, the optimality condition of -KSVM is concluded as the following theorem:

Theorem 3

For the -KSVM problem (3.1), the following relations hold.

(i)  Assume is invertible. For a given , if is a global minimizer of -KSVM, then it is a P-stationary point of -KSVM with , here is the smallest eigenvalue of .

(ii) For a given , if is a P-stationary point of -KSVM with , then it is a local minimizer of -KSVM.

The proof of Theorem 3 is in A.2.

In the following subsection, the classical algorithm ADMM is carried on to find numerical solution for our nonlinear -SVM problem, termed as -ADMM. The SV of -KSVM will be defined and applied as select working set in the updating of all sub-problems, which undoubtedly speeds up the whole process of iteration and then makes the termination condition satisfied more rapidly.

3.3 -ADMM Algorithm via Selection of Working Set

3.3.1 -KSVM Support Vector

Suppose be a P-stationary point, hence be a local minimizer of -KSVM. From the second equation of (8), we define the index set


and . Then and . By (9), we have

This leads to


Then from definition of and (11), we have


Combining with in (11) yields


Taking (12) into the function of decision surface of -KSVM, we derive

Following the concept of SV in [1], we define or are the -KSVM SVs, here becomes the index set of SVs.

Again, from for , we obtain


That is to say, the -KSVM SVs must be lie on the support surfaces . As far as we know, only the hard-margin SVM and -SVM has such characteristic. This geometrical property shows the evidence that SVs in

-KSVM arrange orderly and probably sparsely, moreover, it also demonstrates the idealness of the 0-1 loss in SVM algorithms.

3.3.2 -ADMM Algorithm

Firstly, to our best knowledge, the close-form solution of -KSVM is hard to trace for its non-convexity, though its existence can be verified using similar skill in [19]. In this subsection, ADMM algorithm will be operated on the augmented problem of (3.1) to search its numerical solution:


where is the Lagrangian multiplier and is the penalty parameter. Given the th iteration , the algorithm framework can be described as


where is referred as the dual step-size. Here,

is the so-called proximal term, is a symmetric matrix properly chosen to guarantee the convexity of -subproblem of (15).
(i) Updating : The u-subproblem in (15) is equivalent to the following problem

where the third equation is from the definition of proximal operator with and . Define the working set


and complementary set then we obtain


(ii) Updating .


is introduced during the updating process, thus we derive the -subproblem of (15) as


and it can be simplified as


where , then we update by applying Conjugate Gradient(CG) method [32] on (20) for efficiency.

(iii) Updating . Let , then based on (15), we have


where .

Summarizing the above (16),(17), the solution of (20) and (21), we obtain Algorithm 1, which is called the method -ADMM, an abbreviation for -KSVM solved by ADMM.

  Initialize (). Choose parameters and set .
  while The halting condition does not hold and  do
     Update as in (16).
     Update by (17).
     Update by CG method on (20).
     Update by (21).
     Set .
  end while
  return  the solution () to (3.1).
Algorithm 1 : -ADMM for solving -KSVM (3.1)

3.3.3 -ADMM Convergence and Complexity Analysis

Firstly, considering that directly solving the SVM with 0-1 loss is a NP-hard problem, we admit that only limit materials are available to guide us to analyze the solution system. Moreover, the ADMM algorithm, our finding path to calculate the -KSVM, has intrinsic flaw on the job of analyzing convergence for it is designed to a multi-step and multi-factor structure compared to other popular algorithms like gradient algorithms. Further, , the select working set in the updating of all sub-problems in -ADMM, its varying in each iteration, undoubtedly exacerbates the instability of the whole calculation process. Even worse, what this algorithm has to address now is a non-convex and discontinuous problem other than optimization problems with a general class of more-friendly non-convex losses, like the jobs finished in [33][34][35][36][37]. Still, we achieve a convergence conclusion similar to the linear form in [19] as to the sequence created by ADMM algorithm:

Theorem 4

Suppose be the limit point of the sequence generated by -ADMM, then is a local optimal solution of -KSVM.

The proof of the Theorem 4 could be found in A.3.

Next we present the computational complexity analysis of -ADMM.

  • Updating by (16) needs the complexity .

  • The main term involved in computing by (17) is the kernel matrix , taking the complexity about , is the dimension of data points .

  • To update , calculating

    will cost as its computational complexities. Moreover, we need [38] to compute (20) by CG method, here is the number of distinct eigenvalues of . Therefore, the complexity to update in each step is

  • Similar to updating , is the most expensive computation in (21) to derive . Its complexity is supposed we have calculated firstly.

Overall, the whole computational complexity in each step of -ADMM in Algorithm 1 is

An obvious fact is that the kernel matrix brings the heavy burden on computational time, which is caused by the curse of kernelization [39] and continues to be a open and big challenge in kernel SVM training problems.

4 Numerical experiments

The experiments related to our proposed algorithm -ADMM on -KSVM (abbreviated as in the tables below), are divided into two parts, corresponding to two entities as a performance contrast: the linear -SVM and the other six leading nonlinear SVM classifiers. 10 UCI data sets, whose detailed information is listed in Table 2, are selected to conduct all experiments by using MATLAB (2019a) on a laptop of 32GB of memory and Inter Core i7 2.7Ghz CPU. In addition, all features in each data set are scaled to and Gaussian kernel is chosen in all experiments.

Data sets    Numbers()     Features()    Rate()
Breast (bre) 699 9 34/66
Echo(ech) 131 10 33/67
Heartstatlog (hea) 270 13 56/44
Housevotes(hou) 435 16 61/39
Hypothyroid (hyp) 3163 25 5/95
Monk3(mon) 432 6 50/50
PimaIndian (pim) 768 8 35/65
TwoNorm (two) 7400 20 50/50
Waveform (wav) 5000 21 33/67
WDBC (wdb) 198 34 37/63
Table 2: Descriptions of 10 UCI data sets.

The Theorem 3 allows us to take the P-stationary point as a stopping criteria, then we can terminate our algorithm if the point () closely satisfies the conditions in (8), namely,

where tol is the tolerance level and

For our evaluation of classification performances is based on the method of 10-fold cross validation, that is, each data is randomly split into ten parts, one of which is used for testing and the remaining nine parts is for training. Therefore, we choose the mean accuracy , the mean number of support vectors and the mean CPU time as the criteria of performance of all models. Let be test samples, the testing accuracy is defined by

where is the SVs set of -KSVM and = .

4.1 Comparisons between linear -SVM and -Ksvm

The optimal parameters and in two ADMM algorithms are picked out through the standard 10-fold cross validation in the same range of
. In addition, as to the other parameters options in -ADMM, we set and choose , , as the initial points, the maximum iteration number is and the tolerance level tol. In algorithm -ADMM, , and , the maximum iteration number and the tolerance level are , tol respectively. The linear and nonlinear -SVM results can be found in Table 3. We conclude that -KSVM performs obviously better than its linear formulation for the majority of 10 data sets in terms of mACC and mNSV, with the bold numbers indicating superiority. All data sets except wdb attains a higher mACC for our proposed nonlinear model, the data set mon adds its accuracy increment to the surprising 25, wav also reduces nearly 4.2 for its error rate. As to the mNSV, -KSVM also greatly outperforms its linear peer besides hea. Especially, the number of SVs occurring in the nonlinear formulation is less than a fifth to its linear opponent in data sets hyp, two, ech. The stronger sparsity is a little surprise to us considering that most convex nonlinear SVMs brings more SVs than its corresponding linear form. We assume that higher accuracy in

-KSVM and the special geometrical distribution of SVs in 0-1 loss SVM are responsible for this unusual phenomenon. However, the curse of kernelization hinders the computation inefficiency for the

-ADMM, which costs much more time than -ADMM.

Data set      Catagory       mACC(%)      mNSV      mCPU(seconds)
bre -SVM 0.9568 15.55 0.0465
-KSVM 0.9677 9.82 0.6453

-SVM 0.9042 23.35 0.0381
-KSVM 0.9186 7.76 0.0601
hea -SVM 0.8252 18.67 0.0349
-KSVM 0.8374 26.76 0.0912

-SVM 0.9458 27.85 0.0297
-KSVM 0.9584 23.93 0.2323

-SVM 0.9796 139.93 0.1253
-KSVM 0.9816 25.07 20.7255
mon -SVM 0.7272 83.55 0.0178
-KSVM 0.9723 78.73 0.2546
pim -SVM 0.7647 267.18 0.1710
-KSVM 0.7723 176.54 0.8658
two -SVM 0.9770 309.77 0.3979
-KSVM 0.9776 16.81 116.4893

-SVM 0.8539 740.53 0.3969