Least squares support vector machine (LSSVM) was introduced by SuykensSuykens1999
and has been a powerful learning technique for classification and regression. It has been successfully used in many real world pattern recognition problems, such as disease diagnosisDuygu2011 , fault detectionLong2014 , image classification Yang2015
, partial differential equations solvingMehrkanoon2015 and visual trackingGao2016 . LSSVM tries to minimize least squares errors on the training samples. Comparing with other SVMs, LSSVM is based on equality constraints rather than inequality ones, hence it has closed form solutions by solving a system of linear equations instead of solving a quadratic programming (QP) problem iteratively as other SVMs. So the training of LSSVM is simpler than other SVMs.
However, LSSVM has two main drawbacks. One is that it is sensitive to outliers, because outliers always have large support values (the values of Lagrange multiplier), which means that the influences of outliers are larger than other samples in constructing the decision function. Another is that the solution of LSSVM lacks sparse, which limits the method for training large scale problems.
In order to overcome the sensitivity to outliers of the LSSVM, Suykens et al.Suykens2002 proposed the weighted LSSVM (W-LSSVM) model by putting small weights on the less important samples or outliers to reduce their influence to the model. Some other weight setting strategies are proposed, see Valyon2003 You2011 . Theoretical analyses and the experimental results indicate that such methods are robust to outliers. But those methods need pre-solve the original LSSVM to set the weights, so they are all not suit for training large scale problems. Another technique to deal with robustness is on non-convex loss functions. Based on truncated least squares loss function, Wang et al.KuainiWang2014 and Yang et al.XiaoweiYang2014 presented robust LSSVM (R-LSSVM) model. Experimental results show that R-LSSVM model significantly reduces the effect of the outliers. However, the solutions to R-LSSVM by Yang’s and Wang’s algorithms both lack sparseness, and they need pre-compute the whole kernel matrix and the inverse of , hence they are both time consuming for the large scale data sets. They are even unable to handle the data sets containing more than 10,000 training samples on common computers.
There are also some methods to promote the sparsity of LSSVM. Suykens et al.Suykens2000 J.A.K.Suykens2002 proposed a pruning algorithm which iteratively remove a small amount of samples (5%) with smallest support values to impose sparseness. In this pruning algorithm, a retraining of LSSVM with the reduced training set is needed for each iteration, which leads to a large computation cost. Fixed-size least squares support vector machine (FS-LSSVM)Suykens2002 is another sparse algorithm. In this algorithm, some support vectors (SVs) referred to as prototype vectors are fixed in advance, and then they are replaced iteratively by samples which are randomly selected from the training set based on the quadratic Rényi entropy criterion. However, in each iteration, this method only computes the entropy of the samples that are selected in the working set rather than the whole data set, which may cause the sub-optimized solutions. Jiao et al.Jiao2007 presented the fast sparse approximation for LSSVM (FSA-LSSVM), in which an approximated decision function was built iteratively by adding the basis function from a kernel-based dictionary one by one until the
criterion satisfied. This algorithm obtains sparse classifiers at a rather low cost. But with the very sparse setting, the experimental results insszhou2016 show that FSA-LSSVM is not good on some training data sets. Zhousszhou2016 proposed pivoting Cholesky of primal LSSVM (PCP-LSSVM) which is an iterative method based on incomplete pivoting Cholesky factorization of the kernel matrix. Theoretical analyses and the experimental results indicate that PCP-LSSVM can obtain acceptable test accuracy by extreme sparse solution.
In this paper, we aim to obtain the sparse solution of the R-LSSVM model to overcome the two drawbacks of LSSVM simultaneously. New algorithm solves the R-LSSVM in primal space as Zhousszhou2016 did for LSSVM, and our main contributions can be summarized as follows:
By introducing an equivalent form of the truncated least squared loss function, we show that R-LSSVM is equivalent to a re-weighted LSSVM model, which explains the robustness of R-LSSVM.
We illustrate that representer theorem is also held for the non-convex loss function, and propose the primal R-LSSVM model which has a sparse solution if the kernel matrix is low rank.
We propose sparse R-LSSVM algorithm to obtain the sparse solution of R-LSSVM by applying low-rank approximation of the kernel matrix. The complexity of the new algorithm is lower than the existing non-sparse R-LSSVM algorithms.
A large number of experiments demonstrate that the proposed algorithm can process large-scale problems efficiently.
The rest of the paper is organized as follows. The brief descriptions of the R-LSSVM and its existing algorithms are given in section 2. In section 3, robustness of R-LSSVM is interpreted from a re-weighted viewpoint. In section 4, primal R-LSSVM and its smooth version are discussed, and the novel sparse algorithm is proposed. After that, the convergence and complexity of the new algorithm are analyzed. Section 5 includes some experiments to show the efficiency of the proposed algorithm. Section 6 concludes this paper.
2 Robust LSSVM model and the existing algorithms
In this section, we briefly summarize the R-LSSVM and the existing algorithms.
2.1 Robust LSSVM
Consider a training set with pairs of samples , where are the input data and or are the output targets corresponding to the inputs for classification or regression problems. The classical LSSVM model is described as follows:
where is the regularization parameter,
is the normal of the hyperplane,is the bias, is a map which maps the input into a high-dimensional feature space, especially for managing the nonlinear learning problems, and is the least squares loss with being the predict error.
By replacing in (1) with the truncated least squares loss :
where is the truncated parameter which controls the errors of the outliers. Fig. 2 plots the in (2) with , the least square loss and the difference between them . It is clear that the losses of the outliers (samples with larger errors) are bounded by , hence it reduce the effects of the outliers in R-LSSVM. We will investigate the robustness of the R-LSSVM from a re-weighted viewpoint in section 3.
2.2 Existing algorithms for R-LSSVM
Then R-LSSVM can be transformed to a difference of convex (DC) programming:
Wang et al.KuainiWang2014 and Yang et al.XiaoweiYang2014 solve the DC programming (5) by the Concave-Convex Procedure (CCCP). Then through different methods, they both focus on solving the following linear equations (6) iteratively.
where is the positive semi-definite kernel matrix satisfying , ,
is a identity matrix,, , and is the value of at the -th iteration satisfying
where , is the -th row of the kernel matrix .
Through iteratively solving (6) with respect to and until convergence, the output deterministic function is .
One limitation of these two algorithms is that the solution lacks sparseness. That is because the coefficient matrix of (6) is a nonsingular symmetric dense matrix and the vector on the right side of equations is dense. Hence the training speeds of these two algorithms are slow and they can not train large-scale problems efficiently.
3 Robustness of R-LSSVM from a re-weighted viewpoint
Wang et al.KuainiWang2014 illustrate the robustness of R-LSSVM only through experiments. Yang et al.XiaoweiYang2014 explain it from the relationship between the solutions of R-LSSVM and W-LSSVMJ.A.K.Suykens2002 . In this section, we will show that R-LSSVM enjoys the robustness from a re-weighted viewpointFeng2016 .
By the representer theorem in section 4.1, R-LSSVM can be translated into the following model in primal space without the implicit feature map :
can be expressed as
Any stationary point of R-LSSVM (9) can be obtained by solving an iteratively re-weighted LSSVM as follows:
where is the value of -th iteration of the weight .
where . Since is nonconvex, only a stationary point of preceding minimization problem can be expected. Let be one of the stationary points of (9). By the analysis above, there exists such that be the solution of (14). On the other hand, if is any stationary point of (14), then also solves (9). Hence, we can iteratively solve (14) by alternating direction method (ADM)He2012 as follows:
Since denotes the predicted error, similar to the robustness analysis in article Feng2016 , the larger is, the more likely that the instance pair tends to be an outlier. From (12) and (13), it observes that when the is sufficiently large for the outlier instance , the corresponding weight in (13) will be 0. That is, the truncated least squares loss function can reduce the influence of samples which are far away from their true targets. This explains the robustness of R-LSSVM from the re-weighted viewpoint.
4 Sparse R-LSSVM algorithm
In this section, we give the primal R-LSSVM and propose the sparse algorithm to obtain the sparse solution of the R-LSSVM.
4.1 Primal R-LSSVM
If loss function is convex such as in LSSVM model (1), by duality theory, the optimal solution can be represented as
where . If loss function is nonconvex, the strong duality does not hold, hence we cannot get (17) by duality. However, by the representer theorem in Scholkopf2001 Shai2014 , it is easily to prove that (17) also holds.
where is the same as (7) with .
However, the computation of is not simple, since is non-differentiable at some points. Inspired by the idea in ShuishengZhou2013 , we smooth by the entropy penalty function. Let
then we have whenever . is the smooth approximation of , and the upper bound of the difference between and is . In practice, if we set sufficiently large such as , the difference between them can be neglected. Fig. 2 shows the comparison between and the smoothed truncated least squares loss function with .
Yang et al.XiaoweiYang2014 also adopt a smooth procedure, but their method has to tune the smoothing parameter to get the best effect. That makes the parameter adjustment procedure complex. In comparison, our smoothing strategy based on entropy penalty function does not need to tune such parameter. What we need to do is set a large value for in (22).
4.2 Sparse solution for Primal R-LSSVM
It seems that (23) is more complicated than (6) in a first sight. However, the coefficient matrix of (6) is nonsingular symmetric dense matrix, which leads to a non-sparse solution of (6). In comparison, the coefficients matrix of (23) may be low rank if the related kernel matrix is low rank or is approximated by a low rank matrix. In this situation, (23) may have sparse solution, which overcomes the limitation of the previous methods partly.
Now, we discuss the sparse optimization solution of (23) as soon as the kernel matrix can be approximated by a low rank matrix.
Nyström Approximation is a most popular method to obtain the low-rank approximation of kernel matrix (see Williams2001 Petros2005 Zhang2010 Si2016 and the references therein). The low-rank approximation method is not the point of this paper. For simplicity, we employ Zhou’s pivoting Cholesky factorization methodsszhou2016 . Let corresponding to the indices of the landmark points, be the sub-matrix of whose elements are for and , and be defined similarly. By the pivoting Cholesky factorization method in sszhou2016 , we can obtain the full column rank matrix satisfying as the best rank- Nyström type approximation of under the trace norm, and in all process only the selected columns and the diagonal of the kernel matrix are necessary. If is gotten by some other low-rank approximation methods Williams2001 Petros2005 Zhang2010 Si2016 , let and the following analysis is the same.
where is a identity matrix. By permuting rows of matrix , we get , where is a full rank matrix (and will be a lower triangular matrix if is obtained as sszhou2016 , hence is computed with cost instead of ), and is comprised by the rest rows of . Correspondingly, let , then we have
is the sparse solution of (25), where
So the sparse R-LSSVM (SR-LSSVM) algorithm is obtained by iteratively updating as follows:
4.3 Sparse R-LSSVM algorithm
From the above analysis, our SR-LSSVM algorithm is listed as Algorithm 1.
After obtaining the optimal and by Algorithm 1, the decision function for regression is:
For classification, the decision function is . We give some comments about Algorithm 1.
Comment 2. In equation (28), if we set
then and , so . In step 3 of Algorithm 1, we compute instead of , and the cost of step 3 is decreased further. The output can only be calculated at the last round by .
Comment 3. To promote computational efficiency, Equation (28) can be rewritten as:
where , , is the sparse solution of primal LSSVM, is the index set of nonzero elements of , is comprised by several rows of , and the indexes of these rows in correspond to the elements in , is a vector comprised of nonzero elements of .
Then the step 2 and 3 in Algorithm 1 can be replaced with the following:
Step 2’: Compute and . Set ;
Comment 4. In Algorithm 1, the parameter limits the upper bound of loss function. should not be set too large or small. The improper results in poor generalization performance. To overcome the sensitivity of the loss function to , we can tone as follows. Firstly, set a little larger , such as , where . Then add the following step between the step 3 and step 4 in Algorithm 1: reduce if is small until , where is the minimum of we set.
4.4 Convergence and Complexity analysis
CCCP is globally or locally convergent, see Yuille2003 Tao2014 Bharath2012 . Similar to the convergence proof of DCA (DC Algorithm) for general DC programs in article PHAM1997 , we have the following Lemma.
If the optimal value of the problem (18) is finite, and the infinite sequences and are bounded, then every limit point of the sequence is a generalized KKT point of .
Obvious, the objective function of (18) and (9) is bounded below. Assume the prediction error variable is bounded, which is reasonable in real application, then is bounded by (22). So and are also bounded because of the boundedness of , and in (28) and (29). By Lemma 2, we get the following theorem.
For Algorithm 1, the computation cost of step 1 and step 2 are both sszhou2016 . The complexity of iteratively solving step 3 is , where is the total iterative number of SR-LSSVM. So the overall complexity of this algorithm is . If we utilize the technique in comment 3 to compute , then the complexity of step in Algorithm 1 reduced to which is the complexity of step . In comparison, the computational complexities of Wang’s and Yang’s R-LSSVM algorithms in KuainiWang2014 and XiaoweiYang2014 are both , where is the iterations of their algorithms. It is obvious that our method has smaller computational complexity than existing approaches.
Parallel computing potential. In the Algorithm 1, some calculations are easy to perform, so serial computing is enough for them. However, for some costly calculations, we can utilize parallel computing to further improve computing efficiency. The main computational cost of Algorithm 1 is from computing , which can be implemented in parallel. For example, can be partitioned into chunks according to row satisfying , so which can be efficiently calculated by the parallel algorithm of matrix multiplication, where is the th block of the matrix .
5 Numerical experiments and discussions
To examine the validity of the proposed algorithm, we compare our SR-LSSVM with the R-LSSVM-WKuainiWang2014 (Wang’s algorithm for R-LSSVM), R-LSSVM-YXiaoweiYang2014 (Yang’s algorithm for R-LSSVM), the classical LSSVM, W-LSSVMJ.A.K.Suykens2002 , the FS-LSSVMSuykens2002 which is operated in the LS-SVMlab v1.8 software Brabanter2011_lssvmtoolbox 111Codes are available in http://www.esat.kuleuven.be/sista/lssvmlab/. and the SVMs (C-SVC for classification and -SVR for regression) which are implemented in the LIBSVM software222Codes are available in https://www.csie.ntu.edu.tw/ cjlin/libsvm/. for medium datasets. For some large-scale problems, we only compare the proposed algorithm with some sparse algorithms, such as PCP-LSSVMsszhou2016 333Codes and article can be downloaded from http://web.xidian.edu.cn/sszhou/paper.html, FS-LSSVMSuykens2002 , Cholesky with side information (CSI)Bach_csi2005 444Codes are available in http://www.di.ens.fr/fbach/csi/index.html. and C-SVC for classification or -SVR for regression, since the others can not apply in this case.
All computations are implemented in windows 8 with Matlab R2014a. The whole experiments are run on a PC with an Intel Core i5-4210U CPU and a maximum of 8G bytes of memory available for all processes.
We fixed the values of smoothing parameter in SR-LSSVM and the stop criterion respectively. For all the data sets, we use cross-validation procedure and grid search to search the best values of the parameter and , where is the parameter in Gaussian kernel function , and is the smooth parameter in method R-LSSVM-Y.
For R-LSSVM-W and R-LSSVM-Y, the running time in our article is much less than those in KuainiWang2014 XiaoweiYang2014 for the same data sets and the total complexity is reduced from KuainiWang2014 XiaoweiYang2014 to , where the coefficient matrix of (6) is decomposed by Cholesky factorization once and such decomposition is unchanged per loop in our experiments.
5.1 Classification experiments
In this section, we test one synthetic classification data set and some benchmark classification data sets to illustrate the effectiveness of the SR-LSSVM. For benchmark datasets, each attribute of the samples is normalized into , and these datasets are separated into two groups: the medium size datasets group and the large-scale datasets group. All of them are downloaded from lib . The experimental results on Adult data set show the reason why we separate these data sets into two groups. Finally, we test the robustness of our proposed algorithm for large-scale data sets with outliers on Cod-RNA dataset. Outliers are generated by the following procedure. We choose 30% of samples which are far from decision hyperplane, then randomly sample 1/3 of them and flip their labels to simulate outliers.
5.1.1 Synthetic classification dataset experiment
To compare the robustness and spareness of four algorithms LSSVM, W-LSSVM, R-LSSVM-Y and SR-LSSVM, we conduct an experiment on a linear binary classification data set including 60 training samples and 100 testing samples. Fig. 3 shows the experimental results. To simulate outliers, we add 4 training samples labeled with wrong classes. They are marked as ’’ and ’’ for positive and negative classes respectively. Through grid search, we obtain the best parameter values for this data set are .
Fig. 3 illustrates that the decision lines of algorithms LSSVM and W-LSSVM change greatly and these two methods have lower accuracies than SR-LSSVM and R-LSSVM-Y after adding outliers. In contrast, the decision boundaries of SR-LSSVM and R-LSSVM-Y are almost unchanged and the accuracies of these two approaches remain stable before and after adding outliers. So SR-LSSVM is insensitive to outliers. Moreover, almost all of training samples are SVs for LSSVM, W-LSSVM and R-LSSVM-Y. By contrast, for SR-LSSVM, the support vector sizes are both only 2 for data sets with and without outliers. So the proposed algorithm is sparseness, which can accelerate the training speed of our approach in processing large scale problems.
5.1.2 Medium-scale benchmark classification datasets experiments
Pendigits is a pen-based recognition of handwritten digits data set to classify the digits 0 to 9. We only classify the digit 3 versus 4 here.
Protein is a multi-class data set with 3 classes. Here a binary classification problem is trained to separate class 1 from 2.
Satimage is comprised by 6 classes. Here the task of classifying class 1 versus 6 is trained.
USPS is a muti-class data set with 10 classes. Here a binary classification problem is trained to separate class 1 from 2.
Comparison of the numbers of iterations, training time (seconds), mean number of support vectors (denoted by nSVs) and accuracies (%) of different algorithms on benchmark classification data sets with outliers (10%). The standard deviations are given in brackets. ’-’ means this parameter is not used by this method. The best values are highlighted in bold.
Table 1 reports the data information, optimal parameters and experimental results for the medium-scale classification data sets with outliers. The best results are highlighted in bold. In Table 1, we set for SR-LSSVM and FS-LSSVM for all data sets except Splice (). For C-SVC, the parameter . All the algorithms independently operate 10 times to get the unbiased results.
As regard to accuracies, Table 1 illustrates that our proposed method SR-LSSVM has higher accuracies than any other compared approaches on most data sets. As to training time, our method is faster than other approaches except C-SVC. C-SVC performs well on some medium scale data sets in training speed, but on some larger scale data sets such as Protein and Mushroom, the running speeds of C-SVC are slower than SR-LSSVM. In addition, the accuracies of C-SVC is lower than SR-LSSVM.
In terms of sparseness, SR-LSSVM and FS-LSSVM need much fewer support vectors than other approaches. In other words, these two methods have sparseness. But the accuracy of FS-LSSVM is lower than SR-LSSVM, and FS-LSSVM spends more time than SR-LSSVM on all data sets. C-SVC also displays sparsity but its support vector size is much larger than SR-LSSVM and FS-LSSVM, partly because there exist outliers in the training set.
As respect to iteration times of solving nonconvex programming R-LSSVM, SR-LSSVM needs less iterations than R-LSSVM-W and R-LSSVM-Y to converge to the optimal solution.
5.1.3 Adult data set experiments
To investigate the performance of each algorithm on data sets in different sizes, we randomly choose 4000, 8000, 10000, 15000, 20000 and all the 32561 training samples from the training set of Adult data setlib . The test set size is 16281.
Fig. 4 shows the experimental results of all the approaches on the data sets with outliers. The horizontal axis is the logarithmic coordinate in the figure. As to accuracies on these data sets, in general, SR-LSSVM, C-SVC and FS-LSSVM perform better than other methods, and our method SR-LSSVM performs the best. In addition, from Fig. 4, we can draw the conclusion that for the medium scale training data sets, especially those with size smaller than 8000, every algorithm runs fast. However, if the training set size exceeds 20000, LSSVM, W-LSSVM, R-LSSVM-W and R-LSSVM-Y cannot operate on our common computer due to lack of memory. So for the large scale benchmark data sets, we do not compare our method SR-LSSVM with LSSVM, W-LSSVM, R-LSSVM-W and R-LSSVM-Y. Moreover, Fig. 4 also shows that the training time of C-SVC increases rapidly as the sizes of training samples grow larger.
5.1.4 Large-scale benchmark classification datasets experiments
Table 2 reports the data information, optimal parameters and experimental results for the large-scale data sets with outliers (10%). We compare our SR-LSSVM with some other sparse algorithms. For Skin-nonskin data set, we randomly select 2/3 of the data as training samples and the rest of the data as testing samples, and for others, we use the default setting in lib . All the algorithms operate 5 times independently to get the unbiased results for every dataset. The best results are highlighted in bold.