Sparse Least Squares Low Rank Kernel Machines

01/29/2019 ∙ by Manjing Fang, et al. ∙ The University of Sydney University of Reading 0

A general framework of least squares support vector machine with low rank kernels, referred to as LR-LSSVM, is introduced in this paper. The special structure of low rank kernels with a controlled model size brings sparsity as well as computational efficiency to the proposed model. Meanwhile, a two-step optimization algorithm with three different criteria is proposed and various experiments are carried out using the example of the so-call robust RBF kernel to validate the model. The experiment results show that the performance of the proposed algorithm is comparable or superior to several existing kernel machines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the proliferation of big data in scientific and business research, in practical nonlinear modeling approaches, one wishes to build sparse models with more efficient algorithms. Kernel machines (KMs) have attracted great attention since the support vector machines (SVM), a well linear binary classification model under the principle of risk minimization, was introduced in earlier 1990s [1]. In fact, KMs have extended SVM by implementing the linearity in the so-called high dimensional feature space under a feature mapping implicitly determined by a Mercer kernel function. Both SVM and KMs have been also applied for regression problems [2]

. Commonly used kernels are radial basis function kernel (RBF), polynomial kernel, and Fisher kernel

[3], etc. As one of the most well-known members of the KM family, SVM has the advantages of good generalization and insensitivity to overfitting [4].

Until now Gaussian RBF kernel is the most common choice for SVM in practice. Generally, SVM with RBF kernel has been widely used and has superior prediction performance in many areas such as text categorization [5], image recognition [6], bioinformatics [7], credit scoring [8], time series forecasting [9], and weather forecasting [10]

. Text categorization or text classification is to classify documents into predefined categories. SVM and KMs work well for this task because the high dimensional text or dense concept representation can be easily mapped into a latent feature space where a linear prediction model is learned with an appropriately chosen kernel function

[11]. The results of the experiments indicate that SVM with RBF kernel outperforms other classification methods [5]. The superior performance of SVM with RBF kernel in dealing with high dimensional small datasets has also been demonstrated in remote sensing [12], by carefully choosing feature mappings.

The performance of SVM largely depends on kernel types and it has been shown that RBF kernel support vector machine is always capable of outperforming other classifiers in various classification scenarios [6, 5, 7]. Nonetheless, in practical nonlinear modeling, SVM with standard Gaussian RBF kernel has a non-negligible limitation in separating some nonlinear decision boundaries. Thus, the analysis of RBF kernel optimization has gained much more popularity than before. The result given in [13] demonstrates that after introducing an information-geometric data-dependent method to modify a kernel (eg, the RBF kernel), the performance of SVM is considerably improved. Yu et al. [14] enhance the kernel metrics by adding regularization into kernel machines (eg. RBF kernel SVM).

One of the advantages of the standard SVM model is its model sparsity determined by the so-called support vectors, however the sparsity cannot be pre-determined and the support vectors have to be learned from the training data by solving a computationally demanding quadratic programming optimization problem [15]. A massive progress in proposing computationally efficient algorithms for SVM models has been explored. One of the examples is the introduction of a least squares version of support vector machine (LSSVM) [16]. Instead of the margin constraints in the standard SVM, LSSVM introduces the equality constraints in the model formulation. The resulting quadratic programming problem can be solved by a set of linear equations [16]

. However, LSSVM is loosing of sparseness offered by the original SVM method, which leads a kernel model evaluating all possible pairs of data in the kernel function and therefore is inferior to the standard SVM model in inference for large scale data learning. To maintain the sparsity offered by the standard SVM and the equality constraints of LSSVM, researchers considered extending LSSVM for the Ramp loss function and produce sparse models with extra computational complexity, see

[17]. This strategy has been extended to more general insensitive loss function in [18]. Recently, Zhu et al. [19] proposed a way to select effective patterns from training datasets for fast support vector regression learning. However, there is no extension for classification problems yet.

The need in dealing with large scale datasets motivates exploring new approaches for the sparse models under the broad framework of both SVM and KMs. Chen [20] proposed a method for building a sparse kernel model by extending the so-called orthogonal least squares (OLS) algorithm [21] and kernel techniques. It seems the OLS assisted sparse kernel model offers an efficient learning procedure particularly demonstrating good performance in nonlinear system identification. The OLS algorithm relies on a greedy sequential selection of the kernel regressors under the orthogonal requirement imposing extra computational cost. Based on the so-called significant vectors, Gao et al. [22] proposed a more straightforward way to learn the significant regressors from training data for the kernel regression modelling. This type of approaches has their roots in the relevance vector machine (RVM) [23]. RVM is implemented under the Bayesian learning framework of kernel machine and has a comparable inference performance to the standard SVM with dramatically fewer kernel terms, offering great sparsity.

Almost all the aforementioned modeling methods build models by learning or extracting those key data points or patterns from the entire training dataset. Recently, the authors proposed a new type of low rank kernel model based on the so-called simplex basis functions (SBF) [15], successfully building a sparse and fast modeling algorithm thus lowering the computational cost in LSSVM. The model size is no longer determined by the given training data while the key patterns will be learned straightaway. We further explore the idea and extend it for the so-called robust radial basis functions. The main contributions of this paper are summarized as follows,

  1. Given that the aforementioned models learn data patterns under the the regression setting, this paper focuses on classification setting for a controlled or pre-defined model size;

  2. The kernel function proposed in this paper takes the form of composition of basic basis components which are adaptive to the training data. This composition form opens the door for a fast closed form solution, avoiding the issue of kernel matrix inversion in the case of large scale datasets;

  3. A new criterion is proposed for the final model selection in terms of pattern parameters of location and scale; and

  4. A two-step optimization algorithm is proposed to simplify the learning procedure.

The rest of this paper is organized as follows. In Section II, we present the brief background on several related models. Section III proposes our robust RBF kernel function and its classification model. Section IV describes the artificial and real-world datasets and conducts several experiments to demonstrate the performance of the model and algorithm and Section V concludes the paper.

Ii Background and Notation

In this section, we start introducing necessary notation for the purpose of presenting our model and algorithm. We mainly consider binary classification problems. For the multi-class classification problems, as usual, the commonly used heuristic approach of “one-vs-all” or “one-vs-one” can be adopted.

Given a training dataset where is the number of data, is the feature vector and is the label for the -th data respectively.

KM methods have been used as a universal approximator in data modeling. The core idea of the KMs is to implement a linear model in a high dimensional feature space by using a feature mapping defined as [1]

which induces a Mercer kernel function in the input space

where is the inner product on the feature space .

In general, an affine linear model of KMs is defined as

(1)

where is the bias parameter and is the parameter vector of high dimensionality, most likely in infinite dimension. It is infeasible to solve for the parameter vector directly. Instead, the so-called kernel trick transforms the infinite dimension problem to a finite dimension problem by relating the parameters to the data as

(2)

A learning algorithm will focus on solving for parameters under an appropriate learning criterion.

For the sake of convenience, define

Then, under (2), model (1) can be expressed in terms of new parameters as111If we are considering a regression problem, there is no need to add in the model (3).

(3)

where means the component-wise product of two vectors.

All the KMs algorithms are involved with the so-called kernel matrix, as defined below

and

Both and are symmetric matrices of size .

In the following section, standard SVM, LSSVM and sparse least square support vector machine using simplex basis function (LSSVM-SBF) [15] are outlined.

Ii-a C-Svm

The standard support vector machine (C-SVM) imposes the so-called maximal margin criterion inducing a kernel model where the parameter (and ) can be obtained by solving the following dual Lagrangian problem

(4)

where is the vector with all ones in appropriate dimension. The parameter can be easily calculated from the support vectors [1].

The margin criterion guarantees that the resulting kernel model (3) is sparse, as only those parameters corresponding to the support vectors are non-zero. However, when is large, solving the convex quadratic programming problem (4) to identify such support vectors is very time consuming.

Ii-B Lssvm

To reduce the computational complexity of the standard SVM, the least square support vector machine introduces the equality constraints.

The standard LSSVM is formulated in the following programming problem

(5)
s.t.

where is a penalty parameter.

With the given equality constraints, the Lagrangian multiplier method produces a kernel model (3) such that the parameters and are given by the following set of closed form linear equations

(6)

where

is the identity matrix of size

. However, the computational hurdle lies in the massive matrix inverse in (6) which has complexity of order .

Ii-C Lssvm-Sbf

Despite of a close formed solution obtained by LSSVM, the model has two main limitations. First, calculating the matrix inversion is computationally demanding and second, the model is non-sparse which means that it has to compute all possible pairs of system inputs, making the model infeasible for large-sized datasets. Alternatively, we have proposed a novel kernel method referred to as LSSVM-SBF [15], which can overcome these two issues by introducing symmetric structure in specially designed kernel function based on the so-called low rank Simplex Basis Function (SBF) kernel.

The SBF is defined as

(7)

where and are the center vector of the th SBF function that adjusts the location and the shape vector of the th SBF that adjusts the shape respectively. The proposed new kernel in [15] is defined as

(8)

in which the SBF kernels use only basis functions. is the pre-defined model size.

It has been proved in [15] that, under the kernel (8) with the SBF (7), the resulting model is piecewise locally linear with respect to the input as

Here we have defined

(9)

where is the index set of , satisfying condition , and

With the low rank kernel structure defined as (8), the closed form solution (6) for and can be rewritten as, see [15],

(10)

where

and

with , i.e., the vector of basis function values at the training inputs.

The new solution (10) only involves the matrix inverse of size , which is superior to (6) where the inverse is of size .

Iii The Proposed Model and Its Algorithm

From subsection II-C, we have found that the special choice of low rank SBF kernel as defined in (7) and (8) brings model efficiency. To extend the idea of using low rank kernel, in this section, we will propose a general framework for fast algorithm and validate it with several examples.

We would like to emphasize that our idea of using low rank kernel is inspired by the original low rank kernel approximation such as Nyström approximation [24]. However the standard low rank kernel methods aim to approximate a given kernel function, while our approach involves learning (basis) functions and constructs the kernel with composite structure in order to assist fast algorithms.

Iii-a The Low Rank Kernels and Models

Consider learnable “basis” functions

(11)

with adaptable parameters (). In the case of SBF in (7), we have in total parameters

As another example, we will consider the so-called robust RBF

(12)

Similar to the SBF, while determines the location of in the th dimensional direction, restricts the sharpness of in the th dimension. In fact, the SBF (7) can be regarded as the first order approximation of the robust RBF in terms of . We expect the robust RBF will have better modeling capability.

More generally, each learnable basis function

can be a deep neural network. We will leave this for further study.

Given a set of learnable basis functions (11), define a finite dimensional feature mapping

This feature mapping naturally induces the following learnable low rank kernel

(13)

Consider the “linear” model and define the following low rank LSSVM (LR-LSSVM)

(14)
s.t.

LR-LSSVM problems takes the same form as the standard LSSVM (5), however our low rank kernel carries composition structure and is learnable with adaptable parameters. In the following subsections, we propose a two-steps alternative algorithm procedure to solve the LR-LSSVM.

Iii-B Solving LR-LSSVM with Fixed Feature Mappings

When all the feature mappings are fixed, problem (14) gives back to the standard LSSVM. Denote and consider the Lagrangian function

where are Lagrange multipliers for all the equality constraints. We now optimize out , and to give

(15)
(16)
(17)

where

(18)

Furthermore, setting the partial derivative with respect to each Lagrange multiplier gives

(19)

Taking (15) into (19) we have

After a long algebraic manipulation, the solution for the dual problem is given by

Denote the matrix with one row of all zeros on the top of matrix , then the solution can be expressed as

(20)

Applying the matrix inversion formula to (20) results in the exactly same solution as (10). Once and are worked out, the final model can be written as

(21)

Define

which can be calculated after is known, then (21) can be expressed in terms sparse form of size

(22)

Iii-C Training Learnable Low Rank Kernels

Given

which are solved by the closed-form solution in the first step, we estimate the kernel parameters

() using a gradient descent algorithm. The algorithm seeks to maximize the magnitude of model outputs, which leads to overall further distance from the model outputs to the existing decision boundary. Taking the robust RBF functions (12) as an example, this objective function can be expressed as

(23)

Another objective function is

which gives similar results as (23).

Denote . Given the objective function above, we have

(24)

in which

(25)

where

(26)

which are calculated by, for ,

(27)
(28)

where is defined in (12).

Meanwhile, we should also consider the positivity constraints for the shape parameters vector and thus, we have the following constrained normalized gradient descent algorithm, which is, for ,

(29)

where is a preset learning rate. By applying (24) to (29) to all Robust RBF units while keeping to their current values and other RBF units constant, we manage to update all RBF kernels.

Iii-D Initialization of Robust Radial Basis Functions

As is shown in (22), the model requires a preset kernel model size and a set of initial kernel parameters , . In the case of robust RBFs, both and need to be initialized. The initialization of the center vector can be obtained using a clustering algorithm. We propose a -medoids algorithm here to solve for the Robust RBF centers since it is more robust to unbalanced distribution of data. It seeks to divide the data points into subsets and iteratively adjust the centers of each subset until reaching convergence while minimizing the clustering objective objection given by

(30)

where the centers of each subset are the members of that subset. As for the initial values of the shaping parameters , we preset as a predetermined constant for all basis functions, e.g., 1s.

Iii-E The Overall Algorithm and Its Complexity

Algorithm 1222The algorithm can be easily adopted to any learnable kernels. summarizes the overall procedure of LR-LSSVM using the example of robust RBF kernel. The algorithm starts with the k-medoids clustering algorithm for initialization of the robust RBF centres in Section III-B, then the fast LSSVM solution is achieved and the gradient descent algorithm in Section III-C or III-F are alternatively applied for a predefined number of iterations. A simple complexity analysis indicates that the overall computational complexity is which is dominated by the gradient descent algorithm for training learnable basis functions, scaled by the iteration number. Many examples in Section IV have shown that a minor size gives competitive model prediction performance. In this sense, the newly proposed algorithm has a complexity of . The lower complexity benefits from the special structure of low rank kernel functions. It should be pointed out again that the proposed framework contains the SBF model in [15] as a special case, that the framework can be applied for more generic extension, for example using deep neural networks for learning kernel functions.

0:  Dataset . Model size , Regularization parameter . Initial shape parameter . Iteration number .
0:  The obtained model parameters , , for .
1:  Apply the k-medoids clustering algorithm to initialize (). Set all to a constant .
2:  for  do
3:     Form from the dataset and the current kernel parameters for ;
4:     Construct by adding one row of on the top of matrix ;
5:     Update and according to the closed form solution (10);
6:     for  do
7:        Apply (24) to (29) to adjust
8:     end for
9:  end for
Algorithm 1 The Proposed LR-LSSVM Algorithm with robust RBF kernel

Iii-F The Differentiable Objective Functions

The objective defined in (23) is non-differentiable. For the purpose of maximizing the magnitude of model outputs, we propose the following squared objective which is differentiable, for ,

(31)

Then according to (21), we can write (31) as

(32)

It is not hard to prove that

(33)

and the chain rule gives

(34)

where means the trace of matrix, means either or , and means the matrix element-wise product. Combining (33) and (34) gives

(35)

where is the matrix given by (24) and (25).

Iv Experimental Studies

Iv-a Example 1: Synthetic Dataset

For synthetic data set in [25], the dimension of input space is , and the training and test sample sets are in the size of 250 and 1000 respectively. In this example, three types of models are constructed to generate classification performance comparison by using the metric of misclassification rate. For LSSVM with Gaussian RBF kernel models, the steepness is set in the range of 0.5-3, step 0.5, while shrinkage is all set into 5000. For the LR-LSSVM-SBF model, the parameters are preset to . For our proposed LR-LSSVM-Robust RBF models with absolute value, squared and targeted objective functions, the parameters are set into ; and respectively.

(a) Synthetic data: decision boundary of gaussian SVM ()
(b) Synthetic data: decision boundary of LSSVM-SBF
(c) Synthetic data: decision boundary of LR-LSSVM using absolute loss function
(d) Synthetic data: decision boundary of LR-LSSVM using squared loss function
Fig. 1: The experiment results for synthetic dataset

From the classification results shown in TABLE I, we can find that the proposed LR-LSSVM-Robust RBF and LR-LSSVM-SBF models dominate all the time with the misclassification rates of around 8, while Gaussian RBF kernel models perfrom fairly poor in this case. In Fig 1, we can see that the decision boundary of LSSVM with Gaussian RBF kernel is relatively curvey and nonlinear, whereas the ones for SBF and Robust RBF are in piecewise linear forms.

Testing Misclassification Rate (%) Model Size
LSSVM-Gaussian () 11.40% 250
LSSVM-Gaussian () 9.20% 250
LSSVM-Gaussian () 10.40% 250
LSSVM-Gaussian () 10.10% 250
LSSVM-Gaussian () 10.10% 250
LSSVM-Gaussian () 9.80% 250
LSSVM-SBF 8.30% 4
Proposed Model (abs obj.) 8.00% 3
Proposed Model (square obj.) 8.30% 3
Proposed Model (target obj.) 8.00% 3
TABLE I: The misclassification rate on synthetic data
Models Titanic Diabetes German Credit
Misclassification Rate (%) Mosel Size Misclassification Rate (%) Model Size Misclassification Rate (%) Model Size
RBF 23.3 1.3 4 24.32.3 15 24.7 2.4 8
Adaboost with RBF 22.6 1.2 4 26.51.9 15 27.5 2.5 8
AdaBoostReg 22.6 1.2 4 23.81.8 15 24.3 2.1 8
LPReg-AdaBoost 24.0 4.4 4 24.11.9 15 24.8 2.2 8
QPReg-AdaBoost 22.7 1.1 4 25.42.2 15 25.3 2.1 8
SVM with RBF kernel 22.4 1.0 not available 23.51.7 not available 23.6 2.1 not available
LSSVM-SBF 22.5 0.8 2 23.51.7 5 24.9 1.9 3
Proposed Model (abs obj.) 22.3 0.8 2 23.81.7 5 25.6 2.3 2
Proposed Model (square obj.) 22.6 1.5 3 23.52.0 4 24.7 1.9 2
Proposed Model (target obj.) 22.4 0.8 2 24.72.0 5 25.6 2.4 2
TABLE II: The misclassification rate on different datasets

Iv-B Example 2: Titanic Dataset

For the Titanic data set in [26], it has 100 realizations and each has 150 training samples and 2051 test samples respectively. The original data has the input dimension of . We compare the prediction accuracy of various Adaboost-based models and the LR-LSSVM models over the test samples. For the LR-LSSVM-SBF model, the parameters are set into , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as , , , , ; , , , , and , , , , respectively.

The result of the proposed models is shown in TABLE II (columns 2 & 3) together with the first six other results quoted from [26] and the seventh result quoted from [15]

. Generally, LR-LSSVM-SBF and the proposed LR-LSSVM models with Robust RBF kernel outperform other models and all the LR-LSSVM models are sparse with only 2 terms (except for the model with squared loss function). Also, we can observe that the LR-LSSVM models with absolute value and targeted objective function have similar prediction results. Overall, the proposed models with absolute value and targeted objective functions perform the best with the lowest misclassification rate and standard deviation since the final model size of the Robust RBF kernels is only 2, which makes it easy for the models to explain the data.

Iv-C Example 3: Diabetes Dataset

For diabetes data set in [26], it has 100 groups of training and test samples individually, with the size of training set equal to 468 and the size of test set equal to 300. The input space of this example is . Similar to the main structure of titanic data set, here, for comparison, we will use ten different models and the measurement metric of average misclassification rate as well. For the LR-LSSVM-SBF model, the parameters are set into , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as ; , , , , and , , , , respectively.

The modeling results in TABLE II (columns 4 & 5) show that the performance of the proposed LR-LSSVM-Robust RBF models with absolute value and squared objective functions are competitive in the ten models with the classification accuracy almost ranking at the top. Moreover, it can be seen that the SBF kernel and the proposed Robust RBF kernel bring sparsity into the LR-LSSVM models, which considerably increases the programming speed during computation.

Iv-D Example 4: German Credit Dataset

Similarly, German credit dataset in [26] has 100 realizations of training and test sets. Each realization contains 700 training samples and 300 test samples. The original data has the 20 features. We evaluate the misclassification rate of our proposed models with various objective functions and the LR-LSSVM-SBF model along with the six other models. For the parameters of the LR-LSSVM-SBF model, we set , , , , while for the proposed LR-LSSVM-Robust RBF models with absolute value, squared and targeted objective functions, the parameters are set into for all three cases.

The results of the four models are listed in TABLE II (columns 6 & 7) together with the first six other results quoted from [26]. For this data set both LR-LSSVM-SBF and LR-LSSVM-Robust RBF do not perform as well as they do in the previous data sets. However, the prediction accuracy together with the standard deviation are still comparable. Additionally, it can been seen that the model size of the four models is relatively small compared to other models.

Iv-E Summary

Overall, we can notice that the proposed squared objective model perfroms well in high dimensional datasets, which include the diabetes and german examples in our demonstration, whereas the proposed absolute value and targeted objective models are more suitable for low dimensional input, which are the synthetic and titanic datasets in our cases. Moreover, there is no relation between input dimension and chosen model size since in the four result tables, we can observe that the final selected is relatively random in general.

V Conclusions

In this paper we have generalized a widely-applied framework for fast LR-LSSVM algorithm and then extended this idea to the novel robust RBF kernel. After initialising the proposed kernel parameters with k-medoids clustering, the working procedures of training algorithm are alternating between fast least square closed form solution for and gradient descent for sub-algorithms. For the gradient descent section, three criteria are offered - two non-differentiable (absolute value and targeted) and one differentiable (squared) objective functions with squared objective working better in the case of high dimensional input and the rest targeting more on low dimensional data. In the end, for the aim of demonstrating the effectiveness of our proposed algorithm, simple synthetic as well as several real-world data sets are validated in comparison with other known approaches.

References

  • [1] B. Schölkopf and A. J. Smola, Learning with Kernels.   MIT Press, 2002.
  • [2] C. Bishop, Pattern Recognition and Machine Learning.   Springer, 2006.
  • [3] T. S. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” Advances in Neural Information Processing Systems, pp. 487–493, 1998.
  • [4] F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEG-based brain-computer interfaces,” J. of Neural Engineering, vol. 4, no. 2, pp. R1–13, 2007.
  • [5] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Lecture Notes in Computer Science.   Springer, 1998, vol. 1398, pp. 137–142.
  • [6]

    E. Gumus, N. Kilic, A. Sertbas, and O. N. Ucan, “Evaluation of face recognition techniques using PCA, Wavelets and SVM,”

    Expert Systems with Applications, vol. 37, no. 9, pp. 6404–6408, 2010.
  • [7] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, “A comparative study of different machine learning methods on microarray gene expression data,” BMC Genomics, vol. 9, no. Suppl 1, 2008.
  • [8] J. Min and Y. Lee, “Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters,” Expert Systems with Applications, vol. 28, no. 4, pp. 603–614, 2005.
  • [9] L. Cao, “Support vector machines experts for time series forecasting,” Neurocomputing, vol. 51, pp. 321–339, 2003.
  • [10] N. Sharma, P. Sharma, D. Irwin, and P. Shenoy, “Predicting solar generation from weather forecasts using machine learning,” in Proc of ICSGC, 2011.
  • [11] V. D. Sànchez A, “Advanced support vector machines and kernel methods,” Neurocomputing, vol. 55, no. 1-2, pp. 5–20, 2003.
  • [12] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Evaluation of kernels for multiclass classification of hyperspectral remote sensing data,” in Proc of IEEE ICASSP, 2006.
  • [13] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.
  • [14]

    K. Yu, W. Xu, and Y. Gong, “Deep learning with kernel regularization for visual recognition,” in

    NIPS, vol. 21, 2009, pp. 1889–1896.
  • [15] X. Hong, H. Wei, and J. Gao, “Sparse least squares support vector machine using simplex basis function,” IEEE Transactions on Cybernetics, vol. XX, pp. Submission No. CYB–E–2018–06–1246, 2018.
  • [16] J. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, pp. 293–300, 1999.
  • [17] D. Liu, Y. Shi, Y. Tian, and X. Huang, “Ramp loss least squares support vector machine,” J. of Computational Science, vol. 14, pp. 61–68, 2016.
  • [18] Y. Ye, J. Gao, Y. Shao, C. Li, and Y. Jin, “Robust support vector regression with general quadratic non-convex -insensitive loss,” ACM Trans. on Knowledge Discovery from Data, vol. XX, p. submitted, 2019.
  • [19] F. Zhu, J. Gao, C. Xu, J. Yang, and D. Tao, “On selecting effective patterns for fast support vector regression training,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3610–3622, 2018.
  • [20] S. Chen, “Local regularization assisted orthogonal least squares regression,” NeuroComputing, vol. 69, pp. 559–585, 2006.
  • [21]

    S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,”

    IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 302–309, 1991.
  • [22] J. Gao, D. Shi, and X. Liu, “Critical vector learning to construct sparse kernel regression modelling,” Neural Networks, vol. 20, no. 7, pp. 791–798, 2007.
  • [23] M. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. of Machine Learning Research, vol. 1, pp. 211–244, 2001.
  • [24] C. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Proc of NIPS, 2001, pp. 682–688.
  • [25] B. D. Ripley, Pattern Recognition and Neural Networks.   Cambridge University Press, 1996.
  • [26] G. Räatsch, T. Onoda, and K.-R. Müller, “Soft margins for adaboost,” Machine Learning, vol. 42, no. 3, pp. 287–320, 2001.