I Introduction
With the proliferation of big data in scientific and business research, in practical nonlinear modeling approaches, one wishes to build sparse models with more efficient algorithms. Kernel machines (KMs) have attracted great attention since the support vector machines (SVM), a well linear binary classification model under the principle of risk minimization, was introduced in earlier 1990s [1]. In fact, KMs have extended SVM by implementing the linearity in the socalled high dimensional feature space under a feature mapping implicitly determined by a Mercer kernel function. Both SVM and KMs have been also applied for regression problems [2]
. Commonly used kernels are radial basis function kernel (RBF), polynomial kernel, and Fisher kernel
[3], etc. As one of the most wellknown members of the KM family, SVM has the advantages of good generalization and insensitivity to overfitting [4].Until now Gaussian RBF kernel is the most common choice for SVM in practice. Generally, SVM with RBF kernel has been widely used and has superior prediction performance in many areas such as text categorization [5], image recognition [6], bioinformatics [7], credit scoring [8], time series forecasting [9], and weather forecasting [10]
. Text categorization or text classification is to classify documents into predefined categories. SVM and KMs work well for this task because the high dimensional text or dense concept representation can be easily mapped into a latent feature space where a linear prediction model is learned with an appropriately chosen kernel function
[11]. The results of the experiments indicate that SVM with RBF kernel outperforms other classification methods [5]. The superior performance of SVM with RBF kernel in dealing with high dimensional small datasets has also been demonstrated in remote sensing [12], by carefully choosing feature mappings.The performance of SVM largely depends on kernel types and it has been shown that RBF kernel support vector machine is always capable of outperforming other classifiers in various classification scenarios [6, 5, 7]. Nonetheless, in practical nonlinear modeling, SVM with standard Gaussian RBF kernel has a nonnegligible limitation in separating some nonlinear decision boundaries. Thus, the analysis of RBF kernel optimization has gained much more popularity than before. The result given in [13] demonstrates that after introducing an informationgeometric datadependent method to modify a kernel (eg, the RBF kernel), the performance of SVM is considerably improved. Yu et al. [14] enhance the kernel metrics by adding regularization into kernel machines (eg. RBF kernel SVM).
One of the advantages of the standard SVM model is its model sparsity determined by the socalled support vectors, however the sparsity cannot be predetermined and the support vectors have to be learned from the training data by solving a computationally demanding quadratic programming optimization problem [15]. A massive progress in proposing computationally efficient algorithms for SVM models has been explored. One of the examples is the introduction of a least squares version of support vector machine (LSSVM) [16]. Instead of the margin constraints in the standard SVM, LSSVM introduces the equality constraints in the model formulation. The resulting quadratic programming problem can be solved by a set of linear equations [16]
. However, LSSVM is loosing of sparseness offered by the original SVM method, which leads a kernel model evaluating all possible pairs of data in the kernel function and therefore is inferior to the standard SVM model in inference for large scale data learning. To maintain the sparsity offered by the standard SVM and the equality constraints of LSSVM, researchers considered extending LSSVM for the Ramp loss function and produce sparse models with extra computational complexity, see
[17]. This strategy has been extended to more general insensitive loss function in [18]. Recently, Zhu et al. [19] proposed a way to select effective patterns from training datasets for fast support vector regression learning. However, there is no extension for classification problems yet.The need in dealing with large scale datasets motivates exploring new approaches for the sparse models under the broad framework of both SVM and KMs. Chen [20] proposed a method for building a sparse kernel model by extending the socalled orthogonal least squares (OLS) algorithm [21] and kernel techniques. It seems the OLS assisted sparse kernel model offers an efficient learning procedure particularly demonstrating good performance in nonlinear system identification. The OLS algorithm relies on a greedy sequential selection of the kernel regressors under the orthogonal requirement imposing extra computational cost. Based on the socalled significant vectors, Gao et al. [22] proposed a more straightforward way to learn the significant regressors from training data for the kernel regression modelling. This type of approaches has their roots in the relevance vector machine (RVM) [23]. RVM is implemented under the Bayesian learning framework of kernel machine and has a comparable inference performance to the standard SVM with dramatically fewer kernel terms, offering great sparsity.
Almost all the aforementioned modeling methods build models by learning or extracting those key data points or patterns from the entire training dataset. Recently, the authors proposed a new type of low rank kernel model based on the socalled simplex basis functions (SBF) [15], successfully building a sparse and fast modeling algorithm thus lowering the computational cost in LSSVM. The model size is no longer determined by the given training data while the key patterns will be learned straightaway. We further explore the idea and extend it for the socalled robust radial basis functions. The main contributions of this paper are summarized as follows,

Given that the aforementioned models learn data patterns under the the regression setting, this paper focuses on classification setting for a controlled or predefined model size;

The kernel function proposed in this paper takes the form of composition of basic basis components which are adaptive to the training data. This composition form opens the door for a fast closed form solution, avoiding the issue of kernel matrix inversion in the case of large scale datasets;

A new criterion is proposed for the final model selection in terms of pattern parameters of location and scale; and

A twostep optimization algorithm is proposed to simplify the learning procedure.
The rest of this paper is organized as follows. In Section II, we present the brief background on several related models. Section III proposes our robust RBF kernel function and its classification model. Section IV describes the artificial and realworld datasets and conducts several experiments to demonstrate the performance of the model and algorithm and Section V concludes the paper.
Ii Background and Notation
In this section, we start introducing necessary notation for the purpose of presenting our model and algorithm. We mainly consider binary classification problems. For the multiclass classification problems, as usual, the commonly used heuristic approach of “onevsall” or “onevsone” can be adopted.
Given a training dataset where is the number of data, is the feature vector and is the label for the th data respectively.
KM methods have been used as a universal approximator in data modeling. The core idea of the KMs is to implement a linear model in a high dimensional feature space by using a feature mapping defined as [1]
which induces a Mercer kernel function in the input space
where is the inner product on the feature space .
In general, an affine linear model of KMs is defined as
(1) 
where is the bias parameter and is the parameter vector of high dimensionality, most likely in infinite dimension. It is infeasible to solve for the parameter vector directly. Instead, the socalled kernel trick transforms the infinite dimension problem to a finite dimension problem by relating the parameters to the data as
(2) 
A learning algorithm will focus on solving for parameters under an appropriate learning criterion.
For the sake of convenience, define
Then, under (2), model (1) can be expressed in terms of new parameters as^{1}^{1}1If we are considering a regression problem, there is no need to add in the model (3).
(3) 
where means the componentwise product of two vectors.
All the KMs algorithms are involved with the socalled kernel matrix, as defined below
and
Both and are symmetric matrices of size .
In the following section, standard SVM, LSSVM and sparse least square support vector machine using simplex basis function (LSSVMSBF) [15] are outlined.
Iia CSvm
The standard support vector machine (CSVM) imposes the socalled maximal margin criterion inducing a kernel model where the parameter (and ) can be obtained by solving the following dual Lagrangian problem
(4) 
where is the vector with all ones in appropriate dimension. The parameter can be easily calculated from the support vectors [1].
IiB Lssvm
To reduce the computational complexity of the standard SVM, the least square support vector machine introduces the equality constraints.
The standard LSSVM is formulated in the following programming problem
(5)  
s.t. 
where is a penalty parameter.
With the given equality constraints, the Lagrangian multiplier method produces a kernel model (3) such that the parameters and are given by the following set of closed form linear equations
(6) 
where
is the identity matrix of size
. However, the computational hurdle lies in the massive matrix inverse in (6) which has complexity of order .IiC LssvmSbf
Despite of a close formed solution obtained by LSSVM, the model has two main limitations. First, calculating the matrix inversion is computationally demanding and second, the model is nonsparse which means that it has to compute all possible pairs of system inputs, making the model infeasible for largesized datasets. Alternatively, we have proposed a novel kernel method referred to as LSSVMSBF [15], which can overcome these two issues by introducing symmetric structure in specially designed kernel function based on the socalled low rank Simplex Basis Function (SBF) kernel.
The SBF is defined as
(7) 
where and are the center vector of the th SBF function that adjusts the location and the shape vector of the th SBF that adjusts the shape respectively. The proposed new kernel in [15] is defined as
(8) 
in which the SBF kernels use only basis functions. is the predefined model size.
It has been proved in [15] that, under the kernel (8) with the SBF (7), the resulting model is piecewise locally linear with respect to the input as
Here we have defined
(9)  
where is the index set of , satisfying condition , and
Iii The Proposed Model and Its Algorithm
From subsection IIC, we have found that the special choice of low rank SBF kernel as defined in (7) and (8) brings model efficiency. To extend the idea of using low rank kernel, in this section, we will propose a general framework for fast algorithm and validate it with several examples.
We would like to emphasize that our idea of using low rank kernel is inspired by the original low rank kernel approximation such as Nyström approximation [24]. However the standard low rank kernel methods aim to approximate a given kernel function, while our approach involves learning (basis) functions and constructs the kernel with composite structure in order to assist fast algorithms.
Iiia The Low Rank Kernels and Models
Consider learnable “basis” functions
(11) 
with adaptable parameters (). In the case of SBF in (7), we have in total parameters
As another example, we will consider the socalled robust RBF
(12) 
Similar to the SBF, while determines the location of in the th dimensional direction, restricts the sharpness of in the th dimension. In fact, the SBF (7) can be regarded as the first order approximation of the robust RBF in terms of . We expect the robust RBF will have better modeling capability.
More generally, each learnable basis function
can be a deep neural network. We will leave this for further study.
Given a set of learnable basis functions (11), define a finite dimensional feature mapping
This feature mapping naturally induces the following learnable low rank kernel
(13) 
Consider the “linear” model and define the following low rank LSSVM (LRLSSVM)
(14)  
s.t. 
LRLSSVM problems takes the same form as the standard LSSVM (5), however our low rank kernel carries composition structure and is learnable with adaptable parameters. In the following subsections, we propose a twosteps alternative algorithm procedure to solve the LRLSSVM.
IiiB Solving LRLSSVM with Fixed Feature Mappings
When all the feature mappings are fixed, problem (14) gives back to the standard LSSVM. Denote and consider the Lagrangian function
where are Lagrange multipliers for all the equality constraints. We now optimize out , and to give
(15)  
(16)  
(17) 
where
(18) 
Furthermore, setting the partial derivative with respect to each Lagrange multiplier gives
(19) 
After a long algebraic manipulation, the solution for the dual problem is given by
Denote the matrix with one row of all zeros on the top of matrix , then the solution can be expressed as
(20) 
Applying the matrix inversion formula to (20) results in the exactly same solution as (10). Once and are worked out, the final model can be written as
(21) 
Define
which can be calculated after is known, then (21) can be expressed in terms sparse form of size
(22) 
IiiC Training Learnable Low Rank Kernels
Given
which are solved by the closedform solution in the first step, we estimate the kernel parameters
() using a gradient descent algorithm. The algorithm seeks to maximize the magnitude of model outputs, which leads to overall further distance from the model outputs to the existing decision boundary. Taking the robust RBF functions (12) as an example, this objective function can be expressed as(23) 
Another objective function is
which gives similar results as (23).
Denote . Given the objective function above, we have
(24) 
in which
(25) 
where
(26) 
which are calculated by, for ,
(27) 
(28) 
where is defined in (12).
Meanwhile, we should also consider the positivity constraints for the shape parameters vector and thus, we have the following constrained normalized gradient descent algorithm, which is, for ,
(29) 
where is a preset learning rate. By applying (24) to (29) to all Robust RBF units while keeping to their current values and other RBF units constant, we manage to update all RBF kernels.
IiiD Initialization of Robust Radial Basis Functions
As is shown in (22), the model requires a preset kernel model size and a set of initial kernel parameters , . In the case of robust RBFs, both and need to be initialized. The initialization of the center vector can be obtained using a clustering algorithm. We propose a medoids algorithm here to solve for the Robust RBF centers since it is more robust to unbalanced distribution of data. It seeks to divide the data points into subsets and iteratively adjust the centers of each subset until reaching convergence while minimizing the clustering objective objection given by
(30) 
where the centers of each subset are the members of that subset. As for the initial values of the shaping parameters , we preset as a predetermined constant for all basis functions, e.g., 1s.
IiiE The Overall Algorithm and Its Complexity
Algorithm 1^{2}^{2}2The algorithm can be easily adopted to any learnable kernels. summarizes the overall procedure of LRLSSVM using the example of robust RBF kernel. The algorithm starts with the kmedoids clustering algorithm for initialization of the robust RBF centres in Section IIIB, then the fast LSSVM solution is achieved and the gradient descent algorithm in Section IIIC or IIIF are alternatively applied for a predefined number of iterations. A simple complexity analysis indicates that the overall computational complexity is which is dominated by the gradient descent algorithm for training learnable basis functions, scaled by the iteration number. Many examples in Section IV have shown that a minor size gives competitive model prediction performance. In this sense, the newly proposed algorithm has a complexity of . The lower complexity benefits from the special structure of low rank kernel functions. It should be pointed out again that the proposed framework contains the SBF model in [15] as a special case, that the framework can be applied for more generic extension, for example using deep neural networks for learning kernel functions.
IiiF The Differentiable Objective Functions
The objective defined in (23) is nondifferentiable. For the purpose of maximizing the magnitude of model outputs, we propose the following squared objective which is differentiable, for ,
(31) 
It is not hard to prove that
(33) 
and the chain rule gives
(34) 
where means the trace of matrix, means either or , and means the matrix elementwise product. Combining (33) and (34) gives
(35) 
Iv Experimental Studies
Iva Example 1: Synthetic Dataset
For synthetic data set in [25], the dimension of input space is , and the training and test sample sets are in the size of 250 and 1000 respectively. In this example, three types of models are constructed to generate classification performance comparison by using the metric of misclassification rate. For LSSVM with Gaussian RBF kernel models, the steepness is set in the range of 0.53, step 0.5, while shrinkage is all set into 5000. For the LRLSSVMSBF model, the parameters are preset to . For our proposed LRLSSVMRobust RBF models with absolute value, squared and targeted objective functions, the parameters are set into ; and respectively.
From the classification results shown in TABLE I, we can find that the proposed LRLSSVMRobust RBF and LRLSSVMSBF models dominate all the time with the misclassification rates of around 8, while Gaussian RBF kernel models perfrom fairly poor in this case. In Fig 1, we can see that the decision boundary of LSSVM with Gaussian RBF kernel is relatively curvey and nonlinear, whereas the ones for SBF and Robust RBF are in piecewise linear forms.
Testing Misclassification Rate (%)  Model Size  

LSSVMGaussian ()  11.40%  250 
LSSVMGaussian ()  9.20%  250 
LSSVMGaussian ()  10.40%  250 
LSSVMGaussian ()  10.10%  250 
LSSVMGaussian ()  10.10%  250 
LSSVMGaussian ()  9.80%  250 
LSSVMSBF  8.30%  4 
Proposed Model (abs obj.)  8.00%  3 
Proposed Model (square obj.)  8.30%  3 
Proposed Model (target obj.)  8.00%  3 
Models  Titanic  Diabetes  German Credit  
Misclassification Rate (%)  Mosel Size  Misclassification Rate (%)  Model Size  Misclassification Rate (%)  Model Size  
RBF  23.3 1.3  4  24.32.3  15  24.7 2.4  8 
Adaboost with RBF  22.6 1.2  4  26.51.9  15  27.5 2.5  8 
AdaBoostReg  22.6 1.2  4  23.81.8  15  24.3 2.1  8 
LPRegAdaBoost  24.0 4.4  4  24.11.9  15  24.8 2.2  8 
QPRegAdaBoost  22.7 1.1  4  25.42.2  15  25.3 2.1  8 
SVM with RBF kernel  22.4 1.0  not available  23.51.7  not available  23.6 2.1  not available 
LSSVMSBF  22.5 0.8  2  23.51.7  5  24.9 1.9  3 
Proposed Model (abs obj.)  22.3 0.8  2  23.81.7  5  25.6 2.3  2 
Proposed Model (square obj.)  22.6 1.5  3  23.52.0  4  24.7 1.9  2 
Proposed Model (target obj.)  22.4 0.8  2  24.72.0  5  25.6 2.4  2 
IvB Example 2: Titanic Dataset
For the Titanic data set in [26], it has 100 realizations and each has 150 training samples and 2051 test samples respectively. The original data has the input dimension of . We compare the prediction accuracy of various Adaboostbased models and the LRLSSVM models over the test samples. For the LRLSSVMSBF model, the parameters are set into , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as , , , , ; , , , , and , , , , respectively.
The result of the proposed models is shown in TABLE II (columns 2 & 3) together with the first six other results quoted from [26] and the seventh result quoted from [15]
. Generally, LRLSSVMSBF and the proposed LRLSSVM models with Robust RBF kernel outperform other models and all the LRLSSVM models are sparse with only 2 terms (except for the model with squared loss function). Also, we can observe that the LRLSSVM models with absolute value and targeted objective function have similar prediction results. Overall, the proposed models with absolute value and targeted objective functions perform the best with the lowest misclassification rate and standard deviation since the final model size of the Robust RBF kernels is only 2, which makes it easy for the models to explain the data.
IvC Example 3: Diabetes Dataset
For diabetes data set in [26], it has 100 groups of training and test samples individually, with the size of training set equal to 468 and the size of test set equal to 300. The input space of this example is . Similar to the main structure of titanic data set, here, for comparison, we will use ten different models and the measurement metric of average misclassification rate as well. For the LRLSSVMSBF model, the parameters are set into , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as ; , , , , and , , , , respectively.
The modeling results in TABLE II (columns 4 & 5) show that the performance of the proposed LRLSSVMRobust RBF models with absolute value and squared objective functions are competitive in the ten models with the classification accuracy almost ranking at the top. Moreover, it can be seen that the SBF kernel and the proposed Robust RBF kernel bring sparsity into the LRLSSVM models, which considerably increases the programming speed during computation.
IvD Example 4: German Credit Dataset
Similarly, German credit dataset in [26] has 100 realizations of training and test sets. Each realization contains 700 training samples and 300 test samples. The original data has the 20 features. We evaluate the misclassification rate of our proposed models with various objective functions and the LRLSSVMSBF model along with the six other models. For the parameters of the LRLSSVMSBF model, we set , , , , while for the proposed LRLSSVMRobust RBF models with absolute value, squared and targeted objective functions, the parameters are set into for all three cases.
The results of the four models are listed in TABLE II (columns 6 & 7) together with the first six other results quoted from [26]. For this data set both LRLSSVMSBF and LRLSSVMRobust RBF do not perform as well as they do in the previous data sets. However, the prediction accuracy together with the standard deviation are still comparable. Additionally, it can been seen that the model size of the four models is relatively small compared to other models.
IvE Summary
Overall, we can notice that the proposed squared objective model perfroms well in high dimensional datasets, which include the diabetes and german examples in our demonstration, whereas the proposed absolute value and targeted objective models are more suitable for low dimensional input, which are the synthetic and titanic datasets in our cases. Moreover, there is no relation between input dimension and chosen model size since in the four result tables, we can observe that the final selected is relatively random in general.
V Conclusions
In this paper we have generalized a widelyapplied framework for fast LRLSSVM algorithm and then extended this idea to the novel robust RBF kernel. After initialising the proposed kernel parameters with kmedoids clustering, the working procedures of training algorithm are alternating between fast least square closed form solution for and gradient descent for subalgorithms. For the gradient descent section, three criteria are offered  two nondifferentiable (absolute value and targeted) and one differentiable (squared) objective functions with squared objective working better in the case of high dimensional input and the rest targeting more on low dimensional data. In the end, for the aim of demonstrating the effectiveness of our proposed algorithm, simple synthetic as well as several realworld data sets are validated in comparison with other known approaches.
References
 [1] B. Schölkopf and A. J. Smola, Learning with Kernels. MIT Press, 2002.
 [2] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 [3] T. S. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” Advances in Neural Information Processing Systems, pp. 487–493, 1998.
 [4] F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEGbased braincomputer interfaces,” J. of Neural Engineering, vol. 4, no. 2, pp. R1–13, 2007.
 [5] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Lecture Notes in Computer Science. Springer, 1998, vol. 1398, pp. 137–142.

[6]
E. Gumus, N. Kilic, A. Sertbas, and O. N. Ucan, “Evaluation of face recognition techniques using PCA, Wavelets and SVM,”
Expert Systems with Applications, vol. 37, no. 9, pp. 6404–6408, 2010.  [7] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, “A comparative study of different machine learning methods on microarray gene expression data,” BMC Genomics, vol. 9, no. Suppl 1, 2008.
 [8] J. Min and Y. Lee, “Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters,” Expert Systems with Applications, vol. 28, no. 4, pp. 603–614, 2005.
 [9] L. Cao, “Support vector machines experts for time series forecasting,” Neurocomputing, vol. 51, pp. 321–339, 2003.
 [10] N. Sharma, P. Sharma, D. Irwin, and P. Shenoy, “Predicting solar generation from weather forecasts using machine learning,” in Proc of ICSGC, 2011.
 [11] V. D. Sànchez A, “Advanced support vector machines and kernel methods,” Neurocomputing, vol. 55, no. 12, pp. 5–20, 2003.
 [12] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Evaluation of kernels for multiclass classification of hyperspectral remote sensing data,” in Proc of IEEE ICASSP, 2006.
 [13] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.

[14]
K. Yu, W. Xu, and Y. Gong, “Deep learning with kernel regularization for visual recognition,” in
NIPS, vol. 21, 2009, pp. 1889–1896.  [15] X. Hong, H. Wei, and J. Gao, “Sparse least squares support vector machine using simplex basis function,” IEEE Transactions on Cybernetics, vol. XX, pp. Submission No. CYB–E–2018–06–1246, 2018.
 [16] J. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, pp. 293–300, 1999.
 [17] D. Liu, Y. Shi, Y. Tian, and X. Huang, “Ramp loss least squares support vector machine,” J. of Computational Science, vol. 14, pp. 61–68, 2016.
 [18] Y. Ye, J. Gao, Y. Shao, C. Li, and Y. Jin, “Robust support vector regression with general quadratic nonconvex insensitive loss,” ACM Trans. on Knowledge Discovery from Data, vol. XX, p. submitted, 2019.
 [19] F. Zhu, J. Gao, C. Xu, J. Yang, and D. Tao, “On selecting effective patterns for fast support vector regression training,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3610–3622, 2018.
 [20] S. Chen, “Local regularization assisted orthogonal least squares regression,” NeuroComputing, vol. 69, pp. 559–585, 2006.

[21]
S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,”
IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 302–309, 1991.  [22] J. Gao, D. Shi, and X. Liu, “Critical vector learning to construct sparse kernel regression modelling,” Neural Networks, vol. 20, no. 7, pp. 791–798, 2007.
 [23] M. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. of Machine Learning Research, vol. 1, pp. 211–244, 2001.
 [24] C. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Proc of NIPS, 2001, pp. 682–688.
 [25] B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
 [26] G. Räatsch, T. Onoda, and K.R. Müller, “Soft margins for adaboost,” Machine Learning, vol. 42, no. 3, pp. 287–320, 2001.
Comments
There are no comments yet.