With the proliferation of big data in scientific and business research, in practical nonlinear modeling approaches, one wishes to build sparse models with more efficient algorithms. Kernel machines (KMs) have attracted great attention since the support vector machines (SVM), a well linear binary classification model under the principle of risk minimization, was introduced in earlier 1990s . In fact, KMs have extended SVM by implementing the linearity in the so-called high dimensional feature space under a feature mapping implicitly determined by a Mercer kernel function. Both SVM and KMs have been also applied for regression problems 
. Commonly used kernels are radial basis function kernel (RBF), polynomial kernel, and Fisher kernel, etc. As one of the most well-known members of the KM family, SVM has the advantages of good generalization and insensitivity to overfitting .
Until now Gaussian RBF kernel is the most common choice for SVM in practice. Generally, SVM with RBF kernel has been widely used and has superior prediction performance in many areas such as text categorization , image recognition , bioinformatics , credit scoring , time series forecasting , and weather forecasting 
. Text categorization or text classification is to classify documents into predefined categories. SVM and KMs work well for this task because the high dimensional text or dense concept representation can be easily mapped into a latent feature space where a linear prediction model is learned with an appropriately chosen kernel function. The results of the experiments indicate that SVM with RBF kernel outperforms other classification methods . The superior performance of SVM with RBF kernel in dealing with high dimensional small datasets has also been demonstrated in remote sensing , by carefully choosing feature mappings.
The performance of SVM largely depends on kernel types and it has been shown that RBF kernel support vector machine is always capable of outperforming other classifiers in various classification scenarios [6, 5, 7]. Nonetheless, in practical nonlinear modeling, SVM with standard Gaussian RBF kernel has a non-negligible limitation in separating some nonlinear decision boundaries. Thus, the analysis of RBF kernel optimization has gained much more popularity than before. The result given in  demonstrates that after introducing an information-geometric data-dependent method to modify a kernel (eg, the RBF kernel), the performance of SVM is considerably improved. Yu et al.  enhance the kernel metrics by adding regularization into kernel machines (eg. RBF kernel SVM).
One of the advantages of the standard SVM model is its model sparsity determined by the so-called support vectors, however the sparsity cannot be pre-determined and the support vectors have to be learned from the training data by solving a computationally demanding quadratic programming optimization problem . A massive progress in proposing computationally efficient algorithms for SVM models has been explored. One of the examples is the introduction of a least squares version of support vector machine (LSSVM) . Instead of the margin constraints in the standard SVM, LSSVM introduces the equality constraints in the model formulation. The resulting quadratic programming problem can be solved by a set of linear equations 
. However, LSSVM is loosing of sparseness offered by the original SVM method, which leads a kernel model evaluating all possible pairs of data in the kernel function and therefore is inferior to the standard SVM model in inference for large scale data learning. To maintain the sparsity offered by the standard SVM and the equality constraints of LSSVM, researchers considered extending LSSVM for the Ramp loss function and produce sparse models with extra computational complexity, see. This strategy has been extended to more general insensitive loss function in . Recently, Zhu et al.  proposed a way to select effective patterns from training datasets for fast support vector regression learning. However, there is no extension for classification problems yet.
The need in dealing with large scale datasets motivates exploring new approaches for the sparse models under the broad framework of both SVM and KMs. Chen  proposed a method for building a sparse kernel model by extending the so-called orthogonal least squares (OLS) algorithm  and kernel techniques. It seems the OLS assisted sparse kernel model offers an efficient learning procedure particularly demonstrating good performance in nonlinear system identification. The OLS algorithm relies on a greedy sequential selection of the kernel regressors under the orthogonal requirement imposing extra computational cost. Based on the so-called significant vectors, Gao et al.  proposed a more straightforward way to learn the significant regressors from training data for the kernel regression modelling. This type of approaches has their roots in the relevance vector machine (RVM) . RVM is implemented under the Bayesian learning framework of kernel machine and has a comparable inference performance to the standard SVM with dramatically fewer kernel terms, offering great sparsity.
Almost all the aforementioned modeling methods build models by learning or extracting those key data points or patterns from the entire training dataset. Recently, the authors proposed a new type of low rank kernel model based on the so-called simplex basis functions (SBF) , successfully building a sparse and fast modeling algorithm thus lowering the computational cost in LSSVM. The model size is no longer determined by the given training data while the key patterns will be learned straightaway. We further explore the idea and extend it for the so-called robust radial basis functions. The main contributions of this paper are summarized as follows,
Given that the aforementioned models learn data patterns under the the regression setting, this paper focuses on classification setting for a controlled or pre-defined model size;
The kernel function proposed in this paper takes the form of composition of basic basis components which are adaptive to the training data. This composition form opens the door for a fast closed form solution, avoiding the issue of kernel matrix inversion in the case of large scale datasets;
A new criterion is proposed for the final model selection in terms of pattern parameters of location and scale; and
A two-step optimization algorithm is proposed to simplify the learning procedure.
The rest of this paper is organized as follows. In Section II, we present the brief background on several related models. Section III proposes our robust RBF kernel function and its classification model. Section IV describes the artificial and real-world datasets and conducts several experiments to demonstrate the performance of the model and algorithm and Section V concludes the paper.
Ii Background and Notation
In this section, we start introducing necessary notation for the purpose of presenting our model and algorithm. We mainly consider binary classification problems. For the multi-class classification problems, as usual, the commonly used heuristic approach of “one-vs-all” or “one-vs-one” can be adopted.
Given a training dataset where is the number of data, is the feature vector and is the label for the -th data respectively.
KM methods have been used as a universal approximator in data modeling. The core idea of the KMs is to implement a linear model in a high dimensional feature space by using a feature mapping defined as 
which induces a Mercer kernel function in the input space
where is the inner product on the feature space .
In general, an affine linear model of KMs is defined as
where is the bias parameter and is the parameter vector of high dimensionality, most likely in infinite dimension. It is infeasible to solve for the parameter vector directly. Instead, the so-called kernel trick transforms the infinite dimension problem to a finite dimension problem by relating the parameters to the data as
A learning algorithm will focus on solving for parameters under an appropriate learning criterion.
For the sake of convenience, define
where means the component-wise product of two vectors.
All the KMs algorithms are involved with the so-called kernel matrix, as defined below
Both and are symmetric matrices of size .
In the following section, standard SVM, LSSVM and sparse least square support vector machine using simplex basis function (LSSVM-SBF)  are outlined.
The standard support vector machine (C-SVM) imposes the so-called maximal margin criterion inducing a kernel model where the parameter (and ) can be obtained by solving the following dual Lagrangian problem
where is the vector with all ones in appropriate dimension. The parameter can be easily calculated from the support vectors .
To reduce the computational complexity of the standard SVM, the least square support vector machine introduces the equality constraints.
The standard LSSVM is formulated in the following programming problem
where is a penalty parameter.
With the given equality constraints, the Lagrangian multiplier method produces a kernel model (3) such that the parameters and are given by the following set of closed form linear equations
is the identity matrix of size. However, the computational hurdle lies in the massive matrix inverse in (6) which has complexity of order .
Despite of a close formed solution obtained by LSSVM, the model has two main limitations. First, calculating the matrix inversion is computationally demanding and second, the model is non-sparse which means that it has to compute all possible pairs of system inputs, making the model infeasible for large-sized datasets. Alternatively, we have proposed a novel kernel method referred to as LSSVM-SBF , which can overcome these two issues by introducing symmetric structure in specially designed kernel function based on the so-called low rank Simplex Basis Function (SBF) kernel.
The SBF is defined as
where and are the center vector of the th SBF function that adjusts the location and the shape vector of the th SBF that adjusts the shape respectively. The proposed new kernel in  is defined as
in which the SBF kernels use only basis functions. is the pre-defined model size.
Here we have defined
where is the index set of , satisfying condition , and
with , i.e., the vector of basis function values at the training inputs.
Iii The Proposed Model and Its Algorithm
From subsection II-C, we have found that the special choice of low rank SBF kernel as defined in (7) and (8) brings model efficiency. To extend the idea of using low rank kernel, in this section, we will propose a general framework for fast algorithm and validate it with several examples.
We would like to emphasize that our idea of using low rank kernel is inspired by the original low rank kernel approximation such as Nyström approximation . However the standard low rank kernel methods aim to approximate a given kernel function, while our approach involves learning (basis) functions and constructs the kernel with composite structure in order to assist fast algorithms.
Iii-a The Low Rank Kernels and Models
Consider learnable “basis” functions
with adaptable parameters (). In the case of SBF in (7), we have in total parameters
As another example, we will consider the so-called robust RBF
Similar to the SBF, while determines the location of in the th dimensional direction, restricts the sharpness of in the th dimension. In fact, the SBF (7) can be regarded as the first order approximation of the robust RBF in terms of . We expect the robust RBF will have better modeling capability.
More generally, each learnable basis function
can be a deep neural network. We will leave this for further study.
Given a set of learnable basis functions (11), define a finite dimensional feature mapping
This feature mapping naturally induces the following learnable low rank kernel
Consider the “linear” model and define the following low rank LSSVM (LR-LSSVM)
LR-LSSVM problems takes the same form as the standard LSSVM (5), however our low rank kernel carries composition structure and is learnable with adaptable parameters. In the following subsections, we propose a two-steps alternative algorithm procedure to solve the LR-LSSVM.
Iii-B Solving LR-LSSVM with Fixed Feature Mappings
When all the feature mappings are fixed, problem (14) gives back to the standard LSSVM. Denote and consider the Lagrangian function
where are Lagrange multipliers for all the equality constraints. We now optimize out , and to give
Furthermore, setting the partial derivative with respect to each Lagrange multiplier gives
After a long algebraic manipulation, the solution for the dual problem is given by
Denote the matrix with one row of all zeros on the top of matrix , then the solution can be expressed as
which can be calculated after is known, then (21) can be expressed in terms sparse form of size
Iii-C Training Learnable Low Rank Kernels
which are solved by the closed-form solution in the first step, we estimate the kernel parameters() using a gradient descent algorithm. The algorithm seeks to maximize the magnitude of model outputs, which leads to overall further distance from the model outputs to the existing decision boundary. Taking the robust RBF functions (12) as an example, this objective function can be expressed as
Another objective function is
which gives similar results as (23).
Denote . Given the objective function above, we have
which are calculated by, for ,
where is defined in (12).
Meanwhile, we should also consider the positivity constraints for the shape parameters vector and thus, we have the following constrained normalized gradient descent algorithm, which is, for ,
Iii-D Initialization of Robust Radial Basis Functions
As is shown in (22), the model requires a preset kernel model size and a set of initial kernel parameters , . In the case of robust RBFs, both and need to be initialized. The initialization of the center vector can be obtained using a clustering algorithm. We propose a -medoids algorithm here to solve for the Robust RBF centers since it is more robust to unbalanced distribution of data. It seeks to divide the data points into subsets and iteratively adjust the centers of each subset until reaching convergence while minimizing the clustering objective objection given by
where the centers of each subset are the members of that subset. As for the initial values of the shaping parameters , we preset as a predetermined constant for all basis functions, e.g., 1s.
Iii-E The Overall Algorithm and Its Complexity
Algorithm 1222The algorithm can be easily adopted to any learnable kernels. summarizes the overall procedure of LR-LSSVM using the example of robust RBF kernel. The algorithm starts with the k-medoids clustering algorithm for initialization of the robust RBF centres in Section III-B, then the fast LSSVM solution is achieved and the gradient descent algorithm in Section III-C or III-F are alternatively applied for a predefined number of iterations. A simple complexity analysis indicates that the overall computational complexity is which is dominated by the gradient descent algorithm for training learnable basis functions, scaled by the iteration number. Many examples in Section IV have shown that a minor size gives competitive model prediction performance. In this sense, the newly proposed algorithm has a complexity of . The lower complexity benefits from the special structure of low rank kernel functions. It should be pointed out again that the proposed framework contains the SBF model in  as a special case, that the framework can be applied for more generic extension, for example using deep neural networks for learning kernel functions.
Iii-F The Differentiable Objective Functions
The objective defined in (23) is non-differentiable. For the purpose of maximizing the magnitude of model outputs, we propose the following squared objective which is differentiable, for ,
Iv Experimental Studies
Iv-a Example 1: Synthetic Dataset
For synthetic data set in , the dimension of input space is , and the training and test sample sets are in the size of 250 and 1000 respectively. In this example, three types of models are constructed to generate classification performance comparison by using the metric of misclassification rate. For LSSVM with Gaussian RBF kernel models, the steepness is set in the range of 0.5-3, step 0.5, while shrinkage is all set into 5000. For the LR-LSSVM-SBF model, the parameters are preset to . For our proposed LR-LSSVM-Robust RBF models with absolute value, squared and targeted objective functions, the parameters are set into ; and respectively.
From the classification results shown in TABLE I, we can find that the proposed LR-LSSVM-Robust RBF and LR-LSSVM-SBF models dominate all the time with the misclassification rates of around 8, while Gaussian RBF kernel models perfrom fairly poor in this case. In Fig 1, we can see that the decision boundary of LSSVM with Gaussian RBF kernel is relatively curvey and nonlinear, whereas the ones for SBF and Robust RBF are in piecewise linear forms.
|Testing Misclassification Rate (%)||Model Size|
|Proposed Model (abs obj.)||8.00%||3|
|Proposed Model (square obj.)||8.30%||3|
|Proposed Model (target obj.)||8.00%||3|
|Misclassification Rate (%)||Mosel Size||Misclassification Rate (%)||Model Size||Misclassification Rate (%)||Model Size|
|RBF||23.3 1.3||4||24.32.3||15||24.7 2.4||8|
|Adaboost with RBF||22.6 1.2||4||26.51.9||15||27.5 2.5||8|
|AdaBoostReg||22.6 1.2||4||23.81.8||15||24.3 2.1||8|
|LPReg-AdaBoost||24.0 4.4||4||24.11.9||15||24.8 2.2||8|
|QPReg-AdaBoost||22.7 1.1||4||25.42.2||15||25.3 2.1||8|
|SVM with RBF kernel||22.4 1.0||not available||23.51.7||not available||23.6 2.1||not available|
|LSSVM-SBF||22.5 0.8||2||23.51.7||5||24.9 1.9||3|
|Proposed Model (abs obj.)||22.3 0.8||2||23.81.7||5||25.6 2.3||2|
|Proposed Model (square obj.)||22.6 1.5||3||23.52.0||4||24.7 1.9||2|
|Proposed Model (target obj.)||22.4 0.8||2||24.72.0||5||25.6 2.4||2|
Iv-B Example 2: Titanic Dataset
For the Titanic data set in , it has 100 realizations and each has 150 training samples and 2051 test samples respectively. The original data has the input dimension of . We compare the prediction accuracy of various Adaboost-based models and the LR-LSSVM models over the test samples. For the LR-LSSVM-SBF model, the parameters are set into , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as , , , , ; , , , , and , , , , respectively.
. Generally, LR-LSSVM-SBF and the proposed LR-LSSVM models with Robust RBF kernel outperform other models and all the LR-LSSVM models are sparse with only 2 terms (except for the model with squared loss function). Also, we can observe that the LR-LSSVM models with absolute value and targeted objective function have similar prediction results. Overall, the proposed models with absolute value and targeted objective functions perform the best with the lowest misclassification rate and standard deviation since the final model size of the Robust RBF kernels is only 2, which makes it easy for the models to explain the data.
Iv-C Example 3: Diabetes Dataset
For diabetes data set in , it has 100 groups of training and test samples individually, with the size of training set equal to 468 and the size of test set equal to 300. The input space of this example is . Similar to the main structure of titanic data set, here, for comparison, we will use ten different models and the measurement metric of average misclassification rate as well. For the LR-LSSVM-SBF model, the parameters are set into , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as ; , , , , and , , , , respectively.
The modeling results in TABLE II (columns 4 & 5) show that the performance of the proposed LR-LSSVM-Robust RBF models with absolute value and squared objective functions are competitive in the ten models with the classification accuracy almost ranking at the top. Moreover, it can be seen that the SBF kernel and the proposed Robust RBF kernel bring sparsity into the LR-LSSVM models, which considerably increases the programming speed during computation.
Iv-D Example 4: German Credit Dataset
Similarly, German credit dataset in  has 100 realizations of training and test sets. Each realization contains 700 training samples and 300 test samples. The original data has the 20 features. We evaluate the misclassification rate of our proposed models with various objective functions and the LR-LSSVM-SBF model along with the six other models. For the parameters of the LR-LSSVM-SBF model, we set , , , , while for the proposed LR-LSSVM-Robust RBF models with absolute value, squared and targeted objective functions, the parameters are set into for all three cases.
The results of the four models are listed in TABLE II (columns 6 & 7) together with the first six other results quoted from . For this data set both LR-LSSVM-SBF and LR-LSSVM-Robust RBF do not perform as well as they do in the previous data sets. However, the prediction accuracy together with the standard deviation are still comparable. Additionally, it can been seen that the model size of the four models is relatively small compared to other models.
Overall, we can notice that the proposed squared objective model perfroms well in high dimensional datasets, which include the diabetes and german examples in our demonstration, whereas the proposed absolute value and targeted objective models are more suitable for low dimensional input, which are the synthetic and titanic datasets in our cases. Moreover, there is no relation between input dimension and chosen model size since in the four result tables, we can observe that the final selected is relatively random in general.
In this paper we have generalized a widely-applied framework for fast LR-LSSVM algorithm and then extended this idea to the novel robust RBF kernel. After initialising the proposed kernel parameters with k-medoids clustering, the working procedures of training algorithm are alternating between fast least square closed form solution for and gradient descent for sub-algorithms. For the gradient descent section, three criteria are offered - two non-differentiable (absolute value and targeted) and one differentiable (squared) objective functions with squared objective working better in the case of high dimensional input and the rest targeting more on low dimensional data. In the end, for the aim of demonstrating the effectiveness of our proposed algorithm, simple synthetic as well as several real-world data sets are validated in comparison with other known approaches.
-  B. Schölkopf and A. J. Smola, Learning with Kernels. MIT Press, 2002.
-  C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
-  T. S. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” Advances in Neural Information Processing Systems, pp. 487–493, 1998.
-  F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEG-based brain-computer interfaces,” J. of Neural Engineering, vol. 4, no. 2, pp. R1–13, 2007.
-  T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Lecture Notes in Computer Science. Springer, 1998, vol. 1398, pp. 137–142.
E. Gumus, N. Kilic, A. Sertbas, and O. N. Ucan, “Evaluation of face recognition techniques using PCA, Wavelets and SVM,”Expert Systems with Applications, vol. 37, no. 9, pp. 6404–6408, 2010.
-  M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, “A comparative study of different machine learning methods on microarray gene expression data,” BMC Genomics, vol. 9, no. Suppl 1, 2008.
-  J. Min and Y. Lee, “Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters,” Expert Systems with Applications, vol. 28, no. 4, pp. 603–614, 2005.
-  L. Cao, “Support vector machines experts for time series forecasting,” Neurocomputing, vol. 51, pp. 321–339, 2003.
-  N. Sharma, P. Sharma, D. Irwin, and P. Shenoy, “Predicting solar generation from weather forecasts using machine learning,” in Proc of ICSGC, 2011.
-  V. D. Sànchez A, “Advanced support vector machines and kernel methods,” Neurocomputing, vol. 55, no. 1-2, pp. 5–20, 2003.
-  M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Evaluation of kernels for multiclass classification of hyperspectral remote sensing data,” in Proc of IEEE ICASSP, 2006.
-  S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.
K. Yu, W. Xu, and Y. Gong, “Deep learning with kernel regularization for visual recognition,” inNIPS, vol. 21, 2009, pp. 1889–1896.
-  X. Hong, H. Wei, and J. Gao, “Sparse least squares support vector machine using simplex basis function,” IEEE Transactions on Cybernetics, vol. XX, pp. Submission No. CYB–E–2018–06–1246, 2018.
-  J. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, pp. 293–300, 1999.
-  D. Liu, Y. Shi, Y. Tian, and X. Huang, “Ramp loss least squares support vector machine,” J. of Computational Science, vol. 14, pp. 61–68, 2016.
-  Y. Ye, J. Gao, Y. Shao, C. Li, and Y. Jin, “Robust support vector regression with general quadratic non-convex -insensitive loss,” ACM Trans. on Knowledge Discovery from Data, vol. XX, p. submitted, 2019.
-  F. Zhu, J. Gao, C. Xu, J. Yang, and D. Tao, “On selecting effective patterns for fast support vector regression training,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3610–3622, 2018.
-  S. Chen, “Local regularization assisted orthogonal least squares regression,” NeuroComputing, vol. 69, pp. 559–585, 2006.
S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,”IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 302–309, 1991.
-  J. Gao, D. Shi, and X. Liu, “Critical vector learning to construct sparse kernel regression modelling,” Neural Networks, vol. 20, no. 7, pp. 791–798, 2007.
-  M. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. of Machine Learning Research, vol. 1, pp. 211–244, 2001.
-  C. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Proc of NIPS, 2001, pp. 682–688.
-  B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
-  G. Räatsch, T. Onoda, and K.-R. Müller, “Soft margins for adaboost,” Machine Learning, vol. 42, no. 3, pp. 287–320, 2001.