Kernel methods for classification and regression (and Support Vector Machines (SVMs) in particular) require selection of a kernel. Kernel Learning (KL) algorithms such as those found inxu2010simple ; sonnenburg2010shogun ; yang2011efficient automate this task by finding the kernel, which optimizes an achievable metric such as the soft margin (for classification). The set of kernels,
, over which the algorithm can optimize, however, strongly influences the performance and robustness of the resulting classifier or predictor.
To understand how the choice of influences performance and robustness, three properties were proposed in JMLR to characterize the set - tractability, density, and universality. Specifically, is tractable if is convex (or, preferably, a linear variety) - implying the KL problem is solvable using, e.g. rakotomamonjy_2008 ; jain_2012 ; lanckriet_2004 ; qiu2005multiple ; gonen2011multiple . The set has the density property if, for any and any positive kernel, there exists a where . The density property implies the kernel will perform well on untrained data (robustness or generalizability). The set has the universal property if any is universal - ensuring the classifier/predictor will perform arbitrarily well on large sets of training data.
In JMLR , the Tessellated Kernels (TKs) were shown to have all 3 properties, the first known such class of kernels. This work was based on a general framework for using positive matrices to parameterize positive kernels (as opposed to positive kernel matrices as in lanckriet_2004 ; qiu2005multiple ; ni2006learning ). Unfortunately, however, the algorithms proposed in JMLR were either based on SemiDefinite Programming (SDP) (thereby limiting the amount of training data) or used a randomized linear basis for the kernels (implying loss of density). Thus, while the algorithms in JMLR
outperformed all other methods (including deep learning) as measured by Test Set Accuracy (TSA), the computation times were not competitive. Furthermore, the results inJMLR did not encompass the problem of regression.
In this paper, we extend the TK framework proposed in JMLR to the problem of regression. The KL problem in regression has been studied using SDP in qiu2005multiple ; ni2006learning and Quadratic Programming (QP) in e.g. rakotomamonjy_2008 ; jain_2012 . However, neither of these previous works considered a set of kernels with both the tractability and the density property. By generalizing the Tessellated KL framework proposed in JMLR to the regression problem, we demonstrate significant increases in performance, as measured by Mean Square Error (MSE), and when compared to the results in rakotomamonjy_2008 ; jain_2012 ; qiu2005multiple .
In addition, we show that the SDP-based algorithm JMLR for classification, and extended here to regression, can be decomposed into primal and dual sub-problems, and - similar to the approach taken in rakotomamonjy_2008 ; jain_2012 . Furthermore, we show that
(an SDP) admits an analytic solution using the Singular Value Decomposition (SVD) - an approach which allows us to consider higher dimensional feature spaces and more complex TKs. In addition,is a convex QP and may be solved efficiently with achieved complexity which scales as where is the number of data points. We use a two-step algorithm on and and show that termination at is equivalent to global optimality. The resulting algorithm, then, does not require the use of SDP and, when applied to several standard test cases, is shown to retain the favorable TSA of JMLR for classification, while offering improved MSE for regression, and competitive computation times as compared to other KL and deep learning algorithms.
2 An Ideal Set of Kernels for KL in Classification and Regression
Consider a generalized representation of the KL problem, which encompasses both classification and regression where (using the representor theorem scholkopf2001generalized ) the learned function is of the form .
is the loss function and is defined for SVM binary classification and SVM regression asand , respectively, where
The properties of the classifier/predictor, , resulting from Optimization Problem 1 will depend on the properties of the set , which is presumed to be a subset of the convex cone of all positive kernels. To understand how influences the tractability of the optimization problem and the resulting fit, we consider three properties of the set, .
We say a set of kernel functions, , is tractable if it can be represented using a countable basis.
The set of kernels is tractable if there exist a countable set such that, for any , there exists where for some .
Note the need not be positive kernel functions. The tractable property is required for the KL problem to be tractable using algorithms for convex optimization.
Universal kernel functions always have positive definite (full rank) kernel matrices, implying that for arbitrary data , there exists a function , such that for all . Conversely, if a kernel is not universal, then exists a data set such that for any , there exists some such that . This ensures that SVMs using universal kernels can always benefit from additional training data, whereas non-universal kernels may saturate.
A kernel is said to be universal on the compact metric space if it is continuous and there exists an inner-product space and feature map, such that and where the unique Reproducing Kernel Hilbert Space (RKHS), with associated norm is dense in f where .
The following definition extends the universal property to a set of kernels.
A set of kernel functions has the universal property if every kernel function is universal.
The third property is density which distinguishes the TK class from other sets of kernel functions with the universal property. For instance consider a set containing a single Gaussian kernel function - which is clearly not ideal for kernel learning. The set containing a single Gaussian is tractable (it has only one element) and every member of the set is universal. However, it is not dense.
Considering SVM for classification, the KL problem determines the kernel for which we may obtain the maximum separation in the kernel-associated feature space. Increasing this separation distance makes the resulting classifier more robust (generalizable) boehmke2019hands . The density property, then, ensures that the resulting KL algorithm will be maximally robust (generalizable) in the sense of separation distance.
Likewise, considering SVMs for regression, the KL problem finds the kernel which permits the “flattest” smola2004tutorial function in feature space. In this case, the density property ensures that the resulting KL algorithm will be maximally robust (generalizable) in the sense of flatness.
These arguments motivate the following definition of the pointwise density property.
The set of kernels is said to be pointwise dense if for any positive kernel, , any set of data , and any , there exists such that .
3 A General Framework for Representation of Tractable Kernel Sets
Here we define a framework for constructing classes of tractable positive kernel functions and illustrate this approach on the class of General Polynomial Kernels.
Let be any bounded measurable function and be a positive semidefinite matrix . Then
is a positive kernel function.
Let be any bounded measurable function on compact and . Then the set of kernel functions
For a given , the map is linear. Specifically,
and thus by Definition 1 is tractable.
3.1 The Class of General Polynomial Kernels is Tractable
The class of General Polynomial Kernels (GPKs) is defined as the set of all polynomials, each of which is a positive kernel.
The GPK class is not universal, but is tractable, as per the following lemma.
Let be the vector of monomials of degree or less. From JMLR , we have that a polynomial of degree is a positive polynomial kernel if and only if there exists some such that . Now for any finite-dimensional subset of , let be the maximum degree over this subset and define . Then Lemma 6 implies that is tractable. ∎
4 Tessellated Kernels: Tractable, Dense and Universal
In this section, we define the class of TK kernels and show it is tractable, dense, and universal.
4.1 Tessellated Kernels
Again, let be the vector of monomials of degree . Define , the indicator function for the positive orthant, and the following choice of as
where means for all . We now define the set of TK kernels for as
Kernels in the TK class are “Tessellated” in the sense that each datapoint defines a vertex which bisects each dimension of the domain of the resulting classifier/predictor - resulting in a tessellated partition of the feature space.
4.2 The Set of TK Kernels is Tractable
However, we will expand on this result by specifying the basis for the set of TK kernels, which will then be used in Section 5.
Suppose that for , and . We define the finite set . Let be some ordering of and define where . Now let be as defined in Eqn. (2) for some and where is as defined in Eqn. (6). If we partition then we have,
where are defined as
where is the vector of ones, is defined elementwise as , and is defined as
4.3 The TK Class is Dense
The density property differentiates the set of TK kernels from other sets of kernel functions (e.g. a linear combination of Gaussian kernels of fixed bandwidths).
From JMLR we have that the set of TK kernels satisfies the pointwise density property.
For any kernel matrix and any finite set , there exists a and such that if , then .
In JMLR an analytical solution, , was found for the optimal trace-constrained kernel matrix that maximized the separation distance between two classes of points in the feature space. It was shown in this work that when has an equal number of positive and negative labels, contains an equal number of positive and negative elements - illustrating the importance of using kernels which are not pointwise positive (Gaussians are pointwise positive).
To illustrate the density property, then, we show how optimal GPK and TK kernels yield kernel matrices which approximate the analytic solution, , of the optimal kernel matrix problem for a given set of data and labels , while Gaussian kernels do not. Specifically, we consider the following optimization problem.
In these problems, the sets will be: - the sum of Gaussians with bandwidths ; - the GPKs of degree ; and - the TK kernels of degree . More precisely, for bandwidths , we define
Consider a spiral data set with 20 samples, using equal numbers of positive and negative labels. Fig. 1 shows the achieved objective value of Problem (7) for , , and as a function of the number of bandwidths (top axis - in ), polynomial degree (bottom axis - in , and ). The -axes of the plots are scaled to show equal numbers of decision variables. As expected, the case saturates with an objective value significantly larger than the lower bound. The cases and , meanwhile have almost no error at degree .
4.4 TK Kernels are Universal
Finally we discuss the universality property of the class of TK kernels which ensures that every TK function can fit the training data well.
The following theorem from JMLR shows that any TK kernel with is necessarily universal.
This theorem implies that even if we use the subset of TK kernels defined by , this subset is still universal.
5 A New Algorithm for KL in Classification and Regression using TKs
In this section, we express the KL optimization problem for both classification and regression and break this optimization problem into two sub-problems which allow us to express the problem in primal and dual form. For convenience, we define the feasible sets for the sub-problems as
The common part of the objective is
while the unique parts of the objective are
Then the KL optimization problem () for TK kernels ( being elementwise multiplication) is as follows for classification and regression, respectively.
Primal Formulation: We can formulate the primal problem () as
where for classification and regression, respectively,
Dual Formulation: Alternatively, we have the dual formulation ().
where for classification and regression. Likewise, for classification and regression, respectively,
For , , if and only if: solve ; solves ; and solves .
For any minmax optimization problem with objective function , we have
and strong duality holds () if and are both convex and one is compact, is convex for every and is concave for every , and the function is continuous fan1953minimax . In our case, these conditions hold for both classification and regression where . Hence if solves and solves , then solves and
Conversely, suppose , , then
Hence if , then and hence and solve and , respectively. ∎
For a given , is a Quadratic Program (QP). General purpose QP solvers as applied to this problem have a worst-case complexity which scales as ye1989extension where is the number of data points. This computational complexity may be improved, however, by noting that the problem formulation is compatible with the representation defined in LibSVM for QPs derived from SVM. In this case, the algorithm in LibSVM LibSVM can reduce the computational burden somewhat. This improved performance is illustrated in Figure 3 where we observe the achieved complexity scales as . Note that for the 2-step algorithm proposed in this manuscript, solving the QP in is significantly slower that solving the Singular Value Decomposition (SVD) required for , which is defined in the following subsection. However, the achieved complexity of is also significantly faster than solving the large SDP, as described in lanckriet_2004 , qiu2005multiple , and JMLR . This complexity comparison will be further discussed in Section 6.
For a given , is an SDP. Fortunately, however, this SDP is structured so as to admit an analytic solution using the SVD. To solve we minimize from Eq. (8) which, as per Corollary 8, is linear in and can be formulated as
and , and can be found in Corollary 8.
The following theorem gives an analytic solution for using the SVD.
Let be the SVD of symmetric and be the right singular vector corresponding to the minimum singular value of . Then solves .
Recall has the form .
Denote the minimum singular value of as . Then for any feasible , by fang1994inequalities we have
Now consider . is feasible since , and . Furthermore,
as desired. ∎
Note that the size of the SVD problem in is , which increases with the number of features, which is typically relatively small. As a result, we observe that the step of Algorithm 1 is typically less computationally intense than the step.
6 Complexity and Scalability of the New TK Kernel Learning Algorithm
We consider the computational complexity of Algorithm 1. If we define the number of data points used to learn the TK kernel function as and the size of as , then we find experimentally that the complexity of Algorithm 1 scales as approximately for classification and for regression as can be seen in Fig. 3. These results are lower with respect to than the value of reported in JMLR
for binary classification. The values for classification and regression are both estimated using the data set: Combined Cycle Power Plant (CCPP) intufekci2014prediction ; kaya2012local , containing 4 features and samples. In the case of classification, labels with value greater than or equal to the median of the were relabeled as , and those less than the median were relabeled as . Note that to study scalability in , we varied the number of features in the dataset - thereby incrementing the size of the matrix .
Aside from improved scalability, the overall time required for Algorithm 1 is significantly reduced when compared with the algorithm in JMLR , improving by two orders of magnitude in some cases. This is illustrated for classification using four data sets in Table 1. This improved complexity is likely due to the lower overhead associated with QP and the SVD.
|Method||Liver UCI||Cancer mangasarian1990pattern||Heart UCI||Pima UCI|
|SDP||95.75 2.68||636.17 25.43||221.67 29.63||1211.66 27.01|
|TKL||1.10 0.24||8.20 0.36||3.35 0.26||12.66 0.44|
We report the mean computation time (in seconds), along with standard deviation, for 30 trials comparing the SDP algorithm inJMLR and the new TKL algorithm on several data sets. All tests are run on a computer with an Intel i7-5960X CPU at 3.00 GHz with 128 Gb of RAM.
7 Accuracy of the New TK Kernel Learning Algorithm for Regression
As expected, for classification, the accuracy of the new TK kernel learning algorithm (TKL) is identical to the analysis in JMLR .
For regression, we evaluate the accuracy of TKL when compared to other state of the art machine learning algorithms. Because the set of TK kernels is dense, for classification (as shown in JMLR ), TKL outperforms all existing algorithm with respect to TSA. For regression, the appropriate metric is Mean Square Error (MSE). The algorithms used in our comparison are as follows.
[TKL] Algorithm 1 with , and we scale the data so that , and then select , where and are chosen by 5-fold cross-validation;
[SimpleMKL] We use SimpleMKL rakotomamonjy_2008 with a standard selection of Gaussian and polynomial kernels with bandwidths arbitrarily chosen between .5 and 10 and polynomial degrees one through three - yielding approximately kernels. We set as in TKL and is chosen by 5-fold cross-validation;
We use a 3 layer neural network with 50 hidden layers using MATLABs (feedforwardnet) implementation and stopped learning after the error in a validation set decreased sequentially 50 times.
In Table 2, we see the average MSE on the test set for these three approaches as applied to randomly selected regression benchmark data sets where is the dimension of the data, is the number of training data and is the number of testing data points. In all cases except Forest, [TKL] had both a lower (or comparable) computation time and MSE than both SimpleMKL and Neural Net. In all cases, the MSE for TKL was significantly lower - illustrating the importance of the density property.
To further illustrate the importance of density property and the TKL framework for practical regression problems, we used elevation data from becker2009global to learn a TK kernel and associated SVM predictor representing the surface of the Grand Canyon in Arizona. This data set is particularly challenging due to the variety of geographical features. The result of the TKL algorithm can be seen in Figure 2(d).
|Data Set||Method||Error||Time||Data Set||Method||Error||Time|
|CCPP tufekci2014prediction ; kaya2012local||TKL||9.70||1463.8||Abalone UCI||TKL||3.43||522.5|
|= 4, = 8000||SimpleMKL||13.77||26097.1||= 8, = 4000||SimpleMKL||4.28||1185.3|
|= 1568||Neural Net||15.00||850.4||= 177||Neural Net||8.72||483.4|
|Airfoil UCI||TKL||1.46||92.1||Forest cortez2007data||TKL||2.05||7.6|
|= 5, = 1300||SimpleMKL||3.63||1025.0||= 10, = 457||SimpleMKL||2.07||0.8|
|= 203||Neural Net||4.28||61.3||= 50||Neural Net||6.40||117.7|
We have extended the TK kernel learning framework to regression problems and proposed a faster algorithm for TK kernel learning which can be used for both classification and regression. The set of TK kernels is tractable, dense, and universal - implying that KL algorithms based on TK kernels are more robust - resulting in higher TSA for classification and lower MSE for regression. These three properties, combined with the improved computational complexity of the new algorithm, has resulted in a kernel learning framework which achieves both lower MSE and computation time when compared to both SimpleMKL and neural networks.
While machine learning algorithm have become very accurate in recent years, they perform poorly when faced with changes in the underlying process. As evidenced by Covid19, predictive models based on ML algorithms can be brittle mims_2020 . The density property of the TK class ensures that the models generated using the algorithms described in this manuscript will be more robust to such changes in environment. Naturally, however, over-reliance on predictive models, without understanding of the process, can lead to negative outcomes, even if the models are robust.
-  J.J. Becker, D.T. Sandwell, W.H.F. Smith, J. Braud, B. Binder, J.L. Depner, D. Fabre, J. Factor, S. Ingalls, S.H. Kim, et al. Global bathymetry and elevation data at 30 arc seconds resolution: Srtm30_plus. Marine Geodesy, 32(4):355–371, 2009.
-  B. Boehmke and B.M. Greenwell. Hands-On Machine Learning with R. CRC Press, 2019.
-  C-C. Chang and C-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
-  B.K. Colbert and M.M. Peet. A convex parametrization of a new class of universal kernel functions. Journal of Machine Learning Research, 21(45):1–29, 2020.
-  P. Cortez and A. Morais. A data mining approach to predict forest fires using meteorological data. 2007.
-  D. Dua and C. Graff. UCI machine learning repository, 2017.
-  K. Fan. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 39(1):42, 1953.
-  Y. Fang, K.A. Loparo, and X. Feng. Inequalities for the trace of matrix product. IEEE Transactions on Automatic Control, 39(12):2489–2490, 1994.
-  M. Gönen and E. Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 2011.
-  A. Jain, S. Vishwanathan, and M. Varma. SPF-GMKL: generalized multiple kernel learning with a million kernels. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, 2012.
-  H. Kaya, P. Tüfekci, and F.S. Gürgen. Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the international conference on emerging trends in computer and electronics engineering icetcee, pages 13–18, 2012.
-  G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004.
-  O.L. Mangasarian, R. Setiono, and W.H. Wolberg. Pattern recognition via linear programming: Theory and application to medical diagnosis. 1990.
-  C. Mims. AI isn’t magical and won’t help you reopen your business. Wall Street Journal, May 2020.
-  K. Ni, S. Kumar, and T. Nguyen. Learning the kernel matrix for superresolution. In 2006 IEEE Workshop on Multimedia Signal Processing, pages 441–446, 2006.
-  S. Qiu and T. Lane. Multiple kernel learning for support vector regression. Computer Science Department, The University of New Mexico, Albuquerque, NM, USA, Tech. Rep, page 1, 2005.
-  A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning Research, 2008.
-  B. Recht. Convex Modeling with Priors. PhD thesis, Massachusetts Institute of Technology, 2006.
B. Schölkopf, R. Herbrich, and A.J. Smola.
A generalized representer theorem.
International conference on computational learning theory, pages 416–426, 2001.
-  A.J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199–222, 2004.
-  S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. De Bona, A. Binder, C. Gehl, and V. Franc. The shogun machine learning toolbox. Journal of Machine Learning Research, 11(60):1799–1802, 2010.
-  P. Tüfekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60:126–140, 2014.
-  Z. Xu, R. Jin, H. Yang, I. King, and M.R. Lyu. Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th international conference on machine learning, pages 1175–1182, 2010.
-  H. Yang, Z. Xu, J. Ye, I. King, and M.R. Lyu. Efficient sparse generalized multiple kernel learning. IEEE Transactions on neural networks, 22(3):433–446, 2011.
-  Y. Ye and E. Tse. An extension of karmarkar’s projective algorithm for convex quadratic programming. Mathematical programming, 44(1-3):157–179, 1989.