A Convex Parametrization of a New Class of Universal Kernel Functions for use in Kernel Learning

11/15/2017
by   Brendon K. Colbert, et al.
Arizona State University
0

We propose a new class of universal kernel functions which admit a linear parametrization using positive semidefinite matrices. These kernels are generalizations of the Sobolev kernel and are defined by piecewise-polynomial functions. The class of kernels is termed "tessellated" as the resulting discriminant is defined piecewise with hyper-rectangular domains whose corners are determined by the training data. The kernels have scalable complexity, but each instance is universal in the sense that its hypothesis space is dense in L_2. Using numerical testing, we show that for the soft margin SVM, this class can eliminate the need for Gaussian kernels. Furthermore, we demonstrate that when the ratio of the number of training data to features is high, this method will significantly outperform other kernel learning algorithms. Finally, to reduce the complexity associated with SDP-based kernel learning methods, we use a randomized basis for the positive matrices to integrate with existing multiple kernel learning algorithms such as SimpleMKL.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/13/2020

A New Algorithm for Tessellated Kernel Learning

The accuracy and complexity of machine learning algorithms based on kern...
02/13/2019

On the Expressive Power of Kernel Methods and the Efficiency of Kernel Learning by Association Schemes

We study the expressive power of kernel methods and the algorithmic feas...
03/31/2021

Symmetric and antisymmetric kernels for machine learning problems in quantum physics and chemistry

We derive symmetric and antisymmetric kernels by symmetrizing and antisy...
06/03/2016

On Valid Optimal Assignment Kernels and Applications to Graph Classification

The success of kernel methods has initiated the design of novel positive...
01/17/2016

Learning the kernel matrix via predictive low-rank approximations

Efficient and accurate low-rank approximations of multiple data sources ...
09/07/2009

Kernels for Measures Defined on the Gram Matrix of their Support

We present in this work a new family of kernels to compare positive meas...
11/18/2015

Efficient Output Kernel Learning for Multiple Tasks

The paradigm of multi-task learning is that one can achieve better gener...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper addresses the problem of automated selection of an optimal kernel function. Kernel functions implicitly define a linear parametrization of nonlinear candidate maps from features to labels

. The ‘kernel trick’ allows optimization of fit and regularity in this hypothesis space without explicit representation of the space itself. The kernel selection process, then, is critical for determining the class of hypothesis functions and, as a result, is a well-studied topic, with generalized kernels including polynomials, Gaussians, and many variations of the Radial Basis Function class. In addition, specialized kernels include string kernels 

[17, 7], graph kernels [9], and convolution kernels [13, 4]

. The kernel selection process heavily influences the accuracy of the resulting fit and hence significant research has gone into optimization of these kernel functions in order to select the hypothesis space which most accurately represents the underlying physical process. Recently, there have been a number of proposed kernel learning algorithms. For support vector machines, the methods proposed in this paper are heavily influenced by the SDP approach proposed by Lanckriet et. al. in 

[15] which directly imposed kernel matrix positivity using a linear subspace of candidate kernel functions. There have been several extensions of the SDP approach, including the hyperkernel method of [20]. However, because of the complexity of semidefinite programming, more recent work has focused on gradient methods for non-convex parametrization and convex LP-type parameterizations of positive linear combinations of candidate kernels, as in SimpleMKL [22] or the several variations in [24]. These methods rely on kernel set operations (addition, multiplication, convolution) to generate large numbers of parameterized kernel functions [5]. When the parameterization is non-convex, gradient-based methods find local minima and include GMKL [14]. Other variations include LMKL [11] (gating), polynomial kernel combinations and the Alignment and Centered Alignment MKL in, e.g. [6]. For kernel learning, regularization is particularly important and interesting approaches to this problem include the group sparsity metric in [25] and the enclosing ball approach in [8]. See, e.g. [12] for a comprehensive review of multiple kernel learning algorithms. In this paper, we consider the class of “Universal Kernels” formalized in [19]. A kernel is defined as universal if its associated Reproducing Kernel Hilbert Space (RKHS), , is suitably dense in any compact subset of . That is, for a given , define as

where we may take . Then we would like to establish -density of

The most well-known example of a universal kernel is the Gaussian (generalized in [26]). However, most other common kernels are not universal, including, significantly, the polynomial class of kernels. In this paper, we propose a new class of universal kernel functions which are defined by polynomials and admit a convex parametrization. Specifically, we consider the class of kernels defined as

which is a positive semidefinite kernel for any . We will show that this leads to kernel functions of the form

which are generalizations of the class of Sobolev kernels [21] and which therefore have a hypothesis space dense in

. In contrast to the Gaussian kernels, however, our universal class of “tessellated” kernels have a linear parametrization, need not be pointwise positive, and are piecewise-polynomial, making them significantly more robust and useful in combination with kernel-learning algorithms. Specifically, we show how this class of kernel can be rigorously incorporated into the SDP kernel learning framework as well as SimpleMKL (albeit using a randomized set of positive matrices in the latter case). In the numerical results we show the potential of these kernels for unlimited learning by examining cases where the ratio of training data to features is high. In this case, all other kernel learning methods saturate while the universal kernel approach is able to estimate the discriminant with seemingly arbitrary accuracy.

2 Posing the kernel learning problem as an SDP

Suppose we are given a set of training data points with features , and labels , . For a given , and hypothesis space of functions , we define the 1-norm soft margin problem as

The kernel learning problem is a solution to the problem of minimization over a set of hypothesis spaces

where we take the set of valid hypothesis spaces to be of the form

Definition 1.

We say a function is a positive kernel if

for any function .

Define to be the set of all positive semidefinite kernels. Then the representer theorem states that the 1-norm soft margin problem may be formulated as

(1)
(2)

By forming the Lagrangian dual problem, we then get the following equivalent problem.

This problem can be reformulated as the following SDP.

Provided that can be parameterized using real numbers and that the constraints and can be represented as SDP constraints, then this problem can be efficiently solved using well-developed interior-point methods [1] with implementations such as MOSEK [2]. In Lanckriet [15], the authors proposed

where the are chosen a priori as known admissible kernels such as the gaussian kernel , where is the bandwidth (and must be chosen a priori) or the polynomial kernel where again the must be chosen a priori. Different methods similar to Lanckriet have since been produced, with a more computationally efficient variant being [22] which tightens the constraints and reformulates the problem using the parameterized set of kernels

In this paper, we propose a new parametrization of which can be represented as an SDP constraint and which does not require an a priori choice of kernel functions. Furthermore, we will show how this new class of kernel functions can be integrated with efficient LP-based kernel learning algorithms such as SimpleMKL [22]

3 Using positive matrices to parameterize positive kernels

In this section, we propose a general framework for using positive matrices to parameterize positive kernels. We then focus on the special case of tessellated kernel functions and show these kernels are suitably dense such that they may be used in lieu of classical RBF kernels such as the Gaussian.

The following result is based on a parametrization of positive integral operators initially proposed in [23].

Theorem 2.

For any function and any positive semidefinite matrix , ,

is a positive kernel. Moreover,

where is defined as,

Proof.

where

For this class of kernels, the constraint becomes

which is then a linear equality constraint on the real-valued variables with the additional constraint .

Thus for any function , we have the following kernel learning SDP.

(3)

Note the integrals are calculated a priori. While this integration does not affect the computational complexity of the algorithm, it does impose an implicit restriction on the class of admissible functions and the space - the integrals must be computable for any given set of data .

As a special case, consider where is the vector of monomials of degree or less in variables and . It is trivial to show that any polynomial is separable. That is, for polynomial matrices and . Then the kernel becomes

which is a polynomial where . Furthermore, any positive polynomial kernel of degree must have a representation of the form . Hence the choice of could equivalently be reduced to . However, as this illustration shows, polynomial kernels suffer from the fact that the hypothesis space has the form

where is the finite-dimensional space of kernels of degree or less. That is, polynomial kernels can never learn a hypothesis space of rank greater than . In the following section, we propose a choice of for which the kernel forms a hypothesis space which is infinite-dimensional and suitably dense in and for which the integrals can be efficiently computed.

4 A class of tessellated kernel functions

Recall from the previous section that we are searching for kernels of the form

where the function is determined a priori. Furthermore, the class of admissible is limited to those functions for which it is possible to determine the integral

A natural choice for would be the polynomials defined on a hypercube, for which integration is trivial. However, as shown in the previous section, the hypothesis space formed by the polynomial class of kernels is finite-dimensional. For this reason, we propose the following class of kernels based on the class of semi-separable kernel functions. As proposed in [10], these semi-separable kernel functions are defined as

In this paper, we consider a definition of for which we recover a generalization of this class of semi-separable kernels. In this case the inequality is generalized to where the set is defined by inequalities but does not itself represent an inequality. Roughly speaking, the set will be a tessellation of with tile corners defined by the data points . Specifically, we consider the functions

where if (the positive orthant). Because the ordering defined by the positive orthant is only partial, we will replace the inequality with the sets and , where and . Now, for any given set, define the indicator function as follows

Our proposed function can now be compactly represented as

where recall is the vector of monomials of degree or less, and whose ’th element is denoted . Now, given, , if , we have

where we define

Now define as .

Lemma 3.

Furthermore, the are disjoint - for .

Proof.

Clearly and are disjoint. Similarly,

Clearly and are disjoint. To show that and are disjoint, suppose , then and and hence , which is a contradiction. Finally,

By definition, . ∎

Lemma 4.
Proof.

Since the sets are disjoint, the proof is straightforward.

If we redefine the decision variables as

we can expand the representation of in terms of the decision variables and recover

Note that while the 2nd, 3rd and 4th terms analytically reduce to polynomials (and can be calculated a priori), the limits of integration in the 1st term depend on . This implies that the kernel function is piecewise polynomial, indexed by as

where the are polynomials and therefore separable. This also implies that unless the feature space is low-dimensional, it may not be reasonable to compute all a priori. Instead, these may be computed ad-hoc based on the training data. This issue will be discussed in the following section, wherein we discuss properties of the proposed class of kernel functions.

5 Properties of the tessellated class of kernel functions

Let us begin by recalling that for any and ,

is a positive kernel with , where is defined as . Then we have a hypothesis space representation

Now recall for the tessellated kernel, we have that

In this case, our hypothesis space consists of functions of the form

Now let us consider the simplest case, where , , . In this case, we have

In this simplest case, we have that . That is, the hypothesis space is simply the Sobolev space of continuous functions such that (which is dense in ). To see this, let with be arbitrary and set , , and . Then

Stokes theorem implies that this proof can then be inductively extended to . Thus the class of kernel functions proposed in this paper and defined by the positive orthant is a generalization of the Sobolev kernels implied by the fundamental theorem of calculus. This analysis also indicates that the density property does not depend on the number of terms in . That is, the hypothesis space is dense in even for . Moreover, unlike Gaussian kernels with a hypothesis space characterised by the postive orthant in  [18], the kernels themselves are not restricted to be pointwise positive.

Figure 1: The domains of definition of the piecewise-polynomial kernel function. Circles are training data.

From another perspective, the hypothesis space is defined as

where recall has the form

Then in the simplest case where for all and , we have that

Then, to approximate any function with Lipschitz factor to accuracy , let

be the set of natural numbers, scaled by factor and intersected with . Clearly is finite. Now define inductively in each dimension

then , indicating pointwise convergence. A similar argument can be made for approximation in the -norm. Essentially, then, the training data tesselates the space, with the function defined separately on each tile and a new tile being defined for each training datum. This geometry is illustrated in Figure 1.

6 Implementation and complexity analysis

In this paper, we have proposed a new class of kernel functions defined by piecewise polynomials as

For any , using the dual formulation in Eqn. (3), the problem of learning the kernel matrices can be formulated as an SDP. If , and is the number of training data the complexity of the resulting SDP scales as approximately as can be seen in Figure 3 and is similar to the complexity of other methods such as the hyperkernel approach in [20]. These scaling results are for training data randomly generated by two standard 2-feature example problems (circle and spiral - See Figure 2) for degrees , , and where defines the length of which is the vector of all monomials in 2 variables of degree or less. Note that the length of scales with the degree and number of features, , as .

For a large number of features and high degree, the size of will become unmanageably large. Note, however, that, as indicated in the previous section, the hypothesis space is dense in even when . In this case, we have only 4 decision variables. Furthermore, in the case of large numbers of features, a random basis for the positive matrices can be selected. This basis can then be integrated directly into existing kernel learning methods such as SimpleMKL - as is discussed in the following section.

(a) Circle with T [n=50]
(b) Circle with S [n=50]
(c) Spiral with T [n=150]
(d) Spiral with S [n=150]
Figure 2: Discriminant Surface for Circle and Spiral Separator using Tessellated kernel [T] as Compared with SimpleMKL [S] for training data.
(a) Complexity Scaling for Identification of Circle
(b) Complexity Scaling for Identification of Spiral
Figure 3: Log-Log Plot of Computation Time vs number of training data for 2-feature kernel learning.

7 Accuracy and comparison with existing methods

In this section, we evaluate the proposed class of kernel in isolation, combined with, and compared to the SimpleMKL [22] kernel learning algorithm. We use the soft-margin problem with regularization parameter determined by 5-fold cross-validation and compare the following methods: a) For the tessellated kernel, in all cases we choose (Except Ionosphere, which uses ); b) For SimpleMKL, we use the standard kernel selection of combined Gaussian and polynomial kernels with bandwidths arbitrarily chosen between .5 and 10 and degrees of degree one through three - yielding approximately kernels; c) To illustrate the effect of combining the proposed kernel with SimpleMKL, we randomly generated a sequence of positive semidefinite matrices and used these as the SimpleMKL library of kernels; Finally, in d) We combined the SimpleMKL library of kernels mentioned earlier with the 300 randomly generated tessellated library of kernels.

(a) Average Test Set Accuracy on the Liver Dataset vs. Number of Training Data for Proposed Method Compared to SimpleMKL
(b) Semilog plot of residual error on generated 2D spiral data vs. number of training data for proposed method compared to SimpleMKL. Residual error is defined as 1-TSA where TSA is the Test Set Accuracy.

In all evaluations of Test Set Accuracy (TSA), the data is partitioned into 80% training data and 20% testing and this partition is repeated 30 times to obtain 30 sets of training and testing data. In Table 

, we see the average TSA for these four approaches as applied to several randomly selected benchmark data sets from the UCI Machine learning Data Repository. In all cases, the tessellated kernel met or in some cases significantly exceeded the accuracy of SimpleMKL. Note in addition, as was discussed in 

[15], the introduction of Gaussians into the tessellated SDP formulation (a) occasionally will slightly improve accuracy.

In addition to the standard battery of tests, we performed a secondary analysis to demonstrate the advantages of the tessellated kernel class when the ratio of training data to number of features is high. For this analysis, we use the liver data set (6 features ) and the spiral discriminant [16] with 2 features ( and ) (we also briefly examine the unit circle). For the spiral case, in Figure 3(b) we see a semilog plot of the residual error (=1-TSA) as the number of data increases as compared with SimpleMKL. The key feature observed in this plot is that the accuracy of the SimpleMKL method saturates when the number of training data is large. The tessellated kernel, however, continues to improve ultimately yielding more than an order of magnitude increase in performance. Note, however, that such unlimited performance is possible only because the training data is analytically generated. By contrast, in Figure 3(a), we see convergence of both SimpleMKL and the tessellated kernel, although in this case the tessellated kernel is significantly more accurate. Finally, in Figure 2, we see the learned discriminant surface for the spiral and circle as compared to the SimpleMKL surface.

Data Set Method Accuracy Time Data Features
Tessellated 71.69 4.43 93.26 3.69
Liver SimpleMKL 65.51 5.10 2.61 0.42 m = 346
SimpleMKL Tess. 70.58 4.69 8.37 0.30 n = 6
Combined 70.53 4.79 14.70 0.76
Tessellated 96.96 1.38 755.21 91.48
Cancer SimpleMKL 96.55 1.34 14.74 1.33 m = 684
SimpleMKL Tess. 96.89 1.43 45.84 4.28 n = 9
Combined 96.89 1.42 65.08 10.52
Tessellated 83.52 4.55 104.18 4.54
Heart SimpleMKL 83.70 4.77 3.09 0.19 m = 271
SimpleMKL Tess. 84.38 4.34 55.48 2.67 n = 13
Combined 83.64 4.54 13.23 2.70
Tessellated 75.74 3.43 1967.1 64.30
Pima SimpleMKL 76.00 3.33 19.04 2.33 m=769
SimpleMKL Tess. 76.75 2.81 34.65 23.28 n = 8
Combined 76.57 2.72 96.20 30.42
Tessellated 92.07 3.06 5.73 0.20
Ionosphere SimpleMKL 92.16 2.78 26.24 2.78 m = 352
SimpleMKL Tess. 87.65 2.88 8.28 .16 n = 34
Combined 92.16 2.78 50.77 2.98
Table 1:

TSA comparison for algorithms a), b), c), and d). The maximum TSA for each data set is bold. The average TSA, standard deviation of TSA and time to compute are shown below.

is size of dataset and the number of features.

8 Conclusion

In this paper, we have proposed a new class of universal kernel function. This class is a generalization of the Sobolev kernel and has a linear parametrization using positive matrices. The kernels have scalable complexity and any instance is universal in the sense that the hypothesis space is dense in , giving it comparable performance and properties to Gaussian kernels. However, unlike the Gaussian, the tessellated kernel does not require a set of bandwidths to be chosen a priori. We have demonstrated the effectiveness of the kernel on several datasets from the UCI repository. We have shown that the computational complexity is comparable to other SDP-based kernel learning methods. Furthermore, by using a randomized basis for the positive matrices, we have shown that the tessellated class can be readily integrated with existing multiple kernel learning algorithms such as Simple MKL - yielding similar results with less computational complexity. In most cases, either the optimal tessellated kernel, or the MKL learned sub-optimal tessellated kernel will out perform or match an MKL approach using Gaussian and polynomial kernels with respect to the Test Set Accuracy. Furthermore, when the ratio of training data to number of features is high, the class of tessellated kernels shows almost unlimited potential for learning, as opposed to existing methods which ultimately saturate. Finally, we note that this universal class of kernels can be trivially extended to matrix-valued kernels for use in, e.g. multi-task learning [3].

References

  • [1] F. Alizadeh, J.-P. Haeberly, and M. Overton. Primal-dual interior-point methods for semidefinite programming: convergence rates, stability and numerical results. SIAM Journal on Optimization, 8(3):746–768, 1998.
  • [2] MOSEK ApS. The MOSEK optimization toolbox for MATLAB manual. Version 7.1 (Revision 28)., 2015.
  • [3] A. Caponnetto, C. Micchelli, M. Pontil, and Y. Ying. Universal multi-task kernels. Journal of Machine Learning Research, 9(Jul):1615–1646, 2008.
  • [4] M. Collins and N. Duffy. Convolution kernels for natural language. In Advances in neural information processing systems, pages 625–632, 2002.
  • [5] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In Advances in neural information processing systems, pages 396–404, 2009.
  • [6] C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(March):795–828, 2012.
  • [7] E. Eskin, J. Weston, W. Noble, and C. Leslie. Mismatch string kernels for SVM protein classification. In Advances in neural information processing systems, pages 1441–1448, 2003.
  • [8] K. Gai, G. Chen, and C.-S. Zhang. Learning kernels with radiuses of minimum enclosing balls. In Advances in neural information processing systems, pages 649–657, 2010.
  • [9] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pages 129–143. 2003.
  • [10] I. Gohberg, S. Goldberg, and M. Kaashoek. Classes of linear operators, volume 63. Birkhäuser, 2013.
  • [11] M. Gönen and E. Alpaydin. Localized multiple kernel learning. In Proceedings of the International Conference on Machine learning, pages 352–359, 2008.
  • [12] M. Gönen and E. Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268, 2011.
  • [13] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University of California in Santa Cruz, July 1999.
  • [14] A. Jain, S. Vishwanathan, and M. Varma. Spf-gmkl: generalized multiple kernel learning with a million kernels. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pages 750–758, 2012.
  • [15] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine learning research, 5(Jan):27–72, 2004.
  • [16] K. Lang. Learning to tell two spirals apart. In Proceedings of the Connectionist Models Summer School, 1988.
  • [17] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2(Feb):419–444, 2002.
  • [18] C. Micchelli.

    Algebraic aspects of interpolation.

    In Proceedings of Symposia in Applied Mathematics, volume 36, pages 81–102, 1986.
  • [19] C. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
  • [20] Cheng Soon Ong, Alexander J Smola, and Robert C Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6(Jul):1043–1071, 2005.
  • [21] V. Paulsen and M. Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge University Press, 2016.
  • [22] Alain Rakotomamonjy, Francis R Bach, Stéphane Canu, and Yves Grandvalet. Simplemkl. Journal of Machine Learning Research, 9(Nov):2491–2521, 2008.
  • [23] B. Recht. Convex Modeling with Priors. PhD thesis, Massachusetts Institute of Technology, 2006.
  • [24] SĆ Sonnenburg, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, V. Franc, et al. The shogun machine learning toolbox. Journal of Machine Learning Research, 11(Jun):1799–1802, 2010.
  • [25] N. Subrahmanya and Y. Shin. Sparse multiple kernel learning for signal processing applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):788–798, 2010.
  • [26] E. Zanaty and A. Afifi. Support vector machines (SVMs) with universal kernels.

    Applied Artificial Intelligence

    , 25(7):575–589, 2011.