Kernel methods   represent a well-established learning paradigm that is able to capture the nonlinear complex patterns underlying data. In kernel methods, the learning is implicitly performed in a high-dimensional (even infinite-dimensional) nonlinear feature space (called Reproducing Kernel Hilbert Space, RKHS)  via the kernel trick: for a vector in the input space, it is projected into the RKHS using a nonlinear mapping , then linear learning is performed on the nonlinear features . is implicitly characterized by a kernel function associated with the RKHS, which known as kernel trick .
Classical kernel methods perform single-layer feature learning: the input is transformed into which are used as the final features on which learning is carried out. Motivated by recent success of deep neural networks which perform hierarchical (“deep”) representation learning  that is able to capture low-level, middle-level and high-level features, we aim to study “deep” kernel methods that learn several layers of stacked nonlinear transformations.
Specifically, we propose a Stacked Kernel Network (SKN) that interleaves several layers of nonlinear transformations and linear mappings. Starting from the input , a nonlinear feature map is applied to project into a -dimensional ( could be infinite) RKHS . Then we use a linear mapping to project into a -dimensional linear space : . Let be the -th row vector of , according to the definition of RKHS, can be computed as where is a function in this RKHS. To this end, the representation of in can be written as , where the -th element of is . Then treating as input, we apply the above procedure again: projecting into another RKHS , followed by a linear projection into another linear space , getting the representation . Repeating this process times, we obtain a SKN with hidden layers. SKN contains multiple layers of nonlinear representations which could be infinite-dimensional. This grants SKN vast representation power to capture the complex patterns behind data. On the other hand, after each nonlinear layer, a linear mapping is applied to confine the size of the model, so that the model capacity does not get out of control.
Figure 1 shows the architecture of SKN. Similar to a deep neural network, it contains multiple hidden layers. The striking difference is in SKN each hidden unit is parameterized by a RKHS function while in DNN the units are parametrized by vectors. The RKHS functions could be infinite-dimensional, which are arguably more expressive than finite-dimensional vectors. As a result, SKN could possess more representational power than DNN.
Given the multiple layers of RKHS functions in SKN, how to learn them is very challenging. At first, we need to seek explicit representations of these functions. We propose three ways. First, motivated by the representer theorem , we parameterize a RKHS function as a linear combination of kernel functions anchored over training data : . Second, we shrink the domain of the functions from the entire RKHS into the image of the nonlinear feature map : , where each function is parameterized by a learnable vector . Third, gaining insight from random Fourier features , a RKHS function can be approximated with where is the random Fourier feature transformation of , and is a parameter vector. We use backpropogation algorithm to learn the SKNs under these three representations. Evaluations on various datasets demonstrate the effectiveness of SKN.
The major contributions of this paper are:
We propose Stacked Kernel Network, which learns multiple layers of RKHS representations.
We study three ways of representing the RKHS functions in SKN to address the computation issue of learning RKHS functions in high-dimensional, even infinite-dimensional.
We design a convolutional architecture of SKN, called Stacked Convolutional Kernel Network (SCKN), to solve visual recognition tasks.
We demonstrate the effectiveness of SKN and SCKN in experiments.
The rest of the paper is organized as follows. Section 2 introduces the Stack Kernel Network and Section 3 presents experimental results. Section 4 reviews related works and Section 5 concludes the paper.
In this section, we introduce the architecture of the Stacked Kernel Network (SKN) and three ways of representing the RKHS functions in SKN.
2.1 Kernel methods
Kernel methods  perform learning in a reproducing kernel Hilbert space (RKHS)  of functions. This RKHS represents a high-dimensional feature space that is more expressive than the input space since it is able to capture the non-linear patterns behind data. The RKHS is associated with a kernel function and inner product in the RKHS can be computed via evaluating in the lower-dimensional input space (known as kernel trick
). Well established kernel methods include support vector machine
, kernel principal component analysis
, kernel independent component analysis, Gaussian process , to name a few.
Kernel methods are featured by three prominent concepts: (1) feature map that maps from a data point in the input space to an element of an inner product space (the feature vector); (2) kernel that takes two data points and returns a real number; (3) RKHS which is a Hilbert space of functions . Their relations are as follows. First, feature map defines a kernel. Let be a feature map, then is a kernel. Second, kernel defines feature maps. For every kernel , there exists a Hilbert space and a feature map such that . Third, RKHS defines a kernel. Every RKHS of functions defines a unique kernel , called the reproducing kernel of . Fourth, kernel defines RKHS. For every kernel , there exists a unique RKHS with reproducing kernel .
2.2 Deep neural networks
We briefly review deep neural networks (DNNs), which inspire us to construct the stacked kernel network. A DNN contains one input layer, several hidden layers and one output layer. Each layer has a set of units. The units between adjacent layers are inter-connected, each connection associated with a weight parameter. To achieve non-linearity, a nonlinear activation function is applied at each hidden unit. Commonly used activation functions include sigmoid, tanh and rectifier linear.
2.3 Stacked Kernel Network
Classic kernel methods learn a single layer of nonlinear features, which may not be expressive enough to accommodate complex data. Inspired by the hierarchical representation learning in deep neural networks, we propose Stacked Kernel Network that learns multiple layers of nonlinear features based on RKHS. Figure 1 shows the architecture of SKN. Similar to a DNN, it consists of an input layer, an output layer and hidden layers. Each hidden layer is associated with a RKHS and the hidden units therein are parameterized by functions in this RKHS. This is the key difference with DNN where the hidden units are parameterized by weight vectors. Next, we present the detailed construction procedure of SKN. We start with defining the first hidden layer. Let be the input feature vector. We pick up functions from a RKHS to map into a -dimensional latent space where the -th dimension is . Then on top of the first hidden layer, we can use functions from another RKHS to define the second hidden layer. Repeating this process times, a SKN with hidden layers characterized by functions from RKHS is obtained. The -th hidden layer is utilized to produce the outputs.
By stacking multiple RKHS together, the SKN is highly expressive. To better understand this, we present an equivalent architecture in Figure 2. Recall that a RKHS function is equivalently defined as follows: given an input vector , first transform it into using the reproducing kernel feature map , then can be define as where contains linear coefficients. Based upon this definition, the SKN can be represented by interleaving nonlinear projections and linear projections: given the latent representation at layer , we first transform it into using the nonlinear feature map , then use a linear project matrix to map into the latent representation at layer : . The nonlinear features could be infinite-dimensional. A SKN contains layers of such features that lead to substantial representational power. In between two adjacent layers of nonlinear features, a layer of finite-dimensional linear features is placed. This ensures the size of SKN is properly controlled.
2.4 Representing RKHS functions in SKN
A SKN is parameterized with layers of RKHS functions. While expressive, these functions present great challenges for learning. Unlike weights in DNN that are naturally represented as finite-dimensional vectors, the RKHS functions are of infinite dimension, whose storage and computation are troublesome. To address this issue, we first seek explicit representations of these functions that facilitate learning. We investigate three ways: data-dependent nonparametric representation, data-independent parametric representation and an approximate representation based on random Fourier features.
2.4.1 Nonparametric representation
In kernel methods, the most common way to represent a RKHS function is based on the representer theorem : given a regularized risk functional where are the training data, denotes the Hilbert norm of and is an increasing function, the minimizer of this functional admits the following form:
where is a kernel function associated with the RKHS and are coefficients. This representation depends on data, hence is referred to as nonparametric representation.
Specifically, given the hidden states of all training data at layer , the RKHS function associated with the th hidden unit at layer is defined as:
where is the kernel function associated with RKHS . For data sample , the activation value of hidden unit at layer is . Figure 3 shows the SKN architecture under the nonparametric representation. The input of each hidden unit at layer are the representations of all samples at layer , creating a huge network.
The advantage of the nonparametric representation is that the function set contains the global optimal solution, though this optimal may not be achieved due to the non-convexity of SKN. The drawback is the number of parameters grow linearly with , which is not scalable to large datasets.
2.4.2 Parametric representation
In light of the large computational complexity of the nonparametric representation, we investigate a parametric counterpart. The basic idea is: instead of searching the optimal solution in the entire RKHS, we restrict the learning into a subset of the RKHS where the functions in the subset have nice parametric forms. Specifically, we choose the subset to be the image of the reproducing kernel map:
where is a kernel function associated with the RKHS and
is the learnable parameter vector which initialized using k-means clustering from the input samples and trained by gradient decent. Specifically, the RKHS function associated with the -th hidden unit in layer is:
where is the parameter vector of this function. Given the hidden states at layer , the activation value of this unit is .
The advantage of the parametric representation is that it is independent of data. The disadvantage is the optimal solution of may not be contained in this subset. Note that if we choose
to be the radial basis function (RBF) kernel, then the parametric SKN is specialized to a deep RBF network.
|Contain Opt||Yes||May not||No|
|Depend on data||Yes||No||No|
|Subset of RKHS||Yes||Yes||No|
2.4.3 Random Fourier feature representation
In addition to the nonparametric and parametric representations, we also investigate another approximated representation based on random Fourier features (RFFs) . Given a shift-invariant kernel , it can be approximated with RFFs:
where is the RFF transformation.
is generated in the following way: (1) compute the Fourier transformof the kernel ; (2) draw i.i.d samples from and i.i.d samples
from the uniform distribution on; (3) let .
Given a RKHS function , we approximate it using RFF:
The RKHS function associated with the -th hidden unit in layer is:
where is the parameter vector of this function. Given the hidden states at layer , the activation value of this unit is .
Figure 4 shows the SKN architecture under the RFF representation, where multiple layers of RFF transformations and linear projections are interleaved. The weight parameters in RFF transformation are sampled from the Fourier transform and are fixed during training. The learnable parameters are the weights .
2.4.4 Comparison of three representations
Table 1 presents a comparison of the three representations. Functions under nonparametric representation contain the global optimal function in the RKHS. Functions under random Fourier feature representation approximate those in the RKHS. While nonparametric representation contains the global optimal solution in the RKHS, its parameters grow with training data size . Parametric and RFF representations are computationally efficient, however, they have rare chance to reach the global optimal. The number of parameters in all three representations grows with the number of layers and the number of units in each layer. For each hidden unit, parametric representation has parameters and RFF representation has (dimension of random features) parameters, which are both much smaller than .
2.5 Stacked Kernel Convolutional Network
In order to apply SKN to visual recognition problems, we aiming to extend the plain Stacked Kernel Network to a certain advanced architecture which has strong representation ability in visual recognition tasks. Inspired by the recent success of Convolutional Neural Network, which can extract features patch-wisely from the input feature space by convolution operation, and get strong representation features of the corresponding local area, we design a convolutional architecture for SKN, called Stacked Kernel Convolution Network (SKCN).
Specifically, as shown in Figure 5(a), starting from the first kernel convolutional layer, let be the input feature space, each patch extracted from is denoted as . Similar to SKN, a nonlinear feature map is first applied to project into a -dimensional RKHS . Then using a linear mapping to project into a -dimensional linear space : . According to the definition of RKHS, we pick up functions from RKHS to replace to parameterize the hidden units (Figure 5(b)). To this end, the representation of in can be written as , where the -th element of is . Then on top of the first kernel convolutional layer, we can use functions from another RKHS to define the second layer. Repeating this process times, a SKCN with kernel convolutional layers characterized by functions from RKHS is obtained. The -th kernel convolutional layer is utilized to produce the outputs.
In the following experiments, we applied parametric representation and random Fourier feature representation to the SKCN architecture, named P-SKCN and RFF-SKCN respectively.
|P-SKN (poly)||NP-SKN (poly)|
|Dataset||RFF-SKN (RBF)||P-SKN (RBF)||NP-SKN (RBF)|
In this section, we present experimental results, where we observe (1) “deep” SKN/SKCN outperforms single-layer SKN/SKCN; (2) SKN outperforms DNNs and SKCN outperforms CNNs.
We compare SKN with DNN on six datasets. (1) MNIST . It contains images of handwritten digits, represented with 784-dimensional vectors of raw pixels. The training set has 60,000 images and the test set has 10,000 images. (2) ImageNet-10
. This dataset contains 10 categories of ImageNet images, each category with 600 images. Images are represented with 128-dimensional convolutional neural network features extracted from the pre-trained ConvNet model. (3) PenDigits. This is a multi-class dataset with 16 integer attributes and 10 classes. It is created by collecting 250 samples from each of 44 writers. (4) SatImage. It is generated from Landsat multi-spectral scanner image data, containing 4435 training samples and 2000 testing samples belonging to 6 classes. Feature dimension is 36. (5) Segment. The data examples were drawn randomly from a database of 7 outdoor image categories. The images were hand-segmented to create a label for every pixel. (6) Vowel. This dataset consists of 11 classes, and each class has 90 10-dimensional samples.
We compare SKCN with CNN on two datasets. (1) MNIST. It is the same as the MNIST dataset mentioned above. Here we reshape the 784-dimensional vectors into 2828 images, with 1 grayscale channel. (2) CIFAR-10 . It consists of 3232 images, each with 3 color channels (RGB). The images are from 10 classes. The train and test set contain 50,000 and 10,000 images respectively.
3.2 Experimental setup
For SKN, we experimented three representations of RKHS function: nonparametric representation (NP-SKN), parametric representation (P-SKN) and random Fourier feature representation (RFF-SKN), and two kernel functions: Radial Basis Function kernel (RBF) and polynomial kernel (poly) . Polynomial kernel is not applicable in RFF-SKN since it is not shift-invariant. The scale parameter in RBF kernel was tuned in the range using 5-fold cross validation. The two parameters and in polynomial kernel were tuned in the range (with intervals of 0.5) and
(with intervals of 1). We compare with deep neural networks where the activation function is set to rectified linear (relu) with the same hyperparameters all the time. SKN was trained by stochastic gradient decent using cross-entropy loss. The batch size was set to 128, and the learning rate was set to
. For RFF-SKN, the number of random Fourier feature is set to 5000. We implemented SKN using TensorFlow. These experiments were performed on Linux machines with 32 4.0GHz CPU cores and 256GB RAM.
For SKCN, we experimented two representations of RKHS function: parametric representation (P-SKCN) and random Fourier feature representation (RFF-SKCN). For P-SKCN, we use polynomial kernel function. And for RFF-SKCN, we use RBF kernel function. To get a trade-off between runtime and accuracy, we set the sample number in RFF-SKCN to 2000. The parameters and in polynomial kernel and the parameter in RBF kernel were tuned in same way as them in P-SKN. We compare our methods with CNN using the same hyperparameter setting all the time. For instance, in Table 3, we set both CNN and SKCN (1) Network architecture: conv1-pool1-norm1(layer1), conv2-norm2-pool2(layer2), conv3-norm3-pool3(layer3), conv4-norm4-pool4(layer4); (2) Kernel size: 55(layer1), 55(layer2), 33(layer3), 3
2 Max-pooling; (4) Dropout: with 50% or without; (5) Data augmentation: None. The only different setting between CNN and RFF-SKCN are activation functions: CNN (relu), RFF-SKCN (none). In the contrast experiment, we use stochastic gradient decent to minimize the cross-entropy loss, the batch size was set to 128, and we use adaptive learning rate which initialized at and staircase like decayed in every iterations. We use local response normalization provided in Tensorflow to normalize the outputs of each layer. These experiments were performed on Linux machines with Tesla K80 GPUs and 256GB RAM.
Table 2 shows the classification accuracy of DNN and SKN with different representation of RKHS functions and kernel functions. The accuracy of NP-SKN with more than 2 layers are not available since the model is too large and it takes too much time to converge. From this table, we make the following observations. (1) P-SKN (with polynomial kernel or RBF kernel) outperforms DNN. For instance, on Vowel, P-SKN with RBF kernel achieves an accuracy of 65.43% while the accuracy achieved by DNN is 61.75%. The hidden units in SKN are parameterized by infinite-dimensional RKHS functions, which are more expressive than finite-dimensional weight vectors in DNN. Thus SKN is more capable to capture the complex patterns behind data and achieves higher classification accuracy. (2) In most cases, SKN with hidden layers outperforms that with one single hidden layer. For example, on the Segment dataset, P-SKN (poly) with 3 layers achieve an accuracy of 99.02% while that with 1 layer achieves 97.80%. This demonstrates that “deep” SKN is better than single-layer kernel method. Stacking multiple layers of RKHS functions together improves the representation power of kernel methods and results in better performance. However, the number of layers cannot be too large, which otherwise leads to overfitting and hurts generalization performance on the test set. For instance, on the SatImage dataset, P-SKN (RBF) with 4 layers performs worse than that with 2 layers. (3) The performance of P-SKN is better than NP-SKN and RFF-SKN. For NP-SKN, its number of parameters grows linearly with data size, which easily leads to overfitting. For RFF-SKN, the RFF-represented functions are not true functions in the RKHS, but rather approximations, which suffers an approximation error (but overall, its performance is very close to P-SKN). In contrast, the functions with parametric representations are exactly from the RKHS and their parameters do not depend on data size.
Besides, we design a experiment to find out the sensitiveness of SKN and DNNs. Figure 7 shows how accuracy changes w.r.t. the number of hidden units , on the MNIST dataset. For DNN, we experimented different acivation functions including relu, relu6, softplus, softsign, tanh and sigmoid. As can be seen, overall, SKNs are not sensitive to . The accuracy remains stable as increases from 10 to 2500. In contrast, DNNs are very sensitive to . For each method, the best accuracy is achieved at a that is in the middle ground. A smaller is not expressive enough and a larger leads to overfitting.
Table 3 listed experiment results of CNN and SKCN. We have two major observation from this table: (1) RFF-SKCN outperforms CNN. SKCN achieve higher accuracy on most dataset with same layer number. For instance, RFF-SKCN achieves 79.06% with dropout and 77.77% without dropout on CIFAR-10 dataset with 3-layer, while CNN only get 74.79% and 73.24% respectively. (2) In most cases, SKCN with layers outperforms that with one single layer.
Next, We show the detail performance of CNN and RFF-SKCN with different layer on CIFAR-10 dataset from Figure 6(a) to Figure 6(d). First, from Figure 6(a) and Figure 6(b) we can observe that RFF-SKCN exhibits considerably higher testing accuracy and is more generalizable than the baseline architecture CNN, which indicates that the infinite-dimensional RKHS functions are more expressive than finite-dimensional filters in CNN. Second, although RFF-SKCN outperforms CNN, from Figure 6(c) and Figure 6(d) we can observe that our method RFF-SKCN take nearly twice as long to run a single iteration. It is mainly because RFF representation need to train a larger than CNN, and as a result of this, the learned parameters in RFF-SKCN can be more expressive than CNN.
4 Related work
Several studies have been performed to bridge kernel methods and deep learning. aim to define an “deep” kernel by first successively composing the same nonlinear transformation multiple times over the input : , then defining as . Using the kernel trick, these can be replaced with kernel functions. However, if the representation is not used for defining kernel, but rather making predictions (which is the case in our work), it is unclear how to deal with these .  propose local deep kernel learning where the local feature space are represented with a hierarchy of nonlinear transformations guided by a tree structure.  use deep neural network (DNN) to define a kernel function . Given two input data and , they are first fed into a DNN , generating latent representations and , which are then fed into a kernel function . Overall, is defined as . The major difference between these works with ours is that they utilize DNN or nested nonlinear transformations to define a single kernel function, while our work defines a network with multiple layers of RKHS functions.  propose a generative model which stacks multiple layers of Gaussian processes while our work is a discriminative model that stacks multiple layers of RKHS functions.  use convolutional neural network to approximate the kernel map while our work uses random Fourier features as building blocks of deep networks.
In this paper, we propose a “deep” kernel method – Stacked Kernel Network, that learns a hierarchy of RKHS functions. SKN consists of multiple layers of interleaving nonlinear and linear projections. The nonlinear projection is carried out by the reproducing kernel feature map associated with a RKHS and the resultant features could be infinite-dimensional. A SKN is equipped with multiple such feature maps that bring in high representation power. To avoid the model size of SKN out of control, immediately after each nonlinear transformation, a linear projection is applied to map the infinite-dimensional nonlinear space to a finite-dimensional linear space. In the end, a SKN is composed of multiple hidden layers, each associated with a RKHS and each unit therein is parameterized by a function in that RKHS. We investigate three ways to represent the RKHS functions to make their learning tractable: data-dependent nonparametric representation based on the representer theorem, data-independent parametric representation and an approximate representation based on random Fourier features. Experiments on various datasets demonstrate the effectiveness of SKN.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
-  F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1–48, 2002.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
-  J. Bergstra, G. Desjardins, P. Lamblin, and Y. Bengio. Quadratic polynomials learn better image features. Technical report, Technical Report 1337, Département d’Informatique et de Recherche Opérationnelle, Université de Montréal, 2009.
C. J. Burges.
A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
-  M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert systems with applications, 40(1):200–210, 2013.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv, 2014.
-  Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009.
-  A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
-  X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Aistats, volume 15, page 275, 2011.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
A. Krizhevsky and G. Hinton.
Convolutional deep belief networks on cifar-10.Unpublished manuscript, 40, 2010.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pages 2627–2635, 2014.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
M. J. Orr et al.
Introduction to radial basis function networks, 1996.
-  A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2007.
-  C. E. Rasmussen. Gaussian processes for machine learning. 2006.
B. Schölkopf, R. Herbrich, and A. J. Smola.
A generalized representer theorem.
International Conference on Computational Learning Theory, pages 416–426. Springer, 2001.
-  B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
-  B. Schölkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002.
-  J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
-  M. Varma. Local deep kernel learning for efficient non-linear svm prediction.
-  L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1058–1066, 2013.
-  A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. arXiv preprint arXiv:1511.02222, 2015.