1 Introduction
Kernel methods [21] [22] represent a wellestablished learning paradigm that is able to capture the nonlinear complex patterns underlying data. In kernel methods, the learning is implicitly performed in a highdimensional (even infinitedimensional) nonlinear feature space (called Reproducing Kernel Hilbert Space, RKHS) [21] via the kernel trick: for a vector in the input space, it is projected into the RKHS using a nonlinear mapping , then linear learning is performed on the nonlinear features . is implicitly characterized by a kernel function associated with the RKHS, which known as kernel trick [21].
Classical kernel methods perform singlelayer feature learning: the input is transformed into which are used as the final features on which learning is carried out. Motivated by recent success of deep neural networks which perform hierarchical (“deep”) representation learning [3] that is able to capture lowlevel, middlelevel and highlevel features, we aim to study “deep” kernel methods that learn several layers of stacked nonlinear transformations.
Specifically, we propose a Stacked Kernel Network (SKN) that interleaves several layers of nonlinear transformations and linear mappings. Starting from the input , a nonlinear feature map is applied to project into a dimensional ( could be infinite) RKHS . Then we use a linear mapping to project into a dimensional linear space : . Let be the th row vector of , according to the definition of RKHS, can be computed as where is a function in this RKHS. To this end, the representation of in can be written as , where the th element of is . Then treating as input, we apply the above procedure again: projecting into another RKHS , followed by a linear projection into another linear space , getting the representation . Repeating this process times, we obtain a SKN with hidden layers. SKN contains multiple layers of nonlinear representations which could be infinitedimensional. This grants SKN vast representation power to capture the complex patterns behind data. On the other hand, after each nonlinear layer, a linear mapping is applied to confine the size of the model, so that the model capacity does not get out of control.
Figure 1 shows the architecture of SKN. Similar to a deep neural network, it contains multiple hidden layers. The striking difference is in SKN each hidden unit is parameterized by a RKHS function while in DNN the units are parametrized by vectors. The RKHS functions could be infinitedimensional, which are arguably more expressive than finitedimensional vectors. As a result, SKN could possess more representational power than DNN.
Given the multiple layers of RKHS functions in SKN, how to learn them is very challenging. At first, we need to seek explicit representations of these functions. We propose three ways. First, motivated by the representer theorem [21], we parameterize a RKHS function as a linear combination of kernel functions anchored over training data : . Second, we shrink the domain of the functions from the entire RKHS into the image of the nonlinear feature map [21]: , where each function is parameterized by a learnable vector . Third, gaining insight from random Fourier features [17], a RKHS function can be approximated with where is the random Fourier feature transformation of , and is a parameter vector. We use backpropogation algorithm to learn the SKNs under these three representations. Evaluations on various datasets demonstrate the effectiveness of SKN.
The major contributions of this paper are:

[noitemsep]

We propose Stacked Kernel Network, which learns multiple layers of RKHS representations.

We study three ways of representing the RKHS functions in SKN to address the computation issue of learning RKHS functions in highdimensional, even infinitedimensional.

We design a convolutional architecture of SKN, called Stacked Convolutional Kernel Network (SCKN), to solve visual recognition tasks.

We demonstrate the effectiveness of SKN and SCKN in experiments.
The rest of the paper is organized as follows. Section 2 introduces the Stack Kernel Network and Section 3 presents experimental results. Section 4 reviews related works and Section 5 concludes the paper.
2 Method
In this section, we introduce the architecture of the Stacked Kernel Network (SKN) and three ways of representing the RKHS functions in SKN.
2.1 Kernel methods
Kernel methods [21] perform learning in a reproducing kernel Hilbert space (RKHS) [21] of functions. This RKHS represents a highdimensional feature space that is more expressive than the input space since it is able to capture the nonlinear patterns behind data. The RKHS is associated with a kernel function and inner product in the RKHS can be computed via evaluating in the lowerdimensional input space (known as kernel trick
). Well established kernel methods include support vector machine
[5], kernel principal component analysis
[20], kernel independent component analysis
[2], Gaussian process [18], to name a few.Kernel methods are featured by three prominent concepts: (1) feature map that maps from a data point in the input space to an element of an inner product space (the feature vector); (2) kernel that takes two data points and returns a real number; (3) RKHS which is a Hilbert space of functions . Their relations are as follows. First, feature map defines a kernel. Let be a feature map, then is a kernel. Second, kernel defines feature maps. For every kernel , there exists a Hilbert space and a feature map such that . Third, RKHS defines a kernel. Every RKHS of functions defines a unique kernel , called the reproducing kernel of . Fourth, kernel defines RKHS. For every kernel , there exists a unique RKHS with reproducing kernel .
2.2 Deep neural networks
We briefly review deep neural networks (DNNs), which inspire us to construct the stacked kernel network. A DNN contains one input layer, several hidden layers and one output layer. Each layer has a set of units. The units between adjacent layers are interconnected, each connection associated with a weight parameter. To achieve nonlinearity, a nonlinear activation function is applied at each hidden unit. Commonly used activation functions include sigmoid, tanh and rectifier linear.
2.3 Stacked Kernel Network
Classic kernel methods learn a single layer of nonlinear features, which may not be expressive enough to accommodate complex data. Inspired by the hierarchical representation learning in deep neural networks, we propose Stacked Kernel Network that learns multiple layers of nonlinear features based on RKHS. Figure 1 shows the architecture of SKN. Similar to a DNN, it consists of an input layer, an output layer and hidden layers. Each hidden layer is associated with a RKHS and the hidden units therein are parameterized by functions in this RKHS. This is the key difference with DNN where the hidden units are parameterized by weight vectors. Next, we present the detailed construction procedure of SKN. We start with defining the first hidden layer. Let be the input feature vector. We pick up functions from a RKHS to map into a dimensional latent space where the th dimension is . Then on top of the first hidden layer, we can use functions from another RKHS to define the second hidden layer. Repeating this process times, a SKN with hidden layers characterized by functions from RKHS is obtained. The th hidden layer is utilized to produce the outputs.
By stacking multiple RKHS together, the SKN is highly expressive. To better understand this, we present an equivalent architecture in Figure 2. Recall that a RKHS function is equivalently defined as follows: given an input vector , first transform it into using the reproducing kernel feature map , then can be define as where contains linear coefficients. Based upon this definition, the SKN can be represented by interleaving nonlinear projections and linear projections: given the latent representation at layer , we first transform it into using the nonlinear feature map , then use a linear project matrix to map into the latent representation at layer : . The nonlinear features could be infinitedimensional. A SKN contains layers of such features that lead to substantial representational power. In between two adjacent layers of nonlinear features, a layer of finitedimensional linear features is placed. This ensures the size of SKN is properly controlled.
2.4 Representing RKHS functions in SKN
A SKN is parameterized with layers of RKHS functions. While expressive, these functions present great challenges for learning. Unlike weights in DNN that are naturally represented as finitedimensional vectors, the RKHS functions are of infinite dimension, whose storage and computation are troublesome. To address this issue, we first seek explicit representations of these functions that facilitate learning. We investigate three ways: datadependent nonparametric representation, dataindependent parametric representation and an approximate representation based on random Fourier features.
2.4.1 Nonparametric representation
In kernel methods, the most common way to represent a RKHS function is based on the representer theorem [19]: given a regularized risk functional where are the training data, denotes the Hilbert norm of and is an increasing function, the minimizer of this functional admits the following form:
(1) 
where is a kernel function associated with the RKHS and are coefficients. This representation depends on data, hence is referred to as nonparametric representation.
Specifically, given the hidden states of all training data at layer , the RKHS function associated with the th hidden unit at layer is defined as:
(2) 
where is the kernel function associated with RKHS . For data sample , the activation value of hidden unit at layer is . Figure 3 shows the SKN architecture under the nonparametric representation. The input of each hidden unit at layer are the representations of all samples at layer , creating a huge network.
The advantage of the nonparametric representation is that the function set contains the global optimal solution, though this optimal may not be achieved due to the nonconvexity of SKN. The drawback is the number of parameters grow linearly with , which is not scalable to large datasets.
2.4.2 Parametric representation
In light of the large computational complexity of the nonparametric representation, we investigate a parametric counterpart. The basic idea is: instead of searching the optimal solution in the entire RKHS, we restrict the learning into a subset of the RKHS where the functions in the subset have nice parametric forms. Specifically, we choose the subset to be the image of the reproducing kernel map:
(3) 
where is a kernel function associated with the RKHS and
is the learnable parameter vector which initialized using kmeans clustering
[6] from the input samples and trained by gradient decent. Specifically, the RKHS function associated with the th hidden unit in layer is:(4) 
where is the parameter vector of this function. Given the hidden states at layer , the activation value of this unit is .
The advantage of the parametric representation is that it is independent of data. The disadvantage is the optimal solution of may not be contained in this subset. Note that if we choose
to be the radial basis function (RBF) kernel, then the parametric SKN is specialized to a deep RBF network
[16].Nonparametric  Parametric  RFF  
Contain Opt  Yes  May not  No 
Depend on data  Yes  No  No 
Subset of RKHS  Yes  Yes  No 
# parameters 
2.4.3 Random Fourier feature representation
In addition to the nonparametric and parametric representations, we also investigate another approximated representation based on random Fourier features (RFFs) [17]. Given a shiftinvariant kernel , it can be approximated with RFFs:
(5) 
where is the RFF transformation.
is generated in the following way: (1) compute the Fourier transform
of the kernel ; (2) draw i.i.d samples from and i.i.d samplesfrom the uniform distribution on
; (3) let .Given a RKHS function , we approximate it using RFF:
(6) 
The RKHS function associated with the th hidden unit in layer is:
(7) 
where is the parameter vector of this function. Given the hidden states at layer , the activation value of this unit is .
Figure 4 shows the SKN architecture under the RFF representation, where multiple layers of RFF transformations and linear projections are interleaved. The weight parameters in RFF transformation are sampled from the Fourier transform and are fixed during training. The learnable parameters are the weights .
2.4.4 Comparison of three representations
Table 1 presents a comparison of the three representations. Functions under nonparametric representation contain the global optimal function in the RKHS. Functions under random Fourier feature representation approximate those in the RKHS. While nonparametric representation contains the global optimal solution in the RKHS, its parameters grow with training data size . Parametric and RFF representations are computationally efficient, however, they have rare chance to reach the global optimal. The number of parameters in all three representations grows with the number of layers and the number of units in each layer. For each hidden unit, parametric representation has parameters and RFF representation has (dimension of random features) parameters, which are both much smaller than .
2.5 Stacked Kernel Convolutional Network
In order to apply SKN to visual recognition problems, we aiming to extend the plain Stacked Kernel Network to a certain advanced architecture which has strong representation ability in visual recognition tasks. Inspired by the recent success of Convolutional Neural Network
[13], which can extract features patchwisely from the input feature space by convolution operation, and get strong representation features of the corresponding local area, we design a convolutional architecture for SKN, called Stacked Kernel Convolution Network (SKCN).Specifically, as shown in Figure 5(a), starting from the first kernel convolutional layer, let be the input feature space, each patch extracted from is denoted as . Similar to SKN, a nonlinear feature map is first applied to project into a dimensional RKHS . Then using a linear mapping to project into a dimensional linear space : . According to the definition of RKHS, we pick up functions from RKHS to replace to parameterize the hidden units (Figure 5(b)). To this end, the representation of in can be written as , where the th element of is . Then on top of the first kernel convolutional layer, we can use functions from another RKHS to define the second layer. Repeating this process times, a SKCN with kernel convolutional layers characterized by functions from RKHS is obtained. The th kernel convolutional layer is utilized to produce the outputs.
In the following experiments, we applied parametric representation and random Fourier feature representation to the SKCN architecture, named PSKCN and RFFSKCN respectively.
Dataset  DNN (relu) 
PSKN (poly)  NPSKN (poly)  
1L  2L  3L  4L  1L  2L  3L  4L  1L  2L  
MNIST  97.19  98.67  98.60  98.33  97.92  98.71  98.57  98.54  95.83  96.11 
ImageNet10  95.70  97.35  97.35  97.20  97.20  97.65  97.45  97.45  97.25  97.20 
PenDigits  97.11  98.53  98.50  98.15  98.68  98.82  97.77  96.03  95.37  97.88 
SatImage  89.85  91.50  90.95  85.45  91.30  91.95  92.15  92.00  90.00  90.20 
Segment  96.86  98.25  98.50  98.25  97.80  98.82  99.02  98.63  95.47  96.86 
Vowel  52.59  57.49  61.75  60.25  57.14  63.11  59.37  57.42  62.34  63.20 
Dataset  RFFSKN (RBF)  PSKN (RBF)  NPSKN (RBF)  
1L  2L  3L  4L  1L  2L  3L  4L  1L  2L  
MNIST  97.97  97.99  98.08  98.48  98.37  98.12  97.55  97.21  96.88  96.56 
ImageNet10  97.30  97.35  97.20  96.35  97.30  97.55  97.45  97.35  97.20  96.90 
PenDigits  98.20  98.04  97.84  98.24  98.50  99.06  98.95  98.89  95.81  98.11 
SatImage  90.70  90.35  90.35  87.70  91.60  92.00  90.80  89.75  90.30  90.75 
Segment  98.27  98.24  98.43  98.04  98.83  98.83  98.04  97.25  97.25  97.84 
Vowel  59.09  62.04  63.87  65.35  61.68  63.38  64.74  65.43  60.39  61.04 
3 Experiments
In this section, we present experimental results, where we observe (1) “deep” SKN/SKCN outperforms singlelayer SKN/SKCN; (2) SKN outperforms DNNs and SKCN outperforms CNNs.
3.1 Datasets
We compare SKN with DNN on six datasets. (1) MNIST [13]. It contains images of handwritten digits, represented with 784dimensional vectors of raw pixels. The training set has 60,000 images and the test set has 10,000 images. (2) ImageNet10
. This dataset contains 10 categories of ImageNet images, each category with 600 images. Images are represented with 128dimensional convolutional neural network features extracted from the pretrained ConvNet model
[7]. (3) PenDigits. This is a multiclass dataset with 16 integer attributes and 10 classes. It is created by collecting 250 samples from each of 44 writers. (4) SatImage. It is generated from Landsat multispectral scanner image data, containing 4435 training samples and 2000 testing samples belonging to 6 classes. Feature dimension is 36. (5) Segment. The data examples were drawn randomly from a database of 7 outdoor image categories. The images were handsegmented to create a label for every pixel. (6) Vowel. This dataset consists of 11 classes, and each class has 90 10dimensional samples.We compare SKCN with CNN on two datasets. (1) MNIST. It is the same as the MNIST dataset mentioned above. Here we reshape the 784dimensional vectors into 2828 images, with 1 grayscale channel. (2) CIFAR10 [11]. It consists of 3232 images, each with 3 color channels (RGB). The images are from 10 classes. The train and test set contain 50,000 and 10,000 images respectively.
3.2 Experimental setup
For SKN, we experimented three representations of RKHS function: nonparametric representation (NPSKN), parametric representation (PSKN) and random Fourier feature representation (RFFSKN), and two kernel functions: Radial Basis Function kernel (RBF) and polynomial kernel (poly) . Polynomial kernel is not applicable in RFFSKN since it is not shiftinvariant. The scale parameter in RBF kernel was tuned in the range using 5fold cross validation. The two parameters and in polynomial kernel were tuned in the range (with intervals of 0.5) and
(with intervals of 1). We compare with deep neural networks where the activation function is set to rectified linear (relu) with the same hyperparameters all the time. SKN was trained by stochastic gradient decent using crossentropy loss. The batch size was set to 128, and the learning rate was set to
. For RFFSKN, the number of random Fourier feature is set to 5000. We implemented SKN using TensorFlow
[1]. These experiments were performed on Linux machines with 32 4.0GHz CPU cores and 256GB RAM.For SKCN, we experimented two representations of RKHS function: parametric representation (PSKCN) and random Fourier feature representation (RFFSKCN). For PSKCN, we use polynomial kernel function. And for RFFSKCN, we use RBF kernel function. To get a tradeoff between runtime and accuracy, we set the sample number in RFFSKCN to 2000. The parameters and in polynomial kernel and the parameter in RBF kernel were tuned in same way as them in PSKN. We compare our methods with CNN using the same hyperparameter setting all the time. For instance, in Table 3, we set both CNN and SKCN (1) Network architecture: conv1pool1norm1(layer1), conv2norm2pool2(layer2), conv3norm3pool3(layer3), conv4norm4pool4(layer4); (2) Kernel size: 55(layer1), 55(layer2), 33(layer3), 3
3(layer4); (3) Strides: set 1 for all layers; (5) Padding: Samepadding; (6) Pooling: 2
2 Maxpooling; (4) Dropout
[24]: with 50% or without; (5) Data augmentation: None. The only different setting between CNN and RFFSKCN are activation functions: CNN (relu), RFFSKCN (none). In the contrast experiment, we use stochastic gradient decent to minimize the crossentropy loss, the batch size was set to 128, and we use adaptive learning rate which initialized at and staircase like decayed in every iterations. We use local response normalization provided in Tensorflow to normalize the outputs of each layer. These experiments were performed on Linux machines with Tesla K80 GPUs and 256GB RAM.3.3 Results
Table 2 shows the classification accuracy of DNN and SKN with different representation of RKHS functions and kernel functions. The accuracy of NPSKN with more than 2 layers are not available since the model is too large and it takes too much time to converge. From this table, we make the following observations. (1) PSKN (with polynomial kernel or RBF kernel) outperforms DNN. For instance, on Vowel, PSKN with RBF kernel achieves an accuracy of 65.43% while the accuracy achieved by DNN is 61.75%. The hidden units in SKN are parameterized by infinitedimensional RKHS functions, which are more expressive than finitedimensional weight vectors in DNN. Thus SKN is more capable to capture the complex patterns behind data and achieves higher classification accuracy. (2) In most cases, SKN with hidden layers outperforms that with one single hidden layer. For example, on the Segment dataset, PSKN (poly) with 3 layers achieve an accuracy of 99.02% while that with 1 layer achieves 97.80%. This demonstrates that “deep” SKN is better than singlelayer kernel method. Stacking multiple layers of RKHS functions together improves the representation power of kernel methods and results in better performance. However, the number of layers cannot be too large, which otherwise leads to overfitting and hurts generalization performance on the test set. For instance, on the SatImage dataset, PSKN (RBF) with 4 layers performs worse than that with 2 layers. (3) The performance of PSKN is better than NPSKN and RFFSKN. For NPSKN, its number of parameters grows linearly with data size, which easily leads to overfitting. For RFFSKN, the RFFrepresented functions are not true functions in the RKHS, but rather approximations, which suffers an approximation error (but overall, its performance is very close to PSKN). In contrast, the functions with parametric representations are exactly from the RKHS and their parameters do not depend on data size.
Dataset  CNN  RFFSKCN  PSKCN  

1L  2L  3L  4L  1L  2L  3L  4L  1L  2L  3L  4L  
MNIST  99.26  99.47  99.43  99.51  99.31  99.63  99.62  99.58  99.30  99.49  99.27  99.27 
MNIST (drop)  99.38  99.54  99.50  99.56  99.38  99.60  99.63  99.58  99.10  99.40  99.37  99.41 
CIFAR10  70.03  72.83  73.24  74.86  69.36  75.97  77.77  77.59  67.77  73.33  73.41  72.19 
CIFAR10 (drop)  71.21  75.60  74.79  75.88  72.45  76.82  79.06  77.13  67.84  74.71  76.79  75.31 
Besides, we design a experiment to find out the sensitiveness of SKN and DNNs. Figure 7 shows how accuracy changes w.r.t. the number of hidden units , on the MNIST dataset. For DNN, we experimented different acivation functions including relu[15], relu6[12], softplus[10], softsign[4], tanh and sigmoid. As can be seen, overall, SKNs are not sensitive to . The accuracy remains stable as increases from 10 to 2500. In contrast, DNNs are very sensitive to . For each method, the best accuracy is achieved at a that is in the middle ground. A smaller is not expressive enough and a larger leads to overfitting.
Table 3 listed experiment results of CNN and SKCN. We have two major observation from this table: (1) RFFSKCN outperforms CNN. SKCN achieve higher accuracy on most dataset with same layer number. For instance, RFFSKCN achieves 79.06% with dropout and 77.77% without dropout on CIFAR10 dataset with 3layer, while CNN only get 74.79% and 73.24% respectively. (2) In most cases, SKCN with layers outperforms that with one single layer.
Next, We show the detail performance of CNN and RFFSKCN with different layer on CIFAR10 dataset from Figure 6(a) to Figure 6(d). First, from Figure 6(a) and Figure 6(b) we can observe that RFFSKCN exhibits considerably higher testing accuracy and is more generalizable than the baseline architecture CNN, which indicates that the infinitedimensional RKHS functions are more expressive than finitedimensional filters in CNN. Second, although RFFSKCN outperforms CNN, from Figure 6(c) and Figure 6(d) we can observe that our method RFFSKCN take nearly twice as long to run a single iteration. It is mainly because RFF representation need to train a larger than CNN, and as a result of this, the learned parameters in RFFSKCN can be more expressive than CNN.
4 Related work
Several studies have been performed to bridge kernel methods and deep learning.
[8] aim to define an “deep” kernel by first successively composing the same nonlinear transformation multiple times over the input : , then defining as . Using the kernel trick, these can be replaced with kernel functions. However, if the representation is not used for defining kernel, but rather making predictions (which is the case in our work), it is unclear how to deal with these . [23] propose local deep kernel learning where the local feature space are represented with a hierarchy of nonlinear transformations guided by a tree structure. [25] use deep neural network (DNN) to define a kernel function . Given two input data and , they are first fed into a DNN , generating latent representations and , which are then fed into a kernel function . Overall, is defined as . The major difference between these works with ours is that they utilize DNN or nested nonlinear transformations to define a single kernel function, while our work defines a network with multiple layers of RKHS functions. [9] propose a generative model which stacks multiple layers of Gaussian processes while our work is a discriminative model that stacks multiple layers of RKHS functions. [14] use convolutional neural network to approximate the kernel map while our work uses random Fourier features as building blocks of deep networks.5 Conclusion
In this paper, we propose a “deep” kernel method – Stacked Kernel Network, that learns a hierarchy of RKHS functions. SKN consists of multiple layers of interleaving nonlinear and linear projections. The nonlinear projection is carried out by the reproducing kernel feature map associated with a RKHS and the resultant features could be infinitedimensional. A SKN is equipped with multiple such feature maps that bring in high representation power. To avoid the model size of SKN out of control, immediately after each nonlinear transformation, a linear projection is applied to map the infinitedimensional nonlinear space to a finitedimensional linear space. In the end, a SKN is composed of multiple hidden layers, each associated with a RKHS and each unit therein is parameterized by a function in that RKHS. We investigate three ways to represent the RKHS functions to make their learning tractable: datadependent nonparametric representation based on the representer theorem, dataindependent parametric representation and an approximate representation based on random Fourier features. Experiments on various datasets demonstrate the effectiveness of SKN.
References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  [2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1–48, 2002.
 [3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 [4] J. Bergstra, G. Desjardins, P. Lamblin, and Y. Bengio. Quadratic polynomials learn better image features. Technical report, Technical Report 1337, Département d’Informatique et de Recherche Opérationnelle, Université de Montréal, 2009.

[5]
C. J. Burges.
A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2(2):121–167, 1998.  [6] M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficient initialization methods for the kmeans clustering algorithm. Expert systems with applications, 40(1):200–210, 2013.
 [7] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv, 2014.
 [8] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009.
 [9] A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
 [10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Aistats, volume 15, page 275, 2011.
 [11] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.

[12]
A. Krizhevsky and G. Hinton.
Convolutional deep belief networks on cifar10.
Unpublished manuscript, 40, 2010.  [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [14] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pages 2627–2635, 2014.
 [15] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010.

[16]
M. J. Orr et al.
Introduction to radial basis function networks, 1996.
 [17] A. Rahimi and B. Recht. Random features for largescale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2007.
 [18] C. E. Rasmussen. Gaussian processes for machine learning. 2006.

[19]
B. Schölkopf, R. Herbrich, and A. J. Smola.
A generalized representer theorem.
In
International Conference on Computational Learning Theory
, pages 416–426. Springer, 2001.  [20] B. Schölkopf, A. Smola, and K.R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
 [21] B. Schölkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [22] J. ShaweTaylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
 [23] M. Varma. Local deep kernel learning for efficient nonlinear svm prediction.
 [24] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML13), pages 1058–1066, 2013.
 [25] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. arXiv preprint arXiv:1511.02222, 2015.
Comments
There are no comments yet.