1 Introduction
“How can Gaussian processes possibly replace neural networks? Have we thrown the baby out with the bathwater?” questioned
MacKay (1998). It was the late 1990s, and researchers had grown frustrated with the many design choices associated with neural networks – regarding architecture, activation functions, and regularisation – and the lack of a principled framework to guide in these choices.
Gaussian processes had recently been popularised within the machine learning community by
Neal (1996), who had shown that Bayesian neural networks with infinitely many hidden units converged to Gaussian processes with a particular kernel (covariance) function. Gaussian processes were subsequently viewed as flexible and interpretable alternatives to neural networks, with straightforward learning procedures. Where neural networks used finitely many highly adaptive basis functions, Gaussian processes typically used infinitely many fixed basis functions. As argued by MacKay (1998), Hinton et al. (2006), and Bengio (2009), neural networks could automatically discover meaningful representations in highdimensional data by learning multiple layers of highly adaptive basis functions. By contrast, Gaussian processes with popular kernel functions were used typically as simple smoothing devices.
Recent approaches (e.g., Wilson, 2014; Wilson and Adams, 2013; Lloyd et al., 2014; Yang et al., 2015) have demonstrated that one can develop more expressive kernel functions, which are indeed able to discover rich structure in data without human intervention. Such methods effectively use infinitely many adaptive basis functions. The relevant question then becomes not which paradigm (e.g., kernel methods or neural networks) replaces the other, but whether we can combine the advantages of each approach. Indeed, deep neural networks provide a powerful mechanism for creating adaptive basis functions, with inductive biases which have proven effective for learning in many application domains, including visual object recognition, speech perception, language understanding, and information retrieval (Krizhevsky et al., 2012; Hinton et al., 2012; Socher et al., 2011; Kiros et al., 2014; Xu et al., 2015).
In this paper, we combine the nonparametric flexibility of kernel methods with the structural properties of deep neural networks. In particular, we use deep feedforward fullyconnected and convolutional networks, in combination with spectral mixture covariance functions (Wilson and Adams, 2013), inducing points (QuiñoneroCandela and Rasmussen, 2005), structure exploiting algebra (Saatchi, 2011), and local kernel interpolation (Wilson and Nickisch, 2015; Wilson et al., 2015), to create scalable expressive closed form covariance kernels for Gaussian processes. As a nonparametric method, the information capacity of our model grows with the amount of available data, but its complexity is automatically calibrated through the marginal likelihood of the Gaussian process, without the need for regularization or crossvalidation (Rasmussen and Ghahramani, 2001; Rasmussen and Williams, 2006; Wilson, 2014). The flexibility and automatic calibration provided by the nonparametric layer typically provides a high standard of performance, while reducing the need for extensive hand tuning from the user.
We further build on the ideas in KISSGP (Wilson and Nickisch, 2015) and extensions (Wilson et al., 2015), so that our deep kernel learning model can scale linearly with the number of training instances , instead of as is standard with Gaussian processes, while retaining a fully nonparametric representation. Our approach also scales as per test point allowing for very fast prediction times. Because KISSGP creates an approximate kernel from a user specified kernel for fast computations, independently of a specific inference procedure, we can view the resulting kernel as a scalable deep kernel. We demonstrate the value of this scalability in the experimental results section, where it is the large datasets that provide the greatest opportunities for our model to discover expressive statistical representations.
We begin by reviewing related work in section 2, and providing background material on Gaussian processes in section 3. In section 4 we derive scalable closed form deep kernels, and describe how to perform efficient automatic learning of these kernels through the Gaussian process marginal likelihood. In section 5, we show substantially improved performance over standard Gaussian processes, expressive kernel learning approaches, and deep neural networks, on a wide range of datasets. We also examine the structure of the kernels to gain new insights into our modelling problems.
2 Related Work
Given the intuitive value of combining kernels and neural networks, it is encouraging that various distinct forms of such combinations have been considered in different contexts.
The Gaussian process regression network (Wilson et al., 2012) replaces all weight connections in a Bayesian neural network with Gaussian processes, allowing the authors to model input dependent correlations between multiple tasks. Alternatively, Damianou and Lawrence (2013)
replace every activation function in a Bayesian neural network with a Gaussian process transformation, in an unsupervised setting. While promising, both models are very task specific, and require sophisticated approximate Bayesian inference which is much more demanding than what is required by standard Gaussian processes or deep learning models, and typically does not scale beyond a few thousand training points. Similarly,
Salakhutdinov and Hinton (2008)combine deep belief networks (DBNs) with Gaussian processes, showing improved performance over standard GPs with RBF kernels, in the context of semisupervised learning. However, their model is heavily relying on unsupervised pretraining of DBNs, with the GP component unable to scale beyond a few thousand training points. Likewise,
Calandra et al. (2014) combine a feedforward neural network transformation with a Gaussian process, showing an ability to learn sharp discontinuities. However, similar to many other approaches, the resulting model can only scale to at most a few thousand data points.In a frequentist setting, Yang et al. (2014)
combine convolutional networks, with parameters pretrained on ImageNet, with a scalable Fastfood
(Le et al., 2013) expansion for the RBF kernel applied to the final layer. The resulting method is scalable and flexible, but the network parameters generally must first be trained separately from the Fastfood features, and the combined model remains parametric, due to the parametric expansion provided by Fastfood. Careful attention must still be paid to training procedures, regularization, and manual calibration of the network architecture. In a similar manner, Huang et al. (2015) and Snoek et al. (2015) have combined deep architectures with parametric Bayesian models. Huang et al. (2015)pursue an unsupervised pretraining procedure using deep autoencoders, showing improved performance over GPs using standard kernels.
Snoek et al. (2015) show promising performance on Bayesian optimisation tasks, for tuning the parameters of a deep neural network.Our approach is distinct in that we combine deep feedforward and convolutional architectures with spectral mixture covariances (Wilson and Adams, 2013), inducing points, Kronecker and Toeplitz algebra, and local kernel interpolation (Wilson and Nickisch, 2015; Wilson et al., 2015), to derive expressive and scalable closed form kernels, which can be trained jointly with a unified supervised objective, as part of a nonparametric
Gaussian process framework, without requiring approximate Bayesian inference. Moreover, the simple joint learning procedure in our approach can be applied in general settings. Indeed we show that the proposed model outperforms state of the art standalone deep learning architectures and Gaussian processes with advanced kernel learning procedures on a wide range of datasets, demonstrating its practical significance. We achieve scalability while retaining nonparametric model structure by leveraging the very recent KISSGP approach
(Wilson and Nickisch, 2015) and extensions in Wilson et al. (2015) for efficiently representing kernel functions, to produce scalable deep kernels.3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conventions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs.
We assume a dataset of
input (predictor) vectors
, each of dimension , which index an vector of targets . If , then any collection of function valueshas a joint Gaussian distribution,
(1) 
with a mean vector, , and covariance matrix, , determined from the mean function and covariance kernel of the Gaussian process. The kernel, , is parametrized by . Assuming additive Gaussian noise, , the predictive distribution of the GP evaluated at the test points indexed by , is given by
(2)  
, for example, is an matrix of covariances between the GP evaluated at and . is the mean vector, and is the covariance matrix evaluated at training inputs
. All covariance (kernel) matrices implicitly depend on the kernel hyperparameters
.GPs with RBF kernels correspond to models which have an infinite basis expansion in a dual space, and have compelling theoretical properties: these models are universal approximators, and have prior support to within an arbitrarily small epsilon band of any continuous function (Micchelli et al., 2006). Indeed the properties of the distribution over functions induced by a Gaussian process are controlled by the kernel function. For example, the popular RBF kernel,
(3) 
encodes the inductive bias that function values closer together in the input space, in the Euclidean sense, are more correlated. The complexity of the functions in the input space is determined by the interpretable lengthscale hyperparameter . Shorter lengthscales correspond to functions which vary more rapidly with the inputs .
The structure of our data is discovered through learning interpretable kernel hyperparameters. The marginal likelihood of the targets
, the probability of the data conditioned only on kernel hyperparameters
, provides a principled probabilistic framework for kernel learning:(4) 
where we have used as shorthand for given . Note that the expression for the log marginal likelihood in Eq. (4) pleasingly separates into automatically calibrated model fit and complexity terms (Rasmussen and Ghahramani, 2001). Kernel learning can be achieved by optimizing Eq. (4) with respect to .
The computational bottleneck for inference is solving the linear system , and for kernel learning is computing the log determinant in the marginal likelihood. The standard approach is to compute the Cholesky decomposition of the matrix , which requires operations and storage. After inference is complete, the predictive mean costs
, and the predictive variance costs
, per test point .4 Deep Kernel Learning
In this section we show how we can contruct kernels which encapsulate the expressive power of deep architectures, and how to learn the properties of these kernels as part of a scalable probabilistic Gaussian process framework.
Specifically, starting from a base kernel with hyperparameters , we transform the inputs (predictors) as
(5) 
where is a nonlinear mapping given by a deep architecture, such as a deep convolutional network, parametrized by weights . The popular RBF kernel (Eq. (3)) is a sensible choice of base kernel . For added flexibility, we also propose to use spectral mixture base kernels (Wilson and Adams, 2013):
(6) 
The parameters of the spectral mixture kernel are mixture weights, bandwidths (inverse lengthscales), and frequencies. The spectral mixture (SM) kernel, which forms an expressive basis for all stationary covariance functions, can discover quasiperiodic stationary structure with an interpretable and succinct representation, while the deep learning transformation captures nonstationary and hierarchical structure.
We use the deep kernel of Eq. (5) as the covariance function of a Gaussian process to model data . Conditioned on all kernel hyperparameters, we can interpret our model as applying a Gaussian process with base kernel to the final hidden layer of a deep network. Since a GP with (RBF or SM) base kernel corresponds to an infinite basis function representation, our network effectively has a hidden layer with an infinite number of hidden units. The overall model is shown in Figure 1.
We emphasize, however, that we jointly learn all deep kernel hyperparameters, , which include , the weights of the network, and the parameters of the base kernel, by maximizing the log marginal likelihood of the Gaussian process (see Eq. (4)). Indeed compartmentalizing our model into a base kernel and deep architecture is for pedagogical clarity. When applying a Gaussian process one can use our deep kernel, which operates as a single unit, as a dropin replacement for e.g., standard ARD or Matérn kernels (Rasmussen and Williams, 2006), since learning and inference follow the same procedures.
For kernel learning, we use the chain rule to compute derivatives of the log marginal likelihood with respect to the deep kernel hyperparameters:
The implicit derivative of the log marginal likelihood with respect to our data covariance matrix is given by
(7) 
where we have absorbed the noise covariance into our covariance matrix, and treat it as part of the base kernel hyperparameters . are the derivatives of the deep kernel with respect to the base kernel hyperparameters (such as lengthscale), conditioned on the fixed transformation of the inputs . Similarly, are the implicit derivatives of the deep kernel with respect to , holding fixed. The derivatives with respect to the weight variables
are computed using standard backpropagation.
For scalability, we replace all instances of with the KISSGP covariance matrix (Wilson and Nickisch, 2015; Wilson et al., 2015)
(8) 
where is a sparse matrix of interpolation weights, containing only nonzero entries per row for local cubic interpolation, and is a covariance matrix created from our deep kernel, evaluated over latent inducing points . We place the inducing points over a regular multidimensional lattice, and exploit the resulting decomposition of into a Kronecker product of Toeplitz matrices for extremely fast matrix vector multiplications (MVMs), without requiring any grid structure in the data inputs or the transformed inputs . Because KISSGP operates by creating an approximate kernel which admits fast computations, and is independent from a specific inference and learning procedure, we can view the KISS approximation applied to our deep kernels as a standalone kernel, , which can be combined with Gaussian processes or other kernel machines for scalable learning.
Datasets  n  d  RMSE  Runtime(s)  

GP  DNN  DKL  DNN  DKL  
RBF  SM  best  RBF  SM  RBF  SM  
Gas  2,565  128  0.210.07  0.140.08  0.120.07  0.110.05  0.110.05  0.090.06  7.43  7.80  10.52 
Skillcraft  3,338  19  1.263.14  0.250.02  0.250.02  0.250.00  0.250.00  0.250.00  15.79  15.91  17.08 
SML  4,137  26  6.940.51  0.270.03  0.260.04  0.250.02  0.240.01  0.230.01  1.09  1.48  1.92 
Parkinsons  5,875  20  3.941.31  0.000.00  0.000.00  0.310.04  0.290.04  0.290.04  3.21  3.44  6.49 
Pumadyn  8,192  32  1.000.00  0.210.00  0.200.00  0.250.02  0.240.02  0.230.02  7.50  7.88  9.77 
PoleTele  15,000  26  12.60.3  5.400.3  4.300.2  3.420.05  3.280.04  3.110.07  8.02  8.27  26.95 
Elevators  16,599  18  0.120.00  0.0900.001  0.0890.002  0.0990.001  0.0840.002  0.0840.002  8.91  9.16  11.77 
Kin40k  40,000  8  0.340.01  0.190.02  0.060.00  0.110.01  0.050.00  0.030.01  19.82  20.73  24.99 
Protein  45,730  9  1.641.66  0.500.02  0.470.01  0.490.01  0.460.01  0.430.01  142.8  154.8  144.2 
KEGG  48,827  22  0.330.17  0.120.01  0.120.01  0.120.01  0.110.00  0.100.01  31.31  34.23  61.01 
CTslice  53,500  385  7.130.11  2.210.06  0.590.07  0.410.06  0.360.01  0.340.02  36.38  44.28  80.44 
KEGGU  63,608  27  0.290.12  0.120.00  0.120.00  0.120.00  0.110.00  0.110.00  39.54  42.97  41.05 
3Droad  434,874  3  12.860.09  10.340.19  9.900.10  7.360.07  6.910.04  6.910.04  238.7  256.1  292.2 
Song  515,345  90  0.550.00  0.460.00  0.450.00  0.450.02  0.440.00  0.430.01  517.7  538.5  589.8 
Buzz  583,250  77  0.880.01  0.510.01  0.510.01  0.490.00  0.480.00  0.460.01  486.4  523.3  769.7 
Electric  2,049,280  11  0.2300.000  0.0530.000  0.0530.000  0.0580.002  0.0500.002  0.0480.002  3458  3542  4881 
one standard deviation. The
best denotes the bestperforming kernel according to Yang et al. (2015) (note that often the best performing kernel is GPSM). Following Yang et al. (2015), as exact Gaussian processes are intractable on the large data used here, the Fastfood finite basis function expansions are used for approximation in GP (RBF, SM, Best). We verified on datasets with that exact GPs with RBF kernels provide comparable performance to the Fastfood expansions. For datasets with we used a fullyconnected DNN with a [1000500502] architecture, and for we used a [10001000500502] architecture. We consider scalable deep kernel learning (DKL) with RBF and SM base kernels. For the SM base kernel, we set for datasets with training instances, and use for larger datasets.For inference we solve using linear conjugate gradients (LCG), an iterative procedure for solving linear systems which only involves matrix vector multiplications (MVMs). The number of iterations required for convergence to within machine precision is , and in practice depends on the conditioning of the KISSGP covariance matrix rather than the number of training points
. For estimating the log determinant in the marginal likelihood we follow the approach described in
Wilson and Nickisch (2015) with extensions in Wilson et al. (2015).KISSGP training scales as (where is typically close to linear in ), versus conventional scalable GP approaches which require (QuiñoneroCandela and Rasmussen, 2005) computations and need for tractability, which results in severe deteriorations in predictive performance. The ability to have large allows KISSGP to have nearexact accuracy in its approximation (Wilson and Nickisch, 2015), retaining a nonparametric representation, while providing linear scaling in and time per test point prediction (Wilson et al., 2015). We empirically demonstrate this scalability and accuracy in our experiments of section 5.
5 Experiments
We evaluate the proposed deep kernel learning method on a wide range of regression problems, including a large and diverse collection of regression tasks from the UCI repository (section 5.1), orientation extraction from face patches (section 5.2), magnitude recovery of handwritten digits (section 5.3), and step function recovery (section 5.4). We show that the proposed algorithm substantially outperforms Gaussian processes with expressive kernel learning approaches, and deep neural networks, without any significant increases in computational overhead.
All experiments were performed on a Linux machine with eight 4.0GHz CPU cores and 32GB RAM. We implemented DNNs based on Caffe
(Jia et al., 2014), a general deep learning platform, and KISSGP (Wilson and Nickisch, 2015; Wilson et al., 2015) leveraging GPML (Rasmussen and Nickisch, 2010) ^{2}^{2}2www.gaussianprocess.org/gpml.For our deep kernel learning model, we first train a deep neural network using SGD with the squared loss objective, and rectified linear activation functions. After the neural network has been pretrained, a KISSGP model was fitted using the toplevel features of the DNN model as inputs. Using this pretraining initialization, our joint deep kernel learning (DKL) model of section 4 is then trained by optimizing all the hyperparameters of our deep kernel, by backpropagating derivatives through the marginal likelihood of the Gaussian process (see Eq. 4).
5.1 UCI regression tasks
We consider a large set of UCI regression problems of varying sizes and properties. Table 1 reports test root mean squared error (RMSE) for 1) many scalable Gaussian process kernel learning methods based on Fastfood (Yang et al., 2015); 2) standalone deep neural networks (DNNs); and 3) our proposed combined deep kernel learning (DKL) model using both RBF and SM base kernels.
For smaller datasets, where the number of training examples , we used a fullyconnected neural network with a d1000500502 architecture; for larger datasets we used a d10001000500502 architecture^{3}^{3}3We found [d1000100050050] architectures provide a similar level of performance, but scalable Kronecker algebra is most effective if the network maps into dimensional spaces..
Table 1 shows that on most of the datasets, our DKL method strongly outperforms not only Gaussian processes with the standard RBF kernel, but also the bestperforming kernels selected from a wide range of alternative kernel learning procedures (Yang et al., 2015).
We further compared DKL to standalone deep neural networks which have the exact same architecture as the DNN component of DKL. By combining KISSGP with DNNs as part of a joint DKL procedure, we obtain consistently better results than standalone deep learning over all 16 datasets. Moreover, using a spectral mixture base kernel (Eq. (6)) to create a deep kernel provides notable additional performance improvements. It is interesting to observe that by effectively learning the salient features from raw data, plain DNNs generally achieve competitive performance compared to expressive Gaussian processes. Combining the complementary advantages of these approaches into scalable deep kernels consistently brings substantial additional performance gains.
We next investigate the runtime of DKL. Table 1, right panel, compares DKL with a standalone DNN in terms of runtime for evaluating the objective and derivatives (i.e. one forward and backpropagation pass for DNN; one computation of the marginal likelihood and all relevant derivatives for DNNKISSGP). We see that in addition to improving accuracy, combining KISSGP with DNNs for deep kernels introduces only negligible runtime costs: KISSGP imposes an additional runtime of about 10% (one order of magnitude less than) the runtime a DNN typically requires. Overall, these results show the general applicability and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
We now consider the task of predicting the orientation of a face extracted from a grayscale image patch, explored in Salakhutdinov and Hinton (2008). We investigate our DKL procedure for efficiently learning meaningful representations from highdimensional highlystructured image data.
The Olivetti face data set contains ten 6464 images of forty different people, for images total. Following Salakhutdinov and Hinton (2008), we constructed datasets of 2828 images by randomly rotating (uniformly from to ), cropping, and subsampling the original 400 images. We then randomly select 30 people uniformly and collect their images as training data, while using the images of the remaining 10 people as test data. Figure 2 shows randomly sampled examples from the training and test data.
Datasets  GP  DBNGP  CNN  DKL 

Olivetti  16.33  6.42  6.34  6.07 
MNIST  1.25  1.03  0.59  0.53 
For training DKL on the Olivetti face patches we used a convolutional network consisting of 2 convolutional layers followed by 4 fullyconnected layers, mapping a face patch to a 2dimensional feature vector, with a SM base kernel. We describe this convolutional architecture in detail in the appendix.
Table 2 shows the RMSE of the predicted face orientations using four models. The DBNGP model, proposed by Salakhutdinov and Hinton (2008), first extracts features from raw data using a Deep Belief Network (DBN), and then applies a Gaussian process with an RBF kernel. However, their approach could only handle up to a few thousand labelled datapoints, due to the
complexity of standard Gaussian processes. The remaining data were modeled through unsupervised learning of a DBN, leaving the large amount of available labels unused.
Our proposed deep kernel methods, by contrast, scale linearly with the size of training data, and are capable of directly modeling the full labeled data to accurately recover salient patterns. Figure 2, right panel, shows that the deep kernel discovers features essential for orientation prediction, while filtering out irrelevant factors such as identities and scales.
Figure 3, left panel, further validates the benefit of scaling to large data. As more training data are used, our model continues to increase in accuracy. Indeed, it is the large datasets that will provide the greatest opportunities for our model to discover expressive statistical representations.
In Figure 4
we show the spectral density (the Fourier transform) of the
base kernels learned through our deep kernel learning method. The expressive spectral mixture (SM) kernel discovers a structure with two peaks in the frequency domain. The RBF kernel is only able to use a single Gaussian in the spectral domain, centred at the origin. In an attempt to capture the significant mass near a frequency of , the RBF kernel spectral density spreads itself across the whole frequency domain, missing the important local correlations near a frequency, thus erroneously discarding much of the network features as white noise, since a broad spectral peak corresponds to a short lengthscale. This result provides intuition for why spectral mixture base kernels generally perform much better than RBF base kernels, despite the flexibility of the deep architecture.
We further see the benefit of an SM base kernel in Figure 5, where we show the learned covariance matrices constructed from the whole deep kernels (composition of base kernel and deep architecture) for RBF and SM base kernels. The covariance matrix is evaluated on a set of test inputs, where we randomly sample 400 instances from the test set and sort them in terms of the orientation angles of the input faces. We see that the deep kernels with both RBF and SM base kernels discover that faces with similar rotation angles are highly correlated, concentrating their largest entries on the diagonal (i.e., face pairs with similar orientations). Deep kernel learning with an SM base kernel captures these correlations more strongly than the RBF base kernel, which is somewhat more diffuse.
In Figure 5, right panel, we also show the learned covariance matrix for an RBF kernel with a standard Gaussian process applied to the raw data inputs. We see that the entries are very diffuse. In essence, through deep kernel learning, we can learn a metric where faces with similar rotation angles are highly correlated, and thus overcome the fundamental limitations of a Euclidean distance metric (used by standard kernels), where similar rotation angles are not particularly correlated, regardless of what hyperparameters are learned with Euclidean kernels.
We next measure the scalability of our model. Figure 3, middle panel, shows the runtimes in seconds, as a function of training instances, for evaluating the objective and any relevant derivatives. We see that, with the scalable KISSGP, the joint model achieves a roughly linear asymptotic scaling, with a slope of 1. In Figure 3, right panel, we show how the total training time (i.e., the time for CNN pretraining plus the time for DKL with CNN architecture joint training) changes with varying the data size . In addition to the linear scaling which is necessary for modeling large data, the added time in combining KISSGP with CNNs is reasonable, especially considering the gains in performance and expressive power.
5.3 Digit magnitude extraction
We map images of handwritten digits to a single realvalue that is as close as possible to the integer represented by the digit in the image, as in Salakhutdinov and Hinton (2008). The MNIST digit dataset contains 60,000 training data and 10,000 test
images of ten handwritten digits (0 to 9). We used a convolutional neural network with a similar architecture as the LeNet
(LeCun et al., 1998) (detailed in the appendix). Table 2 shows that a CNN performs considerably better than GP and DBNGP, and DKL (with CNN architecture) further improves over CNN.5.4 Step function recovery
We have so far considered RMSE for comparison to alternative methods where posterior predictive distributions are not readily available, or on problems where RMSE has historically been used as a benchmark. However, an advantage of DKL over standalone deep architectures is the ability to naturally produce a posterior predictive distribution, which is especially useful in applications such as reinforcement learning and Bayesian optimisation. In Figure
6, we consider an example where we use DKL to learn the posterior predictive distribution for a step function with many challenging discontinuities. This problem is particularly difficult for conventional Gaussian process approaches, due to strong smoothness assumptions intrinsic to popular kernels.GPs with SM kernels improve upon RBF kernels, but neither can properly adapt to the many sharp changes in covariance structure. By contrast, the DKLSM model accurately encodes the discontinuities of the function, and has reasonable uncertainty over the whole domain.
6 Discussion
We have explored scalable deep kernels, which combine the structural properties of deep architectures with the nonparametric flexibility of kernel methods. In particular, we transform the inputs of a base kernel with a deep architecture, and then leverage local kernel interpolation, inducing points, and structure exploiting algebra (e.g., Kronecker and Toeplitz methods) for a scalable kernel representation. These scalable kernels can then be combined with Gaussian process inference and learning procedures for training and testing time. Moreover, we use spectral mixture covariances as a base kernel, which provides a significant additional boost in representational power. Overall, our scalable deep kernels can be used in place of standard kernels, following the same inference and learning procedures, but with benefits in expressive power and efficiency. We show on a wide range of experiments the general applicability and practical significance of our approach, consistently outperforming scalable GPs with expressive kernels, and standalone DNNs.
A major challenge in developing expressive kernel learning approaches is the Euclidean and absolute distance based metrics which are pervasive in most families of kernel functions, such as the ARD and Matérn kernels. Indeed, although intuitive in some cases, one cannot expect Euclidean and absolute distance as measures of similarity to be generally applicable, and they are especially problematic in high dimensional input spaces (Aggarwal et al., 2001). Modern approaches attempt to learn a flexible parametric family, for example, through weighted combinations of known kernels (e.g., Gönen and Alpaydın, 2011), but are still fundamentally limited to these standard notions of distance. As we have seen in the Olivetti faces examples, our approach allows for the whole functional form of the metric to be learned in a flexible manner, through expressive transformations of the input space. We expect such metric learning to be particularly valuable in high dimensional classification problems, which we view as a promising direction for future research. We hope that this work will help bring together research on neural networks and kernel methods, to inspire many new models and unifying perspectives which combine the complementary advantages of these approaches.
References
 Aggarwal et al. (2001) Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space. Springer.
 Bengio (2009) Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning.
 Calandra et al. (2014) Calandra, R., Peters, J., Rasmussen, C. E., and Deisenroth, M. P. (2014). Manifold gaussian processes for regression. arXiv preprint arXiv:1402.5876.
 Damianou and Lawrence (2013) Damianou, A. and Lawrence, N. (2013). Deep Gaussian processes. In Artificial Intelligence and Statistics.
 Gönen and Alpaydın (2011) Gönen, M. and Alpaydın, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268.
 Hinton et al. (2012) Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97.
 Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554.
 Huang et al. (2015) Huang, W., Zhao, D., Sun, F., Liu, H., and Chang, E. (2015). Scalable gaussian process regression using deep neural networks. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3576–3582. AAAI Press.
 Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
 Kiros et al. (2014) Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Unifying visualsemantic embeddings with multimodal neural language models. TACL.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems.
 Le et al. (2013) Le, Q., Sarlos, T., and Smola, A. (2013). Fastfoodcomputing Hilbert space expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, pages 244–252.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
 Lloyd et al. (2014) Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. (2014). Automatic construction and NaturalLanguage description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI).
 MacKay (1998) MacKay, D. J. (1998). Introduction to Gaussian processes. In Bishop, C. M., editor, Neural Networks and Machine Learning, chapter 11, pages 133–165. SpringerVerlag.
 Micchelli et al. (2006) Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. The Journal of Machine Learning Research, 7:2651–2667.
 Neal (1996) Neal, R. (1996). Bayesian Learning for Neural Networks. Springer Verlag.
 QuiñoneroCandela and Rasmussen (2005) QuiñoneroCandela, J. and Rasmussen, C. (2005). A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959.
 Rasmussen and Ghahramani (2001) Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s razor. In Neural Information Processing Systems (NIPS).
 Rasmussen and Nickisch (2010) Rasmussen, C. E. and Nickisch, H. (2010). Gaussian processes for machine learning (GPML) toolbox. Journal of Machine Learning Research (JMLR), 11:3011–3015.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian processes for Machine Learning. The MIT Press.
 Saatchi (2011) Saatchi, Y. (2011). Scalable Inference for Structured Gaussian Process Models. PhD thesis, University of Cambridge.
 Salakhutdinov and Hinton (2008) Salakhutdinov, R. and Hinton, G. (2008). Using deep belief nets to learn covariance kernels for Gaussian processes. Advances in Neural Information Processing Systems, 20:1249–1256.
 Snoek et al. (2015) Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Ali, M., and Adams, R. P. (2015). Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning.
 Socher et al. (2011) Socher, R., Huang, E., Pennington, J., Ng, A., and Manning, C. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems 24, pages 801–809.

Wilson (2014)
Wilson, A. G. (2014).
Covariance kernels for fast automatic pattern discovery and
extrapolation with Gaussian processes.
PhD thesis, University of Cambridge.
http://www.cs.cmu.edu/~andrewgw/andrewgwthesis.pdf.  Wilson and Adams (2013) Wilson, A. G. and Adams, R. P. (2013). Gaussian process kernels for pattern discovery and extrapolation. International Conference on Machine Learning (ICML).

Wilson et al. (2015)
Wilson, A. G., Dann, C., and Nickisch, H. (2015).
Thoughts on massively scalable Gaussian processes.
arXiv preprint 1511.01870.
http://arxiv.org/abs/1511.01870.  Wilson et al. (2012) Wilson, A. G., Knowles, D. A., and Ghahramani, Z. (2012). Gaussian process regression networks. In International Conference on Machine Learning (ICML), Edinburgh. Omnipress.
 Wilson and Nickisch (2015) Wilson, A. G. and Nickisch, H. (2015). Kernel interpolation for scalable structured Gaussian processes (KISSGP). International Conference on Machine Learning (ICML).
 Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. ICML.
 Yang et al. (2014) Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. (2014). Deep fried convnets. arXiv preprint arXiv:1412.7149.
 Yang et al. (2015) Yang, Z., Smola, A. J., Song, L., and Wilson, A. G. (2015). A la carte  learning fast kernels. Artificial Intelligence and Statistics.
Appendix A Appendix
a.1 Convolutional network architecture
Table 3 lists the architecture of the convolutional networks used in the tasks of face orientation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and adapted to the above tasks with one or two more fullyconnected layers for feature transformation.
Layer  conv1  pool1  conv2  pool2  full3  full4  full5  full6 

kernel size  55  22  55  22         
stride  1  2  1  2         
channel  20  20  50  50  1000  500  50  2 
are max pooling layers. ReLU layer is placed after
full3 and full4.
Comments
There are no comments yet.