1 Introduction
Large datasets provide great opportunities to learn rich statistical representations, for accurate predictions and new scientific insights into our modeling problems. Gaussian processes are promising for large data problems, because they can grow their information capacity with the amount of available data, in combination with automatically calibrated model complexity (Rasmussen and Ghahramani, 2001; Wilson, 2014).
From a Gaussian process perspective, all of the statistical structure in data is learned through a kernel function. Popular kernel functions, such as the RBF kernel, provide smoothing and interpolation, but cannot learn representations necessary for long range extrapolation (Rasmussen and Williams, 2006; Wilson, 2014)
. With smoothing kernels, we can only use the information in a large dataset to learn about noise and lengthscale hyperparameters, which tell us only how quickly correlations in our data vary with distance in the input space. If we learn a short lengthscale hyperparameter, then by definition we will only make use of a small amount of training data near each testing point. If we learn a long lengthscale, then we could subsample the data and make similar predictions.
Therefore to fully use the information in large datasets, we must build kernels with great representational power and useful learning biases, and scale these approaches without sacrificing this representational ability. Indeed many recent approaches have advocated building expressive kernel functions (e.g., Rasmussen and Williams, 2006; Gönen and Alpaydın, 2011; Wilson and Adams, 2013; Wilson, 2014; Lloyd et al., 2014; Yang et al., 2015), and emerging research in this direction takes inspiration from deep learning models (e.g., Wilson et al., 2012; Damianou and Lawrence, 2013; Calandra et al., 2014). However, the scalability, general applicability, and interpretability of such approaches remain a challenge. Recently, Wilson et al. (2016)
proposed simple and scalable deep kernels for singleoutput regression problems, with promising performance on many experiments. But their approach does not allow for stochastic training, multiple outputs, deep architectures with many output features, or classification. And it is on classification problems, in particular, where we typically have high dimensional input vectors, with little intuition about how these vectors should correlate, and therefore most want to
learn a flexible nonEuclidean similarity metric (Aggarwal et al., 2001).In this paper, we introduce inference procedures and propose a new deep kernel learning model which enables (1) classification and nonGaussian likelihoods; (2) multitask learning^{1}^{1}1We follow the GP convention where multitask learning involves a function mapping a single input to multiple correlated
output responses (class probabilities, regression responses, etc.). Unlike NNs which naturally have correlated outputs by sharing hidden basis functions (and multitask can have a more specialized meaning), most GP models perform multiple binary classification, ignoring correlations between output classes. Even applying a GP to NN features for deep kernel learning does not naturally produce multiple correlated outputs.
; (3) stochastic gradient minibatch training; (4) deep architectures with many output features; (5) additive covariance structures; and (5) greatly enhanced scalability.We propose to use additive base kernels corresponding to Gaussian processes (GPs) applied to subsets of output features of a deep neural architecture. We then linearly mix these Gaussian processes, inducing correlations across multiple output variables. The result is a deep probabilistic neural network, with a hidden layer composed of additive sets of infinite basis functions, linearly mixed to produce correlated output variables. All parameters of the deep architecture and base kernels are
jointly learned through a marginal likelihood objective, having integrated away all GPs. For scalability and nonGaussian likelihoods, we derive stochastic variational inference (SVI) which leverages local kernel interpolation, inducing points, and structure exploiting algebra, and a hybrid sampling scheme, building on Wilson and Nickisch (2015), Wilson et al. (2015), Titsias (2009), Hensman et al. (2013), and Nickson et al. (2015). The resulting approach, SVDKL, has a complexity of for inducing points and input dimensions, versus the standard for efficient stochastic variational methods.We achieve good predictive accuracy and scalability over a wide range of classification tasks, while retaining a straightforward, general purpose, and highly practical probabilistic nonparametric representation, with code available at https://people.orie.cornell.edu/andrew/code.
2 Background
Throughout this paper, we assume we have access to vectorial inputoutput pairs , where each is related to through a Gaussian process and observation model. For example, in regression, one could model , where is a latent vector of independent Gaussian processes , and is a noise covariance matrix.
The computational bottleneck in working with Gaussian processes typically involves computing and over an covariance matrix evaluated at training inputs . Standard procedure is to compute the Cholesky decomposition of , which incurs computations and storage, after which predictions cost per test point. Gaussian processes are thus typically limited to at most a few thousand training points. Many promising approaches to scalability have been explored, for example, involving randomized methods (Rahimi and Recht, 2007; Le et al., 2013; Yang et al., 2015) , and low rank approximations (Silverman, 1985; QuiñoneroCandela and Rasmussen, 2005). Wilson and Nickisch (2015) recently introduced the KISSGP approximate kernel matrix , which admits fast computations, given the exact kernel matrix evaluated on a latent multidimensional lattice of inducing inputs , and , a sparse interpolation matrix. Without requiring any grid structure in , decomposes into a Kronecker product of Toeplitz matrices, which can be approximated by circulant matrices (Wilson et al., 2015). Exploiting such structure in combination with local kernel interpolation enables one to use many inducing points, resulting in nearexact accuracy in the kernel approximation, and inference. Unfortunately, this approach does not typically apply to dimensional inputs (Wilson et al., 2015).
Moreover, the Gaussian process marginal likelihood does not factorize, and thus stochastic gradient descent does not ordinarily apply. To address this issue,
Hensman et al. (2013) extended the variational approach from Titsias (2009) and derived a stochastic variational GP posterior over inducing points for a regression model which does have the required factorization for stochastic gradient descent. Hensman et al. (2015b), Hensman et al. (2015a), and Dezfouli and Bonilla (2015)further combine this with a sampling procedure for estimating nonconjugate expectations. These methods have
sampling complexity which becomes prohibitive where many inducing points are desired for accurate approximation. Nickson et al. (2015) consider Kronecker structure in the stochastic approximation of Hensman et al. (2013) for regression, but do not leverage local kernel interpolation or sampling.To address these limitations, we introduce a new deep kernel learning model for multitask classification, minibatch training, and scalable kernel interpolation which does not require low dimensional input spaces. In this paper, we view scalability and flexibility as two sides of one coin: we most want the flexible models on the largest datasets, which contain the necessary information to discover rich statistical structure. We show that the resulting approach can learn very expressive and interpretable kernel functions on large classification datasets, containing millions of training points.
3 Deep Kernel Learning for Multitask Classification
We propose a new deep kernel learning approach to account for classification and nonGaussian likelihoods, multiple correlated outputs, additive covariances, and stochastic gradient training.
We propose to build a probabilistic deep network as follows: 1) a deep nonlinear transformation
, parametrized by weights , is applied to the observed input variable , to produce features at the final layer , ; 2) Gaussian processes, with base kernels , are applied to subsets of these features, corresponding to an additive GP model (e.g., Durrande et al., 2011). The base kernels can thus act on relatively low dimensional inputs, where local kernel interpolation and learning biases such as similarities based on Euclidean distance are most natural; 3) these GPs are linearly mixed by a matrix , and transformed by an observation model, to produce the output variables . The mixing of these variables through produces correlated multiple outputs, a multitask property which is uncommon in Gaussian processes or SVMs. The structure of this network is illustrated in Figure 1. Critically, all of the parameters in the model (including base kernel hyperparameters) are trained through optimizing a marginal likelihood, having integrated away the Gaussian processes, through the variational inference procedures described in section 4.For classification, we consider a special case of this architecture. Let be the number of classes, and we have data , where is a oneshot encoding of the class label. We use the softmax observation model:
(1) 
where is a vector of independent Gaussian processes followed by a linear mixing layer ; and is the indicator vector with the th element being 1 and the rest 0.
For the th Gaussian process in the additive GP layer, let be the latent functions on the input data features. By introducing a set of latent inducing variables indexed by inducing inputs , we can write (e.g., QuiñoneroCandela and Rasmussen, 2005)
(2) 
Substituting the local interpolation approximation of Wilson and Nickisch (2015) into Eq. (5), we find ; it therefore follows that . In section 4 we exploit this deterministic relationship between and , governed by the sparse matrix , to derive a particularly efficient stochastic variational inference procedure.
4 Structure Exploiting Stochastic Variational Inference
Exact inference and learning in Gaussian processes with a nonGaussian likelihood is not analytically tractable. Variational inference is an appealing approximate technique due to its automatic regularization to avoid overfitting, and its ability to be used with stochastic gradient training, by providing a factorized approximation to the Gaussian process marginal likelihood. We develop our stochastic variational method equipped with a fast sampling scheme for tackling any intractable marginalization.
Let be the collection of the inducing variables of the additive GPs. We assume a variational posterior over the inducing variables . By Jensen’s inequality we have
(3) 
where we have omitted the mixing weights for clarity. The KL divergence term can be interpreted as a regularizer encouraging the approximate posterior to be close to the prior . We aim at tightening the marginal likelihood lower bound which is equivalent to minimizing the KL divergence from to the true posterior.
Since the likelihood function typically factorizes over data instances: , we can optimize the lower bound with stochastic gradients. In particular, we specify for the independent GPs, and iteratively update the variational parameters and the kernel and deep network parameters using a noisy approximation of the gradient of the lower bound on minibatches of the full data. Henceforth we omit the index for clarity.
Unfortunately, for general nonGaussian likelihoods the expectation in Eq (7) is usually intractable. We develop a sampling method for tackling this intractability which is highly efficient with structured reparameterization, local kernel interpolation, and structure exploiting algebra.
Using local kernel interpolation, the latent function is expressed as a deterministic local interpolation of the inducing variables (section 3). This result allows us to work around any difficult approximate posteriors on which typically occur in variational approaches for GPs. Instead, our sampler only needs to account for the uncertainty on . The direct parameterization of yields a straightforward and efficient sampling procedure. The latent function samples (indexed by ) are then computed directly through interpolation .
As opposed to conventional meanfield methods, which assume a diagonal variational covariance matrix, we use the Cholesky decomposition for reparameterizing in order to preserve structures within the covariance. Specifically, we let , resulting in the following sampling procedure:
Each step of the above standard sampler has complexity of , where is the number of inducing points. Due to the matrix vector product, this sampling procedure becomes prohibitive in the presence of many inducing points, which are required for accuracy on large datasets with multidimensional inputs – particularly if we have an expressive kernel function (Wilson and Nickisch, 2015).
We scale up the sampler by leveraging the fact that the inducing points are placed on a grid (taking advantage of both Toeplitz and circulant structure), and additionally imposing a Kronecker decomposition on , where is the input dimension of the base kernel. With the fast Kronecker matrixvector products, we reduce the above sampling cost of to . Our approach thus greatly improves over previous stochastic variational methods which typically scale with complexity, as discussed shortly.
Note that the KL divergence term between the two Gaussians in Eq (7) has a closed form without the need for Monte Carlo estimation. Computing the KL term and its derivatives, with the Kronecker method, is . With samples of and a minibatch of data points of size , we can estimate the marginal likelihood lower bound as
(4) 
and the derivatives w.r.t the model hyperparameters and the variational parameters can be taken similarly. We provide the detailed derivation in the supplement.
Although a small body of pioneering work has developed stochastic variational methods for Gaussian processes, our approach distinctly provides the above representationpreserving variational approximation, and exploits algebraic structure for significant advantages in scalability and accuracy. In particular, a similar variational lower bound as in Eq (7) was proposed in Titsias (2009); Hensman et al. (2013) for a sparse GP, which were extended to nonconjugate likelihoods, with the intractable integrals estimated using Gaussian quadrature as in the KLSPGP Hensman et al. (2015a) or univariate Gaussian samples as in the SAVIGP Dezfouli and Bonilla (2015). Hensman et al. (2015b) estimates nonconjugate expectations with a hybrid Monte Carlo sampler (denoted as MCGP). The computations in these approaches can be costly, with complexity, due to a complicated variational posterior over as well as the expensive operations on the full inducing point matrix. In addition to its increased efficiency, our sampling scheme is much simpler, without introducing any additional tuning parameters. We empirically compare with these methods and show the practical significance of our algorithm in section 5.
Variational methods have also been used in GP regression for stochastic inference (e.g., (Nickson et al., 2015; Hensman et al., 2013)), and some of the most recent work in this area applied variational autoencoders (Kingma and Welling, 2013) for coupled variational updates (aka back constraints) Dai et al. (2015); Bui and Turner (2015). We note that these techniques are orthogonal and complementary to our inference approach, and can be leveraged for further enhancements.
5 Experiments
We evaluate our proposed approach, stochastic variational deep kernel learning (SVDKL), on a wide range of classification problems, including an airline delay task with over 5.9 million data points (section 5.1), a large and diverse collection of classification problems from the UCI repository (section 5.2), and image classification benchmarks (section 5.3). Empirical results demonstrate the practical significance of our approach, which provides consistent improvements over standalone DNNs, while preserving a GP representation, and dramatic improvements in speed and accuracy over modern state of the art GP models. We use classification accuracy when comparing to DNNs, because it is a standard for evaluating classification benchmarks with DNNs. However, we also compute the negative log probability (NLP) values (supplement), which show similar trends.
All experiments were performed on a Linux machine with eight 4.0GHz CPU cores, one Tesla K40c GPU, and 32GB RAM. We implemented deep neural networks with Caffe
(Jia et al., 2014).Model Training
For our deep kernel learning model, we used deep neural networks which produce dimensional toplevel features. Here is the number of classes. We place a Gaussian process on each dimension of these features. We used RBF base kernels. The additive GP layer is then followed by a linear mixing layer . We initialized
to be an identity matrix, and optimized in the joint learning procedure to recover crossdimension correlations from data.
We first train a deep neural network using SGD with the softmax loss objective, and rectified linear activation functions. After the neural network has been pretrained, we fit an additive KISSGP layer, followed by a linear mixing layer, using the toplevel features of the deep network as inputs. Using this pretraining
initialization, our joint SVDKL model of section 3 is then trained through the stochastic variational method of section 4 which jointly optimizes all the hyperparametersof the deep kernel (including all network weights), as well as the variational parameters, by backpropagating derivatives through the proposed marginal likelihood lower bound of the additive Gaussian process in section
4. In all experiments, we use a relatively large minibatch size (specified according to the full data size), enabled by the proposed structure exploiting variational inference procedures. We achieve good performance setting the number of samples in Eq. 4 for expectation estimation in variational inference, which provides additional confirmation for a similar observation in (Kingma and Welling, 2013).5.1 Airline Delays
We first consider a large airline dataset consisting of flight arrival and departure details for all commercial flights within the US in 2008. The approximately 5.9 million records contain extensive information about the flights, including the delay in reaching the destination. Following Hensman et al. (2015a), we consider the task of predicting whether a flight was subject to delay based on 8 features (e.g., distance to be covered, day of the week, etc).
Classification accuracy
Table 1 reports the classification accuracy of 1) KLSPGP Hensman et al. (2015a)
, a recent scalable variational GP classifier as discussed in section
4; 2) standalone deep neural network (DNN); 3) DNN+, a standalone DNN with an extra fullyconnected hidden layer with , defined as in Figure 1; 4) DNN+GP which is a GP applied to a pretrained DNN (with same architecture as in 2); and 5) our stochastic variational DKL method (SVDKL) (same DNN architecture as in 2). For DNN, we used a fullyconnected architecture with layers d1000100050050c.^{2}^{2}2We obtained similar results with other DNN architectures (e.g., d100010005005020c).The DNN component of the SVDKL model has the exact same architecture. The SVDKL joint training was conducted using a large minibatch size of 50,000 to reduce the variance of the stochastic gradient. We can use such a large minibatch in each iteration (which is daunting for regular GP even as a whole dataset) due to the efficiency of our inference strategy within each minibatch, leveraging structure exploiting algebra.
From the table we see that SVDKL outperforms both the alternative variational GP model (KLSPGP) and the standalone deep network. DNN+GP outperforms standalone DNNs, showing the nonparametric flexibility of kernel methods. By combining KISSGP with DNNs as part of a joint SVDKL procedure, we obtain better results than DNN and DNN+GP. Besides, both the plain DNN and SVDKL notably improve on standalone GPs, indicating a superior capacity of deep architectures to learn representations from large but finite training sets, despite the asymptotic approximation properties of Gaussian processes. By contrast, adding an extra hidden layer, as in DNN+, does not improve performance.
Figure 2 further studies how performance changes as data size increases. We observe that the proposed SVDKL classifier trained on 1/50 of the data already can reach a competitive accuracy as compared to the KLSPGP model trained on the full dataset. As the number of the training points increases, the SVDKL and DNN models continue to improve. This experiment demonstrates the value of expressive kernel functions on large data problems, which can effectively capture the extra information available as seeing more training instances. Furthermore, SVDKL consistently provides better performance over the plain DNN, through its nonparametric flexibility.
Scalability
We next measure the scalability of our variational DKL in terms of the number of inducing points in each GP. Figure 2 shows the runtimes in seconds, as a function of , for evaluating the objective and any relevant derivatives. We compare our structure exploiting variational method with the scalable variational inference in KLSPGP, and the MCMCbased variational method in MCGP Hensman et al. (2015b). We see that our inference approach is far more efficient than previous scalable algorithms. Moreover, when the number of inducing points is not too large (e.g., ), the added time for SVDKL over DNN is reasonable (e.g., 0.39s vs. 0.27s), especially considering the gains in performance and expressive power. Figure 2 shows the runtime scaling of different variational methods as grows. We can see that the runtime of our approach increases only slowly in a wide range of (), greatly enhancing the scalability over the other methods. This empirically validates the improved time complexity of our new inference method as presented in section 4.
We next investigate the total training time of the models. Table 1, right panel, lists the time cost of training KLSPGP, DNN, and SVDKL; and Figure 2 shows how the training time of SVDKL and DNN changes as more training data is presented. We see that on the full dataset DKL, as a GP model, saves over 60% time as compared to the modern state of the art KLSPGP, while at the same time achieving over an 18% improvement in predictive accuracy. Generally, the training time of SVDKL increases slowly with growing data sizes, and has only modest additional overhead compared to standalone architectures, justified by improvements in performance, and the general benefits of a nonparametric probabilistic representation. Moreover, the DNN was fully trained on a GPU, while in SVDKL the base kernel hyperparameters and variational parameters were optimized on a CPU. Since most updates of the SVDKL parameters are computed in matrix forms, we believe the already modest time gap between SVDKL and DNNs can be almost entirely closed by deploying the whole SVDKL model on GPUs.
Datasets  n  d  c  Accuracy  Total Training Time (h)  

KLSPGP Hensman et al. (2015a)  DNN  DNN+  DNN+GP  SVDKL  KLSPGP  DNN  SVDKL  
Airline  5,934,530  8  2  0.675  0.7730.001  0.7220.002  0.77460.001  0.7810.001  11  0.53  3.98 
one standard deviation. Since the code of KLSPGP is not publicly available we directly show the results from
Hensman et al. (2015a).5.2 UCI Classification Tasks
The second evaluation of our proposed algorithm (SVDKL) is conducted on a number of commonly used UCI classification tasks of varying sizes and properties. Table 2 lists the classification accuracy of SVM, DNN, DNN+ (a standalone DNN with an extra fullyconnected hidden layer with , defined as in Figure 1), DNN+GP (a GP trained on the top level features of a trained DNN without the extra hidden layer), and SVDKL (same architecture as DNN).
The plain DNN, which learns salient features effectively from raw data, gives notably higher accuracy compared to an SVM, the mostly widely used kernel method for classification problems. We see that the extra layer in DNN+GP can sometimes harm performance. By contrast, nonparametric flexibility of DNN+GP consistently improves upon DNN. And SVDKL, by training a DNN through a GP marginal likelihood objective, consistently provides further enhancements (with particularly notable performance on the Connect4 and Covtype datasets).
Datasets  n  d  c  Accuracy  

SVM  DNN  DNN+  DNN+GP  SVDKL  
Adult  48,842  14  2  0.8490.001  0.8520.001  0.8450.001  0.8530.001  0.8570.001 
Connect4  67,557  42  3  0.7730.002  0.8050.005  0.8040.009  0.8110.005  0.8410.006 
Diabetes  101,766  49  25  0.8690.000  0.8890.001  0.8900.001  0.8910.001  0.8960.001 
Covtype  581,012  54  7  0.7960.006  0.8240.004  0.8110.002  0.8270.006  0.8600.006 
5.3 Image Classification
We next evaluate the proposed scalable SVDKL procedure for efficiently handling highdimensional highlystructured image data. We used a minibatch size of 5,000 for stochastic gradient training of SVDKL. Table 3 compares SVDKL with the most recent scalable GP classifiers. Besides KLSPGP, we also collected the results of the MCGP Hensman et al. (2015b) which uses a hybrid Monte Carlo sampler to tackle nonconjugate likelihoods, SAVIGP Dezfouli and Bonilla (2015) which approximates with a univariate Gaussian sampler, as well as the distributed GP latent variable model (denoted as DGPLVM) Gal et al. (2014). We see that on the respective benchmark tasks, SVDKL improves over all of the above scalable GP methods by a large margin. We note that these datasets are very challenging for conventional GP methods.
We further compare SVDKL to standalone convolutional neural networks, and GPs applied to fixed pretrained CNNs (CNN+GP). On the first three datasets in Table
3, we used the reference CNN models implemented in Caffe; and for the SVHN dataset, as no benchmark architecture is available, we used the CIFAR10 architecture which turned out to perform quite well. As we can see, the SVDKL model outperforms CNNs and CNN+GP on all datasets. By contrast, the extra hidden hidden layer CNN+ does not consistently improve performance over CNN.Datasets  n  d  c  Accuracy  

MCGP Hensman et al. (2015b)  SAVIGP Dezfouli and Bonilla (2015)  DGPLVM Gal et al. (2014)  KLSPGP Hensman et al. (2015a)  CNN  CNN+  CNN+GP  SVDKL  
MNISTBinary  60K  2828  2  —  —  —  0.978  0.9934  0.8838  0.9938  0.9940 
MNIST  60K  2828  10  0.9804  0.9749  0.9405  —  0.9908  0.9909  0.9915  0.9920 
CIFAR10  50K  33232  10  —  —  —  —  0.7592  0.7618  0.7633  0.7704 
SVHN  73K  33232  10  —  —  —  —  0.9214  0.9193  0.9221  0.9228 
Classification accuracy on the image classification benchmarks. MNISTBinary is the task to differentiate between odd and even digits on the MNIST dataset. We followed the standard trainingtest set partitioning of all these datasets. We have collected recently published results of a variety of scalable GPs. For CNNs, we used the respective benchmark architectures (or with slight adaptations) from Caffe. CNN+ is a standalone CNN with
fully connected extra hidden layer. See the text for more details, including a comparison with ResNets on CIFAR10.We also provide brief evaluations of SVDKL for ResNets and ImageNet, as we believe such exploration will be a promising direction for future research.
ResNet Comparison: Based on one of the best public implementations on Caffe, the ResNet20 has 0.901 accuracy on CIFAR10, and SVDKL (with this ResNet base architecture) improves to 0.910.
ImageNet: We randomly selected 20 categories of images with an AlexNet variant as the base NN (Krizhevsky et al., 2012), which has an accuracy of 0.6877, while SVDKL achieves 0.7067 accuracy.
5.3.1 Interpretation
In Figure 3 we investigate the deep kernels learned on the MNIST dataset by randomly selecting 4 classes and visualizing the covariance matrices of respective dimensions. The covariance matrices are evaluated on the set of test inputs, sorted in terms of the labels of the input images. We see that the deep kernel on each dimension effectively discovers the correlations between the images within the corresponding class. For instance, in the data points between 2k3k (i.e., images of digit 2) are strongly correlated with each other, and carry little correlation with the rest of the images. Besides, we can also clearly observe that the rest of the data points also form multiple “blocks”, rather than being crammed together without any structure. This validates that the DKL procedure and additive GPs do capture the correlations across different dimensions.
To further explore the learnt dependencies between the output classes and the additive GPs serving as the bases, we visualized the weights of the mixing layer () in Fig. 3
, enabling the correlated multioutput (multitask) nature of the model. Besides the expected high weights along the diagonal, we find that class 9 is also highly correlated with dimension 0 and 6, which is consistent with the visual similarity between digit “9” and “0”/“6”. Overall, the ability to
interpret the learned deep covariance matrix as discovering an expressive similarity metric across data instances is a distinctive feature of our approach.6 Discussion
We introduced a scalable Gaussian process model which leverages deep learning, stochastic variational inference, structure exploiting algebra, and additive covariance structures. The resulting deep kernel learning approach, SVDKL, allows for classification and nonGaussian likelihoods, multitask learning, and minibatch training. SVDKL achieves superior performance over alternative scalable GP models and standalone deep networks on many significant benchmarks.
Several fundamental themes emerge from the exposition: (1) kernel methods and deep learning approaches are complementary, and we can combine the advantages of each approach; (2) expressive kernel functions are particularly valuable on large datasets; (3) by viewing neural networks through the lens of metric learning, deep learning approaches become more interpretable.
Deep learning is able to obtain good predictive accuracy by automatically learning structure which would be difficult to a priori feature engineer into a model. In the future, we hope deep kernel learning approaches will be particularly helpful both for characterizing uncertainty and for interpreting these learned features, leading to new scientific insights into our modelling problems.
Acknowledgements: We thank NSF IIS1563887, ONR N000141410684, N000141310721, N000141512791, and ADeLAIDE FA875016C0130001 grants. We also thank anonymous reviewers for helpful comments.
References
 Aggarwal et al. [2001] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional space. Springer, 2001.
 Bui and Turner [2015] T. D. Bui and R. E. Turner. Stochastic variational inference for Gaussian process latent variable models using back constraints. Black Box Learning and Inference NIPS workshop, 2015.
 Calandra et al. [2014] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for regression. arXiv preprint arXiv:1402.5876, 2014.

Chang and Lin [2011]
C.C. Chang and C.J. Lin.
Libsvm: a library for support vector machines.
ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.  Dai et al. [2015] Z. Dai, A. Damianou, J. González, and N. Lawrence. Variational autoencoded deep Gaussian processes. arXiv preprint arXiv:1511.06455, 2015.
 Damianou and Lawrence [2013] A. Damianou and N. Lawrence. Deep Gaussian processes. Artificial Intelligence and Statistics, 2013.
 Dezfouli and Bonilla [2015] A. Dezfouli and E. V. Bonilla. Scalable inference for Gaussian process models with blackbox likelihoods. In Advances in Neural Information Processing Systems, pages 1414–1422, 2015.
 Durrande et al. [2011] N. Durrande, D. Ginsbourger, and O. Roustant. Additive kernels for Gaussian process modeling. arXiv preprint arXiv:1103.4023, 2011.
 Gal et al. [2014] Y. Gal, M. van der Wilk, and C. Rasmussen. Distributed variational inference in sparse Gaussian process regression and latent variable models. In Advances in Neural Information Processing Systems, pages 3257–3265, 2014.

Gönen and Alpaydın [2011]
M. Gönen and E. Alpaydın.
Multiple kernel learning algorithms.
Journal of Machine Learning Research
, 12:2211–2268, 2011.  Hensman et al. [2013] J. Hensman, N. Fusi, and N. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2013.
 Hensman et al. [2015a] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 351–360, 2015a.
 Hensman et al. [2015b] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems, pages 1648–1656, 2015b.
 Jia et al. [2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
 Le et al. [2013] Q. Le, T. Sarlos, and A. Smola. Fastfoodcomputing Hilbert space expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, pages 244–252, 2013.
 Lloyd et al. [2014] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic construction and NaturalLanguage description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), 2014.
 Nickson et al. [2015] T. Nickson, T. Gunter, C. Lloyd, M. A. Osborne, and S. Roberts. Blitzkriging: Kroneckerstructured stochastic Gaussian processes. arXiv preprint arXiv:1510.07965, 2015.
 QuiñoneroCandela and Rasmussen [2005] J. QuiñoneroCandela and C. Rasmussen. A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
 Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for largescale kernel machines. In Neural Information Processing Systems, 2007.
 Rasmussen and Ghahramani [2001] C. E. Rasmussen and Z. Ghahramani. Occam’s razor. In Neural Information Processing Systems (NIPS), 2001.
 Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006.
 Silverman [1985] B. W. Silverman. Some aspects of the spline smoothing approach to nonparametric regression curve fitting. Journal of the Royal Statistical SocietyB, 47(1):1–52, 1985.
 Titsias [2009] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
 Wilson [2014] A. G. Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. PhD thesis, University of Cambridge, 2014.
 Wilson and Adams [2013] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. International Conference on Machine Learning (ICML), 2013.
 Wilson and Nickisch [2015] A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISSGP). International Conference on Machine Learning (ICML), 2015.
 Wilson et al. [2012] A. G. Wilson, D. A. Knowles, and Z. Ghahramani. Gaussian process regression networks. In International Conference on Machine Learning (ICML), Edinburgh, 2012.
 Wilson et al. [2015] A. G. Wilson, C. Dann, and H. Nickisch. Thoughts on massively scalable Gaussian processes. arXiv preprint arXiv:1511.01870, 2015. https://arxiv.org/abs/1511.01870.
 Wilson et al. [2016] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. Artificial Intelligence and Statistics, 2016.
 Yang et al. [2015] Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. A la carte  learning fast kernels. Artificial Intelligence and Statistics, 2015.
Appendix A Negative Log Probability (NLP) Results
Tables 4, 5, and 6 show the negative log probability values on different tasks. Generally we observed similar trends as from the classification accuracy results.
Datasets  n  d  c  NLP  
KLSPGP  DNN  DNN+GP  SVDKL  
Airline  5,934,530  8  2  0.61  0.4740.001  0.4730.001  0.4610.001 
Datasets  n  d  c  NLP  

DNN  DNN+GP  SVDKL  
Adult  48,842  14  2  0.3160.003  0.3140.003  0.3120.003 
Connect4  67,557  42  3  0.4950.003  0.4780.003  0.4490.002 
Diabetes  101,766  49  25  0.4040.001  0.3960.002  0.3850.002 
Covtype  581,012  54  7  0.4350.004  0.4290.005  0.3650.004 
Datasets  n  d  c  NLP  

MCGP  KLSPGP  CNN  CNN+GP  SVDKL  
MNISTBinary  60K  2828  2  —  0.069  0.020  0.019  0.018 
MNIST  60K  2828  10  0.064  —  0.028  0.028  0.028 
CIFAR10  50K  33232  10  —  —  0.738  0.738  0.734 
SVHN  73K  33232  10  —  —  0.309  0.310  0.308 
Appendix B Stochastic Variational Inference for Deep Kernel Learning Classification
Recall the SVDKL classification model
(5) 
Let . We assume a variational posterior over the inducing variables
(6) 
By Jensen’s inequality we have
(7) 
In the following we omit the GP index when there is no ambiguity. Due to the deterministic mapping, we can obtain latent function samples from the samples of :
(8) 
To sample from , we use the Cholesky decomposition for reparameterizing in order to preserve structures within the covariance. Specifically, we let . This results in the following sampling procedure for :
We further scale up the sampler by leveraging the fact that the inducing points are placed on a grid, and imposing Kronecker decomposition on , where is the input dimension of the base kernel. With the fast Kronecker matrixvector products, the sampling cost is . Note that
With the samples, then for any , we have
(9) 
Next we give the derivation of the objective lower bound and its derivatives in detail. In the following we denote for clarity.
Computation of the marginal likelihood lower bound
The expectation term of objective lower bound Eq (7) can be computed straightforwardly following Eq (9). The KL term has a closed form (we omit the GP index ):
(10) 
With the Kronecker product representation of the covariance matrices, all the above matrix operations can be conducted efficiently:
(11) 
where is the number of inducing points in dimension (we have ); ; and .
Derivatives w.r.t the base kernel hyperparameters
Derivatives w.r.t other model parameters
Derivatives w.r.t the variational parameters
We only show the derivatives w.r.t the variational covariance parameters . The derivatives w.r.t the variational means can be derived similarly.
(1) The derivative of the softmax expectation term of input w.r.t the th element of , denoted as for clarity, is given by
where is the th row of the interpolation matrix (i.e., the interpolation vector of input ); and . Note that for , we can directly write down the derivatives w.r.t the whole matrix which is efficient for computing:
(2) The derivative of the KL term is (index omitted):
where . Note that the Kronecker factor matrices are small (with size ) and thus the above computations are fast.