1 Introduction
Gaussian processes (GPs) provide a prior over functions and allow finding complex regularities in data. The ability of GPs to adjust the complexity of the model to the size of the data makes them appealing to use for big datasets. Unfortunately, standard methods for GP regression and classification scale as with the number of training instances and can not be applied when exceeds several thousands.
Numerous approximate inference methods have been proposed in the literature. Many of these methods are based on the concept of inducing inputs (QuiñoneroCandela and Rasmussen (2005), Snelson and Ghahramani (2006), Williams and Seeger (2000)). These methods build a smaller set of points that serve to approximate the true posterior of the process and reduce the complexity to . Titsias (2009) proposed to consider the values of the Gaussian process at the inducing inputs as latent variables and derived a variational inference procedure to approximate the posterior distribution of these variables. Hensman et al. (2013) and Hensman et al. (2015) extended this framework by using stochastic optimization to scale up the method and generalized it to classification problems.
Inducing input methods allow to use Gaussian processes on datasets containing millions of examples. However, these methods are still limited in the number of inducing inputs they can use (usually up to ). Small number of inducing inputs limits the flexibility of the models that can be learned with these methods, and does not allow to learn expressive kernel functions (Wilson et al. (2014)). Wilson and Nickisch (2015) proposed KISSGP framework, which exploits the Kronecker product structure in covariance matrices for inducing inputs placed on a multidimensional grid in the feature space. KISSGP has complexity , where is the dimensionality of the feature space. Note however, that is the number of points in a dimensional grid and grows exponentially with , which makes the method impractical when the number of features is larger than .
In this paper, we propose TTGP method, that can use billions of inducing inputs and is applicable to a much wider range of datasets compared to KISSGP. We achieve this by combining kernel interpolation and Kronecker algebra of KISSGP with a scalable variational inference procedure. We restrict the family of variational distributions from Hensman et al. (2013) to have parameters in compact formats. Specifically, we use Kronecker product format for the covariance matrix and Tensor Train format (Oseledets (2011)) for the expectation of the variational distribution over the values of the process at inducing inputs .
Nickson et al. (2015) showed that using Kronecker format for does not substantially affect the predictive performance of GP regression, while allowing for computational gains. The main contribution of this paper is combining the Kronecker format for with TTformat for , which, together with efficient inference procedure, allows us to efficiently train GP models with billions of inducing inputs.
Unlike KISSGP the proposed method has linear complexity with respect to dimensionality
of the feature space. It means that we can apply TTGP to datasets that are both large and highdimensional. Note however, that TTGP is constructing a grid of inducing inputs in the feature space, and tries to infer the values of the process in all points in the grid. Highdimensional realworld datasets are believed to lie on smalldimensional manifolds in the feature space, and it is impractical to try to recover the complex nonlinear transformation that a Gaussian Process defines on the whole feature space. Thus, we use TTGP on raw features for datasets with dimensionality up to
. For feature spaces with higher dimensionality we propose to use kernels based on parametric projections, which can be learned from data.Wilson et al. (2016a) and Wilson et al. (2016b)
demonstrated efficiency of Gaussian processes with kernels based on deep neural networks. They used subsets of outputs of a DNN as inputs for a Gaussian process. As the authors were using KISSGP, they were limited to using additive kernels, combining multiple low dimensional Gaussian processes. We found that DNNbased kernels are very efficient in combination with TTGP. These kernels allows us to train TTGP models on highdimensional datasets including computer vision tasks. Moreover, unlike the existing deep kernel learning methods, TTGP does not require any changes in the GP model and allows deep kernels that produce embeddings of dimensionality up to
.2 Background
2.1 Gaussian Processes
A Gaussian process is a collection of random variables, any finite number of which have a joint normal distribution. A GP
taking place in is fully defined by its mean and covariance functions. For everywhere , and is the covariance matrix with . Below we will use notation for the matrix of pairwise values of covariance function on points from sets and .
Consider a regression problem. The dataset consists of objects , and target values . We assume that the data is generated by a latent zeromean Gaussian process plus independent Gaussian noise:
(1) 
where is the value of the process at data point and
is the noise variance.
Assume that we want to predict the values of the process at a set of test points
. As the joint distribution of
and is Gaussian, we can analytically compute the conditional distribution with tractable formulas for and . The complexity of computing and is since it involves calculation of the inverse of the covariance matrix .Covariance functions usually have a set of hyperparameters . For example, the RBF kernel
has two hyperparameters and . In order to fit the model to the data, we can maximize the marginal likelihood of the process with respect to these parameters. In case of GP regression this marginal likelihood is tractable and can be computed in operations.
2.2 Inducing Inputs
A number of approximate methods were developed to scale up Gaussian processes. Hensman et al. (2013) proposed a variational lower bound that factorizes over observations for Gaussian process marginal likelihood. We rederive this bound here.
Consider a set of inducing inputs in the feature space and latent variables representing the values of the Gaussian process at these points. Consider the augmented model
with
(2) 
where .
The standard variational lower bound is given by
(3) 
where is the variational distribution over latent variables. Consider the following family of variational distributions
(4) 
where and are variational parameters. Then the marginal distribution over can be computed analytically
(5) 
We can then rewrite (3) as
(6) 
Note, that the lower bound (6) factorizes over observations and thus stochastic optimization can be applied to maximize this bound with respect to both kernel hyperparameters and variational parameters and , as well as other parameters of the model e.g. noise variance . In case of regression we can rewrite (6) in the closed form
(7) 
where is the th column of matrix and
At prediction time we can use the variational distribution as a substitute for posterior
The complexity of computing the bound (7) is . Hensman et al. (2015) proposes to use GaussHermite quadratures to approximate the expectation term in (6) for binary classification problem to obtain the same computational complexity . This complexity allows to use Gaussian processes in tasks with millions of training samples, but these methods are limited to use small numbers of inducing inputs , which hurts the predictive performance and doesn’t allow to learn expressive kernel functions.
2.3 KissGp
Saatçi (2012) noted that the covariance matrices computed at points on a multidimensional grid in the feature space can be represented as a Kronecker product if the kernel function factorizes over dimensions
(8) 
Note, that many popular covariance functions, including RBF, belong to this class. Kronecker structure of covariance matrices allows to perform efficient inference for full Gaussian processes with inputs on a grid.
Wilson and Nickisch (2015) proposed to set inducing inputs on a grid:
The number of inducing inputs is then given by .
Let the covariance function satisfy (8). Then the covariance matrix can be represented as a Kronecker product over dimensions
where
Kronecker products allow efficient computation of matrix inverse and determinant:
where , .
Another major idea of KISSGP is to use interpolation to approximate . Considering inducing inputs as interpolation points for the function we can write
(9) 
where contains the coefficients of interpolation, and is it’s th column. Authors of KISSGP suggest using cubic convolutional interpolation (Keys (1981)), in which case the interpolation weights can be represented as a Kronecker product over dimensions
Wilson and Nickisch (2015) combine these ideas with SOR (Silverman (1985)) in the KISSGP method yielding computational complexity. This complexity allows to use KISSGP with a large number (possibly greater than ) of inducing inputs. Note, however, that grows exponentially with the dimensionality of the feature space and the method becomes impractical when .
2.4 Tensor Train Decomposition
Tensor Train (TT) decomposition, proposed in Oseledets (2011)
, allows to efficiently store tensors (multidimensional arrays of data), large matrices, and vectors. For tensors, matrices and vectors in the TTformat linear algebra operations can be implemented efficiently.
Consider a dimensional tensor . is said to be in the Tensor Train format if
(10) 
where . Matrices are called TTcores, and numbers are called TTranks of tensor .
In order to represent a vector in TTformat, it is reshaped to a multidimensional tensor (possibly with zero padding) and then format (
10) is used. We will use TTformat for the vector of expectations of the values of the Gaussian process in points placed on a multidimensional grid. In this case, is naturally represented as a dimensional tensor.For matrices TT format is given by
where . Note, that Kronecker product format is a special case of the TTmatrix with TTranks .
Let be vectors in TTformat with TTranks not greater than . Let and be represented as a Kronecker product
and the same for . Let . Then the computational complexity of computing the quadratic form is and the complexity of computing is . We will need these two operations below. See Oseledets (2011) for a detailed description of TT format and efficient algorithms implementing linear algebraic operations with it.
3 TtGp
In the previous section we described several methods for GP regression and classification. All these methods have different limitations: standard methods are limited to smallscale datasets, KISSGP requires small dimensionality of the feature space, and other methods based on inducing inputs are limited to use a small number of these points. In this section, we propose the TTGP method that can be used with big datasets and can incorporate billions of inducing inputs. Additionally, TTGP allows for training expressive deep kernels to work with structured data (e.g. images).
3.1 Variational Parameters Approximation
In section 2.2 we derived the variational lower bound of Hensman et al. (2013). We will place the inducing inputs on a multidimensional grid in the feature space and we will assume the covariance function satisfies (8). Let the number of inducing inputs in each dimension be . Then the total number of inducing inputs is . As shown in Section 2.3, in this case matrix can be expressed as a Kronecker product over dimensions. Substituting the approximation (9) into the lower bound (7), we obtain
(11) 
where .
Note that and can be computed with operations due to the Kronecker product structure. Now the most computationally demanding terms are those containing variational parameters and .
Let us restrict the family of variational distributions (4). Let be a Kronecker product over dimensions, and be in the TTformat whith TTrank ( is a hyperparameter of our method). Then, according to section 2.4, we can compute the lower bound (11) with complexity.
The proposed TTGP method has linear complexity with respect to dimensionality of the feature space, despite the exponential growth of the number of inducing inputs. Lower bound (11) can be maximized with respect to kernel hyperparameters , TTcores of , and Kronecker multipliers of . Note that stochastic optimization can be applied, as the bound (11) factorizes over data points.
TT format was successfully applied for different machine learning tasks, e.g. for compressing neural networks (
Novikov et al. (2015)) and estimating logpartition function in probabilistic graphical models (
Novikov et al. (2014)). We explore the properties of approximating the variational mean in TT format in section 4.1.3.2 Classification
In this section we describe a generalization of the proposed method for multiclass classification. In this case the dataset consists of features and target values , where is the number of classes.
Consider Gaussian processes taking place in . Each process corresponds to it’s own class. We will place inducing inputs on a grid in the feature space, and they will be shared between all processes. Each process has it’s own set of latent variables representing the values of the process at data points , and inducing inputs . We will use the following model
where is the vector consisting of the values of all processes at data point , and are defined as in (2) and
We will use variational distributions of the form
where , all are represented in TTformat with TTranks not greater than and all are represented as Kronecker products over dimensions. Similarly to (6), we obtain
(12) 
The second term in (12) can be computed analytically as a sum of KLdivergences between normal distributions. The first term is intractable. In order to approximate the first term we will use a lower bound. We can rewrite
(13) 
where .
The first term in (13) is obviously tractable, while the second term has to be approximated. Bouchard (2007) discusses several lower bounds for expectations of this type. Below we derive one of these bounds, which we use in TTGP.
Concavity of logarithm implies . Taking expectation of both sides of the inequality and minimizing with respect to , we obtain
(14) 
Substituting (14) back into (12) we obtain a tractable lower bound for multiclass classification task, that can be maximized with respect to kernel hyperparameters , TTcores of and Kronecker factors of . The complexity of the method is times higher, than in regression case.
3.3 Deep kernels
Wilson et al. (2016b) and Wilson et al. (2016a) showed the efficiency of using expressive kernel functions based on deep neural networks with Gaussian processes on a variety of tasks. The proposed TTGP method is naturally compatible with this idea.
Consider a covariance function satisfying (8) and a neural network (or in fact any parametric transform) . We can define a new kernel as follows
We can train the neural network weights through maximization of GP marginal likelihood, the same way, as we normally train kernel hyperparameters . This way, the network learns a multidimensional embedding for the data, and GP is making the prediction working with this embedding.
Wilson et al. (2016b) trained additive deep kernels combining onedimensional GPs on different outputs of a neural network. Training Gaussian processes on multiple outputs of a Neural network is impractical in their framework, because the complexity of the GP part of their model grows exponentially with the input dimensionality.
With methods of Hensman et al. (2013) and Hensman et al. (2015) training Gaussian processes on multiple outputs of a neural network also isn’t straightforward. Indeed, with these methods we can only use up to – inducing inputs. While with standard RBF kernels the positions of inputs of the GP are fixed and we can place the inducing inputs near the data, with deep kernels the positions of the inputs of the GP (outputs of the DNN) change during training to match the positions of inducing inputs. It is thus not clear how to set the inducing inputs in the latent feature space, other than placing them on a multidimensional grid, which means that the complexity of such methods would grow exponentially with dimensionality.
On the other hand, TTGP allows us to train Gaussian processes on multiple DNN outputs because of it’s ability to efficiently work with inducing inputs placed on multidimensional grids.
4 Experiments
In this section we first explore how well can we approximate variational expectations in TT format with small ranks. Then, we compare the proposed TTGP method with SVIGP (Hensman et al. (2013)) on regression tasks and KLSPGP (Hensman et al. (2015)) on binary classification tasks using standard RBF kernel functions. Then, we test the ability of our method to learn expressive deep kernel functions and compare it with SVDKL (Wilson et al. (2016b)). For TTGP we use our implementation available at https://github.com/izmailovpavel/TTGP, which is based on the t3f library (Novikov et al. (2018)). For SVIGP and KLSPGP we used the implementations provided in GPfLow (Matthews et al. (2016)).
4.1 Expectation approximation
In this section we provide a numerical justification of using Tensor Train format for the mean of the variational distribution. We use the Powerplant dataset from UCI. This dataset consists of objects with features. We place inducing inputs per dimension and form a grid, which gives us a total of inducing inputs. We train the standard SVIGP method from GPflow library (Matthews et al. (2016)) with free form representations for and . Then we try to approximate the learned vector with a TTvector with small TTranks.
(a) MSE 
Figure 1 shows the dependence between TTranks and approximation accuracy. For TTrank greater than we can approximate the true values of within machine precision. Note that for TTrank the amount of parameters in the TT representation already exceeds the number of entries in the tensor that we are approximating. For moderate TTranks an accurate approximation can still be achieved.
(a)  (b) , 
Figure 2 shows the true variational mean and it’s approximation for TTrank . We can see that captures the structure of the true variational mean.
Dataset  SVIGP / KLSPGP  TTGP  
Name  acc.  (s)  acc.  (s)  
Powerplant    
Protein    
YearPred  
Airline        
svmguide1  3089  4    
EEG  11984  14  
covtype bin  465K  54 
is the time per one pass over the data (epoch) in seconds; where provided,
is the dimensionality of linear embedding.for KLSPGP on Airline we provide results from the original paper where the accuracy is given as a plot, and detailed information about experiment setup and exact results is not available.
4.2 Standard RBF Kernels
For testing our method with standard RBF covariance functions we used a range of classification and regression tasks from UCI and LIBSVM archives and the Airline dataset, that is popular for testing scalable GP models (Hensman et al. (2013), Hensman et al. (2015), Wilson et al. (2016b), Cutajar et al. (2016)).
For Airline dataset we provide results reported in the original paper (Hensman et al. (2015)). For our experiments, we use a cluster of Intel Xeon E52698B v3 CPUs having cores and GB of RAM.
For YearPred, EEG and covtype datasets we used a dimensional linear embedding inside the RBF kernel for TTGP, as the number of features makes it impractical to set inducing inputs on a grid in a dimensional space in this case.
Table 1 shows the results on different regression and classification tasks. We can see, that TTGP is able to achieve better predictive quality on all datasets except EEG. We also note that the method is able to achieve good predictive performance with linear embedding, which makes it practical for a wide range of datasets.
4.3 Deep Kernels
4.3.1 Representation learning
We first explore the representation our model learns for data on the small Digits^{1}^{1}1http://scikitlearn.org/stable/auto_examples/datasets/plot_digits_last_image.html dataset containing images of handwritten digits. We used a TTGP with a kernel based on a small fullyconnected neural network with two hidden layers with neurons each and neurons in the output layer to obtain a
dimensional embedding. We trained the model to classify the digits to
classes corresponding to different digits. Fig. 3 (a) shows the learned embedding. We also trained the same network standalone, adding another layer with outputs and softmax activations. The embedding for this network is shown in fig. 3,b.(a) DNN with TTGP 
(b) Plain DNN 
Dataset  Architecture 

Airline  F( )ReLUF( )ReLUF()ReLUF()ReLUF() 
CIFAR10  C(, 128)BNReLUC(, 128)BNReLUP()C(, 256)BNReLU 
C(, 256)BNReLUP()C(, 256)BNReLU  
C(, 256)BNReLUP()F()BNReLUF()BNReLUF()  
MNIST  C(, )ReLUP()C(, )ReLUP()F()ReLUF() 
) means maxpooling with
kernel; ReLU stands for rectified linear unit and BN means batch normalization (
Ioffe and Szegedy (2015)).Dataset  SVDKL  DNN  TTGP  

Name  acc.  acc.  (s)  acc.  (s)  
Airline  
CIFAR10  
MNIST 
We can see that the standalone DNN with linear classifiers is unable to learn a good dimensional embedding. On the other hand, using a flexible GP classifier that is capable of learning nonlinear transormations, our model groups objects of the same class into compact regions.
4.3.2 Classification tasks
To test our model with deep kernels we used Airline, CIFAR10 (Krizhevsky (2009)) and MNIST (LeCun et al. (1998)) datasets. The corresponding DNN architectures are shown in Table 2. For CIFAR10 dataset we also use standard data augmentation techniques with random cropping of parts of the image, horizontal flipping, randomly adjusting brightness and contrast. In all experiments we also add a BN without trainable mean and variance after the DNN output layer to project the outputs into the region where inducing inputs are placed. We use inducing inputs per dimension placed on a regular grid from to and set TTranks of to
for all three datasets. For experiments with convolutional neural networks, we used Nvidia Tesla K80 GPU to train the model.
Table 3 shows the results of the experiments for our TTGP with DNN kernel and SVDKL. Note, that the comparison is not absolutely fair on CIFAR10 and MNIST datasets, as we didn’t use the same exact architecture and preprocessing as Wilson et al. (2016b) because we couldn’t find the exact specifications of these models. On Airline dataset we used the same exact architecture and preprocessing as SVDKL and TTGP achieves a higher accuracy on this dataset.
We also provide results of standalone DNNs for comparison. We used the same networks that were used in TTGP kernels with the last linear layers replaced by layers with outputs and softmax activations. Overall, we can see, that our model is able to achieve good predictive performance, improving the results of standalone DNN on Airline and MNIST.
We train all the models from random initialization without pretraining. We also tried using pretrained DNNs as initialization for the kernel of our TTGP model, which sometimes leads to faster convergence, but does not improve the final accuracy.
5 Discussion
We proposed TTGP method for scalable inference in Gaussian process models for regression and classification. The proposed method is capable of using billions of inducing inputs, which is impossible for existing methods. This allows us to improve the performance over stateoftheart both with standard and deep kernels on several benchmark datasets. Further, we believe that our model provides a more natural way of learning deep kernel functions than the existing approaches since it doesn’t require any specific modifications of the GP model and allows working with highdimensional DNN embeddings.
Our preliminary experiments showed that TTGP is inferior in terms of uncertainty quantification compared to existing methods. We suspect that the reason for this is restricting Kronecker structure for the covariance matrix . We hope to alleviate this limitation by using Tensor Train format for and corresponding approximations to it’s determinant.
As a promising direction for future work we consider training TTGP with deep kernels incrementally, using the variational approximation of posterior distribution as a prior for new data. We also find it interesting to try using the lowdimensional embeddings learned by our model for transfer learning. Finally, we are interested in using the proposed method for structured prediction, where TTGP could scale up GPstruct approaches (
Bratieres et al. (2015)) and allow using deep kernels.Acknowledgements
Alexander Novikov was supported by the Russian Science Foundation grant 171101027. Dmitry Kropotov was supported by Samsung Research, Samsung Electronics.
References

Bouchard (2007)
G. Bouchard.
Efficient bounds for the softmax function and applications to
approximate inference in hybrid models.
In
NIPS 2007 workshop for approximate Bayesian inference in continuous/hybrid systems
, 2007.  Bratieres et al. (2015) S. Bratieres, N. Quadrianto, and Z. Ghahramani. Gpstruct: Bayesian structured prediction using gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 37(7):1514–1520, 2015.
 Cutajar et al. (2016) K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Practical learning of deep gaussian processes via random fourier features. arXiv preprint arXiv:1610.04386, 2016.

Hensman et al. (2013)
J. Hensman, N. Fusi, and N. D. Lawrence.
Gaussian processes for big data.
In
Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence
, pages 282–290. AUAI Press, 2013.  Hensman et al. (2015) J. Hensman, A. G. de G. Matthews, and Z. Ghahramani. Scalable variational gaussian process classification. In AISTATS, 2015.
 Ioffe and Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 448–456, 2015.
 Keys (1981) R. Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.
 Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Matthews et al. (2016) A. G. de G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. LeónVillagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using TensorFlow. arXiv preprint 1610.08733, October 2016.
 Nickson et al. (2015) T. Nickson, T. Gunter, C. Lloyd, M. A. Osborne, and S. Roberts. Blitzkriging: Kroneckerstructured stochastic gaussian processes. arXiv preprint arXiv:1510.07965, 2015.
 Novikov et al. (2014) A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov. Putting MRFs on a tensor train. In International Conference on Machine Learning, pages 811–819, 2014.
 Novikov et al. (2015) A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.

Novikov et al. (2018)
A. Novikov, P. Izmailov, V. Khrulkov, M. Figurnov, and I. Oseledets.
Tensor train decomposition on tensorflow (t3f).
arXiv preprint, 2018.  Oseledets (2011) I. V. Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 QuiñoneroCandela and Rasmussen (2005) J. QuiñoneroCandela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rasmussen and Williams (2006) C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning, 2006.
 Saatçi (2012) Y. Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
 Silverman (1985) B. W. Silverman. Some aspects of the spline smoothing approach to nonparametric regression curve fitting. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–52, 1985.
 Snelson and Ghahramani (2006) E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudoinputs. Advances in neural information processing systems, 18:1257, 2006.
 Titsias (2009) M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, volume 5, pages 567–574, 2009.
 Williams and Seeger (2000) C. K. I. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 13th International Conference on Neural Information Processing Systems, pages 661–667. MIT press, 2000.
 Wilson and Nickisch (2015) A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes (kissgp). In International Conference on Machine Learning, pages 1775–1784, 2015.
 Wilson et al. (2014) A. G. Wilson, E. Gilboa, J. P. Cunningham, and A. Nehorai. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, pages 3626–3634, 2014.
 Wilson et al. (2016a) A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 370–378, 2016a.
 Wilson et al. (2016b) A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016b.