Gaussian processes (GPs) provide a prior over functions and allow finding complex regularities in data. The ability of GPs to adjust the complexity of the model to the size of the data makes them appealing to use for big datasets. Unfortunately, standard methods for GP regression and classification scale as with the number of training instances and can not be applied when exceeds several thousands.
Numerous approximate inference methods have been proposed in the literature. Many of these methods are based on the concept of inducing inputs (Quiñonero-Candela and Rasmussen (2005), Snelson and Ghahramani (2006), Williams and Seeger (2000)). These methods build a smaller set of points that serve to approximate the true posterior of the process and reduce the complexity to . Titsias (2009) proposed to consider the values of the Gaussian process at the inducing inputs as latent variables and derived a variational inference procedure to approximate the posterior distribution of these variables. Hensman et al. (2013) and Hensman et al. (2015) extended this framework by using stochastic optimization to scale up the method and generalized it to classification problems.
Inducing input methods allow to use Gaussian processes on datasets containing millions of examples. However, these methods are still limited in the number of inducing inputs they can use (usually up to ). Small number of inducing inputs limits the flexibility of the models that can be learned with these methods, and does not allow to learn expressive kernel functions (Wilson et al. (2014)). Wilson and Nickisch (2015) proposed KISS-GP framework, which exploits the Kronecker product structure in covariance matrices for inducing inputs placed on a multidimensional grid in the feature space. KISS-GP has complexity , where is the dimensionality of the feature space. Note however, that is the number of points in a -dimensional grid and grows exponentially with , which makes the method impractical when the number of features is larger than .
In this paper, we propose TT-GP method, that can use billions of inducing inputs and is applicable to a much wider range of datasets compared to KISS-GP. We achieve this by combining kernel interpolation and Kronecker algebra of KISS-GP with a scalable variational inference procedure. We restrict the family of variational distributions from Hensman et al. (2013) to have parameters in compact formats. Specifically, we use Kronecker product format for the covariance matrix and Tensor Train format (Oseledets (2011)) for the expectation of the variational distribution over the values of the process at inducing inputs .
Nickson et al. (2015) showed that using Kronecker format for does not substantially affect the predictive performance of GP regression, while allowing for computational gains. The main contribution of this paper is combining the Kronecker format for with TT-format for , which, together with efficient inference procedure, allows us to efficiently train GP models with billions of inducing inputs.
Unlike KISS-GP the proposed method has linear complexity with respect to dimensionality
of the feature space. It means that we can apply TT-GP to datasets that are both large and high-dimensional. Note however, that TT-GP is constructing a grid of inducing inputs in the feature space, and tries to infer the values of the process in all points in the grid. High-dimensional real-world datasets are believed to lie on small-dimensional manifolds in the feature space, and it is impractical to try to recover the complex non-linear transformation that a Gaussian Process defines on the whole feature space. Thus, we use TT-GP on raw features for datasets with dimensionality up to. For feature spaces with higher dimensionality we propose to use kernels based on parametric projections, which can be learned from data.
demonstrated efficiency of Gaussian processes with kernels based on deep neural networks. They used subsets of outputs of a DNN as inputs for a Gaussian process. As the authors were using KISS-GP, they were limited to using additive kernels, combining multiple low dimensional Gaussian processes. We found that DNN-based kernels are very efficient in combination with TT-GP. These kernels allows us to train TT-GP models on high-dimensional datasets including computer vision tasks. Moreover, unlike the existing deep kernel learning methods, TT-GP does not require any changes in the GP model and allows deep kernels that produce embeddings of dimensionality up to.
2.1 Gaussian Processes
where , and is the covariance matrix with . Below we will use notation for the matrix of pairwise values of covariance function on points from sets and .
Consider a regression problem. The dataset consists of objects , and target values . We assume that the data is generated by a latent zero-mean Gaussian process plus independent Gaussian noise:
where is the value of the process at data point and
is the noise variance.
Assume that we want to predict the values of the process at a set of test points
. As the joint distribution ofand is Gaussian, we can analytically compute the conditional distribution with tractable formulas for and . The complexity of computing and is since it involves calculation of the inverse of the covariance matrix .
Covariance functions usually have a set of hyper-parameters . For example, the RBF kernel
has two hyper-parameters and . In order to fit the model to the data, we can maximize the marginal likelihood of the process with respect to these parameters. In case of GP regression this marginal likelihood is tractable and can be computed in operations.
2.2 Inducing Inputs
A number of approximate methods were developed to scale up Gaussian processes. Hensman et al. (2013) proposed a variational lower bound that factorizes over observations for Gaussian process marginal likelihood. We rederive this bound here.
Consider a set of inducing inputs in the feature space and latent variables representing the values of the Gaussian process at these points. Consider the augmented model
The standard variational lower bound is given by
where is the variational distribution over latent variables. Consider the following family of variational distributions
where and are variational parameters. Then the marginal distribution over can be computed analytically
We can then rewrite (3) as
Note, that the lower bound (6) factorizes over observations and thus stochastic optimization can be applied to maximize this bound with respect to both kernel hyper-parameters and variational parameters and , as well as other parameters of the model e.g. noise variance . In case of regression we can rewrite (6) in the closed form
where is the -th column of matrix and
At prediction time we can use the variational distribution as a substitute for posterior
The complexity of computing the bound (7) is . Hensman et al. (2015) proposes to use Gauss-Hermite quadratures to approximate the expectation term in (6) for binary classification problem to obtain the same computational complexity . This complexity allows to use Gaussian processes in tasks with millions of training samples, but these methods are limited to use small numbers of inducing inputs , which hurts the predictive performance and doesn’t allow to learn expressive kernel functions.
Saatçi (2012) noted that the covariance matrices computed at points on a multidimensional grid in the feature space can be represented as a Kronecker product if the kernel function factorizes over dimensions
Note, that many popular covariance functions, including RBF, belong to this class. Kronecker structure of covariance matrices allows to perform efficient inference for full Gaussian processes with inputs on a grid.
Wilson and Nickisch (2015) proposed to set inducing inputs on a grid:
The number of inducing inputs is then given by .
Let the covariance function satisfy (8). Then the covariance matrix can be represented as a Kronecker product over dimensions
Kronecker products allow efficient computation of matrix inverse and determinant:
where , .
Another major idea of KISS-GP is to use interpolation to approximate . Considering inducing inputs as interpolation points for the function we can write
where contains the coefficients of interpolation, and is it’s -th column. Authors of KISS-GP suggest using cubic convolutional interpolation (Keys (1981)), in which case the interpolation weights can be represented as a Kronecker product over dimensions
Wilson and Nickisch (2015) combine these ideas with SOR (Silverman (1985)) in the KISS-GP method yielding computational complexity. This complexity allows to use KISS-GP with a large number (possibly greater than ) of inducing inputs. Note, however, that grows exponentially with the dimensionality of the feature space and the method becomes impractical when .
2.4 Tensor Train Decomposition
Tensor Train (TT) decomposition, proposed in Oseledets (2011)
, allows to efficiently store tensors (multidimensional arrays of data), large matrices, and vectors. For tensors, matrices and vectors in the TT-format linear algebra operations can be implemented efficiently.
Consider a -dimensional tensor . is said to be in the Tensor Train format if
where . Matrices are called TT-cores, and numbers are called TT-ranks of tensor .
In order to represent a vector in TT-format, it is reshaped to a multidimensional tensor (possibly with zero padding) and then format (10) is used. We will use TT-format for the vector of expectations of the values of the Gaussian process in points placed on a multidimensional grid. In this case, is naturally represented as a -dimensional tensor.
For matrices TT format is given by
where . Note, that Kronecker product format is a special case of the TT-matrix with TT-ranks .
Let be vectors in TT-format with TT-ranks not greater than . Let and be represented as a Kronecker product
and the same for . Let . Then the computational complexity of computing the quadratic form is and the complexity of computing is . We will need these two operations below. See Oseledets (2011) for a detailed description of TT format and efficient algorithms implementing linear algebraic operations with it.
In the previous section we described several methods for GP regression and classification. All these methods have different limitations: standard methods are limited to small-scale datasets, KISS-GP requires small dimensionality of the feature space, and other methods based on inducing inputs are limited to use a small number of these points. In this section, we propose the TT-GP method that can be used with big datasets and can incorporate billions of inducing inputs. Additionally, TT-GP allows for training expressive deep kernels to work with structured data (e.g. images).
3.1 Variational Parameters Approximation
In section 2.2 we derived the variational lower bound of Hensman et al. (2013). We will place the inducing inputs on a multidimensional grid in the feature space and we will assume the covariance function satisfies (8). Let the number of inducing inputs in each dimension be . Then the total number of inducing inputs is . As shown in Section 2.3, in this case matrix can be expressed as a Kronecker product over dimensions. Substituting the approximation (9) into the lower bound (7), we obtain
Note that and can be computed with operations due to the Kronecker product structure. Now the most computationally demanding terms are those containing variational parameters and .
Let us restrict the family of variational distributions (4). Let be a Kronecker product over dimensions, and be in the TT-format whith TT-rank ( is a hyper-parameter of our method). Then, according to section 2.4, we can compute the lower bound (11) with complexity.
The proposed TT-GP method has linear complexity with respect to dimensionality of the feature space, despite the exponential growth of the number of inducing inputs. Lower bound (11) can be maximized with respect to kernel hyper-parameters , TT-cores of , and Kronecker multipliers of . Note that stochastic optimization can be applied, as the bound (11) factorizes over data points.
TT format was successfully applied for different machine learning tasks, e.g. for compressing neural networks (Novikov et al. (2015)
) and estimating log-partition function in probabilistic graphical models (Novikov et al. (2014)). We explore the properties of approximating the variational mean in TT format in section 4.1.
In this section we describe a generalization of the proposed method for multiclass classification. In this case the dataset consists of features and target values , where is the number of classes.
Consider Gaussian processes taking place in . Each process corresponds to it’s own class. We will place inducing inputs on a grid in the feature space, and they will be shared between all processes. Each process has it’s own set of latent variables representing the values of the process at data points , and inducing inputs . We will use the following model
where is the vector consisting of the values of all processes at data point , and are defined as in (2) and
We will use variational distributions of the form
where , all are represented in TT-format with TT-ranks not greater than and all are represented as Kronecker products over dimensions. Similarly to (6), we obtain
The second term in (12) can be computed analytically as a sum of KL-divergences between normal distributions. The first term is intractable. In order to approximate the first term we will use a lower bound. We can rewrite
The first term in (13) is obviously tractable, while the second term has to be approximated. Bouchard (2007) discusses several lower bounds for expectations of this type. Below we derive one of these bounds, which we use in TT-GP.
Concavity of logarithm implies . Taking expectation of both sides of the inequality and minimizing with respect to , we obtain
Substituting (14) back into (12) we obtain a tractable lower bound for multiclass classification task, that can be maximized with respect to kernel hyper-parameters , TT-cores of and Kronecker factors of . The complexity of the method is times higher, than in regression case.
3.3 Deep kernels
Wilson et al. (2016b) and Wilson et al. (2016a) showed the efficiency of using expressive kernel functions based on deep neural networks with Gaussian processes on a variety of tasks. The proposed TT-GP method is naturally compatible with this idea.
Consider a covariance function satisfying (8) and a neural network (or in fact any parametric transform) . We can define a new kernel as follows
We can train the neural network weights through maximization of GP marginal likelihood, the same way, as we normally train kernel hyper-parameters . This way, the network learns a multidimensional embedding for the data, and GP is making the prediction working with this embedding.
Wilson et al. (2016b) trained additive deep kernels combining one-dimensional GPs on different outputs of a neural network. Training Gaussian processes on multiple outputs of a Neural network is impractical in their framework, because the complexity of the GP part of their model grows exponentially with the input dimensionality.
With methods of Hensman et al. (2013) and Hensman et al. (2015) training Gaussian processes on multiple outputs of a neural network also isn’t straightforward. Indeed, with these methods we can only use up to – inducing inputs. While with standard RBF kernels the positions of inputs of the GP are fixed and we can place the inducing inputs near the data, with deep kernels the positions of the inputs of the GP (outputs of the DNN) change during training to match the positions of inducing inputs. It is thus not clear how to set the inducing inputs in the latent feature space, other than placing them on a multidimensional grid, which means that the complexity of such methods would grow exponentially with dimensionality.
On the other hand, TT-GP allows us to train Gaussian processes on multiple DNN outputs because of it’s ability to efficiently work with inducing inputs placed on multidimensional grids.
In this section we first explore how well can we approximate variational expectations in TT format with small ranks. Then, we compare the proposed TT-GP method with SVI-GP (Hensman et al. (2013)) on regression tasks and KLSP-GP (Hensman et al. (2015)) on binary classification tasks using standard RBF kernel functions. Then, we test the ability of our method to learn expressive deep kernel functions and compare it with SV-DKL (Wilson et al. (2016b)). For TT-GP we use our implementation available at https://github.com/izmailovpavel/TTGP, which is based on the t3f library (Novikov et al. (2018)). For SVI-GP and KLSP-GP we used the implementations provided in GPfLow (Matthews et al. (2016)).
4.1 Expectation approximation
In this section we provide a numerical justification of using Tensor Train format for the mean of the variational distribution. We use the Powerplant dataset from UCI. This dataset consists of objects with features. We place inducing inputs per dimension and form a grid, which gives us a total of inducing inputs. We train the standard SVI-GP method from GPflow library (Matthews et al. (2016)) with free form representations for and . Then we try to approximate the learned vector with a TT-vector with small TT-ranks.
Figure 1 shows the dependence between TT-ranks and approximation accuracy. For TT-rank greater than we can approximate the true values of within machine precision. Note that for TT-rank the amount of parameters in the TT representation already exceeds the number of entries in the tensor that we are approximating. For moderate TT-ranks an accurate approximation can still be achieved.
Figure 2 shows the true variational mean and it’s approximation for TT-rank . We can see that captures the structure of the true variational mean.
|Dataset||SVI-GP / KLSP-GP||TT-GP|
is the time per one pass over the data (epoch) in seconds; where provided,is the dimensionality of linear embedding.
for KLSP-GP on Airline we provide results from the original paper where the accuracy is given as a plot, and detailed information about experiment setup and exact results is not available.
4.2 Standard RBF Kernels
For testing our method with standard RBF covariance functions we used a range of classification and regression tasks from UCI and LIBSVM archives and the Airline dataset, that is popular for testing scalable GP models (Hensman et al. (2013), Hensman et al. (2015), Wilson et al. (2016b), Cutajar et al. (2016)).
For Airline dataset we provide results reported in the original paper (Hensman et al. (2015)). For our experiments, we use a cluster of Intel Xeon E5-2698B v3 CPUs having cores and GB of RAM.
For YearPred, EEG and covtype datasets we used a -dimensional linear embedding inside the RBF kernel for TT-GP, as the number of features makes it impractical to set inducing inputs on a grid in a -dimensional space in this case.
Table 1 shows the results on different regression and classification tasks. We can see, that TT-GP is able to achieve better predictive quality on all datasets except EEG. We also note that the method is able to achieve good predictive performance with linear embedding, which makes it practical for a wide range of datasets.
4.3 Deep Kernels
4.3.1 Representation learning
We first explore the representation our model learns for data on the small Digits111http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html dataset containing images of handwritten digits. We used a TT-GP with a kernel based on a small fully-connected neural network with two hidden layers with neurons each and neurons in the output layer to obtain a
-dimensional embedding. We trained the model to classify the digits toclasses corresponding to different digits. Fig. 3 (a) shows the learned embedding. We also trained the same network standalone, adding another layer with outputs and softmax activations. The embedding for this network is shown in fig. 3,b.
|(a) DNN with TT-GP|
|(b) Plain DNN|
|CIFAR-10||C(, 128)-BN-ReLU-C(, 128)-BN-ReLU-P()-C(, 256)-BN-ReLU-|
|C(, 256)-BN-ReLU-P()-C(, 256)-BN-ReLU-|
|MNIST||C(, )-ReLU-P()-C(, )-ReLU-P()-F()-ReLU-F()|
) means max-pooling with2015)).
We can see that the stand-alone DNN with linear classifiers is unable to learn a good -dimensional embedding. On the other hand, using a flexible GP classifier that is capable of learning non-linear transormations, our model groups objects of the same class into compact regions.
4.3.2 Classification tasks
To test our model with deep kernels we used Airline, CIFAR-10 (Krizhevsky (2009)) and MNIST (LeCun et al. (1998)) datasets. The corresponding DNN architectures are shown in Table 2. For CIFAR-10 dataset we also use standard data augmentation techniques with random cropping of parts of the image, horizontal flipping, randomly adjusting brightness and contrast. In all experiments we also add a BN without trainable mean and variance after the DNN output layer to project the outputs into the region where inducing inputs are placed. We use inducing inputs per dimension placed on a regular grid from to and set TT-ranks of to
for all three datasets. For experiments with convolutional neural networks, we used Nvidia Tesla K80 GPU to train the model.
Table 3 shows the results of the experiments for our TT-GP with DNN kernel and SV-DKL. Note, that the comparison is not absolutely fair on CIFAR-10 and MNIST datasets, as we didn’t use the same exact architecture and preprocessing as Wilson et al. (2016b) because we couldn’t find the exact specifications of these models. On Airline dataset we used the same exact architecture and preprocessing as SV-DKL and TT-GP achieves a higher accuracy on this dataset.
We also provide results of stand-alone DNNs for comparison. We used the same networks that were used in TT-GP kernels with the last linear layers replaced by layers with outputs and softmax activations. Overall, we can see, that our model is able to achieve good predictive performance, improving the results of standalone DNN on Airline and MNIST.
We train all the models from random initialization without pretraining. We also tried using pretrained DNNs as initialization for the kernel of our TT-GP model, which sometimes leads to faster convergence, but does not improve the final accuracy.
We proposed TT-GP method for scalable inference in Gaussian process models for regression and classification. The proposed method is capable of using billions of inducing inputs, which is impossible for existing methods. This allows us to improve the performance over state-of-the-art both with standard and deep kernels on several benchmark datasets. Further, we believe that our model provides a more natural way of learning deep kernel functions than the existing approaches since it doesn’t require any specific modifications of the GP model and allows working with high-dimensional DNN embeddings.
Our preliminary experiments showed that TT-GP is inferior in terms of uncertainty quantification compared to existing methods. We suspect that the reason for this is restricting Kronecker structure for the covariance matrix . We hope to alleviate this limitation by using Tensor Train format for and corresponding approximations to it’s determinant.
As a promising direction for future work we consider training TT-GP with deep kernels incrementally, using the variational approximation of posterior distribution as a prior for new data. We also find it interesting to try using the low-dimensional embeddings learned by our model for transfer learning. Finally, we are interested in using the proposed method for structured prediction, where TT-GP could scale up GPstruct approaches (Bratieres et al. (2015)) and allow using deep kernels.
Alexander Novikov was supported by the Russian Science Foundation grant 17-11-01027. Dmitry Kropotov was supported by Samsung Research, Samsung Electronics.
Efficient bounds for the softmax function and applications to
approximate inference in hybrid models.
NIPS 2007 workshop for approximate Bayesian inference in continuous/hybrid systems, 2007.
- Bratieres et al. (2015) S. Bratieres, N. Quadrianto, and Z. Ghahramani. Gpstruct: Bayesian structured prediction using gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 37(7):1514–1520, 2015.
- Cutajar et al. (2016) K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Practical learning of deep gaussian processes via random fourier features. arXiv preprint arXiv:1610.04386, 2016.
Hensman et al. (2013)
J. Hensman, N. Fusi, and N. D. Lawrence.
Gaussian processes for big data.
Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 282–290. AUAI Press, 2013.
- Hensman et al. (2015) J. Hensman, A. G. de G. Matthews, and Z. Ghahramani. Scalable variational gaussian process classification. In AISTATS, 2015.
- Ioffe and Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 448–456, 2015.
- Keys (1981) R. Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.
- Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Matthews et al. (2016) A. G. de G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using TensorFlow. arXiv preprint 1610.08733, October 2016.
- Nickson et al. (2015) T. Nickson, T. Gunter, C. Lloyd, M. A. Osborne, and S. Roberts. Blitzkriging: Kronecker-structured stochastic gaussian processes. arXiv preprint arXiv:1510.07965, 2015.
- Novikov et al. (2014) A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov. Putting MRFs on a tensor train. In International Conference on Machine Learning, pages 811–819, 2014.
- Novikov et al. (2015) A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
Novikov et al. (2018)
A. Novikov, P. Izmailov, V. Khrulkov, M. Figurnov, and I. Oseledets.
Tensor train decomposition on tensorflow (t3f).arXiv preprint, 2018.
- Oseledets (2011) I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
- Quiñonero-Candela and Rasmussen (2005) J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
- Rasmussen and Williams (2006) C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning, 2006.
- Saatçi (2012) Y. Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
- Silverman (1985) B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–52, 1985.
- Snelson and Ghahramani (2006) E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18:1257, 2006.
- Titsias (2009) M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, volume 5, pages 567–574, 2009.
- Williams and Seeger (2000) C. K. I. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 13th International Conference on Neural Information Processing Systems, pages 661–667. MIT press, 2000.
- Wilson and Nickisch (2015) A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775–1784, 2015.
- Wilson et al. (2014) A. G. Wilson, E. Gilboa, J. P. Cunningham, and A. Nehorai. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, pages 3626–3634, 2014.
- Wilson et al. (2016a) A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 370–378, 2016a.
- Wilson et al. (2016b) A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016b.