Sparse Orthogonal Variational Inference for Gaussian Processes

10/23/2019 ∙ by Jiaxin Shi, et al. ∙ 0

We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one in the subspace spanned by the inducing basis and the other in the orthogonal complement to this subspace. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 with purely GP-based models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

solvegp

Sparse Orthogonal Variational Inference for Gaussian Processes (SOLVE-GP)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GP) (Rasmussen and Williams, 2006)

are nonparametric models for representing distributions over functions. As a generalization of multivariate Gaussian distributions to infinite dimensions, the simplicity and elegance of these models has led to their wide adoption in uncertainty estimation for machine learning, including supervised learning 

(Williams and Rasmussen, 1996; Williams and Barber, 1998), sequential decision making (Srinivas et al., 2010), model-based planning (Deisenroth and Rasmussen, 2011), and unsupervised data analysis (Lawrence, 2005; Damianou et al., 2016).

Despite the successful application of these models, they suffer from computation and storage requirements given training data points, which has motivated a large body of research on sparse GP methods (Csato and Opper, 2002; Lawrence et al., 2002; Seeger et al., 2003; Quiñonero-Candela and Rasmussen, 2005a; Titsias, 2009; Hensman et al., 2013; Bui et al., 2017)

. GPs have been also been unfavourably compared to deep learning models for lacking representation learning capabilities.

Figure 1: SOLVE-GP.

Sparse variational GP (SVGP) methods (Titsias, 2009; Hensman et al., 2013, 2015a) based on variational learning of inducing points have shown promise in addressing these limitations. Such methods make no change to the prior distribution of the GP model but enforce sparse structures in the posterior approximation though variational inference. This gives computation and storage with inducing points. Moreover, this approach allows us to perform mini-batch training by sub-sampling data points. Successful application of such methods allowed scalable GP models to be trained on billions of data points (Salimbeni and Deisenroth, 2017). These advances in inference methods have also led to more flexibility in model design. A recent convolutional GP model (van der Wilk et al., 2017)

encodes translation invariance by summing over GPs that take image patches as inputs. The inducing points, which can be interpreted as image patches in this model, play a role similar to that of convolutional filters in neural networks. Their work demonstrated that it is possible to implement representation learning in GP models. Further extensions of such models into deep hierarchies 

(Blomqvist et al., 2018; Dutordoir et al., 2019) significantly boosted the performance of GPs applied to natural images.

As these works suggest, currently the biggest challenge in this area still lies in scalable inference. Especially for the widely used SVGP methods, the computational cost is cubic with respect to the number of inducing points, making it difficult to improve the flexibility of posterior approximations. For example, state-of-the-art models like deep convolutional GPs only used 384 inducing points for inference in each layer to get a manageable running time (Dutordoir et al., 2019).

In this paper, we introduce a new variational inference framework, called SOLVE-GP, which allows increasing the number of inducing points we can use given a fixed computational budget. Our framework is based on the observation that SVGP methods can be reinterpreted using an orthogonal decomposition of the GP prior, illustrated in Fig. 1. By introducing another set of inducing variables for the GP that lives in the orthogonal complement, we can increase the number of inducing points at a much lower additional computational cost. For instance, doubling the number of inducing points leads to a 2-fold increase in computational cost with our method, compared to the 8-fold increase for the original SVGP method. We show that our framework is equivalent to a structured covariance approximation for SVGP defined over the union of the two sets of inducing points. Interestingly, under this interpretation the SOLVE-GP framework can be seen as a generalization of the recently proposed decoupled-inducing-points method (Salimbeni et al., 2018). As the decoupled method often comes with a complex dual formation in RKHS, our framework provides a simpler derivation and more intuitive understanding for it.

We conducted experiments on convolutional GPs and their deep variants. To the best of our knowledge, we are the first to train a purely GP-based model without any neural network components that achieves over test accuracy on CIFAR-10. No data augmentation was used to obtain these results. Besides classification, we also evaluated our method on a range of regression datasets with from tens of thousands to millions of data points. Our results show that SOLVE-GP is often competitive with the 4x more expensive SVGP counterpart that uses the same number of inducing points and outperforms SVGP when given the same computational budget.

2 Background

Here, we briefly review Gaussian processes and sparse variational GP methods. A GP is an uncountable collection of random variables indexed by a real-valued vector

x, of which any finite subset has a multivariate Gaussian distribution. It is defined by a mean function and a covariance function :

Let be (the matrix containing) the training data points and denote the corresponding function values. Similarly we denote the test data points by and their function values by

. Then the joint distribution over

is given by:

where is an kernel matrix with its th entry as , and similarly , . In practice we often observe the training function values through some noisy labels y, generated by the likelihood function . For regression, the likelihood usually models independent Gaussian observation noise: In this situation the exact posterior distribution can be computed in closed form:

(1)

As seen from Eq. (2), exact GP prediction involves the inverse of matrix , which requires computation. For large datasets, it is clear that we need to avoid the cubic complexity by resorting to approximations.

Inducing points have played a central role in previous works on scalable GP inference. The general idea is to summarize with a small number of variables , where is a set of parameters called inducing points in the input space. The augmented joint distribution over is , where and denotes the kernel matrix of inducing points with the th entry corresponding to . There is a long history in developing sparse approximations for GPs by making different independence assumptions for the conditional distribution to reduce the computational cost (Quiñonero-Candela and Rasmussen, 2005b). However, these methods made modifications to the GP prior and tended to suffer from degeneracy and overfitting problems.

Sparse variational GP methods (SVGP), first proposed in Titsias (2009) and later extended for minibatch training and nonconjugate likelihoods (Hensman et al., 2013, 2015a), provide an elegant solution to these problems. By reformulating the posterior inference problem as variational inference and restricting the variational distribution to be , the variational lower bound for minimizing is simplified as:

(2)

For GP regression the bound has a collapsed form obtained by solving for the optimal and plugging it into (2(Titsias, 2009):

(3)

where . Computing this objective requires operations, in constrast to the complexity of exact inference. The inducing points Z can be learned as variational parameters by maximizing the lower bound. More generally, if we do not collapse we obtain the uncollapsed bound suitable for minibatch training and non-Gaussian likelihoods (Hensman et al., 2013, 2015a).

3 Solve-Gp

Despite the success of SVGP methods, their complexity makes it difficult for the flexibility of posterior approximation to grow with the dataset size. In this section, we present a new framework called Sparse OrthogonaL Variational infErence for Gaussian Processes (SOLVE-GP) to address this challenge. Our framework allows the use of another set of inducing points at a lower computational cost than the standard SVGP methods.

3.1 Reinterpreting SVGP

We start by reinterpreting SVGP methods using a simple reparameterization, which will then lead us to possible ways of improving the approximation. First we notice that the covariance of the conditional distribution does not depend on u.111Note that kernel matrices like depend on Z instead of u; the subscript only indicates that this is the covariance matrix of u. Since the Gaussian distribution belongs to the location-scale family, samples from can be reparameterized as

(4)

The reason for denoting the zero-mean component as shall become clear later. Now we can reparameterize the augmented prior distribution as

(5)

and the joint distribution of the GP model becomes

(6)

Posterior inference for in the original model then turns into inference for u and . If we approximate the above GP model by considering a factorised approximation , where is a variational distribution and is the prior distribution of that appears also in Eq. (6), we arrive at the standard SVGP method. To see this, note that minimizing is equivalent to maximizing the variational lower bound

which is the SVGP objective (Eq. (2)) using the reparameterization in Eq. (3.1).

Under this interpretation of the standard SVGP method, it becomes clear that we can modify the form of the variational distribution to improve the accuracy of the posterior approximation. There are two natural options: (i) keep as part of the approximation and alter so that it will have some dependence on , and (ii) keep independent from , and replace with a more structured variational distribution . Both options lead to new bounds and more accurate approximations than the standard method, though we will defer the discussion of (i) to appendix A and focus on (ii) because it is amenable to large-scale training, as we will show next.

3.2 Orthogonal Decomposition

As suggested in section 3.1, we consider improving the variational distribution for . However, the complexity of inferring is the same as for and thus cubic. Resolving the problem requires a better understanding of the reparameterization we used in section 3.1.

The key observation here is that the reparameterization of into and in Eq. (5) corresponds to an orthogonal decomposition in the function space. Recall that the GP prior is a distribution over functions . Consider a subspace spanned by the kernel basis functions indexed by the inducing points :

Samples from the GP prior can be decomposed (Cheng and Boots, 2016; Hensman et al., 2017) as

(7)

Since , we let . By , where is the inner product defined as , we can solve for the coefficients: . Interestingly, we can check that is itself a sample from a GP with a zero mean function and covariance function , where . Similarly we can show that is a sample from another GP and we denote these two independent GPs as and :

(8)
(9)

Marginalizing out the GPs at the training points X, it is easy to show that

(10)
(11)

This is exactly the decomposition we used in section 3.1, and the fact that denotes the function values of the orthogonal process becomes clear.

3.3 SOLVE-GP Lower Bound

The orthogonal decomposition described in the previous section gives new insights for improving the variational distribution for . Specifically, we can introduce a second set of inducing variables to approximate the orthogonal process , as illustrated in Fig. 1. We call this second set the orthogonal inducing points. The joint model distribution is then

(12)

First notice that standard SVGP methods correspond to using the variational distribution . To obtain better approximations we can replace the prior factor with a tunable variational factor :

(13)

This gives the SOLVE-GP variational lower bound:

(14)

To intuitively understand the improvement over the standard SVGP methods, we derive a collapsed bound for GP regression using this approach and compare it to the Titsias (2009) bound. Setting , plugging in the optimal , and simplifying (see appendix B), gives the collapsed bound

(15)

where is the kernel matrix of the orthogonal process on the training inputs and similarly for the other matrices; is the covariance of the orthogonal predictive distribution .

This bound now can be tighter than the Titsias (2009) bound. For example, notice that when is equal to the prior , i.e., and , the bound in (15) reduces to the one in (3). Another interesting special case arises when the variational distribution has the same covariance matrix as the prior (i.e., ), while the mean is learnable. Then the bound becomes

(16)

Here we see that the second set of inducing variables mostly determines the mean prediction over y, which is zero in Titsias (2009) bound (Eq. (3)).

In the general setting, Eq. (14) can be maximized using minibatch training in time per gradient update. This is because the inverse of matrix can be computed in time if we have pre-computed the Cholesky decomposition of . For the function values at test data points , the predictive density given by our approximate posterior can be found in appendix C.

3.4 SOLVE-GP as Structured Covariance Approximation

Our method introduces another set of inducing points to improve the variational approximation. One natural question to ask is: How does this compare to the standard SVGP algorithm with the inducing points chosen to be union of the two sets? In this section we answer the question by interpreting our method as using a structured covariance in the variational approximation for SVGP, with the ability to allow a larger set of inducing points at the same computational cost.

To obtain this structured covariance matrix, we need to express our variational approximation w.r.t. the original GP. Let denote the function outputs at the orthogonal inducing points. We have the following relationship between and :

(17)

By change-of-variable we compute the joint variational distribution over

u and v that corresponds to the factorized :

(18)

Using Gaussian identities we can show that is a Gaussian distribution with mean and covariance matrix

Interestingly, despite being a matrix, it can be inverted with computation, which gives a 4x speed-up over a fully parameterized multivariate Gaussian distribution for when .

4 Extensions

A direct extension of SOLVE-GP is that we can introduce more than two sets of inducing points by repeatedly applying the orthogonal decomposition, however this adds more complexity to the implementation. Below we show that the SOLVE-GP framework can be easily extended to different GP models where the standard SVGP method applies.

Inter-domain Inducing Points and Convolutional GP.

Similar to SVGP methods, SOLVE-GP can deal with inter-domain inducing points (Lázaro-Gredilla and Figueiras-Vidal, 2009) which lie in a different domain from the input space. The inducing variables u, which we used to represent outputs of the GP at the inducing points, is now defined as , where is a different function from that takes inputs in the domain of inducing points. In convolutional GPs (van der Wilk et al., 2017), the input domain is the space of images, while the inducing points are in the space of image patches. The convolutional GP function is defined as

(19)

where is the th patch in the image x; are the assigned weights to different patches. In SOLVE-GP, we can choose either Z, O, or both to be inter-domain as long as we can compute the covariance between and . When applied to convolutional GP models, we set both Z and O to be a collection of image patches. Examples of the covariance matrices we need for this model include and  (used for ). They can be computed as

(20)
(21)
Deep GP.

We show that we can integrate SOLVE-GP with popular doubly stochastic variational inference algorithms for deep GPs (Salimbeni and Deisenroth, 2017). The joint distribution of a deep GP model with inducing variables in all layers is

where we define and is the output of the th-layer GP. The doubly stochastic algorithm applies SVGP methods to each layer conditioned on samples from the variational distribution in the previous layer. The variational distribution over is This gives a similar objective as in the single layer case (Eq. (2)):

(22)

where . Extending this using SOLVE-GP is straightforward by introducing orthogonal inducing variables for all layers, which gives the lower bound:

(23)

where the expression of is in appendix D.

5 Related Work

Many approximate algorithms have been proposed to overcome the computational limitations of GPs. The simplest of these are based on sub-sampling data points, which include the naive subset-of-data training (Rasmussen and Williams, 2006) as well as the Nyström approximation to the kernel matrix (Williams and Seeger, 2001). Better sparse approximations can be constructed by learning a set of inducing points to summarize the entire dataset. As introduced in section 2, these works can be divided into sparse approximations to the GP prior (SoR, DTC, FITC, etc.) (Quiñonero-Candela and Rasmussen, 2005b), and sparse variational methods (Titsias, 2009; Hensman et al., 2013, 2015a).

Recently there have been many attempts to reduce the complexity of computing for using a large set of inducing points. A notable line of work (Wilson and Nickisch, 2015; Evans and Nair, 2018; Gardner et al., 2018) involves imposing grid structures on the locations of Z to perform fast structure exploiting computations. However, to get such computational benefits Z

need to be fixed due to the structure constraints, which often suffers from curse-of-dimensionality in the input space. Another direction for allowing the use of more inducing points is the decoupled method, first proposed in 

Cheng and Boots (2017), where two different set of inducing points are used for modeling the mean and the covariance function. This gives linear complexity in the number of mean inducing points which allows using many more of them. Despite the increasing interest in decoupled inducing points (Havasi et al., 2018; Salimbeni et al., 2018), the method has not been well-understood due to its complexity. We found that our SOLVE-GP framework is closely connected to a recent development of decoupled methods: the orthogonal decoupled variational GP (ODVGP) (Salimbeni et al., 2018), as explained next.

Connection with decoupled inducing points.

If we set the and inducing points in ODVGP (Salimbeni et al., 2018) to be Z and O, their approach becomes equivalent to using the following variational distribution:

By comparing this covariance matrix to , we can see that we generalize their method by introducing , which replaces the original residual , so that we allow more flexible covariance modeling while still keeping the block structure that facilitates cheap inverse. This implies that ODVGP is a special case of SOLVE-GP by restricting to have the same covariance as the prior.

Besides inducing points, another way to construct sparse approximations is by examining the weight space representation of GPs, i.e., Bayesian linear regression in the kernel feature space 

(Rasmussen and Williams, 2006). Relevance vector machines (Tipping, 2001) use finite basis functions as features, while sparse spectrum GP (Lázaro-Gredilla et al., 2010) uses random Fourier features. These two methods approximate the prior distribution. It is also possible to use the weight-space representation to approximate the posterior distribution of GPs by variational inference (Shi et al., 2019).

6 Experiments

Since ODVGP is a special case of SOLVE-GP, we use to refer to and in their algorithm, respectively.

6.1 1D Regression

(a) SVGP,
(b) ODVGP,
(c) SOLVE-GP,
(d) SVGP,
Figure 2: Posterior process on the Snelson dataset, where shaded bands correspond to intervals of standard deviations. The learned inducing locations are shown at the bottom of each figure, where correspond to Z; blue and dark triangles correspond to O in ODVGP and SOLVE-GP, respectively.
(a) Protein
(b) HouseElectric
Figure 3: Changes of test RMSE and predictive log likelihoods during training, on (a) Protein; (b) HouseElectric.

We begin by illustrating our method on Snelson’s 1D regression problem (Snelson and Ghahramani, 2006) with 100 training points and minibatch size 20. We compare the following methods: SVGP with 5 and 10 inducing points, ODVGP (), and SOLVE-GP ().

The results are plotted in Fig. 2. First we can see that 5 inducing points are insufficient to summarize the training set: SVGP (

) cannot fit data well and underestimates the variance in regions beyond the training data. Increasing

to 10 fixes the issues, but requires 8x more computation than using 5 inducing points222In practice the cost is negligible in this toy problem but we are analyzing the theoretical complexity.. The decoupled formulation provides a cheaper alternative and we have tried ODVGP (), which has 100 additional inducing points for modeling the mean function. Comparing Fig. 1(a) and Fig. 1(b), we can see that this results in a much better fit for the mean function. However, the model still overestimates the predictive variance. As ODVGP is a special case of the SOLVE-GP framework, we can improve over it in terms of covariance modeling. As seen in Fig. 1(c), adding 5 orthogonal inducing points can closely approximate the results of SVGP (), with only 2-fold increase in time complexity relative to SVGP ().

6.2 (Deep) Convolutional Gaussian Process

One class of applications that benefit from the SOLVE-GP framework is the training of large, hierarchical GP models where the true posterior distribution is difficult to approximate with a small number of inducing points. Convolutional GPs (van der Wilk et al., 2017) and their deep variants (Blomqvist et al., 2018; Dutordoir et al., 2019) are such models. There inducing points are feature detectors just like CNN filters, which play a critical role in predictive performance. As introduced in section 4, it is straightforward to apply SOLVE-GP to these models.

Convolutional GP.

We train convolutional GP models on the CIFAR-10 dataset, using GPs with TICK kernels (Dutordoir et al., 2019) to define the patch response functions. We compare SVGP with 1K and 2K inducing points, SOLVE-GP (), and also SVGP () which has a similar running time on GPU as the SOLVE-GP algorithm. Results are shown in Table 1. Clearly SOLVE-GP outperforms SVGP (). Under a fair comparison in terms of running time, it also outperforms SVGP (). SOLVE-GP also performs on par with the 4x more expensive SVGP (), which is very encouraging. This indicates that the structured covariance approximation is fairly accurate even for this large, non-conjugate model.

Deep Convolutional GPs.

We further extend SOLVE-GP to deep convolutional GPs using the techniques described in section 4. We experiment with 2-layer and 3-layer models that have 1K inducing points in the output layer and 384 inducing points in each lower layer. Results are summarized in Table 3. These models are already very expensive on a single GPU as shown by the time per iteration. SOLVE-GP allows to double the size of inducing points in each layer but only introduces 2-fold increase in computation. This gives superior performance on both accuracy and test predictive likelihoods. The double-size SVGP results take a week to run and is only for comparison purpose.

As shown above, on both single layer and deep convolutional GPs, we improve the state-of-the-art results of CIFAR-10 classification by 3-4 percentage points. This leads to more than accuracy on CIFAR-10 with a purely GP-based model, without any neural network components, closing the gap between GP/kernel regression and CNN baselines presented in Novak et al. (2019); Arora et al. (2019). Note that all the results are obtained without data augmentation.

Test Acc Test LL Time
SVGP 66.07% -1.59 0.241 s/iter
67.18% -1.54 0.380 s/iter
SOLVE-GP 68.19% -1.51 0.370 s/iter
SVGP * 68.06% -1.48 0.474 s/iter
Table 1: Convolutional GP for CIFAR-10 Classification. Previous SOTA is 64.6% by SVGP with 1K inducing points (van der Wilk et al., 2017).
Kin40k Protein KeggDirected KEGGU 3dRoad Song Buzz HouseElectric
25,600 29,267 31,248 40,708 278,319 329,820 373,280 1,311,539
8 9 20 27 3 90 77 9
SVGP 0.0938(0.0056) -0.9628(0.0124) 0.9673(0.0111) 0.6784(0.0085) -0.6981(0.0051) -1.1934(0.0019) -0.0793(0.0040) 1.3036(0.0044)
0.1287(0.0067) -0.9490(0.0116) 0.9442(0.0133) 0.6734(0.0100) -0.6744(0.0056) -1.1927(0.0018) -0.0786(0.0041) 1.3040(0.0069)
ODVGP 0.1372(0.0061) -0.9558(0.0116) -0.1988(0.1499) 0.1054(0.0739) -0.6644(0.0062) -1.1932(0.0016) -0.0783(0.0026) 1.3170(0.0052)
0.1444(0.0040) -0.9460(0.0108) -0.1364(0.1416) 0.1091(0.0747) -0.6568(0.0067) -1.1929(0.0016) -0.0789(0.0029) 1.3188(0.0086)
SOLVE-GP 0.1868(0.0050) -0.9429(0.0110) 0.9730(0.0073) 0.6804(0.0074) -0.6587(0.0034) -1.1918(0.0019) -0.0711(0.0033) 1.3332(0.0058)
SVGP * 0.1374(0.0057) -0.9402(0.0112) 0.9071(0.0071) 0.6648(0.0099) -0.6689(0.0049) -1.1924(0.0018) -0.0788(0.0039) 1.3036(0.0056)
Table 2: Test Log Likelihood Values of Regression Datasets.
SVGP SOLVE-GP SVGP
, *
Test Acc 76.35% 77.8% 77.46%
Test LL -1.04 -0.98 -0.98
Time 0.392 s/iter 0.657 s/iter 1.104 s/iter
(a) 2-layer model
SVGP SOLVE-GP SVGP
,
*
Test Acc 78.76% 80.3% 80.33%
Test LL -0.88 -0.79 -0.82
Time 0.418 s/iter 0.752 s/iter 1.246 s/iter
(b) 3-layer model
Table 3: Deep Convolutional GPs for CIFAR-10 Classification. Previous SOTA is 76.17% by a 3-layer model with 384 inducing points in all layers (Dutordoir et al., 2019).

6.3 Regression Benchmarks

Besides classification experiments, we evaluate our method on 10 regression datasets, with size ranging from tens of thousands to millions. Results of exact GP regression have been reported on these datasets with distributed training (Wang et al., 2019). The settings are followed from Wang et al. (2019) and described in detail in appendix E. We implemented SVGP with inducing points, ODVGP and SOLVE-GP (), as well as SVGP with inducing points, which has roughly the same training time per iteration on GPU as the SOLVE-GP objective. An attractive property of ODVGP is that by restricting the covariance of to be the same as the prior covariance , it can use far larger because the complexity is linear with and sub-sampling of orthogonal inducing points can be used for each gradient update. Thus for a fair comparison, we also include ODVGP (), where in each iteration a subset of size 1024 is sampled from the orthogonal inducing points to estimate the gradient. Other experiment details can be found in appendix E.

We report the predictive log likelihoods on test data in Table 2. Due to space limit we leave the results on two small datasets (Elevators, Bike) in appendix F. We can see that the performance of SOLVE-GP is competitive with the 4x more expensive SVGP (). Perhaps surprisingly, while SOLVE-GP uses a less flexible covariance in the variational distribution, it often outperforms SVGP (). We believe this is due to optimization difficulties introduced by the covariance matrix. We shall analyze this on the HouseElectric dataset later. On most datasets, using a large number of additional inducing points for modeling the mean function did improve the performance, as shown by comparison between the ODVGP () and ODVGP (). Though more flexible covariance modeling seems to be more essential, as SOLVE-GP outperforms ODVGP () on all datasets except 3dRoad.

In Fig. 3 we plot the change of test RMSE and test log likelihoods during training on Protein and HouseElectric. Interestingly, on both datasets ODVGP () gets very good performance at the beginning, then slowly converges to less competitive results. The beginning stage is likely where the additional inducing points give good predictions but are not in the best configuration for maximizing the training lower bounds. This phenomenon is also observed on Elevators and Kin40k. We believe such mismatch between the training lower bound and predictive performance is caused by fixing the covariance matrix of to the prior covariance. On HouseElectric, SVGP () does not improve over SVGP () and is outperformed by SOLVE-GP. As previously mentioned, this might be due to difficulties when optimising large covariance matrices. To verify this, we tried the “whitening” trick (Murray and Adams, 2010; Hensman et al., 2015b) that is often used to improve the optimization of SVGP methods by reducing the correlation in the posterior distribution of u. As expected, the result of SVGP () then becomes similar to SOLVE-GP, outperforming all other methods.

7 Conclusion

We proposed SOLVE-GP, a new variational inference framework for GPs using inducing points, that unifies and generalizes previous sparse variational methods. This increases the number of inducing points we can use for a fixed computational budget, which allows to improve performance of large, hierarchical GP models at a manageable computational cost. Future work includes experiments on challenging datasets like ImageNet and investigating other ways to improve the variational distribution, as mentioned in

section 3.1.

Acknowledgements

We thank Alex Matthews and Yutian Chen for helpful suggestions on improving the paper.

Bibliography

  • Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.
  • Blomqvist et al. (2018) Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional Gaussian processes. arXiv preprint arXiv:1810.03052, 2018.
  • Bui et al. (2017) Thang D. Bui, Josiah Yan, and Richard E. Turner. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research, 18(104):1–72, 2017.
  • Cheng and Boots (2016) Ching-An Cheng and Byron Boots. Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems, pages 4410–4418, 2016.
  • Cheng and Boots (2017) Ching-An Cheng and Byron Boots. Variational inference for Gaussian process models with linear complexity. In Advances in Neural Information Processing Systems, pages 5184–5194, 2017.
  • Csato and Opper (2002) L. Csato and M. Opper. Sparse online Gaussian processes. Neural Computation, 14:641–668, 2002.
  • Damianou et al. (2016) Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence. Variational inference for latent variables and uncertain inputs in Gaussian processes. Journal of Machine Learning Research, 17(42):1–62, 2016.
  • Deisenroth and Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465–472, 2011.
  • Dutordoir et al. (2019) Vincent Dutordoir, Mark van der Wilk, Artem Artemev, Marcin Tomczak, and James Hensman. Translation insensitivity for deep convolutional Gaussian processes. arXiv preprint arXiv:1902.05888, 2019.
  • Evans and Nair (2018) Trefor Evans and Prasanth Nair.

    Scalable Gaussian processes with grid-structured eigenfunctions (GP-GRIEF).

    In International Conference on Machine Learning, pages 1416–1425, 2018.
  • Gardner et al. (2018) Jacob Gardner, Geoff Pleiss, Ruihan Wu, Kilian Weinberger, and Andrew Wilson.

    Product kernel interpolation for scalable Gaussian processes.

    In Artificial Intelligence and Statistics, pages 1407–1416, 2018.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics, pages 249–256, 2010.
  • Havasi et al. (2018) Marton Havasi, José Miguel Hernández-Lobato, and Juan José Murillo-Fuentes. Deep Gaussian processes with decoupled inducing inputs. arXiv preprint arXiv:1801.02939, 2018.
  • Hensman et al. (2013) James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
  • Hensman et al. (2015a) James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In Artificial Intelligence and Statistics, pages 351–360, 2015a.
  • Hensman et al. (2015b) James Hensman, Alexander G Matthews, Maurizio Filippone, and Zoubin Ghahramani. MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems, pages 1648–1656, 2015b.
  • Hensman et al. (2017) James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1–151, 2017.
  • Hernández-Lobato et al. (2011) Daniel Hernández-Lobato, José M Hernández-Lobato, and Pierre Dupont. Robust multi-class Gaussian process classification. In Advances in Neural Information Processing Systems, pages 280–288, 2011.
  • Lawrence et al. (2002) N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the informative vector machine. In Advances in Neural Information Processing Systems. MIT Press, 2002.
  • Lawrence (2005) Neil Lawrence.

    Probabilistic non-linear principal component analysis with Gaussian process latent variable models.

    Journal of Machine Learning Research, 6(Nov):1783–1816, 2005.
  • Lázaro-Gredilla and Figueiras-Vidal (2009) Miguel Lázaro-Gredilla and Anibal Figueiras-Vidal. Inter-domain Gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, pages 1087–1095, 2009.
  • Lázaro-Gredilla et al. (2010) Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal R Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11:1865–1881, 2010.
  • Murray and Adams (2010) Iain Murray and Ryan P Adams.

    Slice sampling covariance hyperparameters of latent Gaussian models.

    In Advances in Neural Information Processing Systems, pages 1732–1740, 2010.
  • Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. In International Conference on Learning Representations, 2019.
  • Quiñonero-Candela and Rasmussen (2005a) J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005a.
  • Quiñonero-Candela and Rasmussen (2005b) Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005b.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Salimbeni and Deisenroth (2017) Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pages 4588–4599, 2017.
  • Salimbeni et al. (2018) Hugh Salimbeni, Ching-An Cheng, Byron Boots, and Marc Deisenroth. Orthogonally decoupled variational Gaussian processes. In Advances in Neural Information Processing Systems, pages 8711–8720, 2018.
  • Seeger et al. (2003) M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Ninth International Workshop on Artificial Intelligence. MIT Press, 2003.
  • Shi et al. (2019) Jiaxin Shi, Mohammad Emtiyaz Khan, and Jun Zhu. Scalable training of inference networks for Gaussian-process models. In International Conference on Machine Learning, pages 5758–5768, 2019.
  • Snelson and Ghahramani (2006) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2006.
  • Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, pages 1015–1022, 2010.
  • Tipping (2001) Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun):211–244, 2001.
  • Titsias (2009) Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
  • van der Wilk et al. (2017) Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems, pages 2849–2858, 2017.
  • Wang et al. (2019) Ke Alexander Wang, Geoff Pleiss, Jacob R Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gordon Wilson. Exact Gaussian processes on a million data points. arXiv preprint arXiv:1903.08114, 2019.
  • Williams and Barber (1998) Christopher KI Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
  • Williams and Rasmussen (1996) Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in Neural Information Processing Systems, pages 514–520, 1996.
  • Williams and Seeger (2001) Christopher KI Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, pages 682–688, 2001.
  • Wilson and Nickisch (2015) Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International Conference on Machine Learning, pages 1775–1784, 2015.

Bibliography

  • Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.
  • Blomqvist et al. (2018) Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional Gaussian processes. arXiv preprint arXiv:1810.03052, 2018.
  • Bui et al. (2017) Thang D. Bui, Josiah Yan, and Richard E. Turner. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research, 18(104):1–72, 2017.
  • Cheng and Boots (2016) Ching-An Cheng and Byron Boots. Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems, pages 4410–4418, 2016.
  • Cheng and Boots (2017) Ching-An Cheng and Byron Boots. Variational inference for Gaussian process models with linear complexity. In Advances in Neural Information Processing Systems, pages 5184–5194, 2017.
  • Csato and Opper (2002) L. Csato and M. Opper. Sparse online Gaussian processes. Neural Computation, 14:641–668, 2002.
  • Damianou et al. (2016) Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence. Variational inference for latent variables and uncertain inputs in Gaussian processes. Journal of Machine Learning Research, 17(42):1–62, 2016.
  • Deisenroth and Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465–472, 2011.
  • Dutordoir et al. (2019) Vincent Dutordoir, Mark van der Wilk, Artem Artemev, Marcin Tomczak, and James Hensman. Translation insensitivity for deep convolutional Gaussian processes. arXiv preprint arXiv:1902.05888, 2019.
  • Evans and Nair (2018) Trefor Evans and Prasanth Nair.

    Scalable Gaussian processes with grid-structured eigenfunctions (GP-GRIEF).

    In International Conference on Machine Learning, pages 1416–1425, 2018.
  • Gardner et al. (2018) Jacob Gardner, Geoff Pleiss, Ruihan Wu, Kilian Weinberger, and Andrew Wilson.

    Product kernel interpolation for scalable Gaussian processes.

    In Artificial Intelligence and Statistics, pages 1407–1416, 2018.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics, pages 249–256, 2010.
  • Havasi et al. (2018) Marton Havasi, José Miguel Hernández-Lobato, and Juan José Murillo-Fuentes. Deep Gaussian processes with decoupled inducing inputs. arXiv preprint arXiv:1801.02939, 2018.
  • Hensman et al. (2013) James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
  • Hensman et al. (2015a) James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In Artificial Intelligence and Statistics, pages 351–360, 2015a.
  • Hensman et al. (2015b) James Hensman, Alexander G Matthews, Maurizio Filippone, and Zoubin Ghahramani. MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems, pages 1648–1656, 2015b.
  • Hensman et al. (2017) James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1–151, 2017.
  • Hernández-Lobato et al. (2011) Daniel Hernández-Lobato, José M Hernández-Lobato, and Pierre Dupont. Robust multi-class Gaussian process classification. In Advances in Neural Information Processing Systems, pages 280–288, 2011.
  • Lawrence et al. (2002) N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the informative vector machine. In Advances in Neural Information Processing Systems. MIT Press, 2002.
  • Lawrence (2005) Neil Lawrence.

    Probabilistic non-linear principal component analysis with Gaussian process latent variable models.

    Journal of Machine Learning Research, 6(Nov):1783–1816, 2005.
  • Lázaro-Gredilla and Figueiras-Vidal (2009) Miguel Lázaro-Gredilla and Anibal Figueiras-Vidal. Inter-domain Gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, pages 1087–1095, 2009.
  • Lázaro-Gredilla et al. (2010) Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal R Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11:1865–1881, 2010.
  • Murray and Adams (2010) Iain Murray and Ryan P Adams.

    Slice sampling covariance hyperparameters of latent Gaussian models.

    In Advances in Neural Information Processing Systems, pages 1732–1740, 2010.
  • Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. In International Conference on Learning Representations, 2019.
  • Quiñonero-Candela and Rasmussen (2005a) J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005a.
  • Quiñonero-Candela and Rasmussen (2005b) Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005b.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Salimbeni and Deisenroth (2017) Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pages 4588–4599, 2017.
  • Salimbeni et al. (2018) Hugh Salimbeni, Ching-An Cheng, Byron Boots, and Marc Deisenroth. Orthogonally decoupled variational Gaussian processes. In Advances in Neural Information Processing Systems, pages 8711–8720, 2018.
  • Seeger et al. (2003) M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Ninth International Workshop on Artificial Intelligence. MIT Press, 2003.
  • Shi et al. (2019) Jiaxin Shi, Mohammad Emtiyaz Khan, and Jun Zhu. Scalable training of inference networks for Gaussian-process models. In International Conference on Machine Learning, pages 5758–5768, 2019.
  • Snelson and Ghahramani (2006) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2006.
  • Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, pages 1015–1022, 2010.
  • Tipping (2001) Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun):211–244, 2001.
  • Titsias (2009) Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
  • van der Wilk et al. (2017) Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems, pages 2849–2858, 2017.
  • Wang et al. (2019) Ke Alexander Wang, Geoff Pleiss, Jacob R Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gordon Wilson. Exact Gaussian processes on a million data points. arXiv preprint arXiv:1903.08114, 2019.
  • Williams and Barber (1998) Christopher KI Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
  • Williams and Rasmussen (1996) Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in Neural Information Processing Systems, pages 514–520, 1996.
  • Williams and Seeger (2001) Christopher KI Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, pages 682–688, 2001.
  • Wilson and Nickisch (2015) Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International Conference on Machine Learning, pages 1775–1784, 2015.

Appendix A Tighter Sparse Variational Bounds for GP Regression

As mentioned in section 3.1, another direction to improve the variational distribution in SVGP is to add some dependence between u and . The best possible approximation of this family is obtained by the setting to the optimal exact posterior conditional . The corresponding collapsed bound for GP regression can be derived by analytically marginalising out u from the joint model in Eq. (6),

(24)
(25)

and then forcing the approximation :

(26)

This bound has a closed-form as

(27)

Applying the matrix inversion lemma to , we have an equivalent form that can be directly compared with Eq. (3):

(28)

where the first two terms recover Eq. (3), suggesting this is a tighter bound than the Titsias (2009) bound. This bound is not amenable to large-scale datasets because of storage requirements and computational time (dominated by the matrix multiplication ). However, it is still of theoretical interest and can be applied to medium-sized regression datasets, just like the SGPR algorithm using the Titsias (2009) bound.

Appendix B The Collapsed SOLVE-GP Lower Bound

We derive the collapsed SOLVE-GP lower bound in Eq. (15) by seeking the optimal that is independent of . First we rearrange the terms in the uncollapsed SOLVE-GP bound (Eq. (14)) as

(29)

Let and denote the mean and covariance matrix of the variational predictive distribution in the orthogonal process:

(30)

We can compute them as

(31)
(32)

In the first term we can simplify the expectation over and as:

(33)
(34)
(35)
(36)
(37)

Plugging into Eq. (29) and rearranging the terms, we have