solvegp
Sparse Orthogonal Variational Inference for Gaussian Processes (SOLVEGP)
view repo
We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one in the subspace spanned by the inducing basis and the other in the orthogonal complement to this subspace. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multiclass classification using (deep) convolutional Gaussian processes and report stateoftheart results on CIFAR10 with purely GPbased models.
READ FULL TEXT VIEW PDF
We introduce stochastic variational inference for Gaussian process model...
read it
We introduce fully scalable Gaussian processes, an implementation scheme...
read it
The variational framework for learning inducing variables (Titsias, 2009...
read it
The composition of multiple Gaussian Processes as a Deep Gaussian Proces...
read it
Gaussian processes are a flexible Bayesian nonparametric modelling appro...
read it
Good sparse approximations are essential for practical inference in Gaus...
read it
Knotbased, sparse Gaussian processes have enjoyed considerable success ...
read it
Sparse Orthogonal Variational Inference for Gaussian Processes (SOLVEGP)
Gaussian processes (GP) (Rasmussen and Williams, 2006)
are nonparametric models for representing distributions over functions. As a generalization of multivariate Gaussian distributions to infinite dimensions, the simplicity and elegance of these models has led to their wide adoption in uncertainty estimation for machine learning, including supervised learning
(Williams and Rasmussen, 1996; Williams and Barber, 1998), sequential decision making (Srinivas et al., 2010), modelbased planning (Deisenroth and Rasmussen, 2011), and unsupervised data analysis (Lawrence, 2005; Damianou et al., 2016).Despite the successful application of these models, they suffer from computation and storage requirements given training data points, which has motivated a large body of research on sparse GP methods (Csato and Opper, 2002; Lawrence et al., 2002; Seeger et al., 2003; QuiñoneroCandela and Rasmussen, 2005a; Titsias, 2009; Hensman et al., 2013; Bui et al., 2017)
. GPs have been also been unfavourably compared to deep learning models for lacking representation learning capabilities.
Sparse variational GP (SVGP) methods (Titsias, 2009; Hensman et al., 2013, 2015a) based on variational learning of inducing points have shown promise in addressing these limitations. Such methods make no change to the prior distribution of the GP model but enforce sparse structures in the posterior approximation though variational inference. This gives computation and storage with inducing points. Moreover, this approach allows us to perform minibatch training by subsampling data points. Successful application of such methods allowed scalable GP models to be trained on billions of data points (Salimbeni and Deisenroth, 2017). These advances in inference methods have also led to more flexibility in model design. A recent convolutional GP model (van der Wilk et al., 2017)
encodes translation invariance by summing over GPs that take image patches as inputs. The inducing points, which can be interpreted as image patches in this model, play a role similar to that of convolutional filters in neural networks. Their work demonstrated that it is possible to implement representation learning in GP models. Further extensions of such models into deep hierarchies
(Blomqvist et al., 2018; Dutordoir et al., 2019) significantly boosted the performance of GPs applied to natural images.As these works suggest, currently the biggest challenge in this area still lies in scalable inference. Especially for the widely used SVGP methods, the computational cost is cubic with respect to the number of inducing points, making it difficult to improve the flexibility of posterior approximations. For example, stateoftheart models like deep convolutional GPs only used 384 inducing points for inference in each layer to get a manageable running time (Dutordoir et al., 2019).
In this paper, we introduce a new variational inference framework, called SOLVEGP, which allows increasing the number of inducing points we can use given a fixed computational budget. Our framework is based on the observation that SVGP methods can be reinterpreted using an orthogonal decomposition of the GP prior, illustrated in Fig. 1. By introducing another set of inducing variables for the GP that lives in the orthogonal complement, we can increase the number of inducing points at a much lower additional computational cost. For instance, doubling the number of inducing points leads to a 2fold increase in computational cost with our method, compared to the 8fold increase for the original SVGP method. We show that our framework is equivalent to a structured covariance approximation for SVGP defined over the union of the two sets of inducing points. Interestingly, under this interpretation the SOLVEGP framework can be seen as a generalization of the recently proposed decoupledinducingpoints method (Salimbeni et al., 2018). As the decoupled method often comes with a complex dual formation in RKHS, our framework provides a simpler derivation and more intuitive understanding for it.
We conducted experiments on convolutional GPs and their deep variants. To the best of our knowledge, we are the first to train a purely GPbased model without any neural network components that achieves over test accuracy on CIFAR10. No data augmentation was used to obtain these results. Besides classification, we also evaluated our method on a range of regression datasets with from tens of thousands to millions of data points. Our results show that SOLVEGP is often competitive with the 4x more expensive SVGP counterpart that uses the same number of inducing points and outperforms SVGP when given the same computational budget.
Here, we briefly review Gaussian processes and sparse variational GP methods. A GP is an uncountable collection of random variables indexed by a realvalued vector
x, of which any finite subset has a multivariate Gaussian distribution. It is defined by a mean function and a covariance function :Let be (the matrix containing) the training data points and denote the corresponding function values. Similarly we denote the test data points by and their function values by
. Then the joint distribution over
is given by:where is an kernel matrix with its th entry as , and similarly , . In practice we often observe the training function values through some noisy labels y, generated by the likelihood function . For regression, the likelihood usually models independent Gaussian observation noise: In this situation the exact posterior distribution can be computed in closed form:
(1) 
As seen from Eq. (2), exact GP prediction involves the inverse of matrix , which requires computation. For large datasets, it is clear that we need to avoid the cubic complexity by resorting to approximations.
Inducing points have played a central role in previous works on scalable GP inference. The general idea is to summarize with a small number of variables , where is a set of parameters called inducing points in the input space. The augmented joint distribution over is , where and denotes the kernel matrix of inducing points with the th entry corresponding to . There is a long history in developing sparse approximations for GPs by making different independence assumptions for the conditional distribution to reduce the computational cost (QuiñoneroCandela and Rasmussen, 2005b). However, these methods made modifications to the GP prior and tended to suffer from degeneracy and overfitting problems.
Sparse variational GP methods (SVGP), first proposed in Titsias (2009) and later extended for minibatch training and nonconjugate likelihoods (Hensman et al., 2013, 2015a), provide an elegant solution to these problems. By reformulating the posterior inference problem as variational inference and restricting the variational distribution to be , the variational lower bound for minimizing is simplified as:
(2) 
For GP regression the bound has a collapsed form obtained by solving for the optimal and plugging it into (2) (Titsias, 2009):
(3) 
where . Computing this objective requires operations, in constrast to the complexity of exact inference. The inducing points Z can be learned as variational parameters by maximizing the lower bound. More generally, if we do not collapse we obtain the uncollapsed bound suitable for minibatch training and nonGaussian likelihoods (Hensman et al., 2013, 2015a).
Despite the success of SVGP methods, their complexity makes it difficult for the flexibility of posterior approximation to grow with the dataset size. In this section, we present a new framework called Sparse OrthogonaL Variational infErence for Gaussian Processes (SOLVEGP) to address this challenge. Our framework allows the use of another set of inducing points at a lower computational cost than the standard SVGP methods.
We start by reinterpreting SVGP methods using a simple reparameterization, which will then lead us to possible ways of improving the approximation. First we notice that the covariance of the conditional distribution does not depend on u.^{1}^{1}1Note that kernel matrices like depend on Z instead of u; the subscript only indicates that this is the covariance matrix of u. Since the Gaussian distribution belongs to the locationscale family, samples from can be reparameterized as
(4) 
The reason for denoting the zeromean component as shall become clear later. Now we can reparameterize the augmented prior distribution as
(5) 
and the joint distribution of the GP model becomes
(6) 
Posterior inference for in the original model then turns into inference for u and . If we approximate the above GP model by considering a factorised approximation , where is a variational distribution and is the prior distribution of that appears also in Eq. (6), we arrive at the standard SVGP method. To see this, note that minimizing is equivalent to maximizing the variational lower bound
which is the SVGP objective (Eq. (2)) using the reparameterization in Eq. (3.1).
Under this interpretation of the standard SVGP method, it becomes clear that we can modify the form of the variational distribution to improve the accuracy of the posterior approximation. There are two natural options: (i) keep as part of the approximation and alter so that it will have some dependence on , and (ii) keep independent from , and replace with a more structured variational distribution . Both options lead to new bounds and more accurate approximations than the standard method, though we will defer the discussion of (i) to appendix A and focus on (ii) because it is amenable to largescale training, as we will show next.
As suggested in section 3.1, we consider improving the variational distribution for . However, the complexity of inferring is the same as for and thus cubic. Resolving the problem requires a better understanding of the reparameterization we used in section 3.1.
The key observation here is that the reparameterization of into and in Eq. (5) corresponds to an orthogonal decomposition in the function space. Recall that the GP prior is a distribution over functions . Consider a subspace spanned by the kernel basis functions indexed by the inducing points :
Samples from the GP prior can be decomposed (Cheng and Boots, 2016; Hensman et al., 2017) as
(7) 
Since , we let . By , where is the inner product defined as , we can solve for the coefficients: . Interestingly, we can check that is itself a sample from a GP with a zero mean function and covariance function , where . Similarly we can show that is a sample from another GP and we denote these two independent GPs as and :
(8)  
(9) 
Marginalizing out the GPs at the training points X, it is easy to show that
(10)  
(11) 
This is exactly the decomposition we used in section 3.1, and the fact that denotes the function values of the orthogonal process becomes clear.
The orthogonal decomposition described in the previous section gives new insights for improving the variational distribution for . Specifically, we can introduce a second set of inducing variables to approximate the orthogonal process , as illustrated in Fig. 1. We call this second set the orthogonal inducing points. The joint model distribution is then
(12) 
First notice that standard SVGP methods correspond to using the variational distribution . To obtain better approximations we can replace the prior factor with a tunable variational factor :
(13) 
This gives the SOLVEGP variational lower bound:
(14) 
To intuitively understand the improvement over the standard SVGP methods, we derive a collapsed bound for GP regression using this approach and compare it to the Titsias (2009) bound. Setting , plugging in the optimal , and simplifying (see appendix B), gives the collapsed bound
(15) 
where is the kernel matrix of the orthogonal process on the training inputs and similarly for the other matrices; is the covariance of the orthogonal predictive distribution .
This bound now can be tighter than the Titsias (2009) bound. For example, notice that when is equal to the prior , i.e., and , the bound in (15) reduces to the one in (3). Another interesting special case arises when the variational distribution has the same covariance matrix as the prior (i.e., ), while the mean is learnable. Then the bound becomes
(16) 
Here we see that the second set of inducing variables mostly determines the mean prediction over y, which is zero in Titsias (2009) bound (Eq. (3)).
In the general setting, Eq. (14) can be maximized using minibatch training in time per gradient update. This is because the inverse of matrix can be computed in time if we have precomputed the Cholesky decomposition of . For the function values at test data points , the predictive density given by our approximate posterior can be found in appendix C.
Our method introduces another set of inducing points to improve the variational approximation. One natural question to ask is: How does this compare to the standard SVGP algorithm with the inducing points chosen to be union of the two sets? In this section we answer the question by interpreting our method as using a structured covariance in the variational approximation for SVGP, with the ability to allow a larger set of inducing points at the same computational cost.
To obtain this structured covariance matrix, we need to express our variational approximation w.r.t. the original GP. Let denote the function outputs at the orthogonal inducing points. We have the following relationship between and :
(17) 
By changeofvariable we compute the joint variational distribution over
u and v that corresponds to the factorized :(18) 
Using Gaussian identities we can show that is a Gaussian distribution with mean and covariance matrix
Interestingly, despite being a matrix, it can be inverted with computation, which gives a 4x speedup over a fully parameterized multivariate Gaussian distribution for when .
A direct extension of SOLVEGP is that we can introduce more than two sets of inducing points by repeatedly applying the orthogonal decomposition, however this adds more complexity to the implementation. Below we show that the SOLVEGP framework can be easily extended to different GP models where the standard SVGP method applies.
Similar to SVGP methods, SOLVEGP can deal with interdomain inducing points (LázaroGredilla and FigueirasVidal, 2009) which lie in a different domain from the input space. The inducing variables u, which we used to represent outputs of the GP at the inducing points, is now defined as , where is a different function from that takes inputs in the domain of inducing points. In convolutional GPs (van der Wilk et al., 2017), the input domain is the space of images, while the inducing points are in the space of image patches. The convolutional GP function is defined as
(19) 
where is the th patch in the image x; are the assigned weights to different patches. In SOLVEGP, we can choose either Z, O, or both to be interdomain as long as we can compute the covariance between and . When applied to convolutional GP models, we set both Z and O to be a collection of image patches. Examples of the covariance matrices we need for this model include and (used for ). They can be computed as
(20)  
(21) 
We show that we can integrate SOLVEGP with popular doubly stochastic variational inference algorithms for deep GPs (Salimbeni and Deisenroth, 2017). The joint distribution of a deep GP model with inducing variables in all layers is
where we define and is the output of the thlayer GP. The doubly stochastic algorithm applies SVGP methods to each layer conditioned on samples from the variational distribution in the previous layer. The variational distribution over is This gives a similar objective as in the single layer case (Eq. (2)):
(22) 
where . Extending this using SOLVEGP is straightforward by introducing orthogonal inducing variables for all layers, which gives the lower bound:
(23) 
where the expression of is in appendix D.
Many approximate algorithms have been proposed to overcome the computational limitations of GPs. The simplest of these are based on subsampling data points, which include the naive subsetofdata training (Rasmussen and Williams, 2006) as well as the Nyström approximation to the kernel matrix (Williams and Seeger, 2001). Better sparse approximations can be constructed by learning a set of inducing points to summarize the entire dataset. As introduced in section 2, these works can be divided into sparse approximations to the GP prior (SoR, DTC, FITC, etc.) (QuiñoneroCandela and Rasmussen, 2005b), and sparse variational methods (Titsias, 2009; Hensman et al., 2013, 2015a).
Recently there have been many attempts to reduce the complexity of computing for using a large set of inducing points. A notable line of work (Wilson and Nickisch, 2015; Evans and Nair, 2018; Gardner et al., 2018) involves imposing grid structures on the locations of Z to perform fast structure exploiting computations. However, to get such computational benefits Z
need to be fixed due to the structure constraints, which often suffers from curseofdimensionality in the input space. Another direction for allowing the use of more inducing points is the decoupled method, first proposed in
Cheng and Boots (2017), where two different set of inducing points are used for modeling the mean and the covariance function. This gives linear complexity in the number of mean inducing points which allows using many more of them. Despite the increasing interest in decoupled inducing points (Havasi et al., 2018; Salimbeni et al., 2018), the method has not been wellunderstood due to its complexity. We found that our SOLVEGP framework is closely connected to a recent development of decoupled methods: the orthogonal decoupled variational GP (ODVGP) (Salimbeni et al., 2018), as explained next.If we set the and inducing points in ODVGP (Salimbeni et al., 2018) to be Z and O, their approach becomes equivalent to using the following variational distribution:
By comparing this covariance matrix to , we can see that we generalize their method by introducing , which replaces the original residual , so that we allow more flexible covariance modeling while still keeping the block structure that facilitates cheap inverse. This implies that ODVGP is a special case of SOLVEGP by restricting to have the same covariance as the prior.
Besides inducing points, another way to construct sparse approximations is by examining the weight space representation of GPs, i.e., Bayesian linear regression in the kernel feature space
(Rasmussen and Williams, 2006). Relevance vector machines (Tipping, 2001) use finite basis functions as features, while sparse spectrum GP (LázaroGredilla et al., 2010) uses random Fourier features. These two methods approximate the prior distribution. It is also possible to use the weightspace representation to approximate the posterior distribution of GPs by variational inference (Shi et al., 2019).Since ODVGP is a special case of SOLVEGP, we use to refer to and in their algorithm, respectively.


We begin by illustrating our method on Snelson’s 1D regression problem (Snelson and Ghahramani, 2006) with 100 training points and minibatch size 20. We compare the following methods: SVGP with 5 and 10 inducing points, ODVGP (), and SOLVEGP ().
The results are plotted in Fig. 2. First we can see that 5 inducing points are insufficient to summarize the training set: SVGP (
) cannot fit data well and underestimates the variance in regions beyond the training data. Increasing
to 10 fixes the issues, but requires 8x more computation than using 5 inducing points^{2}^{2}2In practice the cost is negligible in this toy problem but we are analyzing the theoretical complexity.. The decoupled formulation provides a cheaper alternative and we have tried ODVGP (), which has 100 additional inducing points for modeling the mean function. Comparing Fig. 1(a) and Fig. 1(b), we can see that this results in a much better fit for the mean function. However, the model still overestimates the predictive variance. As ODVGP is a special case of the SOLVEGP framework, we can improve over it in terms of covariance modeling. As seen in Fig. 1(c), adding 5 orthogonal inducing points can closely approximate the results of SVGP (), with only 2fold increase in time complexity relative to SVGP ().One class of applications that benefit from the SOLVEGP framework is the training of large, hierarchical GP models where the true posterior distribution is difficult to approximate with a small number of inducing points. Convolutional GPs (van der Wilk et al., 2017) and their deep variants (Blomqvist et al., 2018; Dutordoir et al., 2019) are such models. There inducing points are feature detectors just like CNN filters, which play a critical role in predictive performance. As introduced in section 4, it is straightforward to apply SOLVEGP to these models.
We train convolutional GP models on the CIFAR10 dataset, using GPs with TICK kernels (Dutordoir et al., 2019) to define the patch response functions. We compare SVGP with 1K and 2K inducing points, SOLVEGP (), and also SVGP () which has a similar running time on GPU as the SOLVEGP algorithm. Results are shown in Table 1. Clearly SOLVEGP outperforms SVGP (). Under a fair comparison in terms of running time, it also outperforms SVGP (). SOLVEGP also performs on par with the 4x more expensive SVGP (), which is very encouraging. This indicates that the structured covariance approximation is fairly accurate even for this large, nonconjugate model.
We further extend SOLVEGP to deep convolutional GPs using the techniques described in section 4. We experiment with 2layer and 3layer models that have 1K inducing points in the output layer and 384 inducing points in each lower layer. Results are summarized in Table 3. These models are already very expensive on a single GPU as shown by the time per iteration. SOLVEGP allows to double the size of inducing points in each layer but only introduces 2fold increase in computation. This gives superior performance on both accuracy and test predictive likelihoods. The doublesize SVGP results take a week to run and is only for comparison purpose.
As shown above, on both single layer and deep convolutional GPs, we improve the stateoftheart results of CIFAR10 classification by 34 percentage points. This leads to more than accuracy on CIFAR10 with a purely GPbased model, without any neural network components, closing the gap between GP/kernel regression and CNN baselines presented in Novak et al. (2019); Arora et al. (2019). Note that all the results are obtained without data augmentation.
Test Acc  Test LL  Time  
SVGP  66.07%  1.59  0.241 s/iter  
67.18%  1.54  0.380 s/iter  
SOLVEGP  68.19%  1.51  0.370 s/iter  
SVGP  *  68.06%  1.48  0.474 s/iter 
Kin40k  Protein  KeggDirected  KEGGU  3dRoad  Song  Buzz  HouseElectric  
25,600  29,267  31,248  40,708  278,319  329,820  373,280  1,311,539  
8  9  20  27  3  90  77  9  
SVGP  0.0938(0.0056)  0.9628(0.0124)  0.9673(0.0111)  0.6784(0.0085)  0.6981(0.0051)  1.1934(0.0019)  0.0793(0.0040)  1.3036(0.0044)  
0.1287(0.0067)  0.9490(0.0116)  0.9442(0.0133)  0.6734(0.0100)  0.6744(0.0056)  1.1927(0.0018)  0.0786(0.0041)  1.3040(0.0069)  
ODVGP  0.1372(0.0061)  0.9558(0.0116)  0.1988(0.1499)  0.1054(0.0739)  0.6644(0.0062)  1.1932(0.0016)  0.0783(0.0026)  1.3170(0.0052)  
0.1444(0.0040)  0.9460(0.0108)  0.1364(0.1416)  0.1091(0.0747)  0.6568(0.0067)  1.1929(0.0016)  0.0789(0.0029)  1.3188(0.0086)  
SOLVEGP  0.1868(0.0050)  0.9429(0.0110)  0.9730(0.0073)  0.6804(0.0074)  0.6587(0.0034)  1.1918(0.0019)  0.0711(0.0033)  1.3332(0.0058)  
SVGP  *  0.1374(0.0057)  0.9402(0.0112)  0.9071(0.0071)  0.6648(0.0099)  0.6689(0.0049)  1.1924(0.0018)  0.0788(0.0039)  1.3036(0.0056) 


Besides classification experiments, we evaluate our method on 10 regression datasets, with size ranging from tens of thousands to millions. Results of exact GP regression have been reported on these datasets with distributed training (Wang et al., 2019). The settings are followed from Wang et al. (2019) and described in detail in appendix E. We implemented SVGP with inducing points, ODVGP and SOLVEGP (), as well as SVGP with inducing points, which has roughly the same training time per iteration on GPU as the SOLVEGP objective. An attractive property of ODVGP is that by restricting the covariance of to be the same as the prior covariance , it can use far larger because the complexity is linear with and subsampling of orthogonal inducing points can be used for each gradient update. Thus for a fair comparison, we also include ODVGP (), where in each iteration a subset of size 1024 is sampled from the orthogonal inducing points to estimate the gradient. Other experiment details can be found in appendix E.
We report the predictive log likelihoods on test data in Table 2. Due to space limit we leave the results on two small datasets (Elevators, Bike) in appendix F. We can see that the performance of SOLVEGP is competitive with the 4x more expensive SVGP (). Perhaps surprisingly, while SOLVEGP uses a less flexible covariance in the variational distribution, it often outperforms SVGP (). We believe this is due to optimization difficulties introduced by the covariance matrix. We shall analyze this on the HouseElectric dataset later. On most datasets, using a large number of additional inducing points for modeling the mean function did improve the performance, as shown by comparison between the ODVGP () and ODVGP (). Though more flexible covariance modeling seems to be more essential, as SOLVEGP outperforms ODVGP () on all datasets except 3dRoad.
In Fig. 3 we plot the change of test RMSE and test log likelihoods during training on Protein and HouseElectric. Interestingly, on both datasets ODVGP () gets very good performance at the beginning, then slowly converges to less competitive results. The beginning stage is likely where the additional inducing points give good predictions but are not in the best configuration for maximizing the training lower bounds. This phenomenon is also observed on Elevators and Kin40k. We believe such mismatch between the training lower bound and predictive performance is caused by fixing the covariance matrix of to the prior covariance. On HouseElectric, SVGP () does not improve over SVGP () and is outperformed by SOLVEGP. As previously mentioned, this might be due to difficulties when optimising large covariance matrices. To verify this, we tried the “whitening” trick (Murray and Adams, 2010; Hensman et al., 2015b) that is often used to improve the optimization of SVGP methods by reducing the correlation in the posterior distribution of u. As expected, the result of SVGP () then becomes similar to SOLVEGP, outperforming all other methods.
We proposed SOLVEGP, a new variational inference framework for GPs using inducing points, that unifies and generalizes previous sparse variational methods. This increases the number of inducing points we can use for a fixed computational budget, which allows to improve performance of large, hierarchical GP models at a manageable computational cost. Future work includes experiments on challenging datasets like ImageNet and investigating other ways to improve the variational distribution, as mentioned in
section 3.1.We thank Alex Matthews and Yutian Chen for helpful suggestions on improving the paper.
Scalable Gaussian processes with gridstructured eigenfunctions (GPGRIEF).
In International Conference on Machine Learning, pages 1416–1425, 2018.Product kernel interpolation for scalable Gaussian processes.
In Artificial Intelligence and Statistics, pages 1407–1416, 2018.Probabilistic nonlinear principal component analysis with Gaussian process latent variable models.
Journal of Machine Learning Research, 6(Nov):1783–1816, 2005.Slice sampling covariance hyperparameters of latent Gaussian models.
In Advances in Neural Information Processing Systems, pages 1732–1740, 2010.Scalable Gaussian processes with gridstructured eigenfunctions (GPGRIEF).
In International Conference on Machine Learning, pages 1416–1425, 2018.Product kernel interpolation for scalable Gaussian processes.
In Artificial Intelligence and Statistics, pages 1407–1416, 2018.Probabilistic nonlinear principal component analysis with Gaussian process latent variable models.
Journal of Machine Learning Research, 6(Nov):1783–1816, 2005.Slice sampling covariance hyperparameters of latent Gaussian models.
In Advances in Neural Information Processing Systems, pages 1732–1740, 2010.As mentioned in section 3.1, another direction to improve the variational distribution in SVGP is to add some dependence between u and . The best possible approximation of this family is obtained by the setting to the optimal exact posterior conditional . The corresponding collapsed bound for GP regression can be derived by analytically marginalising out u from the joint model in Eq. (6),
(24)  
(25) 
and then forcing the approximation :
(26) 
This bound has a closedform as
(27) 
Applying the matrix inversion lemma to , we have an equivalent form that can be directly compared with Eq. (3):
(28) 
where the first two terms recover Eq. (3), suggesting this is a tighter bound than the Titsias (2009) bound. This bound is not amenable to largescale datasets because of storage requirements and computational time (dominated by the matrix multiplication ). However, it is still of theoretical interest and can be applied to mediumsized regression datasets, just like the SGPR algorithm using the Titsias (2009) bound.
We derive the collapsed SOLVEGP lower bound in Eq. (15) by seeking the optimal that is independent of . First we rearrange the terms in the uncollapsed SOLVEGP bound (Eq. (14)) as
(29) 
Let and denote the mean and covariance matrix of the variational predictive distribution in the orthogonal process:
(30) 
We can compute them as
(31)  
(32) 
In the first term we can simplify the expectation over and as:
(33)  
(34)  
(35)  
(36)  
(37) 
Plugging into Eq. (29) and rearranging the terms, we have
Comments
There are no comments yet.