Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly.READ FULL TEXT VIEW PDF
We present an approximate Bayesian inference approach for estimating the...
One of the most compelling features of Gaussian process (GP) regression ...
The use of Gaussian processes (GPs) is supported by efficient sampling
Standard sparse pseudo-input approximations to the Gaussian process (GP)...
We consider probabilistic multinomial probit classification using Gaussi...
Neural-net-induced Gaussian process (NNGP) regression inherits both the ...
We propose a parallelizable sparse inverse formulation Gaussian process
Benchmarks for GPflow
Gaussian processes in TensorFlow
Gaussian processes in TensorFlow
Gaussian process models are attractive for machine learning because of their flexible nonparametric nature. By combining a GP prior with different likelihoods, a multitude of machine learning tasks can be tackled in a probabilistic fashion . There are three things to consider when using a GP model: approximation of the posterior function (especially if the likelihood is non-Gaussian), computation, storage and inversion of the covariance matrix, which scales poorly in the number of data; and estimation (or marginalization) of the covariance function parameters. A multitude of approximation schemes have been proposed for efficient computation when the number of data is large. Early strategies were based on retaining a sub-set of the data [2, 3]. Snelson and Ghahramani  introduced an inducing point approach, where the model is augmented with additional variables, and Titsias  used these ideas in a variational approach. Other authors have introduced approximations based on the spectrum of the GP [6, 7], or which exploit specific structures within the covariance matrix [8, 9], or by making unbiased stochastic estimates of key computations . In this work, we extend the variational inducing point framework, which we prefer for general applicability (no specific requirements are made of the data or covariance function), and because the variational inducing point approach can be shown to minimize the KL divergence to the posterior process .
To approximate the posterior function and covariance parameters, Markov chain Monte-Carlo (MCMC) approaches provide asymptotically exact approximations.Murray and Adams  and Filippone et al.  examine schemes which iteratively sample the function values and covariance parameters. Such sampling schemes require computation and inversion of the full covariance matrix at each iteration, making them unsuitable for large problems. Computation may be reduced somewhat by considering variational methods, approximating the posterior using some fixed family of distributions [14, 15, 16, 17, 1, 18], though many covariance matrix inversions are generally required. Recent works [19, 20, 21] have proposed inducing point schemes which can reduce the computation required substantially, though the posterior is assumed Gaussian and the covariance parameters are estimated by (approximate) maximum likelihood. Table 1 places our work in the context of existing variational methods for GPs.
This paper presents a general inference scheme, with the only concession to approximation being the variational inducing point assumption. Non-Gaussian posteriors are permitted through MCMC, with the computational benefits of the inducing point framework. The scheme jointly samples the inducing-point representation of the function with the covariance function parameters; with sufficient inducing points our method approaches full Bayesian inference over GP values and the covariance parameters. We show empirically that the number of required inducing points is substantially smaller than the dataset size for several real problems.
|Williams & Barber [also 15, 18]||
|✗||Gaussian (assumed)||point estimate|
|Titsias ||Gaussian||✓||Gaussian (optimal)||point estimate|
|Chai ||softmax||✓||Gaussian (assumed)||point estimate|
|Nguyen and Bonilla ||any factorized||✗||Mixture of Gaussians||point estimate|
|Hensman et al. ||probit||✓||Gaussian (assumed)||point estimate|
|This work||any factorized||✓||free-form||free-form|
The model is set up as follows. We are presented with some data inputs and responses . A latent function is assumed drawn from a GP with zero mean and covariance function with (hyper-) parameters
. Consistency of the GP means that only those points with data are considered: the latent vectorrepresents the values of the function at the observed points , and has conditional distribution , where is a matrix composed of evaluating the covariance function at all pairs of points in . The data likelihood depends on the latent function values: . To make a prediction for latent function value test points , the posterior function values and parameters are integrated:
In order to make use of the computational savings offered by the variational inducing point framework , we introduce additional input points to the function and collect the responses of the function at that point into the vector . With some variational posterior , new points are predicted similarly to the exact solution
This makes clear that the approximation is a stochastic process in the same fashion as the true posterior: the length of the predictions vector is potentially unbounded, covering the whole domain.
To obtain a variational objective, first consider the support of under the true posterior, and for under the approximation. In the above, these points are subsumed into the prediction vector : from here we shall be more explicit, letting be the points of the process at , be the points of the process at and be a large vector containing all other points of interest111The vector here is considered finite but large enough to contain any point of interest for prediction. The infinite case follows Matthews et al. , is omitted here for brevity, and results in the same solution.. All of the free parameters of the model are then
, and using a variational framework, we aim to minimize the Kullback-Leibler divergence between the approximate and true posteriors:
where the conditional distributions for have been expanded to make clear that they are the same under the true and approximate posteriors, and and have been omitted for clarity. Straight-forward identities simplify the expression,
resulting in the variational inducing-point objective investigated by Titsias , aside from the inclusion of . This can be rearranged to give the following informative expression
Here is an intractable constant which normalizes the distribution and is independent of . Minimizing the KL divergence on the right hand side reveals that the optimal variational distribution is
For general likelihoods, since the optimal distribution does not take any particular form, we intend to sample from it using MCMC. This is feasible using standard methods since it is computable up to a constant, using computations. To sample effectively, the following are proposed.
Noting that the problem (6) appears similar to a standard GP for , albeit with an interesting ‘likelihood’, we make use of an ancillary augmentation , with . This results in the optimal variational distribution
Previously [12, 13] this parameterization has been used with schemes which alternate between sampling the latent function values (represented by or ) and the parameters . Our scheme uses HMC across and jointly, whose effectiveness is examined throughout the experiment section.
The first term in (6) is the expected log-likelihood. In the case of factorization across the data-function pairs, this results in one-dimensional integrals. For Gaussian or Poisson likelihood these integrals are tractable, otherwise they can be approximated by Gauss-Hermite quadrature. Given the current sample , the expectations are computed w.r.t. , with:
where the kernel matrices are computed similarly to , but over the pairs in respectively. From here, one can compute the expected likelihood and it is subsequently straight-forward to compute derivatives in terms of and .
To compute derivatives with respect to and
we use reverse-mode differentiation (backpropagation) of the derivative through the Cholesky matrix decomposition, transforminginto , and then . This is discussed by Smith , and results in a operation; an efficient Cython implementation is provided in the supplement.
A natural question is, what strategy should be used to select the inducing points ? In the original inducing point formulation , the positions were treated as parameters to be optimized. One could interpret them as parameters of the approximate prior covariance . The variational formulation  treats them as parameters of the variational approximation, thus protecting from over-fitting as they form part of the variational posterior. In this work, since we propose a Bayesian treatment of the model, we question whether it is feasible to treat in a Bayesian fashion.
Since and are auxiliary parameters, the form of their distribution does not affect the marginals of the model. The term has been defined by the consistency with the GP in order to preserve the posterior-process interpretation above (i.e. should be points on the GP), but we are free to choose . Omitting dependence on for clarity, and choosing w.l.o.g. , the bound on the marginal likelihood, similarly to (4) is given by
The bound can be maximized w.r.t by noting that the term only appears inside a (negative) KL divergence: . Substituting the optimal reduces (9) to
which can now be optimized w.r.t. . Since no entropy term appears for , the bound is maximized when the distribution becomes a Dirac’s delta. In summary, since we are free to choose a prior for which maximizes the amount of information captured by , the optimal distribution becomes . This formally motivates optimizing the inducing points .
For completeness we also include the derivative of the free form objective with respect to the inducing point positions. Substituting the optimal distribution into (4) to give and then differentiating we obtain
Since we aim to draw samples from , evaluating this free form inducing point gradient using samples seems plausible but challenging. Instead we use the following strategy.
1. Fit a Gaussian approximation to the posterior. We follow 
in fitting a Gaussian approximation to the posterior. The positions of the inducing points are initialized using k-means clustering of the data. The values of the latent function are represented by a mean vector (initialized randomly) and a lower-triangular matrixforms the approximate posterior covariance as . For large problems (such as the MNIST experiment), stochastic optimization using AdaDelta is used. Otherwise, LBFGS is used. After a few hundred iterations with the inducing points positions fixed, they are optimized in free-form alongside the variational parameters and covariance function parameters.
2. Initialize the model using the approximation. Having found a satisfactory approximation, the HMC strategy takes the optimized inducing point positions from the Gaussian approximation. The initial value of is drawn from the Gaussian approximation, and the covariance parameters are initialized at the (approximate) MAP value.
3. Tuning HMC The HMC algorithm has two free parameters to tune, the number of leapfrog steps and the step-length. We follow a strategy inspired by Wang et al. , where the number of leapfrog steps is drawn randomly from 1 to , and Bayesian optimization is used to maximize the expected square jump distance (ESJD), penalized by . Rather than allow an adaptive (but convergent) scheme as , we run the optimization for 30 iterations of 30 samples each, and use the best parameters for a long run of HMC.
4. Run tuned HMC to obtain predictions Having tuned the HMC, it is run for several thousand iterations to obtain a good approximation to . The samples are used to estimate the integral in equation (2). The following section investigates the effectiveness of the proposed sampling scheme.
This section illustrates the effectiveness of Hamiltonian Monte Carlo in sampling from . As already pointed out, the form assumed by the optimal variational distribution in equation (6
) resembles the joint distribution in a GP model with a non-Gaussian likelihood.
For a fixed , sampling is relatively straightforward, and this is possible to be done efficiently using HMC [13, 26, 27] or Elliptical Slice Sampling . A well tuned HMC has been reported to be extremely efficient in sampling the latent variables, and this motivates our effort into trying to extend this efficiency to the sampling of hyper-parameters as well. This is also particularly appealing due to the convenience offered by the proposed representation of the model.
The problem of drawing samples from the posterior distribution over has been investigated in detail in [12, 13]. In these works, it has been advocated to alternate between the sampling of and in a Gibbs sampling fashion and condition the sampling of on a suitably chosen transformation of the latent variables. For each likelihood model, we compare efficiency and convergence speed of the proposed HMC sampler with a Gibbs sampler where is sampled using HMC is sampled using the Metropolis-Hastings algorithm. To make the comparison fair, we imposed the mass matrix in HMC and the covariance in MH to be isotropic, and any parameters of the proposal were tuned using Bayesian optimization. Unlike in the proposed HMC sampler, for the Gibbs sampler we did not penalize the objective function of the Bayesian optimization for large numbers of leapfrog steps, as in this case HMC proposals on the latent variables are computationally cheaper than those on the hyper-parameters. We report efficiency in sampling from using Effective Sample Size (ESS) and Time Normalized (TN)-ESS. In the supplement we include convergence plots based on the Potential Scale Reduction Factor (PSRF) computed based on ten parallel chains; in these each chain is initialized from the VB solution and individually tuned using Bayesian optimization.
We first use the image dataset  to investigate the benefits of the approach over a Gaussian approximation, and to investigate the effect of changing the number of inducing points, as well as optimizing the inducing points under the Gaussian approximation. The data are 18 dimensional: we investigated the effect of our approximation using both ARD (one lengthscale per dimension) and an isotropic RBF kernel. The data were split randomly into 1000/1019 train/test sets; the log predictive density over ten random splits is shown in Figure 1.
Following the strategy outlined above, we fitted a Gaussian approximation to the posterior, with initialized with k-means. Figure 1 investigates the difference in performance when is optimized using the Gaussian approximation, compared to just using k-means for . Whilst our strategy is not guaranteed to find the global optimum, it is clear that it improves the performance.
The second part of Figure 1 shows the performance improvement of our sampling approach over the Gaussian approximation. We drew 10,000 samples, discarding the first 1000: we see a consistent improvement in performance once is large enough. For small , The Gaussian approximation appears to work very well. The supplement contains a similar Figure for the case where a single lengthscale is shared: there, the improvement of the MCMC method over the Gaussian approximation is smaller but consistent. We speculate that the larger gains for ARD are due to posterior uncertainty in the lengthscale parameters, which is poorly represented by a point in the Gaussian/MAP approximation.
The ESS and TN-ESS are comparable between HMC and the Gibbs sampler. In particular, for inducing points and the RBF covariance, ESS and TN-ESS for HMC are and and for the Gibbs sampler are and . For the ARD covariance, ESS and TN-ESS for HMC are and and for the Gibbs sampler are and . Convergence, however, seems to be faster for HMC, especially for the ARD covariance (see the supplement).
We apply our methods to Log Gaussian Cox processes : doubly stochastic models where the rate of an inhomogeneous Poisson process is given by a Gaussian process. The main difficulty for inference lies in that the likelihood of the GP requires an integral over the domain, which is typically intractable. For low dimensional problems, this integral can be approximated on a grid; assuming that the GP is constant over the width of the grid leads to a factorizing Poisson likelihood for each of the grid points. Whilst some recent approaches allow for a grid-free approach [20, 31], these usually require concessions in the model, such as an alternative link function, and do not approach full Bayesian inference over the covariance function parameters.
On the one-dimensional coal-mining disaster data. We held out 50% of the data at random, and using a grid of 100 points with 30 evenly spaced inducing points
, fitted both a Gaussian approximation to the posterior process with an (approximate) MAP estimate for the covariance function parameters (variance and lengthscale of an RBF kernel). With Gamma priors on the covariance parameters we ran our sampling scheme using HMC, drawing 3000 samples. The resulting posterior approximations are shown in Figure2, alongside the true posterior using a sampling scheme similar to ours (but without the inducing point approximation). The free-form variational approximation matches the true posterior closely, whilst the Gaussian approximation misses important detail. The approximate and true posteriors over covariance function parameters are shown in the right hand part of Figure 2, there is minimal discrepancy in the distributions.
Over 10 random splits of the data, the average held-out log-likelihood was for the Gaussian approximation and for the free-form MCMC variant; the average difference was , and the MCMC variant was always better than the Gaussian approximation. We attribute this improved performance to marginalization of the covariance function parameters.
Efficiency of HMC is greater than for the Gibbs sampler; ESS and TN-ESS for HMC are and and for the Gibbs sampler are and . Also, chains converge within few thousand iterations for both methods, although convergence for HMC is faster (see the supplement).
The advantages of the proposed approximation are prominent as the number of grid points become higher, an effect emphasized with increasing dimension of the domain. We fitted a similar model to the above to the pine sapling data .
We compared the sampling solution obtained using 225 inducing points on a 32 x 32 grid to the gold standard full MCMC run with the same prior and grid size. Figure 3 shows that the agreement between the variational sampling and full sampling is very close. However the variational method was considerably faster. Using a single core on a desktop computer required seconds to obtain 1 effective sample for a well tuned variational method whereas it took seconds for well tuned full MCMC. This effect becomes even larger as we increase the resolution of the grid to 64 x 64, which gives a better approximation to the underlying smooth function as can be seen in figure 3. It took seconds to obtain one effective sample for the variational method, but now gold standard MCMC comparison was computationally extremely challenging to run for even a single HMC step. This is because it requires linear algebra operations using flops with .
To do multi-class classification with Gaussian processes, one latent function is defined for each of the classes. The functions are defined a-priori independent, but covary a posteriori because of the likelihood. Chai  studies a sparse variational approximation to the softmax multi-class likelihood restricted to a Gaussian approximation. Here, following [32, 33, 34], we use a robust-max likelihood. Given a vector containing latent functions evaluated at the point
, the probability that the label takes the integer valueis
As Girolami and Rogers  discuss, the ‘soft’ probit-like behaviour is recovered by adding a diagonal ‘nugget’ to the covariance function. In this work, was fixed to , though it would also be possible to treat this as a parameter for inference. The expected log-likelihood is , where is the probability that the labelled function is largest, which is computable using one-dimensional quadrature. An efficient Cython implementation is contained in the supplement.
To investigate the proposed posterior approximation for the multivariate classification case, we turn to the toy data shown in Figure 4.
A toy multiclass problem. Left: the Gaussian approximation, colored points show the simulated data, lines show posterior probability contours at 0.3, 0.95, 0.99. Inducing points positions shows as black points. Middle: the free form solution with 10,000 posterior samples. The free-form solution is more conservative (the contours are smaller). Right: posterior samples forat the same position but across different latent functions. The posterior exhibits strong correlations and edges.
We drew 750 data points from three Gaussian distributions. The synthetic data was chosen to include non-linear decision boundaries and ambiguous decision areas. Figure4 shows that there are differences between the variational and sampling solutions, with the sampling solution being more conservative in general (the contours of 95% confidence are smaller). As one would expect at the decision boundary there are strong correlations between the functions which could not be captured by the Gaussian approximation we are using. Note the movement of inducing points away from k-means and towards the decision boundaries.
Efficiency of HMC and the Gibbs sampler is comparable. In the RBF case, ESS and TN-ESS for HMC are and and for the Gibbs sampler are and . In the ARD case, ESS and TN-ESS for HMC are and and for the Gibbs sampler are and . For both cases, the Gibbs sampler struggles to reach convergence even though the average acceptance rates are similar to those recommended for the two samplers individually.
The MNIST dataset is a well studied benchmark with a defined training/test split. We used 500 inducing points, initialized from the training data using k-means. A Gaussian approximation was optimized using minibatch-based optimization over the means and variances of
, as well as the inducing points and covariance function parameters. The accuracy on the held-out data was 98.04%, significantly improving on previous approaches to to classify these digits using GP models.
For binary classification, Hensman et al.  reported that their Gaussian approximation resulted in movement of the inducing point positions toward the decision boundary. The same effect appears in the multivariate case, as shown in Figure 5, which shows three of the 500 inducing points used in the MNIST problem. The three examples were initialized close to the many six digits, and after optimization have moved close to other digits (five and four). The last example still appears to be a six, but has moved to a more ‘unusual’ six shape, supporting the function at another extremity. Similar effects are observed for all inducing-point digits.
Having optimized the inducing point positions with the approximate , and estimate for , we used these optimal inducing points to draw samples from and . This did not result in an increase in accuracy, but did improve the log-density on the test set from -0.068 to -0.064. Evaluating the gradients for the sampler took approximately 0.4 seconds on a desktop machine, and we were easily able to draw 1000 samples. This dataset size has generally be viewed as challenging in the GP community and consequently there are not many published results to compare with. One recent work  reports a 94.05% accuracy using variational inference and a GP latent variable model.
We have presented an inference scheme for general GP models. The scheme significantly reduces the computational cost whilst approaching exact Bayesian inference, making minimal assumptions about the form of the posterior. The improvements in accuracy in comparison with the Gaussian approximation of previous works has been demonstrated, as has the quality of the approximation to the hyper-parameter distribution. Our MCMC scheme was shown to be effective for several likelihoods, and we note that the automatic tuning of the sampling parameters worked well over hundreds of experiments. This paper shows that MCMC methods are feasible for inference in large GP problems, addressing the unfair sterotype of ‘slow’ MCMC.
JH was supported by a fellowship from the Medical Research Council. AM and ZG acknowledge EPSRC grant EP/I036575/1, and a Google Focussed Research award. MF acknowledges EPSRC grant EP/L020319/1.
Slice sampling covariance hyperparameters of latent Gaussian models.In NIPS, pages 1732–1740, 2010.
Convergence of the samplers on the Image dataset is reported in fig. 7 and shows the evolution of the PSRF for the twenty slowest parameters for HMC and the Gibbs sampler in the case of RBF and ARD covariances. The figure shows that HMC consistently converges faster than the Gibbs sampler for both covariances, even when the ESS of the slowest variable is comparable.
Fig. 7 shows the convergence analysis on the coal dataset. In this case, HMC converges faster than the Gibbs sampler and efficiency is comparable.
Convergence of the samplers on the toy multi-class dataset is reported in fig. 9. HMC converges much faster than the Gibbs sampler even though efficiency measured through ESS is comparable.