Pseudo-Marginal Bayesian Inference for Gaussian Processes

10/02/2013 ∙ by Maurizio Filippone, et al. ∙ University of Glasgow 0

The main challenges that arise when adopting Gaussian Process priors in probabilistic modeling are how to carry out exact Bayesian inference and how to account for uncertainty on model parameters when making model-based predictions on out-of-sample data. Using probit regression as an illustrative working example, this paper presents a general and effective methodology based on the pseudo-marginal approach to Markov chain Monte Carlo that efficiently addresses both of these issues. The results presented in this paper show improvements over existing sampling methods to simulate from the posterior distribution over the parameters defining the covariance function of the Gaussian Process prior. This is particularly important as it offers a powerful tool to carry out full Bayesian inference of Gaussian Process based hierarchic statistical models in general. The results also demonstrate that Monte Carlo based integration of all model parameters is actually feasible in this class of models providing a superior quantification of uncertainty in predictions. Extensive comparisons with respect to state-of-the-art probabilistic classifiers confirm this assertion.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Non-parametric or kernel based models represent a successful class of statistical modelling and prediction methods. To focus ideas throughout the paper we employ the working example of predictive classification problems; the methodology presented, however, is applicable to all hierarchic Bayesian models in general and those employing Gaussian Process (GP) priors in particular. Important examples of kernel-based classifiers are the Support Vector Machine (SVM) 

[1, 2], the Relevance Vector Machine (RVM) [3, 4], and the Gaussian Process classifier [5]. Although these classifiers are based on different modeling assumptions and paradigms of statistical inference, they are characterized by a kernel function or covariance operator that allows one to build nonlinear classifiers able to tackle challenging problems [6, 7, 8, 9, 10, 11].

In order to allow these classifiers to be flexible, it is necessary to parameterize the kernel (or covariance) function by a set of so called hyper-parameters

. After observing a set of training data, the aim is to estimate or infer such hyper-parameters. In the case of SVMs point estimates of hyper-parameters are obtained by optimizing a cross-validation error. This makes optimization viable only in the case of very few hyper-parameters, as grid search is usually employed, and is limited by the available amount of data. In GP classification, instead, the probabilistic nature of the model provides a means (usually after approximately integrating out latent variables) to obtain an approximate marginal likelihood that offers the possibility to optimize the hyper-parameters; this is known as type II Maximum Likelihood (ML) 

[12, 5]. Deterministic approximations for integrating out the latent variables include the Laplace Approximation (LA) [13], Expectation Propagation (EP) [14], Variational Bayes [15], Integrated Nested Laplace Approximations [16], and mean field approximations [17]; see [18, 19] for extensive assessments of the relative merits of different approximation schemes for GP classification.

From a fully Bayesian perspective, in a GP classifier one would like to be able to (i) infer all model parameters and latent variables from data and (ii) integrate out latent variables and hyper-parameters with respect to their posterior distribution when making predictions accounting for their uncertainty; this in particular, would effectively make the classifier parameter free. To date, the literature lacks a systematic way to efficiently tackle both of these questions. The main limitations are due the fact that it is not possible to obtain any of the quantities needed in the inference in closed form because of analytical intractability. This requires the use of some form of approximation, and deterministic approximations have been proposed to integrate out latent variables only; in order to integrate out the hyper-parameters most of the approaches propose quadrature methods [20, 16], that can only be employed in the case of a small number of hyper-parameters.

Recently, there have been a few attempts to carry out inference using stochastic approximations based on Markov chain Monte Carlo (MCMC) methods [21, 22, 23], the idea being to leverage asymptotic guarantees of convergence of Monte Carlo estimates to the true values. Unfortunately, employing MCMC methods for inferring latent variables and hyper-parameters is extremely challenging, and state-of-the-art methods for doing so are still inefficient and difficult to use in practice.

This paper aims at providing a straightforward to implement methodology that is effective in the direction of bridging this gap, by proposing an MCMC method that addresses most of the difficulties that one is faced with when applying stochastic based inference in GP modeling, such as discrete label classification. The main issue in applying MCMC methods to carry out inference in GP classification is in sampling the hyper-parameters form the full posterior. This is due to the structure of the model that makes latent variables and hyper-parameters strongly coupled a posteriori; as a result, chains are characterized by low efficiency and poor mixing. The key idea of the proposed methodology is to break the correlation in the sampling between latent variables and hyper-parameters by approximately integrating out the latent variables while retaining a correct MCMC procedure; namely, maintaining the exact posterior distribution over hyper-parameters as the invariant distribution of the chains and ergodicity properties. This can be achieved by means of the so called Pseudo Marginal (PM) approach to Monte Carlo sampling [24, 25]. This work shows that the use of the PM approach leads to remarkable efficiency in the sampling of hyper-parameters, thus making the fully Bayesian treatment viable and simple to implement and employ.

The importance of integrating out the hyper-parameters to achieve a sound quantification of uncertainty in predictions is well known and has been highlighted, for example, in [12, 26, 16, 27]

; employing this in practice, however, is notoriously challenging. The main motivation for this work, is to demonstrate that this marginalization can be done exactly, in a Monte Carlo sense, by building upon deterministic approximations already proposed in the GP and Machine Learning literature. This work reports a thorough empirical comparison in this direction, showing the ability of the fully Bayesian treatment to achieve a better quantification of uncertainty compared to the standard practice of optimization of the hyper-parameters in GP classification. Furthermore, the results report a comparison with a probabilistic version of the SVM classifier 

[28]. The results on GP classification support the argument that hyper-parameters should be integrated out to achieve a reliable quantification of uncertainty in applications and this paper provides a practical means to achieve this111The code to reproduce all the experiments is available at:

The paper is organized as follows: section 2 reviews the Gaussian Process approach to classification, and section 3 presents the proposed MCMC approach to obtain samples from the posterior distribution over both latent variables and hyper-parameters. Section 4 reports an assessment of the sampling efficiency achieved by the PM approach compared to other MCMC approaches, and section 5 reports a study on the performance of the fully Bayesian GP classifier compared to other probabilistic classifiers. Finally, section 6 concludes the paper.

2 Gaussian Process Classification

In this section, we briefly review GP classification based on a probit likelihood (see [5] for an extensive presentation of GPs). Let be a set of input vectors described by covariates and associated with observed univariate responses with . Let

be a set of latent variables. From a generative perspective, GP classifiers assume that the class labels have a Bernoulli distribution with success probability given by a transformation of the latent variables:


Here denotes the cumulative function of the Gaussian density; based on this modeling assumption, the likelihood function is:


The latent variables are given a zero mean GP prior with covariance :


Let be the function modeling the covariance between latent variables evaluated at the input vectors, parameterized by a vector of hyper-parameters . In this paper we will adopt a covariance function defined as follows:


The parameter

is the variance of the marginal prior distribution for each of the latent variables

. The matrix , instead, defines the type of covariance between the values of the function at different input vectors. By defining a matrix with a global parameter as follows,


an isotropic covariance function is obtained. Alternatively, can be defined to assign a different parameter to each covariate


The latter choice yields the so called Automatic Relevance Determination (ARD) prior [29]. In this formulation, the hyper-parameters can be interpreted as length-scale parameters. Let be a vector comprising and all the length-scale parameters, and be the matrix whose entries are .

The GP classification model is hierarchical, as is conditioned on , and is conditioned on and the inputs . In order to keep the notation uncluttered, in the remainder of this paper we will not report explicitly the conditioning on the inputs in any of the equations. We now briefly review the types of approximations that have been proposed in the literature to employ GP classifiers.

2.1 Deterministic approximations for integrating out latent variables

One of the difficulties encountered in GP classification is that, unlike GP regression, the prior on the latent variables and the likelihood do not form a conjugate pair; therefore, it is not possible to analytically integrate out the latent variables. As a consequence, it is not possible to directly sample from or optimize the distribution of hyper-parameters given the labels, nor directly evaluate predictive probabilities. This has motivated a large body of research that attempts to approximate the posterior distribution over the latent variables with a Gaussian in order to exploit conjugacy. By doing so, it is possible to analytically integrate out latent variables to obtain an approximate marginal likelihood, and compute the predictive distribution for new data, as discussed in the following. The Gaussian approximation yields an approximate marginal likelihood that can then be optimized with respect to the hyper-parameters, or used to obtain samples from the approximate posterior distribution over the hyper-parameters, say , using MCMC techniques. We now briefly discuss how this can be achieved.

To obtain an approximate predictive distribution, conditioned on a value of the hyper-parameters , we can compute:


Here can be a ML estimate that maximizes the approximate likelihood or one sample from the approximate posterior . For simplicity of notation, let be the covariance matrix evaluated at , the vector whose th element is and . Given the properties of multivariate normal variables, is distributed as with and . Approximating with a Gaussian makes it possible to analytically perform integration with respect to in equation 7. In particular, the integration with respect to yields with


The univariate integration with respect to follows exactly in the case of a probit likelihood, as it is a convolution of a Gaussian and a cumulative Gaussian


We now briefly review two popular approximation methods for integrating out latent variables, namely the Laplace Approximation and Expectation Propagation.

2.1.1 Laplace Approximation

The Laplace Approximation (LA) is based on the assumption that the distribution of interest can be approximated by a Gaussian centered at its mode and with the same curvature. By analyzing the Taylor expansion of the logarithm of target and approximating densities, the latter requirement is satisfied by imposing an inverse covariance for the approximating Gaussian equal to the negative Hessian of the logarithm of the target density [30]. For a given value of the hyper-parameters , define


as the logarithm of the target density up to terms independent of . Performing a Laplace approximation amounts in defining a Gaussian , such that


As it is not possible to directly solve the maximization problem in equation 10, an iterative procedure based on the following Newton-Raphson formula is usually employed:


starting from until convergence. The gradient and the Hessian of the log of the target density are:


Note that if is concave, such as in probit classification, has a unique maximum. Practically, the Newton-Raphson update in equation 11 is implemented by employing Woodbury identities to avoid inverting directly (see section 3.4 of [5] for full details). In such an implementation, one matrix factorization is needed at each iteration and no other operations.

2.1.2 Expectation Propagation

The Expectation Propagation (EP) algorithm is based on the assumption that each individual term of the likelihood can be approximated by an unnormalized Gaussian


Approximating each term in the likelihood by a Gaussian implies that the approximate likelihood, as a function of , is multivariate Gaussian


with and .

Under this approximation, the posterior is approximated by a Gaussian with:


The EP algorithm is characterized by the way the parameters , , and are optimized. The EP algorithm loops through the factors approximating the likelihood updating those three parameters for each factor in turn. First, the so called cavity distribution is computed


which is obtained by leaving out the th factor from . Second, a revised Gaussian , which closely approximates the product of the cavity distribution and the exact

th likelihood term, is sought. In particular, this is performed by minimizing the following Kullback-Leibler divergence:


which in practice boils down to matching the moments of the two distributions. Third, once the mean and variance of

are computed, it is possible to derive the updated parameters , , and for the th factor. The derivation of those equations is rather involved, and the reader is referred to [5] for full details; EP requires five operations in at each iteration. Note that convergence of the EP algorithm is not guaranteed in general; however, for GP classification, no convergence issues have been reported in the literature. Furthermore, EP for GP classification has been reported to offer superior accuracy in approximations compared to other methods [18, 19].

2.2 Fully Bayesian treatment

In a fully Bayesian treatment, the aim is to integrate out latent variables as well as hyper-parameters:


Again, the integration with respect to can be done analytically, whereas the integration with respect to latent variables and hyper-parameters requires the posterior distribution . One way to tackle the intractability in characterizing is to draw samples from using MCMC methods, so that a Monte Carlo estimate of the predictive distribution can be used


where denotes the th sample from . This estimate will asymptotically converge to the exact expectation .

3 MCMC Sampling From

Sampling from the posterior over and by joint proposals is not feasible; it is extremely unlikely to propose a set of latent variables and hyper-parameters that are compatible with each other and observed data. In order to draw samples from , it is therefore necessary to resort to a Gibbs sampler, whereby and are updated in turn. We now briefly review the state of the art in Gibbs sampling techniques for GP models, and propose a new Gibbs sampler based on the PM approach.

3.1 Drawing samples from

Efficient sampling of the latent variables can be achieved by means of Elliptical Slice Sampling (ELL-SS) [31]. ELL-SS is based on an adaptation of the Slice Sampling algorithm [32] to propose new values of the latent variables. ELL-SS has the very appealing property of requiring no tuning, so that minimum user intervention is needed, and by the fact that once is factorized the complexity of iterating ELL-SS is in . Recently, the efficiency of ELL-SS has been extensively demonstrated on several models involving GP priors [21].

Another way to efficiently sample latent variables in GP models is by means of a variant of Hybrid Monte Carlo (HMC) [33, 34] where the inverse mass matrix is set to the GP covariance , as described in detail in [21]. This variant of HMC can be interpreted as a simplified version of Riemann manifold Hamiltonian Monte Carlo (RMHMC) [35] which makes it possible to obtain samples from the posterior distribution over in once is factorized. Owing to its simplicity, in the remainder of this paper we will use ELL-SS to sample from the posterior over latent variables .

3.2 Drawing samples from the posterior over employing reparameterization techniques

3.2.1 SA and AA parameterizations

In GP classification, efficiently sampling from the posterior distribution over the latent variables and hyper-parameters is complex because of their strong coupling [21, 22, 26]. The result of this strong coupling is that fixing induces a sharply peaked posterior over that makes the chain converge slowly and mix very poorly. This effect is illustrated in Figure 1. In particular, conditioning the sampling of on corresponds to considering the standard parameterization of GP models and , which is also known as Sufficient Augmentation (SA) [36].

A better parameterization can be given by introducing a set of transformed (whitened) latent variables  [37]. The way is defined is by , being the Cholesky factor of . In this parameterization, that is also known as Ancillary Augmentation (AA) [36], are constructed to be a priori independent from the hyper-parameters (using is convenient as it is needed also to evaluate the GP prior density). In the AA parameterization is sampled from . The effect of conditioning on makes the conditional posterior over larger, as illustrated in Figure 1.

3.2.2 The Surrogate data model

In the Surrogate (SURR) data model proposed in [22], a set of auxiliary latent variables is introduced as a noisy version of ; in particular, . This construction yields a conditional for of the form , with and . After decomposing , the sampling of is then conditioned on the “whitened” variables , defined as . The covariance is constructed by matching the posterior distribution over each of the latent variables individually (see [22] for further details). Figure 1 shows that the SURR parameterization is characterized by a conditional posterior over larger than SA and slightly larger than the AA parameterization.

3.3 Drawing samples from : the Pseudo Marginal approach

The use of reparameterization techniques mitigates the problems due to the coupling of latent variables and hyper-parameters, but sampling efficiency for GP models is still an issue (for example, [7]

reports simulations of ten parallel chains comprising five millions samples each). Intuitively, the best strategy to break the correlation between latent variables and hyper-parameters in sampling from the posterior over the hyper-parameters would be to integrate out the latent variables altogether. As we discussed, this is not possible, but here we present a strategy that uses an unbiased estimate of the marginal likelihood

to devise an MCMC strategy that produces samples from the correct posterior distribution . For the sake of clarity, in this work we will focus on the Metropolis-Hastings algorithm with proposal . We are interested in sampling from the posterior distribution


In order to do that, we would need to integrate out the latent variables:


and use this along with the prior in the Hastings ratio:


As already discussed, analytically integrating out is not possible.

The results in [25, 24] show that we can plug into the Hastings ratio an estimate of the marginal , and as long as this is unbiased, then the sampler will draw samples from the correct posterior .


This result is remarkable as it gives a simple recipe to be used in hierarchical models to tackle the problem of strong coupling between groups of variables when using MCMC algorithms.

Fig. 1: Comparison of the posterior distribution with the posterior in the SA parameterization, the posterior in the AA parameterization, and the parameterization used in the SURR method.

Figure 1 shows the effect of conditioning the sampling of on different transformations of the latent variables given by SA (blue line), AA (red line), and SURR (green line). The conditional variance for the three approaches is still way lower than the variance of the marginal posterior that can be obtained by the PM approach. This motivates the use of the PM approach to effectively break the correlation between latent variables and hyper-parameters in an MCMC scheme.

Note that if the goal is quantifying uncertainty in the parameters only, and no predictions are needed, one could just iterate the sampling of , as this is done regardless of . For predictions, instead, samples from the joint posterior are needed in the Monte Carlo integral in equation 20, so both steps are necessary. We consider this as a Gibbs sampler despite the fact that in principle interleaving of the two steps is not needed; one could obtain samples from the posterior distribution over in a second stage, once samples from are available. This would come at an extra cost given that sampling requires the factorization of for each MCMC sample . Therefore, when predictions are needed, we prefer to interleave the two steps, and still interpret the proposed sampling strategy as a Gibbs sampler.

3.3.1 Unbiased estimation of using importance sampling

In order to obtain an unbiased estimator for the marginal , we propose to employ importance sampling. We draw samples from the approximating distribution , so that we can approximate the marginal by:


It is easy to verify that equation 25 yields an unbiased estimate of , as its expectation is the exact marginal . Therefore, this estimate can be used in the Hastings ratio to construct an MCMC approach that samples from the correct invariant distribution . Algorithm 1 sketches the MH algorithm that we propose to sample the hyper-parameters.

Input: The current pair , a routine to approximate by , and number of importance samples
Output: A new pair

1:Draw from the proposal distribution
2:Approximate by
3:Draw samples from
4:Compute using eq. 25
6:Draw from
7:if  then return
8:else return
Algorithm 1 Pseudo-marginal MH transition operator to sample .

From the theory of importance sampling [38], the variance of the estimator is zero when is proportional to , which is proportional to the posterior distribution over that we do not know how to sample from in the first place. In our case, the more accurate the Gaussian approximation the smaller the variance of the estimator. Given that EP has been reported to be more accurate in approximating , it is reasonable to expect that EP will lead to a smaller estimator variance compared to LA. This will be assessed in the next section.

The motivation for using an importance sampling estimator rather than other simulation based methods for estimating marginal likelihoods, is the following. Even though it is possible to sample relatively efficiently, the estimation of marginal likelihoods from MCMC simulations is generally challenging [39, 40] and only guarantees of estimator consistency are available. Obtaining estimates based on samples from would require some form of user intervention (assessment of convergence and estimation of efficiency) every time a new value of is proposed; this is clearly not practical or useful for the PM scheme. This work reports an extensive assessment of LA and EP to obtain Gaussian approximations to within the importance sampling estimate of .

3.3.2 Analysis of correctness

We show here why the proposed method yields an MCMC approach that produces samples from the correct invariant distribution . The easiest way to see this is by considering ; showing correctness for larger numbers of importance samples is similar but notationally heavier (see [25] for further details). By substituting the importance sampling estimate with into and rearranging the terms, we obtain


Isolating the terms in the squared bracket allows us to interpret as a Hastings ratio with a joint proposal for and for the importance sample given by


The remaining term in indicates that the target distribution this approach is sampling from is


If we concentrate on , regardless of , the target distribution is exactly what we are aiming to sample from, as it is proportional to the posterior . The extension to more than one importance sample follows from a similar argument, except that the approximating density appears in the expression of the target distribution; however, this does not cause any problems as marginalizing latent variables leads to the same conclusion as in the case . The analysis for also reveals an interesting similarity with the approach proposed in [23], where a joint update of and was performed as follows: proposing , proposing , and accepting/rejecting the joint proposal . However in this case the PM transition kernel will still target the desired marginal posterior irrespective of the value of importance samples .

4 Assessing importance distributions

In this section, we present simulations to assess the ability of the PM approach to characterize the marginal likelihood in GP classification. First, we aim to assess the quality of the estimate given by the importance sampler based on LA and EP on simulated data with respect to the number of importance samples. Second, we will evaluate the efficiency of the sampler on simulated data with respect to the approximation used to draw importance samples and with respect to their number. Third, we will compare the PM approach with the AA and SURR parameterizations that are the most efficient sampling schemes proposed in the literature for sampling hyper-parameters in models involving GP priors [21]. In all the experiments, in both LA and EP we imposed a convergence criterion on the change in squared norm of being less than .

4.1 Analysis of the variance of the estimator

Fig. 2: Plot of the PM as a function of the length-scale ; black solid lines represent the average over

repetitions and dashed lines represent 2.5th and 97.5th quantiles for

and . The solid red line is the prior density.

In this section, we present an assessment of the variance of the estimator with respect to the global length-scale parameter in equation 5. In particular, we are interested in comparing the quality of the approximation given by the importance sampling approach based on the two different approximations employed in this work. Given that the dimensionality of the space where the approximation is performed grows linearly with the number of input vectors , we are also interested in studying the effect of on the quality of the approximation. Based on these considerations, we simulated data from the GP classification model with and and , with an isotropic covariance function with and . We fixed the value of to the value used to generate the data, and we imposed a Gamma prior on the length-scale with shape and rate . We then computed the posterior over based on for different values of and over repetitions, with different number of importance samples . The results are reported in figure 2.

As expected, for larger sets of importance samples, the estimates are more accurate. We can also see that EP leads to a smaller variance compared to LA, which is not surprising given that the approximation achieved by EP is more accurate [18, 19]. The experiment suggests that there is little increase in the variance of the estimator for the larger data set.

4.2 Effect of the pseudo marginal on the efficiency of the sampler

Isotropic ARD
Scheme Acc Acc
rate rate
50 2 PM LA

10 PM LA


10 PM LA

TABLE I: Analysis of convergence and efficiency of a MH algorithm sampling the hyper-parameters using the PM approach. The results show the dependency of the effective sample size (ESS) and speed of convergence (measured through the statistics after , , , and iterations) with respect to the type of approximation (LA or EP) and the number of importance samples used to compute an unbiased estimate of the marginal likelihood .

In this section we report an analysis on simulated data showing how the choice of the approximation and the number of importance samples affect the efficiency in sampling from . We generated data sets from the GP classification model with different combinations of number of input vectors and number of covariates. The covariates were generated in the unit hypercube and data were selected to have an equal number of input vectors in each class. We chose Gamma priors for the hyper-parameters as follows: with shape and rate , and with shape and rate . In the formulation of the GP classification model, all hyper-parameters have to be positive; for the sake of convenience, we reparameterized them introducing the variables and .

For each method, we ran parallel chains for burn-in iterations followed by iterations; convergence speed of the samplers was monitored using the Potential Scale Reduction Factor (PSRF) ( statistics) as described in [41]. The chains were initialized from the prior, rather than using the procedure suggested in [41] to make the convergence test more challenging. Also, correctness of the code was checked by using the idea presented in [42], that indirectly shows that the Markov chains have indeed as their invariant distribution.

The proposal mechanism was the same for all the PM approaches for a given combination of and , so that it is meaningful to analyze the effect of on sampling efficiency, convergence speed, and acceptance rate. In particular, a large variance for the estimator of the marginal likelihood can eventually lead to the acceptance of because is largely overestimated leading to a difficulty for the chain to move away from there. In this case, the chain can get stuck and take several iterations before moving again; this effect has been reported in [24, 25]. To isolate the effect of and the type of approximation on sampling efficiency and acceptance rate, we tuned the chains using preliminary runs for EP and to achieve about acceptance rate and used the same proposal for LA and other values of .

The results are reported in table I for isotropic and ARD covariances. As a measure of efficiency, we used the minimum Effective Sample Size (ESS) [43] across the hyper-parameters. The tables also report the median of achieved by the chains at different iterations, namely , , , and . This gives an idea of the convergence as the iterations progress. Finally, in table I we report the acceptance rate; a low acceptance rate compared to the one obtained by PM EP  indicates that the chains are more likely to get stuck due to a large variance of the estimator of the marginal likelihood.

The results indicate that sampling efficiency when employing EP to approximate the posterior distribution over is higher than when employing the LA algorithm. It is striking to see that evaluating the PM with as little as one importance sample seems already able to offer acceptable performance in terms of ESS compared to larger values of . However, a low acceptance rate when is small suggests that the corresponding chains can take several iterations before accepting any proposals.

Scheme Isotropic ARD


10 PM LA


10 PM LA

TABLE II: Average number of operations in required for each iteration of the PM approaches with the LA and EP approximations and for each iteration in the AA and SURR parameterizations.

4.3 Comparison with reparameterization techniques

Table I also reports a comparison of the PM method with the AA and SURR sampling schemes with a Metropolis-Hastings transition operator so that results are meaningfully comparable. The proposal mechanism was tuned during the burn-in phase to achieve about acceptance rate. Table I shows that the PM approach achieves faster convergence and higher sampling efficiency compared to the AA scheme. The SURR method has comparable convergence behavior and efficiency compared to the AA scheme. This is somehow different from what presented in [22], where a component-wise slice sampling transition operator was employed. In [22], the SURR method achieved higher efficiency per covariance construction, likelihood evaluation and running time compared to the AA method. In the experiment reported here ESS is comparable.

Table II reports the average number of operations in needed for one iteration of the PM approach with LA and EP approximations and the AA and SURR parameterizations. The table shows that EP is more expensive than the LA algorithm, and that the PM approaches require more operations compared to AA and SURR. Normalization of the ESS by the number of operations suggests that the cost of obtaining independent samples using the PM approach is generally better than AA and SURR. In the PM approach we also tried stopping the approximation algorithms using a looser convergence criterion (results not reported); especially for this yielded similar efficiency with much lower computational cost.

Isotropic ARD
50 2 PM

10 PM

2 PM

10 PM

TABLE III: Analysis of convergence and efficiency of the AA and SURR parameterizations compared to the PM approach. Each of the Gibbs sampling updates was repeated times in order to highlight the effect of the parameterization alone. The PM approach is based on the EP approximation and . The column next to the one reporting the shows the normalized by the number of operations in .

We conducted a further test of convergence with the aim of mitigating the effect of the random walk exploration and highlighting the advantages offered by different parameterizations. We ran the AA, SURR and PM approaches by repeating each step of the Gibbs sampler times, so that we can roughly consider the new sample drawn in a Gibbs sampling updates independent with respect to the previous. The behavior of the statistics with respect to the number of iterations, reported in table III, reveals some interesting features. All methods are characterized by fast convergence. In the case of the SURR and AA methods, however, efficiency is much lower than what can be achieved by the PM method when repeating the Gibbs sampling update times. This is an indication that the parameterizations of SURR and AA methods are not fully capable of breaking the correlation between hyper-parameters and latent variables. Finally, note that in the case of ARD covariances, the low efficiency in all methods is due to the random walk type of exploration; in those cases iterations of the Gibbs sampling steps were not enough to ensure independence between consecutive samples.

4.4 Convergence speed and efficiency on real data

This section reports a comparison of classification performance on three UCI data sets [44], so as to verify the capability of the proposed approach to effectively carry out inference for parameters in general GP classification problems. The Breast and Pima data sets are two-class classification problems, described by and covariates, comprising and input vectors respectively. The Abalone data set has three classes; in this paper, we considered the task of inferring parameters of a GP probit classifier for the two classes “M” and “F”, resulting in a data set of input vectors and

covariates. All covariates were transformed to have zero mean and unit standard deviation. We selected the isotropic covariance function in equation 

5 and chose the following priors for their parameters: