Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models

The future predictive performance of a Bayesian model can be estimated using Bayesian cross-validation. In this article, we consider Gaussian latent variable models where the integration over the latent values is approximated using the Laplace method or expectation propagation (EP). We study the properties of several Bayesian leave-one-out (LOO) cross-validation approximations that in most cases can be computed with a small additional cost after forming the posterior approximation given the full data. Our main objective is to assess the accuracy of the approximative LOO cross-validation estimators. That is, for each method (Laplace and EP) we compare the approximate fast computation with the exact brute force LOO computation. Secondarily, we evaluate the accuracy of the Laplace and EP approximations themselves against a ground truth established through extensive Markov chain Monte Carlo simulation. Our empirical results show that the approach based upon a Gaussian approximation to the LOO marginal distribution (the so-called cavity distribution) gives the most accurate and reliable results among the fast methods.

There are no comments yet.

Authors

• 67 publications
• 2 publications
• 3 publications
• 5 publications
• 31 publications
01/03/2020

Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data

Recently, new methods for model assessment, based on subsampling and pos...
04/27/2020

Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation

Gaussian latent variable models are a key class of Bayesian hierarchical...
02/16/2018

Bayesian cross-validation of geostatistical models

The problem of validating or criticising models for georeferenced data i...
09/04/2017

Unbiased approximations of products of expectations

We consider the problem of approximating the product of n expectations w...
08/28/2018

Estimating seal pup production in the Greenland Sea using Bayesian hierarchical modeling

The Greenland Sea is an important breeding ground for harp and hooded se...
08/24/2020

Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison

Leave-one-out cross-validation (LOO-CV) is a popular method for comparin...
04/05/2020

Graphical outputs and Spatial Cross-validation for the R-INLA package using INLAutils

Statistical analyses proceed by an iterative process of model fitting an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian cross-validation can be used to assess predictive performance. Vehtari and Ojanen (2012) provide an extensive review of theory and methods in Bayesian predictive performance assessment including decision theoretical assumptions made in Bayesian cross-validation. Gelman et al. (2014) provide further details on theoretical properties of leave-one-out cross-validation and information criteria, and Vehtari et al. (2016) provide practical fast computation in the case of Monte Carlo posterior inference. In this article, we present the properties of several Bayesian leave-one-out (LOO) cross-validation approximations for Gaussian latent variable models (GLVM) with factorizing likelihoods. Integration over the latent variables is performed with either the Laplace method or expectation propagation (EP). We show that for these methods leave-one-out cross-validation can be computed accurately with zero or a negligible additional cost after forming the full data posterior approximation.

Global (Gaussian) and factorizing variational approximations for latent variable inference are not considered in this paper. They have the same order computational complexity as Laplace and EP but with a larger pre-factor on the dominating term, where is the number of observations (Nickisch and Rasmussen, 2008). EP may be expected to be the most accurate method (e.g. Nickisch and Rasmussen, 2008; Vanhatalo and Vehtari, 2010; Jylänki et al., 2011; Riihimäki et al., 2013) and Laplace to have the smallest computational overhead. So EP and Laplace may be considered the methods of choice for accuracy and speed, respectively. We expect that our overall results and conclusions for Laplace and EP carry over to Gaussian variational. For non-GLVM models such as generalized linear and deep generative models, the (factorized) Gaussian variational approximations scale to large datasets (Challis and Barber, 2013; Ranganath et al., 2014; Kingma and Welling, 2014; Rezende et al., 2014). It is of interest to derive approximate LOO estimators for these models, but that is outside the scope of this paper.

We consider a prediction problem with an explanatory variable and an outcome variable

. The same notation is used interchangeably for scalar and vector-valued quantities. The observed data are denoted by

and future observations by . We focus on GLVMs, where the observation model depends on a local latent value and possibly on some global parameters , such as the scale of the measurement error process. Latent values have a joint Gaussian prior which depends on covariates (e.g., covariance function parameters for a Gaussian process). The posterior of the latent is then

 p(f|D,θ,ϕ)∝p(f|x,θ)n∏i=1p(yi|fi,ϕ). (1)

As a specific example we use Gaussian process (GP) models (reviewed, e.g., by Rasmussen and Williams, 2006), but the methods are applicable also for other GLVMs which have the same factorizing form (e.g. Gaussian Markov random field models used in the R-INLA software (Lindgren and Rue, 2015)). Some of the presented methods are applicable more generally, requiring only a factorizing likelihood with terms and a method to integrate over the marginal posteriors . The results presented in this paper can be generalized to the cases where a likelihood term depends upon more than one latent variable (e.g. Tolvanen et al., 2014) or the latent value prior is non-Gaussian (e.g. Seeger, 2008; Hernández-Lobato et al., 2008; Hernáandez-Lobato et al., 2010). For clarity we restrict our treatment to the case of one likelihood term with one latent value.

We are interested in assessing the predictive performance of our models to report this to application experts or to perform model selection. For simplicity, in this paper we use only the logarithmic score, but the methods can be also be used with application specific utilities such as classification error. Logarithmic score is the standard scoring rule in Bayesian cross-validation (see Geisser and Eddy (1979)) and it has desirable properties for scientific inference (Bernardo and Smith, 1994; Gneiting and Raftery, 2007).

The predictive distribution for a future observation given future covariate values is

 p(~y|~x,D)=∫p(~y|~f,ϕ)p(~f|~x,D,θ)p(ϕ,θ|D)d~fdϕdθ. (2)

The expected predictive performance using the log score and unknown true distribution of the future observation is

 ∫pt(~x,~y)logp(~y|~x,D)d~xd~y. (3)

This expectation can be approximated by re-using the observations and computing the leave-one-out Bayesian cross-validation estimate

 LOO=1nn∑i=1logp(yi|xi,D−i), (4)

where is all other observations except . Here we consider only cases with random from the same distribution as . See Vehtari and Ojanen (2012) for discussion of fixed, shifted, deterministic, or constrained .

In addition to estimating the expected log predictive density, it may be interesting to look at a single value, . These terms, also called conditional predictive ordinates (), may reveal observations which are highly influential or not well explained by the model (see, e.g., Gelfand, 1996)

. The probability integral transform (PIT) values

, where is the predictive CDF, can be used to assess the calibration of the predictions (see, e.g., Gneiting et al., 2007).

The straightforward brute force implementation of leave-one-out cross-validation requires recomputing the posterior distribution times. Often leave-one-out cross-validation is replaced with -fold cross-validation requiring only recomputations of the posterior, with usually 10 or less. Although -fold-CV is robust and would often be computationally feasible, there are several fast approximations for computing LOO with a negligible additional computational cost after forming the posterior with the full data.

Several studies have shown that the Laplace method and EP perform well (compared to the gold standard Markov chain Monte Carlo inference) for GLVMs with many log-concave likelihoods (e.g. Rue et al., 2009; Vanhatalo et al., 2010; Martins et al., 2013; Riihimäki and Vehtari, 2014)

. EP has also been shown to be close to Markov chain Monte Carlo inference for classification models (log-concave likelihood, but potentially highly skewed posterior) and non-log-concave likelihoods

(e.g. Nickisch and Rasmussen, 2008; Vanhatalo and Vehtari, 2010; Jylänki et al., 2011; Cseke and Heskes, 2011; Riihimäki et al., 2013; Vanhatalo et al., 2013; Tolvanen et al., 2014). In this paper we also consider the accuracy of approximative LOO with standard Markov chain Monte Carlo inference for LOO as our benchmark.

In practical data analysis work, it is useful to start with fast approximations and step by step check whether a computationally more expensive approach can improve the predictive accuracy. We propose the following three step approach:

1. Find the MAP estimate using the Laplace method to approximately integrate over the latent values .

2. Using obtained in the previous step, use EP to integrate over the latent values and check whether the predictive performance improves substantially compared to using the Laplace method (we may also re-estimate and ).

3. Integrate over and and check whether integration over the parameters improves predictive performance.

Details of the computations involved are given in Sections 2 and 3. Based on these steps we can continue with the model that has the best predictive performance or the one that makes predictions fastest, or both. Often we also need to re-estimate models when data are updated or additional covariates become available, and then again a fast and accurate posterior approximation is useful. To follow the above approach, we need accurate predictive performance estimates for the Laplace method and EP.

The main contributions of this paper are:

• A unified presentation and thorough empirical comparison of methods for approximate LOO for Gaussian latent variable models with both log-concave and non-log-concave likelihoods and MAP and hierarchical approaches for handling hyperparameter inference (Section 3).

• The main conclusion from the empirical investigation (Section 4) is the observed superior accuracy/complexity tradeoff of Gaussian latent cavity distribution based LOO estimators. Although there are more accurate non-Gaussian approximations of the marginal posteriors, their use does not translate into substantial improvements in terms of LOO cross-validation accuracy and also introduces considerable instability. Using the widely applicable information criterion (WAIC) in the computation does not provide any benefits.

• The Laplace Gaussian cavity distribution (LA-LOO) (Section 3.5), although mentioned by Cseke and Heskes (2011), has not been used previously for LOO estimation. LOO consistency of LA-LOO using linear response theory is proved (Appendix A).

• Truncated weights quadrature integration (Section 3.7) inspired by truncated importance sampling is a novel way to stabilize the quadrature used in some LOO computations.

2 Gaussian latent variable models

In this section, we briefly review the notation and methods for Gaussian latent variable models used in the rest of the article. We focus on Gaussian processes (see, e.g., Rasmussen and Williams, 2006), but most of the discussion also holds for other factorizing GLVMs. We consider models with a Gaussian prior on latent values and factorizing likelihood

 p(f|D,θ,ϕ)=1Zn∏i=1p(yi|fi,ϕ)p(f|X,θ), (5)

where is a normalization factor and equal to the marginal likelihood . For example, in the Gaussian process framework the multivariate Gaussian prior on latent values is , where is the prior mean and is a covariance matrix constructed by a covariance function , which characterizes the correlation between two points. In this paper, we assume that the prior mean is zero, but the results generalize to nonzero prior means as well.

2.1 Gaussian observation model

With a Gaussian observation model,

 p(yi|fi,σ2)=N(yi|fi,σ2), (6)

where

is the noise variance, the conditional posterior of the latent variables is a multivariate Gaussian

 p(f|D,θ,ϕ) =N(f|μ,Σ), whereμ =K(K+σ2I)−1y andΣ =(K−1+σ−2I)−1=K−K(K+σ2I)−1K. (7)

The marginal posterior is simply and the marginal likelihood can be computed analytically using properties of the multivariate Gaussian (see, e.g., Rasmussen and Williams, 2006).

2.2 Non-Gaussian observation model

In the case of a non-Gaussian likelihood, the conditional posterior needs to be approximated. In this paper, we focus on expectation propagation (EP) and the Laplace method (LA), which form a multivariate Gaussian approximation of the joint latent posterior

 q(f|D,θ,ϕ)=1Zp(f|X,θ)n∏i=1~ti(fi), (8)

where the are (unnormalized) Gaussian approximations of the likelihood contributions. We use to denote approximative joint and marginal distributions in general, or the specific approximation used in each case can be inferred from the context.

2.3 Expectation propagation

Expectation propagation (Opper and Winther, 2000; Minka, 2001) approximates independent non-Gaussian likelihood terms by unnormalized Gaussian form site approximations (aka pseudo-observations),

 p(yi|fi)≃~ti(fi|~Zi,~μi,~Σi)=~ZiN(fi|~μi,~Σi), (9)

where , and and are the parameters of the site approximations, or site parameters. The joint latent posterior approximation is then

 p(f|D,ϕ,θ)=1Zp(f|X,θ)∏ip(yi|fi,ϕ) ≈1ZEPp(f|X,θ)∏i~ti(fi)=q(f|D,ϕ,θ), (10)

where is the normalization constant or the marginal likelihood, is the EP approximation to the marginal likelihood and is a multivariate Gaussian posterior approximation.

EP updates the site approximations by iteratively improving accuracy of the marginals. To update the th site approximation, it is first removed from the marginal approximation to form a cavity distribution,

 q−i(fi)∝q(fi|D)/~ti(fi), (11)

where the marginal is obtained analytically using properties of the multivariate Gaussian.

The cavity distribution is combined with the original likelihood term to form a more accurate marginal distribution called the tilted distribution:

 q−i(fi)p(yi|fi,ϕ). (12)

Minimization of Kullback-Leibler divergence from the tilted distribution to the marginal approximation corresponds to matching the moments of the distributions. Hence for Gaussian approximation, the zeroth, first and second moments of this tilted distribution are computed, for example, using one-dimensional numerical integration. The site parameters are updated so that moments of the marginal approximation

match the moments of the tilted distribution . The new can be computed after a single site approximation has been updated (sequential EP) or after all the site approximations have been updated (parallel EP).

2.4 Laplace approximation

The Laplace approximation is constructed from the second-order Taylor expansion of around the mode , which gives a Gaussian approximation to the conditional posterior,

 q(f|D,θ,ϕ)=N(f|^f,^Σ)≈p(f|D,θ,ϕ), (13)

where is the inverse of the Hessian at the mode with being a diagonal matrix with elements (e.g., Rasmussen and Williams, 2006; Gelman et al., 2013),

 ~Σi =−1∇i∇ilogp(yi|fi,ϕ)|fi=^fi. (14)

From this joint Gaussian approximation we can analytically compute an approximation of the marginal posterior and the marginal likelihood . The Laplace approximation can also be written as

 q(f|D,θ,ϕ)=1Zp(f|X,θ)n∏i=1~ti(fi), (15)

where are Gaussian terms with

 ~μi =^f+~Σi∇ilogp(yi|fi,ϕ)|fi=^fi. (16)

2.5 Marginal posterior approximations

Many leave-one-out approximation methods require explicit computation of full posterior marginal approximations. We thus review alternative Gaussian and non-Gaussian approximations of the marginal posteriors following the article by Cseke and Heskes (2011). The exact joint posterior can be written as (dropping , and for brevity)

 p(f)∝q(f)∏iϵi(fi)withϵi(fi)=p(yi|fi,ϕ)/~ti(fi), (17)

where is the ratio of the exact likelihood and the site term approximating the likelihood. By integrating over the other latent variables, the marginal posterior can be written as

 p(fi)∝q(fi)ϵi(fi)∫q(f−i|fi)∏j≠iϵj(fj)df−ici(fi), (18)

where represents all other latent variables except . Local methods use which depends locally only on . Global methods additionally use an approximation of which depends globally on all latent variables. Next we briefly review different marginal posterior approximations of this exact marginal (see Table 1 for a summary).

Gaussian approximations.

The simplest approximation is to use the Gaussian marginals , which are easily obtained from the joint Gaussian obtained by the Laplace approximation or expectation propagation; we call these LA-G and EP-G. By denoting the mean and variance of the pseudo observations (defined by the site terms) by and respectively, the joint approximation has the same form as in the Gaussian case:

 q(f|D,θ,ϕ) =N(μ,Σ) withμ =Σ~Σ−1~μ,andΣ=(K−1+~Σ−1)−1, (19)

where is diagonal matrix with . Then the marginal is simply .

Non-Gaussian approximations using a local correction.

The simplest improvement to Gaussian marginals is to include the local term , and assume that the global term . For EP the result is the tilted distribution which is obtained as a part of the EP algorithm (Opper and Winther, 2000). As only the local terms are used to compute the improvement, Cseke and Heskes (2011) refer to it as the local improvement and denote the locally improved EP marginal as EP-L.

For the Laplace approximation, Cseke and Heskes (2011) propose a similar local improvement LA-L which can be written as , where the site approximation is based on the second order approximation of (see Section 2.4). In Section 3.5, we propose an alternative way to compute the equivalent marginal improvement using a tilted distribution , where the cavity distribution is based on a leave-one-out formula derived using linear response theory (Appendix A). The local methods EP-L and LA-L can improve the marginal posterior approximation only at the observed , and the marginal posterior at new is the usual Gaussian predictive distribution.

Non-Gaussian approximations using a global correction.

Global approximations also take into account the global term by approximating the multidimensional integral in Equation (18), again using Laplace or EP. To obtain an approximation for the marginal distribution, the integral has to be evaluated with several values and the computations can be time consuming unless some simplifications are used. Global methods can be used to obtain an improved non-Gaussian posterior marginal approximation also at the not yet observed .

Using the Laplace approximation to evaluate corresponds to an approach proposed by Tierney and Kadane (1986), and so we label the marginal improvement as LA-TK. Rue et al. (2009) proposed an approach that can be seen as a compromise between the computationally intensive LA-TK and the local approximation LA-L. Instead of finding the mode for each , they evaluate the Taylor expansion around the conditional mean obtained from the joint approximation . The method is referred to as LA-CM. Cseke and Heskes (2011) propose the improvement LA-CM2 which adds a correction to take into account that the Taylor expansion is not done at the mode. To further reduce the computational effort, Rue et al. (2009) propose additional approximations with performance somewhere between LA-CM and LA-L. Rue et al. (2009) also discuss computationally efficient schemes for selecting values of

and interpolation or parametric model fitting to estimate the marginal density for other values of

. Cseke and Heskes (2011) propose similar approaches for EP, with EP-FULL corresponding to LA-TK, and EP-1STEP corresponding to LA-CM/LA-CM2. Cseke and Heskes (2011) also propose EP-FACT and LA-FACT which use factorized approximation to speed up the computation of the normalization terms.

The local improvements EP-L and LA-L are obtained practically for free and all global approximations are significantly slower. See Appendix B for the computational complexities of the global approximations. Based on the results by Cseke and Heskes (2011), EP-L is inferior to global approximations, but the difference is often small, and LA-L is often worse than the global approximations. Also based on the results by Cseke and Heskes (2011) and our own experiments, we chose to use LA-CM2 and EP-FACT as the global corrections in the experiments.

2.6 Integration over the parameters

To marginalize out the parameters and from the previously mentioned conditional posteriors, we can use the exact or approximated marginal likelihood to form the marginal posterior for the parameters

 p(θ,ϕ|D)∝p(y|X,θ,ϕ)p(θ,ϕ), (20)

and use numerical integration to integrate over and . Commonly used methods include various Monte Carlo algorithms (see list of references in Vanhatalo et al., 2013) as well as deterministic procedures, such as the central composite design (CCD) method by Rue et al. (2009). Using stochastic or deterministic samples, the marginal posterior can be approximated as

 p(f|D)≈S∑s=1p(f|D,ϕs,θs)ws, (21)

where is a weight for the sample .

If the marginal posterior distribution is narrow, which can happen if is large and the dimensionality of is small, then the effect of the integration over the parameters may be negligible and we can use Type II MAP, that is, choose .

3 Leave-one-out cross-validation approximations

We start by reviewing the generic exact LOO equations, which are then used to provide a unifying view of the different approximations in the subsequent sections. We first review some special cases and then more generic approximations. The LOO approximations and their abbreviations are listed in Table 2. The computational complexities of the LOO approximations have been collected in Appendix B.

3.1 LOO from the full posterior

Consider the case where we have not yet seen the th observation. Then using Bayes’ rule we can add information from the th observation:

 p(fi|D)=p(yi|fi)p(fi|xi,D−i)p(yi|xi,D−i), (22)

again dropping and for brevity. Correspondingly we can remove the effect of the th observation from the full posterior:

 p(fi|xi,D−i)=p(fi|D)p(yi|xi,D−i)p(yi|fi) (23)

If we now integrate both sides over and rearrange the terms we get

 p(yi|xi,D−i)=1/∫p(fi|D)p(yi|fi)dfi. (24)

In theory this gives the exact LOO result, but in practice we usually need to approximate and the integral over . In the following sections we first discuss the hierarchical approach, then the analytic, Monte Carlo, quadrature, WAIC, and Taylor series approaches for computing the conditional version of Equation (24). We then consider how the different marginal posterior approximations affect the result.

In some cases, we can compute exactly or approximate it efficiently and then we can compute the LOO predictive density for ,

 p(yi|xi,D−i)=∫p(yi|fi)p(fi|xi,D−i)dfi. (25)

Or, if we are interested in the predictive distribution for a new observation , we can compute

 p(~yi|xi,D−i)=∫p(~yi|fi)p(fi|xi,D−i)dfi, (26)

which is evaluated with different values of as it is not fixed like .

3.2 Hierarchical approximations

Instead of approximating the leave-one-out predictive density directly, for hierarchical models such as GLVMs it is often easier to first compute the leave-one-out predictive density conditional on the parameters , then compute the leave-one-out posteriors for the parameters and combine the results

 p(yi|xi,D−i)=∫p(yi|xi,D−i,θ,ϕ)p(θ,ϕ|D−i)dθdϕ. (27)

Sometimes the leave-one-out posterior of the hyperparameters is close to the full posterior, that is, . The joint leave-one-out posterior can be then approximated as

 p(fi|xi,D−i)≈∫p(fi|xi,D−i,θ,ϕ)p(θ,ϕ|D)dθdϕ (28)

(see, e.g., Marshall and Spiegelhalter, 2003). This approximation is a reasonable alternative if removing has only a small impact on but a larger impact on . Furthermore, if the posterior is narrow, a Type II MAP point estimate of the parameters may produce similar predictions as integrating over the parameters,

 p(fi|xi,D−i)≈p(fi|xi,D−i,^θ,^ϕ). (29)

3.3 LOO with Gaussian likelihood

If both and are Gaussian, then we can compute analytically. Starting from the marginal posterior we can remove the contribution of the th factor in the likelihood:

 p(fi|xi,D−i,θ,ϕ) ∝p(fi|D,θ)p(yi|fi,ϕ) =N(fi|μ−i,v−i), (30)

where

 μ−i =v−i(Σ−1iiμi−σ−2yi) v−i =(Σ−1ii−σ−2)−1. (31)

These equations correspond to the cavity distribution equations in EP.

Sundararajan and Keerthi (2001) derived the leave-one-out predictive distribution directly from the joint posterior using prediction equations and properties of partitioned matrices. This gives a numerically alternative but mathematically equivalent way to compute the leave-one-out posterior mean and variance:

 μ−i =yi−¯c−1iigi v−i =¯c−1ii−σ2, (32)

where

 gi =[(K+σ2I)−1y]i ¯cii =[(K+σ2I)−1]ii. (33)

Sundararajan and Keerthi (2001) also provided the equation for the LOO log predictive density

 logp(yi|xi,D−i,θ,ϕ)=−12log(2π)−12log¯cii−12g2i¯cii. (34)

Instead of integrating over the parameters, Sundararajan and Keerthi (2001) used the result (and its gradient) to find a point estimate for the parameters maximizing the LOO log predictive density.

3.4 LOO with expectation propagation

In EP, the leave-one-out marginal posterior of the latent variable is computed explicitly as a part of the algorithm. The cavity distribution (11) is formed from the marginal posterior approximation by removing the site approximation (pseudo observation) using (31) and can be used to approximate the LOO posterior

 p(fi|xi,D−i,θ,ϕ)≈q−i(fi). (35)

The approximation for the LOO predictive density

 p(yi|xi,D−i,θ,ϕ)≈∫p(yi|fi)q−i(fi)dfi (36)

is the same as the zeroth moment of the tilted distribution. Hence we obtain an approximation for and as a free by-product of the EP algorithm. We denote this approach as EP-LOO. For certain likelihoods (36) can be computed analytically, but otherwise quadrature methods with a controllable error tolerance are usually used.

The EP algorithm uses all observations when converging to its fixed point and thus the cavity distribution technically depends on the observation . Opper and Winther (2000) showed using linear response theory that the cavity distribution is up to first order leave-one-out consistent. Opper and Winther (2000) also showed experimentally in one case that the cavity distribution approximation is accurate. Cseke and Heskes (2011) did not consider LOO, but compared visually the tilted distribution marginal approximation EP-L to many global marginal posterior improvements. Based on these results, EP-L has some error on the shape of the marginal approximation if there is a strong prior correlation, but even then the zeroth moment — the LOO predictive density — is accurate. Our experiments provide much more evidence of the excellent accuracy of the EP-LOO approximation.

3.5 LOO with Laplace approximation

Using linear response theory, which was used by Opper and Winther (2000) to prove LOO consistency of EP, we also prove the LOO consistency of Laplace approximation (derivation in Appendix A). Hence, we obtain a good approximation for also as a free by-product of the Laplace method. Linear response theory can be used to derive two alternative ways to compute the cavity distribution .

The Laplace approximation can be written in terms of the Gaussian prior times the product of (unnormalized) Gaussian form site approximations. Cseke and Heskes (2011) define the LA-L marginal approximation as , from which the cavity distribution, that is the leave-one-out distribution, follows as . It can be computed using (31). We refer to this approach as LA-LOO. The LOO predictive density can be obtained by numerical integration

 p(yi|xi,D−i,θ,ϕ)≈∫q−i(fi)p(yi|fi,ϕ)dfi. (37)

An alternative way to compute the Laplace LOO derived using linear response theory is

 p(fi|xi,D−i,θ,ϕ)≈N(fi|^fi−v−i^gi,v−i), (38)

where is the posterior mode, is the derivative of the log likelihood at the mode, and

 v−i=(1Σii−1~Σi)−1 . (39)

If we consider having pseudo observations with means and variances , then these resemble the exact LOO equations for a Gaussian likelihood given in Section 3.3.

3.6 Importance sampling and weighting

A generic approach not restricted to GLVMs is based on obtaining Monte Carlo samples from the full posterior and approximating (24) as

 p(yi|xi,D−i)≈11S∑Ss=11p(yi|fsi,ϕs), (40)

where drops out since is independent of given and . This approach was first proposed by Gelfand et al. (1992) (see also, Gelfand, 1996) and it corresponds to importance sampling (IS) where the full posterior is used as the proposal distribution. We refer to this approach as IS-LOO.

A more general importance sampling form is

 p(~yi|xi,D−i)≈∑Ss=1p(~yi|fsi,ϕs)wsi∑Ss=1wsi, (41)

where are importance weights and

 wsi=p(fsi|xi,D−i)p(fsi|D)∝1p(yi|fsi,ϕs). (42)

This form shows the importance weights explicitly and allows the computation of other leave-one-out quantities like the LOO predictive distribution. If the predictive density is evaluated with the observed value , Equation (41) reduces to (40).

The approximation (40

) has the form of the harmonic mean, which is notoriously unstable (see, e.g., Newton & Raftery 1994). However the leave-one-out version is not as unstable as the harmonic mean estimator of the marginal likelihood, which uses the harmonic mean of

and corresponds to using the joint posterior as the importance sampling proposal distribution for the joint prior.

For the Gaussian observation model, Vehtari (2001) and Vehtari and Lampinen (2002) used exact computation for and importance sampling only for . The integrated importance weights are then

 wsi∝1p(yi|xi,D−i,θs,ϕs), (43)

and the LOO predictive density is

 p(yi|xi,D−i)≈1∑Ss=11p(yi|xi,D−i,θs,ϕs). (44)

The same marginalization approach can be used in the case of non-Gaussian observation models. Held et al. (2010) used the Laplace approximation, marginal improvements, and numerical integration to obtain an approximation for (see more in Section 3.7). Vanhatalo et al. (2013) use EP and the Laplace method for the marginalisation in the GPstuff toolbox. Li et al. (2014) considered generic latent variable models using Monte Carlo inference, and propose to marginalise by obtaining additional Monte Carlo samples from the posterior . Li et al. (2014) also proposed the name integrated IS and provided useful results illustrating the benefits of the marginalization. As we are focusing on EP and Laplace approximations for the latent inference, in our experiments we use IS only for hyperparameters.

The variance of the estimate (40) depends on the variance of the importance weights. The full posterior marginal is likely to be narrower and have thinner tails than the leave-one-out distribution . This may cause the importance weights to have high or infinite variance (Peruggia, 1997; Epifani et al., 2008) as rare samples from the low density region in the tails of may have very large weights.

If the variance of the importance weights is finite, the central limit theorem holds

(Epifani et al., 2008), and the corresponding estimates converge quickly. If the variance of the raw importance ratios is infinite but the mean exists, the generalized central limit theorem for stable distributions holds, and the convergence of the estimate is slower (Vehtari and Gelman, 2015).

Vehtari et al. (2016) propose to use Pareto smoothed importance sampling (PSIS) by Vehtari and Gelman (2015) for diagnostics and to regularize the importance weights in IS-LOO. Pareto smoothed importance sampling uses the empirical Bayes estimate of Zhang and Stephens (2009) to fit a generalized Pareto distribution to the tail. By examining the estimated shape parameter of the Pareto distribution, we are able to obtain sample based estimates of the existence of the moments (Koopman et al., 2009). When the tail of the weight distribution is long, a direct use of importance sampling is sensitive to one or few largest values. To stabilize the estimates, Vehtari et al. (2016) propose to replace the largest weights by the expected values of the order statistics of the fitted generalized Pareto distribution. Vehtari et al. (2016) also apply optional truncation for very large weights following truncated importance sampling by Ionides (2008) to guarantee finite and reduced variance of the estimate in all cases. Even if the raw importance weights do not have finite variance, the PSIS-LOO estimate still has a finite variance, although at the cost of additional bias. Vehtari et al. (2016) demonstrate that this bias is likely to be small when the estimated Pareto shape parameter .

If the variance of the weights is finite, then an effective sample size estimate can be estimated as

 Seff=1/S∑s=1(~ws)2, (45)

where are normalized weights (with a sum equal to one) (Kong et al., 1994). This estimate is noisy if the variance is large, but with smaller variances it provides an easily interpretable estimate of the efficiency of the importance sampling.

Importance weighting can also be used with deterministic evaluation points obtained from, for example, grid or CCD by re-weighting the weights in (21); see Held et al. (2010) and Vanhatalo et al. (2013). As the deterministic points are usually used in the low dimensional case and the evaluation points are not far in the tails, the variance of the observed weights is usually smaller than with Monte Carlo. If the full posterior is a poor fit to each LOO posterior , then the problem remains that the tails are not well approximated and LOO is biased towards the hierarchical approximation (28) that uses the full posterior of the parameters .

In the ideal case, the CCD evaluation points except the modal point would have equal weights. The CCD approach adjusts these weights based on the actual density and importance weighting will further adjust them, making it possible that a small number of evaluation points have large weights. Although the CCD evaluation points have been chosen deterministically, we can diagnose the reliability of CCD by investigating the distribution of the weights. If there is a small number of CCD points, we examine the effective sample size, and in cases where the number of points exceed 280 (which happens when there are more than 11 parameters), we also estimate the Pareto shape parameter .

Held et al. (2010) proposed to use numerical integration to approximate

 p(yi|xi,D−i,θ,ϕ)≈1/∫q(fi|D,θ,ϕ)p(yi|fi,ϕ)dfi. (46)

We call this quadrature LOO (Q-LOO), as one-dimensional numerical integration methods are usually called quadrature. Given exact and accurate quadrature, this would provide an accurate result (e.g., if the true posterior is Gaussian, quadrature should give a result similar to the analytic solution apart from numerical inaccuracies). However, some error will be introduced when the latent posterior is approximated with . The numerical integration of the ratio expression may also be numerically unstable if the tail of the likelihood term decays faster than the tail of the approximation . For example, the probit likelihood, which has a tail that goes as , will be numerically unstable if is Gaussian with a variance below one.

Held et al. (2010) tested the Gaussian marginal approximation (LA-G) and two non-Gaussian improved marginal approximations (LA-CM and simplified LA-CM, see Section 2.5). All had problems with the tails, although less so with the more accurate approximations. Held et al. proposed to rerun the failed LOO cases with actual removal of the data. As Held et al. had 13 to 56 failures in their experiments, the proposed approach would make LOO relatively expensive. In our experiments with Gaussian marginal approximations LA-G/EP-G, we also had several severe failures with some data sets. However with the non-Gaussian approximations LA-CM2/EP-FACT, we did not observe severe failures (see Section 4).

If we use marginal approximations EP-L or LA-L based on the tilted distribution (see Table 1), we can see that the tail problem vanishes. Inserting the normalized tilted distribution from (46), the equation reduces to

 p(yi|xi,D−i,θ,ϕ)≈∫q−i(fi)p(yi|fi,ϕ)dfi, (47)

which is the EP-LOO or LA-LOO predictive density estimate depending on which approximation is used.

We also present an alternative form of (46), which gives additional insight about the numerical stability when the global marginal improvements are used. As discussed in Section 2.5, we can write the marginal approximation with a global improvement as

 ZqZpq(fi)~t(fi)−1p(yi|fi,ϕ)ci(fi), (48)

where is a global correction term (see Eq. (18)). Replacing with the cavity distribution from EP-L or LA-L gives

 ZqZpq−i(fi)p(yi|fi,ϕ)ci(fi), (49)

which we can insert into (46) to obtain

 p(yi|xi,D−i,θ,ϕ)≈∫p(yi|fi,ϕ)q−i(fi)ci(fi)dfi∫q−i(fi)ci(fi)dfi. (50)

Here is a global corrected leave-one-out posterior, and we can see that the stability will depend on . The correction term may have increasing tails, which is usually not a problem in , but may be a problem in . In addition, the evaluation of at a small number of points and using interpolation for the quadrature (as proposed by Rue et al., 2009) is sometimes unstable, which may increase the instability of . Depending on the details of the computation, (46) and (50) can produce the same result up to numerical accuracy, if the relevant terms cancel out numerically in Equation (46). This happens in our implementation with global marginal posterior improvements, and thus in Section 4 we do not report the results separately for (46) and (50).

Held et al. (2010) and Vanhatalo et al. (2013) use quadrature LOO in a hierarchical approximation, where the parameter level is handled using importance weighting (Section 3.6). Our experiments also use this approach. Alternatively, we could approximate by integrating over the parameters in the marginal and likelihood separately and approximate LOO as

 p(yi|xi,D−i)≈1/∫q(fi|D)p(yi|fi,D)dfi. (51)

If the integration over and is made using Monte Carlo or deterministic sampling (e.g. CCD), then this is equivalent to using quadrature for conditional terms and importance weighting of the parameter samples.

As the quadrature approach may also be applied beyond simple GLVMs, we propose an approach for stabilizing the general form. Inspired by truncated importance sampling by Ionides (2008), we propose a modification of the quadrature approach, which makes it more robust to approximation errors in tails:

 p(yi|xi,D−i,θ,ϕ)≈∫p(yi|fi,ϕ) p(fi|D,θ,ϕ)~w(fi)dfi∫p(fi|D,θ,ϕ)~w(fi)dfi, (52)

where

 ~w(fi)=1max(p(yi|fi,ϕ),c), (53)

and is a small positive constant. When , we get the original equation. When is larger than the maximum value of

, we get the posterior predictive density

.

With larger values of and we avoid the possibility that the ratio explodes. In easy cases, where the numerator gets close to zero before is used, we get a negligible bias. In difficult cases, we have a bias towards the full posterior predictive density.

In truncated importance sampling, the truncation level is based on the average raw weight size and the number of samples (see details in Ionides, 2008). Following this idea we choose

 c−1=c−10∫bap(fi|D,θ,ϕ)p(yi|fi,ϕ)