1 Introduction
Bayesian crossvalidation can be used to assess predictive performance. Vehtari and Ojanen (2012) provide an extensive review of theory and methods in Bayesian predictive performance assessment including decision theoretical assumptions made in Bayesian crossvalidation. Gelman et al. (2014) provide further details on theoretical properties of leaveoneout crossvalidation and information criteria, and Vehtari et al. (2016) provide practical fast computation in the case of Monte Carlo posterior inference. In this article, we present the properties of several Bayesian leaveoneout (LOO) crossvalidation approximations for Gaussian latent variable models (GLVM) with factorizing likelihoods. Integration over the latent variables is performed with either the Laplace method or expectation propagation (EP). We show that for these methods leaveoneout crossvalidation can be computed accurately with zero or a negligible additional cost after forming the full data posterior approximation.
Global (Gaussian) and factorizing variational approximations for latent variable inference are not considered in this paper. They have the same order computational complexity as Laplace and EP but with a larger prefactor on the dominating term, where is the number of observations (Nickisch and Rasmussen, 2008). EP may be expected to be the most accurate method (e.g. Nickisch and Rasmussen, 2008; Vanhatalo and Vehtari, 2010; Jylänki et al., 2011; Riihimäki et al., 2013) and Laplace to have the smallest computational overhead. So EP and Laplace may be considered the methods of choice for accuracy and speed, respectively. We expect that our overall results and conclusions for Laplace and EP carry over to Gaussian variational. For nonGLVM models such as generalized linear and deep generative models, the (factorized) Gaussian variational approximations scale to large datasets (Challis and Barber, 2013; Ranganath et al., 2014; Kingma and Welling, 2014; Rezende et al., 2014). It is of interest to derive approximate LOO estimators for these models, but that is outside the scope of this paper.
We consider a prediction problem with an explanatory variable and an outcome variable
. The same notation is used interchangeably for scalar and vectorvalued quantities. The observed data are denoted by
and future observations by . We focus on GLVMs, where the observation model depends on a local latent value and possibly on some global parameters , such as the scale of the measurement error process. Latent values have a joint Gaussian prior which depends on covariatesand hyperparameters
(e.g., covariance function parameters for a Gaussian process). The posterior of the latent is then(1) 
As a specific example we use Gaussian process (GP) models (reviewed, e.g., by Rasmussen and Williams, 2006), but the methods are applicable also for other GLVMs which have the same factorizing form (e.g. Gaussian Markov random field models used in the RINLA software (Lindgren and Rue, 2015)). Some of the presented methods are applicable more generally, requiring only a factorizing likelihood with terms and a method to integrate over the marginal posteriors . The results presented in this paper can be generalized to the cases where a likelihood term depends upon more than one latent variable (e.g. Tolvanen et al., 2014) or the latent value prior is nonGaussian (e.g. Seeger, 2008; HernándezLobato et al., 2008; HernáandezLobato et al., 2010). For clarity we restrict our treatment to the case of one likelihood term with one latent value.
We are interested in assessing the predictive performance of our models to report this to application experts or to perform model selection. For simplicity, in this paper we use only the logarithmic score, but the methods can be also be used with application specific utilities such as classification error. Logarithmic score is the standard scoring rule in Bayesian crossvalidation (see Geisser and Eddy (1979)) and it has desirable properties for scientific inference (Bernardo and Smith, 1994; Gneiting and Raftery, 2007).
The predictive distribution for a future observation given future covariate values is
(2) 
The expected predictive performance using the log score and unknown true distribution of the future observation is
(3) 
This expectation can be approximated by reusing the observations and computing the leaveoneout Bayesian crossvalidation estimate
(4) 
where is all other observations except . Here we consider only cases with random from the same distribution as . See Vehtari and Ojanen (2012) for discussion of fixed, shifted, deterministic, or constrained .
In addition to estimating the expected log predictive density, it may be interesting to look at a single value, . These terms, also called conditional predictive ordinates (), may reveal observations which are highly influential or not well explained by the model (see, e.g., Gelfand, 1996)
. The probability integral transform (PIT) values
, where is the predictive CDF, can be used to assess the calibration of the predictions (see, e.g., Gneiting et al., 2007).The straightforward brute force implementation of leaveoneout crossvalidation requires recomputing the posterior distribution times. Often leaveoneout crossvalidation is replaced with fold crossvalidation requiring only recomputations of the posterior, with usually 10 or less. Although foldCV is robust and would often be computationally feasible, there are several fast approximations for computing LOO with a negligible additional computational cost after forming the posterior with the full data.
Several studies have shown that the Laplace method and EP perform well (compared to the gold standard Markov chain Monte Carlo inference) for GLVMs with many logconcave likelihoods (e.g. Rue et al., 2009; Vanhatalo et al., 2010; Martins et al., 2013; Riihimäki and Vehtari, 2014)
. EP has also been shown to be close to Markov chain Monte Carlo inference for classification models (logconcave likelihood, but potentially highly skewed posterior) and nonlogconcave likelihoods
(e.g. Nickisch and Rasmussen, 2008; Vanhatalo and Vehtari, 2010; Jylänki et al., 2011; Cseke and Heskes, 2011; Riihimäki et al., 2013; Vanhatalo et al., 2013; Tolvanen et al., 2014). In this paper we also consider the accuracy of approximative LOO with standard Markov chain Monte Carlo inference for LOO as our benchmark.In practical data analysis work, it is useful to start with fast approximations and step by step check whether a computationally more expensive approach can improve the predictive accuracy. We propose the following three step approach:

Find the MAP estimate using the Laplace method to approximately integrate over the latent values .

Using obtained in the previous step, use EP to integrate over the latent values and check whether the predictive performance improves substantially compared to using the Laplace method (we may also reestimate and ).

Integrate over and and check whether integration over the parameters improves predictive performance.
Details of the computations involved are given in Sections 2 and 3. Based on these steps we can continue with the model that has the best predictive performance or the one that makes predictions fastest, or both. Often we also need to reestimate models when data are updated or additional covariates become available, and then again a fast and accurate posterior approximation is useful. To follow the above approach, we need accurate predictive performance estimates for the Laplace method and EP.
The main contributions of this paper are:

A unified presentation and thorough empirical comparison of methods for approximate LOO for Gaussian latent variable models with both logconcave and nonlogconcave likelihoods and MAP and hierarchical approaches for handling hyperparameter inference (Section 3).

The main conclusion from the empirical investigation (Section 4) is the observed superior accuracy/complexity tradeoff of Gaussian latent cavity distribution based LOO estimators. Although there are more accurate nonGaussian approximations of the marginal posteriors, their use does not translate into substantial improvements in terms of LOO crossvalidation accuracy and also introduces considerable instability. Using the widely applicable information criterion (WAIC) in the computation does not provide any benefits.

Truncated weights quadrature integration (Section 3.7) inspired by truncated importance sampling is a novel way to stabilize the quadrature used in some LOO computations.
2 Gaussian latent variable models
In this section, we briefly review the notation and methods for Gaussian latent variable models used in the rest of the article. We focus on Gaussian processes (see, e.g., Rasmussen and Williams, 2006), but most of the discussion also holds for other factorizing GLVMs. We consider models with a Gaussian prior on latent values and factorizing likelihood
(5) 
where is a normalization factor and equal to the marginal likelihood . For example, in the Gaussian process framework the multivariate Gaussian prior on latent values is , where is the prior mean and is a covariance matrix constructed by a covariance function , which characterizes the correlation between two points. In this paper, we assume that the prior mean is zero, but the results generalize to nonzero prior means as well.
2.1 Gaussian observation model
With a Gaussian observation model,
(6) 
where
is the noise variance, the conditional posterior of the latent variables is a multivariate Gaussian
(7) 
The marginal posterior is simply and the marginal likelihood can be computed analytically using properties of the multivariate Gaussian (see, e.g., Rasmussen and Williams, 2006).
2.2 NonGaussian observation model
In the case of a nonGaussian likelihood, the conditional posterior needs to be approximated. In this paper, we focus on expectation propagation (EP) and the Laplace method (LA), which form a multivariate Gaussian approximation of the joint latent posterior
(8) 
where the are (unnormalized) Gaussian approximations of the likelihood contributions. We use to denote approximative joint and marginal distributions in general, or the specific approximation used in each case can be inferred from the context.
2.3 Expectation propagation
Expectation propagation (Opper and Winther, 2000; Minka, 2001) approximates independent nonGaussian likelihood terms by unnormalized Gaussian form site approximations (aka pseudoobservations),
(9) 
where , and and are the parameters of the site approximations, or site parameters. The joint latent posterior approximation is then
(10) 
where is the normalization constant or the marginal likelihood, is the EP approximation to the marginal likelihood and is a multivariate Gaussian posterior approximation.
EP updates the site approximations by iteratively improving accuracy of the marginals. To update the th site approximation, it is first removed from the marginal approximation to form a cavity distribution,
(11) 
where the marginal is obtained analytically using properties of the multivariate Gaussian.
The cavity distribution is combined with the original likelihood term to form a more accurate marginal distribution called the tilted distribution:
(12) 
Minimization of KullbackLeibler divergence from the tilted distribution to the marginal approximation corresponds to matching the moments of the distributions. Hence for Gaussian approximation, the zeroth, first and second moments of this tilted distribution are computed, for example, using onedimensional numerical integration. The site parameters are updated so that moments of the marginal approximation
match the moments of the tilted distribution . The new can be computed after a single site approximation has been updated (sequential EP) or after all the site approximations have been updated (parallel EP).2.4 Laplace approximation
The Laplace approximation is constructed from the secondorder Taylor expansion of around the mode , which gives a Gaussian approximation to the conditional posterior,
(13) 
where is the inverse of the Hessian at the mode with being a diagonal matrix with elements (e.g., Rasmussen and Williams, 2006; Gelman et al., 2013),
(14) 
From this joint Gaussian approximation we can analytically compute an approximation of the marginal posterior and the marginal likelihood . The Laplace approximation can also be written as
(15) 
where are Gaussian terms with
(16) 
2.5 Marginal posterior approximations
Many leaveoneout approximation methods require explicit computation of full posterior marginal approximations. We thus review alternative Gaussian and nonGaussian approximations of the marginal posteriors following the article by Cseke and Heskes (2011). The exact joint posterior can be written as (dropping , and for brevity)
(17) 
where is the ratio of the exact likelihood and the site term approximating the likelihood. By integrating over the other latent variables, the marginal posterior can be written as
(18) 
where represents all other latent variables except . Local methods use which depends locally only on . Global methods additionally use an approximation of which depends globally on all latent variables. Next we briefly review different marginal posterior approximations of this exact marginal (see Table 1 for a summary).
Method  Improvement  Explanation 

LAG   
Gaussian marginal from the joint distribution 
LAL  local  tilted distribution 
LATK  global  , where is approximated using the Laplace approximation 
LACM/CM2/FACT  global  , where is approximated using the Laplace approximation with simplifications 
EPG    Gaussian marginal from the joint distribution 
EPL  local  tilted distribution , where is obtained as a part of EP method 
EPFULL  global  , where is approximated using EP 
EP1STEP/FACT  global  , where is approximated using EP with simplifications 
Gaussian approximations.
The simplest approximation is to use the Gaussian marginals , which are easily obtained from the joint Gaussian obtained by the Laplace approximation or expectation propagation; we call these LAG and EPG. By denoting the mean and variance of the pseudo observations (defined by the site terms) by and respectively, the joint approximation has the same form as in the Gaussian case:
(19) 
where is diagonal matrix with . Then the marginal is simply .
NonGaussian approximations using a local correction.
The simplest improvement to Gaussian marginals is to include the local term , and assume that the global term . For EP the result is the tilted distribution which is obtained as a part of the EP algorithm (Opper and Winther, 2000). As only the local terms are used to compute the improvement, Cseke and Heskes (2011) refer to it as the local improvement and denote the locally improved EP marginal as EPL.
For the Laplace approximation, Cseke and Heskes (2011) propose a similar local improvement LAL which can be written as , where the site approximation is based on the second order approximation of (see Section 2.4). In Section 3.5, we propose an alternative way to compute the equivalent marginal improvement using a tilted distribution , where the cavity distribution is based on a leaveoneout formula derived using linear response theory (Appendix A). The local methods EPL and LAL can improve the marginal posterior approximation only at the observed , and the marginal posterior at new is the usual Gaussian predictive distribution.
NonGaussian approximations using a global correction.
Global approximations also take into account the global term by approximating the multidimensional integral in Equation (18), again using Laplace or EP. To obtain an approximation for the marginal distribution, the integral has to be evaluated with several values and the computations can be time consuming unless some simplifications are used. Global methods can be used to obtain an improved nonGaussian posterior marginal approximation also at the not yet observed .
Using the Laplace approximation to evaluate corresponds to an approach proposed by Tierney and Kadane (1986), and so we label the marginal improvement as LATK. Rue et al. (2009) proposed an approach that can be seen as a compromise between the computationally intensive LATK and the local approximation LAL. Instead of finding the mode for each , they evaluate the Taylor expansion around the conditional mean obtained from the joint approximation . The method is referred to as LACM. Cseke and Heskes (2011) propose the improvement LACM2 which adds a correction to take into account that the Taylor expansion is not done at the mode. To further reduce the computational effort, Rue et al. (2009) propose additional approximations with performance somewhere between LACM and LAL. Rue et al. (2009) also discuss computationally efficient schemes for selecting values of
and interpolation or parametric model fitting to estimate the marginal density for other values of
. Cseke and Heskes (2011) propose similar approaches for EP, with EPFULL corresponding to LATK, and EP1STEP corresponding to LACM/LACM2. Cseke and Heskes (2011) also propose EPFACT and LAFACT which use factorized approximation to speed up the computation of the normalization terms.The local improvements EPL and LAL are obtained practically for free and all global approximations are significantly slower. See Appendix B for the computational complexities of the global approximations. Based on the results by Cseke and Heskes (2011), EPL is inferior to global approximations, but the difference is often small, and LAL is often worse than the global approximations. Also based on the results by Cseke and Heskes (2011) and our own experiments, we chose to use LACM2 and EPFACT as the global corrections in the experiments.
2.6 Integration over the parameters
To marginalize out the parameters and from the previously mentioned conditional posteriors, we can use the exact or approximated marginal likelihood to form the marginal posterior for the parameters
(20) 
and use numerical integration to integrate over and . Commonly used methods include various Monte Carlo algorithms (see list of references in Vanhatalo et al., 2013) as well as deterministic procedures, such as the central composite design (CCD) method by Rue et al. (2009). Using stochastic or deterministic samples, the marginal posterior can be approximated as
(21) 
where is a weight for the sample .
If the marginal posterior distribution is narrow, which can happen if is large and the dimensionality of is small, then the effect of the integration over the parameters may be negligible and we can use Type II MAP, that is, choose .
3 Leaveoneout crossvalidation approximations
We start by reviewing the generic exact LOO equations, which are then used to provide a unifying view of the different approximations in the subsequent sections. We first review some special cases and then more generic approximations. The LOO approximations and their abbreviations are listed in Table 2. The computational complexities of the LOO approximations have been collected in Appendix B.
Method  Based on 

ISLOO  importance sampling / importance weighting, Section 3.6 
QLOO  quadrature integration, Section 3.7 
TQLOO  truncated quadrature integration, Section 3.7 
LALOO  same as QLOO with LAL, Section 3.5 
EPLOO  same as QLOO with EPL, obtained as byproduct of EP, Section 3.4 
matches the first two terms of the Taylor series expansion of LOO, Section 3.8  
matches the first three terms of the Taylor series expansion of LOO, Section 3.8 
3.1 LOO from the full posterior
Consider the case where we have not yet seen the th observation. Then using Bayes’ rule we can add information from the th observation:
(22) 
again dropping and for brevity. Correspondingly we can remove the effect of the th observation from the full posterior:
(23) 
If we now integrate both sides over and rearrange the terms we get
(24) 
In theory this gives the exact LOO result, but in practice we usually need to approximate and the integral over . In the following sections we first discuss the hierarchical approach, then the analytic, Monte Carlo, quadrature, WAIC, and Taylor series approaches for computing the conditional version of Equation (24). We then consider how the different marginal posterior approximations affect the result.
In some cases, we can compute exactly or approximate it efficiently and then we can compute the LOO predictive density for ,
(25) 
Or, if we are interested in the predictive distribution for a new observation , we can compute
(26) 
which is evaluated with different values of as it is not fixed like .
3.2 Hierarchical approximations
Instead of approximating the leaveoneout predictive density directly, for hierarchical models such as GLVMs it is often easier to first compute the leaveoneout predictive density conditional on the parameters , then compute the leaveoneout posteriors for the parameters and combine the results
(27) 
Sometimes the leaveoneout posterior of the hyperparameters is close to the full posterior, that is, . The joint leaveoneout posterior can be then approximated as
(28) 
(see, e.g., Marshall and Spiegelhalter, 2003). This approximation is a reasonable alternative if removing has only a small impact on but a larger impact on . Furthermore, if the posterior is narrow, a Type II MAP point estimate of the parameters may produce similar predictions as integrating over the parameters,
(29) 
3.3 LOO with Gaussian likelihood
If both and are Gaussian, then we can compute analytically. Starting from the marginal posterior we can remove the contribution of the th factor in the likelihood:
(30) 
where
(31) 
These equations correspond to the cavity distribution equations in EP.
Sundararajan and Keerthi (2001) derived the leaveoneout predictive distribution directly from the joint posterior using prediction equations and properties of partitioned matrices. This gives a numerically alternative but mathematically equivalent way to compute the leaveoneout posterior mean and variance:
(32) 
where
(33) 
Sundararajan and Keerthi (2001) also provided the equation for the LOO log predictive density
(34) 
Instead of integrating over the parameters, Sundararajan and Keerthi (2001) used the result (and its gradient) to find a point estimate for the parameters maximizing the LOO log predictive density.
3.4 LOO with expectation propagation
In EP, the leaveoneout marginal posterior of the latent variable is computed explicitly as a part of the algorithm. The cavity distribution (11) is formed from the marginal posterior approximation by removing the site approximation (pseudo observation) using (31) and can be used to approximate the LOO posterior
(35) 
The approximation for the LOO predictive density
(36) 
is the same as the zeroth moment of the tilted distribution. Hence we obtain an approximation for and as a free byproduct of the EP algorithm. We denote this approach as EPLOO. For certain likelihoods (36) can be computed analytically, but otherwise quadrature methods with a controllable error tolerance are usually used.
The EP algorithm uses all observations when converging to its fixed point and thus the cavity distribution technically depends on the observation . Opper and Winther (2000) showed using linear response theory that the cavity distribution is up to first order leaveoneout consistent. Opper and Winther (2000) also showed experimentally in one case that the cavity distribution approximation is accurate. Cseke and Heskes (2011) did not consider LOO, but compared visually the tilted distribution marginal approximation EPL to many global marginal posterior improvements. Based on these results, EPL has some error on the shape of the marginal approximation if there is a strong prior correlation, but even then the zeroth moment — the LOO predictive density — is accurate. Our experiments provide much more evidence of the excellent accuracy of the EPLOO approximation.
3.5 LOO with Laplace approximation
Using linear response theory, which was used by Opper and Winther (2000) to prove LOO consistency of EP, we also prove the LOO consistency of Laplace approximation (derivation in Appendix A). Hence, we obtain a good approximation for also as a free byproduct of the Laplace method. Linear response theory can be used to derive two alternative ways to compute the cavity distribution .
The Laplace approximation can be written in terms of the Gaussian prior times the product of (unnormalized) Gaussian form site approximations. Cseke and Heskes (2011) define the LAL marginal approximation as , from which the cavity distribution, that is the leaveoneout distribution, follows as . It can be computed using (31). We refer to this approach as LALOO. The LOO predictive density can be obtained by numerical integration
(37) 
An alternative way to compute the Laplace LOO derived using linear response theory is
(38) 
where is the posterior mode, is the derivative of the log likelihood at the mode, and
(39) 
If we consider having pseudo observations with means and variances , then these resemble the exact LOO equations for a Gaussian likelihood given in Section 3.3.
3.6 Importance sampling and weighting
A generic approach not restricted to GLVMs is based on obtaining Monte Carlo samples from the full posterior and approximating (24) as
(40) 
where drops out since is independent of given and . This approach was first proposed by Gelfand et al. (1992) (see also, Gelfand, 1996) and it corresponds to importance sampling (IS) where the full posterior is used as the proposal distribution. We refer to this approach as ISLOO.
A more general importance sampling form is
(41) 
where are importance weights and
(42) 
This form shows the importance weights explicitly and allows the computation of other leaveoneout quantities like the LOO predictive distribution. If the predictive density is evaluated with the observed value , Equation (41) reduces to (40).
The approximation (40
) has the form of the harmonic mean, which is notoriously unstable (see, e.g., Newton & Raftery 1994). However the leaveoneout version is not as unstable as the harmonic mean estimator of the marginal likelihood, which uses the harmonic mean of
and corresponds to using the joint posterior as the importance sampling proposal distribution for the joint prior.For the Gaussian observation model, Vehtari (2001) and Vehtari and Lampinen (2002) used exact computation for and importance sampling only for . The integrated importance weights are then
(43) 
and the LOO predictive density is
(44) 
The same marginalization approach can be used in the case of nonGaussian observation models. Held et al. (2010) used the Laplace approximation, marginal improvements, and numerical integration to obtain an approximation for (see more in Section 3.7). Vanhatalo et al. (2013) use EP and the Laplace method for the marginalisation in the GPstuff toolbox. Li et al. (2014) considered generic latent variable models using Monte Carlo inference, and propose to marginalise by obtaining additional Monte Carlo samples from the posterior . Li et al. (2014) also proposed the name integrated IS and provided useful results illustrating the benefits of the marginalization. As we are focusing on EP and Laplace approximations for the latent inference, in our experiments we use IS only for hyperparameters.
The variance of the estimate (40) depends on the variance of the importance weights. The full posterior marginal is likely to be narrower and have thinner tails than the leaveoneout distribution . This may cause the importance weights to have high or infinite variance (Peruggia, 1997; Epifani et al., 2008) as rare samples from the low density region in the tails of may have very large weights.
If the variance of the importance weights is finite, the central limit theorem holds
(Epifani et al., 2008), and the corresponding estimates converge quickly. If the variance of the raw importance ratios is infinite but the mean exists, the generalized central limit theorem for stable distributions holds, and the convergence of the estimate is slower (Vehtari and Gelman, 2015).Vehtari et al. (2016) propose to use Pareto smoothed importance sampling (PSIS) by Vehtari and Gelman (2015) for diagnostics and to regularize the importance weights in ISLOO. Pareto smoothed importance sampling uses the empirical Bayes estimate of Zhang and Stephens (2009) to fit a generalized Pareto distribution to the tail. By examining the estimated shape parameter of the Pareto distribution, we are able to obtain sample based estimates of the existence of the moments (Koopman et al., 2009). When the tail of the weight distribution is long, a direct use of importance sampling is sensitive to one or few largest values. To stabilize the estimates, Vehtari et al. (2016) propose to replace the largest weights by the expected values of the order statistics of the fitted generalized Pareto distribution. Vehtari et al. (2016) also apply optional truncation for very large weights following truncated importance sampling by Ionides (2008) to guarantee finite and reduced variance of the estimate in all cases. Even if the raw importance weights do not have finite variance, the PSISLOO estimate still has a finite variance, although at the cost of additional bias. Vehtari et al. (2016) demonstrate that this bias is likely to be small when the estimated Pareto shape parameter .
If the variance of the weights is finite, then an effective sample size estimate can be estimated as
(45) 
where are normalized weights (with a sum equal to one) (Kong et al., 1994). This estimate is noisy if the variance is large, but with smaller variances it provides an easily interpretable estimate of the efficiency of the importance sampling.
Importance weighting can also be used with deterministic evaluation points obtained from, for example, grid or CCD by reweighting the weights in (21); see Held et al. (2010) and Vanhatalo et al. (2013). As the deterministic points are usually used in the low dimensional case and the evaluation points are not far in the tails, the variance of the observed weights is usually smaller than with Monte Carlo. If the full posterior is a poor fit to each LOO posterior , then the problem remains that the tails are not well approximated and LOO is biased towards the hierarchical approximation (28) that uses the full posterior of the parameters .
In the ideal case, the CCD evaluation points except the modal point would have equal weights. The CCD approach adjusts these weights based on the actual density and importance weighting will further adjust them, making it possible that a small number of evaluation points have large weights. Although the CCD evaluation points have been chosen deterministically, we can diagnose the reliability of CCD by investigating the distribution of the weights. If there is a small number of CCD points, we examine the effective sample size, and in cases where the number of points exceed 280 (which happens when there are more than 11 parameters), we also estimate the Pareto shape parameter .
3.7 Quadrature LOO
Held et al. (2010) proposed to use numerical integration to approximate
(46) 
We call this quadrature LOO (QLOO), as onedimensional numerical integration methods are usually called quadrature. Given exact and accurate quadrature, this would provide an accurate result (e.g., if the true posterior is Gaussian, quadrature should give a result similar to the analytic solution apart from numerical inaccuracies). However, some error will be introduced when the latent posterior is approximated with . The numerical integration of the ratio expression may also be numerically unstable if the tail of the likelihood term decays faster than the tail of the approximation . For example, the probit likelihood, which has a tail that goes as , will be numerically unstable if is Gaussian with a variance below one.
Held et al. (2010) tested the Gaussian marginal approximation (LAG) and two nonGaussian improved marginal approximations (LACM and simplified LACM, see Section 2.5). All had problems with the tails, although less so with the more accurate approximations. Held et al. proposed to rerun the failed LOO cases with actual removal of the data. As Held et al. had 13 to 56 failures in their experiments, the proposed approach would make LOO relatively expensive. In our experiments with Gaussian marginal approximations LAG/EPG, we also had several severe failures with some data sets. However with the nonGaussian approximations LACM2/EPFACT, we did not observe severe failures (see Section 4).
If we use marginal approximations EPL or LAL based on the tilted distribution (see Table 1), we can see that the tail problem vanishes. Inserting the normalized tilted distribution from (46), the equation reduces to
(47) 
which is the EPLOO or LALOO predictive density estimate depending on which approximation is used.
We also present an alternative form of (46), which gives additional insight about the numerical stability when the global marginal improvements are used. As discussed in Section 2.5, we can write the marginal approximation with a global improvement as
(48) 
where is a global correction term (see Eq. (18)). Replacing with the cavity distribution from EPL or LAL gives
(49) 
which we can insert into (46) to obtain
(50) 
Here is a global corrected leaveoneout posterior, and we can see that the stability will depend on . The correction term may have increasing tails, which is usually not a problem in , but may be a problem in . In addition, the evaluation of at a small number of points and using interpolation for the quadrature (as proposed by Rue et al., 2009) is sometimes unstable, which may increase the instability of . Depending on the details of the computation, (46) and (50) can produce the same result up to numerical accuracy, if the relevant terms cancel out numerically in Equation (46). This happens in our implementation with global marginal posterior improvements, and thus in Section 4 we do not report the results separately for (46) and (50).
Held et al. (2010) and Vanhatalo et al. (2013) use quadrature LOO in a hierarchical approximation, where the parameter level is handled using importance weighting (Section 3.6). Our experiments also use this approach. Alternatively, we could approximate by integrating over the parameters in the marginal and likelihood separately and approximate LOO as
(51) 
If the integration over and is made using Monte Carlo or deterministic sampling (e.g. CCD), then this is equivalent to using quadrature for conditional terms and importance weighting of the parameter samples.
Truncated weights quadrature.
As the quadrature approach may also be applied beyond simple GLVMs, we propose an approach for stabilizing the general form. Inspired by truncated importance sampling by Ionides (2008), we propose a modification of the quadrature approach, which makes it more robust to approximation errors in tails:
(52) 
where
(53) 
and is a small positive constant. When , we get the original equation. When is larger than the maximum value of
, we get the posterior predictive density
.With larger values of and we avoid the possibility that the ratio explodes. In easy cases, where the numerator gets close to zero before is used, we get a negligible bias. In difficult cases, we have a bias towards the full posterior predictive density.
In truncated importance sampling, the truncation level is based on the average raw weight size and the number of samples (see details in Ionides, 2008). Following this idea we choose
Comments
There are no comments yet.