    Bayesian inference using synthetic likelihood: asymptotics and adjustments

Implementing Bayesian inference is often computationally challenging in applications involving complex models, and sometimes calculating the likelihood itself can be difficult. Synthetic likelihood is one approach for carrying out inference when the likelihood is intractable, but it is straightforward to simulate from the model. The method constructs an approximate likelihood by taking a vector summary statistic as being multivariate normal, with the unknown mean and covariance matrix estimated by simulation for any given parameter value. Our article examines the asymptotic behaviour of Bayesian inference using a synthetic likelihood. If the summary statistic satisfies a central limit theorem, then the synthetic likelihood posterior is asymptotically normal under general conditions, with a distribution concentrating around a pseudo-true parameter value which is the true value when the model is correct. We compare the asymptotic behaviour of the synthetic likelihood posterior with that obtained from approximate Bayesian computation (ABC), and the two methods behave similarly under appropriate assumptions which allow correct uncertainty quantification. We also compare the computational efficiency of importance sampling ABC and synthetic likelihood algorithms, and give a general argument why synthetic likelihood is more efficient. Adjusted inference methods based on the asymptotic results are also suggested for use when a possibly misspecified form is assumed for the synthetic likelihood covariance matrix, such as diagonal or a factor model. This can be attractive to allow covariance matrix estimation using fewer model simulations when model simulation is expensive. The methods are illustrated in some simulated and real examples.

Authors

09/16/2018

Robust Bayesian Synthetic Likelihood via a Semi-Parametric Approach

Bayesian synthetic likelihood (BSL) is now a well established method for...
09/11/2019

Efficient Bayesian synthetic likelihood with whitening transformations

Likelihood-free methods are an established approach for performing appro...
10/03/2018

An easy-to-use empirical likelihood ABC method

Many scientifically well-motivated statistical models in natural, engine...
11/12/2020

On a Variational Approximation based Empirical Likelihood ABC Method

Many scientifically well-motivated statistical models in natural, engine...
09/21/2018

Parameter inference and model comparison using theoretical predictions from noisy simulations

When inferring unknown parameters or comparing different models, data mu...
12/16/2021

BayesFlow can reliably detect Model Misspecification and Posterior Errors in Amortized Bayesian Inference

Neural density estimators have proven remarkably powerful in performing ...
09/16/2021

Statistical Inference for Bayesian Risk Minimization via Exponentially Tilted Empirical Likelihood

The celebrated Bernstein von-Mises theorem ensures that credible regions...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Synthetic likelihood is a popular method used in likelihood-free inference when the likelihood is intractable, but it is possible to simulate from the model for any given parameter value. The method takes a vector summary statistic informative about the parameter and assumes multivariate normality for it, estimating the unknown mean and covariance matrix by simulation to obtain an approximate likelihood function.

This article makes three main contributions. First, we investigate the asymptotic properties of synthetic likelihood when the summary statistic satisfies a central limit theorem. Here we use the framework of Generalized Laplacian estimators (GLE’s) developed in Chernozhukov and Hong (2003), and show that under appropriate conditions the posterior density is asymptotically normal, and that it quantifies uncertainty accurately. The limiting posterior distribution is similar to that obtained in approximate Bayesian computation (ABC) approaches (Li and Fearnhead, 2018a; Frazier et al., 2018). The second main contribution is to show that an importance sampling Bayesian synthetic likelihood algorithm based on the posterior as proposal is more efficient computationally than a corresponding ABC algorithm with the same proposal. Although using the posterior as a proposal is impossible in general, it can be a guide to what happens with reasonable proposals of the kind used in practice, rather than considering naive choices such as the prior. Our results here are similar to results obtained by Price et al. (2018) when a rejection algorithm with the prior as a proposal is used for a toy normal example, but our argument is general and not tied to a particular model. We also show that the effect of estimating the mean and covariance matrix of the summary statistic based on samples is asymptotically negligible when is allowed to increase with sample size (at any rate). A third contribution of our work is to consider situations where a certain structure is assumed for the summary statistic covariance matrix (such as a diagonal matrix or a factor model) in order to facilitate the covariance matrix estimation in high-dimensional problems using a smaller value for

. This can be especially important for models where simulation of summary statistics is expensive. We use our asymptotic results to motivate sandwich type variance adjustments to account for the misspecification and implement these in some examples. The adjustments are also potentially useful in situations where the model for the original data

is misspecified, if there is some interest in inference for the pseudo-true parameter value in the model with the data generating density closest to the truth in a certain sense. For these adjustment methods, it is important that the summary statistic satisfies a central limit theorem for the asymptotic normality results for the posterior density to hold. This means in particular that these adjustments are not useful for correcting for the effects of violating the normal assumption for the summary statistic. Some related adjustment methods are considered by Müller (2013), although not in the context of synthetic likelihood or likelihood-free inference.

The synthetic likelihood was first introduced in Wood (2010), where it was used for approximate (non-Bayesian) likelihood inference. Price et al. (2018) recently discussed Bayesian implementations focusing on efficient computational methods. They also show that the synthetic likelihood scales more easily to high-dimensional problems and that it is easier to tune than competing approaches such as ABC. There is much recent development of innovative methodology for accelerating computations for synthetic likelihood and related methods (Meeds and Welling, 2014; Wilkinson, 2014; Gutmann and Corander, 2016; Ong et al., 2018a; An et al., 2018; Everitt, 2017; Ong et al., 2018b). However, there is also recent interest in weakening the assumptions on which the synthetic likelihood is based. One concern that is often expressed is the validity of the method when the normal assumption for the summary statistic is violated. This has led to several authors trying to use other kinds of surrogate likelihoods for summaries which are more flexible. For example, Fasiolo et al. (2016) consider extended saddlepoint approximations, Dutta et al. (2016)

consider a logistic regression approach for likelihood estimation, and

An et al. (2018) consider semiparametric density estimation with flexible marginals and a Gaussian copula dependence structure. Mengersen et al. (2013) and Chaudhuri et al. (2018) consider empirical likelihood approaches. An encompassing framework for many of these suggestions is the parametric Bayesian indirect likelihood (pBIL) of Drovandi et al. (2015). As mentioned above, the adjustments for misspecification developed here do not contribute to this literature on robustifying synthetic likelihood inferences to non-normality of the summary statistics, as they can only be justified when a central limit theorem holds for the summary statistic.

Section 2 describes some asymptotic theory for the Bayesian synthetic likelihood posterior, focusing on the situation where the summary statistic satisfies a central limit theorem. Some comparisons between asymptotic behaviour of synthetic likelihood and ABC posterior distributions are also considered. Section 3 considers the effect of estimating the summary statistic mean and covariance matrix in Bayesian computational algorithms. We compare the computational efficiency of importance sampling synthetic likelihood and ABC algorithms, and give a general argument why synthetic likelihood is more efficient. Section 4 considers some posterior adjustment methods motivated by the asymptotic analysis which aim to robustify inferences to misspecification of the summary statistic covariance matrix or the original model for the data. Section 5 considers some examples, and Section 6 gives some concluding discussion. All proofs can be found in the Appendix unless otherwise stated.

2 Asymptotic behaviour of Bayesian synthetic likelihood

2.1 Synthetic likelihood

In the synthetic likelihood approach, we start with a model for the data in terms of the parameter and the data are reduced to a vector summary statistic , with mean vector and covariance matrix . For an observed value of , the synthetic likelihood is usually defined to be , where denotes the multivariate normal density with mean vector and covariance matrix . This synthetic likelihood can be used as a substitute for the original intractable likelihood. In the following write for the corresponding approximate log-likelihood. For Bayesian inference based on a prior density , the synthetic likelihood posterior is defined as

 pSL(θ|Sobs) ∝p(θ)L(θ;Sobs). (1)

Although working directly with is difficult, since and are generally unknown, and can be estimated by Monte Carlo simulation. For a given , we can simulate replicates of the summary statistics, , where denotes the density of given , and then and are usually estimated by

 ^μ(θ)=1MM∑i=1S(i)^Σ(θ)=1M−1M∑i=1(S(i)−^μ(θ))(S(i)−^μ(θ))⊤,

with these estimates being plugged into instead of and to obtain an estimate . We write for the corresponding log-likelihood estimate.

In what follows, we consider the situation where we possibly misspecify and use a covariance matrix in the definition of , so that

 L(θ;Sobs) =ϕ(Sobs;μ(θ),C(θ)).

Let be a sample-based estimate of , and we define

 ^LM(θ;Sobs) =ϕ(Sobs;^μ(θ),^C(θ)).

Most simply, we might take and , where for a square matrix denotes the diagonal matrix with diagonal entries equal to those of . The reason for considering such a misspecification is that reliable estimation of may be possible with a much smaller value of , which is valuable when model simulations are expensive. Note that , , , , and the synthetic likelihood depend on , but for notational conciseness we leave this implicit in the notation.

in Metropolis-Hastings Markov chain Monte Carlo (MCMC) or importance sampling algorithms for exploring the posterior distribution, then these algorithms target the approximate posterior density

 pSLM(θ|Sobs) ∝p(θ)E(^LM(θ;Sobs)), (2)

assuming the expectation in (2) exists, and that the integral of the right-hand side of (2) is finite. This follows from standard pseudo-marginal arguments (Beaumont, 2003; Andrieu and Roberts, 2009; Tran et al., 2013) for appropriate proposal distributions in these algorithms, since

is a non-negative unbiased estimate of its own expectation. It is found in practice that the density (

2) depends only weakly on the value of , and hence is often chosen based on computational considerations (Price et al., 2018). Later we show that if tends to infinity with (at any rate), then the effects of estimating by are asymptotically negligible.

2.2 Asymptotic behaviour

Let denote the parameter space and be a prior density continuous on . We assume that is compact and convex, although this can be relaxed (see for example the remarks in Chernozhukov and Hong (2003), p. 303). For now we assume that and can be computed exactly and the effects of estimating these quantities by Monte Carlo methods are discussed later. We will denote the generating density for the summary statistics by ; if the specified model is correct, then this is the density of obtained from the model with true parameter value , but we do not assume this in general. Wherever appears in the development below, it is assumed to be distributed according to the true data generating density

unless otherwise stated. We denote convergence in probability by

, convergence in distribution by , and denote the gradient of a function by and its Hessian by . Assumptions (A1) and (A2) below imply the conditions of Lemma 1 and Lemma 2 respectively of Chernozhukov and Hong (2003).

1. For a continuous function having a unique maximum at some value in the interior of

 supθ∈Θ∣∣∣1nl(θ;S)−G(θ)∣∣∣p→0.
1. For some , and are twice differentiable in when .

2. For some positive definite ,

 1√n∇θl(θ0;S)d→N(0,Ω(θ0)),
3. is positive definite, where is the function given in (A1), and for some

 sup|θ−θ0|<δ∥∥∥1nHl(θ)−HG(θ)∥∥∥p→0,

where is the Hessian of and denotes the max norm.

Recall that is the synthetic likelihood posterior and write for the synthetic likelihood posterior mean,

 ¯θSL =∫θpSL(θ|S)dθ.

We denote by the maximum synthetic likelihood estimator.

Chernozhukov and Hong (2003) use some more general assumptions in their results which are similar to ours when (in their notation) and so that these quantities do not depend on . Theorem 2 of Chernozhukov and Hong (2003) then implies under our assumptions that

 √n(¯θSL−θ0)d→N(0,HG(θ0)−1Ω(θ0)HG(θ0)−1), (3)

so that the synthetic likelihood posterior mean is consistent for and asymptotically normal under assumptions (A1) and (A2). In the framework of Chernozhukov and Hong (2003)

, quite general Bayes estimators based on the synthetic likelihood posterior can be considered for loss functions satisfying certain conditions, but here we just consider the posterior mean. The result above also holds for

under similar conditions. See, for example Newey and McFadden (1986), where assumption (A1) implies consistency (Newey and McFadden, 1986, Theorem 2.1) and consistency with the assumption (A2) implies asymptotic normality (Newey and McFadden, 1986, Theorem 3.1). Consistency of is also considered in Wood (2010). Condition (A2) (ii) may not hold in general, but does hold when satisfies a central limit theorem and is discussed in Section 2.3. For ABC algorithms, asymptotic results in the literature concerned with the shape of the posterior distribution or the distribution of the posterior mean (Li and Fearnhead, 2018b, a; Frazier et al., 2018) also assume that a central limit theorem holds for summary statistics.

Theorem 1 of Chernozhukov and Hong (2003) is a Bernstein-von Mises type theorem which motivates a normal approximation to the synthetic likelihood posterior,

 N(¯θSL,−1nHG(¯θSL)−1), (4)

and the point estimate can be replaced in (4) by alternative estimates such as , for example. Comparison with (3

) shows that credible intervals based on this approximate posterior will give valid frequentist confidence intervals asymptotically if

.

Considering assumption (A1) above and let be the dimension of . Then

 1nl(θ;Sobs)= −d2nlog2π+d2nlogn−12nlog|nC(θ)| −12(Sobs−μ(θ))T{nC(θ)}−1(Sobs−μ(θ)).

Now assume that

1. , and have finite limits as which we write as , and , and and are positive definite matrices, with convergence to these limits uniform in .

2. .

Under (A3), apart from constants not depending on ,

 G(θ) =−12(˜μ(θ0)−˜μ(θ))T˜C(θ)−1(˜μ(θ0)−˜μ(θ)), (5)

and is maximized at . If for , then assumption (A1) above is satisfied with the true parameter, when the model for is correct.

2.3 Normality of the synthetic likelihood score

Assumptions (A1) and (A2) are straightforward to check, except for (A2) (ii), the requirement of asymptotic normality for the score of the synthetic likelihood. Now assume

1. .

If , then , but we do not assume this in what follows. Assuming (A4), Lemma 1 below states that (A2) (ii) holds. It is convenient to introduce some notation to make it easier to state the result. Let be a matrix valued function of a matrix argument, (i.e. is and is ). For any matrix , write for its vectorization, obtained by stacking columns. We define

 dfdX ≡dvec(f)dvec(X),

i.e.,

 [dvec(f)dvec(X)]i,j=∂vec(f)i∂vec(X)j,

so that is an matrix. If is a scalar and a column vector, then is a row vector, and .

Lemma 1.

If is differentiable at , and assumptions (A3) and (A4) hold, then

 1√n∇θl(θ0;S)d→ N(0,Ω(θ0)),

where

 Ω(θ0)= d˜μ(θ0)dθT˜Σ(θ0)−1˜C(θ0)˜Σ(θ0)−1d˜μ(θ0)dθ.

When , then

 Ω(θ0)= d˜μ(θ0)dθT˜C(θ0)−1d˜μ(θ0)dθ.

2.4 Correct frequentist coverage and comparison with the asymptotic behaviour of the ABC posterior

Equation (5) gives the form of for the synthetic likelihood. We observed earlier that the synthetic likelihood gives asymptotically correct frequentist coverage if . The following lemma gives an expression for and, together with Lemma 1, implies that if (i.e. when ), then and hence the Bayesian synthetic likelihood gives asymptotically correct frequentist coverage.

Lemma 2.

Suppose assumptions (A1)-(A3) hold, and let be given by (5). Then

 HG(θ0) =−d˜μ(θ0)dθ⊤C(θ0)−1d˜μ(θ0)dθ

The proof follows directly from Lemma A2 in the Appendix when we set so that . It is unsurprising that Lemma 2 shows that we have correct frequentist coverage because if satisfies a central limit theorem, then the normal model used to define the synthetic likelihood is asymptotically correct.

The results are similar to those obtained in the case of ABC - for the same summary statistics satisfying a central limit theorem, and under appropriate conditions, both ABC and synthetic likelihood posteriors are asymptotically normal with the same limiting covariance matrix. See, in particular, Proposition 1 of Li and Fearnhead (2018a) and Theorem 2 of Frazier et al. (2018). In our setup, for an asymptotically normal ABC limiting posterior distribution, it is required that the ABC tolerance is , although for point estimation can be chosen to be . Li and Fearnhead (2018a) also consider the effects of regression adjustment (Beaumont et al., 2002) and show that the use of these adjustments can allow correct uncertainty quantification with decreasing at slower rates than without adjustment. The results of Li and Fearnhead (2018a,b) and Frazier et al. (2018) are more general than those considered in our discussion, as they consider the behaviour of the posterior density and the posterior mean when the summary statistics satisfy a central limit theorem at a rate other than and the behaviour of ABC when the tuning parameters of the algorithm do not allow a limiting normal posterior distribution. However, we will only be concerned with the asymptotic behaviour of the posterior distribution when correct uncertainty quantification is desired, and satisfies a central limit theorem at the usual rate.

3 Effects of estimating μ(θ) and C(θ) and computational efficiency

The previous section examined the asymptotic behaviour of the synthetic likelihood posterior when it is unnecessary to estimate and . We now consider the situation where these quantities are estimated by Monte Carlo methods, based on samples, so that is replaced by within Bayesian inference algorithms. We demonstrate that in an importance sampling algorithm with a reasonable proposal, if goes to infinity with (at any rate), then the effects of replacing with are asymptotically negligible. Based on this, a comparison of computational efficiency between importance sampling ABC and synthetic likelihood algorithms with the posterior as proposal is also considered. Although it is not possible to use the posterior as a proposal in general, our results provide an insight for what happens with reasonable proposals of the kind used in practice rather than consideirng naive choices such as the prior. We show that the synthetic likelihood approach is more efficient than the ABC method, in terms of model simulations required to achieve a given effective sample size, when the dimension of the summary statistic is high. Price et al. (2018) obtained similar results using rejection sampling from the prior when the covariance matrix was not estimated, but our analysis is quite general and incorporates estimation of the covariance matrix, whereas their result is only for a specific model.

Algorithm 1 gives the version of ABC considered here and corresponds to importance sampling for drawing a weighted sample from an ABC posterior density using a proposal . The ABC posterior density is

 pϵ(θ|Sobs) ∝p(θ)pϵ(Sobs|θ), (6)

where

 pϵ(Sobs|θ) =∫p(S|θ)Kϵ(S−Sobs)dS, (7)

is the approximate ABC likelihood, and we write , with being a suitable kernel function, such as a density function for a standard Gaussian random vector. Note that as the bandwidth tends to zero, the approximate likelihood (7) tends to the density of at the observed value , . Observing that (7) is an expectation with respect to , a non-negative unbiased estimate of is for . It is valid to replace an intractable likelihood within an importance sampling scheme with a non-negative unbiased estimate (for further elaboration, see, for example, Tran et al. (2013)). Hence, with this estimate of the ABC likelihood (7), we can replace the usual importance weight for the proposal of with , for . The weights are normalized to sum to one after drawing samples. Algorithm 1 reduces to the basic rejection ABC algorithm (Pritchard et al., 1999) for a uniform kernel and . We also consider Algorithm 2, which is a synthetic likelihood algorithm which is similar to Algorithm 1 which generates a weighted sample from as follows. First, draw from , and then assign the weight to it, with the weights being normalized to sum to one after drawing samples.

Algorithm 1: Importance sampling ABC algorithm
1. For , simulate , , and let .

2. Output , , where

 ~wi =wi∑Nl=1wl,

as a weighted posterior sample from .

Algorithm 2: Importance sampling Bayesian synthetic likelihood algorithm
1. For , simulate , obtain a value for based on simulations from , and let .

2. Output , , where

 ~wi =wi∑Nl=1wl,

as a weighted posterior sample from .

Lemma 3 below shows that for a reasonable proposal, if we choose as a function of with going to infinity with (at any rate) then Algorithm 2 is asymptotically equivalent to the algorithm where is replaced by , so that the estimation of and has an asymptotically negligible effect. This means that Algorithms 1 and 2 asymptotically achieve the same uncertainty quantification as when is chosen as a function of so that in Algorithm 1 (Li and Fearnhead, 2018a; Frazier et al., 2018), and tends to infinity with in Algorithm 2.

Lemma 3.

Suppose that in Algorithm 2, is chosen as a function of so that tends to infinity with (at any rate). If is a proposal such that for the elements of are , and if the elements of are , then

 ^LM(θ;Sobs)L(θ;Sobs)p→1,

as and .

Note that Lemma 3 implies that the weights in Algorithm 2 tend to the values that would be obtained if and are known (i.e. the effects of estimation can be neglected asymptotically).

3.1 Computational efficiency of Bayesian synthetic likelihood and ABC

We now consider the computational efficiency of Algorithms 1 and 2 when the proposal is chosen to be the targeted posterior, in Algorithm 1, and in Algorithm 2. We use the effective sample size (ESS) for a sample of size as a basis for comparison, (Kong, 1992; Liu and Chen, 1995):

 ESS =NE(W)2E(W2),

where denotes one of the importance weights (unnormalized) for a draw from the proposal. Note that if so that , then , but if the importance weights are highly variable so that only a few proposals receive significant weight, then the ESS is much smaller.

Write for the volume of a unit ball in dimensions, and , , for the ball centred at of radius . For simplicity, consider the uniform kernel

 K(h) =V−1dI(h∈Bd(0,1)). (8)

We denote the ESS for ABC and synthetic likelihood in Algorithms 1 and 2 as and respectively. The following lemma obtains the asymptotic behaviour of and .

Lemma 4.

For Algorithms 1 and 2, using the kernel (8), as , and as .

For correct uncertainty quantification Li and Fearnhead (2018a) require that , so that in this case is . Each ABC sample requires only one summary statistic simulation, whereas each synthetic likelihood simulation requires . Dividing the ESS by the number of simulations required for each algorithm gives a useful measure of computational efficiency. For ABC,

 ESSABCN =o(n−d/2), (9)

whereas for the synthetic likelihood

 ESSBSLMN →1M, (10)

as . Suppose that say for some small . Then (10) is and comparing with (9), suggests that asymptotically synthetic likelihood is more efficient than ABC.

It is worth commenting on the example in Section 3 of Price et al. (2018) that compares rejection ABC and a rejection version of synthetic likelihood, where the model is normal and is constant and does not need to be estimated. They find that with the prior as proposal, ABC is more efficient when , equally efficient when , but less efficient than synthetic likelihood when . The essence of the example is that the sampling variability in estimating can be equated with the effect of the kernel (for a Gaussian kernel) in their toy normal model for a certain relationship between and . Our discussion above suggests that in general models, and with a reasonable proposal, the synthetic likelihood is preferable to the basic ABC algorithm asymptotically no matter what the dimension of the summary statistic. However, this greater computational efficiency is achieved through the strong normality assumption.

A rejection and importance sampling algorithm related to Algorithm 1 was analyzed in Li and Fearnhead (2018a). Under appropriate conditions they show that for a good proposal, and if is chosen as a function of so that , then the acceptance probability in their algorithm goes to asymptotically. However, while choosing suffices for good point estimation based on the ABC posterior mean, a choice of is needed for correct uncertainty quantification, and in this case the acceptance probability goes to zero. They also compare the efficiency of variants of rejection and importance sampling ABC with a good proposal, with and without regression adjustment, and the variant with regression adjustment allows to decrease to zero at a slower rate while still giving accurate uncertainty quantification. We leave a theoretical comparison of synthetic likelihood to regression adjusted versions of ABC to future work.

The asymptotic results of the previous section suggest the possibility of adjusting inferences to account for misspecification. We outline one approach, but there are other ways to do so. Suppose we have a sample , , approximately from , obtained by MCMC for example. As before let denote the synthetic likelihood posterior mean, let denote the synthetic likelihood posterior covariance, and write and for their sample estimates based on , . Consider the adjusted sample

 θA,q =¯θ+¯Γ¯Ω1/2¯Γ−1/2(θs−¯θ), (11)

, where is an estimate of . We discuss below how to obtain . We propose using (11) as an approximate sample from the posterior which will be similar to the original sample when the model is correctly specified but gives asymptotically valid frequentist inference about the pseudo-true parameter value when the model is misspecified.

The motivation for (11) is that if

is approximately drawn from the normal distribution

, then is approximately a draw from . Comparing with (3) and (4), if and , then this gives approximate frequentist validity to posterior credible intervals based on the adjusted posterior samples. We still need to describe how to obtain , which is an estimate of . We suggest two ways of doing so, one suitable for the case where we are prepared to assume that the original model for is well specified but the model for may not be, and another where the models for both and may be misspecified.

4.1 Approximating Cov(∇θl(θ0;S)) when the model for y is well specified

If we assume that the model for is correct, then a natural way to estimate is by Monte Carlo simulation from the model for . The procedure is given below.

Algorithm 3
1. For , draw , where recall that is the estimated synthetic likelihood posterior mean.

2. Approximate , where denotes the synthetic log-likelihood for . See the discussion in Section 5.2 for the approximation used in the examples.

3. Return

 ¯Ω=1J−1J∑j=1(g(j)−¯g)(g(j)−¯g)T,

where .

4.2 Approximating Cov(∇θl(θ0;S)) when both the model for y and for Σ(θ) may be incorrect

The procedure suggested in the previous subsection for approximating the gradient assumes that the model for is well specified. If this is untrue, we may still be able to estimate . In particular, if are iid, then we can obtain an estimate of by approximating the distribution of at using the bootstrap. The procedure in this case is to follow Algorithm 1, but replace step 1 with

1. for , sample with replacement to get a bootstrap sample with corresponding summary .

If the data are dependent it may still be possible to use the bootstrap (Kreiss and Paparoditis, 2011) but the implementation details are model dependent.

4.3 What the adjustments can and cannot do

The adjustments suggested above are intended to achieve asymptotically valid frequentist inference when is misspecified, or when the model for is misspecified but still satisfies a central limit theorem. The adjustment will not recover the posterior distribution that is obtained when the model is correctly specified. Asymptotically valid frequentist estimation based on the synthetic likelihood posterior mean for the misspecified synthetic likelihood is frequentist inference based on a point estimator of that is generally less efficient than in the correctly specified case. Matching posterior uncertainty after adjustment to the sampling variability of such an estimator does not recover the posterior uncertainty from the correctly specified situation. If is misspecified, but the model for is correct, we could try to estimate for the correctly specified case (i.e. we could try to estimate for the case where ) which would recover something which is similar asymptotically to the posterior in the correctly specified case. Since we only need to estimate at the single point that might be considered feasible and preferable to the adjustments considered above. However, we find that estimating in moderate or high dimensions even at a single point requires considerable computational effort, so that the misspecified form of often needs to be used to manage the overall computational demands. Section 5.2 discusses estimating . Hence, we only consider the adjustment to attain asymptotically correct frequentist estimation based on the misspecified in the examples below.

5 Examples

5.1 Toy example

Suppose we have non-negative count data , which we model as independent and identically distributed from a distribution. We will act as if the likelihood is intractable and base inference on the sample mean summary statistic , which is sufficient for in the Poisson model. However, we consider the situation where the data generating distribution is negative binomial so that the have mean and variance , deviating from the Poisson mean-variance relationship.

Under the Poisson model the synthetic likelihood has and . We consider a simulated dataset with , and we also consider deliberately misspecifying the variance model in the synthetic likelihood under the Poisson model as . As noted previously, the deliberate misspecification of may be of interest in problems with a high-dimensional as a way of reducing the number of simulated summaries needed to estimate with reasonable precision; for example, we might assume diagonal or based on a factor model. Under the misspecified variance model, the log synthetic likelihood is

 lS(θ) =−12log2π−12logθ2n−n(¯y−θ)2θ.

Since

 n−1lS(θ) →−(5−θ)2θ,

as , the pseudo-true parameter value is , which is the mean of the .

Figure 1 shows the estimated posterior densities obtained via a number of different approaches, when the prior for is . The narrowest green density is obtained from the synthetic likelihood with misspecified variance. This density is obtained using a run of 50,000 iterations of a Metropolis-Hastings MCMC algorithm using a normal random walk proposal. The red density is the exact posterior assuming the Poisson likelihood is correct (the posterior is

). The purple density is a kernel density estimate from adjusted synthetic likelihood samples, where we use the method of Section 3.1 for the adjustment in which the

model is assumed correct but the model may be misspecified. The figure shows that the adjustment gives a result very close to the exact posterior when the model is assumed correct. Finally, the light blue curve shows a kernel density estimate from adjusted synthetic likelihood samples, where we use the method of Section 3.2 based on the bootstrap without assuming that the model is correct. This posterior is more dispersed than the one obtained under the Poisson assumption, since the negative binomial generating density is overdispersed relative to the Poisson, and hence the observed is less informative about the pseudo-true parameter value than the Poisson model would have us believe.

5.2 Examples with high-dimensional S

This section explores the efficacy of our adjustment method when using a misspecified covariance in the presence of a high-dimensional summary statistic . In particular, in all the examples below we focus on using a shrinkage estimator of the covariance to reduce the number of simulations required to obtain an approximation of the posterior. Here, we use the shrinkage estimator of Warton (2008), which is also used in the synthetic likelihood context in Ong et al. (2018b). Based on independent observations we estimate the covariance matrix by

 ^Σγ =^D1/2(γ^C+(1−γ)I)^D1/2, (12)

where is the sample correlation matrix, is the diagonal matrix of component sample variances, and is a shrinkage parameter. The matrix is non-singular if even if , where is the dimension of the observations. This estimator shrinks the sample correlation matrix towards the identity. When (resp. ) there is no shrinkage (resp. a diagonal covariance is produced). Here we choose the value of to allow us to perform 1/10 of the simulations compared to standard synthetic likelihood for Bayesian inference. We are interested in the effect of the shrinkage on the synthetic likelihood approximation and if our methods can offer some kind of correction. In general, we will use very heavy shrinkage in order to stabilize covariance estimation, even for small , in the synthetic likelihood procedure. So we can think of the shrinkage estimator as the estimator of , and we link this estimator to our asymptotic framework by assuming that the shrinkage parameter is being chosen as a function of in such a way that it will tend to a limiting value that is not equal to .

In order to perform the adjustment, we must calculate or approximate the derivative of the synthetic log-likelihood (with shrinkage applied) at a point estimate of the parameter which is taken in our work to be the synthetic likelihood posterior mean , for different datasets simulated from the model at the point estimate. We develop a computationally efficient approach for estimating these derivatives as outlined below.

Our approach uses Gaussian process emulation of the approximate log-likelihood surface based on a pre-computed training sample. Given that we always wish to approximate the derivative at , we construct the training sample around it. We sample values using Latin hypercube sampling from the hypercube defined by , where denotes the th component of and we take

as the approximate posterior standard deviation of

. For each training sample, we can estimate the mean and shrinkage covariance and store these values. Denote the collection of training data as .

For a simulated statistic generated from the model at , the shrinkage synthetic log-likelihoods can be rapidly computed at each in the training data using the pre-stored information, denoted . Then, a Gaussian process (GP) regression model is fit based on the collection , with as the response and as the predictor. Here we use a zero-mean GP with squared exponential covariance function with different length scales for different components of . We then approximate the gradient of by computing the derivative of the smooth predicted mean function of the GP at . This can be shown to be equivalent to considering the bivariate GP of the original GP and its derivative, and performing GP prediction for the derivative value. It is possible to compute this estimate of the derivative explicitly, but it is simpler to implement using a finite difference approximation and we do so for our examples. In the examples below, we used training samples and datasets to construct the matrix . The amount of computing required to get a reliable adjustment estimate is clearly substantial.

In both examples we used 20,000 iterations of MCMC for standard and shrinkage BSL with a multivariate normal random walk proposal. In each case, the covariance of the random walk was set to an approximate posterior covariance obtained by pilot MCMC runs.

5.2.1 MA(2) example

The moving average (MA) time series model is a popular toy example in ABC-related research. We consider the following MA(2) model:

 yt=zt+θ1zt−1+θ2zt−2,

for , where , . The MA(2) model is invertible if is constrained to the space . We assume a uniform prior over this constrained space. Here the likelihood function is multivariate normal with , , , with all other covariances equal to . Here, the observed dataset is simulated with parameters and .

In this example we use the full dataset with as the summary statistic. The data here are normal, and it is unnecessary to rely on a central limit theorem. We denote the usual synthetic likelihood method using as BSL, and the shrinkage version as shrinkage BSL. For BSL, we use simulations per iteration. Choosing in (12) allows us to use only simulations per iteration in shrinkage BSL. An MCMC acceptance rate of 20% and 21% is obtained for BSL and shrinkage BSL, respectively, suggesting that the choice of is reasonable to allow for a 10-fold reduction in the number of simulations. Given that the likelihood is available in this example, we compare with the true posterior produced by a long run of MCMC (200,000 iterations). The BSL posterior and exact posterior are the same here. As mentioned previously, it is not expected that adjustment should recover the posterior for BSL from the shrinkage BSL approximation, since the adjustment aims at asymptotically correct frequentist estimation based on the shrinkage BSL point estimation, which is a different goal to recovery of the BSL posterior. However, comparison with the BSL posterior can be informative nevertheless about how much efficiency is lost compared to the BSL analysis.

Figure 2 summarizes the results and shows that the unadjusted shrinkage BSL posterior is dissimilar to the BSL posterior, having a different dependence structure and larger variances. However, the adjustment succeeds here in giving uncertainty quantification that is much closer to the BSL posterior. Figure 2: Adjustment results for the MA(2) example. The left figure shows contour plots for the true and shrinkage BSL posteriors, while the right figure shows contour plots for the true and adjusted BSL posteriors.

As a second example, we consider an individual-based model of a species called Fowler’s Toads (Anaxyrus fowleri) developed by Marchand et al. (2017), which was also analysed by An et al. (2018). Here we give very brief details, with more information in Marchand et al. (2017) and An et al. (2018).

The model assumes that a toad hides in its refuge site in the daytime and moves to a randomly chosen foraging place at night. GPS location data are collected on toads for days, i.e. the observation matrix is of dimension . For the synthetic data we use here, and . Then is summarised to sets comprising the relative moving distances for time lags of days. For instance, consists of the displacement information of lag day, .

Simulating from the model involves two distinct processes. For each toad, we first generate an overnight displacement, , then mimic the returning behaviour with a simplified model. The overnight displacement is assumed to belong to the Lévy-alpha stable distribution family, with stability parameter and scale parameter . The total returning probability is a constant , and if a return occurs on day , , then the return site is the same as the refuge site on day , where is selected randomly from with equal probability. Here we consider both simulated and real datasets. For the synthetically generated data we take , which is also favourable for the real data. We use a uniform prior over .

As in Marchand et al. (2017), the dataset of displacements is split into two components. If the absolute value of the displacement is less than 10 metres, it is assumed the toad has returned to its starting location. For the summary statistic, we consider the number of toads that returned. For the non-returns (absolute displacement greater than 10 metres), we calculate the log difference between adjacent

-quantiles with

and also the median. These statistics are computed separately for the four time lags. This results in 48 statistics in total. For standard BSL, we use simulations per MCMC iteration. In order to use only simulations per MCMC iteration, we use a shrinkage parameter of . For the simulated data, the MCMC acceptance rates are 16% and 21% for standard and shrinkage BSL, respectively. For the real data, the acceptance rates are both roughly 24%.

Figure 3 shows the results for the simulated data. The shrinkage BSL posterior sometimes underestimates the variance and has the wrong dependence structure compared to the standard BSL posterior. Adjustment produces uncertainty quantification that is closer to the standard BSL procedure, although the larger variances appearing for the adjusted shrinkage BSL indicate that there is a loss in efficiency in using frequentist inference based on the shrinkage BSL point estimate. The results for the real data (Figure 4) are qualitatively similar. There is less difference in the posteriors means between the standard and shrinkage BSL methods for the real data compared to the simulated data, and generally less variance inflation in the adjusted results for the real data compared to the simulated data. Figure 3: Adjustment results for the toad example based on the simulated data. The top row are bivariate contour plots of the standard and shrinkage BSL posteriors. The bottom row are bivariate contour plots of the standard and adjusted BSL posteriors. Figure 4: Adjustment results for the toad example based on the real data. The top row are bivariate contour plots of the standard and shrinkage BSL posteriors. The bottom row are bivariate contour plots of the standard and adjusted BSL posteriors.

6 Discussion

Our work examines the asymptotic behaviour of Bayesian inference using the synthetic likelihood when the summary statistic satisfies a central limit theorem. The synthetic likelihood asymptotically quantifies uncertainty similarly to ABC methods under appropriate algorithm settings and assumptions leading to correct uncertainty quantification. We also examine the effect of estimating the mean and covariance matrix in synthetic likelihood algorithms, as well as the computational efficiency of similar versions of importance sampling for Bayesian synthetic likelihood and ABC. Adjustments are also discussed for misspecification of the synthetic likelihood covariance, when such misspecification is useful for computational reasons. These adjustments may also be useful when the model for is misspecified, provided that inference on the pseudo-true parameter value has some interest.

It would be interesting to compare theoretically the computational efficiency of synthetic likelihood with regression-adjusted versions of ABC, in the same way that Li and Fearnhead (2018b)

compared ABC algorithms with and without regression adjustment. We mentioned that our adjustment methods do not apply when the summary statistics do not satisfy a central limit theorem, so they are not useful for accounting for non-normality of the summary statistics. Some approaches consider more complex parametric models than normal for addressing this issue, and the asymptotic framework developed here might be adapted for other parametric model approximations for the summaries. We leave these extensions to future work.

Acknowledgments

DN was supported by a Singapore Ministry of Education Academic Research Fund Tier 1 grant (R-155-000-189-114). CD was supported by an Australian Research Council’s Discovery Early Career Researcher Award funding scheme (DE160100741). The authors are grateful to Ziwen An preparing some computer code for the toad example.

Appendix

The following lemma is used in the proof of Lemma 1.

Lemma A1. If is differentiable at , then

 1√ndl(θ;S)dθ= 12√nvec((nC(θ))−1)⊤d(nC(θ))dθ+{√n(S−μ(θ))}⊤{nC(θ)}−1dμ(θ)dθ +1√nvec({nC(θ)}−1n(S−μ(θ))(S−μ(θ))T{nC(θ)−1})⊤d(nC(θ))dθ (13)

The proof involves differentiating . It is omitted because it is straightforward but tedious.

Proof of Lemma 1:

Noting that in (13) the first and third terms on the right-hand side go to zero for , we obtain that

 1√n∇θl(θ0;S)d→d˜μ(θ0)dθT˜C(θ0)−1Z,

where , and the result follows.

The following lemma gives an expression for that is used to prove Lemma 2.

Lemma A2. Suppose assumptions (A1)-(A3) hold, and let be given by (5). Then

 HG(θ) =HG1(θ)+HG2(θ),

where

 HG1(θ)= −d˜μ(θ)dθ⊤˜C(θ)−1d˜μ(θ)dθ+{(˜μ(θ0)−˜μ(θ))T⊗IP}d2˜μ(θ)dθdθ −d˜μ(θ)dθ⊤{(˜μ(θ0)−˜μ(θ))⊤˜C(θ)−1⊗˜C(θ)−1}d˜C(θ)dθ,
 HG2(θ)= 12{d˜C(θ)dθ}⊤{˜C(θ)−1(˜μ(θ0)−˜μ(θ))⊗IP}d˜C(θ)−1(˜μ(θ0)−˜μ(θ))dθ +12{d˜C(θ)dθ}⊤{IP⊗˜C(θ)−1(˜μ(θ0)−˜μ(θ)}d˜C(θ)−1(˜μ(θ0)−˜μ(θ))dθ +12{IP⊗vec(˜C(θ)−1(˜μ(θ0)−˜μ(θ))(˜μ(θ0)−˜μ(θ))⊤˜C(θ)−1)}d2˜C(θ)dθdθ,

and

 d˜C(θ)−1(˜