 # Accelerating pseudo-marginal Metropolis-Hastings by correlating auxiliary variables

Pseudo-marginal Metropolis-Hastings (pmMH) is a powerful method for Bayesian inference in models where the posterior distribution is analytical intractable or computationally costly to evaluate directly. It operates by introducing additional auxiliary variables into the model and form an extended target distribution, which then can be evaluated point-wise. In many cases, the standard Metropolis-Hastings is then applied to sample from the extended target and the sought posterior can be obtained by marginalisation. However, in some implementations this approach suffers from poor mixing as the auxiliary variables are sampled from an independent proposal. We propose a modification to the pmMH algorithm in which a Crank-Nicolson (CN) proposal is used instead. This results in that we introduce a positive correlation in the auxiliary variables. We investigate how to tune the CN proposal and its impact on the mixing of the resulting pmMH sampler. The conclusion is that the proposed modification can have a beneficial effect on both the mixing of the Markov chain and the computational cost for each iteration of the pmMH algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are interested in approximating some probability distribution

using simulation methods based on Markov chain Monte Carlo (MCMC; Robert and Casella, 2004). In Bayesian inference, could represent the posterior distribution of the parameters and some auxiliary variables , which e.g. can be latent states in the model or missing data. Often, we are interested in the marginal posterior distribution of the parameters given by

 πθ(θ)=p(θ)p(y|θ)∫Θp(θ)p(y|θ)dθ=p(θ)pθ(y)p(y), (1)

where denotes the likelihood function and denotes the parameter prior distribution. The denominator is usually referred to as the marginal likelihood or the model evidence.

In principle, we can apply MCMC methods to sample directly from (1) using e.g. the Metropolis-Hastings (MH; Metropolis et al., 1953; Hastings, 1970) algorithm. However, in practice can be analytically intractable or too costly from a computationally perspective to evaluate point-wise. By introducing the auxiliary variables , we can sometimes mitigate these problems and obtain an algorithm which can be of practical use.

This approach is discussed by Andrieu and Roberts (2009), where the pseudo-marginal MH (pmMH) algorithm is introduced to sample from using a standard MH sampler. This approach can be seen as an exact approximation of the ideal algorithm, where can be recovered by marginalisation. A concrete example is the particle MH (PMH; Andrieu et al., 2010

), where the auxiliary variables are all the random variables generated in a run of a sequential Monte Carlo (SMC;

Del Moral et al., 2006

) algorithm. In this setting, the SMC algorithm is used to provide an unbiased estimate of the likelihood function. This is useful for inference in latent variables models such as state space models (SSMs;

Flury and Shephard, 2011; Pitt et al., 2012), mixture models (Fearnhead and Meligkotsidou, 2015)

and generalised linear mixed models

(Tran et al., 2014).

A typical problem in pmMH is that the resulting Markov chain can be highly autocorrelated, which corresponds to a sticky chain and poor exploration of the posterior. This is the result of that occasionally we obtain an estimate of the posterior which is much larger than its true value. This problem can typically be mitigated by increasing the number of auxiliary variables , which increase the accuracy of the estimate of the target. However, this increases the computational cost for each iteration of the algorithm, which can limit practical use of the pmMH algorithm for many interesting but challenging models. The trade-off between increasing and the number of iterations in the pmMH algorithm is studied by Pitt et al. (2012), Sherlock et al. (2015) and Doucet et al. (2015). They conclude that

should be selected such that the standard deviation in the log-target estimate is roughly between

and .

The main contribution of this paper is to introduce a positive correlation in the auxiliary variables between two consecutive iterations of the pmMH algorithm. Intuitively, this allows us to decrease and therefore the computational cost while keeping the mixing of the Markov chain constant. The correlation is introduced by changing the proposal for from the standard independent proposal to a random walk proposal in the auxiliary space .

The idea for this originates in the field of computer graphics, where a similar approach is used for rendering imagesby sampling valid light paths connecting the light sources in the scene to the camera using MCMC methods. In Kelemen et al. (2002), the authors propose the primary sample path space Metropolis light transport algorithm that operates directly on a set of uniform variates used in a second step to construct a light ray through the scene, using sequential importance sampling. Recently, Hachisuka et al. (2014) extended the original algorithm to include multiple importance sampling (MIS; Owen and Zhou, 2000; Veach and Guibas, 1997; Kronander and Schön, 2014) to improve the efficiency further.

The idea of introducing correlation in between iterations has previously been suggested (independently) in the original particle MCMC (PMCMC) paper by Andrieu et al. (2010) and in the connected discussions by Lee and Holmes (2010). Furthermore, a similar idea of introducing correlation in is proposed by Andrieu et al. (2012). The main difference in our contribution is the use of the Crank-Nicolson (CN; Beskos et al., 2008; Cotter et al., 2013; Hairer et al., 2014) proposal to update at every iteration of the algorithm. This in contrast with the standard pmMH proposal for , which samples new random variables independently at each iteration of the algorithm. The CN proposal is a natural choice as it is known to scale independently with the dimension of the space. Furthermore, we provide the reader with both a theoretical and numerical analysis of how the mixing and the computational cost is affected and propose an adaptive algorithm based on the theoretical results. Deligiannidis et al. (2015) also proposes (independently) to make use of the CN proposal in this setting.

We consider two different numerical examples to illustrate the properties of the proposed modifications to the pmMH algorithm. In the first example, we infer the parameters in IID data from the Gaussian distribution to analyse how to tune the CN proposal and show how it improves upon the idea proposed by

Lee and Holmes (2010). In the second example, we conduct inference of the log-volatility using a stochastic volatility (SV) model with leverage on real-world stock index data.

We continue this paper by introducing the proposed modifications in Section 2 and analyse its theoretical properties, in a simplified setting, in Section 3. We end the paper by some numerical illustrations of the proposed method in Section 4 from which we draw some general conclusions in Section 5.

## 2 Introducing correlation into the auxiliary variables

We would like to sample from the parameter posterior in (1). However, this is not possible using a direct implementation of the MH algorithm as we cannot evaluate point-wise. Instead, we introduce the auxiliary variables as independent standard Gaussian random variables, i.e. . We make use of the setup used by Andrieu and Roberts (2009) and Doucet et al. (2015). Furthermore, we make no notational distinction between a random variable and its realisation for brevity.

Let the potential function be defined such that

 E[exp(−Φθ(u))]=cπθ(θ),∀θ∈Θ,

and some constant . Hence, we can define the extended target distribution as

 ¯π(θ,u)=exp(−Φθ(u))cN(u;0,INu), (2)

where denotes a -dimensional multivariate standard Gaussian distribution. Note that the parameter posterior is the marginal of as

 ∫RNu¯π(θ,u)du=1cE[exp(−Φθ(u))]=πθ(θ),

as required in the exact approximation scheme used in the pmMH algorithm. Hence, we conclude that this is a valid extended target for sampling from the parameter posterior . A concrete example of a suitable potential function follows from the definition in (1). In this case, we have

 Φθ(u) =−(logˆpθ(y;u)+logp(θ)), exp(−Φθ(u)) =ˆpθ(y;u)p(θ), E[exp(−Φθ(u))] =pθ(y)p(θ)=p(y)=cπθ(θ),

where the constant corresponds to the marginal likelihood and denotes the non-negative and unbiased estimator of the likelihood constructed by the auxiliary variables. (The typical application of the pmMH algorithm makes use of importance sampling or particle filtering to construct such an estimator.)

From (2), we have that the target is high-dimensional in whenever is large, and that the extended target correspond to a change of measure from the Gaussian prior. In this setting, we know that the CN proposal is a suitable choice as it is dimension independent (Cotter et al., 2013). An intuition for this can be given by realising that the CN proposal is a discretisation of the Ornstein-Uhlenbeck stochastic differential equation, see e.g. Kloeden and Platen (1992). The main difference between using the CN proposal and a standard random walk proposal is its autoregressive nature, which results in the mean-reverting behaviour. As a consequence, the CN proposal has the standard Gaussian distribution as its limiting distribution, which suits us well in this setting. Therefore, we assume a proposal distribution for and with the form

 qθ,u({θ′,u′}|{θ,u}) =qθ(θ′|{θ,u})qu(u′|u) =N(θ′;μ(θ,u),Σ(θ,u))N(u′;√1−σ2uu,σ2uINu). (3)

This corresponds to using a standard Gaussian random walk for , where and denote a mean and covariance function possibly depending on the auxiliary variables in the previous step, respectively. Note that by introducing into , we can make use of gradient and Hessian information as proposed by Dahlin et al. (2015a) to improve the performance of the algorithm. The CN proposal for is parametrised by the step length , which is determined by the user. We return to discussing how to tune in Section 3. Note that (3) is in contrast with the standard independent pmMH proposal, which we recover for the choice .

The corresponding pmMH algorithm for sampling from (2) follows directly from the choice of proposal in (3). During iteration of the algorithm, we first sample from (3) to obtain the candidate parameter given . We then compute the acceptance probability by

 α({θ′,u′},{θk−1,uk−1})=1∧exp(Φθk−1(uk−1)−Φθ′(u′))qθ(θk−1|{θ′,u′})qθ(θ′|{θk−1,uk−1}), (4)

where we make use of the notation . This follows directly from the properties of the CN proposal (Cotter et al., 2013). In the last step, we accept the candidate parameter, i.e. set , with the probability given by (4). Otherwise, we reject the candidate parameter and set .

The resulting pmMH is presented in Algorithm 1. In Line 4, we need to evaluate the potential function in . In this paper, we make use of importance sampling and particle filtering in Section 4 to construct . Note that Algorithm 1 only differs from a standard pmMH algorithm in Lines 3 and 4, where we add a CN proposal for and evaluate the potential function . Hence, implementing the new algorithm only requires minor changes to the existing code.

In many models, we are required to make use of non-Gaussian random variables to generate samples from the proposal distributions in e.g. importance sampling and particle filtering algorithms. Furthermore, we require uniform random variables for the resampling step in the particle filter, see e.g. Doucet and Johansen (2011)

. These variables can be obtained by using a inverse cumulative distribution function (CDF) transformation also known as a quantile transformation. Firstly, we compute a uniform random variable by

, where denotes the CDF of the standard Gaussian distribution. Seconly, we compute by a quantile transform of using the inverse CDF for the distribution that we would like to simulate a random variable from. There exists other approaches e.g. accept-reject sampling that also can be useful for simulating random variables when the CDF cannot be inverted in closed-form. See e.g. Ross (2012) or Robert and Casella (2004) for more information.

## 3 Theoretical analysis

The aim of the theoretical analysis in this section is to give some guidance of how to tune to obtain a reasonable performance in the pmMH algorithm. More specifically, we are interested in determining the value of that maximises the mixing and to determine how this translates into a suitable acceptance rate in the algorithm. We begin by setting up the model for the analysis in Section 3.1, which depends on the multivariate random variable . We then reformulate the model to replace with a univariate random variable . This enables use to make use of an analysis of a discretised model (Gelman et al., 1996; Yang and Rodríguez, 2013) in Section 3.2 to tune the CN proposal to optimise the mixing in the Markov chain. In Section 3.3, we report the results of this analysis, which we return to In Section 4 where we compare with the optimal tuning of in some practical scenarios.

### 3.1 Setting up the model

To accomplish the aforementioned aim we will study the algorithm in a simplified setting. Firstly, since we are primarily interested in the effect of the correlation in the -variables, we will start out by making the simplifying assumption that we fix in the extended target . Hence, we would like to analyse sampling from

 (5)

using the CN proposal in (3). We believe that this is a reasonable proxy for the complete model (2) since in many cases only local moves are made in the variable. Following Pitt et al. (2012) and Doucet et al. (2015) we also assume that (which depend on

) follows a central limit theorem (CLT) and that it therefore can be accurately approximated as being Gaussian distributed:

 −Φ(u) ∼N(−σ2Φ2,σ2Φ), when u ∼N(u;0,INu). (6)

for some . This assumption follows from the properties of the log-likelihood estimator based on (sequential) importance sampling; see Pitt et al., 2012; Doucet et al., 2015 for further details. This case is of interested to us as we make use of this type of estimator in Section 4 for computing an estimate of the parameter posterior distribution.

We can readily check that assumption (6) implies that

 −Φ(u) ∼N(σ2Φ2,σ2Φ), when u ∼¯π(u), (7)

where is the -dimensional distribution defined in (5). Consequently, if we define the random variable

 z≜σΦ2−1σΦΦ(u)

it follows that the law of under is given by

 ˜π(z)=N(z;σΦ,1).

Note that is a one-dimensional variable, which means that we have reduced the high-dimensional target (5) to one of its one-dimensional marginals. We will study the properties of the CN proposal in this one-dimensional marginal space.

Indeed, assume that is distributed according to the target (5) (i.e., it can be viewed as the state of the Markov chain at stationarity) and that is simulated from the CN proposal in (3). Define and as above, as transformations of and

, respectively. From a bivariate CLT we can then obtain a bivariate Gaussian approximation for the vector

, suggesting that the CN proposal for corresponds to the one-dimensional proposal:

 ˜q(z′|z)=N(z′;√1−σ2z,σ2z), (8)

for , for some . Note that in general differs from and that it depends on the nonlinear transformation . However, it is clear that and and, intuitively, is a monotone function of . We investigate the dependence between these two variables numerically in Section 4.1.

Note that the resulting acceptance probability, for the reduced one-dimensional MH algorithm with target and proposal , is given by:

 ˜α(z,z′)=1∧exp(σΦz′−σΦz). (9)

### 3.2 Analysis by discretisation of the state space

We follow Gelman et al. (1996) and Yang and Rodríguez (2013) and analyse the mixing of the Markov chain using a state space discretisation approach. We know that the expectation of any integrable test function under the target distribution

 ˜π[φ]=E˜π[φ(z)]=∫Zφ(z)˜π(z)dz,

can be approximated by the ergodic theorem (Meyn and Tweedie, 2009; Tierney, 1994) as

 ˆ˜π[φ]=1KK∑k=1φ(zk),

using the samples obtained from the pmMH algorithm. Under the assumption of geometric ergodicity, we have that the error of the estimate obeys a CLT (Meyn and Tweedie, 2009; Tierney, 1994) given by

 √K(˜π[φ]−ˆ˜π[φ])d⟶N(0,ν),

where

denotes the asymptotic variance,

 ν=V˜π[φ]⋅[1+2∞∑τ=1ˆρτ(z)]≜% IACT(z), (10)

where and denote the variance of over and the integrated autocorrelation time (IACT), respectively. Here, denotes the lag- autocorrelation of .

To optimise the mixing, we would like to determine such that is as large as possible. To estimate the asymptotic variance, we partition the interval into bins of equal length . Hence, the center point of bin is given by . We can then define the transition matrix following Yang and Rodríguez (2013) by

 plm={˜q(zm|zl)˜α(zl,zm)Δ,for l,m∈{1,…,K},l≠m1−∑k≠lplk,for l=m, (11)

where the proposal (8) and acceptance probability (9) enters into . Hence, we can estimate the acceptance rate (the probability of a jump) by

 Pjump=L∑l=1πl(1−pll), (12)

where the stationary probability is calculated as

 πl=N(zl;σΦ,1).

Furthermore from the asymptotic results in Peskun (1973) and Kemeny and Snell (1976), we can estimate by

 ˆν(f,π,P)=φ⊤(2BZ−B−BA)φ, (13)

where and . Here, we introduce the fundamental matrix by and the limiting matrix by with . Figure 1: Optimal σu (left) and acceptance rate (right) for minimising the 1/ˆν(f,π,P).

### 3.3 Tuning the CN proposal for the univariate case

We make use of a modified version of the C-code provided by Yang and Rodríguez (2013) to implement the discretisation approach. We set , and . Here, we use to optimise the mixing of the posterior mean. We vary in the target and in the proposal in and , respectively. For each , we find the that maximises computed by (12) and calculated by (13). In Figure 1, we present the resulting is as a function of . We note that for as recommended by e.g. Doucet et al. (2015), we should have to obtain the optimal mixing and this corresponds to an acceptance rate of around to in the Markov chain for .

Note that these results hold for the case when can be described by the univariate Gaussian random variable . We return to numerically investigate how to optimal value of in the CN proposal for the univariate random variable relates to the corresponding value used in the CN proposal for the multivariate random variable in Section 4.

## 4 Numerical illustrations

In this section, we investigate the numerical properties of the proposed alterations to the pmMH algorithm. We are especially interested in how the mixing is affected by making use of the CN proposal instead of an independent proposal for . For this end with compare the mixing in two different models: (i) a Gaussian IID model with synthetic data and (ii) a nonlinear state space model with real-world data. The implementation details are presented in Appendix A, where we also describe how to estimate the value of the target distribution (log-likelihood) for each model.

We quantify the mixing by estimating the IACT using the empirical autocorrelation function. The estimate is computed by

 ˆIF(θKb:K)=1+2100∑τ=1ˆρτ(θKb:K), (14)

where denotes the empirical lag- autocorrelation of , and denotes the burn-in time.

### 4.1 Gaussian IID model

Consider the independent and identically (IID) distributed Gaussian model given by

 x0∼δ0,xt∼N(xt;μ,σ2v),yt|xt∼N(yt;xt,σ2e), (15)

with parameters and where denotes the Dirac measure placed at . We generate a realisation with observations using the parameters .

We begin by investigating how the correlation of the log-likelihood estimated using importance sampling (see Appendix A.1) depends on . In Figure 2, we present the correlation between two consecutive estimates of the log-likelihood (keeping fixed) when . We note that the correlation is almost linear when and can be described as . We also estimate the standard deviation in the log-likelihood and obtain . Hence, we conclude that this implies from Figure 1 that the optimal correlation in the log-likelihood estimator is approximately . Figure 2: The estimated correlation (dots) in ˆpθ(y;u) for the Gaussian IID model (15) as a function of σuand a linear regression (green). The linear approximation provides a reasonable approximation for much of the central part of the interval. Figure 3: A heatmap of the IACT for the Gaussian IID model (15) when varying α and σu in (16). The results presented are the median from 32 independent Monte Carlo runs.

We proceed by fixing and to their true values and would like to infer the parameter posterior of given the data using Algorithm 1. Here, the potential function corresponds to the log-likelihood estimator for the importance sampler as discussed in Appendix A.1. The aim is to compute the IACT over a grid of and compare with the optimal theoretical results from Section 3. We would also like to compare the proposed changes to pmMH with the suggestions discussed by Lee and Holmes (2010). For this end, we consider a generalisation of the CN proposal (3) in which we introduce so-called global moves. The mixture of local and global moves is a popular approach in e.g. computer graphics (Kelemen et al., 2002) to promote exploration of the entire target space. In our setup, we can introduce such a mixture into and obtain the mixture proposal given by

 (16)

where we make use of the original proposals for and from (3). Here, denotes the probability of a global move for , where we recover the proposal in (3) by and the proposal discussed by Lee and Holmes (2010) by and . We recover the standard pmMH proposal for when and/or .

In Figure 3, we present a heatmap of the median IACT using (16) when varying . The smallest IACT values are obtain by using with a small (or zero) value of . This range of corresponds well with the results in Figure 1 with , which results in the optimal . Furthermore, we note that using results in worse performance than if independently of the value of . Hence, we conclude that there is some benefit of using our proposed modification compared with the approach discussed by Lee and Holmes (2010) in this specific example. That is, using local moves seem to be more beneficial than global moves but there could be some merit to consider a mixture proposal in any case. However, we set in the remainder of this section to simplify the tuning and make the conclusions more precise regarding the choice of .

### 4.2 Stochastic volatility model with leverage

Consider the problem of modelling the volatility in the daily closing prices of the NASDAQ OMXS30 index, i.e. a weighted average of the most traded stocks at the Stockholm stock exchange. We extract the log-returns using Quandl111The data is available for download from: https://www.quandl.com/data/NASDAQOMX/OMXS30. for the period January 2, 2011 and January 2, 2014. To model the underlying volatility, we make use of a stochastic volatility model with leverage given by

 x0∼N(x0;μ,σ2v(1−ϕ2)2),[xt+1yt]∣∣∣xt∼N([xt+1yt];[μ+ϕ(xt−μ)0],[σ2vρρexp(xt)]), (17)

with parameters . In this model, we would like to infer the parameter posterior of given the data using Algorithm 1. Here, the potential function corresponds to the log-likelihood estimator using the particle filter as discussed in Appendix A.2. The aim is to again compute the IACT and compare with the optimal theoretical results from Section 3. Figure 4: The acceptance probability (upper) and the resulting IACT (middle and lower) for μ (orange), ϕ (purple), σv (magenta) and ρ (light green) when varying σu. The results presented are the median (line) and first and third quantiles (shaded area) from 32 independent Monte Carlo runs. Figure 5: Upper: the closing log-returns for the NASDAQ OMXS30 index between January 2, 2011 and January 2, 2014. Lower: the parameter trace (left), autocorrelation function (center) and resulting posterior estimate (right) for the SV model (17) for μ (orange), ϕ (purple), σv (magenta) and ρ (light green). We make use of σu=0.55 in the CN proposal (3) and the histograms are computed using the output from 32 independent Monte Carlo runs.

In Figure 4, we present the median acceptance rate and IACT for the four parameters in the model when varying . The maximum IACT (over the four variables) is minimised when . However, the variation in the IACT is significant and a range between and seems as suitable choices for . The standard deviation of the log-likelihood estimator (around the estimated posterior mean) is , which by Section 3 would imply an optimal of around . This is clearly not an appropriate choice for this model. This can be due to that the Gaussian assumption for is not fulfilled or that the standard deviation of the log-likelihood estimate varies with .

From Figure 4, we conclude that using results in a decrease of the IACT comparing with using an independent proposal for . The improvement in the mixing is about times. We present a specific case in Figure (5), where we fix . We conclude that the chains are mixing well and the posterior estimates are reasonable with the estimated posterior mean with standard deviations .

## 5 Conclusions and future work

The numerical illustrations in Section 4 indicate that we can obtain improvements in the mixing of the pmMH algorithm by introducing correlation in . In this paper, we consider using a CN proposal for introducing the correlation and it seems that selecting offers good performance in the settings that we have considered. Hence, implementing the proposed alterations to the pmMH algorithm only requires changing a small number of lines of code and hopefully does not introduce any additional tuning parameters for the user. We are currently working on an adaptive method to further alleviate the work of tuning for the user. The main benefit of our proposed modifications of the pmMH algorithm is that we can decrease the number of particles and obtain better mixing at the same time. Hence, this results in a significant decrease in the computational cost for Bayesian inference based on the pseudo-marginal framework.

Some important additional future work is to extended the theoretical analysis in Section 3 to a more realistic setting, where the influence of also is taking into the account. Also it would be interesting to make use of Hilbert curve resampling proposed by Gerber and Chopin (2015) or some tree methods discussed by Lee (2008) in the particle filter. This would allow for inference in state space models with a multivariate state process.

It is also possible to make use of to construct estimates of the gradient and Hessian of the log-posterior. This information can be included into the proposal for as discussed by Dahlin et al. (2015a, b) to further improve the mixing in the Markov chain. The gradient information can also be used to reduce the variance in a post-processing step as discussed by Mira et al. (2013).

The proposed alterations of the pmMH algorithm can also be useful in models where the likelihood is intractable as discussed by e.g. Jasra (2015) and Dahlin et al. (2015b). In this class of models, approximate Bayesian computations (ABC; Marin et al., 2012 can be used to approximate the log-likelihood using importance sampling and particle filtering. However, in practice these estimates suffer from a large variance with results in bad mixing for the Markov chain. This approach can possible result in a large decrease in the computational cost for using the pmMH algorithm for inference in models with intractable likelihoods.

## Acknowledgements

This work was supported by: the Swedish Foundation for Strategic Research (SSF) through grant IIS11-0081, the project Probabilistic modeling of dynamical systems (Contract number: 621-2013-5524) funded by the Swedish Research Council and CADICS, a Linnaeus Center also funded by the Swedish Research Council. The simulations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at Linköping University, Sweden.

## References

• Andrieu and Roberts  C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
• Andrieu et al.  C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
• Andrieu et al.  C. Andrieu, A. Doucet, and A. Lee. Discussion on constructing summary statistics for approximate Bayesian computation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):451–452, 2012.
• Beskos et al.  A. Beskos, G. Roberts, A. Stuart, and J. Voss. MCMC methods for diffusion bridges. Stochastics and Dynamics, 8(03):319–350, 2008.
• Cotter et al.  S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28(3):424–446, 2013.
• Dahlin et al. [2015a] J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings using gradient and Hessian information. Statistics and Computing, 25(1):81–92, 2015a.
• Dahlin et al. [2015b] J. Dahlin, F. Lindsten, and T. B. Schön. Quasi-Newton particle Metropolis-Hastings. In Proceedings of the 17th IFAC Symposium on System Identification (SYSID), Beijing, China, October 2015b.
• Dahlin et al. [2015c] J. Dahlin, M. Villani, and T. B. Schön. Efficient approximate Bayesian inference for models with intractable likelihoods. Pre-print, 2015c. arXiv:1506.06975v1.
• Del Moral et al.  P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
• Deligiannidis et al.  G. Deligiannidis, A. Doucet, M. K. Pitt, and R. Kohn. The Correlated Pseudo-Marginal Method. Pre-print, 2015. arXiv:1511.04992v1.
• Doucet and Johansen  A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan and B. Rozovsky, editors,

The Oxford Handbook of Nonlinear Filtering

. Oxford University Press, 2011.
• Doucet et al.  A. Doucet, M. K. Pitt, G. Deligiannidis, and R. Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, 102(2):295–313, 2015.
• Fearnhead and Meligkotsidou  P. Fearnhead and L. Meligkotsidou. Augmentation schemes for particle MCMC. Statistics and Computing (in press), pages 1–14, 2015.
• Flury and Shephard  T. Flury and N. Shephard. Bayesian inference based only on simulated likelihood: particle filter analysis of dynamic economic models. Econometric Theory, 27(5):933–956, 2011.
• Gelman et al.  A. Gelman, G. Roberts, and W. Gilks. Efficient Metropolis jumping rules. Bayesian statistics, 5:599–607, 1996.
• Gerber and Chopin  M. Gerber and N. Chopin. Sequential quasi Monte Carlo. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(3):509–579, 2015.
• Hachisuka et al.  T. Hachisuka, A. S. Kaplanyan, and C. Dachsbacher. Multiplexed Metropolis light transport. ACM Transactions on Graphics (TOG), 33(4):100, 2014.
• Hairer et al.  M. Hairer, A. M. Stuart, and S. J. Vollmer. Spectral gaps for a Metropolis-Hastings algorithm in infinite dimensions. The Annals of Applied Probability, 24(6):2455–2490, 2014.
• Hastings  W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
• Jasra  A. Jasra. Approximate Bayesian Computation for a Class of Time Series Models. International Statistical Review, 83(3):405–435, 2015.
• Kelemen et al.  C. Kelemen, L. Szirmay-Kalos, G. Antal, and F. Csonka. A simple and robust mutation strategy for the Metropolis light transport algorithm. Computer Graphics Forum, 21(3):531–540, 2002.
• Kemeny and Snell  J. G. Kemeny and J. L. Snell. Finite Markov chains. Springer, New York, 1976.
• Kloeden and Platen  P. E. Kloeden and E. Platen. Numerical solution of stochastic differential equations, volume 23. Springer, 4 edition, 1992.
• Kronander and Schön  J. Kronander and T. B. Schön. Robust auxiliary particle filters using multiple importance sampling. In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP), Gold Coast, Australia, July 2014.
• Lee  A. Lee. Towards smooth particle filters for likelihood estimation with multivariate latent variables. PhD thesis, University of British Columbia, 2008.
• Lee and Holmes  A. Lee and C. Holmes. Discussion on Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):327–328, 2010.
• Malik and Pitt  S. Malik and M. K. Pitt. Particle filters for continuous likelihood evaluation and maximisation. Journal of Econometrics, 165(2):190–209, 2011.
• Marin et al.  J-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180, 2012.
• Metropolis et al.  N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953.
• Meyn and Tweedie  S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, 2009.
• Mira et al.  A. Mira, R. Solgi, and D. Imparato. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653–662, 2013.
• Owen and Zhou  A. Owen and Y. Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135–143, 2000.
• Peskun  P. H. Peskun. Optimum Monte-carlo sampling using Markov chains. Biometrika, 60(3):607–612, 1973.
• Pitt et al.  M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics, 171(2):134–151, 2012.
• Robert and Casella  C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2 edition, 2004.
• Ross  S. M. Ross. Simulation. Academic Press, 5 edition, 2012.
• Sherlock et al.  C. Sherlock, A. H. Thiery, G. O. Roberts, and J. S. Rosenthal. On the efficency of pseudo-marginal random walk Metropolis algorithms. The Annals of Statistics, 43(1):238–275, 2015.
• Tierney  L. Tierney. Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4):1701–1728, 1994.
• Tran et al.  M.-N. Tran, M. Scharth, M. K. Pitt, and R. Kohn. Importance Sampling Squared for Bayesian Inference in Latent Variable Models. Pre-print, 2014. arXiv:1309.3339v3.
• Veach and Guibas  E. Veach and L. J. Guibas. Metropolis light transport. In Proceedings of the 24th International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 65–76, Los Angeles, CA, USA, August 1997.
• Yang and Rodríguez  Z. Yang and Carlos E Rodríguez. Searching for efficient Markov chain Monte Carlo proposal kernels. Proceedings of the National Academy of Sciences, 110(48):19307–19312, 2013.

## Appendix A Implementation details

In this appendix, we outline the implementation details of the numerical illustrations presented in Section 4. The implementations for estimating the log-likelihood are based on standard importance sampling and particle filtering. Extensive treatments of these methods are found in e.g. Doucet and Johansen  and Robert and Casella .

### a.1 Gaussian IID model

We make use of an importance sampler with to estimate the log-likelihood with the prior as the importance distribution. This results in that the log-likelihood can be estimated by

 logˆpθ(y;u) =T∑t=1log[N∑i=1w(i)t]−TlogN, (18)

where the weights are generated using the procedure outlined in Algorithm 2.

For pmMH, we use iterations (discarding the first as burn-in) in Algorithm 1 and initialise in the true parameter . The proposal for is a standard Gaussian random walk proposal (3) with and . Finally, we make use of the following prior

 p(μ)∼TN(0,1)(μ;0,1),

where denotes a standard Gaussian distribution truncated to the interval .

### a.2 Stochastic volatility model with leverage

For any SSM, we make use of a bootstrap particle filter (bPF; Doucet and Johansen, 2011) to estimate the log-likelihood. An SSM with latent states and observations is given by

 x0∼μθ(x0),xt+1|xt∼fθ(xt+1|xt),yt|xt∼gθ(yt|xt), (19)

where denotes the static unknown parameters. Here, we assume that it is possible to simulate from the distributions and and evaluate point-wise. A bPF to estimate is presented in Algorithm 3.

Hence, the log-likelihood is computed using the same expression (18) as for importance sampling but the particles are instead generated by sequential importance sampling with resampling. Note that we are required to sort the particles after the propagation step, see the discussion about smooth particle filter by Malik and Pitt  for more details. We make use of the probability transform to generate the uniform random variables required for the systematic resampling step. Hence, is a -variate Gaussian random variable, where is used directly in the propagation step and is used in the resampling step after a transformation into a uniform random variable. Here, we make use of particles.

For pmMH, we use iterations (discarding the first as burn-in) in Algorithm 1 and initialise the parameters at obtained using the approach discussed by Dahlin et al. [2015c]. The proposal for is a standard Gaussian random walk proposal (3) with . Here, we make use of the rules of thumb by Sherlock et al.  to select the covariance function. We extract an estimate of the Hessian using the method by Dahlin et al. [2015c] and set

Finally, we use the following prior distributions

 p(μ) ∼N(μ;0,22), p(ϕ) ∼TN(−1,1)(ϕ;0.9,0.052), p(σv) ∼G(σv;2.0,0.05), p(ρ) ∼N(ρ;−0.5,0.22),

where

denotes the Gamma distribution with mean

.