 # Unbiased Smoothing using Particle Independent Metropolis-Hastings

We consider the approximation of expectations with respect to the distribution of a latent Markov process given noisy measurements. This is known as the smoothing problem and is often approached with particle and Markov chain Monte Carlo (MCMC) methods. These methods provide consistent but biased estimators when run for a finite time. We propose a simple way of coupling two MCMC chains built using Particle Independent Metropolis-Hastings (PIMH) to produce unbiased smoothing estimators. Unbiased estimators are appealing in the context of parallel computing, and facilitate the construction of confidence intervals. The proposed scheme only requires access to off-the-shelf Particle Filters (PF) and is thus easier to implement than recently proposed unbiased smoothers. The approach is demonstrated on a Lévy-driven stochastic volatility model and a stochastic kinetic model.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 State-space models, problem statement and review

Let denote a discrete-time index. State-space models are defined by a latent Markov process and observation process , taking values in a measurable space and satisfying

 Xt+1|{Xt=x}∼f(⋅|x),Yt|{Xt=x}∼g(⋅|x),

for with . In the following we assume that and , and use , and to denote densities with respect to the corresponding Lebesgue measure. State inference given a realization of the observations for some fixed requires the posterior density

 π(x1:T):=p(x1:T|y1:T) ∝μ(x1)g(y1|x1)T∏t=2f(xt|xt−1)g(yt|xt),

and expectations w.r.t. to this density. This is known as smoothing in the literature. For non-linear non-Gaussian state-space models, this problem is complex as this posterior and its normalizing constant , often called somewhat abusively likelihood, are intractable. We provide here a means to obtain unbiased estimators of for some function .

Particle methods return asymptotically consistent estimators which are however biased for a finite number of particles. Similarly MCMC kernels, such as the iterated Conditional Particle Filter (i-CPF) and PIMH , can be used for approximating smoothing expectations consistently but are also biased for a finite number of iterations. Additionally, although theoretical bounds on the bias are available for particle , PIMH  and i-CPF [7, 2, 27] estimators, these bounds are usually not sharp and/or rely on strong mixing assumptions which are not satisfied by most realistic models. Unbiased estimators of computed on parallel machines can be combined into asymptotically valid confidence intervals as either the number of machines, the time budget, or both go to infinity .

Recently, it has been shown in [22, 27] that it is possible to obtain such unbiased estimators by combining the i-CPF algorithm with a debiasing scheme for MCMC algorithms proposed initially in  and further developed in . These unbiased smoothing schemes couple two i-CPF kernels using common random numbers and a coupled resampling scheme. After a brief review of particle methods and of PIMH, we propose in Section 2

an alternative methodology relying on coupling two PIMH kernels. The method is easily implementable as it does not require any modification of the PF algorithm. It can also be used in scenarios where simulation from the Markov transition kernel of the latent process involves a random number of random variables, whereas the methods proposed in

[22, 27] would not be directly applicable to these settings. Additionally it does not require being able to evaluate pointwise the transition density contrary to the coupled conditional backward sampling PF scheme of . Section 3 presents an analysis of the methodology when is large. In Section 4, the method is demonstrated on a Lévy-driven stochastic volatility model and a stochastic kinetic model.111Code to reproduce figures is provided at https://github.com/lolmid/coupled_pimh

### 1.2 Particle methods

Particle methods are often used to approximate smoothing expectations [13, 25]. Such methods rely on sampling, weighting and resampling a set of weighted particles , where denotes the value of the particle at iteration and its corresponding normalized weight, i.e. . Letting , , denote the proposal density at time and at time respectively, weighting occurs according to the following ‘incremental weights’:

 w1(x1) :=g(y1|x1)μ(x1)q1(x1)fort=1, wt(xt−1,xt) :=g(yt|xt)f(xt|xt−1)qt(xt|xt−1)fort≥2.

We assume that and for and all . Pseudo-code for a standard PF is presented in Algorithm 1 where we let , with

, denote the resampling distribution, a probability distribution on

where . We say that a resampling scheme is unbiased if . All standard resampling schemes -multinomial, residual and systematic- are unbiased . This PF procedure outputs an approximation of and an approximation of the smoothing distribution . Under weak assumptions, it can be shown that , resp. , is an asymptotically consistent (in ) estimator of , resp. of . However, whereas is unbiased (, Section 7.4.1), admits an asymptotic bias of order for a constant which is typically impossible to evaluate and for which only loose bounds are available under realistic assumptions . In the following by a call to the PF, , we mean a procedure which runs Algorithm 1 and returns (dependence on observations is notationally omitted) and a sample from the approximate smoothing distribution , i.e. output with probability .

### 1.3 PIMH method

An alternative way to estimate consists of using an MCMC scheme targeting . The PIMH algorithm achieves this by building a Markov chain on an extended space admitting a stationary distribution with marginal using the PF described in Algorithm 1 as proposal distribution . Algorithm 2 provides pseudo-code for sampling the PIMH kernel, where denotes the current state of this Markov chain and means .

Validity and convergence properties of PIMH rely on viewing it as an Independent Metropolis–Hastings (IMH) sampler on an extended space. For ease of presentation, we detail this construction for an unbiased resampling scheme satisfying additionally for all . The PF of Algorithm 1 implicitly defines a distribution over particle coordinates and ancestors. We use to denote a sample from , where , and the density of is given by

 ψ(ζ)=(N∏i=1q1(xi1))T∏t=2(r(a1:Nt−1|wt−1)N∏i=1q(xit|xait−1t−1)).

We let denote the index of the ancestor particle of at generation , which may be obtained deterministically from the ancestry, using with . From , for , we express the target of the resulting IMH sampler as

 ¯π(j,ζ) =π(xj1:T)NTψ(ζ)q1(xbj11)∏Tt=2r(bjt−1|wt−1)qt(xbjtt|xbjt−1t−1)

Running a PF and sampling from corresponds to sampling from the proposal and [1, Theorem 2] shows that where for a given we understand as a deterministic map from to . As a result, samples from can be obtained asymptotically through accepting proposals with probability . We can thus estimate by averaging over iterations . As shown in [1, Theorem 6], we can also estimate by averaging over iterations , hence reusing all the particle system used to generate the accepted proposal at iteration as . As such, although Algorithms 2 and 3 are stated in terms of proposing and accepting , it is possible to consider them instead as proposing and accepting . In , it is shown that is unnecessary. We only need to use an unbiased resampling scheme to obtain a valid PIMH scheme. This is achieved by defining an alternative target on such that under and also hold. It is also established in  that one can even used adaptive resampling procedures [13, 11].

If we denote by the error of the log-likelihood estimator, i.e. , the PIMH algorithm induces a Markov chain with transition kernel given by

 {1∧exp(z′−z)}g(dz′)+{1−α(z)}δz(dz′), (1)

where is the distribution of under the law of the particle filter and is the average acceptance probability from state . Although not emphasized notationally, both and are functions of . Through an abuse of notation, we denote the invariant density of the above chain with . Such a reparameterization has also been used in previous work to analyze the PIMH and the related particle marginal MH algorithm [31, 14].

## 2 Methodology

### 2.1 Unbiased MCMC via couplings

‘Exact estimation’ methods provide unbiased estimators of expectations with respect to the stationary distribution of a Markov chain using coupling techniques [18, 24]. In the following we use these tools to couple PIMH kernels and estimate smoothing expectations, after briefly reviewing the general approach.

Consider two valued Markov chains, and , each evolving marginally according to a kernel with stationary distribution and initialized from so that for all , where denotes equality in distribution. We couple these two chains so that for , where is an almost surely finite meeting time. In this case, we see that for non-negative integers we have the following telescoping-sum decomposition

 E[h(U(t))]= E[h(U(k))+t∑l=k+1(h(U(l))−h(U(l−1)))] = E[h(U(k))+t∧(τ−1)∑l=k+1(h(U(l))−h(~U(l−1)))].

As a result, under integrability conditions which will be made precise in the sequel, taking the limit as suggests an estimator with expectation . A way to construct such chains considered in [18, 24, 21, 29] relies on sampling independently and , and then successively sampling from such that with positive probability, ensuring that both chains evolve marginally according to . Furthermore, averaging over a range of values , for some , preserves unbiasedness, suggesting the following ‘time-averaged’ unbiased estimator of

 Hk:m=1m−k+1m∑l=kh(U(l))+ τ−1∑l=k+1min(1,l−km−k+1)(h(U(l))−h(~U(l−1))) :=MCMCk:m(h)+BCk:m(h). (2)

We view as the standard MCMC sample average up to time , discarding the first iterates as ‘burn-in’. The second term, , is the ‘bias correction’ term with if . The estimator requires only that two chains be simulated until meeting at time , after which, if , only one chain must be simulated up to . By [24, Proposition 3.1], the validity of the resulting estimators is guaranteed under the following assumptions. We discuss these assumptions for PIMH in Section 2.3.

###### Assumption 1.

Each chain is initialized marginally from a distribution , evolves according to a kernel , and is such that as . Furthermore, there exist constants and such that for all .

###### Assumption 2.

The two chains are such that the meeting time satisfies , for some constants and . The chains stay together after meeting, i.e. for all .

###### Proposition 3.

 Under Assumptions 1 and 2, the estimator

obtained by these coupled chains is unbiased, has finite variance and finite expected cost.

### 2.2 Coupled PIMH

To obtain unbiased estimators of , we use the framework detailed in Section 2.1 for the transition kernel of PIMH and the corresponding invariant distribution of PIMH defined on . This requires introducing a coupling of PIMH kernels. For IMH chains, a natural choice of initialization of the chain is from the proposal distribution. We adopt this in the following, sampling both and . However, in contrast to previous constructions [24, 21, 29], our method allows the chains to couple at time through re-using the initial value as a proposal for the first chain.

We summarize the resulting procedure in Algorithm 3. To simplify presentation of the algorithm, we use the convention . Finally, although the coupling scheme is framed in terms of PIMH, it is clear that this algorithm and estimators apply equally to any IMH algorithm.

At each iteration of Algorithm 3 the proposal can be accepted by one, both or neither chains with positive probability. The meeting time corresponds to the first time the proposal is accepted by both chains. We can also return another unbiased estimator of the form (2) with replaced by as we have previously seen that . The Rao-Blackwellised estimator will typically outperform significantly when we are interested in smoothing expectations of functions of states close to . For functions of states close to the origin, e.g. , we have with high probability if is moderate because of the particle degeneracy problem [13, 23, 25]. This is illustrated in Appendix A.3.

### 2.3 Validity and meeting times

The validity of our unbiased estimators is ensured if the following weak assumptions are satisfied.

###### Assumption 4.

There exist constants and such that for all .

###### Assumption 5.

There exist constants and such that for all .

###### Assumption 6.

The resampling scheme is unbiased and there exist finite constants such that and for .

###### Proposition 7.

Under Assumptions 4 and 6, resp. Assumptions 5 and 6, the estimator , resp. , of obtained from Algorithm 3 is unbiased and has finite variance and finite expected cost.

To establish the result for , note that Assumption 4 and our construction implies Assumption 1 for . Assumption 6 provides verifiable and sufficient conditions to ensure uniform ergodicity of PIMH [1,  Theorem 3]. Hence it follows from [24, Proposition 3.4] that the geometric bound on the tails of is satisfied. Additionally, Algorithm 3 ensures that the chains stay together for all so Assumption 2 is satisfied. A similar reasoning provides the result for .

We have the following precise description of the distribution of the meeting time for Algorithm 3. Let

denote the geometric distribution on the strictly positive integers with success probability

.

###### Proposition 8.

The meeting time satisfies and . Additionally, we have under Assumption 6.

We recall here that is the average acceptance probability from state . The geometric distribution result follows from Proposition 10 in Appendix. The rest of the proposition follows from

 P[τ=1] =∫α(z)g(dz) =∬{1∧exp(z′−z)}g(dz)g(dz′).

It entails trivially that . Noting that a.s. under (which is dependent on ) under Assumption 6, see e.g. , we obtain by dominated convergence, thus .

### 2.4 Unbiased filtering

A PF generates estimates of at all times . These estimates can be used to perform one step of coupled PIMH for pairs of Markov chains, each pair corresponding to one of the smoothing distributions . This requires some additional bookkeeping to keep track of the meeting times and unbiased estimators associated with each but can be used to unbiasedly estimate expectations with respect to all filtering distributions . This was not directly feasible with coupled i-CPF in . In particular, this allows us to estimate unbiasedly the predictive likelihood terms which can be used for a goodness-of-fit test .

## 3 Analysis

### 3.1 Meeting time: large sample approximation

We investigate here the distribution of the meeting time in the large sample regime, i.e. in the interesting scenarios where is large. Under strong mixing assumptions,  showed that letting for some

then the following Central Limit Theorem (CLT) holds:

as where for some . Empirically, this CLT appears to hold for many realistic models not satisfying these strong mixing assumptions [31, 14]. Under this CLT, using similar regularity conditions as in , it can be shown that the Markov kernel defined in (1) converges in some suitable sense towards the Markov kernel given by

 {1∧exp(z′−z)}gσ(z′)dz′+{1−ασ(z)}δz(dz′),

where and denotes the average acceptance probability in . For this limiting kernel222Note that is not uniformly ergodic whereas is under Assumption 6., the following result holds.

###### Proposition 9.

[14, Corollary 3] The invariant density of is and its average acceptance probability is given by

 ασ(z):=1−Φ⎛⎜⎝z+σ22σ⎞⎟⎠+e−zΦ⎛⎜⎝z−σ22σ⎞⎟⎠,

where

denotes the cumulative distribution function of the standard Normal distribution.

Under this large sample approximation, the probability can be written as

 E[P[τ=n|Z(0)]]=∫ασ(z)(1−ασ(z))n−1gσ(z)dz, (3)

thus , Erfc denoting the complementary error function. Similarly we see that the expected meeting time is given by

. This allows us to approximate numerically expectations, quantiles and probabilities of

as a function of

, the standard deviation of

under the law of the particle filter. Figure 1 displays and as a function of . We see that for rising to for . The expected meeting time is also the expected number of iterations to return an unbiased estimator for and grows relatively benignly with . Figure 1: P[τ=1] (left) and E[τ] (right) as a function of the standard deviation σ of logpN(y1:T).

We compare the distribution of the meeting times with its large sample approximation (3) on a stationary auto-regressive (AR) model and with and on a simulated dataset of . Estimates of were obtained empirically for a range of values of and compared with the predicted values based on the estimated variance of using runs of coupled PIMH. Varying between 10 and 110, was between 0.2 and 3.0 in this range. The tail probabilities of the meeting time over this range are shown in Figure 2. Confidence intervals for the estimates of are shown in Figure 2 with error bars indicating standard deviations. We see that there is a satisfactory agreement, with predicted tail probabilities closely matching confidence intervals for each value, and larger values of leading as expected to shorter meeting times on average. Figure 2: Empirical estimates of P[τ≥n] (with ±2standard errors) and comparison with the large sample approximation for toy example.

### 3.2 On the selection of N for large m

We provide here a heuristic for selecting

to minimize the variance of at fixed computational budget when both and are large. We will minimize the computational inefficiency defined by

 C[H]:=V[Hk:m]×N

as the running time is proportional to . For sufficiently large, we expect that with very high probability and so the time to obtain a single unbiased estimator is . We note that increasing typically leads to a decreasing asymptotic variance of the ergodic averages associated with the PIMH chain and from Proposition 8 a reduction in the bias correction, however at the cost of more computation. For large , we also expect that the dominant term of will arise from and will be essentially the asymptotic variance of given by divided by , where is the Integrated Autocorrelation Time (IACT) of for the PIMH kernel. For sufficiently large, we expect that and are approximately independent under the PF proposal. By a reasoning similar to the proof of [31, Lemma 4], will then be approximately proportional to defined in Eq. (11) in . This is illustrated in Appendix A.2 where we plot for a variety of test functions for a range of against over the corresponding range of and show that they are indeed approximately proportional. Minimizing w.r.t.  is then approximately equivalent to minimizing as is typically inversely proportional to . This minimization has already been carried out in  where it was found that the minimizing argument is . Practically, this means that one should select to ensure that the standard deviation of is equal approximately to this value. The resulting value of is expected to be close to the value of minimizing , which is approximately proportional to . This is verified in Figure 3 on the AR example of Section 3.1. Note that these guidelines do not apply to for a function of states close to as it is not true that and are approximately independent. Figure 3: IF(hi)/σ2 for h1=x1, h2=xT, h3=∑txt and h4=∑tx2t as a function of σ2.

## 4 Numerical experiments

We apply the methodology to two models where the transition density of the latent process is analytically intractable but simulation from it is possible. However, this involves sampling a random number of random variables. The unbiased smoother proposed in  relies on common random numbers and it is unclear how one could implement it in this context. The coupled conditional backward sampling PF scheme proposed in  does not apply as we cannot evaluate the transition density pointwise. In both scenarios, we use the bootstrap PF with multinomial resampling, that is and , and Assumption 6 is satisfied.

### 4.1 Stochastic kinetic model

We consider a stochastic kinetic model represented by a jump Markov process introduced in . Such models describe a system of chemical reactions in continuous time, with a reaction occurring under the collision of two species at random times. The discrete number of each species describes the state with jumps representing the change in a particular species. The discrete valued state comprises a

-vector of species at each time, where one of

reactions may occur at any random time, given by hazard functions for . The effect of such a reaction is described by a stoichiometry matrix , where the instantaneous change in the number of species for a certain reaction out of a possible different reactions is encoded in element . For the prokaryotic autoregulation model and parameterisation considered in , the state is a four dimensional vector evolving according to possible reactions, for which the stoichiometry matrix is given by

 S =⎛⎜ ⎜ ⎜⎝001000−100001−220−1−11001−100−11000000⎞⎟ ⎟ ⎟⎠, f(X,c) =(c1X4X3,c2(k−X4),c3X4,c4X1, c5X2(X2−1)/2,c6X3,c7X1,c8X2)′.

Coefficients and are parameters of the model, given by and . We collect noisy observations of the latent process at regular intervals of length , i.e.

 Yt=(10000120)XΔt+ϵt,ϵti.i.d∼N(0,I2).

We fix the initial condition .

To simulate synthetic data and to run the bootstrap PF, we sample the latent process using Gillespie’s direct method , whereby the time to the next event is exponential with rate and reaction occurs with probability . The estimated survival probabilities of the meeting time , with standard errors, computed using 500 independent runs of coupled PIMH are plotted in Figure 4, along with the probabilities obtained from the large sample approximation in Section 3.1, showing good agreement between the two. In Figure 5 we display the unbiased smoothing estimators obtained by averaging the unbiased estimators obtained over 500 independent runs for and the corresponding confidence intervals. Alternative choices of can lead to improved performance at fixed computational budget [24, 29]. Figure 4: Empirical estimates of P[τ≥n] (with ±2 standard errors) and comparison with those implied by the large sample approximation for stochastic kinetic model. Figure 5: Unbiased estimates of E(X1,t|y1:T) (red) (with ±3 pointwise standard errors) and X1,t (blue) for Markov jump process.

### 4.2 Lévy-driven stochastic volatility

Introduced in , Lévy-driven stochastic volatility models provide a flexible model for the log-returns of a financial asset. Letting denote the log-return process, we have

 Yt=μ+βVt+V1/2tϵt,ϵti.i.d.∼N(0,1),

where (termed the ‘actual volatility’) is treated as a stationary stochastic process. The latent state comprises the pair of actual and spot volatility . In the terminology of , the integrated volatility is the integral of the spot volatility and the actual volatility is an increment of the integrated volatility over some unit time. Initializing , the process evolves through sampling the following random variables and recursing the state according to

 K ∼Poisson(λξ2/ω2),C1:Ki.i.d.∼U[t−1,t], E1:K i.i.d.∼Exp(ξ/ω2),Wt=e−λWt−1+K∑j=1e−λ(t−Cj)Ej, Vt =1λ(Wt−1−Wt+K∑j=1Ej).

In particular, conditionally on , simulation of requires a random number of random numbers sampled at each iteration. The parameters denote the stationary mean and variance of the spot volatility respectively, describes the exponential decay of autocorrelations, denotes the risk premium for excess volatility and the drift of the log-return. In the following we perform unbiased smoothing of using data from the S&P 500 index used in . A summary of the data and parameter inference is included in Appendix A.4. (a) Empirical (±2 standard errors) and predicted tail probabilities implied by large-sample approximation P[τ≥n].