 # Central Limit Theorems for Coupled Particle Filters

In this article we prove a new central limit theorem (CLT) for coupled particle filters (CPFs). CPFs are used for the sequential estimation of the difference of expectations w.r.t. filters which are in some sense close. Examples include the estimation of the filtering distribution associated to different parameters (finite difference estimation) and filters associated to partially observed discretized diffusion processes (PODDP) and the implementation of the multilevel Monte Carlo (MLMC) identity. We develop new theory for CPFs and based upon several results, we propose a new CPF which approximates the maximal coupling (MCPF) of a pair of predictor distributions. In the context of ML estimation associated to PODDP with discretization Δ_l we show that the MCPF and the approach in Jasra et al. (2018) have, under assumptions, an asymptotic variance that is upper-bounded by an expression that is (almost) O(Δ_l), uniformly in time. The O(Δ_l) rate preserves the so-called forward rate of the diffusion in some scenarios which is not the case for the CPF in Jasra et al (2017).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The filtering problem is ubiquitous in statistics, applied probability and applied mathematics, with far reaching applications in weather prediction, finance and engineering; see [4, 7] for example. In most cases of practical interest, the filter must be numerically approximated and a popular method for doing so is the particle filter (see e.g. [4, 8] and the references therein). The PF generates samples in parallel and uses a combination of sampling, importance sampling and resampling to approximate the filter. There is a substantial literature on the convergence of PFs (e.g. ) and in particular there are CLTs which allow one to understand the errors associated to estimation. Under assumptions, the associated asymptotic variance is bounded uniformly in time; see e.g. .

In this article, we are concerned with the filtering problem where one seeks to estimate the difference of expectations of two different but ‘close’ filters. As an example, if one observes data in discrete and regular time points, associated to an unobserved diffusion process. In many cases, one must time-discretize the diffusion process, if the transition density is not available up-to a non-negative unbiased estimator. In such scenarios it is well known that the cost of estimating the filter using a PF can be significantly reduced using a collapsing sum representation of the expectation associated to the filter with the most precise discretization and estimating each difference independently using a coupled particle filter. In other applications, one can approximate differences of expectations of filters with different parameter values as a type of finite difference approximations; see for instance

[16, 27].

The CPF developed in  (see also [16, 17, 21, 22, 27]

) is used in several applications as discussed above and various other contexts. It consists of a particle filter which runs on the product space of the two filters. The sampling operation often consists of simulating from a coupling of the Markov transitions of the hidden dynamics, which are often available in applications. Resampling proceeds by sampling the maximal coupling associated to the probability distributions of particle indices. The use of correlating the PFs is vital, for instance in ML applications (see

[13, 15, 21]), as it is this property which allows a variance reduction relative to a single PF. As has been noted by several authors, e.g. , unless the coupling of the Markov transitions is particularly strong, one expects that the maximally coupled resampling operation to ultimately decorrelate the pairs of particles exponentially fast in time. As a result, the benefits of running such algorithms may have a minimal effect for long time intervals.

In this article we consider four CPFs. The first is the case where the resampling is independent for each pair (independent resampliing CPF (IRCPF)), the second with the maximally coupled resampling (the maximally coupled resampling PF (MCRPF)). The third algorithm, which to our knowledge is new, is based upon a weak law of large numbers for the MCRPF on the product space. This result shows that the limiting coupling on product space does not correspond to the maximal coupling of the filter (or predictor). This coupling does not seem to have any optimality properties, so we suggest a new CPF, the MCPF which approximates the maximal coupling of the predictors. This algorithm requires that the Markov transition of the filter to have a density which is known pointwise and essentially samples from the maximal coupling of particle filters. The fourth algorithm in



is based on multinomial resampling which uses the same uniform random variable for each particle pair; we call this the Wasserstein CPF (WCPF). In general, all four algorithms can be used in each of the examples, with the constraint for the MCPF mentioned above. We remark that there are CPFs in

[14, 27], but they are not considered here as they require even more mathematical complexity for their analysis.

We prove a CLT for the first three algorithms (the CLT for the WCPF, when the state-dynamics are one dimensional is in ) associated to the difference of estimates of expectations of the same function w.r.t. the predictors, which is extended to multiple dimensions. The asymptotic variance expression is directly related to the properties of the limiting coupling one approximates. Under assumptions (of the type in ), for the PODDP with (Euler) discretization , we show that the MCPF (resp. WCPF) has an asymptotic variance that is upper-bounded by an expression that is (resp. , arbitrarily close to, but not equal to, zero), uniformly in time, which preserves the so-called forward rate of the diffusion in some scenarios. This is reassuring as it shows that filtering for difference estimation in the PODDP case can be effectively performed. This time and rate stability is associated to the fact that the limiting coupling on product spaces are associated to the optimal and Wasserstein couplings of the predictor/filter. For the IRCPF one does not recover the coupling rate of the diffusion process and this poor performance is well-known in the literature. In the case of the MCRPF we show even in a favourable case, that it can have an asymptotic variance, at time , that is and identify when one can expect the algorithm to work well. As was seen in the empirical results of [19, 21] the time and rate behaviour of the MCRPF in general does not seem to be as good as for the MCPF and WCPF. The assumptions used for our asymptotic variance results are also verified in a real example. Our CLTS are, to the best of our knowledge, the first results of these types for CPFs and require non-standard proofs. To summarize, the main results of the article are:

• Theorem 3.1 gives a WLLN for the MCRPF.

• Theorem 4.1 gives a CLT for the IRCPF, MRCPF & MCPF.

• Theorem 4.3 gives a general bound on the asymptotic variance for each of the methods, IRCPF, MRCPF, MCPF & WCPF.

• Propositions 5.2 and 5.3 give the time uniform bounds on the asymptotic variance for the MCPF and WCPF noted above (i.e. for PODDPs).

This paper is structured as follows. In Section 2 we give our notations, models and the motivating example of a PODDP with MLMC is given. In Section 3 the algorithms are presented. In Section 4 our theoretical results are stated. Our CLTs are given and a general bound on the asymptotic variance is provided. In Section 5 our results are applied to a pratical model in the context of using coupled particle filters for PODDP with MLMC. The article is summarized in Section 6. The appendix contains the proofs of our theoretical results.

## 2 Notation and Models

### 2.1 Notations

Let be a measurable space. For a given function we denote by the class of functions for which

 ∥φ∥v:=supx∈X|φ(x)|v(x)<+∞ .

When we write . If we write , for the bounded measurable and continuous, bounded measurable functions respectively. are the collection of twice continuously differentiable real-valued functions on . Let be a metric on then for , we say that if there exist a such that for every :

 |φ(x)−φ(y)|≤Cd(x,y)v(x)v(y).

If , we write and write for the Lipschitz constant. denotes the collection of probability measures on . We also denote, for , . For a measure on and a integrable, , the notation is used. Let be a non-negative kernel and be a measure then we use the notations and for integrable, For , the total variation distance is denoted . For the indicator is written and the dirac measure . For two measures on , the product measure is . For two measurable functions , on

, the tensor product of functions is

.

denotes the uniform distribution on the set

. is the

dimensional Gaussian distribution of mean

and covariance (if the subscript is dropped from ). and are used to denote probability and expectation w.r.t. the law of the specified algorithm - the context will be clear in each instant. and are used to denote convergence in distribution and probability respectively. In the context of the article, this is as .

### 2.2 Models

Let be a measurable space and a sequence of non-negative, bounded and measurable functions such that . Let and , be two sequences of Markov kernels, i.e. , . Define, for ,

 γsn(B)=∫Xn+1IB(xn)(n−1∏p=0Gsp(xp))ηs0(dx0)n∏p=1Msp(xp−1,dxp)

and

 ηsn(B)=γsn(B)γsn(1).

The objective is to consider Monte Carlo type algorithms which will approximate quantities, for any , such as

 ηfn(φ)−ηcn(φ) (1)

or

 ηfn(Gnφ)ηfn(Gn)−ηcn(Gnφ)ηcn(Gn). (2)

(1) corresponds to a predictor of a state-space model and (2) the filter. We focus explicitly on the predictor from herein.

###### Remark 2.1.

We can make the also depend on , which may be of importance in applictations. In the subsequent development, this is not done, but could be at a cost of slightly longer mathematical arguments and notational complications.

The major point is that one would like to approximate couplings of , say i.e. that for any and every

 ˇηn(B×X)=ηfn(B)ˇηn(X×B)=ηcn(B)

and consider approximating

 ˇηn(φ⊗1)−ˇηn(1⊗φ)

An explanation of why coupling the pairs is of interest has been given in the introduction and will be further illuminated in Section 2.3.

Throughout the article, it is assumed that there exists such that for any

 ˇη0(B×X)=ηf0(B)ˇη0(X×B)=ηc0(B)

and moreover for any there exists Markov kernels , such that for any , :

 ˇMn(B×X)(x,x′)=Mfn(B)(x)ˇMn(X×B)(x,x′)=Mcn(B)(x′).

### 2.3 Example

The following example is from  and there is some overlap with the presentation in that article. We start with a diffusion process:

 dZt = a(Zt)dt+b(Zt)dWt (3)

with , (th element denoted ), (th element denoted ), and a dimensional Brownian motion. The following assumptions are made and if referred to as (D). (D) is assumed explicitly. We set .

The coefficients , for . Also, and satisfy

• uniform ellipticity: is uniformly positive definite;

• globally Lipschitz: there is a such that for all , .

The data are observed at regular unit time-intervals (i.e. in discrete time) , . It is assumed that conditional on , is independent of all other random variables with density . Let

be the transition of the diffusion process (over unit time) and consider a discrete-time Markov chain

with initial distribution and transition . Here we are creating a discrete-time Markov chain that corresponds to the discrete-time skeleton of the diffusion process at a time lag of 1. Now we write instead of . Then we define, for

 γn(B):=∫Xn+1IB(xn)(n−1∏p=0Gp(xp))M(x∗,dx0)n∏p=1M(xp−1,dxp).

The predictor is which corresponds to the distribution associated to .

In many applications, one must time discretize the diffusion to use the model in practice. We suppose an Euler discretization with discretization , and write the associated transition kernel over unit time as . Note that, in practice, one may not be able to compute the density of the kernel as it is a compositon of Gaussians, however, one can certainly sample from in most cases. Hence we are interested in the Feynman-Kac model for

 γln(B):=∫Xn+1IB(xn)(n−1∏p=0Gp(xp))Ml(x∗,dx0)n∏p=1Ml(xp−1,dxp).

with associated predictor .

Below, we will explain why one may wish to compute, for , , . That is, as used above, relates to the predictor associated to the discretization and , the predictor with discretization . A natural coupling of and exists (e.g. ) so one also has a given and .

Before continuing, we note some results which will help in our discussion below. As established in [21, eq. (32)] one has for

 supAsupx∈X|Mfn(φ)(x)−Mcn(φ)(x)|≤CΔl (4)

where , is the norm. In addition, [21, Proposition D.1.] states that for , and any

 ∫X×X∥u−v∥pˇMn((x,y),d(u,v))≤C(|x−y|+Δ1/2l)p. (5)

When ( is the norm), the term is the so-called forward strong error rate. In the proof of [21, Theorem 4.3], it is shown that for any there exists a such that for , one has

 |ηn(φ)−ηln(φ)|≤CΔl. (6)

Note that this bound can be deduced from (4), but that may depend on ; this latter point is ignored for now.

#### 2.3.1 Multilevel Monte Carlo

Suppose one can exactly sample from for any . The Monte Carlo estimate of , set , is then of course , are i.i.d. from . One has the mean square error (MSE)

 E[(1NN∑i=1φ(xin)−ηn(φ))2]=Varηln[φ(X)]N+|ηn(φ)−ηln(φ)|2

where is the variance of w.r.t. . Then, for given, to target a mean square error of , by (6), one chooses (as one sets ). Then one must choose and we suppose the cost of simulating one sample is (see  for a justification), again ignoring . Then the cost of achieving a MSE of is hence .

Now, one has that for any , the multilevel identity [13, 15]

 ηLn(φ)=η0n(φ)+L∑l=1{[ηln−ηl−1n](φ)}.

Suppose that, for , it is possible to exactly sample a coupling of , denote it , so that one has

 ∫X×X∥x−y∥2ˇηln(d(x,y))≤CΔl

and the cost of such a simulation is . The rate has been taken from the strong error rate in (5). Now to estimate one simulates i.i.d. from for some and the approximation is . For this is repeated independently for estimating and the case is performed, independently, using the Monte Carlo method above. One can then easily show that the MSE of the estimate of is upper-bounded by (recall ) by

 Varηln[φ(X)]N0+CL∑l=1ΔlNl+|ηn(φ)−ηln(φ)|2.

Then, for given, to target a mean square error of , set , and (see ), then the MSE is . The overall computational effort is ; if is suitably small this is a significant reduction in computational effort relative to the MC method above.

The main point here is that, at present, there are no computational methods to perform the exact simulation mentioned and thus we focus on particle filters, developed in the literature. The estimates (variance/cost) above typically depend on and our objective is to consider if the variance of PF estimates can be uniformly in time and hence that the ML gain is retained uniformly in time. We focus on the asymptotic variance in the CLT (versus the finite sample variance) as this can be more straightforward to deal with.

## 3 Algorithms

### 3.1 Independent Pair Resampling

The first procedure we consider is as follows. Let , and and define the probability measure:

 ˇΦIn(μ)(B)=μ([Gn−1⊗Gn−1]ˇMn(B))μ(Gn−1⊗Gn−1).

Set then consider the joint probability measure on

 P(d(u0,…,un))=ˇη0(du0)n∏p=1ˇΦIp(ηfp−1⊗ηcp−1)(dup).

It is easily checked that the marginal of (resp. ) is (resp. ). Denote by the marginal of induced by this joint probability measure. Thus, if one could sample a trajectory of from one could easily approximate quantities such as (1) or (2) using Monte Carlo methods. In most practical applications of interest, this is not possible.

The particle approximation is taken as:

 P(d(u1:N0,…,u1:Nn))=(N∏i=1ˇη0(dui0))(n∏p=1N∏i=1ˇΦIp(ηN,fp−1⊗ηN,cp−1)(duip))

where for ,

 ηN,sp−1(dx)=1NN∑i=1δxi,sp−1(dx).

The key point is that in this algorithm, the resampled indices for a pair of particles are generated (conditionally) independently. Set

 ˇηN,Ip(du)=1NN∑i=1δuip(du).

As has been mentioned by many authors (e.g. ) one does not expect this procedure to effective, in the sense that would not provide an appropriate dependence between .

### 3.2 Maximally Coupled Resampling

Let , and and define the probability measure:

 ˇΦMn(μ)(B) = μ({Fn−1,μ,f∧Fn−1,μ,c}ˇMn(B))+(1−μ({Fn−1,μ,f∧Fn−1,μ,c}))× (μ⊗μ)({¯¯¯¯Fn−1,μ,f⊗¯¯¯¯Fn−1,μ,c}¯Mn(B))

where for

 ¯¯¯¯Fn−1,μ,f(x,y) = Fn−1,μ,f(x,y)−{Fn−1,μ,f(x,y)∧Fn−1,μ,c(x,y)}μ(Fn−1,μ,f−{Fn−1,μ,f∧Fn−1,μ,c}) ¯¯¯¯Fn−1,μ,c(x,y) = Fn−1,μ,c(x,y)−{Fn−1,μ,f(x,y)∧Fn−1,μ,c(x,y)}μ(Fn−1,μ,c−{Fn−1,μ,f∧Fn−1,μ,c}) Fn−1,μ,f(x,y) = ˇGn−1,μ,f(x)⊗1
 Fn−1,μ,c(x,y) = 1⊗ˇGn−1,μ,c(y) ˇGn−1,μ,f(x) = Gn−1(x)μ(Gn−1⊗1) ˇGn−1,μ,c(y) = Gn−1(y)μ(1⊗Gn−1)

and for and

 ¯Mn(B)((x,y),(u,v))=ˇMn(B)(x,v).

Now set and we define, recursively, for any ,

 ˇηMn(B)=ˇΦMn(ˇηMn−1)(B).

Note that by [21, Proposition A.1.] we have for

 ˇηMn(B×X)=ηfn(B)andˇηMn(X×B)=ηcn(B).

The algorithm used here is:

 P(d(u1:N0,…,u1:Nn))=(N∏i=1ˇη0(dui0))(n∏p=1N∏i=1ˇΦMp(ˇηN,Mn−1)(duip))

where for ,

 ˇηN,Mp−1(du)=1NN∑i=1δuip(du).

and we set for

 ηN,sp−1(dx)=1NN∑i=1δxi,sp−1(dx).

This procedure is as in  and adopted in, for instance, [17, 21]. The idea is to provide a local optimality procedure w.r.t. the resampling operation. This is in the sense that for any fixed and conditional on the information generated so far, one will maximize the probability that the resampled indices are equal.

For the MCRPF, we present a preliminary result, which will prove to be of interest. The following assumptions are used and note that in (A3.2) there is a metric on implicit in the assumption.

• For every , .

• For every , is Feller.

• is a locally compact and separable metric space.

The assumptions are adopted due to the complexity of the operator . As can be seen in Appendix C, where the proofs for the following result are given, it is non-trivial to work with . To weaken these assumptions would lead to further calculations, which would essentially confirm the same result.

###### Theorem 3.1.

Assume (A3.2-3.2). Then for any , we have

 ˇηN,Mn(φ)→PˇηMn(φ).

What is interesting here, is that on the product space the target is a coupling of , but the actual maximal coupling of is not sampled from. As is well known, the maximal coupling will maximize the probability that two random variables are equal, with specified marginals and is the optimal coupling of two probability measures w.r.t. the Wasserstein distance. If this former property is desirable from a practical perspective, then the above algorithm should not be used. The maximially coupled resampling operation is, for a finite number of samples (particles), the optimal (in the above sense) way to couple the resampling operation, but, may not lead to large sample ‘good’ couplings. This is manifested in  where the forward error rate (5) is lost for the diffusion problem in Section 2.3. As mentioned above, the limit is a coupling of , but in general there is no reason to suspect that it is optimal in any sense.

### 3.3 Maximal Coupling

We now present an algorithm which can sample (in the limit) from the maximal coupling of . We will assume that for the Markov kernels , as well as admit a density w.r.t. a finite measure . The densities are denoted and , and we assume that the densities can be evaluated numerically. To remove this latter requirement is left to future work.

Let , and and define the probability measure:

 ˇΦCn(μ,ν)(φ) = ∫X×XIB(x,y)[∫XFn−1,μ,f(u)∧Fn−1,ν,c(u)δ{u,u}(d(x,y))du+ 11−∫XFn−1,μ,f(u)∧Fn−1,ν,c(u)du¯¯¯¯Fn−1,μ,ν,f(x)¯¯¯¯Fn−1,ν,μ,c(y)dxdy]

where, for

 ¯¯¯¯Fn−1,μ,ν,f(x) = Fn−1,μ,f(x)−Fn−1,μ,f(x)∧Fn−1,ν,c(x) ¯¯¯¯Fn−1,ν,μ,c(x) = Fn−1,ν,c(x)−Fn−1,μ,f(x)∧Fn−1,ν,c(x) Fn−1,μ,f(x) = μ(Gn−1Mfn(⋅,x))μ(Gn−1) Fn−1,ν,c(x) = ν(Gn−1Mcn(⋅,x))ν(Gn−1).

Now set as the maximal coupling of and set for

 ˇηCn(B)=ˇΦCn(ηfn−1,ηcn−1)(B).

We have for

 ˇηCn(B×X)=ηfn(B)andˇηCn(X×B)=ηcn(B).

Moreover is the maximal coupling of .

The particle approximation used is:

 P(d(u1:N0,…,u1:Nn))=(N∏i=1ˇηC0(dui0))(n∏p=1N∏i=1ˇΦCp(ηN,fp−1,ηN,cp−1)(duip))

where for ,

 ηN,sp−1(dx)=1NN∑i=1δxi,sp−1(dx).

We set for , , ,

 Φsn(μ)(B)=μ(Gn−1Msn(B))μ(Gn−1).

We remark that as , and , , can be evaluated numerically we can sample from and , the latter of which is the maximal coupling of and , using the algorithm in . This is as follows:

1. Sample and . If , output , otherwise go-to 2..

2. Sample and , if , output , otherwise start 2. again.

This algorithm is a rejection sampler, which would add a random running time element per time-step, which may not be desirable for some applications. In the limiting case (i.e. sampling the maximal coupling of ) the expected number of steps to acceptance is

 1+∫X|ηcn(x)−ηfn(x)|dx2∫X[ηcn(x)−ηfn(x)]I{x∈X:ηcn(x)≥ηfn(x)}(x)dx=1+Rn.

One might expect that above is in applications. We expect to be close and possibly that is more diffuse than , for instance that , for some constant , hence one may find . In general, one might want to confirm this conjecture before implementing this approach.

### 3.4 Wasserstein Coupled Resampling

We describe the WCPF used in . For this case we restrict our attention to the case that

. It is explicitly assumed that the cumulative distribution function (CDF) and its inverse associated to the probability,

, ,

 ¯¯¯ηsn(B)=ηsn