The filtering problem is ubiquitous in statistics, applied probability and applied mathematics, with far reaching applications in weather prediction, finance and engineering; see [4, 7] for example. In most cases of practical interest, the filter must be numerically approximated and a popular method for doing so is the particle filter (see e.g. [4, 8] and the references therein). The PF generates samples in parallel and uses a combination of sampling, importance sampling and resampling to approximate the filter. There is a substantial literature on the convergence of PFs (e.g. ) and in particular there are CLTs which allow one to understand the errors associated to estimation. Under assumptions, the associated asymptotic variance is bounded uniformly in time; see e.g. .
In this article, we are concerned with the filtering problem where one seeks to estimate the difference of expectations of two different but ‘close’ filters. As an example, if one observes data in discrete and regular time points, associated to an unobserved diffusion process. In many cases, one must time-discretize the diffusion process, if the transition density is not available up-to a non-negative unbiased estimator. In such scenarios it is well known that the cost of estimating the filter using a PF can be significantly reduced using a collapsing sum representation of the expectation associated to the filter with the most precise discretization and estimating each difference independently using a coupled particle filter. In other applications, one can approximate differences of expectations of filters with different parameter values as a type of finite difference approximations; see for instance[16, 27].
) is used in several applications as discussed above and various other contexts. It consists of a particle filter which runs on the product space of the two filters. The sampling operation often consists of simulating from a coupling of the Markov transitions of the hidden dynamics, which are often available in applications. Resampling proceeds by sampling the maximal coupling associated to the probability distributions of particle indices. The use of correlating the PFs is vital, for instance in ML applications (see[13, 15, 21]), as it is this property which allows a variance reduction relative to a single PF. As has been noted by several authors, e.g. , unless the coupling of the Markov transitions is particularly strong, one expects that the maximally coupled resampling operation to ultimately decorrelate the pairs of particles exponentially fast in time. As a result, the benefits of running such algorithms may have a minimal effect for long time intervals.
In this article we consider four CPFs. The first is the case where the resampling is independent for each pair (independent resampliing CPF (IRCPF)), the second with the maximally coupled resampling (the maximally coupled resampling PF (MCRPF)). The third algorithm, which to our knowledge is new, is based upon a weak law of large numbers for the MCRPF on the product space. This result shows that the limiting coupling on product space does not correspond to the maximal coupling of the filter (or predictor). This coupling does not seem to have any optimality properties, so we suggest a new CPF, the MCPF which approximates the maximal coupling of the predictors. This algorithm requires that the Markov transition of the filter to have a density which is known pointwise and essentially samples from the maximal coupling of particle filters. The fourth algorithm in
is based on multinomial resampling which uses the same uniform random variable for each particle pair; we call this the Wasserstein CPF (WCPF). In general, all four algorithms can be used in each of the examples, with the constraint for the MCPF mentioned above. We remark that there are CPFs in[14, 27], but they are not considered here as they require even more mathematical complexity for their analysis.
We prove a CLT for the first three algorithms (the CLT for the WCPF, when the state-dynamics are one dimensional is in ) associated to the difference of estimates of expectations of the same function w.r.t. the predictors, which is extended to multiple dimensions. The asymptotic variance expression is directly related to the properties of the limiting coupling one approximates. Under assumptions (of the type in ), for the PODDP with (Euler) discretization , we show that the MCPF (resp. WCPF) has an asymptotic variance that is upper-bounded by an expression that is (resp. , arbitrarily close to, but not equal to, zero), uniformly in time, which preserves the so-called forward rate of the diffusion in some scenarios. This is reassuring as it shows that filtering for difference estimation in the PODDP case can be effectively performed. This time and rate stability is associated to the fact that the limiting coupling on product spaces are associated to the optimal and Wasserstein couplings of the predictor/filter. For the IRCPF one does not recover the coupling rate of the diffusion process and this poor performance is well-known in the literature. In the case of the MCRPF we show even in a favourable case, that it can have an asymptotic variance, at time , that is and identify when one can expect the algorithm to work well. As was seen in the empirical results of [19, 21] the time and rate behaviour of the MCRPF in general does not seem to be as good as for the MCPF and WCPF. The assumptions used for our asymptotic variance results are also verified in a real example. Our CLTS are, to the best of our knowledge, the first results of these types for CPFs and require non-standard proofs. To summarize, the main results of the article are:
This paper is structured as follows. In Section 2 we give our notations, models and the motivating example of a PODDP with MLMC is given. In Section 3 the algorithms are presented. In Section 4 our theoretical results are stated. Our CLTs are given and a general bound on the asymptotic variance is provided. In Section 5 our results are applied to a pratical model in the context of using coupled particle filters for PODDP with MLMC. The article is summarized in Section 6. The appendix contains the proofs of our theoretical results.
2 Notation and Models
Let be a measurable space. For a given function we denote by the class of functions for which
When we write . If we write , for the bounded measurable and continuous, bounded measurable functions respectively. are the collection of twice continuously differentiable real-valued functions on . Let be a metric on then for , we say that if there exist a such that for every :
If , we write and write for the Lipschitz constant. denotes the collection of probability measures on . We also denote, for , . For a measure on and a integrable, , the notation is used. Let be a non-negative kernel and be a measure then we use the notations and for integrable, For , the total variation distance is denoted . For the indicator is written and the dirac measure . For two measures on , the product measure is . For two measurable functions , on
, the tensor product of functions is.
denotes the uniform distribution on the set. is the
dimensional Gaussian distribution of meanand covariance (if the subscript is dropped from ). and are used to denote probability and expectation w.r.t. the law of the specified algorithm - the context will be clear in each instant. and are used to denote convergence in distribution and probability respectively. In the context of the article, this is as .
Let be a measurable space and a sequence of non-negative, bounded and measurable functions such that . Let and , be two sequences of Markov kernels, i.e. , . Define, for ,
The objective is to consider Monte Carlo type algorithms which will approximate quantities, for any , such as
We can make the also depend on , which may be of importance in applictations. In the subsequent development, this is not done, but could be at a cost of slightly longer mathematical arguments and notational complications.
The major point is that one would like to approximate couplings of , say i.e. that for any and every
and consider approximating
An explanation of why coupling the pairs is of interest has been given in the introduction and will be further illuminated in Section 2.3.
Throughout the article, it is assumed that there exists such that for any
and moreover for any there exists Markov kernels , such that for any , :
The following example is from  and there is some overlap with the presentation in that article. We start with a diffusion process:
with , (th element denoted ), (th element denoted ), and a dimensional Brownian motion. The following assumptions are made and if referred to as (D). (D) is assumed explicitly. We set .
The coefficients , for . Also, and satisfy
uniform ellipticity: is uniformly positive definite;
globally Lipschitz: there is a such that for all , .
The data are observed at regular unit time-intervals (i.e. in discrete time) , . It is assumed that conditional on , is independent of all other random variables with density . Let
be the transition of the diffusion process (over unit time) and consider a discrete-time Markov chainwith initial distribution and transition . Here we are creating a discrete-time Markov chain that corresponds to the discrete-time skeleton of the diffusion process at a time lag of 1. Now we write instead of . Then we define, for
The predictor is which corresponds to the distribution associated to .
In many applications, one must time discretize the diffusion to use the model in practice. We suppose an Euler discretization with discretization , and write the associated transition kernel over unit time as . Note that, in practice, one may not be able to compute the density of the kernel as it is a compositon of Gaussians, however, one can certainly sample from in most cases. Hence we are interested in the Feynman-Kac model for
with associated predictor .
Below, we will explain why one may wish to compute, for , , . That is, as used above, relates to the predictor associated to the discretization and , the predictor with discretization . A natural coupling of and exists (e.g. ) so one also has a given and .
Before continuing, we note some results which will help in our discussion below. As established in [21, eq. (32)] one has for
where , is the norm. In addition, [21, Proposition D.1.] states that for , and any
When ( is the norm), the term is the so-called forward strong error rate. In the proof of [21, Theorem 4.3], it is shown that for any there exists a such that for , one has
Note that this bound can be deduced from (4), but that may depend on ; this latter point is ignored for now.
2.3.1 Multilevel Monte Carlo
Suppose one can exactly sample from for any . The Monte Carlo estimate of , set , is then of course , are i.i.d. from . One has the mean square error (MSE)
where is the variance of w.r.t. . Then, for given, to target a mean square error of , by (6), one chooses (as one sets ). Then one must choose and we suppose the cost of simulating one sample is (see  for a justification), again ignoring . Then the cost of achieving a MSE of is hence .
Suppose that, for , it is possible to exactly sample a coupling of , denote it , so that one has
and the cost of such a simulation is . The rate has been taken from the strong error rate in (5). Now to estimate one simulates i.i.d. from for some and the approximation is . For this is repeated independently for estimating and the case is performed, independently, using the Monte Carlo method above. One can then easily show that the MSE of the estimate of is upper-bounded by (recall ) by
Then, for given, to target a mean square error of , set , and (see ), then the MSE is . The overall computational effort is ; if is suitably small this is a significant reduction in computational effort relative to the MC method above.
The main point here is that, at present, there are no computational methods to perform the exact simulation mentioned and thus we focus on particle filters, developed in the literature. The estimates (variance/cost) above typically depend on and our objective is to consider if the variance of PF estimates can be uniformly in time and hence that the ML gain is retained uniformly in time. We focus on the asymptotic variance in the CLT (versus the finite sample variance) as this can be more straightforward to deal with.
3.1 Independent Pair Resampling
The first procedure we consider is as follows. Let , and and define the probability measure:
Set then consider the joint probability measure on
It is easily checked that the marginal of (resp. ) is (resp. ). Denote by the marginal of induced by this joint probability measure. Thus, if one could sample a trajectory of from one could easily approximate quantities such as (1) or (2) using Monte Carlo methods. In most practical applications of interest, this is not possible.
The particle approximation is taken as:
where for ,
The key point is that in this algorithm, the resampled indices for a pair of particles are generated (conditionally) independently. Set
As has been mentioned by many authors (e.g. ) one does not expect this procedure to effective, in the sense that would not provide an appropriate dependence between .
3.2 Maximally Coupled Resampling
Let , and and define the probability measure:
and for and
Now set and we define, recursively, for any ,
Note that by [21, Proposition A.1.] we have for
The algorithm used here is:
where for ,
and we set for
This procedure is as in  and adopted in, for instance, [17, 21]. The idea is to provide a local optimality procedure w.r.t. the resampling operation. This is in the sense that for any fixed and conditional on the information generated so far, one will maximize the probability that the resampled indices are equal.
For the MCRPF, we present a preliminary result, which will prove to be of interest. The following assumptions are used and note that in (A3.2) there is a metric on implicit in the assumption.
For every , .
For every , is Feller.
is a locally compact and separable metric space.
The assumptions are adopted due to the complexity of the operator . As can be seen in Appendix C, where the proofs for the following result are given, it is non-trivial to work with . To weaken these assumptions would lead to further calculations, which would essentially confirm the same result.
What is interesting here, is that on the product space the target is a coupling of , but the actual maximal coupling of is not sampled from. As is well known, the maximal coupling will maximize the probability that two random variables are equal, with specified marginals and is the optimal coupling of two probability measures w.r.t. the Wasserstein distance. If this former property is desirable from a practical perspective, then the above algorithm should not be used. The maximially coupled resampling operation is, for a finite number of samples (particles), the optimal (in the above sense) way to couple the resampling operation, but, may not lead to large sample ‘good’ couplings. This is manifested in  where the forward error rate (5) is lost for the diffusion problem in Section 2.3. As mentioned above, the limit is a coupling of , but in general there is no reason to suspect that it is optimal in any sense.
3.3 Maximal Coupling
We now present an algorithm which can sample (in the limit) from the maximal coupling of . We will assume that for the Markov kernels , as well as admit a density w.r.t. a finite measure . The densities are denoted and , and we assume that the densities can be evaluated numerically. To remove this latter requirement is left to future work.
Let , and and define the probability measure:
Now set as the maximal coupling of and set for
We have for
Moreover is the maximal coupling of .
The particle approximation used is:
where for ,
We set for , , ,
We remark that as , and , , can be evaluated numerically we can sample from and , the latter of which is the maximal coupling of and , using the algorithm in . This is as follows:
Sample and . If , output , otherwise go-to 2..
Sample and , if , output , otherwise start 2. again.
This algorithm is a rejection sampler, which would add a random running time element per time-step, which may not be desirable for some applications. In the limiting case (i.e. sampling the maximal coupling of ) the expected number of steps to acceptance is
One might expect that above is in applications. We expect to be close and possibly that is more diffuse than , for instance that , for some constant , hence one may find . In general, one might want to confirm this conjecture before implementing this approach.