 # Unbiased Estimation of the Solution to Zakai's Equation

In the following article we consider the non-linear filtering problem in continuous-time and in particular the solution to Zakai's equation or the normalizing constant. We develop a methodology to produce finite variance, almost surely unbiased estimators of the solution to Zakai's equation. That is, given access to only a first order discretization of solution to the Zakai equation, we present a method which can remove this discretization bias. The approach, under assumptions, is proved to have finite variance and is numerically compared to using a particular multilevel Monte Carlo method.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let be a measurable space. On

consider the probability measure

and a pair of stochastic processes , , with , , , with given:

 dYt = h(Xt)dt+dBt (1) dXt = b(Xt)dt+σ(Xt)dWt (2)

where , , with non-constant and of full rank and are independent standard Brownian motions of dimension and respectively. Let be a filtration on such that and are independent standard Brownian motions. Let be an arbitrary real number and introduce the probability measure which is equivalent to on defined by the Radon-Nikodym derivative

 ZT:=dPd¯¯¯P=exp{∫T0h(Xs)∗dYs−12∫T0h(Xs)∗h(Xs)ds}

with, under , following the dynamics (2) and independently is a standard Brownian motion. We have the solution to the Zakai equation (e.g. [1, Theorem 3.2.4.]) for (bounded and measurable real valued functions)

 γt(φ):=¯¯¯¯E[φ(Xt)exp{∫t0h(Xs)∗dYs−12∫t0h(Xs)∗h(Xs)ds}∣∣Yt]

where is the filtration generated by the process . Our objective is to, recursively in time, estimate

over some finite and regular time grid. The solution of Zakai’s equation can be useful for model selection in statistics or as a solution of a particular stochastic partial differential equation in applied mathematics.

In most cases of practical interest, one only has access to finite time discretization of the data and so one must often, correspondingly, discretize the functionals associated to the signal and observations. Several possibilities have been considered in the literature (e.g. [2, 3]) but we use the first order approach in . Even given this discretization, one must often time-discretize (2) of which we use the Euler method. Once one has reached this stage, the problem of numerically approximating Zakai’s equation corresponds to that of approximating the normalizing constant of a high-frequency state-space model, of which there is now a rather mature collection of methods for doing so, for instance, based upon particle filters (PF).

Given a state-space model for which one can sample from the hidden Markov chain and evaluate the conditional likelihood of an observation given the state, a particle filter provides consistent Monte Carlo estimates of the filter and unbiased estimates of the marginal likelihood; see for instance

. In the context of the model (1)- (2), after discretization, many particle filter approaches have been suggested in the literature [1, 5, 7]. We follow the methods considered in , who apply multilevel particle filters for the approximation of the filtering problem. The multilevel Monte Carlo (MLMC) method e.g. [8, 9, 10] is often used for problems where one is interested in the estimation of an expectation w.r.t. a probability law that has been discretized, for instance the law of a diffusion at some given time , which has been Euler discretized. The idea is to present a telescoping sum representation of the expectation under a given precise discretization, in terms of differences of expectations of increasingly coarse discretizations. If one can sample from appropriate couplings of the probability laws in the differences, then one can reduce the computational effort to achieve a pre-specified mean square error, versus simply considering approximating the expectation associated to the precise discretization by itself. Detailed reviews of these methods can be found in [9, 11].

The methodology of this article concerns the estimation of the solution of Zakai’s equation and in particular an estimate that is almost surely unbiased, in that the discretization error is removed from the estimate. The approach that we present is based upon the unbiased methods of  (see also [15, 18]) combined with the multilevel methodology in . More precisely, we start by presenting a multilevel identity for the approximation of the solution of Zakai’s equation which is biased, in terms of the discretization error. We prove that, for a particular implementation based on the algorithm in , in order to obtain a mean square error of , , the computational effort required is , versus if one does not use a multilevel strategy. Then, given access to high-frequency data, we show how this identity can be randomized to remove the discretization error of the multilevel method. We prove that our proposed estimators are unbiased and of finite variance. We also consider the computational effort to produce our estimate relative to using multilevel approaches, both theoretically and numerically. In particular, we demonstrate that to be within of the true solution to Zakai’s equation (with high probability) one requires a computational effort of , for some . Thus there is an extra cost to pay for unbiasedness, relative to using multilevel methods. We remark however, that the unbiased method is exceptionally amenable to parallel implementation, especially relative to using afore-mentioned multilevel approach, and this element is not considered in our mathematical analysis. In addition, our unbiased methodology provides a ground truth estimate, which can be useful if one resorts to estimation methods which exhibit discretization bias.

This article is structured as follows. In Section 2 we provide a review of the methodology to be used. In Section 3 our method is presented. In Section 4 we show that our method produces unbiased and finite variance estimators. We also present a result associated to a multilevel estimator. In Section 5 our numerical results are presented. The appendix features technical results for the proofs of our theoretical results.

## 2 Review of Relevant Methodology

The following Section will provide a review of the methodology to be used in this article. The section is structured as follows. We first describe our notation in Section 2.1. In Section 2.2 we describe the discretized model that is the one that we will work with in practice. In Section 2.3 we review the multilevel Monte Carlo method, which will be used in this article and is an approach which can reduce the cost of estimation, relative to Monte Carlo, to achieve a given mean square error (MSE), particularly in problems which are subject to discretization. The next two Sections 2.4 and 2.5 review methodology which can be used to implement MLMC for the class of problems considered in this article. The final Section 2.6

summarizes MLMC implemented via particle and coupled particle filters (CPF); the multilevel particle filter (MLPF). Throughout the article we assume that all the random variables that are mentioned are well-defined on the measurable space

.

### 2.1 Notations

Let be a measurable space. For we write as the collection of bounded measurable functions. Let , denotes the collection of real-valued functions that are Lipschitz w.r.t.  ( denotes the

norm of a vector

). That is, if there exists a such that for any We write as the Lipschitz constant of a function . For , we write the supremum norm . denotes the collection of probability measures on . For a measure on and a , the notation is used. denote the Borel sets on . is used to denote the Lebesgue measure. Let be a non-negative operator and be a measure then we use the notations and for , For the indicator is written . (resp. ) denotes an

dimensional Gaussian distribution (density evaluated at

) of mean and covariance . If we omit the subscript . For a vector/matrix , is used to denote the transpose of . For , denotes the Dirac measure of , and if with , we write .

is used to denote the uniform distribution on a set

. For a vector-valued function in dimensions (resp. dimensional vector), (resp. ) say, we write the component () as (resp. ). For a matrix we write the entry as . For and a random variable on with distribution associated to we use the notation .

### 2.2 Discretized Model

The following section is taken from . To minimize certain technical difficulties, the following assumption is made throughout the paper:

• We have:

1. is bounded with , and is uniformly elliptic.

2. are bounded and , .

In practice, we will have to work with a discretization of the model in (1)-(2). We will assume access to path of data which is observed at a high frequency.

Let be given and consider an Euler discretization of step-size , , :

 ˜XkΔl = ˜X(k−1)Δl+b(˜X(k−1)Δl)Δl+σ(˜X(k−1)Δl)[WkΔl−W(k−1)Δl]. (3)

It should be noted that the Brownian motion in (3) is the same as in (2) under both and . Then, for define:

 Glk(xkΔl):=exp{h(xkΔl)∗(y(k+1)Δl−ykΔl)−Δl2h(xkΔl)∗h(xkΔl)}

and note that for any

 ZlT(x0,xΔl,…,xT−Δl):=2lT−1∏k=0Glk(xkΔl)=exp{2lT−1∑k=0[h(xkΔl)∗(y(k+1)Δl−ykΔl)−Δl2h(xkΔl)∗h(xkΔl)]}

is simply a discretization of (of the type of ). Then set for

 γlt(φ) := ¯¯¯¯E[φ(Xt)Zlt(˜X0,˜XΔl,…,˜Xt−Δl)|Yt] ηlt(φ) := γlt(φ)γlt(1).

For notational convenience . For one can also set

 γlp+t(φ) := ¯¯¯¯E[φ(Xp+t)Zlp(˜X0,˜XΔl,…,˜Xp−Δl)(tΔ−1l−1∏k=0GlpΔ−1l+k(˜Xp+kΔl))∣∣Yp+t] ηlp+t(φ) := γlp+t(φ)γlp+t(1)

where we define .

### 2.3 Multilevel Monte Carlo

In this section, to elaborate the methodology, we shall consider the estimation of for some fixed . If it is possible, the Monte Carlo estimate of constitutes sampling i.i.d. from and forming the estimate:

 ηL,Nt,MC(φ):=1NN∑i=1φ(xL,it).

In order to understand the error in estimation, one can consider the MSE:

 E[(ηL,Nt,MC(φ)−ηt(φ))2]

where we note that the expectation operator is that under . Then one has

 E[(ηL,Nt,MC(φ)−ηt(φ))2]≤2(E[(ηL,Nt,MC(φ)−ηLt(φ))2]+E[(ηLt(φ)−ηt(φ))2]).

Now if, further, then using classical results in Monte Carlo estimation, (noting (D2.2)) for the first expectation on the R.H.S. and classical results on the bias of the Euler method (e.g. ) for the second expectation on the R.H.S. one has

 E[(ηL,Nt,MC(φ)−ηt(φ))2]≤C(1N+ΔL) (4)

where is a finite constant that may depend on and , but not nor . For a given and assuming access to data with a high enough frequency, one can choose so that and choose , so that the MSE is . If the cost of simulation of one sample is as is often the case when working with Euler discretizations, then the cost to achieve this MSE is (the cost does not take into account the parameter , which we shall not consider).

The MLMC method is associated to the telescoping sum identity:

 ηLt(φ)=η0t(φ)+L∑l=1[ηlt−ηl−1t](φ) (5)

where we are using the short-hand notation . We now explain how (5) can be leveraged to reduce the cost to achieve an MSE of .

The approach is to consider a method that can estimate and then, independently and so on until one independently estimates . The phrase independently, must be understood conditionally upon the data. To estimate , one can proceed just as above, when taking . That is, one generates i.i.d. from and forms the estimate .

We now consider approximating for fixed. We consider a (random) probability measure, , on such that for every we have almost surely

 ˇηl,l−1t(A×Rdx)=ηlt(A)andˇηl,l−1t(Rdx×A)=ηl−1t(A). (6)

We will also assume that has the property that

 E[∫Rdx×Rdx∥x−ˇx∥22ˇηl,l−1t(d(x,ˇx))]≤CΔβl (7)

where is a finite constant that does not depend upon and is a positive constant; we do not discuss the existence of as it is used for purely illustrating the MLMC method, but such probabilities can exist; see for instance . Note that the properties (6)-(7) are the key for the method to be described - without them, it may not be of any practical use. The condition (7) helps to achieve a variance reduction relative to the Monte Carlo estimate as we will explain below. Now one proceeds by sampling i.i.d. from and computing the estimate

 [ηlt−ηl−1t]NlMC(φ):=1NlNl∑i=1[φ(xl,it)−φ(ˇxl−1,it)].

The approximation of (5) is taken as:

 ηL,N0:Lt,MLMC(φ)=η0,N0t,MC(φ)+L∑l=1[ηlt−ηl−1t]NlMC(φ)

where we stress that, conditional upon the data, the random variables are all independent and the notation is used. Now letting , one can again consider the MSE

 E[(ηL,N0:Lt,MLMC(φ)−ηt(φ))2] ≤ 2(E[(ηL,N0:Lt,MLMC(φ)−ηLt(φ))2]+E[(ηLt(φ)−ηt(φ))2]) ≤ 2(E[(ηL,N0:Lt,MLMC(φ)−ηLt(φ))2]+CΔL)

where we have used the bias result that was applied to obtain (4). Using standard results on sums of squares of random variables, along with the fact that, for ,

 E[([ηlt−ηl−1t]NlMC(φ)−[ηlt−ηl−1t](φ))([ηqt−ηq−1t]NqMC(φ)−[ηqt−ηq−1t](φ))]=0

where we are using the conditional independence structure of the simulated random variables and (7) one has

 E[(ηL,N0:Lt,MLMC(φ)−ηLt(φ))2]≤CL∑l=0ΔβlNl

and thus

 E[(ηL,N0:Lt,MLMC(φ)−ηt(φ))2]≤C(L∑l=0ΔβlNl+ΔL)

where, throughout, is a finite constant that does not depend upon or . Suppose that as in (7) is 1 and assume that the cost of producing one sample from is . For a given , using standard calculations (see e.g. ), one can set (achieving a bias of ) and , so that the MSE is and the cost to achieve this is ; a vast reduction over using the Monte Carlo method.

### 2.4 Particle Filters

The main objective of this section is to present a recursive and online method for approximating expectations , where are fixed and is increasing. By ‘online’ we mean that the approximation method will only use a computational cost that is fixed for each .

Particle filters are a simulation-based method that generates samples (or particles) in parallel. The algorithm constitutes two major steps, sampling and resampling. The sampling mechanism, is comprised of sampling the samples (conditionally) independently using the Euler dynamics. Then, as the Euler dynamics do not correspond to the true filter, one must correct for this fact, which is done using the operation of weighting and resampling. In this step all the samples will interact with each other. PFs will produce estimates of which will converge almost surely as grows. In addition, one will also have an (almost surely) unbiased estimate of - note that there is a discretization bias. See e.g. [1, 5] for example.

The following concepts will be used in the algorithm to be described. Set, with

 Glp(xp:p+1−Δl):=Δ−1l−1∏k=0GlpΔ−1l+k(xp+kΔl)

where we use the notation . This quantity will be used to correct samples generated from the Euler dynamics, to those which can be used to approximate the filter. Let , and denote by the joint Markov transition of defined via the Euler discretization (3) and a Dirac on a point : for ,

 Ml(φ)(x):=∫Elφ(x0,xΔl,…,x1)δx(dx0)[Δ−1l∏k=1ψdx(xkΔl;x(k−1)Δl+b(x(k−1)Δl)Δl,a(x(k−1)Δl)Δl)]d(xΔl,…,x1).

This Markov transition kernel is the one that will be used to sample the process in-between weighting and resampling operations. We remark that the presence of the dirac mass is only used to keep consistency with  as we will rely upon the theoretical results in that article. In practice one does not need the dirac mass. The algorithm is presented in details in Algorithm 1. In step 1. of the algorithm, we generate samples independently from the Euler-dynamics. In step 2. each sample is propagated by sampling from the probability measure (9). This sampling encapsulates resampling and then sampling, first one computes the weight functions and one selects a position , from which to move the sample, with probability equal to

 Glp−1(xl,ip−1:p−Δl)∑Nj=1Glp−1(xl,jp−1:p−Δl).

The sample is then moved according to the Markov kernel . If one wants to estimate , , then the estimate

 ηl,Nt,PF(φ):=N∑i=1Glt−1(xl,it−1:t−Δl)∑Nj=1Glt−1(xl,jt−1:t−Δl)φ(xl,it) (8)

is used and be computed after (the appropriate) step 2. (or step 1.) in Algorithm 1. The estimate for valid non-integer is given in . An almost surely unbiased estimate of , is

 γl,Nt,PF(φ):=[t−2∏p=0(1NN∑i=1Glp(xl,ip:p+1−Δl))](1NN∑i=1Glt−1(xl,it−1:t−Δl)φ(xl,it))

and is again computed after (the appropriate) step 2. (or step 1.) in Algorithm 1.

### 2.5 Coupled Particle Filters

The main objective of this section is to present a recursive and online method for approximating expectations , where are fixed and is increasing. More precisely, we will seek to approximate expectations w.r.t. probabilities as described in Section 2.3 with properties (6)-(7). In general, these probabilities are quite complex (see ), so we shall explain an approximation scheme that will correlate or couple the two steps in a particle filter. This approach will induce a sequence of targets , which will possess the properties (6)-(7), but we will not discuss the details of these probabilities; again information can be found in .

The coupled particle filter will generate pairs of paths of the discretized diffusion, using samples simulated in parallel. The algorithm has two steps; coupled sampling and coupled resampling. The coupled sampling step constitutes sampling coupled Euler paths at the two levels of discretization. We describe the simulation of a Markov kernel on paths and (with initial points ) which provides a coupling of the Euler discretizations in Algorithm 2. Let be a Markov kernel defined for ( are the initial points of the kernel)

 ˇMl(φ)((u,ˇv)):=∫El×El−1φ(ul,ul−1)δu(dxl0)δˇv(dˇxl−10)ˇPl((xl0,ˇxl−10),d((xlΔl,…,xl1),(ˇxl−1Δl−1,…,ˇxl−11)))

where we have used the notation . This is the coupled simulation that we will use. Again, the dirac masses are only used for consistency with  and are not needed in practice.

The coupled particle filter will generate pairs of trajectories with . The idea will be to resample these trajectories at times so that the trajectory

is resampled using the probability distribution on

as

 Glp(xl,ip:p+1−Δl)∑Nj=1Glp(xl,jp:p+1−Δl)

and the trajectory is resampled using the probability distribution on as

 Gl−1p(ˇxl−1,ip:p+1−Δl)∑Nj=1Gl−1p(ˇxl−1,jp:p+1−Δl)

but that the sampling of the pair of indices on is not independent. The reason for this is that one would like to obtain a property such as (7) for the limiting distribution , which is seldom possible if the resampling operation is independent between pairs of trajectories (see e.g. ). In Algorithm 3 we describe one way to achieve this (re)sampling for two generic probability mass functions on ; the coupling is called the maximal coupling.