 # Bernoulli Race Particle Filters

When the weights in a particle filter are not available analytically, standard resampling methods cannot be employed. To circumvent this problem state-of-the-art algorithms replace the true weights with non-negative unbiased estimates. This algorithm is still valid but at the cost of higher variance of the resulting filtering estimates in comparison to a particle filter using the true weights. We propose here a novel algorithm that allows for resampling according to the true intractable weights when only an unbiased estimator of the weights is available. We demonstrate our algorithm on several examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the last 25 years particle filters have become a standard tool for optimal estimation in the context of general state space hidden Markov models (HMMs) with applications in ever more complex scenarios. HMMs are described by a latent (unobserved) process

taking values in a space and evolving according to the state transition density

 Xt∣(Xt−1=x)∼f(⋅∣x)

for and . The states can be observed through an observation process , taking values in a space with observation density

 Yt∣(Xt=x)∼g(⋅∣x).

Particle filters (PFs) sequentially approximate the joint distributions

 πt(x1:t) =p(x1:t∣y1:t) =g(y1∣x1)μ(x1)∏tk=2g(yk∣xk)f(xk∣xk−1)p(y1:t) =g(y1∣x1)μ(x1)∏tk=2ϖ(xk∣yk,xk−1)p(y1:t)

on the space at time by evolving a set of samples, termed particles, through time. We use here the shorthand notation . The quantity denotes the marginal likelihood of the model

 p(y1:t)=∫g(y1∣x1)μ(x1)t∏k=2ϖ(xk∣yk,xk−1)dx1:t (1)

which is usually intractable. The PF algorithm proceeds as follows. Given a set of particles at time the particles are propagated through . To adjust for the discrepancy between the distribution of the proposed states and the particles are weighted by

 wt(xt−1,xt) =ϖ(xt∣yt,xt−1)q(xt∣xt−1,yt),t≥2, (2) w1(x1) =g(y1∣x1)μ(x1)q(x1∣y1).

A subsequent resampling step according to these weights ensures that only promising particles survive. Here we consider only multinomial resampling, where we write

to denote the vector

consisting of

samples from a multinomial distribution with probabilities proportional to

. The algorithm is summarized in Algorithm 1.

In most applications, interest lies not in the distribution itself, but rather in the expectations

 I(h)=∫h(x1:t)πt(x1:t)dx1:t (3)

for some test function . Applying a PF we can estimate this integral, for a set of particle genealogies , by taking

 ˆIt,PF(h)=1NN∑i=1h(Xi1:t). (4)

When the PF weights (2) are not available analytically, standard resampling routines—steps 3 and 7 in Algorithm 1—cannot be performed. However, in many cases one might be able to construct an unbiased estimator of the resampling weights. liu1998sequential,rousset2006discussion,DelMoral2007,fearnhead2008particle show that using random but unbiased non-negative weights still yields a valid algorithm. This can be easily seen by considering a standard particle filter on an extended space. Yet, this flexibility does not come without cost. Replacing the true weights with a noisy estimate increases the variance for Monte Carlo estimates of type (4).

Here we introduce a new resampling scheme that allows for multinomial resampling from the true intractable weights while just requiring access to non-negative unbiased estimates. Thus, any PF estimate of type (4) will have the same variance as if the true weights were known. This algorithm relies on an extension of recent work in dughmi2017bernoulli where the authors consider an unrelated problem.

In the supplementary material we collect the proofs for Propositions 3, 4 and 5, Theorem 6 and additional simulation studies. Code to reproduce our results is available online1.

## 2 Particle Filters with Intractable Weights

Particle filters for state space models with intractable weights rely on the observation that replacing the true weights with a non-negative unbiased estimate is equivalent to a particle filter on an extended space. In this section we introduce some examples of models where the weights are indeed not available analytically and review briefly how the random weight particle filter (RWPF) can be applied in these instances.

### 2.1 Locally optimal proposal

Recall that in the setting of a PF for a state space model at time the proposal for which leads to the minimum one step variance, termed the locally optimal proposal , is

 q∗(xt∣xt−1,yt)=g(yt∣xt)f(xt∣xt−1)p(yt∣xt−1),

see, e.g. [Proposition 2]doucet2000sequential. Sampling from the locally optimal proposal is usually straightforward using a rejection sampler. The weights

 wt(xt−1,xt) =p(yt∣xt−1) =∫g(yt∣xt)f(xt∣xt−1)dxt, (5)

however, are intractable if the integral on the right-hand side does not have an analytical expression, thus prohibiting an exact implementation of this algorithm. Our algorithm will enable us to resample according to these weights.

### 2.2 Partially observed diffusions

Most research on particle filters with intractable weights has been carried out in the setting of partially observed diffusions, see e.g. fearnhead2008particle, which we will describe here briefly. For simplicity, consider the univariate diffusion222It is not necessary to restrict oneself to univariate diffusions, see e.g. fearnhead2008particle. process of the form

 dXt=a(Xt)dt+dBt,t∈[0,T], (6)

where denotes the drift function and is the time horizon. For a general diffusion constant speed as assumed in (6) can be obtained whenever the Lamperti transform is available. The diffusion is observed at discrete times with measurement error described by the density . In this model the resampling weights are often intractable since for most diffusion processes the transition density over the interval with length is not available analytically. However, as shown in rogers1985smooth,dacunha1986estimation,beskos2005exact the transition density over an interval of length can be expressed as

 fΔt(y∣x) =φ(y;x,Δt)exp(A(y)−A(x)) ×E[exp(−∫Δt0ϕ(Ws)ds)] (7)

where

is the density of a Normal distribution with mean

and variance , denotes a Brownian bridge from to and for a function with . The expectation on the right-hand side in (7) is with respect to the Brownian bridge and is usually intractable. Thus, for a particle filter to work in this example one either needs to implement a Bootstrap PF, which can sample the diffusion (6) exactly [see, e.g.][]beskos2005exact, or one constructs an unbiased estimator of the expectation on the right-hand side to employ the RWPF. Note that for a function and we have

 E[g(WU)λ∣Ws,0≤s≤t]=∫t0g(Ws)λtds.

Using this relationship we can use debiasing schemes such as the Poisson estimator Beskos2006 to find a non-negative unbiased estimator of the expectation of the exponential.

### 2.3 Random weight particle filter

Here we briefly review the RWPF and compare its asymptotic variance to that of a PF with known weights. Assume one does not have access to the weight in (2) but only to some non-negative unbiased estimate , where are auxiliary variables sampled from some density . Then we carry out the multinomial resampling in Algorithm 1 by taking , where is defined as in (2) with replaced by . This is equivalent to a standard particle filter on the extended space targeting at time the distribution

 ¯πt(x1:t,u1:t)∝t∏k=1^ϖ(xk∣yk,xk−1,uk)m(uk)

which satisfies

 ¯πt(x1:t,u1:t)=πt(x1:t)¯πt(u1:t∣x1:t) (8)

with

 ¯πt(u1:t∣x1:t)=t∏k=1^ϖ(xk∣yk,xk−1,uk)m(uk)ϖ(xk∣yk,xk−1).

A PF targeting the sequence of distributions directly can be interpreted as a Rao-Blackwellized PF doucet2000sequential of the PF on the extended space. Hence, the following is a direct consequence of [][Theorem 3]chopin2004central.

###### Proposition 1.

For any sufficiently regular333We refer the reader to chopin2004central for the mathematical details. real-valued test function , the exact weight PF (EWPF) and RWPF estimators of defined in (3) both satisfy a

-central limit theorem with asymptotic variances satisfying

.

## 3 Bernoulli Races

Assume now that the intractable weights can be written as follows

 wt(xt−1,xt)=g(yt∣xt)f(xt∣xt−1)q(xt∣xt−1,yt) =c(xt−1,xt,yt)b(xt−1,xt,yt), (9)

where for all , and that we are able to generate a coin flip with . We will assume that in (9) is analytically available for any . For brevity we will denote (9) as , for particles , dropping for now the dependence on .

The aim of this section is to develop an algorithm to perform multinomial sampling proportional to the weights , that is, sample from discrete distributions of the form

 p(i)=cibi∑Nk=1ckbk,i=1,…N (10)

where denote fixed constants whereas the are unknown probabilities. We assume we are able to sample coins In practice, the are selected to ensure that take values between 0 and 1. The Bernoulli race algorithm for multinomial sampling proceeds by first proposing from the distribution

 P(I=i)=ci∑Nk=1ck (11)

and then sampling . If , return , otherwise the algorithm restarts. Pseudo code describing this procedure is presented in Algorithm 2. If we take for all we recover the original Bernoulli race algorithm from dughmi2017bernoulli.

###### Proposition 2.

Algorithm 2 samples from the distribution

 p(i)=P(I=i∣ZI=1)=cibi∑Nk=1ckbk.
###### Proof.

Note that the probability of sampling and accepting is

 P(I=i,Zi=1)=bici∑kck.

It follows that observing has probability . Now for any

 p(i) =P(I=i∣ZI=1) =P(I=i,Zi=1)P(ZI=1) =bici∑kck/∑kbkck∑kck=bici∑kbkck.\qed

### 3.1 Efficient Implementation

This algorithm repeatedly proposes an a priori unknown amount of random variables from the multinomial distribution (

11). Naïve implementations of multinomial sampling are of complexity for one draw from the multinomial distribution. Standard resampling algorithms, however, can sample random variables at cost [see e.g.][]hol2006resampling but in this case the number of samples needs to be known beforehand. We show here that Bernoulli resampling can still be implemented with an average of operations.

To ensure that the Bernoulli resampling is fast we need cheap samples from the multinomial distribution with weights proportional to . We can achieve this with the Alias method walker1974new,walker1977efficient which requires preprocessing after which we can sample with complexity. Hence, the overall complexity depends on the number of calls to the Bernoulli factory per sample. Denote the number of coin flips that are required to accept a value for the th sample, where . The random variable

follows a geometric distribution with success probability

 ρN=P(ZI=1)=∑Nk=1ckbk∑Nk=1ck. (12)

The expected number of trials until a value is accepted is then

 E[Cj,N]=1ρN(j=1,…,N).

The complexity of the resampling algorithm depends on the values of the acceptance probabilities . For example, if all probabilities are identical, that is, , we expect coin flips, each of cost , which leads to overall order complexity. In practice, however, the success probabilities of our Bernoulli factories are all different. Assuming all are non-zero, the expected algorithmic complexity per particle is bounded by the inverse of smallest and largest Bernoulli probabilities. Let . Then,

 E[Cj,N]=∑kck∑kbkck≤∑kckb–∑kck=1b–

and with

 E[Cj,N]≥1¯¯b.

In practice, the complexity of the Bernoulli sampling algorithm depends on the behavior of as shown by the following central limit theorem.

###### Proposition 3.

Assume . Then we have the following central limit theorem for the average number of coin flips as

 √N(1NN∑j=1Cj,N−1ρN)d→N(0,1−ρρ2),

where denotes convergence in distribution.

In particular, Proposition 3 implies that the run-time concentrates around with fluctuations of order . Thus, the order of complexity depends on the order of and we have complexity if

 limsupN→∞∣∣ ∣∣∑Nk=1ck∑Nk=1ckbk∣∣ ∣∣<∞. (13)

As an example consider the case where are all identical. Then at time

 ρt,N=1NN∑k=1wt,k. (14)

An instance of this setting is the locally optimal proposal presented in Section 2.1. It is well known that as (14) will converge towards and indeed we see that the algorithm’s run-time will concentrate around a quantity of order .

###### Remark.

After the Alias table is constructed, the algorithm can be implemented in parallel. This can lead to considerable gains if the number of particles used in the particle filter is high. If , or if the constants denote a distribution for which a table as in the Alias method is already implemented before program execution, the above algorithm can be implemented entirely in parallel. murray2016parallel consider the case of multinomial resampling using a rejection sampler with uniform proposal on the set with known weights and show that this algorithm has parallel complexity

### 3.2 Estimating the Probability of Stopping

In our later applications we will be interested in evaluating the success probability

 ρN=∑Nk=1ckbk∑Nk=1ck.

Assume that we sample independent realizations from a multinomial distribution using the Bernoulli race algorithm described above and that are the geometric random variables that count the number of trials until the algorithm accepts a value and terminates. Then , where is a consistent estimator of since for all

and therefore by a weak law of large numbers

in probability and in probability by the continuous mapping theorem. Unfortunately, the estimator is not unbiased which would be desirable. However, this can be remedied by constructing the minimum variance unbiased estimator [see][Definition 7.1.1]hogg2005introduction for a geometric distribution. This result is well known, but we repeat it here for convenience.

###### Proposition 4.

For let denote independent samples from a geometric distribution with success probability , then the minimum variance unbiased estimator for is

 ^ρmvueN=N−1∑Nk=1Ck,N−1. (15)

In order to understand the asymptotic behavior of the estimator (15) we also provide the following central limit theorem.

###### Proposition 5.

For let denote independent samples from a geometric distribution with success probability , then

 √N(^ρmvueN−ρN)d→N(0,(1−ρ)ρ2).

## 4 Bernoulli Race Particle Filter

### 4.1 Algorithm Description

We now consider the application of the Bernoulli race algorithm to particle filter methods. The Bernoulli race can be employed as a multinomial resampling algorithm, Mult, in Algorithm 1. However, the clear advantage is that this algorithm can be implemented even if the true weights are not analytically available, but we do have access to non-negative unbiased estimates. A -valued unbiased estimator for can be converted into an unbiased coin flip by noting that where [see e.g. Lemma 2.1 in][]latuszynski2011simulating.

We will refer to such a particle filter as a Bernoulli race particle filter (BRPF).

### 4.2 Likelihood estimation

Even though the Bernoulli race resampling scheme enables us to resample according to the true weights, the normalizing constant or marginal likelihood (1) remains intractable. We show here how we can still obtain an unbiased estimator for the normalizing constant. We first recall that in particle filters an estimator of the normalizing constant is obtained by

 ^p(y1:T)=T∏t=1^p(yt∣y1:t−1)=T∏t=11NN∑k=1wt,k. (16)

This estimator is well-known to be unbiased [see][Chapter 7]del2004feynman. If the weights are not available this estimator cannot be employed. Fortunately, the quantity (16) comes up naturally when running the BRPF. Recall that the probability for the Bernoulli race to stop at a given iteration, i.e. to accept a value, is

 P(Cj,N=1)=∑Nk=1ct,kbt,k∑Nk=1ct,k=1N∑Nk=1wt,k1N∑Nk=1ct,k.

Thus, conditional on the weights , an unbiased estimator of (16) is given by virtue of (15):

 ^ρN,T=T∏t=11NN∑k=1ct,k⋅N−1∑Nk=1Ck,N−1. (17)
###### Theorem 6.

The estimator (17) is unbiased for , i.e. .

## 5 Applications

### 5.1 Locally optimal proposal

We start with a Gaussian state space model. In linear Gaussian state space models the Kalman Filter can be used to analytically evolve the system through time. Nevertheless, the aim here is a proof of concept of the BRPF. We assume the latent variables follow the Markov chain

 Xt=aXt−1+Vt

where we take and we observe these hidden variables through the observation equation

 Yt=Xt+Wt

with initialization and , . In this particular instance the locally optimal proposal is available analytically, but in most practical scenarios this will not be the case. For this reason sampling from the locally optimal proposal is implemented using a rejection sampler that proposes from the state equation. Coin flips for the weights (5) can be obtained by sampling from the model and computing

 Zt=1{U≤exp(−(yt−ξt)210)},U∼Unif[0,1].

This leads to the choice and

 bt,k=∫exp(−(yt−x)210)f(x∣xt−1)dxt−1.

Note that is defined such that

 supx,x′,ybt(x,x′,y)=1

Such a choice ensures that the acceptance probability in the Bernoulli race, Algorithm 2, is as large as possible. This can lead to a considerable speedup when using the Bernoulli resampling algorithm.

#### 5.1.1 Complexity and Run-time

Figure 1 shows the run-time for RWPF and the BRPF for the Gaussian state space model. The BRPF clearly has linear complexity in the number of particles. In a sequential implementation the Bernoulli race (orange) performs worse than the classical resampling scheme (blue), but the difference vanishes when implementing a parallel version of the Bernoulli race (green, with 32 cores). This demonstrates the speedup due to parallel sampling of the coin flips. Since we need many Bernoulli random variables each of which is computationally cheap this algorithm lends itself to a implementation using GPUs to further improve the performance of the Bernoulli race. Figure 1: Comparison of the run-time for RWPF and BRPF for Gaussian state space model. For BRPF we show a sequential and a parallel implementation with 32 cores. For all algorithms we show wall-clock time for number of particles N.

#### 5.1.2 Efficiency

As alluded to earlier, if Bernoulli resampling is performed, the variance for any Monte Carlo estimate (4) will be the same as if the true weights were known and one applies standard multinomial resampling. From Proposition 1 it follows that the asymptotic variance of any Monte Carlo estimate of type (4) will be smaller when applying a BRPF over a RWPF. While the variance for functions (4) in the BRPF coincides with the standard particle filter we do not have access to the same estimator for the normalising constant and instead need to use the methods from Section 4.2.

For these reasons we will compare the performance of the BRPF with the RWPF for a set of different test functions by comparing the variance of PF estimates (4). We then study the estimation of the normalizing constant separately. For the simulation we consider the functions

 h1(x1:T) =1TT∑t=1xt, h2(x1:T) =∥x1:T∥2, h3(x1:T) =xT, h4(x1:T) =(xT−¯xT)2,

where . The PF approximations (4) using and are estimators for the quantities and .

All estimates are based on 100 runs of each particle filter using particles. The results are collected in Table 1

. We denote the standard deviation of the test functions under the BRPF as

and for the RWPF. All measures indicate a reduction in the standard deviation. For the estimator of the function , the standard deviation is reduced by when compared to the RWPF.

We now investigate the estimates for the normalizing constant. Table 2 shows the standard deviation of the (log-)normalizing constant of a Gaussian state space model for time steps. In this setting it is not obvious why the Bernoulli race resampling estimate should outperform the estimate provided by the RWPF. In our case however, we find that the BRPF performs better.

### 5.2 Partially Observed Diffusion

We use the sine diffusion, a commonly used example in the context of partially observed diffusions, see e.g. fearnhead2008particle, given by the stochastic differential equation (SDE)

 dXs=sin(Xs)dt+dBs,s∈[0,15].

Here, denotes a Brownian motion and the drift function in (6) is Consequently, and with , the transition density is

 fΔt(x,y) =φ(y;x,Δt)exp(−cos(y)+cos(x)) ×E[exp(−∫Δt0ϕ(Ws)ds)]. (18)

We observe the state of the SDE through zero mean Gaussian noise with standard deviation 5 yielding weights

 w(xt−1,xt,yt)=φ(yt;xt,52)fΔt(xt∣xt−1)q(xt∣xt−1,yt),

As the proposal we take one step of the Euler–Maruyama scheme. Figure 2: Tracking performance for different particle filters; RWPF with 100 particles (RWPF100) and 1000 particles (RWPF1000) as well as a particle filter using Bernoulli resampling with 100 particles (BerPF100).

The weights are intractable because of the expectation on the right-hand side in (18). The construction of unbiased coin flips can often rely on similar techniques as the construction of unbiased estimators. We can construct an unbiased estimator for the transition density in the following way. Fix a Brownian bridge starting at and finishing at . Sample , then the Poisson estimator Beskos2006 is given by

 ^P(κ,U1…,Uκ)=exp{(λ−c)Δt}κ∏i=1{c−ϕ(WUi)}λ,

where is chosen such that the estimator is non-negative. The Poisson estimator is unbiased

 E[^P(κ,U1…,Uκ)]=E[exp(−∫Δt0ϕ(Ws)ds)]. (19) Figure 3: Kernel density estimate of p(x10∣y1:10) based on particle approximations from a RWPF with 100 particles (RWPF) a BRPF with 100 particles (Bernoulli) and a RWPF with 1000 particles.

For the purposes of implementing the BRPF, we need to construct a coin flip with success probability proportional to (19). We use the probability generating function approach which works analogously to the Poisson estimator. Sample , then

 Z=κ∏i=11{Vi≤{c−ϕ(WUi)}λ}

has success probability

 exp{(c−λ)Δt}E[exp(−∫Δt0ϕ(Ws)ds)].

Hence, to implement the BRPF we can choose

 ct,k =φ(yt;xt,52)φ(xt;xt−1,Δt)φ(yt;xt−1+Δtsin(xt−1),Δt) ×exp(−cos(xt)+cos(xt−1)−(c−λ)Δt) bt,k =exp{(c−λ)Δt}E[exp(−∫Δt0ϕ(Wks)ds)],

where denotes a Brownian bridge from to .

#### 5.2.1 Complexity and Run-time

As in the previous example, we observe the BRPF to be of order and slower than the RWPF. However, implementing the Bernoulli race in parallel yields almost the same performance in terms of run-time. The details can be found in the supplementary material.

#### 5.2.2 Efficiency

One run for the RWPF and the BRPF is shown in Figure 2, where we show the true state of the SDE as well as the noisy observations and the particle filter approximations. In this scenario precise state estimation is hampered by the high noise and resulting partial multimodality of the filtering distribution as shown in Figure 3. For , Figure 2 shows that the BRPF performs much better as it finds the true state earlier. The RWPF finds this trajectory only when the number of particles is increased (here we show also the case ). With the same test functions as above we compare both algorithms in Table 3. We observe gains for all functions, with the most significant gain for the conditional mean, . Again, the Bernoulli race will ordinarily be slower, but most of the difference in run-time vanishes when the Bernoulli race is implemented in parallel. Further details are provided in the supplementary material.

As a final test we use both particle filters for estimating the (-)normalizing constant. The results are listed in Table 4. For comparison we also employ a bootstrap particle filter using the exact algorithm beskos2005exact to propose from the model. We observed this implementation to be slower than the other two. The BRPF estimate (17) outperforms the RWPF and the bootstrap PF.

## 6 Conclusion

We have introduced the idea of Bernoulli races to resampling in particle filtering, utilizing the equivalence of unbiased estimation and the construction of unbiased coin-flips. This algorithm provides the first resampling scheme to allow for exact implementation of multinomial resampling when the weights are intractable, but unbiased estimates are available. We have shown that our algorithm has a complexity of order , like standard multinomial resampling, and demonstrated its advantages over alternative methods in a variety of settings. In doing so we have focused our attention to the resampling step. Further gains using our algorithm could be obtained by considering auxiliary particle filters and to resample only if the effective sample size drops beyond a certain threshold. We expect both to positively impact the performance of the BRPF.

Supplementary Material to Bernoulli Race Particle Filters

## Appendix A Proofs

When we say that is a geometric random variable with parameter, or success probability, we mean that

 P(T≥k)=(1−p)k,k≥1.
###### Proposition 3.

Assume . Then we have the following central limit theorem for the average number of coin flips as

 √N(1NN∑j=1Cj,N−1ρN)d→N(0,1−ρρ2),

where denotes convergence in distribution.

###### Proof.

The random variables form a triangular array

 {Zi,N;i=1,…N,N∈N}

of row wise independent random variables. Define and notice that as . In addition one can easily show that are bounded uniformly in and therefore

 1s3NN∑i=1E[Zi,N] =ρ3/2N(1−ρN)3NN3/2E[|Z1,N|3] ≤C√N→0,as N→∞,

that is the Lyapunov condition is satisfied and therefore

 1√NN∑i=1(Ci,N−1ρN)d→N(0,s2),

where

###### Proposition 4.

For let denote independent samples from a geometric distribution with success probability , then the minimum variance unbiased estimator for is

 ^ρmvueN=N−1∑Nk=1Ck,N−1. (20)
###### Proof.

This is a straightforward application of the Lehmann-Scheffé Theorem [Theorem 7.4.1]hogg2005introduction. Take the unbiased estimator , where denotes the indicator function of the set , and condition on the complete and sufficient statistic (of the coin flips) . A straightforward calculation yields

 E[1{C1,N=1}∣N∑k=1Ck,N=n] =P(C1=1,∑Nk=2Ck,N=n−1)P(∑Nk=1Ck,N=n) =p(n−2N−2)pN−1(1−p)n−N(n−1N−1)pN(1−p)n−N =N−1n−1

and thus

 ^ρmvueN=E[1{C1,N=1}∣N∑k=1Ck,N]=N−1∑Nk=1Ck,N−1.\qed
###### Proposition 5.

For let denote independent samples from a Geometric distribution with success probability , then

 √N(^ρmvueN−ρN)d→N(0,(1−ρ)ρ2).
###### Proof.

By Proposition 3 we have convergence

 √N(1NN∑i=1Ci,N−1ρN)d→N(0,1−ρρ2).