 # Asynchronous Anytime Sequential Monte Carlo

We introduce a new sequential Monte Carlo algorithm we call the particle cascade. The particle cascade is an asynchronous, anytime alternative to traditional particle filtering algorithms. It uses no barrier synchronizations which leads to improved particle throughput and memory efficiency. It is an anytime algorithm in the sense that it can be run forever to emit an unbounded number of particles while keeping within a fixed memory budget. We prove that the particle cascade is an unbiased marginal likelihood estimator which means that it can be straightforwardly plugged into existing pseudomarginal methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Particle filter based inference techniques require blocking barrier synchronization at resampling steps which limits parallel throughput and is costly in terms of memory. We introduce a new asynchronous particle filter algorithm that has statistical efficiency competitive with standard resampling algorithms, and has sufficiently higher particle throughput such that it is, on balance, more efficient per unit time. The approach uses locally-computed decision rules for each particle that do not require block synchronization of all particles, instead only requiring sharing summary statistics with particles that follow. In our algorithm each resampling point acts as a queue rather than a barrier: each particle chooses the number of its own offspring using by comparing its own weight to the weights of particles which previously reached the queue, updates its own weight, then proceeds without waiting.

An anytime algorithm is an algorithm which can be run continuously, generating progressively better solutions when afforded additional computation time. Traditional particle filter (PF) algorithms are not anytime in nature; all particles need to be propagated in lock-step to completion in order to compute expectations. Once a particle set runs to termination, inference cannot straightforwardly be continued by simply doing more computation. The naive strategy of running sequential Monte Carlo (SMC) again and merging the resulting sets of particles is suboptimal due to bias (see  for explanation). More complex methods (i.e. particle Metropolis Hastings and iterated conditional sequential Monte Carlo (iCSMC) ) for correctly merging particle sets produced by additional SMC runs are closer to anytime in nature but suffer from burstiness as big sets of particles are computed then emitted at once and, fundamentally, the inner-SMC loop of such algorithms still suffers the kind of excessive synchronization performance penalty that the particle cascade directly avoids. Our asynchronous SMC algorithm, the particle cascade, is anytime in nature. The particle cascade can be run indefinitely, without resorting to merging of particle sets, and with only a fixed (tunable) memory requirement.

### 1.1 Related work

Our algorithm shares a superficial similarity to Bernoulli branching numbers  and other search and exploration methods which have been used for particle filtering, where each particle samples some number of children to propagate to the next observation. Like the particle cascade (and in contrast to most traditional resampling algorithms), the total number of particles which exist at each generation is allowed to gradually increase and decrease. However, computing branching correction numbers is also generally a synchronous operation, requiring all particle weights to be known at each observation in order to choose an appropriate number of offspring; this also precludes use of an anytime algorithm.

Parallelizing the resampling step of sequential Monte Carlo methods has drawn increasing recent interest as the effort progresses to scale up algorithms to take advantage of high-performance computing systems and GPUs. A recent approach to removing the global collective resampling operation, quite different from the particle cascade method introduced here, can be found at .

Another recent method for running arbitrarily many particles within a fixed memory budget introduced in  focuses on keeping track of random seeds used to generate proposals, allowing particular particles to be deterministically “replayed”; a first pass through all particles computes the normalizing constant of the particle weights, and a second pass re-executes those which are chosen to continue to the next generation. However, the algorithm as presented there still relies on a synchronous resampling step, and lacks the anytime property of our approach.

## 2 Background

We begin by briefly reviewing particle filtering as generally formulated on state-space models. Suppose we have a non-Markovian dynamical system with latent random variables

and observed random variables described by the joint density

 p(Xn|X0:n−1,Y0:n−1) =f(Xn|X0:n−1) p(Yn|X0:n,Y0:n−1) =g(Yn|X0:n), (1)

where is drawn from some initial distribution , and and are conditional densities.

Given observed values , we approximate the posterior distribution with a weighted set of particles, with each particle denoted for . Particles are propagated forward from proposal densities and re-weighted at each observation :

 xkn|xkn−1 ∼q(⋅|xkn−1) (2) wkn =f(xkn|xk0:n−1)g(yn|xk0:n)q(xkn|xk0:n−1) (3) Wkn =Wkn−1wkn (4)

where is the weight associated with observation and is the weight of particle after observation . We assume that exact evaluation of is intractable, requiring only that the conditional terms likelihoods can be evaluated pointwise. In many complex dynamical systems, or in black-box simulation models, evaluation of may be prohibitively costly or even impossible. As long as we are capable of simulating from the system we can set our proposal distribution , in which case the particle weights are simply , eliminating the need to compute the conditional densities directly.

By normalizing the weights , defining

 ¯ωkn=Wkn∑Kj=1Wjn, (5)

we can approximate the posterior distribution with a weighted set of particles

 p(X0:N|y0:N) ≈K∑k=1¯ωkNδxk0:N(X0:N). (6)

In the very simple sequential importance sampling setup described here, the marginal likelihood can be estimated by

 ^p(y0:n)=1KK∑k=1Wkn. (7)

### 2.1 Resampling, degeneracy, synchronization

The algorithm described above suffers from a degeneracy problem wherein the normalized weights become mostly very close to zero for even moderately large . Traditionally this is combated by introducing a resampling step: as we progress from to , particles with high weights are duplicated and particles with low weights are discarded. Many difference schemes for resampling particles exist; see  for an overview, with discussion and theoretical results for several common approaches. One can think of a resampling scheme as a method for drawing the number of offspring particles that each particle will produce after stage . After resampling, all outgoing particles from to receive a new outgoing weight , and we have

 Wkn+1=Vkn+1wkn+1. (8)

In most traditional resampling schemes the outgoing weights of all particles are deterministically set to be equal, i.e. with ; a valid resampling scheme then must satisfy the unbiasedness condition

 E[Mkn+1]=K¯ωkn. (9)

Introducing this resampling step prevents all the probability mass in our approximation to the posterior from accumulating on a single particle. In this version of the algorithm, where a resampling step is added at every

, the marginal likelihood can be estimated by

 ^p(yn|y0:n−1)=1KK∑k=1wkn; (10)

it is well-known that the estimate of the marginal likelihood is unbiased .

### 2.2 Limitations

Our goal is to scale up to very large numbers of particles, using a parallel computing architecture where each particle is simulated as a separate process or thread. In order to resample at each we must compute the normalized weights , requiring us to wait until all individual particles have both finished forward simulation and computed their individual weight before any can proceed. While the forward simulation itself is trivially parallelizable, the weight normalization and resampling step is a synchronous, collective operation. In practice this can lead to significant underuse of computing resources in a multiprocessor environment, hindering our ability to scale up to large numbers of particles.

Memory limitations on finite computing hardware also limit the number of simultaneous particles we are capable of running in practice. All particles must move through the system together, and all must exist simultaneously; if the total memory requirements of particles is greater than the available system RAM, then a substantial overhead will be incurred from regularly swapping memory contents to disk.

The particle cascade algorithm we introduce addresses both these limitations: it does not require synchronization, and keeps only a bounded number of particles alive in the system at any given time. Instead of resampling, we will consider particle branching, where each particle can result in 0 or more offspring. These branching events happen asynchronously and mutually exclusively, i.e. they are processed one at a time.

### 3.1 Local branching decisions

At each stage of the particle filter, particles process observation . Without loss of generality, we can define an ordering on the particles in the order they arrive at . This order need not be independent of the state of the particles .

We keep track of the running average weight of the first particles to arrive at observation in an online manner:

 ¯¯¯¯¯¯Wkn =Wkn for k=1, (11) ¯¯¯¯¯¯Wkn =k−1k¯¯¯¯¯¯Wk−1n+1kWkn for k=2,3,…. (12)

The number of children of particle should depend on the weight of particle

relative to those of other particles. Particles with higher relative weight are more likely to be located in a high posterior probability part of the space, and should be allowed to spawn more child particles.

In the online asynchronous particle system as described here, we do not have access to the weights of future particles when processing . Instead we will compare to the current average weight among particles processed thus far. Specifically, the number of children, which we denote by , will depend on the ratio

 Rkn=Wkn¯¯¯¯¯¯Wkn. (13)

Each child of particle will be assigned a weight such that the total weight of all children has expectation .

There is a great deal of flexibility available in designing a scheme for choosing the number of child particles; we need only be careful to set appropriately. Informally, we would like to be large when is large. If is sampled in such a way that , then we set the outgoing weight . Alternatively, if we are using a scheme which deterministically guarantees , then we set .

A simple approach would be to sample independently conditioned on the weights. In such schemes we could draw each

from some simple distribution, e.g. a Poisson distribution with mean

, or a discrete distribution over the integers

. However, one issue that arises in such approaches where the number of children for each particle is conditionally independent, is that the variance of the total number of particles at each generation can grow without bound. Suppose we start the system with

particles. The number of particles at subsequent stages is given recursively as . We would like to avoid situations in which the number of particles becomes too large, or collapses to 1.

Instead, we will allow to depend on the number of children of previous particles at , in such a way that we can stabilize the total number of particles in each generation. Suppose that we wish for the number of particles to be stabilized around . After particles have been processed, we expect the total number of children produced at that point to be approximately , so that if the number is less than we should allow particle to produce more children, and vice versa. Similarly, if we already currently have more than children, we should allow particle to produce less children. We use a simple scheme which satisfies these criteria, where the number of particles is chosen at random when , and set deterministically when :

 (Mkn+1,Vkn+1)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(0,0) w.p. 1−Rkn, if Rkn<1;(1,¯¯¯¯¯¯Wkn) w.p. Rkn, if Rkn<1;(⌊Rkn⌋,Wkn⌊Rkn⌋)if% Rkn≥1 and ∑k−1j=1Mjn+1>min(K0,k−1);(⌈Rkn⌉,Wkn⌈Rkn⌉)if Rkn≥1 and ∑k−1j=1Mjn+1≤min(K0,k−1). (14)

We pause here to take note of the anytime nature of this algorithm — any given particle passing through the system needs only the previous weights in order to make its local branching decisions, not the previous particles themselves. Thus it is possible to run this algorithm for some fixed number of initial particles , inspect the output of the completed particles which have left the system, and then decide whether to continue inference by initializing additional particles.

### 3.2 Computing expectations and marginal likelihoods

Samples drawn from the particle cascade can be used to compute expectations in the same manner as samples from a standard particle filter; that is, given some function , we normalize weights analogously to before and approximate the posterior expectation by

 EX0:N|Y0:N[φ(X0:N)] ≈KN∑k=1¯ωkNφ(xk0:N). (15)

We can also use the particle cascade to define an estimator of the marginal likelihood ,

 ^p(y0:n) =1K0Kn∑k=1Wkn. (16)

The form of this estimate is fairly distinct from the standard SMC estimators in Section 2. In terms of predictive densities, one can think of as

 ^p(y0:n)=^p(y0)n∏i=1^p(yi|y0:i−1) (17)

where

 ^p(y0) =1K0K0∑k=1Wk0, ^p(yn|y0:n−1) =∑Knk=1Wkn∑Kn−1k=1Wkn−1 for n≥1. (18)

It is interesting to note that the incrementally updated statistics in the denominator of are very directly tied to the marginal likelihood estimate; that is, .

### 3.3 Theoretical properties, unbiasedness

Here we show that the marginal likelihood estimator defined in Eq. 16 is unbiased; i.e. . In a very general form, we can summarize the particle cascade algorithm as

• Initialization at : for sample and compute

• At each time , perform

1. Resampling step: resample to obtain

2. Forward simulation step: for sample , and set and

We denote by the space of bounded real-valued functions on a space . We make the following assumption on the resampling step.

Assumption R. For any we have and for any

 E⎡⎣Kn+1∑k=1˜Wkn φ(Xk,n+10:n)∣∣ ∣∣Fn⎤⎦=Kn∑k=1Wkn φ(Xk,n0:n) (19)

where denotes the natural filtration associated to all the random variables generated by the particle algorithm before resampling at time . We also denote by denotes the natural filtration associated to all the random variables generated by the particle algorithm just after the resampling step at time .

The resampling step of the particle cascade corresponds to

 E⎡⎣Kn+1∑k=1˜Wkn φ(Xk,n+10:n)∣∣ ∣∣Fn⎤⎦ =E[Kn∑k=1Mkn+1Vkn+1 φ(Xk,n0:n)∣∣ ∣∣Fn]=Kn∑k=1Wkn φ(Xk,n0:n), (20)

i.e. each particle has offspring of associated weight so that

Proposition 1. Assume Assumption R holds and is in for any . Then we have

 E[^p(y0:n)]=p(y0:n). (21)

Proof of Proposition 1. The proof follows from a backward induction. We have

 E[^p(y0:n)] =E[E[1K0Kn∑k=1Wkn∣∣ ∣∣˜Fn−1]] =E⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1K0Kn∑k=1˜Wkn−1∫f(xn|Xk,n0:n−1)g(yn|Xk,n0:n−1,xn,y0:n−1)dxnp(yn|Xk,n0:n−1,y0:n−1)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ =E[E[1K0Kn∑k=1˜Wkn−1p(yn|Xk,n0:n−1,y0:n−1)∣∣ ∣∣Fn−1]] =E⎡⎣E⎡⎣1K0Kn−1∑k=1˜Wkn−2p(yn−1:n|Xk,n−10:n−2,y0:n−2)∣∣ ∣∣Fn−2⎤⎦⎤⎦ =E⎡⎣E⎡⎣1K0Kn−2∑k=1Wkn−2p(yn−1:n|Xk,n−20:n−2,y0:n−2)∣∣ ∣∣Fn−2⎤⎦⎤⎦ =p(y0:n).

## 4 Active bounding of memory usage

In an idealized computational environment, with infinite available memory, our implementation of the particle cascade could begin by launching (a very large number) particles simultaneously which then gradually propagate forward through the system. In practice, only some finite number of particles, probably much smaller than , can be simultaneously simulated efficiently. Furthermore, the initial particles are not truly launched all at once, but rather in a sequence, introducing a dependency in the order in which particles arrive at each observation .

While the resampling scheme in Eq. 14 is designed to stabilize the number of particles over time, we can still see an explosion in the number of particles . The degree to which the particle count becomes unstable depends on the extent to which the ordering of the particles is permuted as we progress to each . In Fig. 1 we compare a best-case situation where the ordering of particles at is completely independent of the ordering of particles at , to a worst-case situation where the ordering of particles is completely preserved from to . In practice, a naïve implementation of the incremental resampling scheme will have a very strong dependence in ordering across — a particle which is one of the first to reach stage is quite likely one of the first to reach stage as well. Figure 1: Number of particles Kn at each of n=0,…,49 for a one-dimensional linear Gaussian model, initialized with 100 particles. (left) When the order of the particles arriving at each n is subject to a random permutation, then the number of particles is reasonably stable, staying at or near 100. (right) When the order of the particles arriving at each n is completely deterministic, then the total number of particles quickly explodes, in this case exceeding 15000 by n=11.

Our implementation of the particle cascade addresses these issues by explicitly injecting randomness into the execution order of particles, and by imposing a machine-dependent hard cap on the number of simultaneous extant processes. This permits us to run our particle filter system indefinitely, for arbitrarily large initial particle counts , while consuming only a fixed computational budget.

Each particle in our implementation runs as an independent operating system process. In order to efficiently run a large number of particles, we impose a hard limit limit on the total number of particles which can simultaneously exist in the particle system; most of these will generally be sleeping processes. The ideal choice for this number will vary based on hardware capabilities, but in general should be made as large as possible.

Scheduling across particles is managed via a global first-in random-out process queue of length ; this can equivalently be conceptualized as a random-weight priority queue. Each particle corresponds to a single live process, augmented by a single additional control process which is responsible only for spawning additional initial particles (i.e. incrementing the initial particle count ). When any particle arrives at any likelihood evaluation , it computes its target number of child particles and outgoing particle weight . If it immediately terminates; otherwise it enters the queue. Once this particle either enters the queue or terminates, some other process continues execution — this process is chosen uniformly at random, and as such may be a sleeping particle at any stage , or it may instead be the control process which then launches a brand new particle. At any given time, there are some number of particles currently in the queue, and so the probability of resuming any particular individual particle, or of launching a new particle, is . If the particle released from the queue has exactly one child to spawn, it advances to the next observation and repeats the resampling process. If, however, a particle has more than one child particle to spawn, rather than launching all child particles at once it launches a single particle to simulate forward, decrements the total number of particles left to launch by one, and itself re-enters the queue.

In the event that the process count is fully saturated (i.e. the process queue is full), then we forcibly prevent particles from duplicating themselves and creating new children. If we release a particle from the queue which seeks to launch additional particles when the queue is full, we instead collapse all the remaining particles into a single particle; this single particle represents a “virtual” set of particles, but does not actually create a new process and requires no additional CPU or memory resources. We keep track of a particle count multiplier that we propagate forward along with the particle. All particles are initialized with , and then when a particle collapse takes place, update their multiplier at to .

This affects the way in which running weight averages are computed; suppose a new particle arrives with multiplier and weight . We incorporate all these values into the average weight immediately, and update taking into account the multiplicity, with

 ¯¯¯¯¯¯Wkn =k−1k+Ckn−1¯¯¯¯¯¯Wk−1n+Cknk+Ckn−1Wkn for k=2,3,…. (22)

This does not affect the computation of the ratio . We preserve the particle multiplier, until we reach the final ; then, after all forward simulation is complete, we re-incorporate the particle multiplicity when reporting the final particle weight . The system is initialized by seeding the system with a number of initial particles at , creating active initial processes.

The ideal choice for the process count constraint may vary across operating systems and hardware configurations; online optimization of this parameter remains an avenue for future work.

## 5 Experiments Figure 2: All results are reported over multiple independent replications, shown here as independent lines. (top) Convergence of estimates to ground truth vs. number of particles, shown as (left) MSE of marginal probabilities of being in each state for every observation n in the HMM, and (right) MSE of the latent expected position in the linear Gaussian state space model. (bottom) Convergence of marginal likelihood estimates to the ground truth value (marked by a red dashed line), for (left) the HMM, and (right) the linear Gaussian model.

We run a preliminary set of experiments on two simple state space models, each with

observations, with the goal of demonstrating the overall validity and utility of the particle cascade algorithm. Results are presented here on two simple models. The first is a hidden Markov model (HMM) with 10 latent discrete states, each with an associated Gaussian emission distribution; the second is a one-dimensional linear Gaussian model. In both models we can use an exact algorithm to compute posterior marginals at each

and compute the marginal likelihood .

These experiments are not designed to stress-test the particle cascade; rather, they are designed to show that performance of the particle branching scheme closely approximates that of the fully synchronous particle filter, even in a small-data small-complexity regime where we expect particle filter performance to be very good. In addition to comparing to a particle filter which resamples synchronously, we also compare to a worst-case particle filter in which we never resample, instead propagating particles forward deterministically with a single child particle at every . While the statistical (per-sample) efficiency of this approach is quite poor, it is fully parallelizable with no blocking operations in the algorithm at all, and thus provides a ceiling estimate of the raw sampling speed attainable in our overall implementation.

We also benchmark against what we believed to be the most practically competitive similar approach, iterated conditional SMC . Iterated conditional SMC corresponds to the particle Gibbs algorithm in the case where parameter values are known; by using a particle filter sweep as a step within a larger MCMC algorithm, iCSMC provides a statistically valid approach to sampling from a posterior distribution by repeatedly running sequential Monte Carlo sweeps each with a fixed number of particles. One downside to iCSMC is that it does not provide an estimate of the marginal likelihood.

On both these models we see the statistical efficiency of the particle cascade is approximately in line with the true particle filter, slightly outperforming the iCSMC algorithm and significantly outperforming the fully parallelized non-resampling approach. This suggests that the approximations made by computing weights at each based on only the previously observed particles, and the total particle count limit imposed by , do not have an adverse effect on overall performance. In Fig. 2 we plot convergence per particle to the true posterior distribution, as well as convergence in our estimate of the normalizing constant.

### 5.1 Performance and scalability Figure 3: (top) Comparative convergence rates between SMC alternatives including our new algorithm, and (bottom) estimation of marginal likelihood, by time. Results are shown for (left) the hidden Markov model, and (right) the linear Gaussian state space model.

Although values will be implementation-dependent, we are ultimately interested not in per-sample efficiency but rather in our rate of convergence over time. We record wall clock time for each algorithm for both of these models; the results for convergence of our estimates of values and marginal likelihood are shown in Fig. 3. These particular experiments were all run on Amazon EC2, in an 8-core environment with Intel Xeon E5-2680 v2 processors. The particle cascade provides a much faster and more accurate estimate of the marginal likelihood than the competing methods, in both models. Convergence in estimates of values is quick as well, faster than the iCSMC approach. We note that for very small numbers of particles, running a simple particle filter is faster than the particle cascade, despite the blocking nature of the resampling step. This is due to the overhead incurred by the particle cascade in sending an initial flurry of particles into the system before we see any particles progress to the end; this initial speed advantage diminishes as the number of samples increases. Furthermore, in stark contrast to the simple SMC method, there are no barriers to drawing more samples from the particle cascade indefinitely. On this fixed hardware environment, our implementation of SMC, which aggressively parallelizes all forward particle simulations, exhibits a dramatic loss of performance as the number of particles increases from to , to the point where simultaneously running particles is simply not possible in a feasible amount of time.

We are also interested in how the particle cascade scales up to larger hardware, or down to smaller hardware. A comparison across 5 different hardware configurations is shown in Fig. 4. Figure 4: Average time to draw a single complete particle on a variety of machine architectures. Queueing rather than blocking at each observation improves performance, and appears to improve relative performance even more as the available compute resources increase. Note that we see only average time per sample, here; this is not a measure of statistical efficiency. The high speed of the non-resampling algorithm is not sufficient to make it competitive with the other approaches.

## 6 Discussion

The particle cascade has broad applicability, appropriate for all SMC and particle filtering inference applications. For example, constructing an appropriate sequence of densities for SMC is possible in arbitrary probabilistic graphical models, including undirected graphical models; see e.g. the sequential decomposition approach of . We are particularly motivated by the SMC-based probabilistic programming systems that have recently appeared in the literature [9, 10]. In both references it was suggested that primary performance bottleneck in their inference algorithms was barrier synchronization, something we have done away with entirely. What is more, while particle MCMC methods are particularly appropriate when there is a clear boundary that can be exploited between between parameters of interest and nuisance state variables, in a growing number of applications in the probabilistic programming and SMC communities, parameters values are generated as part of the state trajectory itself, leaving no explicitly denominated latent parameter variables per se. The particle cascade is particularly relevant to such approaches.

Finally, an attractive property of this algorithm is that it yields an unbiased estimate of the marginal likelihood, and thus can be plugged directly into PIMH, SMC

, and other so-called pseudomarginal methods.

#### Acknowledgments

Yee Whye Teh’s research leading to these results has received funding from EPSRC (grant EP/K009362/1) and the ERC under the EU’s FP7 Programme (grant agreement no. 617411). Frank Wood is supported under DARPA PPAML. This material is based on research sponsored by DARPA through the U.S. Air Force Research Laboratory under Cooperative Agreement number FA8750-14-2-0004. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation heron. The views and conclusions contained herein are those of the authors and should be not interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, the U.S. Air Force Research Laboratory of the U.S. Government.