Efficient Learning of the Parameters of Non-Linear Models using Differentiable Resampling in Particle Filters

It has been widely documented that the sampling and resampling steps in particle filters cannot be differentiated. The reparameterisation trick was introduced to allow the sampling step to be reformulated into a differentiable function. We extend the reparameterisation trick to include the stochastic input to resampling therefore limiting the discontinuities in the gradient calculation after this step. Knowing the gradients of the prior and likelihood allows us to run particle Markov Chain Monte Carlo (p-MCMC) and use the No-U-Turn Sampler (NUTS) as the proposal when estimating parameters. We compare the Metropolis-adjusted Langevin algorithm (MALA), Hamiltonian Monte Carlo with different number of steps and NUTS. We consider two state-space models and show that NUTS improves the mixing of the Markov chain and can produce more accurate results in less computational time.



There are no comments yet.


page 19


Markov chain Monte Carlo algorithms with sequential proposals

We explore a general framework in Markov chain Monte Carlo (MCMC) sampli...

Identification of jump Markov linear models using particle filters

Jump Markov linear models consists of a finite number of linear state sp...

Particle Filtering for Stochastic Navier-Stokes Signal Observed with Linear Additive Noise

We consider a non-linear filtering problem, whereby the signal obeys the...

Estimating the Competitive Storage Model with Stochastic Trends in Commodity Prices

We propose a state-space model (SSM) for commodity prices that combines ...

Inference of Stochastic Disease Transmission Models Using Particle-MCMC and a Gradient Based Proposal

State-space models have been widely used to model the dynamics of commun...

Slice Sampling Particle Belief Propagation

Inference in continuous label Markov random fields is a challenging task...

Automatically Differentiable Random Coefficient Logistic Demand Estimation

We show how the random coefficient logistic demand (BLP) model can be ph...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-Space Models (SSMs) have been used to model dynamical systems in a wide range of research fields (see [doucet2001sequential] for numerous examples). SSMs are represented by two stochastic processes: and where indicates the hidden state which evolves according to a Markov process and is the observation (both at time ), such that


The initial latent state has initial density denoted . The SSM is parameterised by an unknown static parameter contained in the parameter space . The transition and observation densities are given by (1) and (2), respectively. In this paper we focus on Bayesian parameter estimation in SSMs using Particle Markov Chain Monte Carlo (p-MCMC), as first proposed in [andrieu_doucet_holenstein_2010]. This approach combines two Monte Carlo methods that use repeated sampling techniques to obtain numerical estimates of a target distribution , for which exact inference is intractable. The two methods are Markov Chain Monte Carlo (MCMC), as described in [robert_2015, ravenzwaaij_cassey_brown_2016, Roberts_2004], and Sequential Monte Carlo (SMC) i.e. a particle filter, as described in [gordon1993novel, arulampalam_maskell_gordon_clapp_2002, 15yearslater].

MCMC algorithms such as Metropolis-Hastings (M-H) often use random walk sampling within the proposal. Such proposals can struggle to enable the MCMC to reach the stationary distribution when estimating large numbers of parameters. A related issue can occur with Gibbs samplers when the correlation between parameters is high. These issues can result in the sampler getting stuck in local maxima within . Hamiltonian Monte Carlo (HMC), as described in [neal2012mcmc]

, is an approach that simulates from a problem-specific Hamiltonian system to generate the samples used in the MCMC. HMC has been seen to be effective when estimating parameters in models when the target distribution is complex or multi-modal but is sensitive to hyperparameters which have to be determined by the user. An adaptive version of HMC called the No-U-Turn Sampler (NUTS)

[NUTS] automates the selection of these hyperparameters. Probabilistic programming languages such as Stan [stan] and PyMC3 [pymc3] are tools that have been developed to allow users to define and make inferences about probabilistic models using NUTS.

Particle filters have been used in many areas of research, such as finance [filteringfinance], disease modelling [filteringdisease] and multiple target tracking [filteringmultipletargets] to infer time-dependent hidden states. The original contribution of [andrieu_doucet_holenstein_2010]

uses a particle filter to calculate an unbiased estimate of the often intractable likelihood for

. A M-H algorithm with a random walk proposal was used to sample from . Using such a proposal in p-MCMC will inherit the same issues as described above in the context of MCMC generically. To make use of more sophisticated proposals the gradient of the log-posterior of the parameter, , needs to be estimated. It has been noted in [diffPF_1, diffPF_2, diffPF_3] that the standard particle filter cannot be differentiated due to the stochastic nature of both the sampling and resampling steps that are inherently part of the particle filter.

As will be explained in more detail later in this paper (in Sections 3 and 4.1), the reparameterisation trick was proposed in [reparam]

to reformulate the sampling operation into a differentiable function by sampling a noise vector in advance and defining the likelihood for

as being a deterministic function of this sampled noise vector. However, resampling remains problematic since after resampling all weights are equal. More specifically, the gradients cannot be calculated since the new particles’ states are not differentiable w.r.t. the weights that are input to resampling. Recent work in machine learning has focused on how to modify the resampling step to make it differentiable such that gradients can be backpropagated in the context of Particle Filter Recurrent Neural Networks (PF-RNNs)

[diffPF_1]. A differentiable version of the PF was introduced in [diffPF_3]: [diffPF_3] goes on to use the differentiable PF to learn measurement and motion models end-to-end using back-propagation and algorithmic priors over the states and this is shown to improve the state estimation performance. A novel network architecture can be seen in [diffPF_5] that trains a particle transformer which then replaces traditional resampling. Soft resampling was introduced in [diffPF_4] and utilised in [diffPF_1] and considers an approximation which involves drawing from the distribution , with , representing a trade-off parameter. If = 1, regular resampling is used and if = 0 the algorithm performs subsampling. The new weights are calculated by


This gives non-zero estimates of the gradient because the dependency on the previous weights is maintained. By changing , this method trades resampling quality for biased gradient estimates. A fully differentiable particle filter is described in [diffPF_6] that resamples by using Optimal Transport ideas seen in [peyre2020computational]. A method called stop-gradient resampling is outlined in [diffresam_1]

which only modifies the messages used in backpropagation, is implemented in PyTorch and makes use of automatic differentiation libraries. In


, they utilise conditional normalisation flows to construct flexible probability distributions for differentiable particle filters. A comparison of differentiable filters can be seen in

[kloss2021train]. The approach described in this paper differs from this body of previous work since we focus on ensuring resampling is differentiable without having to change how resampling operates.

Outside of the Neural Network field, extensions of the original p-MCMC algorithm have focused on including gradient information when proposing new parameters. Reference

[poyiadjis2011particle] shows how to estimate the score (gradient) of the log-likelihood and the observed information matrices at in SSMs using particle filter methods. The two methods proposed run with computational complexity and , respectively. The first has a linear computational cost but the performance decreases over time. The second has a computational cost that increases quadratically with the number of particles . [nemeth2013particle] built on this work to compute these terms with computational complexity

and avoids the quadratically increasing variance caused by particle degeneracy. In

[particleLangevin, inproceedings, Second-order, Dahlin_2014] the authors utilise the previous work of [nemeth2013particle] to recursively estimate the score (gradient) of the log-likelihood at . References [particleLangevin] and [inproceedings] include Langevin Monte Carlo (LMC) methods seen in [girolami_calderhead_2011] whilst [Second-order] and [Dahlin_2014] include first- and second-order Hessian information about the posterior in the proposal. Use of the Hessian is shown to improve the mixing of the Markov chain at the stationary phase and decrease the length of burn-in. However, calculating a matrix of the second-order partial derivatives can become infeasible when the dimensionality, , becomes large. While [Second-order], [Dahlin_2014] do mention using HMC dynamics within p-MCMC, to the best of the authors’ knowledge, no implementation of this approach has been described in the literature up to now.

We aim to complement the recent literature seen in machine learning that addresses the problem of differentiating resampling. Our core contribution is to fix the random numbers used within the resampling step. This conditioning on the input to resampling, results in the subsequent particle derivative calculations being a function of the parent particle. The gradients can then be efficiently estimated and utilised within the framework of p-MCMC and specifically used to calculate gradient-based proposals for : more specifically, this allows us to use NUTS as the proposal. Another novel contribution, relative to the previous work on differentiable particle filters in the neural network community, is that we provide full Bayesian parameter estimates (including variances). This differs from the present literature on differentiable particle filtering which focuses exclusively on point-estimates of parameters. We also compare NUTS’ performance with that achieved by HMC and Metropolis-adjusted Langevin algorithm (MALA).

An outline of the paper is as follows: in Section 2 we describe a generic particle filter followed by a description of the difficulties associated with differentiating the sampling and resampling steps. We describe how to calculate the likelihood and gradients in Section 3, the methods to propagate the derivatives of the particle weights in Section 4 and how we extend the reparameterisation trick to include the stochastic input to resampling in section 5. We test the likelihood and gradient estimates, explain particle-HMC (p-HMC) and particle-NUTS (p-NUTS) and detail comparative numerical results in the context of two applications in Section 6. Concluding remarks are described in Section 7.

2 Particle filtering background

Assume we have considered time-steps, obtaining data at each increment of given by . The state sequence grows with time where has dimensions. The dynamics and likelihood are parameterised by (which has dimensions) such that


If is known, we can run a (conventional) particle filter.

2.1 Particle Filter

At every timestep , the particle filter draws samples (particles) from a proposal distribution, , which is parameterised by the sequence of states and measurements. The samples are seen as statistically independent and each represents a different hypothesis for the sequence of states of the system. The collection of samples of

represents the probability density function (pdf) of the dynamic system. The

th sample has an associated weight,

, which indicates the relative probability that the

th sample, , is the true state of the system. The weights at are set to be . The proposal distribution is constructed recursively as


such that we can pose an estimate with respect to the joint distribution,

, as follows:


This is an unbiased estimate, where (for 1)


and is a recursive formulation for the unnormalised weight, , with incremental weight


For =1


2.2 Choice of proposal

Three commonly used options for the proposal distribution are:

  1. Using the dynamic model as the proposal


    which simplifies the weight update to

  2. In certain situations it is possible to use the “optimal” proposal which is


    with weights updated according to


    Note that the term “optimal” in this context of the particle proposal means that the variance of the incremental particle weights at the current time-step is minimized. In fact, this variance is zero since the weight in (14) is independent of (as explained in [1336055731117]).

  3. Using the Unscented Transform, as explained in [kalmanFil].

2.3 Estimation with Respect to the Posterior

It is often the case that we wish to calculate estimates with respect to the posterior, , which we can calculate as follows:


Note that if ,


in line with (6), such that




are the normalised weights.

Equation (18) is a biased estimate since it is a ratio of estimates, in contrast with (6).

2.4 Resampling

As time evolves, the normalised weights will become increasingly skewed such that one of the weights given by (

19) becomes close to unity and the others approach zero. It is often suggested that monitoring the effective sample size, can be used to identify the need to resample, where


There are many resampling methods, some of which are outlined and evaluated in [resamplingmethods] but they all share the same purpose—stochastically replicate particles with higher weights whilst eliminating ones with lower weights. Multinomial resampling is commonly used and involves drawing from the current particle set times proportional to its weight. The associated distribution is defined by


To keep the total unnormalised weight constant (such that the approximation (16) is the same immediately before and after resampling), we assign each newly-resampled sample an unnormalised weight


Note this is such that the normalised weights after resampling are .

3 Calculating the likelihood and gradients

We pose the calculation of the likelihood of the parameter as the calculation of the approximation in (16), (ie the sum of unnormalised particle filter weights), with .

Differentiating the weights gives an approximation to the gradient of the likelihood222We note that this approach differs from that advocated in [inproceedings, Second-order, Dahlin_2014], which use a Fixed-Lag filter (with a user-specified lag) to approximate the derivatives. In contrast, we explicitly calculate the derivatives of the approximation to the likelihood.:


For numerical stability, it is typically preferable to propagate values in logs. Applying the Chain Rule to (

16) and (23) gives


Note that in (25), the weights outside the logs are normalised, while the weights inside are not. The log weights can be calculated recursively as






So, if we can differentiate the single measurement likelihood, transition model and proposal, we can calculate (an approximation to) the derivative of the log-likelihood for the next time step, thus recursively approximating the log-likelihood derivatives for each time step. While this is true, there are some challenges involved, which we now discuss in turn.

If the particle filter is using the transition model as the dynamics (as in (11)), the likelihood in the weight update does not explicitly depend on and we might initially suppose that . If this were the case, an induction argument using (27) would show that the weight derivatives were always zero and therefore give an approximation of zero for the gradient of the likelihood for . This seems intuitively incorrect. Indeed, the flaw in this reasoning is that, in fact, the likelihood (somewhat implicitly) does depend on since depends on . Applying the Chain Rule gives



is a random variable sampled from the proposal, we use the

reparameterisation trick[reparam]: we consider the derivative for a fixed random number seed. More precisely, let be the vector of standard random variables used when sampling from the proposal such that, if is known, then is a deterministic function (that can be differentiated) of . We then consider


where is considered fixed and, most importantly, (32) then involves differentials that can be calculated.

As a simple example, consider sampling from the dynamics with a random walk proposal and

being the standard deviation of the process noise. This is such that




which can be calculated recursively and then used to calculate (29).

More generally, the derivatives of the weight are non-zero and, to calculate these derivatives, we have to propagate the particle derivatives .

4 Calculating the derivatives

In order to propagate the derivatives of the particle weights we need to calculate:

  • The particle derivatives,

  • The derivatives of the proposal pdfs,

  • The derivatives of the prior log pdfs,

  • The derivatives of the single measurement likelihood log pdfs,


In this section, we show how to calculate these derivatives in turn.

4.1 Derivative of New Particle States

We now describe how to calculate . Suppose the proposal takes the following form:


where and

are functions of the old particle state, the measurement and the parameter. Such a generic description can articulate sampling from the prior, or defining a proposal using a Kalman filter with the predicted mean and covariance given by the motion model.

If we sample the proposal noise in advance, the new particle states can be written as a deterministic function


We would like to compute the derivative of this w.r.t the parameter. Care must be taken however, since is itself a function of (if a different was chosen, a different would have been sampled).


Note that in (41) is not the same as in (42) — see Appendix A for the distinction. Note also that the terms here are matrix-valued in general: is an matrix, and and are matrices.

Also note that for , implicitly depends on , which itself depends on . Hence we need the total derivative .

4.2 Derivative of Proposal

To differentiate the log proposal pdf, we note that we can write it as


where (dropping the fixed values and for notational convenience)


where we emphasise again that we assume the proposal is Gaussian. We then get


where we denote and for brevity. The derivatives of are given in Appendix C.

4.3 Derivative of the Prior

We now describe how to calculate . Let


where here we assume that the transition model has additive Gaussian noise that is independent of . Then


where we denote that and for brevity. Note that this makes clear that since the realisation of the sampled particles, , are, in general, dependent on , (51) includes . Also note that these derivatives of are evaluated at (the prior mean) and not at (the proposal mean) as was the case in (48).

4.4 Derivative of likelihood

We now describe how to calculate . Let


where we assume that the likelihood is Gaussian with a variance that is independent of . Then


where we denote and for brevity.

5 Resampling for a Differentiable Particle Filter

Unlike a standard particle filter, we also need to resample the weight derivatives


as well as the particle derivatives




be the normalised cumulative weights and the index sampled for particle be given by


where are independent for each particle and time step. Note that the particle indices are sampled according to a Categorical distribution giving a Multinomial resampler, where each index is resampled with probability proportional to its weight. Other resampling schemes are possible and could reduce the variance of any estimates.

The resampled weights are set to be the same as each other, while preserving the original sum:


From (60), it is clear that


In order to convert this to log weights, applying the Chain Rule gives


where are the normalised weights.

To get the particle gradient note that (where is differentiable),


Since except where




almost surely, so the derivative is obtained by taking the derivative of the parent particle.

It should be noted that because particle resampling adds potential discontinuities in the derivative calculation, the particle filter should be designed so resampling is required as little as possible.

Input: ,
1 Initialise: , , , for t = 1, …, T do
2       If necessary, resample , , , as described in Section 5. Sample the new particles and calculate the partial derivatives
from (40). Get the proposal mean, and covariance, for each particle , as well as their derivatives
seen in Section 4.1. Get the particle gradients using Subsection 4.1. Get the derivatives of the prior, proposal and likelihood using Subsections 4.2, 4.3, 4.4. Evaluate the new log weights and log weight derivatives using (26) and (27), respectively.
3 end for
Evaluate the final log likelihood, , and associated derivative, , using (16) and (25), respectively.
Algorithm 1 Particle Filter

6 Numerical Experiments

We test the estimates of the log-likelihood and the gradient w.r.t. for a simple model using Algorithm 1. The proposal is derived from a Kalman Filter with the prior mean, , and covariance, , given by the motion model. We show in Appendix B how to calculate the true proposal mean and covariance for each particle , as well as their derivatives.

We run Algorithm 1 and compute an estimate of the log-likelihood and associated gradient across a range of 500 values of , equally spaced from 1 - 4. The true value of =2, 2000 and =250 observations.

The log-likelihood and gradient of the log-likelihood, at each instance of , can be seen at Figure 1 (a) and Figure 1 (b), respectively. There is no obvious difference in the plots when comparing the estimates given from the Kalman Filter shown in black and the estimates produced by the particle filter when using the reparameterisation trick for resampling in red (see section 5) and multinomial resampling in blue (21). However, when looking at a zoomed in instance of these plots the log-likelihood and gradient of the log-likelihood (seen in the Figure 1 (c) and Figure 1 (d), respectively), it is evident that using the reparameterisation trick for resampling produces piecewise linear continuous estimates.

Figure 1: Plots of the log-likelihood (a), gradient of the log-likelihood w.r.t. (b), a zoomed in section of the log-likelihood plot (c) and associated gradient (d). All plots: The true values given from the Kalman Filter (black), particle filter using reparameterisation trick resampling (red) and multinomial resampling (blue).

6.1 Estimation of parameters

If we have a prior, , for which the likelihood, , or log-likelihood333The log-likelihood is likely to be more stable numerically. can be calculated, we can run p-MCMC to estimate . The gradient of the log-posterior of is given by


where is the gradient of the log-prior and is the gradient of the log-likelihood. If we know , it is possible to guide proposals to areas of higher probability within .

6.1.1 Hamiltonian Monte Carlo

HMC is a gradient based algorithm which uses Hamilton’s equations to generate new proposals. Since it uses gradients, it is better at proposing samples than a random-walk proposal. It was first developed in the late 1980s [DUANE1987216] and in the last decade it has become a popular approach when implementing MCMC [neal2012mcmc]. In the following section we give a high level conceptual overview of HMC and direct the reader to [betancourt2017conceptual] for a more thorough explanation. Hamilton’s equations are a pair of differential equations that describe a system in terms of its position and momentum where the potential of the system is defined by . Conceptually, they are a set of differential equations that govern the movement of a frictionless puck across the potential surface. HMC introduces a momentum vector which moves the puck at on a random trajectory to . Intuitively, the puck will slow down or go backwards when going up a slope keeping the proposed in regions of higher probability. The total energy or Hamiltonian of a system can be expressed as


and is comprised of the sum of the Kinetic energy , which is dependent on where in the parameter space the puck is, and the potential energy , which is independent on the momentum .

Hamilton’s equations describe how the system evolves as a function of time and are:

The joint density is


which means and are independent samples from the joint density so can be sampled from any distribution. For simplicity the distributed used is often chosen to be Gaussian and we make that choice here. Many numerical integration methods exist which discretise Hamilton’s equations and can be seen in [hmcintegrators] with the leapfrog method being the go-to method for HMC. Leapfrog is a symplectic method which means the Hamiltonian remains close to its initial value, though not equal to it exactly, as the system is simulated. This means samples are generated with a high accept/reject ratio so the target is explored efficiently. Leapfrog is also a reversible method which allows detailed balance to be maintained. Finally, Leapfrog is a low-order method which uses relatively few gradient evaluations per step and is therefore computationally cheap. Algorithm 1 is run every time a gradient evaluation is made within the Leapfrog numerical integrator.

The samples generated are governed by a predetermined number of steps of size , decided by the user. HMC is highly sensitive to the choice of these parameters, particularly . If is too large, computation time can be wasted as the trajectory can make a U-turn and end close to where it started and, if too small, the proposal can exhibit random-walk behaviour. P-HMC is outlined in Algorithm 2 with the notation of the log-likelihood and gradient of the parameter posterior in (67) changed to and , respectively.

6.1.2 No-U-Turn Sampler

NUTS [NUTS] is an extension of HMC and eliminates the need to specify by adaptively finding an optimal number of steps. NUTS builds a trajectory exploring states forwards and backwards in time by building a balanced tree, i.e. it takes forwards and backwards steps doubling in number each iteration. A Bernoulli trial is undertaken to decide the initial direction, and after the completion of the subtree a new trial is undertaken and a further

steps are taken. This process continues until the trajectory begins to double back on itself (ie the puck starts to make a “U-Turn”) at which point a state is sampled from a complete balanced tree. By sampling from a balanced tree the process is reversible and detailed balance is maintained. To avoid excessively large trees, which usually result from choosing too small a step-size, it is necessary to set a max-tree depth. To find a reasonable initial step-size we used the heuristic approach in

[NUTS] but we do not use dual averaging to automatically tune