1 Introduction
StateSpace Models (SSMs) have been used to model dynamical systems in a wide range of research fields (see [doucet2001sequential] for numerous examples). SSMs are represented by two stochastic processes: and where indicates the hidden state which evolves according to a Markov process and is the observation (both at time ), such that
(1) 
(2) 
The initial latent state has initial density denoted . The SSM is parameterised by an unknown static parameter contained in the parameter space . The transition and observation densities are given by (1) and (2), respectively. In this paper we focus on Bayesian parameter estimation in SSMs using Particle Markov Chain Monte Carlo (pMCMC), as first proposed in [andrieu_doucet_holenstein_2010]. This approach combines two Monte Carlo methods that use repeated sampling techniques to obtain numerical estimates of a target distribution , for which exact inference is intractable. The two methods are Markov Chain Monte Carlo (MCMC), as described in [robert_2015, ravenzwaaij_cassey_brown_2016, Roberts_2004], and Sequential Monte Carlo (SMC) i.e. a particle filter, as described in [gordon1993novel, arulampalam_maskell_gordon_clapp_2002, 15yearslater].
MCMC algorithms such as MetropolisHastings (MH) often use random walk sampling within the proposal. Such proposals can struggle to enable the MCMC to reach the stationary distribution when estimating large numbers of parameters. A related issue can occur with Gibbs samplers when the correlation between parameters is high. These issues can result in the sampler getting stuck in local maxima within . Hamiltonian Monte Carlo (HMC), as described in [neal2012mcmc]
, is an approach that simulates from a problemspecific Hamiltonian system to generate the samples used in the MCMC. HMC has been seen to be effective when estimating parameters in models when the target distribution is complex or multimodal but is sensitive to hyperparameters which have to be determined by the user. An adaptive version of HMC called the NoUTurn Sampler (NUTS)
[NUTS] automates the selection of these hyperparameters. Probabilistic programming languages such as Stan [stan] and PyMC3 [pymc3] are tools that have been developed to allow users to define and make inferences about probabilistic models using NUTS.Particle filters have been used in many areas of research, such as finance [filteringfinance], disease modelling [filteringdisease] and multiple target tracking [filteringmultipletargets] to infer timedependent hidden states. The original contribution of [andrieu_doucet_holenstein_2010]
uses a particle filter to calculate an unbiased estimate of the often intractable likelihood for
. A MH algorithm with a random walk proposal was used to sample from . Using such a proposal in pMCMC will inherit the same issues as described above in the context of MCMC generically. To make use of more sophisticated proposals the gradient of the logposterior of the parameter, , needs to be estimated. It has been noted in [diffPF_1, diffPF_2, diffPF_3] that the standard particle filter cannot be differentiated due to the stochastic nature of both the sampling and resampling steps that are inherently part of the particle filter.As will be explained in more detail later in this paper (in Sections 3 and 4.1), the reparameterisation trick was proposed in [reparam]
to reformulate the sampling operation into a differentiable function by sampling a noise vector in advance and defining the likelihood for
as being a deterministic function of this sampled noise vector. However, resampling remains problematic since after resampling all weights are equal. More specifically, the gradients cannot be calculated since the new particles’ states are not differentiable w.r.t. the weights that are input to resampling. Recent work in machine learning has focused on how to modify the resampling step to make it differentiable such that gradients can be backpropagated in the context of Particle Filter Recurrent Neural Networks (PFRNNs)
[diffPF_1]. A differentiable version of the PF was introduced in [diffPF_3]: [diffPF_3] goes on to use the differentiable PF to learn measurement and motion models endtoend using backpropagation and algorithmic priors over the states and this is shown to improve the state estimation performance. A novel network architecture can be seen in [diffPF_5] that trains a particle transformer which then replaces traditional resampling. Soft resampling was introduced in [diffPF_4] and utilised in [diffPF_1] and considers an approximation which involves drawing from the distribution , with , representing a tradeoff parameter. If = 1, regular resampling is used and if = 0 the algorithm performs subsampling. The new weights are calculated by(3) 
This gives nonzero estimates of the gradient because the dependency on the previous weights is maintained. By changing , this method trades resampling quality for biased gradient estimates. A fully differentiable particle filter is described in [diffPF_6] that resamples by using Optimal Transport ideas seen in [peyre2020computational]. A method called stopgradient resampling is outlined in [diffresam_1]
which only modifies the messages used in backpropagation, is implemented in PyTorch and makes use of automatic differentiation libraries. In
[diffresam_2], they utilise conditional normalisation flows to construct flexible probability distributions for differentiable particle filters. A comparison of differentiable filters can be seen in
[kloss2021train]. The approach described in this paper differs from this body of previous work since we focus on ensuring resampling is differentiable without having to change how resampling operates.Outside of the Neural Network field, extensions of the original pMCMC algorithm have focused on including gradient information when proposing new parameters. Reference
[poyiadjis2011particle] shows how to estimate the score (gradient) of the loglikelihood and the observed information matrices at in SSMs using particle filter methods. The two methods proposed run with computational complexity and , respectively. The first has a linear computational cost but the performance decreases over time. The second has a computational cost that increases quadratically with the number of particles . [nemeth2013particle] built on this work to compute these terms with computational complexityand avoids the quadratically increasing variance caused by particle degeneracy. In
[particleLangevin, inproceedings, Secondorder, Dahlin_2014] the authors utilise the previous work of [nemeth2013particle] to recursively estimate the score (gradient) of the loglikelihood at . References [particleLangevin] and [inproceedings] include Langevin Monte Carlo (LMC) methods seen in [girolami_calderhead_2011] whilst [Secondorder] and [Dahlin_2014] include first and secondorder Hessian information about the posterior in the proposal. Use of the Hessian is shown to improve the mixing of the Markov chain at the stationary phase and decrease the length of burnin. However, calculating a matrix of the secondorder partial derivatives can become infeasible when the dimensionality, , becomes large. While [Secondorder], [Dahlin_2014] do mention using HMC dynamics within pMCMC, to the best of the authors’ knowledge, no implementation of this approach has been described in the literature up to now.We aim to complement the recent literature seen in machine learning that addresses the problem of differentiating resampling. Our core contribution is to fix the random numbers used within the resampling step. This conditioning on the input to resampling, results in the subsequent particle derivative calculations being a function of the parent particle. The gradients can then be efficiently estimated and utilised within the framework of pMCMC and specifically used to calculate gradientbased proposals for : more specifically, this allows us to use NUTS as the proposal. Another novel contribution, relative to the previous work on differentiable particle filters in the neural network community, is that we provide full Bayesian parameter estimates (including variances). This differs from the present literature on differentiable particle filtering which focuses exclusively on pointestimates of parameters. We also compare NUTS’ performance with that achieved by HMC and Metropolisadjusted Langevin algorithm (MALA).
An outline of the paper is as follows: in Section 2 we describe a generic particle filter followed by a description of the difficulties associated with differentiating the sampling and resampling steps. We describe how to calculate the likelihood and gradients in Section 3, the methods to propagate the derivatives of the particle weights in Section 4 and how we extend the reparameterisation trick to include the stochastic input to resampling in section 5. We test the likelihood and gradient estimates, explain particleHMC (pHMC) and particleNUTS (pNUTS) and detail comparative numerical results in the context of two applications in Section 6. Concluding remarks are described in Section 7.
2 Particle filtering background
Assume we have considered timesteps, obtaining data at each increment of given by . The state sequence grows with time where has dimensions. The dynamics and likelihood are parameterised by (which has dimensions) such that
(4) 
If is known, we can run a (conventional) particle filter.
2.1 Particle Filter
At every timestep , the particle filter draws samples (particles) from a proposal distribution, , which is parameterised by the sequence of states and measurements. The samples are seen as statistically independent and each represents a different hypothesis for the sequence of states of the system. The collection of samples of
represents the probability density function (pdf) of the dynamic system. The
th sample has an associated weight,, which indicates the relative probability that the
th sample, , is the true state of the system. The weights at are set to be . The proposal distribution is constructed recursively as(5) 
such that we can pose an estimate with respect to the joint distribution,
, as follows:(6) 
This is an unbiased estimate, where (for 1)
(7)  
(8) 
and is a recursive formulation for the unnormalised weight, , with incremental weight
(9) 
For =1
(10) 
2.2 Choice of proposal
Three commonly used options for the proposal distribution are:

Using the dynamic model as the proposal
(11) which simplifies the weight update to
(12) 
In certain situations it is possible to use the “optimal” proposal which is
(13) with weights updated according to
(14) Note that the term “optimal” in this context of the particle proposal means that the variance of the incremental particle weights at the current timestep is minimized. In fact, this variance is zero since the weight in (14) is independent of (as explained in [1336055731117]).

Using the Unscented Transform, as explained in [kalmanFil].
2.3 Estimation with Respect to the Posterior
It is often the case that we wish to calculate estimates with respect to the posterior, , which we can calculate as follows:
(15) 
Note that if ,
(16) 
in line with (6), such that
(17)  
(18) 
where
(19) 
are the normalised weights.
2.4 Resampling
As time evolves, the normalised weights will become increasingly skewed such that one of the weights given by (
19) becomes close to unity and the others approach zero. It is often suggested that monitoring the effective sample size, can be used to identify the need to resample, where(20) 
There are many resampling methods, some of which are outlined and evaluated in [resamplingmethods] but they all share the same purpose—stochastically replicate particles with higher weights whilst eliminating ones with lower weights. Multinomial resampling is commonly used and involves drawing from the current particle set times proportional to its weight. The associated distribution is defined by
(21) 
To keep the total unnormalised weight constant (such that the approximation (16) is the same immediately before and after resampling), we assign each newlyresampled sample an unnormalised weight
(22) 
Note this is such that the normalised weights after resampling are .
3 Calculating the likelihood and gradients
We pose the calculation of the likelihood of the parameter as the calculation of the approximation in (16), (ie the sum of unnormalised particle filter weights), with .
Differentiating the weights gives an approximation to the gradient of the likelihood^{2}^{2}2We note that this approach differs from that advocated in [inproceedings, Secondorder, Dahlin_2014], which use a FixedLag filter (with a userspecified lag) to approximate the derivatives. In contrast, we explicitly calculate the derivatives of the approximation to the likelihood.:
(23) 
For numerical stability, it is typically preferable to propagate values in logs. Applying the Chain Rule to (
16) and (23) gives(24)  
(25) 
Note that in (25), the weights outside the logs are normalised, while the weights inside are not. The log weights can be calculated recursively as
(26) 
so
(27) 
where
(28)  
So, if we can differentiate the single measurement likelihood, transition model and proposal, we can calculate (an approximation to) the derivative of the loglikelihood for the next time step, thus recursively approximating the loglikelihood derivatives for each time step. While this is true, there are some challenges involved, which we now discuss in turn.
If the particle filter is using the transition model as the dynamics (as in (11)), the likelihood in the weight update does not explicitly depend on and we might initially suppose that . If this were the case, an induction argument using (27) would show that the weight derivatives were always zero and therefore give an approximation of zero for the gradient of the likelihood for . This seems intuitively incorrect. Indeed, the flaw in this reasoning is that, in fact, the likelihood (somewhat implicitly) does depend on since depends on . Applying the Chain Rule gives
(29) 
Since
is a random variable sampled from the proposal, we use the
reparameterisation trick[reparam]: we consider the derivative for a fixed random number seed. More precisely, let be the vector of standard random variables used when sampling from the proposal such that, if is known, then is a deterministic function (that can be differentiated) of . We then consider(30)  
(31)  
(32) 
where is considered fixed and, most importantly, (32) then involves differentials that can be calculated.
As a simple example, consider sampling from the dynamics with a random walk proposal and
being the standard deviation of the process noise. This is such that
(33) 
so
(34) 
which can be calculated recursively and then used to calculate (29).
More generally, the derivatives of the weight are nonzero and, to calculate these derivatives, we have to propagate the particle derivatives .
4 Calculating the derivatives
In order to propagate the derivatives of the particle weights we need to calculate:

The particle derivatives,
(35) 
The derivatives of the proposal pdfs,
(36) 
The derivatives of the prior log pdfs,
(37) 
The derivatives of the single measurement likelihood log pdfs,
(38)
In this section, we show how to calculate these derivatives in turn.
4.1 Derivative of New Particle States
We now describe how to calculate . Suppose the proposal takes the following form:
(39) 
where and
are functions of the old particle state, the measurement and the parameter. Such a generic description can articulate sampling from the prior, or defining a proposal using a Kalman filter with the predicted mean and covariance given by the motion model.
If we sample the proposal noise in advance, the new particle states can be written as a deterministic function
(40) 
We would like to compute the derivative of this w.r.t the parameter. Care must be taken however, since is itself a function of (if a different was chosen, a different would have been sampled).
(41)  
(42)  
(43) 
Note that in (41) is not the same as in (42) — see Appendix A for the distinction. Note also that the terms here are matrixvalued in general: is an matrix, and and are matrices.
Also note that for , implicitly depends on , which itself depends on . Hence we need the total derivative .
4.2 Derivative of Proposal
To differentiate the log proposal pdf, we note that we can write it as
(44) 
where (dropping the fixed values and for notational convenience)
(45)  
(46) 
where we emphasise again that we assume the proposal is Gaussian. We then get
(48)  
where we denote and for brevity. The derivatives of are given in Appendix C.
4.3 Derivative of the Prior
We now describe how to calculate . Let
(49)  
(50) 
where here we assume that the transition model has additive Gaussian noise that is independent of . Then
(51)  
where we denote that and for brevity. Note that this makes clear that since the realisation of the sampled particles, , are, in general, dependent on , (51) includes . Also note that these derivatives of are evaluated at (the prior mean) and not at (the proposal mean) as was the case in (48).
4.4 Derivative of likelihood
We now describe how to calculate . Let
(52)  
(53) 
where we assume that the likelihood is Gaussian with a variance that is independent of . Then
(54)  
where we denote and for brevity.
5 Resampling for a Differentiable Particle Filter
Unlike a standard particle filter, we also need to resample the weight derivatives
(55) 
as well as the particle derivatives
(56) 
Let
(57) 
be the normalised cumulative weights and the index sampled for particle be given by
(58) 
where are independent for each particle and time step. Note that the particle indices are sampled according to a Categorical distribution giving a Multinomial resampler, where each index is resampled with probability proportional to its weight. Other resampling schemes are possible and could reduce the variance of any estimates.
The resampled weights are set to be the same as each other, while preserving the original sum:
(59)  
(60) 
From (60), it is clear that
(61) 
In order to convert this to log weights, applying the Chain Rule gives
(62)  
(63) 
where are the normalised weights.
To get the particle gradient note that (where is differentiable),
(64) 
Since except where
(65) 
then
(66) 
almost surely, so the derivative is obtained by taking the derivative of the parent particle.
It should be noted that because particle resampling adds potential discontinuities in the derivative calculation, the particle filter should be designed so resampling is required as little as possible.
6 Numerical Experiments
We test the estimates of the loglikelihood and the gradient w.r.t. for a simple model using Algorithm 1. The proposal is derived from a Kalman Filter with the prior mean, , and covariance, , given by the motion model. We show in Appendix B how to calculate the true proposal mean and covariance for each particle , as well as their derivatives.
We run Algorithm 1 and compute an estimate of the loglikelihood and associated gradient across a range of 500 values of , equally spaced from 1  4. The true value of =2, 2000 and =250 observations.
The loglikelihood and gradient of the loglikelihood, at each instance of , can be seen at Figure 1 (a) and Figure 1 (b), respectively. There is no obvious difference in the plots when comparing the estimates given from the Kalman Filter shown in black and the estimates produced by the particle filter when using the reparameterisation trick for resampling in red (see section 5) and multinomial resampling in blue (21). However, when looking at a zoomed in instance of these plots the loglikelihood and gradient of the loglikelihood (seen in the Figure 1 (c) and Figure 1 (d), respectively), it is evident that using the reparameterisation trick for resampling produces piecewise linear continuous estimates.
6.1 Estimation of parameters
If we have a prior, , for which the likelihood, , or loglikelihood^{3}^{3}3The loglikelihood is likely to be more stable numerically. can be calculated, we can run pMCMC to estimate . The gradient of the logposterior of is given by
(67) 
where is the gradient of the logprior and is the gradient of the loglikelihood. If we know , it is possible to guide proposals to areas of higher probability within .
6.1.1 Hamiltonian Monte Carlo
HMC is a gradient based algorithm which uses Hamilton’s equations to generate new proposals. Since it uses gradients, it is better at proposing samples than a randomwalk proposal. It was first developed in the late 1980s [DUANE1987216] and in the last decade it has become a popular approach when implementing MCMC [neal2012mcmc]. In the following section we give a high level conceptual overview of HMC and direct the reader to [betancourt2017conceptual] for a more thorough explanation. Hamilton’s equations are a pair of differential equations that describe a system in terms of its position and momentum where the potential of the system is defined by . Conceptually, they are a set of differential equations that govern the movement of a frictionless puck across the potential surface. HMC introduces a momentum vector which moves the puck at on a random trajectory to . Intuitively, the puck will slow down or go backwards when going up a slope keeping the proposed in regions of higher probability. The total energy or Hamiltonian of a system can be expressed as
(68) 
and is comprised of the sum of the Kinetic energy , which is dependent on where in the parameter space the puck is, and the potential energy , which is independent on the momentum .
Hamilton’s equations describe how the system evolves as a function of time and are:
The joint density is
(70) 
which means and are independent samples from the joint density so can be sampled from any distribution. For simplicity the distributed used is often chosen to be Gaussian and we make that choice here. Many numerical integration methods exist which discretise Hamilton’s equations and can be seen in [hmcintegrators] with the leapfrog method being the goto method for HMC. Leapfrog is a symplectic method which means the Hamiltonian remains close to its initial value, though not equal to it exactly, as the system is simulated. This means samples are generated with a high accept/reject ratio so the target is explored efficiently. Leapfrog is also a reversible method which allows detailed balance to be maintained. Finally, Leapfrog is a loworder method which uses relatively few gradient evaluations per step and is therefore computationally cheap. Algorithm 1 is run every time a gradient evaluation is made within the Leapfrog numerical integrator.
The samples generated are governed by a predetermined number of steps of size , decided by the user. HMC is highly sensitive to the choice of these parameters, particularly . If is too large, computation time can be wasted as the trajectory can make a Uturn and end close to where it started and, if too small, the proposal can exhibit randomwalk behaviour. PHMC is outlined in Algorithm 2 with the notation of the loglikelihood and gradient of the parameter posterior in (67) changed to and , respectively.
6.1.2 NoUTurn Sampler
NUTS [NUTS] is an extension of HMC and eliminates the need to specify by adaptively finding an optimal number of steps. NUTS builds a trajectory exploring states forwards and backwards in time by building a balanced tree, i.e. it takes forwards and backwards steps doubling in number each iteration. A Bernoulli trial is undertaken to decide the initial direction, and after the completion of the subtree a new trial is undertaken and a further
steps are taken. This process continues until the trajectory begins to double back on itself (ie the puck starts to make a “UTurn”) at which point a state is sampled from a complete balanced tree. By sampling from a balanced tree the process is reversible and detailed balance is maintained. To avoid excessively large trees, which usually result from choosing too small a stepsize, it is necessary to set a maxtree depth. To find a reasonable initial stepsize we used the heuristic approach in
[NUTS] but we do not use dual averaging to automatically tune
Comments
There are no comments yet.