Non-linear and high-dimensional state space models arise in many areas of application, such as oceanography (Mattern et al., 2013), numerical weather prediction (Evensen, 1994) and epidemiology (He et al., 2010; Shaman and Karspeck, 2012), to mention a few. Here we consider models of the form
where is a non-linear function of the previous state , is process noise, and the observations are a linear function of the current state with additive Gaussian noise . This form of the non-linear state dynamics
is very general compared, e.g., to the case when the process noise is just additive, and allows for using blackbox simulation models, discretized stochastic differential equations, etc. For the observations we restrict our attention to the linear-Gaussian case, which is common in many applications. However, the method we propose can handle any observation likelihood for which there is a conjugate prior. One such case is a Poisson distributed observation likelihood (with a Gamma distributed prior).
When performing filtering we wish to recover the unknown states at time given all observations of the process up to time . The filtering distribution
can only be evaluated in closed form for a few specific models, such as the linear-Gaussian state space model (the Kalman filter). In more general cases the filtering distribution must be approximated. The particle filter is one way to do this by representing the filtering distribution with a set of weighted samples from the distribution.
In addition to being of significant interest on its own, the filtering problem is also intimately related to model identification via both maximum likelihood and Bayesian formulations (see, e.g., Schön et al. (2015)). Indeed, the data log-likelihood can be expressed as a sum of filtering expectations. Thus, even though we will restrict our attention to the filtering problem, we emphasize that the improvements offered by the new method are useful also for identification of models of the form (1).
The particle filter can, unlike the Kalman filter, handle highly non-linear models, but may also experience degeneracy of the particle weights. Degeneracy occurs when one particle has a weight close to one while the weights of all other particles are close to zero. The filtering distribution is then effectively represented by a single particle, which results in a very poor approximation. It has been shown that to avoid weight degeneracy the number of particles must increase exponentially with the state dimension (Snyder et al., 2015). Weight degeneracy is therefore a frequent issue for high-dimensional problems.
A range of different techniques for high-dimensional filtering have been previously developed. Methods like the ensemble Kalman filter (Evensen, 1994) can be used to solve the filtering problem if the system is mildly non-linear. For more difficult cases adaptation of the particle filter to higher dimensions is necessary. Some examples of particle filters for high-dimensional problems include Djurić and Bugallo (2013); Naesseth et al. (2015); Rebeschini and van Handel (2015); Robert and Künsch (2017). These methods all aim to approximate the particle filter algorithm in some sense to avoid degeneracy. The method we propose here is in a different vein—we will instead approximate the model (1) by adding artificial process noise. The filtering problem is then solved using a regular particle filter with the proposal chosen as a combination of the standard (bootstrap) proposal and the locally optimal proposal. This is related to "roughening", first introduced by Gordon et al. (1993), where artificial noise is added after resampling to spread the particles in an attempt to mitigate degeneracy. Here we refine this concept by proposing a specific proposal for the approximate model to improve the performance for high-dimensional models. Based on results by Snyder et al. (2015), we also provide insights on how approximating the model by adding noise can be seen as a bias-variance trade-off where the magnitude of the artificial process noise is a tuning parameter.
2 Background on the particle filter
The particle filter sequentially approximates the filtering distribution as where are random samples (particles), are their corresponding weights, is the Dirac delta function and is the number of particles. It is often impossible to sample from the filtering distribution directly, instead importance sampling is used where samples are drawn sequentially from a proposal distribution . The proposal can be any distribution from which it is possible to draw samples and for which whenever . To adjust for not sampling from a correction is introduced in the weight update. The unnormalized importance weights are given by
where is the normalized weight from the previous time step. The normalized importance weights are .
Each iteration in the filtering algorithm consists of three steps. First resampling is (possibly) performed according to the normalized weights from the previous time step, and the weights are set to . The particles are then propagated to the next time step using the proposal distribution . Finally the normalized importance weights are computed as described above. Further details on the particle filtering algorithm can be found e.g. in (Doucet et al., 2000).
2.1 The standard proposal
A common choice of proposal distribution is the transition density , referred to as the standard proposal. Inserting this proposal in (2) gives the unnormalized weights . If resampling is performed in every iteration this choice of proposal corresponds to the original version of the particle filter, the bootstrap filter (Gordon et al., 1993). Note that it is sufficient to be able to simulate from , exact evaluation of the expression is not required. Therefore the standard proposal can be used even for models like (1
) where, typically, no closed form expression is available for the transition density. Furthermore, the corresponding weights, given by the observation likelihood, can be easily evaluated. However, this choice of proposal is prone to weight degeneracy, in particular when the system is high-dimensional or when the observations are very informative (low observation noise) or contain outliers(Cappé et al., 2007). This degeneracy occurs when there is little overlap between the observation likelihood and the prior distribution of particles.
2.2 The locally optimal proposal
A possible remedy for the shortcomings of the standard proposal is to use a proposal which shifts the particles towards the observations by taking both the previous state and the current observation into account when propagating and reweighting the particles. One such choice is the locally optimal proposal, which is optimal in the sense that the variance of the importance weights is minimized when compared to other proposals depending only on and (Doucet et al., 2000). The locally optimal proposal propagates the particles according to
and then reweights using the importance weights . Unfortunately it is often not possible to use this proposal due to two major difficulties; it must be possible both to sample from the proposal and to evaluate . This integral can, in most cases, only be evaluated when is conjugate to .
The locally optimal proposal is in general not available for the model (1). One exception is the special case when the state dynamics are non-linear with additive Gaussian noise, that is when (Doucet et al., 2000). Assuming and are independent Gaussian noise with mean zero and covariances and respectively, (1) can be expressed using the densities
Both densities are Gaussian, so well-known relations give the proposal where
and for the corresponding weights we obtain
3 Particle filter with conjugate artificial process noise
The locally optimal proposal has minimal degeneracy compared to other proposals, and for high-dimensional systems it can improve upon the standard proposal by several orders of magnitude (Snyder et al., 2015). Unfortunately, as pointed out in the previous section the locally optimal proposal is in general not available for (1) and common approximations, e.g. based on local linearizations (Doucet et al., 2000), are not applicable when the transition density function is intractable. However, to still be able to leverage the benefits of the locally optimal proposal we propose a controlled approximation of (1) where artificial noise is added in an extra state update. The approximate model is given by
where is a parameter adjusting the magnitude of the artificial noise. We consider a linear-Gaussian observation model (7c), hence for conjugacy between (7b) and (7c) where is a covariance matrix. Note that by choosing we recover the original model (1).
To design a particle filter for (7) we must choose a proposal and derive the corresponding weights. If and are taken to be the system states, the model (7) suggests using a combination of the standard and the locally optimal proposal. First the particles are propagated according to the standard proposal (7a). Noting that the two latter equations (7b) and (7c) are linear-Gaussian the particles are then propagated according to the locally optimal proposal taking to be the previous state. Using (5) and (6) we obtain the combined proposal
where is the standard proposal (7a). The corresponding importance weights are
It is clear from (8) and (9) that choosing will recover the standard proposal. Equation (9) also indicates why this choice of proposal is beneficial for high-dimensional problems; adding artificial process noise () will make the covariance parameter larger which in turn will make the weights less degenerate.
One way to interpret the proposed strategy is that adding artificial process noise in (7) will introduce some extra movement to the particles. The first propagation (standard proposal) moves the initial particles according to the state dynamics. If we add noise () the optimal proposal then moves the propagated particles further, shifting them towards the observation according to the expression for the mean in (8).
Snyder et al. (2015) found that the difference in performance between the locally optimal and the standard proposal increases with the magnitude of the process noise. Effectively this means that when using the locally optimal proposal the Monte Carlo variance of e.g. estimates of the normalizing constant or test functions, such as the posterior mean or variance of the system state, can be reduced by adding more process noise. However, adding more artificial process noise in (7) will introduce more bias in our estimates, so ultimately our proposed method has a bias-variance trade-off where is the tuning parameter.
3.1 Choice of parameters
The parameter adjusts the magnitude of the noise and hence controls the bias-variance trade-off. A high value on implies that there is a lot of freedom in moving the particles in the second stage of the proposal which results in a lower Monte-Carlo variance, but at the cost of a higher model bias.
For the case of a linear-Gaussian observation model the covariance matrix
, describing the correlation structure, must also be specified. A simple choice is to use the identity matrix which corresponds to adding noise of the same magnitude to all states, but with no correlation between states. However, if some states are not observed they will not be affected by the artificial process noise—they will just be propagated blindly forwards according to the standard proposal. Another possible choice is to use the weighted sample covariance matrix. This choice will take the correlation structure of the states into account which can mitigate the impact of not observing some states. Each element of the weighted sample covariance matrixat time is given by
where , is the state dimension, is the normalized weight of particle , is the value of the dimension of particle and are the sample mean for dimension given by .
4 Numerical examples
To evaluate our proposed method we consider two examples; a linear-Gaussian state space model and the non-linear Lorenz’96 model (Lorenz, 1995). For both models we examine the 10-dimensional case where only half of the states are observed (in noise) at each time step. Two choices of covariance matrices for the artificial process noise are considered; the block-diagonal matrix with an identity matrix in the upper block and zeros in the lower block, and the weighted sample covariance matrix with elements given by (10). The first choice will add artificial process noise to the observed states only, with no correlation between the states, whereas the latter choice will add artificial process noise to all states and allows for correlation between the states.
To quantify the performance of our method we use two measures; the marginal log-likelihood of the data, , and the mean square error (MSE) of the state estimates averaged over all time steps and all dimensions. The likelihood of the data is of particular interest for parameter estimation and model checking, and the MSE of the state estimate is of interest for filtering applications. In our discussion of the performance we will also refer to the effective sample size (ESS) for the particle filter given by . If the ESS drops too low at some point the filter estimates will be degenerate since all particles will originate from a small number of ancestor particles.
4.1 Linear Gaussian model
We first consider a 10-dimensional linear state space model with Gaussian noise of the form
where is tridiagonal with value on the diagonal, on the first diagonal above and below giving a local dependence between the states, and zeros everywhere else. We observe half of the states, hence is
where the left half is an identity matrix and the right half is a zero matrix. To make the estimation problem harder we assume that the covariance of the measurement noise is two orders of magnitude smaller than the covariance of the process noise (vs ). Data for time steps was generated from (11) and particles were used to estimate the states with the particle filter given by (8) and (9).
For the linear-Gaussian state space model the optimal solution to the filtering problem is available from the Kalman filter, hence it is possible to compare the performance of our method with the best achievable performance. We will compare both with the true Kalman filter estimate and with the Kalman filter estimate for the approximate model (7).
Fig. 1 shows the log-likelihood and the MSE as a function of for (left) and (right). It is clear from the log-likelihood plots that the standard proposal () degenerates whereas with our proposed method it is possible to almost reach the log-likelihood and MSE for the true Kalman filter for certain ranges of . It is also evident that there is a bias-variance trade-off with a higher variance for low values on and a lower variance but bigger bias for larger values on .
Note that for small values of the negative bias in the estimate of is an effect of the increased Monte Carlo variance. It is well known that the particle filter estimate of is unbiased, which by Jensen’s inequality implies , where a large Monte Carlo variance tends to result in a large negative bias. By a log-normal central limit theorem
, where a large Monte Carlo variance tends to result in a large negative bias. By a log-normal central limit theorem(Bérard et al., 2014) it holds that the bias is roughly for large enough, where is the Monte Carlo variance.
For the sample covariance the highest log-likelihood and lowest MSE is obtained for similar values on , around . For the identity matrix on the other hand the range of values for giving the highest log-likelihood are lower than the range of giving the lowest MSE. As increases both choices of covariance matrices approaches the Kalman filter estimate for the approximate model (7) corresponding to the best we can do.
Estimated probabilities of degeneracy in terms of the ESS for varying. Left: Block-diagonal covariance matrix. Right: Weighted sample covariance matrix.
4.2 Lorenz’96 model
Next we consider the non-linear, possibly chaotic Lorenz’96 model which is often used for testing data assimilation algorithms (Lorenz, 1995). The d-dimensional Lorenz’96 model is defined in terms of coupled stochastic differential equations which, for the continuous state , are given by
where the first term is drift and the second term is diffusion. The model is cyclic, hence , and is assumed. is a forcing constant confining the volume in which the solution can move and, for a fixed dimension of the state space, it determines whether the system exhibits chaotic, decaying or periodic behavior (Karimi and Paul, 2010). We consider the 10-dimensional case and to obtain a highly non-linear model exhibiting chaotic behavior we use . For the diffusion term we choose and is a standard 10-dimensional Wiener process. The observations are linear-Gaussian and, like in (11), we observe only half of the states.
Data was generated for steps assuming observations are made with a timestep . The system can be discretized by considering the discrete state which, between observations, is propagated forward according to (4.2) using iterations of the Euler-Maruyama method for numerical solution of stochastic differential equations. Note that this system, unlike the previously considered linear-Gaussian system, is one example of (1) where no closed form expression for exists which makes the filtering problem particularly challenging. For the artificial process noise particle filter we use particles and the propagation of the states is a simulation forward in time using the Euler-Maruyama method for solving (4.2) numerically.
For small values of the particle filter tends to degenerate, resulting in very poor estimates of the log-likelihood (as low as ) and high MSE values (as high as 60). For clarity of presentation we therefore split the particle filter runs into two groups: one containing the degenerate cases, in which the ESS dropped below two at least once during the run, and one for the non-degenerate cases, in which the ESS stayed above two. Fig. 2 has been zoomed in on the non-degenerate runs, for (left) and (right). Plots showing all runs are given in Fig. 4 in the appendix.
From Fig. 2 we observe that indeed there is a bias-variance trade-off for the proposed method where a higher value of reduces the variance but increases the bias. A comparison of the log-likelihood plots for and shows that the latter choice obtains higher likelihood estimates which decay more slowly and with a lower spread on the estimates for high values on . Similarly, the MSE estimates also show less spread for the choice . This indicates that the sample covariance matrix might give an estimate which is more robust for varying values of for this model.
To examine the effect of on the degeneracy of the particle filter in more detail we show the estimated probabilities of degeneracy (as defined above in terms of the ESS) for varying in Fig. 3, for (left) and (right). These probabilities are estimated based on binning the range of into 100 equally sized bins and counting the number of degenerate runs in each bin. This explains the raggedness of the estimates, but the plots nevertheless give an idea of how the probability of degeneracy decreases with increasing (we expect that the true probabilities are monotonically decreasing).
The particle filter is a powerful inference method with strong convergence guarantees. However, for challenging cases such as high-dimensional models the well-known degeneracy issue of the particle filter can cause the Monte Carlo error to be significant, effectively rendering the standard particle filter useless. Furthermore, for many models of interest—in particular non-linear models of the form (1)—it is difficult to improve over the standard particle filter proposal due to the intractability of the transition density function. To alleviate this issue we have proposed to instead perform filtering for an approximate model where the approximation is such that it opens up for a more efficient particle filter proposal. This is in contrast with many existing approaches to approximate filtering, which are often based on approximating the inference algorithm itself. One motivation for this approach is that the model in most cases is an approximation to begin with, so adding a further approximation does not necessarily deteriorate the data analysis to any large degree. Another motivation is that an error analysis of the proposed procedure can focus on the model bias. Indeed, the inference method that we use is a regular particle filter (albeit for a non-standard model) and the properties of these methods are by know fairly well understood.
The proposed model approximation, and thus also the bias-variance trade-off of the resulting method, is controlled by the tuning parameter . From our numerical examples with two 10-dimensional state-space models it is clear that the introduction of a non-zero
can significantly improve the performance over the standard particle filter, both in terms of log-likelihood estimates and in terms of the filtering MSE. However, if we increase the dimension much further we expect that the proposed method will struggle as well. This is supported by the fact that even the locally optimal proposal will suffer from the curse of dimensionality(Snyder et al., 2015). Thus, to address very high-dimensional problems (thousands of dimensions, say) we most likely need to combine the proposed method with other approaches. Such combinations of techniques is an interesting topic for future research.
So far we have only investigated the empirical performance of the new method for varying . In practice we would of course like to know which to pick beforehand, or have an adaptive scheme for tuning online. Devising such rules-of-thumb or adaptive methods is a topic for future work. The same hold for the choice of base covariance which needs further investigation. We note, however, that our empirical results suggest that the method is fairly robust to selecting “too large”, whereas a too small resulted in very poor performance. A possible approach is therefore to start with a large which is then gradually decreased while monitoring the ESS.
Finally, we note that the proposed method is useful not only for filtering problems, but also for system identification. Several state-of-the-art methods for identification of non-linear state-space models are based on the log-likelihood estimate (see, e.g., Schön et al. (2015); Kantas et al. (2015)). Thus, the significant improvement in these estimates offered by the proposed method should have direct bearing also on non-linear system identification.
This research is financially supported by the Swedish Research Council via the project Learning of Large-Scale Probabilistic Dynamical Models (contract number: 2016-04278) and by the Swedish Foundation for Strategic Research via the projects Probabilistic Modeling and Inference for Machine Learning
Probabilistic Modeling and Inference for Machine Learning(contract number: ICA16-0015) and ASSEMBLE (contract number: RIT15-0012).
- Bérard et al. (2014) J. Bérard, P. Del Moral, and A. Doucet. A lognormal central limit theorem for particle approximations of normalizing constants. Electronic Journal of Probability, 19:28 pp., 2014.
- Cappé et al. (2007) O. Cappé, S. J. Godsill, and E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5):899–924, May 2007.
- Djurić and Bugallo (2013) P. M. Djurić and M. F. Bugallo. Particle filtering for high-dimensional systems. In 2013 5th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 352–355, Dec 2013.
- Doucet et al. (2000) A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3):197–208, 2000.
- Evensen (1994) G. Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. Journal of Geophysical Research: Oceans, 99(C5):10143–10162, 1994.
- Gordon et al. (1993) N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F Radar and Signal Processing, 140(2):107, 1993.
- He et al. (2010) D. He, E. L. Ionides, and A. A. King. Plug-and-play inference for disease dynamics: measles in large and small populations as a case study. Journal of The Royal Society Interface, 7(43):271–283, 2010.
- Kantas et al. (2015) N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, and N. Chopin. On particle methods for parameter estimation in state-space models. Statistical Science, 30(3):328–351, 08 2015.
- Karimi and Paul (2010) A. Karimi and M. R. Paul. Extensive chaos in the Lorenz-96 model. Chaos: An Interdisciplinary Journal of Nonlinear Science, 20(4):043105, 2010.
- Lorenz (1995) E. Lorenz. Predictability: a problem partly solved. In Seminar on Predictability, 4-8 September 1995, volume 1, pages 1–18, Shinfield Park, Reading, 1995.
- Mattern et al. (2013) J. P. Mattern, M. Dowd, and K. Fennel. Particle filter-based data assimilation for a three-dimensional biological ocean model and satellite observations. Journal of Geophysical Research: Oceans, 118(5):2746–2760, 2013.
- Naesseth et al. (2015) C. Naesseth, F. Lindsten, and T. Schön. Nested sequential Monte Carlo methods. In Proceedings of the 32nd International Conference on Machine Learning, pages 1292–1301, Lille, France, 07–09 Jul 2015.
- Rebeschini and van Handel (2015) P. Rebeschini and R. van Handel. Can local particle filters beat the curse of dimensionality? The Annals of Applied Probability, 25(5):2809–2866, 10 2015.
- Robert and Künsch (2017) S. Robert and H. R. Künsch. Localizing the ensemble Kalman particle filter. Tellus A: Dynamic Meteorology and Oceanography, 69(1):1282016, 2017.
- Schön et al. (2015) T. B. Schön, F. Lindsten, J. Dahlin, J. Wågberg, C. A. Naesseth, A. Svensson, and L. Dai. Sequential Monte Carlo methods for system identification. IFAC-PapersOnLine, 48(28):775 – 786, 2015.
- Shaman and Karspeck (2012) J. Shaman and A. Karspeck. Forecasting seasonal outbreaks of influenza. Proceedings of the National Academy of Sciences, 109(50):20425–20430, 2012.
- Snyder et al. (2015) C. Snyder, T. Bengtsson, and M. Morzfeld. Performance bounds for particle filters using the optimal proposal. Monthly Weather Review, 143(11):4750–4761, 2015.
Appendix A Additional plots
Fig. 4 shows all runs for the 10-dimensional Lorenz’96 model. For small values on the filter degenerates for both choices of covariance matrices. When is increased the number of degenerate estimates gradually decreases.