 # Learning nonlinear state-space models using smooth particle-filter-based likelihood approximations

When classical particle filtering algorithms are used for maximum likelihood parameter estimation in nonlinear state-space models, a key challenge is that estimates of the likelihood function and its derivatives are inherently noisy. The key idea in this paper is to run a particle filter based on a current parameter estimate, but then use the output from this particle filter to re-evaluate the likelihood function approximation also for other parameter values. This results in a (local) deterministic approximation of the likelihood and any standard optimization routine can be applied to find the maximum of this local approximation. By iterating this procedure we eventually arrive at a final parameter estimate.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the nonlinear state-space model

 xt|xt−1 ∼fθ(xt|xt−1), (1a) yt|xt ∼gθ(yt|xt), (1b)

where the hidden states and observations evolves over time

. We assume that the probability densities

and are parameterized by unknown parameters , which we are interested to learn from data. More specifically we are seeking the maximum likelihood estimate of from given observations , i.e., to solve

 ˆθ=argmaxθpθ(y1:T). (2)

We shall refer to as the likelihood function when considered a function of . It follows from (1) and the initial state density that

 pθ(y1:T)=∫p(x0)T∏t=1fθ(xt|xt−1)gθ(yt|xt)dx0:T. (3)

This can be evaluated analytically only for a few special cases. Approximations, such as the particle filter, are needed for the nonlinear case. For any given , the particle filter can be run to give a Monte Carlo based estimate of (3). It is well-known (Del Moral, 2004) that is unbiased, i.e., , where the expectation is over the stochasticity in the particle filter algorithm itself. Albeit unbiased,

is also stochastic (i.e., a different value is obtained every time the particle filter algorithm is run, since it is a Monte Carlo solution), often with a considerable variance and non-Gaussian characteristics, and standard optimization routines can therefore not be used to solve (

2). To this end, schemes using, optimization for noisy function evaluations have been applied (e.g., Wills and Schön 2017; Dahlin and Lindsten 2014).

We propose in this paper to iteratively run a particle filter with some current parameter estimate and scrutinize the particle filter algorithm to construct a deterministic
local approximation of the likelihood function around . In a sense we hereby manage to circumvent the stochasticity from the Monte Carlo procedure in a consistent way, and can allow for standard optimization routines to search for a new parameter value , for which we can run a new particle filter, etc. The idea is outlined as Algorithm 1. Related ideas have been proposed earlier (Le Gland, 2007), but they have to the best of the authors knowledge not yet been fully explored in the system identification context.

## 2 Background on particle filtering

The particle filter was originally proposed as a Monte Carlo solution to the state filtering problem (Gordon et al., 1993), i.e., to compute . It was soon (Kitagawa, 1996) realized that the particle filter could also be used to estimate the likelihood function for given parameter values, essentially by inserting the particles (Monte Carlo samples) into the integral in (3) and thereby obtain an estimate of . Thanks to this likelihood estimate, the particle filter can be used for system identification purposes.111There are several alternative ways in which the particle filter can be used for system identification, for example approaches based on the EM algorithm or Gibbs sampling. As mentioned, is unbiased, but it often has a heavy-tailed and asymmetric distribution with a non-negligible variance. Its exact properties depends on the particle filter settings and the model.

A textbook introduction to particle filters is given in, e.g., Schön and Lindsten (2017); Doucet et al. (2001). We summarize a rather general formulation of the particle filter in Algorithm 2, a version of the auxiliary article filter (Pitt and Shephard, 1999). We use to denote an almost arbitrary222The support of has to cover the support of . proposal distribution. Furthermore, are the resampling weights, the importance weights, and denotes the categorical distribution on the set with (possibly unnormalized) weights , and is the number of particles.

Let us be explicit about the stochastic elements of Algorithm 2: Particles are initially drawn from the initial distribution on line 2 (which we assume to be independent of ). Furthermore, the ancestor indices on line 2 are drawn with respect to the resampling weights , and finally the propagation of particles on line 2, where the new particles are drawn from the proposal .

Whereas , and in Algorithm 2 are given by the model specification and follows from the algorithm, the choices for , and are left to the user. The number of particles is usually taken as large as the computational budget permits, and two common choices for weights and proposals are

• the bootstrap particle filter (Gordon et al., 1993) with

 q(xt|xt−1,yt)=fθ(xt|yt)

and

 νnt=wnt,

and consequently . This choice is generic and requires very little from the user, but has inferior performance compared to

• the fully adapted particle filter (Pitt and Shephard, 1999), with

 q(xt|xt−1,yt)=pθ(xt|xt−1,yt)

and

 νnt=pθ(yt|xt−1).

This choice is superior to the bootstrap choice in terms of variance of the obtained approximation, but it is only available for a quite limited set of models. The literature on approximations to this choice is therefore rich (e.g., Naesseth et al. 2015; Doucet et al. 2000).

We will in this paper exploit the relatively large freedom that is available when it comes to choosing the proposal density and the resampling weights. By making a choice that only depends on the current parameter value  it is possible to evaluate the likelihood estimator (which we will denote by ) for different values of , while at the same time making use of the same realization of .

## 3 Related work

The use of the likelihood estimator as an objective function in optimization, in the search for a maximum likelihood estimate , has been subject to several studies: Doucet and Tadić (2003) differentiate

and use it in a stochastic gradient descent scheme, whereas

Dahlin and Lindsten (2014); Wills and Schön (2017) use an optimization scheme based on Gaussian processes. Malik and Pitt (2011) use a fixed random seed and run the particle filter for different . For a fixed random seed, is indeed deterministic, however with discontinuous ‘jumps’ due to different resampling decisions being made for different . To this end, Malik and Pitt proposes an approximative smoothing to obtain a continuous function.

The idea used in this paper, i.e., to make the ancestor indices and particles depending only on the current, or reference, parameter value (instead of ) has been analyzed by Le Gland (2007), but has (to the best of the authors knowledge) not yet been applied to the system identification problem. The work by Le Gland

concerns central limit theorems for the likelihood estimator

. The application of the results from Le Gland (2007) to this work is subject for further work. The work presented in this paper shares similarities with the authors recent work Svensson et al. (2017), which however is concerned with the different topic of Bayesian identification for state-space models with highly informative observations.

Other approaches to maximum likelihood estimation in nonlinear state-space models include the combination of the popular Expectation Maximization (EM) algorithm and particle filters

(Lindsten, 2013; Schön et al., 2011; Olsson et al., 2008).

## 4 The solution

The key idea in this paper is to choose and such that they are independent of . By such a choice, we note that all the random elements in Algorithm 2, , become independent of . If we then condition on a certain realization of , the estimate becomes a deterministic function in , and any standard optimization routine can subsequently be applied to solve (3) and find .

The strength of the particle filter, however, lies in the sequential build-up of the samples on the high-dimensional space , where the resampling operation provides important ‘feedback’ on which parts of the state space to be explored further. With an arbitrary choice of -independent resampling weights , this feature will be lost, and we may expect an extremely high variance in the obtained estimate. In fact, a particle filter with -independent resampling weights van be understood as importance sampling on the space , and we can in general not expect such an approach to be successful.

In order to obtain a deterministic function in , but avoid the bad consequences of a -independent resampling, we propose to let the resampling weights and proposal depend on some current parameter estimate , as, e.g.,

 q(xt|xt−1,yt)=fθk−1(xt|yt) (4a) and νnt=gθk−1(yt|xnt), (4b)

i.e., the bootstrap choice for (instead of ). If then is somewhat close to , we can expect the variance of the corresponding estimate of the likelihood function, which we denote by , not to be forbiddingly large.

However, if the current is far from , we cannot expect to be a particularly good estimator at the value , and in particular, not expect the maximum of to lie at . For this reason, we have to iterate the parameter values over until we arrive in the vicinity of . By inserting (4) into Algorithm 2, combined with an outer optimization loop as discussed, and re-arranging, we arrive at Algorithm 3. For numerical reasons, we work with the logarithm of the likelihood function. The conceptual idea is illustrated in Figure 1. Figure 1: The gray dots are log likelihood estimates obtained by running individual bootstrap particle filters with N=100 particles and parameter value θk−1. Conditioned on the particle system underlying each particle filter (gray dots), the likelihood function is approximated also for other θ values (red lines), which we can expect to be practically useful in the vicinity of θk−1. The idea of Algorithm 1 is as follows: Start with some θk−1 and sample a corresponding gray dot (particle_filter), and then apply a standard optimization scheme to find the maximum of the corresponding red line (log_likelihood). We save the result as θk, and start over again with a new particle filter for θk, etc.

### 4.1 Solving the maximization problem

On line 3 in Algorithm 3, we have formulated a step where is to be solved. Importantly, this is now a completely deterministic problem and it can be handled by any standard numerical optimization tool. We will in the experiments demonstrate this by simply applying the general-purpose optimization tool fminunc in Matlab333Similar functions in other languages are fminunc (Octave), scipy.optimize (Python), optim (R) and optimize (Julia)..

The particular structure of (defined implicitly in the function likelihood in Algorithm 3) could, however, possibly be utilized by a more tailored optimization scheme. Its structure can be written as

 ˆzθk−1(θ)=1NT∏t=1N∑n=1cnt ωnt(θ) fθ(xt|xantt−1)gθ(yt|xnt), (5)

where is a constant that is independent of , depends on but always fulfil , and and depends on the model. Whether this function exhibits any particular properties that could be exploited in optimization is subject to further research.

## 5 Analysis

We will now present a brief analysis of the proposed Algorithm 3. First, we conclude in Section 5.1 that the proposed scheme has desirable asymptotic properties as . Second, we make an attempt in Section 5.2 to understand the behavior also for finite , and third in Section 5.4 we discuss an alternative solution that would place the proposed method in the framework of stochastic gradient descent methods.

### 5.1 Convergence as N→∞

Asymptotically as in the particle filter, the proposed method becomes exact and converges (in principle) in one iteration. This can be realized as follows: The log-likelihood is estimated by

 logpθ(y1:T)≈T∑t=1log(N∑i=1wntp(yt|xnt)). (6)

Assuming that the proposal distribution used to generate the particles is everywhere non-zero, this log-likelihood approximation is consistent under weak conditions and converges point-wise in as (Doucet and Johansen, 2011; Del Moral, 2004). Thus, as long as the global solution to can be found, it is indeed the likelihood that has been maximized and found, which happens in a single iteration .

### 5.2 Convergence as k→∞ and finite N

It is indeed reassuring that our proposed scheme is consistent as the number of particles , as discussed in the previous section. For practical purposes, however, the behavior for a finite (which always is the case in an implementation) is probably more interesting.

We start by noting that for , it holds that (Del Moral, 2004). Note, however, that this does not imply

 E[argmaxθˆzθk−1(θ)]=argmaxθpθ(y), (7)

so we do not have a theoretical justification to simply average the obtained sequence to obtain .

To obtain some guidance on how to extract a final estimate from the obtained sequence , we can make the simplifying assumption that the error , viewed as a stochastic process with index variable , is stationary. In such a case, we can (under some technical assumptions) expect that is equal to the maximum mode of the distribution for . A proof sketch for this claim is found in Appendix A. This suggests that we should look at the maximum mode in the histogram for to find a good estimate of when we are using the method in practice (i.e., with a finite ). This will be illustrated in Example 2 and Figure 2(c).

### 5.3 Stability

Stability, i.e., that the sequence does not diverge, is another important property. We have not experienced any such issues with the proposed scheme. The key to a stable algorithm is that the solution to the maximization problem on line 3 in Algorithm 3 is not too far from . To motivate that this is likely to be the case, we note that while is unbiased for all , its variance tends to increase as the distance between and increases. It is also known that (rather than

) has approximately a Gaussian distribution, which implies that

has a bias in the order of . This motivates that for far from , the value of is likely to be small, and hence cause the algorithm not to deviate to much from the current iterate.

An alternative approach would be not to solve the problem, but only use to estimate a gradient around and take an (inevitable stochastic) gradient step. Indeed, this has already been proposed by Doucet and Tadić (2003). Designing step lengths based on stochastic approximation ideas (Robbins and Monro, 1951) yields the well-studied stochastic gradient descent method. Our practical experience, however, is that (stochastic) gradient steps have inferior performance compared to the proposed scheme for our problem, including slower convergence and severe stability issues.

## 6 Numerical experiments

We will in this section apply our proposed method to two simulated examples, in order to illustrate and evaluate it. First a common example form the literature will be considered, and comparisons to alternative methods made. Thereafter a more difficult example will be studied. The code for the examples is available via the first author’s homepage.

### 6.1 Example 1

In this first example, we consider measurements generated by the model

 xt+1 =0.5xt+bxt1+x2t+8cos(1.2t)+qwt, (8a) yt =0.05x2t+et, (8b)

where , , and . The true values of are , and this example (with and ) was also used to generate Figure 1. The proposed Algorithm 3 is implemented with , and employing the generic optimization routine fminunc in Matlab to solve the optimization problem on line 3 in Algorithm 3. The initial  is chosen randomly on the intervals and , respectively, and the entire example is repeated 100 times. Each example took approximately 6 seconds on a standard laptop, and the results are found in Figure 1(a). We compare with two alternative methods for maximum likelihood estimation in nonlinear state-space models: The results for particle stochastic approximation EM (PSAEM, Lindsten 2013) applied to the very same problem are reported in Figure 1(b). The results for the same problem with a stochastic optimization approach using the particle filter to estimate the likelihood and the Gaussian process (GP) to model the likelihood surface and its derivatives (Wills and Schön, 2017; Mahsereci and Hennig, 2015) are found in Figure 1(c). With the number of iterations chosen as in the figures, the computational load are of the same order of magnitude for all three methods.

From this, we conclude that our proposed method tend to converge faster than the alternatives (counting the number of iterations needed), but that each iteration is computationally more involved. The computational load of our algorithm is partly due to the use of a generic optimization routine (fminunc in Matlab), which makes no use of the particular structure of the objective function (likelihood in Algorithm 3), as discussed in Section 4.1.

### 6.2 Example 2

Now, consider the following state-space model

 xt+1 =xa+x2+bu+wt, wt ∼N(0,1), (9a) yt =x+et, et ∼N(0,1). (9b)

with and . One data set is generated with , and our method is applied, with different initializations, 100 times to find . This problem is significantly harder than Example 1 due to the location of in the denominator (and not the numerator) in (9a). As an illustration, independent particles filter were run to estimate the log-likelihood for different values of in Figure 2(a), from which we conclude that the likelihood estimate is rather noisy. This can be compared to Example 1 and the gray dots in Figure 1, where the likelihood estimation is clearly less challenging. Again, we use the proposed Algorithm 3 with fminunc in Matlab to solve the optimization problem on line 3. The results are shown in Figure 2(b) and 2(c). Despite the challenging likelihood estimation, our proposed method manages to eventually converge towards meaningful values, and following the guidance discussed in Section 5.2, we take the final estimates as the maximum of the histograms in Figure 2(c), , which corresponds well to the true parameters. (a) Log-likelihood estimates (vertical axis) for the model (9a) in Example 2 for different a (horizontal axis, true a=0.5) and b=−2, obtained with independent particle filters with N=100. As can be seen, the variance in ˆzθ is rather high in this example, which is to be compared with the gray dots in Figure 1, the corresponding plot for Example 1. We thus expect maximum likelihood estimation to be significantly more challenging in this example.

## 7 Conclusions

We have proposed a novel method, Algorithm 3, to find maximum likelihood estimates of unknown parameters in nonlinear state-space models. The method builds on the particle filter, and allows for parameter inference in any model where also a particle filter can be applied. In fact, the method can be understood as a particular choice of -independent proposal and resampling weights in the auxiliary particle filter.

One key reason for the promising results is probably that we propose to solve an optimization problem at each iteration, instead of only taking a gradient step or similar: heuristically this seems to lead to fast convergence, and has not caused any problems with instability. The theoretical analysis, however, becomes more involved. We have presented an attempt to such an analysis in Section

5.2, but all questions have not been answered. As mentioned, the work by Le Gland (2007) could potentially be useful in a more thorough analysis of the proposed method.

A tailored choice of optimization routine would be interesting for further work. Furthermore, the applicability of the proposed method for the particle marginal Metropolis-Hastings sampler (Andrieu et al., 2010) would be another interesting question.

## References

• Andrieu et al. (2010) Christophe Andrieu, Arnaud Doucet, and Roman Holenstein.

Particle Markov chain Monte Carlo methods.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
• Dahlin and Lindsten (2014) Johan Dahlin and Fredrik Lindsten. Particle filter-based Gaussian process optimisation for parameter inference. In Proceedings of the 19th IFAC World Congress, pages 8675–8680, Cape Town, South Africa, August 2014.
• Del Moral (2004) Pierre Del Moral. Feynman-Kac formulae: genealogical and interacting particle systems with applications. Springer, New York, NY, US, 2004.
• Doucet and Johansen (2011) Arnaud Doucet and Adam M. Johansen. A tutorial on particle filtering and smoothing: fifteen years later. In D. Crisan and B. Rozovsky, editors, Nonlinear Filtering Handbook, pages 656–704. Oxford University Press, Oxford, UK, 2011.
• Doucet and Tadić (2003) Arnaud Doucet and Vladislav B. Tadić. Parameter estimation in general state-space models using particle methods. Annals of the Institute of Statistical Mathematics, 55(2):409–422, 2003.
• Doucet et al. (2000) Arnaud Doucet, Simon J. Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3):197–208, 2000.
• Doucet et al. (2001) Arnaud Doucet, Nando de Freitas, and Neil J. Gordon. An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pages 3–14. Springer, 2001.
• Gordon et al. (1993) Neil J. Gordon, David J. Salmond, and Adrian F.M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In IEE Proceedings F - Radar and Signal Processing, pages 107–113, 1993.
• Kitagawa (1996) Genshiro Kitagawa. Monte carlo filter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25, 1996.
• Le Gland (2007) Francois Le Gland. Combined use of importance weights and resampling weights in sequential monte carlo methods. ESAIM: Proc., 19:85–100, 2007.
• Lindsten (2013) Fredrik Lindsten. An efficient stochastic approximation EM algorithm using conditional particle filters. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6274–6278, Vancouver, BC, Canada, May 2013.
• Mahsereci and Hennig (2015) Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems 28 (NIPS), pages 181–189, Montréal, QC, Canada, December 2015.
• Malik and Pitt (2011) Sheheryar Malik and Michael K. Pitt. Particle filters for continuous likelihood evaluation and maximisation. Journal of Econometrics, 165(2):190–209, 2011.
• Naesseth et al. (2015) Christian A. Naesseth, Fredrik Lindsten, and Thomas B. Schön. Nested sequential Monte Carlo methods. In Proceedings of the 32nd

International Conference on Machine Learning (ICML)

, pages 1292–1301, Lille, France, July 2015.
• Olsson et al. (2008) Jimmy Olsson, Olivier Cappé, Randal Douc, and Éric Moulines. Sequential Monte Carlo smoothing with application to parameter estimation in nonlinear state-space models. Bernoulli, 14(1):155–179, 2008.
• Pitt and Shephard (1999) Michael K. Pitt and Neil Shephard. Filtering via simulation: auxiliary particle filters. Journal of the American Statistical Association, 94(446):590–599, 1999.
• Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• Schön and Lindsten (2017) Thomas B. Schön and Fredrik Lindsten. Sequential Monte Carlo methods. 2017.
• Schön et al. (2011) Thomas B. Schön, Adrian Wills, and Brett Ninness. System identification of nonlinear state-space models. Automatica, 47(1):39–49, 2011.
• Svensson et al. (2017) Andreas Svensson, Thomas B. Schön, and Fredrik Lindsten. Learning of state-space models with highly informative observations: a tempered sequential monte carlo solution. Mechanical Systems and Signal Processing, 2017. Accepted for publication.
• Wills and Schön (2017) Adrian Wills and Thomas B. Schön. On the construction of probabilistic newton-type algorithms. In Proceedings of the 56th IEEE Conference on Decision and Control (CDC), Melbourne, Australia, December 2017.

## Appendix A Proof sketch

This is a sketch for a proof of the claim given in Section 5.2. Let be fixed and assume that is a stationary stochastic process with index set . For any and , let be a ball of radius centered at . For notational simplicity, let and note that this is a deterministic function of which is assumed to be Lipschitz continuous and attain its maximum for . Now, take sufficiently small so that . For any with we then have

 P(argmaxθ{ε(θ)+h(θ)}∈Bδ(θ′))=P(maxθ∈Bδ(θ′){ε(θ)+h(θ)}≥maxθ∈Θ{ε(θ)+h(θ)})≤P(maxθ∈Bδ(θ′)ε(θ)+minθ∈Bδ(ˆθ)h(θ)≥maxθ∈Θ{ε(θ)+h(θ)})=P(maxθ∈Bδ(ˆθ)ε(θ)+minθ∈Bδ(ˆθ)h(θ)≥maxθ∈Θ{ε(θ)+h(θ)})≤P(maxθ∈Bδ(ˆθ){ε(θ)+h(θ)}≥maxθ∈Θ{ε(θ)+h(θ)})=P(argmaxθ{ε(θ)+h(θ)}∈Bδ(ˆθ))

where the fourth line follows from the assumed stationarity of . Now, since is arbitrary, it follows that if admits a density w.r.t. Lebesgue measure, its density function is maximized at .