1 Introduction
Statespace models are a widely utilised and flexible class of models considered in many scientific and engineering applications [15]. Generally, identification of the parameters of statespace models remains a challenging task [16, 19]. In this paper, nonlinear statespace models of the form
(1a)  
(1b) 
are considered, where and are the state and measurement at time step , respectively, and are model parameters. For clarity the presence of an input is not shown, it is, however, allowed.
In this paper, given a sequence of measurements, , the estimation of the state and model parameters is considered. Using a Bayesian approach, this amounts to obtaining the distribution , given by
(2) 
where , and is a prior distribution over the parameters.
From , distributions of both parameters and states can be obtained via marginalisation. For example, the marginal distribution is found via
(3) 
The difficulty with both (2) and (3) is that they are generally intractable. This is due to both the arbitrary form of the distributions considered and the nonlinear integrals involved. As such, these distributions must be approximated. Many differing approximations exist, one class of approaches is to utilise particlebased Monte Carlo methods, an example of this is Hamiltonian Monte Carlo (HMC) [18, 8] methods.
In this paper, an alternative approach, known as variational inference (VI) [10, 2], to approximate is followed. The contribution of this paper is to use VI to approximate . Using a Gaussian assumed density, and a suitable approximation for the integrals, this provides a deterministic method of approximating that can be efficiently performed using gradientbased optimisation with exact first and secondorder derivatives.
2 Variational Inference
Variational inference is a widely used method to approximate, potentially intractable, posterior distributions with a parametric density of an assumed form. The numeric values of this parametrisation are obtained by maximising a likelihood lower bound [2]. In this section, the use of VI in approximating is examined.
As is intractable, it is approximated using an assumed density, parameterised by , and denoted as . The quality of this approximation is measured by
(4) 
the KullbackLeiber (KL) [13] divergence of from . As it is desired that is close to the aim is to find a , denoted , which minimises this KL divergence and is given by
(5) 
However, as is intractable cannot be directly minimised or even evaluated. Instead, the concept of VI is used to, equivalently, maximise a likelihood lower bound.
Using conditional probability, this lower bound can be found by expressing the loglikelihood,
, as(6) 
Through addition and subtraction of to the righthand side of (6) this leads to
(7) 
As is independent of and , the loglikelihood can alternatively be given by
(8) 
where .
Substituting (7) into the righthand side of (8) arrives at
(9) 
which will be expressed as
(10) 
where
(11) 
From (10), and as any KL divergence is nonnegative, is a lower bound to the loglikelihood .
Furthermore, from (10) we have
(12) 
and therefore
(13a)  
(13b)  
(13c) 
The important property of this is does not involve the intractable distribution, , rather only the assumed distribution which is selected in a tractable form.
The desired approach is therefore to find a value for , that maximises via
(14) 
for an assumed density form. This provides an approximation for given by as desired.
However, as given, this approach is computationally intractable for any large quantity of time steps. This is because is a full dense distribution, the dimension of which continues to grow. Fortunately, by expressing as
(15) 
and utilising the Markovian nature of statespace models (see Appendix A for details) to give
(16) 
and
(17) 
we have
(18) 
without introducing any approximations.
Therefore, calculation of does not require the full distribution, rather only each and as such remains tractable to calculate for large quantities of time steps.
The focus of this paper correspondingly shifts from the, intractable, aim of obtaining , to the tractable aim of obtaining each .
3 Assumed Gaussian Distribution and Tractable Approximation
In this section, the parametric assumed density chosen and the subsequent formation of a standard form optimisation problem is examined. When selecting the parametric form of there are two conflicting goals. Firstly, the parametrisation chosen should be sufficiently flexible to well represent . Secondly, the parametrisation should be chosen to allow the optimisation to be efficiently performed.
In this paper, we have chosen to use a multivariate Gaussian parameterised as
(19) 
where , , , , and
(20) 
where are upper triangular. The parameter is therefore given as
(21) 
and related to via
(22) 
Importantly, parameterising each joint distribution using the Cholesky factor,
, allows the components of corresponding to each time step to be easily approximated using Gaussian quadrature as(23a) 
where is a weight, and the sigma points are denoted as where is given by linear combinations of the joint mean and the columns of . The sigma points, , being linear combinations of the elements of is important as it significantly simplifies the calculation of both first and secondorder derivatives used in the optimisation.
Furthermore, due to the Gaussian assumption , and its derivatives, are available in closed form. With a suitable prior is also available in closed form, alternatively it can be similarly approximated using Gaussian quadrature.
This allows to be approximated by as
(24) 
This parametrisation is, however, overparameterised as the marginal distributions , and can be calculated from either or
. Hence, the constraints defined by the feasible set
(25) 
where and
are required to ensure consistency.
However, as each , being a Cholesky factor of a covariance matrix, is nonsingular the constraints set can be simplified to
(26) 
where
Using this, an approximation of (14) is given by
(27a)  
s.t.  (27b) 
This constrained optimisation problem is of a standard form and can be efficiently solved using exact first and secondorder derivatives without more approximations to find a local maximum. The calculation of derivatives required can be performed using automatic differentiation [7] without manual effort. Due to the parametrisation chosen, this can be efficiently performed using standard tools. For the numerical examples provided CasADi [1] has been used.
The result of this optimisation is numerical values for each
(28) 
which approximates each . Due to the Gaussian form selected, marginal and conditional distributions of both the parameters and states can be readily extracted as desired.
The proposed approach is summarised in Algorithm 1.
4 Numerical Examples
In this section, we provide two numerical examples of the proposed approach and compare the results against HMC. The HMC results have been calculating using STAN [3] as detailed in [8].
4.1 Stochastic Volatility
In this example, estimation of for the following stochastic volatility model:
(29a)  
(29b) 
where and is considered using 726 simulated measurements.
For the proposed method fifthorder cubature [9] is used to perform the Gaussian quadrature approximations and the optimisation routine is run until convergence to a locally optimal point requiring . The HMC results are obtained using iterations with half discarded at warmup requiring .
In Figure 1 the marginal distributions of using both the proposed approach and HMC are shown. While the marginal distributions for parameters and closely match, the VIbased method is overconfident in the distribution of and cannot closely match the nonGaussian relationships between the parameters.
Despite this the state distributions obtained using both methods closely match. Figure 2 highlights this for three differing time steps, which are representative of the results for other time steps.
In this example, the proposed method is demonstrated on a nonlinear and nonGaussian example and is, generally, representative of the results obtained using HMC at a lower computational cost. This, however, comes with a tradeoff, when the underlying distribution is nonGaussian, the Gaussian assumption of the proposed method may produce overconfident results as shown for the distribution of .
4.2 Inverted Pendulum
In this section, the proposed method is applied to a rotational inverted pendulum, or Furata pendulum [6]
, using real data collected from a Quanser QUBE—Servo 2. The estimation of a fourdimensional state vector, six parameters
, and a seven dimensional covariance matrix associated with coupled additive Gaussian noise of the process and measurement models is considered using 375 measurements. For further details of the model refer to Appendix B.For the proposed approach, a third order unscentedtransform [11, 23] with , , and is used to perform the Gaussian quadrature required. A point estimate for is also considered, this highlights the flexibility of the proposed approach which allows for distributions of subsets of parameters alongside point estimates of others. In this example, we also exploit the additive form of the statespace model, this allows for a Hessian to be efficiently approximated using only Jacobian’s. Identification problems for systems with additive Gaussian noise commonly use this approach [12, 20]. While the calculations for this example are not fully detailed in Section 3, they represent minor modifications of the resultant optimisation problem.
For the HMC results, iterations are used to obtain distributions for all states and parameters. As the HMC results calculate a distribution for each element of , while the proposed method considers a point estimate, it is expected the results of the proposed method and HMC slightly differ.
In Figure 3 the distribution of using both the proposed approach and HMC are shown. This illustrates that, despite the approximations introduced, the proposed method has delivered results close to HMC. This has been achieved with a significant computational reduction. The proposed method required a few minutes to calculate while the HMC results required a few hours.
5 Conclusion
In this paper, a variational based approach to estimating state and parameter distributions is presented for general nonlinear statespace models. The proposed method results in an optimisation problem of a standard form for which, due to the parametrisation selected, derivatives are readily available. While the proposed method assumes a Gaussian density, and is therefore limited compared to more general HMC approaches, the numerical examples illustrate that approximations suitable for many purposes are obtained on both simulated and real data. Compared to stateoftheart HMC approaches this is achieved at a significantly lower computational cost.
Future research directions for this work include blending the proposed approach with sequential Monte Carlo (SMC) algorithms along the lines of what has been done in e.g. [14, 17]. The motivation for this is to combine the computational efficiency of the proposed method and the theoretical guarantees of SMC to overcome the limitations imposed by the Gaussian assumed densities.
References
 [1] Joel A. E. Andersson, Joris Gillis, Greg Horn, James B. Rawlings, and Moritz Diehl. CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1–36, July 2018.
 [2] D. Blei, A. Kucukelbir, and J. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 [3] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1), 2017.

[4]
S. B. Chitralekha, J. Prakash, H. Raghavan, R. B. Gopaluni, and S. L. Shah.
Comparison of ExpectationMaximization based parameter estimation using Particle Filter, Unscented and Extended Kalman Filtering techniques.
IFAC Proceedings Volumes, 42(10):804–809, 2009.  [5] Jarrad Courts, Adrian Wills, and Thomas B. Schön. Variational Nonlinear State Estimation. arXiv, 2020.
 [6] K. Furuta, M. Yamakita, and S. Kobayashi. Swingup Control of Inverted Pendulum Using PseudoState Feedback. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering, 206(4):263–269, November 1992.
 [7] Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, jan 2008.
 [8] Johannes Hendriks, Adrian Wills, Brett Ninness, and Johan Dahlin. Practical Bayesian System Identification using Hamiltonian Monte Carlo. arXiv, 2020.
 [9] Bin Jia, Ming Xin, and Yang Cheng. Highdegree cubature Kalman filter. Automatica, 49(2):510–518, feb 2013.
 [10] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2):183–233, November 1999.
 [11] Simon J. Julier and Jeffrey K. Uhlmann. New extension of the Kalman filter to nonlinear systems. In Ivan Kadar, editor, Signal Processing, Sensor Fusion, and Target Recognition VI. SPIE, July 1997.
 [12] Juho Kokkala, Arno Solin, and Simo Särkkä. SigmaPoint Filtering and Smoothing Based Parameter Estimation in Nonlinear Dynamic Systems. Journal of Advances in Information Fusion, 2015.
 [13] S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951.
 [14] Fredrik Lindsten, Jouni Helske, and Matti Vihola. Graphical model inference: Sequential Monte Carlo meets deterministic approximations. In Advances in Neural Information Processing Systems 31, pages 8190–8200. 2018.
 [15] L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999.
 [16] Lennart Ljung. Perspectives on system identification. Annual Reviews in Control, 34(1):1–12, April 2010.
 [17] Christian A. Naesseth, Fredrik Lindsten, and David M. Blei. Markovian Score Climbing: Variational Inference with KL(pq). In Advances in Neural Information Processing Systems 33. 2020.

[18]
Radford M. Neal.
Handbook of Markov Chain Monte Carlo
, chapter MCMC using Hamiltonian dynamics, pages 113–162. Chapman and Hall/CRC, 2011.  [19] Brett Ninness. Some System Identification Challenges and Approaches. IFAC Proceedings Volumes, 42(10):1–20, 2009.
 [20] Simo Särkkä. Bayesian Filtering and Smoothing. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2013.
 [21] Thomas B. Schön, Adrian Wills, and Brett Ninness. System identification of nonlinear statespace models. Automatica, 47(1):39 – 49, 2011.
 [22] Michail D. Vrettas, Yuan Shen, and Dan Cornford. Derivations of Variational Gaussian Process Approximation Framework. Technical report, Aston University, March 2008.
 [23] E. A. Wan and R. Van Der Merwe. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium. IEEE, October 2000.
References
 [1] Joel A. E. Andersson, Joris Gillis, Greg Horn, James B. Rawlings, and Moritz Diehl. CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1–36, July 2018.
 [2] D. Blei, A. Kucukelbir, and J. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 [3] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1), 2017.

[4]
S. B. Chitralekha, J. Prakash, H. Raghavan, R. B. Gopaluni, and S. L. Shah.
Comparison of ExpectationMaximization based parameter estimation using Particle Filter, Unscented and Extended Kalman Filtering techniques.
IFAC Proceedings Volumes, 42(10):804–809, 2009.  [5] Jarrad Courts, Adrian Wills, and Thomas B. Schön. Variational Nonlinear State Estimation. arXiv, 2020.
 [6] K. Furuta, M. Yamakita, and S. Kobayashi. Swingup Control of Inverted Pendulum Using PseudoState Feedback. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering, 206(4):263–269, November 1992.
 [7] Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, jan 2008.
 [8] Johannes Hendriks, Adrian Wills, Brett Ninness, and Johan Dahlin. Practical Bayesian System Identification using Hamiltonian Monte Carlo. arXiv, 2020.
 [9] Bin Jia, Ming Xin, and Yang Cheng. Highdegree cubature Kalman filter. Automatica, 49(2):510–518, feb 2013.
 [10] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2):183–233, November 1999.
 [11] Simon J. Julier and Jeffrey K. Uhlmann. New extension of the Kalman filter to nonlinear systems. In Ivan Kadar, editor, Signal Processing, Sensor Fusion, and Target Recognition VI. SPIE, July 1997.
 [12] Juho Kokkala, Arno Solin, and Simo Särkkä. SigmaPoint Filtering and Smoothing Based Parameter Estimation in Nonlinear Dynamic Systems. Journal of Advances in Information Fusion, 2015.
 [13] S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951.
 [14] Fredrik Lindsten, Jouni Helske, and Matti Vihola. Graphical model inference: Sequential Monte Carlo meets deterministic approximations. In Advances in Neural Information Processing Systems 31, pages 8190–8200. 2018.
 [15] L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999.
 [16] Lennart Ljung. Perspectives on system identification. Annual Reviews in Control, 34(1):1–12, April 2010.
 [17] Christian A. Naesseth, Fredrik Lindsten, and David M. Blei. Markovian Score Climbing: Variational Inference with KL(pq). In Advances in Neural Information Processing Systems 33. 2020.

[18]
Radford M. Neal.
Handbook of Markov Chain Monte Carlo
, chapter MCMC using Hamiltonian dynamics, pages 113–162. Chapman and Hall/CRC, 2011.  [19] Brett Ninness. Some System Identification Challenges and Approaches. IFAC Proceedings Volumes, 42(10):1–20, 2009.
 [20] Simo Särkkä. Bayesian Filtering and Smoothing. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2013.
 [21] Thomas B. Schön, Adrian Wills, and Brett Ninness. System identification of nonlinear statespace models. Automatica, 47(1):39 – 49, 2011.
 [22] Michail D. Vrettas, Yuan Shen, and Dan Cornford. Derivations of Variational Gaussian Process Approximation Framework. Technical report, Aston University, March 2008.
 [23] E. A. Wan and R. Van Der Merwe. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium. IEEE, October 2000.
Appendix A Decomposition of
In this section, the decomposition of required to express as pairwise joints distributions, rather than as a full distribution is presented. This utilises the Markovian nature of statespace models in a fashion similar to related system identification approaches such as [21, 4, 20, 22, 5].
Firstly, is expressed in the form of two integrals as
The first integral is decomposed according to
While the second integral is decomposed according to
As
this gives
and therefore,
Together, this allows to be calculated without requiring the full distribution, , but rather only each using
Appendix B Inverted Pendulum Model
The state vector used to model the Furata pendulum is where and are the arm and pendulum angles, respectively.
The continuous time dynamics are given by,
where
and
where
and is the pendulum mass, , and the rod and pendulum lengths respectively, , are the rod and pendulum inertias, is the motor resistance, is the motor constant, , are damping coefficients for the rod and pendulum respectively, and is the applied motor voltage input.
The process model used consists of a twostep Euler discretisation of these continuous time dynamics over an sampling time subsequently disturbed by noise .
Measurements are from encoders on the arm and pendulum angle and current measurement from the motor. The resultant measurement model is
and it is assumed that
Comments
There are no comments yet.