 # A flexible state space model for learning nonlinear dynamical systems

We consider a nonlinear state-space model with the state transition and observation functions expressed as basis function expansions. The coefficients in the basis function expansions are learned from data. Using a connection to Gaussian processes we also develop priors on the coefficients, for tuning the model flexibility and to prevent overfitting to data, akin to a Gaussian process state-space model. The priors can alternatively be seen as a regularization, and helps the model in generalizing the data without sacrificing the richness offered by the basis function expansion. To learn the coefficients and other unknown parameters efficiently, we tailor an algorithm using state-of-the-art sequential Monte Carlo methods, which comes with theoretical guarantees on the learning. Our approach indicates promising results when evaluated on a classical benchmark as well as real data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Nonlinear system identification (Ljung, 1999, 2010; Sjöberg et al., 1995) aims to learn nonlinear mathematical models from data generated by a dynamical system. We will tackle the problem of learning nonlinear state-space models with only weak assumptions on the nonlinear functions, and make use of the Bayesian framework (Peterka, 1981) to encode prior knowledge and assumptions to guide the otherwise too flexible model.

Consider the (time invariant) state-space model

 xt+1 =f(xt,ut)+vt, vt∼N(0,Q), (1a) yt =g(xt,ut)+et, et∼N(0,R). (1b)

The variables are denoted as the state111 and are iid with respect to , and is thus Markov. , which is not observed explicitly, the input , and the output . We will learn the state transition function and the observation function as well as and from a set of training data of input-output signals . Figure 1: The Gaussian process as a modeling tool for an one-dimensional function f:R↦R. The prior distribution (upper left plot) is represented by the shaded blue color (the more intense color, the higher density), as well as 5 samples drawn from it. By combining the prior and the data (upper right plot), the posterior (lower plot) is obtained. The posterior mean basically interpolates between the data points, and adheres to the prior in regions where the data is not providing any information. This is clearly a desirable property when it comes to generalizing from the training data—consider the thought experiment of using a 2nd order polynomial instead. Further, the posterior also provides a quantification of the uncertainty present, high in data-scarce regions and low where the data provides knowledge about f(⋅).

Consider a situation when a finite-dimensional linear, or other sparsely parameterized model, is too rigid to describe the behavior of interest, but only a limited data record is available so that any too flexible model would overfit (and be of no help in generalizing to events not exactly seen in the training data). In such a situation, a systematic way to encode prior assumptions and thereby tuning the flexibility of the model can be useful. For this purpose, we will take inspiration from Gaussian processes (GPs, Rasmussen and Williams 2006) as a way to encode prior assumptions on and . As illustrated by Figure 1

, the GP is a distribution over functions which gives a probabilistic model for inter- and extrapolating from observed data. GPs have successfully been used in system identification for, e.g., response estimation, nonlinear ARX models and GP state-space models

(Pillonetto and De Nicolao, 2010; Kocijan, 2016; Frigola-Alcade, 2015).

To parameterize , we expand it using basis functions

 f(x)=m∑j=0w(j)ϕ(j)(x), (2)

and similarly for . The set of basis functions is denoted by , whose coefficients will be learned from data. By introducing certain priors on the basis function coefficients the connection to GPs will be made, based on a Karhunen-Loève expansion (Solin and Särkkä, 2014). We will thus be able to understand our model in terms of the well-established and intuitively appealing GP model, but still benefit from the computational advantages of the linear-in-parameter structure of (2). Intuitively, the idea of the priors is to keep ‘small unless data convinces otherwise’, or equivalently, introduce a regularization of .

To learn the model (1), i.e., determine the basis function coefficients , we tailor a learning algorithm using recent sequential Monte Carlo/particle filter methods (Schön et al., 2015; Kantas et al., 2015). The learning algorithm infers the posterior distribution of the unknown parameters from data, and come with theoretical guarantees. We will pay extra attention to the problem of finding the maximum mode of the posterior, or equivalent, regularized maximum likelihood estimation.

Our contribution is the development of a flexible nonlinear state-space model with a tailored learning algorithm, which together constitutes a new nonlinear system identification tool. The model can either be understood as a GP state-space model (generalized allowing for discontinuities, Section 3.2.3), or as a nonlinear state-space model with a regularized basis function expansion.

## 2 Related work

Important work using the GP in system identification includes impulse response estimation (Pillonetto and De Nicolao, 2010; Pillonetto et al., 2011; Chen et al., 2012), nonlinear ARX models (Kocijan et al., 2005; Bijl et al., 2016), Bayesian learning of ODEs (Calderhead et al., 2008; Wang and Barber, 2014; Macdonald et al., 2015) and the latent force model (Alvarez et al., 2013). In the GP state-space model (Frigola-Alcade, 2015) the transition function in a state-space model is learned with a GP prior, particularly relevant to this paper. A conceptually interesting contribution to the GP state-space model was made by Frigola et al. (2013), using a Monte Carlo approach (similar to this paper) for learning. The practical use of Frigola et al. (2013) is however very limited, due to its extreme computational burden. This calls for approximations, and a promising approach is presented by Frigola et al. (2014) (and somewhat generalized by Mattos et al. (2016)), using inducing points and a variational inference scheme. Another competitive approach is Svensson et al. (2016), where we applied the GP approximation proposed by Solin and Särkkä (2014) and used a Monte Carlo approach for learning (Frigola-Alcade (2015) covers the variational learning using the same GP approximation). In this paper, we extend this work by considering basis function expansions in general (not necessarily with a GP interpretation), introduce an approach to model discontinuities in , as well as including both a Bayesian and a maximum likelihood estimation approach to learning.

To the best of our knowledge, the first extensive paper on the use of a basis function expansion inside a state-space model was written by Ghahramani and Roweis (1998), who also wrote a longer unpublished version (Roweis and Ghahramani, 2000). The recent work by Tobar et al. (2015) resembles that of Ghahramani and Roweis (1998) on the modeling side, as they both use basis functions with locally concentrated mass spread in the state space. On the learning side, Ghahramani and Roweis (1998)

use an expectation maximization (EM,

Dempster et al. 1977

) procedure with extended Kalman filtering, whilst

Tobar et al. (2015) use particle Metropolis-Hastings (Andrieu et al., 2010). There are basically three major differences between Tobar et al. (2015) and our work. We will (i) use another (related) learning method, particle Gibbs, allowing us to take advantage of the linear-in-parameter structure of the model to increase the efficiency. Further, we will (ii) mainly focus on a different set of basis functions (although our learning procedure will be applicable also to the model used by Tobar et al. (2015)), and – perhaps most important – (iii) we will pursue a systematic encoding of prior assumptions further than Tobar et al. (2015), who instead assume

to be known and use ‘standard sparsification criteria from kernel adaptive filtering’ as a heuristic approach to regularization.

There are also connections to Paduart et al. (2010), who use a polynomial basis inside a state-space model. In contrast to our work, however, Paduart et al. (2010) prevent the model from overfitting to the training data not by regularization, but by manually choosing a low enough polynomial order and terminating the learning procedure prematurely (early stopping). Paduart et al. are, in contrast to us, focused on the frequency properties of the model and rely on optimization tools. An interesting contribution by Paduart et al. is to first use classical methods to find a linear model, which is then used to initialize the linear term in the polynomial expansion. We suggest to also use this idea, either to initialize the learning algorithm, or use the nonlinear model only to describe deviations from an initial linear state-space model.

Furthermore, there are also connections to our previous work (Svensson et al., 2015), a short paper only outlining the idea of learning a regularized basis function expansion inside a state-space model. Compared to Svensson et al. (2015), this work contains several extensions and new results. Another recent work using a regularized basis function expansion for nonlinear system identification is that of Delgado et al. (2015), however not in the state-space model framework. Delgado et al. (2015) use rank constrained optimization, resembling an -regularization. To achieve a good performance with such a regularization, the system which generated the data has to be well described by only a few number of the basis functions being ‘active’, i.e., have non-zero coefficients, which makes the choice of basis functions important and problem-dependent. The recent work by Mattsson et al. (2016) is also covering learning of a regularized basis function expansion, however for input-output type of models.

## 3 Constructing the model

We want the model, whose parameters will be learned from data, to be able to describe a broad class of nonlinear dynamical behaviors without overfitting to training data. To achieve this, important building blocks will be the basis function expansion (2) and a GP-inspired prior. The order of the state-space model (1) is assumed known or set by the user, and we have to learn the transition and observation functions and from data, as well as the noise covariance matrices and . For brevity, we focus on and , but the reasoning extends analogously to and .

### 3.1 Basis function expansion

The common approaches in the literature on black-box modeling of functions inside state-space models can broadly be divided into three groups: neural networks

(Bishop, 2006; Narendra and Li, 1996; Nørgård et al., 2000), basis function expansions (Sjöberg et al., 1995; Ghahramani and Roweis, 1998; Paduart et al., 2010; Tobar et al., 2015) and GPs (Rasmussen and Williams, 2006; Frigola-Alcade, 2015). We will make use of a basis function expansion inspired by the GP. There are several reasons for this: Firstly, a basis function expansion provides an expression which is linear in its parameters, leading to a computational advantage: neural networks do not exhibit this property, and the naïve use of the nonparametric GP is computationally very expensive. Secondly, GPs and some choices of basis functions allow for a straightforward way of including prior assumptions on and help generalization from the training data, also in contrast to the neural network.

We write the combination of the state-space model (1) and the basis function expansion (2) as

 xt+1 (3a) yt =⎡⎢ ⎢ ⎢ ⎢⎣w(1)g,1⋯w(m)g,1⋮⋮w(1)g,ny⋯w(m)g,ny⎤⎥ ⎥ ⎥ ⎥⎦C⎡⎢ ⎢ ⎢ ⎢⎣ϕ(1)g1(xt,ut)⋮ϕ(m)gnx(xt,ut)⎤⎥ ⎥ ⎥ ⎥⎦¯φg(xt,ut)+et. (3b)

There are several alternatives for the basis functions, e.g., polynomials (Paduart et al., 2010), the Fourier basis (Svensson et al., 2015), wavelets (Sjöberg et al., 1995), Gaussian kernels (Ghahramani and Roweis, 1998; Tobar et al., 2015) and piecewise constant functions. For the one-dimensional case (e.g., , ) on the interval , we will choose the basis functions as

 ϕ(j)(x)=1√Lsin(πj(x+L)2L). (4)

This choice, which is the eigenfunctions to the Laplace operator, enables a particularly convenient connection to the GP framework

(Solin and Särkkä, 2014) in the priors we will introduce in Section 3.2.1. This choice is, however, important only for the interpretability222Other choices of basis functions are also interpretable as GPs. The choice (4) is, however, preferred since it is independent of the choice of which GP covariance function to use. of the model. The learning algorithm will be applicable to any choice of basis functions.

#### 3.1.1 Higher state-space dimensions

The generalization to models with a state space and input dimension such that offers no conceptual challenges, but potentially computational ones. The counterpart to the basis function (4) for the space
is

 ϕ(j1,…,jnx+nu)(x)=nx+nu∏k=11√Lksin(πjk(xk+Lk)2Lk), (5)

(where is the th component of ), implying that the number of terms grows exponentially with . This problem is inherent in most choices of basis function expansions. For , the problem of learning can be understood as learning number of functions , cf. (3).

There are some options available to overcome the exponential growth with , at the cost of a limited capability of the model. Alternative 1 is to assume to be ‘separable’ between some dimensions, e.g., . If this assumption is made for all dimensions, the total number of parameters present grows quadratically (instead of exponentially) with . Alternative 2

is to use a radial basis function expansion

(Sjöberg et al., 1995), i.e., letting only be a function of some norm of , as . The radial basis functions give a total number of parameters growing linearly with . Both alternatives will indeed limit the space of functions possible to describe with the basis function expansion. However, as a pragmatic solution to the otherwise exponential growth in the number of parameters it might still be worth considering, depending on the particular problem at hand.

#### 3.1.2 Manual and data-driven truncation

To implement the model in practice, the number of basis functions has to be fixed to a finite value, i.e., truncated. However, fixing also imposes a harsh restriction on which functions that can be described. Such a restriction can prevent overfitting to training data, an argument used by Paduart et al. (2010) for using polynomials only up to 3rd order. We suggest, on the contrary, to use priors on to prevent overfitting, and we argue that the interpretation as a GP is a preferred way to tune the model flexibility, rather than manually and carefully tuning the truncation. We therefore suggest to choose as big as the computational resources allows, and let the prior and data decide which to be nonzero, a data-driven truncation.

Related to this is the choice of in (4): if is chosen too small, the state space becomes limited and thereby also limits the expressiveness of the model. On the other hand, if is too big, an unnecessarily large might also be needed, wasting computational power. To chose to have about the same size as the maximum of or seems to be a good guideline.

### 3.2 Encoding prior assumptions—regularization

The basis function expansion (3) provides a very flexible model. A prior might therefore be needed to generalize from, instead of overfit to, training data. From a user perspective, the prior assumptions should ultimately be formulated in terms of the input-output behavior, such as gains, rise times, oscillations, equilibria, limit cycles, stability etc. As of today, tools for encoding such priors are (to the best of the authors’ knowledge) not available. As a resort, we therefore use the GP state-space model approach, where we instead encode prior assumptions on as a GP. Formulating prior assumptions on is relevant in a model where the state space bears (partial) physical meaning, and it is natural to make assumptions whether the state is likely to rapidly change (non-smooth ), or state equilibria are known, etc. However, also the truly black-box case offers some interpretations: a very smooth corresponds to a locally close-to-linear model, and vice versa for a more curvy

, and a zero-mean low variance prior on

will steer the model towards a bounded output (if is bounded).

To make a connection between the GP and the basis function expansion, a Karhunen-Loève expansion is explored by Solin and Särkkä (2014). We use this to formulate Gaussian priors on the basis function expansion coefficients , and learning of the model will amount to infer the posterior , where is the prior and the likelihood. To use a prior and inferring the maximum mode of the posterior can equivalently be interpreted as regularized maximum likelihood estimation

 argminw(j) −logp(y1:T|w(j))+α|w(j)|2. (6)

#### 3.2.1 Smooth GP-priors for the functions

The Gaussian process provides a framework for formulating prior assumptions on functions, resulting in a non-parametric approach for regression. In many situations the GP allows for an intuitive generalization of the training data, as illustrated by Figure 1. We use the notation

 f(x)∼GP(m(x),κ(x,x′)) (7)

to denote a GP prior on , where is the mean function and the covariance function. The work by Solin and Särkkä (2014) provides an explicit link between basis function expansions and GPs based on the Karhunen-Loève expansion, in the case of isotropic333Note, this concerns only , which resides inside the state-space model. This does not restrict the input-output behavior, from to , to have an isotropic covariance. covariance functions, i.e., . In particular, if the basis functions are chosen as (4), then

 f(x)∼GP(0,κ(x,x′))⇔f(x)≈m∑j=0w(j)ϕ(j)(x), (8a) with444The approximate equality in (8a) is exact if m→∞ and L→∞, refer to Solin and Särkkä (2014) for details. w(j)∼N(0,S(λ(j))), (8b)

where is the spectral density of , and

is the eigenvalue of

. Thus, this gives a systematic guidance on how to choose basis functions and priors on . In particular, the eigenvalues of the basis function (4) are

 λ(j)=(πj2L)2, and λ(j1:nx+nu)=nx+nu∑k=1(πjk2Lk)2 (9)

for (5). Two common types of covariance functions are the exponentiated quadratic and Matérn class (Rasmussen and Williams, 2006),

 κeq(r) =sfexp(−r22l2), (10a) κM(r) =sf21−νΓ(ν)(√2νrl)νKν(√2νrl), (10b)

where , is a modified Bessel function, and and

are hyperparameters to be set by the user or to be marginalized out, see

Svensson et al. (2016) for details. Their spectral densities are

 Seq(s) =sf√2πl2exp(−π2l2s22), (11a) SM(s) =sf2π12Γ(ν+12)(2ν)νΓ(ν)l2ν(2νl2+s2)−(ν+12). (11b)

Altogether, by choosing the priors for as (8b), it is possible to approximately interpret , parameterized by the basis function expansion (2), as a GP. For most covariance functions, the spectral density tends towards when , meaning that the prior for large tends towards a Dirac mass at . Returning to the discussion on truncation (Section 3.1.2), we realize that truncation of the basis function expansion with a reasonably large therefore has no major impact to the model, but the GP interpretation is still relevant.

As discussed, finding the posterior mode under a Gaussian prior is equivalent to -regularized maximum likelihood estimation. There is no fundamental limitation prohibiting other priors, for example Laplacian (corresponding to -regularization: Tibshirani 1996). We use the Gaussian prior because of the connection to a GP prior on , and it will also allow for closed form expressions in the learning algorithm.

For book-keeping, we express the prior on as a Matrix normal (, Dawid 1981) distribution over . The distribution is parameterized by a mean matrix , a right covariance and a left covariance . The distribution can be defined by the property that if and only if , where is the Kronecker product. Its density can be written as

 (12)

By letting and be a diagonal matrix with entries , the priors (8b) are incorporated into this parametrization. We will let for conjugacy properties, to be detailed later. Indeed, the marginal variance of the elements in is then not scaled only by , but also . That scaling however is constant along the rows, and so is the scaling by the hyperparameter (10). We therefore suggest to simply use as tuning for the overall influence of the priors; letting gives a flat prior, or, a non-regularized basis function expansion.

#### 3.2.2 Prior for noise covariances

Apart from , the noise covariance matrix might also be unknown. We formulate the prior over as an inverse Wishart (, Dawid 1981) distribution. The

distribution is a distribution over real-valued positive definite matrices, which puts prior mass on all positive definite matrices and is parametrized by its number of degrees of freedom

and an positive definite scale matrix . The density is defined as

 IW(Q|ℓ,Λ)=|Λ|ℓ/2|Q|−(nx+ℓ+1)/22ℓnx/2Γnx(ℓ/2)exp(−12tr{Q−1Λ}), (13)

where is the multivariate gamma function. The mode of the distribution is . It is a common choice as a prior for covariance matrices due to its properties (e.g., Wills et al. 2012; Shah et al. 2014). When the distribution (12) is combined with the distribution (13) we obtain the distribution, with the following hierarchical structure

 MNIW(A,Q|M,V,Λ,ℓ)=MN(A|M,Q,V)IW(Q|ℓ,Λ). (14)

The distribution provides a joint prior for the and

matrices, compactly parameterizing the prior scheme we have discussed, and is also the conjugate prior for our model, which will facilitate learning. Figure 2: The idea of a piecewise GP: the interval [−2,−2] is divided by np=2 discontinuity points p1 and p2, and a GP is used to model a function on each of these segments, independently of the other segments. For practical use, the learning algorithm have to be able to also infer the discontinuity points from data.

#### 3.2.3 Discontinuous functions: Sparse singularities

The proposed choice of basis functions and priors is encoding a smoothness assumption of . However, as discussed by Juditsky et al. (1995) and motivated by Example 5.3, there are situations where it is relevant to assume that is smooth except at a few points. Instead of assuming an (approximate) GP prior for on the entire interval we therefore suggest to divide into a number of segments, and then assume an individual GP prior for each segment , independent of all other segments, as illustrated in Figure 2. The number of segments and the discontinuity points dividing them need to be learned from data, and an important prior is how the discontinuity points are distributed, i.e., the number

(e.g., geometrically distributed) and their locations

(e.g., uniformly distributed).

### 3.3 Model summary

We will now summarize the proposed model. To avoid notational clutter, we omit as well as the observation function (1b):

 xt+1 =np∑i=0Ai¯φ(xt)1pi≤xt

where is the indicator function parameterizing the piecewise GP, and was defined in (3). If the dynamical behavior of the data is close-to-linear, and a fairly accurate linear model is already available, this can be incorporated by adding the known linear function to the right hand side of (15a).

A good user practice is to sample parameters from the priors and simulate the model with those parameters, as a sanity check before entering the learning phase. Such a habit can also be fruitful for understanding what the prior assumptions mean in terms of dynamical behavior. There are standard routines for sampling from the as well as the distribution.

The suggested model can also be tailored if more prior knowledge is present, such as a physical relationship between two certain state variables. The suggested model can then be used to learn only the unknown part, as briefly illustrated by Svensson et al. (2015, Example IV.B).

## 4 Learning

We now have a state-space model with a (potentially large) number of unknown parameters

 θ≜{{Ai,Qi}npi=0,np,{pi}npi=1}, (16)

all with priors. ( is still assumed to be known, but the extension follows analogously.) Learning the parameters is a quite general problem, and several learning strategies proposed in the literature are (partially) applicable, including optimization (Paduart et al., 2010), EM with extended Kalman filtering (Ghahramani and Roweis, 1998) or sigma point filters (Kokkala et al., 2016), and particle Metropolis-Hastings (Tobar et al., 2015). We use another sequential Monte Carlo-based learning strategy, namely particle Gibbs with ancestor sampling (PGAS, Lindsten et al. 2014). PGAS allows us to take advantage of the fact that our proposed model (3) is linear in (given ), at the same time as it has desirable theoretical properties.

### 4.1 Sequential Monte Carlo for system identification

Sequential Monte Carlo (SMC) methods have emerged as a tool for learning parameters in state-space models (Schön et al., 2015; Kantas et al., 2015). At the very core when using SMC for system identification is the particle filter (Doucet and Johansen, 2011), which provides a numerical solution to the state filtering problem, i.e., finding . The particle filter propagates a set of weighted samples, particles, in the state-space model, approximating the filtering density by the empirical distribution for each . Algorithmically, it amounts to iteratively weighting the particles with respect to the measurement , resample among them, and thereafter propagate the resampled particles to the next time step . The convergence properties of this scheme have been studied extensively (see references in Doucet and Johansen (2011)).

When using SMC methods for learning parameters, a key idea is to repeatedly infer the unknown states with a particle filter, and interleave this iteration with inference of the unknown parameters , as follows:

 I. Use SMC to infer the states x1:T for given parameters θ. (17) II. Update the parameters θ to fit the states x1:T from the previous step.

There are several details left to specify in this iteration, and we will pursue two approaches for updating : one sample-based for exploring the full posterior , and one EM-based for finding the maximum mode of the posterior, or equivalently, a regularized maximum likelihood estimate. Both alternatives will utilize the linear-in-parameter structure of the model (15), and use the Markov kernel PGAS (Lindsten et al., 2014) to handle the states in Step I of (17).

The PGAS Markov kernel resembles a standard particle filter, but has one of its state-space trajectories fixed. It is outlined by Algorithm 1, and is a procedure to asymptotically produce samples from

, if repeated iteratively in a Markov chain Monte Carlo (MCMC,

Robert and Casella 2004) fashion.

### 4.2 Parameter posterior

The learning problem will be split into the iterative procedure (17). In this section, the focus is on a key to Step II of (17), namely the conditional distribution of given states and measurements . By utilizing the Markovian structure of the state-space model, the density can be written as the product

 p(x1:T,y1:T|θ)=p(x1)T−1∏t=1p(xt+1|xt,θ)p(yt|xt)=p(x1)T−1∏t=1p(xt+1|xt,θ)p(x1:T|θ)T∏t=1p(yt|xt)p(y1:T|x1:T). (18)

Since we assume that the observation function (1b) is known, is independent of , which in turn means that (18) is proportional to . Further, we assume for now that is also known, and therefore omit it. Let us consider the case without discontinuity points, . Since is assumed to be Gaussian, , we can with some algebraic manipulations (Gibson and Ninness, 2005) write

 logp(x1:T|A,Q)=−Tnx2log(2π)−T2logdet(Q)−12tr{Q−1(Φ−AΨT−ΨAT+AΣAT)}, (19)

with the (sufficient) statistics

 Φ=T∑t=1xt+1xTt+1,Ψ=T∑t=1xt+1¯φ(xt,ut)T,Σ=T∑t=1¯φ(xt,ut)¯φ(xt,ut)T. (20a)

The density (19) gives via Bayes’ rule and the prior distribution for from Section 3

 logp(A,Q)=logp(A|Q)+logp(Q)∝−12(nx+ℓ+m+1)logdet(Q)−12tr{Q−1(Λ+AV−1AT)}, (21)

the posterior

 logp(A,Q|x1:t)∝logp(x1:t|A,Q)+logp(A,Q)∝−12(nx+T+ℓ+m+1)logdetQ−12tr{Q−1(Λ+Φ−Ψ(Σ+V−1)−1ΨT+(A−Ψ(Σ+V−1)−1)Q−1(A−Ψ(Σ+V−1)−1)T)}. (22)

This expression will be key for learning: For the fully Bayesian case, we will recognize (22) as another distribution and sample from it, whereas we will maximize it when seeking a point estimate.

Remarks: The expressions needed for an unknown observation function are completely analogous. The case with discontinuity points becomes essentially the same, but with individual and statistics for each segment. If the right hand side of (15a) also contains a known function , e.g., if the proposed model is used only to describe deviations from a known linear model, this can easily be taken care of by noting that now , and thus compute the statistics (20) for instead of .

### 4.3 Inferring the posterior—Bayesian learning

There is no closed form expression for , the distribution to infer in the Bayesian learning. We thus resort to a numerical approximation by drawing samples from using MCMC. (Alternative, variational methods could be used, akin to Frigola et al. (2014)). MCMC amounts to constructing a procedure for ‘walking around’ in -space in such a way that the steps eventually, for large enough, become samples from the distribution of interest.

Let us start in the case without discontinuity points, i.e., . Since (21) is , and (19

) is a product of (multivariate) Gaussian distributions, (

22) is also an distribution (Wills et al., 2012; Dawid, 1981). By identifying components in (22), we conclude that

 p(θ|x1:T,y1:T)=MNIW(A,Q|Ψ(Σ+V−1)−1,(Σ+V−1)−1,Λ+Φ−Ψ(Σ+V−1)−1ΨT,ℓ+Tnx) (23)

We now have (23) for sampling given the states (cf. (17), step II), and Algorithm 1 for sampling the states given the model (cf. (17), step I). This makes a particle Gibbs sampler (Andrieu et al., 2010), cf. (17).

If there are discontinuity points to learn, i.e., is to be learned, we can do that by acknowledging the hierarchical structure of the model. For brevity, we denote by , and simply by . We suggest to first sample from , and next sample from . The distribution for sampling is the distribution (23), but conditional on data only in the relevant segment. The other distribution, , is trickier to sample from. We suggest to use a Metropolis-within-Gibbs step (Müller, 1991), which means that we first sample from a proposal (e.g., a random walk), and then accept it as

with probability

, and otherwise just set . Thus we need to evaluate . The prior is chosen by the user. The density can be evaluated using the expression (see Appendix A.1)

 p(x1:T|ξ)=np∏i=02nxTi/2(2π)Ti/2Γnx(l+N2)Γnx(l2)|V−1|nx/2|Σi+V−1|nx/2×|Λ|l/2|Λ+Φi+Ψi(Σi+V−1)−1ΨTi|l+N2 (24)

where etc. denotes the statistics (20) restricted to the corresponding segment, and is the number of data points in segment (). The suggested Bayesian learning procedure is summarized in Algorithm 2.

Our proposed algorithm can be seen as a combination of a collapsed Gibbs sampler and Metropolis-within-Gibbs, a combination which requires some attention to be correct (van Dyk and Jiao, 2014), see Appendix A.2 for details in our case. If the hyperparameters parameterizing and/or the initial states are unknown, it can be included by extending Algorithm 2 with extra Metropolis-within-Gibbs steps (see Svensson et al. (2016) for details).

### 4.4 Regularized maximum likelihood

A widely used alternative to Bayesian learning is to find a point estimate of maximizing the likelihood of the training data , i.e., maximum likelihood. However, if a very flexible model is used, some kind of mechanism is needed to prevent the model from overfit to training data. We will therefore use the priors from Section 3 as regularization for the maximum likelihood estimation, which can also be understood as seeking the maximum mode of the posterior. We will only treat the case with no discontinuity points, as the case with discontinuity points does not allow for closed form maximization, but requires numerical optimization tools, and we therefore suggest Bayesian learning for that case instead.

The learning will build on the particle stochastic approximation EM (PSAEM) method proposed by Lindsten (2013), which uses a stochastic approximation of the EM scheme (Dempster et al., 1977; Delyon et al., 1999; Kuhn and Lavielle, 2004). EM addresses maximum likelihood estimation in problems with latent variables. For system identification, EM can be applied by taking the states as the latent variables, (Ghahramani and Roweis (1998); another alternative would be to take the noise sequence as the latent variables, Umenberger et al. (2015)).

 The EM algorithm then amounts to iteratively (cf. (17)) computing the expectation (E-step) Q(θ,θ[k])=Ex1:T[logp(θ|x1:T,y1:T)|y1:T,θ[k]], (25a) and updating θ in the maximization (M-step) by solving θ[k+1]=argmaxθQ(θ,θ[k]), (25b)

In the standard formulation, is usually computed with respect to the joint likelihood density for and . To incorporate the prior (our regularization), we may consider the prior as an additional observation of , and we have thus replaced (19) by (22) in . Following Gibson and Ninness (2005), the solution in the M-step is found as follows: Since is positive definite, the quadratic form in (22) is maximized by

 A=Φ(Σ+V−1). (26a) Next, substituting this into (22), the maximizing Q is Q=1nx+Tnx+ℓ+m+1(Λ+Φ−Ψ(Σ+V−1)−1Ψ). (26b)

We thus have solved the M-step exactly. To compute the expectation in the E-step, approximations are needed. For this, a particle smoother (Lindsten and Schön, 2013) could be used, which would give a learning strategy in the flavor of Schön et al. (2011). The computational load of a particle smoother is, however, unfavorable, and PSAEM uses Algorithm 1 instead.

PSAEM also replaces and replace the -function (25a) with a Robbins-Monro stochastic approximation of ,

 Qk(θ)=(1−γk)Qk−1(θ)+γklogp(θ|x1:T[k],y1:T), (27)

where is a decreasing sequence of positive step sizes, with , and . I.e., should be chosen such that holds up to proportionality, and the choice has been suggested in the literature (Delyon et al., 1999, Section 5.1). Here, is a sample from an ergodic Markov kernel with as its invariant distribution, i.e., Algorithm 1. At a first glance, the complexity of appears to grow with because of its iterative definition. However, since belongs to the exponential family, we can write

 p(x1:T[k],y1:T|θ)=h(x1:T[k],y1:T)c(θ)exp(ηT(θ)t[k]), (28)

where is the statistics (20) of . The stochastic approximation  (27) thus becomes

 Qk(θ)∝logp(θ)+logc(θ)+ηT(θ)(γkt[k]+(1−γk)γk-1t[k−1]+…). (29)

Now, we note that if keeping track of the statistics , the complexity of does not grow with . We therefore introduce the following iterative update of the statistics

 Φk =(1−γk)Φk−1+γkΦ(x1:T[k]), (30a) Ψk =(1−γk)Ψk−1+γkΨ(x1:T[k]), (30b) Σk =(1−γk)Σk−1+γkΣ(x1:T[k]), (30c)

where refers to (20), etc. With this parametrization, we obtain as the solutions for the vanilla EM case by just replacing by , etc., in (26). Algorithm 3 summarizes.

### 4.5 Convergence and consistency

We have proposed two algorithms for learning the model introduced in Section 3. The Bayesian learning, Algorithm 2, will by construction (as detailed in Appendix A.2) asymptotically provide samples from the true posterior density (Andrieu et al., 2010). However, no guarantees regarding the length of the burn-in period can be given, which is the case for all MCMC methods, but the numerical comparisons in Svensson et al. (2016) and in Section 5.1 suggest that the proposed Gibbs scheme is efficient compared to its state-of-the-art alternatives. The regularized maximum likelihood learning, Algorithm 3, can be shown to converge under additional assumptions (Lindsten, 2013; Kuhn and Lavielle, 2004) to a stationary point of , however not necessarily a global maximum. The literature on PSAEM is not (yet) very rich, and the technical details regarding the additional assumptions remains to be settled, but we have not experienced any problems of non-convergence in practice.

### 4.6 Initialization

The convergence of Algorithm 2 is not relying on the initialization, but the burn-in period can nevertheless be reduced. One useful idea by Paduart et al. (2010) is thus to start with a linear model, which can be obtained using classical methods. To avoid Algorithm 3 from converging to a poor local minimum, Algorithm 2 can first be run to explore the ‘landscape’ and from that, a promising point for initialization of Algorithm 3 can be chosen.

For convenience, we assumed the distribution of the initial states, , to be known. This is perhaps not realistic, but its influence is minor in many cases. If needed, they can be included in Algorithm 2 by an additional Metropolis-within-Gibbs step, and in Algorithm 3 by including them in (22) and use numerical optimization tools. (a) Maximum likelihood estimation of our proposed model, without regularization; a useless model.

## 5 Experiments

We will give three numerical examples: a toy example, a classic benchmark, and thereafter a real data set from two cascaded water tanks. Matlab code for all examples is available via the first authors homepage.

### 5.1 A first toy example

Consider the following example from Tobar et al. (2015),

 xt+1 =10sinc(xt7)+vt, vt∼N(0,4), (31a) yt =xt+et, et∼N(0,4). (31b)

We generate observations, and the challenge is to learn , when and the noise variances are known. Note that even though is known, is still corrupted by a non-negligible amount of noise.

In Figure 3 (a) we illustrate the performance of our proposed model using basis functions on the form (4) when Algorithm 3 is used without regularization. This gives a nonsense result that is overfitted to data, since offers too much flexibility for this example. When a GP-inspired prior from an exponentiated quadratic covariance function (10a) with length scale and is considered, we obtain (b), that is far more useful and follows the true function rather well in regions were data is present. We conclude that we do not need to choose carefully, but can rely on the priors for regularization. In (c), we use the same prior and explore the full posterior by Algorithm 2

, obtaining information about uncertainty as a part of the learned model (illustrated by the a posteriori credibility interval), in particular in regions where no data is present.

In the next figure, (d), we replace the set of basis functions on the form (4) with Gaussian kernels to reconstruct the model proposed by Tobar et al. (2015). As clarified by Tobar (2016), the prior on the coefficients is a Gaussian distribution inspired by a GP, which makes a close connection to out work. We use Algorithm 2 for learning also in (d) (which is possible thanks to the Gaussian prior). In (e), on the contrary, the learning algorithm from Tobar et al. (2015), Metropolis-Hastings, is used, requiring more computation time. Tobar et al. (2015) spend a considerable effort to pre-process the data and carefully distribute the Gaussian kernels in the state space, see the bottom of (d).

### 5.2 Narendra-Li benchmark

The example introduced by Narendra and Li (1996) has become a benchmark for nonlinear system identification, e.g., The MathWorks, Inc. 2015; Pan et al. 2009; Roll et al. 2005;