## 1 Introduction

System identification of dynamical systems is a topic of great interest. In particular, system identification of linear dynamical systems has been studied extensively during the last six decades [ljung2010perspectives, pintelon2012system, sjoberg1995nonlinear], and this has led to the development of several useful toolboxes suited to this task [ninness2013unit, ljung1988system].

The identification problem for nonlinear dynamical systems is still a challenge. The presence of nonlinear transformations on the hidden variables leads to significant difficulty in directly applying conventional methods such as maximum likelihood (ML) estimation.

Previous work on ML-based estimation include [ghahramani1999learning, caines1974maximum, schon2011system, lindsten2013efficient]

. One notable solution to the problem of estimating parameters using ML-based estimation involves the use of Monte Carlo expectation maximization (MCEM)

[wei1990monte] in combination with a particle smoother (PS) algorithm [schon2011system]. This method is offline, meaning that first data are collected and then the unknown parameters are estimated. Furthermore, it can cater to a general class of discrete-time, nonlinear dynamical systems. For succinctness, we abbreviate this method as PSEM. Although offline sampling-based methods like PSEM, are quite versatile, they are computationally expensive. The success of the solution in [schon2011system] requires that the number of particles used should increase with the number of iterations of the optimization algorithm in order to attain convergence. In order to tackle this issue, another offline ML-based estimation algorithm was proposed in [lindsten2013efficient]. This algorithm uses a combination of Markovian, stochastic approximation expectation maximization (referred to as SAEM) and a conditional particle filter (CPF) with ancestral sampling. The CPF-SAEM method reduces the estimation time but it may not be more accurate compared to the PSEM method.Another method involves the use of linear prediction error methods (PEM) for stochastic nonlinear models [abdalmoaty2019linear, abdalmoaty2020identification]. This approach relies on the statistics of the model, involves rather inexpensive computations (since they do not require exact computation of the likelihood) and are yet highly competitive with respect to sequential Monte Carlo (SMC) based methods similar to [lindsten2013efficient]. A downside is that such estimators are statistically, asymptotically inefficient.

An alternative approach to solving the estimation problem involves using a Bayesian framework, where practitioners make use of prior knowledge to improve the estimates [ninness2010bayesian]

. This approach can help in obtaining an optimal estimate, e.g. using the conditional mean estimator (CME). However, the computations involve calculation of normalization constants that require solving analytically intractable integrals. A way to avoid this predicament is to use numerical approximation methods such as Markov Chain Monte Carlo

[ninness2010bayesian, kantas2015particle, andrieu2004particle].The above methods are undoubtedly quite interesting, however it is evident that they have certain limitations. To elucidate on this, recall that an estimator is a map (function) from , where

is the dimension of the data vector, to

, where is the dimension of the parameter vector. The prevailing paradigm in parameter estimation is to construct this map only for the point in which corresponds to the data vector for which the parameter estimate is desired. This is done at the time of the estimation, making the computation time for the training (the time it takes to construct the estimator) and the computation time for inference (the time it takes to compute the map from the given data vector to the parameter estimate) the same. Methods such as PSEM, CPF-SAEM and CME require the real data during the learning process; training time and inference time for such methods is the same. Moreover, such methods are susceptible to poor initialization and can get stuck in a poor local optimum. CME also requires careful tuning of the proposal densities. This leads to the search for an alternative framework that can provide competitive estimates similar to Bayesian methods like CME but also enable faster inference.In this paper, we make a fundamental change of perspective by separating the construction of the estimator and the inference step. By constructing the entire estimator beforehand, the inference step can be done simply by evaluating the estimator at the point in corresponding to the data vector of interest, making this step very simple and computationally fast. The construction of the estimator is done in a sampling-based, Bayesian fashion. It involves using an extensive number of synthetic data sets generated by simulating the models drawn from the model set of interest. For the estimator itself, we use a recurrent neural network (RNN) [rumelhart1986learning] leading to what one could call a deep Bayesian estimator. We will refer to this class of estimators as DeepBayes estimators.

The use of neural networks in system identification is not new. Feedforward neural networks were used to construct predictors, i.e. inferring the relationship between past input-output data and predicting future outputs given a new input [sjoberg1994neural]. For the output prediction, there exist a recent work where deep feedforward neural networks is used [zancato2021novel]. Instead of output prediction, we address a different problem - estimating parameters of a nonlinear dynamical system using deep RNNs.

### 1.1 Contributions

In this work, we tackle the problem of learning a DeepBayes estimator for the task of parameter estimation in stochastic, nonlinear dynamical models by leveraging the modeling capability of deep RNNs. The proposed methodology is general, and applies for any dynamical model that can be simulated. Hereto, we focus on discrete-time state-space and construct parameter estimators for discrete-time dynamical models that can be simulated. Our main contributions are as follows: We propose the DeepBayes estimation procedure where we employ deep RNNs for learning estimators. Using a mean squared error loss function, we demonstrate the performance of our proposed method and compare with state-of-the-art ML-based methods

[lindsten2013efficient, schon2011system] and the conditional mean estimator (CME) that uses Metropolis-Hastings (MH) algorithm [hastings1970monte, chib1995understanding, robert1999metropolis]. We illustrate the advantages of our proposed approach through simulation experiments using nonlinear state space models and on a real-world, nonlinear benchmark problem.### 1.2 Paper outline

The paper is organized as follows: In Section 2, we briefly describe the problem that we wish to solve. In Section 3, a quick overview on deep RNNs is provided to help setting the stage for describing the architectures used. In Section 4, we introduce the DeepBayes estimators and provide theoretical motivations. Section 5

describes our experimental setup, training strategy, hyperparameter tuning and obtained results using the DeepBayes approach. We also describe the implementation of the CME using MH and compare with the CPF-SAEM and PSEM methods. Finally, in Section

6, we provide final comments related to our method and describe possible directions of future work.### 1.3 Notations

In this paper, we use bold font lowercase symbols to denote vectors, and regular lowercase font to denote scalars, e.g. represents a vector while represents the component of . Upper case symbols in bold font represent matrices. The symbol is used to denote a model under consideration. The notation is used to compactly define a sequence of signal values, e.g. denotes a sequence of signal values with the signal value given by . The symbol represents the expectation operator, denotes the Laplace transform,

denotes element-wise multiplication. Uniform and Gaussian distributions have been represented by

and respectively.## 2 Problem formulation

We assume that we are provided with a data set of inputs and outputs of a dynamical system represented by

consisting of values of the output signal and input signal . We assume that the model of interest has the following general structure

(1) |

where is an unknown parameter vector with a prior distribution , denotes a latent variable, , and are unobserved noise processes whose distributions have known forms and may be parameterized by

, e.g. noise variances may be included in

. The problem is to learn an estimator and obtain an estimate of the unknown parameter vector as(2) |

where is the output signal of length given in . The superscript indicates the dependence on the model set and the subscript indicates the set of free parameters in the estimator that are learned when the estimator is constructed, e.g. weights and biases of the deep RNN in use. The parameter indicates the tunable hyperparameters such as learning rate

, number of epochs

, etc. However, it is worthwhile to note that our estimation approach could also be used with functional approximators other than RNNs. Another important fact is that in (2), we have not included since we consider the case where the input signal is fixed a priori, as often is the case in process industry [ljung2010perspectives, zhu2001multivariable], or not present at all, as in time series forecasting [hewamalage2021recurrent].## 3 Deep recurrent neural networks

We will employ RNNs having the structure shown in Fig. 1. Here, the input to the RNN is where for generality we assume that can be vector valued, and is the output of the RNN.

The forward propagation in case of a single RNN unit can be represented by the following equation

(3) | ||||

where is the output at the time step, is the hidden state of the RNN at the time step, denotes the hyperbolic tangent function and

denotes a nonlinear activation function such as softmax or rectified linear unit (ReLU)

[goodfellow2016deep, Chap. 6]. The weight matricesand bias vectors

are learnable parameters of the RNN and are the same across time steps when the RNN is unfolded as shown in Fig. 1.For classification tasks, a loss function such as cross-entropy is used for training, while for regression tasks, one often uses the mean squared error. The parameters in (3) are optimized using backpropagation through time (BPTT) [rumelhart1986learning]. RNNs can even be stacked on top on one another to constitute a deep RNN [pascanu2014construct].

RNNs are quite versatile and possess several processing modes - one-to-one, one-to-many, many-to-one or many-to-many [goodfellow2016deep, Chap. 10]

. In a many-to-one mode, RNNs are used for summarization of sequential data and used in applications such as word sentiment analysis

[tang2015document]. The RNN processes the entire length of the sequence and then uses the hidden state at the final time step to project it nonlinearly to the desired output space aswhere, the matrices and biases are learned by the algorithm, is an element-wise nonlinearity such as ReLU function [goodfellow2016deep].

LSTM and GRU:

One notable issue for BPTT is the vanishing gradient problem in the computation of the gradients. RNN training algorithms may become inefficient owing to either exploding or vanishing gradients. Long short term memory units (LSTMs) try to solve this problem by imposing gates which control the information flow

[hochreiterLongShortTermMemory1997]. Gated recurrent units (GRUs) provide some simplification over the conventional LSTM units and have a smaller number of gates [choPropertiesNeuralMachine2014]. We refer the interested reader to [chen2016gentle, karpathy2015visualizing] for reviews and visualization insights concerning these architectures.In the proposed DeepBayes method, we use the LSTM and GRU architectures in many-to-one mode and described in Section 5.

## 4 Proposed approach

We assume that the user may a priori design an input signal to excite the system for identification purposes. Then, the objective is to learn an estimator for a given model structure and the chosen input signal. Given a model as in (1

), we require a supervised learning setup to train the RNN in the proposed DeepBayes. This requires a training dataset.

Our idea to create such a training dataset is to first generate a large number of output signals using parameters selected from a set in the parameter space. We achieve this by first sampling parameter vectors from the prior distribution . For each sampled , we then generate signals consisting of noise realizations for . Lastly, we use to generate output signals for according to (1). In total, this gives output signals denoted by

and constitutes our synthetic training set

where . Our goal is to learn an estimator by fitting an RNN to the synthetic data set . This involves tuning hyperparameters and learning the unknown parameters in the RNN. We do so by minimizing the mean squared error (MSE) loss function between the parameter estimates given by the estimator and the sampled parameter vectors. Thus, we compute

(4) |

where depends on , the number of parameter realizations that is used, and , the the number of output signals generated per parameter realization. A schematic illustration of this method is given in Fig. 2.

### 4.1 Theoretical motivation

The learning problem is defined in (4). If the function is flexible enough, i.e. the number of hidden layers, the number of units in each layer is sufficiently large, and the number of used samples and are sufficiently large, the obtained estimator will approximately be the Bayes estimator corresponding to the MSE loss function. To see this, let us denote the conditional distribution of the model outputs by

. Then, under mild conditions on the moments of

and , we have(5) |

as

, where the expectation is with respect to the joint distribution

. The solution to the right hand side of (5) is conditional mean estimator of ; namely(6) |

If this function can be realized using the chosen RNN, then for and chosen large enough, the DeepBayes Estimator will perform close to the conditional mean estimator defined in (6).

### 4.2 An example of our estimation procedure: Linear Toy Model

Suppose that is a set of finite impulse response (FIR) models of a order , given as

(7) |

where , for (independent over ), is unobserved measurement noise, and . Then for observations, we can write the input signal in a matrix form as:

(8) |

Using (8), we can write (7) as

(9) |

where . We assume that the unknown parameter vector is a random vector, drawn from the prior distribution . The simulated output signals are then given by

(10) |

where , with denoting indices for the realizations of and the noise sequence, respectively. To gain some insight into our proposed procedure, let us constrain the estimator function to the set of linear maps and the prior to be Gaussian. In this case the estimator function can be represented by a matrix to be learned by solving the optimization problem

Now observe that this is a quadratic problem; using the vectorization operator we can rewrite this as

where denotes the Kronecker product and denotes the vectorization operation. Let and . The problem is then

which has the closed-form solution

It then follows that

and for a test data set the estimator is given by

(11) |

In fact, due to the independence of the sampled realizations, a direct application of the law of large numbers yields that

as , where

(12) |

and is the variance of the noise process . Substituting in (11), we get

In case the mean value of samples is different from zero, the required estimator has the form . Analogous to (11), the estimator is then

(13) |

It can be shown that as

(14) | ||||

where similar to (12) , and

The required asymptotic estimator using (14) is obtained as:

(15) |

Since the prior on is Gaussian, this means that the obtained estimator is asymptotically equivalent to the conditional mean [pillonetto2014kernel]. This is illustrated in Section 5.1 using a simple simulation experiment.

We remark here that FIR models are quite simple and in practice we will need to work with more complicated models that are nonlinear in the input and have nonlinear parameterizations. Furthermore, the prior distributions used are usually non-Gaussian and often may have finite support, or infinite only on one side. In these cases, the posterior mean that we try to approximate will be nonlinear in the data. Therefore we need a more flexible class of functions and parameterization, instead of just a linear map as we considered in the FIR case; for which the purpose was to illustrate the method in a simple setting.

## 5 Simulation experiments

In this section, we describe the experimental setup and training strategy for the proposed DeepBayes method and discuss our obtained result. We call the DeepBayes methods utilising GRU and LSTM architectures as DeepBayes-GRU and DeepBayes-LSTM, respectively. We use the RNN architectures operating in a many-to-one mode, i.e. we train RNNs to process a sequence of inputs and produce a one output vector at the final time step as described in Section 3.

We conduct a first experiment to demonstrate our proposed approach on the linear toy model, as described in Section 4.2. Next, we evaluate our method on two different variations of a popular, nonlinear state space model, and finally also on a real-world, coupled electric drive model [wigren2017coupled]. We compare the performance with state-of-the-art methods based on ML like PSEM [schon2011system], CPF-SAEM [lindsten2013efficient], and also with a conditional mean estimator using Metropolis-Hastings sampling (CME-MH) [ninness2010bayesian].

The ML-based methods and the Markov chains of the CME-MH are all initialized in the neighbourhood of the true values of the parameters. This is to improve convergence and avoid bad local solutions. We describe the details regarding the tuning of the MH algorithm in Appendix A. The code is available at: github.com/anubhabghosh/ParamEstimation.

We use the MSE of the parameter estimate for evaluating the performance of the DeepBayes estimator. Assuming the true value of the unknown parameter as , we generate the test data set using the input signal and . consists of output signals in total, and the MSE is computed using a Monte Carlo approximation according to

(16) |

where denotes the learned estimator using the DeepBayes method, with optimized parameters given by . We use the same test data set for evaluating the competing methods - PSEM, CPF-SAEM and CME-MH using (16). Note that while the other methods require parameter initialization close to the true values of the parameters, DeepBayes does not have such a requirement.

### 5.1 Experiments on a linear toy model

We consider the FIR model defined in (9), with a filter order , given by

where the true parameters are assumed to be , each (independent of ), and . The value of the variance is selected as , and assumed known when the estimator is trained. We assume that , and generate a test dataset using the true parameter values. We generate the training dataset using a Gaussian prior distribution, more specifically . The simulated output signals are then given by (10) where is obtained using samples as in (8) and kept same during both training and testing.

PM | 50 | 100 | 200 | 500 |
---|---|---|---|---|

400 | 5.47 | 2.55 | 1.29 | 0.53 |

500 | 4.26 | 1.90 | 1.04 | 0.44 |

600 | 3.68 | 1.78 | 0.80 | 0.33 |

1000 | 2.07 | 1.12 | 0.52 | 0.20 |

To study how the choice of user parameters affects the estimator, we use a range of , the number of realizations of the parameter vector and , the number of noise realizations for each sampled parameter vector. For each estimator, we compute estimates using (13) and (15) The results are shown in Table 1. We can see that for increasing values of , the MSE between the estimates computed using (15) and (13) indeed decreases, which corroborates our theory.

### 5.2 Nonlinear state space models

Similar to [schon2011system, lindsten2013efficient], we consider the following discrete-time nonlinear time series model:

(17) |

where , and . We assume zero initial conditions and that and are independent over and of each other. The parameter vector to be identified is a subset of . We consider two different variations of (17) and perform a comparative performance study of our DeepBayes estimation method. The first case is

(18) |

where , and . The parameter vector to be identified is

and the prior on the parameter is such that the parameters are mutually independent and uniformly distributed according to

(19) |

We assume the other parameters to be known and need not be identified: . The true parameter vector in this case is taken . The second case is

(20) |

where , and . The parameter vector to be identified is where the prior is such that the parameters are mutually independent and uniformly distributed according to

(21) | |||||

In (21), is a very small strictly positive real number, since and are used for modeling variances. The values of the other parameters are assumed to be known and need not be identified: . The true parameter vector is .

We choose to use for estimating the variance parameters as it provides a familiar ground for comparison between existing ML-based methods in [lindsten2013efficient, schon2011system]. For estimating a larger set of parameters, we use the simplified model set . We hope that the success of the DeepBayes method on (20) will help motivate its application to further harder cases.

#### 5.2.1 Training strategy and hyperparameter tuning

Training details: An epoch is defined as one complete pass through the training set, i.e. when the training algorithm has ‘seen’ all the examples in the training set. The RNN architectures are trained using mini-batch gradient descent with a learning rate , which is adaptively decreased in a step-wise fashion by a factor of 0.9, with every one-third of the total number of epochs. This helps in preventing oscillations in training at the later stages. The optimizer used is Adam [kingma2014adam]

, and the training and inference algorithms have been written using PyTorch. We utilize the automatic differentiation library of PyTorch in computing the gradients for backpropagration

[paszke2019pytorch].Hyperparameter selection: Every RNN has a set of hyperparamters that need to be tuned in order to ensure desired performance. For our training purposes we choose to tune the following hyperparameters: number of hidden units () in each layer, number of hidden layers () and the number of hidden nodes in the final dense layer () prior to the output. For tuning the hyperparameters, we monitor the validation loss at every epoch and enforce an early-stopping criterion to avoid overfitting [goodfellow2016deep, Chap. 7]. Early stopping is a very popular regularization method and introducing it into a training algorithm does not require any restrictions on the network structure or weights. Typically, it involves monitoring the change in the validation set performance for a specific number of iteration steps. The number of such iteration steps is often called patience. If the performance does not improve beyond the patience parameter, the training is terminated and the model parameters and related metrics are recorded and saved. We describe our algorithm for early-stopping in Appendix B.

S.No. | Tr. MSE | Val. MSE | |||
---|---|---|---|---|---|

1 | 1 | 30 | 32 | 0.00649 | 0.00675 |

2 | 1 | 30 | 40 | 0.00569 | 0.00606 |

3 | 2 | 30 | 32 | 0.00505 | 0.00558 |

4 | 2 | 30 | 40 | 0.00554 | 0.00588 |

5 | 1 | 40 | 32 | 0.00534 | 0.00578 |

6 | 1 | 40 | 40 | 0.00561 | 0.00600 |

7 | 2 | 40 | 32 | 0.00520 | 0.00576 |

8 | 2 | 40 | 40 | 0.00517 | 0.00577 |

9 | 1 | 50 | 32 | 0.00612 | 0.00627 |

10 | 1 | 50 | 40 | 0.00595 | 0.00626 |

11 | 2 | 50 | 32 | 0.00484 | 0.00584 |

12 | 2 | 50 | 40 | 0.00443 | 0.00575 |

13 | 1 | 60 | 32 | 0.00517 | 0.00595 |

14 | 1 | 60 | 40 | 0.00511 | 0.00577 |

15 | 2 | 60 | 32 | 0.00373 | 0.00631 |

16 | 2 | 60 | 40 | 0.00334 | 0.00670 |

#### 5.2.2 Estimating the variance parameters: in

We generate the synthetic dataset by uniformly sampling values of , as defined in (19). We sample realizations of the parameter vector and output signals per realization. Each generated output signal is of length . We split into training and validation in the ratio . The parameters of the RNN are learnt using the training set and hyperparameter tuning is done using the validation set. The results of hyperparameter tuning related to estimating the variance parameters of task for a DeepBayes-GRU method are shown in Table 2. We choose the configuration from Table 2 that provides the smallest validation error. Then, we proceed to retrain the model using a larger training data set consisting of realizations with output signals generated per realization. For testing, we use the same value of as , and generate output signals to form the test data set . We compute parameter estimates for each output signal in , then compare with the true value of the variance parameters taken from using (16).

We repeat the same procedure for hyperaprameter selection using grid-search and testing for the DeepBayes-LSTM method. The DeepBayes-LSTM method has more parameters compared to DeepBayes-GRU and this can lead to poorer generalization due to possible overfitting. We compare the results with the CPF-SAEM method [lindsten2013efficient] and the CME-MH method, both evaluated on the same test set. The results are shown in Table 3 where the DeepBayes-GRU method outperforms CPF-SAEM by leveraging the prior knowledge and modeling capability, and comes quite close to the CME-MH method. An example of the plots of the Markov chain for parameters in case of the CME-MH method is also shown in Fig. 3, which shows that the Markov chains indeed converged around the true values.

A box plot showing the estimates for the DeepBayes-GRU method for the variance parameters is shown in Fig. 4. Furthermore, we also provide in Fig. 5, a comparison between the individual estimates of the benchmark CME-MH method and the DeepBayes-GRU on a finite number of randomly chosen output signals from the test set. We see from Fig. 5 that even the estimates of the CME-MH and DeepBayes estimators on randomly chosen individual output signals are quite close. This indicates that in this example, the DeepBayes estimator successfully approximates the CME function.

Metric | CPF-SAEM | CME-MH | DeepBayes-LSTM | DeepBayes-GRU |
---|---|---|---|---|

Test MSE | 0.00932 | 0.00774 |

#### 5.2.3 Estimating all the parameters in

In this experiment, we seek to estimate the unknown parameters for described in (20). The training dataset is generated using a similar strategy as described in the previous setup for the variances in Section 5.2.2. The hyperparameters for the GRU architecture are chosen using the performance on a validation set, similar to the setup in Table 2. This results in a GRU architecture having . An important remark here is that the performance of our estimator method depends on the size of the training set. Through experiments, we find that the influence of the number of realizations is more than that of the number of output signals per each parameter realization . In Table 4, we demonstrate this behaviour. We find that with the increase in P, keeping M constant, the performance of the model improves significantly, with the best performance given by the training set having .

Hence, we select this training set configuration and compare with the ML-based PSEM method described in [schon2011system]. The results are shown in Table 5. Also, we numerically compute the conditional mean estimate in (6) using the CME-MH method. All results are shown for the same test set, which is generated by fixing the parameters to the known true values and simulating the model described in (20) for times. The performance metric used is (16).

Metric | P=500 | P=1000 | P=1500 | P=2000 |
---|---|---|---|---|

Test MSE | 0.00879 | 0.00326 | 0.00293 | 0.00265 |

Metric | PSEM | CME-MH | DeepBayes-GRU |
---|---|---|---|

Test MSE | 0.00341 | 0.00261 |

Method | Training | Inference |
---|---|---|

PSEM | ||

CME-MH | ||

DeepBayes-GRU |

Although the ML-based methods are asymptotically optimal, we see from the experimental results in Table 5, that the DeepBayes method is capable of producing results that are better than the ML-based methods and even close to the CME-MH method. More importantly, this is achieved by training only on synthetically generated data and having no special initialization close to the true values (as opposed to the PSEM method). This hints at the greater applicability of such models to real-world scenarios, where often true measurements are scarce and mostly unavailable during training.

Another important advantage is that once these estimators are trained, inference tasks using these models are quite quick and efficient, without involving any re-training of architectures. We show in Table 6, that the DeepBayes-GRU method ensures significant savings in inference time compared to the PSEM and CME-MH methods, while having fairly comparable training times. We remark that the use of deep RNNs in DeepBayes estimators requires the use of GPU for faster training. The PSEM and CME-MH methods have been implemented using MATLAB and trained on a CPU with an Apple M1-chip and 4 processing cores, while the DeepBayes methods have been implemented in PyTorch and run on a single NVIDIA Tesla P100 GPU card.

### 5.3 Real-world data

In this section, we describe some experimental results on real-world data. Our system of interest is a pair of coupled electric drives. It consists of two electric motors that are coupled to a pulley through a conveyor belt, with the pulley held by a spring [wigren2017coupled]

. This system is a part of an open-source, nonlinear dynamical system model used for bench-marking purposes. A continuous-time description of the system is given by the Wiener model

(22a) | ||||

(22b) | ||||

(22c) |

where . Here, denotes the pulley velocity measured as voltage and the input signal represents the sum of the voltages applied to both the motors. The measurements are recorded at sampled time instants , so that denotes the measured, rectified noisy output signal and denotes additive Gaussian noise with variance . The parameter denotes the inverse time constant of the electric drives, denotes the damping related to the spring and belt, denotes the angular resonance frequency, and denotes the static gain. In [wigren2005recursive], a method for estimating the unknown parameters of this model through a recursive prediction error method (RPEM) has been developed. The authors in [wigren2017coupled] consider three examples of input pseudo-random binary signals (PRBS) with different amplitude values . For our experiments, we only consider the PRBS input signal with an amplitude of .

#### 5.3.1 Experimental setup in [wigren2017coupled]

In [wigren2017coupled], a measured signal of duration 10 seconds is used, with a sampling time of 20 ms. The sampling period for the input signal is 100 ms. The model structure can be written in state-space form as

(23a) | ||||

(23b) |

We can write (23a) equivalently as

(24) |

where, . The unknown parameter vector with a continuous-time parameterization is . Comparing (22a) and (24), we can see that:

(25) | ||||

We also choose to model the unknown, additive output disturbance as white Gaussian noise with an unknown variance . An important point to note is that RPEM involves estimating the unknown parameters of a continuous-time model that has been discretized using the Euler method [wigren2005recursive]. As pointed out in [wigren2017coupled], this does not take into account the effect of sampling on the model. In our experimental setup, we use the obtained estimates of the parameters in continuous time reported in [wigren2017coupled] as prior information for our DeepBayes method. We furthermore seek to overcome the limitations of the RPEM estimation, by performing discretization using zero-order hold, thus ensuring that our obtained parameters are indeed representative of the identified result.

#### 5.3.2 Dataset generation process

For simulation purposes, we use the open source Python control systems library ^{1}^{1}1Link:https://python-control.readthedocs.io, which is similar to its MATLAB counterpart [murray2018python]. The data set generation process for coupled electric drives model involves - choosing a reasonable prior to sample parameters and then simulating the model in (22) for generating output signals.

For the first step, we assume the prior distribution of our unknown parameters to be uniform, and seek to find a reasonable support for the prior. This is achieved using a set of parameters obtained by solving the following nonlinear least squares problem

where is the measured signal and is the simulated output signal of length using the model and the input which is a PRBS input signal of length having amplitude . In practice, the models are simulated using the command lsim in the control systems toolbox. We utilize the mapping between the parameters of the transfer functions in (22a), (24) as shown in (25) to obtain from . Finally, to generate a synthetic dataset, we sample uniformly around 20 % of and for each sampled realization, we simulate the output signals as step responses of in (22) using . We assume has a uniform prior . We train our DeepBayes-GRU method on a synthetic dataset consisting of realizations, with output signals simulated per realization. Hyperparameters such as learning rate, early stopping patience, etc. are tuned according to performance on a subset of the synthetic dataset, chosen as the validation set.

CME-MH | DeepBayes-GRU |

24.06 |

#### 5.3.3 Results

We now describe the results of our method when applied to real-world data in Table 7. We compare our proposed DeepBayes-GRU method with the CME-MH method. Since we do not have ‘true’ parameter values, the quality of each estimator is assessed by computing the sample, sum of squared error between the measured and the simulated output signal without disturbance using (23) and estimated parameters in each case. We obtain the reference estimate using the CME-MH method by employing a variation of Algorithm 1 where the log-likelihoods can be computed in an easier manner. Considering the model in (22), we can see that given and assuming , it follows that . Hence, the log-likelihood of the signal (consisting of all samples ) is computed in closed form using the fact that . The resulting plot for the simulated output signals using the trained DeepBayes-GRU is shown in Fig. 6. On visual inspection, we find that the simulated output signal using DeepBayes-GRU closely matches the measured output signal. Also from Table 7, we can see that the DeepBayes-GRU method performs quite satisfactorily compared to the CME-MH method.

## 6 Conclusion and Future Work

In this contribution, we propose the DeepBayes method that leverages the modeling capability of deep RNNs for offline learning of estimators of nonlinear dynamical systems. We illustrate the asymptotic behaviour of our proposed estimators using a linear toy model. We then demonstrate that the estimation performance of DeepBayes is competitive with a state-of-the-art sampling based ML and Bayesian methods, with the distinct advantage of huge savings in inference time and no initialization issues.

As part of future work, we would be interested in further investigating the effect of variable-length sequences. We would also like to investigate cases where the unknown parameters have different scales and analyse the relation between our method and recursive system identification. Another idea is to focus on learning prior distributions from data.

## Appendix

## Appendix A Numerically estimating the conditional mean estimate via Metropolis Hastings

We use a Markov Chain Monte Carlo method for numerically approximating the conditional mean estimator given by (6) in Section 4.1. In order to build the Markov chain, we use the MH algorithm [hastings1970monte, chib1995understanding, robert1999metropolis], and a particle filter implementation to approximate the log-likelihood [schon2011system]. A reparameterization of the MH proposal distribution is done in order to sample effectively within the given prior. The MH algorithm is initialized using the true parameter and the values of the proposal covariance are tuned to ensure convergence of the Markov chain. We also remark that we do not apply any thinning of the Markov chain and consider the average of the values of in the Markov chain after , where refers to the burn-in period [robert1999metropolis]. The details of our implementation are shown in Appendix A.

## Appendix B Description of our early stopping algorithm

We describe the algorithm for early-stopping in Algorithm 2. The algorithm monitors the relative change in the magnitude of the validation loss as a criterion to check convergence of the RNN training. In case the convergence criterion is satisfied, then the best model parameters, metrics are saved and the training is terminated. Else, the model parameters at the end of training is saved.