Deep State Space Models for Nonlinear System Identification

An actively evolving model class for generative temporal models developed in the deep learning community are deep state space models (SSMs) which have a close connection to classic SSMs. In this work six new deep SSMs are implemented and evaluated for the identification of established nonlinear dynamic system benchmarks. The models and their parameter learning algorithms are elaborated rigorously. The usage of deep SSMs as a black-box identification model can describe a wide range of dynamics due to the flexibility of deep neural networks. Additionally, the uncertainty of the system is modelled and therefore one obtains a much richer representation and a whole class of systems to describe the underlying dynamics.



There are no comments yet.


page 10


A Sparse Bayesian Deep Learning Approach for Identification of Cascaded Tanks Benchmark

Nonlinear system identification is important with a wide range of applic...

Improved Initialization of State-Space Artificial Neural Networks

The identification of black-box nonlinear state-space models requires a ...

Combining Physics and Deep Learning to learn Continuous-Time Dynamics Models

Deep learning has been widely used within learning algorithms for roboti...

Non-Autoregressive vs Autoregressive Neural Networks for System Identification

The application of neural networks to non-linear dynamic system identifi...

Deep Identification of Nonlinear Systems in Koopman Form

The present paper treats the identification of nonlinear dynamical syste...

A Latent Restoring Force Approach to Nonlinear System Identification

Identification of nonlinear dynamic systems remains a significant challe...

Learning Individual Interactions from Population Dynamics with Discrete-Event Simulation Model

The abundance of data affords researchers to pursue more powerful comput...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

System identification is a well-established area of automatic control [3, 33, 52]

. A wide range of identification methods have been developed for parametric and non-parametric models as well as for grey-box

[25] or black-box models [46]

. Contrary, the field of machine learning

[6, 35] and especially deep learning [17, 27] has emerged as the new standard in many disciplines to model highly complex systems. A large number of deep learning based tools have been developed for a broad spectrum of applications. Deep learning can identify and capture patterns as a black-box model. It has been shown to be useful for high dimensional and nonlinear problems emerging in diverse areas such as image analysis [20, 29], time series modelling [26], speech recognition [11] and text classification [57]. This paper provides one step to combine the areas of system identification and deep learning [32] by showing the usefulness of deep SSMs applied to nonlinear system identification. It helps to bridge the gap between the two fields and to learn from each others advances.

Nowadays, a wide range of system identification algorithms for parametric models are available [31, 50]

. Parametric models such as SSMs can include pre-existing knowledge about the system and its structure and can obtain more precise identification results. SSMs can be similarly to hidden Markov models


more expressive than e.g. autoregressive models due to their use of hidden states. For automatic control this is a popular model class and a variety of identification algorithms is available

[44, 53].

In the deep learning community there have been recent advances in the development of deep SSMs. See e.g. [1, 4, 10, 12, 13, 16, 28, 40]. The class of deep SSMs has three main advantages. (1) It is highly flexible due to the use of Neural Networks (NNs) and it can capture a wide range of system dynamics. (2) Similar to SSMs it is more expressive than standard NNs because of the hidden variables. (3) In addition to the system dynamics, deep SSMs also capture the output uncertainty. These advantages have been exploited for the generation of handwritten text [30] and speech [38]. These examples have highly nonlinear dynamics and require to capture the uncertainty to generate new realistic sequences. The main contributions of this paper are:

  • Bring the communities of system identification and deep learning closer by giving an insight to a deep learning model class and its learning algorithm by applying it to system identification problems. This will extend the toolbox of possible black-box identification approaches with a new class of deep learning models. This paper complements the work in [2], where deterministic NNs are applied to nonlinear system identification.

  • The system identification community defines a clear separation between model structures and parameter estimation methods. In this paper the same distinction between model structures (Section 

    II) and the learning algorithm to estimate the model parameter (Section III) is taken as a future guideline for deep learning.

  • Six deep SSMs are implemented and compared for nonlinear system identification (Section IV). The advantages of the models are highlighted in the experiments by showing that a maximum likelihood estimate is obtained and additionally the uncertainty is captured. Hence, a richer representation of the system dynamics is identified which is beneficial for example in robust control or system analysis.

Ii Deep State Space Models for Sequential Data

Deep learning is a highly active field with research in many directions. One active topic is sequence modeling as motivated by the temporal nature of the physical environment. A dynamic model is required to replicate the dynamics of the system. The model is a mapping from observed past inputs and outputs to predicted outputs . An SSM is obtained if the computations are performed via a latent variable that incorporates past information:


where denote the set of unknown parameters. If the functions and are described by deep mappings such as deep NNs, the resulting model is referred to as a deep SSM.

A second deep learning research direction is that of generative models involving model structures such as generative adversarial networks (GANs) [18]

and Variational Autoencoders (VAEs)

[24, 41], which are used to learn representations of the data and to generate new instances from the same distribution. For example, realistic images can be generated from these models [54]. Extending VAEs to sequential models [14] produce the subclass of deep SSM model which are studied in this paper. The building blocks for this type of model are Recurrent NNs (RNNs) and VAEs.

Ii-a Recurrent Neural Networks

RNNs are NNs, suited to model sequences of variable length [17]. Models with external inputs and outputs at each time step are considered. RNNs make use of a hidden state , similar to (1a) but without considering the outputs for the state update. A block diagram is given in Fig. 1

showing the similarity to classic SSMs. The figure highlights that the function parameters are learned by unfolding the RNN and using backpropagation through time

[17]. Often a regularized L2-loss between the predicted output and true output

is considered. The most successful types of RNNs for long term dependencies are Long Short-Term Memory (LSTM) networks


and Gated Recurrent Units (GRUs)

[8], which yield empirically similar results [9]. GRUs are used within this paper due to of their structural simplicity.

Fig. 1: Block diagram of the RNN for sequence modeling. Round blocks indicate probabilistic variables and rectangular blocks deterministic variables. Shaded blocks indicate observed variables. The black block indicates a one-step delay.

Ii-B Variational Autoencoders

A VAE [24, 41] embeds a representation of the data distribution of in a low dimensional latent variable via an inference network (encoder). A decoder network uses to generate new data of approximately the same distribution as . The conceptual idea of a VAE is visualized in Fig. 2 and can be viewed as a latent variable model. The dimension of

is a hyperparameter.

In the VAE it is in general assumed that the data

has a normal distribution. Therefore, the decoder is chosen accordingly as

. The parameters for this distribution are given by as deep NN with parameters , input and outputs and

. Hence, the generative model is characterized by the joint distribution

, where the multivariate normal distribution is used as prior. The prior parameters are usually chosen to be .

Fig. 2: Conceptual idea of the VAE.

For the data embedding in , the distribution of interest is the posterior which is intractable in general. Instead of solving the posterior, it is approximated by a parametric distribution . The distribution parameters are encoded by a deep NN . This network is optimized by variational inference [7, 22] of the variational parameters which are shared over all data point, using an amortized version [56].

There exists a connection between the VAE and linear dimension reduction methods such as PCA. In [43] it is shown that the PCA corresponds to a linear Gaussian model. Specifically, the VAE can be viewed as a nonlinear generalization of the probabilistic PCA.

Ii-C Combining RNNs and VAEs into deep SSMs

RNNs can be viewed as a special case of classic SSMs with Dirac delta functions as state transition distribution [14], see (1a) for comparison. The VAE can be used to approximate the output distributions of the dynamics, see (1b). A temporal extension of the VAE is needed for the studied class of deep SSMs. The parameters of the VAE prior are updated sequentially with the output of a RNN. The state transition distribution is given by with . Compared with the VAE prior the parameters are now not static but dependent on previous time steps and therefore describes the recurrent nature of the model. Similarly the output distribution is given as with . The joint distribution of the deep SSM is


Similar to the VAE, this expression describes the generative process. It can be further decomposed with a clear separation between the RNN and the VAE. The most simple form within the studied class of deep SSMs is obtained, the so-called VAE-RNN [14]. The model consists of stacking a VAE on top of an RNN as shown in Fig. 3. Notice the clear separation between model parameter learning in the inference network with the available data and the output prediction in the generative network. The joint true posterior can be factorized according to the graphical model as


with prior given by with only depending on the recurrent state . The approximate posterior can be chosen to mimic the same factorization


There are multiple variations in this class of deep SSM, next to the VAE-RNN [14]. The ones considered in this paper are:

  • Variational RNN (VRNN) [10]: Based on VAE-RNN but the recurrence additionally uses the previous latent variable for .

  • VRNN-I [10]: Same as VRNN but a static prior is used in every time step.

  • Stochastic RNN (STORN) [4]: Based on the VRNN-I. In the inference network STORN uses additionally a forward running RNN with input , latent variable and output . Hence is characterized by .

Graphical models for these extensions are provided in Appendix -A. For VRNN and VRNN-I an additional version using Gaussian mixtures as output distribution (VRNN-GMM) is studied. More methods are available in literature, see e.g. [1, 13, 12, 16].

(a) Inference Network
(b) Generative Network
Fig. 3: Graphical model of the VAE-RNN model.

Iii Model Parameter Learning

Iii-a Cost Function for the VAE

The parameter learning method of the deep SSMs is based on the same method as for VAEs. The VAE parameters are learned by maximum likelihood estimation with data points . By performing variational inference with shared parameters for all data one obtains


where Jensen’s inequality is used in (7). The right hand side is referred to as the evidence lower bound (ELBO) and can be rewritten using the Kullback-Leibler (KL) divergence


where the expectation is with respect to . The first term encourages the reconstruction of the data by the decoder. The KL-divergence in the second term is a measure of closeness between the two distributions and it can be interpreted as regularization term. Approximate posterior distributions far away from the prior are penalized. The total ELBO is then given by which is maximized instead of the intractable log-likelihood .

Iii-B Cost Function for Deep SSMs

A temporal extension of the VAE parameter learning is required for the studied deep SSMs. Again, amortized variational inference with ELBO maximization is used. A similar derivation for the total ELBO of the VAE as in (7) leads for the generic deep SSM to


where the expectation is w.r.t. the approximate distribution . The factorization of the true joint posterior distribution from (2) can be applied which yields a total ELBO as the sum over all time steps. Note that in this generic scheme could be factorized as , which requires a smoothing step since depends on all inputs and output from . If there would be a similar factorization for the approximate posterior as in (2), then one can obtain a similar expression as for the VAE in (8).

In the VAE-RNN a solution for parameter learning is obtained due to the clear separation between the RNN and the VAE. Note that here no smoothing step for the variational distribution is necessary since the states are independent given as can be seen by d-separation in Fig. 3. The same factorization as in (4) can be used. The total ELBO for the VAE-RNN can then be written as


where the expectation is taken w.r.t. the approximate posterior . Applying the posterior factorizations in (3) and (4) to the total ELBO in (10) and taking the expectation w.r.t yields


which is of the same form as for the VAE ELBO in (8) but with a temporal extension as summation over all time steps.

Iv Numerical Experiments

All six models described in Section II are evaluated. The model hyperparameters are the size of the hidden state denoted by , the size of the GRU hidden state denoted by and the number of layers within the GRU networks . For STORN the dimension of is chosen equal to the one of . The VRNN-GMM uses 5 Gaussian mixtures in the output distribution.

For parameter learning as well as for hyperparameter and model selection the data is split into training data and validation data. A separate test data set is used for evaluating the final performance. The ADAM optimizer [23]

with default parameters is used with early stopping and batch normalization

[17]. The initial learning rate of

is decayed if the validation loss does not decrease for a specific amount of epochs. The sequence length for training in mini-batches is considered as a design parameter.

Three experiments are conducted: (1) a linear Gaussian system, (2) the nonlinear Narendra-Li Benchmark [36], and (3) the Wiener-Hammerstein process noise benchmark [45]. The first two experiments are considered to show the power of deep SSMs for uncertainty quantification with known true uncertainty, while the last experiment serves as a more complex real world example. The identified models are evaluated in open loop. The initial state is not estimated. The generated output sequences are compared with the true test data output. As performance metric, the root mean squared error (RMSE) is considered, . Here is considered such that a fair comparison with maximum likelihood estimation methods can be made. To quantify the quality of the uncertainty estimate, the negative log-likelihood (NLL) per time step is used,

, describing how likely it is that the true data point falls in the model output distribution. PyTorch code is available on

Iv-a Toy Problem: Linear Gaussian System

Consider the following linear system with process noise and measurement noise


The models are trained and validated with 2 000 samples and tested on the same 5 000 samples. The same number of layers in the NNs is taken for all models but with different number of neurons per layer. A grid search for the selection of the best architecture is performed with

and . Here is chosen due to the simplicity of the experiment. For all models the architecture with the lowest RMSE value is presented.

The deep SSMs are compared with two methods. First, a linear model is identified by SSEST [34] as a gray-box model with

states. SSEST also estimates the output variance, which is used as comparison. Second, the true system matrices as best possible linear model are run in open loop without noise.

The results are presented in Table I

where the models are sorted from simple to more complex. For the deep SSMs the values are averaged over 50 identified models and for the comparison methods over 500 identifications, since these methods are computationally less expensive. The results indicate that the deep SSMs can reach an accuracy close to the one of state of the art methods. Note that SSEST assumes a linear model, whereas the deep SSMs fit a flexible, nonlinear model. The table also shows that the more complex the models is, the more accurate the result is. Note that no fine tuning was necessary to obtain these results. A plot with mean and confidence interval of

standard deviation for the test data and for STORN is given in Fig. 4. The figure shows that the dynamics are captured by STORN as well as by SSEST. Furthermore, the uncertainty is captured well, but is is conservatively overestimated. Compared with the NLL value from SSEST, the uncertainty estimation is also slightly more conservative than the one of SSEST.

Model RMSE NLL (,)
VAE-RNN 1.562 1.951 (80,10)
VRNN-Gauss-I 1.477 1.817 (50,5)
VRNN-Gauss 1.471 1.848 (80,2)
VRNN-GMM-I 1.448 1.798 (70,10)
VRNN-GMM 1.432 1.792 (50,5)
STORN 1.427 1.785 (60,5)
SSEST [34] 1.412 1.775 -
true lin. model (noise free) 1.398 - -
TABLE I: Results for linear Gaussian system toy problem.
Fig. 4: Toy problem: results of open loop run for test data and STORN given with mean . The blue shaded area depicts the output uncertainty () of the test data and the red shaded area accordingly for STORN. For SSEST only the mean is given.

Iv-B Narendra-Li Benchmark

The dynamics of the Narendra-Li benchmark are given by [36] with additional measurement noise according to [47]. The benchmark is designed to be highly nonlinear, but it does not represent a real physical system. For more details, see the appendix.

This benchmark is evaluated for varying number of training samples ranging from 2 000 to 60 000. For each identification 5 000 validation samples and the same 5 000 test samples are used. To choose the architecture a gridsearch is performed. This revealed, that both for small and large training sample sizes, it is better to have larger networks. Hence, for comparability, all models are run with , and . No batch normalization is applied.

The results are given in Fig. 5 and show averaged RMSE and NLL values over 30 estimated models for varying model and training data size. Generally, more training data yields more accurate estimations, both in terms of RMSE and NLL. After a specific amount of training data, the identification results stop to improve. This plateau indicates that the chosen model is saturated. Larger models could be more flexible to decrease the values even further. Specifically, the STORN model outperforms the other models, all of which show similar performance. This is due to the enhanced flexibility in STORN with the second recurrent network in the inference. Hence, more accurate state representations can be learned.

The lowest RMSE values of each model are compared in Table II with results from literature. Note that the comparison methods do not estimate the uncertainty, hence no NLL can be given. Table II also include the required number of samples to obtain the given performance. The table indicated that the deep SSM models require in general more samples for learning than classic models. In particular STORN reaches RMSE values close to the state of the art. Note that gray-box models from the literature are compared with deep SSMs as black-box model which can explain the performance gap.

A time evaluation of an open-loop run between the true dynamics and the ones identified with STORN is given in Fig. 6. Mean value and standard deviations are shown. The figure highlights two points. First, the complex and nonlinear dynamics are identified well by the deep SSM. Second, the uncertainty bounds are captured but are much more conservative than the true bounds. This is in line with empirical results in [19, 37], which show that variational inference based Bayesian methods perform less accurate than for example ensembling based methods.

Fig. 5: Narendra-Li benchmark: RMSE and NLL for varying number of training data points. VRNN-Gauss-I and VRNN-GMM-I are shown with same color as the original method but in dashed lines.
Model RMSE NLL Samples
VAE-RNN 0.841 1.341 50 000
VRNN-Gauss-I 0.890 1.309 60 000
VRNN-Gauss 0.851 1.284 30 000
VRNN-GMM-I 0.869 1.289 20 000
VRNN-GMM 0.869 1.300 50 000
STORN 0.639 1.197 60 000
[55] Multivariate adaptive 0.46 - 2 000
      regression splines

Adaptive hinging hyperplanes

0.31 - 2 000
[47] Model-on-demand 0.46 - 50 000
[42] Direct weight optimization 0.43 - 50 000
[49] Basis function expansion 0.06 - 2 000
TABLE II: Results for the Narendra-Li benchmark.
Fig. 6: Narendra-Li benchmark: Time evaluation of true system and STORN with uncertainties.

Iv-C Wiener-Hammerstein Process Noise Benchmark

The Wiener-Hammerstein benchmark with process noise [45] provides measured input-output data from an electric circuit. The system can be described by a nonlinear Wiener-Hammerstein model which sandwiches a nonlinearity between two linear dynamic systems. Additional process noise enters before the nonlinearity, which makes the benchmark particularly difficult. The aim is to identify the behavior of the circuit for new input data.

The training data consist of 8 192 samples where the input is a faded multisine realization. The validation data are taken from the same data set but for a different realization. The test data set consists of 16 384 samples, one multisine realization and one swept sine. Preliminary tests indicate that a longer training sequence length yield more accurate results, hence a length of 2 048 points is used. This benchmark is evaluated for varying sizes of the deep SSM layers. Here with constant and .

The resulting RMSE values for the multisine and swept sine test sequence are presented in Fig. 7. The lowest RMSE values are in Table III compared to state of the art methods from the literature. The values are presented as averages over 20 identified models. The plot indicates that the influence of is rather limited. Larger values and therefore larger NNs in general tend to result in more accurate identification results. Again, STORN yields the best results, while also the very simple VAE-RNN identifies this complex benchmark well. The jagged behaviour of the plot may arise since the chosen identification data set only consists of two realizations. Therefore the randomness over the multiple identification originates mainly from random initialization of the weights in the NNs.

Fig. 7: Wiener-Hammerstein benchmark: RMSE of test sets for mulitsine and swept sine for different values of , fixed and .
Model RMSE [swept sine] RMSE [multisine]
VAE-RNN 0.0495 0.0587
VRNN-Gauss-I 0.0763 0.0755
VRNN-Gauss 0.0817 0.0785
VRNN-GMM-I 0.0660 0.0669
VRNN-GMM 0.0760 0.0736
STORN 0.0338 0.0509
[5] NOBF 0.2 0.3
[5] NFIR 0.05 0.05
[5] NARX 0.05 0.05
[51] PNLSS 0.022 0.038
[15] Best Linear Approx. - 0.035
[15] ML - 0.0162
[48] SMC 0.014 0.015
TABLE III: Results for Wiener-Hammerstein benchmark.

V Conclusion and Future Work

This paper provides an introduction to deep SSMs as an extension to classic SSMs using highly flexible NNs. The studied model class and parameter learning method based on variational inference and ELBO maximization are elaborated. Six model instances are then applied to three system identification problems in order to benchmark the potential of these models. The results indicate that the class of deep SSMs can be a competitive approach to classic identification methods. Note that deep SSMs are black-box models, which only require a few hyperparameters to be tuned. The models in this benchmark study are not fine tuned to obtain the presented results. Therefore, the toolbox of possible nonlinear system identification methods is extended by a new black-box model class based on deep learning. The studied models have the additional advantage of estimating the uncertainty in the system dynamics by its probabilistic nature. The uncertainty bound appears to be as conservative as established uncertainty quantification methods. This conservative behavior is in line with the existing literature on variational inference of deep learning models.

This study only concerns a subclass of deep SSMs, namely models based on variational inference learning methods. Future work should study a broader class of deep SSMs and more nonlinear system identification benchmarks should be considered. An interesting continuation is to study for the linear toy problem the performance of a one-step-ahead predictor model with the Kalman filter as ground truth. Similarly, for nonlinear systems a comparison with the particle filter can be considered in the one-step-ahead predictor model as ground truth. Finally, it is of interest to use deep SSM in automatic control like e.g. model predictive control and to elaborate how to exploit the latent state variables.


  • [1] A. G. Alias Parth Goyal, A. Sordoni, M. Côté, N. R. Ke, and Y. Bengio (2017) Z-Forcing: Training Stochastic Recurrent Networks. In Advances in Neural Information Processing Systems 30, pp. 6713–6723. Cited by: §I, §II-C.
  • [2] C. Andersson, A. H. Ribeiro, K. Tiels, N. Wahlström, and T. B. Schön (2019-12) Deep Convolutional Networks in System Identification. In Proceedings of the 58th IEEE Conference on Decision and Control, Nice, France, pp. . Cited by: 1st item.
  • [3] K. J. Åström and P. Eykhoff (1971-03) System identification—A survey. Automatica 7 (2), pp. 123–162 (en). External Links: ISSN 0005-1098, Document Cited by: §I.
  • [4] J. Bayer and C. Osendorfer (2015-03) Learning Stochastic Recurrent Networks. arXiv:1411.7610. Note: Comment: Submitted to conference track of ICLR 2015 External Links: 1411.7610 Cited by: §I, 3rd item.
  • [5] J. Belz, T. Münker, T. O. Heinz, G. Kampmann, and O. Nelles (2017-07) Automatic Modeling with Local Model Networks for Benchmark Processes. 20th IFAC World Congress 50 (1), pp. 470–475 (en). External Links: ISSN 2405-8963, Document Cited by: TABLE III.
  • [6] C. M. Bishop (2006) Pattern recognition and machine learning. Springer-Verlag, New York (en). External Links: ISBN 978-0-387-31073-2, LCCN Q327 .B52 2006 Cited by: §I.
  • [7] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational Inference: A Review for Statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. External Links: ISSN 0162-1459, 1537-274X, Document, 1601.00670 Cited by: §II-B.
  • [8] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio (2014-10)

    On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

    In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103–111. External Links: Document Cited by: §II-A.
  • [9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014-12)

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    arXiv:1412.3555. Note: Comment: Presented in NIPS 2014 Deep Learning and Representation Learning Workshop External Links: 1412.3555 Cited by: §II-A.
  • [10] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A Recurrent Latent Variable Model for Sequential Data. In Advances in Neural Information Processing Systems 28, pp. 2980–2988. Cited by: §I, 1st item, 2nd item.
  • [11] L. Deng, G. Hinton, and B. Kingsbury (2013-05) New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. External Links: ISSN 2379-190X, Document Cited by: §I.
  • [12] O. Fabius and J. R. van Amersfoort (2015-06) Variational Recurrent Auto-Encoders. arXiv:1412.6581. External Links: 1412.6581 Cited by: §I, §II-C.
  • [13] M. Fraccaro, S. K. Sønderby, U. Paquet, and O. Winther (2016-12) Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 2207–2215. External Links: ISBN 978-1-5108-3881-9 Cited by: §I, §II-C.
  • [14] M. Fraccaro (2018) Deep Latent Variable Models for Sequential Data. Ph.D. Thesis, DTU Compute. Note: PhD Thesis Cited by: §II-C, §II.
  • [15] G. Giordano and J. Sjöberg (2018) Maximum Likelihood identification of Wiener-Hammerstein system with process noise. 18th IFAC Symposium on System Identification SYSID 2018 51 (15), pp. 401–406 (en). External Links: ISSN 2405-8963, Document Cited by: TABLE III.
  • [16] K. Goel and R. Vohra (2014-12) Learning Temporal Dependencies in Data Using a DBN-BLSTM. arXiv:1412.6093. External Links: 1412.6093 Cited by: §I, §II-C.
  • [17] I. Goodfellow, Y. Bengio, and A. C. Courville (2016) Deep Learning. MIT Press. Cited by: §I, §II-A, §IV.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Cited by: §II.
  • [19] F. K. Gustafsson, M. Danelljan, and T. B. Schön (2020-01)

    Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision

    arXiv:1906.01620. Note: Comment: Code is available at External Links: 1906.01620 Cited by: §IV-B.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034. Cited by: §I.
  • [21] S. Hochreiter and J. Schmidhuber (1997-11) Long Short-Term Memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Document Cited by: §II-A.
  • [22] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999-11) An Introduction to Variational Methods for Graphical Models. Machine Learning 37 (2), pp. 183–233. External Links: ISSN 0885-6125, Document Cited by: §II-B.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, (ICLR), San Diego, CA, USA. Note: Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: 1412.6980 Cited by: §IV.
  • [24] D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, Canada. External Links: 1312.6114 Cited by: §II-B, §II.
  • [25] N. R. Kristensen, H. Madsen, and S. B. Jørgensen (2004-02) Parameter estimation in stochastic grey-box models. Automatica 40 (2), pp. 225–237 (en). External Links: ISSN 0005-1098, Document Cited by: §I.
  • [26] M. Längkvist, L. Karlsson, and A. Loutfi (2014-06) A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters 42, pp. 11–24 (en). External Links: ISSN 0167-8655, Document Cited by: §I.
  • [27] Y. LeCun, Y. Bengio, and G. Hinton (2015-05) Deep learning. Nature 521 (7553), pp. 436–444 (en). External Links: ISSN 1476-4687, Document Cited by: §I.
  • [28] L. Li, J. Yan, X. Yang, and Y. Jin (2019-08) Learning Interpretable Deep State Space Model for Probabilistic Time Series Forecasting. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

    Macao, China, pp. 2901–2908 (en). External Links: Document, ISBN 978-0-9992411-4-1 Cited by: §I.
  • [29] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez (2017-12) A survey on deep learning in medical image analysis. Medical Image Analysis 42, pp. 60–88 (en). External Links: ISSN 1361-8415, Document Cited by: §I.
  • [30] M. Liwicki and H. Bunke (2005-08) IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In Eighth International Conference on Document Analysis and Recognition (ICDAR’05), pp. 956–961 Vol. 2. External Links: ISSN 2379-2140, Document Cited by: §I.
  • [31] L. Ljung (1978-10) Convergence analysis of parametric identification methods. IEEE Transactions on Automatic Control 23 (5), pp. 770–783. External Links: ISSN 2334-3303, Document Cited by: §I.
  • [32] L. Ljung, C. Andersson, K. Tiels, and T. B. Schön (2020) Deep Learning and System Identification. 21st IFAC World Congress, pp. 8 (en). Cited by: §I.
  • [33] L. Ljung (1987) System identification: theory for the user. Prentice Hall, Englewood Cliffs, NJ (English). External Links: ISBN 978-0-13-881640-7 Cited by: §I.
  • [34] L. Ljung (2018) System identification toolbox: The Manual. 9th edition 2018 edition, The MathWorks Inc., Natick, MA, USA. Cited by: §IV-A, TABLE I.
  • [35] K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge, MA (en). External Links: ISBN 978-0-262-01802-9, LCCN Q325.5 .M87 2012 Cited by: §I.
  • [36] K. S. Narendra and S.-M. Li (1996) Neural networks in control systems. P. Smolensky, M. C. Mozer, and D. E. Rumelhard (Eds.), pp. 347–394. Cited by: §-C, §IV-B, §IV.
  • [37] Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, Z. Nado, and J. Snoek (2019) Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32, pp. 13991–14002. Cited by: §IV-B.
  • [38] K. Prahallad, A. Vadapalli, N. K. Elluru, G. V. Mantena, B. Pulugundla, P. Bhaskararao, H. A. Murthy, S. J. King, V. Karaiskos, and A. W. Black (2013) The Blizzard Challenge 2013 - Indian Language Tasks. Cited by: §I.
  • [39] L. Rabiner and B. Juang (1986-01) An introduction to hidden Markov models. IEEE ASSP Magazine 3 (1), pp. 4–16. External Links: ISSN 1558-1284, Document Cited by: §I.
  • [40] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski (2018) Deep State Space Models for Time Series Forecasting. In Advances in Neural Information Processing Systems 31, pp. 7785–7794. Cited by: §I.
  • [41] D. J. Rezende, S. Mohamed, and D. Wierstra (2014-01) Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, pp. 1278–1286 (en). External Links: ISSN 1938-7228 Cited by: §II-B, §II.
  • [42] J. Roll, A. Nazin, and L. Ljung (2005-03) Nonlinear system identification via direct weight optimization. Automatica 41 (3), pp. 475–490 (en). External Links: ISSN 00051098, Document Cited by: TABLE II.
  • [43] S. T. Roweis (1998) EM Algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems 10, pp. 626–632. Cited by: §II-B.
  • [44] T. B. Schön, A. Wills, and B. Ninness (2011-01) System identification of nonlinear state-space models. Automatica 47 (1), pp. 39–49 (en). External Links: ISSN 00051098, Document Cited by: §I.
  • [45] M. Schoukens and J. P. Noel (2017-07) Wiener-Hammerstein benchmark with process noise. 20th IFAC World Congress 50 (1), pp. 448–453 (en). Cited by: §IV-C, §IV.
  • [46] J. Sjöberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P. Glorennec, H. Hjalmarsson, and A. Juditsky (1995-12) Nonlinear black-box modeling in system identification: a unified overview. Automatica 31 (12), pp. 1691–1724 (en). External Links: ISSN 0005-1098, Document Cited by: §I.
  • [47] A. Stenman (1999) Model on demand: algorithms, analysis and applications. Linköping Studies in Science and Technology Dissertation, Univ, Linköping (en). External Links: ISBN 978-91-7219-450-2 Cited by: §-C, §IV-B, TABLE II.
  • [48] A. Svensson, T. B. Schön, and F. Lindsten (2018-05) Learning of state-space models with highly informative observations: a tempered Sequential Monte Carlo solution. Mechanical Systems and Signal Processing 104, pp. 915–928. External Links: ISSN 08883270, Document, 1702.01618 Cited by: TABLE III.
  • [49] A. Svensson and T. B. Schön (2017-06) A flexible state–space model for learning nonlinear dynamical systems. Automatica 80, pp. 189–199 (en). External Links: ISSN 0005-1098, Document Cited by: TABLE II.
  • [50] A. Tarantola (2005-01) Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM (en). External Links: ISBN 978-0-89871-792-1 Cited by: §I.
  • [51] K. Tiels (2016) A polynomial nonlinear state-space Matlab toolbox. In Workshop on Nonlinear System Identification Benchmarks, Brussels, Belgium, pp. 28 (en). Cited by: TABLE III.
  • [52] M. Verhaegen and V. Verdult (2007) Filtering and System Identification: A Least Squares Approach. Cambridge university press (en). Cited by: §I.
  • [53] M. Verhaegen (1994-01) Identification of the deterministic part of MIMO state space models given in innovations form from input-output data. Automatica 30 (1), pp. 61–74 (en). External Links: ISSN 0005-1098, Document Cited by: §I.
  • [54] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018-06) High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 8798–8807 (en). External Links: Document, ISBN 978-1-5386-6420-9 Cited by: §II.
  • [55] J. Xu, X. Huang, and S. Wang (2008-01) Adaptive Hinging Hyperplanes. IFAC Proceedings Volumes 41 (2), pp. 4036–4041 (en). External Links: ISSN 1474-6670, Document Cited by: TABLE II.
  • [56] C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt (2019-08) Advances in Variational Inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 2008–2026. External Links: ISSN 1939-3539, Document Cited by: §II-B.
  • [57] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems 28, pp. 649–657. Cited by: §I.