Model structures and fitting criteria for system identification with neural networks

11/29/2019 ∙ by Marco Forgione, et al. ∙ 0

This paper focuses on the identification of dynamical systems with tailor-made model structures, where neural networks are used to approximate uncertain components and domain knowledge is retained, if available. These model structures are fitted to measured data using different criteria including a computationally efficient approach minimizing a regularized multi-step ahead simulation error. In this approach, the neural network parameters are estimated along with the initial conditions used to simulate the output signal in small-size subsequences. A regularization term is included in the fitting cost in order to enforce these initial conditions to be consistent with the estimated system dynamics. Pitfalls and limitations of naive one-step prediction and simulation error minimization are also discussed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning has advanced at a tremendous pace and is now the core methodology behind cutting-edge technologies such as speech recognition, image classification and captioning, language translation, and autonomous driving

[Schmidhuber, 2015]. These impressive achievements are attracting ever increasing investments both from the private and the public sector, fueling further research in this field.

A good deal of the advancement in the deep learning area is of public domain, both in terms of scientific publications and software tools. Nowadays, highly optimized and user-friendly deep learning frameworks are available [Paszke et al., 2017]

, often distributed under permissive open-source licenses. Using the high-level functionalities of a deep learning framework and following good practice, even a novice user can deal with

standardmachine learning tasks (once considered extremely hard) such as image classification with moderate effort. Under the hood, the machine learning task is automatically transformed into a relevant optimization problem and subsequently solved through efficient numerical routines.

An experienced practitioner can employ the same deep learning framework at a lower level to tackle non-standard learning problems, by defining customized models and objective functions to be optimized, and using operators such as neural networks as building blocks. The practitioner is free from the burden of writing optimization code from scratch for every particular problem, which would be tedious and error-prone. In fact, as a built-in feature, modern deep learning engines can compute the derivatives of a supplied objective function with respect to free tunable parameters by implementing the celebrated back-propagation algorithm [Rumelhart et al., 1988]. In turn, this enables convenient setup of any gradient-based optimization method.

An exciting, challenging—and yet largely unexplored—application field is system identification with tailor-made model structures and fitting criteria. In this context, neural networks can be used to describe uncertain components of the dynamics, while retaining structural (physical) knowledge, if available. Furthermore, the fitting criterion can be specialized to take into account the modeler’s ultimate goal, which could be prediction, failure detection, state estimation, control design, simulation, etc.

The choice of the cost function may also be influenced by computational considerations. In this paper, in particular, models are evaluated according to their simulation performance. In this setting, from a theoretical perspective, simulation error minimization is generally the best fitting criterion. However, computing the simulation error loss and its derivatives may be prohibitively expensive from a computational perspective for dynamical models involving neural networks. We show that multi-step simulation error minimization over batches of small-size subsequences extracted from the identification dataset provides models with high simulation performance, while keeping the computational burden of the fitting procedure acceptable. In the proposed method, the neural network parameters are jointly estimated with the initial conditions used to simulate the system in each subsequence. A regularization term is also included in the fitting criterion in order to enforce all these initial conditions to be consistent with the estimated system dynamics.

The use of neural networks in system identification has a long history, see, e.g., [Werbos, 1989, Chen et al., 1990]. Even though motivated by similar reasoning, these earlier works are hardly comparable given the huge gap of hardware/software technology. More recently, a few interesting approaches using modern deep learning tools and concepts in system identification have been presented. For instance, [Masti and Bemporad, 2018]

introduces a technique to identify neural state-space model structures using deep autoencoders for state reconstruction, while

[Gonzalez and Yu, 2018, Wang, 2017] discuss the use of Long Short-Term Memory

(LSTM) recurrent neural networks for system identification. Compared to these recent contributions, our work focuses on using specialized model structures for the identification task at hand. In the machine learning community, neural networks have also been recently applied for approximating the solution of ordinary and partial differential equation, see

e.g., [Chen et al., 2018, Raissi et al., 2019]. With respect to these contributions, our aim is to find computationally efficient fitting strategies that are robust to the measurement noise.

The rest of this paper is structured as follows. The overall settings and problem statement is outlined in Section 2. The neural dynamical model structures are introduced in Section 3 and criteria for fitting these model structures to training data are described in Section 4. Simulation results are presented in Section 5 and can be replicated using the codes available at Conclusions and directions for future research are discussed in Section 6.

2 Problem Setting

We are given a dataset consisting of input samples and output samples , gathered from an experiment on a dynamical system . The data-generating system is assumed to have the discrete-time state-space representation


where is the state at time ; is the noise-free output; is the input; and are the state and output mappings, respectively. The measured output is corrupted by a zero-mean noise , i.e., .

The aim of the paper is twofold:

  • to introduce flexible neural model structures that are suitable to represent generic dynamical systems as (1), allowing the modeler to embed domain knowledge to various degrees and to exploit neural networks’ universal approximation capabilities (see [Hornik et al., 1989]) to describe unknown model components;

  • to present robust and computationally efficient procedures to fit these neural model structures to the training dataset .

2.1 Full model structure hypothesis

Let us consider a model structure , where

represents a dynamical model parametrized by a real-valued vector

. We refer to neural model structures as structures where some components of the model are described by neural networks.

Throughout the paper, we will make the following full model structure hypothesis: there exists a parameter such that the model is a perfect representation of the true system , i.e., for every input sequence, and provide the same output. We denote this condition as . Note that the parameter

may not be unique. Indeed, deep neural networks have multiple equivalent representations obtained, for instance, by permuting neurons in a hidden layer. Let

be the set of parameters that provide a perfect system description, namely . Under the full model structure hypothesis, the ultimate identification goal is to find a parameter .

Remark 1

In practice, fitting is performed on a finite-length dataset covering a finite number of experimental conditions. To this aim, let us introduce the notation meaning that the model perfectly matches on the dataset and let us define the parameter set . By definition, . Thus, when fitting the model structure to a finite-length dataset , we aim to find a parameter (but not necessarily in ).

3 Neural model structures

In this section, we introduce possible neural model structures for dynamical systems.

3.1 State-space structures

A general state-space neural model structure has form


where and are feedforward neural networks of compatible size parametrized by . Such a general structure can be tailored for the identification task at hand. Examples are reported in the following paragraphs.

Residual model

If a linear approximation of the system is available, an appropriate model structure is


where , , and are matrices of compatible dimensions describing the linear system approximation. Even though model (3) is not more general than (2), it could be easier to train as the neural networks and are supposed to capture only residual (nonlinear) dynamics.

Integral model

When fitting data generated by a continuous-time system, the following neural model with an integral term in the state equation can be used to encourage continuity of the solution:


This structure can also be interpreted as the forward Euler discretization scheme applied to an underlying continuous-time state-space model.

Fully-observed state model

If the system state is known to be fully observed, an effective representation is


where only the state mapping neural network is learned, while the output mapping is fixed to identity.

Physics-based model

Special network structure could be used to embed prior physical knowledge. For instance, let us consider a two degree-of-freedom mechanical system (

e.g., a cart-pole system) with state consisting in two measured positions , and two corresponding unmeasured velocities , , driven by an external force . A physics-based model for this system is


where the integral dynamics for positions are fixed in the parametrization, while the velocity dynamics are modeled by neural networks, possibly sharing some of their innermost layers. For discrete-time identification, (6) could be discretized through the numerical scheme of choice.

3.2 Input/output structures

When limited or no system knowledge is available, the following input/output (IO) model structure may be used:


where and denote the input and output lags, respectively, and is a neural network of compatible size. For an IO model, the state can be defined in terms of inputs and (noise-free) outputs at past time steps, i.e.,


This state evolves simply by shifting previous inputs and outputs over time, and appending the latest samples, namely:


where the IO state update function has been introduced for notational convenience.

The IO model structure only requires to specify the dynamical orders and . If these values are not known a priori, they can be chosen through cross-validation.

4 Training neural models

In this section, we present practical algorithms to fit the model structures introduced in Section 3 to the identification dataset . For the sake of illustration, algorithms are detailed for IO model structures (7). The extension to state-space structures is then discussed in Subsection 4.2.

4.1 Training I/O neural models

The network parameters may be obtained by minimizing a cost function such as


where is the model estimate at time . For a dynamical model, different estimates can be considered in the cost (10), as discussed in the following paragraphs.

One-step prediction

The one-step error loss is constructed by plugging in (10) as estimate the one-step prediction , where is constructed as in (8), but using measured past outputs instead of the (unknown) noise-free outputs, i.e., , , , , , , , .

The gradient of the cost function with respect to can be readily obtained through back-propagation using modern deep learning frameworks. This enables straightforward implementation of an iterative gradient-based optimization scheme to minimize . The resulting one-step prediction error minimization algorithm can be executed very efficiently on modern hardware since all time steps can be processed independently and thus in parallel, exploiting multiple CPU/GPU cores.

For noise-free data, one-step prediction error minimization usually provides accurate results. Indeed, under the full model structure hypothesis, the minimum of is equal to and is achieved by all parameters . However, for noisy output observations, the estimate directly depends on the noise affecting past outputs through the regressor . The situation is reminiscent of the AutoRegressive with Exogenous input (ARX) linear predictor defined as


and thoroughly studied in classic system identification [Ljung, 1999]. The minimizer of the ARX prediction error is generally biased, unless very specific (and not particularly realistic) noise conditions are satisfied. Historically, the ARX predictor has been introduced for computational convenience—the resulting fitting problem can be solved indeed through linear least squares—rather than for its robustness to noise. In our numerical examples, we observed similar bias issues when fitting neural model structures by minimizing on noisy datasets.

Open-loop simulation

In classic system identification for linear systems, the Output Error (OE) predictor


defined recursively in terms of previous simulated outputs provides an unbiased model estimate under the full model structure hypothesis, at the cost of a higher computational burden. In fact, minimizing the OE residual requires to solve a nonlinear optimization problem.

Inspired by these classic system identification results, in the neural modeling context we expect better noise robustness by minimizing the simulation error cost obtained by using as estimate in (10) the open-loop simulated output , with defined recursively in terms of previous simulated outputs as


In principle, the cost function and its gradient w.r.t. can be also computed using a back-propagation algorithm, just as for . However, from a computational perspective, simulating over time has an intrinsically sequential nature and offers scarce opportunity for parallelization. Furthermore, back-propagation through a temporal sequence, also known in the literature as Back-Propagation Through Time (BPTT), has a computational cost that grows linearly with the sequence length [Williams and Zipser, 1995]. In practice, as it will be illustrated in our numerical examples, minimizing the simulation error with a gradient-based method over the entire identification dataset may be inconvenient from a computational perspective.

Multi-step simulation

A natural trade-off between full simulation and one-step prediction is simulation over subsequences of the dataset with length . The multi-step simulation error minimization algorithm presented here processes batches containing subsequences extracted from in parallel to enable efficient implementation.

A batch is completely specified by a batch start vector defining the initial time instant of each subsequence. Thus, for instance, the -th output subsequence in a batch contains the measured output samples where is the -th element of . For notational convenience, let us arrange the batch output subsequences in a three-dimensional tensor whose elements are , with batch index and time index

. Similarly, let us arrange the batch input subsequences in a tensor


The -step simulation for all subsequences has the same tensor structure as and is defined as

where the regressor is recursively obtained as

for . The initial regressor of each subsequence may be constructed by plugging past input and output measurements into (8), i.e., , , , , , . In this way, the measurement noise enters in the -step simulation only at the initial time step of the subsequences, and therefore its effect is less severe than in the one-step prediction case.

A basic multi-step simulation error approach consists in minimizing the cost:


Such an approach outperforms one-step prediction error minimization in the presence of measurement noise.

In this paper, we further improve the basic multi-step simulation method by considering also the initial condition of the subsequences as free variables to be tuned, along with the network parameters . Specifically, we introduce an optimization variable with the same size and structure as . The initial condition for the batch is constructed as , with . By considering such an initial condition, the measurement noise does not enter in the model simulation. Thus, as in pure open loop simulation error minimization, bias issues are circumvented.

Since we are estimating the initial conditions in addition to the neural network parameters, a price is paid in terms of an increased variance of the estimated model. In order to mitigate this effect, the variable

used to construct the initial conditions can be enforced to represent the unknown, noise-free system output and thus to be consistent with (7). To this aim, we introduce a regularization term penalizing the distance between and , where is a tensor with the same structure as , but containing samples from , i.e, .

Algorithm 1 details the steps required to train a dynamical neural model by multi-step simulation error minimization with initial state estimation. In Step 1, the neural network parameters and the “hidden” output variable are initialized to (small) random numbers and to , respectively. Then, at each iteration of the gradient-based training algorithm, the following steps are executed. Firstly, the batch start vector is selected with (Step 2.1). The indexes in may be either (pseudo)randomly generated, or chosen deterministically.222For an efficient use of the identification dataset , has to be chosen in such a way that all samples are visited during training. Then, tensors , , , and are populated with the corresponding samples in (Step 2.2). Subsequently, -step model simulation is performed (Step 2.3) and the cost function to be minimized is computed (Step 2.4). The cost in (16) takes into account both the fitting criterion (thus, the distance between and ) and a regularization term penalizing the distance between and . Such a regularization terms aims at enforcing consistency of the hidden output with the model structure (7). A weighting constant balances the two objectives. Lastly, the gradients of the cost with respect to the optimization variables , are obtained through BPTT (Step 2.5) and the optimization variables are updated via gradient descent with learning rate (Step 2.6). Improved variants of gradient descent such as RMSprop or Adam [Kingma and Ba, 2014] can be alternatively adopted at Step 2.6.

Remark 2

The computational cost of BPTT in -step simulation is proportional to (and not to , which is the case for open-loop simulation). Furthermore, processing of the subsequences can be carried out independently, and thus in parallel on current hardware and software which support parallel computing. For these reasons, running multi-step simulation error minimization with is significantly faster than pure open-loop simulation error minimization.

Inputs: identification dataset ; number of iterations ; batch size ; length of subsequences ; learning rate ; weight .  

  1. [label=0., ref=0]

  2. initialize the neural network parameters to a random vector and the hidden output to ;

  3. for do

    1. [label=2.0., ref=2.0]

    2. select batch start indexes vector ;

    3. define tensors


    4. simulate according to

    5. compute the cost

    6. evaluate the gradients and at the current values of and ;

    7. update optimization variables and :


  Output: neural network parameters .

Algorithm 1 Multi-step simulation error minimization

4.2 Training state-space neural models

The fitting methods presented for the IO structure are applicable to the state-space structures introduced in Section 3.1, with the considerations discussed below.

  • For the fully observed state model structure (5), adaptation of the one-step prediction error minimization method is straightforward. Indeed, the noisy measured state is directly used as regressor to construct the predictor, i.e., .

  • For model structures where the state is not fully observed, one-step prediction error minimization is not directly applicable as a one-step ahead prediction cannot be constructed in terms of the available measurements.

  • Simulation error minimization is directly applicable to state-space structures without special modifications, provided that it is feasible from a computational perspective.

  • Algorithm 1 for multi-step simulation error minimization is also generally applicable for state-space model structures. Instead of the hidden output variable , a hidden state variable representing the (noise-free) state at each time step must be optimized along with the network parameters through gradient descent. However, if the state is not fully observed, cannot be initialized directly with measurements as was done in the IO case. A convenient initialization of to be used in gradient-based optimization can come from an initial state estimator, or exploiting physical knowledge. For instance, for the mechanical system in (6), a possible initialization for velocities is obtained through numerical differentiation of the measured position outputs.

5 Numerical Example

The fitting algorithms for the model structures presented in this paper are tested on a simulated nonlinear RLC circuit. All computations are carried out on a PC equipped with an Intel i5-7300U 2.60 GHz processor and 32 GB of RAM. The software implementation is based on the PyTorch Deep Learning Framework

[Paszke et al., 2017]. All the codes implementing the methodologies discussed in the paper and required to reproduce the results are available on the on-line repository Other examples concerning the identification of a Continuously Stirred Tank Reactor (CSTR) and a cart-pole system are available in the same repository.

5.1 System description

We consider the nonlinear RLC circuit in Fig. 1 (left).

Figure 1: Nonlinear series RLC circuit used in the example (left) and nonlinear dependence of the inductance on the inductor current (right).

The circuit behavior is described by the continuous-time state-space equation


where is the input voltage; is the capacitor voltage; and is the inductor current. The circuit parameters and  nF are fixed, while the inductance depends on as shown in Fig. 1 (right). Specifically,

with . The identification dataset is built by discretizing (18) using a th-order Runge-Kutta method with a fixed step and simulating the system for . samples are gathered. The input

is filtered white noise with bandwidth

and standard deviation

. An independent validation dataset is generated using as input filtered white noise with bandwidth and standard deviation .

The performance of the estimated models is assessed in terms of the index computed using the open-loop simulated model output. As reference, a second-order linear OE model estimated on noise-free data using the System Identification Toolbox [Ljung, 1988] achieves an index of for and for on the identification dataset, and for and for on the validation dataset.

We consider for neural model structures the cases of () noise-free measurements of and ; () noisy measurements of and ; () noisy measurements of only.

5.2 Algorithm setup

For gradient-based optimization, the Adam optimizer is used at Step 2.6 of Algorithm 1 to update the network parameters and the hidden variable (or for state-space structures) used to construct the initial conditions . The learning rate is adjusted through a rough trial and error, with taking values in the range , while the number of iterations is chosen large enough to reach a plateau in the cost function. In Algorithm 1, the weight is always set to . We tested different values for the sequence length and adjusted the batch size such that .

Figure 2: True output (black) and simulated output (red) obtained by the state-space model trained using the one-step prediction error minimization approach in the presence of noise.
Figure 3: True output (black) and simulated output (red) obtained by the state-space model trained using the -step simulation error minimization approach in the presence of noise.

5.3 Results

() Noise-free measurements of and  
Since in this case the system state is supposed to be measured, we use the fully-observed state model structure (5). The neural network modeling the state mapping has a sequential feedforward structure with three input units (, , and

); a hidden layer with 64 linear units followed by ReLU nonlinearity; and two linear output units—the two components of the state equation to be learned. Having a noise-free dataset, we expect good results from one-step prediction error minimization. Thus, we fit the model using this approach over

iterations with learning rate . The time required to train the network is 114 seconds. The fitted model describes the system dynamics with high accuracy. On both the identification and the validation datasets, the model index in open-loop simulation is above for and for .

() Noisy measurements of and  
We consider the same identification problem above, with observations of and corrupted by an additive white Gaussian noise with standard deviation and , respectively. This corresponds to a Signal-to-Noise Ratio (SNR) of 20 dB and 13 dB on and , respectively. Results for the one-step prediction error minimization approach on the validation dataset are shown in Fig. 2. The index in validation drops to for and for . It is evident that noise has a severe impact on the performance of the one-step method.

In the presence of noise, better performance is expected from a multi-step simulation error minimization approach. Thus, we fit the same neural state-space model structure using Algorithm 1 over iterations, with , and randomly extracted batches of subsequences, each of length . The results are in line with expectations. Indeed, we recover similar performance as one-step prediction error minimization in the noise-free case ( index of on and on on both the identification and the validation datasets). Time trajectories of the output are reported in Fig. 3. For the sake of visualization, only a portion of the validation dataset is shown. The total run time of Algorithm 1 is 182 seconds—about 60% more than the one-step prediction error minimization method.

Open-loop simulation error minimization is also tested. This method yields the same performance of -step simulation error minimization in terms of index of the fitted model. However, it takes about two hours to execute iterations required to reach a cost function plateau.

() Noisy measurements of only  
We consider the case where only the voltage is measured and corrupted by an additive white Gaussian noise with standard deviation  V. The IO model structure in (7) is used with and . The neural network is characterized by four input units (corresponding to previous values of and previous values of ); a hidden layer with 64 linear units followed by ReLU nonlinearity; and a linear output unit representing the output value . As in case (), the one-step prediction error minimization approach delivers unsatisfactory results due to the presence of measurement noise. Thus, we fit the model using the multi-step method described in Algorithm 1 over iterations, with , , and . The total runtime of the algorithm is 192 seconds. The index of the fitted model is above on both the identification and the validation dataset, thus even larger than the index achieved by the OE model estimated on noise-free data.

6 Conclusions and follow-up

In this paper, we have presented neural model structures and fitting criteria for the identification of dynamical systems. A custom method minimizing a regularized multi-step simulation error criterion has been proposed and compared with one-step prediction error and simulation error minimization.

The main strengths of the presented framework are its versatility to describe complex non-linear systems, thanks to the neural network flexibility; its robustness to the measurement noise, thanks to the multi-step simulation error criterion with initial condition estimation; and the possibility to exploit parallel computing to train the network and optimize the initial conditions, thanks to the division of the dataset into small-size subsequences.

Current and future research activities are devoted to: () the formulation of proper fitting criteria and optimization algorithms for direct learning of continuous-time systems and systems described by partial differential equations, without introducing numerical discretization; () the development of computationally efficient algorithms for estimation and control based on the neural dynamical models.


The authors are grateful to Dr. Giuseppe Sorgioso for the fruitful discussions on the properties of the back-propagation through time algorithm.


  • S. Chen, S. Billings, and P. Grant (1990) Non-linear system identification using neural networks. International journal of control 51 (6), pp. 1191–1214. Cited by: §1.
  • T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018)

    Neural ordinary differential equations

    In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1.
  • J. Gonzalez and W. Yu (2018) Non-linear system modeling using lstm neural networks. IFAC-PapersOnLine 51 (13), pp. 485–489. Cited by: §1.
  • K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: 1st item.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • L. Ljung (1988) System identification toolbox. The Matlab user’s guide. Cited by: §5.1.
  • L. Ljung (Ed.) (1999) System identification: theory for the user. 2 edition, Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: ISBN 0-13-656695-2 Cited by: §4.1.
  • D. Masti and A. Bemporad (2018) Learning nonlinear state-space models using deep autoencoders. In 2018 IEEE Conference on Decision and Control (CDC), pp. 3862–3867. Cited by: §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1, §5.
  • M. Raissi, P. Perdikaris, and G. E. Karniadakis (2019) Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, pp. 686–707. Cited by: §1.
  • D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. (1988) Learning representations by back-propagating errors. Cognitive modeling 5 (3), pp. 1. Cited by: §1.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • Y. Wang (2017) A new concept using lstm neural networks for dynamic system identification. In 2017 American Control Conference (ACC), pp. 5324–5329. Cited by: §1.
  • P. J. Werbos (1989) Neural networks for control and system identification. In Proceedings of the 28th IEEE Conference on Decision and Control,, pp. 260–265. Cited by: §1.
  • R. J. Williams and D. Zipser (1995) Oxford handbook of innovation. Cited by: §4.1.