1 Introduction
In recent years, deep learning has advanced at a tremendous pace and is now the core methodology behind cuttingedge technologies such as speech recognition, image classification and captioning, language translation, and autonomous driving
[Schmidhuber, 2015]. These impressive achievements are attracting ever increasing investments both from the private and the public sector, fueling further research in this field.A good deal of the advancement in the deep learning area is of public domain, both in terms of scientific publications and software tools. Nowadays, highly optimized and userfriendly deep learning frameworks are available [Paszke et al., 2017]
, often distributed under permissive opensource licenses. Using the highlevel functionalities of a deep learning framework and following good practice, even a novice user can deal with
standardmachine learning tasks (once considered extremely hard) such as image classification with moderate effort. Under the hood, the machine learning task is automatically transformed into a relevant optimization problem and subsequently solved through efficient numerical routines.An experienced practitioner can employ the same deep learning framework at a lower level to tackle nonstandard learning problems, by defining customized models and objective functions to be optimized, and using operators such as neural networks as building blocks. The practitioner is free from the burden of writing optimization code from scratch for every particular problem, which would be tedious and errorprone. In fact, as a builtin feature, modern deep learning engines can compute the derivatives of a supplied objective function with respect to free tunable parameters by implementing the celebrated backpropagation algorithm [Rumelhart et al., 1988]. In turn, this enables convenient setup of any gradientbased optimization method.
An exciting, challenging—and yet largely unexplored—application field is system identification with tailormade model structures and fitting criteria. In this context, neural networks can be used to describe uncertain components of the dynamics, while retaining structural (physical) knowledge, if available. Furthermore, the fitting criterion can be specialized to take into account the modeler’s ultimate goal, which could be prediction, failure detection, state estimation, control design, simulation, etc.
The choice of the cost function may also be influenced by computational considerations. In this paper, in particular, models are evaluated according to their simulation performance. In this setting, from a theoretical perspective, simulation error minimization is generally the best fitting criterion. However, computing the simulation error loss and its derivatives may be prohibitively expensive from a computational perspective for dynamical models involving neural networks. We show that multistep simulation error minimization over batches of smallsize subsequences extracted from the identification dataset provides models with high simulation performance, while keeping the computational burden of the fitting procedure acceptable. In the proposed method, the neural network parameters are jointly estimated with the initial conditions used to simulate the system in each subsequence. A regularization term is also included in the fitting criterion in order to enforce all these initial conditions to be consistent with the estimated system dynamics.
The use of neural networks in system identification has a long history, see, e.g., [Werbos, 1989, Chen et al., 1990]. Even though motivated by similar reasoning, these earlier works are hardly comparable given the huge gap of hardware/software technology. More recently, a few interesting approaches using modern deep learning tools and concepts in system identification have been presented. For instance, [Masti and Bemporad, 2018]
introduces a technique to identify neural statespace model structures using deep autoencoders for state reconstruction, while
[Gonzalez and Yu, 2018, Wang, 2017] discuss the use of Long ShortTerm Memory(LSTM) recurrent neural networks for system identification. Compared to these recent contributions, our work focuses on using specialized model structures for the identification task at hand. In the machine learning community, neural networks have also been recently applied for approximating the solution of ordinary and partial differential equation, see
e.g., [Chen et al., 2018, Raissi et al., 2019]. With respect to these contributions, our aim is to find computationally efficient fitting strategies that are robust to the measurement noise.The rest of this paper is structured as follows. The overall settings and problem statement is outlined in Section 2. The neural dynamical model structures are introduced in Section 3 and criteria for fitting these model structures to training data are described in Section 4. Simulation results are presented in Section 5 and can be replicated using the codes available at https://github.com/forgi86/sysidneuralstructuresfitting. Conclusions and directions for future research are discussed in Section 6.
2 Problem Setting
We are given a dataset consisting of input samples and output samples , gathered from an experiment on a dynamical system . The datagenerating system is assumed to have the discretetime statespace representation
(1a)  
(1b) 
where is the state at time ; is the noisefree output; is the input; and are the state and output mappings, respectively. The measured output is corrupted by a zeromean noise , i.e., .
The aim of the paper is twofold:

to introduce flexible neural model structures that are suitable to represent generic dynamical systems as (1), allowing the modeler to embed domain knowledge to various degrees and to exploit neural networks’ universal approximation capabilities (see [Hornik et al., 1989]) to describe unknown model components;

to present robust and computationally efficient procedures to fit these neural model structures to the training dataset .
2.1 Full model structure hypothesis
Let us consider a model structure , where
represents a dynamical model parametrized by a realvalued vector
. We refer to neural model structures as structures where some components of the model are described by neural networks.Throughout the paper, we will make the following full model structure hypothesis: there exists a parameter such that the model is a perfect representation of the true system , i.e., for every input sequence, and provide the same output. We denote this condition as . Note that the parameter
may not be unique. Indeed, deep neural networks have multiple equivalent representations obtained, for instance, by permuting neurons in a hidden layer. Let
be the set of parameters that provide a perfect system description, namely . Under the full model structure hypothesis, the ultimate identification goal is to find a parameter .Remark 1
In practice, fitting is performed on a finitelength dataset covering a finite number of experimental conditions. To this aim, let us introduce the notation meaning that the model perfectly matches on the dataset and let us define the parameter set . By definition, . Thus, when fitting the model structure to a finitelength dataset , we aim to find a parameter (but not necessarily in ).
3 Neural model structures
In this section, we introduce possible neural model structures for dynamical systems.
3.1 Statespace structures
A general statespace neural model structure has form
(2a)  
(2b) 
where and are feedforward neural networks of compatible size parametrized by . Such a general structure can be tailored for the identification task at hand. Examples are reported in the following paragraphs.
Residual model
If a linear approximation of the system is available, an appropriate model structure is
(3a)  
(3b) 
where , , and are matrices of compatible dimensions describing the linear system approximation. Even though model (3) is not more general than (2), it could be easier to train as the neural networks and are supposed to capture only residual (nonlinear) dynamics.
Integral model
When fitting data generated by a continuoustime system, the following neural model with an integral term in the state equation can be used to encourage continuity of the solution:
(4a)  
(4b) 
This structure can also be interpreted as the forward Euler discretization scheme applied to an underlying continuoustime statespace model.
Fullyobserved state model
If the system state is known to be fully observed, an effective representation is
(5a)  
(5b) 
where only the state mapping neural network is learned, while the output mapping is fixed to identity.
Physicsbased model
Special network structure could be used to embed prior physical knowledge. For instance, let us consider a two degreeoffreedom mechanical system (
e.g., a cartpole system) with state consisting in two measured positions , and two corresponding unmeasured velocities , , driven by an external force . A physicsbased model for this system is(6) 
where the integral dynamics for positions are fixed in the parametrization, while the velocity dynamics are modeled by neural networks, possibly sharing some of their innermost layers. For discretetime identification, (6) could be discretized through the numerical scheme of choice.
3.2 Input/output structures
When limited or no system knowledge is available, the following input/output (IO) model structure may be used:
(7) 
where and denote the input and output lags, respectively, and is a neural network of compatible size. For an IO model, the state can be defined in terms of inputs and (noisefree) outputs at past time steps, i.e.,
(8) 
This state evolves simply by shifting previous inputs and outputs over time, and appending the latest samples, namely:
(9) 
where the IO state update function has been introduced for notational convenience.
The IO model structure only requires to specify the dynamical orders and . If these values are not known a priori, they can be chosen through crossvalidation.
4 Training neural models
In this section, we present practical algorithms to fit the model structures introduced in Section 3 to the identification dataset . For the sake of illustration, algorithms are detailed for IO model structures (7). The extension to statespace structures is then discussed in Subsection 4.2.
4.1 Training I/O neural models
The network parameters may be obtained by minimizing a cost function such as
(10) 
where is the model estimate at time . For a dynamical model, different estimates can be considered in the cost (10), as discussed in the following paragraphs.
Onestep prediction
The onestep error loss is constructed by plugging in (10) as estimate the onestep prediction , where is constructed as in (8), but using measured past outputs instead of the (unknown) noisefree outputs, i.e., , , , , , , , .
The gradient of the cost function with respect to can be readily obtained through backpropagation using modern deep learning frameworks. This enables straightforward implementation of an iterative gradientbased optimization scheme to minimize . The resulting onestep prediction error minimization algorithm can be executed very efficiently on modern hardware since all time steps can be processed independently and thus in parallel, exploiting multiple CPU/GPU cores.
For noisefree data, onestep prediction error minimization usually provides accurate results. Indeed, under the full model structure hypothesis, the minimum of is equal to and is achieved by all parameters . However, for noisy output observations, the estimate directly depends on the noise affecting past outputs through the regressor . The situation is reminiscent of the AutoRegressive with Exogenous input (ARX) linear predictor defined as
(11) 
and thoroughly studied in classic system identification [Ljung, 1999]. The minimizer of the ARX prediction error is generally biased, unless very specific (and not particularly realistic) noise conditions are satisfied. Historically, the ARX predictor has been introduced for computational convenience—the resulting fitting problem can be solved indeed through linear least squares—rather than for its robustness to noise. In our numerical examples, we observed similar bias issues when fitting neural model structures by minimizing on noisy datasets.
Openloop simulation
In classic system identification for linear systems, the Output Error (OE) predictor
(12) 
defined recursively in terms of previous simulated outputs provides an unbiased model estimate under the full model structure hypothesis, at the cost of a higher computational burden. In fact, minimizing the OE residual requires to solve a nonlinear optimization problem.
Inspired by these classic system identification results, in the neural modeling context we expect better noise robustness by minimizing the simulation error cost obtained by using as estimate in (10) the openloop simulated output , with defined recursively in terms of previous simulated outputs as
(13) 
In principle, the cost function and its gradient w.r.t. can be also computed using a backpropagation algorithm, just as for . However, from a computational perspective, simulating over time has an intrinsically sequential nature and offers scarce opportunity for parallelization. Furthermore, backpropagation through a temporal sequence, also known in the literature as BackPropagation Through Time (BPTT), has a computational cost that grows linearly with the sequence length [Williams and Zipser, 1995]. In practice, as it will be illustrated in our numerical examples, minimizing the simulation error with a gradientbased method over the entire identification dataset may be inconvenient from a computational perspective.
Multistep simulation
A natural tradeoff between full simulation and onestep prediction is simulation over subsequences of the dataset with length . The multistep simulation error minimization algorithm presented here processes batches containing subsequences extracted from in parallel to enable efficient implementation.
A batch is completely specified by a batch start vector defining the initial time instant of each subsequence. Thus, for instance, the th output subsequence in a batch contains the measured output samples where is the th element of . For notational convenience, let us arrange the batch output subsequences in a threedimensional tensor whose elements are , with batch index and time index
. Similarly, let us arrange the batch input subsequences in a tensor
whereThe step simulation for all subsequences has the same tensor structure as and is defined as
(14a)  
where the regressor is recursively obtained as  
(14b) 
for . The initial regressor of each subsequence may be constructed by plugging past input and output measurements into (8), i.e., , , , , , . In this way, the measurement noise enters in the step simulation only at the initial time step of the subsequences, and therefore its effect is less severe than in the onestep prediction case.
A basic multistep simulation error approach consists in minimizing the cost:
(15) 
Such an approach outperforms onestep prediction error minimization in the presence of measurement noise.
In this paper, we further improve the basic multistep simulation method by considering also the initial condition of the subsequences as free variables to be tuned, along with the network parameters . Specifically, we introduce an optimization variable with the same size and structure as . The initial condition for the batch is constructed as , with . By considering such an initial condition, the measurement noise does not enter in the model simulation. Thus, as in pure open loop simulation error minimization, bias issues are circumvented.
Since we are estimating the initial conditions in addition to the neural network parameters, a price is paid in terms of an increased variance of the estimated model. In order to mitigate this effect, the variable
used to construct the initial conditions can be enforced to represent the unknown, noisefree system output and thus to be consistent with (7). To this aim, we introduce a regularization term penalizing the distance between and , where is a tensor with the same structure as , but containing samples from , i.e, .Algorithm 1 details the steps required to train a dynamical neural model by multistep simulation error minimization with initial state estimation. In Step 1, the neural network parameters and the “hidden” output variable are initialized to (small) random numbers and to , respectively. Then, at each iteration of the gradientbased training algorithm, the following steps are executed. Firstly, the batch start vector is selected with (Step 2.1). The indexes in may be either (pseudo)randomly generated, or chosen deterministically.^{2}^{2}2For an efficient use of the identification dataset , has to be chosen in such a way that all samples are visited during training. Then, tensors , , , and are populated with the corresponding samples in (Step 2.2). Subsequently, step model simulation is performed (Step 2.3) and the cost function to be minimized is computed (Step 2.4). The cost in (16) takes into account both the fitting criterion (thus, the distance between and ) and a regularization term penalizing the distance between and . Such a regularization terms aims at enforcing consistency of the hidden output with the model structure (7). A weighting constant balances the two objectives. Lastly, the gradients of the cost with respect to the optimization variables , are obtained through BPTT (Step 2.5) and the optimization variables are updated via gradient descent with learning rate (Step 2.6). Improved variants of gradient descent such as RMSprop or Adam [Kingma and Ba, 2014] can be alternatively adopted at Step 2.6.
Remark 2
The computational cost of BPTT in step simulation is proportional to (and not to , which is the case for openloop simulation). Furthermore, processing of the subsequences can be carried out independently, and thus in parallel on current hardware and software which support parallel computing. For these reasons, running multistep simulation error minimization with is significantly faster than pure openloop simulation error minimization.
Inputs: identification dataset ; number of iterations ; batch size ; length of subsequences ; learning rate ; weight .

[label=0., ref=0]

initialize the neural network parameters to a random vector and the hidden output to ;

for do

[label=2.0., ref=2.0]

select batch start indexes vector ;

define tensors
and

simulate according to
for 
compute the cost
(16) 
evaluate the gradients and at the current values of and ;

update optimization variables and :
(17)

Output: neural network parameters .
4.2 Training statespace neural models
The fitting methods presented for the IO structure are applicable to the statespace structures introduced in Section 3.1, with the considerations discussed below.

For the fully observed state model structure (5), adaptation of the onestep prediction error minimization method is straightforward. Indeed, the noisy measured state is directly used as regressor to construct the predictor, i.e., .

For model structures where the state is not fully observed, onestep prediction error minimization is not directly applicable as a onestep ahead prediction cannot be constructed in terms of the available measurements.

Simulation error minimization is directly applicable to statespace structures without special modifications, provided that it is feasible from a computational perspective.

Algorithm 1 for multistep simulation error minimization is also generally applicable for statespace model structures. Instead of the hidden output variable , a hidden state variable representing the (noisefree) state at each time step must be optimized along with the network parameters through gradient descent. However, if the state is not fully observed, cannot be initialized directly with measurements as was done in the IO case. A convenient initialization of to be used in gradientbased optimization can come from an initial state estimator, or exploiting physical knowledge. For instance, for the mechanical system in (6), a possible initialization for velocities is obtained through numerical differentiation of the measured position outputs.
5 Numerical Example
The fitting algorithms for the model structures presented in this paper are tested on a simulated nonlinear RLC circuit. All computations are carried out on a PC equipped with an Intel i57300U 2.60 GHz processor and 32 GB of RAM. The software implementation is based on the PyTorch Deep Learning Framework
[Paszke et al., 2017]. All the codes implementing the methodologies discussed in the paper and required to reproduce the results are available on the online repository https://github.com/forgi86/sysidneuralstructuresfitting. Other examples concerning the identification of a Continuously Stirred Tank Reactor (CSTR) and a cartpole system are available in the same repository.5.1 System description
We consider the nonlinear RLC circuit in Fig. 1 (left).
The circuit behavior is described by the continuoustime statespace equation
(18) 
where is the input voltage; is the capacitor voltage; and is the inductor current. The circuit parameters and nF are fixed, while the inductance depends on as shown in Fig. 1 (right). Specifically,
with . The identification dataset is built by discretizing (18) using a thorder RungeKutta method with a fixed step and simulating the system for . samples are gathered. The input
is filtered white noise with bandwidth
. An independent validation dataset is generated using as input filtered white noise with bandwidth and standard deviation .The performance of the estimated models is assessed in terms of the index computed using the openloop simulated model output. As reference, a secondorder linear OE model estimated on noisefree data using the System Identification Toolbox [Ljung, 1988] achieves an index of for and for on the identification dataset, and for and for on the validation dataset.
We consider for neural model structures the cases of () noisefree measurements of and ; () noisy measurements of and ; () noisy measurements of only.
5.2 Algorithm setup
For gradientbased optimization, the Adam optimizer is used at Step 2.6 of Algorithm 1 to update the network parameters and the hidden variable (or for statespace structures) used to construct the initial conditions . The learning rate is adjusted through a rough trial and error, with taking values in the range , while the number of iterations is chosen large enough to reach a plateau in the cost function. In Algorithm 1, the weight is always set to . We tested different values for the sequence length and adjusted the batch size such that .
5.3 Results
() Noisefree measurements of and
Since in this case the system state is supposed to be measured, we use the fullyobserved state model structure (5).
The neural network modeling the state mapping has a sequential feedforward structure with three input units (, , and
); a hidden layer with 64 linear units followed by ReLU nonlinearity; and two linear output units—the two components of the state equation to be learned. Having a noisefree dataset, we expect good results from onestep prediction error minimization. Thus, we fit the model using this approach over
iterations with learning rate . The time required to train the network is 114 seconds. The fitted model describes the system dynamics with high accuracy. On both the identification and the validation datasets, the model index in openloop simulation is above for and for .() Noisy measurements of and
We consider the same identification problem above, with observations of and corrupted by an additive white Gaussian noise with standard deviation and , respectively. This corresponds to a SignaltoNoise Ratio (SNR) of 20 dB and 13 dB on and , respectively.
Results for the onestep prediction error minimization approach on the validation dataset are shown in Fig. 2.
The index in validation drops to for and for . It is evident that noise has a severe impact on the performance of the onestep method.
In the presence of noise, better performance is expected from a multistep simulation error minimization approach. Thus, we fit the same neural statespace model structure using Algorithm 1 over iterations, with , and randomly extracted batches of subsequences, each of length . The results are in line with expectations. Indeed, we recover similar performance as onestep prediction error minimization in the noisefree case ( index of on and on on both the identification and the validation datasets). Time trajectories of the output are reported in Fig. 3. For the sake of visualization, only a portion of the validation dataset is shown. The total run time of Algorithm 1 is 182 seconds—about 60% more than the onestep prediction error minimization method.
Openloop simulation error minimization is also tested. This method yields the same performance of step simulation error minimization in terms of index of the fitted model. However, it takes about two hours to execute iterations required to reach a cost function plateau.
() Noisy measurements of only
We consider the case where only the voltage is measured and corrupted by an additive white Gaussian noise with standard deviation V. The IO model structure in (7) is used with and . The neural network is characterized by four input units (corresponding to previous values of and previous values of ); a hidden layer with 64 linear units followed by ReLU nonlinearity; and a linear output unit representing the output value .
As in case (), the onestep prediction error minimization approach delivers unsatisfactory results due to the presence of measurement noise.
Thus, we fit the model using the multistep method described in Algorithm 1 over iterations, with , , and .
The total runtime of the algorithm is 192 seconds.
The index of the fitted model is above on both the identification and the validation dataset, thus even larger than the index achieved by the OE model estimated on noisefree data.
6 Conclusions and followup
In this paper, we have presented neural model structures and fitting criteria for the identification of dynamical systems. A custom method minimizing a regularized multistep simulation error criterion has been proposed and compared with onestep prediction error and simulation error minimization.
The main strengths of the presented framework are its versatility to describe complex nonlinear systems, thanks to the neural network flexibility; its robustness to the measurement noise, thanks to the multistep simulation error criterion with initial condition estimation; and the possibility to exploit parallel computing to train the network and optimize the initial conditions, thanks to the division of the dataset into smallsize subsequences.
Current and future research activities are devoted to: () the formulation of proper fitting criteria and optimization algorithms for direct learning of continuoustime systems and systems described by partial differential equations, without introducing numerical discretization; () the development of computationally efficient algorithms for estimation and control based on the neural dynamical models.
Acknowledgements
The authors are grateful to Dr. Giuseppe Sorgioso for the fruitful discussions on the properties of the backpropagation through time algorithm.
References
 Nonlinear system identification using neural networks. International journal of control 51 (6), pp. 1191–1214. Cited by: §1.
 . In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1.
 Nonlinear system modeling using lstm neural networks. IFACPapersOnLine 51 (13), pp. 485–489. Cited by: §1.
 Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: 1st item.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 System identification toolbox. The Matlab user’s guide. Cited by: §5.1.
 System identification: theory for the user. 2 edition, Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: ISBN 0136566952 Cited by: §4.1.
 Learning nonlinear statespace models using deep autoencoders. In 2018 IEEE Conference on Decision and Control (CDC), pp. 3862–3867. Cited by: §1.
 Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1, §5.
 Physicsinformed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, pp. 686–707. Cited by: §1.
 Learning representations by backpropagating errors. Cognitive modeling 5 (3), pp. 1. Cited by: §1.
 Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
 A new concept using lstm neural networks for dynamic system identification. In 2017 American Control Conference (ACC), pp. 5324–5329. Cited by: §1.
 Neural networks for control and system identification. In Proceedings of the 28th IEEE Conference on Decision and Control,, pp. 260–265. Cited by: §1.
 Oxford handbook of innovation. Cited by: §4.1.
Comments
There are no comments yet.