1 Introduction
In recent years, deep learning has advanced at a tremendous pace and is now the core methodology behind cuttingedge technologies such as speech recognition, image classification and captioning, language translation, and autonomous driving
(Schmidhuber, 2015; LeCun et al., 2015; Bengio and others, 2009). These impressive achievements are attracting ever increasing investments both from the private and the public sector, fueling further research in this field.A good deal of the advancement in the deep learning area is publicly accessible, both in terms of scientific publications and software tools. For instance, highly optimized and userfriendly deep learning frameworks are readily available (Paszke et al., 2017; Abadi and others, 2015)
, and are often distributed under permissive opensource licenses. Using the highlevel functionalities of a deep learning framework and following good practice, even a novice user can tackle
standardmachine learning tasks (once considered extremely hard) such as image classification with moderate effort.An experienced practitioner can employ the same deep learning framework at a lower level to tackle nonstandard learning problems, by defining customized models and objective functions to be optimized, and using operators such as neural networks as building blocks. The practitioner is free from the burden of writing optimization code from scratch for every particular problem, which would be tedious and errorprone. In fact, as a builtin feature, modern deep learning engines can compute the derivatives of a supplied objective function with respect to free tunable parameters by implementing the celebrated backpropagation algorithm (Rumelhart et al., 1988). In turn, this enables convenient setup of any gradientbased optimization method.
An exciting, challenging—and yet partially unexplored—application field is system identification with tailormade model structures and fitting criteria. In this context, neural networks can be used to describe uncertain components of the dynamics, while retaining structural (physical) knowledge, if available. Furthermore, the fitting criterion can be specialized to take into account the modeler’s ultimate goal, which could be prediction, failure detection, state estimation, control design, simulation, etc.
The choice of the cost function may also be influenced by computational considerations. In this paper, in particular, models are assessed in terms of their simulation performance. In this setting, from a theoretical perspective, simulation error minimization is generally the best fitting criterion. However, evaluating the simulation error loss and its derivative may be prohibitively expensive from a computational perspective for dynamical models containing neural networks. Furthermore, simulation over time has an intrinsically sequential nature and offers limited opportunities for parallelization.
In this paper, we present two fitting algorithms whose runtime is significantly faster than full simulation error minimization, but still provide models with high simulation performance.
In the first approach, called truncated simulation error minimization, the neural dynamical model is simulated over (mini)batches of subsequences extracted from the training dataset. This allows parallelization of the model simulation over the different subsequences in a batch and also results in reduced backpropagation cost with respect to a full openloop simulation. Special care is taken to provide consistent initial conditions for all the simulated subsequences. In fact, these initial conditions are optimized along with the neural network parameters according to a dual objective. Specifically, the batch cost function takes into account both the distance between the simulated output and the measured output —the fitting objective— and the consistency of all the initial condition variables with the neural model equation — the initial state consistency objective. This cost function is iteratively minimized over the randomly extracted batches through a gradientbased optimization algorithm.
In the second approach, called softconstrained integration
, the neural dynamical model is enforced by a regularization term in the cost function penalizing the violation of a numerical ODE integration scheme applied to the system’s (hidden) state variables. These state variables, together with the neural network parameters, are tuned with the dual objective of fitting the measured data and minimizing the penalty term associated with the violation of the numerical integration scheme. In the softconstrained integration method, simulation through time is thus completely circumvented and the loss function splits up into independent contribution for each time step. This enables a fully parallel implementation of gradientbased optimization.
The use of neural networks in system identification has a long history. For instance, neural AutoRegressive Moving Average with eXogenous inputs models were discussed in Chen et al. (1990). Training was performed with the onestep prediction error method (Ljung, 1978) previously developed in the context of linear dynamical systems. In (Horne and Giles, 1995), several Recurrent Neural Network structures trained by BackPropagation Through Time are evaluated on a nonlinear system identification task. Although the overall reasoning in these earlier works is similar to ours, their results are hardly comparable with our current contribution, given the huge gap of hardware/software technology.
More recently, a few interesting approaches exploiting modern deep learning concepts and tools for system identification have been proposed. For instance, (Wang, 2017) and (Andersson et al., 2019) discuss the use of Long ShortTerm Memory networks and 1D Convolutional Neural Networks, respectively. Compared to these recent contributions, our work focuses on specialized neural model structures for the identification task at hand. Furthermore, we pose the identification problem directly in a continuoustime setting, which offers several advantages over the discretetime counterpart (Garnier and Young, 2014; Piga, 2018). First, the majority of physical systems are naturally modeled in a continuoustime framework. Embedding physical knowledge in continuoustime model structures is thus more intuitive, and inspecting a continuoustime identified model is more insightful as some parameters may retain a physical meaning. Second, continuoustime models can handle the case of nonuniformly sampled data. Last, continuoustime identification is generally immune from the numerical issues affecting discretetime methodologies in the case of high sampling frequency.
The connection between deep learning and dynamical system theory is currently under intensive investigation, see e.g., (Weinan, 2017) and crosscontamination is yielding substantial advances to both fields. On the one hand, modern deep learning architectures are interpreted as discretetime approximations of an underlying continuoustime neural dynamical system. Exploiting this parallel, (Haber and Ruthotto, 2017) and (Ruthotto and Haber, 2019) analyze the stability properties of existing deep neural architectures through the lens of system theoretic tools. Modified architectures guaranteeing stability by design are also proposed. On the other hand, contributions such as Raissi et al. (2019); Long et al. (2019); Rackauckas et al. (2020)
showcase the potential of neural networks for datadriven modeling of dynamical systems described by ordinary and partial differential equations. With respect to these contributions, our aim is to devise computationally efficient fitting strategies for neural dynamical models that are robust to the measurement noise.
The rest of this paper is structured as follows. The overall settings and problem statement is outlined in Section 2. The neural dynamical model structures are introduced in Section 3 and criteria for fitting these model structures to training data are described in Section 4. Simulation results are presented in Section 5 and can be replicated using the codes available as online supplementary material. Conclusions and directions for future research are discussed in Section 6.
2 Problem Setting
We are given a dataset consisting of input samples and output samples , gathered at time instants from an experiment on a dynamical system . The datagenerating system is assumed to have the continuoustime statespace representation
(1a)  
(1b) 
where is the system state at time ; denotes the time derivative of ; is the noisefree output; is the system input; , and are the state and output mappings, respectively. The measured output at time instant , , is corrupted by a zeromean noise , i.e., . We assume that the input can be reconstructed (or reasonably approximated) for all time instants from the samples .
In this paper, we introduce flexible neural model structures that are suitable to represent generic dynamical systems as (1), allowing the modeler to embed domain knowledge to various degrees and to exploit neural networks to describe unknown model components. Furthermore, we present fitting criteria and algorithms to train these neural dynamical model structures. Overall, we aim at fitting a neural network model that exploits the dynamic constraints and different forms of available prior knowledge of the datagenerating system (1).
3 Neural dynamical models
Let us consider a model structure , where
represents a dynamical model parametrized by a realvalued vector
. We refer to neural model structures as structures where some components of the model are described by neural networks. In the following, we introduce possible neural model structures for dynamical systems.3.1 General statespace model
A general statespace neural model structure has form
(2a)  
(2b) 
where and are feedforward neural networks of compatible size parametrized by . For notation simplicity, the time dependence of all signals in (2) is omitted.^{1}^{1}1In the rest of the paper, the time dependence of signals will be specified only when necessary.
The general structure (2) can be tailored for the identification task at hand. Examples are illustrated in the remainder of this section.
3.2 Incremental model
If a linear approximation of the system is available, an appropriate model structure could be
(3a)  
(3b) 
where , , and are matrices of compatible size describing the linear system approximation. For example, the values of these matrices can be estimated from the available training dataset using wellestablished algorithms for linear system identification Ljung (1999); Van Overschee and De Moor (1994). Although model (3) is not more general than (2), it could be easier to train as the neural networks and are supposed to capture only residual (nonlinear) dynamics.
3.3 Fullyobserved state model
If the system state is known to be fully observed, the most convenient representation is
(4a)  
(4b) 
where only the state mapping neural network is learned, while the output mapping is fixed to identity.
3.4 Physicsbased model
Tailormade architectures could be used to embed specific physical knowledge in the neural model structure.
For instance, let us consider the Cascaded Tanks System (CTS) schematized in Figure 1. The CTS is a fluid level control system consisting of two tanks with free outlets fed by a pump. Water is pumped from a bottom reservoir into the upper tank by a controlled pump. The water in the upper tank flows through a small opening into the lower tank, and from another small opening from the lower tank to the reservoir.
The system input is the water flow from the bottom reservoir feeding the upper tank. The state variables and are the water level in the upper and lower tanks, respectively.^{2}^{2}2With some abuse of notation, the subscripts in and simply denote the variable name and not a time index. The same notation will be used for the examples in Section 5. As in the illustrative example further discussed in Section 5, we consider the case where only the second state is measured.
The following dynamical model for the CTS can be derived from conservation laws and Bernoulli’s principle (Schoukens and Noël, 2017):
(5a)  
(5b)  
(5c) 
where , , , and are fixed coefficients.
Based on this physical knowledge, an appropriate neural dynamical model for the CTS is
(6a)  
(6b)  
(6c) 
capturing the information that: () the system dynamics can be described by a twodimensional statespace model; () the state is measured; () the state does not depend on ; and the state does not depend directly on , given the current value of . This neural model takes advantage of the available process knowledge, while leaving representational capabilities to describe unmodeled effects such as fluid viscosity, nonlinearities of the actuators, and water overflow that may happen when the tanks are completely filled.
Embedding physical knowledge in neural model structures is a very active and promising trend in deep learning (Raissi et al., 2019). For instance, recent contributions propose specialized structures that are suitable to describe systems satisfying general physical principles such a energy conservation. In these cases, a physicsbased neural network may be used to learn the system’s Hamiltonian or Lagrangian function, instead of the individual components its ODE representation as independent terms (Greydanus et al., 2019; Lutter et al., 2019).
In general, including domain knowledge in the model structure is useful to restrict the search space, while leaving enough representation capacity. Therefore, a better generalization performance is expected from the trained models. Furthermore, the estimated models exactly satisfy (by design) the prior assumptions and thus they are generally easier to diagnose, interpret and exploit for advanced tasks such as state estimation, fault detection and closedloop control.
4 Training neural dynamical models
In this section, we present algorithms aimed at fitting the model structures introduced in Section 3 to the training dataset .
For fixed values of neural network parameters , for given initial condition , and under the model structure (2), the openloop state simulation is the solution of the Cauchy problem:
(7a)  
(7b) 
and the simulated output is
(8) 
Different ODE solution schemes (Quarteroni et al., 2010), may be applied to numerically solve problem (7). Formally,
(9) 
will represent the solution of the Cauchy problem (7) obtained using a numerical scheme of choice (explicit or implicit, singlestep or multistep, singlestage or multistage) denoted as ODEINT.
The neural network parameters can be obtained by minimizing the simulation error norm
(10) 
with respect to both the network parameters and the state initial condition , with defined by Equations (8)(9).
It is worth remarking that, when ODEINT is an explicit ODE solver, the full computational graph producing the cost function (10) can be constructed using standard differentiable blocks. For example, Figure 2 represents the computational graph obtained by applying a forward Euler scheme with constant step size , assuming that the measurements in are collected at the same rate . It is interesting to note the similarity between this computational graph and the one of the residual network structure (He et al., 2016).
Therefore, in the case of an explicit ODE solver, the derivatives of the loss with respect to and can be obtained using standard backpropagation through the elementary solver steps. Thus, a procedure minimizing can be implemented using available deep learning software.
Recently, an alternative approach to differentiate through the ODE solution based on backward time integration of adjoint sensitivities has been proposed Chen et al. (2018). Following this approach, implicit ODE solvers may be adopted as well.
In either case, from a computational perspective, simulating over time has an intrinsically sequential nature and offers scarce opportunities for parallelization. Thus, in practice, minimizing the simulation error with a gradientbased method over the entire training dataset may be inconvenient or even unfeasible in terms of memory allocation and computational time.
Remark 1
In many cases, optimizing the initial condition is not necessary in simulation error minimization as this quantity may be available from physical considerations. For instance, in the cascaded tanks system presented above, the initial values of the tank levels may be known. Even when the initial condition is unknown, its effect may be negligible, as in the case for measurements collected from a fading memory system on a sufficiently long time horizon .
4.1 Truncated simulation error minimization
In order to reduce the computational burden and the wallclock execution time of the full simulation error minimization approach previously discussed, the simulated output can be obtained by simulating the system state in (9) on several smaller portions of the training set .
For efficient implementation, the truncated simulation error minimization algorithm presented in this section processes batches containing subsequences extracted from . In principle, the simulations can be carried out simultaneously for all the subsequences in the batch by exploiting parallel computing.
A batch is completely specified by a batch starting index vector defining the initial sample of each subsequence and a sequence duration defining the number of samples contained in each subsequence, where each element of satisfies . Thus, for instance, the th output subsequence in a batch contains the measured output samples (see Figure 3 for graphical representation).
For notational convenience, we can arrange the batch data in the following tensors:

output , where

input , where

relative time , where ,
for and . Furthermore, we define the tensors

simulated state

simulated output ,
that will be used to store simulated state and output values at the corresponding batch time instants.
Note that the third dimension of the output tensor corresponds to the output channel. With some abuse of notation, the tensor is addressed by only two indexes because we do not need to specify a particular output channel. The same notation is used for tensors , , and .
For the subsequences , the state evolution in the relative time intervals (which corresponds to the absolute time interval ) is the solution of the Cauchy problem
(11) 
for a given initial condition . The solution of the Cauchy problem (11) is numerically approximated by
(12a)  
and the simulated output for the th subsequence is then given by  
(12b) 
We can now arrange the simulated state and output at the measurement time instants in the tensors and , respectively:
(13a)  
(13b) 
for and .
As opposed to the full simulation error minimization case, the choice of appropriate initial conditions in the Cauchy problem (11) is here critical. Indeed, the effect of these initial conditions cannot in general be neglected as the duration of a subsequence is typically much shorter than the whole dataset . Furthermore, the system state is unlikely to be known a priori at all different time instants. For this reason, we introduce an additional set of free variables , where represents the (unmeasured) system state at the measurement time . For a given batch, the set of subsequences’ initial conditions is then constructed as (with ) and optimized along with the neural network parameters in order to minimize the fitting cost
(14) 
It is important to remark that the simulated output is a function of both the neural network parameters and the initial conditions .
Since the fitting cost defined above has
additional degrees of freedom w.r.t. the full simulation error minimization case, a price could be paid in terms of lack of generalization of the estimated model.
In order to reduce the degrees of freedom in the minimization of the loss , the variable used to construct the initial conditions for the Cauchy problem (11) can be enforced to represent the unknown system state and thus to be consistent with the neural model structure (2a) (namely, the state sequence should satisfy the neural ODE equation (2a)). To this aim, we introduce a regularization term penalizing the distance between the state (simulated through (12b)) and the optimization variables in . Specifically, the regularization term is defined as
(15) 
where is a tensor with the same size and structure as , but populated with samples from , i.e., .
The overall loss is then constructed as a weighted sum of the two objectives, namely:
(16) 
with regularization weight .
Inputs: training dataset ; number of iterations ; batch size ; length of subsequences ; learning rate ; regularization weight .

[label=0., ref=0]

initialize the neural network parameters and the hidden state sequence ;

for do

[label=2.0., ref=2.0]

select batch start index vector ;

populate tensors
and set of initial conditions

simulate state and output
for ;

populate tensors and as
for and ;

compute the cost

evaluate the gradients and at the current values of and ;

update optimization variables and :

Output: neural network parameters .
Algorithm 1 details the steps required by the proposed truncated simulation error method to train a dynamical neural model via gradientbased optimization.
In Step 1, the neural network parameters and the “hidden” state variables in are initialized. For instance, the initial values of and can be set to small random numbers. Alternatively, if a measurement/estimate of some of the system’s state variables is available, it can be used to initialize certain components of .
Then, at each iteration of the gradientbased training algorithm, the following steps are executed. Firstly, the batch start vector is selected with (Step 2.1). The indexes in can be either (pseudo)randomly generated, or chosen deterministically.^{3}^{3}3For an efficient use of the training data , has to be chosen in such a way that all samples are visited with equal frequency during the iterations of Algorithm 1. Then, tensors , , and are populated with the corresponding values in and . The initial conditions , with are also obtained from (Step 2.2). Subsequently, for each subsequence , the system state and output are simulated within the time interval , starting from the initial condition (Step 2.3). Then, the values of the simulated state and output at the (relative) measurement times (with ) are collected in tensors and (Step 2.4), and used to construct the loss (Step 2.5). Next, the gradients of the cost with respect to the optimization variables , are obtained in (Step 2.6), either by backpropagation through the solver steps or exploiting the adjoint sensitivity method suggested in (Chen et al., 2018). Lastly, the optimization variables are updated via gradient descent with learning rate (Step 2.7). Improved variants of gradient descent such as Adam (Kingma and Ba, 2015) can be alternatively adopted at Step 2.7.
Note that, at each iteration of the gradient descent algorithm, the cost may depend on just a subset of hidden state variables . Thus, only for those components at each iteration the gradient vector is nonzero and an update is performed.
Remark 2
The computational cost of evaluating the gradient of is proportional to the number of solver steps executed in (12a). In the case the solver step is equal to the sampling intervals (which corresponds to solver steps), running truncated simulation error minimization with is thus significantly faster than full simulation error minimization. Furthermore, the computations for the subsequences can be carried out independently, and thus parallel computing can be exploited.
In the next paragraph, we introduce an alternative method for fitting neural dynamical models that does not require simulation through time, and thus is even more convenient from a computational perspective, allowing full parallelization at time step level.
4.2 Softconstrained integration
The optimization variables previously introduced for truncated simulation error minimization are regularized to be consistent with the fitted system dynamics through the cost (15). Therefore, implicitly provides another estimate of the unknown system state, and thus of the output for given . This estimate can be compared with the measured samples to define an alternative fitting objective. This suggests the following variant for the fitting term :
(17) 
where .
Experimentally, using (17) instead of as fitting term does not significantly alter the properties of truncated simulation error minimization. Furthermore, the wallclock execution time of Algorithm 1 with the modified fitting term (17) is still dominated by the step simulation (and backpropagation) still required to compute the gradient of the regularization term in (15). In order to formulate a faster learning algorithm, an alternative regularization term promoting consistency of the hidden state variables, but not requiring time simulation should be devised.
The consistencypromoting regularizer considered in the softconstrained integration method penalizes the violation of a numerical ODE integration scheme applied to the hidden state variables , independently at each integration step. For instance, the forward Euler scheme can be enforced by means of the regularization term
(18) 
assuming for notation simplicity a constant step size . If the regularization term (18) is reduced to a “small value” through optimization, then the hidden variables will (approximately) satisfy the forward Euler scheme.
Inputs: training dataset ; number of iterations ; batch size ; length of subsequences ; learning rate ; regularization weight .

[label=0., ref=0]

initialize the neural network parameters and the hidden state sequence ;

for do

[label=2.0., ref=2.0]

select batch start index vector ;

populate tensors

compute the cost
with ;

evaluate the gradients and at the current values of and ;

update optimization variables and :

Output: neural network parameters .
Algorithm 2 details the training steps when the forward Euler numerical scheme is used to enforce consistency of the hidden state variables with the model dynamics. In Step 1, the neural network parameters , and the sequence of hidden variables are initialized. Then, at each iteration of the gradientbased training algorithm, the following steps are executed. Firstly, the batch start vector is selected with (Step 2.1) and the tensors , , are populated with the corresponding samples in (Step 2.2), similarly as in Algorithm 1
Then, the loss is computed (Step 2.3) as a weighted sum of the fitting cost in (17) and the regularizer (18). Note that, unlike Algorithm 1, Algorithm 2 does not require step simulation. The gradients of are obtained using standard backpropagation and used to perform a gradientbased minimization step (Steps 2.4 and 2.5).
The potential advantage of the proposed softconstrained integration method over the truncated simulation error minimization is twofold. Firstly, implicit integration schemes can be enforced with no additional computational burden with respect to explicit ones. For instance, the backward Euler integration scheme can be implemented simply by modifying the consistency term to
(19) 
while the CrankNicolson scheme corresponds to
(20) 
Other implicit schemes such as the multistep Backward Differentiation Formula (BDF) and AdamsMoulton (AM), or multistage implicit Runge Kutta (RK) methods may be similarly implemented, leading to a potential increase in the accuracy of the ODE numerical solution (Quarteroni et al., 2010). Secondly, the resulting cost function splits up as a sum of independent contributions for each time step, thus enabling the fully parallel implementation of gradientbased optimization. This leads to significant computational advantages and a reduced wallclock execution time.
On the other hand, in the proposed softconstrained integration method, the numerical scheme is only approximately satisfied at each solver step, and the degree of violation eventually depends on the weighting constant in the cost function. The tuning of the weight is thus more critical as compared to truncated simulation error minimization.
5 Case studies
The performance of the model structures and fitting algorithms introduced in this paper are tested on three cases studies considering the identification of a nonlinear RLC circuit; a Cascaded Tanks System (CTS); and an ElecroMechanical Positioning System (EMPS).
Code availability
The software implementation is based on the Python programming language and the PyTorch Deep Learning Framework (Paszke et al., 2017).
All the codes required to reproduce the results reported in the paper are available
in the GitHub repository https://github.com/forgi86/sysidneuralcontinuous
Dataset availability
The RLC circuit dataset is simulated and available together with the code in the online repository.
Experimental datasets for the CTS and the EMPS case studies are obtained from the website http://www.nonlinearbenchmark.org, which hosts a collection of public benchmarks widely used in system identification.
Metrics
For all case studies, the performance of the fitting algorithms is assessed in terms of the index of the model simulation:
where .
For the CTS, the Root Mean Squared Error (RMSE) of the model simulation is also provided, as this is the performance index suggested in the description of the benchmark (Schoukens and Noël, 2017):
The performance indexes are evaluated both on the training dataset and on a separate test dataset. In the case of systems with multiple output channels, the metrics are computed channelwise.
Algorithm settings
In truncated simulation error minimization (Algorithm 1), the batch size and sequence length are chosen within the integer range , while the regularization weight in (16) is always set to 1.
In the softconstrained integration method (Algorithm 2), we consider instead a single subsequence containing all the dataset samples, i.e., and . The weight constant is tuned based on the simulation performance of the identified model in the training dataset.
In all the examples, the Adam optimizer is used for gradientbased optimization. The learning rate is adjusted through a rough trialanderror within the range , while the other optimizer parameters are left to default. The number of training steps is chosen sufficiently high to reach a cost function plateau.
The neural networks’ weight parameters are initialized to random Gaussian variables with zero mean and standard deviation
, while the bias terms are initialized to zero. The hidden state variables are initialized differently for the three examples, exploiting available process knowledge.Hardware configuration
All computations are carried out on a PC equipped with an Intel i57300U 2.60 GHz processor and 32 GB of RAM.
5.1 Nonlinear RLC circuit
We consider the nonlinear RLC circuit in Figure 4 (left).
The circuit behavior is described by the continuoustime statespace equation
(21) 
where is the input voltage; is the capacitor voltage; and is the inductor current. The circuit parameters and nF are fixed, while the inductance depends on as shown in Figure 4 (right) and according to
with . This dependence is typically encountered in ferrite inductors operating in partial saturation (Di Capua et al., 2017).
In this case study, we assume both state variables and to be measured. A training dataset with samples is built by simulating the system for with a fixed step . The input
is a filtered white noise with bandwidth
and standard deviation . An independent test dataset is generated using as input a filtered white noise with bandwidth and standard deviation . In the training dataset, the observations of and are corrupted by an additive white Gaussian noise with zero mean and standard deviation and , respectively. This corresponds to a SignaltoNoise Ratio (SNR) of 20 dB and 13 dB on and , respectively.5.1.1 Neural Model Structure
By considering as state vector and input , we adopt for this system the fully observed state model structure (4) introduced in Section 3 reported for convenience below:
This model structure embeds the knowledge that the system has a secondorder statespace representation and the whole state vector is measured.
The neural network used in this example has feedforward structure with three input units (corresponding to , , and
); a hidden layer with 64 linear units followed by ReLU nonlinearity; and two linear output units—the components of the state equation to be learned.
5.1.2 Truncated Simulation Error Minimization
Algorithm 1 is executed with learning rate , number of iterations is , and batches containing subsequences, each one of size . The hidden state variables are initialized to the values of the noisy output measurements. Model equations (11) are numerically integrated using the forward Euler numerical scheme. The total run time of Algorithm 1 is 142 seconds.
Time trajectories of the true and model output are reported in Figure 5. For the sake of visualization, only a portion of the test dataset is shown. The fitted model describes the system dynamics with high accuracy. On both the training and the test datasets, the index in simulation is above for and for . For the sake of comparison, a secondorder linear Output Error model estimated using the System Identification Toolbox (Ljung and Singh, 2012) achieves an index of for and for on the training dataset, and for and for on the test dataset.
5.1.3 Full Simulation Error Minimization
Full simulation error minimization is also tested. This method yields the same performance of truncated simulation error minimization in terms of index of the fitted model. However, the runtime required to execute iterations and reach a cost function plateau is about two hours.
5.1.4 OneStep Prediction Error Minimization
For the fullyobserved neural model structure, a straightforward fitting criterion may be defined by taking the noisy output measurement as a state estimate and by minimizing the onestep prediction error loss of the AutoRegressive with eXogenous input (ARX) model structure:
(22) 
Minimization of (22) corresponds to the training of a standard feedforward neural network with target and features , .
Since the neural network is fed with noisy input data and only the 1step ahead is minimized, this approach is not robust to the measurement noise. On this RLC example, the fullyobserved state neural model structure trained by minimizing achieves an index of for and for in the test dataset (see time trajectories in Figure 6).
By repeating the fitting procedure on a noisefree RLC training dataset, onestep prediction error minimization recovers the same performance of simulation error minimization ( index of and for and , respectively).
5.2 Cascaded Tanks System
We consider the CTS already introduced in Section 3 and described in details in (Schoukens and Noël, 2017).
The training and test datasets contain 1024 points each, collected at a constant sampling time . In both datasets, the input is a multisine signal with identical power spectrum, but different realization. Input and output values are in Volts as they correspond to the actuator commands and the raw sensor readings, respectively.
The initial state of the system is not provided, but it is known to be the same for both datasets. Thus, as suggested in (Schoukens and Noël, 2017), we use the initial state estimated on the training dataset (which is a byproduct of the proposed fitting procedures) as initial state for model simulation in test.
5.2.1 Neural Model Structure
The neural model structure used for this system is
(23a)  
(23b)  
(23c) 
Compared to (6), model (23) also includes a direct dependency on in the second state equation. This dependency is added to take into account that in this experimental setup, in case of water overflow from the upper tank, part of the overflowing water may go directly in the lower tank (Schoukens and Noël, 2017).
The neural networks and have two and three input units, respectively. Both networks one hidden layer with 100 linear units followed by ReLU nonlinearity and a linear output unit.
5.2.2 Truncated Simulation Error Minimization
Algorithm 1 is executed with learning rate , number of iterations is , batch size , and subsequence length . The components of the hidden state variables associated to the state variable are initialized to , while the components associated to are initialized to the noisy output measurements . Model equations (9) are numerically integrated using the forward Euler numerical scheme. The total runtime of Algorithm 1 is 533 seconds.
Time trajectories of simulated and true output are reported in Figure 7. The achieved index is and on the training and on the test dataset, respectively. The RMSE index is V and V on the identification and on the test dataset, respectively. These results compare favorably with stateoftheart blackbox nonlinear identification methods applied to this benchmark (Relan et al., 2017; Svensson and Schön, 2017; Birpoutsoukis et al., 2018). To the best of our knowledge, the best previously published result was obtained in (Svensson and Schön, 2017) using a statespace model with priors for basis function expansion inspired by Gaussian Processes, and trained using a sequential Monte Carlo method. The authors of (Svensson and Schön, 2017) report an RMSE index of V on the test dataset. A higher performance was reported only in (Rogers et al., 2017) for greybox models including an explicit physical description of the water overflow phenomenon (RMSE index of V).
Note that the largest discrepancies between measured and simulated output are noticeable towards the end of the test experiment, where the measured output is close to V. This condition is not encountered in the training dataset and therefore a mismatch can be expected for a blackbox nonlinear model such as a neural network.
5.2.3 Softconstrained integration method
Algorithm 2 is executed with the regularization term in (20) enforcing the CrankNicolson integration scheme and a regularization constant . For this small dataset, all time steps fit into the memory and can be processed altogether in a batch. Thus, we consider batches with a single subsequence () containing the whole training dataset . Optimization is performed over iterations of the Adam algorithm, with learning rate . The total runtime of Algorithm 2 is 271 seconds, approximately half the runtime time of Algorithm 1.
Time trajectories of the output are reported in Figure 8 for both the training and the test dataset. The index of the model is and on the training and on test dataset, respectively. The RMSE index is V and V on the training and on the test datasets, respectively. The results are thus in line with the ones achieved by truncated simulation error minimization.
5.3 ElectroMechanical Positioning System
As a last case study, we consider the identification of the ElectroMechanical Positioning System (EMPS) described in (Janot et al., 2019).
The system is a controlled prismatic joint, which is a common component of robots and machine tools. In the benchmark, the system input is the motor force (N) expressed in the load side and the measured output is the prismatic joint position (m). A physical statespace model for the system is
(24a)  
(24b) 
where (kg) is the joint mass, (Ns/m) is the dynamic friction coefficient and (N) is the static friction. The benchmark is challenging due to (i) the unknown friction behavior and (ii) the marginally stable (integral) system dynamics.
The identification and test dataset are constructed from closedloop experiments performed with the same reference position trajectory. A force disturbance is acting on the system in the test experiment only. The two datasets have the same duration (approximately 25 seconds) and are collected at a sampling frequency of kHz. In this paper, the original EMPS signals are decimated by a factor . Thus, each dataset contains points with sampling time ms.
5.3.1 Neural Model Structure
According to the physical model (24), the neural model structure used to fit the EMPS system is
(25a)  
(25b)  
(25c) 
with state variables and ; and input .
The neural model structure (25) captures the physical knowledge that the system states are position and velocity; the derivative of position is velocity; and the velocity dynamics does not depend on the position. The neural network is thus used to describe the velocity dynamics (24b), which could be rather complex due to the presence of static friction. Indeed, static friction is highly nonlinear and hard to describe with firstprinciples formulas. On the other hand, there is no need to use a blackbox model to describe the position dynamics (24a). In fact, this equation simply states that velocity is the time derivative of position.
The neural network used for this benchmark has 2 input units; 64 hidden linear units followed by ReLU nonlinearity; and one linear output unit.
5.3.2 Truncated simulation error method
Algorithm 1 is executed with learning rate , number of iterations is , batch size , and sequence length . The components of associated to are initialized to the measurement position sequence, while the components associated to are initialized to the forward difference approximation of its time derivative. Model equations (9) are numerically integrated using the fourthorder RungeKutta scheme RK44 (Ralston, 1962). The total runtime of Algorithm 1 is seconds.
Comments
There are no comments yet.