Continuous-time system identification with neural networks: model structures and fitting criteria

06/03/2020 ∙ by Marco Forgione, et al. ∙ IDSIA 0

This paper presents tailor-made neural model structures and two custom fitting criteria for learning dynamical systems. The proposed framework is based on a representation of the system behavior in terms of continuous-time state-space models. The sequence of hidden states is optimized along with the neural network parameters in order to minimize the difference between measured and estimated outputs, and at the same time to guarantee that the optimized state sequence is consistent with the estimated system dynamics. The effectiveness of the approach is demonstrated through three case studies, including two public system identification benchmarks based on experimental data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning has advanced at a tremendous pace and is now the core methodology behind cutting-edge technologies such as speech recognition, image classification and captioning, language translation, and autonomous driving

(Schmidhuber, 2015; LeCun et al., 2015; Bengio and others, 2009). These impressive achievements are attracting ever increasing investments both from the private and the public sector, fueling further research in this field.

A good deal of the advancement in the deep learning area is publicly accessible, both in terms of scientific publications and software tools. For instance, highly optimized and user-friendly deep learning frameworks are readily available (Paszke et al., 2017; Abadi and others, 2015)

, and are often distributed under permissive open-source licenses. Using the high-level functionalities of a deep learning framework and following good practice, even a novice user can tackle

standardmachine learning tasks (once considered extremely hard) such as image classification with moderate effort.

An experienced practitioner can employ the same deep learning framework at a lower level to tackle non-standard learning problems, by defining customized models and objective functions to be optimized, and using operators such as neural networks as building blocks. The practitioner is free from the burden of writing optimization code from scratch for every particular problem, which would be tedious and error-prone. In fact, as a built-in feature, modern deep learning engines can compute the derivatives of a supplied objective function with respect to free tunable parameters by implementing the celebrated back-propagation algorithm (Rumelhart et al., 1988). In turn, this enables convenient setup of any gradient-based optimization method.

An exciting, challenging—and yet partially unexplored—application field is system identification with tailor-made model structures and fitting criteria. In this context, neural networks can be used to describe uncertain components of the dynamics, while retaining structural (physical) knowledge, if available. Furthermore, the fitting criterion can be specialized to take into account the modeler’s ultimate goal, which could be prediction, failure detection, state estimation, control design, simulation, etc.

The choice of the cost function may also be influenced by computational considerations. In this paper, in particular, models are assessed in terms of their simulation performance. In this setting, from a theoretical perspective, simulation error minimization is generally the best fitting criterion. However, evaluating the simulation error loss and its derivative may be prohibitively expensive from a computational perspective for dynamical models containing neural networks. Furthermore, simulation over time has an intrinsically sequential nature and offers limited opportunities for parallelization.

In this paper, we present two fitting algorithms whose runtime is significantly faster than full simulation error minimization, but still provide models with high simulation performance.

In the first approach, called truncated simulation error minimization, the neural dynamical model is simulated over (mini)batches of subsequences extracted from the training dataset. This allows parallelization of the model simulation over the different subsequences in a batch and also results in reduced back-propagation cost with respect to a full open-loop simulation. Special care is taken to provide consistent initial conditions for all the simulated subsequences. In fact, these initial conditions are optimized along with the neural network parameters according to a dual objective. Specifically, the batch cost function takes into account both the distance between the simulated output and the measured output —the fitting objective— and the consistency of all the initial condition variables with the neural model equation — the initial state consistency objective. This cost function is iteratively minimized over the randomly extracted batches through a gradient-based optimization algorithm.

In the second approach, called soft-constrained integration

, the neural dynamical model is enforced by a regularization term in the cost function penalizing the violation of a numerical ODE integration scheme applied to the system’s (hidden) state variables. These state variables, together with the neural network parameters, are tuned with the dual objective of fitting the measured data and minimizing the penalty term associated with the violation of the numerical integration scheme. In the soft-constrained integration method, simulation through time is thus completely circumvented and the loss function splits up into independent contribution for each time step. This enables a fully parallel implementation of gradient-based optimization.

The use of neural networks in system identification has a long history. For instance, neural AutoRegressive Moving Average with eXogenous inputs models were discussed in Chen et al. (1990). Training was performed with the one-step prediction error method (Ljung, 1978) previously developed in the context of linear dynamical systems. In (Horne and Giles, 1995), several Recurrent Neural Network structures trained by Back-Propagation Through Time are evaluated on a nonlinear system identification task. Although the overall reasoning in these earlier works is similar to ours, their results are hardly comparable with our current contribution, given the huge gap of hardware/software technology.

More recently, a few interesting approaches exploiting modern deep learning concepts and tools for system identification have been proposed. For instance, (Wang, 2017) and (Andersson et al., 2019) discuss the use of Long Short-Term Memory networks and 1-D Convolutional Neural Networks, respectively. Compared to these recent contributions, our work focuses on specialized neural model structures for the identification task at hand. Furthermore, we pose the identification problem directly in a continuous-time setting, which offers several advantages over the discrete-time counterpart (Garnier and Young, 2014; Piga, 2018). First, the majority of physical systems are naturally modeled in a continuous-time framework. Embedding physical knowledge in continuous-time model structures is thus more intuitive, and inspecting a continuous-time identified model is more insightful as some parameters may retain a physical meaning. Second, continuous-time models can handle the case of non-uniformly sampled data. Last, continuous-time identification is generally immune from the numerical issues affecting discrete-time methodologies in the case of high sampling frequency.

The connection between deep learning and dynamical system theory is currently under intensive investigation, see e.g., (Weinan, 2017) and cross-contamination is yielding substantial advances to both fields. On the one hand, modern deep learning architectures are interpreted as discrete-time approximations of an underlying continuous-time neural dynamical system. Exploiting this parallel, (Haber and Ruthotto, 2017) and (Ruthotto and Haber, 2019) analyze the stability properties of existing deep neural architectures through the lens of system theoretic tools. Modified architectures guaranteeing stability by design are also proposed. On the other hand, contributions such as Raissi et al. (2019); Long et al. (2019); Rackauckas et al. (2020)

showcase the potential of neural networks for data-driven modeling of dynamical systems described by ordinary and partial differential equations. With respect to these contributions, our aim is to devise computationally efficient fitting strategies for neural dynamical models that are robust to the measurement noise.

The rest of this paper is structured as follows. The overall settings and problem statement is outlined in Section 2. The neural dynamical model structures are introduced in Section 3 and criteria for fitting these model structures to training data are described in Section 4. Simulation results are presented in Section 5 and can be replicated using the codes available as on-line supplementary material. Conclusions and directions for future research are discussed in Section 6.

2 Problem Setting

We are given a dataset consisting of input samples and output samples , gathered at time instants from an experiment on a dynamical system . The data-generating system is assumed to have the continuous-time state-space representation


where is the system state at time ; denotes the time derivative of ; is the noise-free output; is the system input; , and are the state and output mappings, respectively. The measured output at time instant , , is corrupted by a zero-mean noise , i.e., . We assume that the input can be reconstructed (or reasonably approximated) for all time instants from the samples .

In this paper, we introduce flexible neural model structures that are suitable to represent generic dynamical systems as (1), allowing the modeler to embed domain knowledge to various degrees and to exploit neural networks to describe unknown model components. Furthermore, we present fitting criteria and algorithms to train these neural dynamical model structures. Overall, we aim at fitting a neural network model that exploits the dynamic constraints and different forms of available prior knowledge of the data-generating system (1).

3 Neural dynamical models

Let us consider a model structure , where

represents a dynamical model parametrized by a real-valued vector

. We refer to neural model structures as structures where some components of the model are described by neural networks. In the following, we introduce possible neural model structures for dynamical systems.

3.1 General state-space model

A general state-space neural model structure has form


where and are feedforward neural networks of compatible size parametrized by . For notation simplicity, the time dependence of all signals in (2) is omitted.111In the rest of the paper, the time dependence of signals will be specified only when necessary.

The general structure (2) can be tailored for the identification task at hand. Examples are illustrated in the remainder of this section.

3.2 Incremental model

If a linear approximation of the system is available, an appropriate model structure could be


where , , and are matrices of compatible size describing the linear system approximation. For example, the values of these matrices can be estimated from the available training dataset using well-established algorithms for linear system identification Ljung (1999); Van Overschee and De Moor (1994). Although model (3) is not more general than (2), it could be easier to train as the neural networks and are supposed to capture only residual (nonlinear) dynamics.

3.3 Fully-observed state model

If the system state is known to be fully observed, the most convenient representation is


where only the state mapping neural network is learned, while the output mapping is fixed to identity.

3.4 Physics-based model

Tailor-made architectures could be used to embed specific physical knowledge in the neural model structure.

For instance, let us consider the Cascaded Tanks System (CTS) schematized in Figure 1. The CTS is a fluid level control system consisting of two tanks with free outlets fed by a pump. Water is pumped from a bottom reservoir into the upper tank by a controlled pump. The water in the upper tank flows through a small opening into the lower tank, and from another small opening from the lower tank to the reservoir.


Figure 1: Schematics of the cascaded two-tank system.

The system input is the water flow from the bottom reservoir feeding the upper tank. The state variables and are the water level in the upper and lower tanks, respectively.222With some abuse of notation, the subscripts in and simply denote the variable name and not a time index. The same notation will be used for the examples in Section 5. As in the illustrative example further discussed in Section 5, we consider the case where only the second state is measured.

The following dynamical model for the CTS can be derived from conservation laws and Bernoulli’s principle (Schoukens and Noël, 2017):


where , , , and are fixed coefficients.

Based on this physical knowledge, an appropriate neural dynamical model for the CTS is


capturing the information that: () the system dynamics can be described by a two-dimensional state-space model; () the state is measured; () the state does not depend on ; and the state does not depend directly on , given the current value of . This neural model takes advantage of the available process knowledge, while leaving representational capabilities to describe unmodeled effects such as fluid viscosity, nonlinearities of the actuators, and water overflow that may happen when the tanks are completely filled.

Embedding physical knowledge in neural model structures is a very active and promising trend in deep learning (Raissi et al., 2019). For instance, recent contributions propose specialized structures that are suitable to describe systems satisfying general physical principles such a energy conservation. In these cases, a physics-based neural network may be used to learn the system’s Hamiltonian or Lagrangian function, instead of the individual components its ODE representation as independent terms (Greydanus et al., 2019; Lutter et al., 2019).

In general, including domain knowledge in the model structure is useful to restrict the search space, while leaving enough representation capacity. Therefore, a better generalization performance is expected from the trained models. Furthermore, the estimated models exactly satisfy (by design) the prior assumptions and thus they are generally easier to diagnose, interpret and exploit for advanced tasks such as state estimation, fault detection and closed-loop control.

4 Training neural dynamical models

In this section, we present algorithms aimed at fitting the model structures introduced in Section 3 to the training dataset .

For fixed values of neural network parameters , for given initial condition , and under the model structure (2), the open-loop state simulation is the solution of the Cauchy problem:


and the simulated output is


Different ODE solution schemes (Quarteroni et al., 2010), may be applied to numerically solve problem (7). Formally,


will represent the solution of the Cauchy problem (7) obtained using a numerical scheme of choice (explicit or implicit, single-step or multi-step, single-stage or multi-stage) denoted as ODEINT.

The neural network parameters can be obtained by minimizing the simulation error norm


with respect to both the network parameters and the state initial condition , with defined by Equations (8)-(9).

It is worth remarking that, when ODEINT is an explicit ODE solver, the full computational graph producing the cost function (10) can be constructed using standard differentiable blocks. For example, Figure 2 represents the computational graph obtained by applying a forward Euler scheme with constant step size , assuming that the measurements in are collected at the same rate . It is interesting to note the similarity between this computational graph and the one of the residual network structure (He et al., 2016).


Figure 2: Computational graph associated with the forward Euler ODE integration scheme with constant step size applied to system (2). Measurements are assumed to be equally spaced with the same constant rate .

Therefore, in the case of an explicit ODE solver, the derivatives of the loss with respect to and can be obtained using standard back-propagation through the elementary solver steps. Thus, a procedure minimizing can be implemented using available deep learning software.

Recently, an alternative approach to differentiate through the ODE solution based on backward time integration of adjoint sensitivities has been proposed Chen et al. (2018). Following this approach, implicit ODE solvers may be adopted as well.

In either case, from a computational perspective, simulating over time has an intrinsically sequential nature and offers scarce opportunities for parallelization. Thus, in practice, minimizing the simulation error with a gradient-based method over the entire training dataset may be inconvenient or even unfeasible in terms of memory allocation and computational time.

Remark 1

In many cases, optimizing the initial condition is not necessary in simulation error minimization as this quantity may be available from physical considerations. For instance, in the cascaded tanks system presented above, the initial values of the tank levels may be known. Even when the initial condition is unknown, its effect may be negligible, as in the case for measurements collected from a fading memory system on a sufficiently long time horizon .

4.1 Truncated simulation error minimization

In order to reduce the computational burden and the wall-clock execution time of the full simulation error minimization approach previously discussed, the simulated output can be obtained by simulating the system state in (9) on several smaller portions of the training set .

For efficient implementation, the truncated simulation error minimization algorithm presented in this section processes batches containing subsequences extracted from . In principle, the simulations can be carried out simultaneously for all the subsequences in the batch by exploiting parallel computing.

A batch is completely specified by a batch starting index vector defining the initial sample of each subsequence and a sequence duration defining the number of samples contained in each subsequence, where each element of satisfies . Thus, for instance, the -th output subsequence in a batch contains the measured output samples (see Figure 3 for graphical representation).


Figure 3: Representation of two subsequences of length extracted from the training dataset . and represent the starting indexes of subsequences and , respectively.

For notational convenience, we can arrange the batch data in the following tensors:

  • output , where

  • input , where

  • relative time , where ,

for and . Furthermore, we define the tensors

  • simulated state

  • simulated output ,

that will be used to store simulated state and output values at the corresponding batch time instants.

Note that the third dimension of the output tensor corresponds to the output channel. With some abuse of notation, the tensor is addressed by only two indexes because we do not need to specify a particular output channel. The same notation is used for tensors , , and .

For the subsequences , the state evolution in the relative time intervals (which corresponds to the absolute time interval ) is the solution of the Cauchy problem


for a given initial condition . The solution of the Cauchy problem (11) is numerically approximated by

and the simulated output for the -th subsequence is then given by

We can now arrange the simulated state and output at the measurement time instants in the tensors and , respectively:


for and .

As opposed to the full simulation error minimization case, the choice of appropriate initial conditions in the Cauchy problem (11) is here critical. Indeed, the effect of these initial conditions cannot in general be neglected as the duration of a subsequence is typically much shorter than the whole dataset . Furthermore, the system state is unlikely to be known a priori at all different time instants. For this reason, we introduce an additional set of free variables , where represents the (unmeasured) system state at the measurement time . For a given batch, the set of subsequences’ initial conditions is then constructed as (with ) and optimized along with the neural network parameters in order to minimize the fitting cost


It is important to remark that the simulated output is a function of both the neural network parameters and the initial conditions .

Since the fitting cost defined above has

additional degrees of freedom w.r.t. the full simulation error minimization case, a price could be paid in terms of lack of generalization of the estimated model.

In order to reduce the degrees of freedom in the minimization of the loss , the variable used to construct the initial conditions for the Cauchy problem (11) can be enforced to represent the unknown system state and thus to be consistent with the neural model structure (2a) (namely, the state sequence should satisfy the neural ODE equation (2a)). To this aim, we introduce a regularization term penalizing the distance between the state (simulated through (12b)) and the optimization variables in . Specifically, the regularization term is defined as


where is a tensor with the same size and structure as , but populated with samples from , i.e., .

The overall loss is then constructed as a weighted sum of the two objectives, namely:


with regularization weight .

Inputs: training dataset ; number of iterations ; batch size ; length of subsequences ; learning rate ; regularization weight .  

  1. [label=0., ref=0]

  2. initialize the neural network parameters and the hidden state sequence ;

  3. for do

    1. [label=2.0., ref=2.0]

    2. select batch start index vector ;

    3. populate tensors

      and set of initial conditions

    4. simulate state and output

      for ;

    5. populate tensors and as

      for and ;

    6. compute the cost

    7. evaluate the gradients and at the current values of and ;

    8. update optimization variables and :

  Output: neural network parameters .

Algorithm 1 Truncated simulation error minimization

Algorithm 1 details the steps required by the proposed truncated simulation error method to train a dynamical neural model via gradient-based optimization.

In Step 1, the neural network parameters and the “hidden” state variables in are initialized. For instance, the initial values of and can be set to small random numbers. Alternatively, if a measurement/estimate of some of the system’s state variables is available, it can be used to initialize certain components of .

Then, at each iteration of the gradient-based training algorithm, the following steps are executed. Firstly, the batch start vector is selected with (Step 2.1). The indexes in can be either (pseudo)randomly generated, or chosen deterministically.333For an efficient use of the training data , has to be chosen in such a way that all samples are visited with equal frequency during the iterations of Algorithm 1. Then, tensors , , and are populated with the corresponding values in and . The initial conditions , with are also obtained from (Step 2.2). Subsequently, for each subsequence , the system state and output are simulated within the time interval , starting from the initial condition (Step 2.3). Then, the values of the simulated state and output at the (relative) measurement times (with ) are collected in tensors and (Step 2.4), and used to construct the loss (Step 2.5). Next, the gradients of the cost with respect to the optimization variables , are obtained in (Step 2.6), either by back-propagation through the solver steps or exploiting the adjoint sensitivity method suggested in (Chen et al., 2018). Lastly, the optimization variables are updated via gradient descent with learning rate (Step 2.7). Improved variants of gradient descent such as Adam (Kingma and Ba, 2015) can be alternatively adopted at Step 2.7.

Note that, at each iteration of the gradient descent algorithm, the cost may depend on just a subset of hidden state variables . Thus, only for those components at each iteration the gradient vector is non-zero and an update is performed.

Remark 2

The computational cost of evaluating the gradient of is proportional to the number of solver steps executed in (12a). In the case the solver step is equal to the sampling intervals (which corresponds to solver steps), running truncated simulation error minimization with is thus significantly faster than full simulation error minimization. Furthermore, the computations for the subsequences can be carried out independently, and thus parallel computing can be exploited.

In the next paragraph, we introduce an alternative method for fitting neural dynamical models that does not require simulation through time, and thus is even more convenient from a computational perspective, allowing full parallelization at time step level.

4.2 Soft-constrained integration

The optimization variables previously introduced for truncated simulation error minimization are regularized to be consistent with the fitted system dynamics through the cost  (15). Therefore, implicitly provides another estimate of the unknown system state, and thus of the output for given . This estimate can be compared with the measured samples to define an alternative fitting objective. This suggests the following variant for the fitting term :


where .

Experimentally, using (17) instead of as fitting term does not significantly alter the properties of truncated simulation error minimization. Furthermore, the wall-clock execution time of Algorithm 1 with the modified fitting term (17) is still dominated by the -step simulation (and back-propagation) still required to compute the gradient of the regularization term in (15). In order to formulate a faster learning algorithm, an alternative regularization term promoting consistency of the hidden state variables, but not requiring time simulation should be devised.

The consistency-promoting regularizer considered in the soft-constrained integration method penalizes the violation of a numerical ODE integration scheme applied to the hidden state variables , independently at each integration step. For instance, the forward Euler scheme can be enforced by means of the regularization term


assuming for notation simplicity a constant step size . If the regularization term (18) is reduced to a “small value” through optimization, then the hidden variables will (approximately) satisfy the forward Euler scheme.

Inputs: training dataset ; number of iterations ; batch size ; length of subsequences ; learning rate ; regularization weight .  

  1. [label=0., ref=0]

  2. initialize the neural network parameters and the hidden state sequence ;

  3. for do

    1. [label=2.0., ref=2.0]

    2. select batch start index vector ;

    3. populate tensors

    4. compute the cost

      with ;

    5. evaluate the gradients and at the current values of and ;

    6. update optimization variables and :

  Output: neural network parameters .

Algorithm 2 Soft-constrained integration for the forward Euler scheme

Algorithm 2 details the training steps when the forward Euler numerical scheme is used to enforce consistency of the hidden state variables with the model dynamics. In Step 1, the neural network parameters , and the sequence of hidden variables are initialized. Then, at each iteration of the gradient-based training algorithm, the following steps are executed. Firstly, the batch start vector is selected with (Step 2.1) and the tensors , , are populated with the corresponding samples in (Step 2.2), similarly as in Algorithm 1

Then, the loss is computed (Step 2.3) as a weighted sum of the fitting cost in (17) and the regularizer  (18). Note that, unlike Algorithm 1, Algorithm 2 does not require -step simulation. The gradients of are obtained using standard back-propagation and used to perform a gradient-based minimization step (Steps 2.4 and 2.5).

The potential advantage of the proposed soft-constrained integration method over the truncated simulation error minimization is twofold. Firstly, implicit integration schemes can be enforced with no additional computational burden with respect to explicit ones. For instance, the backward Euler integration scheme can be implemented simply by modifying the consistency term to


while the Crank-Nicolson scheme corresponds to


Other implicit schemes such as the multi-step Backward Differentiation Formula (BDF) and Adams-Moulton (AM), or multi-stage implicit Runge Kutta (RK) methods may be similarly implemented, leading to a potential increase in the accuracy of the ODE numerical solution (Quarteroni et al., 2010). Secondly, the resulting cost function splits up as a sum of independent contributions for each time step, thus enabling the fully parallel implementation of gradient-based optimization. This leads to significant computational advantages and a reduced wall-clock execution time.

On the other hand, in the proposed soft-constrained integration method, the numerical scheme is only approximately satisfied at each solver step, and the degree of violation eventually depends on the weighting constant in the cost function. The tuning of the weight is thus more critical as compared to truncated simulation error minimization.

5 Case studies

The performance of the model structures and fitting algorithms introduced in this paper are tested on three cases studies considering the identification of a nonlinear RLC circuit; a Cascaded Tanks System (CTS); and an Elecro-Mechanical Positioning System (EMPS).

Code availability  
The software implementation is based on the Python programming language and the PyTorch Deep Learning Framework (Paszke et al., 2017). All the codes required to reproduce the results reported in the paper are available in the GitHub repository

Dataset availability  
The RLC circuit dataset is simulated and available together with the code in the on-line repository. Experimental datasets for the CTS and the EMPS case studies are obtained from the website, which hosts a collection of public benchmarks widely used in system identification.

For all case studies, the performance of the fitting algorithms is assessed in terms of the index of the model simulation:

where .

For the CTS, the Root Mean Squared Error (RMSE) of the model simulation is also provided, as this is the performance index suggested in the description of the benchmark (Schoukens and Noël, 2017):

The performance indexes are evaluated both on the training dataset and on a separate test dataset. In the case of systems with multiple output channels, the metrics are computed channel-wise.

Algorithm settings  
In truncated simulation error minimization (Algorithm 1), the batch size and sequence length are chosen within the integer range , while the regularization weight in (16) is always set to 1.

In the soft-constrained integration method (Algorithm 2), we consider instead a single subsequence containing all the dataset samples, i.e., and . The weight constant is tuned based on the simulation performance of the identified model in the training dataset.

In all the examples, the Adam optimizer is used for gradient-based optimization. The learning rate is adjusted through a rough trial-and-error within the range , while the other optimizer parameters are left to default. The number of training steps is chosen sufficiently high to reach a cost function plateau.

The neural networks’ weight parameters are initialized to random Gaussian variables with zero mean and standard deviation

, while the bias terms are initialized to zero. The hidden state variables are initialized differently for the three examples, exploiting available process knowledge.

Hardware configuration  
All computations are carried out on a PC equipped with an Intel i5-7300U 2.60 GHz processor and 32 GB of RAM.

5.1 Nonlinear RLC circuit

We consider the nonlinear RLC circuit in Figure 4 (left).

[width=0.35]fig/RLC/RLC.pdf [width=0.35]fig/RLC/RLC_characteristics.pdf
Figure 4: Nonlinear series RLC circuit used in the example (left) and nonlinear dependence of the inductance on the inductor current (right).

The circuit behavior is described by the continuous-time state-space equation


where is the input voltage; is the capacitor voltage; and is the inductor current. The circuit parameters and  nF are fixed, while the inductance depends on as shown in Figure  4 (right) and according to

with . This dependence is typically encountered in ferrite inductors operating in partial saturation (Di Capua et al., 2017).

In this case study, we assume both state variables and to be measured. A training dataset with samples is built by simulating the system for with a fixed step . The input

is a filtered white noise with bandwidth

and standard deviation . An independent test dataset is generated using as input a filtered white noise with bandwidth and standard deviation . In the training dataset, the observations of and are corrupted by an additive white Gaussian noise with zero mean and standard deviation and , respectively. This corresponds to a Signal-to-Noise Ratio (SNR) of 20 dB and 13 dB on and , respectively.

5.1.1 Neural Model Structure

By considering as state vector and input , we adopt for this system the fully observed state model structure (4) introduced in Section 3 reported for convenience below:

This model structure embeds the knowledge that the system has a second-order state-space representation and the whole state vector is measured.

The neural network used in this example has feed-forward structure with three input units (corresponding to , , and

); a hidden layer with 64 linear units followed by ReLU nonlinearity; and two linear output units—the components of the state equation to be learned.

5.1.2 Truncated Simulation Error Minimization

Algorithm 1 is executed with learning rate , number of iterations is , and batches containing subsequences, each one of size . The hidden state variables are initialized to the values of the noisy output measurements. Model equations (11) are numerically integrated using the forward Euler numerical scheme. The total run time of Algorithm 1 is 142 seconds.

Time trajectories of the true and model output are reported in Figure 5. For the sake of visualization, only a portion of the test dataset is shown. The fitted model describes the system dynamics with high accuracy. On both the training and the test datasets, the index in simulation is above for and for . For the sake of comparison, a second-order linear Output Error model estimated using the System Identification Toolbox (Ljung and Singh, 2012) achieves an index of for and for on the training dataset, and for and for on the test dataset.


Figure 5: RLC circuit: true output (black) and estimated output (red) obtained by the state-space model trained by truncated simulation error minimization.

5.1.3 Full Simulation Error Minimization

Full simulation error minimization is also tested. This method yields the same performance of truncated simulation error minimization in terms of index of the fitted model. However, the runtime required to execute iterations and reach a cost function plateau is about two hours.

5.1.4 One-Step Prediction Error Minimization

For the fully-observed neural model structure, a straightforward fitting criterion may be defined by taking the noisy output measurement as a state estimate and by minimizing the one-step prediction error loss of the AutoRegressive with eXogenous input (ARX) model structure:


Minimization of (22) corresponds to the training of a standard feedforward neural network with target and features , .

Since the neural network is fed with noisy input data and only the 1-step ahead is minimized, this approach is not robust to the measurement noise. On this RLC example, the fully-observed state neural model structure trained by minimizing achieves an index of for and for in the test dataset (see time trajectories in Figure 6).

By repeating the fitting procedure on a noise-free RLC training dataset, one-step prediction error minimization recovers the same performance of simulation error minimization ( index of and for and , respectively).


Figure 6: RLC circuit: true output (black) and estimated output (red) obtained by the state-space model trained by one-step prediction error minimization.

5.2 Cascaded Tanks System

We consider the CTS already introduced in Section 3 and described in details in (Schoukens and Noël, 2017).

The training and test datasets contain 1024 points each, collected at a constant sampling time . In both datasets, the input is a multisine signal with identical power spectrum, but different realization. Input and output values are in Volts as they correspond to the actuator commands and the raw sensor readings, respectively.

The initial state of the system is not provided, but it is known to be the same for both datasets. Thus, as suggested in (Schoukens and Noël, 2017), we use the initial state estimated on the training dataset (which is a by-product of the proposed fitting procedures) as initial state for model simulation in test.

5.2.1 Neural Model Structure

The neural model structure used for this system is


Compared to (6), model (23) also includes a direct dependency on in the second state equation. This dependency is added to take into account that in this experimental setup, in case of water overflow from the upper tank, part of the overflowing water may go directly in the lower tank (Schoukens and Noël, 2017).

The neural networks and have two and three input units, respectively. Both networks one hidden layer with 100 linear units followed by ReLU nonlinearity and a linear output unit.

5.2.2 Truncated Simulation Error Minimization

Algorithm 1 is executed with learning rate , number of iterations is , batch size , and subsequence length . The components of the hidden state variables associated to the state variable are initialized to , while the components associated to are initialized to the noisy output measurements . Model equations (9) are numerically integrated using the forward Euler numerical scheme. The total runtime of Algorithm 1 is 533 seconds.

Time trajectories of simulated and true output are reported in Figure 7. The achieved index is and on the training and on the test dataset, respectively. The RMSE index is  V and  V on the identification and on the test dataset, respectively. These results compare favorably with state-of-the-art black-box nonlinear identification methods applied to this benchmark (Relan et al., 2017; Svensson and Schön, 2017; Birpoutsoukis et al., 2018). To the best of our knowledge, the best previously published result was obtained in (Svensson and Schön, 2017) using a state-space model with priors for basis function expansion inspired by Gaussian Processes, and trained using a sequential Monte Carlo method. The authors of (Svensson and Schön, 2017) report an RMSE index of  V on the test dataset. A higher performance was reported only in (Rogers et al., 2017) for grey-box models including an explicit physical description of the water overflow phenomenon (RMSE index of  V).

Note that the largest discrepancies between measured and simulated output are noticeable towards the end of the test experiment, where the measured output is close to  V. This condition is not encountered in the training dataset and therefore a mismatch can be expected for a black-box nonlinear model such as a neural network.

[width=.45]fig/CTS/CTS_id_model_SS_128step.pdf [width=.45]fig/CTS/CTS_test_model_SS_128step.pdf
Figure 7: CTS: measured output (black) and model simulation (red) obtained by the neural model trained by truncated simulation error minimization.

5.2.3 Soft-constrained integration method

Algorithm 2 is executed with the regularization term in (20) enforcing the Crank-Nicolson integration scheme and a regularization constant . For this small dataset, all time steps fit into the memory and can be processed altogether in a batch. Thus, we consider batches with a single subsequence () containing the whole training dataset . Optimization is performed over iterations of the Adam algorithm, with learning rate . The total runtime of Algorithm 2 is 271 seconds, approximately half the runtime time of Algorithm 1.

Time trajectories of the output are reported in Figure 8 for both the training and the test dataset. The index of the model is and on the training and on test dataset, respectively. The RMSE index is  V and  V on the training and on the test datasets, respectively. The results are thus in line with the ones achieved by truncated simulation error minimization.

[width=.45]fig/CTS/CTS_id_model_SS_soft.pdf [width=.45]fig/CTS/CTS_test_model_SS_soft.pdf
Figure 8: CTS: measured output (black) and model simulation (red) obtained by the neural model trained using the soft-constrained integration method.

5.3 Electro-Mechanical Positioning System

As a last case study, we consider the identification of the Electro-Mechanical Positioning System (EMPS) described in (Janot et al., 2019).

The system is a controlled prismatic joint, which is a common component of robots and machine tools. In the benchmark, the system input is the motor force  (N) expressed in the load side and the measured output is the prismatic joint position  (m). A physical state-space model for the system is


where  (kg) is the joint mass,  (Ns/m) is the dynamic friction coefficient and  (N) is the static friction. The benchmark is challenging due to (i) the unknown friction behavior and (ii) the marginally stable (integral) system dynamics.

The identification and test dataset are constructed from closed-loop experiments performed with the same reference position trajectory. A force disturbance is acting on the system in the test experiment only. The two datasets have the same duration (approximately 25 seconds) and are collected at a sampling frequency of  kHz. In this paper, the original EMPS signals are decimated by a factor . Thus, each dataset contains points with sampling time  ms.

5.3.1 Neural Model Structure

According to the physical model (24), the neural model structure used to fit the EMPS system is


with state variables and ; and input .

The neural model structure (25) captures the physical knowledge that the system states are position and velocity; the derivative of position is velocity; and the velocity dynamics does not depend on the position. The neural network is thus used to describe the velocity dynamics (24b), which could be rather complex due to the presence of static friction. Indeed, static friction is highly nonlinear and hard to describe with first-principles formulas. On the other hand, there is no need to use a black-box model to describe the position dynamics (24a). In fact, this equation simply states that velocity is the time derivative of position.

The neural network used for this benchmark has 2 input units; 64 hidden linear units followed by ReLU nonlinearity; and one linear output unit.

5.3.2 Truncated simulation error method

Algorithm 1 is executed with learning rate , number of iterations is , batch size , and sequence length . The components of associated to are initialized to the measurement position sequence, while the components associated to are initialized to the forward difference approximation of its time derivative. Model equations (9) are numerically integrated using the fourth-order Runge-Kutta scheme RK44 (Ralston, 1962). The total runtime of Algorithm 1 is  seconds.

Time trajectories of the input and of the output are reported in Figure 9. The achieved index is above on both the identification and test datasets. By comparison, Janot et al. (2019) reports an index of for different linear models estimated on this benchmark.

[width=.45]fig/EMPS/EMPS_SS_id_model_SS_64step_RK.pdf [width=.45]fig/EMPS/EMPS_SS_test_model_SS_64step_RK.pdf
Figure 9: EMPS: measured position (top panels, black) and model simulation