This paper introduces dynoNet, a neural network architecture tailored for sequence modeling and dynamical system learning (a.k.a. system identification). The network is designed to process time series of arbitrary length and contains causal linear time-invariant (LTI) dynamical operators as building blocks. These LTI layers are parametrized in terms of rational transfer functions, and thus apply infinite impulse response (IIR) filtering to their input sequence. In the dynoNet architecture, the LTI layers are combined with static (i.e., memoryless) non-linearities which can be either elementary activation functions
applied channel-wise; fully connected feed-forward neural networks; or other differentiable operators (e.g, polynomials). Both the LTI and the static layers defining adynoNet are in general multi-input-multi-output (MIMO) and can be interconnected in an arbitrary fashion.
Overall, the dynoNet architecture can represent rich classes of non-linear, causal dynamic relations. Moreover, dynoNet networks can be trained end-to-end by plain back-propagation using standard deep learning (DL) software. Technically, this is achieved by introducing the LTI dynamical layer as a differentiable operator, endowed with a well-defined forward and backward behavior and thus compatible with reverse-mode automatic differentiation . Special care is taken to devise closed-form expressions for the forward and backward operations that are convenient from a computational perspective.
1.2 Related Works
To the best of our knowledge, LTI blocks with an IIR have never been considered as differentiable operators for back-propagation-based training to date. Among the layers routinely applied in DL, the 1-D Convolution  is the closest match. In particular, the 1D causal Convolution layer described in [2, 1] corresponds to the filtering of an input sequence through a causal Finite Impulse Response (FIR) dynamical system. The dynoNet architecture may be seen as a generalization of the causal 1D Convolutional Neural Network (CNN) enabling IIR filtering, owing to the description of the dynamical layers as rational transfer functions. This representation allows modeling long-term (actually infinite) time dependencies with a smaller number of parameters with respect to 1D Convolutional networks. Furthermore, filtering through rational transfer function can be implemented by means of recurrent linear difference equations. While this operation is not as highly parallelizable as FIR filtering, the total number of computations required is generally lower.
The dynoNet architecture has also analogies with Recurrent Neural Network (RNN)  architectures. As in RNNs, a dynamic dependency is built exploiting recurrence equations. However, in an RNN the basic computational unit is a neural cell (e.g., Elman, LSTM, GRU) that processes a single time step. The network’s computational graph is then built by repeating the same cell for all the steps of the timeseries. Processing long timeseries through an RNN is often computational expensive and presents limited opportunities for parallelization, due to the sequential nature of the computational graph. Conversely, in dynoNet
a lightweight (linear) recurrence equation is “baked into” the elementary LTI blocks, that naturally operate on time series in a vectorized fashion. While internally these layers require certain sequential operations (details are given in Section3), the overall computational burden is sensibly lower than the one of typical RNNs. Moreover, the computations performed by the static layers of a dynoNet are highly parallelizable, as they are independent for each time step. Therefore, more complex transformations may be included in the static layer.
Thus, compared to 1D Convolutional and Recurrent neural architectures, dynoNet is characterized by an intermediate level of computational complexity and representational power.
In the system identification literature, particular cases of the dynoNet architecture have been widely studied in the last decades within the so-called block-oriented modeling framework . In most of the contributions, shallow architectures based on single-input single-output (SISO) blocks are considered. For instance: the Wiener model is defined as the series connection of an LTI dynamical model followed by a static non-linearity ; the Hammerstein model is based on the reverse connection, with a static non-linearity followed by an LTI block ; the Wiener-Hammerstein (WH) model combines two SISO LTI blocks interleaved by a static SISO non-linearity in a sequential structure --; and the Hammerstein-Wiener (HW) has structure --. See Figure 1 for a visual representation of the aforementioned structures.
One notable exception is the deep architecture consisting in the repeated sequential connection of SISO blocks --- proposed in  and dubbed by its authors generalized Hammerstein-Wiener (Figure 2, left panel). Furthermore, the parallel Wiener-Hammerstein model  extends the classic WH model beyond the strictly SISO case. The parallel WH model has the same -- as the basic WH mentioned above. However, the first linear block is single-input-multi-output; the static non-linearity is multi-input-multi-output; and the second linear block is multi-input-single-output (Figure 2, right panel). Thus, the parallel WH model describes an input/output SISO dynamical system, but it leverages on an inner MIMO structure to provide additional flexibility.
From a DL perspective, the parallel WH extends the representation capabilities of the plain WH network by including several “neurons” in a single hidden layer, while the generalized HW model aims to the same result by stacking several layers, each one consisting of a single “neuron”. Interestingly, the universal approximation capability has been proven for the parallel WH structure[5, 15].
The dynoNet architecture encompasses all the previous block-oriented models as special cases. Other structures containing, e.g., multiple MIMO blocks and skip connection can be described within the dynoNet modeling framework. More importantly, the existing training methods for block-oriented models are custom-made for each specific architecture, requiring for instance analytic expressions of the Jacobian of the loss with respect to the training parameters. Conversely, the derivation of the differentiable dynamical layer allows us to train arbitrary dynoNet architectures using the plain back-propagation algorithm.
The entries of an -length vector are specified by subscript integer indices running from to , unless stated otherwise. The bold-face notation is reserved for real-valued -length vectors, generally representing time series with samples. For instance, is a -length vector with entries .
The time reversal of a -length vector is denoted as and defined as
The convolution between vectors and is defined as
The cross-correlation between vectors and is defined as
2 Linear Dynamical Operator
The input-output relation of an individual (SISO) dynamical layer in the dynoNet architecture is described by the dynamical rational operator as follows:
|where and are polynomials in the time delay operator (), i.e.,|
|and and are the input and output sequence values at time index .|
The filtering operation through in (4a) is equivalent to the input/output equation:
Based on the definitions of and , (5) is equivalent to the recurrence equation:
The tunable parameters of are the coefficients of the polynomials and . For convenience, these coefficients are collected in vectors and . Note that the first element of vector has index 1, while for all other vectors in this paper the starting index is 0.
In this paper, the operator is always initialized from rest, namely the values of and for are all taken equal to zero. Then, given an input sequence , (6) provides an univocal expression for the output sequence .
In practice, the operator in a dynoNet operates on finite-length sequences. Let us stack the input and output samples and in vectors and , respectively. With a slight abuse of notation, the filtering operation in (4) applied to is denoted as
The operation above is also equivalent to the convolution
where is a vector containing the first samples of the operator’s impulse response. The latter is defined as the output sequence generated by (6) for an input such that and .
In the MIMO case, the input and output at time are vectors of size and , respectively. The MIMO linear dynamical operator with input and output channels may be represented as a MIMO transfer function matrix whose element is a SISO rational transfer function such as (4). The components of the output sequence at time are defined as
The derivations for the dynamical layer are presented in the following in a SISO setting to avoid notation clutter. Extension to the MIMO case is straightforward and only requires repetition of the same operations for the different input/output channels. The computations for the different input/output channels are independent and therefore may be performed in parallel.
Note that the software implementation available in our on-line GitHub repository fully supports the MIMO case.
3 Dynamical Operator as a Deep Learning Layer
In this section, the forward and backward operations required to integrate the linear dynamical operator in a DL framework are derived. The computational cost of these operations as measured by the number of multiplications to be executed is also reported. Furthermore, the possibility of parallelizing these computations is analyzed.
In the rest of this paper, the linear dynamical operator interpreted as a differentiable layer for use in DL is also referred to as -block. In our software implementation, the -block is implemented in the PyTorch DL framework as a class extending torch.autograd.Function, based on the forward and backward operations derived in the following.
3.1 Forward Operations
The forward operations of a -block embedded in a computational graph are represented by solid arrows in Figure 3. In the forward pass, the block filters an input sequence through a dynamical system with structure (4) and parameters and . The block output is a vector containing the filtered sequence:
The input of the -block may be either the training input sequence or the result of previous operations in the computational graph, while the output is an intermediate step towards the computation of a scalar output . The exact operations leading to are not relevant in this discussion, and thus they are not further specified.
When the filtering operation (9) is implemented using (6), the computational cost of the -block forward pass corresponds to multiplications. These multiplications can be parallelized for the + different coefficients at a given time step, but need to be performed sequentially for the time samples due to the recurrent structure of (6).
3.2 Backward Operations
The backward operations are illustrated in Figure 3 with dashed arrows. In the backward pass, receives the vector containing the partial derivatives of the loss w.r.t. , namely:
Given , the -block has to compute the derivatives of the loss w.r.t. its differentiable inputs , , and . Overall, the backward operation has the following structure:
Application of the chain rule leads to:
The required sensitivities , can be obtained in closed-form through additional filtering operations (see , Section 10.3). Specifically, an expression for is derived by differentiating the left and hand side of Eq. (5) w.r.t . This yields:
Thus, can be computed by filtering the input vector through the linear filter . Furthermore, the following condition holds:
Exploiting expression (15) for , the -th component of is obtained as:
This operation corresponds to the dot product of with shifted version of the sensitivity .
Overall, the computation of requires: () filtering through , which entails multiplications and () the dot products defined in (16), totaling multiplications. As for the filtering, the operations have to be performed sequentially for the different time steps due to the recursive structure of (13). After completion of the filtering, the dot product operations may be performed in parallel.
Following the same rationale above, we obtain a closed-form expression for the sensitivities , , by differentiating the left and right hand side of Eq. (5) with respect to . This yields:
Then, can be obtained by filtering the output through the linear filter . Furthermore, the following condition holds:
The -th component of is obtained as:
Input time series
Application of the chain rule yields:
From (7), the following expression for holds:
Plugging the expression above for into (20), we obtain
By definition, the expression above corresponds to the following cross-correlation operation:
However, direct implementation of (22) requires a number of operations proportional to .
Since represents the impulse response of the filter , (22) can be implemented more efficiently by filtering the vector in reverse time through , and then reversing the result, i.e.,
Neglecting the flipping operations, the computational cost of the backward pass for is thus equivalent to the filtering of an -length vector through , namely multiplications.
The effectiveness of the dynoNet architecture is evaluated on system identification benchmarks publicly available at the website www.nonlinearbenchmark.org. All the codes required to reproduce the results in this section are available on the GitHub repository https://github.com/forgi86/dynonet.git
The results achieved on the Wiener-Hammerstein , the Bouc-Wen , and the Electro-Mecanical Positioning System (EMPS)  benchmarks are presented in this paper. Other examples are dealt with in the provided codes.
In the following examples, the dynoNet is trained by minimizing the mean square of the simulation error and using the Adam algorithm  for gradient-based optimization. The number of iterations is chosen sufficiently large to reach a cost function plateau. The learning rate is adjusted by a rough trial and error. All static non-linearities following the -blocks are modeled as feed-forward Neural Networks with a single hidden layer containing 20 neurons and hyperbolic tangent activation function. The numerator and denominator coefficients of the linear dynamical
-blocks are randomly initialized from a uniform distribution with zero mean and range, while the feed-forward neural network parameters are initialized according to PyTorch’s default strategy. Note that several settings are kept constant across the benchmarks to highlight that limited tuning is needed to obtain state-of-the art identification results using the proposed dynoNet architecture.
computations are performed on a desktop computer equipped with an AMD Ryzen 5 1600x 6-core processor and 32 GB of RAM.
The identified models are evaluated in terms of the and Root Mean Square Error (RMSE) indexes defined as:
where is the measured (true) output vector; is the dynoNet model’s open-loop simulated output vector; and is the mean value of , i.e. .
4.1 Electronic Circuit with Wiener-Hammerstein Structure
The experimental setup used in this benchmark is an electronic circuit that behaves by construction as a Wiener-Hammerstein system . Therefore, a simple dynoNet architecture corresponding to the WH model structure is adopted. Specifically, the dynoNet model has a sequential structure defined by a SISO -block with ; a SISO feed-forward neural network; and a final SISO -block with .
The model is trained over iterations of the Adam algorithm with learning rate , by minimizing the MSE on the whole training dataset ( samples). The total training time is 267 seconds. On the test dataset ( samples), the dynoNet model’s performance indexes are and mV. The measured output and simulated output on a portion of the test dataset are shown in Figure 4, together with the simulation error . While specialized training algorithms for WH systems may provide even superior results on this benchmark (the best published result  reports mV), the dynoNet model trained by plain back-propagation achieves remarkably good performance.
4.2 Bouc-Wen System
The Bouc-Wen is a nonlinear dynamical system describing hysteretic effects in mechanical engineering and commonly used to assess system identification algorithms. The example in this section is based on the synthetic Bouc-Wen benchmark described in . The training and test datasets of the benchmark are obtained by numerical simulation of the differential equations:
where (N) is the input force; (N) is the hysteretic force; (mm) is the output displacement; and all other symbols represent fixed coefficients. The input and output signals are available at a sampling frequency Hz. The output is corrupted by an additive band-limited Gaussian noise with bandwidth
Hz and standard deviationmm. The training and test datsaset for the benchmark are generated using as input independent random phase multisine sequences containing and samples, respectively.
We adopt for this benchmark a dynoNet architecture with two parallel branches. The first branch has a sequential structure containing: a -block with 1 input and 8 output channels; a feed-forward network with 8 input and 4 output channels; a -block with 4 input and 4 output channels; and a feed-forward neural network with 4 input and 1 output channel, while the second branch consists in a single SISO -block. The model output is the sum of the the two branches. All the -blocks in the first branch are third-order (), while the single -block in the second branch is second-order (). This model does not have a specific physical motivation and it is chosen to showcase the representational power of dynoNet. Furthermore, it does not correspond to any classic block-oriented structure previously considered in the system identification literature.
The model is trained over iterations with learning rate , by minimizing the MSE on the whole training dataset. On the test dataset, the model achieves a index of and a RMSE of mm. Time traces of the measured and simulated dynoNet output on a portion of the test dataset are shown in Figure 4. The results obtained by the dynoNet compare favorably with other general black-box identification methods applied to this benchmark. For instance, in  Non-linear Finite Impulse Response (NFIR); Auto Regressive with eXogenous input (NARX); and Orthornormal Basis Function (NOBF) model structures are tested on this benchmark. The best results are obtained with the NFIR structure ( mm). Superior results are achieved only in  using polynomial nonlinear state-space models trained with an algorithm tailored for the identification of hysteretic systems ( mm).
4.3 Electro-Mechanical Positioning System
The EMPS is a controlled prismatic joint, which is a common component of robots and machine tools. In the experimental benchmark described in , the system input is the motor force (N) and the measured output is the prismatic joint position (m). A physical model for the system is:
where (kg) is the joint mass and (N) is the friction affecting the system (comprising both viscous and Coloumb terms). From a system identification perspective, this benchmark is challenging due to the unknown (and possibly complex) friction characteristic, the marginally stable (integral) system dynamics, and actuator and sensor behavior affecting the measured data. Indeed,  remarks that a significant measurement bias is present.
We adopt for this benchmark a sequential dynoNet structure comprising: a third-order -block with 1 input and 20 output channels; a feed-forward neural network with 20 input and 1 output channel; and a final integrator block. In this architecture, the final integrator is used to model the integral system dynamics, while the other units have no specific physical meaning and are used as black-box model components.
By training this dynoNet model over iterations on a dataset with samples with learning rate , we obtain on the test dataset performance indexes and mm. As reference,  reports for the best linear model on this benchmark. Time traces of the measured and simulated output on the test dataset are shown in Figure 4.
We have introduced dynoNet, a neural network architecture tailored for time series processing and dynamical system learning. The core element of dynoNet is a linear infinite impulse response dynamical operator described by a rational transfer function. We have derived all the formulas required to integrate the dynamical operator in a back-propagation-based optimization engine for end-to-end training of deep networks. Furthermore, we have analyzed the computational cost of the forward and backward operations.
The proposed case studies have shown the effectiveness and flexibility of the presented methodologies against well-known system identification benchmarks.
Current and future research activities are devoted to the design of nonlinear state estimators and control strategies for systems modeled bydynoNet networks.
Appendix A Special cases of linear dynamical operators
In this section, two special cases of the linear dynamical operators, namely the finite impulse response and the second-order structures are analyzed. The first case is interesting as it corresponds to the Convolutional block of a standard 1D CNN, which is generalized by the dynoNet architecture. The latter is useful in practice as: () higher-order systems may always be described as the sequential connection of first- and second-order dynamics; () the coefficients of a second-order system can be readily re-parametrized in order to enforce stability of the dynamical blocks and thus of the whole dynoNet network.
a.2 Finite Impulse Response structure
A Finite Impulse Response (FIR) dynamical operator has structure
In the FIR structure, there are no denominator coefficients . Furthermore, the numerator coefficients correspond to the system’s non-zero impulse response coefficients. For these reasons, the formulas derived in Section 3 of the main paper and required to define the forward and backward behavior of a general -block simplify significantly in the FIR case.
The forward pass operation is equivalent to
which is equivalent to the convolution
Using (26) for the implementation, the forward operation of a FIR -block requires fully parallelizable multiplications.
As for the backward pass operations, the sensitivities of the block output with respect to the numerator coefficients are given by
Applying the chain rule, we obtain
The latter can also be written as
Thus, computing requires fully parallelizable multiplications.
As for the back-propagation operations with respect to the input time series , Equation (21) of the main paper still holds in the FIR case. Applying this equation to the FIR case, we obtain:
This operation also requires fully parallelizable multiplications.
The formulas presented above for the FIR structure are very similar to the ones used in 1D-CNNs. In most deep learning frameworks, however, a cross-correlation operation is implemented in the forward pass. Working out the math, the convolution operation appears than in the backward pass.
A significant limitation of the FIR structure is that a large number of coefficient is required to represent an LTI dynamics whose impulse response decays slowly. On the other hand, all operations can be performed in parallel as they are independent for each time step, owing to the non-recurrent structure. Furthermore, the FIR representation defines by construction a stable LTI dynamics, which could have numerical advantages in training.
In linear System Identification, special regularization techniques for FIR models based on a Gaussian prior on the impulse response coefficient have been developed . In particular, a Gaussian prior whose covariance is described by the so-called Diagonal/Correlated (DC) kernel is effective to describe the impulse response of a stable linear systems. The DC kernel has form
with , ,
. The hyperparameterrepresents the exponential decay along the diagonal, while describes the correlation across the diagonal (correlation between neighbouring impulse response coefficient). It would be interesting to extend the use of these priors in deep structured neural networks such as dynoNet.
a.3 Second-order structure
A second-order dynamical operator has structure
The second-order structure is interesting as () higher-order systems may always be described as the sequential connection of first- and second-order dynamics and () the coefficients of a second-order system can be readily re-parametrized in order to enforce stability of the block and thus of the entire network.
The second-order filter above is asymptotically stable if and only if the two roots of the characteristic polynomial
lie within the complex unit circle.
By applying the Jury stability criterion , it is possible to show that this property holds in the region of the coefficient space characterized by:
Furthermore, the two poles are: () real and distinct for ; () complex conjugate for ; and () real and coincident for . The regions of interest in the coefficient space are illustrated in Figure 5.
An intuitive stable parametrization for second-order dynamical layers is obtained by describing the denominator in terms of two complex conjugate (or coincident) poles:
with magnitude and phase . Next, in order to avoid interval constraints on and , one can further parametrize and in terms of unconstrained variables as follows:
denotes the sigmoid function and.
The overall transformation from to is then:
Adopting this parametrization, it is possible to train dynoNet networks that are stable by design. In practice, the trainable parameters and may be introduced in the computational graph as parents—through the differentiable transformation (32)—of the coefficients , of a second-order -block. Then, the variables , can optimized using standard unconstrained optimization with gradients computed by plain back-propagation. The learned denominator coefficients , will represent a second-order dynamics that is stable by construction.
The parametrization (32) excludes however the case of two distinct real poles. In order to allow this system structure, a slightly more complex parametrization spanning the whole stability region (30) for the coefficients and of the second-order structure is the following:
where are used as unconstrained optimization variables.
-  (2019) Deep convolutional networks in system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), Vol. , pp. 3670–3676. Cited by: §1.2.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.2.
Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research 18 (1), pp. 5595–5637. Cited by: §1.1.
-  (2017) Automatic modeling with local model networks for benchmark processes. IFAC-PapersOnLine 50 (1), pp. 470–475. Cited by: §4.2.
-  (1985) Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Transactions on Circuits and Systems 32 (11), pp. 1150–1161. Cited by: §1.2.
-  (2017) Polynomial state-space model decoupling for the identification of hysteretic systems. IFAC-PapersOnLine 50 (1), pp. 458–463. Cited by: §4.2.
-  F. Giri and E. Bai (Eds.) (2010) Block-oriented Nonlinear System Identification. Lecture Notes in Control and Information Sciences, Vol. 404. Cited by: §1.2.
-  (2016) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28 (10), pp. 2222–2232. Cited by: §1.2.
-  (2019-04) Data Set and Reference Models of EMPS. In Nonlinear System Identification Benchmarks, EINDHOVEN, Netherlands. External Links: Cited by: §4.3, §4.3, §4.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §4.
-  (2009) Wiener-Hammerstein benchmark. In 15th IFAC Symposium on System Identification, Saint-Malo, France, July, 2009, Cited by: §4.1, §4.
-  (1999) System identification: theory for the user. 2 edition, Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: Cited by: §3.2.
-  (2016) Hysteretic benchmark with a dynamic nonlinearity. In Workshop on Nonlinear System Identification Benchmarks, pp. 7–14. Cited by: §4.2, §4.
-  (1995) Discrete-time control systems. Vol. 2, Prentice Hall Englewood Cliffs, NJ. Cited by: §A.3.
-  (1979) On representation and approximation of nonlinear systems. Biological Cybernetics 34 (1), pp. 49–52. Cited by: §1.2.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.1.
-  (2014) Kernel methods in system identification, machine learning and function estimation: A survey. Automatica 50 (3), pp. 657–682. Cited by: §A.2.
-  (2015) Parametric identification of parallel wiener–hammerstein systems. Automatica 51, pp. 111–122. Cited by: §1.2.
-  (2014) Identification of Wiener–Hammerstein systems by a nonparametric separation of the best linear approximation. Automatica 50 (2), pp. 628–634. Cited by: §4.1.
-  (2017) Time series classification from scratch with Deep Neural Networks: a strong baseline. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1578–1585. Cited by: §1.2.
-  (2012) Generalised hammerstein–wiener system estimation and a benchmark application. Control Engineering Practice 20 (11), pp. 1097–1108. Cited by: §1.2.