1 Introduction
1.1 Contribution
This paper introduces dynoNet, a neural network architecture tailored for sequence modeling and dynamical system learning (a.k.a. system identification). The network is designed to process time series of arbitrary length and contains causal linear timeinvariant (LTI) dynamical operators as building blocks. These LTI layers are parametrized in terms of rational transfer functions, and thus apply infinite impulse response (IIR) filtering to their input sequence. In the dynoNet architecture, the LTI layers are combined with static (i.e., memoryless) nonlinearities which can be either elementary activation functions
applied channelwise; fully connected feedforward neural networks; or other differentiable operators (e.g, polynomials). Both the LTI and the static layers defining a
dynoNet are in general multiinputmultioutput (MIMO) and can be interconnected in an arbitrary fashion.Overall, the dynoNet architecture can represent rich classes of nonlinear, causal dynamic relations. Moreover, dynoNet networks can be trained endtoend by plain backpropagation using standard deep learning (DL) software. Technically, this is achieved by introducing the LTI dynamical layer as a differentiable operator, endowed with a welldefined forward and backward behavior and thus compatible with reversemode automatic differentiation [3]. Special care is taken to devise closedform expressions for the forward and backward operations that are convenient from a computational perspective.
Furthermore, a software implementation of the dynamical operator layer based on the PyTorch DL framework [16] is available on the GitHub repository https://github.com/forgi86/dynonet.git.
1.2 Related Works
To the best of our knowledge, LTI blocks with an IIR have never been considered as differentiable operators for backpropagationbased training to date. Among the layers routinely applied in DL, the 1D Convolution [20] is the closest match. In particular, the 1D causal Convolution layer described in [2, 1] corresponds to the filtering of an input sequence through a causal Finite Impulse Response (FIR) dynamical system. The dynoNet architecture may be seen as a generalization of the causal 1D Convolutional Neural Network (CNN) enabling IIR filtering, owing to the description of the dynamical layers as rational transfer functions. This representation allows modeling longterm (actually infinite) time dependencies with a smaller number of parameters with respect to 1D Convolutional networks. Furthermore, filtering through rational transfer function can be implemented by means of recurrent linear difference equations. While this operation is not as highly parallelizable as FIR filtering, the total number of computations required is generally lower.
The dynoNet architecture has also analogies with Recurrent Neural Network (RNN) [8] architectures. As in RNNs, a dynamic dependency is built exploiting recurrence equations. However, in an RNN the basic computational unit is a neural cell (e.g., Elman, LSTM, GRU) that processes a single time step. The network’s computational graph is then built by repeating the same cell for all the steps of the timeseries. Processing long timeseries through an RNN is often computational expensive and presents limited opportunities for parallelization, due to the sequential nature of the computational graph. Conversely, in dynoNet
a lightweight (linear) recurrence equation is “baked into” the elementary LTI blocks, that naturally operate on time series in a vectorized fashion. While internally these layers require certain sequential operations (details are given in Section
3), the overall computational burden is sensibly lower than the one of typical RNNs. Moreover, the computations performed by the static layers of a dynoNet are highly parallelizable, as they are independent for each time step. Therefore, more complex transformations may be included in the static layer.Thus, compared to 1D Convolutional and Recurrent neural architectures, dynoNet is characterized by an intermediate level of computational complexity and representational power.
In the system identification literature, particular cases of the dynoNet architecture have been widely studied in the last decades within the socalled blockoriented modeling framework [7]. In most of the contributions, shallow architectures based on singleinput singleoutput (SISO) blocks are considered. For instance: the Wiener model is defined as the series connection of an LTI dynamical model followed by a static nonlinearity ; the Hammerstein model is based on the reverse connection, with a static nonlinearity followed by an LTI block ; the WienerHammerstein (WH) model combines two SISO LTI blocks interleaved by a static SISO nonlinearity in a sequential structure ; and the HammersteinWiener (HW) has structure . See Figure 1 for a visual representation of the aforementioned structures.
One notable exception is the deep architecture consisting in the repeated sequential connection of SISO blocks  proposed in [21] and dubbed by its authors generalized HammersteinWiener (Figure 2, left panel). Furthermore, the parallel WienerHammerstein model [18] extends the classic WH model beyond the strictly SISO case. The parallel WH model has the same  as the basic WH mentioned above. However, the first linear block is singleinputmultioutput; the static nonlinearity is multiinputmultioutput; and the second linear block is multiinputsingleoutput (Figure 2, right panel). Thus, the parallel WH model describes an input/output SISO dynamical system, but it leverages on an inner MIMO structure to provide additional flexibility.
From a DL perspective, the parallel WH extends the representation capabilities of the plain WH network by including several “neurons” in a single hidden layer, while the generalized HW model aims to the same result by stacking several layers, each one consisting of a single “neuron”. Interestingly, the universal approximation capability has been proven for the parallel WH structure
[5, 15].The dynoNet architecture encompasses all the previous blockoriented models as special cases. Other structures containing, e.g., multiple MIMO blocks and skip connection can be described within the dynoNet modeling framework. More importantly, the existing training methods for blockoriented models are custommade for each specific architecture, requiring for instance analytic expressions of the Jacobian of the loss with respect to the training parameters. Conversely, the derivation of the differentiable dynamical layer allows us to train arbitrary dynoNet architectures using the plain backpropagation algorithm.
1.3 Notation
The entries of an length vector are specified by subscript integer indices running from to , unless stated otherwise. The boldface notation is reserved for realvalued length vectors, generally representing time series with samples. For instance, is a length vector with entries .
Time reversal
The time reversal of a length vector is denoted as and defined as
(1) 
Convolution
The convolution between vectors and is defined as
(2) 
Crosscorrelation
The crosscorrelation between vectors and is defined as
(3) 
2 Linear Dynamical Operator
The inputoutput relation of an individual (SISO) dynamical layer in the dynoNet architecture is described by the dynamical rational operator as follows:
(4a)  
where and are polynomials in the time delay operator (), i.e.,  
(4b)  
(4c)  
and and are the input and output sequence values at time index . 
The filtering operation through in (4a) is equivalent to the input/output equation:
(5) 
Based on the definitions of and , (5) is equivalent to the recurrence equation:
(6) 
Parameters
The tunable parameters of are the coefficients of the polynomials and . For convenience, these coefficients are collected in vectors and . Note that the first element of vector has index 1, while for all other vectors in this paper the starting index is 0.
Initial condition
In this paper, the operator is always initialized from rest, namely the values of and for are all taken equal to zero. Then, given an input sequence , (6) provides an univocal expression for the output sequence .
Finitelength sequences
In practice, the operator in a dynoNet operates on finitelength sequences. Let us stack the input and output samples and in vectors and , respectively. With a slight abuse of notation, the filtering operation in (4) applied to is denoted as
The operation above is also equivalent to the convolution
(7) 
where is a vector containing the first samples of the operator’s impulse response. The latter is defined as the output sequence generated by (6) for an input such that and .
MIMO extension
In the MIMO case, the input and output at time are vectors of size and , respectively. The MIMO linear dynamical operator with input and output channels may be represented as a MIMO transfer function matrix whose element is a SISO rational transfer function such as (4). The components of the output sequence at time are defined as
(8) 
The derivations for the dynamical layer are presented in the following in a SISO setting to avoid notation clutter. Extension to the MIMO case is straightforward and only requires repetition of the same operations for the different input/output channels. The computations for the different input/output channels are independent and therefore may be performed in parallel.
Note that the software implementation available in our online GitHub repository fully supports the MIMO case.
3 Dynamical Operator as a Deep Learning Layer
In this section, the forward and backward operations required to integrate the linear dynamical operator in a DL framework are derived. The computational cost of these operations as measured by the number of multiplications to be executed is also reported. Furthermore, the possibility of parallelizing these computations is analyzed.
In the rest of this paper, the linear dynamical operator interpreted as a differentiable layer for use in DL is also referred to as block. In our software implementation, the block is implemented in the PyTorch DL framework as a class extending torch.autograd.Function, based on the forward and backward operations derived in the following.
3.1 Forward Operations
The forward operations of a block embedded in a computational graph are represented by solid arrows in Figure 3. In the forward pass, the block filters an input sequence through a dynamical system with structure (4) and parameters and . The block output is a vector containing the filtered sequence:
(9) 
The input of the block may be either the training input sequence or the result of previous operations in the computational graph, while the output is an intermediate step towards the computation of a scalar output . The exact operations leading to are not relevant in this discussion, and thus they are not further specified.
When the filtering operation (9) is implemented using (6), the computational cost of the block forward pass corresponds to multiplications. These multiplications can be parallelized for the + different coefficients at a given time step, but need to be performed sequentially for the time samples due to the recurrent structure of (6).
3.2 Backward Operations
The backward operations are illustrated in Figure 3 with dashed arrows. In the backward pass, receives the vector containing the partial derivatives of the loss w.r.t. , namely:
(10) 
Given , the block has to compute the derivatives of the loss w.r.t. its differentiable inputs , , and . Overall, the backward operation has the following structure:
(11) 
where
(12a)  
(12b)  
(12c) 
Numerator coefficients
Application of the chain rule leads to:
The required sensitivities , can be obtained in closedform through additional filtering operations (see [12], Section 10.3). Specifically, an expression for is derived by differentiating the left and hand side of Eq. (5) w.r.t . This yields:
(13) 
or equivalently:
(14) 
Thus, can be computed by filtering the input vector through the linear filter . Furthermore, the following condition holds:
(15) 
Then, one only needs to compute by simulating the recursive equation (13). The other sensitivities , , are obtained through simple shifting operations according to (15).
Exploiting expression (15) for , the th component of is obtained as:
(16) 
This operation corresponds to the dot product of with shifted version of the sensitivity .
Overall, the computation of requires: () filtering through , which entails multiplications and () the dot products defined in (16), totaling multiplications. As for the filtering, the operations have to be performed sequentially for the different time steps due to the recursive structure of (13). After completion of the filtering, the dot product operations may be performed in parallel.
Denominator coefficients
Following the same rationale above, we obtain a closedform expression for the sensitivities , , by differentiating the left and right hand side of Eq. (5) with respect to . This yields:
or equivalently
(17) 
Then, can be obtained by filtering the output through the linear filter . Furthermore, the following condition holds:
(18) 
The th component of is obtained as:
(19) 
Input time series
Application of the chain rule yields:
(20) 
From (7), the following expression for holds:
(21) 
Plugging the expression above for into (20), we obtain
By definition, the expression above corresponds to the following crosscorrelation operation:
(22) 
However, direct implementation of (22) requires a number of operations proportional to .
Since represents the impulse response of the filter , (22) can be implemented more efficiently by filtering the vector in reverse time through , and then reversing the result, i.e.,
Neglecting the flipping operations, the computational cost of the backward pass for is thus equivalent to the filtering of an length vector through , namely multiplications.
4 Examples
The effectiveness of the dynoNet architecture is evaluated on system identification benchmarks publicly available at the website www.nonlinearbenchmark.org. All the codes required to reproduce the results in this section are available on the GitHub repository https://github.com/forgi86/dynonet.git
The results achieved on the WienerHammerstein [11], the BoucWen [13], and the ElectroMecanical Positioning System (EMPS) [9] benchmarks are presented in this paper. Other examples are dealt with in the provided codes.
Settings
In the following examples, the dynoNet is trained by minimizing the mean square of the simulation error and using the Adam algorithm [10] for gradientbased optimization. The number of iterations is chosen sufficiently large to reach a cost function plateau. The learning rate is adjusted by a rough trial and error. All static nonlinearities following the blocks are modeled as feedforward Neural Networks with a single hidden layer containing 20 neurons and hyperbolic tangent activation function. The numerator and denominator coefficients of the linear dynamical
blocks are randomly initialized from a uniform distribution with zero mean and range
, while the feedforward neural network parameters are initialized according to PyTorch’s default strategy. Note that several settings are kept constant across the benchmarks to highlight that limited tuning is needed to obtain stateofthe art identification results using the proposed dynoNet architecture.Hardware setup
computations are performed on a desktop computer equipped with an AMD Ryzen 5 1600x 6core processor and 32 GB of RAM.
Metrics
The identified models are evaluated in terms of the and Root Mean Square Error (RMSE) indexes defined as:
where is the measured (true) output vector; is the dynoNet model’s openloop simulated output vector; and is the mean value of , i.e. .
4.1 Electronic Circuit with WienerHammerstein Structure
The experimental setup used in this benchmark is an electronic circuit that behaves by construction as a WienerHammerstein system [11]. Therefore, a simple dynoNet architecture corresponding to the WH model structure is adopted. Specifically, the dynoNet model has a sequential structure defined by a SISO block with ; a SISO feedforward neural network; and a final SISO block with .
The model is trained over iterations of the Adam algorithm with learning rate , by minimizing the MSE on the whole training dataset ( samples). The total training time is 267 seconds. On the test dataset ( samples), the dynoNet model’s performance indexes are and mV. The measured output and simulated output on a portion of the test dataset are shown in Figure 4, together with the simulation error . While specialized training algorithms for WH systems may provide even superior results on this benchmark (the best published result [19] reports mV), the dynoNet model trained by plain backpropagation achieves remarkably good performance.
4.2 BoucWen System
The BoucWen is a nonlinear dynamical system describing hysteretic effects in mechanical engineering and commonly used to assess system identification algorithms. The example in this section is based on the synthetic BoucWen benchmark described in [13]. The training and test datasets of the benchmark are obtained by numerical simulation of the differential equations:
where (N) is the input force; (N) is the hysteretic force; (mm) is the output displacement; and all other symbols represent fixed coefficients. The input and output signals are available at a sampling frequency Hz. The output is corrupted by an additive bandlimited Gaussian noise with bandwidth
Hz and standard deviation
mm. The training and test datsaset for the benchmark are generated using as input independent random phase multisine sequences containing and samples, respectively.We adopt for this benchmark a dynoNet architecture with two parallel branches. The first branch has a sequential structure containing: a block with 1 input and 8 output channels; a feedforward network with 8 input and 4 output channels; a block with 4 input and 4 output channels; and a feedforward neural network with 4 input and 1 output channel, while the second branch consists in a single SISO block. The model output is the sum of the the two branches. All the blocks in the first branch are thirdorder (), while the single block in the second branch is secondorder (). This model does not have a specific physical motivation and it is chosen to showcase the representational power of dynoNet. Furthermore, it does not correspond to any classic blockoriented structure previously considered in the system identification literature.
The model is trained over iterations with learning rate , by minimizing the MSE on the whole training dataset. On the test dataset, the model achieves a index of and a RMSE of mm. Time traces of the measured and simulated dynoNet output on a portion of the test dataset are shown in Figure 4. The results obtained by the dynoNet compare favorably with other general blackbox identification methods applied to this benchmark. For instance, in [4] Nonlinear Finite Impulse Response (NFIR); Auto Regressive with eXogenous input (NARX); and Orthornormal Basis Function (NOBF) model structures are tested on this benchmark. The best results are obtained with the NFIR structure ( mm). Superior results are achieved only in [6] using polynomial nonlinear statespace models trained with an algorithm tailored for the identification of hysteretic systems ( mm).
4.3 ElectroMechanical Positioning System
The EMPS is a controlled prismatic joint, which is a common component of robots and machine tools. In the experimental benchmark described in [9], the system input is the motor force (N) and the measured output is the prismatic joint position (m). A physical model for the system is:
where (kg) is the joint mass and (N) is the friction affecting the system (comprising both viscous and Coloumb terms). From a system identification perspective, this benchmark is challenging due to the unknown (and possibly complex) friction characteristic, the marginally stable (integral) system dynamics, and actuator and sensor behavior affecting the measured data. Indeed, [9] remarks that a significant measurement bias is present.
We adopt for this benchmark a sequential dynoNet structure comprising: a thirdorder block with 1 input and 20 output channels; a feedforward neural network with 20 input and 1 output channel; and a final integrator block. In this architecture, the final integrator is used to model the integral system dynamics, while the other units have no specific physical meaning and are used as blackbox model components.
By training this dynoNet model over iterations on a dataset with samples with learning rate , we obtain on the test dataset performance indexes and mm. As reference, [9] reports for the best linear model on this benchmark. Time traces of the measured and simulated output on the test dataset are shown in Figure 4.
5 Conclusions
We have introduced dynoNet, a neural network architecture tailored for time series processing and dynamical system learning. The core element of dynoNet is a linear infinite impulse response dynamical operator described by a rational transfer function. We have derived all the formulas required to integrate the dynamical operator in a backpropagationbased optimization engine for endtoend training of deep networks. Furthermore, we have analyzed the computational cost of the forward and backward operations.
The proposed case studies have shown the effectiveness and flexibility of the presented methodologies against wellknown system identification benchmarks.
Current and future research activities are devoted to the design of nonlinear state estimators and control strategies for systems modeled by
dynoNet networks.Appendix A Special cases of linear dynamical operators
a.1 Introduction
In this section, two special cases of the linear dynamical operators, namely the finite impulse response and the secondorder structures are analyzed. The first case is interesting as it corresponds to the Convolutional block of a standard 1D CNN, which is generalized by the dynoNet architecture. The latter is useful in practice as: () higherorder systems may always be described as the sequential connection of first and secondorder dynamics; () the coefficients of a secondorder system can be readily reparametrized in order to enforce stability of the dynamical blocks and thus of the whole dynoNet network.
a.2 Finite Impulse Response structure
A Finite Impulse Response (FIR) dynamical operator has structure
(23) 
In the FIR structure, there are no denominator coefficients . Furthermore, the numerator coefficients correspond to the system’s nonzero impulse response coefficients. For these reasons, the formulas derived in Section 3 of the main paper and required to define the forward and backward behavior of a general block simplify significantly in the FIR case.
Forward operations
The forward pass operation is equivalent to
(24) 
which is equivalent to the convolution
(25)  
(26) 
Using (26) for the implementation, the forward operation of a FIR block requires fully parallelizable multiplications.
Backward operations
As for the backward pass operations, the sensitivities of the block output with respect to the numerator coefficients are given by
Applying the chain rule, we obtain
The latter can also be written as
Thus, computing requires fully parallelizable multiplications.
As for the backpropagation operations with respect to the input time series , Equation (21) of the main paper still holds in the FIR case. Applying this equation to the FIR case, we obtain:
This operation also requires fully parallelizable multiplications.
The formulas presented above for the FIR structure are very similar to the ones used in 1DCNNs. In most deep learning frameworks, however, a crosscorrelation operation is implemented in the forward pass. Working out the math, the convolution operation appears than in the backward pass.
Discussion
A significant limitation of the FIR structure is that a large number of coefficient is required to represent an LTI dynamics whose impulse response decays slowly. On the other hand, all operations can be performed in parallel as they are independent for each time step, owing to the nonrecurrent structure. Furthermore, the FIR representation defines by construction a stable LTI dynamics, which could have numerical advantages in training.
Regularization
In linear System Identification, special regularization techniques for FIR models based on a Gaussian prior on the impulse response coefficient have been developed [17]. In particular, a Gaussian prior whose covariance is described by the socalled Diagonal/Correlated (DC) kernel is effective to describe the impulse response of a stable linear systems. The DC kernel has form
(27) 
with , ,
. The hyperparameter
represents the exponential decay along the diagonal, while describes the correlation across the diagonal (correlation between neighbouring impulse response coefficient). It would be interesting to extend the use of these priors in deep structured neural networks such as dynoNet.a.3 Secondorder structure
A secondorder dynamical operator has structure
(28) 
The secondorder structure is interesting as () higherorder systems may always be described as the sequential connection of first and secondorder dynamics and () the coefficients of a secondorder system can be readily reparametrized in order to enforce stability of the block and thus of the entire network.
System analysis
The secondorder filter above is asymptotically stable if and only if the two roots of the characteristic polynomial
(29) 
lie within the complex unit circle.
By applying the Jury stability criterion [14], it is possible to show that this property holds in the region of the coefficient space characterized by:
(30a)  
(30b) 
Furthermore, the two poles are: () real and distinct for ; () complex conjugate for ; and () real and coincident for . The regions of interest in the coefficient space are illustrated in Figure 5.
Stable parametrization
An intuitive stable parametrization for secondorder dynamical layers is obtained by describing the denominator in terms of two complex conjugate (or coincident) poles:
with magnitude and phase . Next, in order to avoid interval constraints on and , one can further parametrize and in terms of unconstrained variables as follows:
where
denotes the sigmoid function and
.The overall transformation from to is then:
(32a)  
(32b) 
Adopting this parametrization, it is possible to train dynoNet networks that are stable by design. In practice, the trainable parameters and may be introduced in the computational graph as parents—through the differentiable transformation (32)—of the coefficients , of a secondorder block. Then, the variables , can optimized using standard unconstrained optimization with gradients computed by plain backpropagation. The learned denominator coefficients , will represent a secondorder dynamics that is stable by construction.
The parametrization (32) excludes however the case of two distinct real poles. In order to allow this system structure, a slightly more complex parametrization spanning the whole stability region (30) for the coefficients and of the secondorder structure is the following:
(33a)  
(33b) 
where are used as unconstrained optimization variables.
References
 [1] (2019) Deep convolutional networks in system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), Vol. , pp. 3670–3676. Cited by: §1.2.
 [2] (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.2.

[3]
(2017)
Automatic differentiation in machine learning: a survey
. The Journal of Machine Learning Research 18 (1), pp. 5595–5637. Cited by: §1.1.  [4] (2017) Automatic modeling with local model networks for benchmark processes. IFACPapersOnLine 50 (1), pp. 470–475. Cited by: §4.2.
 [5] (1985) Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Transactions on Circuits and Systems 32 (11), pp. 1150–1161. Cited by: §1.2.
 [6] (2017) Polynomial statespace model decoupling for the identification of hysteretic systems. IFACPapersOnLine 50 (1), pp. 458–463. Cited by: §4.2.
 [7] F. Giri and E. Bai (Eds.) (2010) Blockoriented Nonlinear System Identification. Lecture Notes in Control and Information Sciences, Vol. 404. Cited by: §1.2.
 [8] (2016) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28 (10), pp. 2222–2232. Cited by: §1.2.
 [9] (201904) Data Set and Reference Models of EMPS. In Nonlinear System Identification Benchmarks, EINDHOVEN, Netherlands. External Links: Link Cited by: §4.3, §4.3, §4.
 [10] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §4.
 [11] (2009) WienerHammerstein benchmark. In 15th IFAC Symposium on System Identification, SaintMalo, France, July, 2009, Cited by: §4.1, §4.
 [12] (1999) System identification: theory for the user. 2 edition, Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: ISBN 0136566952 Cited by: §3.2.
 [13] (2016) Hysteretic benchmark with a dynamic nonlinearity. In Workshop on Nonlinear System Identification Benchmarks, pp. 7–14. Cited by: §4.2, §4.
 [14] (1995) Discretetime control systems. Vol. 2, Prentice Hall Englewood Cliffs, NJ. Cited by: §A.3.
 [15] (1979) On representation and approximation of nonlinear systems. Biological Cybernetics 34 (1), pp. 49–52. Cited by: §1.2.
 [16] (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.1.
 [17] (2014) Kernel methods in system identification, machine learning and function estimation: A survey. Automatica 50 (3), pp. 657–682. Cited by: §A.2.
 [18] (2015) Parametric identification of parallel wiener–hammerstein systems. Automatica 51, pp. 111–122. Cited by: §1.2.
 [19] (2014) Identification of Wiener–Hammerstein systems by a nonparametric separation of the best linear approximation. Automatica 50 (2), pp. 628–634. Cited by: §4.1.
 [20] (2017) Time series classification from scratch with Deep Neural Networks: a strong baseline. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1578–1585. Cited by: §1.2.
 [21] (2012) Generalised hammerstein–wiener system estimation and a benchmark application. Control Engineering Practice 20 (11), pp. 1097–1108. Cited by: §1.2.
Comments
There are no comments yet.