1 Introduction
Recurrent Neural Networks (RNNs) are a class of neural networks suitable for timeseries processing. In particular, gated RNNs [1, 2], such as Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU), are fullytrained recurrent models that implement adaptive gates able to address signals characterized by multiple timescales dynamics. Recently, within the Reservoir Computing (RC) [3] framework, the Deep Echo State Network (DeepESN) model has been proposed as extremely efficient way to design and training of deep neural networks for temporal data, with the intrinsic ability to represent hierarchical and distributed temporal features [4, 5, 6].
In this paper, we investigate different approaches to RNN modeling (i.e., untrained stacked layers and fullytrained gated architectures), through an experimental comparison between RC and fullytrained RNNs on challenging realworld prediction tasks characterized by multivariate timeseries. In particular, we perform a comparison between DeepESN, LSTM and GRU models on 4 polyphonic music tasks [7]. Since these datasets are characterized by sequences with highdimensionality and complex temporal sequences, these challenging tasks are particularly suitable for RNNs evaluation [8]. Moreover, we consider ESN and simple RNN (Simple Recurrent Network  SRN) as baseline approaches for DeepESN and gated RNNs, respectively. The models are evaluated in terms of predictive accuracy and computation efficiency.
In a context in which the model design is difficult, especially for fullytrained RNNs, this paper would provide a first glimpse in the experimental comparison between different stateoftheart recurrent models on multivariate timeseries prediction tasks which still lacks in literature.
2 Deep Echo State Networks
DeepESNs [4] extend Echo State Network (ESN) [9]
models to the deep learning paradigm. Fig.
1 shows an example of a DeepESN architecture composed by a hierarchy of reservoirs, coupled by a readout output layer.In the following equations, and represent the external input and state of the lth reservoir layer at step , respectively. Omitting bias terms for the ease of notation, and using leakingrate reservoir units, the state transition of the first recurrent layer is described as follows:
(1) 
while for each layer the state computation is performed as follows:
(2) 
In eq. 1 and 2, represents the matrix of input weights, is the matrix of the recurrent weights of layer , is the matrix that collects the interlayer weights from layer to layer , is the leaky parameter at layer and
is the activation function of recurrent units implemented by a hyperbolic tangent (
). Finally, the (global) state of the DeepESN is given by the concatenation of all the states encoded in the recurrent layers of the architecture .The weights in matrices and
are randomly initialized from a uniform distribution and rescaled such that
and respectively, where is an input scaling parameter. Recurrent layers are initialized in order to satisfy the necessary condition for the Echo State Property of DeepESNs [10]. Accordingly, values in are randomly initialized from uniform distribution and rescaled such that , whereis the spectral radius of its matrix argument, i.e. the maximum among its eigenvalues in modulus. The standard ESN case is obtained considering DeepESN with 1 single layer, i.e. when
.The output of the network at timestep is computed by the readout as a linear combination of the activation of reservoir units, as follows: , where is the matrix of output weights. This combination allows to differently weight the contributions of the multiple dynamics developed in the network’s state. The training of the network is performed only on the readout layer by means of direct numerical methods. Finally, as pretraining technique we use the Intrinsic Plasticity (IP) adaptation for deep recurrent architectures, particularly effective for DeepESN and ESN architectures [4, 6].
3 Experimental Comparison
In this section we present the results of the experimental comparison performed between randomized and fullytrained RNNs. The approaches are assessed on polyphonic music tasks defined in [7]. In particular, we consider the following 4 datasets^{1}^{1}1Pianomidi.de (www.pianomidi.de); MuseData (www.musedata.org); JSBchorales (chorales by J. S. Bach); Nottingham (ifdo.ca/~seymour/nottingham/nottingham.html).
: Pianomidi.de, MuseData , JSBchorales and Nottingham. A polyphonic music task is defined as a nextstep prediction on 88, 82, 52 and 58 dimensional sequences for Pianomidi.de, MuseData, JSBchorales and Nottingham datasets, respectively. Since these datasets consist in highdimensional timeseries characterized by heterogeneous sequences, sparse vector representations and complex temporal dependencies involved at different timescales, they are considered challenging realworld benchmarks for RNNs
[8].Models’ performance is measured by using the expected framelevel accuracy (ACC), commonly adopted as prediction accuracy in polyphonic music tasks [7], and computed as follows:
(3) 
where is the total number of timesteps, while , and respectively denote the numbers of true positive, false positive and false negative notes predicted at timestep .
Concerning DeepESN and ESN approaches, we considered reservoirs initialized with of connectivity. Moreover, we performed a model selection on the major hyperparameters considering spectral radius and leaky integrator values in , and input scaling values in
. Training of the readout was performed through ridge regression
[9, 3] with regularization coefficient in . Moreover, based on the results of the design analysis in [6] on polyphonic music tasks, we set up DeepESN with layers composed by units, and ESN with recurrent units. We used an IP adaptation configured as in [4, 6]with a standard deviation of
.For what regards fully trained RNNs, we used the Adam learning algorithm [11]
with a maximum of 2000 epochs. In order to regularize the learning process, we applied dropout methods, a clipping gradient with a value of
and an early stopping with a patience value of . Then, we performed a model selection considering learning rate values in and dropout values in .Since randomized and fullytrained RNNs implement different learning approaches, it is difficult to set up a fair experimental comparison between them. However, we faced these difficulties by considering a comparable number of freeparameters for all the models. The number of recurrent units and freeparameters considered in the models is shown in the second and third columns of Tab. 1. Each model is individually selected on the validation sets through a grid search on hyperparameters ranges. We independently generated 5 guesses for each network hyperparametrization (for random initialization), and averaged the results over such guesses.
Model  total recurrent units  freeparameters  test ACC  computation time 
Pianomidi.de  
DeepESN  33.33 (0.11) %  386  
ESN  30.43 (0.06) %  748  
SRN  29.48 (0.35) %  3185  
LSTM  28.98 (2.93) %  2333  
GRU  31.38 (0.21) %  2821  
MuseData  
DeepESN  36.32 (0.06) %  789  
ESN  35.95 (0.04) %  997  
SRN  34.02 (0.28) %  8825  
LSTM  34.71 (1.17) %  18274  
GRU  35.89 (0.17) %  18104  
JSBchorales  
DeepESN  30.82 (0.12) %  83  
ESN  29.14 (0.09) %  140  
SRN  29.68 (0.17) %  341  
LSTM  29.80 (0.38) %  532  
GRU  29.63 (0.64) %  230  
Nottingham  
DeepESN  69.43 (0.05) %  677  
ESN  69.12 (0.08) %  1473  
SRN  65.89 (0.49) %  2252  
LSTM  70.00 (0.24) %  26175  
GRU  71.50 (0.77) %  11844 
In accordance with the different characteristics of the considered training approaches (direct methods for RC and iterative methods for fullytrained models) we preferred the most efficient method in all the considered cases. Accordingly, we used a MATLAB implementation for DeepESN and ESN models, and a Keras implementation for fullytrained RNNs. We measured the time in seconds spent by models in training and test procedures, performing experiments on a CPU “Intel Xeon E5, 1.80GHz, 16 cores” in the case of RC approaches, and on a GPU “Tesla P100 PCIe 16GB” in the case of fullytrained RNNs, with the same aim to give the best resource to each of them.
Tab. 1 shows the number of recurrent units, the number of freeparameters, the predictive accuracy and the computation time (in seconds) achieved by DeepESN, ESN, SRN, LSTM and GRU models. For what regards the comparison between RC approaches in terms of predictive performance, results indicate that DeepESN outperformed ESN with an accuracy improvement of , , and on Pianomidi.de, MuseData, JSBchorales and Nottingha tasks, respectively. Concerning the comparison between fullytrained RNNs, GRU obtained a similar accuracy to SRN and LSTM models on JSBchorales task and it outperformed them on Pianomidi.de, MuseData and Nottingham tasks.
The efficiency assessments show that DeepESN requires about less that one order of magnitude of computation time with respect to fullytrained RNNs, boosting the already striking efficiency of standard ESN models. Moreover, while ESN benefits in terms of efficiency only by exploiting the sparsity of reservoirs (with of connectivity), in the case of DeepESN the benefit is intrinsically due to the architectural constraints involved by layering [6] (and are obtained also with fullyconnected layers).
Overall, the DeepESN model outperformed all the other approaches on 3 out of 4 tasks, resulting extremely more efficient with respect to fullytrained RNNs.
4 Conclusions
In this paper, we performed an experimental comparison between radomized and fullytrained RNNs on challenging realworld tasks characterized by multivariate timeseries. This kind of comparisons in complex temporal tasks, that is practically absent in literature especially for what regards efficiency aspects, offered the opportunity to assess efficient alternative models (ESN and DeepESN in particular) to typical RNN approaches (LSTM and GRU). Moreover, we assessed also the effectiveness of layering in deep recurrent architectures with a large number of layers (i.e., 30).
Concerning fullytrained RNNs, GRU outperformed the other gated RNNs on 3 out of 4 tasks and it was more efficient than LSTM in most cases. The effectiveness of GRU approaches found in our experiments is in line with the literature that deals with the design of adaptive gates in recurrent architectures.
For what regards randomized RNNs, the results show that DeepESN is able to outperform ESN in terms of prediction accuracy and efficiency on all tasks. Interestingly, this highlights that the layering aspect allows us to improve the effectiveness of RC approaches on multiple timescales processing. Overall, the DeepESN model outperformed other approaches in terms of prediction accuracy on 3 out of 4 tasks. Finally, DeepESN required much less time in computation time with respect to the others models resulting in an extremely efficient model able to compete with the stateoftheart on challenging timeseries tasks.
More in general, it is interesting to highlight the gain in the prediction accuracy showed by the multiple timescales processing capability obtained by layering in deep RC models and by using adaptive gates in fullytrained RNNs in comparison to the respective baselines (ESN and SRN, respectively). Also, it is particularly interesting to note the comparison between models with the capability to learn multiple timescales dynamics (LSTM and GRU) and models showing an intrinsic capability to develop such kind of hierarchical temporal representations (DeepESN), which was completely lacking in literature.
In addition to provide insights on such general issues, this paper would contribute to show a practical way to efficiently approach the design of learning models in the scenario of deep RNN, extending the set of tools available to the users for complex timeseries tasks. Indeed, the first empirical results provided in this paper seem to indicate that some classes of models are sometimes uncritically adopted, i.e. despite their cost, guided by the natural popularity due to their software availability (GRU, LSTM). The same diffusion of software tools deserve more effort on the side of the other models (DeepESN class), although the first instances are already available^{2}^{2}2DeepESN implementations are made publicly available for download both in MATLAB (see https://it.mathworks.com/matlabcentral/fileexchange/69402deepesn) and in Python (see https://github.com/lucapedrelli/DeepESN)..
References
 [1] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [2] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [3] M. Lukoševičius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
 [4] C. Gallicchio, A. Micheli, and L. Pedrelli. Deep reservoir computing: a critical experimental analysis. Neurocomputing, 268:87–99, 2017.
 [5] C. Gallicchio, A. Micheli, and L. Pedrelli. Hierarchical Temporal Representation in Linear Reservoir Computing. In Neural Advances in Processing Nonlinear Dynamic Signals, pages 119–129, Cham, 2019. Springer International Publishing. WIRN 2017.
 [6] C. Gallicchio, A. Micheli, and L. Pedrelli. Design of deep echo state networks. Neural Networks, 108:33 – 47, 2018.

[7]
Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent.
Modeling temporal dependencies in highdimensional sequences:
Application to polyphonic music generation and transcription.
In
Proceedings of the 29th International Conference on Machine Learning
, 2012.  [8] Y. Bengio, N. BoulangerLewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In ICASSP 2013, pages 8624–8628. IEEE, 2013.
 [9] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.
 [10] C. Gallicchio and A. Micheli. Echo State Property of Deep Reservoir Computing Networks. Cognitive Computation, 9(3):337–350, 2017.
 [11] D. Kinga and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), volume 5, 2015.
Comments
There are no comments yet.