Recurrent Neural Networks (RNNs) are a class of neural networks suitable for time-series processing. In particular, gated RNNs [1, 2], such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), are fully-trained recurrent models that implement adaptive gates able to address signals characterized by multiple time-scales dynamics. Recently, within the Reservoir Computing (RC)  framework, the Deep Echo State Network (DeepESN) model has been proposed as extremely efficient way to design and training of deep neural networks for temporal data, with the intrinsic ability to represent hierarchical and distributed temporal features [4, 5, 6].
In this paper, we investigate different approaches to RNN modeling (i.e., untrained stacked layers and fully-trained gated architectures), through an experimental comparison between RC and fully-trained RNNs on challenging real-world prediction tasks characterized by multivariate time-series. In particular, we perform a comparison between DeepESN, LSTM and GRU models on 4 polyphonic music tasks . Since these datasets are characterized by sequences with high-dimensionality and complex temporal sequences, these challenging tasks are particularly suitable for RNNs evaluation . Moreover, we consider ESN and simple RNN (Simple Recurrent Network - SRN) as baseline approaches for DeepESN and gated RNNs, respectively. The models are evaluated in terms of predictive accuracy and computation efficiency.
In a context in which the model design is difficult, especially for fully-trained RNNs, this paper would provide a first glimpse in the experimental comparison between different state-of-the-art recurrent models on multivariate time-series prediction tasks which still lacks in literature.
2 Deep Echo State Networks
models to the deep learning paradigm. Fig.1 shows an example of a DeepESN architecture composed by a hierarchy of reservoirs, coupled by a readout output layer.
In the following equations, and represent the external input and state of the l-th reservoir layer at step , respectively. Omitting bias terms for the ease of notation, and using leaking-rate reservoir units, the state transition of the first recurrent layer is described as follows:
while for each layer the state computation is performed as follows:
In eq. 1 and 2, represents the matrix of input weights, is the matrix of the recurrent weights of layer , is the matrix that collects the inter-layer weights from layer to layer , is the leaky parameter at layer and
is the activation function of recurrent units implemented by a hyperbolic tangent (). Finally, the (global) state of the DeepESN is given by the concatenation of all the states encoded in the recurrent layers of the architecture .
The weights in matrices and
are randomly initialized from a uniform distribution and re-scaled such thatand respectively, where is an input scaling parameter. Recurrent layers are initialized in order to satisfy the necessary condition for the Echo State Property of DeepESNs . Accordingly, values in are randomly initialized from uniform distribution and re-scaled such that , where
is the spectral radius of its matrix argument, i.e. the maximum among its eigenvalues in modulus. The standard ESN case is obtained considering DeepESN with 1 single layer, i.e. when.
The output of the network at time-step is computed by the readout as a linear combination of the activation of reservoir units, as follows: , where is the matrix of output weights. This combination allows to differently weight the contributions of the multiple dynamics developed in the network’s state. The training of the network is performed only on the readout layer by means of direct numerical methods. Finally, as pre-training technique we use the Intrinsic Plasticity (IP) adaptation for deep recurrent architectures, particularly effective for DeepESN and ESN architectures [4, 6].
3 Experimental Comparison
In this section we present the results of the experimental comparison performed between randomized and fully-trained RNNs. The approaches are assessed on polyphonic music tasks defined in . In particular, we consider the following 4 datasets111Piano-midi.de (www.piano-midi.de); MuseData (www.musedata.org); JSBchorales (chorales by J. S. Bach); Nottingham (ifdo.ca/~seymour/nottingham/nottingham.html).
: Piano-midi.de, MuseData , JSBchorales and Nottingham. A polyphonic music task is defined as a next-step prediction on 88-, 82-, 52- and 58- dimensional sequences for Piano-midi.de, MuseData, JSBchorales and Nottingham datasets, respectively. Since these datasets consist in high-dimensional time-series characterized by heterogeneous sequences, sparse vector representations and complex temporal dependencies involved at different time-scales, they are considered challenging real-world benchmarks for RNNs.
Models’ performance is measured by using the expected frame-level accuracy (ACC), commonly adopted as prediction accuracy in polyphonic music tasks , and computed as follows:
where is the total number of time-steps, while , and respectively denote the numbers of true positive, false positive and false negative notes predicted at time-step .
Concerning DeepESN and ESN approaches, we considered reservoirs initialized with of connectivity. Moreover, we performed a model selection on the major hyper-parameters considering spectral radius and leaky integrator values in , and input scaling values in
. Training of the readout was performed through ridge regression[9, 3] with regularization coefficient in . Moreover, based on the results of the design analysis in  on polyphonic music tasks, we set up DeepESN with layers composed by units, and ESN with recurrent units. We used an IP adaptation configured as in [4, 6]
with a standard deviation of.
For what regards fully trained RNNs, we used the Adam learning algorithm 
with a maximum of 2000 epochs. In order to regularize the learning process, we applied dropout methods, a clipping gradient with a value ofand an early stopping with a patience value of . Then, we performed a model selection considering learning rate values in and dropout values in .
Since randomized and fully-trained RNNs implement different learning approaches, it is difficult to set up a fair experimental comparison between them. However, we faced these difficulties by considering a comparable number of free-parameters for all the models. The number of recurrent units and free-parameters considered in the models is shown in the second and third columns of Tab. 1. Each model is individually selected on the validation sets through a grid search on hyper-parameters ranges. We independently generated 5 guesses for each network hyper-parametrization (for random initialization), and averaged the results over such guesses.
|Model||total recurrent units||free-parameters||test ACC||computation time|
|DeepESN||33.33 (0.11) %||386|
|ESN||30.43 (0.06) %||748|
|SRN||29.48 (0.35) %||3185|
|LSTM||28.98 (2.93) %||2333|
|GRU||31.38 (0.21) %||2821|
|DeepESN||36.32 (0.06) %||789|
|ESN||35.95 (0.04) %||997|
|SRN||34.02 (0.28) %||8825|
|LSTM||34.71 (1.17) %||18274|
|GRU||35.89 (0.17) %||18104|
|DeepESN||30.82 (0.12) %||83|
|ESN||29.14 (0.09) %||140|
|SRN||29.68 (0.17) %||341|
|LSTM||29.80 (0.38) %||532|
|GRU||29.63 (0.64) %||230|
|DeepESN||69.43 (0.05) %||677|
|ESN||69.12 (0.08) %||1473|
|SRN||65.89 (0.49) %||2252|
|LSTM||70.00 (0.24) %||26175|
|GRU||71.50 (0.77) %||11844|
In accordance with the different characteristics of the considered training approaches (direct methods for RC and iterative methods for fully-trained models) we preferred the most efficient method in all the considered cases. Accordingly, we used a MATLAB implementation for DeepESN and ESN models, and a Keras implementation for fully-trained RNNs. We measured the time in seconds spent by models in training and test procedures, performing experiments on a CPU “Intel Xeon E5, 1.80GHz, 16 cores” in the case of RC approaches, and on a GPU “Tesla P100 PCIe 16GB” in the case of fully-trained RNNs, with the same aim to give the best resource to each of them.
Tab. 1 shows the number of recurrent units, the number of free-parameters, the predictive accuracy and the computation time (in seconds) achieved by DeepESN, ESN, SRN, LSTM and GRU models. For what regards the comparison between RC approaches in terms of predictive performance, results indicate that DeepESN outperformed ESN with an accuracy improvement of , , and on Piano-midi.de, MuseData, JSBchorales and Nottingha tasks, respectively. Concerning the comparison between fully-trained RNNs, GRU obtained a similar accuracy to SRN and LSTM models on JSBchorales task and it outperformed them on Piano-midi.de, MuseData and Nottingham tasks.
The efficiency assessments show that DeepESN requires about less that one order of magnitude of computation time with respect to fully-trained RNNs, boosting the already striking efficiency of standard ESN models. Moreover, while ESN benefits in terms of efficiency only by exploiting the sparsity of reservoirs (with of connectivity), in the case of DeepESN the benefit is intrinsically due to the architectural constraints involved by layering  (and are obtained also with fully-connected layers).
Overall, the DeepESN model outperformed all the other approaches on 3 out of 4 tasks, resulting extremely more efficient with respect to fully-trained RNNs.
In this paper, we performed an experimental comparison between radomized and fully-trained RNNs on challenging real-world tasks characterized by multivariate time-series. This kind of comparisons in complex temporal tasks, that is practically absent in literature especially for what regards efficiency aspects, offered the opportunity to assess efficient alternative models (ESN and DeepESN in particular) to typical RNN approaches (LSTM and GRU). Moreover, we assessed also the effectiveness of layering in deep recurrent architectures with a large number of layers (i.e., 30).
Concerning fully-trained RNNs, GRU outperformed the other gated RNNs on 3 out of 4 tasks and it was more efficient than LSTM in most cases. The effectiveness of GRU approaches found in our experiments is in line with the literature that deals with the design of adaptive gates in recurrent architectures.
For what regards randomized RNNs, the results show that DeepESN is able to outperform ESN in terms of prediction accuracy and efficiency on all tasks. Interestingly, this highlights that the layering aspect allows us to improve the effectiveness of RC approaches on multiple time-scales processing. Overall, the DeepESN model outperformed other approaches in terms of prediction accuracy on 3 out of 4 tasks. Finally, DeepESN required much less time in computation time with respect to the others models resulting in an extremely efficient model able to compete with the state-of-the-art on challenging time-series tasks.
More in general, it is interesting to highlight the gain in the prediction accuracy showed by the multiple time-scales processing capability obtained by layering in deep RC models and by using adaptive gates in fully-trained RNNs in comparison to the respective baselines (ESN and SRN, respectively). Also, it is particularly interesting to note the comparison between models with the capability to learn multiple time-scales dynamics (LSTM and GRU) and models showing an intrinsic capability to develop such kind of hierarchical temporal representations (DeepESN), which was completely lacking in literature.
In addition to provide insights on such general issues, this paper would contribute to show a practical way to efficiently approach the design of learning models in the scenario of deep RNN, extending the set of tools available to the users for complex time-series tasks. Indeed, the first empirical results provided in this paper seem to indicate that some classes of models are sometimes uncritically adopted, i.e. despite their cost, guided by the natural popularity due to their software availability (GRU, LSTM). The same diffusion of software tools deserve more effort on the side of the other models (DeepESN class), although the first instances are already available222DeepESN implementations are made publicly available for download both in MATLAB (see https://it.mathworks.com/matlabcentral/fileexchange/69402-deepesn) and in Python (see https://github.com/lucapedrelli/DeepESN)..
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
-  M. Lukoševičius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
-  C. Gallicchio, A. Micheli, and L. Pedrelli. Deep reservoir computing: a critical experimental analysis. Neurocomputing, 268:87–99, 2017.
-  C. Gallicchio, A. Micheli, and L. Pedrelli. Hierarchical Temporal Representation in Linear Reservoir Computing. In Neural Advances in Processing Nonlinear Dynamic Signals, pages 119–129, Cham, 2019. Springer International Publishing. WIRN 2017.
-  C. Gallicchio, A. Micheli, and L. Pedrelli. Design of deep echo state networks. Neural Networks, 108:33 – 47, 2018.
Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent.
Modeling temporal dependencies in high-dimensional sequences:
Application to polyphonic music generation and transcription.
Proceedings of the 29th International Conference on Machine Learning, 2012.
-  Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In ICASSP 2013, pages 8624–8628. IEEE, 2013.
-  H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.
-  C. Gallicchio and A. Micheli. Echo State Property of Deep Reservoir Computing Networks. Cognitive Computation, 9(3):337–350, 2017.
-  D. Kinga and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), volume 5, 2015.