1 Introduction
Randomized approaches to the design of neural networks are subject of considerable attention in the research community nowadays [21, 29, 2]. The idea of keeping the internal connections fixed is particularly intriguing when considering Recurrent Neural Networks (RNNs) [7]. In this case, indeed, the remarkable efficiency advantages of using untrained hidden weights are coupled with the need to control the resulting system dynamics, to make sure that they can be useful for learning. In this context, Reservoir Computing (RC) [17, 24]
represents the reference paradigm for the randomized design of RNNs. A promising research line in the current development of RC is given by the exploration of its extensions to deep learning
[16, 22], with the introduction of Deep Echo State Network (DeepESN) [5] providing a refreshing perspective to the study of hierarchically structured RNNs. On the one hand, results in [10, 6] suggested that a proper architectural design of DeepESNs can have a tremendous impact on realworld applications. On the other hand, investigations on DeepESNs dynamics [5, 4, 11] revealed that a stacked composition of recurrent layers has the inherent ability to diversify the dynamical response to the driving input signals in successive levels of the architecture. However, the effects of network setup on the quality of state dynamics in deep RC models currently remain largely unexplored.In this paper we study the richness of state dynamics developed in successive layers of DeepESNs. To do so, we make use of quantitative measures of different nature (including entropy, number of uncoupled state dynamics and condition number) and study how they vary in deeper network settings. Differently from the work in [5], here we do not consider any pretraining approach for the recurrent layers, and focus our analysis only on the intrinsic effects of layering. Besides, the experimental investigation reported in this paper also contributes to explore the viability of least mean squares (LMS)based algorithms for training the output component of RC networks.
The rest of this paper is structured as follows. In Section 2 we present the basics of the DeepESN model, while the adopted quality measures of reservoir dynamics are introduced in Section 3. The experimental settings and the achieved results are reported in Section 4. Finally, Section 5 concludes the paper.
2 Deep Echo State Networks
The RC approach to RNN design is based on separating the dynamical recurrent (nonlinear) component of the network, called reservoir, from the feedforward (linear) output layer, called readout. While the application of training algorithms is limited to the readout, the reservoir is initialized under stability constraints and then is left untrained, making the overall approach extremely efficient in comparison to fully trained RNN models. The RC paradigm has several equivalent formulations, among which the Echo State Network (ESN) [14, 13] is one of the most popular. In the rest of this section we deal with deep extensions of ESNs, referring the reader interested in basic aspects of RC to the extensive overviews available in literature [17, 24].
A Deep Echo State Network (DeepESN) [5] is an RC model in which the reservoir part is structured into a stack of layers. The output of each reservoir layer constitutes the input for the next one in the deep architecture, and the external input signal is propagated only to the first reservoir in the stack. A comprehensive summary of properties and recent advancements in the study of DeepESNs can be found in [9].
In what follows, we denote the input and the output sizes respectively by and , the number of layers is indicated by , and we make the assumption that each layer contains the same number of recurrent units. An illustrative representation of the DeepESN reservoir architecture is given in Fig. 1. The state update equations of a DeepESN are described in terms of discretetime iterated mappings^{1}^{1}1For the ease of notation, DeepESN equations are reported here omitting the bias terms. For a comprehensive mathematical description of DeepESN, comprising the case of leaky integrator reservoir units, the reader is referred to [5].. At each timestep , the state of the first layer, denoted by , is computed as follows:
(1) 
while for successive layers , the state is updated according to the following equation:
(2) 
In the above eqs. 1 and 2, is the input weight matrix, (for ) is the interlayer weight matrix connecting the th reservoir to the th reservoir, (for ) is the recurrent weight matrix for layer , and denotes the elementwise application of the hyperbolic tangent nonlinearity. Typically, a zero state is used as initial condition for each layer, i.e. . Note that whenever a single reservoir layer is considered in the architecture, i.e. if , the DeepESN reduces to a standard (shallow) ESN.
Following the RC paradigm, the values in all the above mentioned reservoir weight matrices (in eqs. 1 and 2) are left untrained after initialization based on asymptotic stability criteria, commonly known under the name of Echo State Property (ESP) [17, 27]. A detailed analysis of the ESP for the case of deep reservoirs is provided in one of our previous works [4], to which the reader is referred for a detailed description. Here we limit ourselves to recall that the analysis of stability of deep reservoir dynamics essentially suggests to constrain the magnitude of the involved weight matrices, which leads to a simple initialization procedure for a DeepESN. Accordingly, values in , and
are first chosen randomly from a uniform distribution in
, and then are rescaled to control the values of the following hyperparameters: input scaling , interlayer scaling (for ) , and spectral radius^{2}^{2}2The maximum among the eigenvalues in modulus.
(for ) . Given a driving input signal of length , i.e. , we find it useful to (columnwise) collect the states developed by each reservoir layer into a state matrix:(3) 
In line with the standard RC methodology, the output of the DeepESN is computed by the readout layer as a linear combination of the reservoir activations. As in this paper we are mainly interested in analyzing the behavior of the DeepESN locally to each layer, we study the output of the model when the readout is fed by the state developed individually by each reservoir. Accordingly, when layer is under focus, the output at timestep , denoted by , is computed as follows:
(4) 
In previous eq. 4,
denotes the readout weight matrix, whose values are adjusted on a training set to solve a linear regression problem given by:
(5) 
where is as defined in eq. 3, and is a matrix that collects the corresponding target signals (in a columnwise fashion). Due to a typically large condition number of the reservoir state matrices, readout training is commonly performed by means of noniterative methods [17].
3 Richness of Deep Reservoir Dynamics
To quantify the richness of reservoir dynamics in DeepESNs, we make use of the following measures.
 Average State Entropy

(ASE)  From an informationtheoretic perspective the richness of ESN dynamics can be effectively quantified by means of the entropy of instantaneous reservoir states [18]
. Here we employ an efficient estimator of Renyi’s quadratic entropy
[19, 20], which, for each time step and for each layer in the deep reservoir architecture, can be computed as follows:(6) where denotes the th component of , and
is a gaussian kernel (whose size is obtained by shrinking the standard deviation of instantaneous reservoir activations by a factor of
, in analogy to [18]). Given an input signal of length , we compute the average state entropy (ASE) of layer as the timeaveraged value of the Renyi’s quadratic entropy in eq. 6:(7)  Uncoupled Dynamics

(UD)  It is a known fact in RC literature that reservoir units exhibit behaviors that are strongly coupled among each other [17, 26]. In one of our previous works [8] we experimentally showed that this phenomenon can be understood in terms of the inherent Markovian characterization of ESN dynamics [3, 23]
. Essentially, the stronger is the contractive characterization of a reservoir (i.e. the smaller is the Lipschitz constant of its state transition function) the stronger is the observed redundancy of its units activations, and the poorer is the quality of reservoir dynamics provided to the readout learner. Following this spirit, we propose to evaluate the richness of reservoirs by measuring the actual dimensionality of (linearly) uncoupled state dynamics. To this aim, here we take a simple approach consisting in computing the number of the principal components (i.e. orthogonal directions of variability) that are able to explain the most of the variance in the reservoir state space. Specifically, given an input signal of length
, the th reservoir layer is driven into a set of states collected into the state matrix (see eq. 3). Denoting bythe singular values of
in decreasing order (i.e., the eigenvalues of the covariance matrix of reservoir units activations), the (normalized) relevance of the th principal component can be computed as follows:(8) Based on this, the uncoupled dynamics (UD) indicator of the th reservoir is given by:
(9) where ranges in and expresses the desired amount of explained variability. In this paper we considered , meaning that is the number of linearly uncoupled directions that explain the of the state space variability.
 Condition Number

()  Another wellknown measure for reservoir quality is given by the conditioning of the resulting learning problem for the readout learner (see eq. 5). Conventional ESNs are known to suffer from poor conditioning [17, 15] (i.e., high eigenvalue spread), which (among the other downsides) prevents the use of efficient LMSbased learning algorithms employing stochastic gradient descent [12] in RC contexts. In this paper, we study the conditioning of the state representation developed in successive levels of DeepESNs. To this end, for each layer in the deep reservoir architecture, we compute the condition number of its reservoir state matrix , as follows:
(10) where and are respectively the largest and the smallest singular values of , with smaller values of indicating richer reservoirs.
4 Experiments
In this section we report the outcomes of our experimental analysis. Specifically, details on the considered datasets and experimental settings are given in Section 4.1, whereas numerical results are described in Section 4.2.
4.1 Datasets and Experimental Settings
We considered two wellknown benchmark datasets in the RC area, both involving univariate timeseries (i.e.,
). Although the major focus of our analysis is on the evaluation of deep reservoir dynamics excited by the input (irrespective of the target output values), the datasets are also taken in into account for the definition of regression tasks.
The first dataset is related to the prediction of a 10th order nonlinear autoregressive moving average (NARMA) system, where at each timestep the input is randomly sampled from a uniform distribution in , and the target is given by the following equation:
(11) 
The second dataset is the Santa Fe Laser dataset [25], consisting in a timeseries of sampled intensities from a farinfrared laser in chaotic regime. The dataset enables the definition of a nextstep prediction task where for each timestep . As a minimal preprocessing step, the original values present in the Laser dataset were scaled by a factor of .
For both the NARMA and the Laser datasets, we considered sequences of length to assess the richness of reservoir dynamics. The same data was used as training set in the regression experiments, where the continuation of the respective temporal sequences was considered as test set (of length for NARMA, and of length for Laser). In all the experiments the first timesteps were used as transient to washout the initial conditions from state dynamics.
In our experiments, we considered DeepESNs with a number of layers ranging from to . All reservoir layers contained fully connected recurrent units, and shared the same values of the spectral radius and interlayer scaling (i.e., , ). For every DeepESN hyperparameterization, results were averaged (and standard deviations computed) over networks realizations (with the same values of the hyperparameters, but random initialization).
We focused our experimental analysis on the behavior of reservoir dynamics in progressively higher layers. As such, all the measures of state richness detailed in Section 3 (as well as the predictive performance) were computed on a layerwise basis, i.e. individually for each layer in the deep architecture.
4.2 Results
Fig. 2 shows the quality measures of DeepESN dynamics, computed in correspondence of progressively higher layers of the architecture. In particular, results correspond to DeepESNs with reservoir layers hyperparameterized by input scaling and spectral radius (a value that is of common use in ESN practice). Results are shown for values of 2, 1, and 0.5, as representatives for the cases of strong, medium and weak interlayer connectivity strength, respectively.
Results indicate that when strong interlayer connections are used, the quality of state representations in DeepESNs improves significantly in progressively higher layers. In this case ( in our experiments) we can appreciate a progressive increase of both the state entropy (, first row in Fig. 2) and the number of relevant directions of state variability (, second row in Fig. 2), corresponding to a substantial decrease of the condition number (, third row in Fig. 2). With the increasing layer number we can also observe a saturation trend in the improvement of both entropy and condition number. Interestingly, from Fig. 2 we can note that the marked enrichment of reservoir dynamics is generally observed only for strong connections between consecutive layers. For medium values of we see that the reservoir quality improves only slightly or remains substantially unchanged ( in our experiments), and for small values of it eventually gets worse ( in our experiments). Overall, our results point out the major role played by the interlayer scaling parameter in determining the trend of improvement/worsening of reservoir quality in deeper network settings. In this sense, the impact of other RC parameters, such as the spectral radius and the input scaling , resulted to be much less relevant and is not shown here for brevity^{3}^{3}3Changing the value of resulted in globally scaling (up/down for larger/smaller values, respectively) the achieved results for every layer. Changing the value of affected only the results in the first layer (scaling it up/down for larger/smaller values, respectively). In any case, changes in the values of and did not affect the quality of the results in Fig. 2 at the increase of network’s depth..
From the perspective of RC training algorithms, an interesting consequence of the possible decrease of the condition number in higher reservoir layers (third row in Fig. 2) is that it makes potentially more suitable the application of computationally cheap stochastic gradient descent algorithms. To start exploring this possibility, we performed a further set of experiments, training the readout component fed by the reservoir states at individual layers, applying a basic LMS algorithm with learning rate for epochs. Note that our aim was not to optimize the training algorithm to achieve the highest possible performance on the specific tasks, rather we focused on analyzing the LMS performance when progressively higher reservoir layers are considered (from the qualitative viewpoint). Moreover, notice that when considering progressively more layers the cost of readout training remains constant (as the input for the readout has the same dimensionality in all the cases), and the only extracost is that of the involved state computation in the lower reservoir layers. Under this perspective, the operation of the lower layers in the deep recurrent architecture can be intended as a composition of filters that progressively preprocess the data for the final reservoir level. The MSE achieved on the test sets of the two considered datasets is shown in Fig. 3 (under the same settings considered in Fig. 2).
Results in Fig. 3 nicely agree with those in Fig. 2, further indicating the relevant role played by the interlayer scaling also in terms of LMS effectiveness at the increase of network’s depth. In particular, for medium/low interlayer strength () we see that the error stays essentially constant or even severely increase, while stronger connections among layers () result in a progressive error drop. As a side observation, on the Laser dataset we can notice that the highest performance is achieved in correspondence of the rd reservoir layer, after which the performance starts degrading slightly (a trend confirmed also on the training set, not reported in Fig. 3
for the sake of conciseness). This behavior reflects the fact that the performance on a supervised learning task clearly depends on the task characterizations in a broad sense (including the target properties), and it is in line with the observations already made in
[1, 11] in relation to the memorization skills of reservoirs.5 Conclusions
In this paper we have studied the richness of reservoir representations in DeepESN architectures. Our empirical analysis, conducted on benchmark datasets in the RC area, pointed out the key role played by the strength of connections between the successive layers of recurrent units. Our major finding is that a strong interlayer connectivity is required in order to determine a progressive enrichment of the state dynamics as the network’s depth is increased. This outcome is interestingly related to recent results in the context of reservoirs composed of spiking neurons
[28], showing the importance of interreservoirs connections to properly propagate information across the levels of a deep recurrent architecture.Our empirical results indicate that in presence of strong interlayer connections, reservoirs in higher layers (i.e., further away from the input) are able to develop internal representations featured by increasing entropy and with higher intrinsic (linearly uncoupled) dimensionality, at the same time leading to a decrease in the condition number that characterizes the resulting state matrices. Interestingly, this latter observation is paired with an improved effectiveness of LMS for training the readout, which mitigates the widely known issue with the applicability of gradient descent algorithms in the context of RC.
While already revealing on the potentialities of deep architectures in the RC context, the work presented in this paper allows us to pave the way for promising future research developments. In this regard, we find particularly interesting to extend the application of the analysis tools delineated here to drive the architectural design of DeepESNs in challenging realworld tasks. A starting point in this sense might be represented by a fruitful exploitation of the saturation effects shown for the quality of reservoirs in deeper networks settings. Besides, the insights on the improved amenability of LMS in DeepESNs give a further stimulus to investigate the use of efficient stateoftheart training algorithms based on stochastic gradient descent in the RC field.
References
 [1] Boedecker, J., Obst, O., Lizier, J.T., Mayer, N.M., Asada, M.: Information processing in echo state networks at the edge of chaos. Theory in Biosciences 131(3), 205–213 (2012)

[2]
Gallicchio, C., MartinGuerrero, J., Micheli, A., SoriaOlivas, E.: Randomized machine learning approaches: Recent developments and challenges. In: 25th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2017). pp. 77–86. i6doc. com publication (2017)
 [3] Gallicchio, C., Micheli, A.: Architectural and markovian factors of echo state networks. Neural Networks 24(5), 440–456 (2011)
 [4] Gallicchio, C., Micheli, A.: Echo state property of deep reservoir computing networks. Cognitive Computation 9(3), 337–350 (2017)
 [5] Gallicchio, C., Micheli, A., Pedrelli, L.: Deep reservoir computing: A critical experimental analysis. Neurocomputing 268, 87–99 (2017)
 [6] Gallicchio, C., Micheli, A., Pedrelli, L.: Comparison between DeepESNs and gated RNNs on multivariate timeseries prediction. In: 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2019). i6doc. com publication (2019)
 [7] Gallicchio, C., Micheli, A., Tiňo, P.: Randomized recurrent neural networks. In: 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2018). pp. 415–424. i6doc. com publication (2018)
 [8] Gallicchio, C., Micheli, A.: A markovian characterization of redundancy in echo state networks by pca. In: Proc. of the 18th European Symposium on Artificial Neural Networks (ESANN). dside publi. (2010)
 [9] Gallicchio, C., Micheli, A.: Deep echo state network (deepesn): A brief survey. arXiv preprint arXiv:1712.04323 (2017)
 [10] Gallicchio, C., Micheli, A., Pedrelli, L.: Design of deep echo state networks. Neural Networks 108, 33–47 (2018)
 [11] Gallicchio, C., Micheli, A., Silvestri, L.: Local lyapunov exponents of deep echo state networks. Neurocomputing 298, 34–45 (2018)
 [12] Haykin, S.: Neural networks and learning machines. Pearson, 3 edn. (2009)
 [13] Jaeger, H.: The ”echo state” approach to analysing and training recurrent neural networks  with an erratum note. Tech. rep., GMD  German National Research Institute for Computer Science (2001)
 [14] Jaeger, H., Haas, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004)
 [15] Jaeger, H.: Reservoir riddles: Suggestions for echo state network research. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks (IJCNN). vol. 3, pp. 1460–1462. IEEE (2005)
 [16] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
 [17] Lukoševičius, M., Jaeger, H.: Reservoir computing approaches to recurrent neural network training. Computer Science Review 3(3), 127–149 (2009)
 [18] Ozturk, M.C., Xu, D., Príncipe, J.C.: Analysis and design of echo state networks. Neural computation 19(1), 111–138 (2007)
 [19] Principe, J.C.: Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media (2010)
 [20] Principe, J.C., Xu, D., Fisher, J.: Information theoretic learning. Unsupervised adaptive filtering 1, 265–319 (2000)
 [21] Scardapane, S., Wang, D.: Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7(2), e1200 (2017)
 [22] Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61, 85–117 (2015)
 [23] Tiňo, P., Hammer, B., Bodén, M.: Markovian bias of neuralbased architectures with feedback connections. In: Perspectives of neuralsymbolic integration, pp. 95–133. Springer (2007)
 [24] Verstraeten, D., Schrauwen, B., d’Haene, M., Stroobandt, D.: An experimental unification of reservoir computing methods. Neural networks 20(3), 391–403 (2007)
 [25] Weigend, A.S.: Time series prediction: forecasting the future and understanding the past. Routledge (2018)
 [26] Xue, Y., Yang, L., Haykin, S.: Decoupled echo state networks with lateral inhibition. Neural Networks 20(3), 365–376 (2007)
 [27] Yildiz, I., Jaeger, H., Kiebel, S.: Revisiting the echo state property. Neural networks 35, 1–9 (2012)
 [28] Zajzon, B., Duartel, R., Morrison, A.: Transferring state representations in hierarchical spiking neural networks. In: Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN). pp. 1785–1793. IEEE (2018)
 [29] Zhang, L., Suganthan, P.N.: A survey of randomized algorithms for training neural networks. Information Sciences 364, 146–155 (2016)
Comments
There are no comments yet.