1 Abstract
Neuroinspired recurrent neural network algorithms, such as echo state networks, are computationally lightweight and thereby map well onto untethered devices. The baseline echo state network algorithms are shown to be efficient in solving smallscale spatiotemporal problems. However, they underperform for complex tasks that are characterized by multiscale structures. In this research, an intrinsic plasticityinfused modular deep echo state network architecture is proposed to solve complex and multiple timescale temporal tasks. It outperforms stateoftheart for time series prediction tasks.
Keywords: echo state networks (ESN); intrinsic plasticity; time series prediction; reservoir computing (RC)
2 Introduction
Echo state networks (ESNs) are recurrent ratebased networks that are efficient in solving spatiotemporal tasks (Jaeger ., 2007). Studies of the associated spiking models have shown that the recurrent layer (a.k.a. reservoir layer) corresponds to the granular cells in the cerebellum and the readout layer corresponds to the Purkinjee cells (Yamazaki Tanaka, 2007). These models are attractive because they offer lightweight, resilient, and conceptually simple networks with rapid training time, as training occurs only at the readout layer. The recurrent nature of these models introduces feedback signals similar to an innate form of fading memory.
In a recent resurgence to enhance the capabilities of ESNs, few research groups have studied deep ESNs. The premise of adding depth to an ESN is to support hierarchical representations of the temporal input and also capture the multiscale dynamics of the input features. In prior literature, ESNs have been applied to several temporal tasks, such as speech processing, EEG classification, and anomaly detection
(Soures ., 2017; Jaeger ., 2007). The DeepESN architecture proposed by Ma . (2017) consists of stacked reservoir layers and unsupervised encoders to efficiently exploit the temporal kernel property of each reservoir. In Gallicchio . (2017), multiple deep ESN architectures based on the shallow LeakyIntegrator ESN of Jaeger . (2007) are analyzed. Of the architectures, the DeepESN network achieved the best results with and without utilizing intrinsic plasticity on memory capacity experiments.Previous works have shown that increasing the depth of reservoir networks is necessary to capture multiscale dynamics of time series data as well as extract features with a higher order of complexity (Gallicchio ., 2017). However, to extract richer features from the data, a wider architecture comprises greater reservoir real estate as each reservoir learns distinct local features. Therefore, we propose a modular deep ESN architecture, known as ModDeepESN, which utilizes multiple reservoirs with heterogeneous topology and connectivity to capture and integrate the multiscale dynamical states of temporal data. The proposed architecture is studied for a mix of benchmarks and consistently performs well across different topologies when compared to the baseline.
3 Proposed Design
3.1 Architecture
The proposed ModDeepESN architecture consists of modular and deep reservoirs that can be realized in multiple topologies and connectivities. Specifically, four topologies are explored in this study: (i) Wide (shown in Figure 1) (ii) Layered (iii) 22 CrissCross and (iv) Wide+Layered
. The reservoirs that are connected to the observed input receive the vector
at time , which is projected via the input weight matrix into each reservoir layer, where and are Iverson brackets. is the total number of inputs,is the fixed number of neurons within each reservoir, and lastly
is the total number of reservoirs. is the binary and triangular connectivity matrix that determines the feedforward connections between reservoirs and the input . For example, a ’1’ at position indicates a feedforward connection from the input to the second reservoir. The output of the reservoir, , is computed using (1)(1)  
where is computed using (2).
(2) 
Equation (1) is also referred to as the state transition function in the literature. is a feedforward weight matrix that connects two reservoirs in the Layered, 22 CrissCross, and Wide+Layered topologies, while is a recurrent weight matrix that connects intrareservoir neurons. The perlayer leaky parameter
controls the leakage rate in a moving exponential average manner. The nonlinear activation function for the recurrent neurons in this work is the hyperbolic tangent. Note that the bias vectors are left out of the formulation for simplicity. The state of the ModDeepESN network is defined as the concatenation of the output of each reservoir,
i.e. . The matrix of all states is denoted by where is the number of time steps in the time series. Finally, the output of the network for the duration is computed as a linear combination of the ModDeepESN state matrix using (3).(3) 
The matrix contains the feedforward weights between reservoir neurons and the
output neurons. Ridge regression using the MoorePenrose pseudo inverse is used to solve for optimal
, shown in (4).(4) 
The formulation includes a regularization term , and
is the identity matrix. This computation is performed only during the training phase of the model.
3.2 ModDeepESN Topologies
For all the proposed topologies, (i) Wide (ii) Layered (iii) 22 CrissCross and (iv) Wide+Layered, the input layer is directly connected to the output layer; this connection captures sensitive spatial features and improves the network performance experimentally. Additionally, each reservoir is connected to the output layer.
Figure 1 illustrates the Wide topology. Distinct connections are created between the input layer and each reservoir, i.e. the input weights are not shared between reservoirs. As a result, an ensemble of shallow ESNs emerges, each of which captures varying dynamics of the input data locally.
The Layered topology, shown in Figure 2(a), presents the input to the first reservoir with each successive reservoir receiving feedforward input from its predecessor. By stacking reservoirs in this fashion, stateful dynamics of the system are integrated with features that increase in complexity with depth.
Figure 2(b) depicts the 22 CrissCross topology, which comprises a total of 4, or (), reservoirs. This model exhibits dense feedforward connectivity between all reservoirs, but maintains reservoirlocal recurrent connections. The system states are hierarchically integrated at multiple depths from disparate inputs.
The Wide+Layered topology, presented in Figure 2(c), deviates from the 22 CrissCross topology through greater sparsity in its feedforward connectivity; dynamical states of the input are integrated in discrete pathways that resemble the Layered topology.
3.3 Weight Initialization
As weights in ESN reservoirs are fixed, recurrent weights must be initialized to maintain stable reservoir states, especially in deep ESNs where inputs to a network propagate through multiple reservoirs. As mathematically supported in the ESN literature, deep ESNs need to satisfy the Echo State Property for such stability. Accordingly, each
is initialized using a uniform distribution and scaled such that (
5) is satisfied.(5) 
The spectral radius of the matrix, i.e.
the magnitude of the largest eigenvalue, is computed by the function
. We substitute a hyperparameter
for the value 1 to finetune model reservoirs for each experiment.The remaining weight matrices are initialized from a uniform distribution such that the Euclidean distance between weights is equivalent to either or , i.e. and . and are tunable hyperparameters.
Rather than normalizing weights using the norm or spectral radius, Xavier initialization (Glorot Bengio, 2010) can be used to determine the distribution of weights. In this method, any set of weights
can be drawn from a normal distribution as shown by (
6) and (7).(6)  
(7) 
The distribution is parameterized with a mean of
and a standard deviation
, where is the number of inputs and is the number of outputs that the weight matrix is connecting. Additionally, this initialization can be made modular to isolate which sets of weights utilize a Xavier initialization and which use either or spectral radius normalization.All modes of initialization additionally use a sparsity hyperparameter that determines the probability that each weight is nullified. Specifically,
determines sparsity for , for each , and for each .3.4 Intrinsic Plasticity (IP)
In this section, we study how a neurallyinspired IP mechanism can enhance the performance of the ModDeepESN. Originally proposed by Schrauwen . (2008), the IP rule introduces a gain and bias term to the function: where is the gain and is the bias. The update rules are given by (8) and (9) where
(8)  
(9) 
is the net weighted sum of inputs to a neuron, is the activation of , i.e. , and is the learning rate. The hyperparameters and
are the standard deviation and mean, respectively, of the target Gaussian distribution. In a pretraining phase, the learned parameters are initialized as
and , and are updated iteratively according to (8) and (9). The application of such results in the minimization of the KullbackLeibler (KL) divergence between the empirical output distribution and the target Gaussian distribution (Schrauwen ., 2008).3.5 Genetic Algorithm
To finetune network hyperparameters, a genetic algorithm is employed
(Bäck ., 2000). The evolution is executed with a population size of 50 individuals for 50 generations, and evaluated in a tournament comprising 3 randomly selected individuals. Individuals mate with a crossover probability of 50% and mutate with a probability of 10% to form each successive generation. This evolution is used to tune all model hyperparameters as well as the ModDeepESN topology type.4 Experiments & Results
Two datasets, the chaotic MackeyGlass time series generated using the fourthorder RungeKutta method (RK4) and a daily minimum temperature series (Time Series Data Library, 1981–1990), are used to evaluate ModDeepESN as they present variable nonlinearity at multiple time scales. To analyze the effectiveness of the ModDeepESN model, root mean squared error (RMSE), normalized RMSE (NRMSE), and mean absolute percentage error (MAPE) are computed as shown in (10), (11), and (12), respectively.
(10)  
(11)  
(12) 
is the predicted time series value at time step , and is the average value of the time series over the time steps. Each reported result is averaged over 10 runs using the best hyperparameters found from the genetic algorithm search.
The MackeyGlass dataset, shown in Figure 3(a), is split into 8,000 training samples and 2,000 testing samples to forecast 84 time steps in advance. To reduce the influence of initial reservoir states, the first 100 predictions from the network are discarded in a washout period.
The daily minimum temperature series dataset, shown in Figure 3(b), comprises data from Melbourne, Australia (19811990), and is split into 2,920 training samples and 730 testing samples. A washout period of 30 steps is realized during model evaluation for the singlestep ahead prediction.
Method  IP  RMSEe3  NRMSEe3  MAPEe3  

Baseline  ESN ^{1}^{1}1 1                      43.7  201  7.03  
ESN ^{2}^{2}2 2                      8.60  39.6  1.00  
RSP ^{3}^{3}3 2                      27.2  125  1.00  
MESM ^{4}^{4}4 7                      12.7  58.6  1.91  
DeepESN ^{5}^{5}5 3                      1.12  5.17  .151  
This Work  Wide  3  256  2e8  .6  .1  .1  .1  .7  N  20.2  48.7  8.11  
Layered  3  N  57.8  96.2  19.9  
CrissCross  4 (2)  N  56.1  54.4  8.94  
Wide+Layered  6 (3)  N  41.1  55.4  11.2  
Wide  3  Y  7.22  27.5  5.55 
Method  IP  RMSEe3  NRMSEe3  MAPEe3  

Baseline  ESN ^{1}^{1}1 1                      501  139  39.5  
ESN ^{2}^{2}2 2                      493  141  39.6  
RSP ^{3}^{3}3 2                      495  137  39.3  
MESM ^{4}^{4}4 7                      478  136  37.7  
DeepESN ^{5}^{5}5 2                      473  135  37.0  
This Work  Wide  2  1024  7e4  1  .4  .6  .3  .6  N  473  135  38.6  
Layered  2  N  470  134  38.2  
CrissCross  4 (2)  N  472  135  38.7  
Wide+Layered  4 (2)  N  471  135  38.6  
Wide+Layered  4 (2)  Y  459  132  37.1 
In each experiment, is set to the same value to reduce the hyperparameter search space. We denote this value as in Tables 1 and 2. Additionally, a value in the column with parenthesis is used to indicate the reservoir width of the 22 CrissCross and Wide+Layered topologies. If Xavier initialization is used in place of initialization by spectral radius or normalization, is reported. Hyperparameters are chosen independently of the topology, as the experimental performance consistently surpasses that of the best hyperparameters found per topology. Each topology is evaluated with and without IP to determine its efficacy, but only the best result using IP is reported. Baseline ESN results for both datasets are retrieved from Ma . (2017). The ModDeepESN implementation makes use of the scikitlearn (Pedregosa ., 2011), SciPy (Jones ., 2001), NumPy (Oliphant, 2006), and DEAP (Fortin ., 2012) software libraries.
The MackeyGlass experimental results demonstrate that model fitness depends on the ModDeepESN topology. An IP pretraining phase greatly reduces the network error and performs best while evaluated with the Wide topology. The proposed architecture outperforms all baseline models with the exception of the DeepESN (Ma ., 2017).
In the daily minimum temperature time series experiment, the Wide+Layered ModDeepESN outperforms every baseline model, and further reduces the error with the inclusion of an IP pretraining phase. Varying the topology within this experiment is less impactful than within the MackeyGlass experiment, but deeper models consistently yield better performance.
It is important to note that the best ModDeepESN models use Xavier initialization for multiple weight matrices. While this does not necessarily suggest that spectral radius or normalization are inferior initialization methods, it does indicate that a desirable initialization can be achieved by using a nonparametrized Gaussian distribution that reduces the hyperparameter search space. Furthermore, the bestperforming ModDeepESN models, Wide and Wide+Layered, incorporate distinct, noncoinciding channels from input layer to output layer, and consistently outperform the other topologies.
5 Conclusions
In this work, we have shown the efficacy of a deep and modular echo state network architecture with varying topology. Prediction error on multiple datasets is reduced by isolating the integration of dynamical input states to disparate computational pathways, and by incorporating IP in a pretraining phase. Xavier initialization of weights decreases the complexity of hyperparameter tuning and performs well consistently across all proposed topologies. By combining these mechanisms with a genetic algorithm, the ModDeepESN outperforms several baseline models on nontrivial time series forecasting tasks.
6 Acknowledgments
The authors would like to thank the members of the RIT Neuromorphic AI Lab for their valuable feedback on this work.
=0mu plus 1mu
References
 Bäck . (2000) back2000evolutionaryBäck, T., Fogel, DB. Michalewicz, Z. 2000. Evolutionary computation 1: Basic algorithms and operators Evolutionary computation 1: Basic algorithms and operators ( 1). CRC press.
 Butcher . (2013) r2sp_2013Butcher, JB., Verstraeten, D., Schrauwen, B., Day, C. Haycock, P. 2013. Reservoir computing and extreme learning machines for nonlinear timeseries data analysis Reservoir computing and extreme learning machines for nonlinear timeseries data analysis. Neural networks3876–89.

Fortin . (2012)
DEAP_JMLR2012Fortin, FA., De Rainville, FM., Gardner, MA., Parizeau, M. Gagné, C.
2012jul.
DEAP: Evolutionary Algorithms Made Easy DEAP: Evolutionary algorithms made easy.
Journal of Machine Learning Research132171–2175.
 Gallicchio Micheli (2011) phi_2011Gallicchio, C. Micheli, A. 2011. Architectural and markovian factors of echo state networks Architectural and markovian factors of echo state networks. Neural Networks245440–456.
 Gallicchio . (2017) gallicchio_deep_2017Gallicchio, C., Micheli, A. Pedrelli, L. 2017. Deep reservoir computing: a critical experimental analysis Deep reservoir computing: a critical experimental analysis. Neurocomputing26887–99.

Glorot Bengio (2010)
glorot2010understandingGlorot, X. Bengio, Y.
2010.
Understanding the difficulty of training deep
feedforward neural networks Understanding the difficulty of training deep
feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics Proceedings of the thirteenth international conference on artificial intelligence and statistics ( 249–256).
 Jaeger . (2007) jaeger_optimization_2007Jaeger, H., Lukoševičius, M., Popovici, D. Siewert, U. 2007. Optimization and applications of echo state networks with leakyintegrator neurons Optimization and applications of echo state networks with leakyintegrator neurons. Neural networks203335–352.
 Jones . (2001) SciPyJones, E., Oliphant, T., Peterson, P. . 2001. SciPy: Open source scientific tools for Python. SciPy: Open source scientific tools for Python. http://www.scipy.org/
 Ma . (2017) ma_hier_res_2017Ma, Q., Shen, L. Cottrell, GW. 2017. DeepESN: A Multiple Projectionencoding Hierarchical Reservoir Computing Framework Deepesn: A multiple projectionencoding hierarchical reservoir computing framework. arXiv preprint arXiv:1711.05255.
 Malik . (2016) mesm_2017Malik, Z., Hussain, A. Wu, QMJ. 201606. Multilayered Echo State Machine: A Novel Architecture and Algorithm Multilayered echo state machine: A novel architecture and algorithm. 47114.
 Oliphant (2006) oliphant2006guideOliphant, TE. 2006. A guide to NumPy A guide to numpy ( 1). Trelgol Publishing USA.
 Pedregosa . (2011) scikitlearnPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.Duchesnay, E. 2011. Scikitlearn: Machine Learning in Python Scikitlearn: Machine learning in Python. Journal of Machine Learning Research122825–2830.
 Schrauwen . (2008) schrauwen_improving_2008Schrauwen, B., Wardermann, M., Verstraeten, D., Steil, JJ. Stroobandt, D. 2008. Improving reservoirs using intrinsic plasticity Improving reservoirs using intrinsic plasticity. Neurocomputing71791159–1171.
 Soures . (2017) soures2017Soures, N., Hays, L. Kudithipudi, D. 2017May. Robustness of a memristor based liquid state machine Robustness of a memristor based liquid state machine. 2017 International Joint Conference on Neural Networks (IJCNN) 2017 international joint conference on neural networks (ijcnn) ( 24142420).
 Time Series Data Library (1981–1990) temp_dataTime Series Data Library. 1981–1990. Daily minimum temperatures in Melbourne, Australia. Daily minimum temperatures in melbourne, australia. https://datamarket.com/data/set/2324/dailyminimumtemperaturesinmelbourneaustralia19811990
 Yamazaki Tanaka (2007) cereb_2007Yamazaki, T. Tanaka, S. 2007. The cerebellum as a liquid state machine The cerebellum as a liquid state machine. Neural Networks203290  297.
Comments
There are no comments yet.