Neuro-inspired recurrent neural network algorithms, such as echo state networks, are computationally lightweight and thereby map well onto untethered devices. The baseline echo state network algorithms are shown to be efficient in solving small-scale spatio-temporal problems. However, they underperform for complex tasks that are characterized by multi-scale structures. In this research, an intrinsic plasticity-infused modular deep echo state network architecture is proposed to solve complex and multiple timescale temporal tasks. It outperforms state-of-the-art for time series prediction tasks.
Keywords: echo state networks (ESN); intrinsic plasticity; time series prediction; reservoir computing (RC)
Echo state networks (ESNs) are recurrent rate-based networks that are efficient in solving spatio-temporal tasks (Jaeger ., 2007). Studies of the associated spiking models have shown that the recurrent layer (a.k.a. reservoir layer) corresponds to the granular cells in the cerebellum and the readout layer corresponds to the Purkinjee cells (Yamazaki Tanaka, 2007). These models are attractive because they offer lightweight, resilient, and conceptually simple networks with rapid training time, as training occurs only at the readout layer. The recurrent nature of these models introduces feedback signals similar to an innate form of fading memory.
In a recent resurgence to enhance the capabilities of ESNs, few research groups have studied deep ESNs. The premise of adding depth to an ESN is to support hierarchical representations of the temporal input and also capture the multi-scale dynamics of the input features. In prior literature, ESNs have been applied to several temporal tasks, such as speech processing, EEG classification, and anomaly detection(Soures ., 2017; Jaeger ., 2007). The Deep-ESN architecture proposed by Ma . (2017) consists of stacked reservoir layers and unsupervised encoders to efficiently exploit the temporal kernel property of each reservoir. In Gallicchio . (2017), multiple deep ESN architectures based on the shallow Leaky-Integrator ESN of Jaeger . (2007) are analyzed. Of the architectures, the DeepESN network achieved the best results with and without utilizing intrinsic plasticity on memory capacity experiments.
Previous works have shown that increasing the depth of reservoir networks is necessary to capture multi-scale dynamics of time series data as well as extract features with a higher order of complexity (Gallicchio ., 2017). However, to extract richer features from the data, a wider architecture comprises greater reservoir real estate as each reservoir learns distinct local features. Therefore, we propose a modular deep ESN architecture, known as Mod-DeepESN, which utilizes multiple reservoirs with heterogeneous topology and connectivity to capture and integrate the multi-scale dynamical states of temporal data. The proposed architecture is studied for a mix of benchmarks and consistently performs well across different topologies when compared to the baseline.
3 Proposed Design
The proposed Mod-DeepESN architecture consists of modular and deep reservoirs that can be realized in multiple topologies and connectivities. Specifically, four topologies are explored in this study: (i) Wide (shown in Figure 1) (ii) Layered (iii) 22 Criss-Cross and (iv) Wide+Layered
. The reservoirs that are connected to the observed input receive the vectorat time , which is projected via the input weight matrix into each reservoir layer, where and are Iverson brackets. is the total number of inputs,
is the fixed number of neurons within each reservoir, and lastlyis the total number of reservoirs. is the binary and triangular connectivity matrix that determines the feedforward connections between reservoirs and the input . For example, a ’1’ at position indicates a feedforward connection from the input to the second reservoir. The output of the reservoir, , is computed using (1)
where is computed using (2).
Equation (1) is also referred to as the state transition function in the literature. is a feedforward weight matrix that connects two reservoirs in the Layered, 22 Criss-Cross, and Wide+Layered topologies, while is a recurrent weight matrix that connects intra-reservoir neurons. The per-layer leaky parameter
controls the leakage rate in a moving exponential average manner. The non-linear activation function for the recurrent neurons in this work is the hyperbolic tangent. Note that the bias vectors are left out of the formulation for simplicity. The state of the Mod-DeepESN network is defined as the concatenation of the output of each reservoir,i.e. . The matrix of all states is denoted by where is the number of time steps in the time series. Finally, the output of the network for the duration is computed as a linear combination of the Mod-DeepESN state matrix using (3).
The matrix contains the feedforward weights between reservoir neurons and the
output neurons. Ridge regression using the Moore-Penrose pseudo inverse is used to solve for optimal, shown in (4).
The formulation includes a regularization term , and
is the identity matrix. This computation is performed only during the training phase of the model.
3.2 Mod-DeepESN Topologies
For all the proposed topologies, (i) Wide (ii) Layered (iii) 22 Criss-Cross and (iv) Wide+Layered, the input layer is directly connected to the output layer; this connection captures sensitive spatial features and improves the network performance experimentally. Additionally, each reservoir is connected to the output layer.
Figure 1 illustrates the Wide topology. Distinct connections are created between the input layer and each reservoir, i.e. the input weights are not shared between reservoirs. As a result, an ensemble of shallow ESNs emerges, each of which captures varying dynamics of the input data locally.
The Layered topology, shown in Figure 2(a), presents the input to the first reservoir with each successive reservoir receiving feedforward input from its predecessor. By stacking reservoirs in this fashion, stateful dynamics of the system are integrated with features that increase in complexity with depth.
Figure 2(b) depicts the 22 Criss-Cross topology, which comprises a total of 4, or (), reservoirs. This model exhibits dense feedforward connectivity between all reservoirs, but maintains reservoir-local recurrent connections. The system states are hierarchically integrated at multiple depths from disparate inputs.
The Wide+Layered topology, presented in Figure 2(c), deviates from the 22 Criss-Cross topology through greater sparsity in its feedforward connectivity; dynamical states of the input are integrated in discrete pathways that resemble the Layered topology.
3.3 Weight Initialization
As weights in ESN reservoirs are fixed, recurrent weights must be initialized to maintain stable reservoir states, especially in deep ESNs where inputs to a network propagate through multiple reservoirs. As mathematically supported in the ESN literature, deep ESNs need to satisfy the Echo State Property for such stability. Accordingly, each
is initialized using a uniform distribution and scaled such that (5) is satisfied.
The spectral radius of the matrix, i.e.
the magnitude of the largest eigenvalue, is computed by the function
. We substitute a hyperparameterfor the value 1 to fine-tune model reservoirs for each experiment.
The remaining weight matrices are initialized from a uniform distribution such that the Euclidean distance between weights is equivalent to either or , i.e. and . and are tunable hyperparameters.
Rather than normalizing weights using the norm or spectral radius, Xavier initialization (Glorot Bengio, 2010) can be used to determine the distribution of weights. In this method, any set of weights
can be drawn from a normal distribution as shown by (6) and (7).
The distribution is parameterized with a mean of
and a standard deviation, where is the number of inputs and is the number of outputs that the weight matrix is connecting. Additionally, this initialization can be made modular to isolate which sets of weights utilize a Xavier initialization and which use either or spectral radius normalization.
All modes of initialization additionally use a sparsity hyperparameter that determines the probability that each weight is nullified. Specifically,determines sparsity for , for each , and for each .
3.4 Intrinsic Plasticity (IP)
In this section, we study how a neurally-inspired IP mechanism can enhance the performance of the Mod-DeepESN. Originally proposed by Schrauwen . (2008), the IP rule introduces a gain and bias term to the function: where is the gain and is the bias. The update rules are given by (8) and (9) where
is the net weighted sum of inputs to a neuron, is the activation of , i.e. , and is the learning rate. The hyperparameters and
are the standard deviation and mean, respectively, of the target Gaussian distribution. In a pre-training phase, the learned parameters are initialized asand , and are updated iteratively according to (8) and (9). The application of such results in the minimization of the Kullback-Leibler (KL) divergence between the empirical output distribution and the target Gaussian distribution (Schrauwen ., 2008).
3.5 Genetic Algorithm
To fine-tune network hyperparameters, a genetic algorithm is employed(Bäck ., 2000). The evolution is executed with a population size of 50 individuals for 50 generations, and evaluated in a tournament comprising 3 randomly selected individuals. Individuals mate with a crossover probability of 50% and mutate with a probability of 10% to form each successive generation. This evolution is used to tune all model hyperparameters as well as the Mod-DeepESN topology type.
4 Experiments & Results
Two datasets, the chaotic Mackey-Glass time series generated using the fourth-order Runge-Kutta method (RK4) and a daily minimum temperature series (Time Series Data Library, 1981–1990), are used to evaluate Mod-DeepESN as they present variable nonlinearity at multiple time scales. To analyze the effectiveness of the Mod-DeepESN model, root mean squared error (RMSE), normalized RMSE (NRMSE), and mean absolute percentage error (MAPE) are computed as shown in (10), (11), and (12), respectively.
is the predicted time series value at time step , and is the average value of the time series over the time steps. Each reported result is averaged over 10 runs using the best hyperparameters found from the genetic algorithm search.
The Mackey-Glass dataset, shown in Figure 3(a), is split into 8,000 training samples and 2,000 testing samples to forecast 84 time steps in advance. To reduce the influence of initial reservoir states, the first 100 predictions from the network are discarded in a washout period.
The daily minimum temperature series dataset, shown in Figure 3(b), comprises data from Melbourne, Australia (1981-1990), and is split into 2,920 training samples and 730 testing samples. A washout period of 30 steps is realized during model evaluation for the single-step ahead prediction.
|Baseline||ESN 111 1||-||-||-||-||-||-||-||-||-||-||43.7||201||7.03|
|-ESN 222 2||-||-||-||-||-||-||-||-||-||-||8.60||39.6||1.00|
|RSP 333 2||-||-||-||-||-||-||-||-||-||-||27.2||125||1.00|
|MESM 444 7||-||-||-||-||-||-||-||-||-||-||12.7||58.6||1.91|
|Deep-ESN 555 3||-||-||-||-||-||-||-||-||-||-||1.12||5.17||.151|
|Baseline||ESN 111 1||-||-||-||-||-||-||-||-||-||-||501||139||39.5|
|-ESN 222 2||-||-||-||-||-||-||-||-||-||-||493||141||39.6|
|RSP 333 2||-||-||-||-||-||-||-||-||-||-||495||137||39.3|
|MESM 444 7||-||-||-||-||-||-||-||-||-||-||478||136||37.7|
|Deep-ESN 555 2||-||-||-||-||-||-||-||-||-||-||473||135||37.0|
In each experiment, is set to the same value to reduce the hyperparameter search space. We denote this value as in Tables 1 and 2. Additionally, a value in the column with parenthesis is used to indicate the reservoir width of the 22 Criss-Cross and Wide+Layered topologies. If Xavier initialization is used in place of initialization by spectral radius or normalization, is reported. Hyperparameters are chosen independently of the topology, as the experimental performance consistently surpasses that of the best hyperparameters found per topology. Each topology is evaluated with and without IP to determine its efficacy, but only the best result using IP is reported. Baseline ESN results for both datasets are retrieved from Ma . (2017). The Mod-DeepESN implementation makes use of the scikit-learn (Pedregosa ., 2011), SciPy (Jones ., 2001), NumPy (Oliphant, 2006), and DEAP (Fortin ., 2012) software libraries.
The Mackey-Glass experimental results demonstrate that model fitness depends on the Mod-DeepESN topology. An IP pre-training phase greatly reduces the network error and performs best while evaluated with the Wide topology. The proposed architecture outperforms all baseline models with the exception of the Deep-ESN (Ma ., 2017).
In the daily minimum temperature time series experiment, the Wide+Layered Mod-DeepESN outperforms every baseline model, and further reduces the error with the inclusion of an IP pre-training phase. Varying the topology within this experiment is less impactful than within the Mackey-Glass experiment, but deeper models consistently yield better performance.
It is important to note that the best Mod-DeepESN models use Xavier initialization for multiple weight matrices. While this does not necessarily suggest that spectral radius or normalization are inferior initialization methods, it does indicate that a desirable initialization can be achieved by using a non-parametrized Gaussian distribution that reduces the hyperparameter search space. Furthermore, the best-performing Mod-DeepESN models, Wide and Wide+Layered, incorporate distinct, non-coinciding channels from input layer to output layer, and consistently outperform the other topologies.
In this work, we have shown the efficacy of a deep and modular echo state network architecture with varying topology. Prediction error on multiple datasets is reduced by isolating the integration of dynamical input states to disparate computational pathways, and by incorporating IP in a pre-training phase. Xavier initialization of weights decreases the complexity of hyperparameter tuning and performs well consistently across all proposed topologies. By combining these mechanisms with a genetic algorithm, the Mod-DeepESN outperforms several baseline models on non-trivial time series forecasting tasks.
The authors would like to thank the members of the RIT Neuromorphic AI Lab for their valuable feedback on this work.
=0mu plus 1mu
- Bäck . (2000) back2000evolutionaryBäck, T., Fogel, DB. Michalewicz, Z. 2000. Evolutionary computation 1: Basic algorithms and operators Evolutionary computation 1: Basic algorithms and operators ( 1). CRC press.
- Butcher . (2013) r2sp_2013Butcher, JB., Verstraeten, D., Schrauwen, B., Day, C. Haycock, P. 2013. Reservoir computing and extreme learning machines for non-linear time-series data analysis Reservoir computing and extreme learning machines for non-linear time-series data analysis. Neural networks3876–89.
Fortin . (2012)
DEAP_JMLR2012Fortin, FA., De Rainville, FM., Gardner, MA., Parizeau, M. Gagné, C.
DEAP: Evolutionary Algorithms Made Easy DEAP: Evolutionary algorithms made easy.
Journal of Machine Learning Research132171–2175.
- Gallicchio Micheli (2011) phi_2011Gallicchio, C. Micheli, A. 2011. Architectural and markovian factors of echo state networks Architectural and markovian factors of echo state networks. Neural Networks245440–456.
- Gallicchio . (2017) gallicchio_deep_2017Gallicchio, C., Micheli, A. Pedrelli, L. 2017. Deep reservoir computing: a critical experimental analysis Deep reservoir computing: a critical experimental analysis. Neurocomputing26887–99.
Glorot Bengio (2010)
glorot2010understandingGlorot, X. Bengio, Y.
Understanding the difficulty of training deep
feedforward neural networks Understanding the difficulty of training deep
feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics Proceedings of the thirteenth international conference on artificial intelligence and statistics ( 249–256).
- Jaeger . (2007) jaeger_optimization_2007Jaeger, H., Lukoševičius, M., Popovici, D. Siewert, U. 2007. Optimization and applications of echo state networks with leaky-integrator neurons Optimization and applications of echo state networks with leaky-integrator neurons. Neural networks203335–352.
- Jones . (2001) SciPyJones, E., Oliphant, T., Peterson, P. . 2001. SciPy: Open source scientific tools for Python. SciPy: Open source scientific tools for Python. http://www.scipy.org/
- Ma . (2017) ma_hier_res_2017Ma, Q., Shen, L. Cottrell, GW. 2017. Deep-ESN: A Multiple Projection-encoding Hierarchical Reservoir Computing Framework Deep-esn: A multiple projection-encoding hierarchical reservoir computing framework. arXiv preprint arXiv:1711.05255.
- Malik . (2016) mesm_2017Malik, Z., Hussain, A. Wu, QMJ. 201606. Multilayered Echo State Machine: A Novel Architecture and Algorithm Multilayered echo state machine: A novel architecture and algorithm. 471-14.
- Oliphant (2006) oliphant2006guideOliphant, TE. 2006. A guide to NumPy A guide to numpy ( 1). Trelgol Publishing USA.
- Pedregosa . (2011) scikit-learnPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python Scikit-learn: Machine learning in Python. Journal of Machine Learning Research122825–2830.
- Schrauwen . (2008) schrauwen_improving_2008Schrauwen, B., Wardermann, M., Verstraeten, D., Steil, JJ. Stroobandt, D. 2008. Improving reservoirs using intrinsic plasticity Improving reservoirs using intrinsic plasticity. Neurocomputing717-91159–1171.
- Soures . (2017) soures2017Soures, N., Hays, L. Kudithipudi, D. 2017May. Robustness of a memristor based liquid state machine Robustness of a memristor based liquid state machine. 2017 International Joint Conference on Neural Networks (IJCNN) 2017 international joint conference on neural networks (ijcnn) ( 2414-2420).
- Time Series Data Library (1981–1990) temp_dataTime Series Data Library. 1981–1990. Daily minimum temperatures in Melbourne, Australia. Daily minimum temperatures in melbourne, australia. https://datamarket.com/data/set/2324/daily-minimum-temperatures-in-melbourne-australia-1981-1990
- Yamazaki Tanaka (2007) cereb_2007Yamazaki, T. Tanaka, S. 2007. The cerebellum as a liquid state machine The cerebellum as a liquid state machine. Neural Networks203290 - 297.