Analysis of Wide and Deep Echo State Networks for Multiscale Spatiotemporal Time Series Forecasting

07/01/2019 ∙ by Zachariah Carmichael, et al. ∙ 0

Echo state networks are computationally lightweight reservoir models inspired by the random projections observed in cortical circuitry. As interest in reservoir computing has grown, networks have become deeper and more intricate. While these networks are increasingly applied to nontrivial forecasting tasks, there is a need for comprehensive performance analysis of deep reservoirs. In this work, we study the influence of partitioning neurons given a budget and the effect of parallel reservoir pathways across different datasets exhibiting multi-scale and nonlinear dynamics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recurrent neural networks (RNNs) have recently shown advances on a variety of spatiotemporal tasks due to their innate feedback connections. One of the most commonly used RNNs today is the long short term memory (LSTM) network

(Hochreiter and Schmidhuber, 1997). LSTMs are widely employed for solving spatiotemporal tasks and demonstrate high accuracy. However, these networks are generally prone to expensive computations that often lead to long training times.

A computationally lightweight approach to address spatiotemporal processing is reservoir computing (RC), which is a biologically-inspired framework for neural networks. RC networks comprise hidden layers, referred to as reservoirs, that consist of pools of neurons with fixed random weights and sparse random connectivity. Echo state networks (ESNs) (Jaeger, 2001) and liquid state machines (LSMs) (Maass et al., 2002) are the two major types of RC. Both architectures make use of sparse random connectivity between neurons to mimic an intrinsic form of memory, as well as enable rapid training, as training occurs only within the readout layer. Whereas ESNs are rate-based models, liquid state machines (LSMs) are spike-based. The focus of this work is primarily on ESNs.

Figure 1. Various Mod-DeepESN topologies.

In general, ESNs have been shown to perform well on small spatiotemporal tasks but underperform as task complexity increases. Prior literature has shown that ESNs are capable of various functions, such as speech processing, EEG classification, and anomaly detection

(Soures et al., 2017a; Jaeger et al., 2007). In recent literature, several groups have begun to study how these networks can cope with increasingly complex time series tasks with dynamics across multiple scales and domains (Ma et al., 2017; Malik et al., 2016; Butcher et al., 2013; Gallicchio et al., 2018b)

. One technique to enhance ESNs is the addition of reservoir layers. These networks are referred to as deep ESNs, which provide a hierarchical framework for feature extraction and for capturing nontrivial dynamics while maintaining the lightweight training of a conventional ESN.

Ma et al. introduced the Deep-ESN architecture which utilizes a sequence of multiple reservoir layers and unsupervised encoders to extract intricacies of temporal data (Ma et al., 2017). Gallicchio et al. proposed an architecture, named DeepESN, that utilizes Jaeger et al.’s leaky-integrate neurons in a deeper ESN architecture (Gallicchio and Micheli, 2017; Jaeger et al., 2007).

In our previous work, we introduced the Mod-DeepESN, a modular architecture that allows for varying topologies of deep ESNs (Carmichael et al., 2018). Intrinsic plasticity (IP) primes neurons to contribute more equally towards predictions and improves the network’s performance. A network with a wide and layered topology was able to achieve a lower root-mean-squared error (RMSE) than other ESN models on a daily minimum temperature series (Hyndman and Yang, 2018) and the Mackey-Glass time series. In this paper, we propose alternative design techniques to enhance the Mod-DeepESN architecture. Comprehensive analysis is required to understand why these networks perform well.

2. Architecture

A framework under the RC paradigm, ESNs are RNNs which comprise one or more reservoirs with a population of rate-based neurons. Reservoirs maintain a set of random untrained weights and exhibit a nonlinear response when the network is driven by an input signal. The state of each reservoir is recorded over the duration of an input sequence and a set of output weights is trained on the states based on a teacher signal. As the output computes a simple linear transformation, there is no need for expensive backpropagation of error throughout a network as is required for training RNNs in the deep learning paradigm. Thus, the vanishing gradient problem is avoided while still being able to capture complex dynamics from the input data. Vanilla ESNs comprise a single reservoir and have limited application, especially with data exhibiting multi-scale and highly nonlinear dynamics. To this end, various architectures have been proposed with multiple reservoirs, additional projections, autoencoders, plasticity mechanisms, etc.

(Butcher et al., 2013; Ma et al., 2017; Malik et al., 2016; Gallicchio et al., 2017, 2018b; Carmichael et al., 2018).

Building off of (Gallicchio et al., 2017; Carmichael et al., 2018), we introduce the flexible Mod-DeepESN architecture, which maintains parameterized connectivity between reservoirs and the input. A broad set of topologies are accommodated by its modularity, several of which are shown in Figure 1

. We denote the tensor of input data

which comprises sequences of timesteps and features. may differ between each of the

sequences, but for simplicity such variability is left out of the formulation. Reservoirs that are connected to the input receive the vector

at timestep where and . is mapped by the input weight matrix into each reservoir. is the number of neurons per reservoir and typically . The binary matrix determines the feedforward connections between reservoirs and the input . For example, if element is ‘1’, then and reservoir 2 are connected. gives the number of reservoirs that are connected to . The output of the reservoir, , is computed as (1)

(1a)
(1b)

where is given by (2).

(2)

is a feedforward weight matrix that connects two reservoirs, while is a recurrent weight matrix that connects intra-reservoir neurons. The per-layer leaky parameter

controls the leakage rate in a moving exponential average manner. Note that the bias vectors are left out of the formulation for simplicity. The state of a

Mod-DeepESN network is defined as the concatenation of the output of the reservoirs, i.e. . The matrix of all states is denoted as . Finally, the output of the network for the duration is computed as a linear combination of the state matrix using (3).

(3)

The matrix contains the feedforward weights between reservoir neurons and the output neurons, and is the ground truth with a label for each timestep. In a forecasting task, the dimensionality of the output is the same as the input, i.e.

. Ridge regression, also known as Tikhonov regularization, is used to solve for optimal

and is shown with the explicit solution in (4)

(4)

and using singular value decomposition (SVD) in (

5)

(5)

where is a regularization term,

is the identity matrix,

is the Hadamard product, and . The SVD solution of the Moore-Penrose pseudoinverse gives a more accurate result but comes at the cost of higher computational complexity.

To maintain reservoir stability, ESNs need to satisfy the echo state property (ESP) (Jaeger, 2001; Gallicchio and Micheli, 2017) as stated by (6).

(6)

The function gives the theigenvalue of its matrix argument and

gives the modulus of its complex scalar argument. The maximum eigenvalue modulus is referred to as the spectral radius and must be less than unity (‘1’) in order for initial conditions of each reservoir to be washed out asymptotically. A hyperparameter

is substituted for unity to allow for reservoir tuning for a given task. Each

is drawn from a uniform distribution and scaled such that the ESP is satisfied.

The remaining untrained weight matrices are set using one of the two methods. First, a matrix can be drawn from a uniform distribution and scaled to have a specified (Frobenius) norm, i.e.

where and are hyperparameters. Second, Glorot (Xavier) initialization (Glorot and Bengio, 2010)

can be utilized without incurring further hyperparameters. The method initializes weights such that the activation (output) variance of consecutive fully-connected layers is the same. Formally, weights are drawn from a normal distribution with zero mean and a standard deviation of

, where and

are the number of inputs and outputs of a layer, respectively. All weight matrices are drawn with a sparsity parameter which gives the probability that each weight is nullified. Specifically,

determines sparsity for , for each , and for each .

Furthermore, an unsupervised intrinsic plasticity (IP) learning rule is employed. The rule, originally proposed by Schrauwen et al., introduces gain and bias terms to the nonlinearities of reservoir neurons, i.e. is substituted with where is the gain and is the bias. Iterative application of the rule minimizes the Kullback-Leibler (KL) divergence between the empirical output distribution (as driven by

) and a target Gaussian distribution

(Schrauwen et al., 2008). The update rules for the th neuron are given by (7) and (8)

(7)
(8)

where is given by (1a) and is given by (1b). The hyperparameter is the the learning rate, and and are the standard deviation and mean of a target Gaussian distribution, respectively. In a pre-training phase, the learned parameters are each initialized as and and are updated iteratively in a layer-wise fashion.

2.1. Particle Swarm Optimization

Instances of the Mod-DeepESN network may achieve satisfactory forecasting performance using empirical guesses of hyperparameters, however, a more sophisticated optimization approach will further improve the performance. We thus propose black-box optimization of the Mod-DeepESN

hyperparameters using particle swarm optimization (PSO)

(Kennedy and Eberhart, 1995). In PSO, a population of particles is instantiated in the search space of an optimization problem. This space contains the possible values of continuous or discrete hyperparameters and the particles move around the space based on external fitness, or cost, signals. The communication network topology employed dictates the social behavior, or dynamics, of the swarm. Here, we utilize a star topology in which each particle is attracted to the globally best-performing particle.

Formally, each particle is a candidate solution of hyperparameters with position and velocity . The trivial position update of each particle is given by (9) while the velocity update is given by (10).

(9)
(10)

The matrices are populated by values uniformly drawn from the interval at each timestep. The best position found by a particle is the vector while the best solution found in the neighborhood of a particle is the vector . With a star communication topology, is the best position found globally. The velocity update comprises three parameters which influence a particle’s dynamics: is the inertia weight, is the cognitive acceleration (how much a particle should follow its personal best), and is the social acceleration (how much a particle should follow the swarm’s global best). All hyperparameters are considered during the optimization process, except for , which is swept after has been computed, exploiting the fact that all the weight matrices but are fixed.

2.2. Neural Mapping

RC networks have strong underpinnings in neural processing. Recent studies have shown that the distribution of the complex representation layer and the linear construction layer in the reservoir is similar to the one observed in the cerebellum. The model with granule layer (representation layer) and the synapses between granule and Purkinje cells (linear readout layer) 

(Yamazaki and Tanaka, 2007) is used to study computationally useful case studies such as vestibulo-ocular reflex adaptation (Dean et al., 2002). The Purkinje cells are trained from the random, distributed input signals of the granule cells’ parallel dendritic connections (Shepherd, 1990). We also employ intrinsic plasticity, as in (Schrauwen et al., 2008), to modulate neuronal activations to follow a known distribution. Whereas a biological neuron’s electrical properties are modified, a reservoir neuron’s gain and bias are adjusted. There are arguments that identifying the powerful computational paradigm, edge-of-chaos, for the biological counterparts of the reservoir are yet to be understood. However, infusing hybrid plasticity mechanisms such as short-term plasticity or long-term synaptic plasticity can help enhance the computational performance. It is interesting to note that the reservoir computational models (both spiking and non-spiking) seem to have a boost in their performance from embedding intrinsic plasticity, akin to the biological models (Soures, 2017; Soures et al., 2017b). This convergence with the biological counterparts vastly improves our understanding of building spatiotemporal processing, though one should take a parsimonious approach with correlations.

3. Measuring Reservoir Goodness

Various metrics have been proposed to understand reservoir performance (Jaeger, 2002; Ozturk et al., 2007; Gibbons, 2010; Gallicchio et al., 2018c; Lymburn et al., 2019). In this work, we quantify reservoir goodness by measuring the stability of reservoir dynamics as well as evaluating the forecasting proficiency of networks trained on synthetic and real-world tasks.

3.1. Separation Ratio Graphs

Separation ratio graphs (Gibbons, 2010) are considered for determining the fitness of Mod-DeepESN state separability. The method compares the separation of inputs with the separation of outputs for a given input and output, e.g. the input and output of a single reservoir or the input and output of a set of reservoirs. The metric assumes that inputs that appear similar should have a comparable degree of similarity with outputs. That is,

(11)

where gives the Euclidean distance between inputs (or outputs) and

. When the output separation is plotted as a function of the input separation, the relation should be close to identity, i.e. a linear regression trend line should yield a slope

and intercept . If the output separation the input separation, a reservoir (or network) is considered to be in the “chaotic” zone, whereas it is considered to be in the “attractor” zone if the output separation the input separation.

Figure 2. Reservoir neuronal budgeting results for neurons for the Mackey Glass task. Color corresponds to . Left: NRMSE as a function of with a linear trend line. Right: NRMSE as a function of with a linear trend line of NRMSE as a function of .

3.2. Lyapunov Exponent

The Lyapunov exponent (LE) offers a quantitative measure of the exponential growth (decay) rate of infinitesimal perturbations within a dynamical system (Lyapunov, 1992; Eckmann and Ruelle, 1985; Gallicchio et al., 2018c). An -dimensional system has LEs which describe the evolution along each dimension of its state space. The maximum LE (MLE) is a good indication of a system’s stability as it commands the rate of contraction or expansion in the system state space. If the MLE of a system is below 0, a system is said to exhibit “stable” dynamics, whereas an MLE above 0 describes a system with “chaotic” dynamics. An MLE of 0 is commonly referred to as the “edge of chaos.” Local MLEs (Gallicchio et al., 2018c)

are considered in this work as they are more useful in practical experiments and can be estimated by driving a network by a real input signal (e.g. a time series). The MLE, denoted

, of a Mod-DeepESN instance can be computed for a given set of input sequences by (12)

(12)

where for the th sequence the diagonal matrix is given by (13).

(13)

3.3. Task Performance Metrics

The more straightforward way of measuring reservoir goodness is by evaluating the performance of a network on a task. For scalar-valued time series forecasting tasks, we consider the following three metrics to quantify network error: root-mean-square error (RMSE) (14), normalized RMSE (NRMSE) (15), and mean absolute percentage error (MAPE) (16).

(14)
(15)
(16)

The vector is the predicted time series value at timestep , is the average value of the ground truth (time series) over the timesteps, , and .

Proposed in (Bay and Downie, 2009) and adapted for polyphonic music tasks in (Boulanger-Lewandowski et al., 2012), frame-level accuracy (FL-ACC) rewards only true positives across every timestep of all samples as shown in (17).

(17)

The subscript of each of the {true, false} positives ({T,F}P) and false negatives (FN) denotes the corresponding quantity for the th

sequence. Note that FL-ACC is analogous to the intersection-over-union (IoU) metric, also known as the Jaccard index

(Jaccard, 1912), which is commonly used to assess the performance of image segmentation models.

For forecasting tasks, NRMSE is considered as its value is not dependent on the scale of the model nor the data, thus allowing for a more accurate comparison with results reported in the literature. FL-ACC is utilized for the polyphonic forecasting task as the true negative (TN) rate is not a good indicator of performance, as well as for consistent comparison of results.

Figure 3. Reservoir neuronal budgeting results for neurons for the Melbourne, Australia, minimum temperature forecasting task. Color corresponds to . Left: NRMSE as a function of with a linear trend line of NRMSE as a function of . Right: NRMSE as a function of with a linear trend line.

4. Experiments

We analyze the Mod-DeepESN architecture on diverse time series forecasting tasks that exhibit multi-scale and nonlinear dynamics: the chaotic Mackey-Glass series, a daily minimum temperature series (Hyndman and Yang, 2018), and a set of polyphonic music series (Poliner and Ellis, 2006; Boulanger-Lewandowski et al., 2012).

4.1. Practical Details

For each task considered in this work, we run PSO to produce a candidate set of hyperparameters that offer the best performance on the appropriate validation set of data. Thereafter, these parameterizations of Mod-DeepESN are evaluated on the test set for the specific task. All numerical results are averaged over 10 runs. We run PSO for 100 iterations with 50 particles and set its parameters as follows: , , .

During training, is swept from the set , exploiting the fact that need only be computed once. Ridge regression is carried out using SVD according to (5) for increased numerical stability. Only the score produced by the best-performing is considered during the PSO update.

We only consider dense grid topologies in these experiments, which are referred to as Wide and Layered in (Carmichael et al., 2018). This allows all networks to be described in terms of breadth and depth, reducing the complexity of neuronal partitioning and other analyses. Additionally, the leaky rate is kept constant across all the reservoirs, i.e.

4.1.1. Model Implementation

The Mod-DeepESN architecture and its utilities are implemented in Python using several open-source libraries. TensorFlow (Abadi et al., 2015) and Keras (Chollet et al., 2015) are used for matrix/tensor operations, managing network graphs, and unrolling RNNs. The PSO implementation extends the base optimizers provided by the PySwarms (Miranda, 2018) library, and the pandas (McKinney, 2010), NumPy (Oliphant, 2006), and SciPy (Jones et al., 2001) libraries are employed throughout the codebase.

4.2. Neuronal Partitioning

An interesting question in reservoir models is to determine the optimal size and connectivity of/within reservoirs. Is a large reservoir as effective as a multitude of small reservoirs, and should these small reservoirs extend deeper or wider? To address this matter, we propose neuronal partitioning to explore a space of grid topologies. Given a budget of neurons network-wide, we evaluate the task performance for a depth and breadth where and . This experiment is performed for the Mackey Glass and minimum temperature forecasting tasks. The prohibitive size of the polyphonic music data prevents us from running such an experiment. The values of and selected for each experiment are the integer factors of and of .

Figure 4. The Mackey Glass chaotic time series (Mackey and Glass, 1977) computed over 10,000 timesteps of duration .
(a)
(b)
(c)
Figure 5. Separation ratio plots for various Mod-DeepESN instances for the Melbourne forecasting task. (a) Separation ratio plot for the best-performing model. (b) Separation ratio plot for the second-best-performing model. (c) Separation ratio plot for the worst-performing model.

4.3. Mackey Glass

Mackey Glass (Mackey and Glass, 1977) is a classical chaotic time series benchmark for evaluating the forecasting capacity of dynamical systems (Jaeger, 2001; Yusoff et al., 2016; Aceituno et al., 2017; Ma et al., 2017). The series is generated from a nonlinear time delay differential equation using the fourth-order Runge-Kutta method (RK4) and is given by (18).

(18)

During generation, we set =17, =0.2, =0.1, and =10 with a time resolution () of 0.1 to compare with methods evaluated in (Ma et al., 2017). 10,000 samples are split into 6,400 training samples 1,600 validation samples, and 2,000 testing samples for 84 timestep-ahead forecasting, i.e. given , predict . To reduce the influence of transients, the first 100 timesteps of the training set are used as a washout period.

Table 1 contains the best forecasting results for Mod-DeepESN as well as those reported in (Ma et al., 2017). The Mod-DeepESN framework falls slightly short of the performance of Deep-ESN but outperforms the other baselines in terms of (N)RMSE. MAPE exhibits several biases, such as punishing negative errors more than positive, which may be the reason for this discrepancy.

Network RMSE NRMSE MAPE
Vanilla ESN (Jaeger et al., 2007) 1 43.7 201 7.03
-ESN (Gallicchio and Micheli, 2011) 2 8.60 39.6 1.00
RSP (Butcher et al., 2013) 2 27.2 125 1.00
MESM (Malik et al., 2016) 7 12.7 58.6 1.91
Deep-ESN (Ma et al., 2017) 2 1.12 5.17 .151
Mod-DeepESN 3 7.22 27.5 5.55
Table 1. Mackey-Glass Time Series 84-Step Ahead Prediction Results. All errors are reported in thousandths.

4.3.1. Neuronal Partitioning

Neural partitioning is run for the Mackey Glass task with results reported in Figure 2. It is apparent that smaller values of and larger values of yield the lowest NRMSE; an agglomeration of small reservoirs outperform a single large reservoir for this task. While marginal, broader topologies outperform deeper for this task.

Network RMSE NRMSE MAPE
ESN (Jaeger et al., 2007) 1 501 139 39.5
-ESN (Gallicchio and Micheli, 2011) 2 493 141 39.6
RSP (Butcher et al., 2013) 2 495 137 39.3
MESM (Malik et al., 2016) 7 478 136 37.7
Deep-ESN (Ma et al., 2017) 2 473 135 37.0
Mod-DeepESN 4 459 132 37.1
Table 2. Daily Minimum Temperature Series 1-Step Ahead Prediction Results. All errors are reported in thousandths.
Figure 6. The Melbourne, Australia, daily minimum temperature time series (Hyndman and Yang, 2018).
(a)
(b)
(c)
(d)
Figure 7. 2D and 3D heatmaps of NRMSE and as a function of and . Note that the color bar gradient is reversed for visualizing . (a) Impact of and on NRMSE. (b) Impact of and on . (c) Impact of reservoir breadth, depth, and on NRMSE. (d) Impact of reservoir breadth, depth, and on .

4.4. Melbourne, Australia, Daily Minimum Temperature Forecasting

The Melbourne, Australia, daily minimum temperature forecasting series (Hyndman and Yang, 2018) is recorded from 1981-1990 and shown in Figure 6. In this task, the goal is to predict the next minimum temperature of the directly proceeding day in Melbourne, i.e. given , predict . The data is smoothed with a 5-step moving window average and split into 2,336 training samples, 584 validation samples, and 730 testing samples to compare with methods evaluated in (Ma et al., 2017). A washout period of 30 timesteps (days) is used to rid transients.

Table 2 contains the best forecasting results for Mod-DeepESN as well as those reported in (Ma et al., 2017). The Mod-DeepESN framework outperforms all baselines in terms of (N)RMSE. This result is more interesting than that of Mackey Glass as the time series comprises real data as opposed to synthetic.

Figure 8. Two samples from the Piano-midi.de dataset with insets highlighting activity. Terminology: Opus: number of a musical work indicating chronological order of production; D.: Deutsch Thematic Catalogue number (of a Schubert work).

4.4.1. Neuronal Partitioning

Neural partitioning is run for the Melbourne daily minimum temperature task with results reported in Figure 3. The trends of and are less apparent than that observed with the Mackey Glass forecasting task. The gradient of error is consistent with the change in and for deeper topologies between the tasks, but the same does not hold for broader networks; in fact, the inverse is observed, which suggests that hierarchical features and larger memory capacity is required to improve performance. Elongated memory capacity has been shown to emerge with the depth of an ESN (Gallicchio et al., 2018c), which supports this observation.

4.5. Polyphonic Music Tasks

We evaluate the Mod-DeepESN on a set of polyphonic music tasks as defined in (Boulanger-Lewandowski et al., 2012). In particular, we use the data provided111http://www-etud.iro.umontreal.ca/~boulanni/icml2012 for the Piano-midi.de222Classical MIDI piano music (http://piano-midi.de/). task. The data comprises a set of piano roll sequences preprocessed as described in (Poliner and Ellis, 2006). 87 sequences with an average of 872.5 timesteps are used for training, 12 sequences with an average of 711.7 timesteps are used for validation, and 25 sequences with an average of 761.4 timesteps are used for testing. The goal of this task is to predict given where .

Multiple notes may be played at once, so an argmax cannot be used at the output of the readout layer; rather, the output of each neuron is binarized with a threshold. In practice, we find this threshold using training data by sweeping

20 values uniformly distributed between the minimum and maximum of the predicted values. This threshold can also be found by training a linear classifier on the predicted outputs, using an adaptive threshold, or using Otsu’s method

(Otsu, 1979). Lastly, an optimal threshold may exist on a per-neuron basis. A washout period of 20 steps is used to rid transients.

Network FL-ACC
DeepESN (Gallicchio et al., 2018a) 33.22%
shallowESN (Gallicchio et al., 2018a) 31.76%
RNN-RBM (Boulanger-Lewandowski et al., 2012) 28.92%
Mod-DeepESN 33.44%
Table 3. Piano-midi.de Time Series 1-Step Ahead Prediction Results.
(a)
(b)
(c)
(d)
Figure 9. 2D and 3D heatmaps of NRMSE and as a function of and . Note that the color bar gradient is reversed for visualizing . (a) Impact of and on NRMSE444The colors of the color bar are linearly mapped to the interval with power-law normalization, i.e. () for some given colors, .. (b) Impact of and on . (c) Impact of reservoir breadth, depth, and on NRMSE. NRMSE values span far greater than 0.21 and thus are clipped. (d) Impact of reservoir breadth, depth, and on .

Table 3 contains the best forecasting results for Mod-DeepESN as well as those reported in (Boulanger-Lewandowski et al., 2012; Gallicchio et al., 2018a). The Mod-DeepESN framework outperforms both RNN-RBM (Boulanger-Lewandowski et al., 2012) and DeepESN (Gallicchio et al., 2018a) on the Piano-midi.de corpus with fewer trained parameters and reservoirs.

5. Discussion

It is evident that the optimal Mod-DeepESN

topologies found via neural partitioning are task-specific; no one configuration seems optimal for multiple tasks. Between the Mackey Glass and Melbourne forecasting tasks, wider networks exhibit a smaller confidence interval which indicates consistency in performance. There is less of an apparent trend on the real-world Melbourne task (more so for deep networks), although this is within expectations due to noise in the data.

We create separation ratio plots of reservoir responses to time series input with various examples shown in Figure 5. The best-performing models achieve similar error values, but exhibit considerably different dynamics at different scales of magnitude. The worst-performing model yields a separation ratio similar to that of the second-best; the biases differ, however, this can be attributed to the difference in magnitude (as a result of e.g. input scaling). Looking at (11

), it can be observed that the identity function yields an ideal response at the empirical “edge of chaos;” this sheds light on some shortcomings of the metric. The technique can be made more robust by considering input-to-output similarity, matching the variance of inputs and reservoir responses (to avoid skewing the slope), and tracking the consistency of separation ratios over time (as reservoirs are stateful). We recommend these plots as a debugging method for ESNs as they unveil useful attributes beyond input and output separation.

Of the three tasks considered, the Melbourne daily minimum forecasting task is selected for exploring the design space. The data is non-synthetic and its size is not prohibitive of such exploration. Here, we produce heatmaps of NRMSE and MLE () as a function of several swept hyperparameters. In each, a practical range of breadth and depth values are considered. Figure 7 delineates the impact of and shows that is a reliable indication of performance (555Pearson correlation coefficient (Pearson and Galton, 1895) between NRMSE and (do not confuse with , the spectral radius).) for the task. There is no significant impact induced by modulating breadth or depth, which agrees with the neuronal partitioning result (see Figure 3). Figure 4 illustrates the impact of and demonstrably has a more substantial impact on network stability than , as expected. The network error plateaus to a minimum near nd increases dramatically afterward. This critical point is beyond the “edge of chaos” () and error is asymmetrical about it. Here, is a poor predictor of error () (even the correlation between and NRMSE is higher). Again, depth and breadth are not indicative of error on their own, however, both deeper and wider networks suffer larger errors beyond the critical value of .

An interesting observation is that, while the tasks differ, well-performing networks in this work often exhibit a positive whereas the networks in (Gallicchio et al., 2018c) exhibit a negative primarily. This characterization of Mod-DeepESN as a system with “unstable” dynamics requires further attestation but indicates that such does not preclude consistent performance.

6. Conclusion

We provide analytical rationale for the characteristics of deep ESN design that influence forecasting performance. Within the malleable Mod-DeepESN architecture, we experimentally support that networks perform optimally beyond the “edge of chaos.” Provided constraints on model size or compute resources, we explore the effects of neuron allocation and reservoir placement on performance. We also demonstrate that network breadth plays a role in dictating certainty of performance between instances. These characteristics may be present within neuronal populations, which could confirm that powerful models emerge from an agglomeration of weak learners. Redundancy through parallel pathways, extraction of nonlinear data regularities with depth, and discernibility of latent representations all appear to have a significant impact on Mod-DeepESN performance. Future studies should explore the design space of reservoirs in tandem with neuromorphic hardware design constraints.

Acknowledgements.

We would like to thank the members of the Neuromorphic Artificial Intelligence Lab for helpful discussions during this work. We also thank the creators and maintainers of the

matplotlib (Hunter, 2007) and seaborn (Waskom et al., 2018) libraries which we used for plotting. We acknowledge the Air Force Research Lab in funding part of this work under agreement number FA8750-16-1-0108. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory or the U.S. Government.
(a)
(b)
(c)
(d)
Figure 10. 2D and 3D heatmaps of NRMSE and as a function of and . Note that the color bar gradient is reversed for visualizing . (a) Impact of and on NRMSE. (b) Impact of and on . (c) Impact of reservoir breadth, depth, and on NRMSE. (d) Impact of reservoir breadth, depth, and on .

References

  • (1)
  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, et al. 2015.

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.

    CoRR abs/1603.04467 (2015). arXiv:1603.04467 http://arxiv.org/abs/1603.04467 - software available from tensorflow.org.
  • Aceituno et al. (2017) Pau Vilimelis Aceituno, Yan Gang, and Yang-Yu Liu. 2017. Tailoring Artificial Neural Networks for Optimal Learning. CoRR abs/1707.02469 (2017), 1–22. arXiv:cs/1707.02469 http://arxiv.org/abs/1707.02469
  • Bay and Downie (2009) Mert Bay and Andreas F. Ehmannnd J. Stephen Downie. 2009. Evaluation of Multiple-F0 Estimation and Tracking Systems. In Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR, Keiji Hirata, George Tzanetakis, and Kazuyoshi Yoshii (Eds.). International Society for Music Information Retrieval, Kobe International Conference Center, Kobe, Japan, 315–320. http://ismir2009.ismir.net/proceedings/PS2-21.pdf
  • Boulanger-Lewandowski et al. (2012) Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. 2012. Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In Proceedings of the 29th International Conference on Machine Learning, ICML (ICML’12). Omnipress, Edinburgh, Scotland, UK, 1881–1888. http://icml.cc/2012/papers/590.pdf
  • Butcher et al. (2013) John B. Butcher, David Verstraeten, Benjamin Schrauwen, Charles R. Day, and Peter W. Haycock. 2013. Reservoir computing and extreme learning machines for non-linear time-series data analysis. Neural Networks 38 (2013), 76–89. https://doi.org/10.1016/j.neunet.2012.11.011
  • Carmichael et al. (2018) Zachariah Carmichael, Humza Syed, Stuart Burtner, and Dhireesha Kudithipudi. 2018. Mod-DeepESN: Modular Deep Echo State Network. Conference on Cognitive Computational Neuroscience abs/1808.00523 (Sept. 2018), 1–4. arXiv:cs/1808.00523 http://arxiv.org/abs/1808.00523 or https://ccneuro.org/2018/proceedings/1239.pdf.
  • Chollet et al. (2015) François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
  • Dean et al. (2002) Paul Dean, John Porrill, and James V. Stone. 2002. Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. The Royal Society 269, 1503 (2002), 1895–1904. https://doi.org/10.1098/rspb.2002.2103
  • Eckmann and Ruelle (1985) Jean-Pierre Eckmann and David Ruelle. 1985. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics 57 (July 1985), 617–656. Issue 3. https://doi.org/10.1103/RevModPhys.57.617
  • Gallicchio and Micheli (2011) Claudio Gallicchio and Alessio Micheli. 2011. Architectural and Markovian factors of echo state networks. Neural Networks 24, 5 (2011), 440–456. https://doi.org/10.1016/j.neunet.2011.02.002
  • Gallicchio and Micheli (2017) Claudio Gallicchio and Alessio Micheli. 2017. Echo State Property of Deep Reservoir Computing Networks. Cognitive Computation 9, 3 (2017), 337–350. https://doi.org/10.1007/s12559-017-9461-9
  • Gallicchio et al. (2017) Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2017. Deep reservoir computing: A critical experimental analysis. Neurocomputing 268 (2017), 87–99. https://doi.org/10.1016/j.neucom.2016.12.089
  • Gallicchio et al. (2018a) Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2018a. Deep Echo State Networks for Diagnosis of Parkinson’s Disease. In Proceedings of the 26th European Symposium on Artificial Neural Networks, ESANN. i6doc.com, Bruges, Belgium, 397–402. http://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2018-163.pdf
  • Gallicchio et al. (2018b) Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2018b. Design of deep echo state networks. Neural Networks 108 (2018), 33–47. https://doi.org/10.1016/j.neunet.2018.08.002
  • Gallicchio et al. (2018c) Claudio Gallicchio, Alessio Micheli, and Luca Silvestri. 2018c. Local Lyapunov Exponents of Deep Echo State Networks. Neurocomputing 298 (2018), 34–45. https://doi.org/10.1016/j.neucom.2017.11.073
  • Gibbons (2010) Thomas E. Gibbons. 2010. Unifying quality metrics for reservoir networks. In Proceedings of the International Joint Conference on Neural Networks, IJCNN. IEEE, Barcelona, Spain, 1–7. https://doi.org/10.1109/IJCNN.2010.5596307
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, AISTATS (JMLR Proceedings), Yee Whye Teh and D. Mike Titterington (Eds.), Vol. 9. JMLR.org, Chia Laguna Resort, Sardinia, Italy, 249–256. http://jmlr.org/proceedings/papers/v9/glorot10a.html
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
  • Hunter (2007) John D. Hunter. 2007. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
  • Hyndman and Yang (2018) Rob J. Hyndman and Yangzhuoran Yang. 2018. Daily minimum temperatures in Melbourne, Australia (1981–1990). https://pkg.yangzhuoranyang.com/tsdl/
  • Jaccard (1912) Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone. The New Phytologist 11, 2 (Feb. 1912), 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
  • Jaeger (2001) Herbert Jaeger. 2001. The “Echo State” Approach to Analysing and Training Recurrent Neural Networks-with an Erratum Note. Technical Report 148. Fraunhofer Institute for Autonomous Intelligent Systems, GMD-German National Research Institute for Information Technology. http://www.faculty.jacobs-university.de/hjaeger/pubs/EchoStatesTechRep.pdf
  • Jaeger (2002) Herbert Jaeger. 2002. Short term memory in echo state networks. Technical Report 152. Fraunhofer Institute for Autonomous Intelligent Systems, GMD-German National Research Institute for Information Technology. http://www.faculty.jacobs-university.de/hjaeger/pubs/STMEchoStatesTechRep.pdf
  • Jaeger et al. (2007) Herbert Jaeger, Mantas Lukoševičius, Dan Popovici, and Udo Siewert. 2007. Optimization and applications of echo state networks with leaky-integrator neurons. Neural networks 20, 3 (2007), 335–352. https://doi.org/10.1016/j.neunet.2007.04.016
  • Jones et al. (2001) Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org/
  • Kennedy and Eberhart (1995) James Kennedy and Russell C. Eberhart. 1995. Particle swarm optimization. In Proceedings of the International Conference on Neural Networks, ICNN’95 (1995), Vol. 4. IEEE, Perth, WA, Australia, 1942–1948. https://doi.org/10.1109/ICNN.1995.488968
  • Lyapunov (1992) Aleksandr M. Lyapunov. 1992. The general problem of the stability of motion. Internat. J. Control 55, 3 (1992), 531–534. https://doi.org/10.1080/00207179208934253
  • Lymburn et al. (2019) Thomas Lymburn, Alexander Khor, Thomas Stemler, et al. 2019. Consistency in Echo-State Networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 29, 2 (2019), 23118. https://doi.org/10.1063/1.5079686
  • Ma et al. (2017) Qianli Ma, Lifeng Shen, and Garrison W. Cottrell. 2017. Deep-ESN: A Multiple Projection-encoding Hierarchical Reservoir Computing Framework. CoRR abs/1711.05255 (2017), 15. arXiv:cs/1711.05255 http://arxiv.org/abs/1711.05255
  • Maass et al. (2002) Wolfgang Maass, Thomas Natschläger, and Henry Markram. 2002. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation 14, 11 (2002), 2531–2560. https://doi.org/10.1162/089976602760407955
  • Mackey and Glass (1977) Michael C. Mackey and Leon Glass. 1977. Oscillation and chaos in physiological control systems. Science 197, 4300 (1977), 287–289. https://doi.org/10.1126/science.267326
  • Malik et al. (2016) Zeeshan K. Malik, Amir Hussain, and Qingming J. Wu. 2016. Multilayered Echo State Machine: A Novel Architecture and Algorithm. IEEE Transactions on Cybernetics 47, 4 (June 2016), 946–959. https://doi.org/10.1109/TCYB.2016.2533545
  • McKinney (2010) Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference (2010), Stéfan van der Walt and Jarrod Millman (Eds.). SciPy, Austin, TX, 51–56.
  • Miranda (2018) Lester James V. Miranda. 2018. PySwarms, a Research-Toolkit for Particle Swarm Optimization in Python. Journal of Open Source Software 3, 21 (2018), 433. https://doi.org/10.21105/joss.00433
  • Oliphant (2006) Travis E. Oliphant. 2006. A Guide to NumPy. Vol. 1. Trelgol Publishing, USA.
  • Otsu (1979) Nobuyuki Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (Jan. 1979), 62–66. https://doi.org/10.1109/TSMC.1979.4310076
  • Ozturk et al. (2007) Mustafa C. Ozturk, Dongming Xu, and José Carlos Príncipe. 2007. Analysis and Design of Echo State Networks. Neural Computation 19, 1 (2007), 111–138. https://doi.org/10.1162/neco.2007.19.1.111
  • Pearson and Galton (1895) Karl Pearson and Francis Galton. 1895. VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, 347-352 (1895), 240–242. https://doi.org/10.1098/rspl.1895.0041
  • Poliner and Ellis (2006) Graham E. Poliner and Daniel P. W. Ellis. 2006. A Discriminative Model for Polyphonic Piano Transcription. EURASIP Journal on Advances in Signal Processing 2007, 1 (2006), 48317. https://doi.org/10.1155/2007/48317
  • Schrauwen et al. (2008) Benjamin Schrauwen, Marion Wardermann, David Verstraeten, Jochen J. Steil, and Dirk Stroobandt. 2008. Improving reservoirs using intrinsic plasticity. Neurocomputing 71, 7-9 (2008), 1159–1171. https://doi.org/10.1016/j.neucom.2007.12.020
  • Shepherd (1990) Gordon M. Shepherd. 1990. The Synaptic Organization of the Brain (3rd ed.). Oxford University Press, New York, NY, USA.
  • Soures (2017) Nicholas M. Soures. 2017. Deep Liquid State Machines with Neural Plasticity and On-Device Learning. Master’s thesis. Rochester Institute of Technology.
  • Soures et al. (2017b) Nicholas M. Soures, Lydia Hays, Eric Bohannon, Abdullah M. Zyarah, and Dhireesha Kudithipudi. 2017b. On-Device STDP and Synaptic Normalization for Neuromemristive Spiking Neural Network. In Proceedings of the 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, Boston, MA, USA, 1081–1084. https://doi.org/10.1109/MWSCAS.2017.8053115
  • Soures et al. (2017a) Nicholas M. Soures, Lydia Hays, and Dhireesha Kudithipudi. 2017a. Robustness of a memristor based liquid state machine. In Proceedings of the International Joint Conference on Neural Networks, IJCNN. IEEE, Anchorage, AK, USA, 2414–2420. https://doi.org/10.1109/IJCNN.2017.7966149
  • Waskom et al. (2018) Michael Waskom, Olga Botvinnik, Drew O’Kane, et al. 2018. Mwaskom/Seaborn: V0.9.0 (July 2018). zenodo V0.9.0 (2018), 1. https://doi.org/10.5281/zenodo.1313201
  • Yamazaki and Tanaka (2007) Tadashi Yamazaki and Shigeru Tanaka. 2007. The cerebellum as a liquid state machine. Neural Networks 20, 3 (2007), 290–297. https://doi.org/10.1016/j.neunet.2007.04.004
  • Yusoff et al. (2016) Mohd-Hanif Yusoff, Joseph Chrol-Cannon, and Yaochu Jin. 2016. Modeling neural plasticity in echo state networks for classification and regression. Information Sciences 364-365 (2016), 184–196. https://doi.org/10.1016/j.ins.2015.11.017

Appendix A Additional Results

We additionally construct heatmaps for the impact of topology and on NRMSE and , shown in Figure 10. This result shows that somewhat correlates with NRMSE (), which only moderately supports the “edge of chaos” hypothesis. However, there is a clear trend between NRMSE and both and depth. The error extending outward radially from the bottom-right corner of Figure 9(a) correlates positively with decreasing and increasing . More significantly, depth correlates with NRMSE () with deeper networks giving lower errors. Wider networks also always yield lower errors in this experiment.