1. Introduction
Recurrent neural networks (RNNs) have recently shown advances on a variety of spatiotemporal tasks due to their innate feedback connections. One of the most commonly used RNNs today is the long short term memory (LSTM) network
(Hochreiter and Schmidhuber, 1997). LSTMs are widely employed for solving spatiotemporal tasks and demonstrate high accuracy. However, these networks are generally prone to expensive computations that often lead to long training times.A computationally lightweight approach to address spatiotemporal processing is reservoir computing (RC), which is a biologicallyinspired framework for neural networks. RC networks comprise hidden layers, referred to as reservoirs, that consist of pools of neurons with fixed random weights and sparse random connectivity. Echo state networks (ESNs) (Jaeger, 2001) and liquid state machines (LSMs) (Maass et al., 2002) are the two major types of RC. Both architectures make use of sparse random connectivity between neurons to mimic an intrinsic form of memory, as well as enable rapid training, as training occurs only within the readout layer. Whereas ESNs are ratebased models, liquid state machines (LSMs) are spikebased. The focus of this work is primarily on ESNs.
In general, ESNs have been shown to perform well on small spatiotemporal tasks but underperform as task complexity increases. Prior literature has shown that ESNs are capable of various functions, such as speech processing, EEG classification, and anomaly detection
(Soures et al., 2017a; Jaeger et al., 2007). In recent literature, several groups have begun to study how these networks can cope with increasingly complex time series tasks with dynamics across multiple scales and domains (Ma et al., 2017; Malik et al., 2016; Butcher et al., 2013; Gallicchio et al., 2018b). One technique to enhance ESNs is the addition of reservoir layers. These networks are referred to as deep ESNs, which provide a hierarchical framework for feature extraction and for capturing nontrivial dynamics while maintaining the lightweight training of a conventional ESN.
Ma et al. introduced the DeepESN architecture which utilizes a sequence of multiple reservoir layers and unsupervised encoders to extract intricacies of temporal data (Ma et al., 2017). Gallicchio et al. proposed an architecture, named DeepESN, that utilizes Jaeger et al.’s leakyintegrate neurons in a deeper ESN architecture (Gallicchio and Micheli, 2017; Jaeger et al., 2007).In our previous work, we introduced the ModDeepESN, a modular architecture that allows for varying topologies of deep ESNs (Carmichael et al., 2018). Intrinsic plasticity (IP) primes neurons to contribute more equally towards predictions and improves the network’s performance. A network with a wide and layered topology was able to achieve a lower rootmeansquared error (RMSE) than other ESN models on a daily minimum temperature series (Hyndman and Yang, 2018) and the MackeyGlass time series. In this paper, we propose alternative design techniques to enhance the ModDeepESN architecture. Comprehensive analysis is required to understand why these networks perform well.
2. Architecture
A framework under the RC paradigm, ESNs are RNNs which comprise one or more reservoirs with a population of ratebased neurons. Reservoirs maintain a set of random untrained weights and exhibit a nonlinear response when the network is driven by an input signal. The state of each reservoir is recorded over the duration of an input sequence and a set of output weights is trained on the states based on a teacher signal. As the output computes a simple linear transformation, there is no need for expensive backpropagation of error throughout a network as is required for training RNNs in the deep learning paradigm. Thus, the vanishing gradient problem is avoided while still being able to capture complex dynamics from the input data. Vanilla ESNs comprise a single reservoir and have limited application, especially with data exhibiting multiscale and highly nonlinear dynamics. To this end, various architectures have been proposed with multiple reservoirs, additional projections, autoencoders, plasticity mechanisms, etc.
(Butcher et al., 2013; Ma et al., 2017; Malik et al., 2016; Gallicchio et al., 2017, 2018b; Carmichael et al., 2018).Building off of (Gallicchio et al., 2017; Carmichael et al., 2018), we introduce the flexible ModDeepESN architecture, which maintains parameterized connectivity between reservoirs and the input. A broad set of topologies are accommodated by its modularity, several of which are shown in Figure 1
. We denote the tensor of input data
which comprises sequences of timesteps and features. may differ between each of thesequences, but for simplicity such variability is left out of the formulation. Reservoirs that are connected to the input receive the vector
at timestep where and . is mapped by the input weight matrix into each reservoir. is the number of neurons per reservoir and typically . The binary matrix determines the feedforward connections between reservoirs and the input . For example, if element is ‘1’, then and reservoir 2 are connected. gives the number of reservoirs that are connected to . The output of the reservoir, , is computed as (1)(1a)  
(1b) 
where is given by (2).
(2) 
is a feedforward weight matrix that connects two reservoirs, while is a recurrent weight matrix that connects intrareservoir neurons. The perlayer leaky parameter
controls the leakage rate in a moving exponential average manner. Note that the bias vectors are left out of the formulation for simplicity. The state of a
ModDeepESN network is defined as the concatenation of the output of the reservoirs, i.e. . The matrix of all states is denoted as . Finally, the output of the network for the duration is computed as a linear combination of the state matrix using (3).(3) 
The matrix contains the feedforward weights between reservoir neurons and the output neurons, and is the ground truth with a label for each timestep. In a forecasting task, the dimensionality of the output is the same as the input, i.e.
. Ridge regression, also known as Tikhonov regularization, is used to solve for optimal
and is shown with the explicit solution in (4)(4) 
where is a regularization term,
is the identity matrix,
is the Hadamard product, and . The SVD solution of the MoorePenrose pseudoinverse gives a more accurate result but comes at the cost of higher computational complexity.To maintain reservoir stability, ESNs need to satisfy the echo state property (ESP) (Jaeger, 2001; Gallicchio and Micheli, 2017) as stated by (6).
(6) 
The function gives the ^{th}eigenvalue of its matrix argument and
gives the modulus of its complex scalar argument. The maximum eigenvalue modulus is referred to as the spectral radius and must be less than unity (‘1’) in order for initial conditions of each reservoir to be washed out asymptotically. A hyperparameter
is substituted for unity to allow for reservoir tuning for a given task. Eachis drawn from a uniform distribution and scaled such that the ESP is satisfied.
The remaining untrained weight matrices are set using one of the two methods. First, a matrix can be drawn from a uniform distribution and scaled to have a specified (Frobenius) norm, i.e.
where and are hyperparameters. Second, Glorot (Xavier) initialization (Glorot and Bengio, 2010)
can be utilized without incurring further hyperparameters. The method initializes weights such that the activation (output) variance of consecutive fullyconnected layers is the same. Formally, weights are drawn from a normal distribution with zero mean and a standard deviation of
, where andare the number of inputs and outputs of a layer, respectively. All weight matrices are drawn with a sparsity parameter which gives the probability that each weight is nullified. Specifically,
determines sparsity for , for each , and for each .Furthermore, an unsupervised intrinsic plasticity (IP) learning rule is employed. The rule, originally proposed by Schrauwen et al., introduces gain and bias terms to the nonlinearities of reservoir neurons, i.e. is substituted with where is the gain and is the bias. Iterative application of the rule minimizes the KullbackLeibler (KL) divergence between the empirical output distribution (as driven by
) and a target Gaussian distribution
(Schrauwen et al., 2008). The update rules for the ^{th} neuron are given by (7) and (8)(7)  
(8) 
where is given by (1a) and is given by (1b). The hyperparameter is the the learning rate, and and are the standard deviation and mean of a target Gaussian distribution, respectively. In a pretraining phase, the learned parameters are each initialized as and and are updated iteratively in a layerwise fashion.
2.1. Particle Swarm Optimization
Instances of the ModDeepESN network may achieve satisfactory forecasting performance using empirical guesses of hyperparameters, however, a more sophisticated optimization approach will further improve the performance. We thus propose blackbox optimization of the ModDeepESN
hyperparameters using particle swarm optimization (PSO)
(Kennedy and Eberhart, 1995). In PSO, a population of particles is instantiated in the search space of an optimization problem. This space contains the possible values of continuous or discrete hyperparameters and the particles move around the space based on external fitness, or cost, signals. The communication network topology employed dictates the social behavior, or dynamics, of the swarm. Here, we utilize a star topology in which each particle is attracted to the globally bestperforming particle.Formally, each particle is a candidate solution of hyperparameters with position and velocity . The trivial position update of each particle is given by (9) while the velocity update is given by (10).
(9)  
(10) 
The matrices are populated by values uniformly drawn from the interval at each timestep. The best position found by a particle is the vector while the best solution found in the neighborhood of a particle is the vector . With a star communication topology, is the best position found globally. The velocity update comprises three parameters which influence a particle’s dynamics: is the inertia weight, is the cognitive acceleration (how much a particle should follow its personal best), and is the social acceleration (how much a particle should follow the swarm’s global best). All hyperparameters are considered during the optimization process, except for , which is swept after has been computed, exploiting the fact that all the weight matrices but are fixed.
2.2. Neural Mapping
RC networks have strong underpinnings in neural processing. Recent studies have shown that the distribution of the complex representation layer and the linear construction layer in the reservoir is similar to the one observed in the cerebellum. The model with granule layer (representation layer) and the synapses between granule and Purkinje cells (linear readout layer)
(Yamazaki and Tanaka, 2007) is used to study computationally useful case studies such as vestibuloocular reflex adaptation (Dean et al., 2002). The Purkinje cells are trained from the random, distributed input signals of the granule cells’ parallel dendritic connections (Shepherd, 1990). We also employ intrinsic plasticity, as in (Schrauwen et al., 2008), to modulate neuronal activations to follow a known distribution. Whereas a biological neuron’s electrical properties are modified, a reservoir neuron’s gain and bias are adjusted. There are arguments that identifying the powerful computational paradigm, edgeofchaos, for the biological counterparts of the reservoir are yet to be understood. However, infusing hybrid plasticity mechanisms such as shortterm plasticity or longterm synaptic plasticity can help enhance the computational performance. It is interesting to note that the reservoir computational models (both spiking and nonspiking) seem to have a boost in their performance from embedding intrinsic plasticity, akin to the biological models (Soures, 2017; Soures et al., 2017b). This convergence with the biological counterparts vastly improves our understanding of building spatiotemporal processing, though one should take a parsimonious approach with correlations.3. Measuring Reservoir Goodness
Various metrics have been proposed to understand reservoir performance (Jaeger, 2002; Ozturk et al., 2007; Gibbons, 2010; Gallicchio et al., 2018c; Lymburn et al., 2019). In this work, we quantify reservoir goodness by measuring the stability of reservoir dynamics as well as evaluating the forecasting proficiency of networks trained on synthetic and realworld tasks.
3.1. Separation Ratio Graphs
Separation ratio graphs (Gibbons, 2010) are considered for determining the fitness of ModDeepESN state separability. The method compares the separation of inputs with the separation of outputs for a given input and output, e.g. the input and output of a single reservoir or the input and output of a set of reservoirs. The metric assumes that inputs that appear similar should have a comparable degree of similarity with outputs. That is,
(11) 
where gives the Euclidean distance between inputs (or outputs) and
. When the output separation is plotted as a function of the input separation, the relation should be close to identity, i.e. a linear regression trend line should yield a slope
and intercept . If the output separation the input separation, a reservoir (or network) is considered to be in the “chaotic” zone, whereas it is considered to be in the “attractor” zone if the output separation the input separation.3.2. Lyapunov Exponent
The Lyapunov exponent (LE) offers a quantitative measure of the exponential growth (decay) rate of infinitesimal perturbations within a dynamical system (Lyapunov, 1992; Eckmann and Ruelle, 1985; Gallicchio et al., 2018c). An dimensional system has LEs which describe the evolution along each dimension of its state space. The maximum LE (MLE) is a good indication of a system’s stability as it commands the rate of contraction or expansion in the system state space. If the MLE of a system is below 0, a system is said to exhibit “stable” dynamics, whereas an MLE above 0 describes a system with “chaotic” dynamics. An MLE of 0 is commonly referred to as the “edge of chaos.” Local MLEs (Gallicchio et al., 2018c)
are considered in this work as they are more useful in practical experiments and can be estimated by driving a network by a real input signal (e.g. a time series). The MLE, denoted
, of a ModDeepESN instance can be computed for a given set of input sequences by (12)(12) 
where for the ^{th} sequence the diagonal matrix is given by (13).
(13) 
3.3. Task Performance Metrics
The more straightforward way of measuring reservoir goodness is by evaluating the performance of a network on a task. For scalarvalued time series forecasting tasks, we consider the following three metrics to quantify network error: rootmeansquare error (RMSE) (14), normalized RMSE (NRMSE) (15), and mean absolute percentage error (MAPE) (16).
(14)  
(15)  
(16) 
The vector is the predicted time series value at timestep , is the average value of the ground truth (time series) over the timesteps, , and .
Proposed in (Bay and Downie, 2009) and adapted for polyphonic music tasks in (BoulangerLewandowski et al., 2012), framelevel accuracy (FLACC) rewards only true positives across every timestep of all samples as shown in (17).
(17) 
The subscript of each of the {true, false} positives ({T,F}P) and false negatives (FN) denotes the corresponding quantity for the ^{th}
sequence. Note that FLACC is analogous to the intersectionoverunion (IoU) metric, also known as the Jaccard index
(Jaccard, 1912), which is commonly used to assess the performance of image segmentation models.For forecasting tasks, NRMSE is considered as its value is not dependent on the scale of the model nor the data, thus allowing for a more accurate comparison with results reported in the literature. FLACC is utilized for the polyphonic forecasting task as the true negative (TN) rate is not a good indicator of performance, as well as for consistent comparison of results.
4. Experiments
We analyze the ModDeepESN architecture on diverse time series forecasting tasks that exhibit multiscale and nonlinear dynamics: the chaotic MackeyGlass series, a daily minimum temperature series (Hyndman and Yang, 2018), and a set of polyphonic music series (Poliner and Ellis, 2006; BoulangerLewandowski et al., 2012).
4.1. Practical Details
For each task considered in this work, we run PSO to produce a candidate set of hyperparameters that offer the best performance on the appropriate validation set of data. Thereafter, these parameterizations of ModDeepESN are evaluated on the test set for the specific task. All numerical results are averaged over 10 runs. We run PSO for 100 iterations with 50 particles and set its parameters as follows: , , .
During training, is swept from the set , exploiting the fact that need only be computed once. Ridge regression is carried out using SVD according to (5) for increased numerical stability. Only the score produced by the bestperforming is considered during the PSO update.
We only consider dense grid topologies in these experiments, which are referred to as Wide and Layered in (Carmichael et al., 2018). This allows all networks to be described in terms of breadth and depth, reducing the complexity of neuronal partitioning and other analyses. Additionally, the leaky rate is kept constant across all the reservoirs, i.e.
4.1.1. Model Implementation
The ModDeepESN architecture and its utilities are implemented in Python using several opensource libraries. TensorFlow
(Abadi
et al., 2015) and Keras
(Chollet
et al., 2015) are used for matrix/tensor operations, managing network graphs, and unrolling RNNs. The PSO implementation extends the base optimizers provided by the PySwarms
(Miranda, 2018) library, and the pandas
(McKinney, 2010), NumPy
(Oliphant, 2006), and SciPy
(Jones et al., 2001) libraries are employed throughout the codebase.
4.2. Neuronal Partitioning
An interesting question in reservoir models is to determine the optimal size and connectivity of/within reservoirs. Is a large reservoir as effective as a multitude of small reservoirs, and should these small reservoirs extend deeper or wider? To address this matter, we propose neuronal partitioning to explore a space of grid topologies. Given a budget of neurons networkwide, we evaluate the task performance for a depth and breadth where and . This experiment is performed for the Mackey Glass and minimum temperature forecasting tasks. The prohibitive size of the polyphonic music data prevents us from running such an experiment. The values of and selected for each experiment are the integer factors of and of .
4.3. Mackey Glass
Mackey Glass (Mackey and Glass, 1977) is a classical chaotic time series benchmark for evaluating the forecasting capacity of dynamical systems (Jaeger, 2001; Yusoff et al., 2016; Aceituno et al., 2017; Ma et al., 2017). The series is generated from a nonlinear time delay differential equation using the fourthorder RungeKutta method (RK4) and is given by (18).
(18) 
During generation, we set =17, =0.2, =0.1, and =10 with a time resolution () of 0.1 to compare with methods evaluated in (Ma et al., 2017). 10,000 samples are split into 6,400 training samples 1,600 validation samples, and 2,000 testing samples for 84 timestepahead forecasting, i.e. given , predict . To reduce the influence of transients, the first 100 timesteps of the training set are used as a washout period.
Table 1 contains the best forecasting results for ModDeepESN as well as those reported in (Ma et al., 2017). The ModDeepESN framework falls slightly short of the performance of DeepESN but outperforms the other baselines in terms of (N)RMSE. MAPE exhibits several biases, such as punishing negative errors more than positive, which may be the reason for this discrepancy.
Network  RMSE  NRMSE  MAPE  

Vanilla ESN (Jaeger et al., 2007)  1  43.7  201  7.03 
ESN (Gallicchio and Micheli, 2011)  2  8.60  39.6  1.00 
RSP (Butcher et al., 2013)  2  27.2  125  1.00 
MESM (Malik et al., 2016)  7  12.7  58.6  1.91 
DeepESN (Ma et al., 2017)  2  1.12  5.17  .151 
ModDeepESN  3  7.22  27.5  5.55 
4.3.1. Neuronal Partitioning
Neural partitioning is run for the Mackey Glass task with results reported in Figure 2. It is apparent that smaller values of and larger values of yield the lowest NRMSE; an agglomeration of small reservoirs outperform a single large reservoir for this task. While marginal, broader topologies outperform deeper for this task.
Network  RMSE  NRMSE  MAPE  

ESN (Jaeger et al., 2007)  1  501  139  39.5 
ESN (Gallicchio and Micheli, 2011)  2  493  141  39.6 
RSP (Butcher et al., 2013)  2  495  137  39.3 
MESM (Malik et al., 2016)  7  478  136  37.7 
DeepESN (Ma et al., 2017)  2  473  135  37.0 
ModDeepESN  4  459  132  37.1 
4.4. Melbourne, Australia, Daily Minimum Temperature Forecasting
The Melbourne, Australia, daily minimum temperature forecasting series (Hyndman and Yang, 2018) is recorded from 19811990 and shown in Figure 6. In this task, the goal is to predict the next minimum temperature of the directly proceeding day in Melbourne, i.e. given , predict . The data is smoothed with a 5step moving window average and split into 2,336 training samples, 584 validation samples, and 730 testing samples to compare with methods evaluated in (Ma et al., 2017). A washout period of 30 timesteps (days) is used to rid transients.
Table 2 contains the best forecasting results for ModDeepESN as well as those reported in (Ma et al., 2017). The ModDeepESN framework outperforms all baselines in terms of (N)RMSE. This result is more interesting than that of Mackey Glass as the time series comprises real data as opposed to synthetic.
4.4.1. Neuronal Partitioning
Neural partitioning is run for the Melbourne daily minimum temperature task with results reported in Figure 3. The trends of and are less apparent than that observed with the Mackey Glass forecasting task. The gradient of error is consistent with the change in and for deeper topologies between the tasks, but the same does not hold for broader networks; in fact, the inverse is observed, which suggests that hierarchical features and larger memory capacity is required to improve performance. Elongated memory capacity has been shown to emerge with the depth of an ESN (Gallicchio et al., 2018c), which supports this observation.
4.5. Polyphonic Music Tasks
We evaluate the ModDeepESN on a set of polyphonic music tasks as defined in (BoulangerLewandowski et al., 2012). In particular, we use the data provided^{1}^{1}1http://wwwetud.iro.umontreal.ca/~boulanni/icml2012 for the Pianomidi.de^{2}^{2}2Classical MIDI piano music (http://pianomidi.de/). task. The data comprises a set of piano roll sequences preprocessed as described in (Poliner and Ellis, 2006). 87 sequences with an average of 872.5 timesteps are used for training, 12 sequences with an average of 711.7 timesteps are used for validation, and 25 sequences with an average of 761.4 timesteps are used for testing. The goal of this task is to predict given where .
Multiple notes may be played at once, so an argmax cannot be used at the output of the readout layer; rather, the output of each neuron is binarized with a threshold. In practice, we find this threshold using training data by sweeping
20 values uniformly distributed between the minimum and maximum of the predicted values. This threshold can also be found by training a linear classifier on the predicted outputs, using an adaptive threshold, or using Otsu’s method
(Otsu, 1979). Lastly, an optimal threshold may exist on a perneuron basis. A washout period of 20 steps is used to rid transients.Network  FLACC 

DeepESN (Gallicchio et al., 2018a)  33.22% 
shallowESN (Gallicchio et al., 2018a)  31.76% 
RNNRBM (BoulangerLewandowski et al., 2012)  28.92% 
ModDeepESN  33.44% 
Table 3 contains the best forecasting results for ModDeepESN as well as those reported in (BoulangerLewandowski et al., 2012; Gallicchio et al., 2018a). The ModDeepESN framework outperforms both RNNRBM (BoulangerLewandowski et al., 2012) and DeepESN (Gallicchio et al., 2018a) on the Pianomidi.de corpus with fewer trained parameters and reservoirs.
5. Discussion
It is evident that the optimal ModDeepESN
topologies found via neural partitioning are taskspecific; no one configuration seems optimal for multiple tasks. Between the Mackey Glass and Melbourne forecasting tasks, wider networks exhibit a smaller confidence interval which indicates consistency in performance. There is less of an apparent trend on the realworld Melbourne task (more so for deep networks), although this is within expectations due to noise in the data.
We create separation ratio plots of reservoir responses to time series input with various examples shown in Figure 5. The bestperforming models achieve similar error values, but exhibit considerably different dynamics at different scales of magnitude. The worstperforming model yields a separation ratio similar to that of the secondbest; the biases differ, however, this can be attributed to the difference in magnitude (as a result of e.g. input scaling). Looking at (11
), it can be observed that the identity function yields an ideal response at the empirical “edge of chaos;” this sheds light on some shortcomings of the metric. The technique can be made more robust by considering inputtooutput similarity, matching the variance of inputs and reservoir responses (to avoid skewing the slope), and tracking the consistency of separation ratios over time (as reservoirs are stateful). We recommend these plots as a debugging method for ESNs as they unveil useful attributes beyond input and output separation.
Of the three tasks considered, the Melbourne daily minimum forecasting task is selected for exploring the design space. The data is nonsynthetic and its size is not prohibitive of such exploration. Here, we produce heatmaps of NRMSE and MLE () as a function of several swept hyperparameters. In each, a practical range of breadth and depth values are considered. Figure 7 delineates the impact of and shows that is a reliable indication of performance (^{5}^{5}5Pearson correlation coefficient (Pearson and Galton, 1895) between NRMSE and (do not confuse with , the spectral radius).) for the task. There is no significant impact induced by modulating breadth or depth, which agrees with the neuronal partitioning result (see Figure 3). Figure 4 illustrates the impact of and demonstrably has a more substantial impact on network stability than , as expected. The network error plateaus to a minimum near nd increases dramatically afterward. This critical point is beyond the “edge of chaos” () and error is asymmetrical about it. Here, is a poor predictor of error () (even the correlation between and NRMSE is higher). Again, depth and breadth are not indicative of error on their own, however, both deeper and wider networks suffer larger errors beyond the critical value of .
An interesting observation is that, while the tasks differ, wellperforming networks in this work often exhibit a positive whereas the networks in (Gallicchio et al., 2018c) exhibit a negative primarily. This characterization of ModDeepESN as a system with “unstable” dynamics requires further attestation but indicates that such does not preclude consistent performance.
6. Conclusion
We provide analytical rationale for the characteristics of deep ESN design that influence forecasting performance. Within the malleable ModDeepESN architecture, we experimentally support that networks perform optimally beyond the “edge of chaos.” Provided constraints on model size or compute resources, we explore the effects of neuron allocation and reservoir placement on performance. We also demonstrate that network breadth plays a role in dictating certainty of performance between instances. These characteristics may be present within neuronal populations, which could confirm that powerful models emerge from an agglomeration of weak learners. Redundancy through parallel pathways, extraction of nonlinear data regularities with depth, and discernibility of latent representations all appear to have a significant impact on ModDeepESN performance. Future studies should explore the design space of reservoirs in tandem with neuromorphic hardware design constraints.
Acknowledgements.
We would like to thank the members of the Neuromorphic Artificial Intelligence Lab for helpful discussions during this work. We also thank the creators and maintainers of the
matplotlib
(Hunter, 2007) and seaborn
(Waskom et al., 2018) libraries which we used for plotting.
We acknowledge the Air Force Research Lab in funding part of this work under agreement number FA87501610108. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory or the U.S. Government.
References
 (1)

Abadi
et al. (2015)
Martín Abadi, Ashish
Agarwal, Paul Barham, et al.
2015.
TensorFlow: LargeScale Machine Learning on Heterogeneous Systems.
CoRR abs/1603.04467 (2015). arXiv:1603.04467 http://arxiv.org/abs/1603.04467  software available from tensorflow.org.  Aceituno et al. (2017) Pau Vilimelis Aceituno, Yan Gang, and YangYu Liu. 2017. Tailoring Artificial Neural Networks for Optimal Learning. CoRR abs/1707.02469 (2017), 1–22. arXiv:cs/1707.02469 http://arxiv.org/abs/1707.02469
 Bay and Downie (2009) Mert Bay and Andreas F. Ehmannnd J. Stephen Downie. 2009. Evaluation of MultipleF0 Estimation and Tracking Systems. In Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR, Keiji Hirata, George Tzanetakis, and Kazuyoshi Yoshii (Eds.). International Society for Music Information Retrieval, Kobe International Conference Center, Kobe, Japan, 315–320. http://ismir2009.ismir.net/proceedings/PS221.pdf
 BoulangerLewandowski et al. (2012) Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent. 2012. Modeling Temporal Dependencies in HighDimensional Sequences: Application to Polyphonic Music Generation and Transcription. In Proceedings of the 29th International Conference on Machine Learning, ICML (ICML’12). Omnipress, Edinburgh, Scotland, UK, 1881–1888. http://icml.cc/2012/papers/590.pdf
 Butcher et al. (2013) John B. Butcher, David Verstraeten, Benjamin Schrauwen, Charles R. Day, and Peter W. Haycock. 2013. Reservoir computing and extreme learning machines for nonlinear timeseries data analysis. Neural Networks 38 (2013), 76–89. https://doi.org/10.1016/j.neunet.2012.11.011
 Carmichael et al. (2018) Zachariah Carmichael, Humza Syed, Stuart Burtner, and Dhireesha Kudithipudi. 2018. ModDeepESN: Modular Deep Echo State Network. Conference on Cognitive Computational Neuroscience abs/1808.00523 (Sept. 2018), 1–4. arXiv:cs/1808.00523 http://arxiv.org/abs/1808.00523 or https://ccneuro.org/2018/proceedings/1239.pdf.
 Chollet et al. (2015) François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
 Dean et al. (2002) Paul Dean, John Porrill, and James V. Stone. 2002. Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibuloocular reflex. The Royal Society 269, 1503 (2002), 1895–1904. https://doi.org/10.1098/rspb.2002.2103
 Eckmann and Ruelle (1985) JeanPierre Eckmann and David Ruelle. 1985. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics 57 (July 1985), 617–656. Issue 3. https://doi.org/10.1103/RevModPhys.57.617
 Gallicchio and Micheli (2011) Claudio Gallicchio and Alessio Micheli. 2011. Architectural and Markovian factors of echo state networks. Neural Networks 24, 5 (2011), 440–456. https://doi.org/10.1016/j.neunet.2011.02.002
 Gallicchio and Micheli (2017) Claudio Gallicchio and Alessio Micheli. 2017. Echo State Property of Deep Reservoir Computing Networks. Cognitive Computation 9, 3 (2017), 337–350. https://doi.org/10.1007/s1255901794619
 Gallicchio et al. (2017) Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2017. Deep reservoir computing: A critical experimental analysis. Neurocomputing 268 (2017), 87–99. https://doi.org/10.1016/j.neucom.2016.12.089
 Gallicchio et al. (2018a) Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2018a. Deep Echo State Networks for Diagnosis of Parkinson’s Disease. In Proceedings of the 26th European Symposium on Artificial Neural Networks, ESANN. i6doc.com, Bruges, Belgium, 397–402. http://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2018163.pdf
 Gallicchio et al. (2018b) Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2018b. Design of deep echo state networks. Neural Networks 108 (2018), 33–47. https://doi.org/10.1016/j.neunet.2018.08.002
 Gallicchio et al. (2018c) Claudio Gallicchio, Alessio Micheli, and Luca Silvestri. 2018c. Local Lyapunov Exponents of Deep Echo State Networks. Neurocomputing 298 (2018), 34–45. https://doi.org/10.1016/j.neucom.2017.11.073
 Gibbons (2010) Thomas E. Gibbons. 2010. Unifying quality metrics for reservoir networks. In Proceedings of the International Joint Conference on Neural Networks, IJCNN. IEEE, Barcelona, Spain, 1–7. https://doi.org/10.1109/IJCNN.2010.5596307
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, AISTATS (JMLR Proceedings), Yee Whye Teh and D. Mike Titterington (Eds.), Vol. 9. JMLR.org, Chia Laguna Resort, Sardinia, Italy, 249–256. http://jmlr.org/proceedings/papers/v9/glorot10a.html
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ShortTerm Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
 Hunter (2007) John D. Hunter. 2007. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
 Hyndman and Yang (2018) Rob J. Hyndman and Yangzhuoran Yang. 2018. Daily minimum temperatures in Melbourne, Australia (1981–1990). https://pkg.yangzhuoranyang.com/tsdl/
 Jaccard (1912) Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone. The New Phytologist 11, 2 (Feb. 1912), 37–50. https://doi.org/10.1111/j.14698137.1912.tb05611.x
 Jaeger (2001) Herbert Jaeger. 2001. The “Echo State” Approach to Analysing and Training Recurrent Neural Networkswith an Erratum Note. Technical Report 148. Fraunhofer Institute for Autonomous Intelligent Systems, GMDGerman National Research Institute for Information Technology. http://www.faculty.jacobsuniversity.de/hjaeger/pubs/EchoStatesTechRep.pdf
 Jaeger (2002) Herbert Jaeger. 2002. Short term memory in echo state networks. Technical Report 152. Fraunhofer Institute for Autonomous Intelligent Systems, GMDGerman National Research Institute for Information Technology. http://www.faculty.jacobsuniversity.de/hjaeger/pubs/STMEchoStatesTechRep.pdf
 Jaeger et al. (2007) Herbert Jaeger, Mantas Lukoševičius, Dan Popovici, and Udo Siewert. 2007. Optimization and applications of echo state networks with leakyintegrator neurons. Neural networks 20, 3 (2007), 335–352. https://doi.org/10.1016/j.neunet.2007.04.016
 Jones et al. (2001) Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org/
 Kennedy and Eberhart (1995) James Kennedy and Russell C. Eberhart. 1995. Particle swarm optimization. In Proceedings of the International Conference on Neural Networks, ICNN’95 (1995), Vol. 4. IEEE, Perth, WA, Australia, 1942–1948. https://doi.org/10.1109/ICNN.1995.488968
 Lyapunov (1992) Aleksandr M. Lyapunov. 1992. The general problem of the stability of motion. Internat. J. Control 55, 3 (1992), 531–534. https://doi.org/10.1080/00207179208934253
 Lymburn et al. (2019) Thomas Lymburn, Alexander Khor, Thomas Stemler, et al. 2019. Consistency in EchoState Networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 29, 2 (2019), 23118. https://doi.org/10.1063/1.5079686
 Ma et al. (2017) Qianli Ma, Lifeng Shen, and Garrison W. Cottrell. 2017. DeepESN: A Multiple Projectionencoding Hierarchical Reservoir Computing Framework. CoRR abs/1711.05255 (2017), 15. arXiv:cs/1711.05255 http://arxiv.org/abs/1711.05255
 Maass et al. (2002) Wolfgang Maass, Thomas Natschläger, and Henry Markram. 2002. RealTime Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation 14, 11 (2002), 2531–2560. https://doi.org/10.1162/089976602760407955
 Mackey and Glass (1977) Michael C. Mackey and Leon Glass. 1977. Oscillation and chaos in physiological control systems. Science 197, 4300 (1977), 287–289. https://doi.org/10.1126/science.267326
 Malik et al. (2016) Zeeshan K. Malik, Amir Hussain, and Qingming J. Wu. 2016. Multilayered Echo State Machine: A Novel Architecture and Algorithm. IEEE Transactions on Cybernetics 47, 4 (June 2016), 946–959. https://doi.org/10.1109/TCYB.2016.2533545
 McKinney (2010) Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference (2010), Stéfan van der Walt and Jarrod Millman (Eds.). SciPy, Austin, TX, 51–56.
 Miranda (2018) Lester James V. Miranda. 2018. PySwarms, a ResearchToolkit for Particle Swarm Optimization in Python. Journal of Open Source Software 3, 21 (2018), 433. https://doi.org/10.21105/joss.00433
 Oliphant (2006) Travis E. Oliphant. 2006. A Guide to NumPy. Vol. 1. Trelgol Publishing, USA.
 Otsu (1979) Nobuyuki Otsu. 1979. A Threshold Selection Method from GrayLevel Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (Jan. 1979), 62–66. https://doi.org/10.1109/TSMC.1979.4310076
 Ozturk et al. (2007) Mustafa C. Ozturk, Dongming Xu, and José Carlos Príncipe. 2007. Analysis and Design of Echo State Networks. Neural Computation 19, 1 (2007), 111–138. https://doi.org/10.1162/neco.2007.19.1.111
 Pearson and Galton (1895) Karl Pearson and Francis Galton. 1895. VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, 347352 (1895), 240–242. https://doi.org/10.1098/rspl.1895.0041
 Poliner and Ellis (2006) Graham E. Poliner and Daniel P. W. Ellis. 2006. A Discriminative Model for Polyphonic Piano Transcription. EURASIP Journal on Advances in Signal Processing 2007, 1 (2006), 48317. https://doi.org/10.1155/2007/48317
 Schrauwen et al. (2008) Benjamin Schrauwen, Marion Wardermann, David Verstraeten, Jochen J. Steil, and Dirk Stroobandt. 2008. Improving reservoirs using intrinsic plasticity. Neurocomputing 71, 79 (2008), 1159–1171. https://doi.org/10.1016/j.neucom.2007.12.020
 Shepherd (1990) Gordon M. Shepherd. 1990. The Synaptic Organization of the Brain (3rd ed.). Oxford University Press, New York, NY, USA.
 Soures (2017) Nicholas M. Soures. 2017. Deep Liquid State Machines with Neural Plasticity and OnDevice Learning. Master’s thesis. Rochester Institute of Technology.
 Soures et al. (2017b) Nicholas M. Soures, Lydia Hays, Eric Bohannon, Abdullah M. Zyarah, and Dhireesha Kudithipudi. 2017b. OnDevice STDP and Synaptic Normalization for Neuromemristive Spiking Neural Network. In Proceedings of the 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, Boston, MA, USA, 1081–1084. https://doi.org/10.1109/MWSCAS.2017.8053115
 Soures et al. (2017a) Nicholas M. Soures, Lydia Hays, and Dhireesha Kudithipudi. 2017a. Robustness of a memristor based liquid state machine. In Proceedings of the International Joint Conference on Neural Networks, IJCNN. IEEE, Anchorage, AK, USA, 2414–2420. https://doi.org/10.1109/IJCNN.2017.7966149
 Waskom et al. (2018) Michael Waskom, Olga Botvinnik, Drew O’Kane, et al. 2018. Mwaskom/Seaborn: V0.9.0 (July 2018). zenodo V0.9.0 (2018), 1. https://doi.org/10.5281/zenodo.1313201
 Yamazaki and Tanaka (2007) Tadashi Yamazaki and Shigeru Tanaka. 2007. The cerebellum as a liquid state machine. Neural Networks 20, 3 (2007), 290–297. https://doi.org/10.1016/j.neunet.2007.04.004
 Yusoff et al. (2016) MohdHanif Yusoff, Joseph ChrolCannon, and Yaochu Jin. 2016. Modeling neural plasticity in echo state networks for classification and regression. Information Sciences 364365 (2016), 184–196. https://doi.org/10.1016/j.ins.2015.11.017
Appendix A Additional Results
We additionally construct heatmaps for the impact of topology and on NRMSE and , shown in Figure 10. This result shows that somewhat correlates with NRMSE (), which only moderately supports the “edge of chaos” hypothesis. However, there is a clear trend between NRMSE and both and depth. The error extending outward radially from the bottomright corner of Figure 9(a) correlates positively with decreasing and increasing . More significantly, depth correlates with NRMSE () with deeper networks giving lower errors. Wider networks also always yield lower errors in this experiment.
Comments
There are no comments yet.