1 Introduction
Echo State Networks (ESN) belong to the class of computational dynamical systems, implemented according to the socalled reservoir computing approach [39]. An input signal is fed to a large, recurrent and randomly connected dynamic hidden layer, the reservoir, whose outputs are combined by a memoryless layer called readout to solve a specified task. Contrary to most hard computing approaches, which demand long training procedures to learn model parameters through an optimization algorithm [27], ESN is characterized by a very fast learning procedure that usually consists in solving a convex optimization problem.
ESN have been adopted in a variety of different contexts, such as static classification [1], speech recognition [50], intrusion detection [22], adaptive control [23] harmonic distortion measurements [41] and, in general, for modeling of various kinds of nonlinear dynamical systems [24]. The application field where ESN has been used the most, is the problem of predicting real valued timeseries relative, for example, to telephonic or electric load, where the forecast is usually performed 1hour and a 24hours ahead [14, 15, 6, 54, 15, 44]. Outstanding results have also been achieved by ESN in prediction of chaotic timeseries [36, 31]
, which highlighted the capability of these neural networks to learn amazingly accurate models to forecast a chaotic process from almost noisefree training data.
Although a large reservoir could capture the dynamics of the underlying system more accurately, it results in a model of increased complexity, with an inherent risk of overfitting that leads to lower generalization capabilities. Additionally, several regression methods adopted to train the readout layer could be affected by the curse of dimensionality in case of high dimensional data, which could also cause increments in both the computational requirements in software and the resource needed in hardware
[7]. Several tasks in signal processing and machine learning applications have been tackled by evaluating regression functions [58], performing classification [13] or finding neighbors [28] in a reduced dimensional space. In fact, in many cases it is possible to maintain meaningful distance relationships between original data and to deal with the curse of dimensionality at the same time. Through dimensionality reduction, redundant features are removed, noise can be filtered and algorithms that are unfit for a large number of dimensions become applicable. In the ESN literature, different methods have been proposed to increase the generalization ability of the network and to regularize the output. For example, in [16], the authors propose a form of regularization by shrinking the weights of the connections from the reservoir to the readout layer. In [47]by pruning some connections from the reservoir to the readout layer, better generalization capabilities are achieved along with some insight on which neurons are actually useful for the output, providing clues on how to create a good reservoir.
In this paper we propose a novel framework for training an ESN, where an additional computational block is introduced to process the output of the internal reservoir, before being fed into the readout layer. In particular, the internal state of the network is mapped to a properly chosen lower dimensional subspace, using both linear and nonlinear transformations. Accordingly, we are able to use a large reservoir to capture the dynamics of the underlying system, while increasing the generalization capabilities of the model due to implicit regularization constraints provided by dimensionality reduction in the recurrent layer. Even if additional operations are introduced to compute the reduced dimensionality embedding, training the readout layer becomes less demanding, especially in regression methods whose computational complexity depends on input dimension
[17]. With the proposed procedure we improve the generalization capabilities of the network, achieving better results on wellknown benchmarking problems with respect to the standard ESN architecture. Additionally, in cases where data can be mapped to spaces with 2 or 3 dimensions, internal network dynamics can be visualized precisely and relevant patterns can be detected. To justify the results obtained and to understand the mechanisms which determines the effectiveness of the proposed system, we provide a theoretical study based on methods coming from the field of nonlinear timeseries analysis. To the best of the authors’ knowledge, the coupling of dimensionality reduction with the ESN architecture has not been explored before.The remainder of the paper is organized as follows. In Sect. 2 we describe the ESN structure along with existing approaches for its training and we review the dimensionality reduction methods adopted in this work. In Sect. 3 we present our proposed architecture, providing implementation details. In Sect. 4 we describe the datasets used to test our system, the experimental settings adopted and the performance reached on several prediction problems. In Sect. 5 we analyze the results and the functioning of our system through the perspective of nonlinear timeseries analysis. Finally, in Sect. 6 we draw our conclusions.
2 Background material
In the following, we shortly review the methodologies adopted in our framework. Initially, we describe the classic ESN architecture and two effective approaches adopted for its training. Successively, we summarize two wellknow methods used for reducing the dimensionality of the data and for mapping them in a smaller subspace.
2.1 Echo state Network
An ESN consists of a large, untrained recurrent layer of nonlinear units and a linear, memoryless readout layer, usually trained with a linear regression. A visual representation of an ESN is reported in Fig.
1The equations describing the ESN stateupdate and output are, respectively, defined as follows:
(1)  
(2) 
where is a small i.i.d. noise term. The reservoir consists of
neurons characterized by a transfer/activation function
, typically implemented as a hyperbolic tangent function. At time instant , the network is driven by the input signal and generates the output , being andthe dimensionality of input and output, respectively. The vector
describes the ESN (instantaneous) state. The weight matrices (reservoir connections), (inputtoreservoir), and (outputtoreservoir feedback) contain real values in theinterval, sampled from a uniform distribution.
According to the ESN theory, the reservoir must satisfies the socalled “echo state property” (ESP) [39]. This guarantees that the effect of a given input on the state of the reservoir vanish in a finite number of time intervals. A widely used ruleofthumb suggests to rescale the matrix to have , where denotes the spectral radius, but several theoreticallyfounded approaches have been proposed in the literature to properly tune in an ESN driven by a specific input [9, 8, 55].
The weight matrices and instead, are optimized for the task at hand. To determine them, let us consider the training sequence sequence of desired inputoutputs pairs given by:
(3) 
In the initial phase of training, called state harvesting, the inputs are fed to the reservoir in accordance with Eq. 1, producing a sequence of internal states . Since, by definition, the outputs of the ESN are not available for feedback, according to the teacher forcing procedure, the desired output is used instead in Eq. 2. States are stacked in a matrix and the desired outputs in a vector :
The initial rows and are the washout elements that should be discarded, since they refer to a transient phase in the ESN’s behavior.
Since the gain of the sigmoid nonlinearity in the neurons is largest around the origin, three coefficients , and are used to scale the input, desired output and feedback signals respectively. In this way, it is possible to control the amount of nonlinearity introduced by the processing units.
The training of the readout consists in solving a convex optimization problem, for which several closed form solution have been proposed in the literature. The standard procedure to train the readout, originally proposed in [29]
, consists in a regularized leastsquare regression, which can be easily computed through the MoorePenrose pseudoinverse. However, to learn the optimal readout we also consider the Support Vector Regression (SVR), a supervised learning model that can efficiently perform a nonlinear separation of data using a kernel function to map the inputs into highdimensional feature spaces, where they are linearly separable
[11].Ridge Regression:
to train the readout with a linear regressor we adopted ridge regression, whose solution can be computed by solving the following regularized leastsquare problem:
(4) 
where and is the regularization coefficient.
Support Vector Regression:
we adopt a SVR [49] with a Gaussian kernel, initially proposed in [7] as method for readout training. In this case, the ESN acts as a preprocessor for a SVR kernel and their combination can be seen as an adaptive kernel, capable of learning a taskspecific time dependency. The state is projected to a higher dimensional feature space , and the SVR is applied on the resulting space. The dual optimization problem can be written as:
(5) 
where each entry is given by , with being a reproducing Gaussian kernel associated to the feature mapping, given by , where is denoted as the scale parameter.
By an extension of the representer’s theorem, the output of the ESN at a generic timeinstant in this case is given by:
(6) 
where and are the entries of the optimal solution to problem Eq. 5, and they are nonzero only for patterns that are support vectors.
2.2 Dimensionality reduction methods
In the following, we describe the dimensionality reduction techniques that we implemented in our framework. First of all, we underline that several approaches can be followed for reducing the dimensionality of the data and to learn underlying manifold on a subspace of the data space [53, 33, 34]
. In this work, we limit our analysis to the well know and effective, yet simple procedures, namely Principal Component Analysis (PCA)
[26] and kernel Principal Component Analysis (kPCA) [48].Pca
is a statistically motivated method, which projects the data onto an orthonormal basis that preserves most variance in the input signal, while ensuring that the individual components are uncorrelated. These basis vectors are called the
Principal Components. Let be a random vector and let be its covariance matrix, where andis the orthogonal eigenvector matrix and the diagonal eigenvalue matrix respectively. Then the linear transformation
ensures that the covariance matrix of is , which clearly implies that the components of are uncorrelated. We also see that(7) 
To reduce the dimensionality to dimensions, we project the data onto the eigenvectors with the largest eigenvalues. That is,
where is the truncated eigenvector matrix associated with the eigenvalues . According to Eq. 7, this ensures that preserves most of the variance of .
Kernel Principal Component Analysis
(kPCA) is a nonlinear extension of PCA. Given a valid positive semidefinite (psd) Mercer Kernel
where is some nonlinear mapping from feature space to a Hilbert space , Kernel PCA implicitely performs PCA in .
Let , where be the kernel matrix and let and be its eigenvector and eigenvalue matrix respectively with the eigenvalues sorted in descending order. Then the projection of the insample data onto the principal components in is given by
(8) 
The outofsample approximation for the projection of a data point onto the th principal component is given by
(9) 
Just like canonical PCA, to perform dimensionality reduction with kPCA, one need to use the truncated eigenvector and eigenvalue matrix with Eq. 8 and Eq. 9.
The kernel function that is commonly used in practice is the Gaussian kernel which is given by , where controls the width of the kernel.
Both PCA and kPCA methods admit an out of sample extension, a feature which is required in our framework, as discussed later.
3 Proposed architecture
In this section, we provide the details of the architecture of the framework proposed.
The large size of the reservoir, specified by the amount of hidden neurons, is one of the main features that determines the effectiveness of the reservoir computing paradigm. Due to the high quantity of neurons, the internal recurrent connections in the reservoir are capable of generating a rich and heterogeneous dynamic to solve complex memorydependent tasks. However, as the size of the reservoir increases, also the complexity of the model grows, with a consequent risk of overfitting caused by a reduced generalization capability [4]. Dimensionality reduction and manifold learning are techniques that allows to diminish the variance in the data and to introduce a bias, which can reduce the expected value on the prediction error [19]
. In the architecture proposed, we use a large reservoir in order to capture the dynamic of the underlying unknown process and then, through a dimensionality reduction procedure, we enforce regularization constraints to increase the generalization capability of our model. Another important consequence that follows from reducing the dimensionality of the reservoir is that complex regression methods can benefit from a reduced computational complexity if the internal states are described by a lower number of variables. Additionally, several methods used to identify, in an unsupervised way, the configurations of hyperparameters which maximize the computational capabilities of the network, require computational demanding procedures of analysis
[8, 38, 9]. These procedures would greatly benefit from the simplification offered by our proposed architecture.At each time step , the vector that represents the internal state of the reservoir, is mapped into a lower dimensional space by a projector . The new dimensional state vector is then processed by the readout to compute the predicted value .
To train our system, the timeseries is split in three contiguous parts, namely the training , validation and test set . Since we deal with timeseries prediction problems, each set contains coupled real values, which represent the input value and the ground truth of the associated prediction. For example in the training set we have , where is the predicted value of . The regression function in the readout is implemented according to one of the two procedures proposed in Sect. 2.1 and the model parameters are learned on the training data. The system depends on several hyperparameters, which affects the network behavior and they must be carefully tuned on the specific problem at hand by performing a crossvalidation procedure on the validation set, with a method whose details are provided in the next section.
Once the model has been trained, a new test element of the test set is processed and the relative internal reservoir state is generated. Successively, the projection in the subspace with reduced dimensionality is evaluated using a suitable out of sample approximation. In the case of PCA this can be done by projecting on the basis defined by the covariance matrix computed on the states relative to the elements in training set, which are collected in the matrix during the training phase. For kPCA it is possible to use the Nÿstrom approximation [2]
, which specifies an interpolating function for determining the values of out of samples data points.
A schematic representation of the whole procedure is depicted in Fig. 2.
3.1 Hyperparameter optimization
The set of hyperparameters
that are used to control the architecture of the ESN, the regression in the readout and the dimensionalityreduction procedure are optimized by minimizing a loss function
, defined as(10) 
where is the hyperparameter that defines the number of dimensions, , of the new subspace. In order to lower the complexity of the model, jointly penalizes prediction error on the validation set and the number of dimensions retained after the dimensionality reduction.
The loss function is minimized using a standard genetic algorithm with Gaussian mutation, random crossover, elitism and tournament selection
[51]. While the hyperparameter optimization is performed on the validation set, the best individual found is stored and it is successively used to configure the network during the training phase. A schematic description of the training procedure is depicted in Fig. 3.4 Experiments
The component of the loss function (Eq. 10) relative to the error on the given task, is implemented by the Normalized Root Mean Squared Error (NRMSE):
where is the ESN prediction and the desired/teacher output.
The GA uses a population size of 50 individuals and evaluates 20 generations. The individuals are mutated and bred at each generation with a mutation probability of
and a crossover probability of . The individuals in the next generation are selected by a tournament strategy with a tournament size of 4 individuals. The bounds for all parameters are shown in Tab. 1. The weight parameter in the loss function (Eq. 10) is set to 0.1.Due to the stochastic nature of the ESN, which is a consequence of the random initialization of the weight matrices , and , each individual is evaluated on the validation set using 5 networks initialized with different weight parameters. The fitness is then given by the NRMSE, averaged over these 5 networks. Once the optimal set of parameters has been found, we predict values for the test set using 32 randomly initialized networks, using the same set of optimal parameters.



min  100  0.0  0.1  0.1  0.0  0.5  0.001  0.001  0.001  0.001  0.001 
max  500  0.1  0.9  0.9  0.6  1.4  1.0  0.1  1.0  10.0  1.0 
5  0.01  0.08  0.08  0.06  0.09  0.1  0.01  0.1  1.0  0.1  

4.1 Datasets description
To test our system, we consider 3 benchmark tasks commonly used in timeseries forecasting, namely the prediction of MackeyGlass timeseries, of multiple superimposed oscillator and of the NARMA signal. The forecasting problems that we consider have a different level of difficulty, given by the nature of the signal and the complexity of the prediction task. Accordingly to a commonly used approach [32], in each prediction task we set the forecast step by computing a statistic that measures the independence of separated points in the time series. One usually wants the smallest that guarantees the measurements to be decorrelated. Hence, we considered the first zero of the autocorrelation function of the time series, which yields the smallest that maximizes the linear independence between the samples. Alternatively, it is possible to choose the forecast step by considering more general forms of independence, such as the first local minimum on the average mutual information [18] or on the correlation sum [37].
MackeyGlass timeseries:
the input signal is generated from the MackeyGlass (MG) timedelay differential system, described by the following equation:
We generated a timeseries of 150000 timesteps using , initial condition , and 0.1 as integration step for (4.1).
NARMA signal:
the NonLinear AutoRegressive Moving Average (NARMA) task, originally proposed in [30], consists in modeling the output of the following order system:
The input to the system is a uniform random noise in [0, 1], and the model is trained to reproduce . The NARMA task is known to require a memory of at least past timesteps, since the output is determined by the current input and outputs relative to the last timesteps. In our test we set .
Multiple superimposed oscillator:
The prediction of a sinusoidal signal is a relatively simple task, which demands a minimum amount of memory to determine the next network output. However, superimposed sine waves with not integer frequencies are much harder to predict, since the wavelength can be extremely long. The signal we consider is the multiple superimposed oscillator (MSO), studied in [31] and defined as:
ESN struggles to solve this task, since neurons in the reservoir tends to couple, while the task requires the simultaneous existence of multiple decoupled internal states [56].
4.2 Results
The averaged prediction results and the standard deviations are reported in Tab.
2. The convergence rate during the optimization of the hyperparameters for each method, expressed as the NRMSE error on the validation set, is depicted in Fig. 4.The prediction of MG is a quite simple task and each model manages to achieve high forecast accuracy. However, by applying a dimensionality reduction on the states of the reservoir, it is possible to lower the error by one or more order of magnitude. Also the standard deviation of the prediction error decreases, especially in the models using kPCA. The best results are achieved by SVR + PCA and SVR + kPCA, while using SVR without reducing reducing the dimensions of the reservoir demonstrated to be less effective. This means that nonlinearities benefits the training, but without enforcing the regularization constraint the complexity of the model is to high to fit well testing points. As we can see, in every case the number of dimensions retained by both PCA and kPCA is much lower than the optimal number of neurons identified. This underline the effectiveness of the regularization conveyed by our architecture. From Fig. 4(a) results that the model implementing ridge regression + kPCA achieves the lowest convergence rate during the crossvalidation step. However, thanks to the generalization power provided by the nonlinear dimensionality reduction, the test error is lower than the other models, whose readout is trained with ridge regression.
In NARMA prediction task, the best result is achieved by training the readout function with SVR on a reservoir output, whose dimensionality is reduced by kPCA. NARMA is a more complex task which requires a higher amount on nonlinearity to be solved. This is clearly reflected by the results, which improve as more nonlinearity is introduced to learn the function, both in the readout training and in the dimensionality reduction procedure. At the same time, the bias introduced by the regularization enhance the generalization capability of the network significantly. For what concerns the number of dimensions of the optimal subspace, it is higher than in MG task, except for the model implemented with ridge regression + PCA. In this latter cases, however, we obtain the worst performance. Interestingly, from Fig. 4(b) we observe that kPCA has the lower convergence rate, even if this is the best performing model in the testing phase. In this case, the dimensionality reduction introduces a bias, which prevents the model to overfit on the validation data and to develop a high predictive power. On the other hand, the model with SVR and no dimensionality reduction, overfits on the validation data with a consequent poor performance in the test phase.
Finally, in the MSO task the model with the highest prediction performance is SVR without the dimensionality reduction. In this case, the signal to predict has an extremely long periodicity, which demands a high amount of memory in the network. Hence, the compression of the information through the dimensionality reduction could hamper the memory capacity of the network. Furthermore, due to the long periodicity, the slice of timeseries used to train the network can be quite different from the slice to be predicted in the test. Consequently, test points are projected in a subspace which is not optimal, as the basis is learned from the training data. As expected, the number of dimensions kept after the dimensionality reduction is larger than in the other tasks. The need of a high degree of complexity is also denoted by the poor results obtained by using ridge regression in the readout training. From Fig. 4(c), we observe the convergence rate to be faster in models equipped with SVR, which obtain better results both in validation and in testing phase. This symmetry on performances on test and validation reflects the scarce effectiveness of the regularization constraints for this task.

RT  DR  Error  
MG 
ridge reg.  –  378  1.22  0.0  0.55  0.212  0.25  –  –  0.625  –  –  –  
ridge reg.  PCA  318  1.214  0.042  0.807  0.840  0.355  89  –  0.294  –  –  –  
ridge reg.  PCA  445  1.148  0.0  0.339  0.202  0.422  33  0.050  0.521  –  –  –  
SVR  –  490  1.051  0.0  0.873  0.588  0.568  –  –  –  0.234  5.346  0.868  
SVR  PCA  474  1.044  0.0  0.604  0.807  0.350  3  –  –  2.535  0.340  0.825  
SVR  PCA  466  0.738  0.0  0.373  0.664  0.069  14  0.059  –  7.975  0.448  0.299  
5pt. NARMA  ridge reg.  –  406  1.031  0.053  0.19  0.231  0.194  –  –  0.163  –  –  –  
ridge reg.  PCA  409  0.934  0.016  0.135  0.127  0.073  22  –  0.963  –  –  –  
ridge reg.  PCA  342  0.887  0.018  0.167  0.407  0.0  225  0.008  0.001  –  –  –  
SVR  –  440  0.928  0.015  0.129  0.603  0.031  –  –  –  4.332  0.271  0.017  
SVR  PCA  433  0.952  0.0  0.107  0.207  0.267  274  –  –  4.099  0.134  0.028  
SVR  PCA  460  0.962  0.002  0.1  0.302  0.037  163  0.028  –  4.281  0.752  0.420  
5pt. MSO  ridge reg.  –  298  1.148  0.008  0.345  0.147  0.045  –  –  0.438  –  –  –  
ridge reg.  PCA  499  1.141  0.017  0.187  0.184  0.309  438  –  0.601  –  –  –  
ridge reg.  PCA  454  1.053  0.003  0.137  0.169  0.184  407  0.040  0.117  –  –  –  
SVR  –  444  1.0  0.001  0.114  0.1  0.09  –  –  –  3.714  0.282  0.037  
SVR  PCA  459  1.147  0.001  0.174  0.144  0.276  225  –  –  6.373  0.38  0.601  
SVR  PCA  480  1.027  0.032  0.1  0.627  0.011  135  0.002  –  3.529  0.204  0.495  

5 Discussion
To understand the mechanics and the effectiveness of the proposed architecture, we analyze the results through the theory of nonlinear timeseries analysis, which offer powerful methods to retrieve dynamical information from timeordered data [10]. The objective of timeseries analysis is to reconstruct the full dynamics of a complex nonlinear dynamical system, starting from a measurement of only one of its variables. In fact, in many cases it is possible to observe only a subset of the components necessary to determine the timeevolution law which governs the dynamical system.
The main idea which inspires this analysis is that a dynamic system is completely described by the timedependent trajectory in its phase space. Hence, a recurrent neural network that is capable of reconstructing with a high degree of accuracy the dynamic attractor can calculate future states assumed by the system, given a state at any particular moment.
A frequently used method for phase space reconstruction is the delaycoordinate embedding
, which provides an estimation of the attractor that is topologically identical to the true one. From this reconstruction, it is possible to infer several properties of the hidden dynamical system, which are invariant under diffeomorhpism. We refer to these measures as the
dynamical invariants of the system. The most commonly studied are the fractal dimension of the attractor, the Lyapuanov exponents and the Rényi entropy. In the following, we briefly introduce the delaycoordinate embedding procedure and two approaches used to estimate the aforementioned dynamical invariants. We refer the interested reader to [35, 20] for a comprehensive overview of these methods and many other aspects of timeseries analysis.Delaycoordinate embedding:
a dynamical system is characterized by a timeevolution law, which determines its trajectory in the phase space. Each specific state of the system at time is defined by a dimensional vector in the state space: , being the number of variables of the system. The delaycoordinate embedding method allows to reconstruct such state vectors from a timediscrete measurement of only one generic smooth function of the state space [42]. Given a discrete timeseries evenly sampled at rate , the embedding is defined as:
(11) 
where is the embedding dimension, is the time delay and form an orthonormal basis in .
With a proper choice of embedding parameters and , Taken theorem guarantees the existence of a diffeomorhpism between the real and reconstructed dynamic [52]. A sufficient condition for a correct reconstruction is . The value of is usually computed with the false nearestneighbors algorithm [46], which provides an estimation of the smallest sufficient embedding dimension. On the other hand, a suitable timedelay can be estimated looking at the first zero of the autocorrelation function of or by relying on nonlinear time dependencies, such as the mutual information [12].
Correlation dimension:
dimension is a measurement invariant under diffeomorhpism that allows to quantify the similarity of geometrical objects. Attractors of dissipative chaotic systems often exhibit complicated geometries (hence the name strange) which are contained in a fractal dimension , called Rényi dimension [45]. An efficient estimator of fractal dimensions is GrassbergerProcaccia algorithm [21], which computes the correlation dimension through the correlation sum :
(12) 
The temporal spacing parameter it chosen to ensure temporal independence between samples, is the Heaviside function and is the dimension of a set of small boxes used to cover the geometric shape of the attractor. If , .
Lyapuanov exponent:
the Lyapuanov spectrum is another invariant measure that characterizes the predictability of a dynamical system. Lyapuanov exponents quantify the rate of separability of two infinitesimal close trajectories and are closely related to the 2^{nd} order Rényi entropy : . This quantity measures the number of possible trajectories that the system can take for a given number of time steps in the future. A perfectly deterministic can only evolve along one possible trajectory and hence . In contrast, for purely stochastic systems the number of possible future trajectories increases to infinity, so . Chaotic systems are characterized by a finite value of , as the number of possible trajectories diverges but not as fast as in the stochastic case.
The largest Lyapuanov exponent (LLE) is a good estimate of , and its sign determines whether a system is chaotic or not. The socalled direct methods can be used to compute by estimating the divergent motion of the reconstructed space, without fitting a model to the data [57, 43]. In particular, the average exponential growth of the distance of neighboring orbits can be studied on a logarithmic scale by monitoring the prediction error :
(13) 
being the nearest neighbor of at time . The LLE is estimated as with , where is the forecast horizon within which the divergence of the trajectory in the phase space is evaluated.
5.1 ESN phase space reconstruction
In the following, we analyze two chaotic timeseries generated by the Lorenz and the Moore–Spiegel system respectively. We evaluate the accuracy of the phase space reconstruction performed with our ESN by comparing the topological properties of the true attractor of the dynamic, with the one obtained by applying a dimensionality reduction to the network reservoir. The equivalence of attractors geometries are computed by measuring the dynamical invariants, estimated through the correlation sum and the divergent motion of the reconstructed spaces.
In the following, we refer to true attractor, as the trajectory in the phase space generated directly by the differential equations of the dynamic system. With delayembedding attractor we refer at the trajectory described by the embedding, generated with the delaycoordinate procedure. Finally, ESN attractor is the trajectory spanned by the component of the multivariate vector . The latter is the output of the dimensionality reduction procedure applied to the multivariate vector , which contains the sequence of the states of the reservoir (see Sect. 3). For these tests we considered only the component of the loss function relative to the prediction error, by setting in Eq. 10, and we fixed the number of dimensions in PCA and kPCA to 3. Finally, to further empathize the effectiveness of the architecture proposed, we also consider the phase space reconstruction obtained directly from , in the case where the reservoir contains only 3 neurons ().
trajectory of the attractors of the Lorenz dynamical system in the phase space. In (a), the true trajectory, which is computed directly from the ordinary differential equations of the system. In (b), the trajectory reconstructed using timedelay embedding. In (c), the trajectory generated by the internal state of ESN internal state, on the subspace defined by the first 3 components of the PCA. In (d), the trajectory described by the internal state of an ESN with a small reservoir with 3 neurons.
Lorenz:
the system is governed by the following ordinary differential equations:
System  Invariant  True  Emb  ESN+PCA  ESN+kPCA  ESN small 

Lorenz  
LLE  
5pt. MooreSpiegel  
LLE 
(14) 
where variables , and define the state of the system, while , and are system parameters. In this work we set , and , values for which the system exhibits chaotic behavior.
Fig. 5 depicts the geometric shapes of the true attractor, the delayembedding attractor, the two ESN attractors, generated using a dimensionality reduction or a reservoir with 3 neurons. As is it possible to observe visually, both the embedding and ESN with dimensionality reduction manage to reconstruct well the trajectory described by the differential equations of the dynamic system. To quantify formally this similarity, we compute on the dynamical invariants previously introduced each attractor. In Tab. 3, we report for each phase space trajectory the estimated correlation dimension and the largest Lyapuanov exponent, which as previously discussed, represents a good approximation of the entropy. Due to the stochastic nature of the approaches adopted for estimating these quantities, we repeated the procedure 10 different times and we report their average values and the standard deviations. As we can see from the results, both the trajectories described by in the subspace computed using PCA and kPCA generate an attractor whose dynamic invariants are well approximated. In particular, the accuracy of the reconstruction is comparable to the one obtained by the classic timedelay embedding method and in some case it is even better. The standard deviations in the measurements of both correlation dimension and LLE are very small, which indicates a high degree of reliability on both measurements. For what concerns the ESN with 3 neurons, the trajectory described is more “flat”, as it can be seen in the figure. This is confirmed by the estimated correlation dimension and LLE, whose values are much lower than in the other cases. This denotes that the reconstructed dynamic is not rich enough, a symptom that the complexity and the memory of the network is not sufficient to model the underlying system.
Moore–Spiegel:
this dynamical systems manifests interesting synchronization properties, generated by complicated patterns of perioddoubling, saddlenode and homoclinic bifurcations [3]. The differential equations which governs system dynamics are the following:
(15) 
where , and form the state of the system and and are the parameters of the model. In this study, we set , and , for which the dynamics of the system exhibits a chaotic behavior.
In Fig. 6 we show the shape of the attractors of the dynamic, evaluated directly on the differential equations of the system, on the timedelay embedding, on the internal state of the ESN reduced through PCA and on the state of the ESN with 3 neurons. In this second test, the reconstructed trajectories of the Moore–Spiegel system are more jagged and irregular, with respect to the original one. This suggest a poorer approximation of the true dynamic of the system and is confirmed by the results in Tab. 3. Compared to the Lorenz case, the dynamical invariants estimated on the timedelay embedding and on ESN state trajectories approximate with less accuracy the real ones. The reconstructed attractors have a lower correlation dimension, which usually denotes a poor embedding [40]. However, it is worth to notice that the two attractors reconstructed by the ESN+PCA and ESN+kPCA have a larger value than the timedelay embedding and hence they approximate better the true dynamics. For what concerns the LLE, the estimated value in each reconstructed dynamic is larger than in the original one. This means that both the timedelay embedding and the ESNs generate a more chaotic dynamic, as is also reflected by the jagged trajectories in Fig. 6. Even in this case, however, LLE is better approximated by ESN+PCA and ESN+kPCA than by the timedelay embedding. Like before, the standard deviations of the estimates of the two dynamical invariants is very small, which provides a high degree of confidence on the measurements. For what concerns the trajectory described by the ESN state with a small reservoir of 3 neurons, the geometric properties of the reconstruct attractor are even more different from the real ones. This confirm that also in this case such a small amount of neurons cannot catch the dynamic properties of the system to be modeled.
As a concluding remark, it is important to understand another aspect of the utility of the ESN in reproducing the attractor of the system dynamic. In fact, this provides a valid alternative to the standard approach based on the timedelay embedding for reconstructing the phase of the system, which presents several caveats and pitfalls [10]. This a fundamental tool for a wide set of applications, where an accurate estimation of the phase space of the system is required [35].
6 Conclusions and future directions
In this work we have presented a new framework for training an Echo State Network, which enhances its generalization capabilities through the regularization constraints introduced by the smoothing effect of a dimensionality reduction procedure. Through a series of test on benchmark dataset, we have demonstrated how the proposed architecture can achieve better prediction performance in different contexts. Successively, we provided a theoretically grounded explanation of the functioning of the proposed architecture, based on the theory of nonlinear timeseries analysis. By studying the dynamical properties of the network under this novel perspective, we showed that through an ESN it is possible to reconstruct the phase space of the dynamic system; this offers a solid, yet simple alternative to the timedelay embedding procedure.
We believe that this work could be useful not only to enhance the prediction capabilities of an ESN, but also provide a new instrument for analysis of dynamical systems. As a followup of a recent work focused on identifying the edge of criticality of an ESN by evaluating the Fisher information on the state matrix [38], we plan to study the criticality using more reliable Fisher Information Matrix estimators, which are capable of working only on space with few dimensions (e.g., [25]
). We also plan on investigating other dimensionality reduction methods, manifold learning and semisupervised learning approaches to shrink and regularize the output of the network recurrent layer
[4, 5]. Finally, as a future work, we propose to use different dimensionality reduction techniques in parallel and combine their result through a single reservoir to produce the final result.References
 Alexandre et al. [2009] L. A. Alexandre, M. J. Embrechts, and J. Linton. Benchmarking reservoir computing on timeindependent classification tasks. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 89–93. IEEE, 2009.
 Baker [1973] C.T.H. Baker. numerical treatment of integral equations.[the]. 1973.
 Balmforth and Craster [1997] N. Balmforth and R. Craster. Synchronizing moore and spiegel. Chaos: An Interdisciplinary Journal of Nonlinear Science, 7(4):738–752, 1997.
 Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7:2399–2434, 2006.

Bengio et al. [2004]
Y. Bengio, J.F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet.
Outofsample extensions for lle, isomap, mds, eigenmaps, and spectral clustering.
Advances in neural information processing systems, 16:177–184, 2004.  Bianchi et al. [2015a] F. M. Bianchi, E. De Santis, A. Rizzi, and A. Sadeghian. Shortterm electric load forecasting using echo state networks and PCA decomposition. IEEE Access, 3:1931–1943, Oct. 2015a. ISSN 21693536. doi: 10.1109/ACCESS.2015.2485943.
 Bianchi et al. [2015b] F. M. Bianchi, S. Scardapane, A. Uncini, A. Rizzi, and A. Sadeghian. Prediction of telephone calls load using Echo State Network with exogenous variables. Neural Networks, 71:204–213, 2015b. doi: 10.1016/j.neunet.2015.08.010.
 Bianchi et al. [2016] F. M. Bianchi, L. Livi, and C. Alippi. Investigating echo state networks dynamics by means of recurrence analysis. arXiv preprint arXiv:1601.07381, 2016.
 Boedecker et al. [2012] J. Boedecker, O. Obst, J. T. Lizier, N. M. Mayer, and M. Asada. Information processing in echo state networks at the edge of chaos. Theory in Biosciences, 131(3):205–213, 2012.
 Bradley and Kantz [2015] E. Bradley and H. Kantz. Nonlinear timeseries analysis revisited. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(9):097610, 2015.

Burges [1998]
C. J. Burges.
A tutorial on support vector machines for pattern recognition.
Data mining and knowledge discovery, 2(2):121–167, 1998.  Cao [1997] L. Cao. Practical method for determining the minimum embedding dimension of a scalar time series. Physica D: Nonlinear Phenomena, 110(1):43–50, 1997.
 Davenport et al. [2007] M. A. Davenport, M. F. Duarte, M. B. Wakin, J. N. Laska, D. Takhar, K. F. Kelly, and R. G. Baraniuk. The smashed filter for compressive classification and target recognition. In Electronic Imaging 2007, pages 64980H–64980H. International Society for Optics and Photonics, 2007.
 Deihimi and Showkati [2012] A. Deihimi and H. Showkati. Application of echo state networks in shortterm electric load forecasting. Energy, 39(1):327–340, 2012.
 Deihimi et al. [2013] A. Deihimi, O. Orang, and H. Showkati. Shortterm electric load and temperature forecasting using wavelet echo state networks with neural reconstruction. Energy, 57:382–401, 2013.
 Dutoit et al. [2009] X. Dutoit, B. Schrauwen, J. V. Campenhout, D. Stroobandt, H. V. Brussel, and M. Nuttin. Pruning and regularization in reservoir computing. Neurocomputing, 72(7–9):1534 – 1546, 2009. ISSN 09252312. doi: http://dx.doi.org/10.1016/j.neucom.2008.12.020. Advances in Machine Learning and Computational Intelligence16th European Symposium on Artificial Neural Networks 200816th European Symposium on Artificial Neural Networks 2008.
 Fodor [2002] I. K. Fodor. A survey of dimension reduction techniques, 2002.
 Fraser and Swinney [1986] A. M. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Physical review A, 33(2):1134, 1986.
 Friedman [1997] J. H. Friedman. On bias, variance, 0/1—loss, and the curseofdimensionality. Data mining and knowledge discovery, 1(1):55–77, 1997.
 Gao et al. [2007] J. Gao, Y. Cao, W.w. Tung, and J. Hu. Multiscale analysis of complex time series: integration of chaos and random fractal theory, and beyond. John Wiley & Sons, 2007.
 Grassberger and Procaccia [2004] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. In The Theory of Chaotic Attractors, pages 170–189. Springer, 2004.
 Haiyan et al. [2005] D. Haiyan, P. Wenjiang, and H. Zhenya. A multiple objective optimization based echo state network tree and application to intrusion detection. In VLSI Design and Video Technology, 2005. Proceedings of 2005 IEEE International Workshop on, pages 443–446, May 2005. doi: 10.1109/IWVDVT.2005.1504645.
 Han and Lee [2014a] S. Han and J. Lee. Fuzzy echo state neural networks and funnel dynamic surface control for prescribed performance of a nonlinear dynamic system. Industrial Electronics, IEEE Transactions on, 61(2):1099–1112, Feb 2014a. ISSN 02780046. doi: 10.1109/TIE.2013.2253072.
 Han and Lee [2014b] S. I. Han and J. M. Lee. Fuzzy echo state neural networks and funnel dynamic surface control for prescribed performance of a nonlinear dynamic system. Industrial Electronics, IEEE Transactions on, 61(2):1099–1112, 2014b.
 HarShemesh et al. [2016] O. HarShemesh, R. Quax, B. Miñano, A. G. Hoekstra, and P. M. A. Sloot. Nonparametric estimation of Fisher information from real data. Physical Review E, 93(2):023301, 2016. doi: 10.1103/PhysRevE.93.023301.
 Hotelling [1933] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6):417–441, Sept. 1933.

Huang et al. [2005]
C.M. Huang, C.J. Huang, and M.L. Wang.
A particle swarm optimization to identifying the armax model for shortterm load forecasting.
Power Systems, IEEE Transactions on, 20(2):1126–1133, 2005. 
Indyk and Motwani [1998]
P. Indyk and R. Motwani.
Approximate nearest neighbors: towards removing the curse of
dimensionality.
In
Proceedings of the thirtieth annual ACM symposium on Theory of computing
, pages 604–613. ACM, 1998.  Jaeger [2001] H. Jaeger. The “echo state” approach to analysing and training recurrent neural networkswith an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148:34, 2001.
 Jaeger [2002] H. Jaeger. Adaptive nonlinear system identification with echo state networks. In Advances in neural information processing systems, pages 593–600, 2002.
 Jaeger and Haas [2004] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004.
 Jan van Oldenborgh et al. [2005] G. Jan van Oldenborgh, M. A. Balmaseda, L. Ferranti, T. N. Stockdale, and D. L. Anderson. Did the ecmwf seasonal forecast model outperform statistical enso forecast models over the last 15 years? Journal of climate, 18(16):3240–3249, 2005.
 Jenssen [2010] R. Jenssen. Kernel entropy component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):847–860, May 2010. ISSN 01628828. doi: 10.1109/TPAMI.2009.100.
 Jenssen [2013] R. Jenssen. Entropyrelevant dimensions in the kernel feature space: Clustercapturing dimensionality reduction. IEEE Signal Processing Magazine, 30(4):30–39, July 2013. ISSN 10535888. doi: 10.1109/MSP.2013.2249692.
 Kantz and Schreiber [2004] H. Kantz and T. Schreiber. Nonlinear time series analysis, volume 7. Cambridge university press, 2004.
 Li et al. [2012] D. Li, M. Han, and J. Wang. Chaotic time series prediction based on a novel robust echo state network. IEEE Transactions on Neural Networks and Learning Systems, 23(5):787–799, 2012.
 Liebert and Schuster [1989] W. Liebert and H. Schuster. Proper choice of the time delay for the analysis of chaotic time series. Physics Letters A, 142(23):107–111, 1989.
 Livi et al. [2016] L. Livi, F. M. Bianchi, and C. Alippi. Determination of the edge of criticality in echo state networks through fisher information maximization. arXiv preprint arXiv:1603.03685, 2016.
 Lukoševičius and Jaeger [2009] M. Lukoševičius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009. doi: 10.1016/j.cosrev.2009.03.005.
 Marwan et al. [2007] N. Marwan, M. C. Romano, M. Thiel, and J. Kurths. Recurrence plots for the analysis of complex systems. Physics reports, 438(5):237–329, 2007.
 Mazumdar and Harley [2008] J. Mazumdar and R. Harley. Utilization of echo state networks for differentiating source and nonlinear load harmonics in the utility network. Power Electronics, IEEE Transactions on, 23(6):2738–2745, Nov 2008. ISSN 08858993. doi: 10.1109/TPEL.2008.2005097.
 Packard et al. [1980] N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. Shaw. Geometry from a time series. Physical Review Letters, 45(9):712, 1980.
 Parlitz [1998] U. Parlitz. Nonlinear timeseries analysis. In Nonlinear Modeling, pages 209–239. Springer, 1998.
 Peng et al. [2014] Y. Peng, M. Lei, J.B. Li, and X.Y. Peng. A novel hybridization of echo state networks and multiplicative seasonal ARIMA model for mobile communication traffic series forecasting. Neural Computing and Applications, 24(34):883–890, 2014.

Rényi [1959]
A. Rényi.
On the dimension and entropy of probability distributions.
Acta Mathematica Academiae Scientiarum Hungarica, 10(12):193–215, 1959.  Rhodes and Morari [1997] C. Rhodes and M. Morari. The false nearest neighbors algorithm: An overview. Computers & Chemical Engineering, 21:S1149–S1154, 1997.
 Scardapane et al. [2015] S. Scardapane, D. Comminiello, M. Scarpiniti, and A. Uncini. SignificanceBased Pruning for Reservoir’s Neurons in Echo State Networks, pages 31–38. Springer International Publishing, Cham, 2015. ISBN 9783319181646. doi: 10.1007/9783319181646_4.
 Schölkopf et al. [1997] B. Schölkopf, A. Smola, and K.R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
 Schölkopf et al. [2000] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural computation, 12(5):1207–1245, 2000.

Skowronski and Harris [2007]
M. D. Skowronski and J. G. Harris.
Automatic speech recognition using a predictive echo state network classifier.
Neural networks, 20(3):414–423, 2007.  Srinivas and Patnaik [1994] M. Srinivas and L. M. Patnaik. Genetic algorithms: a survey. Computer, 27(6):17–26, June 1994. ISSN 00189162. doi: 10.1109/2.294849.
 Takens [1981] F. Takens. Detecting strange attractors in turbulence. Springer, 1981.
 Van Der Maaten et al. [2009] L. Van Der Maaten, E. Postma, and J. Van den Herik. Dimensionality reduction: a comparative. J Mach Learn Res, 10:66–71, 2009.
 Varshney and Verma [2014] S. Varshney and T. Verma. Half Hourly Electricity Load Prediction using Echo State Network. International Journal of Science and Research, 3(6):885–888, 2014.
 Verstraeten and Schrauwen [2009] D. Verstraeten and B. Schrauwen. On the quantification of dynamics in reservoir computing. In C. Alippi, M. Polycarpou, C. Panayiotou, and G. Ellinas, editors, Artificial Neural Networks – ICANN 2009, volume 5768, pages 985–994. Springer Berlin, Heidelberg, 2009. ISBN 9783642042737. doi: 10.1007/9783642042744_101.

Wierstra et al. [2005]
D. Wierstra, F. J. Gomez, and J. Schmidhuber.
Modeling systems with internal state using evolino.
In
Proceedings of the 7th annual conference on Genetic and evolutionary computation
, pages 1795–1802. ACM, 2005.  Wolf et al. [1985] A. Wolf, J. B. Swift, H. L. Swinney, and J. A. Vastano. Determining lyapunov exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985.
 Zhou et al. [2009] S. Zhou, J. Lafferty, and L. Wasserman. Compressed and privacysensitive sparse regression. IEEE Transactions on Information Theory, 55(2):846–866, 2009.
Comments
There are no comments yet.