Training Echo State Networks with Regularization through Dimensionality Reduction

In this paper we introduce a new framework to train an Echo State Network to predict real valued time-series. The method consists in projecting the output of the internal layer of the network on a space with lower dimensionality, before training the output layer to learn the target task. Notably, we enforce a regularization constraint that leads to better generalization capabilities. We evaluate the performances of our approach on several benchmark tests, using different techniques to train the readout of the network, achieving superior predictive performance when using the proposed framework. Finally, we provide an insight on the effectiveness of the implemented mechanics through a visualization of the trajectory in the phase space and relying on the methodologies of nonlinear time-series analysis. By applying our method on well known chaotic systems, we provide evidence that the lower dimensional embedding retains the dynamical properties of the underlying system better than the full-dimensional internal states of the network.



There are no comments yet.


page 1

page 2

page 3

page 4


Model-Coupled Autoencoder for Time Series Visualisation

We present an approach for the visualisation of a set of time series tha...

Deep Temporal Clustering : Fully Unsupervised Learning of Time-Domain Features

Unsupervised learning of time series data, also known as temporal cluste...

Machine Learning of Time Series Using Time-delay Embedding and Precision Annealing

Tasking machine learning to predict segments of a time series requires e...

Deep learning of dynamical attractors from time series measurements

Experimental measurements of physical systems often have a finite number...

Efficient Optimization of Echo State Networks for Time Series Datasets

Echo State Networks (ESNs) are recurrent neural networks that only train...

Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins

Predicting the effect of mutations in proteins is one of the most critic...

Supporting Optimal Phase Space Reconstructions Using Neural Network Architecture for Time Series Modeling

The reconstruction of phase spaces is an essential step to analyze time ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Echo State Networks (ESN) belong to the class of computational dynamical systems, implemented according to the so-called reservoir computing approach [39]. An input signal is fed to a large, recurrent and randomly connected dynamic hidden layer, the reservoir, whose outputs are combined by a memory-less layer called readout to solve a specified task. Contrary to most hard computing approaches, which demand long training procedures to learn model parameters through an optimization algorithm [27], ESN is characterized by a very fast learning procedure that usually consists in solving a convex optimization problem.

ESN have been adopted in a variety of different contexts, such as static classification [1], speech recognition [50], intrusion detection [22], adaptive control [23] harmonic distortion measurements [41] and, in general, for modeling of various kinds of non-linear dynamical systems [24]. The application field where ESN has been used the most, is the problem of predicting real valued time-series relative, for example, to telephonic or electric load, where the forecast is usually performed 1-hour and a 24-hours ahead [14, 15, 6, 54, 15, 44]. Outstanding results have also been achieved by ESN in prediction of chaotic time-series [36, 31]

, which highlighted the capability of these neural networks to learn amazingly accurate models to forecast a chaotic process from almost noise-free training data.

Although a large reservoir could capture the dynamics of the underlying system more accurately, it results in a model of increased complexity, with an inherent risk of overfitting that leads to lower generalization capabilities. Additionally, several regression methods adopted to train the readout layer could be affected by the curse of dimensionality in case of high dimensional data, which could also cause increments in both the computational requirements in software and the resource needed in hardware

[7]. Several tasks in signal processing and machine learning applications have been tackled by evaluating regression functions [58], performing classification [13] or finding neighbors [28] in a reduced dimensional space. In fact, in many cases it is possible to maintain meaningful distance relationships between original data and to deal with the curse of dimensionality at the same time. Through dimensionality reduction, redundant features are removed, noise can be filtered and algorithms that are unfit for a large number of dimensions become applicable. In the ESN literature, different methods have been proposed to increase the generalization ability of the network and to regularize the output. For example, in [16], the authors propose a form of regularization by shrinking the weights of the connections from the reservoir to the readout layer. In [47]

by pruning some connections from the reservoir to the readout layer, better generalization capabilities are achieved along with some insight on which neurons are actually useful for the output, providing clues on how to create a good reservoir.

In this paper we propose a novel framework for training an ESN, where an additional computational block is introduced to process the output of the internal reservoir, before being fed into the readout layer. In particular, the internal state of the network is mapped to a properly chosen lower dimensional subspace, using both linear and non-linear transformations. Accordingly, we are able to use a large reservoir to capture the dynamics of the underlying system, while increasing the generalization capabilities of the model due to implicit regularization constraints provided by dimensionality reduction in the recurrent layer. Even if additional operations are introduced to compute the reduced dimensionality embedding, training the readout layer becomes less demanding, especially in regression methods whose computational complexity depends on input dimension

[17]. With the proposed procedure we improve the generalization capabilities of the network, achieving better results on well-known benchmarking problems with respect to the standard ESN architecture. Additionally, in cases where data can be mapped to spaces with 2 or 3 dimensions, internal network dynamics can be visualized precisely and relevant patterns can be detected. To justify the results obtained and to understand the mechanisms which determines the effectiveness of the proposed system, we provide a theoretical study based on methods coming from the field of nonlinear time-series analysis. To the best of the authors’ knowledge, the coupling of dimensionality reduction with the ESN architecture has not been explored before.

The remainder of the paper is organized as follows. In Sect. 2 we describe the ESN structure along with existing approaches for its training and we review the dimensionality reduction methods adopted in this work. In Sect. 3 we present our proposed architecture, providing implementation details. In Sect. 4 we describe the datasets used to test our system, the experimental settings adopted and the performance reached on several prediction problems. In Sect. 5 we analyze the results and the functioning of our system through the perspective of nonlinear time-series analysis. Finally, in Sect. 6 we draw our conclusions.

2 Background material

In the following, we shortly review the methodologies adopted in our framework. Initially, we describe the classic ESN architecture and two effective approaches adopted for its training. Successively, we summarize two well-know methods used for reducing the dimensionality of the data and for mapping them in a smaller subspace.

2.1 Echo state Network

An ESN consists of a large, untrained recurrent layer of non-linear units and a linear, memory-less readout layer, usually trained with a linear regression. A visual representation of an ESN is reported in Fig.


Figure 1: Schematic depiction of the ESN architecture. The circles represent input , state, , and output, , respectively. Solid squares and , are the trainable matrices, respectively, of the readout, while dashed squares, , , and , are randomly initialized matrices. The polygon represents the non-linear transformation performed by neurons and is the unit delay operator.

The equations describing the ESN state-update and output are, respectively, defined as follows:


where is a small i.i.d. noise term. The reservoir consists of

neurons characterized by a transfer/activation function

, typically implemented as a hyperbolic tangent function. At time instant , the network is driven by the input signal and generates the output , being and

the dimensionality of input and output, respectively. The vector

describes the ESN (instantaneous) state. The weight matrices (reservoir connections), (input-to-reservoir), and (output-to-reservoir feedback) contain real values in the

interval, sampled from a uniform distribution.

According to the ESN theory, the reservoir must satisfies the so-called “echo state property” (ESP) [39]. This guarantees that the effect of a given input on the state of the reservoir vanish in a finite number of time intervals. A widely used rule-of-thumb suggests to rescale the matrix to have , where denotes the spectral radius, but several theoretically-founded approaches have been proposed in the literature to properly tune in an ESN driven by a specific input [9, 8, 55].

The weight matrices and instead, are optimized for the task at hand. To determine them, let us consider the training sequence sequence of desired input-outputs pairs given by:


In the initial phase of training, called state harvesting, the inputs are fed to the reservoir in accordance with Eq. 1, producing a sequence of internal states . Since, by definition, the outputs of the ESN are not available for feedback, according to the teacher forcing procedure, the desired output is used instead in Eq. 2. States are stacked in a matrix and the desired outputs in a vector :

The initial rows and are the washout elements that should be discarded, since they refer to a transient phase in the ESN’s behavior.

Since the gain of the sigmoid non-linearity in the neurons is largest around the origin, three coefficients , and are used to scale the input, desired output and feedback signals respectively. In this way, it is possible to control the amount of non-linearity introduced by the processing units.

The training of the readout consists in solving a convex optimization problem, for which several closed form solution have been proposed in the literature. The standard procedure to train the readout, originally proposed in [29]

, consists in a regularized least-square regression, which can be easily computed through the Moore-Penrose pseudo-inverse. However, to learn the optimal readout we also consider the Support Vector Regression (SVR), a supervised learning model that can efficiently perform a non-linear separation of data using a kernel function to map the inputs into high-dimensional feature spaces, where they are linearly separable


Ridge Regression:

to train the readout with a linear regressor we adopted ridge regression, whose solution can be computed by solving the following regularized least-square problem:


where and is the regularization coefficient.

Support Vector Regression:

we adopt a -SVR [49] with a Gaussian kernel, initially proposed in [7] as method for readout training. In this case, the ESN acts as a preprocessor for a -SVR kernel and their combination can be seen as an adaptive kernel, capable of learning a task-specific time dependency. The state is projected to a higher dimensional feature space , and the -SVR is applied on the resulting space. The dual optimization problem can be written as:


where each entry is given by , with being a reproducing Gaussian kernel associated to the feature mapping, given by , where is denoted as the scale parameter.

By an extension of the representer’s theorem, the output of the ESN at a generic time-instant in this case is given by:


where and are the entries of the optimal solution to problem Eq. 5, and they are non-zero only for patterns that are support vectors.

2.2 Dimensionality reduction methods

In the following, we describe the dimensionality reduction techniques that we implemented in our framework. First of all, we underline that several approaches can be followed for reducing the dimensionality of the data and to learn underlying manifold on a subspace of the data space [53, 33, 34]

. In this work, we limit our analysis to the well know and effective, yet simple procedures, namely Principal Component Analysis (PCA)

[26] and kernel Principal Component Analysis (kPCA) [48].


is a statistically motivated method, which projects the data onto an orthonormal basis that preserves most variance in the input signal, while ensuring that the individual components are uncorrelated. These basis vectors are called the

Principal Components. Let be a random vector and let be its covariance matrix, where and

is the orthogonal eigenvector matrix and the diagonal eigenvalue matrix respectively. Then the linear transformation

ensures that the covariance matrix of is , which clearly implies that the components of are uncorrelated. We also see that


To reduce the dimensionality to dimensions, we project the data onto the eigenvectors with the largest eigenvalues. That is,

where is the truncated eigenvector matrix associated with the eigenvalues . According to Eq. 7, this ensures that preserves most of the variance of .

Kernel Principal Component Analysis

(kPCA) is a nonlinear extension of PCA. Given a valid positive semidefinite (psd) Mercer Kernel

where is some nonlinear mapping from feature space to a Hilbert space , Kernel PCA implicitely performs PCA in .

Let , where be the kernel matrix and let and be its eigenvector and eigenvalue matrix respectively with the eigenvalues sorted in descending order. Then the projection of the in-sample data onto the principal components in is given by


The out-of-sample approximation for the projection of a data point onto the th principal component is given by


Just like canonical PCA, to perform dimensionality reduction with kPCA, one need to use the truncated eigenvector- and eigenvalue matrix with Eq. 8 and Eq. 9.

The kernel function that is commonly used in practice is the Gaussian kernel which is given by , where controls the width of the kernel.

Both PCA and kPCA methods admit an out of sample extension, a feature which is required in our framework, as discussed later.

3 Proposed architecture

In this section, we provide the details of the architecture of the framework proposed.

The large size of the reservoir, specified by the amount of hidden neurons, is one of the main features that determines the effectiveness of the reservoir computing paradigm. Due to the high quantity of neurons, the internal recurrent connections in the reservoir are capable of generating a rich and heterogeneous dynamic to solve complex memory-dependent tasks. However, as the size of the reservoir increases, also the complexity of the model grows, with a consequent risk of overfitting caused by a reduced generalization capability [4]. Dimensionality reduction and manifold learning are techniques that allows to diminish the variance in the data and to introduce a bias, which can reduce the expected value on the prediction error [19]

. In the architecture proposed, we use a large reservoir in order to capture the dynamic of the underlying unknown process and then, through a dimensionality reduction procedure, we enforce regularization constraints to increase the generalization capability of our model. Another important consequence that follows from reducing the dimensionality of the reservoir is that complex regression methods can benefit from a reduced computational complexity if the internal states are described by a lower number of variables. Additionally, several methods used to identify, in an unsupervised way, the configurations of hyperparameters which maximize the computational capabilities of the network, require computational demanding procedures of analysis

[8, 38, 9]. These procedures would greatly benefit from the simplification offered by our proposed architecture.

At each time step , the vector that represents the internal state of the reservoir, is mapped into a lower dimensional space by a projector . The new -dimensional state vector is then processed by the readout to compute the predicted value .

To train our system, the time-series is split in three contiguous parts, namely the training , validation and test set . Since we deal with time-series prediction problems, each set contains coupled real values, which represent the input value and the ground truth of the associated prediction. For example in the training set we have , where is the predicted value of . The regression function in the readout is implemented according to one of the two procedures proposed in Sect. 2.1 and the model parameters are learned on the training data. The system depends on several hyperparameters, which affects the network behavior and they must be carefully tuned on the specific problem at hand by performing a cross-validation procedure on the validation set, with a method whose details are provided in the next section.

Once the model has been trained, a new test element of the test set is processed and the relative internal reservoir state is generated. Successively, the projection in the subspace with reduced dimensionality is evaluated using a suitable out of sample approximation. In the case of PCA this can be done by projecting on the basis defined by the covariance matrix computed on the states relative to the elements in training set, which are collected in the matrix during the training phase. For kPCA it is possible to use the Nÿstrom approximation [2]

, which specifies an interpolating function for determining the values of out of samples data points.

A schematic representation of the whole procedure is depicted in Fig. 2.

Figure 2: When a new element is fed into the network, the internal state of the ESN is updated and its new value is stored in . Such state vector is then projected on a subspace, computed during the training on the state matrix and the vector of reduced dimensionality in this subspace is evaluated. At this point, the predicted output value is computed by the ESN readout.

3.1 Hyperparameter optimization

The set of hyperparameters

that are used to control the architecture of the ESN, the regression in the readout and the dimensionality-reduction procedure are optimized by minimizing a loss function

, defined as


where is the hyperparameter that defines the number of dimensions, , of the new subspace. In order to lower the complexity of the model, jointly penalizes prediction error on the validation set and the number of dimensions retained after the dimensionality reduction.

The loss function is minimized using a standard genetic algorithm with Gaussian mutation, random crossover, elitism and tournament selection

[51]. While the hyperparameter optimization is performed on the validation set, the best individual found is stored and it is successively used to configure the network during the training phase. A schematic description of the training procedure is depicted in Fig. 3.

Figure 3: Overview of the hyperparameters optimization in the proposed architecture. At the -th iteration, the input elements of the validation set are processed by the ESN configured with the hyperparemeters in , which is the -th individual generated by the GA. The predicted output produced by the network is matched against the ground truth , the resulting similarity (prediction error) is used to compute the fitness of with the loss function . In the next iteration, a new individual is generated, depending on results obtained so far and on the policies of the GA.

4 Experiments

The component of the loss function (Eq. 10) relative to the error on the given task, is implemented by the Normalized Root Mean Squared Error (NRMSE):

where is the ESN prediction and the desired/teacher output.

The GA uses a population size of 50 individuals and evaluates 20 generations. The individuals are mutated and bred at each generation with a mutation probability of

and a crossover probability of . The individuals in the next generation are selected by a tournament strategy with a tournament size of 4 individuals. The bounds for all parameters are shown in Tab. 1. The weight parameter in the loss function (Eq. 10) is set to 0.1.

Due to the stochastic nature of the ESN, which is a consequence of the random initialization of the weight matrices , and , each individual is evaluated on the validation set using 5 networks initialized with different weight parameters. The fitness is then given by the NRMSE, averaged over these 5 networks. Once the optimal set of parameters has been found, we predict values for the test set using 32 randomly initialized networks, using the same set of optimal parameters.

min 100 0.0 0.1 0.1 0.0 0.5 0.001 0.001 0.001 0.001 0.001
max 500 0.1 0.9 0.9 0.6 1.4 1.0 0.1 1.0 10.0 1.0
5 0.01 0.08 0.08 0.06 0.09 0.1 0.01 0.1 1.0 0.1

Table 1: Each hyperparameter is searched by the GA in the interval [min, max] with resolution . The fields in the table are the following: spectral radius of the reservoir (), neurons in the reservoir (), noise in ESN state update (), scaling of input, teacher and feedback weights (, , ), embedding dimension , norm regularization factor (), -SVR parameters (, , ).

4.1 Datasets description

To test our system, we consider 3 benchmark tasks commonly used in time-series forecasting, namely the prediction of Mackey-Glass time-series, of multiple superimposed oscillator and of the NARMA signal. The forecasting problems that we consider have a different level of difficulty, given by the nature of the signal and the complexity of the prediction task. Accordingly to a commonly used approach [32], in each prediction task we set the forecast step by computing a statistic that measures the independence of -separated points in the time series. One usually wants the smallest that guarantees the measurements to be decorrelated. Hence, we considered the first zero of the autocorrelation function of the time series, which yields the smallest that maximizes the linear independence between the samples. Alternatively, it is possible to choose the forecast step by considering more general forms of independence, such as the first local minimum on the average mutual information [18] or on the correlation sum [37].

Mackey-Glass time-series:

the input signal is generated from the Mackey-Glass (MG) time-delay differential system, described by the following equation:

We generated a time-series of 150000 time-steps using , initial condition , and 0.1 as integration step for (4.1).

NARMA signal:

the Non-Linear Auto-Regressive Moving Average (NARMA) task, originally proposed in [30], consists in modeling the output of the following -order system:

The input to the system is a uniform random noise in [0, 1], and the model is trained to reproduce . The NARMA task is known to require a memory of at least past time-steps, since the output is determined by the current input and outputs relative to the last time-steps. In our test we set .

Multiple superimposed oscillator:

The prediction of a sinusoidal signal is a relatively simple task, which demands a minimum amount of memory to determine the next network output. However, superimposed sine waves with not integer frequencies are much harder to predict, since the wavelength can be extremely long. The signal we consider is the multiple superimposed oscillator (MSO), studied in [31] and defined as:

ESN struggles to solve this task, since neurons in the reservoir tends to couple, while the task requires the simultaneous existence of multiple decoupled internal states [56].

4.2 Results

The averaged prediction results and the standard deviations are reported in Tab. 

2. The convergence rate during the optimization of the hyperparameters for each method, expressed as the NRMSE error on the validation set, is depicted in Fig. 4.

The prediction of MG is a quite simple task and each model manages to achieve high forecast accuracy. However, by applying a dimensionality reduction on the states of the reservoir, it is possible to lower the error by one or more order of magnitude. Also the standard deviation of the prediction error decreases, especially in the models using kPCA. The best results are achieved by -SVR + PCA and -SVR + kPCA, while using -SVR without reducing reducing the dimensions of the reservoir demonstrated to be less effective. This means that non-linearities benefits the training, but without enforcing the regularization constraint the complexity of the model is to high to fit well testing points. As we can see, in every case the number of dimensions retained by both PCA and kPCA is much lower than the optimal number of neurons identified. This underline the effectiveness of the regularization conveyed by our architecture. From Fig. 4(a) results that the model implementing ridge regression + kPCA achieves the lowest convergence rate during the cross-validation step. However, thanks to the generalization power provided by the nonlinear dimensionality reduction, the test error is lower than the other models, whose readout is trained with ridge regression.

In NARMA prediction task, the best result is achieved by training the readout function with -SVR on a reservoir output, whose dimensionality is reduced by kPCA. NARMA is a more complex task which requires a higher amount on nonlinearity to be solved. This is clearly reflected by the results, which improve as more nonlinearity is introduced to learn the function, both in the readout training and in the dimensionality reduction procedure. At the same time, the bias introduced by the regularization enhance the generalization capability of the network significantly. For what concerns the number of dimensions of the optimal subspace, it is higher than in MG task, except for the model implemented with ridge regression + PCA. In this latter cases, however, we obtain the worst performance. Interestingly, from Fig. 4(b) we observe that kPCA has the lower convergence rate, even if this is the best performing model in the testing phase. In this case, the dimensionality reduction introduces a bias, which prevents the model to overfit on the validation data and to develop a high predictive power. On the other hand, the model with -SVR and no dimensionality reduction, overfits on the validation data with a consequent poor performance in the test phase.

Finally, in the MSO task the model with the highest prediction performance is -SVR without the dimensionality reduction. In this case, the signal to predict has an extremely long periodicity, which demands a high amount of memory in the network. Hence, the compression of the information through the dimensionality reduction could hamper the memory capacity of the network. Furthermore, due to the long periodicity, the slice of time-series used to train the network can be quite different from the slice to be predicted in the test. Consequently, test points are projected in a subspace which is not optimal, as the basis is learned from the training data. As expected, the number of dimensions kept after the dimensionality reduction is larger than in the other tasks. The need of a high degree of complexity is also denoted by the poor results obtained by using ridge regression in the readout training. From Fig. 4(c), we observe the convergence rate to be faster in models equipped with -SVR, which obtain better results both in validation and in testing phase. This symmetry on performances on test and validation reflects the scarce effectiveness of the regularization constraints for this task.

RT DR Error

ridge reg. 378 1.22 0.0 0.55 0.212 0.25 0.625
ridge reg. PCA 318 1.214 0.042 0.807 0.840 0.355 89 0.294
ridge reg. PCA 445 1.148 0.0 0.339 0.202 0.422 33 0.050 0.521
-SVR 490 1.051 0.0 0.873 0.588 0.568 0.234 5.346 0.868
-SVR PCA 474 1.044 0.0 0.604 0.807 0.350 3 2.535 0.340 0.825
-SVR PCA 466 0.738 0.0 0.373 0.664 0.069 14 0.059 7.975 0.448 0.299
5pt. NARMA ridge reg. 406 1.031 0.053 0.19 0.231 0.194 0.163
ridge reg. PCA 409 0.934 0.016 0.135 0.127 0.073 22 0.963
ridge reg. PCA 342 0.887 0.018 0.167 0.407 0.0 225 0.008 0.001
-SVR 440 0.928 0.015 0.129 0.603 0.031 4.332 0.271 0.017
-SVR PCA 433 0.952 0.0 0.107 0.207 0.267 274 4.099 0.134 0.028
-SVR PCA 460 0.962 0.002 0.1 0.302 0.037 163 0.028 4.281 0.752 0.420
5pt. MSO ridge reg. 298 1.148 0.008 0.345 0.147 0.045 0.438
ridge reg. PCA 499 1.141 0.017 0.187 0.184 0.309 438 0.601
ridge reg. PCA 454 1.053 0.003 0.137 0.169 0.184 407 0.040 0.117
-SVR 444 1.0 0.001 0.114 0.1 0.09 3.714 0.282 0.037
-SVR PCA 459 1.147 0.001 0.174 0.144 0.276 225 6.373 0.38 0.601
-SVR PCA 480 1.027 0.032 0.1 0.627 0.011 135 0.002 3.529 0.204 0.495

Table 2: Average prediction results obtained on the test set. The table contains the following fields: method for readout training (RT), dimensionality reduction procedure (DM), spectral radius of the reservoir (), neurons in the reservoir (), noise in ESN state update (), scaling of input, teacher and feedback weights (, , ), dimensionality (), kPCA kernel width (), norm regularization factor (), -SVR parameters (, , ). Best results are highlighted in bold.
(a) Mackey-Glass
(c) Multiple Superimposed Oscillators
Figure 4: Convergence of the error on the validation set in hyperparameters optimization with the GA. Black lines represent models whose readout is trained with -SVR, models trained with ridge regression are depicted with gray lines.

5 Discussion

To understand the mechanics and the effectiveness of the proposed architecture, we analyze the results through the theory of nonlinear time-series analysis, which offer powerful methods to retrieve dynamical information from time-ordered data [10]. The objective of time-series analysis is to reconstruct the full dynamics of a complex nonlinear dynamical system, starting from a measurement of only one of its variables. In fact, in many cases it is possible to observe only a subset of the components necessary to determine the time-evolution law which governs the dynamical system.

The main idea which inspires this analysis is that a dynamic system is completely described by the time-dependent trajectory in its phase space. Hence, a recurrent neural network that is capable of reconstructing with a high degree of accuracy the dynamic attractor can calculate future states assumed by the system, given a state at any particular moment.

A frequently used method for phase space reconstruction is the delay-coordinate embedding

, which provides an estimation of the attractor that is topologically identical to the true one. From this reconstruction, it is possible to infer several properties of the hidden dynamical system, which are invariant under diffeomorhpism. We refer to these measures as the

dynamical invariants of the system. The most commonly studied are the fractal dimension of the attractor, the Lyapuanov exponents and the Rényi entropy. In the following, we briefly introduce the delay-coordinate embedding procedure and two approaches used to estimate the aforementioned dynamical invariants. We refer the interested reader to [35, 20] for a comprehensive overview of these methods and many other aspects of time-series analysis.

Delay-coordinate embedding:

a dynamical system is characterized by a time-evolution law, which determines its trajectory in the phase space. Each specific state of the system at time is defined by a -dimensional vector in the state space: , being the number of variables of the system. The delay-coordinate embedding method allows to reconstruct such state vectors from a time-discrete measurement of only one generic smooth function of the state space [42]. Given a discrete time-series evenly sampled at rate , the embedding is defined as:


where is the embedding dimension, is the time delay and form an orthonormal basis in .

With a proper choice of embedding parameters and , Taken theorem guarantees the existence of a diffeomorhpism between the real and reconstructed dynamic [52]. A sufficient condition for a correct reconstruction is . The value of is usually computed with the false nearest-neighbors algorithm [46], which provides an estimation of the smallest sufficient embedding dimension. On the other hand, a suitable time-delay can be estimated looking at the first zero of the autocorrelation function of or by relying on nonlinear time dependencies, such as the mutual information [12].

Correlation dimension:

dimension is a measurement invariant under diffeomorhpism that allows to quantify the similarity of geometrical objects. Attractors of dissipative chaotic systems often exhibit complicated geometries (hence the name strange) which are contained in a fractal dimension , called Rényi dimension [45]. An efficient estimator of fractal dimensions is Grassberger-Procaccia algorithm [21], which computes the correlation dimension through the correlation sum :


The temporal spacing parameter it chosen to ensure temporal independence between samples, is the Heaviside function and is the dimension of a set of small boxes used to cover the geometric shape of the attractor. If , .

Lyapuanov exponent:

the Lyapuanov spectrum is another invariant measure that characterizes the predictability of a dynamical system. Lyapuanov exponents quantify the rate of separability of two infinitesimal close trajectories and are closely related to the 2nd order Rényi entropy : . This quantity measures the number of possible trajectories that the system can take for a given number of time steps in the future. A perfectly deterministic can only evolve along one possible trajectory and hence . In contrast, for purely stochastic systems the number of possible future trajectories increases to infinity, so . Chaotic systems are characterized by a finite value of , as the number of possible trajectories diverges but not as fast as in the stochastic case.

The largest Lyapuanov exponent (LLE) is a good estimate of , and its sign determines whether a system is chaotic or not. The so-called direct methods can be used to compute by estimating the divergent motion of the reconstructed space, without fitting a model to the data [57, 43]. In particular, the average exponential growth of the distance of neighboring orbits can be studied on a logarithmic scale by monitoring the prediction error :


being the nearest neighbor of at time . The LLE is estimated as with , where is the forecast horizon within which the divergence of the trajectory in the phase space is evaluated.

5.1 ESN phase space reconstruction

In the following, we analyze two chaotic time-series generated by the Lorenz and the Moore–Spiegel system respectively. We evaluate the accuracy of the phase space reconstruction performed with our ESN by comparing the topological properties of the true attractor of the dynamic, with the one obtained by applying a dimensionality reduction to the network reservoir. The equivalence of attractors geometries are computed by measuring the dynamical invariants, estimated through the correlation sum and the divergent motion of the reconstructed spaces.

In the following, we refer to true attractor, as the trajectory in the phase space generated directly by the differential equations of the dynamic system. With delay-embedding attractor we refer at the trajectory described by the embedding, generated with the delay-coordinate procedure. Finally, ESN attractor is the trajectory spanned by the component of the multivariate vector . The latter is the output of the dimensionality reduction procedure applied to the multivariate vector , which contains the sequence of the states of the reservoir (see Sect. 3). For these tests we considered only the component of the loss function relative to the prediction error, by setting in Eq. 10, and we fixed the number of dimensions in PCA and kPCA to 3. Finally, to further empathize the effectiveness of the architecture proposed, we also consider the phase space reconstruction obtained directly from , in the case where the reservoir contains only 3 neurons ().

(a) True attractor trajectory.
(b) Time-delay embedding trajectory.
(c) ESN+PCA trajectory.
(d) ESN with 3 neurons trajectory.
Figure 5:

trajectory of the attractors of the Lorenz dynamical system in the phase space. In (a), the true trajectory, which is computed directly from the ordinary differential equations of the system. In (b), the trajectory reconstructed using time-delay embedding. In (c), the trajectory generated by the internal state of ESN internal state, on the subspace defined by the first 3 components of the PCA. In (d), the trajectory described by the internal state of an ESN with a small reservoir with 3 neurons.


the system is governed by the following ordinary differential equations:

System Invariant True Emb ESN+PCA ESN+kPCA ESN small
5pt. Moore-Spiegel
Table 3: Correlation dimension () and largest Lyapuanov exponent (LLE) of the attractors of Lorenz and Moore-Spiegel dynamical systems. Each invariant is estimated on the trajectories generated by: the ordinary differential equations (True); the dime-delay embedding (Emb); the ESN reservoir state, whose dimensionality is reduced using PCA (ESN+PCA) or -PCA (ESN+PCA); the internal state of an ESN with a small reservoir with 3 neurons (ESN small).

where variables , and define the state of the system, while , and are system parameters. In this work we set , and , values for which the system exhibits chaotic behavior.

Fig. 5 depicts the geometric shapes of the true attractor, the delay-embedding attractor, the two ESN attractors, generated using a dimensionality reduction or a reservoir with 3 neurons. As is it possible to observe visually, both the embedding and ESN with dimensionality reduction manage to reconstruct well the trajectory described by the differential equations of the dynamic system. To quantify formally this similarity, we compute on the dynamical invariants previously introduced each attractor. In Tab. 3, we report for each phase space trajectory the estimated correlation dimension and the largest Lyapuanov exponent, which as previously discussed, represents a good approximation of the entropy. Due to the stochastic nature of the approaches adopted for estimating these quantities, we repeated the procedure 10 different times and we report their average values and the standard deviations. As we can see from the results, both the trajectories described by in the subspace computed using PCA and kPCA generate an attractor whose dynamic invariants are well approximated. In particular, the accuracy of the reconstruction is comparable to the one obtained by the classic time-delay embedding method and in some case it is even better. The standard deviations in the measurements of both correlation dimension and LLE are very small, which indicates a high degree of reliability on both measurements. For what concerns the ESN with 3 neurons, the trajectory described is more “flat”, as it can be seen in the figure. This is confirmed by the estimated correlation dimension and LLE, whose values are much lower than in the other cases. This denotes that the reconstructed dynamic is not rich enough, a symptom that the complexity and the memory of the network is not sufficient to model the underlying system.

(a) True attractor trajectory.
(b) Time-delay embedding trajectory.
(c) ESN+PCA trajectory.
(d) ESN with 3 neurons trajectory.
Figure 6: trajectory of the attractors of the Moore-Spiegel dynamical system in the phase space. In (a), the true trajectory, which is computed directly from the ordinary differential equations of the system. In (b), the trajectory reconstructed using time-delay embedding. In (c), the trajectory generated by the internal state of ESN internal state, on the subspace defined by the first 3 components of the PCA. In (d), the trajectory described by the internal state of an ESN with a small reservoir with 3 neurons.


this dynamical systems manifests interesting synchronization properties, generated by complicated patterns of period-doubling, saddle-node and homoclinic bifurcations [3]. The differential equations which governs system dynamics are the following:


where , and form the state of the system and and are the parameters of the model. In this study, we set , and , for which the dynamics of the system exhibits a chaotic behavior.

In Fig. 6 we show the shape of the attractors of the dynamic, evaluated directly on the differential equations of the system, on the time-delay embedding, on the internal state of the ESN reduced through PCA and on the state of the ESN with 3 neurons. In this second test, the reconstructed trajectories of the Moore–Spiegel system are more jagged and irregular, with respect to the original one. This suggest a poorer approximation of the true dynamic of the system and is confirmed by the results in Tab. 3. Compared to the Lorenz case, the dynamical invariants estimated on the time-delay embedding and on ESN state trajectories approximate with less accuracy the real ones. The reconstructed attractors have a lower correlation dimension, which usually denotes a poor embedding [40]. However, it is worth to notice that the two attractors reconstructed by the ESN+PCA and ESN+kPCA have a larger value than the time-delay embedding and hence they approximate better the true dynamics. For what concerns the LLE, the estimated value in each reconstructed dynamic is larger than in the original one. This means that both the time-delay embedding and the ESNs generate a more chaotic dynamic, as is also reflected by the jagged trajectories in Fig. 6. Even in this case, however, LLE is better approximated by ESN+PCA and ESN+kPCA than by the time-delay embedding. Like before, the standard deviations of the estimates of the two dynamical invariants is very small, which provides a high degree of confidence on the measurements. For what concerns the trajectory described by the ESN state with a small reservoir of 3 neurons, the geometric properties of the reconstruct attractor are even more different from the real ones. This confirm that also in this case such a small amount of neurons cannot catch the dynamic properties of the system to be modeled.

As a concluding remark, it is important to understand another aspect of the utility of the ESN in reproducing the attractor of the system dynamic. In fact, this provides a valid alternative to the standard approach based on the time-delay embedding for reconstructing the phase of the system, which presents several caveats and pitfalls [10]. This a fundamental tool for a wide set of applications, where an accurate estimation of the phase space of the system is required [35].

6 Conclusions and future directions

In this work we have presented a new framework for training an Echo State Network, which enhances its generalization capabilities through the regularization constraints introduced by the smoothing effect of a dimensionality reduction procedure. Through a series of test on benchmark dataset, we have demonstrated how the proposed architecture can achieve better prediction performance in different contexts. Successively, we provided a theoretically grounded explanation of the functioning of the proposed architecture, based on the theory of nonlinear time-series analysis. By studying the dynamical properties of the network under this novel perspective, we showed that through an ESN it is possible to reconstruct the phase space of the dynamic system; this offers a solid, yet simple alternative to the time-delay embedding procedure.

We believe that this work could be useful not only to enhance the prediction capabilities of an ESN, but also provide a new instrument for analysis of dynamical systems. As a follow-up of a recent work focused on identifying the edge of criticality of an ESN by evaluating the Fisher information on the state matrix [38], we plan to study the criticality using more reliable Fisher Information Matrix estimators, which are capable of working only on space with few dimensions (e.g., [25]

). We also plan on investigating other dimensionality reduction methods, manifold learning and semi-supervised learning approaches to shrink and regularize the output of the network recurrent layer

[4, 5]. Finally, as a future work, we propose to use different dimensionality reduction techniques in parallel and combine their result through a single reservoir to produce the final result.


  • Alexandre et al. [2009] L. A. Alexandre, M. J. Embrechts, and J. Linton. Benchmarking reservoir computing on time-independent classification tasks. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 89–93. IEEE, 2009.
  • Baker [1973] C.-T.-H. Baker. numerical treatment of integral equations.[the]. 1973.
  • Balmforth and Craster [1997] N. Balmforth and R. Craster. Synchronizing moore and spiegel. Chaos: An Interdisciplinary Journal of Nonlinear Science, 7(4):738–752, 1997.
  • Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7:2399–2434, 2006.
  • Bengio et al. [2004] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet.

    Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering.

    Advances in neural information processing systems, 16:177–184, 2004.
  • Bianchi et al. [2015a] F. M. Bianchi, E. De Santis, A. Rizzi, and A. Sadeghian. Short-term electric load forecasting using echo state networks and PCA decomposition. IEEE Access, 3:1931–1943, Oct. 2015a. ISSN 2169-3536. doi: 10.1109/ACCESS.2015.2485943.
  • Bianchi et al. [2015b] F. M. Bianchi, S. Scardapane, A. Uncini, A. Rizzi, and A. Sadeghian. Prediction of telephone calls load using Echo State Network with exogenous variables. Neural Networks, 71:204–213, 2015b. doi: 10.1016/j.neunet.2015.08.010.
  • Bianchi et al. [2016] F. M. Bianchi, L. Livi, and C. Alippi. Investigating echo state networks dynamics by means of recurrence analysis. arXiv preprint arXiv:1601.07381, 2016.
  • Boedecker et al. [2012] J. Boedecker, O. Obst, J. T. Lizier, N. M. Mayer, and M. Asada. Information processing in echo state networks at the edge of chaos. Theory in Biosciences, 131(3):205–213, 2012.
  • Bradley and Kantz [2015] E. Bradley and H. Kantz. Nonlinear time-series analysis revisited. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(9):097610, 2015.
  • Burges [1998] C. J. Burges.

    A tutorial on support vector machines for pattern recognition.

    Data mining and knowledge discovery, 2(2):121–167, 1998.
  • Cao [1997] L. Cao. Practical method for determining the minimum embedding dimension of a scalar time series. Physica D: Nonlinear Phenomena, 110(1):43–50, 1997.
  • Davenport et al. [2007] M. A. Davenport, M. F. Duarte, M. B. Wakin, J. N. Laska, D. Takhar, K. F. Kelly, and R. G. Baraniuk. The smashed filter for compressive classification and target recognition. In Electronic Imaging 2007, pages 64980H–64980H. International Society for Optics and Photonics, 2007.
  • Deihimi and Showkati [2012] A. Deihimi and H. Showkati. Application of echo state networks in short-term electric load forecasting. Energy, 39(1):327–340, 2012.
  • Deihimi et al. [2013] A. Deihimi, O. Orang, and H. Showkati. Short-term electric load and temperature forecasting using wavelet echo state networks with neural reconstruction. Energy, 57:382–401, 2013.
  • Dutoit et al. [2009] X. Dutoit, B. Schrauwen, J. V. Campenhout, D. Stroobandt, H. V. Brussel, and M. Nuttin. Pruning and regularization in reservoir computing. Neurocomputing, 72(7–9):1534 – 1546, 2009. ISSN 0925-2312. doi: Advances in Machine Learning and Computational Intelligence16th European Symposium on Artificial Neural Networks 200816th European Symposium on Artificial Neural Networks 2008.
  • Fodor [2002] I. K. Fodor. A survey of dimension reduction techniques, 2002.
  • Fraser and Swinney [1986] A. M. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Physical review A, 33(2):1134, 1986.
  • Friedman [1997] J. H. Friedman. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1(1):55–77, 1997.
  • Gao et al. [2007] J. Gao, Y. Cao, W.-w. Tung, and J. Hu. Multiscale analysis of complex time series: integration of chaos and random fractal theory, and beyond. John Wiley & Sons, 2007.
  • Grassberger and Procaccia [2004] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. In The Theory of Chaotic Attractors, pages 170–189. Springer, 2004.
  • Hai-yan et al. [2005] D. Hai-yan, P. Wen-jiang, and H. Zhen-ya. A multiple objective optimization based echo state network tree and application to intrusion detection. In VLSI Design and Video Technology, 2005. Proceedings of 2005 IEEE International Workshop on, pages 443–446, May 2005. doi: 10.1109/IWVDVT.2005.1504645.
  • Han and Lee [2014a] S. Han and J. Lee. Fuzzy echo state neural networks and funnel dynamic surface control for prescribed performance of a nonlinear dynamic system. Industrial Electronics, IEEE Transactions on, 61(2):1099–1112, Feb 2014a. ISSN 0278-0046. doi: 10.1109/TIE.2013.2253072.
  • Han and Lee [2014b] S. I. Han and J. M. Lee. Fuzzy echo state neural networks and funnel dynamic surface control for prescribed performance of a nonlinear dynamic system. Industrial Electronics, IEEE Transactions on, 61(2):1099–1112, 2014b.
  • Har-Shemesh et al. [2016] O. Har-Shemesh, R. Quax, B. Miñano, A. G. Hoekstra, and P. M. A. Sloot. Nonparametric estimation of Fisher information from real data. Physical Review E, 93(2):023301, 2016. doi: 10.1103/PhysRevE.93.023301.
  • Hotelling [1933] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6):417–441, Sept. 1933.
  • Huang et al. [2005] C.-M. Huang, C.-J. Huang, and M.-L. Wang.

    A particle swarm optimization to identifying the armax model for short-term load forecasting.

    Power Systems, IEEE Transactions on, 20(2):1126–1133, 2005.
  • Indyk and Motwani [1998] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    , pages 604–613. ACM, 1998.
  • Jaeger [2001] H. Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148:34, 2001.
  • Jaeger [2002] H. Jaeger. Adaptive nonlinear system identification with echo state networks. In Advances in neural information processing systems, pages 593–600, 2002.
  • Jaeger and Haas [2004] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004.
  • Jan van Oldenborgh et al. [2005] G. Jan van Oldenborgh, M. A. Balmaseda, L. Ferranti, T. N. Stockdale, and D. L. Anderson. Did the ecmwf seasonal forecast model outperform statistical enso forecast models over the last 15 years? Journal of climate, 18(16):3240–3249, 2005.
  • Jenssen [2010] R. Jenssen. Kernel entropy component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):847–860, May 2010. ISSN 0162-8828. doi: 10.1109/TPAMI.2009.100.
  • Jenssen [2013] R. Jenssen. Entropy-relevant dimensions in the kernel feature space: Cluster-capturing dimensionality reduction. IEEE Signal Processing Magazine, 30(4):30–39, July 2013. ISSN 1053-5888. doi: 10.1109/MSP.2013.2249692.
  • Kantz and Schreiber [2004] H. Kantz and T. Schreiber. Nonlinear time series analysis, volume 7. Cambridge university press, 2004.
  • Li et al. [2012] D. Li, M. Han, and J. Wang. Chaotic time series prediction based on a novel robust echo state network. IEEE Transactions on Neural Networks and Learning Systems, 23(5):787–799, 2012.
  • Liebert and Schuster [1989] W. Liebert and H. Schuster. Proper choice of the time delay for the analysis of chaotic time series. Physics Letters A, 142(2-3):107–111, 1989.
  • Livi et al. [2016] L. Livi, F. M. Bianchi, and C. Alippi. Determination of the edge of criticality in echo state networks through fisher information maximization. arXiv preprint arXiv:1603.03685, 2016.
  • Lukoševičius and Jaeger [2009] M. Lukoševičius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009. doi: 10.1016/j.cosrev.2009.03.005.
  • Marwan et al. [2007] N. Marwan, M. C. Romano, M. Thiel, and J. Kurths. Recurrence plots for the analysis of complex systems. Physics reports, 438(5):237–329, 2007.
  • Mazumdar and Harley [2008] J. Mazumdar and R. Harley. Utilization of echo state networks for differentiating source and nonlinear load harmonics in the utility network. Power Electronics, IEEE Transactions on, 23(6):2738–2745, Nov 2008. ISSN 0885-8993. doi: 10.1109/TPEL.2008.2005097.
  • Packard et al. [1980] N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. Shaw. Geometry from a time series. Physical Review Letters, 45(9):712, 1980.
  • Parlitz [1998] U. Parlitz. Nonlinear time-series analysis. In Nonlinear Modeling, pages 209–239. Springer, 1998.
  • Peng et al. [2014] Y. Peng, M. Lei, J.-B. Li, and X.-Y. Peng. A novel hybridization of echo state networks and multiplicative seasonal ARIMA model for mobile communication traffic series forecasting. Neural Computing and Applications, 24(3-4):883–890, 2014.
  • Rényi [1959] A. Rényi.

    On the dimension and entropy of probability distributions.

    Acta Mathematica Academiae Scientiarum Hungarica, 10(1-2):193–215, 1959.
  • Rhodes and Morari [1997] C. Rhodes and M. Morari. The false nearest neighbors algorithm: An overview. Computers & Chemical Engineering, 21:S1149–S1154, 1997.
  • Scardapane et al. [2015] S. Scardapane, D. Comminiello, M. Scarpiniti, and A. Uncini. Significance-Based Pruning for Reservoir’s Neurons in Echo State Networks, pages 31–38. Springer International Publishing, Cham, 2015. ISBN 978-3-319-18164-6. doi: 10.1007/978-3-319-18164-6_4.
  • Schölkopf et al. [1997] B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
  • Schölkopf et al. [2000] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural computation, 12(5):1207–1245, 2000.
  • Skowronski and Harris [2007] M. D. Skowronski and J. G. Harris.

    Automatic speech recognition using a predictive echo state network classifier.

    Neural networks, 20(3):414–423, 2007.
  • Srinivas and Patnaik [1994] M. Srinivas and L. M. Patnaik. Genetic algorithms: a survey. Computer, 27(6):17–26, June 1994. ISSN 0018-9162. doi: 10.1109/2.294849.
  • Takens [1981] F. Takens. Detecting strange attractors in turbulence. Springer, 1981.
  • Van Der Maaten et al. [2009] L. Van Der Maaten, E. Postma, and J. Van den Herik. Dimensionality reduction: a comparative. J Mach Learn Res, 10:66–71, 2009.
  • Varshney and Verma [2014] S. Varshney and T. Verma. Half Hourly Electricity Load Prediction using Echo State Network. International Journal of Science and Research, 3(6):885–888, 2014.
  • Verstraeten and Schrauwen [2009] D. Verstraeten and B. Schrauwen. On the quantification of dynamics in reservoir computing. In C. Alippi, M. Polycarpou, C. Panayiotou, and G. Ellinas, editors, Artificial Neural Networks – ICANN 2009, volume 5768, pages 985–994. Springer Berlin, Heidelberg, 2009. ISBN 978-3-642-04273-7. doi: 10.1007/978-3-642-04274-4_101.
  • Wierstra et al. [2005] D. Wierstra, F. J. Gomez, and J. Schmidhuber. Modeling systems with internal state using evolino. In

    Proceedings of the 7th annual conference on Genetic and evolutionary computation

    , pages 1795–1802. ACM, 2005.
  • Wolf et al. [1985] A. Wolf, J. B. Swift, H. L. Swinney, and J. A. Vastano. Determining lyapunov exponents from a time series. Physica D: Nonlinear Phenomena, 16(3):285–317, 1985.
  • Zhou et al. [2009] S. Zhou, J. Lafferty, and L. Wasserman. Compressed and privacy-sensitive sparse regression. IEEE Transactions on Information Theory, 55(2):846–866, 2009.