Time series111We interchangeably use the terms time series and sequence.
are sequences of observations that exhibit short or long term dependencies between them in time. These dependencies can be thought of as manifestations of a latent regime (e.g. natural law) governing the behaviour of the time series. Machine learning approaches designed to deal with data of a vectorial nature have no knowledge of such temporal dependencies. By introducing a model that accounts for temporal behaviour, algorithms can become “aware” of the sequential nature of the data and make the most of the available information.
Echo state networks (ESNs) 
are discrete time recurrent neural networks that have emerged as a popular method to capture the latent regime underlying a time series. ESNs have the great advantage that the hidden part, the reservoir of neurons, is fixed and only the output weights need to be trained. The ESN is thus essentially a linear model and so the output weights, also known as readout weights, can thus be easily optimised via least squares. The processing of structured data has been a topic of research for a long time[10, 11]. Regarding time series, recent attempts [4, 5, 6] have exploited the predictive capabilities of ESNs in regression and classification tasks. In the unsupervised setting, the work in  suggests compressing a linear state space model through a linear autoencoder in order to extract vectorial representations of structured data. The work in  considers the visualisation of individual observations belonging to a single sequence by temporally linking them using an ESN.
In this work, we employ the ESN in the formulation of a dimensionality reduction algorithm for visualising a dataset of time series (we extend previous work presented in 
). Given a fixed reservoir, the only free parameters in the ESN are in the readout weight vector which maps the state space to the sequence space. Thus, an optimised (i.e. trained) readout weight vector uniquely addresses an instance of the ESN (always for the same fixed reservoir) that best predicts on a given sequence. We can also reason backwards: given an observed sequence, we can train the ESN (Section2.1) and identify the readout weight vector that best predicts the given sequence. Hence, each sequence in the dataset can be mapped to the readout weight vector that exhibits the best predictive performance on it. These readout weight vectors in conjunction with the common and fixed reservoir, capture temporal features of their respective sequences. Representing sequences as weight vectors, constitutes the first part of our proposed approach (Section 3.1).
The second stage of our approach involves training an autoencoder  on the obtained readout weight vectors in order to induce a two-dimensional representation, the visualisation, at the bottleneck. At the heart of the autoencoder lies the reconstruction error function which drives the visualisation induced at the bottleneck. During training, the autoencoder receives readout weights as inputs, compresses them at the bottleneck, and outputs an approximate version of the inputs, the reconstructed readout weights. Typically, one would take as the reconstruction error function the norm between the original readout weights and reconstructed readout weights. In the proposed work, we equip the autoencoder with a different reconstruction function that assesses how well the reconstructed readout weights still predict on the sequence that it represents. If it predicts well, we deem it a good reconstruction; if it predicts poorly, we deem it a poor reconstruction (Section 3.2). An overview of the proposed method is displayed in Fig. 1.
In Section 6, we show that the autoencoder with the proposed reconstruction error function is capable of interpreting similarities between time series better than other dimensionality reduction algorithms. In Section 7, we discuss the possibility of alternative formulations of the proposed approach before concluding with some remarks on future work in Section 8.
This section introduces some notation and terminology while briefly reviewing ESNs and the autoencoder.
2.1 Echo State Networks
An ESN is a discrete-time recurrent neural network with fading memory. It processes time series composed by a sequence of observations over time that we denote here by , where is the length222In general, each sequence can have its own length . For ease of exposition, here all sequences have the same . of the sequences. Hence . Given an input , the task of the ESN is to make a prediction for the observation in the next time step. Similarly to a feedforward neural network, the ESN comprises an input layer with weights , a hidden layer with weights (hence is the size of the reservoir) and an output layer with weights , the latter weights also known as readout weights. However, in contrast to feedforward networks, ESNs equip the hidden neurons with feedback connections. The operation of an ESN is specified by the equations:
where are the hidden activations of the reservoir at time , and is a nonlinear function commonly chosen as the function. Bias terms have been omitted in the formulation for the sake of clarity in notation.
According to standard ESN methodology , parameters and in Eqs. (1), (2) are randomly generated333The spectral radius of the reservoir’s weight matrix is made to encourage the Echo State Property. and fixed. The only trainable parameters are the readout weights . Training involves feeding at each time step an observation and recording the resulting activations row-wise into a matrix . Usually, some initial observations are dismissed in order to “washout”  dependence on the initial arbitrary reservoir state (e.g. ). Given matrix , the following objective function is minimised:
The above objective can be supplemented by a regularisation term and so the combined objective is . The combined objective can be exactly minimised by solving the pertaining least squares problem and obtaining as the solution, where is the identity matrix. Given this result, we introduce function that maps a given time series to the optimal readout weights:
2.2 Deterministically Constructed Echo State Networks
In the original formulation of the ESN  the weights in and are generated stochastically and so are the connections between the hidden neurons in the reservoir. This makes the training and use of the ESN dependent on random initialisations. In order to avoid this source of randomness, we make use of a class of ESNs that are constructed in a deterministic fashion .
Deterministic ESNs make several simplifications over standard ESNs. All entries in have the same absolute value of a single scalar parameter . The signs of the entries in are deterministically generated by an aperiodic sequence: e.g. a pseudorandom binary sequence (coin flips), with outcomes and corresponding to and respectively. Similarly, the entries in are parametrised by a single scalar . As opposed to random connectivity, deterministic ESNs impose a fixed regular topology on the hidden neurons in the reservoir. Amongst possible choices, one can arrange the neurons in a cycle. A cyclic arrangement imposes the following structure on : the only nonzero entries in are on the lower sub-diagonal , and at the upper-right corner . An illustration of a cyclic deterministic ESN is shown in Fig. 2.
In this work we employ deterministic ESNs with a cyclic connectivity. Deterministic ESNs have three degrees of freedom: the reservoir size, the input weight and reservoir weight . Hence, the triple completely specifies an ESN. It has been shown empirically and theoretically (memory capacity)  that deterministic ESNs perform up to par with their stochastic counterparts. Training of a deterministic ESN is performed in exactly the same fashion as in stochastically constructed ESNs using the objective in Eq. (3).
The autoencoder  is a feedforward neural network that defines a three hidden layer architecture444To be perfectly precise, we use what is widely considered the standard autoencoder specified in [2, Sec. 12.4.2]). with the middle layer, the “bottleneck”, being smaller than the others in terms of the number of neurons denoted by . The autoencoder learns an identity mapping by training on targets identical to the inputs. Learning is hampered by the bottleneck that forces the autoencoder to reduce the dimensionality of the inputs, and hence the output is only an approximate reconstruction of the input.
Given general vectors , we want to reduce them to a -dimensional representation. The autoencoder is the composition of an encoding and a decoding function. Encoding maps inputs to low-dimensional compressed versions, , while decoding maps approximately back to the inputs, . The complete autoencoder is the function , where are the weights of the autoencoder. Training the autoencoder involves minimising the norm between given vectors and their reconstructions:
3 Model Formulation
The proposed approach consists of two stages. In Section 3.1, we discuss how time series are embedded in the space of readout weight vectors . Section 3.2 discusses how an autoencoder with a modified reconstruction function is applied on the readout weight vectors in a meaningful manner.
3.1 Embedding time series in the space of readout weights
Given a deterministically constructed and fixed reservoir , we cast a dataset via to a new dataset of readout weights . We emphasise that all time series are embedded in the space of readout weights with respect to the same fixed dynamic reservoir . After this embedding, visualisation proceeds by performing dimensionality reduction on the new representations . We take the view that the readout weight is a good representation for a sequence with respect to the fixed reservoir . The readout weight captures important information about in the sense that it exhibits good predictive power on it. Moreover, the readout weight vector features time-shift invariance, and can accommodate sequences of variable length.
A prerequisite for a successful embedding is a common, fixed reservoir that enables good predictive performance on the data. To find this reservoir, we opt for a simple strategy. For both and we take a regular grid of e.g. candidate values . For each combination , we perform the following:
Split each sequence in two halves and .
According to Eq. (3), train ESN on by minimising which yields .
Measure test error via .
Matrices and respectively record row-wise the activations and as specified in Section 2.1. The combination with the lowest test error over all sequences , determines the ESN that will cast all time series in the dataset to readout weights. Parameters and may also be included in this simple validation scheme.
3.2 ESN-coupled Autoencoder
We want to reduce the dimensionality of the new representations using an autoencoder. One possibility is to directly apply the autoencoder taking as input the readout weights and returning their reconstructed versions, . We could then minimise the following objective function with respect to the autoencoder weights :
A limitation of the above objective function is that it merely measures how well the reconstructions approximate the original inputs in the sense.
A better objective would measure reconstruction error in the sequence space as opposed to the space of readout weights. To that purpose, we map the reconstructed readout weights to the sequence space by multiplying with the respective state matrix, . In actual fact, function in Eq. (3) is cut out for this task: if returns high likelihood, then is a good reconstruction; if returns low likelihood, then is a poor reconstruction. The new objective function reads:
, is calculated by backpropagation. We use L-BFGS as the optimisation routine for training the weights .
3.3 Data Projection
Having trained the autoencoder , we would like to project a time series to a coordinate . To that end, we first use the fixed ESN reservoir to cast the time series to . Then, the readout weight is projected using the encoding part of the autoencoder to obtain .
4 Binary Sequences
The time series considered so far are sequences of reals . However, it is possible to extend the proposed approach to the processing of symbolic sequences. In particular, we consider binary sequences composed of observations . For an ESN to process binary sequences, we pass its outputs through the logistic function
(link function of the Bernoulli distribution). Hence, the equations specifying the operation of the ESN now read555 In Eq. (8) we subtract 0.5 from , since the symmetric transfer function is used in the dynamic reservoir. :
Here the output
of the ESN is interpreted as the probability of the next observationbeing equal to , i.e. . Accordingly, the objective function in Eq. (3) needs to be redefined. Instead of solving a least squares problem, we minimise the cross-entropy:
Training of the ESN is carried out by iterative gradient minimisation of Eq. (10) preceded by a period of washout.
The above modifications to the ESN, call for a modification also in the autoencoder. While in Eq. (7) reconstruction is measured via the least-squares based function , we now use the cross-entropy based function . In order for the autoencoder to process correctly the weights coming from binary sequences, its objective function needs to be changed to:
In the case of binary sequences, the outputs of the autoencoder are put though the function .
By adopting a 1-of-K coding scheme for the symbols, and the softmax function in the place of the logistic function, an extension to number of symbols is possible. The resulting objective for training the ESN is again a cross-entropy function.
5 Magnification Factors
In Fig. 3, the smooth nonlinear function embeds the low-dimensional visualisation space as a -dimensional manifold in the space of readout weights . Each point addresses an ESN model666We always have the same fixed reservoir. with readout weights . The ESN model may be viewed as a probabilistic model, if we assume that observations
are corrupted by i.i.d. Gaussian noise of zero mean and variance:
Thus, each point addresses a probabilistic model , and is a manifold of probabilistic models .
Manifold is endowed with a natural metric for measuring distances between probabilistic models
. Specifically, the metric tensor on a statistical manifold at a given pointis the Fisher information matrix (FIM) . Here, we approximate it through the observed FIM over the given dataset of sequences:
We note that the visualisation space does not necessarily reflect distances between models on . In Fig. 3, we see how neighbourhoods of , depicted as dotted ellipses, transform on . Thus, in order to interpret distances in , it is important to push-forward the natural notion of distances on onto the visualization space . In the topographic mapping literature the induced metric in the visualization space from the data space is usually represented through magnification factors . In the following we show how magnification factors can be computed in the ESN-AE setting.
Given the FIM, one can push forward local distances from onto via . In particular, at a given point
it is possible to estimate in which directionthe distance changes the most. This can be easily calculated by solving the following constrained problem:
The solution to this problem is given by setting. Eigenvalue informs us of the maximum local distortion in distance and can be taken as a measure for the local magnification factor.
6 Numerical Experiments
In the following we compare the proposed method to other visualisation algorithms and discuss the results.
In order to judge whether a visualisation truly captures similarities, we need to know a priori which time series are similar to which. We therefore employ the following particular datasets whose data items fall under known classes and are labelled. For these datasets, there is a very strong a priori expectation that the classes are governed by qualitatively distinct dynamical regimes. Thus, time series of the same class are expected to appear similar (close together) in the visualisation, while time series belonging to different classes are expected to appear dissimilar (separate) in the visualisation.
We generate sequences of length from the three qualitatively different NARMA classes  of orders . The NARMA time series is an interesting benchmark problem due to the presence of long-term dependencies.
We sample sequences from a stationary Gaussian process with correlation function given by . We generated classes by permuting parameters and . We generated from each class time series of length . Parameters and are respectively related to the fractal dimension (measuring self-similarity) and the Hurst coefficient (measuring long-memory dependence) of the series. By construction, the four classes have distinct characteristics.
The binary system GRS1915+105 is composed of an extremely heavy stellar black hole and a low-mass star. Material is transferred from the star towards the black hole through the Roche lobe. While falling into the gravitational potential of the black hole, energy is released by radiating X-ray and radio (jet) emission which is typical for the class of microquasars. A thorough investigation carried out in , detected the presence of classes of distinct dynamical patterns. Due to the lack of multiple time sequences per state, we split the observations into equal-length parts, resulting in sequences. Here we visualise classes and .
We visualise wind data777Kindly provided by the Deutscher Wetterdienst, ftp://ftp-cdc.dwd.de/ . taken from the vicinity of Hamburg, Frankfurt and Munich. Around each city, we select the 10 closest stations with a completeness of more than 99% of hourly measured wind speed data between 13/01/2014 - 31/12/2014 (
measurements per station). Missing data are interpolated using a spline function of thedegree. In order to increase the number of visualised entities, the time series of each station are cut into two non-overlapping parts of data points each. In these data there is a strong a priori expectation that time series associated with the coastal city of Hamburg are different to the other data.
Textual data (symbolic)
We visualise the first chapter of J. K. Rowling’s “Harry Potter and the Philosopher’s Stone” in three languages German, English and Spanish. A full symbolic representation of the alphabet makes the optimisation of the ESN difficult and it would be a trivial task to separate the languages as they could be identified by single words. Here, we choose a binary representation where the states and represent vowels and consonants. Punctuation and whitespaces are ignored. E.g. a German sentence is converted as follows:
Die Potters, das stimmt, das hab ich gehört -
Discarded symbols are marked by an underscore. This representation returns sequences of different length for each language, but all with at least symbols. To increase the number of sequences per language, we split the binary vectors into sequences of length with neighbouring sequences overlapping by . It is interesting to see whether texts originating from different languages still retain their distinguishing dynamics after subjected to this drastic “binarisation”.
6.2 Dimensionality Reduction Algorithms
The following dimensionality reduction algorithms are compared in the numerical experiments. All algorithms operate on the readout weights . Sequences are represented as readout weights using a deterministic cyclic ESN whose parameters are selected using the validation procedure in Section 3.1. Additionally, in this validation scheme we include the regularisation parameter . In all experiments the size of the reservoir is fixed to and we set a washout period of time steps. We set for constructing 2D visualisations.
We include PCA as it helps us gauge how difficult it is to project a dataset to low dimensions: if PCA delivers a good result, this hints that a complex, non-linear projection is superfluous.
We include t-SNE  as one of the most popular and well performing algorithms designed for vectorial data. We train t-SNE with perplexities in , and display the visualisation that shows the best class separation. The chosen perplexity is quoted in the figures.
Standard autoencoder (standard-AE)
We employ the standard autoencoder operating directly on the readout weights. The hidden layers of the encoding and decoding part have the same number of neurons. We also add a regulariser on the weights of the autoencoder to control complexity. In all experiments, we set , .
Proposed approach (ESN-AE)
The proposed ESN-AE has the same hyperparameters as the standard-AE. We again fix the hyperparameters to, .
We present the visualisations in Figs. 4 and 5. Each column of plots corresponds to one of the aforementioned dimensionality reduction algorithms, and each row to a dataset. The projections in the plots appear as coloured markers of different shapes indicating class origin. The legend in each plot shows which marker corresponds to which class. Following Section 5, we display local magnification factors, for the autoencoders, as the maximum eigenvalue of matrix on a regular grid of points on the visualisation space. Dark and bright values signify low and high eigenvalues/magnification factors respectively. There are no magnification factors for PCA, as the linear mapping connecting the visualisation space to the high-dimensional space is length/distance preserving. Also, we do not present magnification factors for t-SNE, as it does not define an explicit mapping between the visualisation and high-dimensional space. It thus requires a different framework than the one used here in order to study magnifications.
|Order 10||Order 20||Order 30|
Narma, top row in Fig. 4
We note that all visualisations separate the three classes, and show that the three classes are equidistant. The magnifications in (c)c show that the standard-AE views the three classes indeed as distinctly separable clusters. However, in the case of the ESN-AE in (d)d, the magnification factors suggest the presence of distortions in distances close to class “Order 20”. This means that in actual fact class “Order 20” is separated by significant distance from the other two classes, and that classes “Order 10” and “Order 30” are closer and more similar to each other. We investigate this hypothesis with a simple experiment. We generate from each class additional sequences. For each pair of classes (classes also pair with themselves), we train on sequences from one class and measure the mean squared error on the unseen sequences of the other classes. These errors are reported in Table 1, and support this hypothesis put forward by the magnification factors in the ESN-AE visualisation.
Cauchy, middle row in Fig. 4
PCA in (e)e and t-SNE in (f)f manage to organise the classes coherently to some degree, while the standard-AE in (g)g fails to produce a convincing result. ESN-AE in (h)h displays a clear separation between all four classes. In particular the presence of magnification factors close to the two classes located in the upper right corner, shows that these two classes are potentially separated by a larger distance to the other two.
X-ray, bottom row in Fig. 4
All visualisations clearly separate the and classes. For standard-AE in (k)k, the strong magnification suggest that the class is quite different to the others. t-SNE in (j)j and ESN-AE distinguish in (l)l the classes in a clearer fashion. ESN-AE exhibits less overlapping projections, but does not put enough distance between classes and . The presence of magnifications close to the class is a hint that this class is quite different to the other ones. Still even in the absence of labels (i.e. colour markers), the classes are identifiable in the visualisation produced by ESN-AE.
|NARMA||151451.130 14984.801||116.041 46.606||44.126 11.962|
|Cauchy||121.110 4.022||102.115 2.522||95.176 2.675|
|Xray||25.376 2.274||42.727 17.423||21.798 1.617|
|Wind||5.0498 0.253||5.192 0.260||5.079 0.231|
|Textual||0.691 0.007||0.700 0.010||0.694 0.017|
Mean reconstruction and standard deviation, averaged over 10 runs.
Wind, top row in Fig. 5
None of the visualisations separates the Munich from the Frankfurt stations. Matching our prior expectation, ESN-AE in (d)d organises the stations around Hamburg in a single region, in contrast to the other visualisations which show overlap. Standard-AE fails to produce a clear result and its magnifications do not help in its interpretation any further.
Textual data (symbolic), bottom row in Fig. 5
The binary representation of the text data in three different languages shows the true power behind the ESN-AE equipped here with the logistic function. While other visualisations do not exhibit adequate separation, ESN-AE in (h)h exhibits some clear organisation. Additionally, magnifications suggest some separation between the German and English sequences. The bright magnifications that appear in the unpopulated corners are simply artefacts as the model has not seen any data in these areas.
In order to give a quantitative impression of the quality of the visualisations, we report reconstruction errors in Table 2. Each dataset is randomly partitioned times into equally sized training and test sets. For each partitioning, we train the dimensionality algorithms and measure the error on the test data using Eq. (3). For the binary textual data, the error is measured as the fraction of predictions coinciding with the binary test sequences. We exclude t-SNE as it does not offer a way of reconstructing weights from the projections.
Though the conversion of time-series into fixed-length representations is not new (e.g. ), we believe that converting the time series via a non-parametric state space model with fixed dynamic part (i.e ESN) in conjunction with an appropriately defined reconstruction function, does provide a new way of performing dimensionality reduction on time series. The results show that the proposed visualisation is better at understanding what makes sequences (dis)similar as it manages to separate classes that are governed by qualitatively distinct dynamical regimes. Indeed, the produced visualisations reflect our prior expectations as to which sequences should be similar.
Of course, combining the ESN with the autoencoder is just one possible scheme, and certainly other dimensionality reduction schemes can be devised along this line. One can exchange the ESN with other models such as autoregressive-moving-average models (ARMA), and use them to cast the time-series to fixed parameter vectors. E.g. for slow changing signals, models based on Fourier series might be more suitable than the ESN. Choosing the ESN to model the temporal features of the sequences, is indeed a subjective choice. However, this does not mean that it is a bad choice: in the relevant literature, a wealth of applications demonstrate that ESNs are good models for a large variety of real-world time series.
Besides the autoencoder, other dimensionality reduction methods that rely on optimising reconstruction error (e.g. GPLVM ) can be adapted to the visualisation of time-series; one has to modify their objective to measure reconstruction in the sequence space, just as the loss function of ESN-AE does.
We have presented a method for the visualisation of time series that couples an ESN to an autoencoder. Time series are represented as readout weights of an ESN and are subsequently compressed to a low dimensional representation by an autoencoder. The autoencoder attempts reconstruction of the readout weights in the context of the state space pertaining to the sequences thanks to the modified loss function. In future research, we plan to work on irregularly sampled time series that originate from eclipsing binary stars. The ESN will be replaced by a physical model that will cast the time series to vectors of physical parameters.
The authors from HITS gratefully acknowledge the support of the Klaus Tschira Foundation. We thank Ranjeev Misra for providing the X-ray dataset. We would also like to thank the anonymous reviewers for helping us improve the presented work.
-  T. Belloni, M. Klein-Wolt, M. Méndez, M. van der Klis, and J. van Paradijs. A model-independent analysis of the variability of GRS 1915+105. Astronomy & Astrophysics, 355:271–290, 2000.
-  Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
-  Christopher Bishop, Markus Svensen, and Christopher K. I. Williams. Magnification factors for the GTM algorithm. In ICANN, pages 64–69, 1997.
-  S.P. Chatzis and Y. Demiris. Echo state gaussian process. Neural Networks, IEEE Transactions on, 22(9):1435–1445, 2011.
-  H. Chen, F. Tang, P. Tino, and X. Yao. Model-based kernel for efficient time series analysis. In KDD, pages 392–400, 2013.
-  H. Chen, P. Tino, A. Rodan, and X. Yao. Learning in the model space for cognitive fault diagnosis. IEEE TNNLS, 25(1):124–136, 2014.
-  N. Gianniotis, S.D. Kuegler, and R. Misra P. Tino, K. Polsterer. Autoencoding time series for visualisation. In ESANN, 2015.
-  T. Gneiting and M. Schlather. Stochastic models that separate fractal dimension and the hurst effect. SIAM Review, 46(2):269–282, 2004.
-  J. Grabocka, M. Wistuba, and L. Schmidt-Thieme. Scalable classification of repetitive time series through frequencies of local polynomials. IEEE TKDE, 27(6):1683–1695, 2015.
T. Jaakkola and D. Haussler.
Exploiting generative models in discriminative classifiers.In NIPS, pages 487–493, 1998.
-  T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819–844, 2004.
M. A. Kramer.
Nonlinear principal component analysis using autoassociative neural networks.AICHE Journal, 37:233–243, 1991.
-  Solomon Kullback. Information theory and statistics. Wiley, New York, 1959.
-  Neil D. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. JMLR, 6:1783–1816, 2005.
-  Mantas Lukosevicius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
-  A. Rodan and P. Tino. Minimum complexity echo state network. IEEE Transactions on Neural Networks, 22(1):131–144, 2011.
-  Alessandro Sperduti. Linear autoencoder networks for structured data. In International Workshop on Neural-Symbolic Learning and Reasoning, 2013.
-  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. JMLR, 9:2579–2605, 2008.
-  Tzai-Der Wang, Xiaochuan Wu, and Colin Fyfe. Comparative study of visualisation methods for temporal data. In IEEE CEC, pages 1052–1056, 2012.