In this paper, we use recurrent autoencoder model to predict the time series in single and multiple steps ahead. Previous prediction methods, such as recurrent neural network (RNN) and deep belief network (DBN) models, cannot learn long term dependencies. And conventional long short-term memory (LSTM) model doesn't remember recent inputs. Combining LSTM and autoencoder (AE), the proposed model can capture long-term dependencies across data points and uses features extracted from recent observations for augmenting LSTM at the same time. Based on comprehensive experiments, we show that the proposed methods significantly improves the state-of-art performance on chaotic time series benchmark and also has better performance on real-world data. Both single-output and multiple-output predictions are investigated.READ FULL TEXT VIEW PDF
Time series forecasting and modeling is an important interdisciplinary field of research, involving among others Computer Sciences, Statistics, and Econometrics. Made popular by Box and Jenkins  in the 1970s, traditional modeling procedures combine linear autoregression (AR) and moving average. But, since data are nowadays abundantly available, often complex patterns that are not linear can be extracted. So, the need for nonlinear forecasting procedures arises.
Recently, neural networks with deep architectures have proven to be very successful in image, video, audio and language leaning tasks . In time series forecasting area, though traditionally shallow neural networks are generally adopted, the deep neural networks have also aroused enormous interests among researchers. Deep belief networks (DBN) are frequently employed in current short-term traffic forecasting 9] and Stacked AutoEncoder (SAE)  are also used. However, these deep architectures can not capture the long dependencies across data points which are beyond input observations.
RNNs are particularly suitable for modeling dynamical systems as they operate on input information as well as a trace of previously acquired information (due to recurrent connections) allowing for direct processing of temporal dependencies. RNNs can be employed for a wide range of tasks as they inherit their flexibility from plain neural networks. Among all RNN architectures, the most successful one to characterize long-term memory is the long short-term memory network (LSTM) , which learns both short-term and long-term memory by enforcing constant error flow through the designed cell state.
Recent advances in deep learning, especially recurrent neural network (RNN) and long short-term memory (LSTM) models, have been proposed to tackle this problem. However, these models still have some disadvantages. Especially LSTM model cannot work well on cases where the prediction is primarily based on recent past observations .
In this work, we investigate time series forecasting by combining AutoEncoder (AE) and LSTM. In this model, the input observations are first encoded to latent variables, which is equivalent to feature extraction. In the LSTM cell, the input observations and latent variables are incorporated into the state transition. The decoder is a multi-layer neural network which maps hidden state of LSTM and latent variables of input into the predicted values. We term the proposed model as augmented long short-term memory (A-LSTM). The LSTM cell, encoder and decoder are trained together by ADAM . This model can not only capture long-term dependencies, but also augment the prediction by the extracted features from recent observations at the same time.
The experiments show that this model can not only improve the performance of benchmarks on chaotic time series prediction, but also performs better on real-world data than previous methods such as DBN pre-trained with RBM , SAE  and LSTM models . Moreover, with 5-step and 10-step experiments, the performance on multi-output prediction task doesn’t degrade as the number of output increases, showing the proposed method has strong potential for scalability.
The Autoencoder (AE) was first introduced as a dimension-reduction model 
. An autoencoder takes an input vectorand transforms it into a latent representation . The transformation, typically referred as encoder, follows the following equation:
where and correspond to the weighs and bias in the encoder, and
is the sigmoid function.
The resulting latent representation is then mapped back into the reconstructed feature space in the decoder as follows:
where and correspond to the weighs and bias in the decoder. The autoencoder is trained by minimizing the reconstruction error . Multiple autoencoders can be connected to construct stacked autoencoder , which can be used to learn multiple levels of non-linear features. In this work, we utilize AE to extractor features from input observations, which can make LSTM learn more efficiently.
RNNs are discrete-time state–space models trainable by specialized weight adaptation algorithms. The input to RNN is a variable-length sequence which can be recursively processed. And when processing each symbol, RNN maintains its internal hidden state . The operation of RNN at each timestep can be formulated as
where is the deterministic state transition function and is the parameter of . The output of RNN is computed using the following equation:
where function can be modeled as a neural network with weights . In implementation, the function can be realized by long short-term memory .
Although traditional RNN exhibits a superior capability of modeling nonlinear time series problems in an effective fashion, there are still several issues to be addressed :
Traditional RNNs cannot train the time series with very long time lags, which is commonly seen in real-world datasets.
Traditional RNNs rely on the predetermined time lags to learn the temporal sequence processing, but it is difficult to find the optimal window size in an automatic way.
To overcome the aforementioned disadvantages of traditional RNNs, Long Short-Term Memory (LSTM) neural network is proposed in this study to predict time series in single-step and multi-step ahead. LSTM was initially introduced in  with the objective of modeling long-term dependencies and determining the optimal time lag for time series problems. A LSTM is composed of one input layer, one recurrent hidden layer, and one output layer. The basic unit in the hidden layer is memory block, containing memory cells with self-connections memorizing the temporal state, and a pair of adaptive, multiplicative gating units to control information flow in the block. It also has input gate and output gate controlling the input and output activations into the block.
The memory cell is primarily a recurrently self-connected linear unit, called Constant Error Carousel (CEC), and the cell state is represented by the activation of the CEC. Because of CEC, the multiplicative gates can learn when to open and close. Then by keeping the network error constant, the vanishing gradient problem can be solved in LSTM. Moreover, a forget gate was added to the memory cell, which can prevent the gradient from exploding when learning long time series. The operation and structure of LSTM can be described as below:
where and are denoted as input gate, forget gate and output gate at time respectively, and and represent the hidden state and cell state of the memory cell at time .
Driven by the recent success of deep learning , several different Deep Learning approaches can be found in the literature for performing time series predictions. For example, deep belief networks are used in the work of 
along with restricted Boltzman machine (RBM).
also compares the performance of Deep Belief Networks with that of Stacked Denoising Autoencoders. This last type of network is also employed by to predict the temperature of an indoor environment. Another application of Time-Series forecasting can be found in , which uses Stacked Autoencoders (SAE) to predict the flow of traffic from a big data dataset. However, as compared in , deep learning models such as RBM and SAE perform worse than LSTM because they cannot capture long-term dependencies across data points, and fixed size window size is another factor for suboptimal performance.
LSTM  is another learning structure often used recently in time series prediction.  first used LSTM to predict chaotic time series. In , an LSTM sequence-to-sequence model was used to predict next values. A survey  reviews many applications of LSTM and other RNN architectures to short-term load forecasting problem. However, the sequence-to-sequence RNN models , aiming at predicting a sequence of future values on the basis of a sequence of past values, can not handle very long sequences and model periods in time series 
. And every input sequence was padded to the same length. As indicated in, LSTM model such as  cannot utilize recent observations effectively, since it spends too much resources on long-term memory.
The time series forecasting with multi-output and multi-step is an active research topic.  investigated this problem by employing multiple-output support vector regression (M-SVR) with multiple-input multiple-output (MIMO) prediction strategy. In , a cooperative neural evolution method for multi-task learning was adopted, which enables neural networks to be trained with shared knowledge representation.
Motivated by disadvantages of previous models above, we propose a LSTM model, where the hidden state transition is dependent on both input feature extracted by the autoencoder and the previous hidden state. The prediction output of the proposed model is decoded from both hidden state and latent representation of the input. So, our model can utilize long-term dependencies and recent observations at same time. And it can work on long sequences with variable lengths. We conduct performance comparison on both single-output and multi-output forecasting experiments.
We notice that there are some recent work  which stack AE and RNN or LSTM together to predict time series. Different from our work, the AE and LSTM are trained separately in their models, and the decoder of AE is directly cut off, which may yield suboptimal weights for AE and LSTM cell. And they don’t have results on multi-output prediction. The proposed method can outperform these models on chaotic and real-world datasets.
In this work, we incorporate autoencoder (AE) into LSTM model to improve time series forecasting performance. Here, and denote the input observations, latent variables of the input and the hidden state of LSTM at time respectively. The hidden state includes both the cell state and hidden state of the memory cell as described in (2). In the proposed model, the LSTM is augmented by the latent variables learned from AE and corresponding feature extractors.
Different from conventional autoencoder in , the latent variable extracted at time is dependent on both the input observations and the previous hidden state of LSTM cell. It can be described as below:
where the decoder and feature extractor are both implemented as a two-layer neural network. Here, the input is not directly processed by the encoder together with hidden state. The encoder only read the extracted feature of input , which can reduce some redundant information in the input observations. It is especially important when processing time series with low sampling rate.
This model applies AE in a recurrent setting. At each timestep, different from conventional AE, the output of the decoder, i.e. predicted value, is dependent on both the current latent variables of input and the previous hidden state of LSTM. This process can be depicted as:
where the decoder and the feature extractor can be modeled by two-layer neural networks. In order to improve expressive capability, we also add feature extractor to latent variables, which can learn multi-scale and hierarchical features embedded in the time series. It can significantly improve the prediction performance on complex and highly dynamic time series.
Inside the LSTM cell, the hidden state is updated as follows:
where transition function is essentially the operation described in (2). We implement function by multi-layer neural networks with weights . Different from previous LSTM work , the state transition incorporates the feature of input and feature of latent variables and previous state, rather than the original input and latent variables.
The target function is the prediction error regularized by the L2 norm of all trainable weights. We can describe this target as below:
where and are the predicted values and ground truth at time , denotes the weights of all neural netwokrs in the proposed model, and is the trade-off coefficient for the regularizor. With all functions implemented by neural networks, the accumulative ELBO (6) is minimized by ADAM  algorithm. is chosen to be specific to the experiment.
This section presents a comprehensive experimental study that evaluates the performance of augmented LSTM method (A-LSTM) in terms of predictive ability. The performance is compared with cutting-edge time-series prediction models. For single-step ahead prediction task, we compare the proposed method with the deep belief network trained with restricted Boltzmann Machine (RBM) , stacked autoencoder model (SAE) , and autoencoder stacked on LSTM (Auto-LSTM) . For multi-step ahead prediction task, the performance comparison is conducted with multi-output support vector regression (MSVR)  and co-evolutionary multi-task learning (CMTL) . We conduct 5-step and 10-step ahead prediction experiments, and the scalability is also discussed. All these methods above have been introduced in Section 3.
The experiments are conducted using three simulated and two real-world time series datasets. The simulated time series are chaotic nonlinear systems such as Mackey and Glass , Lorenz , and Rossler . One real world dataset is individual household electric power consumption dataset . All the respective time series data sets are scaled in the range in order to be used for sigmoid units in the feedforward neural network. The performance is measured by the root-mean-squared error (RMSE):
where are observed and predicted data.
Across four simulations in this section, the length of training sequence is 1000, and the length of testing sequence is 500. The optimization is performed by ADAM with learning rate .
We first experiment on Mackey-Glass time series , which is derived from differential equation as below:
The parameters are chosen to be and . The dataset is constructed by second-order Runge-Kutta method with the step size of 0.1, with plot as below.
For one-step ahead prediction, the input is a sequence of 5 samples with stride of 6. For 5-step ahead prediction, the input is a sequence of 15 samples with stride of 6. For 10-step ahead prediction, the input is a sequence of 10 samples with stride of 6. The RMSE and its variance of the proposed method (ALSTM), RBM, SAE and Auto-LSTM are shown in tables in the following.
The variance of ALSTM for 10-step ahead prediction is 0.011232, which is comparable with CMTL and MSVR The complete variance table is omitted.
The Lorenz attractor  is a nonlinear 3-D system that provides a simplified model of convection in the atmosphere. The Lorenz equations are given by:
In this experiment, the parameters are set to be and . The 3-D attractor and components are plotted as below.
The performance of one-step ahead prediction is shown in the following table.
For 5-step ahead, we use 6 samples with stride of 6. For 10-step ahead, we use 10 samples with stride of 4. The results for 5-step and 10-step ahead prediction are shown below.
The Rossler attractor  is generated by a system of three differential equations as below:
where coefficients are set to be and . The figure below plots the 3-D Rossler Attractor and its components.
For one-step ahead prediction, we use 5 samples with stride of 6. The plots of prediction results are shown in the following figure.
The performance comparison for one-step ahead prediction is shown as below:
The performance comparison for 5-step and 10-step ahead predictions are shown below.
The variance of ALSTM is 0.103214, and the complete table for variance is omitted.
This dataset contains measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years , which is equivalent to 7000 data points. The first 1000 data points are used for training and the remaining for testing.
The experiment shows that ALSTM model can also work well on real-world data. The following plots show the estimated value and its error, compared with ground truth.
Comparing the performance of experiments with different number of output, we can see that performance of ALSTM model doesn’t degrade in multi-output cases, which proves the scalability in practical applications.
This paper shows the advantage of ALSTM model on time series prediction. With comprehensive experiments, we prove that the ALSTM model can outperform cutting-edge algorithm on time series prediction and improve the state-of-art performance on both chaotic benchmarks and real-world datasets. Moreover, the predictions from ALSTM model have comparable variance with previous methods, resulting in small confidence intervals. Comparing performance of different number of prediction outputs, we can see the performance of 10-step prediction doesn’t degrade too much, showing that the proposed method has strong ability of scaling to multi-output prediction. In the future, we are going to apply ALSTM to predict the financial data which has more variance and uncertainty.
J.-T. Turner. Time series analysis using deep feed forward neural networks. Ph.D. thesis, University of Maryland, Baltimore County, 2014.
P. Romeu, F. Zamora-Martı́nez, P. Botella-Rocamora, J. Pardo. Time-series forecasting of indoor temperature using pre-trained deep neural networks. In: Artificial Neural Networks and Machine Learning–ICANN 2013, pp. 451-458, Springer ,2013.