1 Introduction
Time series modeling, as an example of signal processing problems, has played an important role in a variety of domains, such as complex dynamical system analysis [18], speech analysis [22], noise filtering [12], and financial market analysis [29]. It is of interest to extract temporal information, uncover the correlations among past observations, analyze the dynamic properties, and predict future behaviors. Among all these application scenarios, the future behavior prediction task is of the greatest interest [8, 27]. However, such task is usually challenging due to nonstationarity, nonlinearity, small sample size, and high noise of time series data.
The goal of time series forecasting is to generate the future series based on the historical observations , . Besides, the observations are often related to some exogenous variables . Different models have been proposed for time series prediction with access to the exogenous data. For example, as illustrated in Fig. 1(a), the autoregressive movingaverage model with exogenous inputs (ARMAX) assumes that relies on not only the historical observations but also the past exogenous variables , . However, this method assumes that the underlying model is linear, which limits its applications to realworld time series. As shown in Fig. 1
(b), the mixed history recurrent neural network (MISTRNN) model
[10] includes both past exogenous data and , thus making more precise predictions. One problem is that MISTRNN treats exogenous variables indistinguishably, ignoring their inherent temporal dynamics. Recently, the DualAttention Recurrent Neural Network (DARNN) model [21] was proposed to exploit the temporal dynamics of exogenous data in predicting (given in Fig. 1(c)). Although DARNN achieves better performance in some experiments, it is doubtful whether this method can be widely used in practice. Because in time series prediction of a future time , the exogenous data is in general unavailable. Moreover, DARNN does not consider the correlations among different components of exogenous data, which may lead to poor predictions of complex realworld patterns. Therefore, how to make reliable future predictions based solely on the past exogenous data and historical observations still remains as an open question.To address the aforementioned issues, as illustrated in Fig. 1(d), we need to first model the “spatial” (we use this word for convenience in opposite to “temporal”) relationships between different components of the exogenous data at each time step, which usually present strong correlations with the observation . Second, we need to model the temporal behaviors of the historical observations and exogenous series as well as their interactions. In addition, the temporal information is usually complicated and may occur at different semantic levels. Therefore, how to fully exploit the spatial and temporal properties of the historical observations and exogenous data are two key problems for time series prediction.
In this paper, we propose an endtoend neural network architecture, i.e., Hierarchical attentionbased Recurrent Highway Network (HRHN), to forecast future time series based on the observations and exogenous data only from the past. The contributions of this work are threefold:

We use a convolutional neural network (ConvNet) to learn the spatial interactions among different components of exogenous data, and employ a recurrent highway network (RHN) to summarize the exogenous data into different semantics at different levels in order to fully exploit and model their temporal dynamics.

We propose a hierarchical attention mechanism, performing on the discovered semantics at different levels, for selecting relevant information in the prediction.

Extensive experimental results on three different datasets demonstrate that the proposed HRHN model can not only yield high accuracy for time series prediction, but also capture sudden changes and oscillations of time series.
2 Related Work
In recent years, time series forecasting has been intensively studied. Among all the classical models, ARMA [28] has gained its popularity, which includes dynamic autoregressive and movingaverage components. Moreover, the ARMAX model includes exogenous variables to model dynamics in historical observations. The NARMAX (nonlinear ARMAX) model is an extension of the linear ARMAX model, which represents the system via a nonlinear mapping from past inputs, outputs, and independent noisy terms to future outputs. However, these approaches usually use a predefined mapping and may not be able to capture the true underlying dynamics of time series.
RNNs [11, 23] are a class of neural networks that are naturally suited for modeling time series data [13, 25]. The NARXRNN [17] model combines the capability of capturing nonlinear relationships by RNN with the effectiveness of gradient learning by NARX. However, traditional RNNs suffer from the issues of gradient vanishing and error propagation when learning longterm dependencies [4]
. Therefore, special gating mechanisms that control access to memory cells have been developed, such as Long ShortTerm Memory (LSTM)
[14]and Gated Recurrent Unit (GRU)
[7], which have already been used to perform time series predictions [3, 20, 24]. Moreover, the MISTRNN [10] model has been developed based on a gating mechanism that is similar to GRU. This model makes the gradient decay’s exponent closer to zero by using the mixed historical inputs. Although MISTRNN can capture the longterm dependencies well, it does not consider the temporal behaviors of exogenous variables that may affect the dynamic properties of the observation data.Recently, the encoderdecoder architectures [5, 6] were developed for modeling sequential data. In time series forecasting, one usually makes prediction based on a long sequence of past observations, which forces the encoder to compress all the necessary information into a fixedlength vector, leading to poor performance as the input sequence length increases. Therefore, an attention mechanism [2] has been introduced, which improves the performance by learning a soft alignment between the input and output sequences. A representative work on time series prediction is the DARNN model [21]. In the encoder stage, DARNN exploits an input attention mechanism to adaptively extract relevant input features at each time step by referring to the previous encoder’s hidden states. In the decoder stage, DARNN uses a classical attention mechanism to select relevant encoder’s hidden states across all time steps. However, DARNN does not consider the “spatial” correlations among different components of exogenous data. More importantly, the classical attention mechanism cannot well model the complicated temporal dynamics, especially when the temporal dynamics may occur at different semantic levels.
3 Problem Formulation
A time series is defined as a sequence of realvalued observations with successive time stamps. In this paper, we focus on time series with identical interval lengths, and let denote the observation measured at time . Meanwhile, is the exogenous input at time , which is assumed to be related to .
Denote by the time window size. Our goal is to predict the current value of the target series , given the historical observations as well as the past (exogenous) input series . More specifically, we aim to learn a nonlinear mapping such that
(1) 
4 Hierarchical AttentionBased Recurrent Highway Networks
We propose the Hierarchical attentionbased Recurrent Highway Network (HRHN) for time series prediction. A detailed diagram of the system is plotted in Fig. 2. In the encoder, we first introduce the convolutional network (ConvNet) that can automatically learn the spatial correlations among different components of exogenous data. Then an RHN is used to model the temporal dependencies among convolved input features at different semantic levels. In the decoder, a novel hierarchical attention mechanism is proposed to select relevant encoded multilevel semantics. Another RHN is then introduced to capture the longterm temporal dependencies among historical observations, exploit the interactions between observations and exogenous data, and make the final prediction.
4.1 Encoder
The encoder takes historical exogenous data as inputs, and consists of a ConvNet for spatial correlation learning and an RHN for modeling exogenous temporal dynamics.
ConvNet for spatial correlations. CNNs have already been widely applied to learning sequential data [1, 16]. The key strength of CNN is that it automatically learns the feature representation by convolving the neighboring inputs and summarizing their interactions. Therefore, given the exogenous inputs with , we apply independent local convolutions on each of the inputs to learn the interactions between different components of . For some fixed , assume that the number of convolutional layers is and the number of feature maps at the th layer is . We also use kernels with fixed size of for all layers. Then the convolution unit for feature map of type at the th layer is given by
(2) 
where are the kernels for the type feature map at the th layer,
is the activation function which is typically chosen to be ReLU
[9], and are the bias terms. Note that for the first layer (), the inputs of (2) are the exogenous inputs, namely since .A nonlinear subsampling approach, e.g.
, max pooling, is also performed between successive convolutional layers, which can reduce the size of feature maps so as to avoid overfitting and improve efficiency. Moreover, max pooling can remove the unreliable compositions generated during the convolution process. Assume that we adopt a
maxpooling process, which is given as(3) 
where starts from zero for clarity.
After several layers of convolution and maxpooling, we feed the outputs to a fully connected layer, leading to a sequence of local feature vectors with . Such a sequence can well exploit the interactions between different components at each time step.
RHN for exogenous temporal dynamics. Following the ConvNet, an RHN layer is used to model the temporal dynamics of exogenous series. Many sequential processing tasks require complex nonlinear transition functions from one step to the next. It is usually difficult to train gated RNNs such as LSTM and GRU when the networks go deeper. The RHN [30] is designed to resolve such an issue by extending the LSTM architecture to allow steptostep transition depth larger than one, which can capture the complicated temporal properties from different semantic levels. Let , , and be the outputs of nonlinear transformations , and , respectively. and typically utilize a sigmoid () nonlinearity [30] and are referred to as the transform and carry gates. Assume that in our model the RHN has recurrence depth of and is the intermediate output at time and depth , where and with . Moreover, let and represent the weight matrices of the nonlinearity and the and gates at the th layer, respectively. The biases are denoted by . Then an RHN layer is described as
(4) 
where
(5)  
and is the indicator function meaning that is transformed only by the first highway layer. Moreover, at the first layer, is the RHN layer’s output of the previous time step. The highway network described in (4) shows that the transform gate acts as selecting and controlling the information from history, and that the carry gate can carry the information between hidden states without any activation functions. Thus, the hidden states composed at different levels can capture the temporal dynamics of different semantics.

4.2 Decoder with Hierarchical Attention Mechanism
In order to predict the future series , we use another RHN to decode the input information which can capture not only the temporal dynamics of the historical observations but also the correlations among the observations and exogenous data. Besides, a hierarchical attention mechanism based on the discovered semantics at different levels is proposed to adaptively select the relevant exogenous features from the past.
Hierarchical attention mechanism. Although the highway networks allow unimpeded information flow across layers, the information stored in different layers captures temporal dynamics at different levels and will thus have impact on predicting future behaviors of the target series. In order to fully exploit the multilevel representations as well as their interactions with the historical observations, the hierarchical attention mechanism computes the soft alignment of the encoder’s hidden states in each layer based on the previous decoder layer’s output . The attention weights of annotation at the th layer are given by
(6) 
where
(7) 
and , and are parameters to be learned. The bias terms have been omitted for clarity. The alignment model scores how well the inputs around position at the th layer match the output at position [2]. Then the soft alignment for layer is obtained by computing the subcontext vector as a weighted sum of all the encoder’s hidden states in the th layer, namely
(8) 
At last, the context vector that we feed to the decoder is given through concatenating all the subcontext vectors in different layers, that is
(9) 
Note that the context vector is timedependent, which selects the most important encoder information in each decoding time step. Moreover, by concatenating the subcontext vectors, can also select the significant semantics from different levels of the encoder RHN, which encourages more interactions between the observations and exogenous data than the classical attention mechanism does. Once we obtain the concatenated context vector , we can combine them with the given decoder inputs , namely
(10) 
where , and are the weight matrices and biases to be learned. The timedependent represent the interactions between and , and are now the inputs of the decoder RHN layer.
RHN for target temporal dynamics. Assume for simplicity that the decoder RHN also has recurrence depth of . Then the update of the decoder’s hidden states is given by
(11) 
where
(12)  
Here and represent the weight matrices of the nonlinearity and the and gate functions, respectively, and are the bias terms to be learned.
As mentioned before, out goal is to find a nonlinear mapping such that
(13) 
In our model, the prediction can be obtained by
(14) 
where is the last layer’s output and is its associated context vector. The parameters , and characterize the linear dependency and produce the final prediction result.
5 Experiments
In this section, we first introduce the datasets and their setup that are of interests to us in time series prediction. Then the parameters and performance evaluation metrics used in this work will be presented. At last, we compare the proposed HRHN model against some other cuttingedge methods, explore the performance of the ConvNet and the hierarchical attention approach in HRHN, and study the parameter sensitivity.

5.1 Datasets and Setup
We use three datasets to test and compare the performance of different methods in time series prediction. The statistics of datasets are given in Table 1.
NASDAQ 100 Stock dataset [21] collects the stock prices of 81 major corporations and the index value of NASDAQ 100, which are used as the exogenous data and the target observations, respectively. The data covers the period from July 26, 2016 to December 22, 2016, 105 trading days in total, with the frequency of every minute. We follow the original experiment [21] and use the first 35,100 data points as the training set and the following 2,730 data points as the validation set. The last 2,730 data points are used as the test set.
In addition, we also consider a statechanging prediction task in the highspeed autonomous driving [19]. Our goal is to predict the state derivatives of an autonomous rally car. The vehicle states include roll angle, linear velocities ( and directions) and heading rate, four dimensions in total. The exogenous measurements are vehicle states and controls including steering and throttle. The training set of 99,700 data points contains data from several different runs in one day with the frequency of every 30 seconds of highspeed driving. The validation and test sets are recorded as one continuous trajectory in the same day with sizes of 2,500 and 3,000, respectively.
In particular, a video mining task is considered in our work. We take the first two seasons of Ode to Joy television drama as the dataset which aired in 2016 and 2017, respectively. Our goal is to predict the accumulated video views (VV) given the historical observations as the target data and the features extracted from the video as the exogenous data. The VV data is crawled from a public video website with the frequency of every five seconds and the exogenous features are obtained from pretraining the video by Inceptionv4 [26]. The video has a resolution of and is sampled at 25 frames per second. We use Inceptionv4 to first extract the features of each frame, which yields a sequence of 1536 dimensional representations. Then we take the average of frames for each second and stack every five vectors, which leads to a sequence of 7680 dimensional vectors, representing the features of the video at every five seconds. Before feeding them into HRHN, we adopt a onelayer dense convolutional network for dimensionality reduction, resulting in 128 dimensional exogenous variables as the inputs of the encoder.
5.2 Parameters and Performance Evaluation
For simplicity, we assume the RHN layer has same size in the encoder and the decoder of our model. Therefore, the parameters in our model include the number of time steps in the window , the size of convolutional layers, kernels and maxpooling for the ConvNet, and the size of hidden states and recurrence depth for the RHN. To evaluate the performance of HRHN in different datasets, we choose different but fixed parameters. In NASDAQ 100 Stock dataset, we choose , i.e., assuming that the target series is related to the past 10 steps, and the RHN layer with size of , namely the RHN has the recurrence depth of 2 and 128 hidden states in each layer. For the ConvNet, we use three convolutional layers with kernel width of 3, and 16, 32 and 64 feature maps respectively, followed by a maxpooling layer after each convolutional layer. In the autonomous driving dataset, we use only one convolutional layer with kernels and 64 feature maps, followed by a maxpooling process. Besides, we let and RHN size to be . At last for the Ode to Joy Video Views dataset, we choose and the size of RHN to be . Again, only one layer of convolution and maxpooling are used with pooling size of
and 128 feature maps. In order to fully convolve the features of every second, we use a kernel with width of 384, which is the dimension of stacking 3second features. Moreover, we employ the strides with shape of
to avoid involving redundant information.In the training process, we adopt the mean squared error as objective function:
(15) 
where is the number of training samples and is the dimension of target data. All neural models are trained using the Adam optimizer [15] on a single NVIDIA Tesla GPU.
We consider three different metrics for performance evaluation on the singlestep prediction task, namely the singlestep error is measured as the root mean squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) [21]. Note that since we have multivariate target data in the experiments, we define the metrics of multivariate data as the average of the metrics along all dimensions. More specifically, assume that is the target value and is the predicted value at time , then the RMSE is defined as
(16) 
and the MAE and MAPE are given by
(17) 
and
(18) 
5.3 Resulti@: Time Series Prediction
To demonstrate the effectiveness of HRHN, we first compare the performance of HRHN against some cuttingedge methods in the same prediction tasks. We choose ARIMA as the representative of the traditional methods and three other deep learning methods, including the attentionbased encoderdecoder models with LSTM and GRU layers, and the DARNN model [21]. The prediction results over three datasets are presented in Table 2. For each dataset, we run each method 5 times and report the median in the table. Note that for autonomous driving dataset, there are many zero values after normalization, which leads to the issue in MAPE calculation. Therefore, the MAPE result is omitted on this dataset.
We can observe that the performance of deep learning methods based on neural networks is better than the traditional learning model (ARIMA), and the proposed HRHN achieves the best results for all the three metrics across all datasets, due to the following reasons. On one hand, the LSTM or GRU model only considers the temporal dynamics of exogenous inputs, while the DARNN model uses an input attention mechanism to extract relevant input features of exogenous data. However, DARNN does not capture the correlations among different components of the inputs. Our HRHN model can learn the spatial interactions of exogenous data by introducing the ConvNet. On the other hand, the RHN layer can well model the complicated temporal dynamics with different semantics at different levels, and the hierarchical attention mechanism can well exploit such information, which is better than the classical attention approach and other gating mechanisms.


Besides, one can also expect that the difference of results among the methods is bigger on more complicated dynamic systems. For example, in the autonomous driving dataset, the deep learning models perform much better than ARIMA due to the complexity of multivariate target outputs. In addition, the difference among methods is bigger in the video views dataset than that in the stock dataset, since the patterns within a video is more complex than that in stock prices.
In time series prediction, it is sometimes more interesting to compare the performance on capturing the socalled rare events, e.g., an oscillation after a stable growth or decay, or a huge sudden change during oscillations. The visual comparison on predicting such events by HRHN against ARIMA and DARNN is given in Fig. 3. We plot some test samples and the corresponding prediction results over the Ode to Joy video views dataset. One can see that around steps 320 (left three circles) and 4070 (right three circles), where the oscillation occurs after a stable change, HRHN fits the ground truth better than others, which again illustrates the ability of ConvNet in summarizing the interactions of exogenous inputs and the hierarchical attention approach in selecting important multilevel features.
5.4 Resultii@: Effectiveness of Modules in HRHN
The effectiveness of HRHN can also be shown via a stepbystep justification. We compare HRHN against the classical attentionbased RHN encoderdecoder model and the setting that only employs the ConvNet (RHN + ConvNet) or the hierarchical attention mechanism (RHN + HA). The results are presented in Table 3, where we again run each experiment 5 times and report the median. We provide a brief analysis as follows based on the results.
RHN. One can notice that the single RHN encoderdecoder model outperforms DARNN in all metrics and datasets except the MAE for video views dataset (which is also comparable). Although RHN cannot model the spatial correlations of exogenous inputs, it can well capture their temporal dynamics from different levels by allowing deeper steptostep transitions.
ConvNet. From the results in Table 2 and Table 3, we can observe that the RHN equipped with ConvNet consistently outperforms DARNN and single RHN, which suggests that by convolving the neighboring exogenous inputs, ConvNet is able to summarize and exploit the interactions between different components at each time step, which has impacts on predicting target series.
Hierarchical attention. Similarly, the RHN equipped with the hierarchical attention also outperforms DARNN and single RHN. To further demonstrate the effectiveness of the hierarchical attention mechanism, we employ the classical attentionbased RHN approach where the attention is computed from each single intermediate layer of the encoder RHN, and compare the performance with that of the aforementioned RHN model. We still take the video views dataset as an illustration since the largest number of recurrence depth is used in this task. Because a fourlayer RHN is adopted in the prediction, we compare the results obtained from Table 3 to the performance of the RHN with attentions of the three hidden layers. From the results stated in Table 4, we can confirm that useful information is also stored in the intermediate layers of RHN and the hierarchical attention mechanism can extract such information in predicting future series.
5.5 Resultiii@: Parameter Sensitivity
At last, we can also study the parameter sensitivity of the proposed methods, especially the HRHN model. Parameters of interests include the length of time steps and the size of recurrence depth of the encoder and decoder RHN. The video views dataset is used again for demonstration. When we vary or , we keep the other fixed. By setting , we plot the RMSE against different lengths of time steps in the window in Fig. 4 (left) and by setting , we also plot the RMSE against different recurrence depths for RHN in Fig. 4 (right). We compare the sensitivity results of several aforementioned methods, including the single RHN encoderdecoder model, RHN + ConvNet, RHN + HA, and HRHN.
We notice that the results of both cases and all the models do not differ much for different parameters. In particular, the difference on HRHN is the smallest among the models (less than 2%), which implies the robustness of our HRHN model. Moreover, we can also observe that HRHN performs worse when the window size or recurrence depth is too small or too large, since the former leads to lack of sufficient information for feature extraction while the latter produces redundant features for capturing the correct temporal dependencies.
6 Conclusion and Future Work
In this paper, we proposed Hierarchical attentionbased Recurrent Highway Network (HRHN) for time series prediction. Based upon the modules in the proposed model, HRHN can not only extract and select the most relevant input features hierarchically, but also capture the longterm dependencies of the time series. The extensive experiments on various datasets demonstrate that our proposed HRHN advances the stateoftheart methods in time series prediction. In addition, HRHN is good at capturing and forecasting the rare events, such as sudden changes and sudden oscillations.
In the future, we intend to apply HRHN to other time series prediction or signal processing tasks, e.g., multistep prediction (sequence to sequence learning). Moreover, we believe that our model can also produce useful information for forecasting future exogenous variables. Besides, we will investigate the correlations among different components of target variables as well. In this work, we predict all the components of multivariate target series simultaneously. However, predicting one component within the target series should help us generate the predictions on the others, which also requires more investigations.
Acknowledgments
This work is supported in part by US NSF CCF1740833, DMR1534910 and DMS1719699.
References
 [1] O. AbdelHamid, L. Deng, and D. Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. Interspeech, pages 3366–3370, 2013.
 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [3] P. Bashivan, I. Rish, M. Yeasin, and N. Codella. Learning representations from eeg with deep recurrentconvolutional neural networks. arXiv preprint arXiv:1511.06448, 2015.
 [4] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [5] K. Cho, B. V. Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 [6] K. Cho, B. V. Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 [8] J. T. Connor, R. D. Martin, and L. E. Atlas. Recurrent neural networks and robust time series prediction. IEEE transactions on neural networks, 5.2:240–254, 1994.

[9]
G. E. Dahl, T. N. Sainath, and G. E. Hinton.
Improving deep neural networks for lvcsr using rectified linear units and dropout.
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8609–8613, 2013.  [10] R. DiPietro, N. Navab, and G. D. Hager. Revisiting narx recurrent neural networks for longterm dependencies. arXiv preprint arXiv:1702.07805, 2017.
 [11] J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7.23:195–225, 1991.
 [12] J. Gao, H. Sultan, J. Hu, and W.W. Tung. Denoising nonlinear time series by adaptive filtering and wavelet shrinkage: a comparison. IEEE signal processing letters, 17(3):237–240, 2010.
 [13] A. Graves, A.r. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649, 2013.
 [14] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9.8:1735–1780, 1997.
 [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [16] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.
 [17] T. Lin, B. Horne, P. Tino, and C. Giles. Learning longterm dependencies in narx recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, 1996.
 [18] Z. Liu and M. Hauskrecht. A regularized linear dynamical system framework for multivariate time series analysis. pages 1798–1804, 2015.
 [19] Y. Pan, X. Yan, E. A. Theodorou, and B. Boots. Prediction under uncertainty in sparse spectrum gaussian processes with applications to filtering and control. ICML, pages 2760–2768, 2017.
 [20] S. C. Prasad and P. Prasad. Deep recurrent neural networks for time series prediction. arXiv preprint arXiv:1407.5949, 2014.
 [21] Y. Qin, D. Song, H. Cheng, W. Cheng, G. Jiang, and G. Cottrell. A dualstage attentionbased recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971, 2017.
 [22] L. Rabiner and R. Schafer. Digital processing of speech signals. Prentice Hall, 1978.
 [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Cognitive modeling, 5.3:1, 1988.
 [24] X. Shi, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. NIPS, pages 802–810, 2015.
 [25] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1017–1024, 2011.

[26]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, pages 4278–4284, 2017. 
[27]
T. Van Gestel, J. Suykens, D. Baestaens, A. Lambrechts, G. Lanckriet,
B. Vandaele, B. De Moor, and J. Vandewalle.
Financial time series prediction using least squares support vector machines within the evidence framework.
IEEE Transactions on Neural Networks, 12(4):809–821, 2001.  [28] P. Whittle. Hypothesis Testing in Time Series Anal ysis. PhD thesis, 1951.
 [29] Y. Wu, J. M. HernándezLobato, and Z. Ghahramani. Dynamic covariance models for multivariate financial time series. ICML, pages 558–566, 2013.
 [30] J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.