Temporal Pattern Attention for Multivariate Time Series Forecasting

09/12/2018 ∙ by Shun-Yao Shih, et al. ∙ National Taiwan University 0

Forecasting multivariate time series data, such as prediction of electricity consumption, solar power production, and polyphonic piano pieces, has numerous valuable applications. However, complex and non-linear interdependencies between time steps and series complicate the task. To obtain accurate prediction, it is crucial to model long-term dependency in time series data, which can be achieved to some good extent by recurrent neural network (RNN) with attention mechanism. Typical attention mechanism reviews the information at each previous time step and selects the relevant information to help generate the outputs, but it fails to capture the temporal patterns across multiple time steps. In this paper, we propose to use a set of filters to extract time-invariant temporal patterns, which is similar to transforming time series data into its "frequency domain". Then we proposed a novel attention mechanism to select relevant time series, and use its "frequency domain" information for forecasting. We applied the proposed model on several real-world tasks and achieved the state-of-the-art performance in all of them with only one exception. We also show that to some degree the learned filters play the role of bases in discrete Fourier transform.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In modern day life, time series data are everywhere. People observe evolving variables generated from sensors over discrete time steps and organize them into time series data. For example, household electricity consumption, road occupancy rate, currency exchange rate, solar power production, and even music notes can all be considered as time series data. In most of the cases nowadays, the collected data are often multivariate time series (MTS) data, such as the electricity consumption of multiple clients, which are kept tracked by electrical power company. There may exist complex dynamic interdependencies between different series that are significant but difficult to capture and analyze.

Humans are often interested in forecasting the future based on historical data. The better the interdependencies among different series are modeled, the more accurate the forecasting can be. For instance, the price of crude oil heavily influences the price of gasoline, but is less influential on the price of lumber, as shown in Figure 1222Source: https://www.eia.gov and https://www.investing.com. Thus, if you realize that gasoline is produced from crude oil and lumber is not, we can use the price of crude oil to predict the price of gasoline.

Figure 1: Comparison of historical price of crude oil, gasoline, and lumber. Units are omitted and scales are normalized for simplicity.

In machine learning, we want the model to automatically learn the interdependencies from data. Machine learning has already been applied on time series analysis for both classification and forecasting 

[G. Zhang and Hu.1998, Zhang2003, Lai et al.2018, Qin et al.2017]. In the classification problem, the machine learns to assign label to a time series, such as evaluating a patient’s diagnostic categories by reading values from medical sensors. As for forecasting, the machine needs to predict future time series based on observed data in the past. For example, precipitation in the next days, weeks or months can be forecast according to historical measurement. The further ahead we try to forecast, the harder it is.

When it comes to MTS forecasting using deep learning, recurrent neural network (RNN) 

[David E Rumelhart and Williams1986, Werbos1990, Elman1990] will definitely be on the list. However, one disadvantage in using RNN in time series analysis is its weakness on managing long-term dependencies, e.g. a yearly pattern on a daily recorded sequence [Kyunghyun Cho and Bengio.2014]. Attention mechanism [Luong, Pham, and Manning2015, Bahdanau, Cho, and Bengio2015], which is originally utilized in encoder-decoder [Sutskever, Vinyals, and Le2014] networks, is able to alleviate this problem to some extent, and thus boosts the effectiveness of RNN [Lai et al.2018].

In this paper, we propose a new attention mechanism – temporal pattern attention – for MTS forecasting. Typical attention mechanism identifies the time steps relevant to the prediction, and extracts the information from these time steps, which has obvious limitation for MTS prediction. Consider the example in Figure 1. To predict the value of gasoline, machine has to learn to focus on “crude oil” and ignore “lumber”. In temporal pattern attention, instead of selecting the relevant time steps as typical attention, machine learns to select the relevant variables.

In addition, time series data often entails noticeable periodic temporal patterns, which are critical for prediction. However, the periodic patterns spanning across multiple time steps are difficult to be identified by the typical attention mechanism which usually focuses only a few time steps. In temporal pattern attention, we introduce convolutional neural network (CNN) 

[LeCun and Bengio.1995, A. Krizhevsky and Hinton.2012] to extract temporal pattern information from each individual variables.

The main contributions of this paper are summarized as follows:

  • We introduce a new concept in attention mechanism which selects the relevant variables instead of time steps. The method is simple and general to apply on RNN.

  • The learned CNN filters in our attention demonstrates interesting and interpretable behavior. They play the role similar to the bases in discrete Fourier transform (DFT) and extract “frequency domain” information.

  • According to the experimental results on real-world data ranging from periodic and partially linear to non-periodic and non-linear tasks, we show that our attention achieved state-of-the-art results across multiple datasets.

The remainder of this paper is organized as follows. We will report related work in Section 2 and describe background knowledge in Section 3. Then, our proposed attention will be detailed in Section 4, while the experimental results and analysis will be presented in Section 5. Finally, we will conclude our paper in Section 6.

2 Related Work

The most renowned model for linear univariate time series forecasting is the autoregressive integrated moving average (ARIMA) [G. E. Box and Ljung2015]

, which encompasses other autoregressive time series models, including autoregression (AR), moving average (MA), and autoregressive moving average (ARMA). Additionally, linear support vector regression (SVR) 

[Cao and Tay.2003, Kim.2003]

treats forecasting problem as typical regression problem with time-varying parameters. However, these models are mostly limited to linear univariate time series and have difficulties to scale to MTS. To forecast MTS data, vector autoregression (VAR), which is a generalization of AR-based models, is proposed. VAR is probably the most well-known model in MTS forecasting. Nevertheless, neither AR-based nor VAR-based models are capable of capturing non-linearity. For that reason, substantial effort has been made for non-linear models for time series forecasting based on kernel methods 

[Chen, Wang, and Harris2008], ensembles [Bouchachia and Bouchachia2008], or Gaussian processes [Frigola and Rasmussen2014]. Still, these approaches apply predetermined non-linearity and may fail to recognize different forms of non-linearity for different MTS.

Recently, deep neural networks have received great amount of attention due to their adaptable abilities in capturing non-linear interdependencies. Two variants of RNN, namely long short-term memory (LSTM) 

[Hochreiter and Schmidhuber1997]

and gated recurrent unit (GRU) 

[Cho et al.2014], have shown promising results in several NLP tasks and have also be employed on MTS forecasting. Previous work in this area starts from using naive RNN [J. Connor and Martin.1991]

, to hybrid models that combined ARIMA and Multilayer Perceptron 

[G. Zhang and Hu.1998, Zhang2003, Jain and Kumar.2007]

, and to the latest Dynamic Boltzmann Machine with RNN 

[Dasgupta and Osogami2017]. Although these models can be applied to MTS, they mainly target univariate or bivariate time series.

Figure 2: Overview of our attention mechanism

To the best of our knowledge, Long- and Short-term Time-series Network (LSTNet) [Lai et al.2018] is the first model that is designed specifically for MTS forecasting with up to hundreds of evolving variables. In LSTNet, CNN is utilized to capture short-term patterns, whereas LSTM or GRU is responsible for memorizing relatively long-term patterns. But in practice, LSTM and GRU are unable to memorize very long-term interdependencies due to training instability and gradient vanishing problem. Thus, LSTNet adds either a recurrent-skip layer or a typical attention to deal with this concern. Traditional autoregression is also part of the entire model that helps to tackle the scale insensitive problem of neural networks. Nonetheless, there are two major shortcomings of LSTNet when compared to our attention: (1) the skip-length of recurrent-skip layer has to be predetermined by human, but our attention learns the periodic patterns by itself, and (2) the whole LSTNet model is designed non-trivially only for MTS data with strong periodic patterns, whereas our attention is simple and adaptable to various datasets, even non-periodic and non-linear ones, as shown in our experiments.

3 Preliminaries

In this section, we briefly introduce two essential modules related to our proposed model: one is RNN, and the other is typical attention mechanism.

3.1 Recurrent Neural Network (RNN)

Given a sequence of information , where , RNN generally defines a recurrent function, , and calculates for each time step, , as follows:


where the implementation of the function depends on what kind of RNN cell is used.

Long short-term memory (LSTM) [Hochreiter and Schmidhuber1997] cell is widely used, which has a slightly different recurrent function:


and the function is defined by the following equations:


where , , and , , , and , , , and , and means element-wise multiplication.

3.2 Typical Attention Mechanism

In typical attention mechanism [Luong, Pham, and Manning2015, Bahdanau, Cho, and Bengio2015] on RNN, given the previous states , a context vector is extracted from the previous states. is a weighted sum of each column in , which represents relevant information to the current time step. is further integrated with the present state to obtain the prediction.

Assuming that there is a scoring function which computes the relevance between its input vectors. Formally, we have the following formula to get the context vector .


4 Temporal Pattern Attention

While previous works mainly focus on changing the network architecture of the attention-based models on different settings to get better performance in different tasks, we think that there is a critical defect to apply typical attention mechanisms on RNN in MTS forecasting. The typical attention aims at selecting the relevant information for the current time step, and the context vector is the weighted sum of the column vectors of previous RNN hidden states, . This design is more suitable for the task that each time step only contains single information, for example, in NLP, each time step corresponding to a single word. If there are multiple variables in each time step, it fails to ignore the variables that are noisy for forecasting. Moreover, since typical attention averages the information across multiple time steps, it fails to detect the temporal patterns useful for forecasting.

The overview of the proposed model is shown in Figure 2. In the proposed approach, given previous RNN hidden states , our proposed attention basically attend on its row vectors. The attention weights on the rows select the variables that are helpful for forecasting. Since the context vector is now the weighted sum of the row vectors containing the information across multiple time steps, it captures the temporal information.

4.1 Problem Formulation

In MTS forecasting task, given an MTS, , where represents the observed value at time , the task is to predict the value of , where is a fixed horizon with respect to different tasks. We denote the corresponding prediction as , and the ground-truth value as . Moreover, for some tasks, we only use to predict , where is the window size.

4.2 Temporal Pattern Detection by CNN

Motivated by the huge success of CNN and its ability to capture different important patterns of signals, we further introduce CNN to enhance the learning ability of the model and apply CNN filters on row vectors of . Specifically, we have filters , where is the maximum length we want to pay attention to. When unspecified, we assume . After convolutional operations, we will have where represents the convolutional value of the -th row vector and the -th filter. Formally, this operation is given by the following equations:


4.3 Proposed Attention

We calculate as a weighted sum of row vectors of . The scoring function to evaluate the relevance is defined as below.


where is the -th row of , and . The attention weight is obtained as below:


Note that we use sigmoid activation function instead of softmax since we expect there is more than one variable useful for forecasting.

At the end, the row vectors of are weighted by to obtain the context vector ,


Then we integrate and to make the final prediction:


where , , , , and and .

5 Experiments and Analysis

In this section, we first describe the datasets that we will conduct our experiments on. Next, we show our experimental results and visualization of prediction against LSTNet. Then, ablation study is discussed. Finally, we analyze how the CNN filters act like the bases in DFT.

5.1 Datasets

To test the effectiveness and generalization of our attention mechanism, we use two datasets that are dissimilar: typical MTS datasets, and polyphonic music datasets.

The typical MTS datasets are published by LSTNet, and there are four datasets:

These datasets are real-world data that contains both linear and non-linear interdependencies. Moreover, three out of the four datasets, namely Solar-Energy, Traffic and Electricity, exhibit strong periodic patterns, which indicate the daily or weekly routines of human activities. According to the authors of LSTNet, all datasets have been split into training (), validation (), and testing set () in chronological order.

On the other hand, the polyphonic music datasets, which are introduced in the following list, are much complicated in a sense that no apparent linearity or repetitive patterns exist:

To train models on these datasets, we consider each played note as 1 and 0 otherwise, and set one beat as one time step as shown in Table 1. Given played notes of 4 bars consisting of 16 beats, the task is to predict each pitch at the next time step is played or not. For training, validation, and testing sets, we follow the original separation of MuseData, which is divided into 524 training pieces, 135 validation pieces, and 124 testing pieces. On the other hand, LPD-5-Cleansed is not split by previous works [Hao-Wen Dong and Yang2018, Raffel2016] so we randomly split it into training (), validation (), and testing () sets.

The statistics of both typical MTS datasets and polyphonic music datasets are summarized in Table 1.

Solar-Energy 52,560 137 10 minutes
Traffic 17,544 862 1 hour
Electricity 26,304 321 1 hour
Exchange Rate 7,588 8 1 day
MuseData 216 102,552 128 1 beat
LPD-5-Cleansed 1,072 1,917,952 128 1 beat
Table 1: Statistics of all datasets, where is the length of the time series, is the number of evolving variables, and is the sampling spacing. MuseData and LPD-5-Cleansed both have various length of time series since length of music pieces can be varied.
RSE Solar-Energy Traffic Electricity Exchange Rate
horizon 3 6 12 24 3 6 12 24 3 6 12 24 3 6 12 24
AR 0.2435 0.3790 0.5911 0.8699 0.5991 0.6218 0.6252 0.6293 0.0995 0.1035 0.1050 0.1054 0.0228 0.0279 0.0353 0.0445
LRidge 0.2019 0.2954 0.4832 0.7287 0.5833 0.5920 0.6148 0.6025 0.1467 0.1419 0.2129 0.1280 0.0184 0.0274 0.0419 0.0675
LSVR 0.2021 0.2999 0.4846 0.7300 0.5740 0.6580 0.7714 0.5909 0.1523 0.1372 0.1333 0.1180 0.0189 0.0284 0.0425 0.0662
GP 0.2259 0.3286 0.5200 0.7973 0.6082 0.6772 0.6406 0.5995 0.1500 0.1907 0.1621 0.1273 0.0239 0.0272 0.0394 0.0580
LSTNet-Skip 0.1843 0.2559 0.3254 0.4643 0.4777 0.4893 0.4950 0.4973 0.0864 0.0931 0.1007 0.1007 0.0226 0.0280 0.0356 0.0449
LSTNet-Attn 0.1816 0.2538 0.3466 0.4403 0.4897 0.4973 0.5173 0.5300 0.0868 0.0953 0.0984 0.1059 0.0276 0.0321 0.0448 0.0590
Our Model 0.1803  0.0008 0.2347  0.0017 0.3234  0.0044 0.4389  0.0084 0.4487  0.0180 0.4658  0.0053 0.4641  0.0034 0.4765  0.0068 0.0823  0.0012 0.0916  0.0018 0.0964  0.0015 0.1006  0.0015 0.0174  0.0001 0.0243  0.0003 0.0345  0.0010 0.0444  0.0006
CORR Solar-Energy Traffic Electricity Exchange Rate
horizon 3 6 12 24 3 6 12 24 3 6 12 24 3 6 12 24
AR 0.9710 0.9263 0.8107 0.5314 0.7752 0.7568 0.7544 0.7519 0.8845 0.8632 0.8591 0.8595 0.9734 0.9656 0.9526 0.9357
LRidge 0.9807 0.9568 0.8765 0.6803 0.8038 0.8051 0.7879 0.7862 0.8890 0.8594 0.8003 0.8806 0.9788 0.9722 0.9543 0.9305
LSVR 0.9807 0.9562 0.8764 0.6789 0.7993 0.7267 0.6711 0.7850 0.8888 0.8861 0.8961 0.8891 0.9782 0.9697 0.9546 0.9370
GP 0.9751 0.9448 0.8518 0.5971 0.7831 0.7406 0.7671 0.7909 0.8670 0.8334 0.8394 0.8818 0.8713 0.8193 0.8484 0.8278
LSTNet-Skip 0.9843 0.9690 0.9467 0.8870 0.8721 0.8690 0.8614 0.8588 0.9283 0.9135 0.9077 0.9119 0.9735 0.9658 0.9511 0.9354
LSTNet-Attn 0.9848 0.9696 0.9397 0.8995 0.8704 0.8669 0.8540 0.8429 0.9243 0.9095 0.9030 0.9025 0.9717 0.9656 0.9499 0.9339
Our Model 0.9850  0.0001 0.9742  0.0003 0.9487  0.0023 0.9081  0.0151 0.8812  0.0089 0.8717  0.0034 0.8717  0.0021 0.8639  0.0030 0.9429  0.0004 0.9337  0.0011 0.9250  0.0013 0.9133  0.0008 0.9790  0.0003 0.9709  0.0003 0.9564  0.0005 0.9381  0.0008
Table 2:

Results on typical MTS datasets using RSE (upper) and CORR (lower) as metric. Best performance in boldface, and second best performance is underlined. We report the mean and standard deviation of our model in ten runs. All numbers besides the results of our model is referenced from the paper of LSTNet 

[Lai et al.2018].

5.2 Methods for Comparison

We compared our model against the following methods on the typical MTS datasets:

AR, LRidge, LSVR, and GP are the traditional baseline methods, while LSTNet-Skip and LSTNet-Attn are state-of-the-art method based on deep neural networks.

However, since non-linearity and the lack of periodic patterns make both traditional baseline methods and LSTNet unsuitable for polyphonic music datasets, we use LSTM and LSTM with Luong attention as the baseline models to benchmark the improvement of our model on polyphonic music datasets:

  • LSTM: RNN cell as introduced in Section 3.

  • LSTM with Luong attention: LSTM with attention mechanism scoring function of which , where  [Luong, Pham, and Manning2015].

5.3 Model Setup and Parameter Settings

For all experiments, we use LSTM as our RNN cell to build our model, and fix the number of CNN filters at 32. Also, inspired by LSTNet, we include an autoregression component in our model when training and testing on typical MTS datasets.

For typical MTS datasets, we performed grid search over tunable parameters just like LSTNet. Specifically, on Solar-Energy, Traffic, and Electricity, the range of window size is , the range of number of hidden units is , and the range of step of exponential learning rate decay with rate 0.995 is

. On Exchange Rate, these three parameters are fixed at 30, 6 and 120, respectively. Two types of data normalization are also viewed as part of the grid search: one normalizes each time series by the maximum value in itself, and the other normalizes every time series by the maximum value in the whole data. Lastly, we use abosolute loss function and Adam with

learning rate on Solar-Energy, Traffic, and Electricity, and learning rate on Exchange Rate. For all other methods for comparison as mentioned in previous subsection, the parameters are identical to the numbers reported in the paper of LSTNet [Lai et al.2018].

For models used in polyphonic music datasets including baselines and our models in the following subsections, we used 3 layers for all RNN, which is the same as tonnetz, and fix the trainable parameters to around by adjusting the number of units in LSTM in order to fairly compare different models. Besides, the optimizer is Adam with learning rate while the loss function is cross entropy.

5.4 Evaluation Metrics

On typical MTS datasets, since we make comparison of our model with LSTNet, we follow the same evaluation metrics. The first metric is the root relative squared error (RSE), which is defined as


and the other metric is empirical correlation coefficient (CORR):


where is defined in Section 4.1, is the label of testing data, and denotes the mean of set . RSE is a normalized version of Root Mean Square Error (RMSE) that disregards data scale. For RSE, the lower is better, whereas for CORR, the higher is better.

To decide which model is better on polyphonic music datasets, we use validation loss (negative log-likelihood), precision, recall and F1 score as measurements which are widely used in previous polyphonic music generation works [Nicolas Boulanger-Lewandowski and Vincent2012, Chuan and Herremans2018].

Figure 3: Side by side comparison of prediction between our model and LSTNet-Skip on testing set of Traffic with 3-hour horizon. Our model clearly forecasts better around the flat line after the peak and around the valley.

5.5 Results on Typical MTS Datasets

On typical MTS datasets, we choose the best model on the validation set using RSE/CORR as metric to test on the testing set. The numerical results are tabulated in Table 2, where the metric of the upper table is RSE, and the metric of the lower table is CORR. All numbers besides the results of our model is referenced from the paper of LSTNet [Lai et al.2018]. From both tables, we can clearly see that our model outperforms all other methods on any dataset, horizon, and metric, but with only one exception. According to the results, our model consistently demonstrates its superiority on MTS forecasting.

When comparing to LSTNet-Skip and LSTNet-Attn, which are the previous state-of-the-art methods, our model outdoes both of them, especially on Traffic and Electricity, which have the largest amount of evolving variables. Moreover, on Exchange Rate, where no repetitive pattern exists, our model is still the best overall, while the performance of LSTNet-Skip and LSTNet-Attn fall behind traditional methods, including AR, LRidge, LSVR and GP. We are defeated by LRidge on Exchange Rate with 6-days horizon. Because linear models are good enough at this dataset, deep learning is redundant. We also visualize and compare the prediction of our model and LSTNet-Skip in Figure 3 as an illustration.

Generally speaking, our model achieved state-of-the-art performance on both periodic and non-periodic MTS datasets.

5.6 Results on Polyphonic Music Datasets

In this subsection, to further verify the efficacy and generalization ability of our model, we conducted experiments on polyphonic music datasets, and the results are shown in Figure 4 and Table 3. We compare three RNN models, LSTM, LSTM with Luong attention and LSTM with the proposed attention. Figure 4

shows the validation loss across training epochs, and in Table

3, we use the models with the lowest validation loss to calculate precision, recall and F1 score on the testing set.

From the results, we can first verify our claim that typical attention mechanism does not work on such tasks since under similar hyperparameters and trainable weights, LSTM and our model outperform such attention mechanism. Besides, our model also learns more effectively compared to LSTM throughout the learning process and has better performance in terms of precision, recall, and F1 score.

Figure 4: Validation loss under different training epochs on MuseData (left), and LPD-5-Cleansed (right)
metric Precision Recall F1
w/o attention 0.84009 0.67657 0.74952
w/ Luong attention 0.75197 0.52839 0.62066
w/ our attention 0.85581 0.68889 0.76333

metric Precision Recall F1
w/o attention 0.83794 0.73041 0.78049
w/ Luong attention 0.83548 0.72380 0.77564
w/ our attention 0.83979 0.74517 0.78966
Table 3: Precision, recall, and F1 score of different models on polyphonic music datasets
Dataset Solar-Energy Traffic Electricity MuseData
position filter w/o CNN position filter w/o CNN position filter w/o CNN position filter w/o CNN
softmax 0.4391 0.4434 0.4489 0.4715 0.4897 0.4770 0.1006 0.1007 0.1011 0.04931 0.04968 0.04902
sigmoid 0.4389 0.4597 0.4507 0.4765 0.4795 0.4796 0.1006 0.1028 0.1008 0.04878 0.04958 0.04987
concat 0.4462 0.4404 0.4951 0.4855 0.4774 0.4795 0.1026 0.1035 0.1014 0.05191 0.05167 0.05128
Table 4: Ablation Study. Evaluation measure for Solar-Energy, Traffic, and Electricity is RSE, and negative log-likelihood for MuseData. The bold text represents the best and the underline text represents the second on each corpus.

5.7 Ablation Study

In order to verify the above improvement comes from each component we add rather than a specific set of hyperparameters, we conduct an ablation study on 4 datasets, Solar-Energy, Traffic, Electricity and MuseData. There are two main settings, one is how we attend on hidden states, , of RNN and the other one is how we integrate the scoring function into our model or even discard this function. First, in our proposed method, we let the model attend on values of different filters on each position (), while we can also consider attending on values of same filters from different positions () or row vectors of (). These three different approaches correspond to column headers in Table 4, “position”, “filter”, and “without CNN”. Second, while in typical attention mechanism, they usually use softmax function on output value of scoring function to extract the most relevant information, we use sigmoid as our activation function. Therefore, we compare these two different functions. Moreover, to concatenate all previous hidden states and let the model automatically learn which values are important is also a possible structure to do the forecasting. Taking these two groups of settings into consideration, we train models with all combinations of possible structures on these 4 datasets.

From the results of MuseData, the model with sigmoid activation function and attention on (position) is apparently the best one which suggests that our proposed model is reasonably effective on forecasting. No matter we remove any proposed components from our model, performance drops. For example, using softmax instead of sigmoid will let negative log-likelihood ascend from to and we will obtain a worse model with negative log-likelihood equal to if we decide to not use CNN filters. Besides, we notice that there is no significant improvement between our model and the model using softmax on the first three datasets in Table 4, Solar-Energy, Traffic and Electricity. It is not surprising to us based on the reason why we use sigmoid as illustrated in Section 4.3

. Originally, we expect CNN filters will find some basic patterns and the sigmoid function will help the model to combine those patterns into one that helps most. However, these three datasets are strongly periodic so that it is possible that using small number of basic patterns is enough to have a good prediction. But overall, our model is more general and has stable and competitive results across different datasets.

5.8 Analysis of CNN Filters

DFT is a variant of Fourier transform (FT) that handles an equally-spaced samples of signal in time. In the field of time series analysis, there has been a far-reaching list of works that utilize FT or DFT to reveal important characteristics in time series [N.E. Huang and Liu1998, Bloomfield1976]. In our case, since the MTS data is also equally-spaced and discrete, we can apply DFT to analyze it. However, in MTS data, there are more than one time series, so we naturally average the magnitude of frequency components of every time series, and arrive at a single frequency domain representation. We called it the average discrete Fourier transform (avg-DFT). The single frequency domain representation reveals prevailing frequency components of the MTS data. For instance, it is reasonable to assume that there is a notable 24-hour oscillation in Figure 3, which is verified by the avg-DFT of Traffic dataset as shown in Figure 5.

Since we expect our CNN filters to learn the temporal patterns in MTS, the prevailing frequency components in CNN filters should match that of the training MTS data. Hence, we also apply avg-DFT on the CNN filters that are trained on Traffic with 3-hour horizon, and plot the result alongside with the avg-DFT of Traffic dataset in Figure 5. Impressively, the two curves reach peaks at the same periods most of the time. At the 24, 12, 8, and 6-hour period, not only the magnitude of Traffic dataset is at its high point, but the magnitude of CNN filters also tops out. Moreover, in Figure 6, we show that different CNN filters behaves differently. Some are more specialized at capturing long-term (24-hour) temporal patterns, while others are good at recognizing short-term (8-hour) temporal patterns. As a whole, we can suggest that our CNN filters play the role of bases in DFT. As demonstrated in the work by spectral_CNN, “frequency domain” provides a powerful representation for CNN to train and model on. Thus, LSTM is able to rely on the “frequency domain” information extracted by our attention to accurately forecast the future.

Figure 5: Magnitude comparison of (1) DFT of CNN filters trained on Traffic with 3-hour horizon, and (2) the Traffic dataset. To make the figure more instinctive, the unit of the horizontal axis is period.
Figure 6: Two different CNN filters trained on Traffic with 3-hour horizon, which detect different periods of temporal patterns.

6 Conclusions

In this paper, we focus on MTS forecasting problem and propose a novel temporal pattern attention which solves the limitation of typical attention mechanisms on such tasks. We let the attention dimension be feature-wise in order to make model learn interdependencies among multiple variables not only within the same time step but also across all the previous time and series. Our experiments strongly support this idea and show that our model achieves the state-of-the-art results. Besides, the visualization of filters also verifies our motivation in a more understandable way to human beings.


  • [A. Krizhevsky and Hinton.2012] A. Krizhevsky, I. S., and Hinton., G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 1097–1105.
  • [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. ICLR.
  • [Bloomfield1976] Bloomfield, P. 1976. Fourier Analysis of Time Series: An Introduction. John Wiley.
  • [Bouchachia and Bouchachia2008] Bouchachia, A., and Bouchachia, S. 2008. Ensemble learning for time series prediction. Proceedings of the 1st International Workshop on Nonlinear Dynamics and Synchronization.
  • [Cao and Tay.2003] Cao, L.-J., and Tay., F. E. H. 2003. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Transactions on neural networks 1506–1518.
  • [Chen, Wang, and Harris2008] Chen, S.; Wang, X. X.; and Harris, C. J. 2008. Narxbased nonlinear system identification using orthogonal least squares basis hunting. IEEE Transactions on Control Systems 78–84.
  • [Cho et al.2014] Cho, K.; van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 1724–1734.
    Association for Computational Linguistics.
  • [Chuan and Herremans2018] Chuan, C.-H., and Herremans, D. 2018. Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation.
  • [Dasgupta and Osogami2017] Dasgupta, S., and Osogami, T. 2017. Nonlinear dynamic boltzmann machines for time-series prediction.
  • [David E Rumelhart and Williams1986] David E Rumelhart, G. E. H., and Williams, R. J. 1986.

    Learning representations by backpropagating errors.

    Nature 533–536.
  • [Elman1990] Elman, J. L. 1990. Finding structure in time. Cognitive science 179–211.
  • [Frigola-Alcade.2015] Frigola-Alcade., R. 2015. Bayesian time series learning with gaussian processes. PhD thesis, University of Cambridge.
  • [Frigola and Rasmussen2014] Frigola, R., and Rasmussen, C. E. 2014. Integrated pre-processing for bayesian nonlinear system identification with gaussian processes. IEEE Conference on Decision and Control 552––560.
  • [G. E. Box and Ljung2015] G. E. Box, G. M. Jenkins, G. C. R., and Ljung, G. M. 2015. Time series analysis: forecasting and control. John Wiley & Sons.
  • [G. Zhang and Hu.1998] G. Zhang, B. E. P., and Hu., M. Y. 1998. Forecasting with artificial neural networks:: The state of the art. International journal of forecasting 35–62.
  • [Hao-Wen Dong and Yang2018] Hao-Wen Dong, Wen-Yi Hsiao, L.-C. Y., and Yang, Y.-H. 2018. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
  • [J. Connor and Martin.1991] J. Connor, L. E. A., and Martin., D. R. 1991. Recurrent networks and narma modeling. Advances in Neural Information Processing Systems 301–308.
  • [Jain and Kumar.2007] Jain, A., and Kumar., A. M. 2007. Hybrid neural network models for hydrologic time series forecasting. Applied Soft Computing 7(2):585–592.
  • [Kim.2003] Kim., K.-J. 2003. Financial time series forecasting using support vector machines. Neurocomputing 55(1):307–319.
  • [Kyunghyun Cho and Bengio.2014] Kyunghyun Cho, Bart Van Merrienboer, D. B., and Bengio., Y. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  • [Lai et al.2018] Lai, G.; Chang, W.-C.; Yang, Y.; and Liu, H. 2018. Modeling long- and short-term temporal patterns with deep neural networks. SIGIR 95–104.
  • [LeCun and Bengio.1995] LeCun, Y., and Bengio., Y. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks.
  • [Luong, Pham, and Manning2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1412–1421.
  • [N.E. Huang and Liu1998] N.E. Huang, Z. Shen, S. L. M. W. H. S. Q. Z. N. Y. C. T., and Liu, H. 1998. The empirical mode decomposition and hilbert spectrum for nonlinear and nonstationary time series analysis. Proc. Roy. Soc. London A 454:903–995.
  • [Nicolas Boulanger-Lewandowski and Vincent2012] Nicolas Boulanger-Lewandowski, Y. B., and Vincent, P. 2012. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.
  • [Qin et al.2017] Qin, Y.; Song, D.; Cheng, H.; Cheng, W.; Jiang, G.; and Cottrell, G. W. 2017. A dual-stage attention-based recurrent neural network for time series prediction. In IJCAI’17, 2627–2633.
  • [Raffel2016] Raffel, C. 2016. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. PhD Thesis.
  • [Rippel, Snoek, and Adams2015] Rippel, O.; Snoek, J.; and Adams, R. P. 2015. Spectral representations for convolutional neural networks. NIPS 2449–2457.
  • [S. Roberts and Aigrain.2011] S. Roberts, M. Osborne, M. E. S. R. N. G., and Aigrain., S. 2011. Gaussian processes for time-series modelling. Phil. Trans. R. Soc. A.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 3104–3112.
  • [V. Vapnik1997] V. Vapnik, S. E. Golowich, A. S. e. a. 1997.

    Support vector method for function approximation, regression estimation, and signal processing.

    Advances in Neural Information Processing Systems 281–287.
  • [Werbos1990] Werbos, P. J. 1990. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 1550–1560.
  • [Zhang2003] Zhang, G. P. 2003. Time series forecasting using a hybrid arima and neural network model. Neurocomputing 159–175.