1 Introduction
In modern day life, time series data are everywhere. People observe evolving variables generated from sensors over discrete time steps and organize them into time series data. For example, household electricity consumption, road occupancy rate, currency exchange rate, solar power production, and even music notes can all be considered as time series data. In most of the cases nowadays, the collected data are often multivariate time series (MTS) data, such as the electricity consumption of multiple clients, which are kept tracked by electrical power company. There may exist complex dynamic interdependencies between different series that are significant but difficult to capture and analyze.
Humans are often interested in forecasting the future based on historical data. The better the interdependencies among different series are modeled, the more accurate the forecasting can be. For instance, the price of crude oil heavily influences the price of gasoline, but is less influential on the price of lumber, as shown in Figure 1^{2}^{2}2Source: https://www.eia.gov and https://www.investing.com. Thus, if you realize that gasoline is produced from crude oil and lumber is not, we can use the price of crude oil to predict the price of gasoline.
In machine learning, we want the model to automatically learn the interdependencies from data. Machine learning has already been applied on time series analysis for both classification and forecasting
[G. Zhang and Hu.1998, Zhang2003, Lai et al.2018, Qin et al.2017]. In the classification problem, the machine learns to assign label to a time series, such as evaluating a patient’s diagnostic categories by reading values from medical sensors. As for forecasting, the machine needs to predict future time series based on observed data in the past. For example, precipitation in the next days, weeks or months can be forecast according to historical measurement. The further ahead we try to forecast, the harder it is.When it comes to MTS forecasting using deep learning, recurrent neural network (RNN)
[David E Rumelhart and Williams1986, Werbos1990, Elman1990] will definitely be on the list. However, one disadvantage in using RNN in time series analysis is its weakness on managing longterm dependencies, e.g. a yearly pattern on a daily recorded sequence [Kyunghyun Cho and Bengio.2014]. Attention mechanism [Luong, Pham, and Manning2015, Bahdanau, Cho, and Bengio2015], which is originally utilized in encoderdecoder [Sutskever, Vinyals, and Le2014] networks, is able to alleviate this problem to some extent, and thus boosts the effectiveness of RNN [Lai et al.2018].In this paper, we propose a new attention mechanism – temporal pattern attention – for MTS forecasting. Typical attention mechanism identifies the time steps relevant to the prediction, and extracts the information from these time steps, which has obvious limitation for MTS prediction. Consider the example in Figure 1. To predict the value of gasoline, machine has to learn to focus on “crude oil” and ignore “lumber”. In temporal pattern attention, instead of selecting the relevant time steps as typical attention, machine learns to select the relevant variables.
In addition, time series data often entails noticeable periodic temporal patterns, which are critical for prediction. However, the periodic patterns spanning across multiple time steps are difficult to be identified by the typical attention mechanism which usually focuses only a few time steps. In temporal pattern attention, we introduce convolutional neural network (CNN)
[LeCun and Bengio.1995, A. Krizhevsky and Hinton.2012] to extract temporal pattern information from each individual variables.The main contributions of this paper are summarized as follows:

We introduce a new concept in attention mechanism which selects the relevant variables instead of time steps. The method is simple and general to apply on RNN.

The learned CNN filters in our attention demonstrates interesting and interpretable behavior. They play the role similar to the bases in discrete Fourier transform (DFT) and extract “frequency domain” information.

According to the experimental results on realworld data ranging from periodic and partially linear to nonperiodic and nonlinear tasks, we show that our attention achieved stateoftheart results across multiple datasets.
The remainder of this paper is organized as follows. We will report related work in Section 2 and describe background knowledge in Section 3. Then, our proposed attention will be detailed in Section 4, while the experimental results and analysis will be presented in Section 5. Finally, we will conclude our paper in Section 6.
2 Related Work
The most renowned model for linear univariate time series forecasting is the autoregressive integrated moving average (ARIMA) [G. E. Box and Ljung2015]
, which encompasses other autoregressive time series models, including autoregression (AR), moving average (MA), and autoregressive moving average (ARMA). Additionally, linear support vector regression (SVR)
[Cao and Tay.2003, Kim.2003]treats forecasting problem as typical regression problem with timevarying parameters. However, these models are mostly limited to linear univariate time series and have difficulties to scale to MTS. To forecast MTS data, vector autoregression (VAR), which is a generalization of ARbased models, is proposed. VAR is probably the most wellknown model in MTS forecasting. Nevertheless, neither ARbased nor VARbased models are capable of capturing nonlinearity. For that reason, substantial effort has been made for nonlinear models for time series forecasting based on kernel methods
[Chen, Wang, and Harris2008], ensembles [Bouchachia and Bouchachia2008], or Gaussian processes [Frigola and Rasmussen2014]. Still, these approaches apply predetermined nonlinearity and may fail to recognize different forms of nonlinearity for different MTS.Recently, deep neural networks have received great amount of attention due to their adaptable abilities in capturing nonlinear interdependencies. Two variants of RNN, namely long shortterm memory (LSTM)
[Hochreiter and Schmidhuber1997]and gated recurrent unit (GRU)
[Cho et al.2014], have shown promising results in several NLP tasks and have also be employed on MTS forecasting. Previous work in this area starts from using naive RNN [J. Connor and Martin.1991], to hybrid models that combined ARIMA and Multilayer Perceptron
[G. Zhang and Hu.1998, Zhang2003, Jain and Kumar.2007], and to the latest Dynamic Boltzmann Machine with RNN
[Dasgupta and Osogami2017]. Although these models can be applied to MTS, they mainly target univariate or bivariate time series.To the best of our knowledge, Long and Shortterm Timeseries Network (LSTNet) [Lai et al.2018] is the first model that is designed specifically for MTS forecasting with up to hundreds of evolving variables. In LSTNet, CNN is utilized to capture shortterm patterns, whereas LSTM or GRU is responsible for memorizing relatively longterm patterns. But in practice, LSTM and GRU are unable to memorize very longterm interdependencies due to training instability and gradient vanishing problem. Thus, LSTNet adds either a recurrentskip layer or a typical attention to deal with this concern. Traditional autoregression is also part of the entire model that helps to tackle the scale insensitive problem of neural networks. Nonetheless, there are two major shortcomings of LSTNet when compared to our attention: (1) the skiplength of recurrentskip layer has to be predetermined by human, but our attention learns the periodic patterns by itself, and (2) the whole LSTNet model is designed nontrivially only for MTS data with strong periodic patterns, whereas our attention is simple and adaptable to various datasets, even nonperiodic and nonlinear ones, as shown in our experiments.
3 Preliminaries
In this section, we briefly introduce two essential modules related to our proposed model: one is RNN, and the other is typical attention mechanism.
3.1 Recurrent Neural Network (RNN)
Given a sequence of information , where , RNN generally defines a recurrent function, , and calculates for each time step, , as follows:
(1) 
where the implementation of the function depends on what kind of RNN cell is used.
Long shortterm memory (LSTM) [Hochreiter and Schmidhuber1997] cell is widely used, which has a slightly different recurrent function:
(2) 
and the function is defined by the following equations:
(3)  
(4)  
(5)  
(6)  
(7) 
where , , and , , , and , , , and , and means elementwise multiplication.
3.2 Typical Attention Mechanism
In typical attention mechanism [Luong, Pham, and Manning2015, Bahdanau, Cho, and Bengio2015] on RNN, given the previous states , a context vector is extracted from the previous states. is a weighted sum of each column in , which represents relevant information to the current time step. is further integrated with the present state to obtain the prediction.
Assuming that there is a scoring function which computes the relevance between its input vectors. Formally, we have the following formula to get the context vector .
(8) 
(9) 
4 Temporal Pattern Attention
While previous works mainly focus on changing the network architecture of the attentionbased models on different settings to get better performance in different tasks, we think that there is a critical defect to apply typical attention mechanisms on RNN in MTS forecasting. The typical attention aims at selecting the relevant information for the current time step, and the context vector is the weighted sum of the column vectors of previous RNN hidden states, . This design is more suitable for the task that each time step only contains single information, for example, in NLP, each time step corresponding to a single word. If there are multiple variables in each time step, it fails to ignore the variables that are noisy for forecasting. Moreover, since typical attention averages the information across multiple time steps, it fails to detect the temporal patterns useful for forecasting.
The overview of the proposed model is shown in Figure 2. In the proposed approach, given previous RNN hidden states , our proposed attention basically attend on its row vectors. The attention weights on the rows select the variables that are helpful for forecasting. Since the context vector is now the weighted sum of the row vectors containing the information across multiple time steps, it captures the temporal information.
4.1 Problem Formulation
In MTS forecasting task, given an MTS, , where represents the observed value at time , the task is to predict the value of , where is a fixed horizon with respect to different tasks. We denote the corresponding prediction as , and the groundtruth value as . Moreover, for some tasks, we only use to predict , where is the window size.
4.2 Temporal Pattern Detection by CNN
Motivated by the huge success of CNN and its ability to capture different important patterns of signals, we further introduce CNN to enhance the learning ability of the model and apply CNN filters on row vectors of . Specifically, we have filters , where is the maximum length we want to pay attention to. When unspecified, we assume . After convolutional operations, we will have where represents the convolutional value of the th row vector and the th filter. Formally, this operation is given by the following equations:
(10) 
4.3 Proposed Attention
We calculate as a weighted sum of row vectors of . The scoring function to evaluate the relevance is defined as below.
(11) 
where is the th row of , and . The attention weight is obtained as below:
(12) 
Note that we use sigmoid activation function instead of softmax since we expect there is more than one variable useful for forecasting.
At the end, the row vectors of are weighted by to obtain the context vector ,
(13) 
Then we integrate and to make the final prediction:
(14) 
(15) 
where , , , , and and .
5 Experiments and Analysis
In this section, we first describe the datasets that we will conduct our experiments on. Next, we show our experimental results and visualization of prediction against LSTNet. Then, ablation study is discussed. Finally, we analyze how the CNN filters act like the bases in DFT.
5.1 Datasets
To test the effectiveness and generalization of our attention mechanism, we use two datasets that are dissimilar: typical MTS datasets, and polyphonic music datasets.
The typical MTS datasets are published by LSTNet, and there are four datasets:

SolarEnergy^{3}^{3}3http://www.nrel.gov/grid/solarpowerdata.html: the solar power production data from photovoltaic plants in Alabama State in 2006.

Traffic^{4}^{4}4http://pems.dot.ca.gov: two years (20152016) of data provided by the California Department of Transportation that describes the road occupancy rate (between 0 and 1) on San Francisco Bay area freeways.

Electricity^{5}^{5}5https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014: a collection of electricity consumption of 321 clients in kWh.

Exchange Rate: the exchange rates of eight foreign countries (Australia, British, Canada, China, Japan, New Zealand, Singapore and Switzerland) from 1990 to 2016.
These datasets are realworld data that contains both linear and nonlinear interdependencies. Moreover, three out of the four datasets, namely SolarEnergy, Traffic and Electricity, exhibit strong periodic patterns, which indicate the daily or weekly routines of human activities. According to the authors of LSTNet, all datasets have been split into training (), validation (), and testing set () in chronological order.
On the other hand, the polyphonic music datasets, which are introduced in the following list, are much complicated in a sense that no apparent linearity or repetitive patterns exist:

MuseData [Nicolas BoulangerLewandowski and Vincent2012]: a collection of music pieces from various classical music composers in MIDI format.

LPD5Cleansed [HaoWen Dong and Yang2018, Raffel2016]: multitrack pianorolls that contains drums, piano, guitar, bass, and strings.
To train models on these datasets, we consider each played note as 1 and 0 otherwise, and set one beat as one time step as shown in Table 1. Given played notes of 4 bars consisting of 16 beats, the task is to predict each pitch at the next time step is played or not. For training, validation, and testing sets, we follow the original separation of MuseData, which is divided into 524 training pieces, 135 validation pieces, and 124 testing pieces. On the other hand, LPD5Cleansed is not split by previous works [HaoWen Dong and Yang2018, Raffel2016] so we randomly split it into training (), validation (), and testing () sets.
The statistics of both typical MTS datasets and polyphonic music datasets are summarized in Table 1.
Dataset  

SolarEnergy  52,560  137  10 minutes 
Traffic  17,544  862  1 hour 
Electricity  26,304  321  1 hour 
Exchange Rate  7,588  8  1 day 
MuseData  216 102,552  128  1 beat 
LPD5Cleansed  1,072 1,917,952  128  1 beat 
RSE  SolarEnergy  Traffic  Electricity  Exchange Rate  

horizon  3  6  12  24  3  6  12  24  3  6  12  24  3  6  12  24 
AR  0.2435  0.3790  0.5911  0.8699  0.5991  0.6218  0.6252  0.6293  0.0995  0.1035  0.1050  0.1054  0.0228  0.0279  0.0353  0.0445 
LRidge  0.2019  0.2954  0.4832  0.7287  0.5833  0.5920  0.6148  0.6025  0.1467  0.1419  0.2129  0.1280  0.0184  0.0274  0.0419  0.0675 
LSVR  0.2021  0.2999  0.4846  0.7300  0.5740  0.6580  0.7714  0.5909  0.1523  0.1372  0.1333  0.1180  0.0189  0.0284  0.0425  0.0662 
GP  0.2259  0.3286  0.5200  0.7973  0.6082  0.6772  0.6406  0.5995  0.1500  0.1907  0.1621  0.1273  0.0239  0.0272  0.0394  0.0580 
LSTNetSkip  0.1843  0.2559  0.3254  0.4643  0.4777  0.4893  0.4950  0.4973  0.0864  0.0931  0.1007  0.1007  0.0226  0.0280  0.0356  0.0449 
LSTNetAttn  0.1816  0.2538  0.3466  0.4403  0.4897  0.4973  0.5173  0.5300  0.0868  0.0953  0.0984  0.1059  0.0276  0.0321  0.0448  0.0590 
Our Model  0.1803 0.0008  0.2347 0.0017  0.3234 0.0044  0.4389 0.0084  0.4487 0.0180  0.4658 0.0053  0.4641 0.0034  0.4765 0.0068  0.0823 0.0012  0.0916 0.0018  0.0964 0.0015  0.1006 0.0015  0.0174 0.0001  0.0243 0.0003  0.0345 0.0010  0.0444 0.0006 
CORR  SolarEnergy  Traffic  Electricity  Exchange Rate  

horizon  3  6  12  24  3  6  12  24  3  6  12  24  3  6  12  24 
AR  0.9710  0.9263  0.8107  0.5314  0.7752  0.7568  0.7544  0.7519  0.8845  0.8632  0.8591  0.8595  0.9734  0.9656  0.9526  0.9357 
LRidge  0.9807  0.9568  0.8765  0.6803  0.8038  0.8051  0.7879  0.7862  0.8890  0.8594  0.8003  0.8806  0.9788  0.9722  0.9543  0.9305 
LSVR  0.9807  0.9562  0.8764  0.6789  0.7993  0.7267  0.6711  0.7850  0.8888  0.8861  0.8961  0.8891  0.9782  0.9697  0.9546  0.9370 
GP  0.9751  0.9448  0.8518  0.5971  0.7831  0.7406  0.7671  0.7909  0.8670  0.8334  0.8394  0.8818  0.8713  0.8193  0.8484  0.8278 
LSTNetSkip  0.9843  0.9690  0.9467  0.8870  0.8721  0.8690  0.8614  0.8588  0.9283  0.9135  0.9077  0.9119  0.9735  0.9658  0.9511  0.9354 
LSTNetAttn  0.9848  0.9696  0.9397  0.8995  0.8704  0.8669  0.8540  0.8429  0.9243  0.9095  0.9030  0.9025  0.9717  0.9656  0.9499  0.9339 
Our Model  0.9850 0.0001  0.9742 0.0003  0.9487 0.0023  0.9081 0.0151  0.8812 0.0089  0.8717 0.0034  0.8717 0.0021  0.8639 0.0030  0.9429 0.0004  0.9337 0.0011  0.9250 0.0013  0.9133 0.0008  0.9790 0.0003  0.9709 0.0003  0.9564 0.0005  0.9381 0.0008 
Results on typical MTS datasets using RSE (upper) and CORR (lower) as metric. Best performance in boldface, and second best performance is underlined. We report the mean and standard deviation of our model in ten runs. All numbers besides the results of our model is referenced from the paper of LSTNet
[Lai et al.2018].5.2 Methods for Comparison
We compared our model against the following methods on the typical MTS datasets:

AR: the standard autoregression model.

LRidge: VAR model with L2regularization, which is the single most popular model for MTS forecasting.

LSVR: VAR model with SVR objective function [V. Vapnik1997].

GP: the Gaussian Process model [FrigolaAlcade.2015, S. Roberts and Aigrain.2011].

LSTNetSkip: the LSTNet with recurrentskip layer.

LSTNetAttn: the LSTNet with attention layer.
AR, LRidge, LSVR, and GP are the traditional baseline methods, while LSTNetSkip and LSTNetAttn are stateoftheart method based on deep neural networks.
However, since nonlinearity and the lack of periodic patterns make both traditional baseline methods and LSTNet unsuitable for polyphonic music datasets, we use LSTM and LSTM with Luong attention as the baseline models to benchmark the improvement of our model on polyphonic music datasets:

LSTM: RNN cell as introduced in Section 3.

LSTM with Luong attention: LSTM with attention mechanism scoring function of which , where [Luong, Pham, and Manning2015].
5.3 Model Setup and Parameter Settings
For all experiments, we use LSTM as our RNN cell to build our model, and fix the number of CNN filters at 32. Also, inspired by LSTNet, we include an autoregression component in our model when training and testing on typical MTS datasets.
For typical MTS datasets, we performed grid search over tunable parameters just like LSTNet. Specifically, on SolarEnergy, Traffic, and Electricity, the range of window size is , the range of number of hidden units is , and the range of step of exponential learning rate decay with rate 0.995 is
. On Exchange Rate, these three parameters are fixed at 30, 6 and 120, respectively. Two types of data normalization are also viewed as part of the grid search: one normalizes each time series by the maximum value in itself, and the other normalizes every time series by the maximum value in the whole data. Lastly, we use abosolute loss function and Adam with
learning rate on SolarEnergy, Traffic, and Electricity, and learning rate on Exchange Rate. For all other methods for comparison as mentioned in previous subsection, the parameters are identical to the numbers reported in the paper of LSTNet [Lai et al.2018].For models used in polyphonic music datasets including baselines and our models in the following subsections, we used 3 layers for all RNN, which is the same as tonnetz, and fix the trainable parameters to around by adjusting the number of units in LSTM in order to fairly compare different models. Besides, the optimizer is Adam with learning rate while the loss function is cross entropy.
5.4 Evaluation Metrics
On typical MTS datasets, since we make comparison of our model with LSTNet, we follow the same evaluation metrics. The first metric is the root relative squared error (RSE), which is defined as
(16) 
and the other metric is empirical correlation coefficient (CORR):
(17) 
where is defined in Section 4.1, is the label of testing data, and denotes the mean of set . RSE is a normalized version of Root Mean Square Error (RMSE) that disregards data scale. For RSE, the lower is better, whereas for CORR, the higher is better.
To decide which model is better on polyphonic music datasets, we use validation loss (negative loglikelihood), precision, recall and F1 score as measurements which are widely used in previous polyphonic music generation works [Nicolas BoulangerLewandowski and Vincent2012, Chuan and Herremans2018].
5.5 Results on Typical MTS Datasets
On typical MTS datasets, we choose the best model on the validation set using RSE/CORR as metric to test on the testing set. The numerical results are tabulated in Table 2, where the metric of the upper table is RSE, and the metric of the lower table is CORR. All numbers besides the results of our model is referenced from the paper of LSTNet [Lai et al.2018]. From both tables, we can clearly see that our model outperforms all other methods on any dataset, horizon, and metric, but with only one exception. According to the results, our model consistently demonstrates its superiority on MTS forecasting.
When comparing to LSTNetSkip and LSTNetAttn, which are the previous stateoftheart methods, our model outdoes both of them, especially on Traffic and Electricity, which have the largest amount of evolving variables. Moreover, on Exchange Rate, where no repetitive pattern exists, our model is still the best overall, while the performance of LSTNetSkip and LSTNetAttn fall behind traditional methods, including AR, LRidge, LSVR and GP. We are defeated by LRidge on Exchange Rate with 6days horizon. Because linear models are good enough at this dataset, deep learning is redundant. We also visualize and compare the prediction of our model and LSTNetSkip in Figure 3 as an illustration.
Generally speaking, our model achieved stateoftheart performance on both periodic and nonperiodic MTS datasets.
5.6 Results on Polyphonic Music Datasets
In this subsection, to further verify the efficacy and generalization ability of our model, we conducted experiments on polyphonic music datasets, and the results are shown in Figure 4 and Table 3. We compare three RNN models, LSTM, LSTM with Luong attention and LSTM with the proposed attention. Figure 4
shows the validation loss across training epochs, and in Table
3, we use the models with the lowest validation loss to calculate precision, recall and F1 score on the testing set.From the results, we can first verify our claim that typical attention mechanism does not work on such tasks since under similar hyperparameters and trainable weights, LSTM and our model outperform such attention mechanism. Besides, our model also learns more effectively compared to LSTM throughout the learning process and has better performance in terms of precision, recall, and F1 score.
MuseData  

metric  Precision  Recall  F1 
w/o attention  0.84009  0.67657  0.74952 
w/ Luong attention  0.75197  0.52839  0.62066 
w/ our attention  0.85581  0.68889  0.76333 
LPD5Cleansed  

metric  Precision  Recall  F1 
w/o attention  0.83794  0.73041  0.78049 
w/ Luong attention  0.83548  0.72380  0.77564 
w/ our attention  0.83979  0.74517  0.78966 
Dataset  SolarEnergy  Traffic  Electricity  MuseData  

position  filter  w/o CNN  position  filter  w/o CNN  position  filter  w/o CNN  position  filter  w/o CNN  
softmax  0.4391  0.4434  0.4489  0.4715  0.4897  0.4770  0.1006  0.1007  0.1011  0.04931  0.04968  0.04902 
sigmoid  0.4389  0.4597  0.4507  0.4765  0.4795  0.4796  0.1006  0.1028  0.1008  0.04878  0.04958  0.04987 
concat  0.4462  0.4404  0.4951  0.4855  0.4774  0.4795  0.1026  0.1035  0.1014  0.05191  0.05167  0.05128 
5.7 Ablation Study
In order to verify the above improvement comes from each component we add rather than a specific set of hyperparameters, we conduct an ablation study on 4 datasets, SolarEnergy, Traffic, Electricity and MuseData. There are two main settings, one is how we attend on hidden states, , of RNN and the other one is how we integrate the scoring function into our model or even discard this function. First, in our proposed method, we let the model attend on values of different filters on each position (), while we can also consider attending on values of same filters from different positions () or row vectors of (). These three different approaches correspond to column headers in Table 4, “position”, “filter”, and “without CNN”. Second, while in typical attention mechanism, they usually use softmax function on output value of scoring function to extract the most relevant information, we use sigmoid as our activation function. Therefore, we compare these two different functions. Moreover, to concatenate all previous hidden states and let the model automatically learn which values are important is also a possible structure to do the forecasting. Taking these two groups of settings into consideration, we train models with all combinations of possible structures on these 4 datasets.
From the results of MuseData, the model with sigmoid activation function and attention on (position) is apparently the best one which suggests that our proposed model is reasonably effective on forecasting. No matter we remove any proposed components from our model, performance drops. For example, using softmax instead of sigmoid will let negative loglikelihood ascend from to and we will obtain a worse model with negative loglikelihood equal to if we decide to not use CNN filters. Besides, we notice that there is no significant improvement between our model and the model using softmax on the first three datasets in Table 4, SolarEnergy, Traffic and Electricity. It is not surprising to us based on the reason why we use sigmoid as illustrated in Section 4.3
. Originally, we expect CNN filters will find some basic patterns and the sigmoid function will help the model to combine those patterns into one that helps most. However, these three datasets are strongly periodic so that it is possible that using small number of basic patterns is enough to have a good prediction. But overall, our model is more general and has stable and competitive results across different datasets.
5.8 Analysis of CNN Filters
DFT is a variant of Fourier transform (FT) that handles an equallyspaced samples of signal in time. In the field of time series analysis, there has been a farreaching list of works that utilize FT or DFT to reveal important characteristics in time series [N.E. Huang and Liu1998, Bloomfield1976]. In our case, since the MTS data is also equallyspaced and discrete, we can apply DFT to analyze it. However, in MTS data, there are more than one time series, so we naturally average the magnitude of frequency components of every time series, and arrive at a single frequency domain representation. We called it the average discrete Fourier transform (avgDFT). The single frequency domain representation reveals prevailing frequency components of the MTS data. For instance, it is reasonable to assume that there is a notable 24hour oscillation in Figure 3, which is verified by the avgDFT of Traffic dataset as shown in Figure 5.
Since we expect our CNN filters to learn the temporal patterns in MTS, the prevailing frequency components in CNN filters should match that of the training MTS data. Hence, we also apply avgDFT on the CNN filters that are trained on Traffic with 3hour horizon, and plot the result alongside with the avgDFT of Traffic dataset in Figure 5. Impressively, the two curves reach peaks at the same periods most of the time. At the 24, 12, 8, and 6hour period, not only the magnitude of Traffic dataset is at its high point, but the magnitude of CNN filters also tops out. Moreover, in Figure 6, we show that different CNN filters behaves differently. Some are more specialized at capturing longterm (24hour) temporal patterns, while others are good at recognizing shortterm (8hour) temporal patterns. As a whole, we can suggest that our CNN filters play the role of bases in DFT. As demonstrated in the work by spectral_CNN, “frequency domain” provides a powerful representation for CNN to train and model on. Thus, LSTM is able to rely on the “frequency domain” information extracted by our attention to accurately forecast the future.
6 Conclusions
In this paper, we focus on MTS forecasting problem and propose a novel temporal pattern attention which solves the limitation of typical attention mechanisms on such tasks. We let the attention dimension be featurewise in order to make model learn interdependencies among multiple variables not only within the same time step but also across all the previous time and series. Our experiments strongly support this idea and show that our model achieves the stateoftheart results. Besides, the visualization of filters also verifies our motivation in a more understandable way to human beings.
References
 [A. Krizhevsky and Hinton.2012] A. Krizhevsky, I. S., and Hinton., G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 1097–1105.
 [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. ICLR.
 [Bloomfield1976] Bloomfield, P. 1976. Fourier Analysis of Time Series: An Introduction. John Wiley.
 [Bouchachia and Bouchachia2008] Bouchachia, A., and Bouchachia, S. 2008. Ensemble learning for time series prediction. Proceedings of the 1st International Workshop on Nonlinear Dynamics and Synchronization.
 [Cao and Tay.2003] Cao, L.J., and Tay., F. E. H. 2003. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Transactions on neural networks 1506–1518.
 [Chen, Wang, and Harris2008] Chen, S.; Wang, X. X.; and Harris, C. J. 2008. Narxbased nonlinear system identification using orthogonal least squares basis hunting. IEEE Transactions on Control Systems 78–84.

[Cho et al.2014]
Cho, K.; van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.;
Bougares, F.; Schwenk, H.; and Bengio, Y.
2014.
Learning phrase representations using rnn encoder–decoder for
statistical machine translation.
In
Conference on Empirical Methods in Natural Language Processing (EMNLP)
, 1724–1734. Association for Computational Linguistics.  [Chuan and Herremans2018] Chuan, C.H., and Herremans, D. 2018. Modeling temporal tonal relations in polyphonic music through deep networks with a novel imagebased representation.
 [Dasgupta and Osogami2017] Dasgupta, S., and Osogami, T. 2017. Nonlinear dynamic boltzmann machines for timeseries prediction.

[David E Rumelhart and Williams1986]
David E Rumelhart, G. E. H., and Williams, R. J.
1986.
Learning representations by backpropagating errors.
Nature 533–536.  [Elman1990] Elman, J. L. 1990. Finding structure in time. Cognitive science 179–211.
 [FrigolaAlcade.2015] FrigolaAlcade., R. 2015. Bayesian time series learning with gaussian processes. PhD thesis, University of Cambridge.
 [Frigola and Rasmussen2014] Frigola, R., and Rasmussen, C. E. 2014. Integrated preprocessing for bayesian nonlinear system identification with gaussian processes. IEEE Conference on Decision and Control 552––560.
 [G. E. Box and Ljung2015] G. E. Box, G. M. Jenkins, G. C. R., and Ljung, G. M. 2015. Time series analysis: forecasting and control. John Wiley & Sons.
 [G. Zhang and Hu.1998] G. Zhang, B. E. P., and Hu., M. Y. 1998. Forecasting with artificial neural networks:: The state of the art. International journal of forecasting 35–62.
 [HaoWen Dong and Yang2018] HaoWen Dong, WenYi Hsiao, L.C. Y., and Yang, Y.H. 2018. Musegan: Multitrack sequential generative adversarial networks for symbolic music generation and accompaniment.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural Computation 9(8):1735–1780.
 [J. Connor and Martin.1991] J. Connor, L. E. A., and Martin., D. R. 1991. Recurrent networks and narma modeling. Advances in Neural Information Processing Systems 301–308.
 [Jain and Kumar.2007] Jain, A., and Kumar., A. M. 2007. Hybrid neural network models for hydrologic time series forecasting. Applied Soft Computing 7(2):585–592.
 [Kim.2003] Kim., K.J. 2003. Financial time series forecasting using support vector machines. Neurocomputing 55(1):307–319.
 [Kyunghyun Cho and Bengio.2014] Kyunghyun Cho, Bart Van Merrienboer, D. B., and Bengio., Y. 2014. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259.
 [Lai et al.2018] Lai, G.; Chang, W.C.; Yang, Y.; and Liu, H. 2018. Modeling long and shortterm temporal patterns with deep neural networks. SIGIR 95–104.
 [LeCun and Bengio.1995] LeCun, Y., and Bengio., Y. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks.
 [Luong, Pham, and Manning2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attentionbased neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1412–1421.
 [N.E. Huang and Liu1998] N.E. Huang, Z. Shen, S. L. M. W. H. S. Q. Z. N. Y. C. T., and Liu, H. 1998. The empirical mode decomposition and hilbert spectrum for nonlinear and nonstationary time series analysis. Proc. Roy. Soc. London A 454:903–995.
 [Nicolas BoulangerLewandowski and Vincent2012] Nicolas BoulangerLewandowski, Y. B., and Vincent, P. 2012. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription.
 [Qin et al.2017] Qin, Y.; Song, D.; Cheng, H.; Cheng, W.; Jiang, G.; and Cottrell, G. W. 2017. A dualstage attentionbased recurrent neural network for time series prediction. In IJCAI’17, 2627–2633.
 [Raffel2016] Raffel, C. 2016. Learningbased methods for comparing sequences, with applications to audiotomidi alignment and matching. PhD Thesis.
 [Rippel, Snoek, and Adams2015] Rippel, O.; Snoek, J.; and Adams, R. P. 2015. Spectral representations for convolutional neural networks. NIPS 2449–2457.
 [S. Roberts and Aigrain.2011] S. Roberts, M. Osborne, M. E. S. R. N. G., and Aigrain., S. 2011. Gaussian processes for timeseries modelling. Phil. Trans. R. Soc. A.
 [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 3104–3112.

[V. Vapnik1997]
V. Vapnik, S. E. Golowich, A. S. e. a.
1997.
Support vector method for function approximation, regression estimation, and signal processing.
Advances in Neural Information Processing Systems 281–287.  [Werbos1990] Werbos, P. J. 1990. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 1550–1560.
 [Zhang2003] Zhang, G. P. 2003. Time series forecasting using a hybrid arima and neural network model. Neurocomputing 159–175.
Comments
There are no comments yet.