1 Introduction
Time series forecasting plays an important role in daily life to help people manage resources and make decisions. For example, in retail industry, probabilistic forecasting of product demand and supply based on historical data can help people do inventory planning to maximize the profit. Although still widely used, traditional time series forecasting models, such as State Space Models (SSMs) (durbin2012timep1, ) and Autoregressive (AR) models, are designed to fit each time series independently. Besides, they also require practitioners’ expertise in manually selecting trend, seasonality and other components. To sum up, these two major weaknesses have greatly hindered their applications in the modern largescale time series forecasting tasks.
To tackle the aforementioned challenges, deep neural networks
(flunkert2017deeparp2, ; graves2013generatingp3, ; sutskever2014sequencep4, ; NIPS2018_8004p10, )have been proposed as an alternative solution, where Recurrent Neural Network (RNN)
(lai2018modelingp6, ; yu2017longp7, ; DBLP:journals/corr/abs181200098p8, ) has been employed to model time series in an autoregressive fashion. However, RNNs are notoriously difficult to train (pascanu2012difficultyp14, ) because of gradient vanishing and exploding problem. Despite the emergence of various variants, including LSTM (DBLP:journals/neco/HochreiterS97p11, ) and GRU (choetal2014propertiesp12, ), the issues still remain unresolved. As an example, (kh2018sharpp15, ) shows that language models using LSTM have an effective context size of about 200 tokens on average but are only able to sharply distinguish 50 tokens nearby, indicating that even LSTM struggles to capture longterm dependencies. On the other hand, realworld forecasting applications often have both long and shortterm repeating patterns (lai2018modelingp6, ). For example, the hourly occupancy rate of a freeway has both daily and hourly patterns. In such cases, how to model longterm dependencies becomes the critical step in achieving promising performances.Recently, Transformer (vaswani2017attentionp17, ; Parikh_2016p16, ) has been proposed as a brand new architecture which leverages attention mechanism to process a sequence of data. Unlike the RNNbased methods, Transformer allows the model to access any part of the history regardless of distance, making it potentially more suitable for grasping the recurring patterns with longterm dependencies. However, canonical dotproduct self attention matches queries against keys insensitive to local context, which may make the model prone to anomalies and bring underlying optimization issues. More importantly, space complexity of canonical Transformer grows quadratically with the input length , which causes the memory bottleneck on modeling long time series with fine granularity. We specifically delve into these two issues and investigate the applications of Transformer to time series forecasting. Our contributions are three fold:

[partopsep=0pt, leftmargin=0.5cm]

We successfully apply Transformer architecture to time series forecasting and perform extensive experiments on both synthetic and real datasets to validate Transformer’s potential value in better handling longterm dependencies than RNNbased models.

We propose convolutional self attention by employing causal convolutions to produce queries and keys in the self attention layer. Querykey matching aware of local context, e.g. shapes, can help the model achieve lower training error and further improve its forecasting accuracy.

We propose LogSparse Transformer, with only space complexity to break the memory bottleneck, not only making finegrained long time series modeling feasible but also producing comparable or even better results with much less memory usage, compared to canonical Transformer.
2 Related Work
Due to the wide applications of forecasting, various methods have been proposed to solve the problem. One of the most prominent models is ARIMA (10.2307/2985674p21, ). Its statistical properties as well as the wellknown BoxJenkins methodology (box_jenkins_reinsel_ljung_2016p22, ) in the model selection procedure make it the first attempt for practitioners. However, its linear assumption and limited scalability make it unsuitable for largescale forecasting tasks. Further, information across similar time series cannot be shared since each time series is fitted individually. In contrast, (NIPS2016_6160p23, ) models related time series data as a matrix and deal with forecasting as a matrix factorization problem. (pmlrv32chapados14p33, ) proposes hierarchical Bayesian methods to learn across multiple related count time series from the perspective of graph model.
Deep neural networks have been proposed to capture shared information across related time series for accurate forecasting. (flunkert2017deeparp2, ) fuses traditional AR models with RNNs by modeling a probabilistic distribution in an encoderdecoder fashion. Instead, (2017arXiv171111053Wp27, )
uses an RNN as an encoder and Multilayer Perceptrons (MLPs) as a decoder to solve the socalled error accumulation issue and conduct multiahead forecasting in parallel.
(NIPS2018_8004p10, ) uses a global RNN to directly output the parameters of a linear SSM at each step for each time series, aiming to approximate nonlinear dynamics with locally linear segments. In contrast, (DBLP:journals/corr/abs181200098p8, ) deals with noise using a local Gaussian process for each time series while using a global RNN to model the shared patterns.The wellknown self attentionbased Transformer (vaswani2017attentionp17, ) has recently been proposed for sequence modeling and has achieved great success. Several recent works apply it to translation, speech, music and image generation (vaswani2017attentionp17, ; huang2018musicp18, ; Povey2018ATSp19, ; parmar2018imagep20, ). However, scaling attention to extremely long sequences is computationally prohibitive since the time and space complexity of self attention grows quadratically with sequence length (huang2018musicp18, ). This becomes a serious issue in forecasting time series in fine granularity.
3 Background
Problem definition
Suppose we have a collection of related univariate time series , where and denotes the value of time series at time . We are going to predict the next time steps for all time series, i.e. . Besides, let
be a set of associated timebased covariate vectors with dimension
that are assumed to be known over the entire time period. We aim to model the following conditional distribution(1) 
We reduce the problem to learning a onestepahead prediction model ^{1}^{1}1Since the model is applicable to all time series, we omit the subscript for simplicity and clarity., where denotes the learnable parameters shared by all time series in the collection. To fully utilize both the observations and covariates, we concatenate them to obtain an augmented matrix as follows:
where represents concatenation. Note that we set to be zero in all cases. An appropriate model is then explored to predict the distribution of given .
Transformer
We instantiate with Transformer ^{2}^{2}2By referring to Transformer, we only consider the autoregressive Transformerdecoder in the following. by taking advantage of the multihead self attention mechanism, since self attention enables Transformer to capture both long and shortterm dependencies, and different attention heads learn to focus on different aspects of temporal patterns. These advantages make Transformer a good candidate for time series forecasting. We briefly introduce its architecture here and refer readers to (vaswani2017attentionp17, ) for more details.
In the self attention layer, a multihead selfattention sublayer simultaneously transforms ^{3}^{3}3At each time step the same model is applied, so we simplify the formulation with some abuse of notation. into distinct query matrices , key matrices , and value matrices respectively, with . Here and are learnable parameters. After these linear projections, the scaled dotproduct attention computes a sequence of vector outputs:
(2) 
Note that a mask matrix is applied to filter out rightward attention by setting all upper triangular elements to , in order to avoid future information leakage. Afterwards,
are concatenated and linearly projected again. Upon the attention output, a positionwise feedforward sublayer with two layers of fullyconnected network and a ReLU activation in the middle is stacked.
4 Methodology
4.1 Enhancing the locality of Transformer
Patterns in time series may evolve with time significantly due to various events, e.g. holidays and extreme weather, so whether an observed point is an anomaly, change point or part of the patterns is highly dependent on its surrounding context. However, in the self attention layers of canonical Transformer, the similarities between queries and keys are computed based on their pointwise values without fully leveraging local context like shape, as shown in Figure 1(a) and (b). Querykey matching agnostic of local context may confuse the self attention module in terms of whether the observed value is an anomaly, change point or part of patterns, and bring underlying optimization issues.
We propose convolutional self attention to ease the issue. The architectural view of proposed convolutional self attention is illustrated in Figure 1(c) and (d). Rather than using convolution of kernel size with stride 1 (matrix multiplication), we employ causal convolution of kernel size with stride 1 to transform inputs (with proper paddings) into queries and keys. Note that causal convolutions ensure that the current position never has access to future information. By employing causal convolution, generated queries and keys can be aware of local context and hence, compute their similarities by their local context information, e.g. local shapes, instead of pointwise values, which is helpful for accurate forecasting. Note that when , the convolutional self attention will degrade to canonical self attention, thus it can be seen as a generalization.
4.2 Breaking the memory bottleneck of Transformer
To motivate our approach, we first perform a qualitative assessment of the learned attention patterns with a canonical Transformer on trafficf dataset. The trafficf dataset contains occupancy rates of 963 car lanes of San Francisco bay area recorded every 20 minutes (NIPS2018_8004p10, ). We trained a 10layer canonical Transformer on trafficf dataset with full attention and visualized the learned attention patterns. One example is shown in Figure 2. Layer 2 clearly exhibited global patterns, however, layer 6 and 10, only exhibited patterndependent sparsity, suggesting that some form of sparsity could be introduced without significantly affecting performance. More importantly, for a sequence with length , computing attention scores between every pair of cells will cause memory usage, making modeling long time series with fine granularity prohibitive.
We propose LogSparse Transformer, which only needs to calculate dot products for each cell in each layer. Further, we only need to stack up to layers and the model will be able to access every cell’s information. Hence, the total cost of memory usage is only . We define as the set of indices of the cells that cell can attend to during the computation from layer to layer. In the standard self attention of Transformer, , allowing every cell to attend to all of the past cells and its own as shown in Figure 3(a). However, such an algorithm suffers from the quadratic space complexity growth along with the input length. To alleviate such an issue, we propose to select a set of the indices so that does not grow too fast. An effective way of choosing indices is .
Notice that cell is a weighted combination of cells indexed by in th self attention layer and can pass the information of cells indexed by to its followings in the next layer. Let be the set which contains indices of all the cells whose information has passed to cell up to layer. To ensure that every cell receives the information from all its previous cells and its own, the number of stacked layers should satisfy that for . That is, and , there is a directed path with edges, where , , , .
We propose LogSparse self attention by allowing each cell only to attend to its previous cells with an exponential step size and itself. That is, and , , where denotes the floor operation, as shown in Figure 3(b).^{4}^{4}4Applying other bases is trivial so we don’t discuss other bases here for simplicity and clarity.
Theorem 1.
and , there is at least one path from cell to cell if we stack layers. Moreover, for , the number of feasible unique paths from cell to cell increases at a rate of .
The proof, deferred to Appendix A.1, uses a constructive argument.
Theorem 1 implies that despite an exponential decrease in the memory usage (from to ) in each layer, the information could still flow from any cell to any other cell provided that we go slightly “deeper” — take the number of layers to be . Note that this implies an overall memory usage of and addresses the notorious scalability bottleneck of Transformer under GPU memory constraint (vaswani2017attentionp17, ). Moreover, as two cells become further apart, the number of paths increases at a rate of superexponential in , which indicates a rich information flow for modeling delicate longterm dependencies.
Local Attention
We can allow each cell to densely attend to cells in its left window of size so that more local information, e.g. trend, can be leveraged for current step forecasting. Beyond the neighbor cells, we can resume our LogSparse attention strategy as shown in Figure 3(c).
Restart Attention
Further, one can divide the whole input with length into subsequences and set each subsequence length . For each of them, we apply the LogSparse attention strategy. One example is shown in Figure 3(d).
Employing local attention and restart attention won’t change the complexity of our sparse attention strategy but will create more paths and decrease the required number of edges in the path. Note that one can combine local attention and restart attention together.
5 Experiments
5.1 Synthetic datasets
To demonstrate Transformer’s capability to capture longterm dependencies, we conduct experiments on synthetic data. Specifically, we generate a piecewise sinusoidal signals
are randomly generated by uniform distribution on
, and . Following the forecasting setting in Section 3, we aim to predict the last steps given the previous data points. Intuitively, larger makes forecasting more difficult since the model is required to understand and remember the relation between and to make correct predictions after steps of irrelevant signals. Hence, we create 8 different datasets by varying the value of within . For each dataset, we generate 4.5K, 0.5K and 1K time series instances for training, validation and test set, respectively. An example time series with is shown in Figure 4(a).In this experiment, we use a 3layer canonical Transformer with standard self attention. For comparison, we employ DeepAR (flunkert2017deeparp2, )
, an autoregressive model based on a 3layer LSTM, as our baseline. Besides, to examine if larger capacity could improve performance of
DeepAR, we also gradually increase its hidden size as . No further improvement is observed with larger hidden size. Following (flunkert2017deeparp2, ; NIPS2018_8004p10, ), we evaluate both methods usingquantile loss
with ,where is the empirical quantile of the predictive distribution and is an indicator function.
Figure 4(b) presents the performance of DeepAR and Transformer on the synthetic datasets. When , both of them perform very well. But, as increases, especially when , the performance of DeepAR drops significantly while Transformer keeps its accuracy, suggesting that Transformer can capture fairly longterm dependencies when LSTM fails to do so.
5.2 Realworld datasets
We further evaluate our model on several realworld datasets. The electricityf (fine) dataset consists of electricity consumption of 370 customers recorded every 15 minutes and the electricityc (coarse) dataset is the aggregated electricityf by every 4 points, producing hourly electricity consumption. Similarly, the trafficf (fine) dataset contains occupancy rates of 963 freeway in San Francisco recorded every 20 minutes and the trafficc (coarse) contains hourly occupancy rates by averaging every 3 points in trafficf. The solar dataset^{5}^{5}5https://www.nrel.gov/grid/solarpowerdata.html contains the solar power production records from January to August in 2006, which is sampled every hour from 137 PV plants in Alabama. The wind^{6}^{6}6https://www.kaggle.com/sohier/30yearsofeuropeanwindgeneration
dataset contains daily estimates of 28 countries’ energy potential from 1986 to 2015 as a percentage of a power plant’s maximum output. The
M4Hourly contains 414 hourly time series from M4 competition (RePEc:eee:intfor:v:34:y:2018:i:4:p:802808p34, ).Longterm and shortterm forecasting
We first show the effectiveness of canonical Transformer equipped with convolutional self attention in longterm and shortterm forecasting in electricityc and trafficc. These two datasets exhibit both hourly and daily seasonal patterns. However, trafficc demonstrates much greater difference between the patterns of weekdays and weekends compared to electricityc. Hence, accurate forecasting in trafficc dataset requires the model to capture both long and shortterm dependencies very well. As baselines, we use classical forecasting methods auto.arima, ets implemented in R’s forecast package and the recent matrix factorization method TRMF (NIPS2016_6160p23, ), a RNNbased autoregressive model DeepAR and a RNNbased state space model DeepState (NIPS2018_8004p10, ). For shortterm forecasting, we evaluate rollingday forecasts for seven days after training by following (lai2018modelingp6, ). For longterm forecasting, we directly forecast 7 days ahead. As shown in Table 1, our models with convolutional self attention get betters results in both longterm and shortterm forecasting, especially in trafficc dataset compared to strong baselines, partly due to the longterm dependency modeling ability of Transformer as shown in our synthetic data.
ARIMA  ETS  TRMF  DeepAR  DeepState  Ours  

ec  0.154/0.102  0.101/0.077  0.084/  0.075/0.040  0.083/0.056  0.059/0.028 
ec  0.283/0.109  0.121/0.101  0.087/  0.082/0.053  0.085/0.052  0.080/0.039 
tc  0.223/0.137  0.236/0.148  0.186/  0.161/0.099  0.167/0.113  0.116/0.080 
tc  0.492/0.280  0.509/0.529  0.202/  0.179/0.105  0.168/0.114  0.144/0.099 
Convolutional self attention
In this experiment, we conduct ablation study of our proposed convolutional self attention. We explore different kernel size
on the full attention model and fix all other settings. We still use rollingday prediction for seven days on
electricityc and trafficc datasets. The results of different kernel sizes are shown in Table 2. With larger , the model can be aware of more local context information and help the model to forecast accurately and ease the training. Further, we plot the training loss of kernel size in electricityc and trafficc datasets. We found that, in addition to getting better results, Transformer with convolutional self attention also converged faster and to lower error, as shown in Figure 5, proving that being aware of local context can ease the training process.electricityc  0.064/0.030  0.058/0.029  0.059/0.030  0.057/0.032  0.059/0.028 
trafficc  0.124/0.086  0.121/0.086  0.121/0.086  0.121/0.083  0.116/0.080 
Sparse attention
Further, we compare our proposed LogSparse Transformer to the full attention counterpart on finegrained datasets, electricityf and trafficf. Note that time series in these two datasets have much longer periods and are noisier comparing to electricityc and trafficc. We first compare them under the same memory budget. For electricityf dataset, we choose with subsequence length and local attention length in each subsequence for our sparse attention model and in the full attention counterpart. For trafficf dataset, we select with subsequence length and local attention length in each subsequence for our sparse attention model, and in the full attention counterpart. The calculation of memory usage and other details can be found in Appendix A.4. We conduct experiments on aforementioned sparse and full attention models with/without convolutional self attention on both datasets. By following such settings, we summarize our results in Table 3 (Upper part). No matter equipped with convolutional self attention or not, our sparse attention models achieve competitive results on electricityf but much better results on trafficf compared to its full attention counterparts. Such performance gain on trafficf could be the result of the dateset’s stronger longterm dependencies and our sparse model’s better capability of capturing these dependencies, which, under the same memory budget, the full attention model cannot match. In addition, both sparse and full attention models benefit from convolutional self attention, proving its effectiveness.
Constraint  Dataset  Full  Sparse  Full + Conv  Sparse + Conv 

Memory  electricityf  0.107/0.047  0.105/0.095  0.083/0.046  0.083/0.053 
trafficf  0.317/0.306  0.196/0.105  0.205/0.107  0.134/0.103  
Length  electricityf  0.091/0.047  0.105/0.095  0.079/0.044  0.083/0.053 
trafficf  0.155/0.109  0.196/0.105  0.155/0.114  0.134/0.103 
To explore how well our sparse attention model performs compared to full attention model with the same input length, we set and on electricityf and trafficf, respectively. Note that our sparse attention models use much less memory. The results of their comparisons are summarized in Table 3 (Lower part). As one expects, canonical full attention Transformer outperforms our sparse attention counterpart when both of them are not equipped with convolutional self attention. However, our sparse Transformer with convolutional self attention can get comparable results on electricityf and, more interestingly, even better on trafficf but with much less memory usage. We argue that in some cases sparse attention may help the model learn the most useful patterns in the noisy data and leave less useful things alone, hence, may get better generalization on test data compared to its full attention counterpart.
Further Exploration
In our last experiment, we evaluate how our methods perform on datasets with various granularities compared to our baselines. All datasets except M4Hourly are evaluated by rolling window 7 times since the test set of M4Hourly has been provided. The results are shown in Table 4. These results further show that our method achieves the best performance overall.
electricityf  trafficf  solar  M4Hourly  wind  

TRMF  0.094/  0.213/  0.241/  /  0.311/ 
DeepAR  0.082/0.063  0.230/0.150  0.222/0.093  0.090/0.030  0.286/0.116 
Ours  0.079/0.044  0.140/0.103  0.219 /0.085  0.054 /0.022  0.288/0.113 
6 Conclusion
In this paper, we propose to apply Transformer in time series forecasting. Our experiments on both synthetic data and real datasets suggest that Transformer can capture longterm dependencies while LSTM may suffer. We also showed, on realworld datasets, that the proposed convolutional self attention further improves Transformer’ performance and achieves stateoftheart in different settings in comparison with recent RNNbased methods, a matrix factorization method, as well as classic statistical approaches. In addition, with the same memory budget, our sparse attention models can achieve better results on data with longterm dependencies. Exploring better sparsity in self attention and extending our method to better fit small datasets are our future research directions.
References
References
 (1) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 (2) James Durbin and Siem Jan Koopman. Time series analysis by state space methods. Oxford university press, 2012.
 (3) Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 (4) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 (5) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 (6) Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, pages 7785–7794, 2018.
 (7) Guokun Lai, WeiCheng Chang, Yiming Yang, and Hanxiao Liu. Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104. ACM, 2018.
 (8) Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Longterm forecasting using tensortrain rnns. arXiv preprint arXiv:1711.00073, 2017.
 (9) Danielle C Maddix, Yuyang Wang, and Alex Smola. Deep factors with gaussian processes for forecasting. arXiv preprint arXiv:1812.00098, 2018.

(10)
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
On the difficulty of training recurrent neural networks.
In
International conference on machine learning
, pages 1310–1318, 2013.  (11) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 (12) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 (13) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural language models use context. arXiv preprint arXiv:1805.04623, 2018.
 (14) Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
 (15) George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
 (16) George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 (17) HsiangFu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in neural information processing systems, pages 847–855, 2016.
 (18) Nicolas Chapados. Effective bayesian modeling of groups of related count time series. arXiv preprint arXiv:1405.3738, 2014.
 (19) Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multihorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
 (20) ChengZhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, and Douglas Eck. An improved relative selfattention mechanism for transformer with application to music generation. arXiv preprint arXiv:1809.04281, 2018.
 (21) Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A timerestricted selfattention layer for asr. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5874–5878. IEEE, 2018.
 (22) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
 (23) Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The m4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802–808, 2018.

(24)
Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli
Liao.
Why and when can deepbut not shallownetworks avoid the curse of dimensionality: a review.
International Journal of Automation and Computing, 14(5):503–519, 2017. 
(25)
Kunihiko Fukushima.
Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological cybernetics, 36(4):193–202, 1980.  (26) Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
 (27) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (28) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016.
 (29) Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series forecasting with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.
 (30) Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, and Aaron Courville. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.
 (31) Rose Yu, Yaguang Li, Cyrus Shahabi, Ugur Demiryurek, and Yan Liu. Deep learning: A generic approach for extreme condition traffic forecasting. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 777–785. SIAM, 2017.
 (32) Guoqiang Zhang, B Eddy Patuwo, and Michael Y Hu. Forecasting with artificial neural networks:: The state of the art. International journal of forecasting, 14(1):35–62, 1998.
 (33) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
 (34) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
 (35) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. URL https://s3uswest2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
 (36) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 (37) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
 (38) Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Timeseries extreme event forecasting with neural networks at uber. In International Conference on Machine Learning, number 34, pages 1–5, 2017.
Appendix A Supplementary Materials
a.1 Proof of Theorem 1
Proof.
According to the attention strategy in LogSparse Transformer, in each layer, cell could attend to the cells with indicies in . To ensure that every cell receives the information from all its previous cells and its own, the number of stacked layers should satisfy that for . That is, and , there is a directed path with edges, where , . We prove the theorem by constructing a path from cell to cell , with length (number of edges) no larger than . Case is trivial, we only need to consider case. Consider the binary representation of , , where . Suppose is the subsequence and is the element of . A feasible path from to is . The length of this path is , which is no larger than . So
(3) 
Furthermore, by reordering , we can generate multiple different paths from cell to cell . The number of feasible paths increases at a rate of along with . ∎
a.2 Training
electricityc  electricityf  trafficc  trafficf  wind  solar  M4Hourly  
T  32304  129216  4049  12435  10957  5832  748/1008 
M  370  370  963  963  28  137  414 
S  1 hour  15 mins  1 hour  20 mins  1 day  1 hour  1 hour 
Similar to [3]
, our network directly predicts the parameters of the probability distribution for the next time point. In our experiments, we use Gaussian likelihood since our training datasets are realvalued data. Note that one can also use other likelihood models, e.g. negativebinomial likelihood for positive count data.
To learn the model, we are given a time series dataset and its associated covariates , where is the length of all available observations and is the number of different time series. The dataset statistics is shown as Table 5. Following [3] , we create training instances by selecting windows with fixed history length and forecasting horizon but varying the start point of forecasting from each of the original long time series. Note that during selecting training windows, data in the test set can never be accessed. As a result, we get a training dataset with sliding windows . During training, we use Adam optimizer [27] with early stopping to maximize the loglikelihood of each training instance.
For electricityc and trafficc , we take K training windows while for electricityf and trafficf, we select K and K training windows, respectively. For wind, M4Hourly and solar, we choose K, K and K training windows, respectively . We sample and scale training windows following [3]. We use datetime information, e.g. month, weekday, hour, etc, as our timebased covariate vectors.
For our Transformer model, we don’t tune hyperparameters heavily. All of them use
heads, . For other hyperparameters (e.g. learning rate and layers), a gridsearch is used to find the best value. To do so, the data before the forecast start time is used as the training set and split into two partitions. For each hyperparameter candidate, we fit our model on the first partition of the training set containing 90% of the data and we pick the one that has the minimal negative loglikelihood on the remaining 10%. All models are trained on GTX 1080 Ti GPUs.a.3 Evaluation
We draw 200 samples to evaluate our method by the standard quantile loss. For electricityc, electricityf, trafficc, trafficf and solar, we leave the last 7 days as test sets. For wind, last 210 days are left as test set. For M4Hourly
, its training and test sets are already provided. Once the best sets of hyper parameters are found by aforementioned method, the evaluation metrics (
and ) are then applied on the test sets.a.4 Calculation of memory cost
For electricityf dataset, we choose with subsequence length and local attention length in each subsequence for our sparse attention model, and in its full attention counterpart. We stack the same layers on both sparse attention and full attention models. Hence, we can guarantee that their memory usage is the same if they use the same memory in every layer. In sparse attention with local window, every cell attends to cells in each subsequence. Since we have subsequences in total, then we get the memory usage of sparse attention in each layer is . Following such setting, the memory usage of the sparse attention model is comparable to that of the full attention model. For trafficf dataset, one can follow the same procedure to check the memory usage.
a.5 Visualization of attention matrix
Here we show an example of learned attention patterns in the masked attention matrix of a head within canonical Transformer’s last layer on trafficc dataset. Figure 6 (a) is a time series window containing 8 days in trafficc . The time series obviously demonstrates both hourly and daily patterns. From its corresponding masked attention matrix, as shown in Figure 6 (b), we can see that for points in weekdays, they heavily attend to previous cells (including itself) at the same time in weekdays while points on weekends tend to only attend to previous cells (including itself) at the same time on weekends. Hence, the model automatically learned both hourly and daily seasonality, which is the key to accurate forecasting.