1 Introduction
Time series forecasting is often the key to effective decision making. For example, estimating the demand for taxis over time can help drivers to plan ahead and decrease the wait time for passengers
(Laptev et al., 2017). For such tasks, the statistical community has developed many wellknown forecasting models, such as State Space Models (SSMs) (Snyder, 2008) and Autoregressive (AR) models. However, most of them are designed to fit each time series independently or can only handle a small group of time series instances, while often requiring extensive manual feature engineering (Harvey, 1991). This makes accurate forecasting particularly challenging when we have a large number of time series but each with limited time steps. To address this issue, (Graves, 2013; Flunkert et al., 2017; Rangapuram et al., 2018)proposed to model time series with Recurrent Neural Networks (RNN), such as LongShort Term Memory Networks (LSTM)
(Hochreiter and Schmidhuber, 1997). One notable drawback is that RNNs force past elements to be stored in memory of a fixed size, indicating that RNNs can struggle to capture longterm dependencies (Khandelwal et al., 2018).(Vaswani et al., 2017)
recently proposed the transformer model which leverages the attention mechanism so that it can access any part of the history regardless of the distance. Transformers have achieved superior performance in the area of natural language processing
(Vaswani et al., 2017) and recently also in the area of time series forecasting (Li et al., 2019). Despite their outstanding capabilities, transformers often need many training samples due to their large number of trainable parameters. Nevertheless, the performance under limited training samples is still critical. One example is to predict the demand for each product in a warehouse (Maddix et al., 2018). Even with records of daily frequencies, several years of data can only aggregate a few hundred time steps, which is not tolerable if the model requires enough data in order to achieve good performance.While some research was done focusing on few shot learning for time series forecasting (Maddix et al., 2018)
, past works on transformers mainly focused on transfer learning in natural language processing through pretrained input representations
(Devlin et al., 2018; E. Peters et al., 2018; Houlsby et al., 2019). However, in the domain of temporal point forecasting, it is often more challenging to find a large corpus of similar time series processes since inputs are direct numbers instead of words. On the other hand, some training techniques were also proposed to introduce complexity to the training tasks. BERT (Devlin et al., 2018) randomly masks some of the input word tokens and predict them in parallel. XLNet (Yang et al., 2019) suggests to predict all tokens but in random order. RoBERTa (Liu et al., 2019)introduces dynamic masking so that different tokens are masked across training epochs.
In this paper, we propose a novel training technique on transformers with applications in time series forecasting. We denoted transformers with this technique as augmented transformers. Through randomly sampling training windows, our contribution is twofold:

we propose a training technique that expands the number of distinct training tasks from linearly to combinatorially many;

by breaking the temporal order in training windows, augmented transformers can better capture dependencies among time steps.
2 Methodology
2.1 Problem Statement
Given buildings with conference room utilization levels where
and a set of associated covariate vectors
where , the goal is to predict steps in the future, i.e. . We denote as the observed history and as the forecasting horizon. Formally, we want to model the joint conditional distribution .We use the autoregressive transformer decoder model from (Vaswani et al., 2017)
by decomposing the joint distribution into the product of onestep ahead distributions
. The input at each time step is , where is a categorical feature learned from onehot embedding of each time series instance . Details of the transformer model can be found in Appendix 5.1.2.2 Random Data Sampling
Traditional data preprocessing for time series forecasting truncate each time series into rolling windows. Some reasons are RNNs often suffer from gradient exploding/vanishing issues (Khandelwal et al., 2018) and transformers need a considerable amount of GPU memory for full attention. Since loss is accumulated over the forecast horizon, training windows should not overlap over the forecast horizon. Each window can be regarded as one training task where the goal is to predict the target values in the forecast horizon given the observed history immediately before the forecast horizon.
Transformers are not aware of relative temporal information as they rely solely on attention mechanism (Lee et al., 2019). Instead, they resort to either positional sinusoids or learned position embeddings that are added to the perposition input representations (Shaw et al., 2018)
. On the other hand, RNNs can model positions relatively through taking input recursively and convolutional neural networks (CNNs) applies kernels based on the relative positions of covered elements
(Van Den Oord et al., 2016). We propose to take full advantage of this unique feature of transformers during model training through effective random data sampling.As transformers are not aware of relative positions of the time series, they can handle disrupted temporal coherence in the observed history and still able to make reasonable predictions. For example, transformer should yield the same output with as under any random permutations, such as .
Thus, for transformers the observed history does not have to come from immediately before. Instead, we can randomly sample all the time steps in each training window independently from the observed history such that they no longer have to be in consecutive temporal order. This technique boosts the number of training tasks from linearly to combinatorially many compared to traditional sampling.
2.3 Capture Better Dependencies
In addition to data augmentation, we argue that the proposed technique also allows the model to extract more complex relationships among time steps. We hypothesize that there are two major reasons. First, it prevents transformers from relying the prediction entirely on one or a few time steps, as these points might not be sampled during training. Second, augmented transformers can capture seasonalities longer than the size of training windows more easily through randomly sampling points beyond the window size. Detailed results are presented in Section 3.
Note that inputs to the model need to be normalized. During experiments, we observe that scaling in bigger rollings windows before sampling yields much better performance than scaling after sampling. One reason is that the magnitudes of adjacent time steps are more similar and the relative magnitude in the adjacent window is more important than absolute magnitudes.
Note that some work introduce relative position to transformers, such as through convolutional selfattention (Li et al., 2019) or relative position representations (Shaw et al., 2018). They are not mutually exclusive with the proposed technique. For example, convolutional attention can be implemented by feeding adjacent time steps as covariates. Further empirical studies are needed.
3 Experiments
We conducted experiments on two public benchmark datasets electricity ^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 and traffic ^{2}^{2}2http://archive.ics.uci.edu/ml/datasets/PEMSSF. The electricity dataset contains the hourly electricity consumption of 370 households from 2011 to 2014. The traffic dataset records hourly occupancy rates (between 0 and 1) of 963 car lanes in San Francisco Bay Area. Following Yu et al. (2016); Rangapuram et al. (2018), we use one week of test data for electricity (starting at 12 AM on September 1, 2014) and traffic (starting at 5 PM on June 15, 2008).
Data Efficiency
We measure the performance of the augmented transformer on the longterm forecasting task presented in (Rangapuram et al., 2018) by directly predicting one week given
weeks of training data. We compare the augmented transformer against traditional statistical methods as well as recent stateoftheart deep learning models:

ARIMA: implemented with auto.arima method in R’s forecast package;

Exponential smoothing (ETS): implemented with ets method in R’s forecast package;

DeepARFlunkert et al. (2017)
: an RNNbased autoregressive model;

Deep State Space (DeepSSM)(Rangapuram et al., 2018): an RNNbased state space model.
Following (Rangapuram et al., 2018; Maddix et al., 2018), we use
quantile loss to evaluate the prediction accuracy, which is defined as:
where is the empirical quantile of the predicted distribution. and for each model are summarized in Table 1. Detailed experiment setup can be found in Appendix 5.3
. Overall, augmented transformer surpassed other models in all but one task. More detailed confidence intervals are shown in Appendix
5.2. Note that we do not include (Maddix et al., 2018) for comparison as we do not limit the number of learnable parameters.DeepAR  DeepSSM  ARIMA  ETS  Ours  

Dataset  Given  
Electricity  2 weeks  0.153  0.147  0.087  0.05  0.283  0.109  0.121  0.101  0.083  0.044 
3 weeks  0.147  0.132  0.130  0.110  0.291  0.112  0.130  0.110  0.083  0.042  
4 weeks  0.125  0.080  0.130  0.110  0.30  0.110  0.13  0.11  0.084  0.041  
Traffic  2 weeks  0.177  0.153  0.168  0.117  0.492  0.280  0.621  0.650  0.141  0.099 
3 weeks  0.126  0.096  0.170  0.113  0.492  0.509  0.529  0.163  0.140  0.101  
4 weeks  0.219  0.138  0.168  0.114  0.501  0.298  0.532  0.60  0.140  0.104 
Temporal Dependencies
Next, we demonstrate that augmented transformer not only benefits from more distinct training windows, but can also capture temporal dependencies in finer granularity than the original transformer. In this experiment, we limit the number of training windows to be the same as through simple rolling windows to examine the new model without data augmentation. We denote this model without data augmentation as the fixed transformer. We also include the vanilla transformer without random sampling for comparison. The results are shown in Figure 3
. For fair comparison, for the vanilla transformer we padded each time series instance with additional zeros so that the rolling windows can start before the beginning of the instance.
The performance of the fixed transformer surpassed the vanilla transformer by a wide margin and is very close to the augmented transformer. This implies that even without data augmentation, random sampling from a larger window during training can help the model extract more features.
4 Conclusion
We present a novel augmentation technique for transformers on time series forecasting by random sampling of the training windows. The proposed strategy is able to achieve competitive performance compared to strong baselines on realworld datasets. In addition to data augmentation, we also show that augmented transformer can better capture dependencies among time steps.
References
References
 BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805v2. Cited by: §1.
 Deep contextualized word representations. arXiv preprint arXiv:1802.05365v2. Cited by: §1.
 DeepAR: probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110. Cited by: §1, 3rd item.
 Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1.

Forecasting, structural time series models and the kalman filter
. Cambridge University Press. Cited by: §1.  Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.

Parameterefficient transfer learning for NLP.
In
Proceedings of the 36th International Conference on Machine Learning
, Vol. 97, pp. 2790–2799. External Links: Link Cited by: §1.  Sharp nearby, fuzzy far away: how neural language models use context. arXiv preprint arXiv:1805.04623. Cited by: §1, §2.2.
 Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104.
 Timeseries extreme event forecasting with neural networks at uber. In International Conference on Machine Learning, pp. 1–5. Cited by: §1.
 Set transformer: a framework for attentionbased permutationinvariant neural networks. arXiv preprint arXiv:1810.00825v3. Cited by: §2.2.
 Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. arXiv preprint arXiv:1907.00235. Cited by: §1, §2.3.
 RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692v1. Cited by: §1.
 Deep factors with gaussian processes for forecasting. arXiv preprint arXiv:1812.00098. Cited by: §1, §1, §3.
 Deep factors with gaussian processes for forecasting. arXiv preprint arXiv:1901.11117v4.
 Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, pp. 7785–7794. Cited by: §1, 4th item, §3, §3, Table 1, §3.
 Selfattention with relative position representations. arXiv preprint arXiv:1803.02155v2. Cited by: §2.2, §2.3.
 Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media. Cited by: §1.
 WaveNet: a generative model for raw audio.. SSW 125. Cited by: §2.2.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: You May Not Need Order in Time Series Forecasting, §1, §2.1, §5.1.
 XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237v1. Cited by: §1.
 Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in neural information processing systems, pp. 847–855. Cited by: §3.
5 Appendix
5.1 Transformer
The model consists of stacked decoder blocks, where each block has a selfattention layer followed by a feedforward layer. The selfattention layer first transforms into set of attention heads. Let and be learnable parameters, where . Each attention head transforms into query matrices , key matrices , and value matrices . From each head, the scaled dotproduct attention computes a vector output for every time step:
where
is an uppertriangular mask that prevents the current time step from accessing future information. The feedforward layer then takes the concatenated output from all attention heads and performs two layers of pointwise dense layers with a ReLu activation in the middle.
^{3}^{3}3Note that we included in our implementation of augmented transformer, which might not be necessary as the input order is permuted. Additional details can be found in [20].5.2 Result Details
Transformer  Fixed Transformer  Ours  

Dataset  Given  
Electricity  2 weeks  0.107  0.051  
3 weeks  0.098  0.051  
4 weeks  0.092  0.047  
Traffic  2 weeks  0.223  0.177  
3 weeks  0.210  0.163  
4 weeks  0.223  0.184 
5.3 Experiment Setup
For each task, our timebased covariate vectors are hour of the day and day of the week. We do not tune hyperparameters heavily. All of the models (both transformer baseline and augmented transformer) use 8 attention heads and dropout of
. A simple gridsearch is used to find the other hyperparamters: , dimension of the feedforward layer in the transformer decoder block among , and embedding dimension of onehot features among .We scale the time steps with the adjacent time steps and randomly sample from the entire observed history. For large datasets, sampling from the entire observed history might lead to slower or no convergence. Thus, sampling smaller training windows from bigger rolling windows are recommended. Each training window is of size , where the last is the forecast horizon. We do not sinusoidal positional embeddings and simply use direct time covariates for faster convergence. With positional embeddings the network achieves around the same accuracy.
Note that during inference, we use all the training data as observed history for the tasks presented as the observed history is limited. For validation set, we randomly sample from the data before the forecast start time and use the rest as the training set. All models are trained on GTX 1080 Ti GPUs.
Comments
There are no comments yet.