1 Introduction
Timeseries forecasting is an important problem with many industrial applications like retail demand forecasting [19], financial predictions [13], predicting traffic or weather patterns [4]. In general it plays a key role in automating business processes [15]. Modern datasets can have millions of correlated timeseries over several thousand timepoints. For instance, in an online shopping portal like Amazon or Walmart, one may be interested in the future daily demands for all items in a category, where the number of items may be in millions. This leads to a problem of forecasting timeseries (one for each of the items), given past demands over timesteps. Such a time series dataset can be represented as a matrix and we are interested in the highdimensional setting where can be of the order of millions.
Traditional timeseries forecasting methods operate on individual timeseries or a small number of timeseries at a time. These methods include the well known AR, ARIMA and exponential smoothing [17], the classical BoxJenkins methodology [3] and more generally the linear statespace models [11]. However, these methods are not easily scalable to large datasets with millions of timeseries, owing to the need for individual training. Moreover, they cannot benefit from shared temporal patterns in the whole dataset while training and prediction.
Deep networks have gained popularity in timeseries forecasting recently, due to their ability to model nonlinear temporal patterns. Recurrent Neural Networks (RNN’s)
[8]have been popular in sequential modeling, however they suffer from the gradient vanishing/exploding problems in training. Long Short Term Memory (LSTM)
[9] networks alleviate that issue and have had great success in langulage modeling and other seqtoseq tasks [9, 20]. Recently, deep timeseries models have used LSTM blocks as internal components [7, 18]. Another popular architecture, that is competitive with LSTM’s and arguably easier to train are temporal convolutions/causal convolutions popularized by the wavenet model [22]. Temporal convolutions have been recently used in timeseries forecasting [2, 1]. These deep network based models can be trained on large timeseries datasets as a whole, in minibatches. However, they still have two important shortcomings.Firstly, most of the above deep models are difficult to train on datasets that have wide variation in scales of the individual timeseries. For instance in the item demand forecasting usecase, the demand for some popular items may be orders of magnitude more than those of niche items. In such datasets, each timeseries needs to be appropriately normalized in order for training to succeed, and then the predictions need to be scaled back to the original scale. The mode and parameters of normalization are difficult to choose and can lead to different accuracies. For example, in [7, 18]
each timeseries is whitened using the corresponding empirical mean and standard deviation, while in
[2] the timeseries are scaled by the corresponding value on the first timepoint.Secondly, even though these deep models are trained on the entire dataset, during prediction the models only focus on local past data i.e only the past data of a timeseries is used for predicting the future of that timeseries. However, in most datasets, global properties may be useful during prediction time. For instance, in stock market predictions, it might be beneficial to look at the past values of Alphabet, Amazon’s stock prices as well, while predicting the stock price of Apple. Similarly, in retail demand forecasting, past values of similar items can be leveraged while predicting the future for a certain item. To this end, in [14], the authors propose a combination of 2D convolution and recurrent connections, that can take in multiple timeseries in the input layer thus capturing global properties during prediction. However, this method does not scale beyond a few thousand timeseries, owing to the growing size of the input layer. On the other end of the spectrum, TRMF [25] is a temporally regularized matrix factorization model that can express all the timeseries as linear combinations of basis timeseries. These basis timeseries can capture global patterns during prediction. However, TRMF can only model linear temporal dependencies. Moreover, there can be approximation errors due to the factorization, which can be interpreted as a lack of local focus i.e the model only concentrates on the global patterns during prediction.
In light of the above discussion, we aim to propose a deep learning model that can think globally and act locally i.e., leverage both local and global patterns during training and prediction, and also can be trained reliably even when there are wide variations in scale. The main contributions of this paper are as follows:

In Section 4.2, we propose a variation of of the temporal convolution architecture called the leveled network that can be trained effectively on datasets with wide scale variations, without a priori normalization.

In Section 5.1, inspired by TRMF [25], we introduce a Matrix Factorization model regularized by a leveled network (TRMFDLN), that can express each timeseries as linear combination of basis timeseries, where is much less than the number of timeseries. Unlike TRMF, this model can capture nonlinear dependencies as the regularization and prediction is done using a leveled network trained concurrently and also is amicable to scalable minibatch training. This model can handle global dependencies during prediction.

In Section 5.2, we combine the global model with a local
leveled network trained on the original dataset, through a datadependent attention model. This leads to a stronger hybrid model, namely
DeepGLO (Deep Global LOcal Forecaster) that focuses on both local properties of individual timeseries and global dependencies in the dataset while training and prediction.
2 Related Work
The literature on timeseries forecasting is vast and spans several decades. Here, we will mostly focus on recent deep learning approaches. For a comprehensive treatment of traditional methods, we refer the readers to [11, 17, 3, 16, 10] and the references there in.
In recent years deep learning models have gained popularity in timeseries forecasting. DeepAR [7] proposes a LSTM based model where parameters of Bayesian models for the future timesteps are predicted as a function of the corresponding hidden states of the LSTM. In [18], the authors combine linear state space models with deep networks. In [23], the authors propose a timeseries model where all history of a timeseries is encoded using an LSTM block, and a multi horizon MLP decoder is used to decode the input into future forecasts. LSTNet [14] can leverage correlations between multiple timeseries through a combination of 2D convolution and recurrent structures. However, it is difficult to scale this model beyond a few thousand timeseries because of the growing size of the input layer. Temporal convolutions have been recently used for timeseries forecasting [2].
Matrix factorization with temporal regularization was first used in [24] in the context of speech denoising. Perhaps closest to our work is TRMF [25], where the authors propose an AR based temporal regularization. In this paper, we extend this work to nonlinear settings where the temporal regularization can be performed by a leveled network (introduced in Section 4.2). We further combine the global matrix factorization model with a leveled network trained separately on the original dataset, using a linear attention model.
3 Problem Setting
We consider the problem of forecasting highdimensional time series over future timesteps. Highdimensional timeseries datasets consist of several possibly correlated timeseries evolving over time and the task is to forecast the values of those timeseries in future timesteps. Before, we formally define the problem, we will set up some notation.
Notation: We will use bold capital letters to denote matrices such as . and will be used interchangeably to denote the th entry of the matrix . Let for a positive integer . For and , the notation will denote the submatrix of with rows in and columns in . means that all the rows are selected and similarly means all the columns are chosen. denotes all the set of elements in increased by . The notation for positive integers , is used to denote the set . ,
denote the Frobenius and Spectral norms respectively. By convention, all vectors in this paper are row vectors unless otherwise specified.
denotes the norm of the vector . denotes the subvector with entries where denotes the th coordinate of and . The notation will be used to denote the vector . The notation will be used to denote the concatenation of two row vectors and . For a vector , denotes the empirical mean of the coordinates and denotes the empirical standard deviation.Forecasting Task: A timeseries dataset can be represented by a matrix , where , , is the number of timeseries, is the number timepoints observed during training phase, is the window size for forecasting. is used to denote the predicted values in the test range. is used to denote the th time series, i.e., the th row of .
Objective: The quality of the predictions are generally measured using a metric calculated between the predicted and actual values in the test range. One of the popular metric is the normalized absolute deviation metric [25], defined as follows,
(1) 
where and
are the observed and predicted values, respectively. This metric is sometime referred to as WAPE in the forecasting literature. Other commonly used evaluation metrics are define in Section
B.2. Note that (1) is also used as the loss function in our proposed models.
4 Dln: A Deep Leveled Network
In this section, we propose Deep Leveled Network(DLN), an architectural variation of the temporal convolution network which is designed to handle highdimensional timeseries data with wide variation in scale, without normalization. DLN constitutes an important component in both our global and local models in DeepGLO in Section 5.
The most commonly used base architecture in deep timeseries models is the Long Short Term Memory (LSTM) [9] network. However, in more recent works it has been observed that sequencesequence tasks can be effectively performed by dilated Temporal Convolution architectures (popularized by the Wavenet model [22]). In addition Temporal Convolutions are more stable while training and can provide competitive or better performance as compared to recurrent models [1, 2]. Therefore, in this work we will use Temporal Convolution (TC) blocks as the basic building blocks for our prediction models. We provide a brief description of TC in the next section, for the sake of completeness.
4.1 Temporal Convolution
Temporal convolution (also referred to as Causal Convolution) [22, 2, 1] are multilayered neural networks comprised of 1D convolutional layers. The major difference from regular 1D convolution, is the introduction of dilation in the filter taps, which guarantees a much larger dynamic range. Dilation is equivalent to introducing a fixed step between every two adjacent filter taps. The dilation in layer is usually set as . A temporal convolution network with filter size and number of layers has a dynamic range (or lookback) of . An example of a temporal convolution network with one channel per layer is shown in Fig. 1
. TC networks can also have residual connections as shown in Fig.
1. For more details, we refer the readers to the general description in [1]. Note that in practice, we can have multiple channels per layer of a TC network.Thus, the temporal convolution network can be treated as an object that takes in the previous values of a timeseries , where and outputs the onestep look ahead predicted value
. We will denote a temporal convolution neural network by
, where is the parameter weights in the temporal convolution network. Thus, we have . The same operators can be defined on matrices. Given and a set of row indices , we can write .4.2 Leveled Network  Handling Diverse Scales
Large scale timeseries datasets containing upwards of hundreds of thousands of timeseries can have very diverse scales. The diversity in scale leads to issues in training deep models, both Temporal Convolutions and LSTM based architectures, and some normalization is needed for training to succeed [14, 2, 7]. However, selecting the correct normalizing factor for each timeseries is not an exact science and can have effects on predictive performance. For instance in [7] the datasets are whitened using the training standard deviation and mean of each timeseries while training, and the predictions are renormalized. On the other hand in [2], each timeseries is rescaled by the value of that timeseries on the first timestep. Moreover, when performing rolling predictions using a pretrained model, when new data is observed there is a potential need for updating the scaling factors by incorporating the new timepoints. In this section we propose DLN, a simple leveling network architecture that can be trained on diverse datasets without the need for a priori normalization.
DLN consists of two temporal convolution blocks (having the same dynamic range ) that are trained concurrently. Let us denote the two networks and the associated weights by and respectively. The key idea is to have (the leveling component) to predict the rolling mean of the next future timepoints given the past. On the otherhand (the residual component) will be used to predict the variations with respect to this mean value. Given an appropriate window size the rolling mean stays stable for each timeseries and can be predicted by a simple temporal convolution model and given these predictions the additive variations are relatively scale free i.e. the network can be trained reliably without normalization. This can be summarized by the following equations:
(2)  
(3)  
(4) 
where we want to be close to and to be close to . An illustration of the leveled network methodology is shown in Figure 2.
Training: Both the networks can be trained concurrently given the training set , using minibatch stochastic gradient updates. The pseudocode for training a DLN is described in Algorithm 1. The loss function used is the same as the metric defined in Eq. (1). Note that in Step 9, the leveling component is held fixed and only is updated.
Prediction: The trained model can be used for multistep lookahead prediction in a standard manner. Given the past datapoints of a timeseries, , the prediction for the next timestep is given by defined in (2). Now, the onestep lookahead prediction can be concatenated with the past values to form the sequence , which can be again passed through the network and get the next prediction:
The sample procedure can be repeated times to predict timesteps ahead in the future.
In Section 6, we shall see that DLN can be trained on unnormalized datasets with wide variations in scale. In the next section, we will introduce how the DLN be used as an key component in both the global and local part of DeepGLO, our proposed hybrid model.
5 DeepGLO: A Deep Global Local Forecaster
In this section, we propose DeepGLO, our complete hybrid model which consists of separate global and local components, married through a datadriven attention mechanism. In Section 5.1, we present the global component, which is a Temporal Matrix Factorization model regularized by a Deep Leveled Network (TRMFDLN). This model can capture global patterns during prediction, by representing each of the original timeseries as a linear combination of basis timeseries, where . The local model is simply a DLN trained on the original dataset. In Section 5.2, we detail how the global and local models are combined.
5.1 Temporal Matrix Factorization regularized by Deep Leveled Network (TrmfDln)
In this section we propose a lowrank matrix factorization model for timeseries forecasting that uses a leveled network for regularization. This approach is inspired by the temporal regularized matrix factorization (TRMF) approach proposed in [25].
The idea is to factorize the training timeseries matrix into lowrank factors and , where . This is illustrated in Figure 3. Further, we would like to encourage a temporal structure in matrix, such that the future values in the test range can also be forecasted. Let Thus, the matrix can be thought of to be comprised of basis timeseries that capture the global temporal patterns in the whole dataset and all the original timeseries are linear combinations of these basis timeseries. In the next subsection we will describe how a DLN can be used to encourage the temporal structure for .
Temporal Regularization by a DLN: If we are provided with a leveled network that captures the temporal patterns in the training dataset , then we can encourage temporal structures in using this model. Let us assume that the said leveled network is . The temporal patterns can be encouraged by including the following temporal regularization into the objective function:
(5) 
where and is the metric defined in Eq. (1). This implies that the values of the on timeindex are close to the predictions from the temporal network applied on the past values between timesteps . Here, is ideally equal to the dynamic range of the network defined in Section 4. Thus the overall loss function for the factors and the temporal network is as follows:
(6) 
where is the regularization parameter for the temporal regularization component.
Training: The lowrank factors and the temporal network can be trained alternatively to approximately minimize the loss in Eq. (6). The overall training can be performed through minibatch SGD and can be broken down into two main components performed alternatively: given a fixed approximately minimize with respect to the factors  Algorithm 2 and train the network on the matrix using Algorithm 1.
The overall algorithm is detailed in Algorithm 3. The global DLN is first initialized by training on a normalized version of . Here, by normalized we mean that each dimension of the timeseries is centered and scaled by the empirical mean and standard deviation of the same dimension of timeseries in the training range. Note that this normalized is only used to initialize . Then in the second initialization step, two factors and are trained using the initialized (step 3), for iterations. This is followed by the alternative steps to update , , and (steps 57).
Rolling Prediction without Retraining: Once the model is trained using Algorithm 3, we can make predictions on the test range using multistep lookahead prediction. The method is straightforward  we first use to make multistep lookahead prediction on the basis timeseries in as detailed in Section 4, to obtained ; then the original timeseries predictions can be obtained by . This model can also be adapted to make rolling predictions without retraining. In the case of rolling predictions, the task is to train the model on a training period say , then make predictions on a future time period say , then receive the actual values on the future timerange and after incorporating these values make further predictions for a time range further in the future and so on. The key challenge in this scenario, is to incorporate the newly observed values to generate the values of the basis timeseries in that period which is . We propose to obtain these values by minimizing global loss defined in (6) while keeping and fixed:
Once we obtain , we can make predictions in the next set of future timeperiods . Note that the TRMF model in [25] needed to be retrained from scratch to incorporate the newly observed values. In this work retraining is not required to achieve good performance, as we shall see in our experiments in Section 6.
5.2 Combining Global and Local Models
In Section 5.1, we introduced TRMFDLN, a model that can leverage global information from the whole dataset even during prediction time, through the linear combinations of the learned basis timeseries. In this section, we will extend that model to leverage both global information (through basis timeseries in TRMFDLN) and local information (through a separate DLN that is trained on the original and predicts the future values of each timeseries based on its past observed values). The final output is through a linear attention model that combines the global and local predictions.
The idea is to have a simple temporal convolution network that outputs attention weights based on the past values of a time series . These attention weights are used to combine local and global predictions. For the sake of exposition, we write the relations for one step lookahead predictions as follows:
where means elementwise multiplication.
Thus the final predictions are linear combinations of the local and global predictions. The linear attention weights, , are predicted by a temporal convolution network based on the past values of the timeseries. This network can be trained by minibatch SGD as detailed in our complete algorithm in Algorithm 4. Steps 1 and 2 involve training the global TRMFDLN model and the locally focused leveled network on the original . Steps 611 is the pseudocode for training .
Note that rolling predictions and multistep predictions can be performed by the DeepGLO, as the individual global TRMFDLN model and the local DLN can perform the forecast independently, without any need for retraining. We illustrate some representative results on a timeseries from the traffic and electricity dataset in Fig. 4. In Fig. (a)a
, we show some of the basis timeseries (global features) extracted from the
traffic dataset, which can be combined linearly to yield individual original timeseries. It can be seen that the basis series are highly temporal and can be predicted in the test range using the network . Fig. (b)b shows global and local predictions on the same timeseries, which are combined through , to provide the final predictions. In Fig. (c)c, we display a scatter plot of the local vs global attention weights in the test range for the timepoints of all timeseries in traffic and electricity (most weights fall [0,1] therefore the range is restricted to [0,1]). We see in that in electricity, more attention is on the local component while in traffic the attention is almost equal on both global and local components.Data  

electricity  370  25,968  24  7  
traffic  963  10,392  24  7  
wiki  115,084  747  14  4 
Algorithm  electricity  traffic  wiki  

Normalized  Unnormalized  Normalized  Unnormalized  Normalized  Unnormalized  
Proposed  DeepGLO  0.084/0.291/0.119  0.109/0.448/0.149  0.159/0.218/0.202  0.221/0.321/0.254  0.233/0.380/0.402  0.228/0.356/0.311 

Local DLN  0.086/0.258/0.129  0.118/0.336/0.172  0.169/0.246/0.218  0.237/0.422/0.275  0.235/0.469/0.346  0.288/0.397/0.341 
LocalOnly  LSTM  0.109/0.264/0.154  0.896/0.672/0.768  0.276/0.389/0.361  0.270/0.357/0.263  0.427/2.170/0.590  0.789/0.686/0.493 

DeepAR  0.099/0.375/0.146  0.889/0.818/0.876  0.268/0.369/0.272  0.250/0.331/0.258  0.442/2.980/0.522  0.958/8.120/1.140 

Temporal Conv.  0.147/0.476/0.156  0.423/0.769/0.523  0.204/0.284/0.236  0.239/0.425/0.281  0.336/1.322/0.497  0.511/0.884/0.509 

Prophet  0.197/0.393/0.221  0.221/0.586/0.524  0.313/0.600/0.420  0.303/0.559/0.403     
GlobalOnly  TRMF (retrained)  0.104/0.280/0.151  0.105/0.431/0.183  0.159/0.226/0.181  0.210/ 0.322/ 0.275  0.309/0.847/0.451  0.320/0.938/0.503 

SVD+DLN  0.219/0.437/0.238  0.368/0.779/0.346  0.468/0.841/0.580  0.329/0.687/0.340  0.697/3.51/0.886  0.639/2.000/0.893 
6 Empirical Results
In this section, we validate our model on three realworld datasets on rolling prediction tasks (see Section B.1 for more details) against other benchmarks. The datasets in consideration are, electricity [21]  Hourly load on houses. The training set consists of timepoints and the task is to perform rolling validation for the next 7 days (24 timepoints at a time, for 7 windows) as done in [25, 18, 7]; traffic [5]  Hourly traffic on roads in San Fransisco. The training set consists of timepoints and the task is to perform rolling validation for the next 7 days (24 timepoints at a time, for 7 windows) as done in [25, 18, 7] and wiki [12]  Daily webtraffic on about articles from Wikipedia. We only keep the timeseries without missing values from the original dataset. The values for each day are normalized by the total traffic on that day across all timeseries and then multiplied by . The training set consists of timepoints and the task is to perform rolling validation for the next 86 days, 14 days at a time. More data statistics indicating scale variations are provided in Table 1.
For each dataset, all models are compared on two different settings. In the first setting the models are trained on normalized version of the dataset where each time series in the training set is rescaled as and then the predictions are scaled back to the original scaling. In the second setting, the dataset is unnormalized i.e left as it is while training and prediction. Note that all models are compared in the test range on the original scale of the data. The purpose of these two settings is to measure the impact of scaling on the accuracy of the different models.
The models under contention are: (1) DeepGLO: The combined local and global TRMFDLN model proposed in Section 5.2. (2) Local DLN: The leveled temporal convolution based architecture proposed in Section 4. (3) LSTM: A simple LSTM block that predicts the timeseries values as function of the hidden states [9]. (4) DeepAR: The model proposed in [7]. (5) TC: A simple Temporal Convolution model as described in Section 4. (6) Prophet: The versatile forecasting model from Facebook based on classical techniques [6]. (7) TRMF: the model proposed in [25]. Note that this model needs to be retrained for every rolling prediction window. (8) SVD+DLN: Combination of SVD and leveled network. The original data is factorized as using SVD and a leveled network is trained on the . This is a simple baseline for a globalonly approach.
Note: We use the same hyperparameters for DeepAR on the traffic and electricity datasets, as specified in [7]. The WAPE values from the original paper could not be directly used, as there are different values reported in [7] and [18]. The model in TRMF [25] was trained with different hyperparameters (larger rank) than in the original paper and therefore the results are slightly better. More details about all the hyperparameters used are provided in Section B. The rank used in electricity, traffic and wiki are and for DeepGLO/TRMF.
In Table 2, we report WAPE/MAPE/SMAPE (see definitions in Section B.2) on all three datasets under both normalized and unnormalized training. We can see that DeepGLO features among the top two models in almost all categories, under all three metrics. TRMF does marginally better on the normalized traffic dataset in terms of SMAPE, and also in terms of MAPE in the unnormalized traffic dataset. This may be attributed to the fact that the lag indices in TRMF can be chosen to account directly for weekly dependencies and that it is retrained. It should be noted that DeepGLO performs substantially better than other models on the much larger wiki dataset. In fact it does at least 30% better than the next best model (not proposed in this paper) in terms of MAPE. Also, note that DeepGLO performs better on wiki, using unnormalized data. We would also like to point out that the leveled network can train well on unnormalized data, while other deep models (LSTM, DeepAR, TC) do not perform well in this setting. More results are provided in Section A, where we plot predictions from the different methods on representative timeseries.
7 Conclusion and Future Work
In this paper, we propose DeepGLO, a deep glocalized temporal model for highdimensional timeseries forecasting. DeepGLO combines a global matrix factorization model regularized by a leveled network and a local leveled network trained on the original dataset. This hybrid method shows considerable gain over the state of the art on real datasets and can train reliably in the presence of diverse scales. In this paper, we focus on making point forecasts, however we believe that DeepGLO
can be extended to make probabilistic forecasting, by enabling the leveled network to predict both the mean value and Bayesian confidence interval for future timepoints. We leave this direction as future work. Another possible direction for future research is to use
DeepGLOfor missing value imputation in timeseries data. This is especially interesting, given the global
TRMFDLN model, owing to the matrix factorization structure.References
 [1] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
 [2] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series forecasting with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.
 [3] George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
 [4] Chris Chatfield. Timeseries forecasting. Chapman and Hall/CRC, 2000.

[5]
Marco Cuturi.
Fast global alignment kernels.
In
Proceedings of the 28th international conference on machine learning (ICML11)
, pages 929–936, 2011.  [6] Facebook. Fbprophet. https://research.fb.com/prophetforecastingatscale/, 2017. [Online; accessed 07Jan2019].
 [7] Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 [8] Kenichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks, 6(6):801–806, 1993.
 [9] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999.
 [10] James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton, NJ, 1994.
 [11] Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
 [12] Kaggle. Wikipedia web traffic. https://www.kaggle.com/c/webtraffictimeseriesforecasting/data, 2017. [Online; accessed 07Jan2019].

[13]
Kyoungjae Kim.
Financial time series forecasting using support vector machines.
Neurocomputing, 55(12):307–319, 2003.  [14] Guokun Lai, WeiCheng Chang, Yiming Yang, and Hanxiao Liu. Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104. ACM, 2018.
 [15] Paul D Larson. Designing and managing the supply chain: Concepts, strategies, and case studies, david simchilevi philip kaminsky edith simchilevi. Journal of Business Logistics, 22(1):259–261, 2001.
 [16] Helmut Lütkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.
 [17] ED McKenzie. General exponential smoothing and the equivalent arma process. Journal of Forecasting, 3(3):333–344, 1984.
 [18] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, pages 7796–7805, 2018.
 [19] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand forecasting for large inventories. In Advances in Neural Information Processing Systems, pages 4646–4654, 2016.
 [20] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
 [21] Artur Trindade. Electricityloaddiagrams20112014 data set. https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, 2011. [Online; accessed 07Jan2019].
 [22] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
 [23] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multihorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
 [24] Kevin W Wilson, Bhiksha Raj, and Paris Smaragdis. Regularized nonnegative matrix factorization with temporal dependencies for speech denoising. In Ninth Annual Conference of the International Speech Communication Association, 2008.
 [25] HsiangFu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in neural information processing systems, pages 847–855, 2016.
Appendix A More Empirical Results
In this section, we will provide some more illustrations. In Fig. 5, we provide the actual and predicted values from all the methods on some selected timeseries from the electricity dataset. The first row shows results from normalized training and the second row from unnormalized. We can see that other local deep models (DeepAR, LSTM, TC) fail to train in the unnormalized setting, as there are wide scale variations. On an average DeepGLO predictions are closer to the actual values. In Fig. 6, we do the same for the traffic dataset. We can see that other normalization does not matter much as scales do not vary much in this dataset. In Fig. 7, we do the same for the wiki dataset. DeepGLO predictions are substantially better on this larger dataset.
Appendix B More Experimental Details
We will provide more details about the experiments like the exact rolling prediction setting in each datasets, the evaluation metrics and the model hyperparameters.
b.1 Rolling Prediction
In our experiments in Section 6, we compare model performances on rolling prediction tasks. The goal in this setting is to predict future timesteps in batches as more data is revealed. Suppose the initial training timeperiod is , rolling window size and number of test windows . Let . The rolling prediction task is a sequential process, where given data till last window, , we predict the values for the next future window , and then the actual values for the next window are revealed and the process is carried on for . The final measure of performance is the loss for the metric defined in Eq. (1).
In the traffic dataset experiments we have and in electricity . The wiki dataset experiments have the parameters .
b.2 Loss Metrics
The following wellknown loss metrics [11] are used in this paper. Here, represents the actual values while are the corresponding predictions.
(i) WAPE: Weighted Absolute Percent Error is defined as follows,
(7) 
(ii) MAPE: Mean Absolute Percent Error is defined as follows,
(8) 
where .
(iii) SMAPE: Symmetric Mean Absolute Percent Error is defined as follows,
(9) 
where .
b.3 Model Parameters
In this section we will describe the compared models in more details. For a TC network the important parameters are the kernel size/filter size, number of layers and number of filters/channels per layer. A network described by implies that there are three layers with filters in layer . For, an LSTM the parameters
means that the number of neurons in hidden layers is
, and number of hidden layers is . All models are trained with early stopping with a tenacity or patience of , with a maximum number of epochs . The hyperparameters for all the models are as follows,DeepGLO: In all the datasets, the leveled networks and both have parameters and kernel size is . has kernel size and parameters in traffic and electricity, while having on the wiki dataset. We set and both to in all experiments. The rank used in electricity, traffic and wiki are and respectively.
Local DLN: In all the datasets, the leveled networks have parameters and kernel size is .
TRMF: The rank used in electricity, traffic and wiki are and respectively. The lag indices are set to include the last day and the same day in the last week for traffic and electricity data.
SVD+Leveled: The rank used in electricity, traffic and wiki are and respectively. In all the datasets, the leveled networks have parameters and kernel size is .
LSTM: In all datasets the parameters are .
DeepAR: In all datasets the parameters are . These are the same hyperparameters specifies in [7].
TC: In all the datasets, the parameters are and kernel size is .
Prophet: The parameters are selected automatically. The model is parallelized over cores. The model was run with growth = ’logistic’, as this was found to perform the best.