Time-series forecasting is an important problem with many industrial applications like retail demand forecasting , financial predictions , predicting traffic or weather patterns . In general it plays a key role in automating business processes . Modern data-sets can have millions of correlated time-series over several thousand time-points. For instance, in an online shopping portal like Amazon or Walmart, one may be interested in the future daily demands for all items in a category, where the number of items may be in millions. This leads to a problem of forecasting time-series (one for each of the items), given past demands over time-steps. Such a time series data-set can be represented as a matrix and we are interested in the high-dimensional setting where can be of the order of millions.
Traditional time-series forecasting methods operate on individual time-series or a small number of time-series at a time. These methods include the well known AR, ARIMA and exponential smoothing , the classical Box-Jenkins methodology  and more generally the linear state-space models . However, these methods are not easily scalable to large data-sets with millions of time-series, owing to the need for individual training. Moreover, they cannot benefit from shared temporal patterns in the whole data-set while training and prediction.
Deep networks have gained popularity in time-series forecasting recently, due to their ability to model non-linear temporal patterns. Recurrent Neural Networks (RNN’s)
have been popular in sequential modeling, however they suffer from the gradient vanishing/exploding problems in training. Long Short Term Memory (LSTM) networks alleviate that issue and have had great success in langulage modeling and other seq-to-seq tasks [9, 20]. Recently, deep time-series models have used LSTM blocks as internal components [7, 18]. Another popular architecture, that is competitive with LSTM’s and arguably easier to train are temporal convolutions/causal convolutions popularized by the wavenet model . Temporal convolutions have been recently used in time-series forecasting [2, 1]. These deep network based models can be trained on large time-series data-sets as a whole, in mini-batches. However, they still have two important shortcomings.
Firstly, most of the above deep models are difficult to train on data-sets that have wide variation in scales of the individual time-series. For instance in the item demand forecasting use-case, the demand for some popular items may be orders of magnitude more than those of niche items. In such data-sets, each time-series needs to be appropriately normalized in order for training to succeed, and then the predictions need to be scaled back to the original scale. The mode and parameters of normalization are difficult to choose and can lead to different accuracies. For example, in [7, 18]
each time-series is whitened using the corresponding empirical mean and standard deviation, while in the time-series are scaled by the corresponding value on the first time-point.
Secondly, even though these deep models are trained on the entire data-set, during prediction the models only focus on local past data i.e only the past data of a time-series is used for predicting the future of that time-series. However, in most datasets, global properties may be useful during prediction time. For instance, in stock market predictions, it might be beneficial to look at the past values of Alphabet, Amazon’s stock prices as well, while predicting the stock price of Apple. Similarly, in retail demand forecasting, past values of similar items can be leveraged while predicting the future for a certain item. To this end, in , the authors propose a combination of 2D convolution and recurrent connections, that can take in multiple time-series in the input layer thus capturing global properties during prediction. However, this method does not scale beyond a few thousand time-series, owing to the growing size of the input layer. On the other end of the spectrum, TRMF  is a temporally regularized matrix factorization model that can express all the time-series as linear combinations of basis time-series. These basis time-series can capture global patterns during prediction. However, TRMF can only model linear temporal dependencies. Moreover, there can be approximation errors due to the factorization, which can be interpreted as a lack of local focus i.e the model only concentrates on the global patterns during prediction.
In light of the above discussion, we aim to propose a deep learning model that can think globally and act locally i.e., leverage both local and global patterns during training and prediction, and also can be trained reliably even when there are wide variations in scale. The main contributions of this paper are as follows:
In Section 4.2, we propose a variation of of the temporal convolution architecture called the leveled network that can be trained effectively on data-sets with wide scale variations, without a priori normalization.
In Section 5.1, inspired by TRMF , we introduce a Matrix Factorization model regularized by a leveled network (TRMF-DLN), that can express each time-series as linear combination of basis time-series, where is much less than the number of time-series. Unlike TRMF, this model can capture non-linear dependencies as the regularization and prediction is done using a leveled network trained concurrently and also is amicable to scalable mini-batch training. This model can handle global dependencies during prediction.
In Section 5.2, we combine the global model with a local
leveled network trained on the original data-set, through a data-dependent attention model. This leads to a stronger hybrid model, namelyDeepGLO (Deep Global LOcal Forecaster) that focuses on both local properties of individual time-series and global dependencies in the data-set while training and prediction.
2 Related Work
The literature on time-series forecasting is vast and spans several decades. Here, we will mostly focus on recent deep learning approaches. For a comprehensive treatment of traditional methods, we refer the readers to [11, 17, 3, 16, 10] and the references there in.
In recent years deep learning models have gained popularity in time-series forecasting. DeepAR  proposes a LSTM based model where parameters of Bayesian models for the future time-steps are predicted as a function of the corresponding hidden states of the LSTM. In , the authors combine linear state space models with deep networks. In , the authors propose a time-series model where all history of a time-series is encoded using an LSTM block, and a multi horizon MLP decoder is used to decode the input into future forecasts. LSTNet  can leverage correlations between multiple time-series through a combination of 2D convolution and recurrent structures. However, it is difficult to scale this model beyond a few thousand time-series because of the growing size of the input layer. Temporal convolutions have been recently used for time-series forecasting .
Matrix factorization with temporal regularization was first used in  in the context of speech de-noising. Perhaps closest to our work is TRMF , where the authors propose an AR based temporal regularization. In this paper, we extend this work to non-linear settings where the temporal regularization can be performed by a leveled network (introduced in Section 4.2). We further combine the global matrix factorization model with a leveled network trained separately on the original dataset, using a linear attention model.
3 Problem Setting
We consider the problem of forecasting high-dimensional time series over future time-steps. High-dimensional time-series datasets consist of several possibly correlated time-series evolving over time and the task is to forecast the values of those time-series in future time-steps. Before, we formally define the problem, we will set up some notation.
Notation: We will use bold capital letters to denote matrices such as . and will be used interchangeably to denote the -th entry of the matrix . Let for a positive integer . For and , the notation will denote the sub-matrix of with rows in and columns in . means that all the rows are selected and similarly means all the columns are chosen. denotes all the set of elements in increased by . The notation for positive integers , is used to denote the set . ,
denote the Frobenius and Spectral norms respectively. By convention, all vectors in this paper are row vectors unless otherwise specified.denotes the -norm of the vector . denotes the sub-vector with entries where denotes the -th coordinate of and . The notation will be used to denote the vector . The notation will be used to denote the concatenation of two row vectors and . For a vector , denotes the empirical mean of the coordinates and denotes the empirical standard deviation.
Forecasting Task: A time-series data-set can be represented by a matrix , where , , is the number of time-series, is the number time-points observed during training phase, is the window size for forecasting. is used to denote the predicted values in the test range. is used to denote the -th time series, i.e., the -th row of .
Objective: The quality of the predictions are generally measured using a metric calculated between the predicted and actual values in the test range. One of the popular metric is the normalized absolute deviation metric , defined as follows,
are the observed and predicted values, respectively. This metric is sometime referred to as WAPE in the forecasting literature. Other commonly used evaluation metrics are define in SectionB.2. Note that (1
) is also used as the loss function in our proposed models.
4 Dln: A Deep Leveled Network
In this section, we propose Deep Leveled Network(DLN), an architectural variation of the temporal convolution network which is designed to handle high-dimensional time-series data with wide variation in scale, without normalization. DLN constitutes an important component in both our global and local models in DeepGLO in Section 5.
The most commonly used base architecture in deep time-series models is the Long Short Term Memory (LSTM)  network. However, in more recent works it has been observed that sequence-sequence tasks can be effectively performed by dilated Temporal Convolution architectures (popularized by the Wavenet model ). In addition Temporal Convolutions are more stable while training and can provide competitive or better performance as compared to recurrent models [1, 2]. Therefore, in this work we will use Temporal Convolution (TC) blocks as the basic building blocks for our prediction models. We provide a brief description of TC in the next section, for the sake of completeness.
4.1 Temporal Convolution
Temporal convolution (also referred to as Causal Convolution) [22, 2, 1] are multi-layered neural networks comprised of 1D convolutional layers. The major difference from regular 1D convolution, is the introduction of dilation in the filter taps, which guarantees a much larger dynamic range. Dilation is equivalent to introducing a fixed step between every two adjacent filter taps. The dilation in layer is usually set as . A temporal convolution network with filter size and number of layers has a dynamic range (or look-back) of . An example of a temporal convolution network with one channel per layer is shown in Fig. 1
. TC networks can also have residual connections as shown in Fig.1. For more details, we refer the readers to the general description in . Note that in practice, we can have multiple channels per layer of a TC network.
Thus, the temporal convolution network can be treated as an object that takes in the previous values of a time-series , where and outputs the one-step look ahead predicted value
. We will denote a temporal convolution neural network by, where is the parameter weights in the temporal convolution network. Thus, we have . The same operators can be defined on matrices. Given and a set of row indices , we can write .
4.2 Leveled Network - Handling Diverse Scales
Large scale time-series datasets containing upwards of hundreds of thousands of time-series can have very diverse scales. The diversity in scale leads to issues in training deep models, both Temporal Convolutions and LSTM based architectures, and some normalization is needed for training to succeed [14, 2, 7]. However, selecting the correct normalizing factor for each time-series is not an exact science and can have effects on predictive performance. For instance in  the data-sets are whitened using the training standard deviation and mean of each time-series while training, and the predictions are renormalized. On the other hand in , each time-series is rescaled by the value of that time-series on the first time-step. Moreover, when performing rolling predictions using a pre-trained model, when new data is observed there is a potential need for updating the scaling factors by incorporating the new time-points. In this section we propose DLN, a simple leveling network architecture that can be trained on diverse datasets without the need for a priori normalization.
DLN consists of two temporal convolution blocks (having the same dynamic range ) that are trained concurrently. Let us denote the two networks and the associated weights by and respectively. The key idea is to have (the leveling component) to predict the rolling mean of the next future time-points given the past. On the other-hand (the residual component) will be used to predict the variations with respect to this mean value. Given an appropriate window size the rolling mean stays stable for each time-series and can be predicted by a simple temporal convolution model and given these predictions the additive variations are relatively scale free i.e. the network can be trained reliably without normalization. This can be summarized by the following equations:
where we want to be close to and to be close to . An illustration of the leveled network methodology is shown in Figure 2.
Training: Both the networks can be trained concurrently given the training set , using mini-batch stochastic gradient updates. The pseudo-code for training a DLN is described in Algorithm 1. The loss function used is the same as the metric defined in Eq. (1). Note that in Step 9, the leveling component is held fixed and only is updated.
Prediction: The trained model can be used for multi-step look-ahead prediction in a standard manner. Given the past data-points of a time-series, , the prediction for the next time-step is given by defined in (2). Now, the one-step look-ahead prediction can be concatenated with the past values to form the sequence , which can be again passed through the network and get the next prediction:
The sample procedure can be repeated times to predict time-steps ahead in the future.
In Section 6, we shall see that DLN can be trained on unnormalized data-sets with wide variations in scale. In the next section, we will introduce how the DLN be used as an key component in both the global and local part of DeepGLO, our proposed hybrid model.
5 DeepGLO: A Deep Global Local Forecaster
In this section, we propose DeepGLO, our complete hybrid model which consists of separate global and local components, married through a data-driven attention mechanism. In Section 5.1, we present the global component, which is a Temporal Matrix Factorization model regularized by a Deep Leveled Network (TRMF-DLN). This model can capture global patterns during prediction, by representing each of the original time-series as a linear combination of basis time-series, where . The local model is simply a DLN trained on the original dataset. In Section 5.2, we detail how the global and local models are combined.
5.1 Temporal Matrix Factorization regularized by Deep Leveled Network (Trmf-Dln)
In this section we propose a low-rank matrix factorization model for time-series forecasting that uses a leveled network for regularization. This approach is inspired by the temporal regularized matrix factorization (TRMF) approach proposed in .
The idea is to factorize the training time-series matrix into low-rank factors and , where . This is illustrated in Figure 3. Further, we would like to encourage a temporal structure in matrix, such that the future values in the test range can also be forecasted. Let Thus, the matrix can be thought of to be comprised of basis time-series that capture the global temporal patterns in the whole data-set and all the original time-series are linear combinations of these basis time-series. In the next subsection we will describe how a DLN can be used to encourage the temporal structure for .
Temporal Regularization by a DLN: If we are provided with a leveled network that captures the temporal patterns in the training data-set , then we can encourage temporal structures in using this model. Let us assume that the said leveled network is . The temporal patterns can be encouraged by including the following temporal regularization into the objective function:
where and is the metric defined in Eq. (1). This implies that the values of the on time-index are close to the predictions from the temporal network applied on the past values between time-steps . Here, is ideally equal to the dynamic range of the network defined in Section 4. Thus the overall loss function for the factors and the temporal network is as follows:
where is the regularization parameter for the temporal regularization component.
Training: The low-rank factors and the temporal network can be trained alternatively to approximately minimize the loss in Eq. (6). The overall training can be performed through mini-batch SGD and can be broken down into two main components performed alternatively: given a fixed approximately minimize with respect to the factors - Algorithm 2 and train the network on the matrix using Algorithm 1.
The overall algorithm is detailed in Algorithm 3. The global DLN is first initialized by training on a normalized version of . Here, by normalized we mean that each dimension of the time-series is centered and scaled by the empirical mean and standard deviation of the same dimension of time-series in the training range. Note that this normalized is only used to initialize . Then in the second initialization step, two factors and are trained using the initialized (step 3), for iterations. This is followed by the alternative steps to update , , and (steps 5-7).
Rolling Prediction without Retraining: Once the model is trained using Algorithm 3, we can make predictions on the test range using multi-step look-ahead prediction. The method is straight-forward - we first use to make multi-step look-ahead prediction on the basis time-series in as detailed in Section 4, to obtained ; then the original time-series predictions can be obtained by . This model can also be adapted to make rolling predictions without retraining. In the case of rolling predictions, the task is to train the model on a training period say , then make predictions on a future time period say , then receive the actual values on the future time-range and after incorporating these values make further predictions for a time range further in the future and so on. The key challenge in this scenario, is to incorporate the newly observed values to generate the values of the basis time-series in that period which is . We propose to obtain these values by minimizing global loss defined in (6) while keeping and fixed:
Once we obtain , we can make predictions in the next set of future time-periods . Note that the TRMF model in  needed to be retrained from scratch to incorporate the newly observed values. In this work retraining is not required to achieve good performance, as we shall see in our experiments in Section 6.
5.2 Combining Global and Local Models
In Section 5.1, we introduced TRMF-DLN, a model that can leverage global information from the whole data-set even during prediction time, through the linear combinations of the learned basis time-series. In this section, we will extend that model to leverage both global information (through basis time-series in TRMF-DLN) and local information (through a separate DLN that is trained on the original and predicts the future values of each time-series based on its past observed values). The final output is through a linear attention model that combines the global and local predictions.
The idea is to have a simple temporal convolution network that outputs attention weights based on the past values of a time series . These attention weights are used to combine local and global predictions. For the sake of exposition, we write the relations for one step look-ahead predictions as follows:
where means element-wise multiplication.
Thus the final predictions are linear combinations of the local and global predictions. The linear attention weights, , are predicted by a temporal convolution network based on the past values of the time-series. This network can be trained by mini-batch SGD as detailed in our complete algorithm in Algorithm 4. Steps 1 and 2 involve training the global TRMF-DLN model and the locally focused leveled network on the original . Steps 6-11 is the pseudo-code for training .
Note that rolling predictions and multi-step predictions can be performed by the DeepGLO, as the individual global TRMF-DLN model and the local DLN can perform the forecast independently, without any need for re-training. We illustrate some representative results on a time-series from the traffic and electricity dataset in Fig. 4. In Fig. (a)a
, we show some of the basis time-series (global features) extracted from thetraffic dataset, which can be combined linearly to yield individual original time-series. It can be seen that the basis series are highly temporal and can be predicted in the test range using the network . Fig. (b)b shows global and local predictions on the same time-series, which are combined through , to provide the final predictions. In Fig. (c)c, we display a scatter plot of the local vs global attention weights in the test range for the time-points of all time-series in traffic and electricity (most weights fall [0,1] therefore the range is restricted to [0,1]). We see in that in electricity, more attention is on the local component while in traffic the attention is almost equal on both global and local components.
|Global-Only||TRMF (retrained)||0.104/0.280/0.151||0.105/0.431/0.183||0.159/0.226/0.181||0.210/ 0.322/ 0.275||0.309/0.847/0.451||0.320/0.938/0.503|
6 Empirical Results
In this section, we validate our model on three real-world data-sets on rolling prediction tasks (see Section B.1 for more details) against other benchmarks. The data-sets in consideration are, electricity  - Hourly load on houses. The training set consists of time-points and the task is to perform rolling validation for the next 7 days (24 time-points at a time, for 7 windows) as done in [25, 18, 7]; traffic  - Hourly traffic on roads in San Fransisco. The training set consists of time-points and the task is to perform rolling validation for the next 7 days (24 time-points at a time, for 7 windows) as done in [25, 18, 7] and wiki  - Daily web-traffic on about articles from Wikipedia. We only keep the time-series without missing values from the original data-set. The values for each day are normalized by the total traffic on that day across all time-series and then multiplied by . The training set consists of time-points and the task is to perform rolling validation for the next 86 days, 14 days at a time. More data statistics indicating scale variations are provided in Table 1.
For each data-set, all models are compared on two different settings. In the first setting the models are trained on normalized version of the data-set where each time series in the training set is re-scaled as and then the predictions are scaled back to the original scaling. In the second setting, the data-set is unnormalized i.e left as it is while training and prediction. Note that all models are compared in the test range on the original scale of the data. The purpose of these two settings is to measure the impact of scaling on the accuracy of the different models.
The models under contention are: (1) DeepGLO: The combined local and global TRMF-DLN model proposed in Section 5.2. (2) Local DLN: The leveled temporal convolution based architecture proposed in Section 4. (3) LSTM: A simple LSTM block that predicts the time-series values as function of the hidden states . (4) DeepAR: The model proposed in . (5) TC: A simple Temporal Convolution model as described in Section 4. (6) Prophet: The versatile forecasting model from Facebook based on classical techniques . (7) TRMF: the model proposed in . Note that this model needs to be retrained for every rolling prediction window. (8) SVD+DLN: Combination of SVD and leveled network. The original data is factorized as using SVD and a leveled network is trained on the . This is a simple baseline for a global-only approach.
Note: We use the same hyper-parameters for DeepAR on the traffic and electricity datasets, as specified in . The WAPE values from the original paper could not be directly used, as there are different values reported in  and . The model in TRMF  was trained with different hyper-parameters (larger rank) than in the original paper and therefore the results are slightly better. More details about all the hyper-parameters used are provided in Section B. The rank used in electricity, traffic and wiki are and for DeepGLO/TRMF.
In Table 2, we report WAPE/MAPE/SMAPE (see definitions in Section B.2) on all three datasets under both normalized and unnormalized training. We can see that DeepGLO features among the top two models in almost all categories, under all three metrics. TRMF does marginally better on the normalized traffic dataset in terms of SMAPE, and also in terms of MAPE in the unnormalized traffic dataset. This may be attributed to the fact that the lag indices in TRMF can be chosen to account directly for weekly dependencies and that it is retrained. It should be noted that DeepGLO performs substantially better than other models on the much larger wiki dataset. In fact it does at least 30% better than the next best model (not proposed in this paper) in terms of MAPE. Also, note that DeepGLO performs better on wiki, using unnormalized data. We would also like to point out that the leveled network can train well on unnormalized data, while other deep models (LSTM, DeepAR, TC) do not perform well in this setting. More results are provided in Section A, where we plot predictions from the different methods on representative time-series.
7 Conclusion and Future Work
In this paper, we propose DeepGLO, a deep glocalized temporal model for high-dimensional time-series forecasting. DeepGLO combines a global matrix factorization model regularized by a leveled network and a local leveled network trained on the original dataset. This hybrid method shows considerable gain over the state of the art on real datasets and can train reliably in the presence of diverse scales. In this paper, we focus on making point forecasts, however we believe that DeepGLO
can be extended to make probabilistic forecasting, by enabling the leveled network to predict both the mean value and Bayesian confidence interval for future time-points. We leave this direction as future work. Another possible direction for future research is to useDeepGLO
for missing value imputation in time-series data. This is especially interesting, given the globalTRMF-DLN model, owing to the matrix factorization structure.
-  Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
-  Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series forecasting with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.
-  George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
-  Chris Chatfield. Time-series forecasting. Chapman and Hall/CRC, 2000.
Fast global alignment kernels.
Proceedings of the 28th international conference on machine learning (ICML-11), pages 929–936, 2011.
-  Facebook. Fbprophet. https://research.fb.com/prophet-forecasting-at-scale/, 2017. [Online; accessed 07-Jan-2019].
-  Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
-  Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks, 6(6):801–806, 1993.
-  Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999.
-  James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton, NJ, 1994.
-  Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
-  Kaggle. Wikipedia web traffic. https://www.kaggle.com/c/web-traffic-time-series-forecasting/data, 2017. [Online; accessed 07-Jan-2019].
Financial time series forecasting using support vector machines.Neurocomputing, 55(1-2):307–319, 2003.
-  Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104. ACM, 2018.
-  Paul D Larson. Designing and managing the supply chain: Concepts, strategies, and case studies, david simchi-levi philip kaminsky edith simchi-levi. Journal of Business Logistics, 22(1):259–261, 2001.
-  Helmut Lütkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.
-  ED McKenzie. General exponential smoothing and the equivalent arma process. Journal of Forecasting, 3(3):333–344, 1984.
-  Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, pages 7796–7805, 2018.
-  Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand forecasting for large inventories. In Advances in Neural Information Processing Systems, pages 4646–4654, 2016.
-  Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
-  Artur Trindade. Electricityloaddiagrams20112014 data set. https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, 2011. [Online; accessed 07-Jan-2019].
-  Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
-  Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
-  Kevin W Wilson, Bhiksha Raj, and Paris Smaragdis. Regularized non-negative matrix factorization with temporal dependencies for speech denoising. In Ninth Annual Conference of the International Speech Communication Association, 2008.
-  Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in neural information processing systems, pages 847–855, 2016.
Appendix A More Empirical Results
In this section, we will provide some more illustrations. In Fig. 5, we provide the actual and predicted values from all the methods on some selected time-series from the electricity data-set. The first row shows results from normalized training and the second row from unnormalized. We can see that other local deep models (DeepAR, LSTM, TC) fail to train in the unnormalized setting, as there are wide scale variations. On an average DeepGLO predictions are closer to the actual values. In Fig. 6, we do the same for the traffic dataset. We can see that other normalization does not matter much as scales do not vary much in this dataset. In Fig. 7, we do the same for the wiki dataset. DeepGLO predictions are substantially better on this larger dataset.
Appendix B More Experimental Details
We will provide more details about the experiments like the exact rolling prediction setting in each data-sets, the evaluation metrics and the model hyper-parameters.
b.1 Rolling Prediction
In our experiments in Section 6, we compare model performances on rolling prediction tasks. The goal in this setting is to predict future time-steps in batches as more data is revealed. Suppose the initial training time-period is , rolling window size and number of test windows . Let . The rolling prediction task is a sequential process, where given data till last window, , we predict the values for the next future window , and then the actual values for the next window are revealed and the process is carried on for . The final measure of performance is the loss for the metric defined in Eq. (1).
In the traffic data-set experiments we have and in electricity . The wiki data-set experiments have the parameters .
b.2 Loss Metrics
The following well-known loss metrics  are used in this paper. Here, represents the actual values while are the corresponding predictions.
(i) WAPE: Weighted Absolute Percent Error is defined as follows,
(ii) MAPE: Mean Absolute Percent Error is defined as follows,
(iii) SMAPE: Symmetric Mean Absolute Percent Error is defined as follows,
b.3 Model Parameters
In this section we will describe the compared models in more details. For a TC network the important parameters are the kernel size/filter size, number of layers and number of filters/channels per layer. A network described by implies that there are three layers with filters in layer . For, an LSTM the parameters
means that the number of neurons in hidden layers is, and number of hidden layers is . All models are trained with early stopping with a tenacity or patience of , with a maximum number of epochs . The hyper-parameters for all the models are as follows,
DeepGLO: In all the datasets, the leveled networks and both have parameters and kernel size is . has kernel size and parameters in traffic and electricity, while having on the wiki dataset. We set and both to in all experiments. The rank used in electricity, traffic and wiki are and respectively.
Local DLN: In all the datasets, the leveled networks have parameters and kernel size is .
TRMF: The rank used in electricity, traffic and wiki are and respectively. The lag indices are set to include the last day and the same day in the last week for traffic and electricity data.
SVD+Leveled: The rank used in electricity, traffic and wiki are and respectively. In all the datasets, the leveled networks have parameters and kernel size is .
LSTM: In all datasets the parameters are .
DeepAR: In all datasets the parameters are . These are the same hyper-parameters specifies in .
TC: In all the datasets, the parameters are and kernel size is .
Prophet: The parameters are selected automatically. The model is parallelized over cores. The model was run with growth = ’logistic’, as this was found to perform the best.