Time series (TS) forecasting is an important business problem and a fruitful application area for machine learning (ML). It underlies most aspects of modern business, including such critical areas as inventory control and customer management, and business planning going from production and distribution to finance and marketing. As such, it has a considerable financial impact, often ranging in the millions of dollars for every point of forecasting accuracy gained(Jain, 2017; Kahn, 2003)
. And yet, unlike areas such as computer vision or natural language processing where deep learning (DL) techniques are now well entrenched, there still exists evidence that such techniques struggle to outperform classical statistical approaches for TS forecasting(Makridakis et al., 2018a, b). For instance, the rankings of the six “pure” ML methods submitted to M4 competition were 23, 37, 38, 48, 54, and 57 out of a total of 60 entries, and most of the best-ranking methods were ensembles of classical statistical techniques (Makridakis et al., 2018b).
On the other hand, the M4 competition winner, an approach by S. Smyl, was based on a hybrid between neural residual/attention dilated LSTM stack with a classical Holt-Winters statistical model (Holt, 1957, 2004; Winters, 1960) with learnable parameters. Since Smyl’s approach heavily depends on this Holt-Winters component, Makridakis et al. (2018b) further argues that “hybrid approaches and combinations of method are the way forward for improving the forecasting accuracy and making forecasting more valuable”. In this work we aspire to challenge this conclusion by exploring the potential of pure DL architectures in the context of the TS forecasting. Therefore, we state the following two hypotheses.
Hypothesis 1. A pure DL model can be constructed to outperform statistical models and their combinations with ML on the set of univariate TS forecasting tasks represented by the M4 dataset.
Another question raised by Makridakis et al. (2018b) in the context of statistical TS model combination with ML is “which features make a combination approach more accurate than others?” We are interested in a related question, but rather in the context of interpretable DL architecture design: can we induce a suitable inductive bias in the model architecture to make its internal operations more interpretable, in the sense of teasing out the intuitive driving factors combining to produce a given forecast? This question is synthesized as the following hypothesis.
Hypothesis 2. A suitable inductive bias can by encoded in the DL architecture to make its outputs more interpretable, without significantly sacrificing forecasting accuracy.
1.1 Summary of Contributions
Deep Neural Architecture: To the best of our knowledge, this is the first work to empirically demonstrate that pure DL using no TS specific components outperforms well-established statistical approaches on M4 (by 11% over statistical benchmark, by 7% over the best statistical entry, and by 3% over the M4 competition winner). This invalidates the hypotheses 1, 4, 5 stipulated by Makridakis et al. (2018b) by showing that the proposed DL architecture is by far the superior approach for large-scale univariate TS forecasting problems. This provides a long-missing proof of concept for the use of pure ML in TS forecasting and strengthens motivation to continue advancing the research in this area.
Interpretable DL for Time Series: In addition to accuracy benefits, we also show that it is feasible to design an architecture with interpretable outputs that can be used by practitioners in very much the same way as traditional decomposition techniques such as the “seasonality-trend-level” approach (Cleveland et al., 1990).
2 Problem Statement
We consider the univariate point forecasting problem in discrete time. Given a length- forecast horizon, a length- observed series history
, the task is to predict the vector of future values. For simplicity, we will later consider a lookback window of length ending with the last observed value to serve as model input, and denoted . We denote the forecast of . We consider the problem in the context of the M4 competition and hence use the predictive accuracy metrics defined by Makridakis et al. (2018b) to evaluate competition entries:
Here is the periodicity of the data (e.g., 12 for monthly series). (symmetric Mean Absolute Percentage Error) and (Mean Absolute Scaled Error) are standard scale-free metrics in the practice of forecasting (Hyndman and Koehler, 2006; Makridakis and Hibon, 2000): whereas scales the error by the average between the forecast and ground truth, the scales by the average error of the naïve predictor that simply copies the observation measured periods in the past, thereby accounting for seasonality. (overall weighted average) is a M4-specific metric used to rank competition entries (M4 Team, 2018b), where and metrics are normalized such that a seasonally-adjusted naïve forecast obtains .
2.1 Dataset Description
M4 is the latest in an influential series of forecasting competitions organized by Spyros Makridakis since 1982 (Makridakis et al., 1982). The 100k-series dataset is large and diverse, consisting of data frequently encountered in business, financial and economic forecasting, and sampling frequencies ranging from hourly to yearly. Table 1 in the supplementary Section S1 outlines its composition and summary statistics, showing wide variability in TS characteristics (smooth vs erratic) defined by Syntetos et al. (2005).
Our architecture design methodology relies on a few key principles. First, the base architecture should be simple and generic, yet expressive (deep). Second, the architecture should not rely on time-series-specific feature engineering or input scaling. These prerequisites let us test Hypothesis 1. Finally, as a prerequisite to test Hypothesis 2, the architecture should be extendable towards making its outputs interpretable. We now discuss how those principles converge to the proposed architecture.
3.1 Basic Block
The proposed basic building block has a fork architecture and is depicted in Fig. 1 (left). It accepts an input and outputs two vectors, and . Input is a history lookback window of certain length ending with the last measured observation, is the block’s forward forecast of length , and
is the block’s best estimate of, also known as the ‘backcast’, given the constraints on the functional space that the block can use to approximate signals. We set the length of input window to a multiple of the forecast horizon , and typical lengths of in our setup range from to .
Internally, the basic building block consists of two parts. First, the waveform generator is a map from points in the time domain to points in the forecast (or backcast) value domain, . The number of points in forecast and backcast is different, in general. The waveform generator is parameterized with . The function of is to provide a sufficiently rich set of time-varying waveforms, selectable by varying . As shown below, can either be chosen to be learnable or can be set to specific functional forms to appropriately constrain the structure of outputs. Concrete examples of are discussed in Section 3.3.
Second, the forward and the backward predictors of parameters of the waveform generator are maps and . The task of is to predict the forward expansion parameters with the ultimate goal of optimizing the accuracy of the partial forecast . Conversely, the task of is to produce an estimate of with the ultimate goal of helping the downstream blocks by removing components of their input that are not helpful for forecasting. We use that has a form of the multi-layer fully connected (FC) neural network with ReLU non-linearity (Nair and Hinton, 2010). The number of hidden layers in the network is 4 and the width of each layer may consist of 256, 512 or 2048 units depending on where it is used (more details in Section 3.3).
3.2 Doubly Residual Stacking
The classical well-known residual network architecture adds the input of the stack of layers to its output before passing the result to the next stack (He et al., 2016). The DenseNet architecture proposed by Huang et al. (2017) further extends this principle by introducing extra connections from the output of each layer stack to the input of every other layer stack that follows it. These approaches provide clear advantages in improving the trainability of very deep architectures. Their disadvantage in the context of this work is that they result in network structures that are difficult to interpret.
We propose a novel hierarchical doubly residual topology depicted in Fig. 1
(middle and right). This architecture has two residual branches, one running over backcast prediction branch of each layer and the other one is running over the forecast branch of each layer. The backcast residual branch can be thought of as running a sequential analysis of the input signal. Each block removes the portion of the signal that it can explain well, making the forecast job of the downstream blocks easier. This structure also facilitates more fluid gradient backpropagation. More importantly, each layer outputs a partial forecast that is first aggregated at the stack level and then at the overall network level, providing a hierarchical decomposition. In a generic model context, when stacks are allowed to have arbitraryfor each layer, this does not have an effect other than making the network more transparent to gradient flows. In a special situation of deliberate structure enforced in shared over a stack, this has the critical importance of enabling interpretability via the aggregation of meaningful partial forecasts.
We propose two configurations of the architecture, based on the selection of . One of them is generic DL that can be used to test Hypothesis 1. The other one is augmented with certain inductive biases to be interpretable and can be used to test Hypothesis 2.
The generic architecture does not rely on TS-specific knowledge. We set to be a linear projection of the previous layer output. In this case the partial forecast at the output of block in stack is:
The interpretation of this model is that the FC layers in the basic building block depicted in Fig. 1 learn the predictive decomposition of the partial forecast in the basis learned by the network. Matrix has dimensionality . Therefore, the first dimension of has the interpretation of discrete time index in the forecast domain. The second dimension of the matrix has the interpretation of the indices of the basis functions, with being the expansion coefficients for this basis. Thus the columns of can be thought of as waveforms in the time domain. We allow independent sets of for the forecast and the backcast branches of the fork. Because no additional constraints are imposed on the form of , the waveforms learned by the deep model do not have inherent structure (and none is apparent in our experiments). This leads to as well as stack outputs not being interpretable.
The interpretable architecture can be constructed by reusing the overall architecture of Fig. 1, adding additional structure to for each stack. In this architecture, we have . Forecasting practitioners often use the decomposition of time series into trend and seasonality, such as those performed by the stl (Cleveland et al., 1990) and X13-arima (U.S. Census Bureau, 2013) procedures. We propose to design the trend and seasonality decomposition into the model to make the stack outputs more easily interpretable.
Trend model. A typical characteristic of trend is that most of the time it is a monotonic function, or at least a slowly varying function. In order to mimic this behaviour we propose to constrain to be a polynomial of small degree , a function slowly varying across forecast window:
Here the normalized time vector is defined on a discrete grid running from 0 to , forecasting steps ahead, where has been shifted to correspond to the first forecasting time-step. (For the backcast window, we let run over negative time values.) The trend forecast will then have the form:
where are polynomial coefficients predicted by a FC network of block of stack and is the matrix of powers of . If is low, e.g. 2 or 3, it forces to mimic trend.
Seasonality model. Typical characteristic of seasonality is that it is a regular, cyclical, recurring fluctuation. Therefore, to model seasonality, we propose to constrain to belong to the class of periodic functions , where is a seasonality period. A natural choice for the basis to model periodic function is the Fourier series:
The seasonality forecast will then have the form:
where are Fourier coefficients predicted by a FC network of layer of stack and is the matrix of sinusoidal waveforms. The forecast is then a periodic function mimicking typical seasonal patterns.
The overall interpretable architecture consists of two stacks only: the trend modeling stack is followed by the seasonality modeling stack. The doubly residual stacking combined with the forecast/backcast principle result in (i) the trend component being removed from the input window
before it is fed into the seasonality analysis stack and (ii) the partial forecasts of trend and seasonality are available as separate interpretable outputs. Structurally, each of the stacks consists of several blocks connected with residual connections as depicted in Fig.1 and each of them shares its respective, non-learnable . The number of blocks is 3 for both trend and seasonality. We found that on top of sharing , sharing all the weights across blocks in a stack resulted in better performance on validation set.
Ensembling is used by all the top entries in the M4-competition. We rely on ensembling as well to be comparable. We found that ensembling is a much more powerful regularization technique than the popular alternatives, e.g. dropout or L2-norm penalty. The addition of those methods improved individual models, but was hurting the performance of the ensemble. The core property of an ensemble is diversity. We build an ensemble using several sources of diversity. First, the ensemble models are fit on three different metrics: and , a version of that has only the ground truth value in the denominator. Second, for every horizon , individual models are trained on input windows of different length: , for a total of six window lengths. Thus the overall ensemble exhibits a multi-scale aspect. Finally, we perform a standard bagging procedure (Breiman, 1996) by sampling a subset of time series and a subset of points in to train each ensemble member. Each individual model is trained on 20% of the time series belonging to its horizon and is using of the entries in . We use 180 models in the ensemble and ensemble settings are cross-validated on a validation set defined in Section 3.5. We use the median as ensemble aggregation function.
3.5 Training Methodology
We split the M4 dataset into train, validation and test subsets. The test subset is the standard test set defined in the M4 competition (M4 Team, 2018a)
. The validation and train subsets are obtained by splitting the full M4 train set at the boundary of the last horizon of each time series. We use the train and validation subsets to tune hyperparameters and report the results of ablation studies (see Supplementary material). Once the hyperparameters are determined, we train the model on the full train set and report results on the test set. N-BEATS is implemented and trained in Tensorflow(Abadi et al., 2015).
N-BEATS shares parameters of the network within horizons, therefore there are a total of 6 models, one per horizon. If every time series is interpreted as a separate task, this can be linked back to the meta-learning, in which a neural network is regularized by learning on multiple tasks to improve generalization. We would like to stress that models for different horizons reuse the same architecture and all of the hyperparameters are fixed to the same values across horizons, except for a minor exception explained below. The fact that we can reuse architecture and even hyperparameters across horizons indicates that the proposed architecture design generalizes well across time series of different nature. The same architecture is successfully trained on a horizon with 48k time series (Monthly) and 359 time series (Weekly). This is a much stronger result than e.g. the result of S. Smyl (Makridakis et al., 2018b) who had to use very different architectures hand crafted for different horizons.
We use the Adam optimizer with initial learning rate 0.001. The neural network training is run for 30k batches of size 1024. The GPU based training of one ensemble member takes between 30 min and 2 hours depending on neural network settings and hardware. Additional details of the training setup pertaining to the sampling of batches are provided in Section S2 of supplementary materials.
4 Related Work
The approaches to TS forecasting can be split in a few distinct categories. The statistical modeling approaches based on exponential smoothing and its different flavors are well established and are often considered a default choice in the industry (Holt, 1957, 2004; Winters, 1960). More advanced variations of exponential smoothing include the winner of M3 competition, the Theta method (Assimakopoulos and Nikolopoulos, 2000) that decomposes the forecast into several theta-lines and statistically combines them. The pinnacle of the statistical approach encapsulates ARIMA, auto-ARIMA and in general, the unified state-space modeling approach, that can be used to explain and analyze all of the approaches mentioned above (see Hyndman and Khandakar (2008) for an overview). More recently, ML/TS combination approaches started infiltrating the domain with great success, showing promising results by using the outputs of statistical engines as features. In fact, 2 out of top-5 entries in the M4 competition are approaches of this type, including the second entry (Montero-Manso et al., 2019)
. The second entry computes the outputs of several statistical methods on the M4 dataset and combines them using gradient boosted tree(Chen and Guestrin, 2016)
. Somewhat independently, the work in the modern deep learning TS forecasting developed based on variations of recurrent neural networks(Flunkert et al., 2017; Rangapuram et al., 2018; Toubeau et al., 2019; Zia and Razzaq, 2018) being largely dominated by the electricity load forecasting in the multi-variate setup. A few earlier works explored the combinations of recurrent neural networks with dilation, residual connections and attention (Chang et al., 2017; Kim et al., 2017; Qin et al., 2017). These served as a basis for the winner of the M4 competition, an approach developed by S. Smyl. The winning entry combines a Holt-Winters style seasonality model with its parameters fitted to a given TS via gradient descent and a unique combination of dilation/residual/attention approaches for each forecast horizon. The resulting model is a hybrid model that heavily relies on preprocessing by a time-series engine and is hand crafted to each specific horizon.
|Best pure ML||14.397||11.031||13.973||4.566||12.894|
|Best ML/TS combination||13.528||9.733||12.639||4.118||11.720|
|DL/TS hybrid, M4 winner||13.176||9.679||12.126||4.014||11.374|
|Best pure ML||0.859||0.939||0.941||0.991||0.915||23|
|Best ML/TS combination||0.799||0.847||0.858||0.914||0.838||2|
|DL/TS hybrid, M4 winner||0.778||0.847||0.836||0.920||0.821||1|
5 Empirical Results
Our empirical results are based on the M4 dataset detailed in Section 2.1. Tables 2 and 2 present our key quantitative empirical results showing that the proposed model achieves the state of the art performance on the challenging M4 benchmark. We study the performance of two model configurations: generic (Ours-G) and interpretable (Ours-I), as well as Ours-I+G (ensemble of all models from Ours-G and Ours-I). We compare against 4 representatives from the M4 competition: each best in their respective model class. Best pure ML is the submission by B. Trotta, the best entry among the 6 pure ML models. Best statistical is the best pure statistical model by N.Z. Legaki and K. Koutsouri. Best ML/TS combination is the model by P. Montero-Manso, T. Talagala, R.J. Hyndman and G. Athanasopoulos, second best entry, gradient boosted tree over a few statistical time series models. Finally, DL/TS hybrid is the winner of M4 competition designed by S. Smyl.
5.1 Hypothesis 1 is supported: pure DL is state-of-the-art on univariate forecasting
N-BEATS outperforms all other approaches on all the studied subsets of time series. The gap between our generic model and the M4 winner () is greater than the gap between the M4 winner and the second entry (). Our generic model uses as little prior knowledge as possible, with no feature engineering, no scaling and no internal architectural components that may be considered TS-specific. Thus the result in Table 2 supports our Hypothesis 1: DL does not need support from the statistical approaches or hand crafted feature engineering and domain knowledge to perform extremely well on a wide array of TS forecasting tasks.
5.2 Hypothesis 2 is supported: DL can be made explainable on univariate forecasting
Fig. 2 studies the outputs of the proposed model in the generic and the interpretable configurations. As discussed in Section 3.3, to make the generic architecture presented in Fig. 1 interpretable, we constrain in the first stack to have the form of polynomial (1) while the second one has the form of Fourier basis (2). Furthermore, we use the outputs of the generic configuration of N-BEATS as control group (the generic model of 30 residual blocks depicted in Fig. 1 is divided into two stacks) and we plot both generic (suffix “-G”) and interpretable (suffix “-I”) stack outputs side by side in Fig. 2. The outputs of generic model are arbitrary and non-interpretable: either trend or seasonality or both of them are present at the output of both stacks. The magnitude of the output (peak-to-peak) is generally smaller at the output of the second stack. The outputs of the interpretable model exhibit distinct properties: the trend output is monotonic and slowly moving, the seasonality output is regular, cyclical and has recurring fluctuations. The peak-to-peak magnitude of the seasonality output is significantly larger than that of the trend, if significant seasonality is present in the time series. Similarly, the peak-to-peak magnitude of trend output tends to be small when no obvious trend is present in the ground truth signal. Thus the proposed interpretable architecture decomposes its forecast into two distinct components. Our conclusion is that the Hypothesis 2 is supported: the outputs of the DL model can be made interpretable by encoding a sensible inductive bias in the architecture. Table 2 confirms that this does not result in performance drop.
6 Discussion: Connections to Meta-learning
Meta-learning defines an inner learning procedure and an outer learning procedure. The inner learning procedure is parameterized, conditioned or otherwise influenced by the outer learning procedure (Bengio et al., 1991). The prototypical inner vs. outer learning is individual learning in the lifetime of an animal vs. evolution of the inner learning procedure itself over many generations of individuals. To see the two levels, it often helps to refer to two sets of parameters, the inner parameters (e.g. synaptic weights) which are modified inside the inner learning procedure, and the outer parameters or meta-parameters (e.g. genes) which are get modified only in the outer learning procedure.
N-BEATS can be cast as an instance of meta-learning by drawing the following parallels. The outer learning procedure is encapsulated in the parameters of the whole network, learned by gradient descent. The inner learning procedure is encapsulated in the set of basic building blocks and modifies the parameters of . The inner learning proceeds through a sequence of stages, each corresponding to a block within the stack of the architecture. Each of the blocks can be thought of as performing the equivalent of an update step which gradually modifies the parameters which eventually feed into in each block (which get added together to form the final prediction). The inner learning procedure takes a single history from a piece of a TS and sees that history as a training set. It produces forward parameters (see Fig. 1), which parametrically map inputs to predictions. In addition, each preceding block modifies the input to the next block by producing backward parameters , thus conditioning the learning and the output of the next block. In the case of the interpretable model, the meta-parameters are only in the FC layers because the ’s are fixed. In the case of the generic model, the meta-parameters also include the ’s which define the non-parametrically. This point of view is further reinforced by the results of the ablation study reported in the supplementary Section S3.2 showing that increasing the number of blocks in the stack, as well as the number of stacks improves generalization performance, and can be interpreted as more iterations of the inner learning procedure.
We proposed and empirically validated a novel architecture for univariate TS forecasting. We showed that the architecture is general, flexible and it performs well on a wide array of TS forecasting problems. We applied it to the M4 competition dataset and demonstrated state-of-the-art performance in two configurations: generic and interpretable. This allowed us to validate two hypotheses: (i) the generic DL approach performs exceptionally well on heterogeneous univariate TS forecasting problems using no TS domain knowledge, (ii) it is viable to constrain a DL model to force it to decompose its forecast into distinct human interpretable outputs. We demonstrated that the DL models can be trained on multiple time series in a multi-task fashion, successfully sharing individual learnings. We speculate that N-BEATS’s performance can be attributed in part to it carrying out a form of meta-learning, a deeper investigation of which should be the subject of future work.
- Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Assimakopoulos and Nikolopoulos (2000) V. Assimakopoulos and K. Nikolopoulos. The theta model: a decomposition approach to forecasting. International Journal of Forecasting, 16(4):521–530, 2000.
- Bengio et al. (1991) Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In Proceedings of the International Joint Conference on Neural Networks, pages II–A969, Seattle, USA, 1991.
- Breiman (1996) L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug 1996.
- Chang et al. (2017) S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock, M. A. Hasegawa-Johnson, and T. S. Huang. Dilated recurrent neural networks. In NIPS, pages 77–87, 2017.
- Chen and Guestrin (2016) T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In ACM SIGKDD, pages 785–794, 2016.
- Cleveland et al. (1990) R. B. Cleveland, W. S. Cleveland, J. E. McRae, and I. Terpenning. STL: A seasonal-trend decomposition procedure based on Loess (with discussion). Journal of Official Statistics, 6:3–73, 1990.
- Flunkert et al. (2017) V. Flunkert, D. Salinas, and J. Gasthaus. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. CoRR, abs/1704.04110, 2017.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
- Holt (1957) C. C. Holt. Forecasting trends and seasonals by exponentially weighted averages. Technical Report ONR memorandum no. 5, Carnegie Institute of Technology, Pittsburgh, PA, 1957.
- Holt (2004) C. C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting, 20(1):5–10, 2004.
- Huang et al. (2017) G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269. IEEE Computer Society, 2017.
- Hyndman and Koehler (2006) R. Hyndman and A. B. Koehler. Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4):679–688, 2006.
- Hyndman and Khandakar (2008) R. J. Hyndman and Y. Khandakar. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3):1–22, 2008.
- Jain (2017) C. L. Jain. Answers to your forecasting questions. Journal of Business Forecasting, 36, Spring 2017.
- Kahn (2003) K. B. Kahn. How to measure the impact of a forecast error on an enterprise? The Journal of Business Forecasting Methods & Systems, 22(1), Spring 2003.
- Kim et al. (2017) J. Kim, M. El-Khamy, and J. Lee. Residual lstm: Design of a deep recurrent architecture for distant speech recognition. In Interspeech 2017, pages 1591–1595, 2017.
- M4 Team (2018a) M4 Team. M4 dataset, 2018a. URL https://github.com/M4Competition/M4-methods/tree/master/Dataset.
- M4 Team (2018b) M4 Team. M4 competitor’s guide: prizes and rules, 2018b. URL www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-CompetitorsGuide.pdf.
- Makridakis and Hibon (2000) S. Makridakis and M. Hibon. The M3-Competition: results, conclusions and implications. International Journal of Forecasting, 16(4):451–476, 2000.
- Makridakis et al. (1982) S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, J. Newton, E. Parzen, and R. Winkler. The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of forecasting, 1(2):111–153, 1982.
- Makridakis et al. (2018a) S. Makridakis, E. Spiliotis, and V. Assimakopoulos. Statistical and machine learning forecasting methods: Concerns and ways forward. PLoS ONE, 13(3), 2018a.
- Makridakis et al. (2018b) S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The M4-Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802–808, 2018b.
- Montero-Manso et al. (2019) P. Montero-Manso, G. Athanasopoulos, R. J. Hyndman, and T. S. Talagala. FFORMA: Feature-based Forecast Model Averaging. International Journal of Forecasting, 2019. to appear.
- Nair and Hinton (2010) V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
- Qin et al. (2017) Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell. A dual-stage attention-based recurrent neural network for time series prediction. In IJCAI-17, pages 2627–2633, 2017.
- Rangapuram et al. (2018) S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski. Deep state space models for time series forecasting. In NeurIPS 31, pages 7785–7794, 2018.
- Syntetos et al. (2005) A. A. Syntetos, J. E. Boylan, and J. D. Croston. On the categorization of demand patterns. Journal of the Operational Research Society, 56(5):495–503, 2005.
- Toubeau et al. (2019) J. Toubeau, J. Bottieau, F. Vallée, and Z. De Grève. Deep learning-based multivariate probabilistic forecasting for short-term scheduling in power markets. IEEE Transactions on Power Systems, 34(2):1203–1215, March 2019.
- U.S. Census Bureau (2013) U.S. Census Bureau. Reference manual for the X-13ARIMA-SEATS Program, version 1.0, 2013. URL http://www.census.gov/ts/x13as/docX13AS.pdf.
- Winters (1960) P. R. Winters. Forecasting sales by exponentially weighted moving averages. Management Science, 6(3):324–342, 1960.
- Zia and Razzaq (2018) T. Zia and S. Razzaq. Residual recurrent highway networks for learning deep sequence prediction models. Journal of Grid Computing, Jun 2018.
S1 Dataset Description
|Frequency / Horizon|
Table 1 outlines the composition of the M4 dataset across domains and forecast horizons by listing the number of time series based on their frequency and type [M4 Team, 2018b]. The M4 dataset is large and diverse: all forecast horizons are composed of heterogeneous time series types (with exception of Hourly) frequently encountered in business, financial and economic forecasting. Summary statistics on series lengths are also listed, showing wide variability therein, as well as a characterization (smooth vs erratic) that follows Syntetos et al. , and is based on the squared coefficient of variation of the series. All series have positive observed values at all time-steps; as such, none can be considered intermittent or lumpy per Syntetos et al. .
S2 Details of Training Setup
To update network parameters, we sample the train batches of fixed size 1024, each consisting of TS from the same horizon. Hence only the neural network corresponding to this horizon is updated. The batch is assembled by first sampling a horizon with the probability equal to the percentage of the TS of this horizon in the dataset (for example Yearly, horizon 6, consisting of 23k TS is sampled with the probability 0.23). Given the horizon, we pick 1024 TS ids from this horizon, uniformly at random with replacement. For each selected TS id we pick a random forecast point from the historical range of lengthimmediately preceding the last point in the train part of the TS. is a hyperparameter equal to for horizons with massive numbers of time series (Yearly, Monthly, Quarterly) and for horizons with small and modest numbers of time series included in Others (Weekly, Daily, Hourly). Given a forecast point, we set one horizon worth of points following it to be the target forecast window and we set the history of points of one of lengths preceding it to be the input to the network. We use the Adam optimizer with initial learning rate 0.001. The neural network training is run for 30k batches. The GPU based training of one ensemble member takes between 30 min and 2 hours depending on neural network settings and hardware.
S3 Ablation Study
We performed an ablation study on the validation set, using metric as performance criterion. We addressed two specific questions with this study. First, Is stacking layers helpful? Second, Does the architecture based on the combination of layers with different basis functions results in better performance than the architecture using only one layer type?
s3.1 Layer stacking.
We start our study with the generic architecture that consists of stacks of one residual block of 5 FC layers each of the form Fig. 1 and we increase the number of stacks. Results presented in Table 3 confirm that increasing the number of stacks decreases error and at certain point the gain saturates. We would like to mention that the network having 30 stack of depth 5 is in fact a very deep network of total depth 150 layers.
s3.2 Basis synergy and increasing the number of blocks.
Stacking works well for the interpretable architecture as can be seen in Table 3 depicting the results of ablating the interpretable architecture configuration. Here we experiment with the architecture that is composed of 2 stacks, stack one is trend model and stack two is the seasonality model. Each stack has variable number of residual blocks and each residual block has 5 FC layers. We found that this architecture works best when all weights are shared within stack. We clearly see that increasing the number of blocks improves performance. The largest network is 60 layers deep. On top of that, we observe that the architecture that consists of stacks based on different basis functions wins over the architecture based on the same stack. It looks like chaining stacks of different nature results in synergistic effects. This is logical as function classes that can be modelled by trend and seasonality stacks have small overlap.