1 Introduction
Forecasting is one of the fundamental scientific problems and also of great practical utility. The ability to plan and control as well as to appropriately react to manifestations of complex partially or completely unknown systems often relies on the ability to forecast relevant observations based on their past history. The applications of forecasting span a variety of fields, ranging from extremely technical (e.g. vehicle and robot control (Tang and Salakhutdinov, 2019), data center optimization (Gao, 2014)), to more business oriented (supply chain management (Leung, 1995), workforce management (Chapados et al., 2014), forecasting phone call arrivals (Ibrahim et al., 2016) and customer traffic (Lam et al., 1998)) and finally to ones that may be critical for the future survival of humanity, such as precision agriculture (Rodrigues Jr et al., 2019) or fire and flood management (Mahoo et al., 2015; Sit and Demir, 2019). Unsurprisingly, forecasting methods have a long history that can be traced back to the very origins of human civilization (Neale, 1985), modern science (Gauss, 1809) and has consistently attracted considerable attention (Yule, 1927; Walker, 1931; Holt, 1957; Winters, 1960; Engle, 1982; Sezer et al., 2019). The progress made in the univariate forecasting in the past four decades is well reflected in the results and methods considered in the associated competitions (Makridakis et al., 1993, 1982; Makridakis and Hibon, 2000; Athanasopoulos et al., 2011; Makridakis et al., 2018b)
. Growing evidence suggests that machine learning approaches offer a superior modeling methodology to tackle timeseries (TS) forecasting tasks, in contrast to some previous assessments
(Makridakis et al., 2018a). For example, the winner of the last competition (M4, Makridakis et al. (2018b)) was a deep neural network predicting parameters of a statistical model (Smyl, 2020). The latter result was reinforced by Oreshkin et al. (2020) who improved over the winner using a pure neural network model, called NBEATS. One of the contributions of the present paper is to help understand why NBEATS is working so well, by casting it as a metalearning method.On the practical side, the deployment of deep neural architectures is often challenged by the cold start problem. Before a tabula rasa
deep neural network provides a useful output, it should be trained on a large problemspecific dataset. For early adopters, this often implies data collection efforts, changing data handling practices and even changing the existing IT infrastructures on a massive scale. In contrast, advanced statistical models can be deployed with significantly less effort as they estimate their parameters on a single TS at a time. In this paper we address the problem of reducing the entry cost of deep neural networks in the industrial practice of TS forecasting. We show that it is viable to train a neural network model on a diversified source dataset and deploy it on a target dataset in a
zeroshot regime, i.e. without explicit retraining on that target data, resulting in performance that is at least as good as that of advanced statistical models tailored to the target dataset. In addition, the TS forecasting problem is distinct in that one has to deal upfront with the challenge of outofdistribution generalization: TS are typically generated by systems whose generative distributions shift significantly over time. Consequently, transfer learning was considered challenging in this domain until very recently
(Hooshmand and Sharma, 2019; Ribeiro et al., 2018). Ours is the first work to demonstrate the highly successful zeroshot operation of a deep neural TS forecasting model, thanks to a metalearning approach.Addressing this practical problem also provides clues to fundamental questions. Can we learn something general about forecasting and transfer this knowledge across datasets? If so, what kind of mechanisms could facilitate this? The ability to learn and transfer representations across tasks via task adaptation is an advantage of metalearning (Raghu et al., 2019). We propose here a broad theoretical framework for metalearning which spans several existing metalearning algorithms. We further show how NBEATS fits this metalearning framework. We identify within NBEATS internal metalearning adaptation mechanisms that generate new parameters onthefly, specific to a given TS, iteratively extending the architecture’s expressive power. We empirically confirm that these mechanisms are key to improving its zeroshot univariate TS forecasting performance.
1.1 Background
The univariate point forecasting problem in discrete time is formulated given a length forecast horizon and a length observed series history
. The task is to predict the vector of future values
. For simplicity, we will later consider a lookback window of length ending with the last observed value to serve as model input, and denoted . We denote the point forecast of . Its accuracy can be evaluated with , the symmetric mean absolute percentage error (Makridakis et al. 2018b),(1) 
Other quality metrics (e.g. , , , ) are possible and are defined in Appendix A.
Metalearning or learningtolearn (Harlow, 1949; Schmidhuber, 1987; Bengio et al., 1991) is believed to be necessary for intelligent machines (Lake et al., 2017). The ability to metalearn is usually linked to being able to (i) accumulate knowledge across different tasks (i.e. transfer learning, multitask learning) and (ii) quickly adapt the accumulated knowledge to the new task (task adaptation) (Ravi and Larochelle, 2016; Lake et al., 2017; Bengio et al., 1992). Accordingly, a metalearning setup can be defined by assuming a distribution over tasks (where each task can be seen as a metaexample), a predictor parameterized with parameters and a metalearning procedure with metaparameters . Each task (a metaexample) includes a limited set of task training examples and a set of task validation examples. The objective is to design a metalearner that can generalize well on a new task by appropriately choosing the predictor’s parameters after observing the task training data. The metalearner is trained to do so by being exposed to many tasks in a training dataset sampled from . For each task , the metalearner is requested to produce the solution to the task in the form of and the metalearner metaparameters
are optimized across many tasks based on validation data and loss functions supplied with the tasks. Training on multiple tasks enables the metalearner to produce solutions
that generalize well on a set of unseen tasks sampled from .NBEATS (Oreshkin et al., 2020) consists of a total of blocks connected using a doubly residual architecture that we review in detail below (see Appendix B.1 for full architecture details). Block has input and produces two outputs: the backcast and the partial forecast . For the first block we define , where is assumed to be the modellevel input from now on. The internal operations of a block are based on a combination of fully connected and linear layers. In this paper, we focus on the configuration of NBEATS that shares all learnable parameters across blocks. We define the th fullyconnected layer in the th block, having nonlinearity (Nair and Hinton, 2010; Glorot et al., 2011), weight matrix
and input , as . With this notation, one block of NBEATS is described as:(2) 
where denotes a fully connected layer and and are linear operators which can be seen as linear bases, combined linearly with the coefficients. Finally, the doubly residual architecture is described by the following recursion (recalling that ):
(3) 
The NBEATS parameters included in the and linear layers are learned by minimizing a suitable loss function (e.g. defined in (1)) across multiple TS.
2 Metalearning Framework
A metalearning procedure can generally be viewed at two levels: the inner loop and the outer loop. The inner training loop operates within an individual “metaexample” or task (fast learning loop improving over current ) and the outer loop operates across tasks (slow learning loop). A task includes task training data and task validation data , both optionally involving inputs, targets and a taskspecific loss: , . We extend the definition of the predictor originally provided in Section 1.1 by allowing a subset of its parameters denoted to belong to metaparameters and hence not to be task adaptive. Therefore, in our framework, the predictor has parameters that can be adapted rapidly, at the task level, and metaparameters that are set by the metalearning procedure and are slowly learned across tasks.
Accordingly, the metalearning procedure has three distinct ingredients: (i) metaparameters , (ii) initialization function and (iii) update function . The metalearner’s metaparameters include the metaparameters of the metainitialization function, , the metaparameters of the predictor shared across tasks, , and the metaparameters of the update function, . The metainitialization function defines the initial values of parameters for a given task based on its metainitialization parameters , task training dataset and task metadata . Task metadata may have, for example, a form of task ID or a textual task description. The update function is parameterized with update metaparameters . It defines an iterated update to predictor parameters at iteration based on their previous value and the task training set . The initialization and update functions produce a sequence of predictor parameters, which we compactly write as . We let the final predictor be a function of the whole sequence of parameters, written compactly as . One implementation of such general function could be a Bayesian ensemble or a weighted sum, for example: . If we set , then we get the more commonly encountered situation .
The metaparameters are updated in the outer metalearning loop so as to obtain good generalization in the inner loop, i.e., by minimizing the expected validation loss that maps the ground truth and estimated outputs into the value that quantifies the generalization performance across tasks. The metalearning framework is succinctly described by the following set of equations.
(4)  
(5) 
In this section we explained the proposed broad metalearning framework and laid out its main equations. Next, we demonstrate how existing metalearning algorithms can be cast into this framework.
2.1 Existing Metalearning Algorithms Explained
MAML and related approaches (Finn et al., 2017; Li et al., 2017; Raghu et al., 2019) can be derived from (4) and (5) by (i) setting to be the identity map that copies into , (ii) setting to be the SGD gradient update: , where and by (iii) setting the predictor’s metaparameters to the empty set . Equation (5) applies with no modifications.
MTnet (Lee and Choi, 2018) is a variant of MAML in which the predictor’s metaparameter set is not empty. The part of the predictor parameterized with is metalearned across tasks and is fixed during task adaptation. The other part parameterized with is treated exactly as in MAML.
Optimization as a model for fewshot learning (Ravi and Larochelle, 2016) can be derived from (4) and (5) via the following steps (in addition to those of MAML). First, set the update function to the update equation of an LSTMlike cell of the form ( is the LSTM update step index) . Second, set to be the LSTM’s forget gate value (Ravi and Larochelle, 2016): and to be the LSTM’s input gate value: . Here is a sigmoid nonlinearity. Finally, include all the LSTM parameters into the set of update metaparameters: .
Prototypical Networks (PNs) (Snell et al., 2017). Most metricbased metalearning approaches, including the PNs, rely on comparing embeddings of the task training set samples with those of the validation set. Therefore, it is convenient to consider a composite predictor consisting of the embedding function, , and the comparison function, : . Concretely, PNs can be derived from (4) and (5) as follows. Consider a shot image classification task, to be a convolutional network with parameters , and to be class prototype vectors: . Initialization function with simply sets to the values of prototypes. is an identity map with and
is as a softmax classifier:
(6) 
Here could be Euclidean distance and the softmax is normalized w.r.t. all . Finally, define the loss in (5) as the crossentropy of the softmax classifier described in (6). Interestingly, are nothing else than the dynamically generated weights of the final linear layer fed into the softmax of a regular image classifier, which is especially apparent when . The fact that in the prototypical network scenario only the final linear layer weights are dynamically generated based on the task training set resonates very well with the most recent study of MAML (Raghu et al., 2019). It has been shown that most of the MAML’s gain can be recovered by only adapting the weights of the final linear layer in the inner loop.
Matching networks (Vinyals et al., 2016) are similar to the PNs with a few adjustments. First,
is cosine similarity. Second, in the
vanilla matching network architecture, is defined, assuming andare onehot encoded, as a soft nearest neighbor:
The softmax is normalized w.r.t. and predictor parameters, dynamically generated by , include embedding/label pairs: . In the FCE matching network, validation and training embeddings additionally interact with the task training set via attention LSTMs (Vinyals et al., 2016). To reflect this, the update function, , updates the original embeddings via LSTM equations (Appendix A.2 in (Vinyals et al., 2016)): . The LSTM’s parameters are included in . Second, the predictor is augmented with an additional relation module , , with the set of predictor metaparameters extended accordingly: . The relation module is again implemented via LSTM: (cf. Vinyals et al. (2016), Appendix A.1).
TADAM (Oreshkin et al., 2018) extends PNs by dynamically conditioning the embedding function on the task training data via FiLM layers (Perez et al., 2018). TADAM’s predictor has the following form: ; . The compare function parameters are as before, . The embedding function parameters include the FiLM layer (scale/shift) vectors for each convolutional layer, generated by a separate FC network from the task embedding. The initialization function sets to all zeros, embeds task training data, and sets the task embedding to the average of class prototypes. The update function whose metaparameters include the coefficients of the FC network, , generates an update to from the task embedding. Then it generates an update to the class prototypes using conditioned with the updated .
LEO (Rusu et al., 2019) uses a fixed pretrained embedding function. The intermediate lowdimensional latent space is optimized and is used to generate the predictor’s taskadaptive final layer weights. LEO’s predictor has the following form: . The predictor has final layer and the latent space parameters, , and no metaparameters, . The initialization function , , uses a task encoder and a relation network with metaparameters and , respectively, to metainitialize the latent space parameters, , based on the task training data. The update function , , uses a decoder with metaparameters to iteratively decode into the final layer weights, . It optimizes by executing gradient descent .
In this section, we illustrated that seven distinct metalearning algorithms from two broad categories (optimizationbased and metricbased) can be derived from our equations (4) and (5). This confirms that our metalearning framework is general and can serve as a useful tool to analyze existing and perhaps synthesize new metalearning algorithms.
3 NBEATS as a Metalearning Algorithm
Let us now focus on the analysis of NBEATS described by equations (2), (3). We first introduce the following notation: ; ; . In the original equations, and are linear and hence can be represented by equivalent matrices and . In the following, we keep the notation general as much as possible, transitioning to the linear case only at the end of our analysis. Then, given the network input, (), and noting that we can write the output as follows:
(7) 
3.1 Metalearning Framework Subsumes NBEATS
NBEATS is now derived from the metalearning framework of Section 2, based on two observations: (i) each application of in (7) is a predictor and (ii) each block of NBEATS is the iteration of the inner metalearning loop. More concretely, we have the following: . Here and are parameters of functions and in (7). The metaparameters of the predictor, , are learned across tasks in the outer loop. The taskspecific parameters include the sequence of shift vectors, that we explain in detail next. The th block of NBEATS performs the adaptation of the predictor’s taskspecific parameters of the form . These parameters are used to adjust the predictor’s input at every iteration as as evident from equation (3).
This gives rise to the following initialization and update functions. with sets to zero. , with generates the next parameter update based on :
Interestingly, (i) metaparameters are shared between the predictor and the update function and (ii) the task training set is limited to the network input, . Note that the latter makes sense because the data are TS, with the inputs
having the same form of internal dependencies as the target outputs
. Hence, observing is enough to infer how to predict from in a way that is similar to how different parts of are related to each other.Finally, according to (7), predictor outputs corresponding to the values of parameters learned at every iteration of the inner loop are combined in the final output. This corresponds to choosing a predictor of the form in (5). The outer learning loop (5) describes the NBEATS training procedure across tasks (TS) with no modification.
Remark 3.1.
It is clear that the final output of the architecture depends on the entire sequence . Quite obviously, even if predictor parameters , are shared across blocks and fixed, the behaviour of is governed by an extended space of parameters . This has two consequences. First, the expressive power of the architecture grows with the growing number of blocks, in some proportion to the growth of the space spanned by , even if , are fixed and shared across blocks. Second, since the number of parameters describing the architecture behaviour grows with the number of blocks, it may lead to a phenomenon similar to overfitting. Therefore, it would be reasonable to expect that at first the addition of blocks will improve generalization performance, because of the increase in expressive power. However, at some point adding more blocks may hurt the generalization performance, because of an effect similar to overfitting, even if , are fixed and shared across blocks, because at each iteration more information is extracted from and the set of parameters is expanded.
3.2 Linear Approximation Analysis
Next, we go a level deeper in the analysis to uncover more intricate task adaptation processes. To this end, we study the behaviour of (7) assuming that residual corrections are small. This allows us to derive an alternative interpretation of NBEATS’ metalearning operation, expressing it in terms of the adaptation of the internal weights of the network based on the task input data. Under the assumption of small , (7) can be analyzed using a Taylor series expansion, in the vicinity of . This results in the following first order approximation:
Here is the Jacobian of and is the small O in Landau notation.
We now consider linear and , as mentioned earlier, in which case and are represented by two matrices of appropriate dimensionality, and ; and . Thus, the above expression can be simplified as:
Continuously applying the linear approximation until we reach and recalling that we arrive at the following:
(8) 
Note that can be written in the form of sequential updates of . Consider , then the update equation for can be written as and (8) becomes:
(9) 
Let us now discuss how the results of the linear approximation analysis can be used to reinterpret NBEATS as an instance of the metalearning framework (4) and (5). According to (9), the predictor can now be represented in a decoupled form . Thus, in the predictor, task adaptation is now clearly confined in the decision function, , whereas the embedding function only relies on fixed metaparameters . The adaptive parameters include the sequence of projection matrices . The metainitialization function is parameterized with and it simply sets . The main ingredient of the update function is, as before, . Therefore, it is parameterized with , same as in Section 3.1. The update function now consists of two equations:
(10) 
Remark 3.2.
In the linearized analysis, the sequence of input shifts becomes an auxiliary internal instrument of the update function. It is used to generate a good sequence of updates to the final linear layer of the predictor via an iterative twostage process. First, for a given previous location in the input space, , the new location in the input space is predicted by . Second, the previous location in the input space is translated in the update of by appropriately projecting via . The first order analysis result (9) shows that under certain circumstances, the blockbyblock manipulation of the input sequence apparent in (7) is equivalent to producing a sequential update of the final linear layer apparent in (10), with the block input being set to the same fixed value (cf. the final layer update behaviour identified in MAML by Raghu et al. (2019) and the results of our analysis of PNs).
The key role in this process seems to be encapsulated in that is responsible for both generating the sequence of input shifts and for the reprojection of derivatives . We study this aspect in more detail in the next section.
3.3 The Role of
It is hard to study the form of learned from the data in general. However, equipped with the results of the linear approximation analysis presented in Section 3.2, we can study the case of a twoblock network, assuming that the norm loss between and is used to train the network. If, in addition, the dataset consists of the set of pairs the datasetwise loss has the following expression:
Introducing , the error between the default forecast and the ground truth , and expanding the L2 norm we obtain the following:
Now, assuming that the rest of the parameters of the network are fixed, we have the derivative with respect to using matrix calculus (Petersen and Pedersen, 2012):
Using the above expression we conclude that the first order approximation of optimal satisfies the following equation:
Although this does not help to find a closed form solution for , it does provide a quite obvious intuition: the LHS and the RHS are equal when and are negatively correlated. Therefore, satisfying the equation will tend to drive the update to in (10) in such a way that on average the projection of over the update to matrix will tend to compensate the error made by forecasting using based on metainitialization.
In this section we established that NBEATS is an instance of a metalearning algorithm described by equations (4) and (5). We showed that each block of NBEATS is an inner metalearning loop that generates additional shift parameters specific to the input timeseries. Therefore, the expressive power of the architecture is expected to grow with each additional block, even if all blocks share their parameters. We used linear approximation analysis to show that the input shift in a block is equivalent to the update of the block’s final linear layer weights under certain conditions. We further provided mathematical intuition hinting that in a twoblock network, the second block will on average tend to compensate the forecasting error made by the first block, even if the blocks share the same network weights.
M4,  M3,  tourism,  electricity,  traffic  fred,  

Pure ML  12.894  Comb  13.52  ETS  20.88  MatFact  0.160  0.200  ETS  14.16  
Best STAT  11.986  ForePro  13.19  Theta  20.88  DeepAR  0.070  0.170  Naïve  12.79  
ProLogistica  11.845  Theta  13.01  ForePro  19.84  DeepState  0.083  0.167  SES  12.70  
Best ML/TS  11.720  DOTM  12.90  Strato  19.52  Theta  0.079  0.178  Theta  12.20  
DL/TS hybrid  11.374  EXP  12.71  LCBaker  19.35  ETS  0.083  0.702  ARIMA  12.15  
NBEATS  11.135  12.37  18.52  0.067  0.114  11.49  
NBSHM4  n/a  12.44  18.82  0.094  0.147  11.60  
NBNSHM4  n/a  12.38  18.92  0.102  0.152  11.70  
NBSHFR  11.701  12.69  19.94  n/a  
NBNSHFR  11.675  12.61  19.46  n/a 
4 ZeroShot TS Forecasting Task
Base datasets. M4 (M4 Team, 2018), contains 100k TS representing demographic, finance, industry, macro and micro indicators. Sampling frequencies include yearly, quarterly, monthly, weekly, daily and hourly. fred is a dataset introduced in this paper containing 290k US and international economic TS from 89 sources, a subset of Federal Reserve economic data (Federal Reserve Bank of St. Louis (2019); see Appendix D.2 for a detailed description). M3 (Makridakis and Hibon, 2000) contains 3003 TS from domains and sampling frequencies similar to M4. tourism (Athanasopoulos et al., 2011) includes monthly, quarterly and yearly series of indicators related to tourism activities supplied by governmental tourism organizations and various academics. electricity (Dua and Graff, 2017; Yu et al., 2016) represents the hourly electricity usage of 370 customers over three years. traffic (Dua and Graff, 2017; Yu et al., 2016) tracks hourly occupancy rates scaled in a (0,1) range of 963 lanes in the San Francisco Bay Area freeways over a period of slightly more than one year. Additional details for all datasets appear in Appendix D.
The zeroshot forecasting task definition. One of the base datasets, a source dataset, is used to train a machine learning model. The entire source dataset can be used for training. The trained model can then forecast a TS in a target
dataset. The source and the target datasets are distinct: they do not contain TS whose values are linear transformations of each other. The forecasted TS is split into two nonoverlapping pieces: the history, and the test. The history is used as model input and the test is used to compute the forecast error metric. We use the history and the test splits for the base datasets consistent with their original publication, unless explicitly stated otherwise. To make forecasts, the model is allowed to access the TS in the target dataset on a
one at a timebasis. This is to avoid having the model implicitly learn/adapt based on any information contained in the target dataset other than the history of the forecasted TS. If any adjustments of model parameters or hyperparameters are necessary, they are allowed
exclusively using the history of the forecasted TS.5 Empirical Results
Experiments follow the defined zeroshot forecasting setup and the base datasets presented in Section 4. We mostly follow the original training setup of Oreshkin et al. (2020) to train NBEATS on a source dataset, with one exception. We scale/descale the architecture input/output by dividing/multiplying all input/output values by the max value of the input window. This does not affect the accuracy of the model in the usual train/test scenario. In the zeroshot regime, this operation is intended to prevent catastrophic failure when the scale of the target dataset differs significantly from the source dataset. Additional training setup details are provided in Appendix B.2. For each dataset, we compare our results with 5 representative entries reported in the literature for that dataset, according to the customary metrics specific to each dataset (M4, fred, M3: , tourism: , electricity, traffic: ). Our main results appear in Table 1 and more details are provided in Appendix E. In the zeroshot forecasting regime, NBEATS consistently outperforms most statistical models tailored to these datasets. NBEATS trained on fred and applied in zeroshot regime to M4 outperforms the best statistical model selected for its performance on M4 and is at par with the competition’s second entry (boosted trees). On M3 and tourism the zeroshot forecasting performance of NBEATS is better than that of the M3 winner, Theta (Assimakopoulos and Nikolopoulos, 2000). On electricity and traffic NBEATS performs close to or better than other neural models trained on these datasets. The results overall suggest that a neural model (NBEATS) is able to extract general knowledge about the TS forecasting task and then successfully adapt it to forecast on unseen TS. We believe our study presents the first example of successfully applying a neural model to solve the zeroshot TS forecasting.
. Each plot shows results with (blue) and without (red) weight sharing across blocks. Block width is 1024. The mean performance and one standard deviation interval (computed using ensemble bootstrap) are shown.
Expressive power. Remark 3.1 on expressive power implies that NBEATS internally generates a sequence of parameters that dynamically extend the expressive power of the architecture with each newly added block, even if block parameters are shared. To validate this hypothesis, we performed an experiment studying the zeroshot forecasting performance of NBEATS with increasing number of blocks, with and without parameter sharing. The architecture was trained on M4 and the performance was measured on the target datasets M3 and tourism. The results^{1}^{1}1The extended set of results for all datasets, using fred as a source dataset, a few metrics and varying layer width are provided in Appendix F. Extended results further reinforce findings in Fig. 1. are presented in Fig. 1. On the two datasets and for the sharedweights configuration, we consistently see performance improvement when the number of blocks increases up to about 30 blocks. In the same scenario, increasing the number of blocks beyond 30 leads to small, but consistent deterioration in performance, an effect similar to overfitting. Recalling that increasing the number of blocks with sharing does not lead to an increase in the number of metaparameters, only the sequence of task specific parameters is being extended onthefly. In our view, this provides evidence supporting the metalearning interpretation of NBEATS, with a simple interpretation of this phenomenon as overfitting in the inner loop of metalearning. It is not clear otherwise how to explain the generalization dynamics in Fig. 1.
Additionally, the performance improvement due to metalearning alone (shared weights, multiple blocks vs. a single block) is 12.60 to 12.44 (1.2%) and 20.40 to 18.82 (7.8%) for M3 and tourism, respectively (see Fig. 1). The performance improvement due to metalearning and unique weights (unique weights, multiple blocks vs. a single block) is 12.60 to 12.40 (1.6%) and 20.40 to 18.91 (7.4%). Clearly, the majority of the gain is due to the metalearning alone. The introduction of unique block weights sometimes results in marginal gain, but often leads to a loss (see more results in Appendix F). This has a clear implication for reducing the memory footprint of neural networks.
It is interesting to make a note about the scale of the improvement. On the tourism dataset (see Fig. 1, right), the zeroshot error of 1 block ( 20.40) is a little bit better than that of the outofthebox models ETS and Theta ( 20.88). As the number of blocks grows, the error drops to a of 18.80, outperforming the statistical method of LeeCBaker (tourism competition winner, handcrafted specifically for tourism (Athanasopoulos et al., 2011)). For the M3 target dataset (see Fig. 1, left) we can see the generalization performance with one NBEATS block ( 12.60) is a bit better than the best known statistical model (EXP method, 12.71). Increasing the number of blocks closes the generalization gap between the zeroshot performance ( 12.40) and the regular NBEATS trained on M3 ( 12.37).
In this section, we presented empirical evidence that neural networks are able to provide highquality zeroshot forecasts on unseen TS. We further empirically supported the hypothesis that metalearning adaptation mechanisms identified within NBEATS in Section 3 are instrumental in achieving impressive zeroshot forecasting accuracy results. Our results provide positive evidence to stimulate research on (i) addressing the cold start problem in neural TS forecasting and (ii) designing memoryefficient neural networks.
6 Related Work
From a highlevel perspective, there are many links with classical TS modeling: a humanspecified classical model is typically designed to generalize well on unseen TS, while we propose to automate that process. The classical models include exponential smoothing with and without seasonal effects (Holt, 1957, 2004; Winters, 1960), multitrace exponential smoothing approaches, e.g. Theta and its variants (Assimakopoulos and Nikolopoulos, 2000; Fiorucci et al., 2016; Spiliotis et al., 2019). Finally, the state space modeling approach encapsulates most of the above in addition to autoARIMA and GARCH (Engle, 1982) (see Hyndman and Khandakar (2008) for an overview). The statespace approach has also been underlying significant amounts of research in the neural TS modeling (Salinas et al., 2019; Wang et al., 2019; Rangapuram et al., 2018). However, those models have not been considered in the zeroshot scenario. In this work we focus on studying the importance of metalearning for successful zeroshot forecasting. The foundations of metalearning have been developed by Schmidhuber (1987); Bengio et al. (1991). More recently, metalearning research has been expanding, mostly outside of the TS forecasting domain (Ravi and Larochelle, 2016; Finn et al., 2017; Snell et al., 2017; Vinyals et al., 2016; Rusu et al., 2019). In the TS domain, metalearning has manifested itself via neural models trained over a collection of TS (Smyl, 2020; Oreshkin et al., 2020) or via a model trained to predict weights combining outputs of several classical forecasting algorithms (MonteroManso et al., 2020). Successful application of a neural TS forecasting model trained on a source dataset and finetuned on the target dataset was demonstrated by Hooshmand and Sharma (2019); Ribeiro et al. (2018) as well as in the context of TS classification by Fawaz et al. (2018). Unlike those, we focus on the zeroshot scenario and address the cold start problem.
7 Conclusions
Zeroshot transfer learning. We propose a broad metalearning framework and explain metalearning mechanisms facilitating zeroshot forecasting. Our results show that neural networks are able to extract generic knowledge about forecasting and apply it to solve zeroshot forecasting problem. Residual architectures in general are covered by the analysis presented in Section 3. The results of this study may thus be applicable to explain some of the success of residual architectures. The extensions to validate this hypothesis are subject to future work. Memory efficiency.
Our analysis clearly suggests that the network is producing, onthefly, compact taskspecific parameters via residual connections. This makes sharing weights across residual blocks effective, resulting in neural networks with reduced memory footprint and comparable statistical performance.
References
 The theta model: a decomposition approach to forecasting. International Journal of Forecasting 16 (4), pp. 521–530. Cited by: §D.3, §E.3, §5, §6.
 The tourism forecasting competition. International Journal of Forecasting 27 (3), pp. 822–844. Cited by: Appendix A, §E.3, §E.4, §1, §4, §5.
 The value of feedback in forecasting competitions. International Journal of Forecasting 27 (3), pp. 845–849. Cited by: §E.4.
 Winning methods for forecasting tourism time series. International Journal of Forecasting 27 (3), pp. 850–852. Cited by: §E.4.
 On the optimization of a synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, Cited by: §1.1.
 Learning a synaptic learning rule. In Proceedings of the International Joint Conference on Neural Networks, Seattle, USA, pp. II–A969. Cited by: §1.1, §6.
 Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International Journal of Forecasting 32 (2), pp. 303–312. Cited by: Table 7.
 Retail store scheduling for profit. European Journal of Operational Research 239 (3), pp. 609 – 624. Cited by: §1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §D.5, §4.

Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation
. Econometrica 50 (4), pp. 987–1007. Cited by: §1, §6.  Transfer learning for time series classification. 2018 IEEE International Conference on Big Data (Big Data). Cited by: §6.
 FRED economic data. Note: Data retrieved from https://fred.stlouisfed.org/ Accessed: 20191101 Cited by: §D.2, §4.
 Modelagnostic metalearning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §2.1, §6.
 Models for optimising the theta method and their relationship to state space models. International Journal of Forecasting 32 (4), pp. 1151–1161. Cited by: §D.3, §E.3, Table 7, §6.
 DeepAR: probabilistic forecasting with autoregressive recurrent networks. CoRR abs/1704.04110. External Links: 1704.04110 Cited by: Appendix A, §D.5, §E.5, §E.6, Table 10, Table 9.
 Machine learning applications for data center optimization. Technical report Technical Report , Google. Cited by: §1.
 Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Frid. Perthes and I. H. Besser, Hamburg. Cited by: §1.
 Deep sparse rectifier neural networks. In AISTATS’2011, Cited by: §1.1.
 The formation of learning sets. Psychological Review 56 (1), pp. 51–65. External Links: Document Cited by: §1.1.
 Forecasting trends and seasonals by exponentially weighted averages. Technical report Technical Report ONR memorandum no. 5, Carnegie Institute of Technology, Pittsburgh, PA. Cited by: §1, §6.
 Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20 (1), pp. 5–10. Cited by: §6.
 Energy predictive models with limited data using transfer learning. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, eEnergy’19, pp. 12–16. Cited by: §1, §6.
 Automatic time series forecasting: the forecast package for R. Journal of Statistical Software 26 (3), pp. 1–22. Cited by: §E.2, §6.
 Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. Cited by: Appendix A.
 Modeling and forecasting call center arrivals: a literature survey and a case study. International Journal of Forecasting 32 (3), pp. 865–874. Cited by: §1.
 Building machines that learn and think like people. Behavioral and Brain Sciences 40, pp. e253. Cited by: §1.1.
 Retail sales force scheduling based on store traffic forecasting. Journal of Retailing 74 (1), pp. 61–88. Cited by: §1.
 Gradientbased metalearning with learned layerwise metric and subspace. In ICML, pp. 2933–2942. Cited by: §2.1.
 Neural networks in supply chain management. In Proceedings for Operating Research and the Management Sciences, pp. 347–352. Cited by: §1.
 Metasgd: learning to learn quickly for few shot learning. CoRR abs/1707.09835. Cited by: §2.1.
 M4 competitor’s guide: prizes and rules. External Links: Link Cited by: Appendix A, §D.1, §4.
 Integrating indigenous knowledge with scientific seasonal forecasts for climate risk management in Lushoto district in Tanzania. Technical report Technical Report CCAFS Working Paper No. 103, CGIAR research program on climate change, agriculture and food security. Cited by: §1.
 Statistical and machine learning forecasting methods: concerns and ways forward. PLoS ONE 13 (3). Cited by: §1.
 The accuracy of extrapolation (time series) methods: results of a forecasting competition. Journal of forecasting 1 (2), pp. 111–153. Cited by: §1.
 The m2competition: a realtime judgmentally based forecasting study. International Journal of Forecasting 9 (1), pp. 5–22. Cited by: §1.
 The M3Competition: results, conclusions and implications. International Journal of Forecasting 16 (4), pp. 451–476. Cited by: Appendix A, §D.3, §E.3, §1, §4.
 The M4Competition: results, findings, conclusion and way forward. International Journal of Forecasting 34 (4), pp. 802–808. Cited by: Appendix A, §1.1, §1.
 FFORMA: featurebased forecast model averaging. International Journal of Forecasting 36 (1), pp. 86–92. Cited by: §E.1, §6.
 Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §1.1.
 WEATHER forecasting: magic, art, science and hypnosis.. Weather and Climate 5 (1), pp. 2–5. Cited by: §1.
 NBEATS: neural basis expansion analysis for interpretable time series forecasting. In ICLR, Cited by: Figure 2, §B.1, §B.2, Table 7, Appendix E, §1.1, §1, §5, §6.
 TADAM: task dependent adaptive metric for improved fewshot learning. In NeurIPS, pp. 721–731. Cited by: §2.1.
 FiLM: visual reasoning with a general conditioning layer. In AAAI, Cited by: §2.1.
 The matrix cookbook. Technical University of Denmark, . Note: Version 20121115 Cited by: §3.3.
 Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. External Links: 1909.09157 Cited by: §1, §2.1, §2.1, Remark 3.2.
 Deep state space models for time series forecasting. In NeurIPS, Cited by: Appendix A, §D.5, §E.5, §E.6, §6.
 Optimization as a model for fewshot learning. In ICLR, Cited by: §1.1, §2.1, §6.
 Transfer learning with seasonal and trend adjustment for crossbuilding energy forecasting. Energy and Buildings 165, pp. 352–363. Cited by: §1, §6.
 MEXICAN crop observation, management and production analysis services system — COMPASS. In Poster Proceedings of the 12th European Conference on Precision Agriculture, pp. . Cited by: §1.
 Metalearning with latent embedding optimization. In ICLR, Cited by: §2.1, §6.
 DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. Cited by: §6.
 Evolutionary principles in selfreferential learning. Master’s Thesis, Institut f. Informatik, Tech. Univ. Munich. Cited by: §1.1, §6.
 Financial time series forecasting with deep learning : a systematic literature review: 20052019. External Links: 1911.13288 Cited by: §1.
 Decentralized flood forecasting using deep neural networks. arXiv preprint arXiv:1902.02308. Cited by: §1.

Data preprocessing and augmentation for multiple short time series forecasting with recurrent neural networks
. In 36th International Symposium on Forecasting, Cited by: Table 7.  A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting 36 (1), pp. 75 – 85. Cited by: §E.1, §1, §6.
 Prototypical networks for fewshot learning. In NIPS, pp. 4080–4090. Cited by: §2.1, §6.
 Forecasting with a hybrid method utilizing data smoothing, a variation of the theta method and shrinkage of seasonal factors. International Journal of Production Economics 209, pp. 92–102. Cited by: §D.3, §E.3, Table 7, §6.
 On the categorization of demand patterns. Journal of the Operational Research Society 56 (5), pp. 495–503. Cited by: §D.1, §D.3.
 Multiple futures prediction. In NeurIPS 32, pp. 15398–15408. Cited by: §1.
 Python Client for FRED API. GitHub, Data Science division of the National Association of REALTORS. External Links: Document, Link Cited by: §D.2.
 Matching networks for one shot learning. In NIPS, pp. 3630–3638. Cited by: §2.1, §6.
 ON periodicity in series of related terms. Proc. R. Soc. Lond. A 131, pp. 518–532. Cited by: §1.
 Deep factors for forecasting. In ICML, Cited by: Appendix A, §D.5, §E.5, §E.6, §6.
 Forecasting sales by exponentially weighted moving averages. Management Science 6 (3), pp. 324–342. Cited by: §1, §6.
 Temporal regularized matrix factorization for highdimensional time series prediction. In NIPS, Cited by: Appendix A, §D.5, §D.5, §D.5, §E.5, §E.6, §4.
 On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers. Phil. Trans. the R. Soc. Lond. A 226 (), pp. 267–298. Cited by: §1.
Appendix A TS Forecasting Metrics
The following metrics are standard scalefree metrics in the practice of forecasting performance evaluation (Hyndman and Koehler, 2006; Makridakis and Hibon, 2000; Makridakis et al., 2018b; Athanasopoulos et al., 2011): (Mean Absolute Percentage Error), (symmetric ) and (Mean Absolute Scaled Error). Whereas scales the error by the average between the forecast and ground truth, the scales by the average error of the naïve predictor that simply copies the observation measured periods in the past, thereby accounting for seasonality. Here is the periodicity of the data (e.g., 12 for monthly series). (overall weighted average) is a M4specific metric used to rank competition entries (M4 Team, 2018), where and metrics are normalized such that a seasonallyadjusted naïve forecast obtains . Normalized Deviation, , being a less standard metric in the traditional TS forecasting literature, is nevertheless quite popular in the machine learning TS forecasting papers (Yu et al., 2016; Flunkert et al., 2017; Wang et al., 2019; Rangapuram et al., 2018).
Here in the last equation, refers to a sample from TS with index and the sum is running over all TS indices and TS samples.
Appendix B NBEATS Details
b.1 Architecture Details
NBEATS (Oreshkin et al., 2020) has hierarchical structure consisting of multiple stacks depicted in Figure 2, reproduced from Figure 1 in (Oreshkin et al., 2020) with permission. Each stack internally consists of multiple blocks. The stacks are chained, whereas blocks within stack are connected using a doubly residual architecture.
b.2 Training details
We use the same overall training framework, as defined by Oreshkin et al. 2020, including the stratified uniform sampling of TS in the source dataset to train the model. One model is trained per frequency split of a dataset (e.g. Yearly, Quarterly, Monthly, Weekly, Daily and Hourly frequencies in M4 dataset). All reported accuracy results are based on ensemble of 30 models (5 different initializations with 6 different lookback periods). One aspect that we found important in the zeroshot regime, that is different from the original training setup, is the scaling/descaling of the input/output. We scale/descale the architecture input/output by the dividing/multiplying all input/output values over the max value of the input window. We found that this does not affect the accuracy of the model trained and tested on the same dataset in a statistically significant way. In the zeroshot regime, this operation prevents catastrophic failure when the target dataset scale (marginal distribution) is significantly different from that of the source dataset.
Most of the time, the model trained on a given frequency split of a source dataset is used to forecast the same frequency split on the target dataset. There are a few exceptions to this rule. First, when transferring from M4 to M3, the Others split of M3 is forecasted with the model trained on Quarterly split of M4. This is because (i) the default horizon length of M4 Quarterly is 8, same as that of M3 Others and (ii) M4 Others is heterogeneous and contains Weekly, Daily, Hourly data with horizon lengths 13, 14, 48. So M4 Quarterly to M3 Others transfer was easier to implement from the coding standpoint. Second, the transfer from M4 to electricity and traffic dataset is done based on a model trained on M4 Hourly. This is because electricity and traffic contain hourly timeseries with obvious 24hour seasonality patterns. It is worth noting that the M4 Hourly only contains 414 timeseries and we can clearly see positive zeroshot transfer in Table 1 from the model trained on this rather small dataset. Third, the transfer from fred to electricity and traffic is done by training the model on the fred
Monthly split, double upsampled using bilinear interpolation. This is because
fred does not have hourly data. Monthly data naturally provide patterns with seasonality period 12. Upsampling with a factor of two and bilinear interpolation provide data with natural seasonality period 24, most often observed in Hourly data, such as electricity and traffic.Appendix C Metalearning Analysis Details
c.1 Factors Enabling Metalearning
Let us now analyze the factors that enable the metalearning inner loop obvious in (10). First, and most straightforward, it is not viable without having multiple blocks connected via the backcast residual connection: ). Second, the metalearning inner loop is viable when is nonlinear: the update of is extracted from the curvature of at the point dictated by the input and the sequence of shifts . Indeed, suppose is linear, let’s say . The Jacobian becomes a constant, . Equation (8) simplifies as (note that for linear , (8) is exact):
Therefore, may be replaced with an equivalent that is not data adaptive.
Remark C.1.
Interestingly, happens to be a truncated Neumann series. Denoting MoorePenrose pseudoinverse as , assuming boundedness of and completing the series, , results in . Therefore, under certain conditions, the NBEATS architecture with linear and infinite number of blocks can be interpreted as a linear predictor of a signal in colored noise. Here the part cleans the intermediate space created by projection from the components that are undesired for forecasting and creates the forecast based on the initial projection after it is “sanitized” by .
Appendix D Dataset Details
d.1 M4 Dataset Details
Frequency / Horizon  
Type  Yearly/6  Qtly/8  Monthly/18  Wkly/13  Daily/14  Hrly/48  Total 
Demographic  1,088  1,858  5,728  24  10  0  8,708 
Finance  6,519  5,305  10,987  164  1,559  0  24,534 
Industry  3,716  4,637  10,017  6  422  0  18,798 
Macro  3,903  5,315  10,016  41  127  0  19,402 
Micro  6,538  6,020  10,975  112  1,476  0  25,121 
Other  1,236  865  277  12  633  414  3,437 
Total  23,000  24,000  48,000  359  4,227  414  100,000 
Min. Length  19  24  60  93  107  748  
Max. Length  841  874  2812  2610  9933  1008  
Mean Length  37.3  100.2  234.3  1035.0  2371.4  901.9  
SD Length  24.5  51.1  137.4  707.1  1756.6  127.9  
% Smooth  82%  89%  94%  84%  98%  83%  
% Erratic  18%  11%  6%  16%  2%  17% 
Table 2 outlines the composition of the M4 dataset across domains and forecast horizons by listing the number of TS based on their frequency and type (M4 Team, 2018). The M4 dataset is large and diverse: all forecast horizons are composed of heterogeneous TS types (with exception of Hourly) frequently encountered in business, financial and economic forecasting. Summary statistics on series lengths are also listed, showing wide variability therein, as well as a characterization (smooth vs erratic) that follows Syntetos et al. (2005), and is based on the squared coefficient of variation of the series. All series have positive observed values at all timesteps; as such, none can be considered intermittent or lumpy per Syntetos et al. (2005).
d.2 fred Dataset Details
fred is a largescale dataset introduced in this paper containing around 290k US and international economic TS from 89 sources, a subset of Federal Reserve economic data (Federal Reserve Bank of St. Louis, 2019). fred is downloaded using a custom download script based on the highlevel FRED python API (Velkoski, 2016). This is a python wrapper over the lowlevel webbased FRED API. For each point in a timeseries the raw data published at the time of first release are downloaded. All time series with any NaN entries have been filtered out. We focus our attention on Yearly, Quarterly, Monthly, Weekly and Daily frequency data. Other frequencies are available, for example, biweekly and fiveyearly. They are skipped, because only being present in small quantities. These factors explain the fact that the size of the dataset we assembled for this study is 290k, while 672k total timeseries are in principle available (Federal Reserve Bank of St. Louis, 2019). Hourly data are not available in this dataset. For the data frequencies included in fred dataset, we use the same forecasting horizons as for the M4 dataset: Yearly: 6, Quarterly: 8, Monthly: 18, Weekly: 13 and Daily: 14. The dataset download takes approximately 710 days, because of the bandwidth constraints imposed by the lowlevel FRED API. The test, validation and train subsets are defined in the usual way. The test set is derived by splitting the full fred dataset at the left boundary of the last horizon of each time series. Similarly, the validation set is derived from the penultimate horizon of each time series.
d.3 M3 Dataset Details
Table 3 outlines the composition of the M3 dataset across domains and forecast horizons by listing the number of TS based on their frequency and type (Makridakis and Hibon, 2000). The M3 is smaller than the M4, but it is still large and diverse: all forecast horizons are composed of heterogeneous TS types frequently encountered in business, financial and economic forecasting. Over the past 20 years, this dataset has supported significant efforts in the design of advanced statistical models, e.g. Theta and its variants (Assimakopoulos and Nikolopoulos, 2000; Fiorucci et al., 2016; Spiliotis et al., 2019). Summary statistics on series lengths are also listed, showing wide variability in length, as well as a characterization (smooth vs erratic) that follows Syntetos et al. (2005), and is based on the squared coefficient of variation of the series. All series have positive observed values at all timesteps; as such, none can be considered intermittent or lumpy per Syntetos et al. (2005).
Frequency / Horizon  
Type  Yearly/6  Quarterly/8  Monthly/18  Other/8  Total 
Demographic  245  57  111  0  413 
Finance  58  76  145  29  308 
Industry  102  83  334  0  519 
Macro  83  336  312  0  731 
Micro  146  204  474  4  828 
Other  11  0  52  141  204 
Total  645  756  1,428  174  3,003 
Min. Length  20  24  66  71  
Max. Length  47  72  144  104  
Mean Length  28.4  48.9  117.3  76.6  
SD Length  9.9  10.6  28.5  10.9  
% Smooth  90%  99%  98%  100%  
% Erratic  10%  1%  2%  0% 
d.4 tourism Dataset Details
Table 4 outlines the composition of the tourism dataset across forecast horizons by listing the number of TS based on their frequency. Summary statistics on series lengths are listed, showing wide variability in length. All series have positive observed values at all timesteps. In contrast to M4 and M3 datasets, tourism includes a much higher fraction of erratic series.
Frequency / Horizon  
Yearly/4  Quarterly/8  Monthly/24  Total  
518  427  366  1,311  
Min. Length  11  30  91  
Max. Length  47  130  333  
Mean Length  24.4  99.6  298  
SD Length  5.5  20.3  55.7  
% Smooth  77%  61%  49%  
% Erratic  23%  39%  51% 
d.5 electricity and traffic Dataset Details
electricity^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 and traffic^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/PEMSSF datasets (Dua and Graff, 2017; Yu et al., 2016) are both part of UCI repository. electricity represents the hourly electricity usage monitoring of 370 customers over three years. traffic dataset tracks the hourly occupancy rates scaled in (0,1) range of 963 lanes in the San Francisco bay area freeways over a period of slightly more than a year. Both datasets exhibit strong hourly and daily seasonality patterns.
Both datasets are aggregated to hourly data, but using different aggregation operations: sum for electricity and mean for traffic. The hourly aggregation is done so that all the points available in hours are aggregated to hour , thus if original dataset starts on 20110101 00:15 then the first time point after aggregation will be 20110101 01:00. For the electricity dataset we removed the first year from training set, to match the training set used in (Yu et al., 2016), based on the aggregated dataset downloaded from, presumable authors’, github repository^{4}^{4}4https://github.com/rofuyu/exptrmfnips16/blob/master/python/expscripts/datasets/downloaddata.sh. We also made sure that data points for both electricity and traffic datasets after aggregation match those used in (Yu et al., 2016). The authors of MatFact model were using the last 7 days of datasets as test set, but papers from Amazon DeepAR (Flunkert et al., 2017), Deep State (Rangapuram et al., 2018), Deep Factors (Wang et al., 2019) are using different splits, where the split points are provided by a date. Changing split points without a well grounded reason adds uncertainties to the comparability of the models performances and creates challenges to the reproducibility of the results, thus we were trying to match all different splits in our experiments. It was especially challenging on traffic
dataset, where we had to use some heuristics to find records dates; the dataset authors state: “The measurements cover the period from Jan. 1st 2008 to Mar. 30th 2009” and “We remove public holidays from the dataset, as well as two days with anomalies (March 8th 2009 and March 9th 2008) where all sensors were muted between 2:00 and 3:00 AM.” In spite of this, we failed to match a part of the provided labels of week days to actual dates. Therefore, we had to assume that the actual list of gaps, which include holidays and anomalous days, is as follows:

Jan. 1, 2008 (New Year’s Day)

Jan. 21, 2008 (Martin Luther King Jr. Day)

Feb. 18, 2008 (Washington’s Birthday)

Mar. 9, 2008 (Anomaly day)

May 26, 2008 (Memorial Day)

Jul. 4, 2008 (Independence Day)

Sep. 1, 2008 (Labor Day)

Oct. 13, 2008 (Columbus Day)

Nov. 11, 2008 (Veterans Day)

Nov. 27, 2008 (Thanksgiving)

Dec. 25, 2008 (Christmas Day)

Jan. 1, 2009 (New Year’s Day)

Jan. 19, 2009 (Martin Luther King Jr. Day)

Feb. 16, 2009 (Washington’s Birthday)

Mar. 8, 2009 (Anomaly day)
The first six gaps were confirmed by the gaps in labels, but the rest were more than one day apart from any public holiday of years 2008 and 2009 in San Francisco, California and US. Moreover, the number of gaps we found in the labels provided by dataset authors is 10, while the number of days between Jan. 1st 2008 and Mar. 30th 2009 is 455, assuming that Jan. 1st 2008 was skipped from the values and labels we should end up with either instead of 440 days or different end date. The metric used to evaluate performance on the datasets is (Yu et al., 2016), which is equal to loss used in DeepAR, Deep State, and Deep Factors papers.
Appendix E Empirical Results Details
On all datasets, we consider the original NBEATS (Oreshkin et al., 2020), the model trained on a given dataset and applied to this same dataset. This is provided for the purpose of assessing the generalization gap of the zeroshot NBEATS. We consider four variants of zeroshot NBEATS: NBSHM4, NBNSHM4, NBSHFR, NBNSHFR. SH/NSH option signifies block weight sharing ON/OFF. M4/FR option signifies M4/fred source dataset.
e.1 Detailed M4 Results
On M4 we compare against five M4 competition entries, each representative of a broad model class. Best pure ML is the submission by B. Trotta, the best entry among the 6 pure ML models. Best statistical is the best pure statistical model by N.Z. Legaki and K. Koutsouri. ProLogistica is a weighted ensemble of statistical methods, the third best M4 participant. Best ML/TS combination is the model by (MonteroManso et al., 2020)
, second best entry, gradient boosted tree over a few statistical time series models. Finally,
DL/TS hybrid is the winner of M4 competition (Smyl, 2020). Results are presented in Table 5.Yearly  Quarterly  Monthly  Others  Average  
(23k)  (24k)  (48k)  (5k)  (100k)  
Best pure ML  14.397  11.031  13.973  4.566  12.894 
Best statistical  13.366  10.155  13.002  4.682  11.986 
ProLogistica  13.943  9.796  12.747  3.365  11.845 
Best ML/TS combination  13.528  9.733  12.639  4.118  11.720 
DL/TS hybrid, M4 winner  13.176  9.679  12.126  4.014  11.374 
NBEATS  12.913  9.213  12.024  3.643  11.135 
NBSHFR  13.267  9.634  12.694  4.892  11.701 
NBNSHFR  13.272  9.596  12.676  4.696  11.675 
e.2 Detailed fred Results
We compare against well established offtheshelf statistical models available from the R forecast package (Hyndman and Khandakar, 2008). Those include Naïve (repeating the last value), ARIMA, Theta, SES and ETS. The quality metric is the regular defined in (1).
Yearly  Quarterly  Monthly  Weekly  Daily  Average  
(133554)  (57569)  (99558)  (1348)  (17)  (292046)  
Theta  16.50  14.24  5.35  6.29  10.57  12.20 
ARIMA  16.21  14.25  5.58  5.51  9.88  12.15 
SES  16.61  14.58  6.45  5.38  7.75  12.70 
ETS  16.46  19.34  8.18  5.44  8.07  14.52 
Naïve  16.59  14.86  6.59  5.41  8.65  12.79 
NBEATS  15.79  13.27  4.79  4.63  8.86  11.49 
NBSHM4  15.00  13.36  6.10  5.67  8.57  11.60 
NBNSHM4  15.06  13.48  6.24  5.71  9.21  11.70 
Comments
There are no comments yet.