Forecasting is one of the fundamental scientific problems and also of great practical utility. The ability to plan and control as well as to appropriately react to manifestations of complex partially or completely unknown systems often relies on the ability to forecast relevant observations based on their past history. The applications of forecasting span a variety of fields, ranging from extremely technical (e.g. vehicle and robot control (Tang and Salakhutdinov, 2019), data center optimization (Gao, 2014)), to more business oriented (supply chain management (Leung, 1995), workforce management (Chapados et al., 2014), forecasting phone call arrivals (Ibrahim et al., 2016) and customer traffic (Lam et al., 1998)) and finally to ones that may be critical for the future survival of humanity, such as precision agriculture (Rodrigues Jr et al., 2019) or fire and flood management (Mahoo et al., 2015; Sit and Demir, 2019). Unsurprisingly, forecasting methods have a long history that can be traced back to the very origins of human civilization (Neale, 1985), modern science (Gauss, 1809) and has consistently attracted considerable attention (Yule, 1927; Walker, 1931; Holt, 1957; Winters, 1960; Engle, 1982; Sezer et al., 2019). The progress made in the univariate forecasting in the past four decades is well reflected in the results and methods considered in the associated competitions (Makridakis et al., 1993, 1982; Makridakis and Hibon, 2000; Athanasopoulos et al., 2011; Makridakis et al., 2018b)
. Growing evidence suggests that machine learning approaches offer a superior modeling methodology to tackle time-series (TS) forecasting tasks, in contrast to some previous assessments(Makridakis et al., 2018a). For example, the winner of the last competition (M4, Makridakis et al. (2018b)) was a deep neural network predicting parameters of a statistical model (Smyl, 2020). The latter result was reinforced by Oreshkin et al. (2020) who improved over the winner using a pure neural network model, called N-BEATS. One of the contributions of the present paper is to help understand why N-BEATS is working so well, by casting it as a meta-learning method.
On the practical side, the deployment of deep neural architectures is often challenged by the cold start problem. Before a tabula rasa
deep neural network provides a useful output, it should be trained on a large problem-specific dataset. For early adopters, this often implies data collection efforts, changing data handling practices and even changing the existing IT infrastructures on a massive scale. In contrast, advanced statistical models can be deployed with significantly less effort as they estimate their parameters on a single TS at a time. In this paper we address the problem of reducing the entry cost of deep neural networks in the industrial practice of TS forecasting. We show that it is viable to train a neural network model on a diversified source dataset and deploy it on a target dataset in azero-shot regime
, i.e. without explicit retraining on that target data, resulting in performance that is at least as good as that of advanced statistical models tailored to the target dataset. In addition, the TS forecasting problem is distinct in that one has to deal upfront with the challenge of out-of-distribution generalization: TS are typically generated by systems whose generative distributions shift significantly over time. Consequently, transfer learning was considered challenging in this domain until very recently(Hooshmand and Sharma, 2019; Ribeiro et al., 2018). Ours is the first work to demonstrate the highly successful zero-shot operation of a deep neural TS forecasting model, thanks to a meta-learning approach.
Addressing this practical problem also provides clues to fundamental questions. Can we learn something general about forecasting and transfer this knowledge across datasets? If so, what kind of mechanisms could facilitate this? The ability to learn and transfer representations across tasks via task adaptation is an advantage of meta-learning (Raghu et al., 2019). We propose here a broad theoretical framework for meta-learning which spans several existing meta-learning algorithms. We further show how N-BEATS fits this meta-learning framework. We identify within N-BEATS internal meta-learning adaptation mechanisms that generate new parameters on-the-fly, specific to a given TS, iteratively extending the architecture’s expressive power. We empirically confirm that these mechanisms are key to improving its zero-shot univariate TS forecasting performance.
The univariate point forecasting problem in discrete time is formulated given a length- forecast horizon and a length- observed series history
. The task is to predict the vector of future values. For simplicity, we will later consider a lookback window of length ending with the last observed value to serve as model input, and denoted . We denote the point forecast of . Its accuracy can be evaluated with , the symmetric mean absolute percentage error (Makridakis et al. 2018b),
Other quality metrics (e.g. , , , ) are possible and are defined in Appendix A.
Meta-learning or learning-to-learn (Harlow, 1949; Schmidhuber, 1987; Bengio et al., 1991) is believed to be necessary for intelligent machines (Lake et al., 2017). The ability to meta-learn is usually linked to being able to (i) accumulate knowledge across different tasks (i.e. transfer learning, multi-task learning) and (ii) quickly adapt the accumulated knowledge to the new task (task adaptation) (Ravi and Larochelle, 2016; Lake et al., 2017; Bengio et al., 1992). Accordingly, a meta-learning set-up can be defined by assuming a distribution over tasks (where each task can be seen as a meta-example), a predictor parameterized with parameters and a meta-learning procedure with meta-parameters . Each task (a meta-example) includes a limited set of task training examples and a set of task validation examples. The objective is to design a meta-learner that can generalize well on a new task by appropriately choosing the predictor’s parameters after observing the task training data. The meta-learner is trained to do so by being exposed to many tasks in a training dataset sampled from . For each task , the meta-learner is requested to produce the solution to the task in the form of and the meta-learner meta-parameters
are optimized across many tasks based on validation data and loss functions supplied with the tasks. Training on multiple tasks enables the meta-learner to produce solutionsthat generalize well on a set of unseen tasks sampled from .
N-BEATS (Oreshkin et al., 2020) consists of a total of blocks connected using a doubly residual architecture that we review in detail below (see Appendix B.1 for full architecture details). Block has input and produces two outputs: the backcast and the partial forecast . For the first block we define , where is assumed to be the model-level input from now on. The internal operations of a block are based on a combination of fully connected and linear layers. In this paper, we focus on the configuration of N-BEATS that shares all learnable parameters across blocks. We define the -th fully-connected layer in the -th block, having non-linearity (Nair and Hinton, 2010; Glorot et al., 2011), weight matrixand input , as . With this notation, one block of N-BEATS is described as:
where denotes a fully connected layer and and are linear operators which can be seen as linear bases, combined linearly with the coefficients. Finally, the doubly residual architecture is described by the following recursion (recalling that ):
The N-BEATS parameters included in the and linear layers are learned by minimizing a suitable loss function (e.g. defined in (1)) across multiple TS.
2 Meta-learning Framework
A meta-learning procedure can generally be viewed at two levels: the inner loop and the outer loop. The inner training loop operates within an individual “meta-example” or task (fast learning loop improving over current ) and the outer loop operates across tasks (slow learning loop). A task includes task training data and task validation data , both optionally involving inputs, targets and a task-specific loss: , . We extend the definition of the predictor originally provided in Section 1.1 by allowing a subset of its parameters denoted to belong to meta-parameters and hence not to be task adaptive. Therefore, in our framework, the predictor has parameters that can be adapted rapidly, at the task level, and meta-parameters that are set by the meta-learning procedure and are slowly learned across tasks.
Accordingly, the meta-learning procedure has three distinct ingredients: (i) meta-parameters , (ii) initialization function and (iii) update function . The meta-learner’s meta-parameters include the meta-parameters of the meta-initialization function, , the meta-parameters of the predictor shared across tasks, , and the meta-parameters of the update function, . The meta-initialization function defines the initial values of parameters for a given task based on its meta-initialization parameters , task training dataset and task meta-data . Task meta-data may have, for example, a form of task ID or a textual task description. The update function is parameterized with update meta-parameters . It defines an iterated update to predictor parameters at iteration based on their previous value and the task training set . The initialization and update functions produce a sequence of predictor parameters, which we compactly write as . We let the final predictor be a function of the whole sequence of parameters, written compactly as . One implementation of such general function could be a Bayesian ensemble or a weighted sum, for example: . If we set , then we get the more commonly encountered situation .
The meta-parameters are updated in the outer meta-learning loop so as to obtain good generalization in the inner loop, i.e., by minimizing the expected validation loss that maps the ground truth and estimated outputs into the value that quantifies the generalization performance across tasks. The meta-learning framework is succinctly described by the following set of equations.
In this section we explained the proposed broad meta-learning framework and laid out its main equations. Next, we demonstrate how existing meta-learning algorithms can be cast into this framework.
2.1 Existing Meta-learning Algorithms Explained
MAML and related approaches (Finn et al., 2017; Li et al., 2017; Raghu et al., 2019) can be derived from (4) and (5) by (i) setting to be the identity map that copies into , (ii) setting to be the SGD gradient update: , where and by (iii) setting the predictor’s meta-parameters to the empty set . Equation (5) applies with no modifications.
MT-net (Lee and Choi, 2018) is a variant of MAML in which the predictor’s meta-parameter set is not empty. The part of the predictor parameterized with is meta-learned across tasks and is fixed during task adaptation. The other part parameterized with is treated exactly as in MAML.
Optimization as a model for few-shot learning (Ravi and Larochelle, 2016) can be derived from (4) and (5) via the following steps (in addition to those of MAML). First, set the update function to the update equation of an LSTM-like cell of the form ( is the LSTM update step index) . Second, set to be the LSTM’s forget gate value (Ravi and Larochelle, 2016): and to be the LSTM’s input gate value: . Here is a sigmoid non-linearity. Finally, include all the LSTM parameters into the set of update meta-parameters: .
Prototypical Networks (PNs) (Snell et al., 2017). Most metric-based meta-learning approaches, including the PNs, rely on comparing embeddings of the task training set samples with those of the validation set. Therefore, it is convenient to consider a composite predictor consisting of the embedding function, , and the comparison function, : . Concretely, PNs can be derived from (4) and (5) as follows. Consider a -shot image classification task, to be a convolutional network with parameters , and to be class prototype vectors: . Initialization function with simply sets to the values of prototypes. is an identity map with and
is as a softmax classifier:
Here could be Euclidean distance and the softmax is normalized w.r.t. all . Finally, define the loss in (5) as the cross-entropy of the softmax classifier described in (6). Interestingly, are nothing else than the dynamically generated weights of the final linear layer fed into the softmax of a regular image classifier, which is especially apparent when . The fact that in the prototypical network scenario only the final linear layer weights are dynamically generated based on the task training set resonates very well with the most recent study of MAML (Raghu et al., 2019). It has been shown that most of the MAML’s gain can be recovered by only adapting the weights of the final linear layer in the inner loop.
Matching networks (Vinyals et al., 2016) are similar to the PNs with a few adjustments. First,
is cosine similarity. Second, in thevanilla matching network architecture, is defined, assuming and
are one-hot encoded, as a soft nearest neighbor:
The softmax is normalized w.r.t. and predictor parameters, dynamically generated by , include embedding/label pairs: . In the FCE matching network, validation and training embeddings additionally interact with the task training set via attention LSTMs (Vinyals et al., 2016). To reflect this, the update function, , updates the original embeddings via LSTM equations (Appendix A.2 in (Vinyals et al., 2016)): . The LSTM’s parameters are included in . Second, the predictor is augmented with an additional relation module , , with the set of predictor meta-parameters extended accordingly: . The relation module is again implemented via LSTM: (cf. Vinyals et al. (2016), Appendix A.1).
TADAM (Oreshkin et al., 2018) extends PNs by dynamically conditioning the embedding function on the task training data via FiLM layers (Perez et al., 2018). TADAM’s predictor has the following form: ; . The compare function parameters are as before, . The embedding function parameters include the FiLM layer (scale/shift) vectors for each convolutional layer, generated by a separate FC network from the task embedding. The initialization function sets to all zeros, embeds task training data, and sets the task embedding to the average of class prototypes. The update function whose meta-parameters include the coefficients of the FC network, , generates an update to from the task embedding. Then it generates an update to the class prototypes using conditioned with the updated .
LEO (Rusu et al., 2019) uses a fixed pretrained embedding function. The intermediate low-dimensional latent space is optimized and is used to generate the predictor’s task-adaptive final layer weights. LEO’s predictor has the following form: . The predictor has final layer and the latent space parameters, , and no meta-parameters, . The initialization function , , uses a task encoder and a relation network with meta-parameters and , respectively, to meta-initialize the latent space parameters, , based on the task training data. The update function , , uses a decoder with meta-parameters to iteratively decode into the final layer weights, . It optimizes by executing gradient descent .
In this section, we illustrated that seven distinct meta-learning algorithms from two broad categories (optimization-based and metric-based) can be derived from our equations (4) and (5). This confirms that our meta-learning framework is general and can serve as a useful tool to analyze existing and perhaps synthesize new meta-learning algorithms.
3 N-BEATS as a Meta-learning Algorithm
Let us now focus on the analysis of N-BEATS described by equations (2), (3). We first introduce the following notation: ; ; . In the original equations, and are linear and hence can be represented by equivalent matrices and . In the following, we keep the notation general as much as possible, transitioning to the linear case only at the end of our analysis. Then, given the network input, (), and noting that we can write the output as follows:
3.1 Meta-learning Framework Subsumes N-BEATS
N-BEATS is now derived from the meta-learning framework of Section 2, based on two observations: (i) each application of in (7) is a predictor and (ii) each block of N-BEATS is the iteration of the inner meta-learning loop. More concretely, we have the following: . Here and are parameters of functions and in (7). The meta-parameters of the predictor, , are learned across tasks in the outer loop. The task-specific parameters include the sequence of shift vectors, that we explain in detail next. The -th block of N-BEATS performs the adaptation of the predictor’s task-specific parameters of the form . These parameters are used to adjust the predictor’s input at every iteration as as evident from equation (3).
This gives rise to the following initialization and update functions. with sets to zero. , with generates the next parameter update based on :
Interestingly, (i) meta-parameters are shared between the predictor and the update function and (ii) the task training set is limited to the network input, . Note that the latter makes sense because the data are TS, with the inputs
having the same form of internal dependencies as the target outputs. Hence, observing is enough to infer how to predict from in a way that is similar to how different parts of are related to each other.
Finally, according to (7), predictor outputs corresponding to the values of parameters learned at every iteration of the inner loop are combined in the final output. This corresponds to choosing a predictor of the form in (5). The outer learning loop (5) describes the N-BEATS training procedure across tasks (TS) with no modification.
It is clear that the final output of the architecture depends on the entire sequence . Quite obviously, even if predictor parameters , are shared across blocks and fixed, the behaviour of is governed by an extended space of parameters . This has two consequences. First, the expressive power of the architecture grows with the growing number of blocks, in some proportion to the growth of the space spanned by , even if , are fixed and shared across blocks. Second, since the number of parameters describing the architecture behaviour grows with the number of blocks, it may lead to a phenomenon similar to overfitting. Therefore, it would be reasonable to expect that at first the addition of blocks will improve generalization performance, because of the increase in expressive power. However, at some point adding more blocks may hurt the generalization performance, because of an effect similar to overfitting, even if , are fixed and shared across blocks, because at each iteration more information is extracted from and the set of parameters is expanded.
3.2 Linear Approximation Analysis
Next, we go a level deeper in the analysis to uncover more intricate task adaptation processes. To this end, we study the behaviour of (7) assuming that residual corrections are small. This allows us to derive an alternative interpretation of N-BEATS’ meta-learning operation, expressing it in terms of the adaptation of the internal weights of the network based on the task input data. Under the assumption of small , (7) can be analyzed using a Taylor series expansion, in the vicinity of . This results in the following first order approximation:
Here is the Jacobian of and is the small O in Landau notation.
We now consider linear and , as mentioned earlier, in which case and are represented by two matrices of appropriate dimensionality, and ; and . Thus, the above expression can be simplified as:
Continuously applying the linear approximation until we reach and recalling that we arrive at the following:
Note that can be written in the form of sequential updates of . Consider , then the update equation for can be written as and (8) becomes:
Let us now discuss how the results of the linear approximation analysis can be used to re-interpret N-BEATS as an instance of the meta-learning framework (4) and (5). According to (9), the predictor can now be represented in a decoupled form . Thus, in the predictor, task adaptation is now clearly confined in the decision function, , whereas the embedding function only relies on fixed meta-parameters . The adaptive parameters include the sequence of projection matrices . The meta-initialization function is parameterized with and it simply sets . The main ingredient of the update function is, as before, . Therefore, it is parameterized with , same as in Section 3.1. The update function now consists of two equations:
In the linearized analysis, the sequence of input shifts becomes an auxiliary internal instrument of the update function. It is used to generate a good sequence of updates to the final linear layer of the predictor via an iterative two-stage process. First, for a given previous location in the input space, , the new location in the input space is predicted by . Second, the previous location in the input space is translated in the update of by appropriately projecting via . The first order analysis result (9) shows that under certain circumstances, the block-by-block manipulation of the input sequence apparent in (7) is equivalent to producing a sequential update of the final linear layer apparent in (10), with the block input being set to the same fixed value (cf. the final layer update behaviour identified in MAML by Raghu et al. (2019) and the results of our analysis of PNs).
The key role in this process seems to be encapsulated in that is responsible for both generating the sequence of input shifts and for the re-projection of derivatives . We study this aspect in more detail in the next section.
3.3 The Role of
It is hard to study the form of learned from the data in general. However, equipped with the results of the linear approximation analysis presented in Section 3.2, we can study the case of a two-block network, assuming that the norm loss between and is used to train the network. If, in addition, the dataset consists of the set of pairs the dataset-wise loss has the following expression:
Introducing , the error between the default forecast and the ground truth , and expanding the L2 norm we obtain the following:
Now, assuming that the rest of the parameters of the network are fixed, we have the derivative with respect to using matrix calculus (Petersen and Pedersen, 2012):
Using the above expression we conclude that the first order approximation of optimal satisfies the following equation:
Although this does not help to find a closed form solution for , it does provide a quite obvious intuition: the LHS and the RHS are equal when and are negatively correlated. Therefore, satisfying the equation will tend to drive the update to in (10) in such a way that on average the projection of over the update to matrix will tend to compensate the error made by forecasting using based on meta-initialization.
In this section we established that N-BEATS is an instance of a meta-learning algorithm described by equations (4) and (5). We showed that each block of N-BEATS is an inner meta-learning loop that generates additional shift parameters specific to the input time-series. Therefore, the expressive power of the architecture is expected to grow with each additional block, even if all blocks share their parameters. We used linear approximation analysis to show that the input shift in a block is equivalent to the update of the block’s final linear layer weights under certain conditions. We further provided mathematical intuition hinting that in a two-block network, the second block will on average tend to compensate the forecasting error made by the first block, even if the blocks share the same network weights.
4 Zero-Shot TS Forecasting Task
Base datasets. M4 (M4 Team, 2018), contains 100k TS representing demographic, finance, industry, macro and micro indicators. Sampling frequencies include yearly, quarterly, monthly, weekly, daily and hourly. fred is a dataset introduced in this paper containing 290k US and international economic TS from 89 sources, a subset of Federal Reserve economic data (Federal Reserve Bank of St. Louis (2019); see Appendix D.2 for a detailed description). M3 (Makridakis and Hibon, 2000) contains 3003 TS from domains and sampling frequencies similar to M4. tourism (Athanasopoulos et al., 2011) includes monthly, quarterly and yearly series of indicators related to tourism activities supplied by governmental tourism organizations and various academics. electricity (Dua and Graff, 2017; Yu et al., 2016) represents the hourly electricity usage of 370 customers over three years. traffic (Dua and Graff, 2017; Yu et al., 2016) tracks hourly occupancy rates scaled in a (0,1) range of 963 lanes in the San Francisco Bay Area freeways over a period of slightly more than one year. Additional details for all datasets appear in Appendix D.
The zero-shot forecasting task definition. One of the base datasets, a source dataset, is used to train a machine learning model. The entire source dataset can be used for training. The trained model can then forecast a TS in a target
dataset. The source and the target datasets are distinct: they do not contain TS whose values are linear transformations of each other. The forecasted TS is split into two non-overlapping pieces: the history, and the test. The history is used as model input and the test is used to compute the forecast error metric. We use the history and the test splits for the base datasets consistent with their original publication, unless explicitly stated otherwise. To make forecasts, the model is allowed to access the TS in the target dataset on aone at a time
basis. This is to avoid having the model implicitly learn/adapt based on any information contained in the target dataset other than the history of the forecasted TS. If any adjustments of model parameters or hyperparameters are necessary, they are allowedexclusively using the history of the forecasted TS.
5 Empirical Results
Experiments follow the defined zero-shot forecasting setup and the base datasets presented in Section 4. We mostly follow the original training setup of Oreshkin et al. (2020) to train N-BEATS on a source dataset, with one exception. We scale/descale the architecture input/output by dividing/multiplying all input/output values by the max value of the input window. This does not affect the accuracy of the model in the usual train/test scenario. In the zero-shot regime, this operation is intended to prevent catastrophic failure when the scale of the target dataset differs significantly from the source dataset. Additional training setup details are provided in Appendix B.2. For each dataset, we compare our results with 5 representative entries reported in the literature for that dataset, according to the customary metrics specific to each dataset (M4, fred, M3: , tourism: , electricity, traffic: ). Our main results appear in Table 1 and more details are provided in Appendix E. In the zero-shot forecasting regime, N-BEATS consistently outperforms most statistical models tailored to these datasets. N-BEATS trained on fred and applied in zero-shot regime to M4 outperforms the best statistical model selected for its performance on M4 and is at par with the competition’s second entry (boosted trees). On M3 and tourism the zero-shot forecasting performance of N-BEATS is better than that of the M3 winner, Theta (Assimakopoulos and Nikolopoulos, 2000). On electricity and traffic N-BEATS performs close to or better than other neural models trained on these datasets. The results overall suggest that a neural model (N-BEATS) is able to extract general knowledge about the TS forecasting task and then successfully adapt it to forecast on unseen TS. We believe our study presents the first example of successfully applying a neural model to solve the zero-shot TS forecasting.
. Each plot shows results with (blue) and without (red) weight sharing across blocks. Block width is 1024. The mean performance and one standard deviation interval (computed using ensemble bootstrap) are shown.
Expressive power. Remark 3.1 on expressive power implies that N-BEATS internally generates a sequence of parameters that dynamically extend the expressive power of the architecture with each newly added block, even if block parameters are shared. To validate this hypothesis, we performed an experiment studying the zero-shot forecasting performance of N-BEATS with increasing number of blocks, with and without parameter sharing. The architecture was trained on M4 and the performance was measured on the target datasets M3 and tourism. The results111The extended set of results for all datasets, using fred as a source dataset, a few metrics and varying layer width are provided in Appendix F. Extended results further reinforce findings in Fig. 1. are presented in Fig. 1. On the two datasets and for the shared-weights configuration, we consistently see performance improvement when the number of blocks increases up to about 30 blocks. In the same scenario, increasing the number of blocks beyond 30 leads to small, but consistent deterioration in performance, an effect similar to overfitting. Recalling that increasing the number of blocks with sharing does not lead to an increase in the number of meta-parameters, only the sequence of task specific parameters is being extended on-the-fly. In our view, this provides evidence supporting the meta-learning interpretation of N-BEATS, with a simple interpretation of this phenomenon as overfitting in the inner loop of meta-learning. It is not clear otherwise how to explain the generalization dynamics in Fig. 1.
Additionally, the performance improvement due to meta-learning alone (shared weights, multiple blocks vs. a single block) is 12.60 to 12.44 (1.2%) and 20.40 to 18.82 (7.8%) for M3 and tourism, respectively (see Fig. 1). The performance improvement due to meta-learning and unique weights (unique weights, multiple blocks vs. a single block) is 12.60 to 12.40 (1.6%) and 20.40 to 18.91 (7.4%). Clearly, the majority of the gain is due to the meta-learning alone. The introduction of unique block weights sometimes results in marginal gain, but often leads to a loss (see more results in Appendix F). This has a clear implication for reducing the memory footprint of neural networks.
It is interesting to make a note about the scale of the improvement. On the tourism dataset (see Fig. 1, right), the zero-shot error of 1 block ( 20.40) is a little bit better than that of the out-of-the-box models ETS and Theta ( 20.88). As the number of blocks grows, the error drops to a of 18.80, outperforming the statistical method of LeeCBaker (tourism competition winner, hand-crafted specifically for tourism (Athanasopoulos et al., 2011)). For the M3 target dataset (see Fig. 1, left) we can see the generalization performance with one N-BEATS block ( 12.60) is a bit better than the best known statistical model (EXP method, 12.71). Increasing the number of blocks closes the generalization gap between the zero-shot performance ( 12.40) and the regular N-BEATS trained on M3 ( 12.37).
In this section, we presented empirical evidence that neural networks are able to provide high-quality zero-shot forecasts on unseen TS. We further empirically supported the hypothesis that meta-learning adaptation mechanisms identified within N-BEATS in Section 3 are instrumental in achieving impressive zero-shot forecasting accuracy results. Our results provide positive evidence to stimulate research on (i) addressing the cold start problem in neural TS forecasting and (ii) designing memory-efficient neural networks.
6 Related Work
From a high-level perspective, there are many links with classical TS modeling: a human-specified classical model is typically designed to generalize well on unseen TS, while we propose to automate that process. The classical models include exponential smoothing with and without seasonal effects (Holt, 1957, 2004; Winters, 1960), multi-trace exponential smoothing approaches, e.g. Theta and its variants (Assimakopoulos and Nikolopoulos, 2000; Fiorucci et al., 2016; Spiliotis et al., 2019). Finally, the state space modeling approach encapsulates most of the above in addition to auto-ARIMA and GARCH (Engle, 1982) (see Hyndman and Khandakar (2008) for an overview). The state-space approach has also been underlying significant amounts of research in the neural TS modeling (Salinas et al., 2019; Wang et al., 2019; Rangapuram et al., 2018). However, those models have not been considered in the zero-shot scenario. In this work we focus on studying the importance of meta-learning for successful zero-shot forecasting. The foundations of meta-learning have been developed by Schmidhuber (1987); Bengio et al. (1991). More recently, meta-learning research has been expanding, mostly outside of the TS forecasting domain (Ravi and Larochelle, 2016; Finn et al., 2017; Snell et al., 2017; Vinyals et al., 2016; Rusu et al., 2019). In the TS domain, meta-learning has manifested itself via neural models trained over a collection of TS (Smyl, 2020; Oreshkin et al., 2020) or via a model trained to predict weights combining outputs of several classical forecasting algorithms (Montero-Manso et al., 2020). Successful application of a neural TS forecasting model trained on a source dataset and fine-tuned on the target dataset was demonstrated by Hooshmand and Sharma (2019); Ribeiro et al. (2018) as well as in the context of TS classification by Fawaz et al. (2018). Unlike those, we focus on the zero-shot scenario and address the cold start problem.
Zero-shot transfer learning. We propose a broad meta-learning framework and explain meta-learning mechanisms facilitating zero-shot forecasting. Our results show that neural networks are able to extract generic knowledge about forecasting and apply it to solve zero-shot forecasting problem. Residual architectures in general are covered by the analysis presented in Section 3. The results of this study may thus be applicable to explain some of the success of residual architectures. The extensions to validate this hypothesis are subject to future work. Memory efficiency.
Our analysis clearly suggests that the network is producing, on-the-fly, compact task-specific parameters via residual connections. This makes sharing weights across residual blocks effective, resulting in neural networks with reduced memory footprint and comparable statistical performance.
- The theta model: a decomposition approach to forecasting. International Journal of Forecasting 16 (4), pp. 521–530. Cited by: §D.3, §E.3, §5, §6.
- The tourism forecasting competition. International Journal of Forecasting 27 (3), pp. 822–844. Cited by: Appendix A, §E.3, §E.4, §1, §4, §5.
- The value of feedback in forecasting competitions. International Journal of Forecasting 27 (3), pp. 845–849. Cited by: §E.4.
- Winning methods for forecasting tourism time series. International Journal of Forecasting 27 (3), pp. 850–852. Cited by: §E.4.
- On the optimization of a synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, Cited by: §1.1.
- Learning a synaptic learning rule. In Proceedings of the International Joint Conference on Neural Networks, Seattle, USA, pp. II–A969. Cited by: §1.1, §6.
- Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International Journal of Forecasting 32 (2), pp. 303–312. Cited by: Table 7.
- Retail store scheduling for profit. European Journal of Operational Research 239 (3), pp. 609 – 624. Cited by: §1.
- UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §D.5, §4.
- . Econometrica 50 (4), pp. 987–1007. Cited by: §1, §6.
- Transfer learning for time series classification. 2018 IEEE International Conference on Big Data (Big Data). Cited by: §6.
- FRED economic data. Note: Data retrieved from https://fred.stlouisfed.org/ Accessed: 2019-11-01 Cited by: §D.2, §4.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §2.1, §6.
- Models for optimising the theta method and their relationship to state space models. International Journal of Forecasting 32 (4), pp. 1151–1161. Cited by: §D.3, §E.3, Table 7, §6.
- DeepAR: probabilistic forecasting with autoregressive recurrent networks. CoRR abs/1704.04110. External Links: Cited by: Appendix A, §D.5, §E.5, §E.6, Table 10, Table 9.
- Machine learning applications for data center optimization. Technical report Technical Report , Google. Cited by: §1.
- Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Frid. Perthes and I. H. Besser, Hamburg. Cited by: §1.
- Deep sparse rectifier neural networks. In AISTATS’2011, Cited by: §1.1.
- The formation of learning sets. Psychological Review 56 (1), pp. 51–65. External Links: Cited by: §1.1.
- Forecasting trends and seasonals by exponentially weighted averages. Technical report Technical Report ONR memorandum no. 5, Carnegie Institute of Technology, Pittsburgh, PA. Cited by: §1, §6.
- Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20 (1), pp. 5–10. Cited by: §6.
- Energy predictive models with limited data using transfer learning. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, e-Energy’19, pp. 12–16. Cited by: §1, §6.
- Automatic time series forecasting: the forecast package for R. Journal of Statistical Software 26 (3), pp. 1–22. Cited by: §E.2, §6.
- Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. Cited by: Appendix A.
- Modeling and forecasting call center arrivals: a literature survey and a case study. International Journal of Forecasting 32 (3), pp. 865–874. Cited by: §1.
- Building machines that learn and think like people. Behavioral and Brain Sciences 40, pp. e253. Cited by: §1.1.
- Retail sales force scheduling based on store traffic forecasting. Journal of Retailing 74 (1), pp. 61–88. Cited by: §1.
- Gradient-based meta-learning with learned layerwise metric and subspace. In ICML, pp. 2933–2942. Cited by: §2.1.
- Neural networks in supply chain management. In Proceedings for Operating Research and the Management Sciences, pp. 347–352. Cited by: §1.
- Meta-sgd: learning to learn quickly for few shot learning. CoRR abs/1707.09835. Cited by: §2.1.
- M4 competitor’s guide: prizes and rules. External Links: Cited by: Appendix A, §D.1, §4.
- Integrating indigenous knowledge with scientific seasonal forecasts for climate risk management in Lushoto district in Tanzania. Technical report Technical Report CCAFS Working Paper No. 103, CGIAR research program on climate change, agriculture and food security. Cited by: §1.
- Statistical and machine learning forecasting methods: concerns and ways forward. PLoS ONE 13 (3). Cited by: §1.
- The accuracy of extrapolation (time series) methods: results of a forecasting competition. Journal of forecasting 1 (2), pp. 111–153. Cited by: §1.
- The m2-competition: a real-time judgmentally based forecasting study. International Journal of Forecasting 9 (1), pp. 5–22. Cited by: §1.
- The M3-Competition: results, conclusions and implications. International Journal of Forecasting 16 (4), pp. 451–476. Cited by: Appendix A, §D.3, §E.3, §1, §4.
- The M4-Competition: results, findings, conclusion and way forward. International Journal of Forecasting 34 (4), pp. 802–808. Cited by: Appendix A, §1.1, §1.
- FFORMA: feature-based forecast model averaging. International Journal of Forecasting 36 (1), pp. 86–92. Cited by: §E.1, §6.
- Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §1.1.
- WEATHER forecasting: magic, art, science and hypnosis.. Weather and Climate 5 (1), pp. 2–5. Cited by: §1.
- N-BEATS: neural basis expansion analysis for interpretable time series forecasting. In ICLR, Cited by: Figure 2, §B.1, §B.2, Table 7, Appendix E, §1.1, §1, §5, §6.
- TADAM: task dependent adaptive metric for improved few-shot learning. In NeurIPS, pp. 721–731. Cited by: §2.1.
- FiLM: visual reasoning with a general conditioning layer. In AAAI, Cited by: §2.1.
- The matrix cookbook. Technical University of Denmark, . Note: Version 20121115 Cited by: §3.3.
- Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. External Links: Cited by: §1, §2.1, §2.1, Remark 3.2.
- Deep state space models for time series forecasting. In NeurIPS, Cited by: Appendix A, §D.5, §E.5, §E.6, §6.
- Optimization as a model for few-shot learning. In ICLR, Cited by: §1.1, §2.1, §6.
- Transfer learning with seasonal and trend adjustment for cross-building energy forecasting. Energy and Buildings 165, pp. 352–363. Cited by: §1, §6.
- MEXICAN crop observation, management and production analysis services system — COMPASS. In Poster Proceedings of the 12th European Conference on Precision Agriculture, pp. . Cited by: §1.
- Meta-learning with latent embedding optimization. In ICLR, Cited by: §2.1, §6.
- DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting. Cited by: §6.
- Evolutionary principles in self-referential learning. Master’s Thesis, Institut f. Informatik, Tech. Univ. Munich. Cited by: §1.1, §6.
- Financial time series forecasting with deep learning : a systematic literature review: 2005-2019. External Links: Cited by: §1.
- Decentralized flood forecasting using deep neural networks. arXiv preprint arXiv:1902.02308. Cited by: §1.
Data preprocessing and augmentation for multiple short time series forecasting with recurrent neural networks. In 36th International Symposium on Forecasting, Cited by: Table 7.
- A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting 36 (1), pp. 75 – 85. Cited by: §E.1, §1, §6.
- Prototypical networks for few-shot learning. In NIPS, pp. 4080–4090. Cited by: §2.1, §6.
- Forecasting with a hybrid method utilizing data smoothing, a variation of the theta method and shrinkage of seasonal factors. International Journal of Production Economics 209, pp. 92–102. Cited by: §D.3, §E.3, Table 7, §6.
- On the categorization of demand patterns. Journal of the Operational Research Society 56 (5), pp. 495–503. Cited by: §D.1, §D.3.
- Multiple futures prediction. In NeurIPS 32, pp. 15398–15408. Cited by: §1.
- Python Client for FRED API. GitHub, Data Science division of the National Association of REALTORS. External Links: Cited by: §D.2.
- Matching networks for one shot learning. In NIPS, pp. 3630–3638. Cited by: §2.1, §6.
- ON periodicity in series of related terms. Proc. R. Soc. Lond. A 131, pp. 518–532. Cited by: §1.
- Deep factors for forecasting. In ICML, Cited by: Appendix A, §D.5, §E.5, §E.6, §6.
- Forecasting sales by exponentially weighted moving averages. Management Science 6 (3), pp. 324–342. Cited by: §1, §6.
- Temporal regularized matrix factorization for high-dimensional time series prediction. In NIPS, Cited by: Appendix A, §D.5, §D.5, §D.5, §E.5, §E.6, §4.
- On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers. Phil. Trans. the R. Soc. Lond. A 226 (), pp. 267–298. Cited by: §1.
Appendix A TS Forecasting Metrics
The following metrics are standard scale-free metrics in the practice of forecasting performance evaluation (Hyndman and Koehler, 2006; Makridakis and Hibon, 2000; Makridakis et al., 2018b; Athanasopoulos et al., 2011): (Mean Absolute Percentage Error), (symmetric ) and (Mean Absolute Scaled Error). Whereas scales the error by the average between the forecast and ground truth, the scales by the average error of the naïve predictor that simply copies the observation measured periods in the past, thereby accounting for seasonality. Here is the periodicity of the data (e.g., 12 for monthly series). (overall weighted average) is a M4-specific metric used to rank competition entries (M4 Team, 2018), where and metrics are normalized such that a seasonally-adjusted naïve forecast obtains . Normalized Deviation, , being a less standard metric in the traditional TS forecasting literature, is nevertheless quite popular in the machine learning TS forecasting papers (Yu et al., 2016; Flunkert et al., 2017; Wang et al., 2019; Rangapuram et al., 2018).
Here in the last equation, refers to a sample from TS with index and the sum is running over all TS indices and TS samples.
Appendix B N-BEATS Details
b.1 Architecture Details
N-BEATS (Oreshkin et al., 2020) has hierarchical structure consisting of multiple stacks depicted in Figure 2, reproduced from Figure 1 in (Oreshkin et al., 2020) with permission. Each stack internally consists of multiple blocks. The stacks are chained, whereas blocks within stack are connected using a doubly residual architecture.
b.2 Training details
We use the same overall training framework, as defined by Oreshkin et al. 2020, including the stratified uniform sampling of TS in the source dataset to train the model. One model is trained per frequency split of a dataset (e.g. Yearly, Quarterly, Monthly, Weekly, Daily and Hourly frequencies in M4 dataset). All reported accuracy results are based on ensemble of 30 models (5 different initializations with 6 different lookback periods). One aspect that we found important in the zero-shot regime, that is different from the original training setup, is the scaling/descaling of the input/output. We scale/descale the architecture input/output by the dividing/multiplying all input/output values over the max value of the input window. We found that this does not affect the accuracy of the model trained and tested on the same dataset in a statistically significant way. In the zero-shot regime, this operation prevents catastrophic failure when the target dataset scale (marginal distribution) is significantly different from that of the source dataset.
Most of the time, the model trained on a given frequency split of a source dataset is used to forecast the same frequency split on the target dataset. There are a few exceptions to this rule. First, when transferring from M4 to M3, the Others split of M3 is forecasted with the model trained on Quarterly split of M4. This is because (i) the default horizon length of M4 Quarterly is 8, same as that of M3 Others and (ii) M4 Others is heterogeneous and contains Weekly, Daily, Hourly data with horizon lengths 13, 14, 48. So M4 Quarterly to M3 Others transfer was easier to implement from the coding standpoint. Second, the transfer from M4 to electricity and traffic dataset is done based on a model trained on M4 Hourly. This is because electricity and traffic contain hourly time-series with obvious 24-hour seasonality patterns. It is worth noting that the M4 Hourly only contains 414 time-series and we can clearly see positive zero-shot transfer in Table 1 from the model trained on this rather small dataset. Third, the transfer from fred to electricity and traffic is done by training the model on the fred
Monthly split, double upsampled using bi-linear interpolation. This is becausefred does not have hourly data. Monthly data naturally provide patterns with seasonality period 12. Upsampling with a factor of two and bi-linear interpolation provide data with natural seasonality period 24, most often observed in Hourly data, such as electricity and traffic.
Appendix C Meta-learning Analysis Details
c.1 Factors Enabling Meta-learning
Let us now analyze the factors that enable the meta-learning inner loop obvious in (10). First, and most straightforward, it is not viable without having multiple blocks connected via the backcast residual connection: ). Second, the meta-learning inner loop is viable when is non-linear: the update of is extracted from the curvature of at the point dictated by the input and the sequence of shifts . Indeed, suppose is linear, let’s say . The Jacobian becomes a constant, . Equation (8) simplifies as (note that for linear , (8) is exact):
Therefore, may be replaced with an equivalent that is not data adaptive.
Interestingly, happens to be a truncated Neumann series. Denoting Moore-Penrose pseudo-inverse as , assuming boundedness of and completing the series, , results in . Therefore, under certain conditions, the N-BEATS architecture with linear and infinite number of blocks can be interpreted as a linear predictor of a signal in colored noise. Here the part cleans the intermediate space created by projection from the components that are undesired for forecasting and creates the forecast based on the initial projection after it is “sanitized” by .
Appendix D Dataset Details
d.1 M4 Dataset Details
|Frequency / Horizon|
Table 2 outlines the composition of the M4 dataset across domains and forecast horizons by listing the number of TS based on their frequency and type (M4 Team, 2018). The M4 dataset is large and diverse: all forecast horizons are composed of heterogeneous TS types (with exception of Hourly) frequently encountered in business, financial and economic forecasting. Summary statistics on series lengths are also listed, showing wide variability therein, as well as a characterization (smooth vs erratic) that follows Syntetos et al. (2005), and is based on the squared coefficient of variation of the series. All series have positive observed values at all time-steps; as such, none can be considered intermittent or lumpy per Syntetos et al. (2005).
d.2 fred Dataset Details
fred is a large-scale dataset introduced in this paper containing around 290k US and international economic TS from 89 sources, a subset of Federal Reserve economic data (Federal Reserve Bank of St. Louis, 2019). fred is downloaded using a custom download script based on the high-level FRED python API (Velkoski, 2016). This is a python wrapper over the low-level web-based FRED API. For each point in a time-series the raw data published at the time of first release are downloaded. All time series with any NaN entries have been filtered out. We focus our attention on Yearly, Quarterly, Monthly, Weekly and Daily frequency data. Other frequencies are available, for example, bi-weekly and five-yearly. They are skipped, because only being present in small quantities. These factors explain the fact that the size of the dataset we assembled for this study is 290k, while 672k total time-series are in principle available (Federal Reserve Bank of St. Louis, 2019). Hourly data are not available in this dataset. For the data frequencies included in fred dataset, we use the same forecasting horizons as for the M4 dataset: Yearly: 6, Quarterly: 8, Monthly: 18, Weekly: 13 and Daily: 14. The dataset download takes approximately 7-10 days, because of the bandwidth constraints imposed by the low-level FRED API. The test, validation and train subsets are defined in the usual way. The test set is derived by splitting the full fred dataset at the left boundary of the last horizon of each time series. Similarly, the validation set is derived from the penultimate horizon of each time series.
d.3 M3 Dataset Details
Table 3 outlines the composition of the M3 dataset across domains and forecast horizons by listing the number of TS based on their frequency and type (Makridakis and Hibon, 2000). The M3 is smaller than the M4, but it is still large and diverse: all forecast horizons are composed of heterogeneous TS types frequently encountered in business, financial and economic forecasting. Over the past 20 years, this dataset has supported significant efforts in the design of advanced statistical models, e.g. Theta and its variants (Assimakopoulos and Nikolopoulos, 2000; Fiorucci et al., 2016; Spiliotis et al., 2019). Summary statistics on series lengths are also listed, showing wide variability in length, as well as a characterization (smooth vs erratic) that follows Syntetos et al. (2005), and is based on the squared coefficient of variation of the series. All series have positive observed values at all time-steps; as such, none can be considered intermittent or lumpy per Syntetos et al. (2005).
|Frequency / Horizon|
d.4 tourism Dataset Details
Table 4 outlines the composition of the tourism dataset across forecast horizons by listing the number of TS based on their frequency. Summary statistics on series lengths are listed, showing wide variability in length. All series have positive observed values at all time-steps. In contrast to M4 and M3 datasets, tourism includes a much higher fraction of erratic series.
|Frequency / Horizon|
d.5 electricity and traffic Dataset Details
electricity222https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 and traffic333https://archive.ics.uci.edu/ml/datasets/PEMS-SF datasets (Dua and Graff, 2017; Yu et al., 2016) are both part of UCI repository. electricity represents the hourly electricity usage monitoring of 370 customers over three years. traffic dataset tracks the hourly occupancy rates scaled in (0,1) range of 963 lanes in the San Francisco bay area freeways over a period of slightly more than a year. Both datasets exhibit strong hourly and daily seasonality patterns.
Both datasets are aggregated to hourly data, but using different aggregation operations: sum for electricity and mean for traffic. The hourly aggregation is done so that all the points available in hours are aggregated to hour , thus if original dataset starts on 2011-01-01 00:15 then the first time point after aggregation will be 2011-01-01 01:00. For the electricity dataset we removed the first year from training set, to match the training set used in (Yu et al., 2016), based on the aggregated dataset downloaded from, presumable authors’, github repository444https://github.com/rofuyu/exp-trmf-nips16/blob/master/python/exp-scripts/datasets/download-data.sh. We also made sure that data points for both electricity and traffic datasets after aggregation match those used in (Yu et al., 2016). The authors of MatFact model were using the last 7 days of datasets as test set, but papers from Amazon DeepAR (Flunkert et al., 2017), Deep State (Rangapuram et al., 2018), Deep Factors (Wang et al., 2019) are using different splits, where the split points are provided by a date. Changing split points without a well grounded reason adds uncertainties to the comparability of the models performances and creates challenges to the reproducibility of the results, thus we were trying to match all different splits in our experiments. It was especially challenging on traffic
dataset, where we had to use some heuristics to find records dates; the dataset authors state: “The measurements cover the period from Jan. 1st 2008 to Mar. 30th 2009” and “We remove public holidays from the dataset, as well as two days with anomalies (March 8th 2009 and March 9th 2008) where all sensors were muted between 2:00 and 3:00 AM.” In spite of this, we failed to match a part of the provided labels of week days to actual dates. Therefore, we had to assume that the actual list of gaps, which include holidays and anomalous days, is as follows:
Jan. 1, 2008 (New Year’s Day)
Jan. 21, 2008 (Martin Luther King Jr. Day)
Feb. 18, 2008 (Washington’s Birthday)
Mar. 9, 2008 (Anomaly day)
May 26, 2008 (Memorial Day)
Jul. 4, 2008 (Independence Day)
Sep. 1, 2008 (Labor Day)
Oct. 13, 2008 (Columbus Day)
Nov. 11, 2008 (Veterans Day)
Nov. 27, 2008 (Thanksgiving)
Dec. 25, 2008 (Christmas Day)
Jan. 1, 2009 (New Year’s Day)
Jan. 19, 2009 (Martin Luther King Jr. Day)
Feb. 16, 2009 (Washington’s Birthday)
Mar. 8, 2009 (Anomaly day)
The first six gaps were confirmed by the gaps in labels, but the rest were more than one day apart from any public holiday of years 2008 and 2009 in San Francisco, California and US. Moreover, the number of gaps we found in the labels provided by dataset authors is 10, while the number of days between Jan. 1st 2008 and Mar. 30th 2009 is 455, assuming that Jan. 1st 2008 was skipped from the values and labels we should end up with either instead of 440 days or different end date. The metric used to evaluate performance on the datasets is (Yu et al., 2016), which is equal to loss used in DeepAR, Deep State, and Deep Factors papers.
Appendix E Empirical Results Details
On all datasets, we consider the original N-BEATS (Oreshkin et al., 2020), the model trained on a given dataset and applied to this same dataset. This is provided for the purpose of assessing the generalization gap of the zero-shot N-BEATS. We consider four variants of zero-shot N-BEATS: NB-SH-M4, NB-NSH-M4, NB-SH-FR, NB-NSH-FR. -SH/-NSH option signifies block weight sharing ON/OFF. -M4/-FR option signifies M4/fred source dataset.
e.1 Detailed M4 Results
On M4 we compare against five M4 competition entries, each representative of a broad model class. Best pure ML is the submission by B. Trotta, the best entry among the 6 pure ML models. Best statistical is the best pure statistical model by N.Z. Legaki and K. Koutsouri. ProLogistica is a weighted ensemble of statistical methods, the third best M4 participant. Best ML/TS combination is the model by (Montero-Manso et al., 2020)
, second best entry, gradient boosted tree over a few statistical time series models. Finally,DL/TS hybrid is the winner of M4 competition (Smyl, 2020). Results are presented in Table 5.
|Best pure ML||14.397||11.031||13.973||4.566||12.894|
|Best ML/TS combination||13.528||9.733||12.639||4.118||11.720|
|DL/TS hybrid, M4 winner||13.176||9.679||12.126||4.014||11.374|
e.2 Detailed fred Results
We compare against well established off-the-shelf statistical models available from the R forecast package (Hyndman and Khandakar, 2008). Those include Naïve (repeating the last value), ARIMA, Theta, SES and ETS. The quality metric is the regular defined in (1).