DeepAI
Log In Sign Up

DeepTIMe: Deep Time-Index Meta-Learning for Non-Stationary Time-Series Forecasting

Deep learning has been actively applied to time-series forecasting, leading to a deluge of new autoregressive model architectures. Yet, despite the attractive properties of time-index based models, such as being a continuous signal function over time leading to smooth representations, little attention has been given to them. Indeed, while naive deep time-index based models are far more expressive than the manually predefined function representations of classical time-index based models, they are inadequate for forecasting due to the lack of inductive biases, and the non-stationarity of time-series. In this paper, we propose DeepTIMe, a deep time-index based model trained via a meta-learning formulation which overcomes these limitations, yielding an efficient and accurate forecasting model. Extensive experiments on real world datasets demonstrate that our approach achieves competitive results with state-of-the-art methods, and is highly efficient. Code is available at https://github.com/salesforce/DeepTIMe.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/22/2022

Respecting Time Series Properties Makes Deep Time Series Forecasting Perfect

How to handle time features shall be the core question of any time serie...
02/04/2022

Self-Adaptive Forecasting for Improved Deep Learning on Non-Stationary Time-Series

Real-world time-series datasets often violate the assumptions of standar...
11/05/2021

Meta-Forecasting by combining Global Deep Representations with Local Adaptation

While classical time series forecasting considers individual time series...
04/23/2021

A study on Ensemble Learning for Time Series Forecasting and the need for Meta-Learning

The contribution of this work is twofold: (1) We introduce a collection ...
04/09/2021

Deep Time Series Forecasting with Shape and Temporal Criteria

This paper addresses the problem of multi-step time series forecasting f...
02/03/2022

CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting

Deep learning has been actively studied for time series forecasting, and...
12/26/2022

Modeling Nonlinear Dynamics in Continuous Time with Inductive Biases on Decay Rates and/or Frequencies

We propose a neural network-based model for nonlinear dynamics in contin...

Code Repositories

DeepTIMe

PyTorch code for DeepTIMe: Deep Time-Index Meta-Learning for Non-Stationary Time-Series Forecasting


view repo

1 Introduction

Time-series forecasting has important applications across business and scientific domains, such as demand forecasting Carbonneau et al. (2008), capacity planning and management Kim (2003), electricity pricing Cuaresma et al. (2004)

, and anomaly detection

Laptev et al. (2017). This has led to interest in developing more powerful methods for forecasting, including the recent increase in attention towards neural forecasting methods Benidis et al. (2020). There are two broad approaches to time-series forecasting – autoregressive models, and time-index based models. Autoregressive models Salinas et al. (2020); Yang et al. (2019) consider the forecast to be a function of the previous values of the time-series, while time-index based models, also known as time-series regression models, map a time-index, and optionally datetime features, to the value of the time-series at that time step. Many recent neural forecasting approaches such as the family of fully connected architectures Challu et al. (2022); Olivares et al. (2021); Oreshkin et al. (2020, 2021), and Transformer-based architectures Zhou et al. (2022); Woo et al. (2022); Xu et al. (2021); Zhou et al. (2021) belong to the autoregressive approach. Yet, while time-index based approaches have the attractive property of being viewed as a continuous signal function over time, leading to signal representations which change smoothly and correlate with each other in continuous space, they have been under-explored from a deep learning perspective. In the following section, we explore the limitations of classical time-index based approaches, and highlight how deep time-index models can overcome these limitations. At the same time, deep time-index models face their own set of limitations. We propose a meta-learning formulation of the standard time-series forecasting problem that can resolve these limitations.

Classical time-index based methods Taylor and Letham (2018); Hyndman and Athanasopoulos (2018); Ord et al. (2017) rely on predefined parametric representation functions, , to generate predictions, optionally following a structural time-series model formulation Harvey and Shephard (1993), , where , , , all functions of a time-index, represent trend, periodic, and holiday components respectively, and represents idiosyncratic changes not accounted for by the model. For example, could be predefined as a linear, polynomial, or even a piecewise linear function. While these functions are simple and easy to learn, they have limited capacity and are unable to fit more complex time-series. Furthermore, predefining the representation function is a strong assumption which may not hold across different application domains, as it is only effective when the data distribution follows this predefined function – we train on historical data and expect extrapolation into future time steps to hold. While this may be true for a short horizon, this assumption most likely collapses when dealing with long sequence time-series forecasting. Finally, while it is possible to perform model selection across various representation functions and parameters such as changepoints and seasonality, this requires either strong domain expertise or computationally heavy cross-validation across a large set of parameters.

(a) Without Meta-Learning
(b) With Meta-Learning (DeepTIMe)
Figure 1: (a) A naive deep time-index model. We visualize a reconstruction of the historical training data, as well as the forecasts. As can be seen, while it manages to fit the historical data, it is too expressive, and without any inductive biases, cannot extrapolate. This model corresponds to (+Local) Table 3 of our ablations. (b) DeepTIMe, our proposed approach, trained via a meta-learning formulation, successfully learns the appropriate function representation and is able to extrapolate. Visualized here is the last variable of the ETTm2 dataset.

Deep learning gives a natural solution to this problem faced by classical time-index based models – parameterize

as a neural network, and learn the representation function directly from data. Neural networks have been shown to have the property of being an extremely expressive representation function with a strong capability to approximate complex functions. However, being too expressive a representation function brings about the first limitation. Time-index based models rely on the assumptions of the representation function to ensure extrapolation beyond the range of historical training data, into future time steps, yield accurate forecasts. Being such an expressive representation function, deep time-index models have no inductive bias to perform well as a forecasting model. This is shown in

Figure 1

. The second limitation arises due to the non-stationarity of time-series. Time-series data, especially in long sequence time-series forecasting, are typically non-stationary – their distributions change over time. While the standard supervised learning setting assumes the validation and test set to come from the same distribution as the training set, this may not be the case for non-stationary time-series

Kim et al. (2021); Arik et al. (2022), leading to a degradation in predictive performance.

To this end, we propose DeepTIMe, a deep time-index based model, which leverages a meta-learning formulation to: (i) learn from data an appropriate function representation by optimizing an extrapolation loss, only possible due to our meta-learning formulation (only reconstruction loss is available for the standard setting), and (ii) learn a global meta model shared across tasks which performs adaptation on a locally stationary distribution. We further leverage Implicit Neural Representations Sitzmann et al. (2020b); Mildenhall et al. (2020) as our choice of deep time-index models, a random Fourier features layer Tancik et al. (2020) to ensure that we are able to learn high frequency information present in time-series data, and a closed-form ridge regressor Bertinetto et al. (2019) to efficiently tackle the meta-learning formulation. We conduct extensive experiments on both synthetic and real world datasets, showing that DeepTIMe has extremely competitive performance, achieving state-of-the-art results on 20 out of 24 settings for the multivariate forecasting benchmark based on MSE. We perform ablation studies to better understand the contribution of each component of DeepTIMe, and finally show that it is highly efficient in terms of runtime and memory.

2 DeepTIMe

(a) Forecasting as Meta-Learning
(b) DeepTIMe Model Architecture
Figure 2: Illustration of DeepTIMe. A time-series dataset can be split into tasks as given in the problem formulation. For a given task, the lookback window represents the support set, and the forecast horizon represents the query set. represents the meta model associated with the meta parameters. is shared between the lookback window and forecast horizon. Inputs to are not normalized due to notation constraints on this illustration. The ridge regressor performs the inner loop optimization, while outer loop optimization is performed over samples from the horizon. As illustrated in Figure 1(b), DeepTIMe has a simple overall architecture, comprising of a random Fourier features layer, an MLP, and a ridge regressor.

Problem Formulation

In long sequence time-series forecasting, we consider a time-series dataset , where is the -dimension observation at time . Given a lookback window of length , we aim to construct a point forecast over a horizon of length , by learning a model

which minimizes some loss function

.

In the following, we first describe how to cast the standard time-series forecasting problem as a meta-learning problem for time-index based models, which endows DeepTIMe with the ability to extrapolate over the forecast horizon, i.e. perform forecasting. We emphasize that this reformulation falls within standard time-series forecasting problem and requires no extra information. Next, we further elaborate on our proposed model architecture, and how it uses a differentiable closed-form ridge regression module to efficiently tackle forecasting as meta-learning problem. Psuedocode of DeepTIMe is available in

Appendix A.

2.1 Forecasting as Meta-Learning

To formulate time-series forecasting as a meta-learning problem, we treat each lookback window and forecast horizon pair, as a task. Specifically, the lookback window is treated as the support set, and forecast horizon to be the query set, and each time coordinate and time-series value pair, , is an input-output sample, i.e. , , where is a -normalized time-index. The forecasting model, , is then parameterized by and , the meta and base parameters respectively, and the bi-level optimization problem can be formalized as:

(1)
(2)

Here, the outer summation in Equation 1 over index represents each lookback-horizon window, corresponding to each task in meta-learning, and the inner summation over index represents each sample in the query set, or equivalently, each time step in the forecast horizon. The summation in Equation 2 over index represents each sample in the support set, or each time step in the lookback window. This is illustrated in Figure 1(a).

To understand how our meta-learning formulation helps to learn an appropriate function representation from data, we examine how the meta-learning process performs a restriction on hypothesis class of the model . The original hypothesis class of our model, or function representation, , is too large and provides no guarantees that training on the lookback window leads to good extrapolation. The meta-learning formulation allows DeepTIMe to restrict the hypothesis class of the representation function, from the space of all -layered INRs, to the space of -layered INRs conditioned on the optimal meta parameters, , where the optimal meta parameters, , is the minimizer of a forecasting loss (as specified in Equation 1). Given this hypothesis class, local adaptation is performed over given the lookback window, which is assumed to come from a locally stationary distribution, resolving the issue of non-stationarity.

2.2 Model Architecture

Implicit Neural Representations

The class of deep models which map coordinates to the value at that coordinate using a stack of multi-layer perceptrons (MLPs) is known as INRs

Sitzmann et al. (2020b); Tancik et al. (2020); Mildenhall et al. (2020). We make use a of them as they are a natural fit for time-index based models, to map a time-index to the value of the time-series at that time-index. A

-layered, ReLU

Nair and Hinton (2010) INR is a function which has the following form:

(3)

where is the time-index. Note that for our proposed approach as specified in Section 2.1, but we use the notation to allow for generalization to cases where datetime features are included. Tancik et al. (2020) introduced a random Fourier features layer which allows INRs to fit to high frequency functions, by modifying , where each entry in is sampled from with is the hidden dimension size of the INR and

is the scale hyperparameter.

is a row-wise stacking operation.

Concatenated Fourier Features

While the random Fourier features layer endows INRs with the ability to learn high frequency patterns, one major drawback is the need to perform a hyperparameter sweep for each task and dataset to avoid over or underfitting. We overcome this limitation with a simple scheme of concatenating multiple Fourier basis functions with diverse scale parameters, i.e. , where elements in are sampled from , and . We perform an analysis in Appendix F and show that the performance of our proposed Concatenated Fourier Features (CFF) does not significantly deviate from the setting with the optimal scale parameter obtained from a hyperparameter sweep.

Differentiable Closed-Form Solvers

One key aspect to tackling forecasting as a meta-learning problem is efficiency. Optimization-based meta-learning approaches originally perform an expensive bi-level optimization procedure on the entire neural network model by backpropagating through inner gradient steps

Ravi and Larochelle (2017); Finn et al. (2017). Since each forecast is now treated as an inner loop optimization problem, it needs to be sufficiently fast to be competitive with competing methods. We achieve this by leveraging a ridge regression closed-form solver Bertinetto et al. (2019), on top an INR, illustrated in Figure 1(b). The ridge regression closed-form solver restricts the inner loop to only apply to the last layer of the model, allowing for either a closed-form solution or differentiable solver to replace the inner gradient steps. This means that for a -layered model, are the meta parameters and are the base parameters, following notation from Equation 3. Then let be the meta learner where . For task with the corresponding lookback-horizon pair, , the support set features obtained from the meta learner is denoted , where is a column-wise concatenation operation. The inner loop thus solves the optimization problem:

(4)

Now, let be the query set features. Then, our predictions are . This closed-form solution is differentiable, which enables gradient updates on the parameters of the meta learner,

. A bias term can be included for the closed-form ridge regressor by appending a scalar 1 to the feature vector

. The model obtained by DeepTIMe is ultimately the restricted hypothesis class .

3 Experiments

We evaluate DeepTIMe on both synthetic datasets, and a variety of real world data. We ask the following questions: (i) Is DeepTIMe, trained on a family of functions following the same parametric form, able to perform extrapolation on unseen functions? (ii) How does DeepTIMe compare to other forecasting models on real world data? (iii) What are the key contributing factors to the good performance of DeepTIMe?

3.1 Synthetic Data Experiments

(a) Linear
(b) Cubic
(c) Sum of Sinusoids
Figure 3: Predictions of DeepTIMe on three unseen functions for each function class. The orange line represents the split between lookback window and forecast horizon.

We first consider DeepTIMe’s ability to extrapolate on the following functions specified by some parametric form: (i) the family of linear functions, , (ii) the family of cubic functions, , and (iii) sums of sinusoids, . Parameters of the functions (i.e. ) are sampled randomly (further details in Appendix B) to construct distinct tasks. A total of 400 time steps are sampled, with a lookback window length of 200 and forecast horizon of 200. Figure 3 demonstrates that DeepTIMe is able to perform extrapolation on unseen test functions/tasks after being trained via our meta-learning formulation. It demonstrates an ability to approximate and adapt, based on the lookback window, linear and cubic polynomials, and even sums of sinusoids. Next, we evaluate DeepTIMe on real world datasets, against state-of-the-art forecasting baselines.

3.2 Real World Data Experiments

Datasets

ETT111https://github.com/zhouhaoyi/ETDataset Zhou et al. (2021) - Electricity Transformer Temperature provides measurements from an electricity transformer such as load and oil temperature. We use the ETTm2 subset, consisting measurements at a 15 minutes frequency. ECL222https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 - Electricity Consuming Load provides measurements of electricity consumption for 321 households from 2012 to 2014. The data was collected at the 15 mintue level, but is aggregated hourly. Exchange333https://github.com/laiguokun/multivariate-time-series-data Lai et al. (2018) - a collection of daily exchange rates with USD of eight countries (Australia, United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016. Traffic444https://pems.dot.ca.gov/ - dataset from the California Department of Transportation providing the hourly road occupancy rates from 862 sensors in San Francisco Bay area freeways. Weather555https://www.bgc-jena.mpg.de/wetter/ - provides measurements of 21 meteorological indicators such as air temperature, humidity, etc., every 10 minutes for the year of 2020 from the Weather Station of the Max Planck Biogeochemistry Institute in Jena, Germany. ILI666https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html - Influenza-like Illness measures the weekly ratio of patients seen with ILI and the total number of patients, obtained by the Centers for Disease Control and Prevention of the United States between 2002 and 2021.

Evaluation

We evaluate the performance of our proposed approach using two metrics, the mean squared error (MSE) and mean absolute error (MAE) metrics. The datasets are split into train, validation, and test sets chronologically, following a 70/10/20 split for all datasets except for ETTm2 which follows a 60/20/20 split, as per convention. The univariate benchmark selects the last index of the multivariate dataset as the target variable, following previous work Xu et al. (2021); Woo et al. (2022); Challu et al. (2022); Zhou et al. (2022). Pre-processing on the data is performed by standardization based on train set statistics. Hyperparameter selection is performed on only one value, the lookback length multiplier, , which decides the length of the lookback window. We search through the values , and select the best value based on the validation loss. Further implementation details on DeepTIMe are reported in Appendix C, and detailed hyperparameters are reported in Appendix D

. Reported results for DeepTIMe are averaged over three runs, and standard deviation is reported in

Appendix E.

Results

Methods DeepTIMe N-HiTS ETSformer Fedformer Autoformer Informer LogTrans Reformer
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ETTm2

96 0.166 0.257 0.176 0.255 0.189 0.280 0.203 0.287 0.255 0.339 0.365 0.453 0.768 0.642 0.658 0.619
192 0.225 0.302 0.245 0.305 0.253 0.319 0.269 0.328 0.281 0.340 0.533 0.563 0.989 0.757 1.078 0.827
336 0.277 0.336 0.295 0.346 0.314 0.357 0.325 0.366 0.339 0.372 1.363 0.887 1.334 0.872 1.549 0.972
720 0.383 0.409 0.401 0.426 0.414 0.413 0.421 0.415 0.422 0.419 3.379 1.388 3.048 1.328 2.631 1.242

ECL

96 0.137 0.238 0.147 0.249 0.187 0.304 0.183 0.297 0.201 0.317 0.274 0.368 0.258 0.357 0.312 0.402
192 0.152 0.252 0.167 0.269 0.199 0.315 0.195 0.308 0.222 0.334 0.296 0.386 0.266 0.368 0.348 0.433
336 0.166 0.268 0.186 0.290 0.212 0.329 0.212 0.313 0.231 0.338 0.300 0.394 0.280 0.380 0.350 0.433
720 0.201 0.302 0.243 0.340 0.233 0.345 0.231 0.343 0.254 0.361 0.373 0.439 0.283 0.376 0.340 0.420

Exchange

96 0.081 0.205 0.092 0.211 0.085 0.204 0.139 0.276 0.197 0.323 0.847 0.752 0.968 0.812 1.065 0.829
192 0.151 0.284 0.208 0.322 0.182 0.303 0.256 0.369 0.300 0.369 1.204 0.895 1.040 0.851 1.188 0.906
336 0.314 0.412 0.371 0.443 0.348 0.428 0.426 0.464 0.509 0.524 1.672 1.036 1.659 1.081 1.357 0.976
720 0.856 0.663 0.888 0.723 1.025 0.774 1.090 0.800 1.447 0.941 2.478 1.310 1.941 1.127 1.510 1.016

Traffic

96 0.390 0.275 0.402 0.282 0.607 0.392 0.562 0.349 0.613 0.388 0.719 0.391 0.684 0.384 0.732 0.423
192 0.402 0.278 0.420 0.297 0.621 0.399 0.562 0.346 0.616 0.382 0.696 0.379 0.685 0.390 0.733 0.420
336 0.415 0.288 0.448 0.313 0.622 0.396 0.570 0.323 0.622 0.337 0.777 0.420 0.733 0.408 0.742 0.420
720 0.449 0.307 0.539 0.353 0.632 0.396 0.596 0.368 0.660 0.408 0.864 0.472 0.717 0.396 0.755 0.423

Weather

96 0.166 0.221 0.158 0.195 0.197 0.281 0.217 0.296 0.266 0.336 0.300 0.384 0.458 0.490 0.689 0.596
192 0.207 0.261 0.211 0.247 0.237 0.312 0.276 0.336 0.307 0.367 0.598 0.544 0.658 0.589 0.752 0.638
336 0.251 0.298 0.274 0.300 0.298 0.353 0.339 0.380 0.359 0.359 0.578 0.523 0.797 0.652 0.639 0.596
720 0.301 0.338 0.351 0.353 0.352 0.388 0.403 0.428 0.419 0.419 1.059 0.741 0.869 0.675 1.130 0.792

ILI

24 2.425 1.086 1.862 0.869 2.527 1.020 2.203 0.963 3.483 1.287 5.764 1.677 4.480 1.444 4.400 1.382
36 2.231 1.008 2.071 0.969 2.615 1.007 2.272 0.976 3.103 1.148 4.755 1.467 4.799 1.467 4.783 1.448
48 2.230 1.016 2.346 1.042 2.359 0.972 2.209 0.981 2.669 1.085 4.763 1.469 4.800 1.468 4.832 1.465
60 2.143 0.985 2.560 1.073 2.487 1.016 2.545 1.061 2.770 1.125 5.264 1.564 5.278 1.560 4.882 1.483
Table 1: Multivariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second best results are underlined.
Methods DeepTIMe N-HiTS ETSformer Fedformer Autoformer Informer N-BEATS DeepAR Prophet ARIMA
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ETTm2

96 0.065 0.186 0.066 0.185 0.080 0.212 0.063 0.189 0.065 0.189 0.088 0.225 0.082 0.219 0.099 0.237 0.287 0.456 0.211 0.362
192 0.096 0.234 0.087 0.223 0.150 0.302 0.102 0.245 0.118 0.256 0.132 0.283 0.120 0.268 0.154 0.310 0.312 0.483 0.261 0.406
336 0.138 0.285 0.106 0.251 0.175 0.334 0.130 0.279 0.154 0.305 0.180 0.336 0.226 0.370 0.277 0.428 0.331 0.474 0.317 0.448
720 0.186 0.338 0.157 0.312 0.224 0.379 0.178 0.325 0.182 0.335 0.300 0.435 0.188 0.338 0.332 0.468 0.534 0.593 0.366 0.487

Exchange

96 0.086 0.226 0.093 0.223 0.099 0.230 0.131 0.284 0.241 0.299 0.591 0.615 0.156 0.299 0.417 0.515 0.828 0.762 0.112 0.245
192 0.173 0.330 0.230 0.313 0.223 0.353 0.277 0.420 0.273 0.665 1.183 0.912 0.669 0.665 0.813 0.735 0.909 0.974 0.304 0.404
336 0.539 0.575 0.370 0.486 0.421 0.497 0.426 0.511 0.508 0.605 1.367 0.984 0.611 0.605 1.331 0.962 1.304 0.988 0.736 0.598
720 0.936 0.763 0.728 0.569 1.114 0.807 1.162 0.832 0.991 0.860 1.872 1.072 1.111 0.860 1.890 1.181 3.238 1.566 1.871 0.935
Table 2: Univariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second best results are underlined.

We compare DeepTIMe to the following baselines for the multivariate setting, N-HiTS Challu et al. (2022), ETSformer Woo et al. (2022), Fedformer Zhou et al. (2022) (we report the best score for each setting from the two variants they present), Autoformer Xu et al. (2021), Informer Zhou et al. (2021), LogTrans Li et al. (2019), and Reformer Kitaev et al. (2020). For the univariate setting, we include additional univariate forecasting models, N-BEATS Oreshkin et al. (2020), DeepAR Salinas et al. (2020), Prophet Taylor and Letham (2018), and ARIMA. We obtain the baseline results from the following papers: Challu et al. (2022); Woo et al. (2022); Zhou et al. (2022); Xu et al. (2021). Table 1 and Table 2 summarizes the multivariate and univariate forecasting results respectively. DeepTIMe achieves state-of-the-art performance on 20 out of 24 settings in MSE, and 17 out of 24 settings in MAE on the multivariate benchmark, and also achieves competitive results on the univariate benchmark despite its simple architecture compared to the baselines comprising complex fully connected architectures and computationally intensive Transformer architectures.

3.3 Ablation Studies

Methods DeepTIMe + Datetime - RR - RR + Local + Local
+ Datetime + Datetime
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ETTm2

96 0.166 0.257 0.226 0.303 3.072 1.345 3.393 1.400 0.251 0.331 0.250 0.327
192 0.225 0.302 0.309 0.362 3.064 1.343 3.269 1.381 0.322 0.371 0.323 0.366
336 0.277 0.336 0.341 0.381 2.920 1.309 3.442 1.401 0.370 0.412 0.367 0.396
720 0.383 0.409 0.453 0.447 2.773 1.273 3.400 1.399 0.443 0.449 0.455 0.461
Table 3: Ablation study on variants of DeepTIMe. Starting from the original version, we add (+) or remove (-) some component from DeepTIMe. RR stands for the differentiable closed-form ridge regressor, removing it refers to replacing this module with a simple linear layer trained via gradient descent across all training samples (i.e. without meta-learning formulation). Local refers to training training an INR from scratch via gradient descent for each lookback window (RR is not used here, and there is no training phase). Datetime refers to datetime features. Further model details can be found in Section G.1.
Methods DeepTIMe MLP SIREN RNN
Metrics MSE MAE MSE MAE MSE MAE MSE MAE
ETTm2 96 0.166 0.257 0.186 0.284 0.236 0.325 0.233 0.324
192 0.225 0.302 0.265 0.338 0.295 0.361 0.275 0.337
336 0.277 0.336 0.316 0.372 0.327 0.386 0.344 0.383
720 0.383 0.409 0.401 0.417 0.438 0.453 0.431 0.432
Table 4: Ablation study on backbone models. DeepTIMe refers to our proposed approach, an INR with random Fourier features sampled from a range of scales. MLP refers to replacing the random Fourier features with a linear map from input dimension to hidden dimension. SIREN refers to an INR with periodic activations as proposed by Sitzmann et al. (2020b)

. RNN refers to an autoregressive recurrent neural network (inputs are the time-series values,

). All approaches include differentiable closed-form ridge regressor. Further model details can be found in Section G.2.
CFF Optimal Scale (% change) Pessimal Scale (% change)
Metrics MSE MAE MSE MAE MSE MAE

ETTm2

96 0.166 0.257 0.164 (1.20%) 0.257 (-0.05%) 0.216 (-23.22%) 0.300 (-14.22%)
192 0.225 0.302 0.220 (1.87%) 0.301 (0.25%) 0.275 (-18.36%) 0.340 (-11.25%)
336 0.277 0.336 0.275 (0.70%) 0.336 (-0.22%) 0.340 (-18.68%) 0.375 (-10.57%)
720 0.383 0.409 0.364 (5.29%) 0.392 (4.48%) 0.424 (-9.67%) 0.430 (-4.95%)
Table 5: Comparison of CFF against the optimal and pessimal scales as obtained from the hyperparameter sweep. We also calculate the change in performance between CFF and the optimal and pessimal scales, where a positive percentage refers to a CFF underperforming, and negative percentage refers to CFF outperforming, calculated as .

We perform an ablation study to understand how various training schemes and input features affect the performance of DeepTIMe. Table 3 presents these results. First, we observe that our meta-learning formulation is a critical component to the success of DeepTIMe. We note that DeepTIMe without meta-learning may not be a meaningful baseline since the model outputs are always the same regardless of the input lookback window. Including datetime features helps alleviate this issue, yet we observe that the inclusion of datetime features generally lead to a degradation in performance. In the case of DeepTIMe, we observed that the inclusion of datetime features lead to a much lower training loss, but degradation in test performance – this is a case of meta-learning memorization Yin et al. (2020) due to the tasks becoming non-mutually exclusive Rajendran et al. (2020). Finally, we observe that the meta-learning formulation is indeed superior to training a model from scratch for each lookback window.

In Table 4 we perform an ablation study on various backbone architectures, while retaining the differentiable closed-form ridge regressor. We observe a degradation when the random Fourier features layer is removed, due to the spectral bias problem which neural networks face Rahaman et al. (2019); Tancik et al. (2020). DeepTIMe outperforms the SIREN variant of INRs which is consistent with observations INR literature. Finally DeepTIMe outperforms the RNN variant which is the model proposed in Grazzi et al. (2021). This is a direct comparison between auto-regressive and time-index models, and highlights the benefits of a time-index models.

Lastly, we perform a comparison between the optimal and pessimal scale hyperparameter for the vanilla random Fourier features layer, against our proposed CFF. We first report the results on each scale hyperparameter for the vanilla random Fourier features layer in Table 8, Appendix F. As with the other ablation studies, the results reported in Table 8 is based on performing a hyperparameter sweep across lookback length multiplier, and selecting the optimal settings based on the validation set, and reporting the test set results. Then, the optimal and pessimal scales are simply the best and worst results based on Table 8. Table 5 shows that CFF achieves extremely low deviation from the optimal scale across all settings, yet retrains the upside of avoiding this expensive hyperparameter tuning phase. We also observe that tuning the scale hyperparameter is extremely important, as CFF obtains up to a 23.22% improvement in MSE over the pessimal scale hyperparameter.

3.4 Computational Efficiency

(a) Runtime Analysis
(b) Memory Analysis
Figure 4: Computational efficiency benchmark on the ETTm2 multivariate dataset, on a batch size of 32. Runtime is measured for one iteration (forward + backward pass). Left: Runtime/Memory usage as lookback length varies, horizon is fixed to 48. Right: Runtime/Memory usage as horizon length varies, lookback length is fixed to 48. Further model details can be found in Appendix H.

Finally, we analyse DeepTIMe’s efficiency in both runtime and memory usage, with respect to both lookback window and forecast horizon lengths. The main bottleneck in computation for DeepTIMe is the matrix inversion operation in the ridge regressor, canonically of complexity. This is a major concern for DeepTIMe as is linked to the length of the lookback window. As mentioned in Bertinetto et al. (2019), the Woodbury formulation,

is used to alleviate the problem, leading to an complexity, where is the hidden size hyperparameter, fixed to some value (see Appendix D). Figure 4 demonstrates that DeepTIMe is highly efficient, even when compared to efficient Transformer models, recently proposed for the long sequence time-series forecasting task, as well as fully connected models.

4 Related Work

Neural Forecasting

Neural forecasting Benidis et al. (2020) methods have seen great success in recent times. One related line of research are Transformer-based methods for long sequence time-series forecasting Li et al. (2019); Zhou et al. (2021); Xu et al. (2021); Woo et al. (2022); Zhou et al. (2022) which aim to not only achieve high accuracy, but to overcome the vanilla attention’s quadratic complexity. Li et al. (2019) and Zhou et al. (2021) introduced sparse attention mechanisms, while Xu et al. (2021); Woo et al. (2022), and Zhou et al. (2022)

introduced mechanisms which make use of the Fourier transform to achieve a quasilinear complexity. They further embed prior knowledge of time-series structures such as auto-correlation, and seasonal-trend decomposition into the Transformer architecture. Another relevant line of work is that of fully connected models

Oreshkin et al. (2020); Olivares et al. (2021); Challu et al. (2022); Oreshkin et al. (2021). Oreshkin et al. (2020) first introduced the N-BEATS model which made use of doubly residual stacks of fully connected layers. Challu et al. (2022)

extended this approach to the long sequence time-series forecasting task by introducing hierarchical interpolation and multi-rate data sampling. Meta-learning has been explored in time-series, where

Grazzi et al. (2021) used a differentiable closed-form solver in the context of time-series forecasting, but specified an auto-regressive backbone model.

Time-Index Based Models

Time-index based models, or time-series regression models, take as input datetime features and other covariates to predict the value of the time-series at that time step. They have been well explored as a special case of regression analysis

Hyndman and Athanasopoulos (2018); Ord et al. (2017)

, and many different predictors have been proposed for the classical setting. These inclue linear, polynomial, and piecewise linear trends, dummy variables indicating holidays, seasonal dummy variables, and many others

Hyndman and Athanasopoulos (2018). Of note, Fourier terms have been used to model periodicity, or seasonal patterns, and is also known as harmonic regression Young et al. (1999). One popular classical time-index based method is Prophet Taylor and Letham (2018), which uses a structural time-series formulation, considering trend, seasonal, and holiday variables, specialized for business forecasting. Godfrey and Gashler (2017)

introduced an initial attempt at using time-index based neural networks to fit a time-series for forecasting. Yet, their work is more reminiscent of classical methods, as they manually specify periodic and non-periodic activation functions, analogous to the representation functions, rather than learning the representation function from data.

Implicit Neural Representations

INRs have recently gained popularity in the area of neural rendering Tewari et al. (2021). They parameterize a signal as a continuous function, mapping a coordinate to the value at that coordinate. A key finding was that positional encodings Mildenhall et al. (2020); Tancik et al. (2020) are critical for ReLU MLPs to learn high frequency details, while another line of work introduced periodic activations Sitzmann et al. (2020b). Meta-learning on via INRs have been explored for various data modalities, typically over images or for neural rendering tasks Sitzmann et al. (2020a); Tancik et al. (2021); Dupont et al. (2021), using both hypernetworks and optimization-based approaches. Yüce et al. (2021) show that meta-learning on INRs is analagous to dictionary learning. In time-series, Jeong and Shin (2022) explored using INRs for anomaly detection, opting to make use of periodic activations and temporal positional encodings.

5 Discussion

While out of scope for our current work, a limitation that DeepTIMe faces is that it does not consider holidays and events. We leave the consideration of such features as a potential future direction, along with the incorporation of datetime features as exogenous covariates, whilst avoiding the incursion of the meta-learning memorization problem. While our current focus for DeepTIMe is on time-series forecasting, time-index based models are a natural fit for missing value imputation, as well as other time-series intelligence tasks for irregular time-series – this is another interesting future direction to extend deep time-index models towards. One final idea which is interesting to explore is a hypernetwork style meta-learning solution, which could potentially allow it to learn from multiple time-series.

6 Conclusion

In this paper, we proposed DeepTIMe, a deep time-index based model trained via a meta-learning formulation to automatically learn a representation function from time-series data, rather than manually defining the representation function as per classical methods. The meta-learning formulation further enables DeepTIMe to be utilized for non-stationary time-series by adapting to the locally stationary distribution. Importantly, we use a closed-form ridge regressor to tackle the meta-learning formulation to ensure that predictions are computationally efficient. Our extensive empirical analysis shows that DeepTIMe, while being a much simpler model architecture compared to prevailing state-of-the-art methods, achieves competitive performance across forecasting benchmarks on real world datasets. We perform substantial ablation studies to identify the key components contributing to the success of DeepTIMe, and also show that it is highly efficient.

References

  • R. Amit and R. Meir (2018) Meta-learning by adjusting priors based on extended pac-bayes theory. In

    International Conference on Machine Learning

    ,
    pp. 205–214. Cited by: Appendix I.
  • S. O. Arik, N. C. Yoder, and T. Pfister (2022) Self-adaptive forecasting for improved deep learning on non-stationary time-series. arXiv preprint arXiv:2202.02403. Cited by: §1.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv. External Links: Document, Link Cited by: Appendix C.
  • K. Benidis, S. S. Rangapuram, V. Flunkert, B. Wang, D. Maddix, C. Turkmen, J. Gasthaus, M. Bohlke-Schneider, D. Salinas, L. Stella, et al. (2020) Neural forecasting: introduction and literature overview. arXiv preprint arXiv:2004.10240. Cited by: §1, §4.
  • L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi (2019) Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.2, §3.4.
  • R. Carbonneau, K. Laframboise, and R. Vahidov (2008) Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research 184 (3), pp. 1140–1154. Cited by: §1.
  • O. Catoni (2007) PAC-bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248. Cited by: Theorem I.2, Appendix I.
  • C. Challu, K. G. Olivares, B. N. Oreshkin, F. Garza, M. Mergenthaler, and A. Dubrawski (2022) N-hits: neural hierarchical interpolation for time series forecasting. arXiv preprint arXiv:2201.12886. Cited by: §1, §3.2, §3.2, §4.
  • J. C. Cuaresma, J. Hlouskova, S. Kossmeier, and M. Obersteiner (2004) Forecasting electricity spot-prices using linear univariate time-series models. Applied Energy 77 (1), pp. 87–106. Cited by: §1.
  • E. Dupont, Y. W. Teh, and A. Doucet (2021) Generative models as distributions of functions. arXiv preprint arXiv:2102.04776. Cited by: §4.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1126–1135. External Links: Link Cited by: §2.2.
  • L. B. Godfrey and M. S. Gashler (2017) Neural decomposition of time-series data for effective generalization. IEEE transactions on neural networks and learning systems 29 (7), pp. 2973–2985. Cited by: §4.
  • R. Grazzi, V. Flunkert, D. Salinas, T. Januschowski, M. Seeger, and C. Archambeau (2021) Meta-forecasting by combining global deeprepresentations with local adaptation. arXiv preprint arXiv:2111.03418. Cited by: §3.3, §4.
  • A. C. Harvey and N. Shephard (1993) 10 structural time series models. In Econometrics, Handbook of Statistics, Vol. 11, pp. 261–302. External Links: ISSN 0169-7161, Document, Link Cited by: §1.
  • R. J. Hyndman and G. Athanasopoulos (2018) Forecasting: principles and practice. OTexts. Cited by: §1, §4.
  • K. Jeong and Y. Shin (2022) Time-series anomaly detection with implicit neural representation. arXiv preprint arXiv:2201.11950. Cited by: §4.
  • K. Kim (2003)

    Financial time series forecasting using support vector machines

    .
    Neurocomputing 55 (1-2), pp. 307–319. Cited by: §1.
  • T. Kim, J. Kim, Y. Tae, C. Park, J. Choi, and J. Choo (2021) Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix C.
  • N. Kitaev, L. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. In International Conference on Learning Representations, External Links: Link Cited by: §3.2.
  • G. Lai, W. Chang, Y. Yang, and H. Liu (2018) Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104. Cited by: §3.2.
  • N. Laptev, J. Yosinski, L. E. Li, and S. Smyl (2017) Time-series extreme event forecasting with neural networks at uber. In International conference on machine learning, Vol. 34, pp. 1–5. Cited by: §1.
  • S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, and X. Yan (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. ArXiv abs/1907.00235. Cited by: §3.2, §4.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In

    European conference on computer vision

    ,
    pp. 405–421. Cited by: §1, §2.2, §4.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §2.2.
  • K. G. Olivares, C. Challu, G. Marcjasz, R. Weron, and A. Dubrawski (2021) Neural basis expansion analysis with exogenous variables: forecasting electricity prices with nbeatsx. arXiv preprint arXiv:2104.05522. Cited by: §1, §4.
  • K. Ord, R. A. Fildes, and N. Kourentzes (2017) Principles of business forecasting. Wessex Press Publishing Co.. Cited by: §1, §4.
  • B. N. Oreshkin, A. Amini, L. Coyle, and M. J. Coates (2021) FC-gaga: fully connected gated graph architecture for spatio-temporal traffic forecasting. In Proc. AAAI Conf. Artificial Intell, Cited by: §1, §4.
  • B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio (2020) N-beats: neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3.2, §4.
  • N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville (2019) On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. Cited by: §3.3.
  • J. Rajendran, A. Irpan, and E. Jang (2020) Meta-learning requires meta-augmentation. Advances in Neural Information Processing Systems 33, pp. 5705–5715. Cited by: §3.3.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §2.2.
  • D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36 (3), pp. 1181–1191. Cited by: §1, §3.2.
  • S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: Appendix I.
  • V. Sitzmann, E. Chan, R. Tucker, N. Snavely, and G. Wetzstein (2020a) Metasdf: meta-learning signed distance functions. Advances in Neural Information Processing Systems 33, pp. 10136–10147. Cited by: §4.
  • V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020b) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33, pp. 7462–7473. Cited by: §G.2, §1, §2.2, Table 4, §4.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: Appendix C.
  • M. Tancik, B. Mildenhall, T. Wang, D. Schmidt, P. P. Srinivasan, J. T. Barron, and R. Ng (2021) Learned initializations for optimizing coordinate-based neural representations. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2846–2855. Cited by: §4.
  • M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, pp. 7537–7547. Cited by: §1, §2.2, §3.3, §4.
  • S. J. Taylor and B. Letham (2018) Forecasting at scale. The American Statistician 72 (1), pp. 37–45. Cited by: §1, §3.2, §4.
  • A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, E. Tretschk, Y. Wang, C. Lassner, V. Sitzmann, R. Martin-Brualla, S. Lombardi, et al. (2021) Advances in neural rendering. arXiv preprint arXiv:2111.05849. Cited by: §4.
  • G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. Hoi (2022) ETSformer: exponential smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381. Cited by: §1, §3.2, §3.2, §4.
  • J. Xu, J. Wang, M. Long, et al. (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems 34. Cited by: §1, §3.2, §3.2, §4.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §1.
  • M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn (2020) Meta-learning without memorization. In International Conference on Learning Representations, External Links: Link Cited by: §3.3.
  • P. C. Young, D. J. Pedregal, and W. Tych (1999) Dynamic harmonic regression. Journal of forecasting 18 (6), pp. 369–394. Cited by: §4.
  • G. Yüce, G. Ortiz-Jiménez, B. Besbinar, and P. Frossard (2021) A structured dictionary perspective on implicit neural representations. arXiv preprint arXiv:2112.01917. Cited by: §4.
  • H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, Cited by: §1, §3.2, §3.2, §4.
  • T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin (2022) FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. arXiv preprint arXiv:2201.12740. Cited by: §1, §3.2, §3.2, §4.

Appendix A DeepTIMe Pseudocode

mm: matrix multiplication, diagonal: returns the diagonal elements of a matrix, add_: in-place addition
linalg.solve computes the solution of a square system of linear equations with a unique solution.

# X: inputs, shape: (n_samples, n_dim)
# Y: targets, shape: (n_samples, n_out)
# lambd: scalar value representing the regularization coefficient
n_samples, n_dim = X.shape
# add a bias term by concatenating an all-ones vector
ones = torch.ones(n_samples, 1)
X = cat([X, ones], dim=-1)
if n_samples >= n_dim:
    # standard formulation
    A = mm(X.T, X)
    A.diagonal().add_(softplus(lambd))
    B = mm(X.T, Y)
    weights = linalg.solve(A, B)
else:
    # Woodbury formulation
    A = mm(X, X.T)
    A.diagonal().add_(softplus(lambd))
    weights = mm(X.T, linalg.solve(A, Y))
w, b = weights[:-1], weights[-1:]
return w, b
Algorithm 1 PyTorch-Style Pseudocode of Closed-Form Ridge Regressor

rearrange

: einops style tensor operations


mm: matrix multiplication

# x: input time-series, shape: (lookback_len, multivariate_dim)
# lookback_len: scalar value representing the length of the lookback window
# horizon_len: scalar value representing the length of the forecast horizon
# inr: implicit neural representation
time_index = linspace(0, 1, lookback_len + horizon_len)  # shape: (lookback_len + horizon_len)
time_index = rearrange(time_index, 't -> t 1')  # shape: (lookback_len + horizon_len, 1)
time_reprs = inr(time_index)  # shape: (lookback_len + horizon_len, hidden_dim)
lookback_reprs = time_reprs[:lookback_len]
horizon_reprs = time_reprs[-horizon_len:]
w, b = ridge_regressor(lookback_reprs, x)
# w.shape = (hidden_dim, multivariate_dim), b.shape = (1, multivariate_dim)
preds = mm(horizon_reprs, w) + b
return preds
Algorithm 2 PyTorch-Style Pseudocode of DeepTIMe

Appendix B Synthetic Data

Linear

Samples are generated from the function for . This means that each function/task consists of 400 evenly spaced points between -1 and 1. The parameters of each function/task (i.e.

) are sampled from a normal distribution with mean 0 and standard deviation of 50, i.e.

.

Cubic

Samples are generated from the function for

for 400 points. Parameters of each task are sampled from a continuous uniform distribution with minimum value of -50 and maximum value of 50, i.e.

.

Sums of sinusoids

Sinusoids come from a fixed set of frequencies, generated by sampling . We fix the size of this set to be five, i.e. . Each function is then a sum of sinusoids, where is randomly assigned. The function is thus for , where the amplitude and phase shifts are freely chosen via , but the frequency is decided by to randomly select a frequency from the set .

Appendix C DeepTIMe Implementation Details

Optimization

We train DeepTIMe with the Adam optimizer Kingma and Ba [2014]

with a learning rate scheduler following a linear warm up and cosine annealing scheme. Gradient clipping by norm is applied. The ridge regressor regularization coefficient,

, is trained with a different, higher learning rate than the rest of the meta parameters. We use early stopping based on the validation loss, with a fixed patience hyperparameter (number of epochs for which loss deteriorates before stopping). All experiments are performed on an Nvidia A100 GPU.

Model

The ridge regression regularization coefficient is a learnable parameter constrained to positive values via a softplus function. We apply Dropout Srivastava et al. [2014], then LayerNorm Ba et al. [2016] after the ReLU activation function in each INR layer. The size of the random Fourier feature layer is set independently of the layer size, in which we define the total size of the random Fourier feature layer – the number of dimensions for each scale is divided equally.

Appendix D DeepTIMe Hyperparameters

Hyperparameter Value

Optimization

Epochs 50
Learning rate 1e-3
learning rate 1.0
Warm up epochs 5
Batch size 256
Early stopping patience 7
Max gradient norm 10.0

Model

Layers 5
Layer size 256
initialization 0.0
Scales
Fourier features size 4096
Dropout 0.1
Lookback length multiplier,
Table 6: Hyperparameters used in DeepTIMe.

Appendix E DeepTIMe Standard Deviation

Metrics MSE (SD) MAE (SD)

ETTm2

96 0.166 (0.000) 0.257 (0.001)
192 0.225 (0.001) 0.302 (0.003)
336 0.277 (0.002) 0.336 (0.002)
720 0.383 (0.007) 0.409 (0.006)

ECL

96 0.137 (0.000) 0.238 (0.000)
192 0.152 (0.000) 0.252 (0.000)
336 0.166 (0.000) 0.268 (0.000)
720 0.201 (0.000) 0.302 (0.000)

Exchange

96 0.081 (0.001) 0.205 (0.002)
192 0.151 (0.002) 0.284 (0.003)
336 0.314 (0.033) 0.412 (0.020)
720 0.856 (0.202) 0.663 (0.082)

Traffic

96 0.390 (0.001) 0.275 (0.001)
192 0.402 (0.000) 0.278 (0.000)
336 0.415 (0.000) 0.288 (0.001)
720 0.449 (0.000) 0.307 (0.000)

Weather

96 0.166 (0.001) 0.221 (0.002)
192 0.207 (0.000) 0.261 (0.000)
336 0.251 (0.000) 0.298 (0.001)
720 0.301 (0.001) 0.338 (0.001)

ILI

24 2.425 (0.058) 1.086 (0.027)
36 2.231 (0.087) 1.008 (0.011)
48 2.230 (0.144) 1.016 (0.037)
60 2.143 (0.032) 0.985 (0.016)
(a) Multivariate benchmark.
Metrics MSE (SD) MAE (SD)

ETTm2

96 0.065 (0.000) 0.186 (0.000)
192 0.096 (0.002) 0.234 (0.003)
336 0.138 (0.001) 0.285 (0.001)
720 0.186 (0.002) 0.338 (0.002)

Exchange

96 0.086 (0.000) 0.226 (0.000)
192 0.173 (0.004) 0.330 (0.003)
336 0.539 (0.066) 0.575 (0.027)
720 0.936 (0.222) 0.763 (0.075)
(b) Univariate benchmark.
Table 7: DeepTIMe main benchmark results with standard deviation. Experiments are performed over three runs.

Appendix F Random Fourier Features Scale Hyperparameter Sensitivity Anslysis

Scale Hyperparam 0.01 0.1 1 5 10 20 50 100
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ETTm2

96 0.216 0.300 0.189 0.285 0.173 0.268 0.168 0.262 0.166 0.260 0.165 0.258 0.165 0.259 0.164 0.257
192 0.275 0.340 0.264 0.333 0.239 0.317 0.225 0.301 0.225 0.303 0.224 0.302 0.224 0.304 0.220 0.301
336 0.340 0.375 0.319 0.371 0.292 0.351 0.275 0.337 0.277 0.336 0.282 0.345 0.278 0.342 0.280 0.344
720 0.424 0.430 0.405 0.420 0.381 0.412 0.364 0.392 0.375 0.408 0.410 0.430 0.396 0.423 0.406 0.429
Table 8: Results from hyperparameter sweep on the scale hyperparameter. Best scores are highlighted in bold, and worst scores are highlighted in bold red.

Appendix G Ablation Studies Details

In this section, we list more details on the models compared to in the ablation studies section. Unless otherwise stated, we perform the same hyperparameter tuning for all models in the ablation studies, and use the same standard hyperparameters such as number of layers, layer size, etc.

g.1 Ablation study on variants of DeepTIMe

Rr

Removing the ridge regressor module refers to replacing it with a simple linear layer, , where , . This corresponds to a straight forward INR, which is trained across all lookback-horizon pairs in the dataset.

Local

For models marked “Local”, we similarly remove the ridge regressor module and replace it with a linear layer. Yet, the model is not trained across all lookback-horizon pairs in the dataset. Instead, for each lookback-horizon pair in the validation/test set, we fit the model to the lookback window via gradient descent, and then perform prediction on the horizon to obtain the forecasts. A new model is trained from scratch for each lookback-horizon window. We perform tuning on an extra hyperparameter, the number of epochs to perform gradient descent, for which we search through .

Datetime Features

As each dataset comes with a timestamps for each observation, we are able to construct datetime features from these timestamps. We construct the following features:

  1. Quarter-of-year

  2. Month-of-year

  3. Week-of-year

  4. Day-of-year

  5. Day-of-month

  6. Day-of-week

  7. Hour-of-day

  8. Minute-of-hour

  9. Second-of-minute

Each feature is initially an integer value, e.g. month-of-year can take on values in , which we subsequently normalize to a range. Depending on the data sampling frequency, the appropriate features can be chosen. For the ETTm2 dataset, we used all features except second-of-minute since it is sampled at a 15 minute frequency.

g.2 Ablation study on backbone models

For all models in this section, we retain the differentiable closed-form ridge regressor, to identify the effects of the backbone model used.

Mlp

The random Fourier features layer is a mapping from coordinate space to latent space . To remove the effects of the random Fourier features layer, we simply replace it with a with a linear map, .

Siren

We replace the random Fourier features backbone with the SIREN model which is introduced by Sitzmann et al. [2020b]. In this model, periodical activation functions are used, i.e. , along with specified weight initialization scheme.

Rnn

We use a 2 layer LSTM with hidden size of 256. Inputs are observations, , in an auto-regressive fashion, predicting the next time step, .

Appendix H Computational Efficiency Experiments Details

Trans/In/Auto/ETS-former

We use a model with 2 encoder and 2 decoder layers with a hidden size of 512, as specified in their original papers.

N-Beats

We use an N-BEATS model with 3 stacks and 3 layers (relatively small compared to 30 stacks and 4 layers used in their orignal paper777https://github.com/ElementAI/N-BEATS/blob/master/experiments/electricity/generic.gin, with a hidden size of 512. Note, N-BEATS is a univariate model and values presented here are multiplied by a factor of to account for the multivariate data. Another dimension of comparison is the number of parameters used in the model. Demonstrated in Table 9, fully connected models like N-BEATS, their number of parameters scales linearly with lookback window and forecast horizon length, while for Transformer-based and DeepTIMe, the number of parameters remains constant.

N-HiTS

We use an N-HiTS model with hyperparameters as sugggested in their original paper (3 stacks, 1 block in each stack, 2 MLP layers, 512 hidden size). For the following hyperparameters which were not specified (subject to hyperparameter tuning), we set the pooling kernel size to , and the number of stack coefficients to . Similar to N-BEATS, N-HiTS is a univariate model, and values were multiplied by a factor of to account for the multivariate data.

Methods Autoformer N-HiTS DeepTIMe

Lookback

48 10,535,943 927,942 1,314,561
96 10,535,943 1,038,678 1,314,561
168 10,535,943 1,204,782 1,314,561
336 10,535,943 1,592,358 1,314,561
720 10,535,943 2,478,246 1,314,561
1440 10,535,943 4,139,286 1,314,561
2880 10,535,943 7,461,366 1,314,561
5760 10,535,943 14,105,526 1,314,561

Horizon

48 10,535,943 927,942 1,314,561
96 10,535,943 955,644 1,314,561
168 10,535,943 997,197 1,314,561
336 10,535,943 1,094,154 1,314,561
720 10,535,943 1,315,770 1,314,561
1440 10,535,943 1,731,300 1,314,561
2880 10,535,943 2,562,360 1,314,561
5760 10,535,943 4,224,480 1,314,561
Table 9: Number of parameters in each model across various lookback window and forecast horizon lengths. The models were instantiated for the ETTm2 multivariate dataset (this affects the embedding and projection layers in Autoformer. Values for N-HiTS in this table are not multiplied by since it is a global model (i.e. a single univariate model is used for all dimensions of the time-series).

Appendix I Generalization Bound for our Meta-Learning Framework

In this section, we derive a meta-learning generalization bound for DeepTIMe under the PAC-Bayes framework [Shalev-Shwartz and Ben-David, 2014]. Our formulation follows [Amit and Meir, 2018] and assumes that all tasks share the same hypothesis space , sample space and loss function . We observes tasks in the form of sample sets . The number of samples in each task is . Each dataset is assumed to be generated i.i.d from an unknown sample distribution . Each task’s sample distribution is i.i.d. generated from an unknown meta distribution, . Particularly, we have , where . Here, is the time coordinate, and is the time-series value. For any forecaster parameterized by , we define the loss function . We also define as the prior distribution over and as the posterior over for each task. In the meta-learning setting, we assume a hyper-prior , which is a prior distribution over priors, observes a sequence of training tasks, and then outputs a distribution over priors, called hyper-posterior .

Theorem I.1.

Consider the Meta-Learning framework, given the hyper-prior , then for any hyper-posterior , any and any

with probability

we have,

(5)
Proof.

Our proof contains two steps. First, we bound the error within observed tasks due to observing a limited number of samples. Then we bound the error on the task environment level due to observing a finite number of tasks. Both of the two steps utilize Catoni’s classical PAC-Bayes bound [Catoni, 2007] to measure the error. We give here the Catoni’s classical PAC-Bayes bound.

Theorem I.2.

(Catoni’s bound [Catoni, 2007]) Let be a sample space, a distribution over , a hypothesis space. Given a loss function

and a collection of M i.i.d random variables (

) sampled from . Let be a prior distribution over hypothesis space. Then, for any and any real number , the following bound holds uniformly for all posterior distributions over hypothesis space,

We first utilize Theorem I.2 to bound the generalization error in each of the observed tasks. Let be the index of task, we have the definition of expected error and empirical error as follows,

(6)
(7)

Then, according to Theorem I.2, for any and , we have

(8)

Next, we bound the error due to observing a limited number of tasks from the environment. Similarly, we have the definition of expected task error as follows

(9)

Then we have the definition of error across the tasks,

(10)

Then Theorem I.2 says that the following holds for any and , we have

(11)

Finally, by employing the union bound, we could bound the probability of the intersection of the events in Equation 11 and Equation 8 For any , set and for ,

(12)

Theorem I.1 shows that the expected task generalization error is bounded by the empirical multi-task error plus two complexity terms. The first term represents the complexity of the environment, or equivalently, the time-series dataset, converging to zero if we observe an infinitely long time-series (). The second term represents the complexity of the observed tasks, or equivalently, the lookback-horizon windows. This converges to zero when there are sufficient number of time steps in each window ().