DeepTIMe
PyTorch code for DeepTIMe: Deep TimeIndex MetaLearning for NonStationary TimeSeries Forecasting
view repo
Deep learning has been actively applied to timeseries forecasting, leading to a deluge of new autoregressive model architectures. Yet, despite the attractive properties of timeindex based models, such as being a continuous signal function over time leading to smooth representations, little attention has been given to them. Indeed, while naive deep timeindex based models are far more expressive than the manually predefined function representations of classical timeindex based models, they are inadequate for forecasting due to the lack of inductive biases, and the nonstationarity of timeseries. In this paper, we propose DeepTIMe, a deep timeindex based model trained via a metalearning formulation which overcomes these limitations, yielding an efficient and accurate forecasting model. Extensive experiments on real world datasets demonstrate that our approach achieves competitive results with stateoftheart methods, and is highly efficient. Code is available at https://github.com/salesforce/DeepTIMe.
READ FULL TEXT VIEW PDFPyTorch code for DeepTIMe: Deep TimeIndex MetaLearning for NonStationary TimeSeries Forecasting
Timeseries forecasting has important applications across business and scientific domains, such as demand forecasting Carbonneau et al. (2008), capacity planning and management Kim (2003), electricity pricing Cuaresma et al. (2004)
, and anomaly detection
Laptev et al. (2017). This has led to interest in developing more powerful methods for forecasting, including the recent increase in attention towards neural forecasting methods Benidis et al. (2020). There are two broad approaches to timeseries forecasting – autoregressive models, and timeindex based models. Autoregressive models Salinas et al. (2020); Yang et al. (2019) consider the forecast to be a function of the previous values of the timeseries, while timeindex based models, also known as timeseries regression models, map a timeindex, and optionally datetime features, to the value of the timeseries at that time step. Many recent neural forecasting approaches such as the family of fully connected architectures Challu et al. (2022); Olivares et al. (2021); Oreshkin et al. (2020, 2021), and Transformerbased architectures Zhou et al. (2022); Woo et al. (2022); Xu et al. (2021); Zhou et al. (2021) belong to the autoregressive approach. Yet, while timeindex based approaches have the attractive property of being viewed as a continuous signal function over time, leading to signal representations which change smoothly and correlate with each other in continuous space, they have been underexplored from a deep learning perspective. In the following section, we explore the limitations of classical timeindex based approaches, and highlight how deep timeindex models can overcome these limitations. At the same time, deep timeindex models face their own set of limitations. We propose a metalearning formulation of the standard timeseries forecasting problem that can resolve these limitations.Classical timeindex based methods Taylor and Letham (2018); Hyndman and Athanasopoulos (2018); Ord et al. (2017) rely on predefined parametric representation functions, , to generate predictions, optionally following a structural timeseries model formulation Harvey and Shephard (1993), , where , , , all functions of a timeindex, represent trend, periodic, and holiday components respectively, and represents idiosyncratic changes not accounted for by the model. For example, could be predefined as a linear, polynomial, or even a piecewise linear function. While these functions are simple and easy to learn, they have limited capacity and are unable to fit more complex timeseries. Furthermore, predefining the representation function is a strong assumption which may not hold across different application domains, as it is only effective when the data distribution follows this predefined function – we train on historical data and expect extrapolation into future time steps to hold. While this may be true for a short horizon, this assumption most likely collapses when dealing with long sequence timeseries forecasting. Finally, while it is possible to perform model selection across various representation functions and parameters such as changepoints and seasonality, this requires either strong domain expertise or computationally heavy crossvalidation across a large set of parameters.
Deep learning gives a natural solution to this problem faced by classical timeindex based models – parameterize
as a neural network, and learn the representation function directly from data. Neural networks have been shown to have the property of being an extremely expressive representation function with a strong capability to approximate complex functions. However, being too expressive a representation function brings about the first limitation. Timeindex based models rely on the assumptions of the representation function to ensure extrapolation beyond the range of historical training data, into future time steps, yield accurate forecasts. Being such an expressive representation function, deep timeindex models have no inductive bias to perform well as a forecasting model. This is shown in
Figure 1. The second limitation arises due to the nonstationarity of timeseries. Timeseries data, especially in long sequence timeseries forecasting, are typically nonstationary – their distributions change over time. While the standard supervised learning setting assumes the validation and test set to come from the same distribution as the training set, this may not be the case for nonstationary timeseries
Kim et al. (2021); Arik et al. (2022), leading to a degradation in predictive performance.To this end, we propose DeepTIMe, a deep timeindex based model, which leverages a metalearning formulation to: (i) learn from data an appropriate function representation by optimizing an extrapolation loss, only possible due to our metalearning formulation (only reconstruction loss is available for the standard setting), and (ii) learn a global meta model shared across tasks which performs adaptation on a locally stationary distribution. We further leverage Implicit Neural Representations Sitzmann et al. (2020b); Mildenhall et al. (2020) as our choice of deep timeindex models, a random Fourier features layer Tancik et al. (2020) to ensure that we are able to learn high frequency information present in timeseries data, and a closedform ridge regressor Bertinetto et al. (2019) to efficiently tackle the metalearning formulation. We conduct extensive experiments on both synthetic and real world datasets, showing that DeepTIMe has extremely competitive performance, achieving stateoftheart results on 20 out of 24 settings for the multivariate forecasting benchmark based on MSE. We perform ablation studies to better understand the contribution of each component of DeepTIMe, and finally show that it is highly efficient in terms of runtime and memory.
In long sequence timeseries forecasting, we consider a timeseries dataset , where is the dimension observation at time . Given a lookback window of length , we aim to construct a point forecast over a horizon of length , by learning a model
which minimizes some loss function
.In the following, we first describe how to cast the standard timeseries forecasting problem as a metalearning problem for timeindex based models, which endows DeepTIMe with the ability to extrapolate over the forecast horizon, i.e. perform forecasting. We emphasize that this reformulation falls within standard timeseries forecasting problem and requires no extra information. Next, we further elaborate on our proposed model architecture, and how it uses a differentiable closedform ridge regression module to efficiently tackle forecasting as metalearning problem. Psuedocode of DeepTIMe is available in
Appendix A.To formulate timeseries forecasting as a metalearning problem, we treat each lookback window and forecast horizon pair, as a task. Specifically, the lookback window is treated as the support set, and forecast horizon to be the query set, and each time coordinate and timeseries value pair, , is an inputoutput sample, i.e. , , where is a normalized timeindex. The forecasting model, , is then parameterized by and , the meta and base parameters respectively, and the bilevel optimization problem can be formalized as:
(1)  
(2) 
Here, the outer summation in Equation 1 over index represents each lookbackhorizon window, corresponding to each task in metalearning, and the inner summation over index represents each sample in the query set, or equivalently, each time step in the forecast horizon. The summation in Equation 2 over index represents each sample in the support set, or each time step in the lookback window. This is illustrated in Figure 1(a).
To understand how our metalearning formulation helps to learn an appropriate function representation from data, we examine how the metalearning process performs a restriction on hypothesis class of the model . The original hypothesis class of our model, or function representation, , is too large and provides no guarantees that training on the lookback window leads to good extrapolation. The metalearning formulation allows DeepTIMe to restrict the hypothesis class of the representation function, from the space of all layered INRs, to the space of layered INRs conditioned on the optimal meta parameters, , where the optimal meta parameters, , is the minimizer of a forecasting loss (as specified in Equation 1). Given this hypothesis class, local adaptation is performed over given the lookback window, which is assumed to come from a locally stationary distribution, resolving the issue of nonstationarity.
The class of deep models which map coordinates to the value at that coordinate using a stack of multilayer perceptrons (MLPs) is known as INRs
Sitzmann et al. (2020b); Tancik et al. (2020); Mildenhall et al. (2020). We make use a of them as they are a natural fit for timeindex based models, to map a timeindex to the value of the timeseries at that timeindex. Alayered, ReLU
Nair and Hinton (2010) INR is a function which has the following form:(3) 
where is the timeindex. Note that for our proposed approach as specified in Section 2.1, but we use the notation to allow for generalization to cases where datetime features are included. Tancik et al. (2020) introduced a random Fourier features layer which allows INRs to fit to high frequency functions, by modifying , where each entry in is sampled from with is the hidden dimension size of the INR and
is the scale hyperparameter.
is a rowwise stacking operation.While the random Fourier features layer endows INRs with the ability to learn high frequency patterns, one major drawback is the need to perform a hyperparameter sweep for each task and dataset to avoid over or underfitting. We overcome this limitation with a simple scheme of concatenating multiple Fourier basis functions with diverse scale parameters, i.e. , where elements in are sampled from , and . We perform an analysis in Appendix F and show that the performance of our proposed Concatenated Fourier Features (CFF) does not significantly deviate from the setting with the optimal scale parameter obtained from a hyperparameter sweep.
One key aspect to tackling forecasting as a metalearning problem is efficiency. Optimizationbased metalearning approaches originally perform an expensive bilevel optimization procedure on the entire neural network model by backpropagating through inner gradient steps
Ravi and Larochelle (2017); Finn et al. (2017). Since each forecast is now treated as an inner loop optimization problem, it needs to be sufficiently fast to be competitive with competing methods. We achieve this by leveraging a ridge regression closedform solver Bertinetto et al. (2019), on top an INR, illustrated in Figure 1(b). The ridge regression closedform solver restricts the inner loop to only apply to the last layer of the model, allowing for either a closedform solution or differentiable solver to replace the inner gradient steps. This means that for a layered model, are the meta parameters and are the base parameters, following notation from Equation 3. Then let be the meta learner where . For task with the corresponding lookbackhorizon pair, , the support set features obtained from the meta learner is denoted , where is a columnwise concatenation operation. The inner loop thus solves the optimization problem:(4) 
Now, let be the query set features. Then, our predictions are . This closedform solution is differentiable, which enables gradient updates on the parameters of the meta learner,
. A bias term can be included for the closedform ridge regressor by appending a scalar 1 to the feature vector
. The model obtained by DeepTIMe is ultimately the restricted hypothesis class .We evaluate DeepTIMe on both synthetic datasets, and a variety of real world data. We ask the following questions: (i) Is DeepTIMe, trained on a family of functions following the same parametric form, able to perform extrapolation on unseen functions? (ii) How does DeepTIMe compare to other forecasting models on real world data? (iii) What are the key contributing factors to the good performance of DeepTIMe?
We first consider DeepTIMe’s ability to extrapolate on the following functions specified by some parametric form: (i) the family of linear functions, , (ii) the family of cubic functions, , and (iii) sums of sinusoids, . Parameters of the functions (i.e. ) are sampled randomly (further details in Appendix B) to construct distinct tasks. A total of 400 time steps are sampled, with a lookback window length of 200 and forecast horizon of 200. Figure 3 demonstrates that DeepTIMe is able to perform extrapolation on unseen test functions/tasks after being trained via our metalearning formulation. It demonstrates an ability to approximate and adapt, based on the lookback window, linear and cubic polynomials, and even sums of sinusoids. Next, we evaluate DeepTIMe on real world datasets, against stateoftheart forecasting baselines.
ETT^{1}^{1}1https://github.com/zhouhaoyi/ETDataset Zhou et al. (2021)  Electricity Transformer Temperature provides measurements from an electricity transformer such as load and oil temperature. We use the ETTm2 subset, consisting measurements at a 15 minutes frequency. ECL^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014  Electricity Consuming Load provides measurements of electricity consumption for 321 households from 2012 to 2014. The data was collected at the 15 mintue level, but is aggregated hourly. Exchange^{3}^{3}3https://github.com/laiguokun/multivariatetimeseriesdata Lai et al. (2018)  a collection of daily exchange rates with USD of eight countries (Australia, United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016. Traffic^{4}^{4}4https://pems.dot.ca.gov/  dataset from the California Department of Transportation providing the hourly road occupancy rates from 862 sensors in San Francisco Bay area freeways. Weather^{5}^{5}5https://www.bgcjena.mpg.de/wetter/  provides measurements of 21 meteorological indicators such as air temperature, humidity, etc., every 10 minutes for the year of 2020 from the Weather Station of the Max Planck Biogeochemistry Institute in Jena, Germany. ILI^{6}^{6}6https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html  Influenzalike Illness measures the weekly ratio of patients seen with ILI and the total number of patients, obtained by the Centers for Disease Control and Prevention of the United States between 2002 and 2021.
We evaluate the performance of our proposed approach using two metrics, the mean squared error (MSE) and mean absolute error (MAE) metrics. The datasets are split into train, validation, and test sets chronologically, following a 70/10/20 split for all datasets except for ETTm2 which follows a 60/20/20 split, as per convention. The univariate benchmark selects the last index of the multivariate dataset as the target variable, following previous work Xu et al. (2021); Woo et al. (2022); Challu et al. (2022); Zhou et al. (2022). Preprocessing on the data is performed by standardization based on train set statistics. Hyperparameter selection is performed on only one value, the lookback length multiplier, , which decides the length of the lookback window. We search through the values , and select the best value based on the validation loss. Further implementation details on DeepTIMe are reported in Appendix C, and detailed hyperparameters are reported in Appendix D
. Reported results for DeepTIMe are averaged over three runs, and standard deviation is reported in
Appendix E.Methods  DeepTIMe  NHiTS  ETSformer  Fedformer  Autoformer  Informer  LogTrans  Reformer  

Metrics  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  
ETTm2 
96  0.166  0.257  0.176  0.255  0.189  0.280  0.203  0.287  0.255  0.339  0.365  0.453  0.768  0.642  0.658  0.619 
192  0.225  0.302  0.245  0.305  0.253  0.319  0.269  0.328  0.281  0.340  0.533  0.563  0.989  0.757  1.078  0.827  
336  0.277  0.336  0.295  0.346  0.314  0.357  0.325  0.366  0.339  0.372  1.363  0.887  1.334  0.872  1.549  0.972  
720  0.383  0.409  0.401  0.426  0.414  0.413  0.421  0.415  0.422  0.419  3.379  1.388  3.048  1.328  2.631  1.242  
ECL 
96  0.137  0.238  0.147  0.249  0.187  0.304  0.183  0.297  0.201  0.317  0.274  0.368  0.258  0.357  0.312  0.402 
192  0.152  0.252  0.167  0.269  0.199  0.315  0.195  0.308  0.222  0.334  0.296  0.386  0.266  0.368  0.348  0.433  
336  0.166  0.268  0.186  0.290  0.212  0.329  0.212  0.313  0.231  0.338  0.300  0.394  0.280  0.380  0.350  0.433  
720  0.201  0.302  0.243  0.340  0.233  0.345  0.231  0.343  0.254  0.361  0.373  0.439  0.283  0.376  0.340  0.420  
Exchange 
96  0.081  0.205  0.092  0.211  0.085  0.204  0.139  0.276  0.197  0.323  0.847  0.752  0.968  0.812  1.065  0.829 
192  0.151  0.284  0.208  0.322  0.182  0.303  0.256  0.369  0.300  0.369  1.204  0.895  1.040  0.851  1.188  0.906  
336  0.314  0.412  0.371  0.443  0.348  0.428  0.426  0.464  0.509  0.524  1.672  1.036  1.659  1.081  1.357  0.976  
720  0.856  0.663  0.888  0.723  1.025  0.774  1.090  0.800  1.447  0.941  2.478  1.310  1.941  1.127  1.510  1.016  
Traffic 
96  0.390  0.275  0.402  0.282  0.607  0.392  0.562  0.349  0.613  0.388  0.719  0.391  0.684  0.384  0.732  0.423 
192  0.402  0.278  0.420  0.297  0.621  0.399  0.562  0.346  0.616  0.382  0.696  0.379  0.685  0.390  0.733  0.420  
336  0.415  0.288  0.448  0.313  0.622  0.396  0.570  0.323  0.622  0.337  0.777  0.420  0.733  0.408  0.742  0.420  
720  0.449  0.307  0.539  0.353  0.632  0.396  0.596  0.368  0.660  0.408  0.864  0.472  0.717  0.396  0.755  0.423  
Weather 
96  0.166  0.221  0.158  0.195  0.197  0.281  0.217  0.296  0.266  0.336  0.300  0.384  0.458  0.490  0.689  0.596 
192  0.207  0.261  0.211  0.247  0.237  0.312  0.276  0.336  0.307  0.367  0.598  0.544  0.658  0.589  0.752  0.638  
336  0.251  0.298  0.274  0.300  0.298  0.353  0.339  0.380  0.359  0.359  0.578  0.523  0.797  0.652  0.639  0.596  
720  0.301  0.338  0.351  0.353  0.352  0.388  0.403  0.428  0.419  0.419  1.059  0.741  0.869  0.675  1.130  0.792  
ILI 
24  2.425  1.086  1.862  0.869  2.527  1.020  2.203  0.963  3.483  1.287  5.764  1.677  4.480  1.444  4.400  1.382 
36  2.231  1.008  2.071  0.969  2.615  1.007  2.272  0.976  3.103  1.148  4.755  1.467  4.799  1.467  4.783  1.448  
48  2.230  1.016  2.346  1.042  2.359  0.972  2.209  0.981  2.669  1.085  4.763  1.469  4.800  1.468  4.832  1.465  
60  2.143  0.985  2.560  1.073  2.487  1.016  2.545  1.061  2.770  1.125  5.264  1.564  5.278  1.560  4.882  1.483 
Methods  DeepTIMe  NHiTS  ETSformer  Fedformer  Autoformer  Informer  NBEATS  DeepAR  Prophet  ARIMA  

Metrics  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  
ETTm2 
96  0.065  0.186  0.066  0.185  0.080  0.212  0.063  0.189  0.065  0.189  0.088  0.225  0.082  0.219  0.099  0.237  0.287  0.456  0.211  0.362 
192  0.096  0.234  0.087  0.223  0.150  0.302  0.102  0.245  0.118  0.256  0.132  0.283  0.120  0.268  0.154  0.310  0.312  0.483  0.261  0.406  
336  0.138  0.285  0.106  0.251  0.175  0.334  0.130  0.279  0.154  0.305  0.180  0.336  0.226  0.370  0.277  0.428  0.331  0.474  0.317  0.448  
720  0.186  0.338  0.157  0.312  0.224  0.379  0.178  0.325  0.182  0.335  0.300  0.435  0.188  0.338  0.332  0.468  0.534  0.593  0.366  0.487  
Exchange 
96  0.086  0.226  0.093  0.223  0.099  0.230  0.131  0.284  0.241  0.299  0.591  0.615  0.156  0.299  0.417  0.515  0.828  0.762  0.112  0.245 
192  0.173  0.330  0.230  0.313  0.223  0.353  0.277  0.420  0.273  0.665  1.183  0.912  0.669  0.665  0.813  0.735  0.909  0.974  0.304  0.404  
336  0.539  0.575  0.370  0.486  0.421  0.497  0.426  0.511  0.508  0.605  1.367  0.984  0.611  0.605  1.331  0.962  1.304  0.988  0.736  0.598  
720  0.936  0.763  0.728  0.569  1.114  0.807  1.162  0.832  0.991  0.860  1.872  1.072  1.111  0.860  1.890  1.181  3.238  1.566  1.871  0.935 
We compare DeepTIMe to the following baselines for the multivariate setting, NHiTS Challu et al. (2022), ETSformer Woo et al. (2022), Fedformer Zhou et al. (2022) (we report the best score for each setting from the two variants they present), Autoformer Xu et al. (2021), Informer Zhou et al. (2021), LogTrans Li et al. (2019), and Reformer Kitaev et al. (2020). For the univariate setting, we include additional univariate forecasting models, NBEATS Oreshkin et al. (2020), DeepAR Salinas et al. (2020), Prophet Taylor and Letham (2018), and ARIMA. We obtain the baseline results from the following papers: Challu et al. (2022); Woo et al. (2022); Zhou et al. (2022); Xu et al. (2021). Table 1 and Table 2 summarizes the multivariate and univariate forecasting results respectively. DeepTIMe achieves stateoftheart performance on 20 out of 24 settings in MSE, and 17 out of 24 settings in MAE on the multivariate benchmark, and also achieves competitive results on the univariate benchmark despite its simple architecture compared to the baselines comprising complex fully connected architectures and computationally intensive Transformer architectures.
Methods  DeepTIMe  + Datetime   RR   RR  + Local  + Local  

+ Datetime  + Datetime  
Metrics  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  
ETTm2 
96  0.166  0.257  0.226  0.303  3.072  1.345  3.393  1.400  0.251  0.331  0.250  0.327 
192  0.225  0.302  0.309  0.362  3.064  1.343  3.269  1.381  0.322  0.371  0.323  0.366  
336  0.277  0.336  0.341  0.381  2.920  1.309  3.442  1.401  0.370  0.412  0.367  0.396  
720  0.383  0.409  0.453  0.447  2.773  1.273  3.400  1.399  0.443  0.449  0.455  0.461 
Methods  DeepTIMe  MLP  SIREN  RNN  

Metrics  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  
ETTm2  96  0.166  0.257  0.186  0.284  0.236  0.325  0.233  0.324 
192  0.225  0.302  0.265  0.338  0.295  0.361  0.275  0.337  
336  0.277  0.336  0.316  0.372  0.327  0.386  0.344  0.383  
720  0.383  0.409  0.401  0.417  0.438  0.453  0.431  0.432 
. RNN refers to an autoregressive recurrent neural network (inputs are the timeseries values,
). All approaches include differentiable closedform ridge regressor. Further model details can be found in Section G.2.CFF  Optimal Scale (% change)  Pessimal Scale (% change)  

Metrics  MSE  MAE  MSE  MAE  MSE  MAE  
ETTm2 
96  0.166  0.257  0.164 (1.20%)  0.257 (0.05%)  0.216 (23.22%)  0.300 (14.22%) 
192  0.225  0.302  0.220 (1.87%)  0.301 (0.25%)  0.275 (18.36%)  0.340 (11.25%)  
336  0.277  0.336  0.275 (0.70%)  0.336 (0.22%)  0.340 (18.68%)  0.375 (10.57%)  
720  0.383  0.409  0.364 (5.29%)  0.392 (4.48%)  0.424 (9.67%)  0.430 (4.95%) 
We perform an ablation study to understand how various training schemes and input features affect the performance of DeepTIMe. Table 3 presents these results. First, we observe that our metalearning formulation is a critical component to the success of DeepTIMe. We note that DeepTIMe without metalearning may not be a meaningful baseline since the model outputs are always the same regardless of the input lookback window. Including datetime features helps alleviate this issue, yet we observe that the inclusion of datetime features generally lead to a degradation in performance. In the case of DeepTIMe, we observed that the inclusion of datetime features lead to a much lower training loss, but degradation in test performance – this is a case of metalearning memorization Yin et al. (2020) due to the tasks becoming nonmutually exclusive Rajendran et al. (2020). Finally, we observe that the metalearning formulation is indeed superior to training a model from scratch for each lookback window.
In Table 4 we perform an ablation study on various backbone architectures, while retaining the differentiable closedform ridge regressor. We observe a degradation when the random Fourier features layer is removed, due to the spectral bias problem which neural networks face Rahaman et al. (2019); Tancik et al. (2020). DeepTIMe outperforms the SIREN variant of INRs which is consistent with observations INR literature. Finally DeepTIMe outperforms the RNN variant which is the model proposed in Grazzi et al. (2021). This is a direct comparison between autoregressive and timeindex models, and highlights the benefits of a timeindex models.
Lastly, we perform a comparison between the optimal and pessimal scale hyperparameter for the vanilla random Fourier features layer, against our proposed CFF. We first report the results on each scale hyperparameter for the vanilla random Fourier features layer in Table 8, Appendix F. As with the other ablation studies, the results reported in Table 8 is based on performing a hyperparameter sweep across lookback length multiplier, and selecting the optimal settings based on the validation set, and reporting the test set results. Then, the optimal and pessimal scales are simply the best and worst results based on Table 8. Table 5 shows that CFF achieves extremely low deviation from the optimal scale across all settings, yet retrains the upside of avoiding this expensive hyperparameter tuning phase. We also observe that tuning the scale hyperparameter is extremely important, as CFF obtains up to a 23.22% improvement in MSE over the pessimal scale hyperparameter.


Finally, we analyse DeepTIMe’s efficiency in both runtime and memory usage, with respect to both lookback window and forecast horizon lengths. The main bottleneck in computation for DeepTIMe is the matrix inversion operation in the ridge regressor, canonically of complexity. This is a major concern for DeepTIMe as is linked to the length of the lookback window. As mentioned in Bertinetto et al. (2019), the Woodbury formulation,
is used to alleviate the problem, leading to an complexity, where is the hidden size hyperparameter, fixed to some value (see Appendix D). Figure 4 demonstrates that DeepTIMe is highly efficient, even when compared to efficient Transformer models, recently proposed for the long sequence timeseries forecasting task, as well as fully connected models.
Neural forecasting Benidis et al. (2020) methods have seen great success in recent times. One related line of research are Transformerbased methods for long sequence timeseries forecasting Li et al. (2019); Zhou et al. (2021); Xu et al. (2021); Woo et al. (2022); Zhou et al. (2022) which aim to not only achieve high accuracy, but to overcome the vanilla attention’s quadratic complexity. Li et al. (2019) and Zhou et al. (2021) introduced sparse attention mechanisms, while Xu et al. (2021); Woo et al. (2022), and Zhou et al. (2022)
introduced mechanisms which make use of the Fourier transform to achieve a quasilinear complexity. They further embed prior knowledge of timeseries structures such as autocorrelation, and seasonaltrend decomposition into the Transformer architecture. Another relevant line of work is that of fully connected models
Oreshkin et al. (2020); Olivares et al. (2021); Challu et al. (2022); Oreshkin et al. (2021). Oreshkin et al. (2020) first introduced the NBEATS model which made use of doubly residual stacks of fully connected layers. Challu et al. (2022)extended this approach to the long sequence timeseries forecasting task by introducing hierarchical interpolation and multirate data sampling. Metalearning has been explored in timeseries, where
Grazzi et al. (2021) used a differentiable closedform solver in the context of timeseries forecasting, but specified an autoregressive backbone model.Timeindex based models, or timeseries regression models, take as input datetime features and other covariates to predict the value of the timeseries at that time step. They have been well explored as a special case of regression analysis
Hyndman and Athanasopoulos (2018); Ord et al. (2017), and many different predictors have been proposed for the classical setting. These inclue linear, polynomial, and piecewise linear trends, dummy variables indicating holidays, seasonal dummy variables, and many others
Hyndman and Athanasopoulos (2018). Of note, Fourier terms have been used to model periodicity, or seasonal patterns, and is also known as harmonic regression Young et al. (1999). One popular classical timeindex based method is Prophet Taylor and Letham (2018), which uses a structural timeseries formulation, considering trend, seasonal, and holiday variables, specialized for business forecasting. Godfrey and Gashler (2017)introduced an initial attempt at using timeindex based neural networks to fit a timeseries for forecasting. Yet, their work is more reminiscent of classical methods, as they manually specify periodic and nonperiodic activation functions, analogous to the representation functions, rather than learning the representation function from data.
INRs have recently gained popularity in the area of neural rendering Tewari et al. (2021). They parameterize a signal as a continuous function, mapping a coordinate to the value at that coordinate. A key finding was that positional encodings Mildenhall et al. (2020); Tancik et al. (2020) are critical for ReLU MLPs to learn high frequency details, while another line of work introduced periodic activations Sitzmann et al. (2020b). Metalearning on via INRs have been explored for various data modalities, typically over images or for neural rendering tasks Sitzmann et al. (2020a); Tancik et al. (2021); Dupont et al. (2021), using both hypernetworks and optimizationbased approaches. Yüce et al. (2021) show that metalearning on INRs is analagous to dictionary learning. In timeseries, Jeong and Shin (2022) explored using INRs for anomaly detection, opting to make use of periodic activations and temporal positional encodings.
While out of scope for our current work, a limitation that DeepTIMe faces is that it does not consider holidays and events. We leave the consideration of such features as a potential future direction, along with the incorporation of datetime features as exogenous covariates, whilst avoiding the incursion of the metalearning memorization problem. While our current focus for DeepTIMe is on timeseries forecasting, timeindex based models are a natural fit for missing value imputation, as well as other timeseries intelligence tasks for irregular timeseries – this is another interesting future direction to extend deep timeindex models towards. One final idea which is interesting to explore is a hypernetwork style metalearning solution, which could potentially allow it to learn from multiple timeseries.
In this paper, we proposed DeepTIMe, a deep timeindex based model trained via a metalearning formulation to automatically learn a representation function from timeseries data, rather than manually defining the representation function as per classical methods. The metalearning formulation further enables DeepTIMe to be utilized for nonstationary timeseries by adapting to the locally stationary distribution. Importantly, we use a closedform ridge regressor to tackle the metalearning formulation to ensure that predictions are computationally efficient. Our extensive empirical analysis shows that DeepTIMe, while being a much simpler model architecture compared to prevailing stateoftheart methods, achieves competitive performance across forecasting benchmarks on real world datasets. We perform substantial ablation studies to identify the key components contributing to the success of DeepTIMe, and also show that it is highly efficient.
International Conference on Machine Learning
, pp. 205–214. Cited by: Appendix I.Financial time series forecasting using support vector machines
. Neurocomputing 55 (12), pp. 307–319. Cited by: §1.European conference on computer vision
, pp. 405–421. Cited by: §1, §2.2, §4.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 2846–2855. Cited by: §4.Samples are generated from the function for . This means that each function/task consists of 400 evenly spaced points between 1 and 1. The parameters of each function/task (i.e.
) are sampled from a normal distribution with mean 0 and standard deviation of 50, i.e.
.Samples are generated from the function for
for 400 points. Parameters of each task are sampled from a continuous uniform distribution with minimum value of 50 and maximum value of 50, i.e.
.Sinusoids come from a fixed set of frequencies, generated by sampling . We fix the size of this set to be five, i.e. . Each function is then a sum of sinusoids, where is randomly assigned. The function is thus for , where the amplitude and phase shifts are freely chosen via , but the frequency is decided by to randomly select a frequency from the set .
We train DeepTIMe with the Adam optimizer Kingma and Ba [2014]
with a learning rate scheduler following a linear warm up and cosine annealing scheme. Gradient clipping by norm is applied. The ridge regressor regularization coefficient,
, is trained with a different, higher learning rate than the rest of the meta parameters. We use early stopping based on the validation loss, with a fixed patience hyperparameter (number of epochs for which loss deteriorates before stopping). All experiments are performed on an Nvidia A100 GPU.
The ridge regression regularization coefficient is a learnable parameter constrained to positive values via a softplus function. We apply Dropout Srivastava et al. [2014], then LayerNorm Ba et al. [2016] after the ReLU activation function in each INR layer. The size of the random Fourier feature layer is set independently of the layer size, in which we define the total size of the random Fourier feature layer – the number of dimensions for each scale is divided equally.
Hyperparameter  Value  
Optimization 
Epochs  50 
Learning rate  1e3  
learning rate  1.0  
Warm up epochs  5  
Batch size  256  
Early stopping patience  7  
Max gradient norm  10.0  
Model 
Layers  5 
Layer size  256  
initialization  0.0  
Scales  
Fourier features size  4096  
Dropout  0.1  
Lookback length multiplier, 


Scale Hyperparam  0.01  0.1  1  5  10  20  50  100  

Metrics  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  
ETTm2 
96  0.216  0.300  0.189  0.285  0.173  0.268  0.168  0.262  0.166  0.260  0.165  0.258  0.165  0.259  0.164  0.257 
192  0.275  0.340  0.264  0.333  0.239  0.317  0.225  0.301  0.225  0.303  0.224  0.302  0.224  0.304  0.220  0.301  
336  0.340  0.375  0.319  0.371  0.292  0.351  0.275  0.337  0.277  0.336  0.282  0.345  0.278  0.342  0.280  0.344  
720  0.424  0.430  0.405  0.420  0.381  0.412  0.364  0.392  0.375  0.408  0.410  0.430  0.396  0.423  0.406  0.429 
In this section, we list more details on the models compared to in the ablation studies section. Unless otherwise stated, we perform the same hyperparameter tuning for all models in the ablation studies, and use the same standard hyperparameters such as number of layers, layer size, etc.
Removing the ridge regressor module refers to replacing it with a simple linear layer, , where , . This corresponds to a straight forward INR, which is trained across all lookbackhorizon pairs in the dataset.
For models marked “Local”, we similarly remove the ridge regressor module and replace it with a linear layer. Yet, the model is not trained across all lookbackhorizon pairs in the dataset. Instead, for each lookbackhorizon pair in the validation/test set, we fit the model to the lookback window via gradient descent, and then perform prediction on the horizon to obtain the forecasts. A new model is trained from scratch for each lookbackhorizon window. We perform tuning on an extra hyperparameter, the number of epochs to perform gradient descent, for which we search through .
As each dataset comes with a timestamps for each observation, we are able to construct datetime features from these timestamps. We construct the following features:
Quarterofyear
Monthofyear
Weekofyear
Dayofyear
Dayofmonth
Dayofweek
Hourofday
Minuteofhour
Secondofminute
Each feature is initially an integer value, e.g. monthofyear can take on values in , which we subsequently normalize to a range. Depending on the data sampling frequency, the appropriate features can be chosen. For the ETTm2 dataset, we used all features except secondofminute since it is sampled at a 15 minute frequency.
For all models in this section, we retain the differentiable closedform ridge regressor, to identify the effects of the backbone model used.
The random Fourier features layer is a mapping from coordinate space to latent space . To remove the effects of the random Fourier features layer, we simply replace it with a with a linear map, .
We replace the random Fourier features backbone with the SIREN model which is introduced by Sitzmann et al. [2020b]. In this model, periodical activation functions are used, i.e. , along with specified weight initialization scheme.
We use a 2 layer LSTM with hidden size of 256. Inputs are observations, , in an autoregressive fashion, predicting the next time step, .
We use a model with 2 encoder and 2 decoder layers with a hidden size of 512, as specified in their original papers.
We use an NBEATS model with 3 stacks and 3 layers (relatively small compared to 30 stacks and 4 layers used in their orignal paper^{7}^{7}7https://github.com/ElementAI/NBEATS/blob/master/experiments/electricity/generic.gin, with a hidden size of 512. Note, NBEATS is a univariate model and values presented here are multiplied by a factor of to account for the multivariate data. Another dimension of comparison is the number of parameters used in the model. Demonstrated in Table 9, fully connected models like NBEATS, their number of parameters scales linearly with lookback window and forecast horizon length, while for Transformerbased and DeepTIMe, the number of parameters remains constant.
We use an NHiTS model with hyperparameters as sugggested in their original paper (3 stacks, 1 block in each stack, 2 MLP layers, 512 hidden size). For the following hyperparameters which were not specified (subject to hyperparameter tuning), we set the pooling kernel size to , and the number of stack coefficients to . Similar to NBEATS, NHiTS is a univariate model, and values were multiplied by a factor of to account for the multivariate data.
Methods  Autoformer  NHiTS  DeepTIMe  

Lookback 
48  10,535,943  927,942  1,314,561 
96  10,535,943  1,038,678  1,314,561  
168  10,535,943  1,204,782  1,314,561  
336  10,535,943  1,592,358  1,314,561  
720  10,535,943  2,478,246  1,314,561  
1440  10,535,943  4,139,286  1,314,561  
2880  10,535,943  7,461,366  1,314,561  
5760  10,535,943  14,105,526  1,314,561  
Horizon 
48  10,535,943  927,942  1,314,561 
96  10,535,943  955,644  1,314,561  
168  10,535,943  997,197  1,314,561  
336  10,535,943  1,094,154  1,314,561  
720  10,535,943  1,315,770  1,314,561  
1440  10,535,943  1,731,300  1,314,561  
2880  10,535,943  2,562,360  1,314,561  
5760  10,535,943  4,224,480  1,314,561 
In this section, we derive a metalearning generalization bound for DeepTIMe under the PACBayes framework [ShalevShwartz and BenDavid, 2014]. Our formulation follows [Amit and Meir, 2018] and assumes that all tasks share the same hypothesis space , sample space and loss function . We observes tasks in the form of sample sets . The number of samples in each task is . Each dataset is assumed to be generated i.i.d from an unknown sample distribution . Each task’s sample distribution is i.i.d. generated from an unknown meta distribution, . Particularly, we have , where . Here, is the time coordinate, and is the timeseries value. For any forecaster parameterized by , we define the loss function . We also define as the prior distribution over and as the posterior over for each task. In the metalearning setting, we assume a hyperprior , which is a prior distribution over priors, observes a sequence of training tasks, and then outputs a distribution over priors, called hyperposterior .
Consider the MetaLearning framework, given the hyperprior , then for any hyperposterior , any and any
with probability
we have,(5) 
Our proof contains two steps. First, we bound the error within observed tasks due to observing a limited number of samples. Then we bound the error on the task environment level due to observing a finite number of tasks. Both of the two steps utilize Catoni’s classical PACBayes bound [Catoni, 2007] to measure the error. We give here the Catoni’s classical PACBayes bound.
(Catoni’s bound [Catoni, 2007]) Let be a sample space, a distribution over , a hypothesis space. Given a loss function
and a collection of M i.i.d random variables (
) sampled from . Let be a prior distribution over hypothesis space. Then, for any and any real number , the following bound holds uniformly for all posterior distributions over hypothesis space,We first utilize Theorem I.2 to bound the generalization error in each of the observed tasks. Let be the index of task, we have the definition of expected error and empirical error as follows,
(6)  
(7) 
Then, according to Theorem I.2, for any and , we have
(8) 
Next, we bound the error due to observing a limited number of tasks from the environment. Similarly, we have the definition of expected task error as follows
(9) 
Then we have the definition of error across the tasks,
(10) 
Then Theorem I.2 says that the following holds for any and , we have
(11) 
Finally, by employing the union bound, we could bound the probability of the intersection of the events in Equation 11 and Equation 8 For any , set and for ,
(12) 
∎
Theorem I.1 shows that the expected task generalization error is bounded by the empirical multitask error plus two complexity terms. The first term represents the complexity of the environment, or equivalently, the timeseries dataset, converging to zero if we observe an infinitely long timeseries (). The second term represents the complexity of the observed tasks, or equivalently, the lookbackhorizon windows. This converges to zero when there are sufficient number of time steps in each window ().