DeepTIMe
PyTorch code for DeepTIMe: Deep Time-Index Meta-Learning for Non-Stationary Time-Series Forecasting
view repo
Deep learning has been actively applied to time-series forecasting, leading to a deluge of new autoregressive model architectures. Yet, despite the attractive properties of time-index based models, such as being a continuous signal function over time leading to smooth representations, little attention has been given to them. Indeed, while naive deep time-index based models are far more expressive than the manually predefined function representations of classical time-index based models, they are inadequate for forecasting due to the lack of inductive biases, and the non-stationarity of time-series. In this paper, we propose DeepTIMe, a deep time-index based model trained via a meta-learning formulation which overcomes these limitations, yielding an efficient and accurate forecasting model. Extensive experiments on real world datasets demonstrate that our approach achieves competitive results with state-of-the-art methods, and is highly efficient. Code is available at https://github.com/salesforce/DeepTIMe.
READ FULL TEXT VIEW PDFPyTorch code for DeepTIMe: Deep Time-Index Meta-Learning for Non-Stationary Time-Series Forecasting
Time-series forecasting has important applications across business and scientific domains, such as demand forecasting Carbonneau et al. (2008), capacity planning and management Kim (2003), electricity pricing Cuaresma et al. (2004)
, and anomaly detection
Laptev et al. (2017). This has led to interest in developing more powerful methods for forecasting, including the recent increase in attention towards neural forecasting methods Benidis et al. (2020). There are two broad approaches to time-series forecasting – autoregressive models, and time-index based models. Autoregressive models Salinas et al. (2020); Yang et al. (2019) consider the forecast to be a function of the previous values of the time-series, while time-index based models, also known as time-series regression models, map a time-index, and optionally datetime features, to the value of the time-series at that time step. Many recent neural forecasting approaches such as the family of fully connected architectures Challu et al. (2022); Olivares et al. (2021); Oreshkin et al. (2020, 2021), and Transformer-based architectures Zhou et al. (2022); Woo et al. (2022); Xu et al. (2021); Zhou et al. (2021) belong to the autoregressive approach. Yet, while time-index based approaches have the attractive property of being viewed as a continuous signal function over time, leading to signal representations which change smoothly and correlate with each other in continuous space, they have been under-explored from a deep learning perspective. In the following section, we explore the limitations of classical time-index based approaches, and highlight how deep time-index models can overcome these limitations. At the same time, deep time-index models face their own set of limitations. We propose a meta-learning formulation of the standard time-series forecasting problem that can resolve these limitations.Classical time-index based methods Taylor and Letham (2018); Hyndman and Athanasopoulos (2018); Ord et al. (2017) rely on predefined parametric representation functions, , to generate predictions, optionally following a structural time-series model formulation Harvey and Shephard (1993), , where , , , all functions of a time-index, represent trend, periodic, and holiday components respectively, and represents idiosyncratic changes not accounted for by the model. For example, could be predefined as a linear, polynomial, or even a piecewise linear function. While these functions are simple and easy to learn, they have limited capacity and are unable to fit more complex time-series. Furthermore, predefining the representation function is a strong assumption which may not hold across different application domains, as it is only effective when the data distribution follows this predefined function – we train on historical data and expect extrapolation into future time steps to hold. While this may be true for a short horizon, this assumption most likely collapses when dealing with long sequence time-series forecasting. Finally, while it is possible to perform model selection across various representation functions and parameters such as changepoints and seasonality, this requires either strong domain expertise or computationally heavy cross-validation across a large set of parameters.
![]() |
![]() |
Deep learning gives a natural solution to this problem faced by classical time-index based models – parameterize
as a neural network, and learn the representation function directly from data. Neural networks have been shown to have the property of being an extremely expressive representation function with a strong capability to approximate complex functions. However, being too expressive a representation function brings about the first limitation. Time-index based models rely on the assumptions of the representation function to ensure extrapolation beyond the range of historical training data, into future time steps, yield accurate forecasts. Being such an expressive representation function, deep time-index models have no inductive bias to perform well as a forecasting model. This is shown in
Figure 1. The second limitation arises due to the non-stationarity of time-series. Time-series data, especially in long sequence time-series forecasting, are typically non-stationary – their distributions change over time. While the standard supervised learning setting assumes the validation and test set to come from the same distribution as the training set, this may not be the case for non-stationary time-series
Kim et al. (2021); Arik et al. (2022), leading to a degradation in predictive performance.To this end, we propose DeepTIMe, a deep time-index based model, which leverages a meta-learning formulation to: (i) learn from data an appropriate function representation by optimizing an extrapolation loss, only possible due to our meta-learning formulation (only reconstruction loss is available for the standard setting), and (ii) learn a global meta model shared across tasks which performs adaptation on a locally stationary distribution. We further leverage Implicit Neural Representations Sitzmann et al. (2020b); Mildenhall et al. (2020) as our choice of deep time-index models, a random Fourier features layer Tancik et al. (2020) to ensure that we are able to learn high frequency information present in time-series data, and a closed-form ridge regressor Bertinetto et al. (2019) to efficiently tackle the meta-learning formulation. We conduct extensive experiments on both synthetic and real world datasets, showing that DeepTIMe has extremely competitive performance, achieving state-of-the-art results on 20 out of 24 settings for the multivariate forecasting benchmark based on MSE. We perform ablation studies to better understand the contribution of each component of DeepTIMe, and finally show that it is highly efficient in terms of runtime and memory.
![]() |
![]() |
In long sequence time-series forecasting, we consider a time-series dataset , where is the -dimension observation at time . Given a lookback window of length , we aim to construct a point forecast over a horizon of length , by learning a model
which minimizes some loss function
.In the following, we first describe how to cast the standard time-series forecasting problem as a meta-learning problem for time-index based models, which endows DeepTIMe with the ability to extrapolate over the forecast horizon, i.e. perform forecasting. We emphasize that this reformulation falls within standard time-series forecasting problem and requires no extra information. Next, we further elaborate on our proposed model architecture, and how it uses a differentiable closed-form ridge regression module to efficiently tackle forecasting as meta-learning problem. Psuedocode of DeepTIMe is available in
Appendix A.To formulate time-series forecasting as a meta-learning problem, we treat each lookback window and forecast horizon pair, as a task. Specifically, the lookback window is treated as the support set, and forecast horizon to be the query set, and each time coordinate and time-series value pair, , is an input-output sample, i.e. , , where is a -normalized time-index. The forecasting model, , is then parameterized by and , the meta and base parameters respectively, and the bi-level optimization problem can be formalized as:
(1) | ||||
(2) |
Here, the outer summation in Equation 1 over index represents each lookback-horizon window, corresponding to each task in meta-learning, and the inner summation over index represents each sample in the query set, or equivalently, each time step in the forecast horizon. The summation in Equation 2 over index represents each sample in the support set, or each time step in the lookback window. This is illustrated in Figure 1(a).
To understand how our meta-learning formulation helps to learn an appropriate function representation from data, we examine how the meta-learning process performs a restriction on hypothesis class of the model . The original hypothesis class of our model, or function representation, , is too large and provides no guarantees that training on the lookback window leads to good extrapolation. The meta-learning formulation allows DeepTIMe to restrict the hypothesis class of the representation function, from the space of all -layered INRs, to the space of -layered INRs conditioned on the optimal meta parameters, , where the optimal meta parameters, , is the minimizer of a forecasting loss (as specified in Equation 1). Given this hypothesis class, local adaptation is performed over given the lookback window, which is assumed to come from a locally stationary distribution, resolving the issue of non-stationarity.
The class of deep models which map coordinates to the value at that coordinate using a stack of multi-layer perceptrons (MLPs) is known as INRs
Sitzmann et al. (2020b); Tancik et al. (2020); Mildenhall et al. (2020). We make use a of them as they are a natural fit for time-index based models, to map a time-index to the value of the time-series at that time-index. A-layered, ReLU
Nair and Hinton (2010) INR is a function which has the following form:(3) |
where is the time-index. Note that for our proposed approach as specified in Section 2.1, but we use the notation to allow for generalization to cases where datetime features are included. Tancik et al. (2020) introduced a random Fourier features layer which allows INRs to fit to high frequency functions, by modifying , where each entry in is sampled from with is the hidden dimension size of the INR and
is the scale hyperparameter.
is a row-wise stacking operation.While the random Fourier features layer endows INRs with the ability to learn high frequency patterns, one major drawback is the need to perform a hyperparameter sweep for each task and dataset to avoid over or underfitting. We overcome this limitation with a simple scheme of concatenating multiple Fourier basis functions with diverse scale parameters, i.e. , where elements in are sampled from , and . We perform an analysis in Appendix F and show that the performance of our proposed Concatenated Fourier Features (CFF) does not significantly deviate from the setting with the optimal scale parameter obtained from a hyperparameter sweep.
One key aspect to tackling forecasting as a meta-learning problem is efficiency. Optimization-based meta-learning approaches originally perform an expensive bi-level optimization procedure on the entire neural network model by backpropagating through inner gradient steps
Ravi and Larochelle (2017); Finn et al. (2017). Since each forecast is now treated as an inner loop optimization problem, it needs to be sufficiently fast to be competitive with competing methods. We achieve this by leveraging a ridge regression closed-form solver Bertinetto et al. (2019), on top an INR, illustrated in Figure 1(b). The ridge regression closed-form solver restricts the inner loop to only apply to the last layer of the model, allowing for either a closed-form solution or differentiable solver to replace the inner gradient steps. This means that for a -layered model, are the meta parameters and are the base parameters, following notation from Equation 3. Then let be the meta learner where . For task with the corresponding lookback-horizon pair, , the support set features obtained from the meta learner is denoted , where is a column-wise concatenation operation. The inner loop thus solves the optimization problem:(4) |
Now, let be the query set features. Then, our predictions are . This closed-form solution is differentiable, which enables gradient updates on the parameters of the meta learner,
. A bias term can be included for the closed-form ridge regressor by appending a scalar 1 to the feature vector
. The model obtained by DeepTIMe is ultimately the restricted hypothesis class .We evaluate DeepTIMe on both synthetic datasets, and a variety of real world data. We ask the following questions: (i) Is DeepTIMe, trained on a family of functions following the same parametric form, able to perform extrapolation on unseen functions? (ii) How does DeepTIMe compare to other forecasting models on real world data? (iii) What are the key contributing factors to the good performance of DeepTIMe?
![]() |
![]() |
![]() |
We first consider DeepTIMe’s ability to extrapolate on the following functions specified by some parametric form: (i) the family of linear functions, , (ii) the family of cubic functions, , and (iii) sums of sinusoids, . Parameters of the functions (i.e. ) are sampled randomly (further details in Appendix B) to construct distinct tasks. A total of 400 time steps are sampled, with a lookback window length of 200 and forecast horizon of 200. Figure 3 demonstrates that DeepTIMe is able to perform extrapolation on unseen test functions/tasks after being trained via our meta-learning formulation. It demonstrates an ability to approximate and adapt, based on the lookback window, linear and cubic polynomials, and even sums of sinusoids. Next, we evaluate DeepTIMe on real world datasets, against state-of-the-art forecasting baselines.
ETT111https://github.com/zhouhaoyi/ETDataset Zhou et al. (2021) - Electricity Transformer Temperature provides measurements from an electricity transformer such as load and oil temperature. We use the ETTm2 subset, consisting measurements at a 15 minutes frequency. ECL222https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 - Electricity Consuming Load provides measurements of electricity consumption for 321 households from 2012 to 2014. The data was collected at the 15 mintue level, but is aggregated hourly. Exchange333https://github.com/laiguokun/multivariate-time-series-data Lai et al. (2018) - a collection of daily exchange rates with USD of eight countries (Australia, United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016. Traffic444https://pems.dot.ca.gov/ - dataset from the California Department of Transportation providing the hourly road occupancy rates from 862 sensors in San Francisco Bay area freeways. Weather555https://www.bgc-jena.mpg.de/wetter/ - provides measurements of 21 meteorological indicators such as air temperature, humidity, etc., every 10 minutes for the year of 2020 from the Weather Station of the Max Planck Biogeochemistry Institute in Jena, Germany. ILI666https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html - Influenza-like Illness measures the weekly ratio of patients seen with ILI and the total number of patients, obtained by the Centers for Disease Control and Prevention of the United States between 2002 and 2021.
We evaluate the performance of our proposed approach using two metrics, the mean squared error (MSE) and mean absolute error (MAE) metrics. The datasets are split into train, validation, and test sets chronologically, following a 70/10/20 split for all datasets except for ETTm2 which follows a 60/20/20 split, as per convention. The univariate benchmark selects the last index of the multivariate dataset as the target variable, following previous work Xu et al. (2021); Woo et al. (2022); Challu et al. (2022); Zhou et al. (2022). Pre-processing on the data is performed by standardization based on train set statistics. Hyperparameter selection is performed on only one value, the lookback length multiplier, , which decides the length of the lookback window. We search through the values , and select the best value based on the validation loss. Further implementation details on DeepTIMe are reported in Appendix C, and detailed hyperparameters are reported in Appendix D
. Reported results for DeepTIMe are averaged over three runs, and standard deviation is reported in
Appendix E.Methods | DeepTIMe | N-HiTS | ETSformer | Fedformer | Autoformer | Informer | LogTrans | Reformer | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metrics | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm2 |
96 | 0.166 | 0.257 | 0.176 | 0.255 | 0.189 | 0.280 | 0.203 | 0.287 | 0.255 | 0.339 | 0.365 | 0.453 | 0.768 | 0.642 | 0.658 | 0.619 |
192 | 0.225 | 0.302 | 0.245 | 0.305 | 0.253 | 0.319 | 0.269 | 0.328 | 0.281 | 0.340 | 0.533 | 0.563 | 0.989 | 0.757 | 1.078 | 0.827 | |
336 | 0.277 | 0.336 | 0.295 | 0.346 | 0.314 | 0.357 | 0.325 | 0.366 | 0.339 | 0.372 | 1.363 | 0.887 | 1.334 | 0.872 | 1.549 | 0.972 | |
720 | 0.383 | 0.409 | 0.401 | 0.426 | 0.414 | 0.413 | 0.421 | 0.415 | 0.422 | 0.419 | 3.379 | 1.388 | 3.048 | 1.328 | 2.631 | 1.242 | |
ECL |
96 | 0.137 | 0.238 | 0.147 | 0.249 | 0.187 | 0.304 | 0.183 | 0.297 | 0.201 | 0.317 | 0.274 | 0.368 | 0.258 | 0.357 | 0.312 | 0.402 |
192 | 0.152 | 0.252 | 0.167 | 0.269 | 0.199 | 0.315 | 0.195 | 0.308 | 0.222 | 0.334 | 0.296 | 0.386 | 0.266 | 0.368 | 0.348 | 0.433 | |
336 | 0.166 | 0.268 | 0.186 | 0.290 | 0.212 | 0.329 | 0.212 | 0.313 | 0.231 | 0.338 | 0.300 | 0.394 | 0.280 | 0.380 | 0.350 | 0.433 | |
720 | 0.201 | 0.302 | 0.243 | 0.340 | 0.233 | 0.345 | 0.231 | 0.343 | 0.254 | 0.361 | 0.373 | 0.439 | 0.283 | 0.376 | 0.340 | 0.420 | |
Exchange |
96 | 0.081 | 0.205 | 0.092 | 0.211 | 0.085 | 0.204 | 0.139 | 0.276 | 0.197 | 0.323 | 0.847 | 0.752 | 0.968 | 0.812 | 1.065 | 0.829 |
192 | 0.151 | 0.284 | 0.208 | 0.322 | 0.182 | 0.303 | 0.256 | 0.369 | 0.300 | 0.369 | 1.204 | 0.895 | 1.040 | 0.851 | 1.188 | 0.906 | |
336 | 0.314 | 0.412 | 0.371 | 0.443 | 0.348 | 0.428 | 0.426 | 0.464 | 0.509 | 0.524 | 1.672 | 1.036 | 1.659 | 1.081 | 1.357 | 0.976 | |
720 | 0.856 | 0.663 | 0.888 | 0.723 | 1.025 | 0.774 | 1.090 | 0.800 | 1.447 | 0.941 | 2.478 | 1.310 | 1.941 | 1.127 | 1.510 | 1.016 | |
Traffic |
96 | 0.390 | 0.275 | 0.402 | 0.282 | 0.607 | 0.392 | 0.562 | 0.349 | 0.613 | 0.388 | 0.719 | 0.391 | 0.684 | 0.384 | 0.732 | 0.423 |
192 | 0.402 | 0.278 | 0.420 | 0.297 | 0.621 | 0.399 | 0.562 | 0.346 | 0.616 | 0.382 | 0.696 | 0.379 | 0.685 | 0.390 | 0.733 | 0.420 | |
336 | 0.415 | 0.288 | 0.448 | 0.313 | 0.622 | 0.396 | 0.570 | 0.323 | 0.622 | 0.337 | 0.777 | 0.420 | 0.733 | 0.408 | 0.742 | 0.420 | |
720 | 0.449 | 0.307 | 0.539 | 0.353 | 0.632 | 0.396 | 0.596 | 0.368 | 0.660 | 0.408 | 0.864 | 0.472 | 0.717 | 0.396 | 0.755 | 0.423 | |
Weather |
96 | 0.166 | 0.221 | 0.158 | 0.195 | 0.197 | 0.281 | 0.217 | 0.296 | 0.266 | 0.336 | 0.300 | 0.384 | 0.458 | 0.490 | 0.689 | 0.596 |
192 | 0.207 | 0.261 | 0.211 | 0.247 | 0.237 | 0.312 | 0.276 | 0.336 | 0.307 | 0.367 | 0.598 | 0.544 | 0.658 | 0.589 | 0.752 | 0.638 | |
336 | 0.251 | 0.298 | 0.274 | 0.300 | 0.298 | 0.353 | 0.339 | 0.380 | 0.359 | 0.359 | 0.578 | 0.523 | 0.797 | 0.652 | 0.639 | 0.596 | |
720 | 0.301 | 0.338 | 0.351 | 0.353 | 0.352 | 0.388 | 0.403 | 0.428 | 0.419 | 0.419 | 1.059 | 0.741 | 0.869 | 0.675 | 1.130 | 0.792 | |
ILI |
24 | 2.425 | 1.086 | 1.862 | 0.869 | 2.527 | 1.020 | 2.203 | 0.963 | 3.483 | 1.287 | 5.764 | 1.677 | 4.480 | 1.444 | 4.400 | 1.382 |
36 | 2.231 | 1.008 | 2.071 | 0.969 | 2.615 | 1.007 | 2.272 | 0.976 | 3.103 | 1.148 | 4.755 | 1.467 | 4.799 | 1.467 | 4.783 | 1.448 | |
48 | 2.230 | 1.016 | 2.346 | 1.042 | 2.359 | 0.972 | 2.209 | 0.981 | 2.669 | 1.085 | 4.763 | 1.469 | 4.800 | 1.468 | 4.832 | 1.465 | |
60 | 2.143 | 0.985 | 2.560 | 1.073 | 2.487 | 1.016 | 2.545 | 1.061 | 2.770 | 1.125 | 5.264 | 1.564 | 5.278 | 1.560 | 4.882 | 1.483 |
Methods | DeepTIMe | N-HiTS | ETSformer | Fedformer | Autoformer | Informer | N-BEATS | DeepAR | Prophet | ARIMA | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metrics | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm2 |
96 | 0.065 | 0.186 | 0.066 | 0.185 | 0.080 | 0.212 | 0.063 | 0.189 | 0.065 | 0.189 | 0.088 | 0.225 | 0.082 | 0.219 | 0.099 | 0.237 | 0.287 | 0.456 | 0.211 | 0.362 |
192 | 0.096 | 0.234 | 0.087 | 0.223 | 0.150 | 0.302 | 0.102 | 0.245 | 0.118 | 0.256 | 0.132 | 0.283 | 0.120 | 0.268 | 0.154 | 0.310 | 0.312 | 0.483 | 0.261 | 0.406 | |
336 | 0.138 | 0.285 | 0.106 | 0.251 | 0.175 | 0.334 | 0.130 | 0.279 | 0.154 | 0.305 | 0.180 | 0.336 | 0.226 | 0.370 | 0.277 | 0.428 | 0.331 | 0.474 | 0.317 | 0.448 | |
720 | 0.186 | 0.338 | 0.157 | 0.312 | 0.224 | 0.379 | 0.178 | 0.325 | 0.182 | 0.335 | 0.300 | 0.435 | 0.188 | 0.338 | 0.332 | 0.468 | 0.534 | 0.593 | 0.366 | 0.487 | |
Exchange |
96 | 0.086 | 0.226 | 0.093 | 0.223 | 0.099 | 0.230 | 0.131 | 0.284 | 0.241 | 0.299 | 0.591 | 0.615 | 0.156 | 0.299 | 0.417 | 0.515 | 0.828 | 0.762 | 0.112 | 0.245 |
192 | 0.173 | 0.330 | 0.230 | 0.313 | 0.223 | 0.353 | 0.277 | 0.420 | 0.273 | 0.665 | 1.183 | 0.912 | 0.669 | 0.665 | 0.813 | 0.735 | 0.909 | 0.974 | 0.304 | 0.404 | |
336 | 0.539 | 0.575 | 0.370 | 0.486 | 0.421 | 0.497 | 0.426 | 0.511 | 0.508 | 0.605 | 1.367 | 0.984 | 0.611 | 0.605 | 1.331 | 0.962 | 1.304 | 0.988 | 0.736 | 0.598 | |
720 | 0.936 | 0.763 | 0.728 | 0.569 | 1.114 | 0.807 | 1.162 | 0.832 | 0.991 | 0.860 | 1.872 | 1.072 | 1.111 | 0.860 | 1.890 | 1.181 | 3.238 | 1.566 | 1.871 | 0.935 |
We compare DeepTIMe to the following baselines for the multivariate setting, N-HiTS Challu et al. (2022), ETSformer Woo et al. (2022), Fedformer Zhou et al. (2022) (we report the best score for each setting from the two variants they present), Autoformer Xu et al. (2021), Informer Zhou et al. (2021), LogTrans Li et al. (2019), and Reformer Kitaev et al. (2020). For the univariate setting, we include additional univariate forecasting models, N-BEATS Oreshkin et al. (2020), DeepAR Salinas et al. (2020), Prophet Taylor and Letham (2018), and ARIMA. We obtain the baseline results from the following papers: Challu et al. (2022); Woo et al. (2022); Zhou et al. (2022); Xu et al. (2021). Table 1 and Table 2 summarizes the multivariate and univariate forecasting results respectively. DeepTIMe achieves state-of-the-art performance on 20 out of 24 settings in MSE, and 17 out of 24 settings in MAE on the multivariate benchmark, and also achieves competitive results on the univariate benchmark despite its simple architecture compared to the baselines comprising complex fully connected architectures and computationally intensive Transformer architectures.
Methods | DeepTIMe | + Datetime | - RR | - RR | + Local | + Local | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+ Datetime | + Datetime | ||||||||||||
Metrics | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm2 |
96 | 0.166 | 0.257 | 0.226 | 0.303 | 3.072 | 1.345 | 3.393 | 1.400 | 0.251 | 0.331 | 0.250 | 0.327 |
192 | 0.225 | 0.302 | 0.309 | 0.362 | 3.064 | 1.343 | 3.269 | 1.381 | 0.322 | 0.371 | 0.323 | 0.366 | |
336 | 0.277 | 0.336 | 0.341 | 0.381 | 2.920 | 1.309 | 3.442 | 1.401 | 0.370 | 0.412 | 0.367 | 0.396 | |
720 | 0.383 | 0.409 | 0.453 | 0.447 | 2.773 | 1.273 | 3.400 | 1.399 | 0.443 | 0.449 | 0.455 | 0.461 |
Methods | DeepTIMe | MLP | SIREN | RNN | |||||
---|---|---|---|---|---|---|---|---|---|
Metrics | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm2 | 96 | 0.166 | 0.257 | 0.186 | 0.284 | 0.236 | 0.325 | 0.233 | 0.324 |
192 | 0.225 | 0.302 | 0.265 | 0.338 | 0.295 | 0.361 | 0.275 | 0.337 | |
336 | 0.277 | 0.336 | 0.316 | 0.372 | 0.327 | 0.386 | 0.344 | 0.383 | |
720 | 0.383 | 0.409 | 0.401 | 0.417 | 0.438 | 0.453 | 0.431 | 0.432 |
. RNN refers to an autoregressive recurrent neural network (inputs are the time-series values,
). All approaches include differentiable closed-form ridge regressor. Further model details can be found in Section G.2.CFF | Optimal Scale (% change) | Pessimal Scale (% change) | |||||
---|---|---|---|---|---|---|---|
Metrics | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm2 |
96 | 0.166 | 0.257 | 0.164 (1.20%) | 0.257 (-0.05%) | 0.216 (-23.22%) | 0.300 (-14.22%) |
192 | 0.225 | 0.302 | 0.220 (1.87%) | 0.301 (0.25%) | 0.275 (-18.36%) | 0.340 (-11.25%) | |
336 | 0.277 | 0.336 | 0.275 (0.70%) | 0.336 (-0.22%) | 0.340 (-18.68%) | 0.375 (-10.57%) | |
720 | 0.383 | 0.409 | 0.364 (5.29%) | 0.392 (4.48%) | 0.424 (-9.67%) | 0.430 (-4.95%) |
We perform an ablation study to understand how various training schemes and input features affect the performance of DeepTIMe. Table 3 presents these results. First, we observe that our meta-learning formulation is a critical component to the success of DeepTIMe. We note that DeepTIMe without meta-learning may not be a meaningful baseline since the model outputs are always the same regardless of the input lookback window. Including datetime features helps alleviate this issue, yet we observe that the inclusion of datetime features generally lead to a degradation in performance. In the case of DeepTIMe, we observed that the inclusion of datetime features lead to a much lower training loss, but degradation in test performance – this is a case of meta-learning memorization Yin et al. (2020) due to the tasks becoming non-mutually exclusive Rajendran et al. (2020). Finally, we observe that the meta-learning formulation is indeed superior to training a model from scratch for each lookback window.
In Table 4 we perform an ablation study on various backbone architectures, while retaining the differentiable closed-form ridge regressor. We observe a degradation when the random Fourier features layer is removed, due to the spectral bias problem which neural networks face Rahaman et al. (2019); Tancik et al. (2020). DeepTIMe outperforms the SIREN variant of INRs which is consistent with observations INR literature. Finally DeepTIMe outperforms the RNN variant which is the model proposed in Grazzi et al. (2021). This is a direct comparison between auto-regressive and time-index models, and highlights the benefits of a time-index models.
Lastly, we perform a comparison between the optimal and pessimal scale hyperparameter for the vanilla random Fourier features layer, against our proposed CFF. We first report the results on each scale hyperparameter for the vanilla random Fourier features layer in Table 8, Appendix F. As with the other ablation studies, the results reported in Table 8 is based on performing a hyperparameter sweep across lookback length multiplier, and selecting the optimal settings based on the validation set, and reporting the test set results. Then, the optimal and pessimal scales are simply the best and worst results based on Table 8. Table 5 shows that CFF achieves extremely low deviation from the optimal scale across all settings, yet retrains the upside of avoiding this expensive hyperparameter tuning phase. We also observe that tuning the scale hyperparameter is extremely important, as CFF obtains up to a 23.22% improvement in MSE over the pessimal scale hyperparameter.
|
|
Finally, we analyse DeepTIMe’s efficiency in both runtime and memory usage, with respect to both lookback window and forecast horizon lengths. The main bottleneck in computation for DeepTIMe is the matrix inversion operation in the ridge regressor, canonically of complexity. This is a major concern for DeepTIMe as is linked to the length of the lookback window. As mentioned in Bertinetto et al. (2019), the Woodbury formulation,
is used to alleviate the problem, leading to an complexity, where is the hidden size hyperparameter, fixed to some value (see Appendix D). Figure 4 demonstrates that DeepTIMe is highly efficient, even when compared to efficient Transformer models, recently proposed for the long sequence time-series forecasting task, as well as fully connected models.
Neural forecasting Benidis et al. (2020) methods have seen great success in recent times. One related line of research are Transformer-based methods for long sequence time-series forecasting Li et al. (2019); Zhou et al. (2021); Xu et al. (2021); Woo et al. (2022); Zhou et al. (2022) which aim to not only achieve high accuracy, but to overcome the vanilla attention’s quadratic complexity. Li et al. (2019) and Zhou et al. (2021) introduced sparse attention mechanisms, while Xu et al. (2021); Woo et al. (2022), and Zhou et al. (2022)
introduced mechanisms which make use of the Fourier transform to achieve a quasilinear complexity. They further embed prior knowledge of time-series structures such as auto-correlation, and seasonal-trend decomposition into the Transformer architecture. Another relevant line of work is that of fully connected models
Oreshkin et al. (2020); Olivares et al. (2021); Challu et al. (2022); Oreshkin et al. (2021). Oreshkin et al. (2020) first introduced the N-BEATS model which made use of doubly residual stacks of fully connected layers. Challu et al. (2022)extended this approach to the long sequence time-series forecasting task by introducing hierarchical interpolation and multi-rate data sampling. Meta-learning has been explored in time-series, where
Grazzi et al. (2021) used a differentiable closed-form solver in the context of time-series forecasting, but specified an auto-regressive backbone model.Time-index based models, or time-series regression models, take as input datetime features and other covariates to predict the value of the time-series at that time step. They have been well explored as a special case of regression analysis
Hyndman and Athanasopoulos (2018); Ord et al. (2017), and many different predictors have been proposed for the classical setting. These inclue linear, polynomial, and piecewise linear trends, dummy variables indicating holidays, seasonal dummy variables, and many others
Hyndman and Athanasopoulos (2018). Of note, Fourier terms have been used to model periodicity, or seasonal patterns, and is also known as harmonic regression Young et al. (1999). One popular classical time-index based method is Prophet Taylor and Letham (2018), which uses a structural time-series formulation, considering trend, seasonal, and holiday variables, specialized for business forecasting. Godfrey and Gashler (2017)introduced an initial attempt at using time-index based neural networks to fit a time-series for forecasting. Yet, their work is more reminiscent of classical methods, as they manually specify periodic and non-periodic activation functions, analogous to the representation functions, rather than learning the representation function from data.
INRs have recently gained popularity in the area of neural rendering Tewari et al. (2021). They parameterize a signal as a continuous function, mapping a coordinate to the value at that coordinate. A key finding was that positional encodings Mildenhall et al. (2020); Tancik et al. (2020) are critical for ReLU MLPs to learn high frequency details, while another line of work introduced periodic activations Sitzmann et al. (2020b). Meta-learning on via INRs have been explored for various data modalities, typically over images or for neural rendering tasks Sitzmann et al. (2020a); Tancik et al. (2021); Dupont et al. (2021), using both hypernetworks and optimization-based approaches. Yüce et al. (2021) show that meta-learning on INRs is analagous to dictionary learning. In time-series, Jeong and Shin (2022) explored using INRs for anomaly detection, opting to make use of periodic activations and temporal positional encodings.
While out of scope for our current work, a limitation that DeepTIMe faces is that it does not consider holidays and events. We leave the consideration of such features as a potential future direction, along with the incorporation of datetime features as exogenous covariates, whilst avoiding the incursion of the meta-learning memorization problem. While our current focus for DeepTIMe is on time-series forecasting, time-index based models are a natural fit for missing value imputation, as well as other time-series intelligence tasks for irregular time-series – this is another interesting future direction to extend deep time-index models towards. One final idea which is interesting to explore is a hypernetwork style meta-learning solution, which could potentially allow it to learn from multiple time-series.
In this paper, we proposed DeepTIMe, a deep time-index based model trained via a meta-learning formulation to automatically learn a representation function from time-series data, rather than manually defining the representation function as per classical methods. The meta-learning formulation further enables DeepTIMe to be utilized for non-stationary time-series by adapting to the locally stationary distribution. Importantly, we use a closed-form ridge regressor to tackle the meta-learning formulation to ensure that predictions are computationally efficient. Our extensive empirical analysis shows that DeepTIMe, while being a much simpler model architecture compared to prevailing state-of-the-art methods, achieves competitive performance across forecasting benchmarks on real world datasets. We perform substantial ablation studies to identify the key components contributing to the success of DeepTIMe, and also show that it is highly efficient.
International Conference on Machine Learning
, pp. 205–214. Cited by: Appendix I.Financial time series forecasting using support vector machines
. Neurocomputing 55 (1-2), pp. 307–319. Cited by: §1.European conference on computer vision
, pp. 405–421. Cited by: §1, §2.2, §4.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 2846–2855. Cited by: §4.mm: matrix multiplication, diagonal: returns the diagonal elements of a matrix, add_: in-place addition
linalg.solve computes the solution of a square system of linear equations with a unique solution.
rearrange : einops style tensor operations
mm: matrix multiplication
Samples are generated from the function for . This means that each function/task consists of 400 evenly spaced points between -1 and 1. The parameters of each function/task (i.e.
) are sampled from a normal distribution with mean 0 and standard deviation of 50, i.e.
.Samples are generated from the function for
for 400 points. Parameters of each task are sampled from a continuous uniform distribution with minimum value of -50 and maximum value of 50, i.e.
.Sinusoids come from a fixed set of frequencies, generated by sampling . We fix the size of this set to be five, i.e. . Each function is then a sum of sinusoids, where is randomly assigned. The function is thus for , where the amplitude and phase shifts are freely chosen via , but the frequency is decided by to randomly select a frequency from the set .
We train DeepTIMe with the Adam optimizer Kingma and Ba [2014]
with a learning rate scheduler following a linear warm up and cosine annealing scheme. Gradient clipping by norm is applied. The ridge regressor regularization coefficient,
, is trained with a different, higher learning rate than the rest of the meta parameters. We use early stopping based on the validation loss, with a fixed patience hyperparameter (number of epochs for which loss deteriorates before stopping). All experiments are performed on an Nvidia A100 GPU.
The ridge regression regularization coefficient is a learnable parameter constrained to positive values via a softplus function. We apply Dropout Srivastava et al. [2014], then LayerNorm Ba et al. [2016] after the ReLU activation function in each INR layer. The size of the random Fourier feature layer is set independently of the layer size, in which we define the total size of the random Fourier feature layer – the number of dimensions for each scale is divided equally.
Hyperparameter | Value | |
Optimization |
Epochs | 50 |
Learning rate | 1e-3 | |
learning rate | 1.0 | |
Warm up epochs | 5 | |
Batch size | 256 | |
Early stopping patience | 7 | |
Max gradient norm | 10.0 | |
Model |
Layers | 5 |
Layer size | 256 | |
initialization | 0.0 | |
Scales | ||
Fourier features size | 4096 | |
Dropout | 0.1 | |
Lookback length multiplier, |
|
|
Scale Hyperparam | 0.01 | 0.1 | 1 | 5 | 10 | 20 | 50 | 100 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metrics | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
ETTm2 |
96 | 0.216 | 0.300 | 0.189 | 0.285 | 0.173 | 0.268 | 0.168 | 0.262 | 0.166 | 0.260 | 0.165 | 0.258 | 0.165 | 0.259 | 0.164 | 0.257 |
192 | 0.275 | 0.340 | 0.264 | 0.333 | 0.239 | 0.317 | 0.225 | 0.301 | 0.225 | 0.303 | 0.224 | 0.302 | 0.224 | 0.304 | 0.220 | 0.301 | |
336 | 0.340 | 0.375 | 0.319 | 0.371 | 0.292 | 0.351 | 0.275 | 0.337 | 0.277 | 0.336 | 0.282 | 0.345 | 0.278 | 0.342 | 0.280 | 0.344 | |
720 | 0.424 | 0.430 | 0.405 | 0.420 | 0.381 | 0.412 | 0.364 | 0.392 | 0.375 | 0.408 | 0.410 | 0.430 | 0.396 | 0.423 | 0.406 | 0.429 |
In this section, we list more details on the models compared to in the ablation studies section. Unless otherwise stated, we perform the same hyperparameter tuning for all models in the ablation studies, and use the same standard hyperparameters such as number of layers, layer size, etc.
Removing the ridge regressor module refers to replacing it with a simple linear layer, , where , . This corresponds to a straight forward INR, which is trained across all lookback-horizon pairs in the dataset.
For models marked “Local”, we similarly remove the ridge regressor module and replace it with a linear layer. Yet, the model is not trained across all lookback-horizon pairs in the dataset. Instead, for each lookback-horizon pair in the validation/test set, we fit the model to the lookback window via gradient descent, and then perform prediction on the horizon to obtain the forecasts. A new model is trained from scratch for each lookback-horizon window. We perform tuning on an extra hyperparameter, the number of epochs to perform gradient descent, for which we search through .
As each dataset comes with a timestamps for each observation, we are able to construct datetime features from these timestamps. We construct the following features:
Quarter-of-year
Month-of-year
Week-of-year
Day-of-year
Day-of-month
Day-of-week
Hour-of-day
Minute-of-hour
Second-of-minute
Each feature is initially an integer value, e.g. month-of-year can take on values in , which we subsequently normalize to a range. Depending on the data sampling frequency, the appropriate features can be chosen. For the ETTm2 dataset, we used all features except second-of-minute since it is sampled at a 15 minute frequency.
For all models in this section, we retain the differentiable closed-form ridge regressor, to identify the effects of the backbone model used.
The random Fourier features layer is a mapping from coordinate space to latent space . To remove the effects of the random Fourier features layer, we simply replace it with a with a linear map, .
We replace the random Fourier features backbone with the SIREN model which is introduced by Sitzmann et al. [2020b]. In this model, periodical activation functions are used, i.e. , along with specified weight initialization scheme.
We use a 2 layer LSTM with hidden size of 256. Inputs are observations, , in an auto-regressive fashion, predicting the next time step, .
We use a model with 2 encoder and 2 decoder layers with a hidden size of 512, as specified in their original papers.
We use an N-BEATS model with 3 stacks and 3 layers (relatively small compared to 30 stacks and 4 layers used in their orignal paper777https://github.com/ElementAI/N-BEATS/blob/master/experiments/electricity/generic.gin, with a hidden size of 512. Note, N-BEATS is a univariate model and values presented here are multiplied by a factor of to account for the multivariate data. Another dimension of comparison is the number of parameters used in the model. Demonstrated in Table 9, fully connected models like N-BEATS, their number of parameters scales linearly with lookback window and forecast horizon length, while for Transformer-based and DeepTIMe, the number of parameters remains constant.
We use an N-HiTS model with hyperparameters as sugggested in their original paper (3 stacks, 1 block in each stack, 2 MLP layers, 512 hidden size). For the following hyperparameters which were not specified (subject to hyperparameter tuning), we set the pooling kernel size to , and the number of stack coefficients to . Similar to N-BEATS, N-HiTS is a univariate model, and values were multiplied by a factor of to account for the multivariate data.
Methods | Autoformer | N-HiTS | DeepTIMe | |
---|---|---|---|---|
Lookback |
48 | 10,535,943 | 927,942 | 1,314,561 |
96 | 10,535,943 | 1,038,678 | 1,314,561 | |
168 | 10,535,943 | 1,204,782 | 1,314,561 | |
336 | 10,535,943 | 1,592,358 | 1,314,561 | |
720 | 10,535,943 | 2,478,246 | 1,314,561 | |
1440 | 10,535,943 | 4,139,286 | 1,314,561 | |
2880 | 10,535,943 | 7,461,366 | 1,314,561 | |
5760 | 10,535,943 | 14,105,526 | 1,314,561 | |
Horizon |
48 | 10,535,943 | 927,942 | 1,314,561 |
96 | 10,535,943 | 955,644 | 1,314,561 | |
168 | 10,535,943 | 997,197 | 1,314,561 | |
336 | 10,535,943 | 1,094,154 | 1,314,561 | |
720 | 10,535,943 | 1,315,770 | 1,314,561 | |
1440 | 10,535,943 | 1,731,300 | 1,314,561 | |
2880 | 10,535,943 | 2,562,360 | 1,314,561 | |
5760 | 10,535,943 | 4,224,480 | 1,314,561 |
In this section, we derive a meta-learning generalization bound for DeepTIMe under the PAC-Bayes framework [Shalev-Shwartz and Ben-David, 2014]. Our formulation follows [Amit and Meir, 2018] and assumes that all tasks share the same hypothesis space , sample space and loss function . We observes tasks in the form of sample sets . The number of samples in each task is . Each dataset is assumed to be generated i.i.d from an unknown sample distribution . Each task’s sample distribution is i.i.d. generated from an unknown meta distribution, . Particularly, we have , where . Here, is the time coordinate, and is the time-series value. For any forecaster parameterized by , we define the loss function . We also define as the prior distribution over and as the posterior over for each task. In the meta-learning setting, we assume a hyper-prior , which is a prior distribution over priors, observes a sequence of training tasks, and then outputs a distribution over priors, called hyper-posterior .
Consider the Meta-Learning framework, given the hyper-prior , then for any hyper-posterior , any and any with probability
(5) |
Our proof contains two steps. First, we bound the error within observed tasks due to observing a limited number of samples. Then we bound the error on the task environment level due to observing a finite number of tasks. Both of the two steps utilize Catoni’s classical PAC-Bayes bound [Catoni, 2007] to measure the error. We give here the Catoni’s classical PAC-Bayes bound.
(Catoni’s bound [Catoni, 2007]) Let be a sample space, a distribution over , a hypothesis space. Given a loss function and a collection of M i.i.d random variables (
We first utilize Theorem I.2 to bound the generalization error in each of the observed tasks. Let be the index of task, we have the definition of expected error and empirical error as follows,
(6) | ||||
(7) |
Then, according to Theorem I.2, for any and , we have
(8) |
Next, we bound the error due to observing a limited number of tasks from the environment. Similarly, we have the definition of expected task error as follows
(9) |
Then we have the definition of error across the tasks,
(10) |
Then Theorem I.2 says that the following holds for any and , we have
(11) |
Finally, by employing the union bound, we could bound the probability of the intersection of the events in Equation 11 and Equation 8 For any , set and for ,
(12) |
∎
Theorem I.1 shows that the expected task generalization error is bounded by the empirical multi-task error plus two complexity terms. The first term represents the complexity of the environment, or equivalently, the time-series dataset, converging to zero if we observe an infinitely long time-series (). The second term represents the complexity of the observed tasks, or equivalently, the lookback-horizon windows. This converges to zero when there are sufficient number of time steps in each window ().