Despite being versatile and omnipresent, traditional time series forecasting methods such as those in (Hyndman and Athanasopoulos, 2018)
, typically provide univariate point forecasts. Training in such frameworks requires to learn one model per individual time series, which might not scale for large data. Global deep learning-based time series models are typically recurrent neural networks (RNN), like LSTM(Hochreiter and Schmidhuber, 1997)
. These methods have become popular due to their end-to-end training, the ease of incorporating exogenous covariates, and their automatic feature extraction abilities, which are the hallmarks of deep learning.
It is often desirable for the outputs to be probability distributions, in which case forecasts typically provide uncertainty bounds. In the deep learning setting the two main approaches to estimate uncertainty have been to either model the data distribution explicitly or to use Bayesian Neural Networks as in(Zhu and Laptev, 2017)
. The former methods rely on some parametric density function, such as that of a Gaussian distribution, which is often based on computational convenience rather than on the underlying distribution of the data.
In this paper, we propose IQN-RNN, a deep-learning-based univariate time series method that learns an implicit distribution over outputs. Importantly, our approach does not make any a-priori assumptions on the underlying distribution of our data. The probabilistic output of our model is generated via Implicit Quantile Networks (Dabney et al., 2018)
(IQN) and is trained by minimizing the integrand of the Continuous Ranked Probability Score (CRPS)(Matheson and Winkler, 1976).
The major contributions of this paper are:
model the data distribution using IQNs which allows the use of a broad class of datasets;
model the time series via an autoregressive RNN where the emission distribution is given by an IQN;
demonstrate competitive results on real-world datasets in particular when compared to RNN-based probabilistic univariate time series models.
2.1 Quantile Regression
The quantile function corresponding to a cumulative distribution function (c.d.f.)is defined as:
For continuous and strictly monotonic c.d.f. one can simply write .
In order to find the quantile for a given one can use quantile regression (Koenker, 2005), which minimizes the quantile loss:
where is non-zero iff the value in the parentheses is positive.
where is the indicator function. The formula can be rewritten (Laio and Tamea, 2007) as an integral over the quantile loss:
where is the quantile function corresponding to .
3 Related work
Over the last years, deep learning models have shown impressive results over classical methods in many fields (Schmidhuber, 2015)et al., 2014). Modern univariate forecast methods like N-BEATS (Oreshkin et al., 2020) share parameters, are interpretive and fast to train on many target domains.
To estimate the underlying temporal distribution we can learn the parameters of some target distribution as in the DeepAR method (Salinas et al., 2019b) or use mixture density models (McLachlan and Basford, 1988) operating on neural network outputs, called mixture density networks (MDN) (Bishop, 1971). One prominent example is MDRNN (Graves, 2013)
that uses a mixture-density RNN to model handwriting. These approaches assume some parametric distribution, based on the data being modeled, for example a Negative Binomial distribution for count data. Models are trained by maximizing the likelihood of these distributions with respect to their predicted parameters and ground truth data.
Our approach is closely related to the SQF-RNN (Gasthaus et al., 2019) which models the conditional quantile function using isotonic splines. We utilize an IQN (Dabney et al., 2018; Yang et al., 2019)
instead, as we will detail, which has been used in the context of Distributional Reinforcement Learning(Bellemare et al., 2017), as well as for Generative Modelling (Ostrovski et al., 2018).
4 Forecasting with Implicit Quantile Networks
In an univariate time series forecasting setting, we typically aim at forecasting a subseries of length from a series of length , generated by an auto-regressive process .
Let and for each integer . We can rewrite as , where is the c.d.f..
In probabilistic time series forecasting, it is typically assumed that a unique function can represent the distribution of all , given input covariates and previous observations . When using IQNs, we additionally parameterize this function with : IQNs learn the mapping from to
. In other words, they are deterministic parametric functions trained to reparameterize samples from the uniform distribution to respective quantile values of a target distribution. OurIQN-RNN model should then learn for and can be written as , where is the Hadamard element-wise product and:
are typically time-dependent features, known for all time steps;
is the state of an RNN that takes the concatenation of , as well as the previous state as input;
embeds a as described by (Dabney et al., 2018), with :
We perform the Hadamard operation on , which is one of the forms considered by (Dabney et al., 2018).
During inference, we analogously sample a new quantile for each time step of our autoregressive loop. Thus, a full single forward pass follows an ancestral sampling
scheme along the graph of our probabilistic network. This approach guarantees to produce valid samples from the underlying model. Sampling a larger number of trajectories this way, allows us to estimate statistics over the distribution of each observation such as mean, quantiles, and standard deviation. For instance, the mean ofcan be estimated using the average over the first step of all sampled trajectories. In our experiments, we choose 100 samples (in parallel via the batch dimension) when calculating metrics and empirical quantiles. This strategy also addresses potential quantile-crossing issues, since nothing in the IQN-RNN architecture guaranties monotonicity with respect to : we simply compute the quantiles from sampled values.
We evaluate our IQN-RNN model on synthetic and open datasets and follow the recommendations of the M4 competition (Makridakis et al., 2020) regarding performance metrics. We report the mean scale interval score (MSIS111https://bit.ly/3c7ffmS) for a 95% prediction interval, the 50-th and 90-th quantile percentile loss, and the CRPS. The point-wise performance of models is measured by the normalized root mean square error (NRMSE), the mean absolute scaled error (MASE) (Hyndman and Koehler, 2006), and the symmetric mean absolute percentage error (sMAPE) (Makridakis, 1993). For pointwise metrics, we use sampled medians with the exception of NRMSE, where we take the mean over our prediction sample.
The code for our model is available in the PyTorchTS (Rasul, 2021) library.
5.1 Results on synthetic data
We firstly evaluate our IQN-RNN on synthetic data and compare its performance with another non-parametric probabilistic forecast model: SQF-RNN (Gasthaus et al., 2019). In order to minimize the MSIS and CRPS, we use this model with 50 linear pieces. Both models have the same RNN
architecture and the same hyperparameters for training. Only the probabilistic head is distinct.
In a similar fashion to (Gasthaus et al., 2019), we generate 10,000 time series of 48 points, where each time step is
and follows a Gaussian Mixture Model:with , ,
. The models are trained 5 times each, with a context length of 15, a prediction window of 2 for 20 epochs.
We show the estimated quantile functions in Figure 4 and report the average metrics in Table 1. While both methods have a similar point-wise performance, IQN-RNN is better at estimating the entire probability distribution.
5.2 Results on empirical data
|Data set||Num.||Dom.||Freq.||Time steps||Pred. steps|
|rnn_cell_type||GRU(Chung et al., 2014)|
|context_length||2 * pred_steps|
|optim||Adam (Kingma and Ba, 2015)|
We next evaluate our model on open source datasets for univariate time series:Electricity222https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, Traffic333https://archive.ics.uci.edu/ml/datasets/PEMS-SF, and Wikipedia444https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets, preprocessed exactly as in (Salinas et al., 2019a), with their properties listed in Table 3. Our model is trained on the training split of each dataset. For testing, we use a rolling-window prediction starting from the last point seen in the training dataset and compare it to the test set.
For comparison, we again use SQF-RNN (Gasthaus et al., 2019) with 50 linear pieces. We also evaluate DeepAR (Salinas et al., 2019b) with a Student-T or a Negative Binomial distribution depending on the domain of the dataset. Since IQN-RNN, SQF-RNN and DeepAR share the same RNN architecture we compare these models using the same untuned, but recommended, hyperparameters (see Table 4) for training: only the probabilistic heads differ. Thus, deviations from performance reported in the original publications are solely due to the number of epochs used for training. Alternative models are trained on the same instances, consume a similar amount of memory, and need similar training time. We also use ETS (Hyndman and Khandakar, 2008) as a comparison, which is an exponential smoothing method using weighted averages of past observations with exponentially decaying weights as the observations get older together with Gaussian additive errors (E) modeling trend (T) and seasonality (S) effects separately.
In Table 2 we report probabilistic and point-wise performance metrics of all models. We found that using IQN-RNN often leads to the best performance on both probabilistic and point-wise metrics while being fully non-parametric and without significantly increasing the parameters of the RNN model. We also note, that the resulting performance on point-forecasting metrics does not result in higher errors for our probabilistic measures (unlike e.g. DeepAR). We did not incorporate per time series embeddings as covariates in any of our experiments.
In this work, we proposed a general method of probabilistic time series forecasting by using IQNs to learn the quantile function of the next time point. We demonstrated the performance of our approach against competitive probabilistic methods on real-world datasets.
Our framework can be easily extended to multivariate time series, under the rather restrictive hypothesis that we observe the same quantile for individual univariate series. This is equivalent to assuming comonotonicity of the processes for each time step. Relaxing this assumption is left to future research.
K.R.: I acknowledge the traditional owners of the land on which I have lived and worked, the Wurundjeri people of the Kulin nation who have been custodians of their land for thousands of years. I pay my respects to their elders, past and present as well as past and present aboriginal elders of other communities.
GluonTS: Probabilistic and Neural Time Series Modeling in Python.
Journal of Machine Learning Research21 (116), pp. 1–6. External Links: Cited by: Software.
- A distributional perspective on reinforcement learning. D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 449–458. External Links: Cited by: §3.
- Pattern recognition and machine learning. Springer US. External Links: Cited by: §3.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, (English (US)). Cited by: Table 4.
- Implicit Quantile Networks for Distributional Reinforcement Learning. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1096–1105. External Links: Cited by: §1, §3, 3rd item, §4.
- Pandas-dev/pandas: pandas External Links: Cited by: Software.
- Probabilistic Forecasting with Spline Quantile Function RNNs. K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1901–1910. External Links: Cited by: §3, §5.1, §5.1, §5.2.
- Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. External Links: Cited by: §2.2.
- Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. External Links: Cited by: §3.
- Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Cited by: Software.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §1.
- Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9 (3), pp. 90–95. External Links: Cited by: Software.
- Forecasting: Principles and practice. OTexts. External Links: Cited by: §1.
- Automatic time series forecasting: Theforecastpackage forR. J. Stat. Soft. 27 (3), pp. 1–22. External Links: Cited by: §5.2.
- Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. External Links: Cited by: §5.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: Table 4.
- Quantile regression. Econometric Society Monographs, Cambridge University Press. External Links: Cited by: §2.1.
- Verification tools for probabilistic forecasts of continuous hydrological variables. Hydrology and Earth System Sciences 11 (4), pp. 1267–1277. External Links: Cited by: §2.2.
- The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36 (1), pp. 54–74. Note: M4 Competition External Links: Cited by: §5.
- Accuracy measures: theoretical and practical concerns. International Journal of Forecasting 9 (4), pp. 527–529. External Links: Cited by: §5.
- Scoring rules for continuous probability distributions. Manage. Sci. 22 (10), pp. 1087–1096. External Links: Cited by: §1, §2.2.
- Mixture models: Inference and applications to clustering.. Marcel Dekker, New York. Cited by: §3.
- N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, External Links: Cited by: §3, §4.
- Autoregressive quantile networks for generative modeling. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3936–3945. External Links: Cited by: §3.
- PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8026–8037. Cited by: Software.
- PytorchTS External Links: Cited by: §5.
- High-dimensional multivariate forecasting with low-rank Gaussian copula processes. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 6824–6834. Cited by: §5.2.
- DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecasting. External Links: Cited by: §3, §5.2.
- Deep learning in neural networks: An overview. Neural Networks 61, pp. 85–117. External Links: Cited by: §3.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §3.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §4.
- Fully Parameterized Quantile Function for Distributional Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 6190–6199. Cited by: §3.
- Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110. External Links: Cited by: §1.