1 Introduction
Despite being versatile and omnipresent, traditional time series forecasting methods such as those in (Hyndman and Athanasopoulos, 2018)
, typically provide univariate point forecasts. Training in such frameworks requires to learn one model per individual time series, which might not scale for large data. Global deep learningbased time series models are typically recurrent neural networks (RNN), like LSTM
(Hochreiter and Schmidhuber, 1997). These methods have become popular due to their endtoend training, the ease of incorporating exogenous covariates, and their automatic feature extraction abilities, which are the hallmarks of deep learning.
It is often desirable for the outputs to be probability distributions, in which case forecasts typically provide uncertainty bounds. In the deep learning setting the two main approaches to estimate uncertainty have been to either model the data distribution explicitly or to use Bayesian Neural Networks as in
(Zhu and Laptev, 2017). The former methods rely on some parametric density function, such as that of a Gaussian distribution, which is often based on computational convenience rather than on the underlying distribution of the data.
In this paper, we propose IQNRNN, a deeplearningbased univariate time series method that learns an implicit distribution over outputs. Importantly, our approach does not make any apriori assumptions on the underlying distribution of our data. The probabilistic output of our model is generated via Implicit Quantile Networks (Dabney et al., 2018)
(IQN) and is trained by minimizing the integrand of the Continuous Ranked Probability Score (CRPS)
(Matheson and Winkler, 1976).The major contributions of this paper are:

model the data distribution using IQNs which allows the use of a broad class of datasets;

model the time series via an autoregressive RNN where the emission distribution is given by an IQN;

demonstrate competitive results on realworld datasets in particular when compared to RNNbased probabilistic univariate time series models.
2 Background
2.1 Quantile Regression
The quantile function corresponding to a cumulative distribution function (c.d.f.)
is defined as:For continuous and strictly monotonic c.d.f. one can simply write .
In order to find the quantile for a given one can use quantile regression (Koenker, 2005), which minimizes the quantile loss:
(1) 
where is nonzero iff the value in the parentheses is positive.
2.2 Crps
Continuous Ranked Probability Score (Matheson and Winkler, 1976; Gneiting and Raftery, 2007) is a proper scoring rule, described by a c.d.f. given the observation :
where is the indicator function. The formula can be rewritten (Laio and Tamea, 2007) as an integral over the quantile loss:
where is the quantile function corresponding to .
3 Related work
Over the last years, deep learning models have shown impressive results over classical methods in many fields (Schmidhuber, 2015)
like computer vision, speech recognition, natural language processing (NLP), and also time series forecasting, which is related to sequence modeling in NLP
(Sutskever et al., 2014). Modern univariate forecast methods like NBEATS (Oreshkin et al., 2020) share parameters, are interpretive and fast to train on many target domains.To estimate the underlying temporal distribution we can learn the parameters of some target distribution as in the DeepAR method (Salinas et al., 2019b) or use mixture density models (McLachlan and Basford, 1988) operating on neural network outputs, called mixture density networks (MDN) (Bishop, 1971). One prominent example is MDRNN (Graves, 2013)
that uses a mixturedensity RNN to model handwriting. These approaches assume some parametric distribution, based on the data being modeled, for example a Negative Binomial distribution for count data. Models are trained by maximizing the likelihood of these distributions with respect to their predicted parameters and ground truth data.
Our approach is closely related to the SQFRNN (Gasthaus et al., 2019) which models the conditional quantile function using isotonic splines. We utilize an IQN (Dabney et al., 2018; Yang et al., 2019)
instead, as we will detail, which has been used in the context of Distributional Reinforcement Learning
(Bellemare et al., 2017), as well as for Generative Modelling (Ostrovski et al., 2018).4 Forecasting with Implicit Quantile Networks
In an univariate time series forecasting setting, we typically aim at forecasting a subseries of length from a series of length , generated by an autoregressive process .
Let and for each integer . We can rewrite as , where is the c.d.f..
In probabilistic time series forecasting, it is typically assumed that a unique function can represent the distribution of all , given input covariates and previous observations . When using IQNs, we additionally parameterize this function with : IQNs learn the mapping from to
. In other words, they are deterministic parametric functions trained to reparameterize samples from the uniform distribution to respective quantile values of a target distribution. Our
IQNRNN model should then learn for and can be written as , where is the Hadamard elementwise product and:
are typically timedependent features, known for all time steps;

is the state of an RNN that takes the concatenation of , as well as the previous state as input;

embeds a as described by (Dabney et al., 2018), with :

is an additional generator layer, which in our case is a simple twolayer feedforward neural network, with a domain relevant activation function.
We perform the Hadamard operation on , which is one of the forms considered by (Dabney et al., 2018).
At training time, quantiles are sampled for each observation at each time step and passed to both network and quantile loss (1) (see Figure 1).
During inference, we analogously sample a new quantile for each time step of our autoregressive loop. Thus, a full single forward pass follows an ancestral sampling
scheme along the graph of our probabilistic network. This approach guarantees to produce valid samples from the underlying model. Sampling a larger number of trajectories this way, allows us to estimate statistics over the distribution of each observation such as mean, quantiles, and standard deviation. For instance, the mean of
can be estimated using the average over the first step of all sampled trajectories. In our experiments, we choose 100 samples (in parallel via the batch dimension) when calculating metrics and empirical quantiles. This strategy also addresses potential quantilecrossing issues, since nothing in the IQNRNN architecture guaranties monotonicity with respect to : we simply compute the quantiles from sampled values.5 Experiments
We evaluate our IQNRNN model on synthetic and open datasets and follow the recommendations of the M4 competition (Makridakis et al., 2020) regarding performance metrics. We report the mean scale interval score (MSIS^{1}^{1}1https://bit.ly/3c7ffmS) for a 95% prediction interval, the 50th and 90th quantile percentile loss, and the CRPS. The pointwise performance of models is measured by the normalized root mean square error (NRMSE), the mean absolute scaled error (MASE) (Hyndman and Koehler, 2006), and the symmetric mean absolute percentage error (sMAPE) (Makridakis, 1993). For pointwise metrics, we use sampled medians with the exception of NRMSE, where we take the mean over our prediction sample.
The code for our model is available in the PyTorchTS (Rasul, 2021) library.
5.1 Results on synthetic data
We firstly evaluate our IQNRNN on synthetic data and compare its performance with another nonparametric probabilistic forecast model: SQFRNN (Gasthaus et al., 2019). In order to minimize the MSIS and CRPS, we use this model with 50 linear pieces. Both models have the same RNN
architecture and the same hyperparameters for training. Only the probabilistic head is distinct.
In a similar fashion to (Gasthaus et al., 2019), we generate 10,000 time series of 48 points, where each time step is
and follows a Gaussian Mixture Model:
with , ,. The models are trained 5 times each, with a context length of 15, a prediction window of 2 for 20 epochs.
We show the estimated quantile functions in Figure 4 and report the average metrics in Table 1. While both methods have a similar pointwise performance, IQNRNN is better at estimating the entire probability distribution.
Method  CRPS  MSIS  sMAPE  MASE 
SQFRNN50  0.780  3.213  1.756  0.747 
IQNRNN  0.776  3.027  1.754  0.740 

5.2 Results on empirical data
Data set  Method  CRPS  QL50  QL90  MSIS  NRMSE  sMAPE  MASE 
Electricity  SQFRNN50  0.078  0.097  0.044  8.66  0.632  0.144  1.051 
DeepARt  0.062  0.078  0.046  6.79  0.687  0.117  0.849  
ETS  0.076  0.100  0.050  9.992  0.838  0.156  1.247  
IQNRNN  0.060  0.074  0.040  8.74  0.543  0.138  0.897  
Traffic 
SQFRNN50  0.153  0.186  0.117  8.40  0.401  0.243  0.76 
DeepARt  0.172  0.216  0.117  8.027  0.472  0.244  0.89  
ETS  0.427  0.488  0.325  20.856  0.872  0.594  1.881  
IQNRNN  0.139  0.168  0.117  7.11  0.433  0.171  0.656  
Wikipedia 
SQFRNN50  0.283  0.328  0.321  23.71  2.24  0.261  1.44 
DeepARnb  0.452  0.572  0.526  46.79  2.25  0.751  2.94  
DeepARt  0.235  0.27  0.267  23.77  2.15  0.21  1.23  
ETS  0.788  0.440  0.836  61.685  3.261  0.301  2.214  
IQNRNN  0.207  0.241  0.238  19.61  2.074  0.179  1.141 
Data set  Num.  Dom.  Freq.  Time steps  Pred. steps 
Elec.  hour  
Traffic  hour  
Wiki.  day 
Hyperparameter  Value 
rnn_cell_type  GRU(Chung et al., 2014) 
rnn_hidden_size  64 
rnn_num_layers  3 
rnn_dropout_rate  0.2 
context_length  2 * pred_steps 
epochs  10 
learning_rate  0.001 
batch_size  256 
batches_per_epoch  120 
num_samples  100 
optim  Adam (Kingma and Ba, 2015) 
We next evaluate our model on open source datasets for univariate time series:
Electricity^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, Traffic^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/PEMSSF, and Wikipedia^{4}^{4}4https://github.com/mbohlkeschneider/gluonts/tree/mv_release/datasets, preprocessed exactly as in (Salinas et al., 2019a), with their properties listed in Table 3. Our model is trained on the training split of each dataset. For testing, we use a rollingwindow prediction starting from the last point seen in the training dataset and compare it to the test set.For comparison, we again use SQFRNN (Gasthaus et al., 2019) with 50 linear pieces. We also evaluate DeepAR (Salinas et al., 2019b) with a StudentT or a Negative Binomial distribution depending on the domain of the dataset. Since IQNRNN, SQFRNN and DeepAR share the same RNN architecture we compare these models using the same untuned, but recommended, hyperparameters (see Table 4) for training: only the probabilistic heads differ. Thus, deviations from performance reported in the original publications are solely due to the number of epochs used for training. Alternative models are trained on the same instances, consume a similar amount of memory, and need similar training time. We also use ETS (Hyndman and Khandakar, 2008) as a comparison, which is an exponential smoothing method using weighted averages of past observations with exponentially decaying weights as the observations get older together with Gaussian additive errors (E) modeling trend (T) and seasonality (S) effects separately.
In Table 2 we report probabilistic and pointwise performance metrics of all models. We found that using IQNRNN often leads to the best performance on both probabilistic and pointwise metrics while being fully nonparametric and without significantly increasing the parameters of the RNN model. We also note, that the resulting performance on pointforecasting metrics does not result in higher errors for our probabilistic measures (unlike e.g. DeepAR). We did not incorporate per time series embeddings as covariates in any of our experiments.
6 Conclusion
In this work, we proposed a general method of probabilistic time series forecasting by using IQNs to learn the quantile function of the next time point. We demonstrated the performance of our approach against competitive probabilistic methods on realworld datasets.
Our framework can be easily extended to multivariate time series, under the rather restrictive hypothesis that we observe the same quantile for individual univariate series. This is equivalent to assuming comonotonicity of the processes for each time step. Relaxing this assumption is left to future research.
Software
Acknowledgements
K.R.: I acknowledge the traditional owners of the land on which I have lived and worked, the Wurundjeri people of the Kulin nation who have been custodians of their land for thousands of years. I pay my respects to their elders, past and present as well as past and present aboriginal elders of other communities.
References

GluonTS: Probabilistic and Neural Time Series Modeling in Python.
Journal of Machine Learning Research
21 (116), pp. 1–6. External Links: Link Cited by: Software.  A distributional perspective on reinforcement learning. D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 449–458. External Links: Link Cited by: §3.
 Pattern recognition and machine learning. Springer US. External Links: Document, Link, ISBN 9781461575689, 9781461575665 Cited by: §3.
 Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, (English (US)). Cited by: Table 4.
 Implicit Quantile Networks for Distributional Reinforcement Learning. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1096–1105. External Links: Link Cited by: §1, §3, 3rd item, §4.
 Pandasdev/pandas: pandas External Links: Document, Link Cited by: Software.
 Probabilistic Forecasting with Spline Quantile Function RNNs. K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1901–1910. External Links: Link Cited by: §3, §5.1, §5.1, §5.2.
 Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. External Links: Document, Link, https://doi.org/10.1198/016214506000001437 Cited by: §2.2.
 Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. External Links: 1308.0850 Cited by: §3.
 Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: Software.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Document, ISSN 08997667, 1530888X, Link Cited by: §1.
 Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9 (3), pp. 90–95. External Links: Document Cited by: Software.
 Forecasting: Principles and practice. OTexts. External Links: ISBN 9780987507112 Cited by: §1.
 Automatic time series forecasting: Theforecastpackage forR. J. Stat. Soft. 27 (3), pp. 1–22. External Links: ISSN 15487660, Document, Link Cited by: §5.2.
 Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. External Links: ISSN 01692070, Document, Link Cited by: §5.
 Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: Table 4.
 Quantile regression. Econometric Society Monographs, Cambridge University Press. External Links: Document, Link, ISBN 9780511754098 Cited by: §2.1.
 Verification tools for probabilistic forecasts of continuous hydrological variables. Hydrology and Earth System Sciences 11 (4), pp. 1267–1277. External Links: Link, Document Cited by: §2.2.
 The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36 (1), pp. 54–74. Note: M4 Competition External Links: ISSN 01692070, Document, Link Cited by: §5.
 Accuracy measures: theoretical and practical concerns. International Journal of Forecasting 9 (4), pp. 527–529. External Links: ISSN 01692070, Document, Link Cited by: §5.
 Scoring rules for continuous probability distributions. Manage. Sci. 22 (10), pp. 1087–1096. External Links: Document, Link, ISSN 00251909, 15265501 Cited by: §1, §2.2.
 Mixture models: Inference and applications to clustering.. Marcel Dekker, New York. Cited by: §3.
 NBEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, External Links: Link Cited by: §3, §4.
 Autoregressive quantile networks for generative modeling. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3936–3945. External Links: Link Cited by: §3.
 PyTorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8026–8037. Cited by: Software.
 PytorchTS External Links: Link Cited by: §5.
 Highdimensional multivariate forecasting with lowrank Gaussian copula processes. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 6824–6834. Cited by: §5.2.
 DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecasting. External Links: ISSN 01692070, Link Cited by: §3, §5.2.
 Deep learning in neural networks: An overview. Neural Networks 61, pp. 85–117. External Links: Document, Link, ISSN 08936080 Cited by: §3.
 Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §3.
 Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §4.
 Fully Parameterized Quantile Function for Distributional Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 6190–6199. Cited by: §3.
 Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110. External Links: Document, Link, ISSN 23759259, ISBN 9781538638002 Cited by: §1.
Comments
There are no comments yet.