Probabilistic Time Series Forecasting with Implicit Quantile Networks

07/08/2021 ∙ by Adèle Gouttes, et al. ∙ 0

Here, we propose a general method for probabilistic time series forecasting. We combine an autoregressive recurrent neural network to model temporal dynamics with Implicit Quantile Networks to learn a large class of distributions over a time-series target. When compared to other probabilistic neural forecasting models on real- and simulated data, our approach is favorable in terms of point-wise prediction accuracy as well as on estimating the underlying temporal distribution.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite being versatile and omnipresent, traditional time series forecasting methods such as those in (Hyndman and Athanasopoulos, 2018)

, typically provide univariate point forecasts. Training in such frameworks requires to learn one model per individual time series, which might not scale for large data. Global deep learning-based time series models are typically recurrent neural networks (RNN), like LSTM 

(Hochreiter and Schmidhuber, 1997)

. These methods have become popular due to their end-to-end training, the ease of incorporating exogenous covariates, and their automatic feature extraction abilities, which are the hallmarks of deep learning.

It is often desirable for the outputs to be probability distributions, in which case forecasts typically provide uncertainty bounds. In the deep learning setting the two main approaches to estimate uncertainty have been to either model the data distribution explicitly or to use Bayesian Neural Networks as in

(Zhu and Laptev, 2017)

. The former methods rely on some parametric density function, such as that of a Gaussian distribution, which is often based on computational convenience rather than on the underlying distribution of the data.

In this paper, we propose IQN-RNN, a deep-learning-based univariate time series method that learns an implicit distribution over outputs. Importantly, our approach does not make any a-priori assumptions on the underlying distribution of our data. The probabilistic output of our model is generated via Implicit Quantile Networks (Dabney et al., 2018)

(IQN) and is trained by minimizing the integrand of the Continuous Ranked Probability Score (CRPS)

(Matheson and Winkler, 1976).

The major contributions of this paper are:

  1. model the data distribution using IQNs which allows the use of a broad class of datasets;

  2. model the time series via an autoregressive RNN where the emission distribution is given by an IQN;

  3. demonstrate competitive results on real-world datasets in particular when compared to RNN-based probabilistic univariate time series models.

2 Background

2.1 Quantile Regression

The quantile function corresponding to a cumulative distribution function (c.d.f.)

is defined as:

For continuous and strictly monotonic c.d.f. one can simply write .

In order to find the quantile for a given one can use quantile regression (Koenker, 2005), which minimizes the quantile loss:


where is non-zero iff the value in the parentheses is positive.

2.2 Crps

Continuous Ranked Probability Score (Matheson and Winkler, 1976; Gneiting and Raftery, 2007) is a proper scoring rule, described by a c.d.f. given the observation :

where is the indicator function. The formula can be rewritten (Laio and Tamea, 2007) as an integral over the quantile loss:

where is the quantile function corresponding to .

3 Related work

Over the last years, deep learning models have shown impressive results over classical methods in many fields (Schmidhuber, 2015)

like computer vision, speech recognition, natural language processing (NLP), and also time series forecasting, which is related to sequence modeling in NLP 

(Sutskever et al., 2014). Modern univariate forecast methods like N-BEATS (Oreshkin et al., 2020) share parameters, are interpretive and fast to train on many target domains.

To estimate the underlying temporal distribution we can learn the parameters of some target distribution as in the DeepAR method (Salinas et al., 2019b) or use mixture density models (McLachlan and Basford, 1988) operating on neural network outputs, called mixture density networks (MDN) (Bishop, 1971). One prominent example is MDRNN (Graves, 2013)

that uses a mixture-density RNN to model handwriting. These approaches assume some parametric distribution, based on the data being modeled, for example a Negative Binomial distribution for count data. Models are trained by maximizing the likelihood of these distributions with respect to their predicted parameters and ground truth data.

Our approach is closely related to the SQF-RNN (Gasthaus et al., 2019) which models the conditional quantile function using isotonic splines. We utilize an IQN (Dabney et al., 2018; Yang et al., 2019)

instead, as we will detail, which has been used in the context of Distributional Reinforcement Learning

(Bellemare et al., 2017), as well as for Generative Modelling (Ostrovski et al., 2018).

4 Forecasting with Implicit Quantile Networks

In an univariate time series forecasting setting, we typically aim at forecasting a subseries of length from a series of length , generated by an auto-regressive process .

Let and for each integer . We can rewrite as , where is the c.d.f..

In probabilistic time series forecasting, it is typically assumed that a unique function can represent the distribution of all , given input covariates and previous observations . When using IQNs, we additionally parameterize this function with : IQNs learn the mapping from to

. In other words, they are deterministic parametric functions trained to reparameterize samples from the uniform distribution to respective quantile values of a target distribution. Our

IQN-RNN model should then learn for and can be written as , where is the Hadamard element-wise product and:

  • are typically time-dependent features, known for all time steps;

  • is the state of an RNN that takes the concatenation of , as well as the previous state as input;

  • embeds a as described by (Dabney et al., 2018), with :

  • is an additional generator layer, which in our case is a simple two-layer feed-forward neural network, with a domain relevant activation function.

We perform the Hadamard operation on , which is one of the forms considered by (Dabney et al., 2018).

At training time, quantiles are sampled for each observation at each time step and passed to both network and quantile loss (1) (see Figure 1).

Figure 1: IQN-RNN schematic at time where during training we minimize the quantile loss with respect to the ground truth.

During inference, we analogously sample a new quantile for each time step of our autoregressive loop. Thus, a full single forward pass follows an ancestral sampling

scheme along the graph of our probabilistic network. This approach guarantees to produce valid samples from the underlying model. Sampling a larger number of trajectories this way, allows us to estimate statistics over the distribution of each observation such as mean, quantiles, and standard deviation. For instance, the mean of

can be estimated using the average over the first step of all sampled trajectories. In our experiments, we choose 100 samples (in parallel via the batch dimension) when calculating metrics and empirical quantiles. This strategy also addresses potential quantile-crossing issues, since nothing in the IQN-RNN architecture guaranties monotonicity with respect to : we simply compute the quantiles from sampled values.

We note that this method would work equally well using a masked Transformer (Vaswani et al., 2017) to model the temporal dynamics or a fixed horizon non-autoregessive model like in (Oreshkin et al., 2020).

5 Experiments

We evaluate our IQN-RNN model on synthetic and open datasets and follow the recommendations of the M4 competition (Makridakis et al., 2020) regarding performance metrics. We report the mean scale interval score (MSIS111 for a 95% prediction interval, the 50-th and 90-th quantile percentile loss, and the CRPS. The point-wise performance of models is measured by the normalized root mean square error (NRMSE), the mean absolute scaled error (MASE) (Hyndman and Koehler, 2006), and the symmetric mean absolute percentage error (sMAPE) (Makridakis, 1993). For pointwise metrics, we use sampled medians with the exception of NRMSE, where we take the mean over our prediction sample.

The code for our model is available in the PyTorchTS (Rasul, 2021) library.

5.1 Results on synthetic data

We firstly evaluate our IQN-RNN on synthetic data and compare its performance with another non-parametric probabilistic forecast model: SQF-RNN (Gasthaus et al., 2019). In order to minimize the MSIS and CRPS, we use this model with 50 linear pieces. Both models have the same RNN

architecture and the same hyperparameters for training. Only the probabilistic head is distinct.

In a similar fashion to (Gasthaus et al., 2019), we generate 10,000 time series of 48 points, where each time step is

and follows a Gaussian Mixture Model:

with , ,

. The models are trained 5 times each, with a context length of 15, a prediction window of 2 for 20 epochs.

We show the estimated quantile functions in Figure 4 and report the average metrics in Table 1. While both methods have a similar point-wise performance, IQN-RNN is better at estimating the entire probability distribution.

Figure 4: Estimated quantile functions for five training on time series following a Gaussian Mixture using (a) IQN-RNN and (b) SQF-RNN model.

SQF-RNN-50 0.780 3.213 1.756 0.747
IQN-RNN 0.776 3.027 1.754 0.740

Table 1: Performance of IQN-RNN and SQF-RNN in fitting a three-component Gaussian Mixture Model.

5.2 Results on empirical data

Electricity SQF-RNN-50 0.078 0.097 0.044 8.66 0.632 0.144 1.051
DeepAR-t 0.062 0.078 0.046 6.79 0.687 0.117 0.849
ETS 0.076 0.100 0.050 9.992 0.838 0.156 1.247
IQN-RNN 0.060 0.074 0.040 8.74 0.543 0.138 0.897

SQF-RNN-50 0.153 0.186 0.117 8.40 0.401 0.243 0.76
DeepAR-t 0.172 0.216 0.117 8.027 0.472 0.244 0.89
ETS 0.427 0.488 0.325 20.856 0.872 0.594 1.881
IQN-RNN 0.139 0.168 0.117 7.11 0.433 0.171 0.656

SQF-RNN-50 0.283 0.328 0.321 23.71 2.24 0.261 1.44
DeepAR-nb 0.452 0.572 0.526 46.79 2.25 0.751 2.94
DeepAR-t 0.235 0.27 0.267 23.77 2.15 0.21 1.23
ETS 0.788 0.440 0.836 61.685 3.261 0.301 2.214
IQN-RNN 0.207 0.241 0.238 19.61 2.074 0.179 1.141
Table 2: Comparison against different methods: SQF-RNN with 50 nodes, DeepAR with Student-T (-t) or Negative Binomial (-nb) output, ETS and IQN-RNN on the 3 datasets.

Data set Num. Dom. Freq. Time steps Pred. steps
Elec. hour
Traffic hour
Wiki. day
Table 3: Number of time series, domain, frequency, total training time steps and prediction length properties of the training datasets used in the experiments.

Hyperparameter Value
rnn_cell_type GRU(Chung et al., 2014)
rnn_hidden_size 64
rnn_num_layers 3
rnn_dropout_rate 0.2
context_length 2 * pred_steps
epochs 10
learning_rate 0.001
batch_size 256
batches_per_epoch 120
num_samples 100
optim Adam (Kingma and Ba, 2015)
Table 4: Common hyperparmeters for SQF-RNN, DeepAR and IQN-RNN models.

We next evaluate our model on open source datasets for univariate time series:

Electricity222, Traffic333, and Wikipedia444, preprocessed exactly as in (Salinas et al., 2019a), with their properties listed in Table 3. Our model is trained on the training split of each dataset. For testing, we use a rolling-window prediction starting from the last point seen in the training dataset and compare it to the test set.

For comparison, we again use SQF-RNN (Gasthaus et al., 2019) with 50 linear pieces. We also evaluate DeepAR (Salinas et al., 2019b) with a Student-T or a Negative Binomial distribution depending on the domain of the dataset. Since IQN-RNN, SQF-RNN and DeepAR share the same RNN architecture we compare these models using the same untuned, but recommended, hyperparameters (see Table 4) for training: only the probabilistic heads differ. Thus, deviations from performance reported in the original publications are solely due to the number of epochs used for training. Alternative models are trained on the same instances, consume a similar amount of memory, and need similar training time. We also use ETS (Hyndman and Khandakar, 2008) as a comparison, which is an exponential smoothing method using weighted averages of past observations with exponentially decaying weights as the observations get older together with Gaussian additive errors (E) modeling trend (T) and seasonality (S) effects separately.

In Table 2 we report probabilistic and point-wise performance metrics of all models. We found that using IQN-RNN often leads to the best performance on both probabilistic and point-wise metrics while being fully non-parametric and without significantly increasing the parameters of the RNN model. We also note, that the resulting performance on point-forecasting metrics does not result in higher errors for our probabilistic measures (unlike e.g. DeepAR). We did not incorporate per time series embeddings as covariates in any of our experiments.

6 Conclusion

In this work, we proposed a general method of probabilistic time series forecasting by using IQNs to learn the quantile function of the next time point. We demonstrated the performance of our approach against competitive probabilistic methods on real-world datasets.

Our framework can be easily extended to multivariate time series, under the rather restrictive hypothesis that we observe the same quantile for individual univariate series. This is equivalent to assuming comonotonicity of the processes for each time step. Relaxing this assumption is left to future research.


We wish to acknowledge and thank the authors and contributors of the following open source libraries that were used in this work: GluonTS  (Alexandrov et al., 2020), NumPy (Harris et al., 2020), Pandas (development team, 2020), Matplotlib (Hunter, 2007)

and PyTorch 

(Paszke et al., 2019).


K.R.: I acknowledge the traditional owners of the land on which I have lived and worked, the Wurundjeri people of the Kulin nation who have been custodians of their land for thousands of years. I pay my respects to their elders, past and present as well as past and present aboriginal elders of other communities.


  • A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, J. Gasthaus, T. Januschowski, D. C. Maddix, S. Rangapuram, D. Salinas, J. Schulz, L. Stella, A. C. Türkmen, and Y. Wang (2020) GluonTS: Probabilistic and Neural Time Series Modeling in Python.

    Journal of Machine Learning Research

    21 (116), pp. 1–6.
    External Links: Link Cited by: Software.
  • M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 449–458. External Links: Link Cited by: §3.
  • C. M. Bishop (1971) Pattern recognition and machine learning. Springer US. External Links: Document, Link, ISBN 9781461575689, 9781461575665 Cited by: §3.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, (English (US)). Cited by: Table 4.
  • W. Dabney, G. Ostrovski, D. Silver, and R. Munos (2018) Implicit Quantile Networks for Distributional Reinforcement Learning. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1096–1105. External Links: Link Cited by: §1, §3, 3rd item, §4.
  • T. P. development team (2020) Pandas-dev/pandas: pandas External Links: Document, Link Cited by: Software.
  • J. Gasthaus, K. Benidis, Y. Wang, S. S. Rangapuram, D. Salinas, V. Flunkert, and T. Januschowski (2019) Probabilistic Forecasting with Spline Quantile Function RNNs. K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1901–1910. External Links: Link Cited by: §3, §5.1, §5.1, §5.2.
  • T. Gneiting and A. E. Raftery (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. External Links: Document, Link, Cited by: §2.2.
  • A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. External Links: 1308.0850 Cited by: §3.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R’ıo, M. Wiebe, P. Peterson, P. G’erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: Software.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Document, ISSN 0899-7667, 1530-888X, Link Cited by: §1.
  • J. D. Hunter (2007) Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9 (3), pp. 90–95. External Links: Document Cited by: Software.
  • R.J. Hyndman and G. Athanasopoulos (2018) Forecasting: Principles and practice. OTexts. External Links: ISBN 9780987507112 Cited by: §1.
  • R. J. Hyndman and Y. Khandakar (2008) Automatic time series forecasting: Theforecastpackage forR. J. Stat. Soft. 27 (3), pp. 1–22. External Links: ISSN 1548-7660, Document, Link Cited by: §5.2.
  • R. J. Hyndman and A. B. Koehler (2006) Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. External Links: ISSN 0169-2070, Document, Link Cited by: §5.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: Table 4.
  • R. Koenker (2005) Quantile regression. Econometric Society Monographs, Cambridge University Press. External Links: Document, Link, ISBN 9780511754098 Cited by: §2.1.
  • F. Laio and S. Tamea (2007) Verification tools for probabilistic forecasts of continuous hydrological variables. Hydrology and Earth System Sciences 11 (4), pp. 1267–1277. External Links: Link, Document Cited by: §2.2.
  • S. Makridakis, E. Spiliotis, and V. Assimakopoulos (2020) The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36 (1), pp. 54–74. Note: M4 Competition External Links: ISSN 0169-2070, Document, Link Cited by: §5.
  • S. Makridakis (1993) Accuracy measures: theoretical and practical concerns. International Journal of Forecasting 9 (4), pp. 527–529. External Links: ISSN 0169-2070, Document, Link Cited by: §5.
  • J. E. Matheson and R. L. Winkler (1976) Scoring rules for continuous probability distributions. Manage. Sci. 22 (10), pp. 1087–1096. External Links: Document, Link, ISSN 0025-1909, 1526-5501 Cited by: §1, §2.2.
  • G.J. McLachlan and K.E. Basford (1988) Mixture models: Inference and applications to clustering.. Marcel Dekker, New York. Cited by: §3.
  • B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio (2020) N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, External Links: Link Cited by: §3, §4.
  • G. Ostrovski, W. Dabney, and R. Munos (2018) Autoregressive quantile networks for generative modeling. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3936–3945. External Links: Link Cited by: §3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8026–8037. Cited by: Software.
  • K. Rasul (2021) PytorchTS External Links: Link Cited by: §5.
  • D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, and J. Gasthaus (2019a) High-dimensional multivariate forecasting with low-rank Gaussian copula processes. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 6824–6834. Cited by: §5.2.
  • D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski (2019b) DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecasting. External Links: ISSN 0169-2070, Link Cited by: §3, §5.2.
  • J. Schmidhuber (2015) Deep learning in neural networks: An overview. Neural Networks 61, pp. 85–117. External Links: Document, Link, ISSN 0893-6080 Cited by: §3.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §4.
  • D. Yang, L. Zhao, Z. Lin, T. Qin, J. Bian, and T. Liu (2019) Fully Parameterized Quantile Function for Distributional Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 6190–6199. Cited by: §3.
  • L. Zhu and N. Laptev (2017) Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110. External Links: Document, Link, ISSN 2375-9259, ISBN 9781538638002 Cited by: §1.