PyTorch based Probabilistic Time Series forecasting framework based on GluonTS backend
Time series forecasting is often fundamental to scientific and engineering problems and enables decision making. With ever increasing data set sizes, a trivial solution to scale up predictions is to assume independence between interacting time series. However, modeling statistical dependencies can improve accuracy and enable analysis of interaction effects. Deep learning methods are well suited for this problem, but multi-variate models often assume a simple parametric distribution and do not scale to high dimensions. In this work we model the multi-variate temporal dynamics of time series via an autoregressive deep learning model, where the data distribution is represented by a conditioned normalizing flow. This combination retains the power of autoregressive models, such as good performance in extrapolation into the future, with the flexibility of flows as a general purpose high-dimensional distribution model, while remaining computationally tractable. We show that it improves over the state-of-the-art for standard metrics on many real-world data sets with several thousand interacting time-series.READ FULL TEXT VIEW PDF
In this work, we propose TimeGrad, an autoregressive model for
Here, we propose a general method for probabilistic time series forecast...
Time series forecasting is a fundamental task emerging from diverse
Spatio-temporal forecasting has numerous applications in analyzing wirel...
Many real-world data sets, especially in biology, are produced by highly...
The forecasting of multi-variate time processes through graph-based
While previous distribution shift detection approaches can identify if a...
PyTorch based Probabilistic Time Series forecasting framework based on GluonTS backend
Classical time series forecasting methods such as those in (Hyndman & Athanasopoulos, 2018)
typically provide univariate forecasts and require hand tuned features to model seasonality and other parameters. Time series models based on recurrent neural networks (RNN), like LSTM(Hochreiter & Schmidhuber, 1997)
, have become popular methods due to their end-to-end training, the ease of incorporating exogenous covariates, and their automatic feature extraction abilities, which are the hallmarks of deep learning. Forecasting outputs can either be points or probability distributions, in which case the forecasts typically come with uncertainty bounds.
The problem of modeling uncertainties in time series forecasting is of vital importance for assessing how much to trust the predictions for downstream tasks, such as anomaly detection or (business) decision making. Without probabilistic modeling, the importance of the forecast in regions of low noise (small variance around a mean value) versus a scenario with high noise cannot be distinguished. Hence, point estimation models ignore risk stemming from this noise, which would be of particular importance in some contexts such as making (business) decisions.
Finally, individual time series, in many cases, are statistically dependent on each other, and models need the capacity to adapt to this to improve forecast accuracy (Battaglia et al., 2018). For example, to model the demand for a retail article, it is important to not only model its sales dependent on its own past sales, but also to take into account the effect of interacting articles, which can lead to cannibalization effects in the case of article competition. As another example, consider traffic flow in a network of streets as measured by occupancy sensors. A disruption on one particular street will also ripple to occupancy sensors of nearby streets — a uni-variate model would arguable not be able to account for these effects.
In this work we propose an end-to-end trainable autoregressive deep learning model for probabilistic forecasting that explicitly models multi-variate time series and their temporal dynamics employing a normalizing flow architecture, like the Masked Autoregressive Flow (Papamakarios et al., 2017) or Real NVP (Dinh et al., 2017).
The main contributions of this paper are:
we propose a probabilistic method to model multi-variate time series, which is able to scale to thousands of time series and their interactions
we demonstrate that the model can uncover ground-truth dependency structure on toy data
the model establishes the new state-of-the-art on many real world data sets
The model further has the advantages that:
the underlying data distribution is modeled using a conditional normalizing flow, which enables adaptation to a broad class of underlying data distributions
it is highly efficient to train due to parallelization by using attention (Vaswani et al., 2017), unlike typical RNN-based time series models. Empirically, we observe an order of magnitude faster training times for Transformer-based models.
We briefly review the current time series forecasting landscape and present the building blocks of our method in this section.
Classical time series forecasting methods rely on the ARMA (see e.g. Box et al. 2015
) method and its variants like ARIMA. Apart from the fact that these methods require manual feature engineering, they also suffer from the curse of dimensionality, require frequent re-training and are focused on model interpretability rather than test-set accuracy.
Deep learning models over the last years have shown impressive results over classical methods in many fields (Schmidhuber, 2015)2014). Modern uni-variate point forecast methods like in (Oreshkin et al., 2020) are interpretive and fast to train on many target domains.
Uncertainty estimation for classical methods in the context of control theory have been worked on for decades, see e.g. (Dietz et al., 1997). The majority of the classic forecasting literature has focused on prediction of point estimates, such as the mean or the median of the distribution at a future time point. In the deep learning setting the two approaches have been to either model the data distribution explicitly or to use BNN as in (Zhu & Laptev, 2018). To estimate the underlying temporal distribution we can either learn the parameters of some target distribution as in the DeepAR method (Flunkert et al., 2017) or use mixture density models (McLachlan & Basford, 1988) operating on neural network outputs, called mixture density networks (MDN) (Bishop, 2006), as for example in the MD-RNN approach used to model handwriting (Graves, 2013). Recently (Rangapuram et al., 2018) combined a linear state space model for each individual time series together with deep probabilistic models to additionally obtain interpretative time series predictions.
To model all time series jointly, i.e. capture interaction effects, one can use multi-variate Gaussian processes to capture the underlying structure of data (Vandenberg-Rodes & Shahbaba, 2015) or Low-rank Gaussian Copula processes via RNNs (Salinas et al., 2019). The temporal regularized matrix factorization framework (Yu et al., 2016) proposes learning the data dependencies, thus allowing the ability to forecast future values, via a matrix factorization approach. The LSTNet (Lai et al., 2018)
approach uses Convolutional Neural Network (CNN) and RNN building blocks to model multi-variate time series for point forecasts. Bayesian models using hierarchical priors have also been proposed to share statistical strength between individual time series while keeping inference feasible(Chapados, 2014). The use of multi-head attention for time series forecasting has also recently been explored (Li et al., 2019) and it allows capturing long term dependencies where RNNs like the LSTM suffer.
Normalizing flows (Tabak & Turner, 2013) are mappings from to such that densities on the input space are transformed into some simple distribution (e.g. an isotropic Gaussian) on the space . This mapping , is composed of a sequence of bijections or invertible functions. Due to the change of variables formula we can express by
where is the Jacobian of at . Normalizing flows have the property that the inverse is easy to evaluate and computing the Jacobian determinant takes time.
The bijection introduced by Real NVP (Dinh et al., 2017) called the coupling layer satisfies the above two properties. It leaves part of its inputs unchanged and transforms the other part via functions of the un-transformed variables
where is an element wise product, is a scaling and a translation function from , given by neural networks. To model a nonlinear density map , a number of coupling layers
are composed together all the while alternating the dimensions which are unchanged and transformed. Via the change of variables formula the probability density function (PDF) of the flow given a data point can be written as
Note that the Jacobian for the Real NVP is a block-triangular matrix and thus the log-determinant simply becomes
is the sum over all the vector elements,is the element-wise logarithm and is the diagonal of the Jacobian. This model, parameterized by the weights of the scaling and translation neural networks
, is then trained via stochastic gradient descent (SGD) on training data points where for each batchwe maximize the average log likelihood (1) given by
In practice Batch Normalization(Ioffe & Szegedy, 2015)
is applied, as a bijection, to outputs of successive coupling layers to stabilize training of normalizing flows. This bijection implements the normalization procedure using a weighted moving average of the layer’s mean and standard deviation values, which has to be adapted to either training or inference regimes.
The Real NVP approach can be generalized, resulting in Masked Autoregressive Flows (MAF, Papamakarios et al. 2017) where the transformation layer is built as an autoregressive neural network in the sense that it takes in some input and outputs with the requirement that this transformation is invertible and any output cannot depend on input dimensions . The Jacobian of this transformation is triangular and thus the Jacobian determinant is tractable. Instead of using a RNN to share parameters across the dimensions of one avoids this sequential computation by using masking, giving the method its name. The inverse however, needed for generating samples, is sequential.
By realizing that the scaling and translation function approximators don’t need to be invertible, it is straight-forward to implement conditioning of the PDF on some additional information : we concatenate to the inputs of the scaling and translation function approximators of the coupling layers, i.e. and which are modified to map . Another approach is to add a bias computed from to every layer inside the and networks as proposed by (Korshunova et al., 2018). This does not change the log-determinant of the coupling layers given by (2). More importantly for us, for sequential data we can share parameters across the different conditioners by using RNNs in an autoregressive fashion.
For discrete data the distribution has differential entropy of negative infinity, which leads to arbitrary high likelihoods when training normalizing flow models, even on test data. To avoid this one can dequantize the data, often by adding noise. The log-likelihood of the continuous model is then lower-bounded by the log-likelihood of the discrete one as shown in Theis et al. (2016).
The self-attention based Transformer layer (Vaswani et al., 2017) has been used for sequence modeling with great success. The multi-head self-attention mechanism enables it to capture both long- and short-term dependencies in time series data. Essentially, the Transformer takes in a sequence , and the multi-head self-attention transforms this into distinct query matrices , key matrices and value matrices where the , , and are the learnable parameters. After these linear projections the scaled dot-product attention computes a sequence of vector outputs via:
where a mask is applied to filter out right-ward attention (or future information leakage) by setting its upper-triangular elements to and we normalize by the dimension of the matrices. Afterwards all the outputs are concatenated and linearly projected again.
One typically uses the Transformer in an encoder-decoder setup, where some warm up time series is passed through the encoder and the decoder can be used to learn and autoregressively generate outputs.
Related to this work are models that combine normalizing flows for sequential modeling in some way. Transformation Autoregressive Networks (TANs, Oliva et al. 2018) which model the density of a multi-variate variable as conditional distributions , where the conditioning is given by a mixture model coming from the state of a RNN, and is then transformed via a bijection. The PixelSNAIL (Chen et al., 2018) method also models the joint as a product of conditional distributions, optionally with some global conditioning, via causal convolutions and self-attention (Vaswani et al., 2017)
to capture long-term temporal dependencies. These methods are well suited to modeling high dimensional data like images, however their use in modeling the temporal development of data has only recently been explored for example in VideoFlow(Kumar et al., 2019) which consists of a normalizing flow that autoregressively models the latent variable at time as a Gaussian whose parameters are functions of the flow at previous time steps.
Using RNNs for modeling either multi-variate or temporal dynamics introduces sequential computational dependencies that are not amenable to parallelization. However, RNNs have been shown to be very effective in modeling sequential dynamics and we feel that it is important to nonetheless explore RNN-based temporal conditioning for multi-variate time series forecasting. A recent work in this direction (Hwang et al., 2019)
employs bipartite flows with GRUs for temporal conditioning to develop a conditional generative model of multi-variate sequential data. The authors use a bidirectional training procedure to learn a generative model of observations that together with the temporal conditioning through a RNN, can also be conditioned on (observed) covariates that are modeled as additional conditioning variables in the latent space, which adds extra padding dimensions to the normalizing flow.
The Gaussian Copula Process method (Salinas et al., 2019) is a RNN-based time series model with a Gaussian copula process output model using a low-rank covariance structure to reduce the computational complexity and handle non-Gaussian marginal distributions. By using a low-rank approximation of the covariance matrix they obtain a computationally tractable method and are able to scale to multi-variate dimensions in the thousands with state-of-the-art results. We will compare our model to this method in what follows.
We denote the entities of a multi-variate time series by for where is the time index. Thus the multi-variate vector at time is given by . In this paper, we investigate the effect of the time evolution and the output distribution module on both statistical accuracy and computation time. We find that an attention layer for time evolution together with an MAF output density model results in high forecasting accuracy. At the same time, attention results in orders of magnitudes faster training.
In the DeepAR model (Flunkert et al., 2017), the log-likelihood of each entity at a time step is maximized given an individual time series’ history. This is done with respect to the parameters of the chosen distributional model (e.g. negative binomal for count data) via the state of a RNN derived from its previous time step and its corresponding covariates
. The emission distribution model, which is typically Gaussian for real-valued data or negative binomial for count data, is selected to best match the statistics of the time series and the network incorporates activation functions that satisfy the constraints of these distribution parameters, e.g. asoftplus() for the scale parameter of the Gaussian.
A simple model for multi-variate real-valued data could use a factorizing distribution in the emissions. Shared parameters can then learn patterns across the individual time series through the temporal component — but the model falls short of capturing dependencies in the emissions of the model. For this, a full joint distribution at each time step must be modeled, for example by using a multivariate Gaussian model. However, modeling the full covariance matrix not only increases the number of parameters of the neural network by, making learning difficult, but computing the loss becomes expensive when is large. Further, statistical dependencies in the emissions would be limited to second-order effects. These models are referred to as Vec-LSTM in Salinas et al. (2019).
We wish to have a scalable model of interacting time-series
, and further to use a flexible distribution model on the emissions that allows to capture and represent higher order moments. To this end, we model the conditional joint distribution at timeof all time series with a flow, e.g. a Real NVP, conditioned on either the hidden state of a RNN at time or an embedding of the time series up to from an attention module. In the case of an autoregeressive RNN (either a LSTM or a GRU; Chung et al. 2014), its hidden state is updated given the previous time step observation and its associated covariates (as in Figure 1):
This model is autoregressive since it consumes the observation of the last time step as well as the recurrent state to produce the state on which we condition the current observation.
To get a powerful emission distribution model, we stack layers of the flow model (Real NVP or MAF). Together with the RNN, we arrive at our model of the conditional distribution of the future of all time series, given its past and all the covariates in . As the model is autoregressive it can be written as a product of factors
where denotes the set of all parameters of both the flow and the RNN.
For modeling the time evolution, we also investigate using an attention module (Vaswani et al., 2017). This is used to compute an embedding of the time series up to , which we will also call for simplicity. In this case, the training time series is split into a warm up or encoding portion and the output portion . See Figure 2 for a schematic of the overall model in this case. In training, care has to be taken to prevent using information from future time points and to preserve the autoregressive property by utilizing a mask that reflects the causal direction of progressing time, i.e. to mask out future time points.
In real-world data the magnitudes of different time series can vary drastically. To normalize scales, we divide each individual time series by its mean before feeding it into the model. The outputs are then correspondingly multiplied with the same mean values to match the original scale. This rescaling technique simplifies the problem for the model, which is reflected in significantly improved empirical performance.
via SGD using Adam (Kingma & Ba, 2015) with respect to the parameters of the conditional flow and the RNN or Transformer. In practice, the time series in a batch is selected from a random time window of size within our training data, and the relative time steps are kept constant. This allows the model to learn to cold-start given only the covariates. This also increases the size of our training data when the training window size is not too large and allows us to trade-off computation time with memory consumption especially when or are large. Note that information about absolute time is only available to the RNN or Transformer via the covariates and not the relative position of in the training data.
The Transformer has computational complexity compared to a RNN which is , where is the dimension of the hidden state and the time series length. This means for large multi-variate time series, i.e. , and the assumption that the dimension of the hidden state grows proportional to the number of simultaneous time-series modeled, i.e. , the Transformer flow model has smaller computational complexity. In addition, unlike the RNN, all the computation for training happens in parallel. The Transformer allows the model to access any part of the historic time series regardless of temporal distance and thus is able to generate better conditioning for the normalizing flow.
We employ embeddings for categorical features (Charrington, 2018), which allows for relationships within a category, or its context, to be captured while training models. Combining these embeddings as features for time series forecasting yields powerful models like the first place winner of the Kaggle Taxi Trajectory Prediction111https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i challenge (De Brébisson et al., 2015). The covariates we use are composed of time-dependent (e.g. day of week, hour of day) and time-independent embeddings, if applicable, as well as lag features depending on the time frequency of the data set we are training on. All covariates must be known for the time periods we wish to forecast.
For inference we either obtain the hidden state by passing a “warm up” time series through the RNN or use the cold-start hidden state, i.e. we set , and then by sampling a noise vector from an isotropic Gaussian, go backward through the flow to obtain a sample of our time series for the next time step conditioned on this starting state . We then use this sample and its covariate to obtain the next conditioning state
via the RNN and repeat till our inference horizon. This process of sampling trajectories from some initial state can be repeated many times to obtain empirical quantiles of the uncertainty of our prediction for arbitrary long forecast horizons.
The attention model similarly uses a warm up time seriesand covariates and passes them through the encoder and then uses the decoder to output the conditioning for sampling from the flow. This sample is then used again in the decoder to iteratively sample the next conditioning state, similar to the inference procedure in seq-to-seq models.
Here we discuss a toy experiment sanity-checking our model as well as results on six real world data sets.
In this experiment, we check for some basic properties of our model by simulating flow of a liquid in a system of pipes with valves. See Figure 3 for a depiction of the system.
Liquid flows from left to right, where pressure at the first sensor in the system is given by
in the shape/scale parameterization of the Gamma distribution. The valves are given by, and we have
for and finally with . With this simulation we check whether our model captures correlations in space and time. The correlation between and results from both having the same source, measured by . This is reflected by , which is captured by our model.
The cross-covariance structure between consecutive time points in ground truth and as captured by our trained model is depicted in Figure 5. It reflects the true flow of liquid in the system from at time to and at time , on to at time .
best methods are in bold where the mean and standard errors are obtained by re-running each method three times.
For evaluation we compute the Continuous Ranked Probability Score (CRPS) (Matheson & Winkler, 1976) on each individual time series, as well as on the sum of all time series (the latter denoted by
). CRPS measures the compatibility of a cumulative distribution functionwith an observation as
where is the indicator function which is one if and zero otherwise. CRPS is a proper scoring function, hence CRPS attains its minimum when the predictive distribution and the data distribution equal. Employing the empirical CDF of , i.e. with samples as a natural approximation of the predictive CDF, CRPS can be directly computed from simulated samples of the conditional distribution (4) at each time point (Jordan et al., 2019). We take samples to estimate the empirical CDF in practice. In comparison to log-likelihood, CRPS puts heavier penalty on the distribution tails (Gebetsberger et al., 2018). Finally, is obtained by first summing across the time-series — both for the ground-truth data, and sampled data (yielding for each time-point). The results are then averaged over the prediction horizon, i.e. formally .
Our model is trained on the training split of each data set, and for testing we use a rolling windows prediction starting from the last point seen in the training data set and compare it to the test set.
We train on Exchange (Lai et al., 2018), Solar (Lai et al., 2018), Electricity222https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, Traffic333https://archive.ics.uci.edu/ml/datasets/PEMS-SF, Taxi444https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page and Wikipedia555https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets open data sets, preprocessed exactly as in (Salinas et al., 2019), with their properties listed in Table 2. Both Taxi and Wikipedia consist of count data and are thus dequantized before being fed to the flow (and mean-scaled).
We use batch sizes of , with
batches per epoch and train for a maximum ofepochs with a learning rate of
. The LSTM/GRU hyperparameters were the ones from(Salinas et al., 2019) and we used or stacks of normalizing flow bijections. We sample times to report the metrics on the test set. The Transformer uses heads and encoding and decoding layers and a dropout rate of . No extensive hyperparameter tuning was done. All the experiments run on a single Nvidia V-100 GPU.
We compare our method using GRU and two different normalizing flows (GRU-Real-NVP and GRU-MAF based on Real NVP and MAF, respectively) as well as a Transformer model with MAF (Transformer-MAF), with different RNN based methods and transformation schemes from (Salinas et al., 2019) and report the results in Table 1. Vec-LSTM-ind-scaling
outputs the parameters of an independent normal distribution with mean-scaling,Vec-LSTM-lowrank-Copula parametrizes a low-rank plus diagonal covariance via Copula process. GP-scaling unrolls a LSTM with scaling on each individual time series before reconstructing the joint distribution. Similarly, GP-Copula unrolls a LSTM on each individual time series and then the joint emission distribution is given by a low-rank plus diagonal covariance Gaussian copula.
In Table 1 we observe that MAF with either RNN or self-attention mechanism for temporal conditioning achieves the state-of-the-art (to the best of our knowledge) on all benchmarks. Moreover, bipartite flows with RNN either also outperform or are found to be competitive w.r.t. to the previous state-of-the-art results as listed in the first four columns of Table 1. Further analyses with other metrics (e.g. MSE) are reported in the Supplementary Material.
To assess how well our model captures dependencies in extrapolating the time series into the future versus real data, we plot in Figure 6 the cross-covariance matrix of observations (plotted left) as well as the mean of sample trajectories (middle plot) drawn from Transformer-MAF model for the test split of Traffic data set. The right most plot in the figure illustrates the absolute difference between the two cross-covariance matrices. As can be seen, most of the covariance structure especially in the top-left region of highly correlated sensors is very well reflected in the samples drawn from the model.
We have presented a general method to model high dimensional probabilistic multivariate time series via combining conditional normalizing flows with an autoregressive model, such as a recurrent neural network or an attention module. Autoregressive models have a long-standing reputation for working very well for time series forecasting, as they show good performance in extrapolation into the future. The flow model, on the other hand, does not assume any simple fixed distribution class, but instead can adapt to a broad range of high-dimensional data distributions. The combination hence combines the extrapolation power of the autoregressive model class with the density estimation flexibility of flows. Further, it is computationally efficient, without the need of resorting to approximations (e.g. low-rank approximations of a covariance structure). Analysis on six commonly used time series benchmarks establish the new state-of-the-art performance, without much hyperparameter tuning.
A natural way to improve the model is to incorporate a better underlying flow model. For example, Table 1 showed that swapping a Real NVP flow in for a MAF improved performance, which is a consequence of Real NVP lacking in density modeling performance compared to MAF. Likewise, we would expect other design choices of the flow model to improve performance, e.g. changes on the dequantization method, the specific affine coupling layer or more expressive conditioning, say via another Transformer. Recent improvements to flows, e.g. as proposed in the Flow++ paper (Ho et al., 2019), to obtain expressive bipartite flow models, or better flow models to handle discrete categorical data (Tran et al., 2019), are left as future work to assess their usefulness. To our knowledge, it is however still an open problem how to model discrete ordinal data via flows — which would best capture the nature of some data sets (e.g. sales data).
Also, we would expect improvements in the time evolution module to improve the time series forecasts. For example, recent improvements to the Transformer like the Reformer (Kitaev et al., 2020) could improve memory efficiency. We leave these improvements to future investigations.
Finally, real-world applications might require training on very large number of interacting time series , such as e.g. in sales modeling in e-Commerce, where can be in the order of millions or more. Flows have been successfully applied to image modeling, which is comparable to the instantaneous dimensionality faced in this setting — but the memory requirement becomes infeasible for large time-series. In future work, we’ll investigate mechanisms for scalable training of time-series models, e.g. by subsampling time-series.