1. Introduction
Forecasting is the task to extrapolate time series into the future and it’s a generally wellstudied area (see (Hyndman and Athanasopoulos, 2018) for an introduction). Forecasting has many important industrial applications in domains ranging from energy load (Xie and Hong, 2016), to ecommerce (Böse et al., 2017), tourism (Athanasopoulos et al., 2011) and traffic (Laptev et al., 2017). Modern industrial applications often exhibit large panels of related time series, all of which need to be forecasted (Januschowski and Kolassa, 2019). The potential of “neural” forecasting models (i.e. models based on neural networks) in such contexts has long been exploited in applied industrial research, e.g., (Laptev et al., 2017; Gasthaus et al., 2019; Salinas et al., 2019c; Wen et al., 2017). Together with the overwhelming success of neural forecasting methods in the recent M4 competition (Smyl, 2020), this has convinced also formerly skeptical academics (Makridakis et al., 2018b, a).
Neural forecasting methods have seen many advancements in recent years, especially in dataabundant settings and with the central aim of learning a single global
model over the entire panel of time series to then extract patterns across multiple time series. Note that global models are distinct from multivariate models. Global models produce univariate forecasts, i.e., forecast each time series member of the panel of time series independently, but parameters of the model are estimated over the entire panels. Multivariate time series models explicitly estimate the dependence structure of the time series in the panel. While multivariate models form an important area and initial work exists for neural forecasting methods (e.g.,
(Salinas et al., 2019a)), much further work is needed and we focus on univariate time series forecasting in this article.Many deep learning architectures that have seen success in other domains (e.g. computer vision or natural language processing) have been adapted to and evaluated in the forecasting setting, ranging from simple feed forward models, convolutional neural networks (CNNs), in particular using 1dimensional dilated causal convolutions
(Oord et al., 2016; Borovykh et al., 2017; Wen et al., 2017; Bai et al., 2018), recurrent neural networks (RNNs) (Salinas et al., 2019c; Mukherjee et al., 2018; Smyl, 2020), and attentionbased models (Li et al., 2019; Lim et al., 2020; Vaswani et al., 2017).While some prior work has only considered the point forecasting setting, we focus on models that can produce probabilistic forecasts, i.e. forecasts that quantify the uncertainty over future events by estimating a probability distribution over future trajectories. Such probabilistic forecasts can be used for decision making under uncertainty, which is typically the ultimate goal in practical applications. To that end, the various aforementioned deep learning architectures have been combined with techniques for modeling probabilistic outputs. These techniques range from parametric distributions and parametric mixtures
(Salinas et al., 2019c; Mukherjee et al., 2018), over quantile regressionbased techniques like quantile grids
(Wen et al., 2017), to parametric quantile functions models (Gasthaus et al., 2019), semiparametric probability integral transform / copula based techniques (Salinas et al., 2019a; Wen and Torkkola, 2019), and approaches based on discretization/bucketing (Oord et al., 2016).Recent developments in neural time series forecasting have mostly focused on improving model architectures (Li et al., 2019; Lim et al., 2020; Oreshkin et al., 2019; Lai et al., 2017; Sen et al., 2019), and developing strategies for modeling the probabilistic outputs in these models (Gasthaus et al., 2019; Salinas et al., 2019a; Wen and Torkkola, 2019) (see Faloutsos et al. (2019) for a recent overview). The work on global neural models for time series forecasting (Salinas et al., 2019c) has often hinted at (but not explored in detail) the importance of careful data preprocessing for learning across time series, especially when modeling datasets with heterogeneous magnitudes (e.g. in the retail demand forecasting setting). While different strategies for handling the differences in magnitudes across time series have been proposed, including mean/median scaling, standardization, bijective transformations (e.g. log or BoxCox transforms), and discretization, a thorough and systematic empirical evaluation of the impact on predictive performance and training stability that input and output representations have relative to the core forecasting model has not been performed. The study presented in this paper shines some light on this question by performing an empirical comparison of multiple different input and output transformation techniques—with a particular focus on discretizing transformations—when combined with commonlyused neural forecasting architectures. It complements empirical studies evaluating the impact of other architectural choices, e.g. the extensive study of RNN models for forecasting conducted by Hewamalage et al. (2019).
The main finding and core contribution of our empirical study is the importance of discretization of inputs and outputs as a general technique for neural forecasting models. Our experimental results show that binning techniques improve performance of forecasting models almost independently of the architecture of the neural network. This is mildly surprising since the inputs and target time series in the forecasting setting are real values, and thus endowed with a natural total order and a notation of distance, which the model does not have access to after discretization. Further, typical forecasting accuracy measures (Gneiting et al., 2007; Hyndman and Koehler, 2006)
also rely on notions of distance, giving higher scores to forecasts that are “close” to the true values. That giving up this order through discretization and adopting a loss function in neural network models that does not take this order into account explicitly leads to superior accuracy, is curious.
The rest of the paper is structured as follows: In Sec. 2 we first describe the general forecasting task and setup we consider; Sec. 3 describes the different data transformations and models that we compare; Sec. 4 contains the experimental results, which we discuss further in Sec. 5. We discuss related work in Sec. 6 and present some conclusions in Sec. 7.
2. Preliminaries
Our study explores the following commonlyused setup for forecasting problems. We are given a set of univariate time series. Each time series is composed of consecutive values which are assumed to be equally spaced. In addition to the target time series (i.e. the ones we are trying to predict the future of), the methods we consider can optionally make use of a set of associated covariates with , which are required to be available until time point with being the prediction horizon of the forecast. Note that in this paper we will exclusively focus on applying transformations to the target time series values , considering the covariates
given and fixed. In practice, the covariates are often synthetically constructed (e.g. datedependent dummy variables) and require no further processing, or similar normalization techniques as we discuss for the target time series can be applied.
Our goal is to model the joint conditional probability distribution over
for each time series , given its past values and the observed additional covariates . Global neural forecasting models achieve this by parametrizing this conditional distribution using a neural network , whose parameters are learned jointly from the entire data set . In particular, for each time series we have,(1) 
and the parameters are learned by optimizing some scoring rule (often negative loglikelihood) measuring the compatibility of the model with the observed data over the training data set, i.e. .
Note that in Eq. (1) the values of the target time series appear in both the conditioning set and the predicted variables. We refer to a transformation that is applied to the variables in the conditioning set as an input transformation, and to a transformation that affects the predicted distribution as an output transformation. The resulting transformed values are input and output representations
, respectively. To illustrate the idea, consider a simple Markov model that predicts the next value
conditioned on the preceding value . Instead of directly modeling we can compute transformed inputs and outputs and model instead. If is invertible we can generate samples from the predictive distribution by sampling and computing . If necessary (e.g. for computing the training loss) we can also evaluate the density,Note that while using the same transformation for both input and output, i.e. , is commonly done (e.g. by preprocessing the data before applying the model), this is not necessary. In particular, the input transformation is not required to be invertible in general.
When global models are used in the series forecasting setting, the parameters are learned jointly over the data set, but the parameters and of the input and output transformations are typically allowed to vary per time series, e.g. by estimating them on data preceding the training range. For example, the meanscaling approach employed by the DeepAR method (Salinas et al., 2019c) estimates and then sets . DeepAR uses no output transformation, but includes as an additional input to the model.
3. Methods
Next, we introduce the main objects in our study, the input and output transformations, and the neural network models.
3.1. Transformations
The transformations of data that we apply in our empirical study range for schemes to rescale the time series in the panel to discretization and other transformations. We describe these next in detail.
3.1.1. Scaling
When training global forecasting models on datasets with heterogeneous scales, accounting for the difference in scales between time series in some way is of critical importance for obtaining good predictive performance. Firstly, it is desirable for the models to learn scaleinvariant patterns, especially seasonal behavior (e.g. the seasonal sales pattern for a text book is largely independent of how popular the specific book is). Secondly, neural network models with saturating nonlinearities are very sensitive to the scale of their inputs, leading to slow convergence (or convergence to undesirable optima) if the scale of their inputs is not carefully controlled.
A common approach for addressing the challenge of heterogeneous scales is to apply an affine transformation to each time series, i.e. , where the parameters for the transformation are chosen for each time series independently. Here, as a representative member of this family of transformations, we use the mean scaling (ms) scheme employed e.g. by DeepAR (Salinas et al., 2019c), which seems to be effective in many practical settings. In particular, we set and . Other common choices include minmax scaling (, and standardization (, ). In practice, variants that use a more robust estimate of the scale, e.g. the trimmed mean or the median have also been employed.
3.1.2. Continuous Transformations
In addition to the affine transformations described above, other continuous transformations, such as power transforms, e.g. the BoxCox transform (Box and Cox, 1964) or the log transform (which is its limit as
) are commonly applied not only in combination with “classical” forecasting techniques (e.g. ARIMA, exponential smoothing, or linear regression models, see
(Hyndman and Athanasopoulos, 2018) for an introduction), but also in combination with deep learning techniques. The BoxCox transform was originally introduced as a “Gaussianizing” transform, i.e. a transformation that makes the data distribution “more Gaussian”, so that techniques that assume Gaussian noise can more readily be applied. Here we will consider an alternative technique for transforming the marginal distributions into a form more amenable to modeling, based on the probability integral transform, similar to the approach proposed by (Salinas et al., 2019a).The probability integral transform
(PIT) is the transformation that maps a random variable
through its cumulative distribution function, i.e.
, resulting in a transformed variablewith uniform distribution. In the setting we are considering here, an (approximate) probability integral transform (
pit) can be used to make the empirical marginal distribution of values in each time series (approximately) uniform. In particular, we apply the transform , where is the empirical cumulative distribution function estimated from. The PIT is effectively the nondiscretizing (i.e. producing realvalued instead of categorical outputs) analogue of the quantile binning transform discussed below. In order to make these two approaches directly comparable, we combine the PIT input transform with an additional twolayer input transformation network, which performs a function similar to the embedding layer used in conjunction with the discrete inputs. In principle, such a network could learn to implement a binning operation in the first layer, while the second layer performs the embedding, making this setup at least as expressive as quantile binning followed by an embedding layer.
3.1.3. Discretizing Transformations
Binning is a form of data discretization (also called quantization
) into a set of buckets with disjoint support, that is widely used in machine learning as a feature engineering technique. Formally, we define a function
which maps a realvalued input to a discrete output with distinct bin values. Each of the possible output values is tied to a specific interval (“bucket”) into which realvalued inputs can fall, with edge cases and . The quantization transform then maps a realvalued input to its bucket index, i.e. iff . In case the input domain happens to be a subset of we can adjust the edge cases accordingly.In order to also define a reconstruction function which transforms a discrete bucket value back to the original realvalued domain, we associate each bucket with a value reconstruction value and set . Given such reconstruction values and assuming squared error as the loss function, it can be shown that the reconstruction error is minimized by choosing the bin edges as . Note that optimal reconstruction values , in the sense of minimizing squared reconstruction error, can also be obtained (given fixed bin edges ) by setting where is the density of the realvalued inputs. Iterating the steps for optimizing the reconstruction values and bin edges is the LloydMax algorithm for obtaining optimal quantizers.
Different strategies have been developed for selecting appropriate bin edges (or reconstruction values and choosing the edges as midpoints as described above) and we will consider two strategies as part of this paper: equallyspaced binning and quantile binning. In equallyspaced (linear) binning, a part of the input is divided into intervals with equal width, i.e. for , . In contrast to linear binning, quantile binning makes use of the underlying cumulative distribution function (CDF) to construct a binning such that the number of data points falling into each bin is (approximately) equal. In quantile binning, we first create a list of equally spaced quantiles where and . Then, we can obtain the quantilebased bin representations by evaluating the quantile function for each , i.e. , ensuring that all buckets contain approximately the same number of samples.
A nowadays wellestablished strategy for using categorical inputs with deep learning models we adopt here is to use an embedding layer as the first layer in the model, which maps each categorical input
to a vector
that is learned together with the weights of the network using gradient descent.As part of this study we consider two main binning strategies for time series forecasting: local absolute binning, where the bins are computed for each time series separately, and global relative binning, where the bins are computed jointly for the entire dataset, after scaling each time series individually.
Local Absolute Binning (lab)
In local absolute binning, each time series is binned individually. This involves computing the the bin edges for each time series and then binning each time series . Since each time series is binned using its own set of bin edges and each time series is mapped to the same set of bin identifiers , local absolute binning effectively acts as a scaling mechanism.
Global Relative Binning (grb)
In global relative binning, all time series are first rescaled and then binned with one global binning. In particular, we use the mean scaling approach described before to scale each time series. We can then estimate one single set of bin edges over the entire collection of scaled time series and bin every time series according to these bin edges where corresponds to the scaled time series. A visual example of global relative binning is shown in Figure 1, while an analysis on the reconstruction loss incurred via global binning is shown in Figure 2.
Hybrid Binning ()
In addition to pure local absolute or global relative binnings, one can compose multiple binnings by concatenating the resulting embeddings before passing them to the model. This does not only allow us to use both local and global binnings, but also enables us to provide the model with multiscale inputs by passing binnings with varying bin sizes and bin edges.
3.2. Output Representations
In terms of modeling the output distribution , we compare three different approaches: a parametric distribution, in particular a Student distribution (st) applied to the raw target values
, but where the mean and variance are scaled by the mean value computed on the conditioning range for each time series (an approach used for example in
(Salinas et al., 2019c)); the piecewiselinear spline quantile function approach of Gasthaus et al. (2019) (plqs); and a categorical distribution applied to the binned values obtained through one of the binning strategies described above, where the final forecasts are obtained by applying the reconstruction function to samples from the predictive distribution.3.3. Models
As part of this study we consider three different base models which we combine with the aforementioned input and output transformations: a variant of the WaveNet CNN architecture (Oord et al., 2016), the DeepAR RNN architecture (Salinas et al., 2019c), and a basic twolayer feedforward model.
3.3.1. WaveNet (CNN)
The WaveNet (Oord et al., 2016) architecture is an autoregressive convolutional neural network which uses 1dimensional dilated causal convolutions. While originally developed for speech synthesis, it has also been shown to be effective for time series forecasting (Borovykh et al., 2017). More generally, CNNbased architectures using dilated causal convolutions have been demonstrated promising performance on a variety of sequence modeling tasks (Bai et al., 2018). The specific model used in the experiments is the original WaveNet architecture as described in (Oord et al., 2016), but using only a single stack of dilated convolutions with exponentially increasing dilation factor.
3.3.2. DeepAR (RNN)
DeepAR (Salinas et al., 2019c) is an autoregressive recurrent neural network architecture designed for time series forecasting. At its core, DeepAR is a RNN consisting of LSTM cells, which additionally receive autoregressive inputs in the form of lagged target values.
3.3.3. Simple Deep Neural Network (FeedForward)
The simple feedforward model is a deep neural network which directly maps the past input sequence to the parameters of a multistep output distribution without any feedback loops or memory. The model used in the experiments is a simple, plain twolayer model with 40 hidden units per layer with ReLU activation function and no additional regularization tricks (e.g. dropout, batchnorm, weight decay). No additional features or lags are used as inputs to this model.
4. Experiments
WaveNet  DeepAR  FeedForw  
(lr)34 (lr)56 (lr)78 Dataset  Output  Mean wQL  ND  Mean wQL  ND  Mean wQL  ND 
m4_h 
ms  0.0988 ( 0.0871)  0.1135 ( 0.0940)  0.0566 ( 0.0096)  0.0676 ( 0.0102)  0.0407 ( 0.0028)  0.0519 ( 0.0015) 
msplqs  0.0453 ( 0.0110)  0.0557 ( 0.0106)  0.1462 ( 0.0257)  0.1618 ( 0.0289)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.0371 ( 0.0092)  0.0487 ( 0.0132)  0.0953 ( 0.0176)  0.1071 ( 0.0152)  0.0428 ( 0.0006)  0.0539 ( 0.0010)  
grb(bin1024,iqF)  0.1292 ( 0.0083)  0.1518 ( 0.0170)  0.0779 ( 0.0155)  0.0890 ( 0.0120)  0.0468 ( 0.0007)  0.0588 ( 0.0009)  
lab(bin1024)  0.0372 ( 0.0029)  0.0463 ( 0.0028)  0.0979 ( 0.0134)  0.1123 ( 0.0177)  0.0419 ( 0.0004)  0.0528 ( 0.0004)  
m4_d 
ms  0.0260 ( 0.0030)  0.0321 ( 0.0040)  0.0282 ( 0.0009)  0.0338 ( 0.0012)  0.0298 ( 0.0001)  0.0304 ( 0.0001) 
msplqs  0.0237 ( 0.0013)  0.0289 ( 0.0019)  0.0300 ( 0.0021)  0.0363 ( 0.0030)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.0228 ( 0.0004)  0.0280 ( 0.0005)  0.2134 ( 0.0181)  0.2235 ( 0.0214)  0.0307 ( 0.0001)  0.0353 ( 0.0000)  
grb(bin1024,iqF)  0.0530 ( 0.0009)  0.0629 ( 0.0012)  0.2103 ( 0.0184)  0.2187 ( 0.0195)  0.0283 ( 0.0000)  0.0330 ( 0.0001)  
lab(bin1024)  0.0359 ( 0.0003)  0.0412 ( 0.0002)  0.1675 ( 0.0033)  0.2132 ( 0.0017)  0.0316 ( 0.0000)  0.0367 ( 0.0001)  
m4_w 
ms  0.0547 ( 0.0039)  0.0686 ( 0.0049)  0.0455 ( 0.0016)  0.0565 ( 0.0026)  0.0705 ( 0.0002)  0.0811 ( 0.0002) 
msplqs  0.0502 ( 0.0038)  0.0626 ( 0.0045)  0.0477 ( 0.0031)  0.0570 ( 0.0056)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.0447 ( 0.0016)  0.0569 ( 0.0021)  0.1746 ( 0.0142)  0.1947 ( 0.0133)  0.0725 ( 0.0002)  0.0851 ( 0.0002)  
grb(bin1024,iqF)  0.0641 ( 0.0024)  0.0799 ( 0.0029)  0.1724 ( 0.0150)  0.1899 ( 0.0133)  0.0707 ( 0.0002)  0.0832 ( 0.0001)  
lab(bin1024)  0.0623 ( 0.0013)  0.0770 ( 0.0018)  0.2140 ( 0.0036)  0.2446 ( 0.0028)  0.0764 ( 0.0002)  0.0885 ( 0.0002)  
m4_m 
ms  0.1313 ( 0.0046)  0.1576 ( 0.0049)  0.1376 ( 0.0123)  0.1639 ( 0.028)  0.1227 ( 0.0009)  0.1589 ( 0.0007) 
msplqs  0.1378 ( 0.0013)  0.1595 ( 0.0028)  0.1471 ( 0.0149)  0.1648 ( 0.0062)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.1177 ( 0.0031)  0.1447 ( 0.0030)  0.1755 ( 0.0223)  0.2046 ( 0.0048)  0.1273 ( 0.0004)  0.1466 ( 0.0003)  
grb(bin1024,iqF)  0.1429 ( 0.0034)  0.1749 ( 0.0054)  0.1727 ( 0.0220)  0.2049 ( 0.0020)  0.1268 ( 0.009)  0.1454 ( 0.0007)  
lab(bin1024)  0.1507 ( 0.0003)  0.1819 ( 0.0022)  0.1931 ( 0.0254)  0.2257 ( 0.0089)  0.1231 ( 0.0006)  0.1470 ( 0.0006)  
m4_q 
ms  0.0936 ( 0.0032)  0.1148 ( 0.0036)  0.1067 ( 0.0039)  0.1299 ( 0.0047)  0.1097 ( 0.0007)  0.1299 ( 0.0002) 
msplqs  0.0987 ( 0.0028)  0.1188 ( 0.0035)  0.1267 ( 0.0117)  0.1439 ( 0.0108)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.0908 ( 0.0015)  0.1126 ( 0.0018)  0.1673 ( 0.0060)  0.1903 ( 0.0044)  0.1146 ( 0.0011)  0.1318 ( 0.0003)  
grb(bin1024,iqF)  0.0998 ( 0.0014)  0.1237 ( 0.0017)  0.1591 ( 0.0092)  0.1819 ( 0.0073)  0.1131 ( 0.0013)  0.1291 ( 0.0003)  
lab(bin1024)  0.1195 ( 0.0019)  0.1412 ( 0.0022)  0.1647 ( 0.0103)  0.1980 ( 0.0087)  0.1197 ( 0.0004)  0.1374 ( 0.0002)  
m4_y 
ms  0.1235 ( 0.0030)  0.1476 ( 0.0032)  0.1733 ( 0.0073)  0.1940 ( 0.0072)  0.1262 ( 0.0014)  0.1497 ( 0.0014) 
msplqs  0.1271 ( 0.0033)  0.1486 ( 0.0039)  0.1758 ( 0.0200)  0.1973 ( 0.0206)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.1538 ( 0.0112)  0.1860 ( 0.0140)  0.2765 ( 0.0076)  0.3073 ( 0.0110)  0.2131 ( 0.0008)  0.2295 ( 0.0002)  
grb(bin1024,iqF)  0.1407 ( 0.0116)  0.1712 ( 0.0122)  0.3001 ( 0.0148)  0.3264 ( 0.0124)  0.2100 ( 0.0009)  0.2254 ( 0.0002)  
lab(bin1024)  0.2024 ( 0.0110)  0.2237 ( 0.0169)  0.2596 ( 0.0081)  0.3034 ( 0.0135)  0.2300 ( 0.0003)  0.2399 ( 0.0006)  
elec 
ms  0.0610 ( 0.0018)  0.0774 ( 0.0028)  0.0551 ( 0.0011)  0.0678 ( 0.0016)  0.0668 ( 0.0010)  0.0826 ( 0.0017) 
msplqs  0.0540 ( 0.0028)  0.0681 ( 0.0036)  0.0582 ( 0.0029)  0.0707 ( 0.0030)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.0475 ( 0.0016)  0.0588 ( 0.0024)  0.0647 ( 0.0018)  0.0821 ( 0.0020)  0.0677 ( 0.0013)  0.0841 ( 0.0020)  
grb(bin1024,iqF)  0.0512 ( 0.0015)  0.0641 ( 0.0013)  0.0712 ( 0.0064)  0.0905 ( 0.0088)  0.0661 ( 0.0007)  0.0817 ( 0.0008)  
lab(bin1024)  0.0527 ( 0.0009)  0.0647 ( 0.0007)  0.0791 ( 0.0001)  0.1013 ( 0.0002)  0.0671 ( 0.0010)  0.0793 ( 0.0012)  
traff 
ms  0.1437 ( 0.0044)  0.1721 ( 0.0050)  0.1185 ( 0.0089)  0.1401 ( 0.0014)  0.2111 ( 0.0008)  0.2527 ( 0.0008) 
msplqs  0.1237 ( 0.0034)  0.1510 ( 0.0040)  0.1369 ( 0.0045)  0.1680 ( 0.0016)  0.3185 ( 0.1331)  0.3849 ( 0.1646)  
grb(bin1024)  0.1209 ( 0.0016)  0.1455 ( 0.0019)  0.1883 ( 0.0012)  0.2323 ( 0.0005)  0.2218 ( 0.0010)  0.0252 ( 0.0015)  
grb(bin1024,iqF)  0.1229 ( 0.0017)  0.1495 ( 0.0022)  0.1835 ( 0.0034)  0.2245 ( 0.0012)  0.2210 ( 0.0017)  0.0341 ( 0.0050)  
lab(bin1024)  0.1261 ( 0.0006)  0.1536 ( 0.0007)  0.1632 ( 0.0051)  0.2005 ( 0.0028)  0.3820 ( 0.0832)  0.0206 ( 0.0058)  
wiki 
ms  0.2204 ( 0.0021)  0.2480 ( 0.0027)  0.2284 ( 0.0012)  0.2577 ( 0.0012)  0.2721 ( 0.0015)  0.3349 ( 0.0014) 
msplqs  0.2336 ( 0.0097)  0.2654 ( 0.0118)  0.2305 ( 0.0040)  0.2561 ( 0.0033)  NaN ( NaN)  NaN ( NaN)  
grb(bin1024)  0.2156 ( 0.0020)  0.2439 ( 0.0021)  0.8465 ( 0.0199)  0.9459 ( 0.0270)  0.2930 ( 0.0013)  0.3316 ( 0.0010)  
grb(bin1024,iqF)  0.2224 ( 0.0037)  0.2524 ( 0.0044)  0.7919 ( 0.0019)  0.9086 ( 0.0076)  0.2961 ( 0.0015)  0.3346 ( 0.0009)  
lab(bin1024)  0.2564 ( 0.0016)  0.2901 ( 0.0021)  0.6996 ( 0.0009)  0.8200 ( 0.0007)  0.2540 ( 0.0004)  0.2858 ( 0.0005)  

WaveNet  DeepAR  FeedForw  
(lr)34 (lr)56 (lr)78 Dataset  Input  Mean wQL  ND  Mean wQL  ND  Mean wQL  ND 
m4_h 
ms  0.0391 ( 0.0057)  0.0506 ( 0.0083)  0.0931 ( 0.0093)  0.1066 ( 0.0090)  0.0463 ( 0.0005)  0.0588 ( 0.0011) 
lab(bin1024)  0.0577 ( 0.0075)  0.0736 ( 0.0099)  0.1114 ( 0.0078)  0.1255 ( 0.0101)  0.0517 ( 0.0035)  0.0643 ( 0.0034)  
pit(bin1024)  0.0296 ( 0.0001)  0.0370 ( 0.0002)  0.0902 ( 0.0089)  0.1120 ( 0.0095)  0.0721 ( 0.0392)  0.0912 ( 0.0491)  
hyb(16,128,1024)  0.0375 ( 0.0009)  0.0504 ( 0.0003)  0.1020 ( 0.0057)  0.1189 ( 0.0109)  0.0435 ( 0.0004)  0.0549 ( 0.0006)  
hyb(grb,lab)  0.0369 ( 0.0061)  0.0475 ( 0.0089)  0.1057 ( 0.0088)  0.1201 ( 0.0110)  0.0421 ( 0.0017)  0.0537 ( 0.0021)  
m4_d 
ms  0.0315 ( 0.0057)  0.0378 ( 0.0065)  0.2128 ( 0.0182)  0.2216 ( 0.0188)  0.0305 ( 0.0000)  0.0352 ( 0.0000) 
lab(bin1024)  0.0317 ( 0.0007)  0.0369 ( 0.0008)  0.2189 ( 0.0124)  0.2244 ( 0.0130)  0.0305 ( 0.0000)  0.0352 ( 0.0001)  
pit(bin1024)  0.0286 ( 0.0053)  0.0345 ( 0.0061)  0.2204 ( 0.0144)  0.2283 ( 0.0141)  0.0305 ( 0.0002)  0.0352 ( 0.0000)  
hyb(16,128,1024)  0.0227 ( 0.0003)  0.0278 ( 0.0004)  0.2196 ( 0.0137)  0.2267 ( 0.0135)  0.0306 ( 0.0001)  0.0353 ( 0.0001)  
hyb(grb,lab)  0.0272 ( 0.0004)  0.0318 ( 0.0003)  0.2222 ( 0.0156)  0.2301 ( 0.0157)  0.0307 ( 0.0001)  0.0353 ( 0.0000)  
m4_w 
ms  0.0848 ( 0.0327)  0.1026 ( 0.0371)  0.1651 ( 0.0113)  0.1830 ( 0.0111)  0.0750 ( 0.0005)  0.0839 ( 0.0001) 
lab(bin1024)  0.1061 ( 0.0023)  0.1244 ( 0.0033)  0.1838 ( 0.0070)  0.1995 ( 0.0083)  0.0760 ( 0.0002)  0.0834 ( 0.0001)  
pit(bin1024)  0.0467 ( 0.0022)  0.0585 ( 0.0028)  0.1884 ( 0.0099)  0.2082 ( 0.0103)  0.0724 ( 0.0005)  0.0848 ( 0.0004)  
hyb(16,128,1024)  0.0443 ( 0.0010)  0.0561 ( 0.0014)  0.1792 ( 0.0047)  0.1980 ( 0.0042)  0.0723 ( 0.0002)  0.0854 ( 0.0002)  
hyb(grb,lab)  0.0500 ( 0.0012)  0.0627 ( 0.0015)  0.1815 ( 0.0072)  0.1975 ( 0.0074)  0.0719 ( 0.0003)  0.0849 ( 0.0002)  
m4_m 
ms  0.1373 ( 0.0143)  0.1655 ( 0.0137)  0.2080 ( 0.0102)  0.2412 ( 0.0098)  0.1392 ( 0.0009)  0.1470 ( 0.0000) 
lab(bin1024)  0.2055 ( 0.0021)  0.2136 ( 0.0012)  0.2395 ( 0.0154)  0.2891 ( 0.0101)  0.1396 ( 0.0005)  0.1463 ( 0.0001)  
pit(bin1024)  0.1213 ( 0.0024)  0.1481 ( 0.0029)  0.1921 ( 0.0097)  0.2287 ( 0.0084)  0.1332 ( 0.0049)  0.1462 ( 0.0009)  
hyb(16,128,1024)  0.1187 ( 0.0037)  0.1463 ( 0.0046)  0.1944 ( 0.0098)  0.2294 ( 0.0057)  0.1267 ( 0.0023)  0.1459 ( 0.0001)  
hyb(grb,lab)  0.1206 ( 0.0010)  0.1468 ( 0.0008)  0.2018 ( 0.0105)  0.2388 ( 0.0083)  0.1264 ( 0.0014)  0.1454 ( 0.0002)  
m4_q 
ms  0.1272 ( 0.0006)  0.1488 ( 0.0003)  0.1507 ( 0.0037)  0.1698 ( 0.0021)  0.1256 ( 0.0009)  0.1501 ( 0.0008) 
lab(bin1024)  0.1299 ( 0.0017)  0.1486 ( 0.0013)  0.1689 ( 0.0025)  0.1861 ( 0.0016)  0.1174 ( 0.0011)  0.1320 ( 0.0004)  
pit(bin1024)  0.1278 ( 0.0014)  0.1488 ( 0.0002)  0.1748 ( 0.0028)  0.1958 ( 0.0021)  0.1180 ( 0.0011)  0.1324 ( 0.0002)  
hyb(16,128,1024)  0.0893 ( 0.0011)  0.1108 ( 0.0012)  0.1743 ( 0.0052)  0.1972 ( 0.0035)  0.1152 ( 0.0019)  0.1314 ( 0.0007)  
hyb(grb,lab)  0.1137 ( 0.0032)  0.1372 ( 0.0034)  0.1722 ( 0.0029)  0.1974 ( 0.0008)  0.1152 ( 0.0021)  0.1308 ( 0.0007)  
m4_y 
ms  0.1308 ( 0.0039)  0.1562 ( 0.0034)  0.2663 ( 0.0177)  0.2907 ( 0.0123)  0.2162 ( 0.0016)  0.2326 ( 0.0008) 
lab(bin1024)  0.2812 ( 0.0144)  0.3171 ( 0.0094)  0.3062 ( 0.0140)  0.3248 ( 0.0085)  0.2143 ( 0.0004)  0.2309 ( 0.0002)  
pit(bin1024)  0.1844 ( 0.0523)  0.2202 ( 0.0621)  0.3058 ( 0.0077)  0.3280 ( 0.0086)  0.2151 ( 0.0019)  0.2324 ( 0.0022)  
hyb(16,128,1024)  0.1337 ( 0.0033)  0.1618 ( 0.0045)  0.2925 ( 0.0028)  0.3219 ( 0.0024)  0.2129 ( 0.0006)  0.2295 ( 0.0001)  
hyb(grb,lab)  0.2065 ( 0.0149)  0.2505 ( 0.0195)  0.3184 ( 0.0050)  0.3576 ( 0.0052)  0.2235 ( 0.0068)  0.2388 ( 0.0020)  
elec 
ms  0.0501 ( 0.0010)  0.0607 ( 0.0017)  0.0732 ( 0.0007)  0.0923 ( 0.0004)  0.0800 ( 0.0038)  0.1004 ( 0.0056) 
lab(bin1024)  0.1389 ( 0.0070)  0.1677 ( 0.0096)  0.0986 ( 0.0023)  0.1107 ( 0.0067)  0.1269 ( 0.0033)  0.1632 ( 0.0034)  
pit(bin1024)  0.0484 ( 0.0010)  0.0598 ( 0.0015)  0.4210 ( 0.1192)  0.4924 ( 0.1078)  0.0705 ( 0.0026)  0.0875 ( 0.0041)  
hyb(16,128,1024)  0.0495 ( 0.0004)  0.0612 ( 0.0007)  0.1143 ( 0.0028)  0.1339 ( 0.0055)  0.0678 ( 0.0018)  0.0843 ( 0.0026)  
hyb(grb,lab)  0.0472 ( 0.0005)  0.0585 ( 0.0005)  0.1528 ( 0.0072)  0.1801 ( 0.0089)  0.0687 ( 0.0009)  0.0856 ( 0.0011)  
traff 
ms  0.1251 ( 0.0013)  0.1507 ( 0.0013)  0.1974 ( 0.0088)  0.2423 ( 0.0339)  0.2280 ( 0.0005)  0.0287 ( 0.0021) 
lab(bin1024)  0.2571 ( 0.0174)  0.3200 ( 0.0246)  0.2535 ( 0.0246)  0.3131 ( 0.0632)  0.2456 ( 0.0010)  0.0070 ( 0.0021)  
pit(bin1024)  0.1275 ( 0.0008)  0.1539 ( 0.0010)  0.5953 ( 0.1299)  0.7266 ( 0.2040)  0.2258 ( 0.0017)  0.0254 ( 0.0037)  
hyb(16,128,1024)  0.1242 ( 0.0008)  0.1498 ( 0.0009)  0.1886 ( 0.0072)  0.2316 ( 0.0290)  0.2184 ( 0.0011)  0.0249 ( 0.0016)  
hyb(grb,lab)  0.1245 ( 0.0011)  0.1505 ( 0.0018)  0.1885 ( 0.0091)  0.2315 ( 0.0387)  0.2182 ( 0.0007)  0.0217 ( 0.0028)  
wiki 
ms  0.2183 ( 0.0028)  0.2472 ( 0.0033)  0.8156 ( 0.0176)  0.9170 ( 0.0234)  0.3027 ( 0.0007)  0.3381 ( 0.0008) 
lab(bin1024)  0.3071 ( 0.0030)  0.3478 ( 0.0026)  0.8143 ( 0.0112)  0.9115 ( 0.0130)  0.3066 ( 0.0005)  0.3408 ( 0.0004)  
pit(bin1024)  0.2177 ( 0.0043)  0.2465 ( 0.0047)  0.9238 ( 0.2095)  0.9981 ( 0.3049)  0.2935 ( 0.0014)  0.3304 ( 0.0014)  
hyb(16,128,1024)  0.2163 ( 0.0017)  0.2447 ( 0.0019)  0.8140 ( 0.0129)  0.9075 ( 0.0090)  0.2927 ( 0.0012)  0.3311 ( 0.0008)  
hyb(grb,lab)  0.2342 ( 0.0035)  0.2631 ( 0.0037)  0.8191 ( 0.0150)  0.9238 ( 0.0198)  0.2931 ( 0.0012)  0.3316 ( 0.0011)  

Our experiments were conducted using GluonTS (Alexandrov et al., 2019), a Python toolkit for probabilistic time series modeling using deep learningbased models based on (Chen et al., 2015). GluonTS provides implementations of the models introduced in Section 3.3, and we extended it with the ability to flexibly specify the input representations and output distributions. We trained all models on the data sets from the m4 forecasting competition (m4_hourly, m4_daily, m4_weekly, m4_monthly, m4_quarterly, m4_yearly) (Makridakis et al., 2018a), on the electricity and traffic datasets (Dua and Graff, 2017), and on a sample of daily page hits of Wikipedia subpages, wiki10k
, using various combinations of representations and report mean error metrics and standard deviation over 10 random runs per configuration. Specifically, we investigated the performance effect of fixing the input representation while varying the output representation and vice versa and also examined the effects of the binning and embedding resolutions in detail. Although we do report results for
m4_hourly, we note that we specifically used this dataset for tuning hyperparameters and for generating deeper insights on representation performance.Both the FeedForw and WaveNet model were trained using the Adam optimizer with a decaying initial learning rate of (decay rate ) in batches of samples over epochs (where one epoch consists of batches) on a p2.xlarge instance (NVIDIA K80 GPU) on Amazon Web Services. DeepAR follows the same setting but starts with an initial learning rate of and is trained over epochs. All models with the exception of FeedForw make use of supplementary covariates encoding datedependent features in the form of dummy variables in addition to time series target values . Moreover, DeepAR further utilizes lagged values at varying frequencies (hourly, daily, weekly, etc.) for quicker convergence as it allows the model to pick up highly periodic patterns more easily.
By default, binnings are assumed to be quantilebased, utilize bins, and are embedded in a dimensional space (Team, 2017) before being fed into the model. Initially, we also experimented with linear binning, but found the quantile binnings to be generally more reliable. When using quantile splines on the output, we default to a resolution of knots.
We report predictive performance in the form of two commonlyused accuracy metrics: To evaluate the quality of the predictive distributions we measure mean weighted quantile loss, which is an approximation to the continuous ranked probability score (Matheson and Winkler, 1976; Gneiting and Raftery, 2007). In particular, we compute,
where is the quantile of the predictive distribution for , and is the set of quantile levels we evaluate. To evaluate the point forecasting performance, we evaluate the normalized deviation (ND), which is equivalent to wQL evaluated only at the median, i.e. with .
The main results are shown in Table 1 (which shows the effect of varying the output representation but keeping the input fixed), Table 2 (different discretizing input transformations while keeping the output fixed), and Table 3 (performance with scaled input/output but no binning).
We performed additional experiments using the WaveNet model on the m4_hourly
data set to better understand the effect of the various transformation hyperparameters (
grb vs. lab vs. pit; number of bins used; embedding size). The results of these experiments are shown in Figure 3.5. Discussion
In the following we summarize the observations and conclusions from our experiments.
Output Scaling versus Binning
Our main results show (cf. first column in Table 1) that in particular the WaveNet model substantially benefits from binned output representations when compared to realvalued, scaled outputs modeled through a parametric Student distribution (ms) or using the quantile spline output (msplqs). In fact, WaveNet combined with global relative (quantile) binning for the input and output transformation (grb(bin1024)) almost always (m4_y being the exception) outperforms all other combinations in our comparison across datasets. Interestingly, for DeepAR this effect is reversed (cf. Table 1, col. 2) and meanscaled, nonbinned output representations substantially outperform the discretized ones. FeedForw shows no clear advantage for either of the representations, but generally performs worse than either of the other models in their best configuration. These results underline our claim that input/output representations in general and output representations in particular can be equally (or even more) important for obtaining good predictive performance than choosing a particular model class, and more powerful models like WaveNet can be outperformed by simpler models like the feed forward model if the representations are not carefully chosen (e.g. on m4_h, FeedForw with meanscaled Student output (ms) outperforms WaveNet with the same output, but is in turn outperformed by WaveNet with globalrelative binning (grb)).
Input Scaling vs. Binning
Table 2 shows the performance of the models when the input representation is varied while the output representation is fixed (grb). Interestingly, while the hybrid binning (hyb(16,128,1024)) often performs well, there is no clear dominant strategy here that outperforms the others across datasets and/or models. However, the impact of the input transformation on the performance is also less pronounced than for the output. One notable exception is localabsolute binning (lab) which often performs significantly worse than the other strategies in this setting. This, in combination with the insensitivity to the number of input bins shown in Figure 3 hints at models’ ability to extract sufficient information from either of the binnings, even when they have low resolution. Further, as expected, the (pit) strategy, being the continuous analogue of (grb), performs on par with it, though (grb) appears to have a slight edge.
Binning resolution effects
Interestingly, we found that (cf. Figure 3), given a fixed global relative binning on the output with 1024 quantile bins, a surprisingly small number of input bins already suffices to achieve good predictive accuracy and, more so, that increasing the number of input bins does not significantly improve performance. In contrast, given a fixed global relative binning on the input with 1024 quantile bins, increasing the number of bins on the output leads to steady improvements in performance. While the latter effect mostly expected due to the reconstruction loss incurred with a discretized output with less bins (cf. Figure 2), the former effect is more surprising and hints at the fact the the models learn to focus on coarsegrained effects in the input, rather than focussing on fine details (that would be lost with a smaller number of bins).
Embedding size effects
Since the embedding size, which is governed by a heuristic described in Section
4, is dependent on the number of bins, we also explicitly assess the performance impact of varying the embedding size in isolation, keeping the other parameters fixed (Figure 3 c)). Similar to the results reported in Figure 3 a), we found that altering the embedding size while keeping the number of bins fixed does not significantly impact performance, and that a relatively small embedding size is sufficient.Global versus Local Binning
We observed that the global relative binning strategy tends to work better than local absolute binning for the output. While the effect is small on some datasets, it is more pronounced on others (e.g. WaveNet on m4_m, m4_q, and m4_y in Table 1). Note that (grb) is used for the input transformation here, so that there is a “mismatch” between the input and the output binning, which seems to be responsible for part of this effect. However, we performed additional experiment with (lab) input transformation (not shown) where this effect is somewhat alleviated, but does not vanish.
Hybrid versus Single Binning
We also analyzed whether hybrid binning strategies used as an input transformation can improve performance over a single binning. Specifically, we considered two different kinds of hybrid binnings: hyb(16,128,1024) which includes multiple global relative binnings at different resolutions and hyb(grb,lab) which combines a global relative and a local absolute binning. Our results show that the multiscale hybrid binning does indeed improve performance in many instances and is in fact the bestperforming method reported for many datasets if used in conjunction with the WaveNet. However, combining both local and global information does not consistently lead to improvements over the best performing method, but rather averages results reported for global relative inputs and local absolute inputs.
Models
Overall, WaveNet does profit the most from the proposed binning strategies, while the FeedForw model does not show any meaningful gains from using binning. As already hinted at, while DeepAR can make effective use of input binnings, it demonstrates significantly worse performance when combined with a binned output representation. The reason for this is not yet clear and would benefit from further investigation.
6. Related Work
WaveNet  DeepAR  FeedForw  
(lr)23 (lr)45 (lr)67 Dataset  Mean wQL  ND  Mean wQL  ND  Mean wQL  ND 
m4_h 
0.1517 ( 0.0904)  0.2008 ( 0.1334)  0.0533 ( 0.0012)  0.0645 ( 0.0009)  0.0463 ( 0.0010)  0.0580 ( 0.0012) 
m4_d  0.0334 ( 0.0088)  0.0401 ( 0.0102)  0.0318 ( 0.0029)  0.0384 ( 0.0036)  0.0247 ( 0.0005)  0.0296 ( 0.0008) 
m4_w  0.0574 ( 0.0036)  0.0716 ( 0.0042)  0.0460 ( 0.0011)  0.0565 ( 0.0012)  0.0521 ( 0.0006)  0.0614 ( 0.0006) 
m4_m  0.1481 ( 0.0170)  0.1674 ( 0.0152)  0.1362 ( 0.0089)  0.1480 ( 0.0083)  0.1159 ( 0.0011)  0.1260 ( 0.0023) 
m4_q  0.0983 ( 0.0019)  0.1196 ( 0.0017)  0.1030 ( 0.0031)  0.1176 ( 0.0027)  0.0869 ( 0.0010)  0.1030 ( 0.0010) 
m4_y  0.1236 ( 0.0055)  0.1458 ( 0.0057)  0.1570 ( 0.0088)  0.1757 ( 0.0085)  0.1262 ( 0.0014)  0.1497 ( 0.0014) 
elec  0.0724 ( 0.0151)  0.0923 ( 0.0194)  0.0571 ( 0.0012)  0.0695 ( 0.0018)  0.0649 ( 0.0011)  0.0793 ( 0.0015) 
traff  0.1450 ( 0.0065)  0.1720 ( 0.0073)  0.1222 ( 0.0077)  0.1456 ( 0.0082)  0.2144 ( 0.0008)  0.2558 ( 0.0009) 
wiki  0.2295 ( 0.0063)  0.2601 ( 0.0072)  0.2378 ( 0.0070)  0.2694 ( 0.0091)  0.2594 ( 0.0026)  0.3030 ( 0.0036) 
The empirical study presented here is part of a growing amount of literature on neural forecasting approaches (Smyl, 2020; Wang et al., 2019; Laptev et al., 2017; Fan et al., 2019; Li et al., 2019; Ding et al., 2019; Deshpande and Sarawagi, 2019). While most prior art considers the probabilistic forecasting setting, some recent work has resorted to only providing point forecasts (Lai et al., 2017; Oreshkin et al., 2019). For forecasting problems with many related time series, as is the focus of the present work, it can be safely assumed that neural network are the state of the art. For example, the models described in (Smyl, 2020) won the recent M4 forecasting competition (Makridakis et al., 2018a) by a large margin. Most recent work on forecasting using neural networks focusses primarily on novel or extended network architectures.
Input transformations have a long and rich history in time series, potentially starting with Box and Cox (1964) who propose a powertransformation of the data to make it ”more normal”. However, the use of some of the input transformations in the focus of this paper, such as input scaling or variants of binning, are partially folkloric (i.e. commonly used in practice by machine learning practitioners but seldomly thoroughly described and investigated). In contrast, the more general area of probability integral transformation and copula approaches (e.g., (Elidan, 2013; Patton, 2012)) enjoys continued attention. For example, (Salinas et al., 2019b) propose a semiparametric neural forecasting model that uses the marginal empirical CDFs combined with Gaussian copulas to model nonGaussian multivariate data. In order to be tractable, it assumes a particular lowrank structure of the covariance structure.
Another set of approaches for modeling the output distributions, related to using a categorical distribution on binned time series values, are techniques based on quantile regression (Koenker and Bassett Jr, 1978; Koenker, 2005; Wen et al., 2017). In these approaches, instead of modeling the entire output distribution, only a fixed set of quantile levels is predicted. The spline quantile function approach of Gasthaus et al. (2019)
that we compare in our study is an extension of these techniques, where the quantile levels to be predicted a learned by the model and interpolated using a linear spline.
The idea of globallocal models, i.e. models that explicitly model the patterns shared between time series globally (i.e. across time series), while allowing the idiosyncratic behavior of each time series to be modeled locally (i.e. per time series), have also been explored (Wang et al., 2019; Sen et al., 2019; Deshpande and Sarawagi, 2019). The data transformations explored here, which locally apply a transformation before modeling the result globally, can be seen as an instance of the same paradigm. Further, the core idea behind the hybrid binning strategy (Section 3) is to mix global (to the panel of time series) and local (specific to a member of the panel) effects.
7. Conclusions and Future Work
We have conducted a largescale study comparing the performance of different input and output transformations when combined with several different types of models. Our investigation shines light on the question to which extent such transformations affect the predictive performance of different model architectures, with the overarching conclusion that carefully choosing and tuning the input and output transformations is important, as it has a large impact on the models’ predictive performance, potentially larger than the performance difference between model architectures.
The work presented here can be extended in multiple directions: First and foremost, there are interesting additional kinds of input and in particular output transformations that we want to explore, e.g. hybrid binnings using multiple scales, and using hybrid binnings also at the output (e.g. using a multiresolution approach similar to the “dual softmax” used in (Kalchbrenner et al., 2018)). On the methodological side, extensive and principled hyperparameter tuning would allow us to make stronger conclusions about the effectiveness of particular model classes when combined with different input/output representations.
Finally, categorical sequence data is common in other domains, e.g. text in NLP or quantized audio data in speech recognition and generation—large subfields of AI where novel deep learning techniques are constantly developed and improved. Modifying models from those domains to fit the forecasting problem better is a productive line of recent research, e.g., (Fan et al., 2019; Li et al., 2019). Exploring whether models from these domains can perform well in the forecasting setting without substantial modifications by discretizing the inputs using the techniques discussed here is an interesting open question, that—if answered affirmatively—would allow further improvements in these domains to immediately carry over to the time series domain. However, it is still surprising to us how well the categorical distribution performs, even though it ignores the order in data, and this needs further understanding. The discretized logistic mixture likelihood (Salimans et al., 2017) has been proposed as an alternative to the categorical distribution that retains the ordering. Exploring such methods that can retain the apparent benefits seen with discretized inputs while making use of the order and distance information in the setting of time series forecasting is an interesting avenue for further research.
References
 (1)
 Alexandrov et al. (2019) Alexander Alexandrov, Konstantinos Benidis, Michael BohlkeSchneider, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, et al. 2019. GluonTS: Probabilistic Time Series Models in Python. arXiv preprint arXiv:1906.05264 (2019).
 Athanasopoulos et al. (2011) George Athanasopoulos, Rob Hyndman, Haiyan Song, and Doris C. Wu. 2011. The tourism forecasting competition. International Journal of Forecasting 27, 3 (2011), 822–844. https://EconPapers.repec.org/RePEc:eee:intfor:v:27:y::i:3:p:822844
 Bai et al. (2018) Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. CoRR abs/1803.01271 (2018). arXiv:1803.01271 http://arxiv.org/abs/1803.01271
 Borovykh et al. (2017) Anastasia Borovykh, Sander Bohte, and Cornelis W. Oosterlee. 2017. Conditional Time Series Forecasting with Convolutional Neural Networks. arXiv eprints, Article arXiv:1703.04691 (Mar 2017), arXiv:1703.04691 pages. arXiv:stat.ML/1703.04691
 Böse et al. (2017) JoosHendrik Böse, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Dustin Lange, David Salinas, Sebastian Schelter, Matthias Seeger, and Yuyang Wang. 2017. Probabilistic demand forecasting at scale. Proceedings of the VLDB Endowment 10, 12 (2017), 1694–1705.
 Box and Cox (1964) G. E. P. Box and D. R. Cox. 1964. An Analysis of Transformations. Journal of the Royal Statistical Society. Series B (Methodological) 26, 2 (1964), 211–252.
 Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. NeurIPS Workshop on Machine Learning Systems (2015).
 Deshpande and Sarawagi (2019) Prathamesh Deshpande and Sunita Sarawagi. 2019. Streaming Adaptation of Deep Forecasting Models Using Adaptive Recurrent Units. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1560–1568. https://doi.org/10.1145/3292500.3330996
 Ding et al. (2019) Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He. 2019. Modeling Extreme Events in Time Series Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1114–1122. https://doi.org/10.1145/3292500.3330896
 Dua and Graff (2017) Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Elidan (2013) Gal Elidan. 2013. Copulas in Machine Learning. In Copulae in Mathematical and Quantitative Finance. Springer Berlin Heidelberg, Berlin, Heidelberg, 39–60.
 Faloutsos et al. (2019) Christos Faloutsos, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, and Yuyang Wang. 2019. Forecasting Big Time Series: Theory and Practice. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 48, 2019.
 Fan et al. (2019) Chenyou Fan, Yuze Zhang, Yi Pan, Xiaoyue Li, Chi Zhang, Rong Yuan, Di Wu, Wensheng Wang, Jian Pei, and Heng Huang. 2019. MultiHorizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2527––2535.

Gasthaus et al. (2019)
Jan Gasthaus, Konstantinos
Benidis, Yuyang Wang, Syama Sundar
Rangapuram, David Salinas, Valentin
Flunkert, and Tim Januschowski.
2019.
Probabilistic Forecasting with Spline Quantile
Function RNNs. In
The 22nd International Conference on Artificial Intelligence and Statistics
.  Gneiting et al. (2007) Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 2 (2007), 243–268.
 Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. 2007. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102, 477 (2007), 359–378.
 Hewamalage et al. (2019) Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. 2019. Recurrent neural networks for time series forecasting: Current status and future directions. arXiv preprint arXiv:1909.00590 (2019).
 Hyndman and Athanasopoulos (2018) Rob J Hyndman and George Athanasopoulos. 2018. Forecasting: principles and practice. OTexts.
 Hyndman and Koehler (2006) Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4 (2006), 679–688.
 Januschowski and Kolassa (2019) Tim Januschowski and Stephan Kolassa. 2019. A Classification of Business Forecasting Problems. Foresight: The International Journal of Applied Forecasting 52 (2019), 36–43.
 Kalchbrenner et al. (2018) Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. arXiv eprints (2018). arXiv:1802.08435
 Koenker (2005) Roger Koenker. 2005. Quantile Regression. Cambridge University Press. https://doi.org/10.1017/CBO9780511754098
 Koenker and Bassett Jr (1978) Roger Koenker and Gilbert Bassett Jr. 1978. Regression quantiles. Econometrica: Journal of the Econometric Society (1978), 33–50.
 Lai et al. (2017) Guokun Lai, WeiCheng Chang, Yiming Yang, and Hanxiao Liu. 2017. Modeling Long and ShortTerm Temporal Patterns with Deep Neural Networks. CoRR abs/1703.07015 (2017). arXiv:1703.07015 http://arxiv.org/abs/1703.07015
 Laptev et al. (2017) Nikolay Laptev, Jason Yosinsk, Li Li Erran, and Slawek Smyl. 2017. Timeseries Extreme Event Forecasting with Neural Networks at Uber. In ICML Time Series Workshop.
 Li et al. (2019) Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, YuXiang Wang, and Xifeng Yan. 2019. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 5244–5254.
 Lim et al. (2020) Bryan Lim, Sercan Arik, Nicolas Loeff, and Tomas Pfister. 2020. Temporal Fusion Transformers for Interpretable Multihorizon Time Series Forecasting. In arXiv.
 Makridakis et al. (2018a) Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018a. The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting 34, 4 (2018), 802–808.
 Makridakis et al. (2018b) Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018b. Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLOS ONE 13 (03 2018), 1–26.
 Matheson and Winkler (1976) James E Matheson and Robert L Winkler. 1976. Scoring rules for continuous probability distributions. Management science 22, 10 (1976), 1087–1096.
 Mukherjee et al. (2018) Srayanta Mukherjee, Devashish Shankar, Atin Ghosh, Nilam Tathawadekar, Pramod Kompalli, Sunita Sarawagi, and Krishnendu Chaudhury. 2018. ARMDN: Associative and Recurrent Mixture Density Networks for eRetail Demand Forecasting. CoRR abs/1803.03800 (2018). arXiv:1803.03800 http://arxiv.org/abs/1803.03800
 Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499 (2016).
 Oreshkin et al. (2019) Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. NBEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv eprints (2019). arXiv:stat.ML/1905.10437

Patton (2012)
Andrew J Patton.
2012.
A review of copula models for economic time
series.
Journal of Multivariate Analysis
110 (2012), 4–18.  Salimans et al. (2017) Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. 2017. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. arXiv eprints (2017). arXiv:1701.05517
 Salinas et al. (2019a) David Salinas, Michael BohlkeSchneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. 2019a. Highdimensional multivariate forecasting with lowrank Gaussian Copula Processes. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 6824–6834.
 Salinas et al. (2019b) David Salinas, Michael BohlkeSchneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. 2019b. Highdimensional multivariate forecasting with lowrank Gaussian Copula Processes. In Advances in Neural Information Processing Systems 32. 6824–6834.
 Salinas et al. (2019c) David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2019c. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. International Journal of Forecasting (2019).
 Sen et al. (2019) Rajat Sen, HsiangFu Yu, and Inderjit S Dhillon. 2019. Think globally, act locally: A deep neural network approach to highdimensional time series forecasting. In Advances in Neural Information Processing Systems. 4838–4847.
 Smyl (2020) Slawek Smyl. 2020. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting 36, 1 (2020), 75–85.
 Team (2017) TensorFlow Team. 2017. Introducing TensorFlow Feature Columns. https://developers.googleblog.com/2017/11/introducingtensorflowfeaturecolumns.html
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems.
 Wang et al. (2019) Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski. 2019. Deep factors for forecasting. In International Conference on Machine Learning. 6607–6617.
 Wen and Torkkola (2019) Ruofeng Wen and Kari Torkkola. 2019. Deep Generative QuantileCopula Models for Probabilistic Forecasting. arXiv eprints (Jul 2019). arXiv:stat.ML/1907.10697
 Wen et al. (2017) Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. 2017. A MultiHorizon Quantile Recurrent Forecaster. arXiv eprints (Nov 2017). arXiv:stat.ML/1711.11053
 Xie and Hong (2016) Jingrui Xie and Tao Hong. 2016. GEFCom2014 probabilistic electric load forecasting: An integrated solution with forecast combination and residual simulation. International Journal of Forecasting 32, 3 (2016), 1012–1016.
Comments
There are no comments yet.