1 Introduction
Some prevalent forecasting methods in statistics and econometrics have been developed for forecasting individual or small groups of time series. These methods consist of complex models designed and tuned by domain experts (harvey1990forecasting, ). Recently, there has been a paradigm shift from modelbased to fullyautomated datadriven approaches. This shift can be attributed to the availability of large and diverse time series datasets in a wide variety of fields (seeger2016bayesian, ). A substantial amount of data consisting of past behavior of related time series can be leveraged for making a forecast for an individual time series. Use of data from related time series allows for fitting of more complex and potentially more accurate models without overfitting.
Classical time series methods, such as Autoregressive Integrated Moving Average (ARIMA) (brockwell2013time, ), exponential smoothing (hyndman2008forecasting, ) and general Bayesian time series (barber2011bayesian, )
, excel at modeling the complex dynamics of individual time series of sufficiently long history. These methods are computationally efficient, e.g. via a Kalman filter, and provide uncertainty estimates. Uncertainty estimates are critical for optimal downstream decision making. These methods are local, that is, they learn one model per time series. As a consequence, they cannot effectively extract information across multiple time series. These classical methods also have challenges with coldstart problems, where more time series are added or removed over time.
Deep neural networks (DNNs), in particular, recurrent neural networks (RNNs), such as LSTMs
(hochreiter1997long, ) have been successful in time series forecasting (flunkert2017deepar, ; wen2017multi, ). DNNs are generally effective at extracting patterns across multiple time series. Without a combination with probabilistic methods, such as variational dropout gal2016 and deep Kalman filters krishnan2015deep , DNNs can be prone to overfitting and have challenges in modeling uncertainty (garnelo2018conditional, ).The combination of probabilistic graphical models with deep neural networks has been an active research area recently krishnan2015deep ; krishnan2017structured ; fraccaro2016sequential ; fraccaro2017disentangled . In the time series forecasting domain, a recent example is rangapuram2018 , where the authors combine RNNs and StateSpace Models (SSM) for scalable time series forecasting. Our work in this paper follows a similar theme: we propose a novel and scalable globallocal method, Deep Factors with Gaussian Processes. It is based on a global DNN backbone and local Gaussian Process (GP) model for computational efficiency. The globallocal structure extracts complex nonlinear patterns globally while capturing individual random effects for each time series locally. The main idea of our approach is to represent each time series as a combination of a global time series and a corresponding local model. The global part is given by a linear combination of a set of deep dynamic factors, where the loading is temporally determined by attentions. The local model is a stochastic Gaussian Process (GP), which allows for the uncertainty to propagate forward in time.
2 Deep Factor Model with Gaussian Processes
We first define the forecasting problem that we are aiming to solve. Let denote the input features space and the space of the observations. We are given a set of time series with the time series consisting of where are the input covariates, and is the corresponding observation at time . Given a forecast horizon , our goal is to calculate the joint predictive distribution of future observations,
where denotes the time series with corresponding features. For concreteness, we restrict ourselves to univariate time series ().
2.1 Generative Model
We assume that each time series is governed by the following two components: fixed and random.
Fixed effects are common patterns that are given by linear combinations of latent global deep factors, . These deep factors can be thought of as dynamic principal components or eigen time series that drive the underlying dynamics of all the time series.
Random effects, , are the local fluctuations that are chosen to be the Gaussian Process (rasmussen2006gaussian, ), i.e., , where the covariance is a kernel matrix and .
The observed value at time , or more generally, its latent function such that , can be expressed as a sum of the weighted average of the global patterns and its local fluctuations. The summary of this generative model is given in Eqn. (1), and is illustrated in Figure 1. For simplicity, we consider to be the embedding of time series
(1) 
We use a global dynamics factors RNN or a set of univariatevalued RNNs to generate . The RNNs are learned globally to capture the common patterns from all time series. For each time series at time , we use attention networks to assign stationary attentions to the dynamic factors . This determines the group of the global factors to focus on and the relevant segment of histories. At a high level, the weighting gives temporal attention to different global factors.
2.2 Inference and Learning
Given a set of time series generated by Eqn. (1), our goal is to estimate
, the parameters in the global RNNs, attention network and the hyperparameters in the kernel function. To do so, we use maximum likelihood estimation, where
Computing the marginal likelihood may require doing inference over the latent variables. In our case, is Gaussian, and the marginal likelihood can be computed easily as,For nonGaussian likelihoods, classical techniques, such as BoxCox transform box2015time or variational inference in the framework of Variational Auto Encoder (VAE) kingma2013auto ; rezende2014stochastic , can be used. This is a direction of future work.
3 Experiments
The model is implemented in MXNet Gluon (chen2015mxnet, ) with a RBF kernel gardner2018 using the mxnet.linalg library seegar2017 ; dai2018 . We use a p3.4xlarge SageMaker instance in all our experiments. The global factor network is chosen to be LSTM with 1 hidden layer and 50 hidden units. We fix the number of factors to be 10.
To assess the quality of the proposed model, we limit the training, sometimes artificially by pruning the data, to only one week of time series. This results in 168 observations per time series. Figures 1(a)1(b) show that the forecasts qualitatively on the publicly available datasets electricity and traffic from the UCI data set (Dua:2017, ; yu2016temporal, ).
We use the quantile loss to evaluate the probabilistic forecast. For a given quantile
, a target value and quantile prediction , the quantile loss is defined asWe use a normalized sum of quantile losses, to compute the quantile losses for a given span across all time series. We include results for which we abbreviate as the P50QL (mean absolute percentage error (MAPE)) and P90QL, respectively. We also report the root mean square error (RMSE), which is the square root of the aggregated squared error normalized by the product of number of time series and the length of the time series in the evaluation segment.
Table 1 compares with DeepAR (DA), a stateofart RNNbased forecasting algorithm on the publicly available AWS SageMaker (flunkert2017deepar, ; janu2018, ) and Prophet (P), a Bayesian structural time series model (taylor2017forecasting, )
. To ensure a fair comparison, we set DeepAR to have the same 1layer 50 hidden units network configuration, with the number of epochs set to be 2000. The results show that our model outperforms the others, in particular with respect to the P90 quantile loss. This shows that we are better at capturing uncertainty.
ds  hrzn  p50ql  p90ql  RMSE  
DA  P  DFGP  DA  P  DFGP  DA  P  DFGP  
elec  3d  0.216  0.149  0.109  0.182  0.103  0.061  1194.421  902.724  745.175 
24hr  0.132  0.124  0.103  0.100  0.091  0.074  2100.927  783.598  454.307  
traf  3d  0.348  0.457  0.137  0.162  0.207  0.093  0.028  0.032  0.021 
24hr  0.268  0.380  0.131  0.149  0.191  0.090  0.024  0.028  0.019 
4 Conclusion
We propose a novel globallocal model, Deep Factors with Gaussian Processes, for forecasting a collection of related time series. Our method differs from other globallocal models by combining classical Bayesian probabilistic models with deep learning techniques that scale. We show promising experiments that demonstrate the effectiveness and potential of our method in learning across multitime series and propagating uncertainty.
References
 [1] Andrew C Harvey. Forecasting, structural time series models and the Kalman filter. Cambridge university press, 1990.
 [2] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand forecasting for large inventories. In Advances in Neural Information Processing Systems, pages 4646–4654, 2016.
 [3] Peter J Brockwell and Richard A Davis. Time series: theory and methods. Springer Science & Business Media, 2013.
 [4] Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
 [5] David Barber, A Taylan Cemgil, and Silvia Chiappa. Bayesian time series models. Cambridge University Press, 2011.
 [6] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [7] Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 [8] Ruofeng Wen, Kari Torkkola, and Balakrishnan Narayanaswamy. A multihorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
 [9] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287, 2016.
 [10] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.

[11]
Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David
Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami.
Conditional neural processes.
In
International Conference on Machine Learning
, pages 1690–1699, 2018.  [12] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In AAAI, pages 2101–2109, 2017.
 [13] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207, 2016.

[14]
Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.
In Advances in Neural Information Processing Systems, pages 3604–3613, 2017.  [15] Syama Sundar Rangapuram, Matthias Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, 2018.
 [16] Carl Edward Rasmussen and Christopher KI Williams. Gaussian process for machine learning. MIT press, 2006.
 [17] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 [18] Diederik P Kingma and Max Welling. Autoencoding variational bayes. ICLR, 2014.

[19]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, pages 1278–1286, 2014.  [20] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 [21] Gardner J.R., Pleiss G., Bindel D., Weinberger K.Q., and Wilson A.G. Gpytorch: Blackbox matrixmatrix gaussian process inference with gpu acceleration. 32nd Conference on Neural Infromation Processing Systems (NIPS 2018) arXiv:1809.11165v2, 2018.
 [22] Matthias Seegar, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Autodifferentiating linear algebra. arXiv preprint arXiv:1710.08717, 2017.
 [23] Dai Zhenwen, Eric Meissner, and Neil D. Lawrence. Mxfusion: A modular deep probabilistic programming library. NIPS 2018 Workshop MLOSS, 2018.
 [24] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017. University of California, Irvine, School of Information and Computer Sciences.
 [25] HsiangFu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in neural information processing systems, pages 847–855, 2016.
 [26] Tim Januschowski, David Arpin, David Salinas, Valentin Flunkert, Jan Gasthaus, Lorenzo Stella, and Paul Vazquez. Now available in amazon sagemaker: Deepar algorithm for more accurate time series forecasting. https://aws.amazon.com/blogs/machinelearning/nowavailableinamazonsagemakerdeeparalgorithmformoreaccuratetimeseriesforecasting/, 2018.
 [27] Sean J Taylor and Benjamin Letham. Forecasting at scale. The American Statistician, 2017.
Comments
There are no comments yet.