Deep Factors with Gaussian Processes for Forecasting

11/30/2018 ∙ by Danielle C. Maddix, et al. ∙ Amazon 0

A large collection of time series poses significant challenges for classical and neural forecasting approaches. Classical time series models fail to fit data well and to scale to large problems, but succeed at providing uncertainty estimates. The converse is true for deep neural networks. In this paper, we propose a hybrid model that incorporates the benefits of both approaches. Our new method is data-driven and scalable via a latent, global, deep component. It also handles uncertainty through a local classical Gaussian Process model. Our experiments demonstrate that our method obtains higher accuracy than state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Some prevalent forecasting methods in statistics and econometrics have been developed for forecasting individual or small groups of time series. These methods consist of complex models designed and tuned by domain experts (harvey1990forecasting, ). Recently, there has been a paradigm shift from model-based to fully-automated data-driven approaches. This shift can be attributed to the availability of large and diverse time series datasets in a wide variety of fields (seeger2016bayesian, ). A substantial amount of data consisting of past behavior of related time series can be leveraged for making a forecast for an individual time series. Use of data from related time series allows for fitting of more complex and potentially more accurate models without overfitting.

Classical time series methods, such as Autoregressive Integrated Moving Average (ARIMA) (brockwell2013time, ), exponential smoothing (hyndman2008forecasting, ) and general Bayesian time series (barber2011bayesian, )

, excel at modeling the complex dynamics of individual time series of sufficiently long history. These methods are computationally efficient, e.g. via a Kalman filter, and provide uncertainty estimates. Uncertainty estimates are critical for optimal downstream decision making. These methods are local, that is, they learn one model per time series. As a consequence, they cannot effectively extract information across multiple time series. These classical methods also have challenges with cold-start problems, where more time series are added or removed over time.

Deep neural networks (DNNs), in particular, recurrent neural networks (RNNs), such as LSTMs

(hochreiter1997long, ) have been successful in time series forecasting (flunkert2017deepar, ; wen2017multi, ). DNNs are generally effective at extracting patterns across multiple time series. Without a combination with probabilistic methods, such as variational dropout gal2016 and deep Kalman filters krishnan2015deep , DNNs can be prone to overfitting and have challenges in modeling uncertainty (garnelo2018conditional, ).

The combination of probabilistic graphical models with deep neural networks has been an active research area recently krishnan2015deep ; krishnan2017structured ; fraccaro2016sequential ; fraccaro2017disentangled . In the time series forecasting domain, a recent example is rangapuram2018 , where the authors combine RNNs and State-Space Models (SSM) for scalable time series forecasting. Our work in this paper follows a similar theme: we propose a novel and scalable global-local method, Deep Factors with Gaussian Processes. It is based on a global DNN backbone and local Gaussian Process (GP) model for computational efficiency. The global-local structure extracts complex non-linear patterns globally while capturing individual random effects for each time series locally. The main idea of our approach is to represent each time series as a combination of a global time series and a corresponding local model. The global part is given by a linear combination of a set of deep dynamic factors, where the loading is temporally determined by attentions. The local model is a stochastic Gaussian Process (GP), which allows for the uncertainty to propagate forward in time.

2 Deep Factor Model with Gaussian Processes

We first define the forecasting problem that we are aiming to solve. Let denote the input features space and the space of the observations. We are given a set of time series with the time series consisting of where are the input co-variates, and is the corresponding observation at time . Given a forecast horizon , our goal is to calculate the joint predictive distribution of future observations,

where denotes the time series with corresponding features. For concreteness, we restrict ourselves to univariate time series ().

2.1 Generative Model

We assume that each time series is governed by the following two components: fixed and random.

Figure 1: Plate graph of the proposed Deep Factors with Gaussian Processes model. The diamond nodes represent deterministic states.

Fixed effects are common patterns that are given by linear combinations of latent global deep factors, . These deep factors can be thought of as dynamic principal components or eigen time series that drive the underlying dynamics of all the time series.

Random effects, , are the local fluctuations that are chosen to be the Gaussian Process (rasmussen2006gaussian, ), i.e., , where the covariance is a kernel matrix and .

The observed value at time , or more generally, its latent function such that , can be expressed as a sum of the weighted average of the global patterns and its local fluctuations. The summary of this generative model is given in Eqn. (1), and is illustrated in Figure 1. For simplicity, we consider to be the embedding of time series

(1)

We use a global dynamics factors RNN or a set of univariate-valued RNNs to generate . The RNNs are learned globally to capture the common patterns from all time series. For each time series at time , we use attention networks to assign stationary attentions to the dynamic factors . This determines the group of the global factors to focus on and the relevant segment of histories. At a high level, the weighting gives temporal attention to different global factors.

2.2 Inference and Learning

Given a set of time series generated by Eqn. (1), our goal is to estimate

, the parameters in the global RNNs, attention network and the hyperparameters in the kernel function. To do so, we use maximum likelihood estimation, where

Computing the marginal likelihood may require doing inference over the latent variables. In our case, is Gaussian, and the marginal likelihood can be computed easily as,

For non-Gaussian likelihoods, classical techniques, such as Box-Cox transform box2015time or variational inference in the framework of Variational Auto Encoder (VAE) kingma2013auto ; rezende2014stochastic , can be used. This is a direction of future work.

3 Experiments

The model is implemented in MXNet Gluon (chen2015mxnet, ) with a RBF kernel gardner2018 using the mxnet.linalg library seegar2017 ; dai2018 . We use a p3.4xlarge SageMaker instance in all our experiments. The global factor network is chosen to be LSTM with 1 hidden layer and 50 hidden units. We fix the number of factors to be 10.

To assess the quality of the proposed model, we limit the training, sometimes artificially by pruning the data, to only one week of time series. This results in 168 observations per time series. Figures 1(a)-1(b) show that the forecasts qualitatively on the publicly available datasets electricity and traffic from the UCI data set (Dua:2017, ; yu2016temporal, ).

(a) electricity
(b) traffic
Figure 2: The dashed orange curve shows the forecast of the proposed global LSTM with GP local model. The black vertical line marks the division between the training and prediction regions.

We use the quantile loss to evaluate the probabilistic forecast. For a given quantile

, a target value and -quantile prediction , the -quantile loss is defined as

We use a normalized sum of quantile losses, to compute the quantile losses for a given span across all time series. We include results for which we abbreviate as the P50QL (mean absolute percentage error (MAPE)) and P90QL, respectively. We also report the root mean square error (RMSE), which is the square root of the aggregated squared error normalized by the product of number of time series and the length of the time series in the evaluation segment.

Table 1 compares with DeepAR (DA), a state-of-art RNN-based forecasting algorithm on the publicly available AWS SageMaker  (flunkert2017deepar, ; janu2018, ) and Prophet (P), a Bayesian structural time series model (taylor2017forecasting, )

. To ensure a fair comparison, we set DeepAR to have the same 1-layer 50 hidden units network configuration, with the number of epochs set to be 2000. The results show that our model outperforms the others, in particular with respect to the P90 quantile loss. This shows that we are better at capturing uncertainty.

ds hrzn p50ql p90ql RMSE
DA P DFGP DA P DFGP DA P DFGP
elec 3d 0.216 0.149 0.109 0.182 0.103 0.061 1194.421 902.724 745.175
24hr 0.132 0.124 0.103 0.100 0.091 0.074 2100.927 783.598 454.307
traf 3d 0.348 0.457 0.137 0.162 0.207 0.093 0.028 0.032 0.021
24hr 0.268 0.380 0.131 0.149 0.191 0.090 0.024 0.028 0.019
Table 1: Results for short-term (3-day forecast) and near-term (24-hour forecast) scenario with one week of training data on electricity, traffic.

4 Conclusion

We propose a novel global-local model, Deep Factors with Gaussian Processes, for forecasting a collection of related time series. Our method differs from other global-local models by combining classical Bayesian probabilistic models with deep learning techniques that scale. We show promising experiments that demonstrate the effectiveness and potential of our method in learning across multi-time series and propagating uncertainty.

References