1 Introduction
Accurate and robust prediction of timeseries data has shown meaningful impact in various applications (Längkvist et al., 2014). For example, at Uber, predicting rider demand accurately benefits supply planning and resource allocation. Inaccurate predictions and confidence miscalibrations in the estimated predictions can lead to suboptimal decision making which may further result in either over or under supply. Ultimately, in an industrial setting such miscalibrations can result in extra cost to the company or to the customers. However, it is challenging to predict quantities like demand accurately due to potentially unknown exogenous variables that cause anomalous patterns and contribute to prediction variability. Although, there are many popular and successful recurrent or modified convolutional network models for capturing time dynamics (Hochreiter and Schmidhuber, 1997; Gehring et al., 2017), they typically are trained using maximum likelihood and suffer from overconfidence. Moreover, such point estimates are typically insufficient to quantify prediction variability. While there are important successes using frequentist ensembles Lakshminarayanan et al. (2017), the Bayesian framework is a natural choice for modeling the prediction uncertainty and interpreting the estimates. Recently, many attempts have been made to adapt existing Bayesian techniques to model neural networks (Gal and Ghahramani, 2016a, b; Blundell et al., 2015; HernándezLobato and Adams, 2015; Louizos and Welling, 2017; Karaletsos et al., 2018), referred to as Bayesian neural networks (BNNs). Variational inference (VI) is often used to approximate the posterior distribution over parameters efficiently (Blundell et al., 2015). One particular posterior approximation for BNNs is the Monte Carlo Dropout (Gal and Ghahramani, 2016a; Li and Gal, 2017) and has been applied to timeseries forecasting as well (Zhu and Laptev, 2017). However, both accuracy and scalability of VI depend on the particular approximating distribution. In this work, we employ Stein variational gradient descent (SVGD), which is a generalized nonparametric VI algorithm for approximating continuous distributions (Liu and Wang, 2016). SVGD has the advantage of not requiring knowledge of the explicit form of the posterior distribution and provides a theoretically guaranteed weak convergence of the samples (Liu, 2017)
. By assuming independent prior distributions and using the radial basis function (RBF) kernel, SVGD is fast and scalable to large neural networks and offers an elegant and efficient solution for forecasting quantities like rider demand while also modeling the prediction uncertainty.
We propose a regressionbased BNN model to predict spatiotemporal quantities like hourly rider demand with calibrated uncertainties. The main contributions of this paper are (i) A feedforward deterministic neural network (DetNN) architecture that predicts cyclical time series data with sensitivity to anomalous forecasting events; (ii) A Bayesian framework applying SVGD to train large neural networks for such tasks, capable of producing time series predictions as well as measures of uncertainty surrounding the predictions. Experiments show that the proposed BNN reduces average estimation error by 10% across 8 U.S. cities compared to a finetuned multilayer perceptron (MLP), and 4% better than the same network architecture trained without SVGD.
2 Bayesian neural network
The proposed neural network consists of an encoder to learn the hidden features and a decoder to predict time series, as shown in Figure 1
. The outcome of interest is a vector of continuous variables
. The input features are denoted as . The parameter of the model consists of the neural network parameter and the noise covariance matrix . The regression model is specified as:(1) 
where . In equation (1), denotes the output of the neural network. The predicted sequence is modeled independently across time points through the neural network. The correlation among the time points could be modeled through a structured , but in our experiments, is assumed to be a positive definite diagonal matrix for simplicity and computational efficiency. The th outcome , is a vector of length
and is assumed to be independently but not identically sampled from a multivariate Gaussian distribution
for .A Bayesian framework is imposed on the model (1) by assigning prior distributions to the model parameters. The prior of the neural network parameters is given by , ; the prior of the noise covariance , for . and are esimated to maximzie the joint loglikelihood with different learning rates.
During training, such neural networks are built via SVGD. When a new data point is passed into the trained network, posterior samples of are obtained for inference. The predicted outcome is estimated as The prediction variability is decomposed into three sources: model uncertainty, model misspecification and inherent noise. Assuming there is no misspecification, the prediction variability can be estimated through SVGD samples by where represents the model uncertainty and
represents the inherent noise. Under the assumption of diagonal noise covariance, constructing a credible region is equivalent to constructing a credible interval at each dimension. The
level credible interval is estimated as where is the upper quantile of a standard Gaussian, , .The detailed BNN via SVGD algorithm is shown in the Appendix. In all experiments, an RBF kernel is used with the bandwidth where H is the median of the pairwise distances between the SVGD samples. The bandwidth is changed adaptively over iterations. The Stein operator depends on the target posterior only through the score function , where . Thus the exact posterior distribution is not required to be represented explicitly to generate approximate samples from it. To calculate the gradient of , we need all the training data. But during training minibatches of size are passed into the neural networks. This is fixed by approximating the data likelihood by , where is the prior distribution of .
3 Experiments
We predict the hourly rider demand across 8 U.S. cities along with quantified prediction variability. The data used in the experiment is the hourly number of completed trips at Uber from 2014 to 2018 among 8 U.S. cities. The dataset is split sequentially into 50%/25%/25% train, validation, and test data and preprocessed to fit the Gaussian assumption. The hourly demand data exhibits a strong 24hour cyclical pattern with jitters around some special time windows. For example, the demand drops during Thanksgiving and rises dramatically after New Year’s eve. To handle the important time windows, extra onehot encoded channels are added to the input to the convolutional layers. As illustrated in Figure 4 (a), the input of the model consists of an hourly demand sequence and several sequential location sequences indicating the hour of the day, day of the week etc., the output is the predicted demand sequence. The difference of the prediction variability of a 72hour window around holidays and a nonholiday using the previous 144hour input is shown in Figure 4 (b). The estimated variability is always higher around holidays, especially around Christmas, compared to the one around a normal day in all 8 cities, meaning that the BNN model is less confident about predicting a holiday than predicting a nonholiday, as expected.
WMAPE  MLP  DetNN  BNN10  BNN30  BNN50 

San Francisco  0.0718  0.0678  0.0658  0.0657  0.066 
New York City  0.0773  0.0747  0.0763  0.0743  0.0743 
Boston  0.0823  0.079  0.0778  0.077  0.0768 
Chicago  0.0935  0.084  0.0807  0.0802  0.0795 
Washington D.C.  0.079  0.0742  0.0758  0.0737  0.0737 
Seattle  0.0822  0.0813  0.0777  0.0772  0.077 
Los Angeles  0.0792  0.0703  0.0655  0.0647  0.065 
Atlanta  0.0933  0.0877  0.0825  0.0805  0.0813 
Average  0.0823  0.0774  0.0753  0.0741  0.0742 
The performance of the BNN model with 10, 30 and 50 particle samples, referred to as BNN10, BNN30 and BNN50, is shown in Table 1
. With only one SVGD sample, a reasonably well maximum a posteriori estimate can be obtained. The sample size in the experiment is chosen arbitrarily as a balance of prediction performance and computational efficiency. The input sequence length is 144 hours, the output sequence length is 6 hours. The weighted mean absolute percentage error (WMAPE)
, where andare the true and predicted outcome, is used as the evaluation metric. Table
1shows a summary of averaged WMAPE across the 6hour prediction window. As performance benchmarks, we also show the results of a MLP model and a DetNN model which has the same network structure as the BNN. The hyperparameters are tuned separately for each model. Averaging across all cities, DetNN achieves 6% decrease in WMAPE from MLP, and BNN30 achieves 4% decrease from DetNN. Bayesian inference of parameters using SVGD further improves the DetNN performance with an additional benefit of quantified prediction variability.
Estimated WMAPE (left) and 95% coverage probability (right) with a 6hour prediction window.
Figure 7 shows the estimated WMAPE and 95% coverage probability from BNN30 over a 6hour prediction window. 95% coverage probability means the percentage that the true value is within the 95% credible band. The WMAPE increases when predicting further, but the coverage probability does not necessarily decrease. Even if the point estimation is not good enough, the BNN model could report low confidence by having a high variability around the estimation, thus facilitating better informed supply allocation.
4 Discussion
We have proposed a particular neural network architecture aimed at spatiotemporal modeling with cyclical components applied to the example of estimating demand, which is an important problem in the ridesharing space. We furthermore perform Bayesian inference on the proposed model using a variant of SVGD that gives us promising performance gains. Our experimental results indicate the advantage of Bayesian estimation using SVGD for our model, which encourages further investigation into the issue of modeling uncertainty for industrial scale problems. There remain interesting research questions to be investigated further. For example, in the future, we will explore different correlation structures to model time series data and investigate the use of more structured prior distributions instead of the independent prior assumption we are currently making.
References

Längkvist et al. [2014]
Martin Längkvist, Lars Karlsson, and Amy Loutfi.
A review of unsupervised feature learning and deep learning for timeseries modeling.
Pattern Recognition Letters, 42:11–24, 2014.  Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Gehring et al. [2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
 Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.

Gal and Ghahramani [2016a]
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
international conference on machine learning
, pages 1050–1059, 2016a. 
Gal and Ghahramani [2016b]
Yarin Gal and Zoubin Ghahramani.
A theoretically grounded application of dropout in recurrent neural networks.
In Advances in neural information processing systems, pages 1019–1027, 2016b.  Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.

HernándezLobato and Adams [2015]
José Miguel HernándezLobato and Ryan Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
In International Conference on Machine Learning, pages 1861–1869, 2015.  Louizos and Welling [2017] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 Karaletsos et al. [2018] Theofanis Karaletsos, Peter Dayan, and Zoubin Ghahramani. Probabilistic metarepresentations of neural networks. arXiv preprint arXiv:1810.00555, 2018.
 Li and Gal [2017] Yingzhen Li and Yarin Gal. Dropout inference in bayesian neural networks with alphadivergences. arXiv preprint arXiv:1703.02914, 2017.
 Zhu and Laptev [2017] Lingxue Zhu and Nikolay Laptev. Deep and confident prediction for time series at uber. In Data Mining Workshops (ICDMW), 2017 IEEE International Conference on, pages 103–110. IEEE, 2017.
 Liu and Wang [2016] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.
 Liu [2017] Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115–3123, 2017.
Comments
There are no comments yet.