1 Introduction
Shortterm demand forecasting is crucial in many areas, including ondemand ride hailing platforms, such as Uber, Didi, and Lyft, because dispatch system’s efficiency can be improved by dynamic adjustment of the fare price and relocation of idle drivers to high demand area.
A predictive model must learn complex spatiotemporal correlations to predict future demand volumes in each region and deep neural networks show prominent performances. After the region of interests is transformed into a grid, convolutional neural networks (CNNs) model local and spatial correlations in a receptive field, and extracts spatial features of certain region
ma2017learning ; zhang2016dnn ; zhang2017deep. Recurrent neural networks (RNNs)
elman1990findingor long shortterm memory (LSTM)
hochreiter1997long are also used to learn temporal patterns in general time series including demand patterns qin2017dual ; lai2018modeling ; zhao2017lstm ; yao2018modeling . Another important issues is temporally recurrent patterns, because periodic and seasonal patterns are commonly appear in realworld timeseries. For example, demand data have similar patterns on the same timeofday and dayofweek (Figure 1). Thus, longterm histories from periods/seasons ago are used as input to model periodicity and seasonality in temporal patterns zhang2016dnn ; zhang2017deep ; yao2018deep . Recent approaches also use some attention mechanisms for long sequence as inputs and improve forecasting results qin2017dual ; yao2018deep .In this paper, we rethink the use of CNNs for modeling spatial features of a region from neighborhood. Convolution calculates different values according to the permutations of positions in a receptive field. For example, if a subway station is adjacent to the target region, a convolutional filter considers whether the subway station is in west or east to model local and spatial correlations. However, we claim that the neighbouring direction is not important, but the proximity of neighborhood is only enough to define spatial features of the target region from its neighborhood. That is, whether a region is adjacent to subway station is more important than where the subway station is in west or east to the region. We found that permutationinvariant operation, which does not model the directionality of neighborhood, improve forecasting results with smaller number of trainable parameters than convolution.
We also postulate that the input of periods/seasons ago may not be the optimal way to digest temporally recurrent patterns, although it is common and effective way to model periodicity and seasonality. Choosing the right periodicity and seasonality is also an open question, and an exiting heuristic, such as (partial) autocorrelation function (ACF), is often time consuming. Previous approaches from seasonal ARIMA to recent deep learning models do not explicitly consider temporal contexts at each time, but only learn a predictive model for input of orderedsequence. However, when a person understands time series and learns to predict, she or he does not only learn to match the input of days/weeks ago histories to make output. She or he learns and recognizes temporal contexts from timeofday, dayofweek, and holiday information and explicitly understands temporal contexts such as weekday morning rush hours, holiday pattern, etc from training data.
In this paper, we propose an efficient demand forecasting model framework (TGNet), which consists of graph networks with temporalguided embedding. Graph networks extract spatiotemporal features of a region from its neighborhood and the features are permutationinvariant to positions of its neighborhood. Temporalguided embedding learns temporal contexts directly from training data and is concatenated into input of model. TGNet is conditional autoregressive model on temporal contexts of target time. Experimental results show that TGNet has 20 times smaller number of trainable parameters than recent stateoftheart model yao2018modeling and improves forecasting performances on realworld datasets.
Our paper is organized as follows. we introduce demand forecasting from spatiotemporal data in Section 2. In Section 3, we propose our model, TGNet, which consists of graph networks with temporalguided embedding. We present experimental results on realworld datasets in Section 4. Related work is reviewed in Section 5. We conclude and discuss about future work in Section 6.
2 Demand Forecasting from Spatiotemporal Data
In spatiotemporal modeling, different tessellations, such as grid ma2017learning , hexagon ke2018hexagon , or others davis2018taxi , are used to divide regions of interest into nonoverlapped grid. We use grid tessellation to divide the entire regions in this study. Then, multivariate autoregressive models are used to predict future data of each region from T immediate past data at time as input features.
(1) 
where and are the set of nonoverlapped regions and time intervals, and is (time, location) of a demand log . denotes the cardinality of the set.
We define graph , where is the set of node features in and is the set of edges between nodes. In here, each node is corresponded to a region in . If and are adjacent, is defined as 1, otherwise 0. A node features of and is defined by
(2) 
Then, the forecasting model predicts demand volumes in target regions at
(3) 
The model is autoregressive model with fixed length ordered sequence from stationary process over time or contains feature extractor, which make observation from nonstationary process stationary (theoretical details are in supplementary). Note that the model is not dependent on specific time and not contain any temporal information of each observation explicitly.
3 Graph Networks with TemporalGuided Embedding
3.1 Graph Networks for Spatial Features with Permutational Invariance
Instead of convolutional layer, which is commonly used to model spatial correlations in demand patterns zhang2016dnn ; zhang2017deep ; yao2018modeling ; yao2018deep , our model consists of a stack of graph networks. Convolutional layers learn to extract spatial features of a region from its adjacent regions, but convolution is permutationvariant operation. Then, it is dependent on the permutations and orderings of its neighborhood.
We claim that permutationinvariant operation is more efficient way to extract spatial correlations between a region and its neighborhood than convolution. When we define spatial feature of a region, the characteristics of its neighborhood and the proximity of them are only enough to consider, instead of their directionality. For example, the proximity of a subway station from a region is more important to define feature of the region than where the station is in west or east of target region. However, convolution considers permutations of neighborhood and requires different filters by permutations of neighborhood. It can increase the number of trainable parameters unnecessarily and result in overfitting when the training data are limited.
Thus, we use permutationinvariant operation to aggregate features of adjacent regions of each region. When a spatial feature of each region is extracted, the directionality of its neighborhood does not considered in permutationinvariant operation. It can efficiently reduce the number of trainable parameters, maintaining or improving forecasting results on test data. For simplicity of notation, we use instead of Equation 2.
(4)  
(5) 
where (k1)th feature vectors of node
, , the neighborhood regions of region , , and trainable parameters in kth layer, . Note that Equation 4 receive messages from feature vectors of neighbor regions and use permutationinvariant operation to aggregate them. Feature vector of nodeis calculated by a fully connected layer, combining aggregation of its neighborhood and linear transformation of the node.
(6) 
where and are concatenation and elementwise summation. The concatenation in Equation 6 is a skip connection and helps model learn with feature reuse and alleviation of gradient vanishing problem huang2017densely . All trainable parameters in each layer are shared over every node.
After layers of graph networks, demand volume of region at time is predicted as
(7) 
where is feature vector of region from external data sources and is explained in next section. ReLU is also used in output layer to produce positive demand values. Note that above operations are generalizable to different tessellation of city, such as hexagonal ke2018hexagon or irregular patterns davis2018taxi .
3.2 TemporalGuided Embedding
Time series data have temporally recurrent patterns, such as periodicity and seasonality, and similar patterns tend to be repeated. For example, different demand patterns are repeated and appear on same timeofday, dayofweek, and holiday holiday (Figure 1), reflecting people’s life cycle. Existing approaches use immediate past and longterm histories of data from period and season length ago as the inputs of model together zhang2016dnn ; zhang2017deep ; yao2018modeling . Periodicity and seasonality are also determined by manual methods such as (partial) ACF.
Temporalguided embedding is proposed to learn temporal contexts directly from training data and to consider the recurrent patterns. We assume that the combination of immediate past data and learned temporal context can substitute for days/weeks ago histories to capture temporally recurrent patterns. The temporalguided embedding at time is defined by
(8) 
where is a categorical variable, which can represent temporal information of time . For example, we can use the concatenation of four onehot vectors, which correspond to timeofday, dayofweek, holiday, and the day before holiday information of time , to represent temporal information of demand. Fully connected layer,
, outputs distributed representation of temporal information of
and is trained by endtoend manner.The temporalguided embedding is concatenated into the input of model and make the model learn conditional distribution on temporal contexts of forecasting target.
(9)  
(10) 
where is featurewise concatenation. Temporal information of is available at time and temporalguided embedding of forecasting target leads for TGNet to extract spatiotemporal features of input, conditioned on temporal contexts of .
Similar approaches, which learn conditional distribution of training images on labels mirza2014conditional or words on positions vaswani2017attention , exist. However, to the best of our knowledge, temporalguided embedding is the first approach in time series domain to learn conditional distribution on explicitly learned temporal contexts. Note that determining the periodicity and seasonality using partial ACF is a heuristic and handcraft procedure, but temporalguided embedding can replace the procedure and learn temporal contexts directly, instead of longterm historical inputs.
3.3 Late Fusion with External Data Sources
Orthogonal to capturing of complex spatiotemporal patterns in demand data, the forecasting results can be improved by incorporating external data such as meteorological, traffic flow, or event information tong2017simpler ; zhang2016dnn ; zhang2017deep ; yao2018deep ; yao2018modeling . In this paper, we do not use external data sources to improve our results and only focus on extracting complex spatiotemporal features effectively. However, we explain how our model architecture incorporate data from other domains.
As an example, dropoff volumes in past are used to improve demand forecasting results, because dropoff in a region might be changed into demands in future yao2018modeling ; vahedian2019predicting . Feature vectors of dropoff patterns are extracted by graph networks in the same manner and concatenated into the features from demand (Equation 7). This type of late fusion is a common approach to combine heterogeneous data sources from multimodality zadeh2017tensor ; anderson2018bottom ; ku2018joint . Although we do not use other external data, we expect that various external data can be incorporated by this manner to improve the results in future work.
4 Experiment
4.1 Experimental Setting
Datasets Three realworld datasets (NYCbike nycb2017data , NYCtaxi taxi2017tlc , and SEOtaxi) are used for evaluation. The details of datasets are described in supplementary material. The first two datasets are open publicly and SEOtaxi is private.
Evaluation
We use two evaluation metrics to measure the accuracy of forecasting results: mean absolute percentage error (MAPE) and root mean squared error (RMSE). We follow same evaluation method with
yao2018modeling ; yao2018deep for fair comparison and excluded samples with less value than. It is known as common practice in industry and academia, because realworld applications have little interest in such lowvolume samples. In all tables in this paper, the mean performances with ten repeats are reported and bold means statistical significance. The standard deviations are in supplementary.
Implementations Demands and dropoff volumes in previous 8 and 16 time intervals (4 and 8 hours) are used to forecast demands in the next time interval (30 minutes). NYC and Seoul are divided into 1020 and 5050 respectively, considering the area of cities. The area of each region is about 700 m
700 m. Batch normalization
ioffe2015batch and dropout srivastava2014dropout withare used in every layers. We attach the details of implementation, including the number of layers and hidden neurons. Source codes with Tensorflow 1.17.0
abadi2016tensorflowand Keras 2.22.2
chollet2015keras are available.^{1}^{1}1https://github.com/LeeDoYup/TGNetkeras.Training We use two types of loss to train TGNet. We used L2 loss (mean square error) first and change the loss to L1 (mean absolute error). L1 loss is more robust to the anomalies in the real time series lai2018modeling , but the optimization process was not stable experimentally. Initial training with L2 loss makes the optimization with L1 loss stable. TGNet is trained with Adam optimizer kingma2014adam using 0.01 learning and decay rate. We used 20 % of samples as test data. 20 % of training data are used for validation and earlystopping is applied to select an optimal model. That is ( numbers of samples in NYC (SEO) are used for training/valid/test. Two Tesla P40 GPUs are used and about 2 (26) hours are takes for training NYC (SEO) dataset.
Baseline Methods
We compare TGNet with statistical and stateoftheart deep learning methods for spatiotemporal data: ARIMA, XGBoost
chen2016xgboost , STResNet zhang2017deep , DMVSTNet yao2018deep , and STDN yao2018modeling .Method  NYCbike  NYCtaxi  SEOtaxi  (NYC) # of  

RMSE  MAPE(%)  RMSE  MAPE(%)  RMSE  MAPE(%)  Parameters  
ARIMA  11.53  27.82  36.53  28.51  48.92  56.43   
XGBoost  9.57  23.52  26.07  19.35  32.09  45.75   
STResNet  9.80  25.06  26.23  21.13      4,835,373 
DMVSTNet  9.14  22.20  25.74  17.38      1,499,021 
STDN yao2018modeling  8.85  21.84  24.10  16.30      9,446,274 
GN  9.09  22.51  23,75  15.43  28.10  37.31  410,977 
GN+TGE  8.88  22.37  22.81  14.99  25.96  35.67  419,857 
TGNet  8.84  21.92  22.75  14.83  25.35  35.72  475,543 
4.2 Forecasting Performances of TGNet
The forecasting accuracies of TGNet and other compared models are calculated with ten repeats on NYCbike, NYCtaxi, and SEOtaxi datasets in Table 1. In evaluation, the samples with demand volume less than 11 were eliminated.
The traditional time series model, ARIMA, shows the lowest accuracy on all datasets, because it cannot consider spatial correlations and complex nonlinearity in demand patterns. XGBoost shows better performances than statistical time series model.
Recent deep learning models outperform ARIMA and XGboost, capturing complex spatiotemporal correlations in the datasets. The most recent model, STDN, shows the best performances on NYCbike and NYCtaxi datasets among baseline methods. STDN outperforms other baseline methods, because STDN utilizes various modules and data such as local CNN, LSTM, periodic and seasonal inputs, periodically shifted attention, and external data such as weather and traffic in/out flow.
There are some remarkable facts that TGNet has effective model architecture to predict future demands. First, TGNet has about 20 times smaller number of trainable parameters (475,543) than STDN (9,446,274), but shows better results than other deep learning models. Graph networks with permutationinvariant aggregation and temporalguided embedding reduce the number of trainable parameters instead of CNNs and longterm histories.
Second, TGNet do not use external data sources, but other compared models use meteorological, traffic flow, or event information. The performances of TGNet on NYCbike are not better than those of STDN, but there is no significant difference. However, our results are promising when we consider demand patterns of bike, which are highly dependent on meteorological situations, and the number of parameters.
Third, TGNet can learn largescale dataset (SEOtaxi) successfully. SEOtaxi dataset has 12.5 times larger regions and 3 times longer period than NYC datasets. TGNet can learn SEOtaxi dataset only by increasing the number of hidden neurons in each layer. However, to the best of our effort, we fail to train other deep learning baselines from SEOtaxi dataset. Simple and efficient model architecture is compelling to generalize from the scale of datasets.
We conducted a series of ablation studies on the effectiveness of proposed methods. Baseline model with graph networks outperforms a recent deep learning model (DMVSTNet), which uses convolutional and recurrent layers and graph embedding. STDN adds longterm histories and attention method on DMVSTNet for temporally recurrent patterns. When we consider the performance gains of STDN from DMVSTNet and the increase of the number of trainable parameters, temporalguided embedding is more efficient way to improve forecasting results instead of longterm histories. Temporalguided embedding is a simple implementation, but also gives performance gains on all datasets. Adding dropoff volumes can also improve forecasting results. These results show that our proposed methods have effectiveness on capture complex spatiotemporal patterns in demand.
4.3 Effectiveness of PermutationInvariant Operation
Convolutional and graph networks are compared to show the effectiveness of permutationinvariant.
TGNetCA Fully convolutional networks noh2015learning , which only use a convolutionReLU instead of graph networks. The number of filters is same with the number of neurons in proposed model.
TGNetCB We substitute aggregation operation (Equation 4) with 33 convolution operation and keep other operations in graph networks same.
Graph networks show better forecasting results than the others with convolutional layers (Table 2). The results are notable, because the number of trainable parameters of TGNet is about 1.5 2 times smaller than the others. We conclude that permutationinvariant operation can model spatial features of each region efficiently and the proximity of certain neighborhood can be more important than the directionality of neighborhood.
Method  NYCbike  NYCtaxi  (NYC) # of  

RMSE  MAPE(%)  RMSE  MAPE(%)  Parameters  
TGNetCA  9.17  22.19  23.68  15.01  737,751 
TGNetCB  9.10  22.13  23.03  15.15  876,951 
TGNet  8.84  21.92  22.75  14.83  475,543 
4.4 Forecasting when Atypical Events Occurs
In practice, shortterm demand forecasting is important when atypical events, which have abnormally large values than usual, occur. For example, bad performances on these situations cause fatal supplydemand mismatch problem and it can be connected to service failure in ridehailing services. Abnormally high values are nonrepetitive and have different patterns from majority of samples vahedian2019predicting and hard to be learned, because they do not appear often in training time.
We set different thresholds according to timeofday, weekend, and holiday information by each region from training data to consider different spatial and temporal contexts. Then, we select atypical samples, which are larger than the threshold in each region, from test data and investigate the forecasting results. Samples above top 1 % and 5 % thresholds in each region are selected and are larger than 10 times of the standard deviation from mean of each region. We also identify that most of samples of atypical events, such as concert, festival, or academic conferences, are included in our atypical samples.
Although recent deep learning models (STResNet and STDN) show great results in average (Table 1), they do not learn minority samples with extremely large values. We infer that excessive number of parameters can result in overfitting. Atypical samples are hard to be predict by overfittedmodel, because they scarcely appear during training time. The results of TGNet are acceptable, although they are somewhat inferior to average performances (Table 1). Dropoff volumes are helpful in atypical event situation, because past surge of dropoff volumes can be converted to a future demand vahedian2019predicting . Forecasting models need to be evaluated not only with average performances, but also on unusual situations together to study robustness of model in practice.
NYCtaxi  SEOtaxi  
Method  RMSE  MAPE (%)  RMSE  MAPE (%)  
top 1 %  top 5 %  top 1 %  top 5 %  top 1 %  top 5 %  top 1 %  top 5 %  
STResNet  224.50  217.72  154.06  157.72  N/A  N/A  N/A  N/A 
STDN  210.34  203.11  90.55  89.71  N/A  N/A  N/A  N/A 
GN  21.15  20.36  28.75  29.62  39.91  30.99  47.32  48.08 
GN + TGE  20.79  20.03  27.51  28.36  37.18  28.96  45.86  46.78 
TGNet  19.64  18.83  27.43  28.23  36.37  28.19  46.16  47.16 
4.5 Visualization of TemporalGuided Embedding
We visualize temporalguided embedding to investigate whether the embedding learn temporal contexts as intended. As the input of temporalguided embedding is categorical variable. each dimension is not mutually correlated. For example, 5 a.m. and 6 a.m. are independent because the input is onehot vector. We expect that temporalguided embedding shows meaningful visualization with distributed representations of temporal contexts, and explains temporal patterns in training data.
We find three remarkable facts that temporalguided embedding actually learns and extracts temporal contexts from data. Firstly, the embeddings of adjacent timeofday are located adjacent to each other (in supplementary material). It is basic concept of time that events as adjacent time are strongly correlated. Secondly, the timeofday vectors are clustered according to the temporal contexts based on timeofday patterns. Timeofday vectors are classified into four clusters: commute time, daytime, evening, and night (Figure
2 left). The division of timeofday vectors is analogous to the way that people understand daily demand patterns based on common lifestyle. Temporalguided embedding learns temporal contexts in different patterns depending on time. Lastly, temporalguided embedding learns the concept of dayofweek and holiday. The locations of weekday and weekend vectors are strictly divided. If a dayofweek is weekday and it is holiday, the embedding is adjacent to weekend vector, because holiday and weekend demand patterns are similar (Figure 2 right).In the case of NYCdataset, the insights of visualization are not as definite as those of SEOtaxi, but we found similar patterns. The embedding vectors of adjacent timeofdays are located in nearby. Temporalguided embedding of working days and the others (weekend and holiday) are also clearly classified (see supplementary). We assume that the scales of NYC datasets are relatively small to learn intuitive understanding in the temporal patterns. In summary, temporalguided embedding not only improves forecasting results, but also can have interpretable visualization of temporal contexts based on demand patterns with largescale dataset.
5 Related Work
Many predictive models are used to learn complex spatiotemporal patterns in demand data. ARIMA is used to predict future traffic condition and exploit temporal pattern in a data pan2012utilizing ; moreira2013predicting . Latent space model deng2016latent
or knearest neighbor (kNN)
cheng2018short are applied to capture spatial correlation between adjacent regions for shortterm traffic forecasting. While these approaches show promising progress on traffic forecasting, they have a limited capability to capture complex spatiotemporal patterns.Most of recent models adopted convolutional neural networks (CNN) lecun1995convolutional and long shortterm memory (LSTM) hochreiter1997long to extract spatial and temporal features respectively. First, they form a grid over a region and assign a quantity of interest as a pixel value to turn the geographical data into a 2D image. For example, ma2017learning turned the traffic speed of each region into 2D image, and forecast the future traffic speed. Then, feature maps are extracted by a stack of convolutional layers, considering local relationship between adjacent regions zhang2016dnn ; zhang2017deep ; yao2018modeling ; yao2018deep . To capture autoregressive sequential dependency, various models use LSTM layers to forecast traffic amounts and condition zhao2017lstm ; cheng2017deeptransport , taxi demands ke2017short ; zhou2018predicting ; yao2018deep ; yao2018modeling , or traffic speeds yu2017spatiotemporal .
Taxi demand patterns are temporally recurrent according to timeofday, dayofweek, and holiday or not. Some approaches utilize longterm history of demand volumes from days/weeks ago to improve forecasting performances. zhang2016dnn ; zhang2017deep use three convolutional models and extract features of temporal closeness, period, and seasonal trend from immediate past, days ago, and weeks ago samples of forecasting target. yao2018modeling also uses days/weeks ago samples as input of LSTM layers and combine periodically shifted attention mechanisms. Longterm histories are considered to capture temporally recurrent patterns. However, they can increase the size of models and result in overfitting. Out model does not use longterm histories, but learns temporal contexts of forecasting target time and conditional distribution on immediate past samples and target temporal contexts.
Recent studies successfully apply neural networks to graphs. GraphSAGE hamilton2017inductive learns a function to generate embedding of node by sampling and aggregating from its neighborhood. Message passing neural networks (MPNNs) gilmer2017neural define message/update functions and integrate many previous studies on graph domains duvenaud2015convolutional ; li2015gated ; battaglia2016interaction ; kearnes2016molecular ; schutt2017quantum ; kipf2016semi . Some attention mechanisms are used to define relationship between entities in velickovic2018graph ; wang2018non
. Graph neural networks extract hidden representations of each node from the messages of its neighborhood and the features are invariant to ordering of neighborhood. We use graph neural networks to make spatial features of a target region invariant to permutation of adjacent regions. This approach focuses on a particular characteristic of neighboring area, not relative locations of the region such as left or right.
Ingesting external data sources that may related with future demand can improve forecasting performances. For example, meteorological, event information zhang2016dnn ; zhang2017deep ; yao2018deep , or traffic (in/out) flows yao2018modeling can be used to improve forecasting results. However, the improvement is orthogonal to complex spatiotemporal dependencies in input data. We only focus to propose an efficient model to learn complex spatiotemporal patterns and expect that various data are combined with our model in future.
6 Conclusion
We propose temporalguided network (TGNet), which is graph neural networks with temporalguided embedding. TGNet uses permutationinvariant operation of graph networks and temporalguided embedding to extract spatial and temporal features efficiently, instead of CNNs and longterm histories. TGNet has about 20 times smaller number of trainable parameters than a recent stateoftheart model yao2018modeling and show competitive and better results on three realworld datasets. We also show TGNet with permutationinvariant operation has better performances and smaller number of parameters than model with convolution. Our results are notable, because external data sources such as weather, traffic flow, or event information are not used in this study. Temporalguided embedding can directly learn temporal contexts from training data and show show interpretable visualizations. TGNet also has stable forecasting results on atypical samples with extremely large value, but other deep learning models show poor performances. Atypical event situations are practically important and need to be focused on. Temporalguided embedding can also be utilized to capture temporally recurrent patterns in various time series data and we will show the generalizability in future work. We expect that our model can be a great baseline to forecasting spatiotemporal data in various applications.
References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: a system for largescale machine learning.
In OSDI, volume 16, pages 265–283, 2016. 
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, and Lei Zhang.
Bottomup and topdown attention for image captioning and visual
question answering.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 6077–6086, 2018.  [3] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
 [4] NYC Bike. https://www.citibikenyc.com/systemdata. Accessed = 20190522.
 [5] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
 [6] Shifen Cheng and Feng Lu. Shortterm traffic forecasting: A dynamic stknn model considering spatial heterogeneity and temporal nonstationarity. 2018.
 [7] Xingyi Cheng, Ruiqing Zhang, Jie Zhou, and Wei Xu. Deeptransport: Learning spatialtemporal dependency for traffic condition forecasting. arXiv preprint arXiv:1709.09585, 2017.
 [8] François Chollet et al. Keras, 2015.
 [9] Neema Davis, Gaurav Raina, and Krishna Jagannathan. Taxi demand forecasting: A hedgebased tessellation strategy for improved accuracy. IEEE Transactions on Intelligent Transportation Systems, (99):1–12, 2018.
 [10] Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, Linhong Zhu, Rose Yu, and Yan Liu. Latent space model for road networks to predict timevarying traffic. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1525–1534. ACM, 2016.
 [11] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 [12] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
 [13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1263–1272. JMLR. org, 2017.
 [14] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [16] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [19] Jintao Ke, Hai Yang, Hongyu Zheng, Xiqun Chen, Yitian Jia, Pinghua Gong, and Jieping Ye. Hexagonbased convolutional neural network for supplydemand forecasting of ridesourcing services. IEEE Transactions on Intelligent Transportation Systems, 2018.
 [20] Jintao Ke, Hongyu Zheng, Hai Yang, and Xiqun Michael Chen. Shortterm forecasting of passenger demand under ondemand ride services: A spatiotemporal deep learning approach. Transportation Research Part C: Emerging Technologies, 85:591–608, 2017.
 [21] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [23] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
 [24] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
 [25] Guokun Lai, WeiCheng Chang, Yiming Yang, and Hanxiao Liu. Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104. ACM, 2018.
 [26] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 [27] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 [28] Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Ma, Yong Wang, and Yunpeng Wang. Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction. Sensors, 17(4):818, 2017.
 [29] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [30] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [31] Luis MoreiraMatias, Joao Gama, Michel Ferreira, Joao MendesMoreira, and Luis Damas. Predicting taxi–passenger demand using streaming data. IEEE Transactions on Intelligent Transportation Systems, 14(3):1393–1402, 2013.
 [32] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
 [33] Bei Pan, Ugur Demiryurek, and Cyrus Shahabi. Utilizing realworld transportation data for accurate traffic prediction. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 595–604. IEEE, 2012.

[34]
Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison W
Cottrell.
A dualstage attentionbased recurrent neural network for time series
prediction.
In
Proceedings of the 26th International Joint Conference on Artificial Intelligence
, pages 2627–2633. AAAI Press, 2017.  [35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.

[36]
Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller,
and Alexandre Tkatchenko.
Quantumchemical insights from deep tensor neural networks.
Nature communications, 8:13890, 2017.  [37] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [38] NYC Taxi, Limousine Commission, et al. Tlc trip record data. Accessed October, 12, 2017.
 [39] R Core Team et al. R: A language and environment for statistical computing. 2013.
 [40] Yongxin Tong, Yuqiang Chen, Zimu Zhou, Lei Chen, Jie Wang, Qiang Yang, Jieping Ye, and Weifeng Lv. The simpler the better: a unified approach to predicting original taxi demands based on largescale online platforms. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1653–1662. ACM, 2017.
 [41] Amin Vahedian, Xun Zhou, Ling Tong, W Nick Street, and Ynahua Li. Predicting urban dispersal events: A twostage framework through deep survival analysis on mobility data. arXiv preprint arXiv:1905.01281, 2019.
 [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 [43] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
 [44] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
 [45] Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, Yanwei Yu, and Zhenhui Li. Modeling spatialtemporal dynamics for traffic prediction. arXiv preprint arXiv:1803.01254, 2018.
 [46] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. Deep multiview spatialtemporal network for taxi demand prediction. In AAAI, pages 2588–2595, 2018.
 [47] Haiyang Yu, Zhihai Wu, Shuqin Wang, Yunpeng Wang, and Xiaolei Ma. Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. Sensors, 17(7):1501, 2017.
 [48] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250, 2017.
 [49] Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatiotemporal residual networks for citywide crowd flows prediction. In AAAI, pages 1655–1661, 2017.
 [50] Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. Dnnbased prediction model for spatiotemporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 92. ACM, 2016.
 [51] Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu. Lstm network: a deep learning approach for shortterm traffic forecast. IET Intelligent Transport Systems, 11(2):68–75, 2017.
 [52] Xian Zhou, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. Predicting multistep citywide passenger demands using attentionbased neural networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 736–744. ACM, 2018.
Appendix A Implementation
a.1 Implementation Details
Our codes are based on Tensorflow 1.7.0 [1] and we used highlevel API, Keras 2.2.2 [8]. The source codes including README.txt are available on supplementary material. There are six hidden layers before fullyconnected layer (equation (7)) and two layers are used for taxi dropoff volumes. We use a 2d average pooling layer with 2x2 kernel after GN 1 layer for computational efficiency. Skip connections like [35] are used to alleviate gradient vanishing problem [17].
The number of hidden neurons of first layer (NF in Figure 3) is 32 in NYC datasets. We increase the number of neurons twice because SEOtaxi dataset is relatively larger scale than NYC datasets. Batch Normalization and dropout are used in each layer.
a.2 Methods for Comparison
We compare the performances of TGNet with existing demand forecasting models from spatiotemporal data and describe them in this section. We follow the hyperparameters in original papers, but adjust learning rates for training.
ARIMA: Autoregressive Integrated Moving Average (ARIMA) is traditional model for nonstationary time series. We use auto ARIMA function in R[39] to fit each dataset.
XGBoost [5]: XGBoost is a popular tool to train a boosted tree. The number of trees is 500, max depth is 4, and subsample rate is 0.6.
STResNet [49]: STResNet is a CNNbased model with residual blocks [15]. They uses various past time step as temporal closeness, periodic, and seasonal inputs to capture temporally recurrent patterns. ResNet is used to extract hidden representations of each input (a image at each time step) and they concatenate all feature maps before prediction of future demand.
DMVSTNet [46]: DMVSTNet models spatial, temporal, and semantic view through local CNN, LSTM, and graph embedding. They do not forecast demands of all target regions at once, but predict future demand of each region independently. After convolutional layers extract spatial feature of input image at each time, the feature maps are entered into LSTM layers to extract temporal features.
STDN [45]: STDN is based on DMVSTNet [46], and add some parts to improve forecasting results. Temporal closeness, periodic, and seasonal inputs are used to model temporally recurrent pattern and periodically shifted attention is proposed to deal with long sequence. Traffic flow and taxi dropoff volumes are also used with flow gating mechanisms.
a.3 Hyperparameter Search
We used greed search with various setting below and determine optimal hyperparameters. The bold means selected ones. The number of hidden neurons in the first layer, in SEOtaxi, was 64 and the others were same.
Learning Rate: Learning rate for Adam optimizer.
Decaying Rate: Decaying rate for Adam optimizer.
Number of Hidden Neurons: The number of convolutional filters in first hidden layer. The Filter numbers in other layers have same ratio with optimal one, mentioned above.
Number of Hidden Neurons for Dropoff: The number of convolutional filters for encoding of dropoff volumes.
Batch Size: minibatch size for model update.
Dimensionality of TemporalGuided Embedding: the number of dimension for temporalguided embedding.
Appendix B Evaluation
We introduce three realworld datasets, which are used to evaluate our model, and other evaluation details. The region of NYC is divided into 1020 grid and the region of Seoul is into 5050 grid. A grid cell covers about 700 m700 m. A time interval is 30 minutes in this study.
b.1 Dataset Description
NYCbike NYCbike dataset contains the number of rents and returns of bike in NYC from 07/01/2016 to 08/29/2016. The first 40 days are used for training purpose and the remaining 20 days are as test. This dataset is not about taxi demand, but we also evaluate this dataset to generalize our model as spatiotemporal demand forecasting model. The demand patterns on bike are vulnerable to weather condition. For example, if a day is rainy, there is no demand of bike. In this paper, we do not use external data, including weather, but we show our model have competitive performances on other baselines with external data.
NYCtaxi NYCTaxi dataset contains taxi pickup and dropoff records of NYC in from 01/01/2015 to 03/01/2015. The first 40 days data is used for training purpose, and the remaining 20 days are tested.
SEOtaxi SEOtaxi dataset contains ride request and dropoff records in Seoul, South Korea. This data are provided from a ondemand ridehailing service provider and is private. The period of dataset is from 01/01/2018 to 06/30/2018 and the first 4 months data are used for training and the remaining for test. SEOtaxi dataset is relatively largescale, because the area of Seoul (5050 grid) is larger than NYC (1020) and the period is also longer than NYCbike and NYCtaxi. We found that other baselines could not learn SEOtaxi in hyperparameter settings described above to the our best effort.
b.2 Dataset Details
In this paper, the other deep learning models can’t learn SEOtaxi dataset because it is largescale and more sparse and complex. The dataset is private now, so we attach the comparison of three datasets, in statistics. We will upload SEOtaxi dataset, if it is free to the security issue.
In Table 4 shows the statistics of datasets: NYCbike, NYCtaxi, and SEOtaxi. We set a time interval as 30 minutes. The period of NYC datasets is 60 days and 2,880 time steps with 1020 regions. On the other hand, SEOtaxi dataset is 181 days and 8,688 steps with 5050 regions. SEOtaxi has lower mean and standard deviation than NYCtaxi, but the maximum demand volume of SEOtaxi is much larger. We consider that SEOtaxi has more sparse and complex dynamics of taxi demands with largescale of regions and times.
Dataset  Period  Regions  Mean  Median  Std  Min  Max  

NYCbike 

10 x 20  4.52  0  14.33  0  307  
NYCtaxi 

10 x 20  38.8  0  107.71  0  1,149  
SEOtaxi 

50 x 50    0  18.27  0  4,491 
b.3 Performance Measures
Two evaluation metrics measure the performances of forecasting models: Mean absolute percentage error (MAPE) and root mean squared error (RMSE). In evaluation, the samples with value less than are excluded as a common practice in industry and academia [46, 45], because they are of little interest in realworld applications. Let be the set of filtered samples, then the performance measures are given by
(11)  
(12) 
MAPE and RMSE tend to be sensitive to low and large value samples, respectively. For extreme cases with one sample, if model prediction is 3 when the groundtruth is 1, MAPE is 200 % and RMSE is 2. On the other hand, if model prediction is 500 when the groundtruth is 1,000, MAPE is 50 % and RMSE is 500. Because of these characteristics, both measures are compared together.
Appendix C TemporalGuided Embedding
c.1 Input of TemporalGuided Embedding
The input of temporalguided embedding is concatenation of four onehot vectors (timeofday, dayofweek, holiday or not, and the day before holiday or not) and the input vector is 0/1 categorical variables. The detail explanations are in Table 5. For example of timeofday, onehot vector means all timeofdays are independent to each other and there are no correlation between timeofday vectors. However, we expect that temporalguided embedding can learn distributed representations of temporal contexts in the process of learning how to forecast taxi demand and understanding the characteristics of time series.
Type  Dimensionality  Explanation 

Time of Day  48  30 Minutes 
Day of Week  7  MTWTFSS 
Holiday  1  Holiday or not 
Bef. Holiday  1  The day bef. holiday or not 
Total  57  0/1 variables 
c.2 Visualization of TemporalGuided Embeddings
We visualize learned temporalguided embedding to investigate whether the embeddings are interpretable or not. We assumed that temporalguided embeddings can have meaningful insights or visualization over performance gains. For visualization, we use tSNE [29] in scikitlearn 0.19.1. Learning rate is 1,000 and other hyperparameters are set by default.
We visualize some examples of temporalguided embedding of different timeofday vectors from SEOtaxi datasets in Figure 4. We find that temporalguided embedding learn to locate the adjacent timeofday vectors nearby each other. The results are similar with human’s understanding about the basic concept of time, because people naturally assume that events as adjacent time are strongly correlated with. The assumption is also applied in sequential modeling with recurrent layers as relational inductive biases of series. The embeddings of remaining time intervals are available in supplementary material.
Although temporalguided embedding improve forecasting results on all datasets, we can not show meaningful insights on NYC datasets like SEOtaxi. That is, we find that the adjacent timeofday vectors tend to be located adjacent to, but it is not obvious to all timeofday vectors (Figure 5).. The working day (weekday) and the other days (weekend and holiday) are also divided clearly in the embedding space (Figure 6). We conclude that NYC datasets may not have enough number of samples to learn temporal contexts like SEOtaxi, but overall concepts of learning of temporalguided embedding are similar with largescale dataset, SEOtaxi.
c.3 TimeSeries Forecasting and TemporalGuided Embedding
In this paper, we showed that temporalguided embedding make forecasting model improve performances and learn temporal contexts explicitly. The implementation of temporalguided embedding is simple, but it has theoretical background. Let an observation of time series is . From ARMA to recent deep learning models, the forecasting models learn autoregressive model of with lag inputs , …,
(13) 
where t is time stamp of each data sample. If a model assume Markov property (not our case), such as LSTM, it becomes
(14) 
TGNet does not assume sequential model and directly learn equation (13). In general, the model (13) or (14) is corresponded to for all time stamps in training sample, assuming stationary condition. Because of stationary condition, a model is not feasible when the time series is nonstationary and has different probability distribution according to time stamps
. Thus, Some preprocessings, such as log scaling or differencing, are used to make the series stationary and neural networks effectively make nonstationary series stationary automatically by learning hierarchical nonlinear transformations. Furthermore, deep learning model contains both model of probability distribution for stationary process (output layer) and preprocessing modules (hidden layers) to make the input stationary.Note that we can rewrite (13) with random variables of a fixedlength ordered sequence
(15) 
where and . That is, above equation (13) is a special case when .
We know that equation (15) does not contain any temporal information about specific time , but model probability distribution of input ordered sequence. That is, it means that the model makes combinations of input values to predict future demand without explicit knowledge or understanding of temporal contexts. However, the these approach to model timeseries is quite different from how human understands timeseries, because people learn temporal contexts of data from explicit understanding of timeofday, dayofweek, or holiday.
Temporalguided embedding makes the model predict conditional distribution on temporal contexts of forecasting target
(16) 
where is learned temporal contexts and is temporal information vector (timeofday, dayofweek, holiday, the day before holiday) of random variable . Temporalguided embedding explicitly learns temporal contexts of forecasting target and make model extract hidden representations of input sequence conditioned on the embedding. We replace input of longterm histories from days/weeks ago with temporalguided embedding and show that the embeddings improve forecasting performances and have interpretable visualizations. We expect temporalguided embedding can be used for general timeseries modeling in future work.
Appendix D Atypical Events and Dropoff Volumes
We conduct evaluation of forecasting performances on atypical event samples, which have extremely large demand volumes, and show dropoff volumes can improve forecasting results. In fact, we found that the patterns of taxi dropoff at a region were different before atypical events occurred (Figure 7). Sudden surge of pickup requests is observed after atypical events, such as music festival, end and dropoff volumes are much larger than usual before the atypical events start. Many people rush into the region to participate in the events and the dropoff volumes can be potential demand in future.
Appendix E Forecasting Performance Details
The standard deviations with ten repeats are attached in Table 6. We conclude that our proposed model is significantly competitive to other baseline models. In the cast of NYCbike, our model is not significantly better than STDN [45], but there is no statistically significant difference. When we think that the number of parameters of TGNet is about 20 times smaller than STDN and bike demands are vulnerable to weather conditions, we consider our results on NYCbike promising.
Method  NYCbike  NYCtaxi  SEOtaxi  

RMSE  MAPE(%)  RMSE  MAPE(%)  RMSE  MAPE(%)  
ARIMA  11.53  27.82  36.53  28.51  48.92  56.43 
XGBoost  9.57  23.52  26.07  19.35  32.09  45.75 
STResNet  9.80 0.12  25.06 0.36  26.23 0.33  21.13 0.63     
DMVSTNet  9.14 0.13  22.20 0.33  25.74 0.26  17.38 0.46     
STDN [45]  8.85 0.11  21.84 0.36  24.10 0.25  16.30 0.23    
GN  9.09 0.05  22.51 0.16  23.75 0.30  15.43 0.15  28.10  37.31 
GN + TGE  8.88 0.09  22.37 0.06  22.81 0.07  14.99 0.07  25.96  35.67 
TGNet  8.84 0.07  21.92 0.13  22.75 0.14  14.83 0.06  25.35  35.72 
Comments
There are no comments yet.