Short-term demand forecasting is crucial in many areas, including on-demand ride hailing platforms, such as Uber, Didi, and Lyft, because dispatch system’s efficiency can be improved by dynamic adjustment of the fare price and relocation of idle drivers to high demand area.
A predictive model must learn complex spatiotemporal correlations to predict future demand volumes in each region and deep neural networks show prominent performances. After the region of interests is transformed into a grid, convolutional neural networks (CNNs) model local and spatial correlations in a receptive field, and extracts spatial features of certain regionma2017learning ; zhang2016dnn ; zhang2017deep
. Recurrent neural networks (RNNs)elman1990finding
or long short-term memory (LSTM)hochreiter1997long are also used to learn temporal patterns in general time series including demand patterns qin2017dual ; lai2018modeling ; zhao2017lstm ; yao2018modeling . Another important issues is temporally recurrent patterns, because periodic and seasonal patterns are commonly appear in real-world time-series. For example, demand data have similar patterns on the same time-of-day and day-of-week (Figure 1). Thus, long-term histories from periods/seasons ago are used as input to model periodicity and seasonality in temporal patterns zhang2016dnn ; zhang2017deep ; yao2018deep . Recent approaches also use some attention mechanisms for long sequence as inputs and improve forecasting results qin2017dual ; yao2018deep .
In this paper, we rethink the use of CNNs for modeling spatial features of a region from neighborhood. Convolution calculates different values according to the permutations of positions in a receptive field. For example, if a subway station is adjacent to the target region, a convolutional filter considers whether the subway station is in west or east to model local and spatial correlations. However, we claim that the neighbouring direction is not important, but the proximity of neighborhood is only enough to define spatial features of the target region from its neighborhood. That is, whether a region is adjacent to subway station is more important than where the subway station is in west or east to the region. We found that permutation-invariant operation, which does not model the directionality of neighborhood, improve forecasting results with smaller number of trainable parameters than convolution.
We also postulate that the input of periods/seasons ago may not be the optimal way to digest temporally recurrent patterns, although it is common and effective way to model periodicity and seasonality. Choosing the right periodicity and seasonality is also an open question, and an exiting heuristic, such as (partial) autocorrelation function (ACF), is often time consuming. Previous approaches from seasonal ARIMA to recent deep learning models do not explicitly consider temporal contexts at each time, but only learn a predictive model for input of ordered-sequence. However, when a person understands time series and learns to predict, she or he does not only learn to match the input of days/weeks ago histories to make output. She or he learns and recognizes temporal contexts from time-of-day, day-of-week, and holiday information and explicitly understands temporal contexts such as weekday morning rush hours, holiday pattern, etc from training data.
In this paper, we propose an efficient demand forecasting model framework (TGNet), which consists of graph networks with temporal-guided embedding. Graph networks extract spatiotemporal features of a region from its neighborhood and the features are permutation-invariant to positions of its neighborhood. Temporal-guided embedding learns temporal contexts directly from training data and is concatenated into input of model. TGNet is conditional autoregressive model on temporal contexts of target time. Experimental results show that TGNet has 20 times smaller number of trainable parameters than recent state-of-the-art model yao2018modeling and improves forecasting performances on real-world datasets.
Our paper is organized as follows. we introduce demand forecasting from spatiotemporal data in Section 2. In Section 3, we propose our model, TGNet, which consists of graph networks with temporal-guided embedding. We present experimental results on real-world datasets in Section 4. Related work is reviewed in Section 5. We conclude and discuss about future work in Section 6.
2 Demand Forecasting from Spatiotemporal Data
In spatiotemporal modeling, different tessellations, such as grid ma2017learning , hexagon ke2018hexagon , or others davis2018taxi , are used to divide regions of interest into non-overlapped grid. We use grid tessellation to divide the entire regions in this study. Then, multivariate autoregressive models are used to predict future data of each region from T immediate past data at time as input features.
where and are the set of non-overlapped regions and time intervals, and is (time, location) of a demand log . denotes the cardinality of the set.
We define graph , where is the set of node features in and is the set of edges between nodes. In here, each node is corresponded to a region in . If and are adjacent, is defined as 1, otherwise 0. A node features of and is defined by
Then, the forecasting model predicts demand volumes in target regions at
The model is autoregressive model with fixed length ordered sequence from stationary process over time or contains feature extractor, which make observation from non-stationary process stationary (theoretical details are in supplementary). Note that the model is not dependent on specific time and not contain any temporal information of each observation explicitly.
3 Graph Networks with Temporal-Guided Embedding
3.1 Graph Networks for Spatial Features with Permutational Invariance
Instead of convolutional layer, which is commonly used to model spatial correlations in demand patterns zhang2016dnn ; zhang2017deep ; yao2018modeling ; yao2018deep , our model consists of a stack of graph networks. Convolutional layers learn to extract spatial features of a region from its adjacent regions, but convolution is permutation-variant operation. Then, it is dependent on the permutations and orderings of its neighborhood.
We claim that permutation-invariant operation is more efficient way to extract spatial correlations between a region and its neighborhood than convolution. When we define spatial feature of a region, the characteristics of its neighborhood and the proximity of them are only enough to consider, instead of their directionality. For example, the proximity of a subway station from a region is more important to define feature of the region than where the station is in west or east of target region. However, convolution considers permutations of neighborhood and requires different filters by permutations of neighborhood. It can increase the number of trainable parameters unnecessarily and result in overfitting when the training data are limited.
Thus, we use permutation-invariant operation to aggregate features of adjacent regions of each region. When a spatial feature of each region is extracted, the directionality of its neighborhood does not considered in permutation-invariant operation. It can efficiently reduce the number of trainable parameters, maintaining or improving forecasting results on test data. For simplicity of notation, we use instead of Equation 2.
where (k-1)-th feature vectors of node, , the neighborhood regions of region , , and trainable parameters in k-th layer, . Note that Equation 4 receive messages from feature vectors of neighbor regions and use permutation-invariant operation to aggregate them. Feature vector of node
is calculated by a fully connected layer, combining aggregation of its neighborhood and linear transformation of the node.
where and are concatenation and element-wise summation. The concatenation in Equation 6 is a skip connection and helps model learn with feature reuse and alleviation of gradient vanishing problem huang2017densely . All trainable parameters in each layer are shared over every node.
After layers of graph networks, demand volume of region at time is predicted as
where is feature vector of region from external data sources and is explained in next section. ReLU is also used in output layer to produce positive demand values. Note that above operations are generalizable to different tessellation of city, such as hexagonal ke2018hexagon or irregular patterns davis2018taxi .
3.2 Temporal-Guided Embedding
Time series data have temporally recurrent patterns, such as periodicity and seasonality, and similar patterns tend to be repeated. For example, different demand patterns are repeated and appear on same time-of-day, day-of-week, and holiday holiday (Figure 1), reflecting people’s life cycle. Existing approaches use immediate past and long-term histories of data from period and season length ago as the inputs of model together zhang2016dnn ; zhang2017deep ; yao2018modeling . Periodicity and seasonality are also determined by manual methods such as (partial) ACF.
Temporal-guided embedding is proposed to learn temporal contexts directly from training data and to consider the recurrent patterns. We assume that the combination of immediate past data and learned temporal context can substitute for days/weeks ago histories to capture temporally recurrent patterns. The temporal-guided embedding at time is defined by
where is a categorical variable, which can represent temporal information of time . For example, we can use the concatenation of four one-hot vectors, which correspond to time-of-day, day-of-week, holiday, and the day before holiday information of time , to represent temporal information of demand. Fully connected layer,
, outputs distributed representation of temporal information ofand is trained by end-to-end manner.
The temporal-guided embedding is concatenated into the input of model and make the model learn conditional distribution on temporal contexts of forecasting target.
where is feature-wise concatenation. Temporal information of is available at time and temporal-guided embedding of forecasting target leads for TGNet to extract spatiotemporal features of input, conditioned on temporal contexts of .
Similar approaches, which learn conditional distribution of training images on labels mirza2014conditional or words on positions vaswani2017attention , exist. However, to the best of our knowledge, temporal-guided embedding is the first approach in time series domain to learn conditional distribution on explicitly learned temporal contexts. Note that determining the periodicity and seasonality using partial ACF is a heuristic and hand-craft procedure, but temporal-guided embedding can replace the procedure and learn temporal contexts directly, instead of long-term historical inputs.
3.3 Late Fusion with External Data Sources
Orthogonal to capturing of complex spatiotemporal patterns in demand data, the forecasting results can be improved by incorporating external data such as meteorological, traffic flow, or event information tong2017simpler ; zhang2016dnn ; zhang2017deep ; yao2018deep ; yao2018modeling . In this paper, we do not use external data sources to improve our results and only focus on extracting complex spatiotemporal features effectively. However, we explain how our model architecture incorporate data from other domains.
As an example, drop-off volumes in past are used to improve demand forecasting results, because drop-off in a region might be changed into demands in future yao2018modeling ; vahedian2019predicting . Feature vectors of drop-off patterns are extracted by graph networks in the same manner and concatenated into the features from demand (Equation 7). This type of late fusion is a common approach to combine heterogeneous data sources from multi-modality zadeh2017tensor ; anderson2018bottom ; ku2018joint . Although we do not use other external data, we expect that various external data can be incorporated by this manner to improve the results in future work.
4.1 Experimental Setting
Datasets Three real-world datasets (NYC-bike nycb2017data , NYC-taxi taxi2017tlc , and SEO-taxi) are used for evaluation. The details of datasets are described in supplementary material. The first two datasets are open publicly and SEO-taxi is private.
We use two evaluation metrics to measure the accuracy of forecasting results: mean absolute percentage error (MAPE) and root mean squared error (RMSE). We follow same evaluation method withyao2018modeling ; yao2018deep for fair comparison and excluded samples with less value than
. It is known as common practice in industry and academia, because real-world applications have little interest in such low-volume samples. In all tables in this paper, the mean performances with ten repeats are reported and bold means statistical significance. The standard deviations are in supplementary.
Implementations Demands and drop-off volumes in previous 8 and 16 time intervals (4 and 8 hours) are used to forecast demands in the next time interval (30 minutes). NYC and Seoul are divided into 1020 and 5050 respectively, considering the area of cities. The area of each region is about 700 m
700 m. Batch normalizationioffe2015batch and dropout srivastava2014dropout with abadi2016tensorflow
and Keras 2.22.2chollet2015keras are available.111https://github.com/LeeDoYup/TGNet-keras.
Training We use two types of loss to train TGNet. We used L2 loss (mean square error) first and change the loss to L1 (mean absolute error). L1 loss is more robust to the anomalies in the real time series lai2018modeling , but the optimization process was not stable experimentally. Initial training with L2 loss makes the optimization with L1 loss stable. TGNet is trained with Adam optimizer kingma2014adam using 0.01 learning and decay rate. We used 20 % of samples as test data. 20 % of training data are used for validation and early-stopping is applied to select an optimal model. That is ( numbers of samples in NYC (SEO) are used for training/valid/test. Two Tesla P40 GPUs are used and about 2 (26) hours are takes for training NYC (SEO) dataset.
We compare TGNet with statistical and state-of-the-art deep learning methods for spatiotemporal data: ARIMA, XGBoostchen2016xgboost , STResNet zhang2017deep , DMVST-Net yao2018deep , and STDN yao2018modeling .
|Method||NYC-bike||NYC-taxi||SEO-taxi||(NYC) # of|
4.2 Forecasting Performances of TGNet
The forecasting accuracies of TGNet and other compared models are calculated with ten repeats on NYC-bike, NYC-taxi, and SEO-taxi datasets in Table 1. In evaluation, the samples with demand volume less than 11 were eliminated.
The traditional time series model, ARIMA, shows the lowest accuracy on all datasets, because it cannot consider spatial correlations and complex non-linearity in demand patterns. XGBoost shows better performances than statistical time series model.
Recent deep learning models outperform ARIMA and XGboost, capturing complex spatiotemporal correlations in the datasets. The most recent model, STDN, shows the best performances on NYC-bike and NYC-taxi datasets among baseline methods. STDN outperforms other baseline methods, because STDN utilizes various modules and data such as local CNN, LSTM, periodic and seasonal inputs, periodically shifted attention, and external data such as weather and traffic in/out flow.
There are some remarkable facts that TGNet has effective model architecture to predict future demands. First, TGNet has about 20 times smaller number of trainable parameters (475,543) than STDN (9,446,274), but shows better results than other deep learning models. Graph networks with permutation-invariant aggregation and temporal-guided embedding reduce the number of trainable parameters instead of CNNs and long-term histories.
Second, TGNet do not use external data sources, but other compared models use meteorological, traffic flow, or event information. The performances of TGNet on NYC-bike are not better than those of STDN, but there is no significant difference. However, our results are promising when we consider demand patterns of bike, which are highly dependent on meteorological situations, and the number of parameters.
Third, TGNet can learn large-scale dataset (SEO-taxi) successfully. SEO-taxi dataset has 12.5 times larger regions and 3 times longer period than NYC datasets. TGNet can learn SEO-taxi dataset only by increasing the number of hidden neurons in each layer. However, to the best of our effort, we fail to train other deep learning baselines from SEO-taxi dataset. Simple and efficient model architecture is compelling to generalize from the scale of datasets.
We conducted a series of ablation studies on the effectiveness of proposed methods. Baseline model with graph networks outperforms a recent deep learning model (DMVST-Net), which uses convolutional and recurrent layers and graph embedding. STDN adds long-term histories and attention method on DMVST-Net for temporally recurrent patterns. When we consider the performance gains of STDN from DMVST-Net and the increase of the number of trainable parameters, temporal-guided embedding is more efficient way to improve forecasting results instead of long-term histories. Temporal-guided embedding is a simple implementation, but also gives performance gains on all datasets. Adding drop-off volumes can also improve forecasting results. These results show that our proposed methods have effectiveness on capture complex spatiotemporal patterns in demand.
4.3 Effectiveness of Permutation-Invariant Operation
Convolutional and graph networks are compared to show the effectiveness of permutation-invariant.
TGNet-C-A Fully convolutional networks noh2015learning , which only use a convolution-ReLU instead of graph networks. The number of filters is same with the number of neurons in proposed model.
TGNet-C-B We substitute aggregation operation (Equation 4) with 33 convolution operation and keep other operations in graph networks same.
Graph networks show better forecasting results than the others with convolutional layers (Table 2). The results are notable, because the number of trainable parameters of TGNet is about 1.5 -2 times smaller than the others. We conclude that permutation-invariant operation can model spatial features of each region efficiently and the proximity of certain neighborhood can be more important than the directionality of neighborhood.
|Method||NYC-bike||NYC-taxi||(NYC) # of|
4.4 Forecasting when Atypical Events Occurs
In practice, short-term demand forecasting is important when atypical events, which have abnormally large values than usual, occur. For example, bad performances on these situations cause fatal supply-demand mismatch problem and it can be connected to service failure in ride-hailing services. Abnormally high values are non-repetitive and have different patterns from majority of samples vahedian2019predicting and hard to be learned, because they do not appear often in training time.
We set different thresholds according to time-of-day, weekend, and holiday information by each region from training data to consider different spatial and temporal contexts. Then, we select atypical samples, which are larger than the threshold in each region, from test data and investigate the forecasting results. Samples above top 1 % and 5 % thresholds in each region are selected and are larger than 10 times of the standard deviation from mean of each region. We also identify that most of samples of atypical events, such as concert, festival, or academic conferences, are included in our atypical samples.
Although recent deep learning models (STResNet and STDN) show great results in average (Table 1), they do not learn minority samples with extremely large values. We infer that excessive number of parameters can result in overfitting. Atypical samples are hard to be predict by overfitted-model, because they scarcely appear during training time. The results of TGNet are acceptable, although they are somewhat inferior to average performances (Table 1). Drop-off volumes are helpful in atypical event situation, because past surge of drop-off volumes can be converted to a future demand vahedian2019predicting . Forecasting models need to be evaluated not only with average performances, but also on unusual situations together to study robustness of model in practice.
|Method||RMSE||MAPE (%)||RMSE||MAPE (%)|
|top 1 %||top 5 %||top 1 %||top 5 %||top 1 %||top 5 %||top 1 %||top 5 %|
|GN + TGE||20.79||20.03||27.51||28.36||37.18||28.96||45.86||46.78|
4.5 Visualization of Temporal-Guided Embedding
We visualize temporal-guided embedding to investigate whether the embedding learn temporal contexts as intended. As the input of temporal-guided embedding is categorical variable. each dimension is not mutually correlated. For example, 5 a.m. and 6 a.m. are independent because the input is one-hot vector. We expect that temporal-guided embedding shows meaningful visualization with distributed representations of temporal contexts, and explains temporal patterns in training data.
We find three remarkable facts that temporal-guided embedding actually learns and extracts temporal contexts from data. Firstly, the embeddings of adjacent time-of-day are located adjacent to each other (in supplementary material). It is basic concept of time that events as adjacent time are strongly correlated. Secondly, the time-of-day vectors are clustered according to the temporal contexts based on time-of-day patterns. Time-of-day vectors are classified into four clusters: commute time, daytime, evening, and night (Figure2 left). The division of time-of-day vectors is analogous to the way that people understand daily demand patterns based on common lifestyle. Temporal-guided embedding learns temporal contexts in different patterns depending on time. Lastly, temporal-guided embedding learns the concept of day-of-week and holiday. The locations of weekday and weekend vectors are strictly divided. If a day-of-week is weekday and it is holiday, the embedding is adjacent to weekend vector, because holiday and weekend demand patterns are similar (Figure 2 right).
In the case of NYC-dataset, the insights of visualization are not as definite as those of SEO-taxi, but we found similar patterns. The embedding vectors of adjacent time-of-days are located in nearby. Temporal-guided embedding of working days and the others (weekend and holiday) are also clearly classified (see supplementary). We assume that the scales of NYC datasets are relatively small to learn intuitive understanding in the temporal patterns. In summary, temporal-guided embedding not only improves forecasting results, but also can have interpretable visualization of temporal contexts based on demand patterns with large-scale dataset.
5 Related Work
Many predictive models are used to learn complex spatiotemporal patterns in demand data. ARIMA is used to predict future traffic condition and exploit temporal pattern in a data pan2012utilizing ; moreira2013predicting . Latent space model deng2016latent
or k-nearest neighbor (kNN)cheng2018short are applied to capture spatial correlation between adjacent regions for short-term traffic forecasting. While these approaches show promising progress on traffic forecasting, they have a limited capability to capture complex spatiotemporal patterns.
Most of recent models adopted convolutional neural networks (CNN) lecun1995convolutional and long short-term memory (LSTM) hochreiter1997long to extract spatial and temporal features respectively. First, they form a grid over a region and assign a quantity of interest as a pixel value to turn the geographical data into a 2D image. For example, ma2017learning turned the traffic speed of each region into 2D image, and forecast the future traffic speed. Then, feature maps are extracted by a stack of convolutional layers, considering local relationship between adjacent regions zhang2016dnn ; zhang2017deep ; yao2018modeling ; yao2018deep . To capture autoregressive sequential dependency, various models use LSTM layers to forecast traffic amounts and condition zhao2017lstm ; cheng2017deeptransport , taxi demands ke2017short ; zhou2018predicting ; yao2018deep ; yao2018modeling , or traffic speeds yu2017spatiotemporal .
Taxi demand patterns are temporally recurrent according to time-of-day, day-of-week, and holiday or not. Some approaches utilize long-term history of demand volumes from days/weeks ago to improve forecasting performances. zhang2016dnn ; zhang2017deep use three convolutional models and extract features of temporal closeness, period, and seasonal trend from immediate past, days ago, and weeks ago samples of forecasting target. yao2018modeling also uses days/weeks ago samples as input of LSTM layers and combine periodically shifted attention mechanisms. Long-term histories are considered to capture temporally recurrent patterns. However, they can increase the size of models and result in overfitting. Out model does not use long-term histories, but learns temporal contexts of forecasting target time and conditional distribution on immediate past samples and target temporal contexts.
Recent studies successfully apply neural networks to graphs. GraphSAGE hamilton2017inductive learns a function to generate embedding of node by sampling and aggregating from its neighborhood. Message passing neural networks (MPNNs) gilmer2017neural define message/update functions and integrate many previous studies on graph domains duvenaud2015convolutional ; li2015gated ; battaglia2016interaction ; kearnes2016molecular ; schutt2017quantum ; kipf2016semi . Some attention mechanisms are used to define relationship between entities in velickovic2018graph ; wang2018non
. Graph neural networks extract hidden representations of each node from the messages of its neighborhood and the features are invariant to ordering of neighborhood. We use graph neural networks to make spatial features of a target region invariant to permutation of adjacent regions. This approach focuses on a particular characteristic of neighboring area, not relative locations of the region such as left or right.
Ingesting external data sources that may related with future demand can improve forecasting performances. For example, meteorological, event information zhang2016dnn ; zhang2017deep ; yao2018deep , or traffic (in/out) flows yao2018modeling can be used to improve forecasting results. However, the improvement is orthogonal to complex spatiotemporal dependencies in input data. We only focus to propose an efficient model to learn complex spatiotemporal patterns and expect that various data are combined with our model in future.
We propose temporal-guided network (TGNet), which is graph neural networks with temporal-guided embedding. TGNet uses permutation-invariant operation of graph networks and temporal-guided embedding to extract spatial and temporal features efficiently, instead of CNNs and long-term histories. TGNet has about 20 times smaller number of trainable parameters than a recent state-of-the-art model yao2018modeling and show competitive and better results on three real-world datasets. We also show TGNet with permutation-invariant operation has better performances and smaller number of parameters than model with convolution. Our results are notable, because external data sources such as weather, traffic flow, or event information are not used in this study. Temporal-guided embedding can directly learn temporal contexts from training data and show show interpretable visualizations. TGNet also has stable forecasting results on atypical samples with extremely large value, but other deep learning models show poor performances. Atypical event situations are practically important and need to be focused on. Temporal-guided embedding can also be utilized to capture temporally recurrent patterns in various time series data and we will show the generalizability in future work. We expect that our model can be a great baseline to forecasting spatiotemporal data in various applications.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
-  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In , pages 6077–6086, 2018.
-  Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
-  NYC Bike. https://www.citibikenyc.com/system-data. Accessed = 2019-05-22.
-  Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
-  Shifen Cheng and Feng Lu. Short-term traffic forecasting: A dynamic st-knn model considering spatial heterogeneity and temporal non-stationarity. 2018.
-  Xingyi Cheng, Ruiqing Zhang, Jie Zhou, and Wei Xu. Deeptransport: Learning spatial-temporal dependency for traffic condition forecasting. arXiv preprint arXiv:1709.09585, 2017.
-  François Chollet et al. Keras, 2015.
-  Neema Davis, Gaurav Raina, and Krishna Jagannathan. Taxi demand forecasting: A hedge-based tessellation strategy for improved accuracy. IEEE Transactions on Intelligent Transportation Systems, (99):1–12, 2018.
-  Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, Linhong Zhu, Rose Yu, and Yan Liu. Latent space model for road networks to predict time-varying traffic. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1525–1534. ACM, 2016.
-  David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
-  Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
-  Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
-  Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Jintao Ke, Hai Yang, Hongyu Zheng, Xiqun Chen, Yitian Jia, Pinghua Gong, and Jieping Ye. Hexagon-based convolutional neural network for supply-demand forecasting of ride-sourcing services. IEEE Transactions on Intelligent Transportation Systems, 2018.
-  Jintao Ke, Hongyu Zheng, Hai Yang, and Xiqun Michael Chen. Short-term forecasting of passenger demand under on-demand ride services: A spatio-temporal deep learning approach. Transportation Research Part C: Emerging Technologies, 85:591–608, 2017.
-  Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
-  Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
-  Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104. ACM, 2018.
-  Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
-  Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
-  Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Ma, Yong Wang, and Yunpeng Wang. Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors, 17(4):818, 2017.
-  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and Luis Damas. Predicting taxi–passenger demand using streaming data. IEEE Transactions on Intelligent Transportation Systems, 14(3):1393–1402, 2013.
-  Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
-  Bei Pan, Ugur Demiryurek, and Cyrus Shahabi. Utilizing real-world transportation data for accurate traffic prediction. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 595–604. IEEE, 2012.
Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison W
A dual-stage attention-based recurrent neural network for time series
Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2627–2633. AAAI Press, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller,
and Alexandre Tkatchenko.
Quantum-chemical insights from deep tensor neural networks.Nature communications, 8:13890, 2017.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  NYC Taxi, Limousine Commission, et al. Tlc trip record data. Accessed October, 12, 2017.
-  R Core Team et al. R: A language and environment for statistical computing. 2013.
-  Yongxin Tong, Yuqiang Chen, Zimu Zhou, Lei Chen, Jie Wang, Qiang Yang, Jieping Ye, and Weifeng Lv. The simpler the better: a unified approach to predicting original taxi demands based on large-scale online platforms. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1653–1662. ACM, 2017.
-  Amin Vahedian, Xun Zhou, Ling Tong, W Nick Street, and Ynahua Li. Predicting urban dispersal events: A two-stage framework through deep survival analysis on mobility data. arXiv preprint arXiv:1905.01281, 2019.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
-  Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
-  Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
-  Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, Yanwei Yu, and Zhenhui Li. Modeling spatial-temporal dynamics for traffic prediction. arXiv preprint arXiv:1803.01254, 2018.
-  Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. Deep multi-view spatial-temporal network for taxi demand prediction. In AAAI, pages 2588–2595, 2018.
-  Haiyang Yu, Zhihai Wu, Shuqin Wang, Yunpeng Wang, and Xiaolei Ma. Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. Sensors, 17(7):1501, 2017.
-  Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250, 2017.
-  Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, pages 1655–1661, 2017.
-  Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. Dnn-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 92. ACM, 2016.
-  Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu. Lstm network: a deep learning approach for short-term traffic forecast. IET Intelligent Transport Systems, 11(2):68–75, 2017.
-  Xian Zhou, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. Predicting multi-step citywide passenger demands using attention-based neural networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 736–744. ACM, 2018.
Appendix A Implementation
a.1 Implementation Details
Our codes are based on Tensorflow 1.7.0  and we used high-level API, Keras 2.2.2 . The source codes including README.txt are available on supplementary material. There are six hidden layers before fully-connected layer (equation (7)) and two layers are used for taxi drop-off volumes. We use a 2d average pooling layer with 2x2 kernel after GN 1 layer for computational efficiency. Skip connections like  are used to alleviate gradient vanishing problem .
The number of hidden neurons of first layer (NF in Figure 3) is 32 in NYC datasets. We increase the number of neurons twice because SEO-taxi dataset is relatively larger scale than NYC datasets. Batch Normalization and dropout are used in each layer.
a.2 Methods for Comparison
We compare the performances of TGNet with existing demand forecasting models from spatiotemporal data and describe them in this section. We follow the hyperparameters in original papers, but adjust learning rates for training.
ARIMA: Autoregressive Integrated Moving Average (ARIMA) is traditional model for non-stationary time series. We use auto ARIMA function in R to fit each dataset.
XGBoost : XGBoost is a popular tool to train a boosted tree. The number of trees is 500, max depth is 4, and subsample rate is 0.6.
ST-ResNet : ST-ResNet is a CNN-based model with residual blocks . They uses various past time step as temporal closeness, periodic, and seasonal inputs to capture temporally recurrent patterns. ResNet is used to extract hidden representations of each input (a image at each time step) and they concatenate all feature maps before prediction of future demand.
DMVST-Net : DMVST-Net models spatial, temporal, and semantic view through local CNN, LSTM, and graph embedding. They do not forecast demands of all target regions at once, but predict future demand of each region independently. After convolutional layers extract spatial feature of input image at each time, the feature maps are entered into LSTM layers to extract temporal features.
STDN : STDN is based on DMVST-Net , and add some parts to improve forecasting results. Temporal closeness, periodic, and seasonal inputs are used to model temporally recurrent pattern and periodically shifted attention is proposed to deal with long sequence. Traffic flow and taxi drop-off volumes are also used with flow gating mechanisms.
a.3 Hyperparameter Search
We used greed search with various setting below and determine optimal hyperparameters. The bold means selected ones. The number of hidden neurons in the first layer, in SEO-taxi, was 64 and the others were same.
Learning Rate: Learning rate for Adam optimizer.
Decaying Rate: Decaying rate for Adam optimizer.
Number of Hidden Neurons: The number of convolutional filters in first hidden layer. The Filter numbers in other layers have same ratio with optimal one, mentioned above.
Number of Hidden Neurons for Drop-off: The number of convolutional filters for encoding of drop-off volumes.
Batch Size: mini-batch size for model update.
Dimensionality of Temporal-Guided Embedding: the number of dimension for temporal-guided embedding.
Appendix B Evaluation
We introduce three real-world datasets, which are used to evaluate our model, and other evaluation details. The region of NYC is divided into 1020 grid and the region of Seoul is into 5050 grid. A grid cell covers about 700 m700 m. A time interval is 30 minutes in this study.
b.1 Dataset Description
NYC-bike NYC-bike dataset contains the number of rents and returns of bike in NYC from 07/01/2016 to 08/29/2016. The first 40 days are used for training purpose and the remaining 20 days are as test. This dataset is not about taxi demand, but we also evaluate this dataset to generalize our model as spatiotemporal demand forecasting model. The demand patterns on bike are vulnerable to weather condition. For example, if a day is rainy, there is no demand of bike. In this paper, we do not use external data, including weather, but we show our model have competitive performances on other baselines with external data.
NYC-taxi NYC-Taxi dataset contains taxi pick-up and drop-off records of NYC in from 01/01/2015 to 03/01/2015. The first 40 days data is used for training purpose, and the remaining 20 days are tested.
SEO-taxi SEO-taxi dataset contains ride request and drop-off records in Seoul, South Korea. This data are provided from a on-demand ride-hailing service provider and is private. The period of dataset is from 01/01/2018 to 06/30/2018 and the first 4 months data are used for training and the remaining for test. SEO-taxi dataset is relatively large-scale, because the area of Seoul (5050 grid) is larger than NYC (1020) and the period is also longer than NYC-bike and NYC-taxi. We found that other baselines could not learn SEO-taxi in hyperparameter settings described above to the our best effort.
b.2 Dataset Details
In this paper, the other deep learning models can’t learn SEO-taxi dataset because it is large-scale and more sparse and complex. The dataset is private now, so we attach the comparison of three datasets, in statistics. We will upload SEO-taxi dataset, if it is free to the security issue.
In Table 4 shows the statistics of datasets: NYC-bike, NYC-taxi, and SEO-taxi. We set a time interval as 30 minutes. The period of NYC datasets is 60 days and 2,880 time steps with 1020 regions. On the other hand, SEO-taxi dataset is 181 days and 8,688 steps with 5050 regions. SEO-taxi has lower mean and standard deviation than NYC-taxi, but the maximum demand volume of SEO-taxi is much larger. We consider that SEO-taxi has more sparse and complex dynamics of taxi demands with large-scale of regions and times.
|10 x 20||4.52||0||14.33||0||307|
|10 x 20||38.8||0||107.71||0||1,149|
|50 x 50||-||0||18.27||0||4,491|
b.3 Performance Measures
Two evaluation metrics measure the performances of forecasting models: Mean absolute percentage error (MAPE) and root mean squared error (RMSE). In evaluation, the samples with value less than are excluded as a common practice in industry and academia [46, 45], because they are of little interest in real-world applications. Let be the set of filtered samples, then the performance measures are given by
MAPE and RMSE tend to be sensitive to low and large value samples, respectively. For extreme cases with one sample, if model prediction is 3 when the ground-truth is 1, MAPE is 200 % and RMSE is 2. On the other hand, if model prediction is 500 when the ground-truth is 1,000, MAPE is 50 % and RMSE is 500. Because of these characteristics, both measures are compared together.
Appendix C Temporal-Guided Embedding
c.1 Input of Temporal-Guided Embedding
The input of temporal-guided embedding is concatenation of four one-hot vectors (time-of-day, day-of-week, holiday or not, and the day before holiday or not) and the input vector is 0/1 categorical variables. The detail explanations are in Table 5. For example of time-of-day, one-hot vector means all time-of-days are independent to each other and there are no correlation between time-of-day vectors. However, we expect that temporal-guided embedding can learn distributed representations of temporal contexts in the process of learning how to forecast taxi demand and understanding the characteristics of time series.
|Time of Day||48||30 Minutes|
|Day of Week||7||MTWTFSS|
|Holiday||1||Holiday or not|
|Bef. Holiday||1||The day bef. holiday or not|
c.2 Visualization of Temporal-Guided Embeddings
We visualize learned temporal-guided embedding to investigate whether the embeddings are interpretable or not. We assumed that temporal-guided embeddings can have meaningful insights or visualization over performance gains. For visualization, we use t-SNE  in scikit-learn 0.19.1. Learning rate is 1,000 and other hyperparameters are set by default.
We visualize some examples of temporal-guided embedding of different time-of-day vectors from SEO-taxi datasets in Figure 4. We find that temporal-guided embedding learn to locate the adjacent time-of-day vectors nearby each other. The results are similar with human’s understanding about the basic concept of time, because people naturally assume that events as adjacent time are strongly correlated with. The assumption is also applied in sequential modeling with recurrent layers as relational inductive biases of series. The embeddings of remaining time intervals are available in supplementary material.
Although temporal-guided embedding improve forecasting results on all datasets, we can not show meaningful insights on NYC datasets like SEO-taxi. That is, we find that the adjacent time-of-day vectors tend to be located adjacent to, but it is not obvious to all time-of-day vectors (Figure 5).. The working day (weekday) and the other days (weekend and holiday) are also divided clearly in the embedding space (Figure 6). We conclude that NYC datasets may not have enough number of samples to learn temporal contexts like SEO-taxi, but overall concepts of learning of temporal-guided embedding are similar with large-scale dataset, SEO-taxi.
c.3 Time-Series Forecasting and Temporal-Guided Embedding
In this paper, we showed that temporal-guided embedding make forecasting model improve performances and learn temporal contexts explicitly. The implementation of temporal-guided embedding is simple, but it has theoretical background. Let an observation of time series is . From ARMA to recent deep learning models, the forecasting models learn autoregressive model of with lag inputs , …,
where t is time stamp of each data sample. If a model assume Markov property (not our case), such as LSTM, it becomes
TGNet does not assume sequential model and directly learn equation (13). In general, the model (13) or (14) is corresponded to for all time stamps in training sample, assuming stationary condition. Because of stationary condition, a model is not feasible when the time series is non-stationary and has different probability distribution according to time stamps. Thus, Some preprocessings, such as log scaling or differencing, are used to make the series stationary and neural networks effectively make non-stationary series stationary automatically by learning hierarchical nonlinear transformations. Furthermore, deep learning model contains both model of probability distribution for stationary process (output layer) and preprocessing modules (hidden layers) to make the input stationary.
Note that we can rewrite (13) with random variables of a fixed-length ordered sequence
where and . That is, above equation (13) is a special case when .
We know that equation (15) does not contain any temporal information about specific time , but model probability distribution of input ordered sequence. That is, it means that the model makes combinations of input values to predict future demand without explicit knowledge or understanding of temporal contexts. However, the these approach to model time-series is quite different from how human understands time-series, because people learn temporal contexts of data from explicit understanding of time-of-day, day-of-week, or holiday.
Temporal-guided embedding makes the model predict conditional distribution on temporal contexts of forecasting target
where is learned temporal contexts and is temporal information vector (time-of-day, day-of-week, holiday, the day before holiday) of random variable . Temporal-guided embedding explicitly learns temporal contexts of forecasting target and make model extract hidden representations of input sequence conditioned on the embedding. We replace input of long-term histories from days/weeks ago with temporal-guided embedding and show that the embeddings improve forecasting performances and have interpretable visualizations. We expect temporal-guided embedding can be used for general time-series modeling in future work.
Appendix D Atypical Events and Drop-off Volumes
We conduct evaluation of forecasting performances on atypical event samples, which have extremely large demand volumes, and show drop-off volumes can improve forecasting results. In fact, we found that the patterns of taxi drop-off at a region were different before atypical events occurred (Figure 7). Sudden surge of pick-up requests is observed after atypical events, such as music festival, end and drop-off volumes are much larger than usual before the atypical events start. Many people rush into the region to participate in the events and the drop-off volumes can be potential demand in future.
Appendix E Forecasting Performance Details
The standard deviations with ten repeats are attached in Table 6. We conclude that our proposed model is significantly competitive to other baseline models. In the cast of NYC-bike, our model is not significantly better than STDN , but there is no statistically significant difference. When we think that the number of parameters of TGNet is about 20 times smaller than STDN and bike demands are vulnerable to weather conditions, we consider our results on NYC-bike promising.
|STResNet||9.80 0.12||25.06 0.36||26.23 0.33||21.13 0.63||-||-|
|DMVST-Net||9.14 0.13||22.20 0.33||25.74 0.26||17.38 0.46||-||-|
|STDN ||8.85 0.11||21.84 0.36||24.10 0.25||16.30 0.23||-|
|GN||9.09 0.05||22.51 0.16||23.75 0.30||15.43 0.15||28.10||37.31|
|GN + TGE||8.88 0.09||22.37 0.06||22.81 0.07||14.99 0.07||25.96||35.67|
|TGNet||8.84 0.07||21.92 0.13||22.75 0.14||14.83 0.06||25.35||35.72|