City is the keystone of modern human living and people constantly migrate from rural areas to urban areas with urbanization. For instance, Delhi, the largest city in India, has a total of 29.4 million residents111http://worldpopulationreview.com/world-cities/. Such a huge population brings a great challenge to urban management, especially in traffic management . To address this challenge, intelligent transportation systems (ITS)  have been exhaustively studied for decades and have emerged as an efficient way of improving the efficiency of urban transportation. As a crucial component in ITS, crowd flow prediction [3, 4, 5] has recently attracted widespread research interest in both academic and industry communities, due to its huge potentials in many real-world applications (e.g., intelligent traffic diversion and travel optimization).
In this paper, we aim to forecast the future crowd flow in a city with historical mobility data of residents. Nowadays, we live in an era where ubiquitous digital devices are able to broadcast rich information about human mobility in real-time and at a high rate, which exponentially increases the availability of large-scale mobility data (e.g., GPS signals or mobile phone signals). How to utilize these mobility data to predict crowd flow is still an open problem. In literature, numerous methods applied time series models (e.g., Auto-Regressive Integrated Moving Average (ARIMA) 
and Kalman filtering) to predict traffic flow at each individual location separately. Subsequently, some studies incorporated spatial information to conduct prediction [9, 10]. However, these traditional models can not well capture the complex spatial-temporal dependency of crowd flow and this task is still far from being well solved.
Recently, notable successes have been achieved for citywide crowd flow prediction based on deep neural networks coupled with certain spatial-temporal priors [11, 6, 12, 13]. In these works, the studied city is partitioned into a grid map based on the longitude and latitude, as shown in Fig. 1
. The historical crowd flow maps generated from mobility data are fed into convolutional neural networks to forecast the future crowd flow. Nevertheless, there still exist several challenges limiting the performance of crowd flow analysis in complex scenarios.First, crowd flow data can vary greatly in temporal sequences and capturing such dynamic variations is non-trivial. Second, the spatial dependencies between locations aren’t strictly stationary and the importance of a specific region may change from time to time. Third, some periodic laws (e.g., traffic flow suddenly changing due to rush hours) and external factors (e.g., a precipitate rain) can greatly affect the situation of crowd flow, which increases the difficulty in learning crowd flow representations from data.
To solve all above issues, we propose a novel spatial-temporal neural network, called Attentive Crowd Flow Machine (ACFM), to adaptively exploit diverse factors that affect crowd flow evolution and at the same time produce the crowd flow estimation map in an end-to-end manner. The attention mechanism embedded in ACFM is designed to automatically discover the regions with primary impacts for the future flow prediction and simultaneously adjust the impacts of the different regions with different weights at each time-step. Specifically, our ACFM comprises two progressive ConvLSTM
units. The first one takes input from i) the original crowd flow features at each moment and ii) the memorized representations of previous moments, to compute the attentional weights. The second LSTM dynamically adjusts the spatial dependencies with the computed attentional map and generates superior spatial-temporal feature representation.
The proposed ACFM has following three appealing properties. First, it can effectively incorporate spatial-temporal information in feature representation and can flexibly compose solutions for crowd flow prediction with different types of input data. Second, by integrating the deep attention mechanism [15, 16], ACFM adaptively learns to represent the weights of each spatial location at each time-step, which allows the model to dynamically perceive the impact of the given area at a given moment for the future traffic flow. Third, as a general and differentiable module, our ACFM can be effectively combined with various network architectures for end-to-end training and can also be applied to various traffic prediction tasks.
Based on the proposed ACFM, we further develop a deep architecture for forecasting the citywide short-term crowd flow. Specifically, this customized framework consists of four components: i)
a normal feature extraction module,ii) a sequential representation learning module, iii) a periodic representation learning module and iv) a temporally-varying fusion module. The middle two components are implemented by two parallel ACFMs for contextual dependencies modeling at different temporal scales, while the temporally-varying fusion module is proposed to adaptively merge the two separate temporal representations for crowd flow prediction. Finally, we ameliorate this framework to predict long-term crowd flow with an extra LSTM prediction network.
In summary, the contributions of this work are three-fold:
We propose a novel neural network called Attentive Crowd Flow Machine (ACFM), which incorporates two ConvLSTM units with an attention mechanism to infer the evolution trend of crowd flow via dynamic spatial-temporal feature representations learning.
We integrate the proposed ACFM in a customized deep framework for citywide crowd flow prediction, which effectively incorporates the sequential and periodic dependencies with a temporally-varying fusion module.
Extensive experiments on two public benchmarks of crowd flow prediction demonstrate that our approach outperforms existing state-of-the-art methods.
A preliminary version of this work is published in . In this work, we inherit the idea of dynamically learning the spatial-temporal representations and provide more details of the proposed method. Moreover, we ameliorate the customized framework to forecast long-term crowd flow. Further, we conduct a more comprehensive ablation study on our method and present more comparisons with state-of-the-art models under different settings (e.g., weekday, weekend, day and night). Finally, we extend the proposed method to forecast the passenger pickup/dropoff demands and show that our method is general to various traffic prediction tasks.
The rest of this paper is organized as follows. First, we review some related works of crowd flow analysis in Section II and provide a preliminary of this task in Section III. Then, we introduce the proposed ACFM in Section IV and develop two unified frameworks to forecast short-term/long-term crowd flow in Section V. Extensive evaluation and comparisons are conducted in Section VI. Finally, we conclude this paper in Section VII.
Ii Related Work
Ii-a Crowd Flow Analysis
As a crucial task in ITS, crowd flow analysis has been studied for decades [18, 19] because of its wide applications in city traffic management and public safety monitoring. Traditional approaches usually used time series models (e.g., ARIMA, Kalman filtering and their variants) to forecast the crowd flow [7, 20, 21]. However, most of these earlier methods modeled the evolution of crowd flow for each individual location separately and cannot well capture the complex spatial-temporal dependency.
Recently, deep learning methods have been widely used in various traffic-related tasks[22, 23, 24, 25, 26]. Inspired by these works, many researchers have attempted to address crowd flow prediction with deep neural networks. For instance, Zhang et al.  developed a deep learning based framework to leverage the temporal information of various scales (i.e. temporal closeness, period and seasonal) for citywide crowd flow prediction. Xu et al.  designed a cascade multiplicative unit to model the dependencies between multiple frames and applied it to forecast the future crowd flow. Zhao et al. 
proposed a unified traffic forecast model based on long short-term memory network for short-term crowd flow forecast. Geng et al. developed a multi-graph convolution network to encode the non-Euclidean pair-wise correlations among regions for spatiotemporal forecasting. Currently, to overcome the scarcity of crowd flow data, Wang et al. 
proposed to learn the target city model from the source city model with a region based cross-city deep transfer learning algorithm. Yao et al. incorporate the meta-learning paradigm into networks to tackle the problem of crowd flow prediction for the cities with only a short period of data collection.
Ii-B Temporal Sequences Modeling
Recurrent neural network (RNN) is a special class of artificial neural network for temporal sequences modeling. As a variation of RNN, Long Short-Term Memory Networks (LSTM) enables RNNs to store information over extended time intervals and exploit longer-term temporal dependencies. Recently, LSTM has been widely applied to various sequential prediction tasks, such as natural language processing  and speech recognition 
. Many works in computer vision community[33, 34] also combined CNN with LSTM to model the spatial-temporal information and achieved substantial progress in various applications. Inspired by the success of aforementioned works, many researchers [35, 36, 37] have attempted to address crowd flow prediction with recurrent neural networks. However, these works simply applied LSTM to extract feature and cannot fully model the crowd flow evolution.
Ii-C Attention Mechanism
Visual attention is a fundamental aspect of the human visual system, which refers to the process by which humans focus the computational resources of their brain’s visual system to specific regions of the visual field while perceiving the surrounding world. It has been recently embedded in deep convolution networks  or recurrent neural networks to adaptively attend on mission-related regions while processing feedforward operation. Moreover, it has been proved effective for many tasks, including machine translation , visual question answering  and crowd counting . However, to the best of our knowledge, there are few works that incorporate attention mechanism to address crowd flow prediction.
In this section, we first describe some basic elements of crowd flow and then define the crowd flow prediction problem.
Region Partition: There are many ways to divide a city into multiple regions in terms of different granularities and semantic meanings, such as road network  and zip code tabular . In this work, we follow the previous work  to partition a city into non-overlapping grid map based on the longitude and latitude. Each rectangular grid represents a different geographical region in the city. All partitioned regions of Beijing and New York City are shown in Fig.1
. With this simple partition strategy, the raw mobility data could be easily transformed into a matrix or tensor, which is the most common format of input data of the deep neural networks.
Crowd Flow Map: In some practical applications, we can extract a mass of crowd trajectories from GPS signals or mobile phone signals. With these crowd trajectories, we measure the number of pedestrians entering or leaving a given region at each time interval, which are called as inflow and outflow in our work. For convenience, we denote the crowd flow map at the time interval of day as a tensor , of which the first channel is the inflow and the second channel is the outflow. Some examples of crowd flow maps are visualized in Fig.8.
External Factors: As mentioned in , crowd flow can be affected by many complex external factors, such as meteorology information and holiday information. For example, a sudden rain may seriously affect the crowd flow evolution. People would gather in some commercial areas for celebration on New Year’s Eve. In this paper, we also consider the effect of these external factors. The meteorology information (e.g., weather condition, temperature and wind speed) can be collected from some public meteorological websites, such as Wunderground222https://www.wunderground.com/
. Specifically, the weather condition is categorized into sixteen categories (e.g., sunny and rainy) and it is digitized with One-Hot Encoding
, while temperature and wind speed are scaled into the range [0, 1] with a min-max linear normalization. Multiple categories of holiday (e.g., Chinese Spring Festival and Christmas) can be acquired from a calendar and encoded into a binary vector with One-Hot Encoding. Finally, we concatenate all external factors data to a 1D tensor. The external factors tensor at thetime interval of day is expressed as a in the following sections.
Crowd Flow Prediction: Given the historical crowd flow maps and external factors data until the time interval of day, we aim to predict the crowd flow map , which is called as short-term prediction in our work. Moreover, we also extend our model to conduct long-term prediction, in which we forecast the crowd flow at the next multiple time intervals.
Iv Attentive Crowd Flow Machine
In this section, we propose a unified neural network, named Attentive Crowd Flow Machine (ACFM), to learn the crowd flow spatial-temporal representations. ACFM is designed to adequately capture various contextual dependencies of the crowd flow, e.g., the spatial consistency and the temporal dependency of long and short term. As shown in Fig. 2, the proposed ACFM consists of two progressive ConvLSTM units connected with a convolutional layer for attention weight prediction at each time step. Specifically, the first ConvLSTM unit learns temporal dependency from the normal crowd flow features, the extraction process of which is described in Section V-A1). The output hidden state encodes the historical evolution information and it is concatenated with the current crowd flow feature for spatial weight map inference. The second ConvLSTM unit takes the re-weighted crowd flow features as input at each time-step and is trained to recurrently learn the spatial-temporal representations for further crowd flow prediction.
Let us denote the input crowd map feature map of the iteration as , with , and representing the height, width and the number of channels. At this iteration, the first ConvLSTM unit takes as input and updates its memorized cell state with an input gate and a forget gate . Meanwhile, it updates its new hidden state with an output gate . The computation process of our first ConvLSTM unit is formulated as:
where are the parameters of convolutional layers in ConvLSTM.
denotes the logistic sigmoid function andis an element-wise multiplication operation. For notation simplification, we denote Eq.(1) as:
Generated from the memorized cell state , the new hidden state encodes the dynamic evolution of historical crowd flow in temporal view.
We then integrate a deep attention mechanism to dynamically model the spatial dependencies of crowd flow. Specifically, we incorporate the historical state and current state to infer an attention map , which is implemented by:
where denotes a feature concatenation operation and is the parameters of a convolutional layer with a kernel size of . The attention map is learned to discover the weights of each spatial location on the input feature map .
Finally, we learn a more effective spatial-temporal representation with the guidance of attention map. After reweighing the normal crowd flow feature map by multiplying and element by element, we feed it into the second ConvLSTM unit and generate a new hidden state , which is expressed as:
where encodes the attention-aware content of current input as well as memorizes the contextual knowledge of previous moments. When the elements in a sequence of crowd flow maps are recurrently fed into ACFM, the last hidden state encodes the information of the whole sequence and it can be used as the spatial-temporal representation for evolution analysis of future flow map.
V Citywide Crowd Flow Prediction
In this section, we first develop a deep neural network framework which incorporates the proposed ACFM for citywide short-term crowd flow prediction. We then ameliorate this framework to predict long-term crowd flow with an extra LSTM prediction network.
V-a Short-term Prediction
As illustrated in Fig. 3, our short-term prediction framework consists of four components: (1) a normal feature extraction (NFE) module, (2) a sequential representation learning (SRL) module, (3) a periodic representation learning (PRL) module and (4) a temporally-varying fusion (TVF) module. First, the NFE module is used to extract the normal features of crowd flow map and external factors tensor at each time interval. Second, the SRL and PRL modules are employed to model the contextual dependencies of crowd flow at different temporal scales. Third, the TVF module adaptively merges the feature representations of SRL and PRL with the fused weight learned from the comprehensive features of various factors. Finally, the merged feature map is fed to one additional convolution layer for crowd flow map inference. For convenience, this framework is denoted as Sequential-Periodic Network (SPN) in following sections.
V-A1 Normal Feature Extraction
We first describe how to extract the normal features of crowd flow and external factors, which will be further fed into the SRL and PRL modules for dynamic spatial-temporal representation learning.
As shown in Fig.5, we utilize a customized ResNet  to automatically learn feature embedding from the given crowd flow map . Specifically, our ResNet consists of residual units, each of which has two convolutional layers with a channel number of 16 and a kernel size of . To maintain the resolution
, we set the strides of all convolutional layers to 1 and don’t adopt any pooling layers in ResNet. Following, we first scale into the range with a min-max linear normalization and then feed it into the ResNet to generates the crowd flow feature, which is denoted as .
Then, we extract the feature of the given external factors tensoroutput neurons. We reshape the output of the last layer to form the 3D external factor feature . Finally, we fuse and to generate an embedded feature , which is expressed as:
where denotes feature concatenation. is the normal feature at a specific time interval and it is unaware of the dynamic spatial dependencies of crowd flow. Thus, the following two modules are proposed to dynamically learn the spatial-temporal representation.
V-A2 Sequential Representation Learning
The evolution of citywide crowd flow is usually affected by the recent traffic states. For instance, a traffic accident occurring on a main road of the studied city during morning rush hours may seriously affect the crowd flow of nearby regions in subsequent time intervals. In this subsection, we develop a sequential representation learning (SRL) module based on the proposed ACFM to fully model the evolution trend of crowd flow.
First, we take the normal crowd flow features of recent several time intervals to form a group of sequential temporal features, which is denoted as:
where is the length of the sequentially related time intervals. We then apply the proposed ACFM to learn sequential representation from the temporal features . As shown on the left of Fig. 3, at each iteration, ACFM takes one element in as input and learns to selectively memorize the spatial-temporal context of the sequential crowd flow. Finally, we get the sequential representation by feeding the last hidden state of ACFM into a convolution layer. encodes the sequential evolution trend of crowd flow.
V-A3 Periodic Representation Learning
In urban transportation systems, there exist some periodicities which make a significant impact on the changes of traffic flow. For example, the traffic conditions are very similar during morning rush hours of consecutive workdays, repeating every 24 hours. Thus, in this subsection, we propose a periodic representation learning (PRL) module that fully captures the periodic dependencies of crowd flow with the proposed ACFM.
Similar to the sequential representation learning, we first construct a group of periodic temporal features
where is the length of the periodic days. At each iteration, we feed one element in into ACFM to dynamically learn the periodic dependencies, as shown on the right of Fig. 3. After the last iteration, we feed the hidden state of ACFM into a convolutional layer to generate the final periodic representation . Encoding the periodic evolution trend of crowd flow, is prove to be effective for traffic prediction in our experiments.
V-A4 Temporally-Varying Fusion
As described in previous two modules, the future crowd flow is affected by the sequential representation and the periodic representation simultaneously. We find that the relative importance of these two representations is temporally dynamic and it is suboptimal to directly concatenate them without any specific preprocessing. To address this issue, we propose a novel temporally-varying fusion (TVF) module to adaptively fuse the representations and with different weights learned from the comprehensive features of various internal and external factors.
In TVF module, we take the sequential representation , the periodic representation and the external factors integrative feature to determine the fusion weight. Specifically, is the element-wise addition of the external factors features and . As shown in Fig. 3, we first feed the concatenation of , and into two fully-connected layers for fusion weight inference. The first layer has 32 output neurons and the second one has only one neuron. We then obtain the fusion weight of by applying a sigmoid function on the output of the second FC layer. The weight of is automatically set to . We then fuse these two temporal representations on the basis of the learned weights and compute a comprehensive spatial-temporal representation as:
where contains the sequential and periodic dependencies of crowd flow.
Finally, we feed into a convolutional layer with two filters to predict the future crowd flow map with following formula:
where is the parameters of the predictive convolutional layer and the hyperbolic tangent ensures the output values are within the range . Further, the predicted map is re-scaled back to normal value with an inverted min-max linear normalization.
V-B Long-term Prediction
In this subsection, we extend our method to predict the longer-term crowd flow. With a similar setting of short-term prediction, we incorporate the sequential data and periodic data at previous time intervals to forecast the crowd flow at the next four time intervals. For convenience, we denote this model as SPN-LONG in following sections.
The architecture of our SPN-LONG is shown in Fig 4. For every previous time interval, we first extract its normal features with the proposed NFE module. Then, the features in are recurrently put into ACFM to learn the sequential representation. The output sequential representation is then fed into a LSTM prediction network. With four ConvLSTM units, this prediction network is designed to forecast the crowd flow at the next four time intervals. Specifically, at LSTM, we use a TVF module to adaptively fuse its hidden state and the periodic representation learned from . The external factors integrative feature is the element-wise addition of and . Finally, we take the output of TVF module to predict with a convolutional layer.
|Crow Flow||City||Beijing||New York|
|Gird Map Size||(32, 32)||(16, 8)|
|Data Type||Taxi GPS||Bike Rent|
|Time Span||7/1/2013 - 10/30/2013||4/1/2014 - 9/30/2014|
|3/1/2014 - 6/30/2014|
|3/1/2015 - 6/30/2015|
|11/1/2015 - 4/10/2016|
|Time Interval||0.5 hour||1 hour|
|# Available Time Interval||22,459||4,392|
|External Factors||# Holidays||41||20|
|Weather Conditions||16 types|
|(e.g., Sunny, Rainy)|
|Temperature / C|
|Wind Speed / mph|
In this section, we first introduce the commonly-used benchmarks and evaluation metrics of citywide crowd flow prediction. Then, we compare the proposed approach with several state-of-the-art methods under different settings. Further, we conduct extensive component analysis to demonstrate the effectiveness of each component in our model. Finally, we apply the proposed method to passenger pickup/dropoff demands forecasting and show its generalization for general traffic prediction tasks.
Vi-a Experiments Setting
Vi-A1 Dataset Setting
In this work, we forecast the inflow and outflow of citywide crowds on two public benchmarks, including the TaxiBJ dataset  for taxicab flow prediction and the BikeNYC dataset  for bike flow prediction. The summaries of these two datasets are shown in Table I333The details of TaxiBJ and BikeNYC dataset are from quoted from .
TaxiBJ Dataset: In this dataset, a mass of taxi GPS trajectories are collected from 34 thousand taxicabs in Beijing for over 16 months. The time interval is half an hour and 22,459 crowd flow maps with size are generated from these trajectory data. The external factors contain weather conditions, temperature, wind speed and 41 categories of holidays. This dataset is divided into a training set and testing set officially. Specifically, the data in the last four weeks are used for evaluation and the rest data are used for training.
BikeNYC Dataset: Generated from the NYC bike trajectory data for 182 days, this dataset contains 4,392 crowd flow maps with a time interval of one hour and the size of these maps is . As for external factors, 20 categories of the holiday are recorded. The data of the first 172 days are used for training and the data of the last ten days are chosen to be the test set.
Vi-A2 Implementation Details
We adopt the PyTorch toolbox to implement our crowd flow prediction network. The sequential length and the periodic length are set to 4 and 2, respectively. For the fair comparison with ST-ResNet , we develop the customized ResNet in Section V-A1 with 12 residual units on TaxiBJ dataset and 4 residual units on BikeNYC dataset. The filter weights of all convolutional layers and fully-connected layers are initialized by Xavier . The size of a minibatch is set to 64 and the learning rate is . We optimize the parameters of our network in an end-to-end manner via Adam optimization  by minimizing the Euclidean loss with a GTX 1080Ti GPU.
Vi-A3 Evaluation Metric
In crowd flow prediction, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are two popular evaluation metrics that are extensively used to measure the performances of all methods. Specifically, they are defined as:
where and represent the predicted flow map and its ground truth map, respectively. indicates the number of samples used for validation. However, some partitioned regions in New York City are water areas and their flow are always zero. This phenomenon would decrease the mean error and make us hard to distinguish the capacities of different methods. To correctly reflect the performance of different methods on BikeNYC dataset, we re-scale their mean errors with a ratio (1.58) provided by ST-ResNet.
Vi-B Comparison for Short-term Prediction
In this subsection, we compare the proposed method with ten typical methods for short-term crowd flow prediction. These compared methods can be divided into three categories, including: (i) traditional models for time series forecasting, (ii) deep learning networks particularly designed for crowd flow prediction and (iii) the state-of-the-art approaches originally designed for some related tasks. The details of the compared methods are described as follows.
HA: Historical Average (HA) is a simple model that directly predicts the future crowd flow by averaging the historical flow in the corresponding periods. For example, the predicted flow at 7:00 am to 7:30 am on a specific Tuesday is the average flow from 7:00 am to 7: 30 am on all historical Tuesdays.
ARIMA : Auto-Regressive Integrated Moving Average (ARIMA) is a famous statistical analysis model that uses time series data to predict future trends.
SARIMA : Seasonal ARIMA (SARIMA) is an advanced variant of ARIMA that considers the seasonal terms.
VAR : Vector Auto-Regression (VAR) is a well-known stochastic process model and it can capture the linear interdependencies among multiple time series.
ST-ANN : As an artificial neural network, this model extracts spatial (nearby 8 regions’ values) and temporal (8 previous time intervals) features to forecast the future crowd flow.
DeepST : This is a DNN-based model and it utilizes various temporal properties to conduct prediction.
ST-ResNet : As an advanced version of DeepST, this model incorporates the closeness, period, trend data as well as external factors to predict crowd flow with residual networks.
VPN : Video Pixel Networks (VPN) is a probabilistic video model designed for multi-frames prediction. A variant of VPN based on RMBs is implemented to predict crowd flow.
PredNet : As a predictive neural network, this model is originally developed to predict next frame in a video sequence. We apply this method to crowd flow prediction.
PredRNN : This method utilizes a predictive recurrent neural network to memorize both spatial appearances and temporal variations in order to generate future images. It is implemented to forecast the future crowd flow in this work.
The performance of the proposed method and the other ten compared methods are summarized in Table II. Among these methods, the baseline model is HA that obtains a RMSE of 57.79 on TaxiBJ dataset and 21.57 on BikeNYC dataset. Although having some progress, the traditional time series models (e.g., SARIMA and SARIMA) still perform poorly on both datasets. The recent deep learning based methods can decrease the errors to some extent (e.g., ST-ResNet decreases the RMSE to 16.59 on TaxiBJ), but this task is still far from being perfectly solved. By contrast, our method can further improve the performance by explicitly learning the spatial-temporal feature and modeling the attention weighting of each spatial influence. Specifically, our method achieves a RMSE of 15.31 on TaxiBJ dataset, outperforming the previous best approach PredRNN with a performance improvement of 6.3% relatively. On BikeNYC dataset, Our method also boosts the prediction accuracy, i.e., decreases RMSE from 5.99 to 5.59, and outperforms other methods. Moreover, we compare the performance of five deep learning based methods at different time intervals, such as weekday (from Monday to Friday), weekend (Saturday and Sunday), day (from 6:00 to 18:00) and night (from 18:00 to 6:00). As shown in Fig. 6, our method outperforms other compared methods under various settings, which well demonstrates the robustness of our method.
We further measure the RMSE on some regions with high crowd flow, since we are more concerned about the predicted results on these regions in some specific applications. We first rank the 1,024 regions in Beijing on the basis of the average crowd flow on the training set and then choose the top- regions ( is a percentage) to conduct the evaluation. As shown in Fig. 7, the RMSE of five deep learning based methods are large on the top-10% regions and our method obtains a RMSE of 32.11, which shows there is still much potency improvement of this task. As the percentage increase, the RMSE of all methods gradually drop and our method outperforms other compared methods consistently. These comparisons well show the superiority of our method.
Vi-C Comparison for Long-term Prediction
In this subsection, we apply the customized SPN-LONG to predict long-term crowd flow and compare it with four deep learning based methods444On TaxiBJ dataset, the performances of all compared methods are directly quoted from . On BikeNYC dataset, we implement all compared methods and evaluate their performances.. These compared methods have been finetuned for long-term prediction. As shown in Table III, the RMSE of all methods gradually increases on TaxiBJ dataset when attempting to forecast the longer-term flow. It can be observed that PredNet performs so poorly in this scenario, since it was originally designed for one-frame prediction and has a low capacity for long-term prediction. By contrast, our method has minor performance degradation and outperforms other methods at each time interval. Specifically, our method achieves the lowest RMSE 20.83 at the fourth time interval and has a relative improvement of 8.2%, compared with the previous best-performing method PredRNN. Moreover, we also evaluate the original SPN for long-term prediction and it is used to forecast crowd flow in a rolling style. As shown in the penultimate column of Table III, it performs worse than SPN-LONG, thus we can conclude that it’s essential to adapt and retrain SPN for long-term prediction. We also conduct long-term prediction on BikeNYC dataset and find that our SPN-LONG consistently outperforms other compared methods, as shown in Table IV. These experiments well demonstrate the effectiveness of the customized SPN-LONG for long-term crowd flow prediction.
Vi-D Component Analysis
As described in Section V, our full model consists of four components: normal feature extraction, sequential representation learning, periodic representation learning and temporally-varying fusion module. In this section, we implement seven variants of our full model in order to verify the effectiveness of each component:
PCNN: directly concatenates the periodic features and feeds them to a convolutional layer with two filters followed by to predict future crowd flow;
SCNN: directly concatenates the sequential features and feeds them to a convolutional layer followed by to predict future crowd flow;
PRNN-w/o-Attention: takes periodic features as input and learns periodic representation with a LSTM layer to predict future crowd flow;
PRNN: takes periodic features as input and learns periodic representation with the proposed ACFM to predict future crowd flow;
SRNN-w/o-Attention: takes sequential features as input and learns sequential representation with a LSTM layer for crowd flow estimation;
SRNN: takes sequential features as input and learns sequential representation with the proposed ACFM to predict future crowd flow;
SPN-w/o-Ext: doesn’t consider the effect of external factors and directly trains the model with crowd flow maps;
SPN-w/o-Fusion: directly merges sequential representation and periodic representation with equal weight (0.5) to predict future crowd flow.
Effectiveness of Sequential Representation Learning: As shown in Table V, directly concatenating the sequential features for prediction, the baseline variant SCNN gets an RMSE of 17.15. When explicitly modeling the sequential contextual dependencies of crowd flow using the proposed ACFM, the variant SRNN decreases the RMSE to 15.82, with 7.75% relative performance improvement compared to the baseline SCNN, which indicates the effectiveness of the sequential representation learning.
Effectiveness of Periodic Representation Learning: We also explore different network architectures to learn the periodic representation. As shown in Table V, the PCNN, which learns to estimate the flow map by simply concatenating all of the periodic features , only achieves RMSE of 33.91. In contrast, when introducing ACFM to learn the periodic representation, the RMSE drops to 32.89. This experiment also well demonstrates the effectiveness of the proposed ACFM for spatial-temporal modeling.
Effectiveness of Spatial Attention: As shown in Table V, adopting spatial attention, PRNN decreases the RMSE by 0.62, compared to PRNN-w/o-Attention. For another pair of variants, SRNN with spatial attention has similar performance improvement, compared to SRNN-w/o-Attention. Fig. 8 and Fig. 9 show some attentional maps generated by our method as well as the residual maps between the input crowd flow maps and their corresponding ground truth. We can observe that there is a negative correlation between the attentional maps and the residual maps to some extent. It indicates that our ACFM is able to capture valuable regions at each time step and make better predictions by inferring the trend of evolution. Roughly, the greater difference a region has, the smaller its weight, and vice versa. We can inhibit the impacts of the regions with great differences by multiplying the small weights on their corresponding location features. With the visualization of attentional maps, we can also get to know which regions have the primary positive impacts for the future flow prediction. According to the experiment, we can see that the proposed model can not only effectively improve the prediction accuracy, but also enhance the interpretability of the model to a certain extent.
Necessity of External Factors: Without modeling the effect of external factors, the variant SPN-w/o-Ext obtains a RMSE of 16.84 on TaxiBJ dataset and has a performance degradation of 10%, compared to SPN. The main reason of degradation is that some atrocious meteorological conditions (e.g., rain and snow) or holidays would seriously affect the crowd flow. Thus, it’s necessary to incorporate the external factors to model the crowd flow evolution.
Effectiveness of Temporally-Varying Fusion: When directly merging the two temporal representations with an equal contribution (0.5), SPN-w/o-fusion achieves a negligible improvement, compared to SRNN. In contrast, after using our proposed fusion strategy, the full model SPN decreases the RMSE from 15.82 to 15.31, with a relative improvement of 3.2% compared with SRNN. The results show that the contributions of these two representations are not equal and are influenced by various factors. The proposed fusion strategy is effective to adaptively merge the different temporal representations and further improve the performance of crowd flow prediction.
Further Discussion: To analyze how each temporal representation contributes to the performance of crowd flow prediction, we measure the average fusion weights of two temporal representations at each time interval on the testing set. As shown in the left of Fig. 10, the fusion weights of sequential representation are greater than that of the periodic representation. To explain this phenomenon, we further measure i) the RMSE of crowd flow between two consecutive time intervals, denoted as “Pre-Hour”, and ii) the RMSE of crowd flow between two adjacent days at the same time interval, denoted as “Pre-Day”. As shown on the right of Fig. 10, the RMSE of “Pre-Day” is much higher than that of “Pre-Hour” at most time excepting for the wee hours. Based on this observation, we can conclude that the sequential representation is more essential for the crowd flow prediction, since the sequential data is more regular. Although the weight is low, the periodic representation still helps to improve the performance of crowd flow prediction qualitatively and quantitatively. For example, we can decrease the RMSE of SRNN by 3.2% after incorporating the periodic representation.
Vi-E Extension to Citywide Passenger Demand Prediction
Our AFCM is a general model for urban mobility modeling. Apart from the crowd flow prediction, it can be also applied to other related traffic tasks, such as citywide passenger demand prediction. In this subsection, we extend the proposed method to forecast the passenger pickup/dropoff demands at the next time interval (half an hour) with historical mobility trips.
We conduct experiments with taxi trips in New York City. First, we choose the Manhattan borough as the studied area, since most taxi transactions were made in this area. Manhattan is divided into a grid map and each grid represents a geographical region with a size of about . Second, we collect 132 million taxicab trip records during 2014 from New York City Taxi and Limousine Commission (NYCTLC555https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Each record contains the timestamp and the geo-coordinates of pickup and dropoff locations. For each region, we measure the passenger pickup/dropoff demands in every half an hour, thus the dimensionality of passenger demand maps is . We collect external meteorological factors (e.g., temperature, wind speed and weather conditions) from Wunderground and the holidays are also marked. Finally, we train our SPN with the historical demand of the first 337 days and test on the last four weeks. When evaluating, we filter some regions with GT demand less than 5, since such low demand is always ignorable in real-world applications.
We compare our method with HA and five deep learning based methods. As shown in Table VI, the baseline method HA obtains a RMSE of 42.23 and a MAE of 23.62, which is impractical in the taxi industry. By contrast, our method dramatically decreases the RMSE to 18.75 and outperforms other compared methods for short-term prediction. Moreover, we adapt and retrain these deep learning based methods to forecast the long-term demand and summarize their RMSE in Table VII. It can be observed that our SPN-LONG model achieves the best performance at every time interval. In particular, our method has a performance improvement of 16.55% compared with PredRNN at the fourth time interval. These experiments show that the proposed method is also effective for passenger demand prediction.
In this work, we utilize massive human trajectory data collected from mobility digital devices to study the crowd flow prediction problem. Its key challenge lies in how to adaptively integrate the various factors that affect the flow changes, such as sequential trends, periodic laws and spatial dependencies. To address these issues, we propose a novel Attentive Crowd Flow Machine (ACFM), which explicitly learns dynamic spatial-temporal representations from historical crowd flow maps with an attention mechanism. Based on the proposed ACFM, we develop a unified framework to adaptively merge the sequential and periodic representations with the aid of a temporally-varying fusion module for citywide crowd flow prediction. By conducting extensive experiments on two public benchmarks, we have verified the effectiveness of our method for crowd flow prediction. Moreover, to verify the generalization of ACFM, we apply the customized framework to forecast the passenger pickup/dropoff demand and it can also achieve practical performance on this traffic prediction task.
However, there is still much room for improvement. In most previous works, the functionality information of regions has not been fully explored. Intuitively, the regions with the same functionalities usually have similar crowd flow patterns. For instance, most residential regions have high outflow during morning rush hours and have high inflow during evening rush hours. How to model the relationship between the regions’ functionalities and crowd flow patterns is still an open problem. We propose to address this issue by incorporating the information of Point of Interest (POI), which can represent the functionality of a region to some extent. However, to our best knowledge, there doesn’t exist an effective way to integrate POI into deep learning models. In future work, we would develop an improved neural network that incorporates the POI information to forecast crowd flow.
-  Y. Zheng, L. Capra, O. Wolfson, and H. Yang, “Urban computing: concepts, methodologies, and applications,” TIST, vol. 5, no. 3, p. 38, 2014.
-  J. Zhang, F.-Y. Wang, K. Wang, W.-H. Lin, X. Xu, and C. Chen, “Data-driven intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4, pp. 1624–1639, 2011.
W. Huang, G. Song, H. Hong, and K. Xie, “Deep architecture for traffic flow prediction: deep belief networks with multitask learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 5, pp. 2191–2201, 2014.
-  Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, “Traffic flow prediction with big data: a deep learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 865–873, 2014.
-  N. G. Polson and V. O. Sokolov, “Deep learning for short-term traffic flow prediction,” Transportation Research Part C: Emerging Technologies, vol. 79, pp. 1–17, 2017.
-  J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual networks for citywide crowd flows prediction.” in AAAI, 2017, pp. 1655–1661.
-  S. Shekhar and B. M. Williams, “Adaptive seasonal time series models for forecasting short-term traffic flow,” Transportation Research Record, vol. 2024, no. 1, pp. 116–125, 2007.
-  J. Guo, W. Huang, and B. M. Williams, “Adaptive kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification,” Transportation Research Part C: Emerging Technologies, vol. 43, pp. 50–64, 2014.
-  J. Zheng and L. M. Ni, “Time-dependent trajectory regression on road networks via multi-task learning,” in AAAI, 2013, pp. 1048–1055.
-  D. Deng, C. Shahabi, U. Demiryurek, L. Zhu, R. Yu, and Y. Liu, “Latent space model for road networks to predict time-varying traffic,” KDD, pp. 1525–1534, 2016.
-  J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, “Dnn-based prediction model for spatio-temporal data,” in SIGSPATIAL. ACM, 2016, p. 92.
-  Z. Xu, Y. Wang, M. Long, J. Wang, and M. KLiss, “Predcnn: Predictive learning with cascade convolutions.” in IJCAI, 2018, pp. 2940–2947.
-  J. Zhang, Y. Zheng, J. Sun, and D. Qi, “Flow prediction in spatio-temporal networks based on multitask deep learning,” TKDE, 2019.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” inNIPS, 2015, pp. 802–810.
-  S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv:1511.04119, 2015.
-  J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” arXiv:1612.01887, 2016.
-  L. Liu, R. Zhang, J. Peng, G. Li, B. Du, and L. Lin, “Attentive crowd flow machines,” in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 1553–1561.
-  B. Williams, P. Durvasula, and D. Brown, “Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models,” Transportation Research Record: Journal of the Transportation Research Board, no. 1644, pp. 132–141, 1998.
-  M. Castro-Neto, Y.-S. Jeong, M.-K. Jeong, and L. D. Han, “Online-svr for short-term traffic flow prediction under typical and atypical traffic conditions,” Expert systems with applications, vol. 36, no. 3, pp. 6164–6173, 2009.
-  X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang, “Prediction of urban human mobility using large-scale taxi traces and its applications,” Frontiers of Computer Science, vol. 6, no. 1, pp. 111–121, 2012.
M. Lippi, M. Bertini, and P. Frasconi, “Short-term traffic flow forecasting: An experimental comparison of time-series analysis and supervised learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 871–882, 2013.
Y. Duan, Y. Lv, Y.-L. Liu, and F.-Y. Wang, “An efficient realization of deep learning for traffic data imputation,”Transportation research part C: emerging technologies, vol. 72, pp. 168–181, 2016.
-  Z. Chen, J. Zhou, and X. Wang, “Visual analytics of movement pattern based on time-spatial data: A neural net approach,” arXiv preprint arXiv:1707.02554, 2017.
-  M. Fouladgar, M. Parchami, R. Elmasri, and A. Ghaderi, “Scalable deep traffic flow neural networks for urban traffic congestion prediction,” arXiv preprint arXiv:1703.01006, 2017.
-  J. Ke, H. Zheng, H. Yang, and X. M. Chen, “Short-term forecasting of passenger demand under on-demand ride services: A spatio-temporal deep learning approach,” Transportation Research Part C: Emerging Technologies, vol. 85, pp. 591–608, 2017.
H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement learning approach for intelligent traffic light control,” inKDD. ACM, 2018, pp. 2496–2505.
-  Z. Zhao, W. Chen, X. Wu, P. C. Chen, and J. Liu, “Lstm network: a deep learning approach for short-term traffic forecast,” IET Intelligent Transport Systems, vol. 11, no. 2, pp. 68–75, 2017.
-  X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu, “Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting,” in AAAI, 2019.
-  L. Wang, X. Geng, X. Ma, F. Liu, and Q. Yang, “Crowd flow prediction by deep spatio-temporal transfer learning,” arXiv preprint arXiv:1802.00386, 2018.
-  H. Yao, Y. Liu, Y. Wei, X. Tang, and Z. Li, “Learning from multiple cities: A meta-learning approach for spatial-temporal prediction,” arXiv preprint arXiv:1901.08518, 2019.
-  M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv:1508.04025, 2015.
-  A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP. IEEE, 2013, pp. 6645–6649.
-  J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv:1412.6632, 2014.
-  V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for action recognition,” in ICCV, 2015, pp. 4041–4049.
-  Y. Tian and L. Pan, “Predicting short-term traffic flow by long short-term memory recurrent neural network,” in 2015 IEEE international conference on smart city/SocialCom/SustainCom (SmartCity). IEEE, 2015, pp. 153–158.
-  R. Fu, Z. Zhang, and L. Li, “Using lstm and gru neural network methods for traffic flow prediction,” in 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2016, pp. 324–328.
-  J. Mackenzie, J. F. Roddick, and R. Zito, “An evaluation of htm and lstm for short-term arterial traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, no. 99, pp. 1–11, 2018.
-  L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in CVPR, 2016, pp. 3640–3649.
-  H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in ECCV. Springer, 2016, pp. 451–466.
-  L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin, “Crowd counting using deep recurrent spatial-aware network,” in IJCAI. AAAI Press, 2018, pp. 849–855.
-  L. Liu, Z. Qiu, G. Li, Q. Wang, Ouyang，Wanli, and L. Lin, “Contextualized spatial-temporal network for taxi origin-destination demand prediction,” IEEE Transactions on Intelligent Transportation Systems, 2019.
-  D. Harris and S. Harris, Digital design and computer architecture. Morgan Kaufmann, 2010.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS workshop, 2017.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010, pp. 249–256.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
-  G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control, 2015.
S. Johansen, “Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models,”Econometrica: journal of the Econometric Society, pp. 1551–1580, 1991.
-  N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” in ICML, 2017, pp. 1771–1779.
W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” inICLR, 2017.
-  Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” in NIPS, 2017, pp. 879–888.