I Introduction
City is the keystone of modern human living and people constantly migrate from rural areas to urban areas with urbanization. For instance, Delhi, the largest city in India, has a total of 29.4 million residents^{1}^{1}1http://worldpopulationreview.com/worldcities/. Such a huge population brings a great challenge to urban management, especially in traffic management [1]. To address this challenge, intelligent transportation systems (ITS) [2] have been exhaustively studied for decades and have emerged as an efficient way of improving the efficiency of urban transportation. As a crucial component in ITS, crowd flow prediction [3, 4, 5] has recently attracted widespread research interest in both academic and industry communities, due to its huge potentials in many realworld applications (e.g., intelligent traffic diversion and travel optimization).
In this paper, we aim to forecast the future crowd flow in a city with historical mobility data of residents. Nowadays, we live in an era where ubiquitous digital devices are able to broadcast rich information about human mobility in realtime and at a high rate, which exponentially increases the availability of largescale mobility data (e.g., GPS signals or mobile phone signals). How to utilize these mobility data to predict crowd flow is still an open problem. In literature, numerous methods applied time series models (e.g., AutoRegressive Integrated Moving Average (ARIMA) [7]
and Kalman filtering
[8]) to predict traffic flow at each individual location separately. Subsequently, some studies incorporated spatial information to conduct prediction [9, 10]. However, these traditional models can not well capture the complex spatialtemporal dependency of crowd flow and this task is still far from being well solved.Recently, notable successes have been achieved for citywide crowd flow prediction based on deep neural networks coupled with certain spatialtemporal priors [11, 6, 12, 13]. In these works, the studied city is partitioned into a grid map based on the longitude and latitude, as shown in Fig. 1
. The historical crowd flow maps generated from mobility data are fed into convolutional neural networks to forecast the future crowd flow. Nevertheless, there still exist several challenges limiting the performance of crowd flow analysis in complex scenarios.
First, crowd flow data can vary greatly in temporal sequences and capturing such dynamic variations is nontrivial. Second, the spatial dependencies between locations aren’t strictly stationary and the importance of a specific region may change from time to time. Third, some periodic laws (e.g., traffic flow suddenly changing due to rush hours) and external factors (e.g., a precipitate rain) can greatly affect the situation of crowd flow, which increases the difficulty in learning crowd flow representations from data.To solve all above issues, we propose a novel spatialtemporal neural network, called Attentive Crowd Flow Machine (ACFM), to adaptively exploit diverse factors that affect crowd flow evolution and at the same time produce the crowd flow estimation map in an endtoend manner. The attention mechanism embedded in ACFM is designed to automatically discover the regions with primary impacts for the future flow prediction and simultaneously adjust the impacts of the different regions with different weights at each timestep. Specifically, our ACFM comprises two progressive ConvLSTM
[14]units. The first one takes input from i) the original crowd flow features at each moment and ii) the memorized representations of previous moments, to compute the attentional weights. The second LSTM dynamically adjusts the spatial dependencies with the computed attentional map and generates superior spatialtemporal feature representation.
The proposed ACFM has following three appealing properties. First, it can effectively incorporate spatialtemporal information in feature representation and can flexibly compose solutions for crowd flow prediction with different types of input data. Second, by integrating the deep attention mechanism [15, 16], ACFM adaptively learns to represent the weights of each spatial location at each timestep, which allows the model to dynamically perceive the impact of the given area at a given moment for the future traffic flow. Third, as a general and differentiable module, our ACFM can be effectively combined with various network architectures for endtoend training and can also be applied to various traffic prediction tasks.
Based on the proposed ACFM, we further develop a deep architecture for forecasting the citywide shortterm crowd flow. Specifically, this customized framework consists of four components: i)
a normal feature extraction module,
ii) a sequential representation learning module, iii) a periodic representation learning module and iv) a temporallyvarying fusion module. The middle two components are implemented by two parallel ACFMs for contextual dependencies modeling at different temporal scales, while the temporallyvarying fusion module is proposed to adaptively merge the two separate temporal representations for crowd flow prediction. Finally, we ameliorate this framework to predict longterm crowd flow with an extra LSTM prediction network.In summary, the contributions of this work are threefold:

We propose a novel neural network called Attentive Crowd Flow Machine (ACFM), which incorporates two ConvLSTM units with an attention mechanism to infer the evolution trend of crowd flow via dynamic spatialtemporal feature representations learning.

We integrate the proposed ACFM in a customized deep framework for citywide crowd flow prediction, which effectively incorporates the sequential and periodic dependencies with a temporallyvarying fusion module.

Extensive experiments on two public benchmarks of crowd flow prediction demonstrate that our approach outperforms existing stateoftheart methods.
A preliminary version of this work is published in [17]. In this work, we inherit the idea of dynamically learning the spatialtemporal representations and provide more details of the proposed method. Moreover, we ameliorate the customized framework to forecast longterm crowd flow. Further, we conduct a more comprehensive ablation study on our method and present more comparisons with stateoftheart models under different settings (e.g., weekday, weekend, day and night). Finally, we extend the proposed method to forecast the passenger pickup/dropoff demands and show that our method is general to various traffic prediction tasks.
The rest of this paper is organized as follows. First, we review some related works of crowd flow analysis in Section II and provide a preliminary of this task in Section III. Then, we introduce the proposed ACFM in Section IV and develop two unified frameworks to forecast shortterm/longterm crowd flow in Section V. Extensive evaluation and comparisons are conducted in Section VI. Finally, we conclude this paper in Section VII.
Ii Related Work
Iia Crowd Flow Analysis
As a crucial task in ITS, crowd flow analysis has been studied for decades [18, 19] because of its wide applications in city traffic management and public safety monitoring. Traditional approaches usually used time series models (e.g., ARIMA, Kalman filtering and their variants) to forecast the crowd flow [7, 20, 21]. However, most of these earlier methods modeled the evolution of crowd flow for each individual location separately and cannot well capture the complex spatialtemporal dependency.
Recently, deep learning methods have been widely used in various trafficrelated tasks
[22, 23, 24, 25, 26]. Inspired by these works, many researchers have attempted to address crowd flow prediction with deep neural networks. For instance, Zhang et al. [11] developed a deep learning based framework to leverage the temporal information of various scales (i.e. temporal closeness, period and seasonal) for citywide crowd flow prediction. Xu et al. [12] designed a cascade multiplicative unit to model the dependencies between multiple frames and applied it to forecast the future crowd flow. Zhao et al. [27]proposed a unified traffic forecast model based on long shortterm memory network for shortterm crowd flow forecast. Geng et al.
[28] developed a multigraph convolution network to encode the nonEuclidean pairwise correlations among regions for spatiotemporal forecasting. Currently, to overcome the scarcity of crowd flow data, Wang et al. [29]proposed to learn the target city model from the source city model with a region based crosscity deep transfer learning algorithm. Yao et al.
[30] incorporate the metalearning paradigm into networks to tackle the problem of crowd flow prediction for the cities with only a short period of data collection.IiB Temporal Sequences Modeling
Recurrent neural network (RNN) is a special class of artificial neural network for temporal sequences modeling. As a variation of RNN, Long ShortTerm Memory Networks (LSTM) enables RNNs to store information over extended time intervals and exploit longerterm temporal dependencies. Recently, LSTM has been widely applied to various sequential prediction tasks, such as natural language processing [31] and speech recognition [32]
. Many works in computer vision community
[33, 34] also combined CNN with LSTM to model the spatialtemporal information and achieved substantial progress in various applications. Inspired by the success of aforementioned works, many researchers [35, 36, 37] have attempted to address crowd flow prediction with recurrent neural networks. However, these works simply applied LSTM to extract feature and cannot fully model the crowd flow evolution.IiC Attention Mechanism
Visual attention is a fundamental aspect of the human visual system, which refers to the process by which humans focus the computational resources of their brain’s visual system to specific regions of the visual field while perceiving the surrounding world. It has been recently embedded in deep convolution networks [38] or recurrent neural networks to adaptively attend on missionrelated regions while processing feedforward operation. Moreover, it has been proved effective for many tasks, including machine translation [31], visual question answering [39] and crowd counting [40]. However, to the best of our knowledge, there are few works that incorporate attention mechanism to address crowd flow prediction.
Iii Preliminary
In this section, we first describe some basic elements of crowd flow and then define the crowd flow prediction problem.
Region Partition: There are many ways to divide a city into multiple regions in terms of different granularities and semantic meanings, such as road network [10] and zip code tabular [41]. In this work, we follow the previous work [11] to partition a city into nonoverlapping grid map based on the longitude and latitude. Each rectangular grid represents a different geographical region in the city. All partitioned regions of Beijing and New York City are shown in Fig.1
. With this simple partition strategy, the raw mobility data could be easily transformed into a matrix or tensor, which is the most common format of input data of the deep neural networks.
Crowd Flow Map: In some practical applications, we can extract a mass of crowd trajectories from GPS signals or mobile phone signals. With these crowd trajectories, we measure the number of pedestrians entering or leaving a given region at each time interval, which are called as inflow and outflow in our work. For convenience, we denote the crowd flow map at the time interval of day as a tensor , of which the first channel is the inflow and the second channel is the outflow. Some examples of crowd flow maps are visualized in Fig.8.
External Factors: As mentioned in [6], crowd flow can be affected by many complex external factors, such as meteorology information and holiday information. For example, a sudden rain may seriously affect the crowd flow evolution. People would gather in some commercial areas for celebration on New Year’s Eve. In this paper, we also consider the effect of these external factors. The meteorology information (e.g., weather condition, temperature and wind speed) can be collected from some public meteorological websites, such as Wunderground^{2}^{2}2https://www.wunderground.com/
. Specifically, the weather condition is categorized into sixteen categories (e.g., sunny and rainy) and it is digitized with OneHot Encoding
[42], while temperature and wind speed are scaled into the range [0, 1] with a minmax linear normalization. Multiple categories of holiday (e.g., Chinese Spring Festival and Christmas) can be acquired from a calendar and encoded into a binary vector with OneHot Encoding. Finally, we concatenate all external factors data to a 1D tensor. The external factors tensor at the
time interval of day is expressed as a in the following sections.Crowd Flow Prediction: Given the historical crowd flow maps and external factors data until the time interval of day, we aim to predict the crowd flow map , which is called as shortterm prediction in our work. Moreover, we also extend our model to conduct longterm prediction, in which we forecast the crowd flow at the next multiple time intervals.
Iv Attentive Crowd Flow Machine
In this section, we propose a unified neural network, named Attentive Crowd Flow Machine (ACFM), to learn the crowd flow spatialtemporal representations. ACFM is designed to adequately capture various contextual dependencies of the crowd flow, e.g., the spatial consistency and the temporal dependency of long and short term. As shown in Fig. 2, the proposed ACFM consists of two progressive ConvLSTM units connected with a convolutional layer for attention weight prediction at each time step. Specifically, the first ConvLSTM unit learns temporal dependency from the normal crowd flow features, the extraction process of which is described in Section VA1). The output hidden state encodes the historical evolution information and it is concatenated with the current crowd flow feature for spatial weight map inference. The second ConvLSTM unit takes the reweighted crowd flow features as input at each timestep and is trained to recurrently learn the spatialtemporal representations for further crowd flow prediction.
Let us denote the input crowd map feature map of the iteration as , with , and representing the height, width and the number of channels. At this iteration, the first ConvLSTM unit takes as input and updates its memorized cell state with an input gate and a forget gate . Meanwhile, it updates its new hidden state with an output gate . The computation process of our first ConvLSTM unit is formulated as:
(1) 
where are the parameters of convolutional layers in ConvLSTM.
denotes the logistic sigmoid function and
is an elementwise multiplication operation. For notation simplification, we denote Eq.(1) as:(2) 
Generated from the memorized cell state , the new hidden state encodes the dynamic evolution of historical crowd flow in temporal view.
We then integrate a deep attention mechanism to dynamically model the spatial dependencies of crowd flow. Specifically, we incorporate the historical state and current state to infer an attention map , which is implemented by:
(3) 
where denotes a feature concatenation operation and is the parameters of a convolutional layer with a kernel size of . The attention map is learned to discover the weights of each spatial location on the input feature map .
Finally, we learn a more effective spatialtemporal representation with the guidance of attention map. After reweighing the normal crowd flow feature map by multiplying and element by element, we feed it into the second ConvLSTM unit and generate a new hidden state , which is expressed as:
(4) 
where encodes the attentionaware content of current input as well as memorizes the contextual knowledge of previous moments. When the elements in a sequence of crowd flow maps are recurrently fed into ACFM, the last hidden state encodes the information of the whole sequence and it can be used as the spatialtemporal representation for evolution analysis of future flow map.
V Citywide Crowd Flow Prediction
In this section, we first develop a deep neural network framework which incorporates the proposed ACFM for citywide shortterm crowd flow prediction. We then ameliorate this framework to predict longterm crowd flow with an extra LSTM prediction network.
Va Shortterm Prediction
As illustrated in Fig. 3, our shortterm prediction framework consists of four components: (1) a normal feature extraction (NFE) module, (2) a sequential representation learning (SRL) module, (3) a periodic representation learning (PRL) module and (4) a temporallyvarying fusion (TVF) module. First, the NFE module is used to extract the normal features of crowd flow map and external factors tensor at each time interval. Second, the SRL and PRL modules are employed to model the contextual dependencies of crowd flow at different temporal scales. Third, the TVF module adaptively merges the feature representations of SRL and PRL with the fused weight learned from the comprehensive features of various factors. Finally, the merged feature map is fed to one additional convolution layer for crowd flow map inference. For convenience, this framework is denoted as SequentialPeriodic Network (SPN) in following sections.
VA1 Normal Feature Extraction
We first describe how to extract the normal features of crowd flow and external factors, which will be further fed into the SRL and PRL modules for dynamic spatialtemporal representation learning.
As shown in Fig.5, we utilize a customized ResNet [43] to automatically learn feature embedding from the given crowd flow map . Specifically, our ResNet consists of residual units, each of which has two convolutional layers with a channel number of 16 and a kernel size of . To maintain the resolution
, we set the strides of all convolutional layers to 1 and don’t adopt any pooling layers in ResNet. Following
[6], we first scale into the range with a minmax linear normalization and then feed it into the ResNet to generates the crowd flow feature, which is denoted as .Then, we extract the feature of the given external factors tensor
with a Multilayer Perceptron. We implement it with two fullyconnected layers. The first layer has 40 output neurons and the second one has
output neurons. We reshape the output of the last layer to form the 3D external factor feature . Finally, we fuse and to generate an embedded feature , which is expressed as:(5) 
where denotes feature concatenation. is the normal feature at a specific time interval and it is unaware of the dynamic spatial dependencies of crowd flow. Thus, the following two modules are proposed to dynamically learn the spatialtemporal representation.
VA2 Sequential Representation Learning
The evolution of citywide crowd flow is usually affected by the recent traffic states. For instance, a traffic accident occurring on a main road of the studied city during morning rush hours may seriously affect the crowd flow of nearby regions in subsequent time intervals. In this subsection, we develop a sequential representation learning (SRL) module based on the proposed ACFM to fully model the evolution trend of crowd flow.
First, we take the normal crowd flow features of recent several time intervals to form a group of sequential temporal features, which is denoted as:
(6) 
where is the length of the sequentially related time intervals. We then apply the proposed ACFM to learn sequential representation from the temporal features . As shown on the left of Fig. 3, at each iteration, ACFM takes one element in as input and learns to selectively memorize the spatialtemporal context of the sequential crowd flow. Finally, we get the sequential representation by feeding the last hidden state of ACFM into a convolution layer. encodes the sequential evolution trend of crowd flow.
VA3 Periodic Representation Learning
In urban transportation systems, there exist some periodicities which make a significant impact on the changes of traffic flow. For example, the traffic conditions are very similar during morning rush hours of consecutive workdays, repeating every 24 hours. Thus, in this subsection, we propose a periodic representation learning (PRL) module that fully captures the periodic dependencies of crowd flow with the proposed ACFM.
Similar to the sequential representation learning, we first construct a group of periodic temporal features
(7) 
where is the length of the periodic days. At each iteration, we feed one element in into ACFM to dynamically learn the periodic dependencies, as shown on the right of Fig. 3. After the last iteration, we feed the hidden state of ACFM into a convolutional layer to generate the final periodic representation . Encoding the periodic evolution trend of crowd flow, is prove to be effective for traffic prediction in our experiments.
VA4 TemporallyVarying Fusion
As described in previous two modules, the future crowd flow is affected by the sequential representation and the periodic representation simultaneously. We find that the relative importance of these two representations is temporally dynamic and it is suboptimal to directly concatenate them without any specific preprocessing. To address this issue, we propose a novel temporallyvarying fusion (TVF) module to adaptively fuse the representations and with different weights learned from the comprehensive features of various internal and external factors.
In TVF module, we take the sequential representation , the periodic representation and the external factors integrative feature to determine the fusion weight. Specifically, is the elementwise addition of the external factors features and . As shown in Fig. 3, we first feed the concatenation of , and into two fullyconnected layers for fusion weight inference. The first layer has 32 output neurons and the second one has only one neuron. We then obtain the fusion weight of by applying a sigmoid function on the output of the second FC layer. The weight of is automatically set to . We then fuse these two temporal representations on the basis of the learned weights and compute a comprehensive spatialtemporal representation as:
(8) 
where contains the sequential and periodic dependencies of crowd flow.
Finally, we feed into a convolutional layer with two filters to predict the future crowd flow map with following formula:
(9) 
where is the parameters of the predictive convolutional layer and the hyperbolic tangent ensures the output values are within the range . Further, the predicted map is rescaled back to normal value with an inverted minmax linear normalization.
VB Longterm Prediction
In this subsection, we extend our method to predict the longerterm crowd flow. With a similar setting of shortterm prediction, we incorporate the sequential data and periodic data at previous time intervals to forecast the crowd flow at the next four time intervals. For convenience, we denote this model as SPNLONG in following sections.
The architecture of our SPNLONG is shown in Fig 4. For every previous time interval, we first extract its normal features with the proposed NFE module. Then, the features in are recurrently put into ACFM to learn the sequential representation. The output sequential representation is then fed into a LSTM prediction network. With four ConvLSTM units, this prediction network is designed to forecast the crowd flow at the next four time intervals. Specifically, at LSTM, we use a TVF module to adaptively fuse its hidden state and the periodic representation learned from . The external factors integrative feature is the elementwise addition of and . Finally, we take the output of TVF module to predict with a convolutional layer.
Dataset  TaxiBJ  BikeNYC  
Crow Flow  City  Beijing  New York 
Gird Map Size  (32, 32)  (16, 8)  
Data Type  Taxi GPS  Bike Rent  
Time Span  7/1/2013  10/30/2013  4/1/2014  9/30/2014  
3/1/2014  6/30/2014  
3/1/2015  6/30/2015  
11/1/2015  4/10/2016  
# Taxis/Bikes  34,000+  6,800+  
Time Interval  0.5 hour  1 hour  
# Available Time Interval  22,459  4,392  
External Factors  # Holidays  41  20 
Weather Conditions  16 types  
(e.g., Sunny, Rainy)  
Temperature / C  
Wind Speed / mph 
Vi Experiments
In this section, we first introduce the commonlyused benchmarks and evaluation metrics of citywide crowd flow prediction. Then, we compare the proposed approach with several stateoftheart methods under different settings. Further, we conduct extensive component analysis to demonstrate the effectiveness of each component in our model. Finally, we apply the proposed method to passenger pickup/dropoff demands forecasting and show its generalization for general traffic prediction tasks.
Via Experiments Setting
ViA1 Dataset Setting
In this work, we forecast the inflow and outflow of citywide crowds on two public benchmarks, including the TaxiBJ dataset [6] for taxicab flow prediction and the BikeNYC dataset [11] for bike flow prediction. The summaries of these two datasets are shown in Table I^{3}^{3}3The details of TaxiBJ and BikeNYC dataset are from quoted from [6].
TaxiBJ Dataset: In this dataset, a mass of taxi GPS trajectories are collected from 34 thousand taxicabs in Beijing for over 16 months. The time interval is half an hour and 22,459 crowd flow maps with size are generated from these trajectory data. The external factors contain weather conditions, temperature, wind speed and 41 categories of holidays. This dataset is divided into a training set and testing set officially. Specifically, the data in the last four weeks are used for evaluation and the rest data are used for training.
BikeNYC Dataset: Generated from the NYC bike trajectory data for 182 days, this dataset contains 4,392 crowd flow maps with a time interval of one hour and the size of these maps is . As for external factors, 20 categories of the holiday are recorded. The data of the first 172 days are used for training and the data of the last ten days are chosen to be the test set.
ViA2 Implementation Details
We adopt the PyTorch
[44] toolbox to implement our crowd flow prediction network. The sequential length and the periodic length are set to 4 and 2, respectively. For the fair comparison with STResNet [6], we develop the customized ResNet in Section VA1 with 12 residual units on TaxiBJ dataset and 4 residual units on BikeNYC dataset. The filter weights of all convolutional layers and fullyconnected layers are initialized by Xavier [45]. The size of a minibatch is set to 64 and the learning rate is . We optimize the parameters of our network in an endtoend manner via Adam optimization [46] by minimizing the Euclidean loss with a GTX 1080Ti GPU.ViA3 Evaluation Metric
In crowd flow prediction, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are two popular evaluation metrics that are extensively used to measure the performances of all methods. Specifically, they are defined as:
(10) 
where and represent the predicted flow map and its ground truth map, respectively. indicates the number of samples used for validation. However, some partitioned regions in New York City are water areas and their flow are always zero. This phenomenon would decrease the mean error and make us hard to distinguish the capacities of different methods. To correctly reflect the performance of different methods on BikeNYC dataset, we rescale their mean errors with a ratio (1.58) provided by STResNet.
Method  TaxiBJ  BikeNYC  
RMSE  MAE  RMSE  MAE  
HA  57.79    21.57   
SARIMA  26.88    10.56   
VAR  22.88    9.92   
ARIMA  22.78    10.07   
STANN  19.57       
DeepST  18.18    7.43   
VPN  16.75  9.62  6.17  3.68 
STResNet  16.69  9.52  6.37  2.95 
PredNet  16.68  9.67  7.45  3.71 
PredRNN  16.34  9.62  5.99  4.89 
SPN (Our)  15.31  9.14  5.59  2.74 
ViB Comparison for Shortterm Prediction
In this subsection, we compare the proposed method with ten typical methods for shortterm crowd flow prediction. These compared methods can be divided into three categories, including: (i) traditional models for time series forecasting, (ii) deep learning networks particularly designed for crowd flow prediction and (iii) the stateoftheart approaches originally designed for some related tasks. The details of the compared methods are described as follows.

HA: Historical Average (HA) is a simple model that directly predicts the future crowd flow by averaging the historical flow in the corresponding periods. For example, the predicted flow at 7:00 am to 7:30 am on a specific Tuesday is the average flow from 7:00 am to 7: 30 am on all historical Tuesdays.

ARIMA [47]: AutoRegressive Integrated Moving Average (ARIMA) is a famous statistical analysis model that uses time series data to predict future trends.

SARIMA [18]: Seasonal ARIMA (SARIMA) is an advanced variant of ARIMA that considers the seasonal terms.

VAR [48]: Vector AutoRegression (VAR) is a wellknown stochastic process model and it can capture the linear interdependencies among multiple time series.

STANN [6]: As an artificial neural network, this model extracts spatial (nearby 8 regions’ values) and temporal (8 previous time intervals) features to forecast the future crowd flow.

DeepST [11]: This is a DNNbased model and it utilizes various temporal properties to conduct prediction.

STResNet [6]: As an advanced version of DeepST, this model incorporates the closeness, period, trend data as well as external factors to predict crowd flow with residual networks.

VPN [49]: Video Pixel Networks (VPN) is a probabilistic video model designed for multiframes prediction. A variant of VPN based on RMBs is implemented to predict crowd flow.

PredNet [50]: As a predictive neural network, this model is originally developed to predict next frame in a video sequence. We apply this method to crowd flow prediction.

PredRNN [51]: This method utilizes a predictive recurrent neural network to memorize both spatial appearances and temporal variations in order to generate future images. It is implemented to forecast the future crowd flow in this work.
The performance of the proposed method and the other ten compared methods are summarized in Table II. Among these methods, the baseline model is HA that obtains a RMSE of 57.79 on TaxiBJ dataset and 21.57 on BikeNYC dataset. Although having some progress, the traditional time series models (e.g., SARIMA and SARIMA) still perform poorly on both datasets. The recent deep learning based methods can decrease the errors to some extent (e.g., STResNet decreases the RMSE to 16.59 on TaxiBJ), but this task is still far from being perfectly solved. By contrast, our method can further improve the performance by explicitly learning the spatialtemporal feature and modeling the attention weighting of each spatial influence. Specifically, our method achieves a RMSE of 15.31 on TaxiBJ dataset, outperforming the previous best approach PredRNN with a performance improvement of 6.3% relatively. On BikeNYC dataset, Our method also boosts the prediction accuracy, i.e., decreases RMSE from 5.99 to 5.59, and outperforms other methods. Moreover, we compare the performance of five deep learning based methods at different time intervals, such as weekday (from Monday to Friday), weekend (Saturday and Sunday), day (from 6:00 to 18:00) and night (from 18:00 to 6:00). As shown in Fig. 6, our method outperforms other compared methods under various settings, which well demonstrates the robustness of our method.
We further measure the RMSE on some regions with high crowd flow, since we are more concerned about the predicted results on these regions in some specific applications. We first rank the 1,024 regions in Beijing on the basis of the average crowd flow on the training set and then choose the top regions ( is a percentage) to conduct the evaluation. As shown in Fig. 7, the RMSE of five deep learning based methods are large on the top10% regions and our method obtains a RMSE of 32.11, which shows there is still much potency improvement of this task. As the percentage increase, the RMSE of all methods gradually drop and our method outperforms other compared methods consistently. These comparisons well show the superiority of our method.
ViC Comparison for Longterm Prediction
Method  Time Interval  

1  2  3  4  
STResNet  16.75  19.56  21.46  22.91 
VPN  17.42  20.50  22.58  24.26 
PredNet  27.55  254.68  255.54  255.47 
PredRNN  16.08  19.51  20.66  22.69 
SPN (Our)  15.31  19.59  23.70  28.61 
SPNLONG (Our)  15.42  17.63  19.08  20.83 
Method  Time Interval  

1  2  3  4  
STResNet  6.45  7.47  8.77  10.28 
VPN  6.55  8.01  8.86  9.41 
PredNet  7.46  8.95  10.08  10.93 
PredRNN  5.97  7.37  8.61  9.40 
SPN (Our)  5.59  7.81  11.96  15.74 
SPNLONG (Our)  5.81  6.80  7.54  7.90 
In this subsection, we apply the customized SPNLONG to predict longterm crowd flow and compare it with four deep learning based methods^{4}^{4}4On TaxiBJ dataset, the performances of all compared methods are directly quoted from [12]. On BikeNYC dataset, we implement all compared methods and evaluate their performances.. These compared methods have been finetuned for longterm prediction. As shown in Table III, the RMSE of all methods gradually increases on TaxiBJ dataset when attempting to forecast the longerterm flow. It can be observed that PredNet performs so poorly in this scenario, since it was originally designed for oneframe prediction and has a low capacity for longterm prediction. By contrast, our method has minor performance degradation and outperforms other methods at each time interval. Specifically, our method achieves the lowest RMSE 20.83 at the fourth time interval and has a relative improvement of 8.2%, compared with the previous bestperforming method PredRNN. Moreover, we also evaluate the original SPN for longterm prediction and it is used to forecast crowd flow in a rolling style. As shown in the penultimate column of Table III, it performs worse than SPNLONG, thus we can conclude that it’s essential to adapt and retrain SPN for longterm prediction. We also conduct longterm prediction on BikeNYC dataset and find that our SPNLONG consistently outperforms other compared methods, as shown in Table IV. These experiments well demonstrate the effectiveness of the customized SPNLONG for longterm crowd flow prediction.
ViD Component Analysis
As described in Section V, our full model consists of four components: normal feature extraction, sequential representation learning, periodic representation learning and temporallyvarying fusion module. In this section, we implement seven variants of our full model in order to verify the effectiveness of each component:

PCNN: directly concatenates the periodic features and feeds them to a convolutional layer with two filters followed by to predict future crowd flow;

SCNN: directly concatenates the sequential features and feeds them to a convolutional layer followed by to predict future crowd flow;

PRNNw/oAttention: takes periodic features as input and learns periodic representation with a LSTM layer to predict future crowd flow;

PRNN: takes periodic features as input and learns periodic representation with the proposed ACFM to predict future crowd flow;

SRNNw/oAttention: takes sequential features as input and learns sequential representation with a LSTM layer for crowd flow estimation;

SRNN: takes sequential features as input and learns sequential representation with the proposed ACFM to predict future crowd flow;

SPNw/oExt: doesn’t consider the effect of external factors and directly trains the model with crowd flow maps;

SPNw/oFusion: directly merges sequential representation and periodic representation with equal weight (0.5) to predict future crowd flow.





PCNN  33.91  17.16  
PRNNw/oAttention  33.51  16.70  
PRNN  32.89  16.64  
SCNN  17.15  9.56  
SRNNw/oAttention  16.20  9.43  
SRNN  15.82  9.34  
SPNw/oExt  16.84  9.83  
SPNw/oFusion  15.67  9.40  
SPN  15.31  9.14 
Effectiveness of Sequential Representation Learning: As shown in Table V, directly concatenating the sequential features for prediction, the baseline variant SCNN gets an RMSE of 17.15. When explicitly modeling the sequential contextual dependencies of crowd flow using the proposed ACFM, the variant SRNN decreases the RMSE to 15.82, with 7.75% relative performance improvement compared to the baseline SCNN, which indicates the effectiveness of the sequential representation learning.
Effectiveness of Periodic Representation Learning: We also explore different network architectures to learn the periodic representation. As shown in Table V, the PCNN, which learns to estimate the flow map by simply concatenating all of the periodic features , only achieves RMSE of 33.91. In contrast, when introducing ACFM to learn the periodic representation, the RMSE drops to 32.89. This experiment also well demonstrates the effectiveness of the proposed ACFM for spatialtemporal modeling.
Effectiveness of Spatial Attention: As shown in Table V, adopting spatial attention, PRNN decreases the RMSE by 0.62, compared to PRNNw/oAttention. For another pair of variants, SRNN with spatial attention has similar performance improvement, compared to SRNNw/oAttention. Fig. 8 and Fig. 9 show some attentional maps generated by our method as well as the residual maps between the input crowd flow maps and their corresponding ground truth. We can observe that there is a negative correlation between the attentional maps and the residual maps to some extent. It indicates that our ACFM is able to capture valuable regions at each time step and make better predictions by inferring the trend of evolution. Roughly, the greater difference a region has, the smaller its weight, and vice versa. We can inhibit the impacts of the regions with great differences by multiplying the small weights on their corresponding location features. With the visualization of attentional maps, we can also get to know which regions have the primary positive impacts for the future flow prediction. According to the experiment, we can see that the proposed model can not only effectively improve the prediction accuracy, but also enhance the interpretability of the model to a certain extent.
Necessity of External Factors: Without modeling the effect of external factors, the variant SPNw/oExt obtains a RMSE of 16.84 on TaxiBJ dataset and has a performance degradation of 10%, compared to SPN. The main reason of degradation is that some atrocious meteorological conditions (e.g., rain and snow) or holidays would seriously affect the crowd flow. Thus, it’s necessary to incorporate the external factors to model the crowd flow evolution.
Effectiveness of TemporallyVarying Fusion: When directly merging the two temporal representations with an equal contribution (0.5), SPNw/ofusion achieves a negligible improvement, compared to SRNN. In contrast, after using our proposed fusion strategy, the full model SPN decreases the RMSE from 15.82 to 15.31, with a relative improvement of 3.2% compared with SRNN. The results show that the contributions of these two representations are not equal and are influenced by various factors. The proposed fusion strategy is effective to adaptively merge the different temporal representations and further improve the performance of crowd flow prediction.
Further Discussion: To analyze how each temporal representation contributes to the performance of crowd flow prediction, we measure the average fusion weights of two temporal representations at each time interval on the testing set. As shown in the left of Fig. 10, the fusion weights of sequential representation are greater than that of the periodic representation. To explain this phenomenon, we further measure i) the RMSE of crowd flow between two consecutive time intervals, denoted as “PreHour”, and ii) the RMSE of crowd flow between two adjacent days at the same time interval, denoted as “PreDay”. As shown on the right of Fig. 10, the RMSE of “PreDay” is much higher than that of “PreHour” at most time excepting for the wee hours. Based on this observation, we can conclude that the sequential representation is more essential for the crowd flow prediction, since the sequential data is more regular. Although the weight is low, the periodic representation still helps to improve the performance of crowd flow prediction qualitatively and quantitatively. For example, we can decrease the RMSE of SRNN by 3.2% after incorporating the periodic representation.
ViE Extension to Citywide Passenger Demand Prediction
Our AFCM is a general model for urban mobility modeling. Apart from the crowd flow prediction, it can be also applied to other related traffic tasks, such as citywide passenger demand prediction. In this subsection, we extend the proposed method to forecast the passenger pickup/dropoff demands at the next time interval (half an hour) with historical mobility trips.
We conduct experiments with taxi trips in New York City. First, we choose the Manhattan borough as the studied area, since most taxi transactions were made in this area. Manhattan is divided into a grid map and each grid represents a geographical region with a size of about . Second, we collect 132 million taxicab trip records during 2014 from New York City Taxi and Limousine Commission (NYCTLC^{5}^{5}5https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page). Each record contains the timestamp and the geocoordinates of pickup and dropoff locations. For each region, we measure the passenger pickup/dropoff demands in every half an hour, thus the dimensionality of passenger demand maps is . We collect external meteorological factors (e.g., temperature, wind speed and weather conditions) from Wunderground and the holidays are also marked. Finally, we train our SPN with the historical demand of the first 337 days and test on the last four weeks. When evaluating, we filter some regions with GT demand less than 5, since such low demand is always ignorable in realworld applications.
Method  RMSE  MAE 

HA  42.23  23.62 
VPN  20.22  12.57 
DeepST  20.06  12.39 
STResNet  19.55  11.94 
PredNet  19.98  12.34 
PredRNN  19.26  11.77 
SPN  18.75  11.36 
Method  Time Interval  

1  2  3  4  
STResNet  19.60  24.75  30.53  37.35 
VPN  21.14  24.80  27.57  30.68 
PredNet  19.87  24.21  28.04  31.69 
PredRNN  19.18  23.38  27.48  31.71 
SPNLONG (Our)  18.83  21.72  24.02  26.46 
We compare our method with HA and five deep learning based methods. As shown in Table VI, the baseline method HA obtains a RMSE of 42.23 and a MAE of 23.62, which is impractical in the taxi industry. By contrast, our method dramatically decreases the RMSE to 18.75 and outperforms other compared methods for shortterm prediction. Moreover, we adapt and retrain these deep learning based methods to forecast the longterm demand and summarize their RMSE in Table VII. It can be observed that our SPNLONG model achieves the best performance at every time interval. In particular, our method has a performance improvement of 16.55% compared with PredRNN at the fourth time interval. These experiments show that the proposed method is also effective for passenger demand prediction.
Vii Conclusion
In this work, we utilize massive human trajectory data collected from mobility digital devices to study the crowd flow prediction problem. Its key challenge lies in how to adaptively integrate the various factors that affect the flow changes, such as sequential trends, periodic laws and spatial dependencies. To address these issues, we propose a novel Attentive Crowd Flow Machine (ACFM), which explicitly learns dynamic spatialtemporal representations from historical crowd flow maps with an attention mechanism. Based on the proposed ACFM, we develop a unified framework to adaptively merge the sequential and periodic representations with the aid of a temporallyvarying fusion module for citywide crowd flow prediction. By conducting extensive experiments on two public benchmarks, we have verified the effectiveness of our method for crowd flow prediction. Moreover, to verify the generalization of ACFM, we apply the customized framework to forecast the passenger pickup/dropoff demand and it can also achieve practical performance on this traffic prediction task.
However, there is still much room for improvement. In most previous works, the functionality information of regions has not been fully explored. Intuitively, the regions with the same functionalities usually have similar crowd flow patterns. For instance, most residential regions have high outflow during morning rush hours and have high inflow during evening rush hours. How to model the relationship between the regions’ functionalities and crowd flow patterns is still an open problem. We propose to address this issue by incorporating the information of Point of Interest (POI), which can represent the functionality of a region to some extent. However, to our best knowledge, there doesn’t exist an effective way to integrate POI into deep learning models. In future work, we would develop an improved neural network that incorporates the POI information to forecast crowd flow.
References
 [1] Y. Zheng, L. Capra, O. Wolfson, and H. Yang, “Urban computing: concepts, methodologies, and applications,” TIST, vol. 5, no. 3, p. 38, 2014.
 [2] J. Zhang, F.Y. Wang, K. Wang, W.H. Lin, X. Xu, and C. Chen, “Datadriven intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4, pp. 1624–1639, 2011.

[3]
W. Huang, G. Song, H. Hong, and K. Xie, “Deep architecture for traffic flow prediction: deep belief networks with multitask learning,”
IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 5, pp. 2191–2201, 2014.  [4] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.Y. Wang, “Traffic flow prediction with big data: a deep learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 865–873, 2014.
 [5] N. G. Polson and V. O. Sokolov, “Deep learning for shortterm traffic flow prediction,” Transportation Research Part C: Emerging Technologies, vol. 79, pp. 1–17, 2017.
 [6] J. Zhang, Y. Zheng, and D. Qi, “Deep spatiotemporal residual networks for citywide crowd flows prediction.” in AAAI, 2017, pp. 1655–1661.
 [7] S. Shekhar and B. M. Williams, “Adaptive seasonal time series models for forecasting shortterm traffic flow,” Transportation Research Record, vol. 2024, no. 1, pp. 116–125, 2007.
 [8] J. Guo, W. Huang, and B. M. Williams, “Adaptive kalman filter approach for stochastic shortterm traffic flow rate prediction and uncertainty quantification,” Transportation Research Part C: Emerging Technologies, vol. 43, pp. 50–64, 2014.
 [9] J. Zheng and L. M. Ni, “Timedependent trajectory regression on road networks via multitask learning,” in AAAI, 2013, pp. 1048–1055.
 [10] D. Deng, C. Shahabi, U. Demiryurek, L. Zhu, R. Yu, and Y. Liu, “Latent space model for road networks to predict timevarying traffic,” KDD, pp. 1525–1534, 2016.
 [11] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, “Dnnbased prediction model for spatiotemporal data,” in SIGSPATIAL. ACM, 2016, p. 92.
 [12] Z. Xu, Y. Wang, M. Long, J. Wang, and M. KLiss, “Predcnn: Predictive learning with cascade convolutions.” in IJCAI, 2018, pp. 2940–2947.
 [13] J. Zhang, Y. Zheng, J. Sun, and D. Qi, “Flow prediction in spatiotemporal networks based on multitask deep learning,” TKDE, 2019.

[14]
S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in
NIPS, 2015, pp. 802–810.  [15] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv:1511.04119, 2015.
 [16] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” arXiv:1612.01887, 2016.
 [17] L. Liu, R. Zhang, J. Peng, G. Li, B. Du, and L. Lin, “Attentive crowd flow machines,” in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 1553–1561.
 [18] B. Williams, P. Durvasula, and D. Brown, “Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models,” Transportation Research Record: Journal of the Transportation Research Board, no. 1644, pp. 132–141, 1998.
 [19] M. CastroNeto, Y.S. Jeong, M.K. Jeong, and L. D. Han, “Onlinesvr for shortterm traffic flow prediction under typical and atypical traffic conditions,” Expert systems with applications, vol. 36, no. 3, pp. 6164–6173, 2009.
 [20] X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang, “Prediction of urban human mobility using largescale taxi traces and its applications,” Frontiers of Computer Science, vol. 6, no. 1, pp. 111–121, 2012.

[21]
M. Lippi, M. Bertini, and P. Frasconi, “Shortterm traffic flow forecasting: An experimental comparison of timeseries analysis and supervised learning,”
IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 871–882, 2013. 
[22]
Y. Duan, Y. Lv, Y.L. Liu, and F.Y. Wang, “An efficient realization of deep learning for traffic data imputation,”
Transportation research part C: emerging technologies, vol. 72, pp. 168–181, 2016.  [23] Z. Chen, J. Zhou, and X. Wang, “Visual analytics of movement pattern based on timespatial data: A neural net approach,” arXiv preprint arXiv:1707.02554, 2017.
 [24] M. Fouladgar, M. Parchami, R. Elmasri, and A. Ghaderi, “Scalable deep traffic flow neural networks for urban traffic congestion prediction,” arXiv preprint arXiv:1703.01006, 2017.
 [25] J. Ke, H. Zheng, H. Yang, and X. M. Chen, “Shortterm forecasting of passenger demand under ondemand ride services: A spatiotemporal deep learning approach,” Transportation Research Part C: Emerging Technologies, vol. 85, pp. 591–608, 2017.

[26]
H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement learning approach for intelligent traffic light control,” in
KDD. ACM, 2018, pp. 2496–2505.  [27] Z. Zhao, W. Chen, X. Wu, P. C. Chen, and J. Liu, “Lstm network: a deep learning approach for shortterm traffic forecast,” IET Intelligent Transport Systems, vol. 11, no. 2, pp. 68–75, 2017.
 [28] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu, “Spatiotemporal multigraph convolution network for ridehailing demand forecasting,” in AAAI, 2019.
 [29] L. Wang, X. Geng, X. Ma, F. Liu, and Q. Yang, “Crowd flow prediction by deep spatiotemporal transfer learning,” arXiv preprint arXiv:1802.00386, 2018.
 [30] H. Yao, Y. Liu, Y. Wei, X. Tang, and Z. Li, “Learning from multiple cities: A metalearning approach for spatialtemporal prediction,” arXiv preprint arXiv:1901.08518, 2019.
 [31] M.T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attentionbased neural machine translation,” arXiv:1508.04025, 2015.
 [32] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP. IEEE, 2013, pp. 6645–6649.
 [33] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (mrnn),” arXiv:1412.6632, 2014.
 [34] V. Veeriah, N. Zhuang, and G.J. Qi, “Differential recurrent neural networks for action recognition,” in ICCV, 2015, pp. 4041–4049.
 [35] Y. Tian and L. Pan, “Predicting shortterm traffic flow by long shortterm memory recurrent neural network,” in 2015 IEEE international conference on smart city/SocialCom/SustainCom (SmartCity). IEEE, 2015, pp. 153–158.
 [36] R. Fu, Z. Zhang, and L. Li, “Using lstm and gru neural network methods for traffic flow prediction,” in 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2016, pp. 324–328.
 [37] J. Mackenzie, J. F. Roddick, and R. Zito, “An evaluation of htm and lstm for shortterm arterial traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, no. 99, pp. 1–11, 2018.
 [38] L.C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scaleaware semantic image segmentation,” in CVPR, 2016, pp. 3640–3649.
 [39] H. Xu and K. Saenko, “Ask, attend and answer: Exploring questionguided spatial attention for visual question answering,” in ECCV. Springer, 2016, pp. 451–466.
 [40] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin, “Crowd counting using deep recurrent spatialaware network,” in IJCAI. AAAI Press, 2018, pp. 849–855.
 [41] L. Liu, Z. Qiu, G. Li, Q. Wang, Ouyang，Wanli, and L. Lin, “Contextualized spatialtemporal network for taxi origindestination demand prediction,” IEEE Transactions on Intelligent Transportation Systems, 2019.
 [42] D. Harris and S. Harris, Digital design and computer architecture. Morgan Kaufmann, 2010.
 [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
 [44] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS workshop, 2017.
 [45] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010, pp. 249–256.
 [46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
 [47] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control, 2015.

[48]
S. Johansen, “Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models,”
Econometrica: journal of the Econometric Society, pp. 1551–1580, 1991.  [49] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” in ICML, 2017, pp. 1771–1779.

[50]
W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in
ICLR, 2017.  [51] Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” in NIPS, 2017, pp. 879–888.
Comments
There are no comments yet.