1. Introduction
Crowd flow prediction is crucial for traffic management and public safety, and has drawn a lot of research interests due to its huge potentials in many intelligent applications, including intelligent traffic diversion and travel optimization.
Nowadays, we live in an era where ubiquitous digital devices are able to broadcast rich information about human mobility in realtime and at a high rate, which exponentially increases the availability of largescale mobility data (e.g., GPS signals or mobile phone signals). In this paper, we generate the crowd flow maps from these mobility data and utilize the historical crowd flow maps to forecast the future crowd flow of a city. As shown in Fig. 1, we partition a city into a grid map based on the longitude and latitude, and measure the number of pedestrians in each region at each time interval with the mobility data. Although the regional scale can vary greatly in different cities, the core problem lies in excavating the evolution of traffic flow in different spatial and temporal regions.
Recently, notable successes have been achieved for citywide crowd flow prediction based on deep neural networks coupled with certain spatialtemporal priors (Zhang et al., 2016, 2017b). Nevertheless, there still exist several challenges limiting the performance of crowd flow analysis in complex scenarios. First, crowd flow data can vary greatly in temporal sequence and capturing such dynamic variations is nontrivial. Second, some periodic laws (e.g., traffic flow suddenly changing due to rush hours or preholiday effects) can greatly affect the situation of crowd flow, which increases the difficulty in learning crowd flow representations from data.
To solve all above issues, we propose a novel spatialtemporal neural network module, called Attentive Crowd Flow Machine (ACFM), to adaptively exploit diverse factors that affect crowd flow evolution and at the same time produce the crowd flow estimation map endtoend in a unified module. The attention mechanism embedded in ACFM is designed to automatically discover the regions with primary positive impacts for the future flow prediction and simultaneously adjust the impacts of the different regions with different weights at each timestep. Specifically, the ACFM comprises two progressive ConvLSTM
(Xingjian et al., 2015)units. The first one takes input from i) the feature map representing flow density at each moment and ii) the memorized representations of previous moments, to compute the attentional weights, while the second LSTM aims at generating superior spatialtemporal feature representation from attentionally weighted sequential flow density features.
The proposed ACFM has the following appealing properties. First, it can effectively incorporate spatialtemporal information in feature representation and can flexibly compose solutions for crowd flow prediction with different types of input data. Second, by integrating the deep attention mechanism (Sharma et al., 2015; Lu et al., 2016), ACFM adaptively learns to represent the weights of each spatial location at each timestep, which allows the model to dynamically perceive the impact of the given area at a given moment for the future traffic flow. Third, ACFM is a general and differentiable module which can be effectively combined with various network architectures for endtoend training.
In addition, for forecasting the citywide crowd flow, we further build a deep architecture based on the ACFM, which consists of three components: i) sequential representation learning, ii) periodic representation learning and iii) a temporallyvarying fusion module. The first two components are implemented by two parallel ACFMs for contextual dependencies modeling at different temporal scales, while the temporallyvarying fusion module is proposed to adaptively merge the two separate temporal representation for crowd flow predictions.
The main contributions of this work are threefold:

We propose a novel ACFM neural network, which incorporates two LSTM modules with spatialtemporal attentional weights, to enhance the crowd flow prediction via adaptively weighted spatialtemporal feature modeling.

We integrate ACFM in our customized deep architecture for citywide crowd flow estimation, which recurrently incorporates various sequential and periodic dependencies with temporallyvarying data.

Extensive experiments on two public benchmarks of crowd flow prediction demonstrate that our approach outperforms existing stateoftheart methods by large margins.
2. Related Work
Crowd Flow Analysis. Due to the wide application of traffic congestion analysis and public safety monitoring, citywide crowd flow analysis has recently attracted a wide range of research interest (Zheng et al., 2014). A pioneer work was proposed by Zheng et al., (Zheng, 2015)
, in which they proposed to represent public traffic trajectories as graphs or tensor structures. Inspired by the significant progress of deep learning on various tasks
(Zhang et al., 2015; Chen et al., 2016a; Li et al., 2017; Zhang et al., 2018; Liu et al., 2018a), many researchers also have attempted to handle this task with deep neural network. Fouladgar et al. (Fouladgar et al., 2017) introduced a scalable decentralized deep neural networks for urban shortterm traffic congestion prediction. In (Zhang et al., 2016), a deep learning based framework was proposed to leverage the temporal information of various scales (i.e. temporal closeness, period and seasonal) for crowd flow prediction. Following this work, Zhang et al., (Zhang et al., 2017b) further employed a convolution based residual network to collectively predict inflow and outflow of crowds in every region of a city gridmap. To take more efficient temporal modeling into consideration, Dai et al. (Dai et al., 2017) proposed a deep hierarchical neural network for traffic flow prediction, which consists of an extraction layer to extract timevariant trend in traffic flow and a prediction layer for final crowd flow forecasting. Currently, to overcome the scarcity of crowd flow data, Wang et al. (Wang et al., 2018)proposed to learn the target city model from the source city model with a region based crosscity deep transfer learning algorithm.
Memory and attention neural networks
. Recurrent neural networks (RNN) have been widely applied to various sequential prediction tasks
(Sutskever et al., 2014; Donahue et al., 2015). As a variation of RNN, Long ShortTerm Memory Networks (LSTM) enables RNNs to store information over extended time intervals and exploit longerterm temporal dependencies. It was first applied to the research field of natural language processing
(Luong et al., 2015) and speech recognition (Graves et al., 2013), while recently many researchers have attempted to combine CNN with LSTM to model the spatialtemporal information for various of computer vision applications, such as video salient object detection
(Li et al., 2018), image caption (Mao et al., 2014; Wu et al., 2018) and action recognition (Veeriah et al., 2015). Visual attention is a fundamental aspect of human visual system, which refers to the process by which humans focus the computational resources of their brain’s visual system to specific regions of the visual field while perceiving the surrounding world. It has been recently embedded in deep convolution networks (Chen et al., 2016b) or recurrent neural networks to adaptively attend on missionrelated regions while processing feedforward operation and have been proved effective for many tasks, including machine translation (Luong et al., 2015), crowd counting (Liu et al., 2018b), multilabel image classification (Wang et al., 2017), face hallucination (Cao et al., 2017), and visual question answering (Xu and Saenko, 2016). However, no existing work incorporates attention mechanism in crowd flow prediction.The most relevant works to us are (Zhang et al., 2017a; Xiong et al., 2017), which also incorporate ConvLSTM for spatialtemporal modeling. However, they are used for consecutive video frames representation and aims to estimate the crowd counting on a given surveillance image instead of forecasting crowd flow evolution based on mobility data. Moreover, our proposed ACFM is composed of two progressive LSTM modules with learnable attention weights, which is not only adept at modeling spatialtemporal representation, but also efficiently capturing the effect on the global crowd flow evolution caused by the changes of traffic conditions in each particular spatialtemporal region (e.g. a traffic jam caused by an accident). Last but not the least, the attention mechanism embedded in our ACFM module also helps to improve the interpretability of the network process while boosting the performance.
3. Preliminaries
In this section, we first describe some basic elements of crowd flow and then define the crowd flow prediction problem.
Region Partition: There are many ways to divide a city into multiple regions in terms of different granularities and semantic meanings, such as road network (Deng et al., 2016) and zip code tabular. In this study, following the previous works (Zhang et al., 2017b; Yao et al., 2018), we partition a city into nonoverlapping grid map based on the longitude and latitude. Each rectangular grid represents a different geographical region in the city. Figure 1 illustrates the partitioned regions of Beijing and New York City.
Crowd Flow: In practical application, we can extract a mass of crowd trajectories from GPS signals or mobile phone signals. With those crowd trajectories, we can measure the number of pedestrians entering or leaving a given region at each time interval, which are called inflow and outflow in our work. For convenience, we denote the crowd flow map at the time interval of day as a tensor , of which the first channel is the inflow and the second channel is the outflow. Some crowd flow maps are visualized in Figure 5.
External Factors: As mentioned in the previous work (Zhang et al., 2017b), crowd flow can be affected by many complex external factors, such as meteorology information and holiday information. In this paper, we also consider the effect of these external factors. The meteorology information (e.g., weather condition, temperature and wind speed) can be collected from some public meteorological websites, such as Wunderground^{1}^{1}1https://www.wunderground.com/
. Specifically, the weather condition is categorized into sixteen categories (e.g., sunny and rainy) and it is digitized with OneHot Encoding
(Harris and Harris, 2010), while temperature and wind speed are scaled into the range [0, 1] with minmax linear normalization. Multiple categories of holiday ^{2}^{2}2The categories of holiday are variational in different datasets.(e.g., Chinese Spring Festival and Christmas) can be acquired from calendar and encoded into a binary vector with OneHot Encoding. Finally, we concatenate all external factors data to a 1D tensor. The external factors tensor at the
time interval of day is expressed as a in the following sections.Crowd Flow Prediction: This problem aims to predict the crowd flow map , given historical crowd flow maps and external factors data until the time interval of day.
4. Attentive Crowd Flow Machine
We propose a unified neural network module, named Attentive Crowd Flow Machine (ACFM), to learn the crowd flow spatialtemporal representations. ACFM is designed to adequately capture various contextual dependencies of the crowd flow, e.g., the spatial consistency and the temporal dependency of long and short term. As shown on the left of Fig. 2, the ACFM is composed of two progressive ConvLSTM (Xingjian et al., 2015) units connected with a convolutional layer for attention weight prediction at each time step. The first LSTM (bottom LSTM in the figure) models the temporal dependency through original crowd flow feature embedding (extracted from CNN), the output hidden state of which is concatenated with current crowd flow feature and fed to a convolution layer for weight map inference. The second LSTM (upper LSTM in the figure) is of the same structure as the first LSTM but takes the reweighted crowd flow features as input at each timestep and is trained to recurrently learn the spatialtemporal representations for further crowd flow prediction.
For better understanding, we denote the input feature map of the iteration as , with , and representing the height, width and the number of channels. Following (Hochreiter and Schmidhuber, 1997), the hidden state of first LSTM can be formulated as:
(1) 
where is the memorized cell state of the first LSTM at iteration. The internal hidden state is maintained to model the dynamic temporal behavior of the previous crowd flow sequences.
We concatenate and to generate a new tensor, and feed it to a single convolutional layer with kernel size to generate an attention map , which can be expressed as:
(2) 
where denotes feature concatenation and is the parameters of the convolutional layer. And indicates the weights of each spatial location on the feature map . We further reweigh with an elementwise multiplication according to and take the reweighed map as input to the second LSTM for representation learning, the hidden state of which can be formulated as:
(3) 
where refers to the elementwise multiplication. encodes the attentionaware content of current input as well as memorizes the contextual knowledge of previous moments. The output of the last hidden state thus encodes the information of the whole crowd flow sequence, and is used as the spatialtemporal representation for evolution analysis of future flow map. In the next section, we will show how to incorporate the proposed ACFM in our crowd flow prediction framework.
5. Citywide Crowd Flow Prediction
We build a deep neural network architecture incorporated with our proposed ACFM to predict citywide crowd flow. As illustrated on the right of Fig. 2, the crowd flow prediction framework consists of three components: (1) sequential representation learning, (2) periodic representation learning and (3) a temporallyvarying fusion module. For the first two parts of the framework, we employ the ACFM to model the contextual dependencies of crowd flow at different temporal scales. After that, a temporallyvarying fusion module is proposed to adaptively merge the different feature embeddings from each component with the weight learned from the concatenation of respective feature representations and the external information. Finally, the merged feature map is fed to one additional convolution layer for crowd flow map inference.
5.1. Sequential Representation Learning
The evolution of citywide crowd flow is usually affected by diverse internal and external factors, e.g., current urban traffic and weather conditions. For instance, a traffic accident occurring on a city main road at 9 am may seriously affect the crowd flow of nearby regions in subsequent time periods. Similarly, a sudden rain may seriously affect the crowd flow in a specific region. To deal with these issues, we take several continuous crowd flow features and their corresponding external factors features as the sequential temporal features, and feed them into our ACFM to recurrently capture the trend of crowd flow in the short term.
Specifically, we denote the input sequential temporal features as:
(4) 
where is the length of the sequentially related time intervals and denotes the embedding features of the crowd flow and the external factors at the time interval of the day. The extraction of embedding feature will be described in Section 5.4. We apply the proposed ACFM to learn sequential representation from temporal features . As shown on the right of Fig. 2, the ACFM recurrently takes each element of as input and learns to selectively memorize the context of this specific temporallyvarying data. The output hidden state of the last iteration is further fed into a following convolution layer to generate a feature representation of size , denoted as , which forms the spatialtemporal feature embedding of the finegrained sequential data.
5.2. Periodic Representation Learning
Generally, there exist some periodicities which make a significant impact on the changes of traffic flow. For example, the traffic conditions are very similar during morning rush hours of consecutive workdays, repeating every 24 hours. Similar with sequential representation learning described in Section 5.1, we take periodic temporal features
(5) 
to capture the periodic property of crowd flow, where is the length of the periodic days. As shown on the right of Fig. 2, we employ ACFM to learn periodic representation with the periodic temporal features as input. The hidden output of the last iteration of ACFM is passed through a convolutional layer to generate a representation . The encodes the context of periodic laws, which is essential for crowd flow prediction.
5.3. TemporallyVarying Fusion
The future crowd flow is affected by the two temporallyvarying representations and . A naive method is to directly merge those two representations, however it is suboptimal. In this subsection, we propose a novel temporallyvarying fusion module to adaptively fuse the sequential representation and the periodic representation of crowd flow with different weight.
Considering that the external factors may affect the importance proportion of two representations, we take the sequential representation , periodic representation and the external factors integrative feature to calculate the fusion weight, where is the elementwise addition of external factors features of all relative time intervals and will be described in Section 5.4. As shown on the right of Fig. 2, we first concatenate , and
and feed them as input to two fullyconnected layers (the first layer has 512 neurons and the second has only one neuron) for fusion weight inference. After a sigmoid function, the temporallyvarying fusion module outputs a single value
, which reflects the importance of the sequential representation . And is treated as the fusion weight of periodic representation .We then merge these two temporal representations with different weight and further reduce the feature to two channels (input and output flow) with a linear transformation, which can be expressed as:
(6) 
where is the linear transformation implemented by a convolution layer with two filters. The predicted crowd flow map can be computed as
(7) 
where the hyperbolic tangent ensures the output values are within ^{3}^{3}3When training, we use MinMax linear normalization method to scale the crowd flow maps into the range . When evaluating, we rescale the predicted value back to the normal values and then compare with the ground truth..
5.4. Implementation Details
We first detail the method of extracting crowd flow feature as well as external factors feature and then describe our network optimization.
Crowd Flow Feature: For the crowd flow map at the time interval of the day, we extract its feature with a customized ResNet (He et al., 2016) structure, which is stacked by
residual units without any downsampling operations. Each residual unit contains two convolutional layers followed by two ReLu layers. We set the channel numbers of all convolutional layers as 16 and the kernel sizes as
.External Factors Feature: For the external factors , we extract its feature with a simple neural network implemented by two fullyconnected layers. The first FC layer has 256 neurons and the second one has neurons. The output of the last layer is further reshaped to a 3D tensor , which is the final feature of .
Finally, we concatenate and to generate the embedding feature , which can be expressed as
(8) 
where denotes feature concatenation. For the external factors integrative feature described in Section 5.3 , it is the elementwise addition of and .
Network Optimization
: We adopt the TensorFlow
(Abadi et al., 2016) toolbox to implement our crowd flow prediction network. The filter weights of all convolutional layers and fullyconnected layers are initialized by Xavier (Glorot and Bengio, 2010). The size of a minibatch is set to 64 and the learning rate is . We optimize our networks parameters in an endtoend manner via Adam optimization (Kingma and Ba, 2014)by minimizing the Euclidean loss for 270 epochs with a GTX 1080Ti GPU.
6. Experiments
In this section, we first conduct experiments on two public benchmarks (e.g., TaxiBJ (Zhang et al., 2016) and BikeNYC (Zhang et al., 2016)) to evaluate the performance of our model on citywide crowd flow prediction. We further conduct an ablation study to demonstrate the effectiveness of each component in our model.
6.1. Dataset Setting and Evaluation Metric
We forecast the inflow and outflow of citywide crowds on two datasets: the TaxiBJ (Zhang et al., 2016) dataset for taxicab flow prediction and the BikeNYC (Zhang et al., 2016) dataset for bike flow prediction.
TaxiBJ Dataset: This dataset contains 22,459 time intervals of crowd flow maps with a size of , which are generated with Beijing taxicab GPS trajectory data. The external factors contain weather conditions, temperature, wind speed and 41 categories of holiday. For the fair comparison, we refer to (Zhang et al., 2017b) and take the data in the last four weeks as the testing set and the rest as the training set. In this dataset, we set the sequential length n and the periodic length m as and , respectively. As with STResNet (Zhang et al., 2017b), the ResNet described in Section 5.4 is composed of 12 residual units.
BikeNYC Dataset: This dataset is generated with the NYC bike trajectory data, which contains 4,392 available time intervals crowd flow maps with the size of . The data of the last ten days are chosen to be the test set. As for external factors, 20 categories of the holiday are recorded. In this dataset, we set the sequential length n as 5 and the periodic length m as 7. For a fair comparison with STResNet (Zhang et al., 2017b), we also utilize a ResNet described in Section 5.4 with 4 residual units to extract the crowd flow feature.
We adopt Root Mean Square Error (RMSE) as evaluation metric to evaluate the performances of all the methods, which is defined as:
(9) 
where and represent the predicted flow map and its ground truth map, respectively. indicates the number of samples used for validation.
6.2. Comparison with the State of the Art





SARIMA (Williams et al., 1998)  26.88  10.56  
VAR (Lütkepohl, 2011)  22.88  9.92  
ARIMA (Box et al., 2015)  22.78  10.07  
STANN (Zhang et al., 2017b)  19.57    
DeepST (Zhang et al., 2016)  18.18  7.43  
STResNet (Zhang et al., 2017b)  16.69  6.33  
Ours  15.40  5.64 
We compare our method with six stateoftheart methods, including AutoRegressive Integrated Moving Average (ARIMA) (Box et al., 2015), Seasonal ARIMA (SARIMA) (Williams et al., 1998), Vector AutoRegressive (VAR) (Lütkepohl, 2011), STANN (Zhang et al., 2017b), DeepST (Zhang et al., 2016) and STResNet (Zhang et al., 2017b). For these compared methods, we use the performances provided by Zhang et al. (Zhang et al., 2017b) as their results.
Table 1 summarizes the performance of the proposed method and other six methods. On TaxiBJ dataset, our method decreases the RMSE from 16.69 to 15.40 when compared with current best model, and achieves a relative improvement of 7.7%. Our method also boosts the prediction accuracy on BikeNYC, i.e., decreases RMSE from 6.33 to 5.64. Note that some compared methods, e.g., STANN, DeepST and STResNet, also employ deep learning techniques. Experimental results demonstrate that our proposed ACFM is able to explicitly model the spatialtemporal feature as well as the attention weighting of each spatial influence, which greatly outperforms the stateoftheart. Some crowd flow prediction maps of our full model on TaxiBJ dataset are shown on the second row of the Fig. 5. As can be seen, our generated crowd flow map is consistently closest to those of the groundtruth, which is accord with the quantitative RMSE comparison.
6.3. Ablation Study
Our full model for citywide crowd flow prediction consists of three components: sequential representation learning, periodic representation learning and temporallyvarying fusion module. For convenience, we denote our full model as SequentialPeriodic Network (SPN) in the following experiments. To show the effectiveness of each component, we implement seven variants of our full model on the TaxiBJ dataset:

PCNN: directly concatenates the periodic features and feeds them to a convolutional layer with two filters followed by to predict future crowd flow;

SCNN: directly concatenates the sequential features and feeds them to a convolutional layer followed by to predict future crowd flow;

PRNNw/oAttention: takes periodic features as input and learns periodic representation with a LSTM layer to predict future crowd flow;

PRNN: takes periodic features as input and learns periodic representation with the proposed ACFM to predict future crowd flow;

SRNNw/oAttention: takes sequential features as input and learns sequential representation with a LSTM layer for crowd flow estimation;

SRNN: takes sequential features as input and learns sequential representation with the proposed ACFM to predict future crowd flow;

SPNw/oFusion: directly merges sequential representation and periodic representation with equal weight (0.5) to predict future crowd flow.




PCNN  33.44  
PRNNw/oAttention  32.97  
PRNN  32.52  
SCNN  17.48  
SRNNw/oAttention  16.62  
SRNN  16.11  
SPNw/oFusion  16.01  
SPN  15.40 
Effectiveness of Spatial Attention: As shown in Table 2, adopting spatial attention, SRNN decreases the RMSE by 0.51, compared to SRNNw/oAttention. For another pair of variants, PRNN with spatial attention has the similar performance improvement, compared to PRNNw/oAttention. Fig. 3 and Fig. 4 show some attentional maps generated by our method as well as the residual maps between the input crowd flow maps and their corresponding ground truth. We can observe that there is a negative correlation between the attentional maps and the residual maps. It indicates that our ACFM is able to capture valuable regions at each time step and make better predictions by inferring the trend of evolution. Roughly, the greater difference a region has, the smaller its weight, and vice versa. We can inhibit the impacts of the regions with great differences by multiplying the small weights on their corresponding location features. With the visualization of attentional maps, we can also get to know which regions have the primary positive impacts for the future flow prediction. According to the experiment, we can see that the proposed model can not only effectively improve the prediction accuracy, but also enhance the interpretability of the model to a certain extent.
Effectiveness of Sequential Representation Learning: As shown in Table 2, directly concatenating the sequential features for prediction, the baseline variant SCNN gets an RMSE of 17.48. When explicitly modeling the sequential contextual dependencies of crowd flow using the proposed ACFM, the variant SRNN decreases the RMSE to 16.11, with 7.8% relative performance improvement compared to the baseline SCNN, which indicates the effectiveness of the sequential representation learning.
Effectiveness of Periodic Representation Learning: We also explore different network architectures to learn the periodic representation. As shown in Table 2, the PCNN, which learns to estimate the flow map by simply concatenating all of the periodic features , only achieves RMSE of 33.44. In contrast, when introducing ACFM to learn the periodic representation, the RMSE drops to 32.52. This further demonstrates the effectiveness of the proposed ACFM for spatialtemporal modeling.
Effectiveness of TemporallyVarying Fusion: When directly merging the two temporal representations with an equal contribution (0.5), SPNw/ofusion achieves a negligible improvement, compared to SRNN. In contrast, after using our proposed fusion strategy, the full model SPN decreases the RMSE from 16.11 to 15.40, with a relative improvement of 4.4% compared with SRNN. The results show that the significance of these two representations are not equal and are influenced by various factors. The proposed fusion strategy is effective to adaptively merge the different temporal representations and further improve the performance of crowd flow prediction.
Further Discussion: To analyze how each temporal representation contributes to the performance of crowd flow prediction, we further measure the average fusion weights of two temporal representations at each time interval. As shown in the right of Fig. 6, the fusion weights of sequential representation are greater than that of the periodic representation at most time excepting for wee hours. Based on this observation, we can conclude that the sequential representation is more essential for the crowd flow prediction. Although the weight is low, the periodic representation still helps to improve the performance of crowd flow prediction qualitatively and quantitatively. Fusing with periodic representation, we can decrease the RMSE of SRNN by 4.4% and generate more precise crowd flow maps, as shown in Table 2 and Fig. 5.
7. Conclusion
This work studies the spatialtemporal modeling for crowd flow prediction problem. To incorporate various factors that affect the flow changes, we propose a unified neural network module named Attentive Crowd Flow Machine (ACFM). In contrast to the existing flow estimation methods, our ACFM explicitly learns dynamic representations of temporallyvarying data with an attention mechanism and can infer the evolution of the future crowd flow from historical crowd flow maps. A unified framework is also proposed to merge two types of temporal information for further prediction. According to the extensive experiments, we have exhaustively verified the effectiveness of our proposed ACFM on the task for citywide crowd flow prediction.
References
 (1)
 Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv:1603.04467 (2016).
 Box et al. (2015) George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. Time series analysis: forecasting and control.
 Cao et al. (2017) Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li. 2017. Attentionaware face hallucination via deep reinforcement learning. arXiv:1708.03132 (2017).
 Chen et al. (2016b) LiangChieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. 2016b. Attention to scale: Scaleaware semantic image segmentation. In CVPR.
 Chen et al. (2016a) Tianshui Chen, Liang Lin, Lingbo Liu, Xiaonan Luo, and Xuelong Li. 2016a. DISC: Deep Image Saliency Computing via Progressive Representation Learning. TNNLS 27, 6 (2016), 1135–1149.
 Dai et al. (2017) Xingyuan Dai, Rui Fu, Yilun Lin, Li Li, and FeiYue Wang. 2017. DeepTrend: A Deep Hierarchical Neural Network for Traffic Flow Prediction. arXiv:1707.03213 (2017).
 Deng et al. (2016) Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, Linhong Zhu, Rose Yu, and Yan Liu. 2016. Latent Space Model for Road Networks to Predict TimeVarying Traffic. KDD (2016), 1525–1534.
 Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Longterm recurrent convolutional networks for visual recognition and description. In CVPR.
 Fouladgar et al. (2017) Mohammadhani Fouladgar, Mostafa Parchami, Ramez Elmasri, and Amir Ghaderi. 2017. Scalable deep traffic flow neural networks for urban traffic congestion prediction. arXiv preprint arXiv:1703.01006 (2017).
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS. 249–256.
 Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In ICASSP.
 Harris and Harris (2010) David Harris and Sarah Harris. 2010. Digital design and computer architecture. Morgan Kaufmann.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).
 Li et al. (2018) Guanbin Li, Yuan Xie, Tianhao Wei, Keze Wang, and Liang Lin. 2018. Flow Guided Recurrent Neural Encoder for Video Salient Object Detection. In CVPR. 3243–3252.
 Li et al. (2017) Ya Li, Lingbo Liu, Liang Lin, and Qing Wang. 2017. Face Recognition by CoarsetoFine Landmark Regression with Application to ATM Surveillance. In CCCV. Springer, 62–73.
 Liu et al. (2018a) Lingbo Liu, Guanbin Li, Yuan Xie, Yizhou Yu, and Liang Lin. 2018a. Facial Landmark Localization in the Wild by BackboneBranches Representation Learning. In BigMM. IEEE.
 Liu et al. (2018b) Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and Liang Lin. 2018b. Crowd Counting using Deep Recurrent SpatialAware Network. In IJCAI.
 Lu et al. (2016) Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via A visual sentinel for image captioning. arXiv:1612.01887 (2016).
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv:1508.04025 (2015).

Lütkepohl (2011)
Helmut Lütkepohl.
2011.
Vector autoregressive models.
In International Encyclopedia of Statistical Science. 1645–1647.  Mao et al. (2014) Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (mrnn). arXiv:1412.6632 (2014).
 Sharma et al. (2015) Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv:1511.04119 (2015).
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
 Veeriah et al. (2015) Vivek Veeriah, Naifan Zhuang, and GuoJun Qi. 2015. Differential recurrent neural networks for action recognition. In ICCV.
 Wang et al. (2018) Leye Wang, Xu Geng, Xiaojuan Ma, Feng Liu, and Qiang Yang. 2018. Crowd Flow Prediction by Deep SpatioTemporal Transfer Learning. arXiv preprint arXiv:1802.00386 (2018).
 Wang et al. (2017) Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. MultiLabel Image Recognition by Recurrently Discovering Attentional Regions. In CVPR.
 Williams et al. (1998) Billy Williams, Priya Durvasula, and Donald Brown. 1998. Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models. Transportation Research Record: Journal of the Transportation Research Board 1644 (1998), 132–141.
 Wu et al. (2018) Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. Interpretable Video Captioning via Trajectory Structured Localization. In CVPR. 6829–6837.
 Xingjian et al. (2015) SHI Xingjian, Zhourong Chen, Hao Wang, DitYan Yeung, WaiKin Wong, and Wangchun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.
 Xiong et al. (2017) Feng Xiong, Xingjian Shi, and DitYan Yeung. 2017. Spatiotemporal modeling for crowd counting in videos. In ICCV.
 Xu and Saenko (2016) Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring questionguided spatial attention for visual question answering. In ECCV.
 Yao et al. (2018) Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, and Jieping Ye. 2018. Deep multiview spatialtemporal network for taxi demand prediction. arXiv preprint arXiv:1802.08714 (2018).
 Zhang et al. (2017b) Junbo Zhang, Yu Zheng, and Dekang Qi. 2017b. Deep SpatioTemporal Residual Networks for Citywide Crowd Flows Prediction.. In AAAI. 1655–1661.
 Zhang et al. (2016) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNNbased prediction model for spatiotemporal data. In ACM SIGSPATIAL.

Zhang
et al. (2018)
Ruimao Zhang, Liang Lin,
Guangrun Wang, Meng Wang, and
Wangmeng Zuo. 2018.
Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions.
PAMI 1 (2018), 1–1. 
Zhang
et al. (2015)
Ruimao Zhang, Liang Lin,
Rui Zhang, Wangmeng Zuo, and
Lei Zhang. 2015.
Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification.
TIP 24, 12 (2015), 4766–4779.  Zhang et al. (2017a) Shanghang Zhang, Guanhang Wu, João P Costeira, and José MF Moura. 2017a. FCNrLSTM: Deep SpatioTemporal Neural Networks for Vehicle Counting in City Cameras. arXiv:1707.09476 (2017).
 Zheng (2015) Yu Zheng. 2015. Trajectory data mining: an overview. TIST 6, 3 (2015), 29.
 Zheng et al. (2014) Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: concepts, methodologies, and applications. TIST 5, 3 (2014), 38.
Comments
There are no comments yet.