1 Introduction
Forecasting crowd flows in each and every part of a city, especially in irregular regions, plays an important role in traffic control, risk assessment, and public safety. For example, when vast amounts of people streamed into a strip region at the 2015 New Year’s Eve celebrations in Shanghai, this resulted in a catastrophic stampede that killed 36 people. Such tragedies can be mitigated or prevented by utilizing emergency mechanisms, like sending out warnings or evacuating people in advance, if we can accurately forecast the crowd flow in a region ahead of time.
Prior work mainly focused on predicting the crowd flows in regular gridded regions [48, 47]. Although partitioning a city into grids is more easily and effectively handled by the subsequent data mining [50]
and machine learning approaches
[48], the regions in a city are actually separated by road networks and therefore extremely irregular. In this study, our goal is to collectively predict inflow and outflow of crowds in each and every irregular region of a city. Figure 1 shows an illustration. Inflow is the total flow of crowds entering a region from other regions during a given time interval and outflow denotes the total flow of crowds leaving a region for other regions during a given time interval, both of which track the transition of crowds between regions. Knowing them is very beneficial for traffic control. We can measure them by the number of cars/bikes running on the roads, the number of pedestrians, the number of people traveling on public transportation systems (e.g. metro, bus), or all of them together if the data is available. We can use the GPS trajectories of vehicles to measure the traffic flow, showing that the inflow and outflow of are (0, 2) respectively. Similarly, using mobile phone signals of pedestrians, the two types of flows are (3, 2) respectively.We formulate the crowd flow forecasting problem as a spatiotemporal graph (STG) prediction problem in which an irregular region can be viewed as a node that is associated with timevarying inflow and outflow, and transition flow between regions can be used to construct the edges. However, forecasting these two flows in each and every node of an STG is very challenging because of the following three complex factors:
1) Interactions and spatial correlations between different vertices of an STG. The inflow of the node (Figure 1(b)) is affected by outflows of adjacent (1hop) neighbors ( and ) as well as multihop neighbors (other nodes, like ). Likewise, a node’s outflow would affect its neighbors’ inflows. Moreover, a node’s inflow and outflow interact with each other.
2) Multiple types of temporal correlations among different time intervals: closeness, period, and trend. i) Closeness: the flows of a node are affected by recent time intervals. Taking traffic flow as an example, a congestion occurring at 5pm will affect traffic flow at 6pm. ii) Many types of Periods: daily, weekly, etc. Traffic conditions during rush hours may be similar on consecutive workdays (daily period) and consecutive weekends (weekly period). iii) Many types of Trends: monthly, quarterly, etc. Morning peak hours may gradually happen earlier as summer comes, with people getting up earlier as the temperature gradually increases and the sun rises earlier.
3) Complex external factors and meta features. Holidays can influence the flow of crowds for consecutive days, and extreme weather always changes the crowd flows tremendously in different regions of a city. Besides, crowd flows are also affected by meta data, like time of the day, weekend/weekday. For example, the flow patterns on rush hours may differ from nonrush hours.
To tackle all aforementioned challenges, we propose a general multiview learning framework for crowd flow prediction in all the irregular regions of a city, as shown in Figure 2. The framework is composed of two stages: data preparation and model learning. The data preparation stage involves fetching global information based on the target time and selecting the dependent crowd flow matrices from key timesteps according to different temporal properties. Based on the collected multiple view data, we present a new model for learning, which we refer to as a multiview graph convolutional network (MVGCN), consisting of several GCNs and fullyconnected neural networks (FNNs). The contributions of this research lie in the following four aspects:

We propose a variant of GCN, which can not only capture spatial correlations and interactions between different nodes but also the spatial property (i.e. geospational position). Single graph convolutional layer can capture 1hop spatial dependencies in an STG, and integrate the geospatial position information simultaneously. We stack multiple such layers to model more complex spatial patterns like multihop interactions. We also integrate residual network with GCN which can help train a deeper network and accelerate the model training effectively.

We use the proposed GCN variant to build a MVGCN for the crowd flow prediction, that can capture different temporal properties, including closeness, many types of periods (e.g. daily, weekly, yearly), many types of trends (monthly, quarterly, etc.). Each view of the MVGCN is a multilayer GCN whic can model different temporal and spatial properties.

We design a multiview fusion module, to fuse multiple latent representations from different views. We design this module based on two fusion methods: gating mechanism and sum fusion, in which the gating and sum fusions are used for sudden and slight changes, respectively.

We evaluate our MVGCN using four realworld mobility datasets, including taxicab data in Beijing and New York City (NYC), and bike data in NYC and Washington D.C. The extensive results demonstrate advantages of our MVGCN beyond the adaptations of several stateoftheart approaches, like diffusion convolutional recurrent neural networks and Gaussian Markov random field based models.
2 Problem Definition
2.1 Irregular Regions
Urban areas are naturally divided into different irregular regions by road network. These regions may have different functions, such as education and business function [45]. Different functional areas usually have different traffic flow patterns. For example, people usually leave residential areas and arrive at work areas in the morning and return home after work. So it is actually more rational and insightful to perform the task of traffic flow prediction on these irregular regions.
Region Partition. The task of region partition consists of two main operations: map segmentation and map clustering. For example, the road network in Beijing is composed of multilevel roads, such as level 0, 1, 2, etc., which represent different functional road categories. As shown in Figure 4 (a), the red segments denote highways and city express ways, and the blue segments represent urban arterial roads in Beijing.
Referring to [46], we utilize morphological image processing techniques to tackle the region partition task. Specifically, we partition the map into small gridcells, and map each road point to its corresponding gridcells, thereby obtaining a binary image, in which 1 and 0 stand for road segments and blank areas respectively. Then we apply the dilation operation and thinning operation to get the skeleton of the road network. The dilation operation can help thicken the roads, fill the small holes and smooth out unnecessary details. Then the thinning operation is used to recover the size of a region while keeping the connectivity between regions [24] , as shown in Figure 3. Finally, we can obtain all labeled irregular regions’ locations using the connected component labeling algorithm (CCL) [15] that finds individual regions by clustering ”1”labeled grids.
After the map segmentation, we get a large number of lowlevel irregular regions, and many of them are too small to collect or predict traffic flows at the city scale. Therefore, we apply a clustering operation [20] on these regions. Specifically, we define the edge weight between two lowlevel regions as the Spearman’s rank correlation coefficient between the average crowd flows within a time period (e.g. one day). After this operation, the small intractable regions are clustered into some highlevel regions, as shown in Figure 4 (b) and (c).
Symbol  Description 

spatiotemporal graph  
nodes,  
an adjacency matrix  
a modified adjacency matrix  
geospatial position of node  
available time interval set  
a matrix of node feature vectors at 

vector of node  
vector of th channel in all nodes 
2.2 Prediction Problem on SpatioTemporal Graphs
The goal in this research is to collectively predict the future inflows/outflows in each and every node of an STG based on previously observed ones. Table I lists the mathematical notation used in the paper.
Definition 1 (Stg)
A spatiotemporal graph (STG), denoted as , where and respectively denote the set of vertices and edges, is a binary unweighted adjacency matrix. Specifically, each vertex has a geospatial position and timevarying attributes. These attributes over an STG at time can be viewed as a graph signal , where represents attributes in the node , e.g., the inflow and outflow [48] (). The edges between two regions can be constructed from transition flows, which are also varying with time. We construct a static graph adjacency matrix based on a period of transition flows and the binary entry value in indicates whether two regions have frequent interactions of transition traffic flows.
Problem 1
Given a graph and observed attributes of nodes , predict the attributes at the next time step, i.e., .
3 Methodology
In this section, we present our new model for crowd flow forecasting. We first present a multiview deep learning framework, then we review the graph convolutional network and present our new spatial graph convolutional network. Finally, we present the multiview fusion method and loss function used in our model.
3.1 Multiview Deep Learning Framework
Figure 2 provides an overview of our proposed deep learning framework to predict the crowd flows in an STG. We adopt the multiview framework [42] that is an effective mechanism to learn latent representations from cross domain data [39]. The framework proposed is composed of two stages: data preparation and model learning/predicting. The first stage is used to fetch global information and select the key timesteps, then we feed all of them to the second stage to perform model training. We provide concrete details in the following sections.
Data preparation stage. “What factors should be considered when forecasting the crowd flow in a region?” a) weather, b) time of the day, c) period, etc. Different people may have different answers that highlights different views on this problem. We summarize these views into two categories: global view and temporal view. (1) the global view is composed of external and meta views. According to the time of the predicted target, we fetch different external data, like meteorological data in previous timesteps and weather forecasting. We can also construct the meta features: time of the day, day of the week, and so on. The external and meta features are represented as and , respectively. (2) the temporal view contains multiple views according to the temporal closeness, period, trend. Considering two types of periods (daily and weekly), and two types of trends (monthly and quarterly)^{1}^{1}1One can set different periods and trends in practice, like yearly period, based on the characteristics of the data., we select the corresponding recent, daily, weekly, monthly, and quarterly timesteps as the key timesteps, to construct five views. For each of the different temporal views, we fetch a list of key timesteps’ flow matrices and concatenated them, to construct five inputs as follows,
where , , , , and are input lengths of recent, daily, weekly, monthly, and quarterly lists, respectively. and are daily and weekly periods; and are monthly and quarterly trend spans.
By selecting these key timesteps, our approach can capture multiple types of temporal properties. The complexity of the input data of our approach is , and these views can be modeled in parallel. If one uses a sequence neural network model (like recurrent neural networks, RNNs) to capture all these temporal dependencies automatically, the complexity would be , while RNNs maintain a hidden state of the entire past that prevents parallel computation within a sequence. Assuming lengths of recent, daily, weekly, monthly, and quarterly lists are all equal 3, our architecture only needs key frames. In contrast, RNNs needs 3 quarters of data, approximately frames. Such a longrange sequence tremendously raises the training complexity for RNNs, making them infeasible in realworld applications.
Model learning/predicting stage. We employ graph convolutional networks (GCNs, see Section 3.2) and fullyconnected neural networks (FNNs) to model the temporal and global views, respectively. For each temporal view, GCN is used to learn the timevarying spatial correlations and interactions using the structural information of the STG, and The corresponding outputs of five GCNs are denoted . Two FNNs are employed to capture the influences from external and meta data, respectively, and the outputs are and . All these outputs are then fed into the multiview fusion module (see Section 3.3) followed by a postnet (e.g. FNN), to obtain the final prediction . The multiview fusion can effectively employ the outputs of different views based on their characteristics. Finally, we propose employing the Huber loss[19] for robust regression.
3.2 Graph Convolutional Network for STG
Convolutional networks over graphs. Recently, generalizing convolutional networks to graphs have become an area of interest. In this paper, we mainly consider spectral convolutions [5, 9] on arbitrary graphs. As it is difficult to express a meaningful translation operator in the node domain [5], [9] presented a spectral formulation for the convolution operator on the graph, denoted as . By this definition, the graph signal with a filter parameterized by in the Fourier domain,
(1) 
where
is the matrix of eigenvectors, and
is the diagonal matrix of eigenvalues of the normalized graph Laplacian
, whereis the identity matrix and
is the diagonal degree matrix with . We can understand as a function of the eigenvalues of . However, evaluating Eq. 1 is computationally expensive, as the multiplication with U is . To circumvent this problem, the Chebyshev polynomial expansion (up to order) [9] was applied to obtain an efficient approximation, as(2) 
where is the Chebyshev polynomial of order evaluated at the scaled Laplacian , denotes the largest eigenvalue of L, is now a vector of Chebyshev coefficients. The details of this approximation can be found in [14, 9].
Furthermore, [22] proposed a fast approximation of the spectral filter by setting and successfully used it for semisupervised classification of nodes, as
(3) 
where is the signal convolved matrix. is the adjacency matrix of with added selfconnections, and is a trainable matrix of filter parameters in a graph convolutional layer. The filtering operation has complexity as [22] and can be efficiently implemented as the product of a sparse matrix with a dense matrix.
Spatial graph convolutional network. We present a variant of fast approximate graph convolution (Eq. 3) that also considers the geospatial positions of vectices in an STG. Here we explore an approach to integrate such geospatial positions based on the First Law of Geography [38], i.e., everything is related to everything else, but near things are more related than distant things.
Given an adjacency matrix A, we assign spatial weights for existing edges based on the spatial distance, as
(4) 
where is the modified adjacency matrix, is the Hadamard product (i.e. elementwise multiplication). is the spatial weighted adjacency matrix that is calculated via a thresholded Gaussian kernel weighting function [31], as
(5) 
Here means the geographical distance between nodes and ; and are two parameters to control the scale and sparsity of the adjacency matrix.
With the modified matrix , we consider multiple graph convolutional layers with the following layerwise propagation rule:
(6) 
where and are the output and input of the layer. is the modified adjacency matrix with added selfconnections, and is a trainable matrix of filter parameters in a graph convolutional layer, denotes an activation function, e.g. the rectifier [23]. The filtering operation has complexity as can be efficiently implemented as a product of a sparse matrix with a dense matrix.
GCNbased Residual Unit. To capture hop spatial correlations and interactions, we stack spatial graph convolutional layers, inspired by graph convolutions [22]. When is large, we need a very deep network. Residual learning [16] allows neural networks to have a super deep structure of 100 layers. Here we propose a GCNbased residual unit that integrates the graph convolutional layer into the residual framework (Figure 5(a)). Formally, the residual unit is defined as:
(7) 
where is an activation function.
By stacking multiple GCNbased residual units, we can build very deep neural networks to capture multihop spatial dependencies.
3.3 Multiview Fusion
We propose a multiview fusion (see Figure 5(b)) method to fuse the latent representations of many flow views with two global views (external and meta data). In our previous crowd flow prediction task [48], we show that different regions have different temporal properties, but the degrees of influence may be different. Inspired by this, we here also employ the parametricmatrixbased fusion method [48] to fuse the outputs of five GCNs for temporal views as below
(8) 
where are the learnable parameters that adjust the degrees affected by closeness, daily period, weekly period, monthly trend, and quarterly trend, respectively.
For the external factor (like weather and holiday) and meta data (e.g. time of the day), we separately feed them into different fullyconnected (FC) layers to obtain different latent representations and . Then we simply concatenate all the outputs of the embed module and add a FC layer following by reshaping, thereby obtaining .
Different factors may change the flows in different ways. For example, holidays may moderate the crowd flows, as shown in Figure 6(a), while rainstorms may sharply and dramatically reduce the flows (Figure 6(b)). Specifically, the latter is just like a switch, changing flows tremendously change when it happens. On account of these insights, we leverage two different fusion methods to deal with these two types of situations. For the gradual changes, we propose employing a sumfusion method, e.g., + . For the sudden changes, we propose employing a gatingmechanismbased fusion, e.g., , where is an approximated gating function such as . When the concatenated representation of captures some special external information such as rainstorm weather, the term
will suddenly increase and become a much larger value due to the property of sigmoid function compared with
. And in most common cases, this term should be close to zero without sudden changes.Based on two fusion methods, the final output is calculated as
(9) 
where is the activation function, e.g., , .
3.4 Loss and Algorithm
Let and be the observed and predicted values.
The objective function we employ here is the Huber loss, which is an elegant compromise between squarederror loss and absoluteerror loss , and has been verified as a robust loss function for regression [19].
The Huber loss, denoted , is defined by
(10) 
where is a threshold (1 by default). The Huber loss combines the desirable properties of squarederror loss near zero and absolute error loss when is greater than (Table V shows the empirical comparison).
Let be all the trainable parameters in MVGCN. For the Huber loss it yields the following optimization problem,
(11) 
where means the element of the row and column of .
4 Experiments
4.1 Settings
Datasets. We use four different datasets as shown in Table II. The details are described as follows:
TaxiNYC^{2}^{2}2http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml: The trajectory data is taxi GPS data for New York City (NYC) from 1st Jan. 2011 to 30th Jun. 2016. We partition NYC into 100 irregular regions based on the map segmentation method (Section 2.1), and build the graph according to transition flow and geographical distance between regions, then we calculate crowd flows like [17].
TaxiBJ: Trajectory data is the taxicab GPS data in Beijing from four time intervals: 1st Jul. 201330th Oct. 2013, 1st Mar. 201430th Jun. 2014, 1st Mar. 201530th Jun. 2015, 1st Nov. 201510th Apr. 2016. The graph construction and crowd flow calculation method in Beijing is the same as that of NYC.
BikeDC^{3}^{3}3https://www.capitalbikeshare.com/systemdata: The data is taken from the Washington D.C. Bike System. Trip data includes: trip duration, start and end station IDs, start and end times. There are 472 stations in total. For each station, we get two types of flows, where the inflow is the number of checkedin bikes, and the outflow is the number of checkedout bikes. Since many stations have no data or very few records, we remove these stations and apply a cluster operation [20] to the remaining stations using the average flow of historical observations, to get 120 irregular regions. We construct the graph with transition flow and geographical distance between these regions.
BikeNYC^{4}^{4}4https://www.citibikenyc.com/systemdata: The data is taken from the NYC Bike system from 1st Jul. 2013 to 31th Dec. 2016. There are 416 stations in total. We also remove unavailable bike stations, and cluster the remaining stations into 120 regions. The graph construction and the bike flow calculation method in NYC is same as that of DC.
For all aforementioned four datasets, we choose data from the last four weeks as the test set, all data before that as the training set. We build the commuting network (i.e. graph) via the geographical distance between stations or regions, which can be viewed as nodes in the graph. The stations each have geospatial positions. For the regions, we approximate using the geospatial position of the central location of the region.
Dataset  TaxiNYC  TaxiBJ  BikeDC  BikeNYC 

Data type  Taxi trip  Taxi GPS  Bike rent  Bike rent 
Location  NYC  Beijing  D.C.  NYC 
Start time  1/1/2011  7/1/2013  1/1/2011  7/1/2013 
End time  6/30/2016  4/10/2016  12/31/2016  12/31/2016 
Time interval  1 hour  1 hour  1 hour  1 hour 
# timesteps  48192  12336  52608  30720 
# regions (stations)  100  100  120 (472)  120 (416) 
# holidays  627  105  686  401 
Weather  \  16 types  \  \ 
Temp. / C  \  [24.6,41]  \  \ 
WS / mph  \  [0,48.6]  \  \ 
Baselines. We compare MVGCN with the following 8 models:

HA: Historical average, which models crowd flows as a seasonal process, and uses the average of previous seasons as the prediction with a period of one week. For example, the prediction for this Tuesday is the averaged crowd flows from all historical Tuesdays.

VAR: Vector autoregressive is a more advanced spatiotemporal model, which is implemented using the statsmodel python package^{5}^{5}5http://www.statsmodels.org. The number of lags is set as 3, 5, 10, or 30. The best result is reported.

GBRT: Gradient boosting decision tree [11]. It uses the same features as the input of STANN. The optimal parameters are achieved by the grid search.

FCLSTM: Encoderdecoder framework using LSTM [36]. Both encoder and decoder have two recurrent layers with 128 or 64 LSTM units.

GCN: We build a 3layer supervised graph convolutional network where the graph convolution [22] is employed. The inputs are the previous 6 timesteps and the output is the target timestep.

DCRNN: We build a 2layer supervised diffusion convolutional recurrent neural network [27], which achieves stateoftheart results on predicting traffic speed on roads. The inputs are the previous 6 timesteps and the output is the target timestep or timesteps.

FCCF: Forecasting Citywide Crowd Flow model based on Gaussian Markov random fields [17], that leverages flows in all individual regions and transitions between regions as well as external factors. As other baselines did not use the transition features, we remove the transition to get a new baseline, named FCCFnoTrans.



Dataset  Metric  HA  VAR  GBRT  FCLSTM  GCN  DCRNN  FCCFnoTrans  FCCF  MVGCN 
TaxiNYC  RMSE  101.54  30.78  83.71  36.44  34.55  35.86  26.02  26.00  25.59 
MAE  33.02  11.21  23.46  14.16  12.25  13.92  9.25  9.24  9.42  
TaxiBJ 
RMSE  38.77  18.79  33.89  20.49  17.71  20.90  18.70  18.42  16.07 
MAE  22.89  11.38  20.34  13.05  10.48  13.64  10.74  10.44  9.52  
BikeDC  RMSE  2.61  1.95  3.46  1.88  1.88  1.90  2.22  2.14  1.72 
MAE  1.48  1.20  1.98  1.10  1.08  1.20  1.34  1.27  1.00  
BikeNYC  RMSE  6.77  4.21  8.57  4.66  5.06  4.35  4.41  4.19  4.15 
MAE  4.00  2.71  5.17  2.78  2.85  2.90  2.79  2.65  2.60  

The neural network based models are implemented using TensorFlow and trained via backpropagation and Adam
[21] optimization.Preprocessing. The MinMax normalization method is used to scale the data into the range or
. In the evaluation, we rescale the predicted value back to the normal values, and compare them with ground truth data. For external factors, we use onehot encoding to transform metadata (
i.e., the day of the week, the time of the day), holidays and weather conditions into binary vectors, and use MinMax normalization to scale the Temperature and Wind speed into the range .Hyperparameters.
We introduce the hyperparameter settings in experiments. For lengths of the five dependent sequences, we set them as:
, , , , . The number of graph convolutional layers is set as . The hidden unit is set as 10 for each embed layer by default. The training data is split into three parts: the last four weeks’ data is used as the test set, adjacent previous four weeks’ data is used as validation set and the rest of the data is used to train the models. The validation set is used to control the training process by early stopping and choose our final model parameters for each model based on the best validation score. The batch size is set as 48. The learning rate is set as . For all trained models, we only select the model which has the best score on the validation set, and evaluate it on the test set.Evaluation Metrics. For the evaluation of STprediction, we employ two metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), both of which are widely used in the regression tasks. Given predicted values and groundtruth values , the RMSE and MAE are respectively calculated as below
where is the total number of all predicted values.
4.2 Comprehensive Results
Table III presents a comprehensive comparison. In general, it indicates that our MVGCN performs best on all datasets based on two metrics except MAE on TaxiNYC, but MVGCN still has a comparable MAE on TaxiNYC against the best model FCCF. Among four datasets, we can observe that our MVGCN achieves the greatest improvement on the dataset TaxiBJ. This is because the TaxiBJ dataset contains more external information, like weather, temperature, and wind speed. We find that FCCF performs very well because it also considers the period and trend as well as external information, even the transitions between regions. When transition features are removed, FCCF is degraded into the model FCCFnoTrans, resulting in a small increase in both RMSE and MAE, which shows the effectiveness of transition features. FCLSTM and DCRNN perform worse than FCCF and MVGCN because they are used to model sequences and do not consider period and trend in the crowd flow data.
4.2.1 Results on sudden changes
Figure 7 presents the comparisons between MVGCN and the five baselines on a special case, i.e. sudden changes, which may be caused by anomalous weather or traffic events. For all four datasets, we select the top city flow changes on adjacent timesteps in test sets as our sudden change cases. We obverse that our MVGCN greatly outperforms all other baselines, especially on TaxiBJ. One reason may be that our MVGCN can effectively model weather data that is only available in TaxiBJ.
4.2.2 Results on multistep prediction
For further analysis, we present the multistep prediction results based on RMSE and MAE over the dataset BikeDC in Figure 8. For the singlestep prediction models, e.g. our MVGCN, we train different models for different timesteps. For the multistep prediction models, including FCLSTM and DCRNN, we use the previous 6 timesteps as the input sequence and the next 6 times as the target sequence, to train the model. Our MVGCN is robust as the step number varies from 1 to 6, i.e. small increase in both RMSE and MAE, achieving the best for all the 6 steps. We can observe that the original graph convolutional network (GCN) is not robust as the timestep increases, demonstrating that it does not work if we apply the existing models to the crowd flow prediction in a straightforward way. DCRNN performs less well because it also only use the sequence from the recent timesteps, resulting in that it cannot capture period, trend, and external factors.
4.3 Effects of Different Components
4.3.1 Temporal view
Figure 9 demonstrates the different experiment effects of different combinations of temporal views based on RMSE and MAE, including recent (view 1), daily (view 2), weekly (view 3), monthly (view 4), and quarterly views (view 5). With only the recent view considered, we get a terrible result. When taking daily view into consideration, the result is greatly improved, indicating the periodicity is an important feature of traffic flow pattern. Also, the result becomes better and better with more temporal views considered.
4.3.2 Geospatial position.
Recall that in our model, we introduce a spatial graph convolution (see Eq. 6), which integrates the geospatial position into the graph convolution. After eliminating such geospatial information, the layer is degraded as a graph convolution (Eq. 3). From Table IV, we observe that RMSE increases from 25.29 to 27.91 without the geospatial position, and MAE also becomes worse, demonstrating the effectiveness of the spatial graph convolution.



Setting  RMSE  MAE 
MVGCN  25.29  9.72 
w/o geospatial position  27.91  10.74 
w/o external  27.36  10.15 
w/o metadata  26.75  10.01 

4.3.3 Global information
To show the effects of the embed component, we compare the performance of MVGCN under two settings: removing external factors or meta data, as shown in Table IV. By eliminating the external factors, the RMSE increases from 25.29 to 27.36. Similarly, without the meta data, RMSE increases to 26.75. The results demonstrate that the external factors/meta data affect the prediction in an STG.
4.3.4 Huber loss and number of GCN layers.
We present experiments with different loss functions and varying spatial graph convolutional layers, as shown in Table V. We observe that the Huber loss has lower RMSE than absoluteerror and squarederror using different number of graph convolutional layers, empirically showing that the Huber loss is better for robust regression tasks than another two losses. As the number of GCN layers increases, the RMSEs of three methods all decrease, demonstrating that deeper networks yield better results.



Loss  number of GCN layers  
3  4  5  
absoluteerror  36.73  30.48  27.18 
squarederror  33.76  27.06  26.05 
Huber  27.72  26.18  25.79 

5 Crowd flow forecasting system in irregular regions
We have developed a crowd flow forecasting demo (called UrbanFlow) in irregular regions internally, which can be accessed now^{6}^{6}6http://101.124.0.58/urbanflow_graph, as shown in Figure 10(a). We have deployed it in the city area of Beijing, China, similar to that of our previous system for gridded regions. The detailed system architecture can be found in our previous work (Section 3 of [49]). Figure 10(a) shows the inflow and outflow results for a certain region in the system, where the green line represents the ground truth inflow or outflow in the previous 14 hours, the blue line denotes the prediction results in the 14 hours, and the orange line points the forecasting values in the next 10 hours. We can see the green and blue lines have very close values and similar trend, meaning that our MVGCN can work effectively and well in the traffic flow forecasting system. Figure 10(b) displays another function view of overall flow changes of different time stamps for the whole city. We can observe the overall flow distribution varying with time. As the figure shows, in the morning rush hours, most regions have larger crowd flows because people are travelling from home, and the flows decrease in the midafternoon during which most people are working or resting indoors.
6 Related Work
6.1 SpatioTemporal Prediction
There is some previously published work on predicting an individual’s movement based on their location history [10, 35, 30]. They mainly forecast millions or even billions of individuals’ mobility traces rather than the aggregated crowd flows in a region. Such a task may require huge computational resources, and it is not always necessary for public safety situations. Some other researchers aim to predict travel speed and traffic volume on the road [40, 1, 32]. Most of them making predictions concerning single or multiple road segments, rather than citywide ones [43, 7]. Recently, researchers have started to focus on cityscale traffic flow prediction [17, 28, 44]. Specifically, [17] proposed a Gaussian Markov random field based model (called FCCF) that achieves stateoftheart results on the crowd flow forecasting problem, which can be formulated as a prediction problem on an STG. [44]
proposes a multiview framework for citywide crowd flows prediction, but it is targeted for regular regions’ flow prediction using of traditional convolutional neural networks.
6.2 Classical Models for Time Series Prediction
Forecasting flow in a spatiotemporal network can be viewed as a time series prediction problem. Existing timeseries models, like the autoregressive integrated moving average model (ARIMA, [4]), seasonal ARIMA [34]
, and the vector autoregressive model
[6] can capture temporal dependencies very well, yet it fails to capture spatial correlations.6.3 Neural Networks for Sequence Prediction
Neural networks and deep learning [3, 26]
have achieved numerous successes in fields such as compute vision
[23, 33], speech recognition [12, 13], and natural language understanding [25, 29]. Recurrent neural networks (RNNs) have been used successfully for sequence learning tasks [37, 2]. The incorporation of long shortterm memory (LSTM)
[18]or gated recurrent unit (GRU)
[8] enables RNNs to learn longterm temporal dependency. However, these neural network models can only capture spatial or temporal dependencies. Recently, researchers have combined the above networks and proposed a convolutional LSTM network [41] that learns spatial and temporal dependencies simultaneously but cannot be operated on spatiotemporal graphs. [48] proposed a spatiotemporal residual network, which is capable of capturing spatiotemporal dependencies as well as external factors in regular regions, yet it cannot be adapted to deal with graphs.7 Conclusion
We propose a novel multiview deep learning model MVGCN, consisting of several graph convolutional networks, to predict the inflow and outflow in each and every irregular region of a city. MVGCN can not only capture spatial adjacent and multihop correlations as well as interactions, but also integrate the geospatial position via spatial graph convolutions. In addition, MVGCN can capture many types of temporal properties, including closeness, periods (daily, weekly, etc), and trends (e.g. monthly, quarterly), as well as various external factors (like weather and event) and meta information (e.g. time of the day). We evaluate our MVGCN on four realworld datasets in different cities, achieving a performance which is significantly better than 8 baselines, including recurrent neural networks, and Gaussian Markov random fieldbased models.
Acknowledgments
The work was supported by the National Natural Science Foundation of China (Grant No. 61672399, No. U1401258, and No. 61773324), and the China National Basic Research Program (973 Program, No. 2015CB352400).
References
 [1] A. Abadi, T. Rajabioun, and P. A. Ioannou, “Traffic flow prediction for road transportation networks with limited traffic data,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 653–662, 2015.
 [2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2015.
 [3] Y. Bengio, I. J. Goodfellow, and A. Courville, “Deep learning, book in preparation for mit press (2015),” URL http://www. iro. umontreal. ca/bengioy/dlbook, pp. 373 – 420, 2015.
 [4] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 [5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2014.
 [6] S. R. Chandra and H. AlDeek, “Predictions of freeway traffic speeds and volumes using vector autoregressive models,” Journal of Intelligent Transportation Systems, vol. 13, no. 2, pp. 53–72, 2009.
 [7] P.T. Chen, F. Chen, and Z. Qian, “Road traffic congestion monitoring in social media with hingeloss markov random fields,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 80–89.
 [8] K. Cho, B. van Merrienboer, Çaglar Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” in EMNLP, 2014.
 [9] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
 [10] Z. Fan, X. Song, R. Shibasaki, and R. Adachi, “Citymomentum: an online approach for crowd behavior prediction at a citywide level,” in Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2015, pp. 559–569.
 [11] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
 [12] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
 [13] A. Graves, A. rahman Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
 [14] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
 [15] R. Haralick and L. Shapiro, “Computer and robot vision: Vol. 2,” vol. I, 01 1991.

[16]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,”
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 770–778, 2016.  [17] M. X. Hoang, Y. Zheng, and A. K. Singh, “Fccf: forecasting citywide crowd flows based on big data,” in Proceedings of the 24th ACM SIGSPATIAL. ACM, 2016.
 [18] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[19]
P. J. Huber, “Robust estimation of a location parameter,”
The Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964.  [20] G. Karypis and V. Kumar, “Parallel multilevel series kway partitioning scheme for irregular graphs,” SIAM Review, vol. 41, no. 2, pp. 278–300, 1999.
 [21] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [22] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in ICLR, 2017.

[23]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [24] L. Lam, S. W Lee, and C. Suen, “Thinning methodologiesa comprehensive survey,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 14, pp. 869–885, 10 1992.

[25]
Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents.” in
ICML, vol. 14, 2014, pp. 1188–1196.  [26] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
 [27] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Datadriven traffic forecasting,” 2018.
 [28] Y. Li, Y. Zheng, H. Zhang, and L. Chen, “Traffic prediction in a bikesharing system,” in Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2015, p. 33.
 [29] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH, 2010.
 [30] S. Scellato, M. Musolesi, C. Mascolo, V. Latora, and A. T. Campbell, “Nextplace: A spatiotemporal prediction framework for pervasive systems,” in Pervasive, 2011.

[31]
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,”
IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.  [32] R. Silva, S. M. Kang, and E. M. Airoldi, “Predicting traffic volumes and estimating the effects of shocks in massive transportation systems,” Proceedings of the National Academy of Sciences, vol. 112, no. 18, pp. 5643–5648, 2015.
 [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2015.
 [34] B. L. Smith, B. M. Williams, and R. K. Oswald, “Comparison of parametric and nonparametric models for traffic flow forecasting,” Transportation Research Part C: Emerging Technologies, vol. 10, no. 4, pp. 303–321, 2002.
 [35] X. Song, Q. Zhang, Y. Sekimoto, and R. Shibasaki, “Prediction of human emergency behavior and their mobility following largescale disaster,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 5–14.
 [36] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using lstms,” arXiv preprint arXiv:1502.04681, 2015.
 [37] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
 [38] W. R. Tobler, “A computer movie simulating urban growth in the detroit region,” Economic geography, vol. 46, no. sup1, pp. 234–240, 1970.
 [39] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multiview representation learning,” in International Conference on Machine Learning, 2015, pp. 1083–1092.
 [40] Y. Wang, Y. Zheng, and Y. Xue, “Travel time estimation of a path using sparse trajectories,” in KDD, 2014.
 [41] S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.k. Wong, and W.c. WOO, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Advances in Neural Information Processing Systems, 2015, pp. 802–810.
 [42] C. Xu, D. Tao, and C. Xu, “A survey on multiview learning,” arXiv preprint arXiv:1304.5634, 2013.
 [43] Y. Xu, Q.J. Kong, R. Klette, and Y. Liu, “Accurate and interpretable bayesian mars for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 6, pp. 2457–2469, 2014.
 [44] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li, “Deep multiview spatialtemporal network for taxi demand prediction,” in AAAI, 2018.
 [45] J. Yuan, Y. Zheng, and X. Xie, “Discovering regions of different functions in a city using human mobility and pois,” in The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, 2012, pp. 186–194.
 [46] N. J. Yuan, Y. Zheng, and X. Xie, “Segmentation of urban areas using road networks,” MSRTR2012–65, Technical Report, 2012.
 [47] J. Zhang, Y. Zheng, J. Sun, and D. Qi, “Flow prediction in spatiotemporal networks based on multitask deep learning,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2019.
 [48] J. Zhang, Y. Zheng, and D. Qi, “Deep spatiotemporal residual networks for citywide crowd flows prediction,” in ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [49] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, and T. Li, “Predicting citywide crowd flows using deep spatiotemporal residual networks,” arXiv preprint arXiv:1701.02543, 2017.
 [50] Y. Zheng, L. Capra, O. Wolfson, and H. Yang, “Urban computing: concepts, methodologies, and applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 5, no. 3, p. 38, 2014.