I Introduction
Metro is an efficient and economical travel mode in metropolises, and it plays an important role in the daily life of residents. By the end of 2018, 35 metro systems have been operated to serve tens of millions of passengers in Mainland China^{1}^{1}1https://en.wikipedia.org/wiki/Urban_rail_transit_in_China. For instance, over 10 million metro trip transactions were made per day in 2018 for Beijing^{2}^{2}2https://en.wikipedia.org/wiki/Beijing_Subway and Shanghai^{3}^{3}3https://en.wikipedia.org/wiki/Shanghai_Metro. Such huge metro ridership poses great challenges for urban transportation and any carelessness of traffic management may result in citywide congestions. For improving the service efficiencies of metro systems, a fundamental problem is how to accurately forecast the ridership (e.g., inflow and outflow) of each station, which is termed as stationlevel metro ridership prediction in this work. Due to its potential applications in traffic dispatch and route planning, this problem has become a hotspot research topic [1, 2, 3, 4, 5, 6] in the community of intelligent transportation systems (ITSs).
Over the past decade, massive efforts have been made to address the traffic states (e.g., flow, speed and demand) prediction. In early works [7, 8, 9]
, the raw data of traffic states at each time interval was usually transformed to be a vector/sequence and the time series models
[10, 8]were applied for prediction. However, this data format failed to maintain the spatial information of locations and the topological connection information between two locations. In recent years, deep neural networks (e.g., Long Shortterm Memory
[11] and Gated Recurrent Unit [12]) have been widely used for citywide traffic prediction [13, 14, 15, 16, 17]. These works usually partitioned the studied cities into regular grid maps on the basis of geographical coordinate and organized the collected traffic state data as Euclidean 2D or 3D tensors, which can be straightway fed into convolutional networks for automatic representation learning. Nevertheless, this manner is also unsuitable for metro systems, since their topologies are irregular graphs and their data structures are nonEuclidean. Although the transaction records of a metro system can be rendered as a grid map
[18], it is inefficient to learn ridership evolution patterns from the rendered map, which is very sparse and can not maintain the connection information of two stations. Therefore, a more reasonable mechanism is desirable to be developed to model graphbased metro systems.Fortunately, the emerging Graph Neural Networks (GNN [19, 20, 21]) have been proven to be general and effective for nonEuclidean data embedding. How to construct the graph in GNN is an open problem and the construction strategy is varying in different tasks [22, 23, 24, 25]. Recently, some researchers [26, 27, 28, 29, 30, 31] have applied GCN to forecast the traffic states and most relative works directly built graphs based on the physical topologies of the studied traffic systems. Although this simple strategy helps to learn the local spatial dependency of neighboring stations, we believe it is suboptimal for metro ridership prediction, since it can not fully capture the interstation flow patterns in a metro system. Except for the physical topologies, some domain knowledge can also be utilized to guide the graph construction, such as:

Interstation Flow Similarity: Intuitively, two metro stations in different regions may have similar evolution patterns of passenger flow, if their located regions share the same functionality (e.g., office districts). Even though these stations are not directly linked in the realworld metro system, we can connect them in GNN with a virtual edge to jointly learn the evolution patterns.

Interstation Flow Correlation: In general, the ridership between every two stations is not uniform and the direction of passenger flow implicitly represents the correlation of two stations. For instance, if (i) the majority of inflow of station streams to station , or (ii) the outflow of station primarily comes from station , we argue that the stations and are highly correlated. Under such circumstances, these stations could also be connected to learn the ridership interaction among stations.
Based on the above observations, we propose a unified PhysicalVirtual Collaboration Graph Network (PVCGN) to predict the future metro ridership in an endtoend manner. To fully explore the ridership evolution patterns, we utilize the metro physical topology information and human domain knowledge to construct three complementary graphs. First, a physical graph is directly formed on the basis of the realistic topology of the studied metro system. Then, a similarity graph and a correlation graph are built with virtual topologies respectively based on the passenger flow similarity and correlation among different stations. In particular, the similarity score of two stations is obtained by computing the warping distance between their historical flow series with Dynamic Time Warping (DTW [32]), while the correlation ratio is determined by the historical origindestination distribution of ridership. These tailordesigned graphs are incorporated into an extended Graph Convolution Gated Recurrent Unit to collaboratively capture the ridership evolution patterns. Furthermore, a FullyConnected Gated Recurrent Unit is utilized to learn the semantic feature of global evolution tendency. Finally, we applied a Seq2Seq model [33] to sequentially forecast the metro ridership at the next several time intervals. To verify the effectiveness of the proposed PVCGN, we conduct extensive experiments on two largescale benchmarks (i.e., Shanghai Metro and Hangzhou Metro) and the evaluation results show that our approach outperforms existing stateoftheart methods under various comparison circumstances.
In summary, our major contributions are threefold:

We systematically analyze the drawbacks of existing methods of traffic states prediction and inventively propose to model the studied metro systems with graph networks, which are naturally suitable for the representation learning on nonEuclidean data.

We develop a unified PhysicalVirtual Collaboration Graph Network (PVCGN) to address the stationlevel metro ridership prediction. Specifically, PVCGN incorporates a physical graph, a similarity graph and a correlation graph into a Graph Convolution Gated Recurrent Unit to facilitate the spatialtemporal representation learning.

The physical graph is built based on the realistic topology of a metro system, while the other two virtual graphs are constructed with human domain knowledge to fully exploit the ridership evolution patterns.

Extensive experiments on two realworld benchmarks show that the proposed PVCGN is comprehensively better than existing methods with significant margins.
The remaining parts of this paper are organized as follows. We first investigate the deep learning on graphs and some related works of traffic states prediction in Section
II. We then introduce the proposed PVCGN in Section III and conduct extensive comparisons in Section IV. Finally, we conclude this paper and discuss future works in Section V.Ii Related Work
Iia Deep Learning on Graphs
In machine learning, Euclidean data refers to the signals with an underlying Euclidean structure
[34](such as speeches, images, and videos). Although deep Convolutional/Recurrent Neural Networks (CNN/RNN) can handle Euclidean data successfully, it is still challenging to deal with nonEuclidean data (e.g., graphs), which is the data structure of many applications. To address this issue, Graph Neural Networks (GNN) have been proposed to automatically learn feature representation on graphs. For instance, Brunaat et al.
[19] introduced a graphLaplacian spectral filter to generalize the convolution operators in nonEuclidean domains. Defferrard et al. [35] presented a formulation of CNN with spectral graph theory and designed fast localized convolutional filters on graphs. Atwood and Towsley [36]developed a spatialbased graph convolution and regarded it as a diffusion process, in which the information of a node was transferred to its neighboring nodes with a certain transition probability. Velivckovic et al.
[37] assumed the contributions of neighboring nodes to the central node were neither identical, thus proposed a Graph Attention Network. Wu et al. [38] reduced the complexity of GNN through successively removing nonlinearities and collapsing weight matrices between consecutive layers.Recently, GNN has been widely applied to address various tasks and the graph construction strategy varied in different works. For instance, in computer vision, Jiang et al.
[39] utilized the cooccurrence probability, attribute correlation and spatial correlation of objects to build three graphs for largescale object detection. Chen et al. [25]constructed a semanticspecific graph based on the statistical label cooccurrence for multilabel image recognition. In natural language processing, Beck et al.
[40] used source dependency information to built a Levi graph [41]for neural machine translation. For semisupervised document classification, Kipf and Welling
[21] introduces a firstorder approximation of spectral graph [35] and constructed their graphs based on citation links. In data mining, the relations between itemsanditems, usersandusers and usersanditems were usually leveraged to construct graphbased recommender systems [42]. In summary, how to build a graph is an open problem and we should flexibly design the topology of a graph for a specific task.IiB Traffic States Prediction
Accurately forecasting the future traffic states is crucial for intelligent transportation systems and numerous models have been proposed to address this task [43, 44, 45]. In early works [46, 7, 8, 9, 47], mass traffic data was collected from some specific locations and the raw data at each time interval was arranged as a vector (sequence) in a certain order. These vectors were further fed into time series models for prediction. A representative work was the data aggregation (DA) model [48], in which the moving average (MA), exponential smoothing (ES) and autoregressive MA (ARIMA) were simultaneously applied to forecast traffic flow. However, this simple data format was inefficient due to the lack of spatial information, and these basic time series models failed to learn the complex traffic patterns. Therefore, the abovementioned works were far from satisfactory in complex traffic scenarios.
In recent years, deep neural networks have become the mainstream approach in this field. For instance, Zhang et al. [13] utilized three residual networks to learn the closeness, period and trend properties of crowd flow. Wang et al. [49]
developed an endtoend convolutional neural network to automatically discover the supplydemand patterns from the carhailing service data. Zhang et al.
[17] simultaneously predicted the regionbased flow and interregion transitions with a deep multitask framework. Subsequently, RNN and its various variants are also widely adopted to learn the temporal patterns. For instance, Yao et al. [14] proposed a Deep MultiView SpatialTemporal Network for taxi demand prediction, which learned the spatial relations and the temporal correlations with deep CNN and Long ShortTerm Memory (LSTM [11]) unit respectively. Liu et al. [50] developed an attentive convolutional LSTM network to dynamically learn the spatialtemporal representations with an attention mechanism. In [16], a periodically shifted attention mechanism based LSTM was introduced to capture the longterm periodic dependency and temporal shifting. To fit the required input format of CNN and RNN, most of these works divided the studied cities into regular grid maps and transformed the raw traffic data to be tensors. However, this preprocessing manner is ineffective to handle the traffic systems with irregular topologies, such as metro systems and road networks.To improve the generality of the abovementioned methods, some researchers have attempted to address this task with Graph Convolutional Networks. For instance, Li et al. [26] modeled the traffic flow as a diffusion process on a directed graph and captured the spatial dependency with bidirectional random walks. Zheng et al. [31] developed a graph multiattention network to model the impact of some spatialtemporal factors on traffic conditions, while Zhao et al. [30] proposed a temporal graph convolutional network for traffic forecasting based on urban road networks. Recently, GCN has also been employed to metro ridership prediction. In [51], graph convolution operations were applied to capture the irregular spatiotemporal dependencies along with the metro network, but their graph was directly built based on the physical topology of metro systems. In constant, we combine the physical topologies information and human domain knowledge to construct three collaborative graphs with various topologies, which can effectively capture the complex patterns.
Iii Methodology
In this work, we propose a novel PhysicalVirtual Collaboration Graph Network (PVCGN) for stationlevel metro ridership prediction. Based on the physical topology of a metro system and human domain knowledge, we construct a physical graph and a similarity graph and a correlation graph, which are incorporated into a Graph Convolutional Gated Recurrent Unit (GCGRU) for local spatialtemporal representation learning. Then, a FullyConnected Gated Recurrent Unit (FCGRU) is applied to learn the global evolution feature. Finally, we develop a seq2seq framework with GCGRU and FCGRU to forecast the ridership of each metro station.
We first define some notations of ridership prediction before introducing the details of PVCGN. The ridership data of station at time interval is denoted as , where these two values are the passenger counts of inflow/outflow. The ridership of the whole metro system is represented as a signal , where is the number of stations. Given a historical ridership sequence, our goal is to predict a future ridership sequence:
(1) 
where refers to the length of the input sequence and is the length of the predicted sequence. For the convenience in following subsections, we also denote the whole historical ridership of station as a vector , where is the number of time intervals in a training set.
Iiia PhysicalVirtual Graphs
In this section, we describe how to construct the physical graph and two virtual graphs. By definition, a graph is composed of nodes, edges as well as the weights of edges. In our work, the physical graph, similarity graph and correlation graph are denoted as , and , respectively. is the set of nodes () and each node represents a realworld metro station. Note that these three graphs share the same nodes, but have different edges and edge weights. , and are the edge sets of different graphs. For a specific graph (), denotes the weights of all edges. Specifically, is the weight of an edge from node to node .
IiiA1 Physical Graph
is directly built based on the physical topology of the studied metro system. An edge is formed to connected node and in , if the corresponding station and are connected in real world. To calculate the weights of these edges, we first construct a physical connection matrix . As shown in Fig.1(a,b), if there exists an edge between node and , or else . The edge weight is obtained by performing a linear normalization on each row (See Fig.1c). Specifically, is computed by:
(2) 
IiiA2 Similarity Graph
In this section, the similarities of metro stations are used to guide the construction of . First, we construct a similarity score matrix by calculating the passenger flow similarities between every two stations. Specifically, the score between station and is computed with Dynamic Time Warping (DTW [32]):
(3) 
where DTW is a general algorithm for measuring the distance between two temporal sequences. Based on the matrix , we select some station pairs to build edges . The selection strategy is flexible. For instance, these virtual edges can be determined with a predefined similarity threshold, or be built by choosing the top station pairs with high similarity scores. More selection details can be found in Section IVA2. Finally, we calculate the edge weights by conducting row normalization on :
(4) 
where if contains an edge connecting node and , or else . A toy example of similarity graph is shown in Fig.1(d,e,f) and we can observe that matrix is symmetrical, but matrix is asymmetrical due to the row normalization.
IiiA3 Correlation Graph
In this work, we utilize the origindestination distribution of ridership to build the virtual graph . First, we construct a correlation ratio matrix . Specifically, is computed by:
(5) 
where is the total number of passengers that traveled from station to station in the whole training set. We use the similar selection strategy described in Section IIIA2 to select some station pairs for edge construction. Finally, the edge weights is calculated by:
(6) 
One example of correlation graph is shown in Fig.1(d,e,f) and we can see that is a directed graph, since .
IiiB Graph Convolution Gated Recurrent Unit
As an alternative of LSTM [11], Gated Recurrent Unit has been widely used for temporal modeling and it was usually implemented with standard convolution or fullconnection. In this section, we incorporate the proposed physicalvirtual graphs to develop a unified Graph Convolution Gated Recurrent Unit (GCGRU) for spatialtemporal feature learning.
We first formulate the convolution on the proposed physicalvirtual graphs. Let us assume that the input of graph convolution is , where can be the ridership data or its feature. The parameters of this graph convolution are denoted as . By definition of convolution, the output feature of is computed by:
(7) 
where is Hadamard product and . Specifically, denotes the learnable parameters for selfloop. denotes the parameters of the physical graph and represents the neighbor set of node in . Other notations , , and have similar semantic meanings. is the dimensionality of feature . In this manner, a node can dynamically receive information from some highlycorrelated neighbor nodes. For convenience, the graph convolution in Eq.7 is abbreviated as in the following.
Since the abovementioned operation is conducted on spatial view, we embed the physicalvirtual graph convolution in a Gated Recurrent Unit to learn spatialtemporal features. Specifically, the reset gate , update gate , new information and hidden state are computed by:
(8) 
where
is the sigmoid function and
is the hidden state at last iteration. denotes the graph convolution parameters between and , while denotes the parameters between and . Other parameters , , and have similar meanings. , and are biases. The feature dimensionality of , , and are also set to . For convenience, we denote the operation of Eq.8 as:(9) 
Thanks to this GCGRU, we can effectively learn spatialtemporal features from the ridership data of metro systems.
IiiC LocalGlobal Feature Fusion
In previous works [15, 5], global features have been proven to be also useful for traffic state prediction. However, the proposed GCGRU conducts convolution on local space and fails to capture the global context. To address this issue, we apply a FullyConnected Gated Recurrent Unit (FCGRU) to learn the global evolution features of all stations and generate a comprehensive feature by fusing the output of GCGRU and FCGRU. The developed fusion module is termed as Collaborative Gated Recurrent Module (CGRM) in this work and its architecture is shown in Fig.2.
Specifically, the inputs of CGRM are and , where is the output of last iteration. For GCGRU, rather than take the original as input, it utilizes the accumulated information in to update hidden state, thus Eq.9 becomes:
(10) 
For FCGRU, we first transform and to an embedded and with two fullyconnected (FC) layers. Then we feed and into a common GRU [12] implemented with fullyconnection to generate a global hidden state , which can be expressed as:
(11) 
Finally, we incorporate and to generate a comprehensive hidden state with a fullyconnected layer:
(12) 
where denotes an operation of feature concatenation. contains the local and global context of ridership, and we has proved its effectiveness in Section IVC2.
IiiD PhysicalVirtual Collaboration Graph Network
In this section, we apply the abovementioned CGRUs to construct a PhysicalVirtual Collaboration Graph Network (PVCGN) for stationlevel metro ridership prediction. Note that our PVCGN is developed based on the seq2seq model [33] and its architecture is shown in Fig.3.
Specifically, PVCGN consists of an encoder and a decoder, both of which contain two CGRMs. In encoder, the ridership data are sequentially fed into CGRMs to accumulate the historical information. At iteration , the bottom CGRM takes as input and its output hidden state is fed into the above CGRM for highlevel feature learning. In particular, the initial hidden states of both CGRMs at the first iteration are set to zero. In decoder, at the first iteration, the input data is also set to zero and the final hidden states of encoder are used to initialize the hidden states of decoder. The future ridership is predicted by feeding the output hidden state of the above CGRM into a fullyconnected layer. At iteration , the bottom CGRM takes as input data and the above CGRM also applies a fullyconnected layer to forecast . Finally, we can obtain a future ridership sequence .
Iv Experiments
In this section, we first introduce the settings of experiments (e.g., dataset construction, implementation details, and evaluation metrics). Then, we compare the proposed PVCGN with eight representative approaches under various scenarios. Finally, we conduct extensive internal analyses to verify the effectiveness of each component in our method.
Iva Experiments Settings
IvA1 Dataset Construction
Since there are few public benchmarks for metro ridership prediction, we collect a mass of trip transaction records from two realworld metro systems and construct two largescale datasets, which are termed as HZMetro and SHMetro respectively. The overviews of these two datasets are summarized in Table I.
SHMetro: This dataset was built based on the metro system of Shanghai, China. A total of 811.8 million transaction records were collected from Jul. 1st 2016 to Sept. 30th 2016, with 8.82 million ridership per day. Each record contains the information of passenger ID, entry/exit station and the corresponding timestamps. In this time period, 288 metro stations were operated normally and they were connected by 958 physical edges. For each station, we measured its inflow and outflow of every 15 minutes by counting the number of passengers entering or exiting the station. The ridership data of the first two months and that of the last three weeks are used for training and testing, while the ridership data of the remaining days are used for validation.
HZMetro: This dataset was created with the transaction records of the Hangzhou metro system collected in January 2019. With 80 operational stations and 248 physical edges, this system has 2.35 million ridership per day. The time interval of this dataset is also set to 15 minutes. Similar to SHMetro, this dataset is divided into three parts, including a training set (Jan. 1st  Jan. 18th), a validation set (Jan. 19th  Jan. 20th) and a testing set (Jan. 21th  Jan. 25th).
Dataset  SHMetro  HZMetro 

City  Shanghai, China  Hangzhou, China 
# Station  288  80 
# Physical Edge  958  248 
Ridership/Day  8.82 M  2.35 M 
Time Interval  15 min  15 min 
Training Timespan  7/01/2016  8/31/2016  1/01/2019  1/18/2019 
Validation Timespan  9/01/2016  9/09/2016  1/19/2019  1/20/2019 
Testing Timespan  9/10/2016  9/30/2016  1/21/2019  1/25/2019 
Time  Metric  HA  RF  GBDT  MLP  LSTM  GRU  DCRNN  GCRNN  PVCGN (Ours) 

15 min  RMSE  136.97  66.63  62.59  48.71  55.53  52.04  46.02  46.09  44.97 
MAE  48.26  34.37  32.72  25.16  26.68  25.91  24.04  24.26  23.29  
MAPE  31.55%  24.09%  23.40%  19.44%  18.76%  18.87%  17.82%  18.06%  16.83%  
30 min  RMSE  136.81  88.03  82.32  51.80  57.37  54.02  49.90  50.12  47.83 
MAE  47.88  41.37  39.50  26.15  27.25  26.39  25.23  25.42  24.16  
MAPE  31.49%  28.89%  28.17%  20.38%  19.04%  19.20%  18.35%  18.73%  17.23%  
45 min  RMSE  136.45  118.65  113.95  57.06  60.45  56.97  54.92  54.87  52.02 
MAE  47.26  50.91  49.14  27.91  28.08  27.17  26.76  26.92  25.33  
MAPE  31.27%  41.34%  40.76%  22.20%  19.61%  19.84%  19.30%  19.81%  17.92%  
60 min  RMSE  135.72  143.5  137.5  63.33  63.41  59.91  58.83  58.67  55.27 
MAE  46.40  59.15  57.31  29.92  28.94  28.08  28.01  28.18  26.29  
MAPE  30.80%  52.91%  52.60%  23.96%  20.59%  21.03%  20.44%  21.07%  18.69% 
Time  Metric  HA  RF  GBDT  MLP  LSTM  GRU  DCRNN  GCRNN  PVCGN (Ours) 

15 min  RMSE  64.19  53.52  51.50  46.55  45.30  45.10  40.39  40.24  37.76 
MAE  36.37  32.19  30.88  26.57  25.76  25.69  23.76  23.84  22.68  
MAPE  19.14%  18.34%  17.60%  16.26%  14.91%  15.13%  14.00%  14.08%  13.70%  
30 min  RMSE  64.10  64.54  61.94  47.96  45.52  45.26  42.57  41.95  39.34 
MAE  36.37  38.00  36.48  27.44  26.01  25.93  25.22  25.14  23.33  
MAPE  19.31%  21.46%  20.49%  17.10%  15.10%  15.35%  14.99%  14.86%  13.81%  
45 min  RMSE  63.92  80.06  76.70  50.66  46.30  46.13  46.26  45.53  40.95 
MAE  36.23  45.78  44.12  28.79  26.38  26.36  26.97  26.82  24.22  
MAPE  19.57%  26.51%  25.75%  19.01%  15.40%  15.79%  16.19%  16.05%  14.45%  
60 min  RMSE  63.72  94.29  91.21  54.62  47.53  47.69  49.35  50.28  42.61 
MAE  35.99  52.95  51.10  30.52  26.76  26.98  28.47  28.75  24.93  
MAPE  20.01%  37.12%  38.10%  22.56%  16.34%  17.20%  18.16%  17.89%  15.49% 
IvA2 Implementation Details
Since the physical graph has a welldefined topology, we only introduce the details of two virtual graphs in this section. In SHMetro dataset, to reduce the computational cost of GCN, for each station, we only select the top ten stations with high similarity scores or correlation rates to construct virtual graphs, thereby both the similarity graph and correlation graph have 2880 edges. In HSMetro dataset, as its station number is much smaller than that of SHMetro and the computational cost is not heavy, we can build more virtual edges to learn the complex patterns. Specifically, we determine the virtual edges by setting the similarity/correlation thresholds to 0.1/0.02, and the final similarity graph and correlation graph have 2502 and 1094 edges respectively.
We implement our PVCGN with the popular deep learning framework PyTorch
[52]. The lengths of input and output sequences are set to 4 simultaneously. The input data and the groundtruth of output are normalized with Zscore Normalization
^{4}^{4}4https://en.wikipedia.org/wiki/Standard_score before being fed into the network. The filter weights of all layers are initialized by Xavier [53]. The batch size is set to 8 for SHMetro and 32 for HZmetro. The feature dimensionality is set to 256. The initial learning rate is 0.001 and its decay ratio is 0.1. We apply Adam [54]to optimize our PVCGN for 200 epochs by minimizing the mean absolute error between the predicted results and the corresponding groundtruths. On each benchmark, we select a final model with the minimum error on the validation set and apply the selected model to conduct prediction on the testing set.
IvA3 Evaluation Metrics
Following previous works [26, 30], we evaluate the performance of methods with Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), which are defined as:
(13) 
where is the number of testing samples. and denote the predicted ridership and the groundtruth ridership, respectively. Notice that and have been transformed back to the original scale with an inverted Zscore Normalization. As mentioned in Section IVA2, our PVCGN is developed to predict the metro ridership of next four time intervals. In the following experiments, we would measure the errors of each time intervals separately.
Time  Metric  HA  RF  GBDT  MLP  LSTM  GRU  DCRNN  GCRNN  PVCGN (Ours) 

15 min  RMSE  255.63  108.09  100.49  64.95  75.78  69.92  67.50  66.21  65.04 
MAE  96.23  56.20  53.36  36.42  39.49  37.27  37.92  37.94  36.46  
MAPE  46.74%  20.06%  19.20%  14.47%  13.87%  13.58%  13.93%  14.07%  13.16%  
30 min  RMSE  270.74  161.33  149.34  69.97  77.24  72.19  73.07  73.63  68.85 
MAE  99.18  75.06  72.13  38.24  39.97  37.73  40.16  40.26  37.77  
MAPE  47.10%  23.72%  22.85%  14.65%  14.02%  13.61%  14.33%  14.44%  13.41%  
45 min  RMSE  265.61  231.29  222.53  74.30  79.05  72.70  79.42  79.88  73.85 
MAE  95.56  98.57  97.18  39.58  39.77  37.33  41.92  41.65  38.84  
MAPE  45.39%  29.61%  28.97%  15.42%  14.45%  14.04%  15.29%  15.29%  14.06%  
60 min  RMSE  248.93  284.02  271.83  75.72  77.13  69.71  79.98  81.37  74.41 
MAE  87.10  115.13  113.73  39.53  38.23  35.68  41.23  41.27  38.12  
MAPE  42.48%  36.00%  35.54%  16.59%  15.27%  14.81%  16.58%  16.62%  15.08% 
Time  Metric  HA  RF  GBDT  MLP  LSTM  GRU  DCRNN  GCRNN  PVCGN (Ours) 

15 min  RMSE  65.53  84.33  82.25  57.39  57.10  56.31  54.17  55.51  49.79 
MAE  40.63  52.07  51.60  35.77  35.27  35.23  35.08  35.68  32.63  
MAPE  11.51%  15.24%  15.02%  10.96%  9.99%  10.12%  10.37%  10.36%  9.72%  
30 min  RMSE  67.89  108.25  103.38  62.25  59.03  58.81  58.27  57.34  51.63 
MAE  42.08  65.97  63.94  37.58  36.45  36.59  37.48  37.31  33.30  
MAPE  11.58%  17.56%  16.95%  10.80%  10.07%  10.10%  10.69%  10.54%  9.52%  
45 min  RMSE  67.33  123.34  126.36  61.85  58.48  58.13  61.83  59.54  51.45 
MAE  41.63  74.91  75.46  37.09  35.72  35.59  37.95  37.58  32.73  
MAPE  12.22%  19.58%  19.36%  11.30%  10.55%  10.36%  11.16%  11.16%  9.88%  
60 min  RMSE  67.22  136.08  132.87  61.81  57.35  57.14  59.52  58.88  51.09 
MAE  40.72  75.40  74.39  36.13  34.19  34.01  36.27  35.94  31.43  
MAPE  13.21%  20.81%  20.54%  12.16%  11.23%  11.08%  11.94%  11.93%  10.43% 
IvB Comparison with StateoftheArt Methods
In this section, we compare our PVCGN with eight basic and advanced methods under various scenarios (e.g., the comparison on the whole testing sets, comparison on rush hours and comparison on highridership stations). These methods can be classified into three categories, including:
(i) three traditional time series models, (ii) three general deep learning models and (iii) two recentlyproposed graph networks. The details of these methods are described as following:
Historical Average (HA): HA is a seasonalbased baseline that forecasts the future ridership by averaging the riderships of the corresponding historical periods. For instance, the ridership at interval 9:009:15 on a specific Monday is predicted as the average ridership of the corresponding time intervals of the previous Mondays. The variate is set to 4 on SHMetro and 2 on SHMetro.

Random Forest (RF):
RF is a machine learning technique for both regression and classification problems that operates by constructing a multitude of decision trees. RF was widely used to predict traffic states
[55]. 
Gradient Boosting Decision Trees (GBDT):
GBDT is a weighted ensemble method that consists of a series of weak prediction models. Gradient descent optimizer is applied to minimize the loss function.

Multiple Layer Perceptron (MLP):
This model consists of two fullyconnected layers with 256 and neurons respectively, where is the number of stations. It takes as input the riderships of all stations of previous time intervals and predicts the ridership of all stations of the next time intervals simultaneously. 
Long ShortTerm Memory (LSTM): This network is a simple seq2seq model and its core module consists of two fullyconnected LSTM layers. The hidden size of each LSTM layer is set to 256.

Gated Recurrent Unit (GRU): With the similar architecture of the previous model, this network replaces the original LSTM layers with GRU layers. The hidden size of GRU is also set to 256.

Diffusion Convolutional Recurrent Neural Network (DCRNN [56]): As a deep learning framework specially designed for traffic forecasting, DCRNN captures the spatial dependencies using bidirectional random walks on graphs and learns the temporal dependencies with an encoderdecoder architecture.

Graph Convolutional Recurrent Neural Network (GCRNN): The architecture of this model is very similar to that of DCRNN. The main difference is that GCRNN replaces diffusion convolutional layers with =3 order ChebNets [21] based on spectral graph convolutions.
Time  Metric  HA  RF  GBDT  MLP  LSTM  GRU  DCRNN  GCRNN  PVCGN (Ours) 

15 min  RMSE  242.87  111.31  103.94  80.72  94.74  87.40  84.04  86.09  74.80 
MAE  96.38  60.65  57.47  45.31  49.29  47.09  44.98  45.89  41.38  
MAPE  27.82%  15.24%  14.80%  12.23%  12.39%  12.23%  13.76%  14.12%  10.62%  
30 min  RMSE  242.68  152.10  141.45  86.46  98.02  91.25  88.52  89.89  79.43 
MAE  95.83  75.66  72.31  47.58  50.52  48.27  46.80  47.50  43.05  
MAPE  28.08%  20.25%  20.08%  13.62%  13.19%  13.32%  14.47%  14.82%  11.46%  
45 min  RMSE  242.22  208.91  201.11  96.13  103.88  96.89  97.75  96.81  87.32 
MAE  94.84  95.56  93.02  51.63  52.54  50.19  49.84  50.08  45.67  
MAPE  28.11%  34.72%  35.40%  16.37%  14.12%  14.54%  16.32%  16.70%  12.48%  
60 min  RMSE  241.27  255.64  245.41  107.53  109.64  102.81  106.03  102.93  93.59 
MAE  93.41  112.77  110.44  56.42  54.64  52.50  53.30  52.79  48.02  
MAPE  27.80%  53.09%  54.37%  19.48%  15.70%  16.41%  17.89%  18.16%  13.61% 
Time  Metric  HA  RF  GBDT  MLP  LSTM  GRU  DCRNN  GCRNN  PVCGN (Ours) 

15 min  RMSE  111.26  82.98  79.23  77.07  75.19  74.57  66.18  65.29  60.56 
MAE  70.30  54.73  52.28  45.95  45.28  44.81  41.22  40.93  38.29  
MAPE  16.36%  14.47%  13.80%  11.68%  11.49%  11.45%  10.72%  10.59%  9.97%  
30 min  RMSE  111.01  99.84  95.27  79.30  75.48  74.75  69.36  67.29  63.77 
MAE  70.19  64.59  62.02  47.80  45.86  45.24  43.92  43.07  39.93  
MAPE  16.52%  17.49%  16.80%  12.31%  11.73%  11.80%  11.51%  11.42%  10.34%  
45 min  RMSE  110.64  125.47  119.12  82.86  76.80  76.12  75.16  72.42  65.99 
MAE  69.86  79.09  75.98  50.36  46.62  46.17  46.99  45.93  41.75  
MAPE  16.93%  22.69%  21.97%  14.43%  12.09%  12.42%  12.74%  12.46%  11.28%  
60 min  RMSE  110.34  148.45  145.35  89.47  79.27  79.11  79.79  80.34  69.25 
MAE  69.44  92.23  89.36  53.96  47.48  47.6  49.5  49.4  43.16  
MAPE  17.64%  33.21%  32.26%  20.03%  13.41%  14.25%  15.00%  14.74%  12.54% 
IvB1 Comparison on the Whole Testing Sets
We first compare the performance of all comparative methods on the whole testing sets (including all time intervals and all metro stations). Their performance on SHMetro and HZMetro datasets are summarized in Table II and Table III
, respectively. We can see that the baseline HA obtains unacceptable MAPE at all time intervals (about 31% on SHMetro and 20% on HZMetro). Compared with HA, RF and GBDT can get better results at the first time interval. However, with the increment of time, their MAPEs gradually become worse and even larger than that of HA, since these two traditional models have weak abilities to learn the ridership distribution. By automatically learning deep features from data, those general neural networks (e.g., MLP, LSTM and GRU) can greatly improve the performance. For example, LSTM obtains a MAPE 18.76% on SHMetro and 14.91% on HZMetro when predicting the ridership at the first time interval, while GRU obtains a MAPE 21.03% on SHMetro and 17.20% on HZMetro for the prediction of the fourth time interval. Thanks to the advanced graph learning, DCRNN and GCRNN achieves competitive performance by reducing the MAPE to 17.82% on SHMetro and to 14.00% on HZMetro. However, these methods directly construct graphs based on physical topologies. To fully capture the ridership complex patterns, our PVCGN constructs physical/virtual graphs with the information of physical topologies and human domain knowledge, thereby achieving stateoftheart performance. For example, our PVCGN improves at least 1% in MAPE at different time intervals on SHMetro dataset. On HZmetro, PVCGN outperforms the existing bestperforming models DCRNN and GCRNN with a large margin in all metrics. This comparison well demonstrates the superiority of the proposed PVCGN.
IvB2 Comparison on Rush Hours
In this section, we focus on the ridership prediction of rush hours, since the accurate prediction results are very crucial for the metro scheduling during such time. In this work, the rush time is defined as 7:309:30 and 17:3019:30. The performance of all methods are summarized in Table IV and Table V. We can observe that our PVCGN outperforms all comparative methods consistently on both datasets. On SHMetro, our PVCGN obtains a MAPE 13.16% for the ridership prediction at the first time interval, while the MAPE of DCRNN and GCRNN are 13.93% and 14.07%, respectively. Other deep learning methods (such as MLP, LSTM and GRU) are relatively worse. For forecasting the ridership at the fourth time interval, our PVCGN achieves a MAPE 15.08%, outperforming DCRNN and GCRNN with a relative improvement of at least 9.04%.
There exists a similar situation of performance comparison on HZMetro. For example, obtaining a MAPE 9.72% for the ridership at the first time interval, our PVCGN is undoubtedly better than DCRNN and GCRNN, the MAPE of which are 10.37% and 10.36%, respectively. When predicting the ridership at the fourth time interval, our PVCGN achieves a very impressive MAPE 10.43%, while DCRNN and GCRNN suffer from serious performance degradation. For instance, their MAPE rapidly increase to 11.94% and to 11.93%, respectively. In summary, the extensive experiments on SHMetro and HZMetro dataset show the effectiveness and robustness of our method during rush hours.
IvB3 Comparison on HighRidership Stations
Except for the prediction during rush hours, we also pay attention to the prediction of some stations with high ridership, since the demand of such stations should be prioritized in realworld metro systems. In this section, we first rerank all metro stations based on their historical ridership of training set and conduct choose comparison on the top 1/4 highridership stations. The performance on SHMetro is summarized in Table VI and we can observe that our PVGCN ranks the first place in performance among all comparative methods. When forecasting the ridership during the next 15 minutes, PVGCN achieves an RMSE 74.80 and a MAPE 10.62%. By contrast, the best RMSE and MAPE of other methods are 80.72 and 12.23%. As the prediction time increases to 60 minutes, our PVGCN can still obtain the best result (e.g., 13.61% in MAPE), while the MAPE of GCRNN significantly increases to 18.16%.
As shown in Table VII, our PVGCN also achieves impressive performance on HZMetro dataset. For the ridership prediction at the first time interval, the RMSE and MAPE of our PVCGN are 60.56 and 9.97%, while the RMSE and MAPE of the existing bestperforming method GCRNN are 65.29 and 10.59%. When forecasting the ridership at the fourth time interval, our PVCGN has minor performance degradation. For example, its RMSE and MAPE respectively increase to 69.25 and 12.54%. In the same situation, the RMSE and MAPE of GCRNN increase to 80.34 and 14.74%. Therefore, we can conclude that our PVCGN is not only effective but also robust for the prediction on highridership stations.
Time  Metric  SHMetro  HZMetro  

P  P+S  P+C  S+C  P+S+C  P  P+S  P+C  S+C  P+S+C  
15 min  RMSE  50.45  47.38  46.18  46.52  44.97  41.80  38.89  39.46  39.92  37.73 
MAE  25.89  24.16  23.88  23.74  23.29  24.81  23.23  23.34  23.84  22.69  
MAPE  19.04%  17.13%  17.12%  16.94%  16.83%  14.84%  13.93%  14.08%  14.38%  13.72%  
30 min  RMSE  58.09  50.86  50.29  50.18  47.83  45.31  40.63  41.26  41.59  39.38 
MAE  28.13  25.28  25.13  24.74  24.16  26.63  24.22  24.22  24.59  23.35  
MAPE  20.19%  17.72%  17.73%  17.32%  17.23%  15.50%  14.49%  14.36%  14.60%  13.83%  
45 min  RMSE  65.81  55.98  55.54  54.45  52.02  50.26  42.63  43.96  44.81  40.88 
MAE  30.51  26.90  26.68  26.01  25.33  29.02  25.31  25.42  25.91  24.23  
MAPE  21.65%  18.66%  18.44%  18.03%  17.92%  16.76%  15.35%  15.26%  15.23%  14.48%  
60 min  RMSE  73.06  60.08  60.59  58.93  55.27  56.32  44.46  44.93  45.49  42.51 
MAE  32.55  27.92  27.94  27.14  26.29  31.41  26.16  26.13  26.54  24.90  
MAPE  23.43%  19.56%  19.30%  18.87%  18.69%  18.33%  16.31%  16.32%  16.69%  15.48% 
IvC Component Analysis
IvC1 Effectiveness of Different Graphs
The distinctive characteristic of our work is that we incorporate a physical graph and two virtual graphs into Gated Recurrent Units (GRU) to collaboratively capture the complex flow patterns. To verify the effectiveness of each graph, we implement five variants of PVCGN, which are described as follows:

PNet: This variant only utilizes the physical graph to implement the ridership prediction network;

P+SNet: This variant is developed with the physical graph and the virtual similarity graph;

P+CNet: Similar with P+S GRU, this variant is built with the physical graph and the correlation graph;

S+CNet: Different with above variants that contain the physical graph, this variant is constructed only with the virtual similarity/correlation graphs;

P+S+CNet: This network is the full model of the proposed PVCGN. It contains the physical graph and the two virtual graphs simultaneously.
The performance of all variants are summarized in Table VIII. To predict the ridership at the next time interval (15 minutes), the baseline PNet obtains a MAPE 19.04% on SHMetro and 14.84% on HZMetro, ranking last among all the variants. By aggregating the physical graph and any one of the proposed virtual graphs, the variants P+SNet and P+CNet achieve obvious performance improvements over all evaluation metrics. For instance, P+SNet decreases the RMSE to from 50.45 to 47.38 on SHMetro and from 41.80 to 38.89 on HZMetro, while P+CNet reduces the RMSE to 46.18 and 39.46. Moreover, we observe that the variant S+CNet can also achieve very competitive performance, even though it does not contain the physical graph. On SHMetro dataset, S+CNet obtains an RMSE 46.52, outperforming PNet with a relative improvement of 7.8%. On HZMetro dataset, S+CNet also achieves a similar improvement by decreasing the RMSE to 39.92. These phenomenons indicate that the proposed virtual graphs are reasoning. Finally, the variant P+S+CNet can obtain the best performance by incorporating the physical graph and all virtual graphs into networks. Specifically, P+S+CNet gets the lowest RMSE (44.97 on SHMetro, 37.73 on HZMetro) and the lowest MAPE (16.83% on SHMetro, 13.72% on HZMetro). This significant improvement is mainly attributed to the enhanced spatialtemporal representation learned by the collaborative physical/virtual graph networks. These comparisons demonstrate the effectiveness of these tailordesigned graphs for the single time interval prediction.
Moreover, we find that these collaborative graphs are also effective for the ridership prediction of continuous time intervals. As shown in the bottom nine rows of Table VIII, all variants suffer from performance degradation to some extent, as the number of time intervals increases from 2 to 4. For instance, the RMSE is rapidly increased to 73.06 on SHMetro and 56.32 on HZMetro, when the baseline PNet is applied to forecast the ridership at the fourth time interval (60 minutes) of the future. By contrast, P+SNet and P+CNet achieve much lower RMSE (about 60 on SHMetro and 44 on HZMetro), since the proposed virtual graphs can prompt these variants to learn the complex flow patterns. Incorporating all physical/virtual graphs, P+S+CNet can further improve the performance with an RMSE 55.27 on SHMetro and 42.51 on HZMetro, which shows that these graphs are complementary.
Time  Metric  SHMetro  HZMetro  

Local  Local + Global  Local  Local + Global  
15 min  RMSE  45.64  44.97  38.46  37.76 
MAE  23.51  23.29  23.00  22.68  
MAPE  17.23%  16.83%  13.86%  13.70%  
30 min  RMSE  48.79  47.83  39.65  39.34 
MAE  24.48  24.16  23.78  23.33  
MAPE  17.59%  17.23%  14.30%  13.81%  
45 min  RMSE  52.70  52.02  41.45  40.95 
MAE  25.58  25.33  24.60  24.22  
MAPE  18.16%  17.92%  14.88%  14.45%  
60 min  RMSE  56.56  55.27  43.11  42.61 
MAE  26.50  26.29  25.36  24.93  
MAPE  18.64%  18.69%  16.06%  15.49% 
IvC2 Influences of Local and Global Feature
As described in Section IIIC, a Graph Convolution Gated Recurrent Unit (GCGRU) is developed for local feature learning, while a FullyConnected Gated Recurrent Unit (FCGRU) is applied to learn the global feature. In this section, we train two variants to explore the influence of each type of feature for metro ridership prediction. The first variant only contains GCGRU, and the second variant consists of GCGRU and FCGRU. The results of these variants are summarized in Table IX. We can observe that the performance of the first variant is very competitive. For example, when predicting the ridership of the next 15 minutes, the first variant obtains an RMSE 45.64 on SHMetro and 38.46 on HZMetro. For the prediction of the fourth time interval, with a MAE 26.50 on SHMetro and 25.36 on HZMetro, this variant is slightly worse than the full model of PVCGN. This competitive performance is attributed to the fact that we can effectively learn the semantic local feature with the customized physical/virtual graphs. By fusing the local/global features of GCGRU/FCGRU, the second variant can boost the performance to a certain degree. For example, when predicting the ridership of the second time interval, the RMSE is decreased from 48.79 to 47.83 on SHMetro. Through these experiments, we can conclude that the local feature plays a dominant role and the global feature provides ancillary information for ridership prediction.
V Conclusion
In this work, we propose a unified PhysicalVirtual Collaboration Graph Network to address the stationlevel metro ridership prediction. Unlike previous works that either ignored the topological information of a metro system or directly modeled on physical topology, we model the studied metro system as a physical graph and two virtual similarity/correlation graphs to fully capture the ridership evolution patterns. Specifically, the physical graph is built on the basis of the metro realistic topology. The similarity graph and correlation graph are constructed with virtual topologies under the guidance of the historical passenger flow similarity and correlation among different stations. We incorporate these graphs into a Graph Convolution Gated Recurrent Unit (GCGRU) to learn spatialtemporal representation and apply a FullyConnected Gated Recurrent Unit (FCGRU) to capture the global evolution tendency. Finally, these GRUs are utilized to develop a seq2seq model for forecasting the ridership of each station. To verify the effectiveness of our method, we construct two realworld benchmarks with mass transaction records of Shanghai metro and Hangzhou metro and the extensive experiments on these benchmarks show the superiority of the proposed PVCGN.
In future works, several improvements should be considered. First, some external factors (such as weather and holiday events) may greatly affect the ridership evolution and we should incorporate these factors to dynamically forecast the ridership. Second, the metro ridership evolves periodically. For instance, the ridership at 9:00 of every weekday is usually similar. Therefore, we should also utilize the periodic law of ridership to learn more comprehensive representation. Last but not least
, the origindestination distribution (ODD) of ridership provides rich information and we can model the ridership ODD at each time interval to facilitate the inflow/outflow ridership prediction.
References

[1]
Y. Li, X. Wang, S. Sun, X. Ma, and G. Lu, “Forecasting shortterm subway passenger flow under special events scenarios using multiscale radial basis function networks,”
Transportation Research Part C: Emerging Technologies, vol. 77, pp. 306–328, 2017.  [2] L. Tang, Y. Zhao, J. Cabrera, J. Ma, and K. L. Tsui, “Forecasting shortterm passenger flow: An empirical study on shenzhen metro,” IEEE Transactions on Intelligent Transportation Systems, 2018.
 [3] Y. Gong, Z. Li, J. Zhang, W. Liu, Y. Zheng, and C. Kirsch, “Networkwide crowd flow prediction of sydney trains via customized online nonnegative matrix factorization,” in CIKM. ACM, 2018, pp. 1243–1252.
 [4] E. Chen, Z. Ye, C. Wang, and M. Xu, “Subway passenger flow prediction for special events using smart card data,” IEEE Transactions on Intelligent Transportation Systems, 2019.
 [5] S. Fang, Q. Zhang, G. Meng, S. Xiang, and C. Pan, “Gstnet: global spatialtemporal network for traffic flow prediction,” in IJCAI, 2019, pp. 10–16.
 [6] S. Hao, D.H. Lee, and D. Zhao, “Sequence to sequence learning with attention mechanism for shortterm passenger flow prediction in largescale metro system,” Transportation Research Part C: Emerging Technologies, vol. 107, pp. 287–300, 2019.

[7]
M. Lippi, M. Bertini, and P. Frasconi, “Shortterm traffic flow forecasting: An experimental comparison of timeseries analysis and supervised learning,”
IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 871–882, 2013. 
[8]
J. Guo, W. Huang, and B. M. Williams, “Adaptive kalman filter approach for stochastic shortterm traffic flow rate prediction and uncertainty quantification,”
Transportation Research Part C: Emerging Technologies, vol. 43, pp. 50–64, 2014.  [9] S. V. Kumar and L. Vanajakshi, “Shortterm traffic flow prediction using seasonal arima model with limited input data,” European Transport Research Review, vol. 7, no. 3, p. 21, 2015.
 [10] B. M. Williams, P. K. Durvasula, and D. E. Brown, “Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models,” Transportation Research Record, 1998.
 [11] S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in NIPS, 2015, pp. 802–810.
 [12] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
 [13] J. Zhang, Y. Zheng, and D. Qi, “Deep spatiotemporal residual networks for citywide crowd flows prediction.” in AAAI, 2017, pp. 1655–1661.
 [14] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li, “Deep multiview spatialtemporal network for taxi demand prediction,” in AAAI, 2018.
 [15] L. Liu, Z. Qiu, G. Li, Q. Wang, Ouyang，Wanli, and L. Lin, “Contextualized spatialtemporal network for taxi origindestination demand prediction,” IEEE Transactions on Intelligent Transportation Systems, 2019.
 [16] H. Yao, X. Tang, H. Wei, G. Zheng, and Z. Li, “Revisiting spatialtemporal similarity: A deep learning framework for traffic prediction,” in AAAI, 2019.
 [17] J. Zhang, Y. Zheng, J. Sun, and D. Qi, “Flow prediction in spatiotemporal networks based on multitask deep learning,” IEEE Transactions on Knowledge and Data Engineering, 2019.
 [18] X. Ma, J. Zhang, B. Du, C. Ding, and L. Sun, “Parallel architecture of convolutional bidirectional lstm neural networks for networkwide metro ridership prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 6, pp. 2278–2288, 2019.
 [19] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
 [20] D. Duvenaud, D. Maclaurin, J. Aguileraiparraguirre, R. Gomezbombarelli, T. D. Hirzel, A. Aspuruguzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in NIPS, 2015, pp. 2224–2232.
 [21] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [22] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in ICCV, 2017, pp. 1261–1270.
 [23] X. Wang, C. Li, R. Yang, T. Zhang, J. Tang, and B. Luo, “Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking,” arXiv preprint arXiv:1811.10014, 2018.

[24]
J.X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolutional label noise cleaner: Train a plugandplay action classifier for anomaly detection,” in
CVPR, 2019, pp. 1237–1246.  [25] T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin, “Learning semanticspecific graph representation for multilabel image recognition,” in CVPR, 2019, pp. 522–531.
 [26] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Datadriven traffic forecasting,” in ICLR, 2018.
 [27] B. Yu, H. Yin, and Z. Zhu, “Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting,” in IJCAI, 2018.
 [28] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “Traffic graph convolutional recurrent neural network: A deep learning framework for networkscale traffic learning and forecasting,” arXiv preprint arXiv:1802.07007, 2018.
 [29] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu, “Spatiotemporal multigraph convolution network for ridehailing demand forecasting,” in AAAI, 2019.
 [30] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, and H. Li, “Tgcn: A temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, 2019.
 [31] C. Zheng, X. Fan, C. Wang, and J. Qi, “Gman: A graph multiattention network for traffic prediction,” in AAAI, 2020.
 [32] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol. 10, no. 16. Seattle, WA, 1994, pp. 359–370.
 [33] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014, pp. 3104–3112.
 [34] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
 [35] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016, pp. 3844–3852.
 [36] J. Atwood and D. Towsley, “Diffusionconvolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
 [37] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
 [38] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in ICML, 2019, pp. 6861–6871.
 [39] C. Jiang, H. Xu, X. Liang, and L. Lin, “Hybrid knowledge routed modules for largescale object detection,” in NIPS, 2018, pp. 1552–1563.
 [40] D. Beck, G. Haffari, and T. Cohn, “Graphtosequence learning using gated graph neural networks,” in ACL, 2018, pp. 273–283.
 [41] F. W. Levi, Finite geometrical systems: six public lectues delivered in February, 1940, at the University of Calcutta. The University of Calcutta, 1942.
 [42] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for webscale recommender systems,” in KDD. ACM, 2018, pp. 974–983.
 [43] E. Bolshinsky and R. Friedman, “Traffic flow forecast survey,” Computer Science Department, Technion, Tech. Rep., 2012.
 [44] J. Barros, M. Araujo, and R. J. Rossetti, “Shortterm realtime traffic prediction methods: A survey,” in MTITS. IEEE, 2015, pp. 132–139.
 [45] A. M. Nagy and V. Simon, “Survey on traffic prediction in smart cities,” Pervasive and Mobile Computing, vol. 50, pp. 148–163, 2018.
 [46] X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang, “Prediction of urban human mobility using largescale taxi traces and its applications,” Frontiers of Computer Science, vol. 6, no. 1, pp. 111–121, 2012.
 [47] P. Dell’Acqua, F. Bellotti, R. Berta, and A. De Gloria, “Timeaware multivariate nearest neighbor regression methods for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 6, pp. 3393–3402, 2015.
 [48] M.C. Tan, S. C. Wong, J.M. Xu, Z.R. Guan, and P. Zhang, “An aggregation approach to shortterm traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 1, pp. 60–69, 2009.
 [49] D. Wang, W. Cao, J. Li, and J. Ye, “Deepsd: supplydemand prediction for online carhailing services using deep neural networks,” in ICDE. IEEE, 2017, pp. 243–254.
 [50] L. Liu, R. Zhang, J. Peng, G. Li, B. Du, and L. Lin, “Attentive crowd flow machines,” in ACM Multimedia. ACM, 2018, pp. 1553–1561.
 [51] Y. Han, S. Wang, Y. Ren, C. Wang, P. Gao, and G. Chen, “Predicting stationlevel shortterm passenger flow in a citywide metro network using spatiotemporal graph convolutional neural networks,” ISPRS International Journal of GeoInformation, vol. 8, no. 6, p. 243, 2019.
 [52] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS workshop, 2017.
 [53] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010, pp. 249–256.
 [54] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
 [55] Y. Liu and H. Wu, “Prediction of road traffic congestion based on random forest,” in ISCID, vol. 2. IEEE, 2017, pp. 361–364.
 [56] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Datadriven traffic forecasting,” arXiv preprint arXiv:1707.01926, 2017.
Comments
There are no comments yet.