Metro is an efficient and economical travel mode in metropolises, and it plays an important role in the daily life of residents. By the end of 2018, 35 metro systems have been operated to serve tens of millions of passengers in Mainland China111https://en.wikipedia.org/wiki/Urban_rail_transit_in_China. For instance, over 10 million metro trip transactions were made per day in 2018 for Beijing222https://en.wikipedia.org/wiki/Beijing_Subway and Shanghai333https://en.wikipedia.org/wiki/Shanghai_Metro. Such huge metro ridership poses great challenges for urban transportation and any carelessness of traffic management may result in citywide congestions. For improving the service efficiencies of metro systems, a fundamental problem is how to accurately forecast the ridership (e.g., inflow and outflow) of each station, which is termed as station-level metro ridership prediction in this work. Due to its potential applications in traffic dispatch and route planning, this problem has become a hotspot research topic [1, 2, 3, 4, 5, 6] in the community of intelligent transportation systems (ITSs).
, the raw data of traffic states at each time interval was usually transformed to be a vector/sequence and the time series models[10, 8]
were applied for prediction. However, this data format failed to maintain the spatial information of locations and the topological connection information between two locations. In recent years, deep neural networks (e.g., Long Short-term Memory and Gated Recurrent Unit ) have been widely used for citywide traffic prediction [13, 14, 15, 16, 17]
. These works usually partitioned the studied cities into regular grid maps on the basis of geographical coordinate and organized the collected traffic state data as Euclidean 2D or 3D tensors, which can be straightway fed into convolutional networks for automatic representation learning. Nevertheless, this manner is also unsuitable for metro systems, since their topologies are irregular graphs and their data structures are non-Euclidean. Although the transaction records of a metro system can be rendered as a grid map, it is inefficient to learn ridership evolution patterns from the rendered map, which is very sparse and can not maintain the connection information of two stations. Therefore, a more reasonable mechanism is desirable to be developed to model graph-based metro systems.
Fortunately, the emerging Graph Neural Networks (GNN [19, 20, 21]) have been proven to be general and effective for non-Euclidean data embedding. How to construct the graph in GNN is an open problem and the construction strategy is varying in different tasks [22, 23, 24, 25]. Recently, some researchers [26, 27, 28, 29, 30, 31] have applied GCN to forecast the traffic states and most relative works directly built graphs based on the physical topologies of the studied traffic systems. Although this simple strategy helps to learn the local spatial dependency of neighboring stations, we believe it is suboptimal for metro ridership prediction, since it can not fully capture the inter-station flow patterns in a metro system. Except for the physical topologies, some domain knowledge can also be utilized to guide the graph construction, such as:
Inter-station Flow Similarity: Intuitively, two metro stations in different regions may have similar evolution patterns of passenger flow, if their located regions share the same functionality (e.g., office districts). Even though these stations are not directly linked in the real-world metro system, we can connect them in GNN with a virtual edge to jointly learn the evolution patterns.
Inter-station Flow Correlation: In general, the ridership between every two stations is not uniform and the direction of passenger flow implicitly represents the correlation of two stations. For instance, if (i) the majority of inflow of station streams to station , or (ii) the outflow of station primarily comes from station , we argue that the stations and are highly correlated. Under such circumstances, these stations could also be connected to learn the ridership interaction among stations.
Based on the above observations, we propose a unified Physical-Virtual Collaboration Graph Network (PVCGN) to predict the future metro ridership in an end-to-end manner. To fully explore the ridership evolution patterns, we utilize the metro physical topology information and human domain knowledge to construct three complementary graphs. First, a physical graph is directly formed on the basis of the realistic topology of the studied metro system. Then, a similarity graph and a correlation graph are built with virtual topologies respectively based on the passenger flow similarity and correlation among different stations. In particular, the similarity score of two stations is obtained by computing the warping distance between their historical flow series with Dynamic Time Warping (DTW ), while the correlation ratio is determined by the historical origin-destination distribution of ridership. These tailor-designed graphs are incorporated into an extended Graph Convolution Gated Recurrent Unit to collaboratively capture the ridership evolution patterns. Furthermore, a Fully-Connected Gated Recurrent Unit is utilized to learn the semantic feature of global evolution tendency. Finally, we applied a Seq2Seq model  to sequentially forecast the metro ridership at the next several time intervals. To verify the effectiveness of the proposed PVCGN, we conduct extensive experiments on two large-scale benchmarks (i.e., Shanghai Metro and Hangzhou Metro) and the evaluation results show that our approach outperforms existing state-of-the-art methods under various comparison circumstances.
In summary, our major contributions are three-fold:
We systematically analyze the drawbacks of existing methods of traffic states prediction and inventively propose to model the studied metro systems with graph networks, which are naturally suitable for the representation learning on non-Euclidean data.
We develop a unified Physical-Virtual Collaboration Graph Network (PVCGN) to address the station-level metro ridership prediction. Specifically, PVCGN incorporates a physical graph, a similarity graph and a correlation graph into a Graph Convolution Gated Recurrent Unit to facilitate the spatial-temporal representation learning.
The physical graph is built based on the realistic topology of a metro system, while the other two virtual graphs are constructed with human domain knowledge to fully exploit the ridership evolution patterns.
Extensive experiments on two real-world benchmarks show that the proposed PVCGN is comprehensively better than existing methods with significant margins.
The remaining parts of this paper are organized as follows. We first investigate the deep learning on graphs and some related works of traffic states prediction in SectionII. We then introduce the proposed PVCGN in Section III and conduct extensive comparisons in Section IV. Finally, we conclude this paper and discuss future works in Section V.
Ii Related Work
Ii-a Deep Learning on Graphs
In machine learning, Euclidean data refers to the signals with an underlying Euclidean structure
(such as speeches, images, and videos). Although deep Convolutional/Recurrent Neural Networks (CNN/RNN) can handle Euclidean data successfully, it is still challenging to deal with non-Euclidean data (e.g., graphs), which is the data structure of many applications. To address this issue, Graph Neural Networks (GNN) have been proposed to automatically learn feature representation on graphs. For instance, Brunaat et al. introduced a graph-Laplacian spectral filter to generalize the convolution operators in non-Euclidean domains. Defferrard et al.  presented a formulation of CNN with spectral graph theory and designed fast localized convolutional filters on graphs. Atwood and Towsley 
developed a spatial-based graph convolution and regarded it as a diffusion process, in which the information of a node was transferred to its neighboring nodes with a certain transition probability. Velivckovic et al. assumed the contributions of neighboring nodes to the central node were neither identical, thus proposed a Graph Attention Network. Wu et al.  reduced the complexity of GNN through successively removing nonlinearities and collapsing weight matrices between consecutive layers.
Recently, GNN has been widely applied to address various tasks and the graph construction strategy varied in different works. For instance, in computer vision, Jiang et al. utilized the co-occurrence probability, attribute correlation and spatial correlation of objects to build three graphs for large-scale object detection. Chen et al. 
constructed a semantic-specific graph based on the statistical label co-occurrence for multi-label image recognition. In natural language processing, Beck et al. used source dependency information to built a Levi graph 
for neural machine translation. For semi-supervised document classification, Kipf and Welling introduces a first-order approximation of spectral graph  and constructed their graphs based on citation links. In data mining, the relations between items-and-items, users-and-users and users-and-items were usually leveraged to construct graph-based recommender systems . In summary, how to build a graph is an open problem and we should flexibly design the topology of a graph for a specific task.
Ii-B Traffic States Prediction
Accurately forecasting the future traffic states is crucial for intelligent transportation systems and numerous models have been proposed to address this task [43, 44, 45]. In early works [46, 7, 8, 9, 47], mass traffic data was collected from some specific locations and the raw data at each time interval was arranged as a vector (sequence) in a certain order. These vectors were further fed into time series models for prediction. A representative work was the data aggregation (DA) model , in which the moving average (MA), exponential smoothing (ES) and autoregressive MA (ARIMA) were simultaneously applied to forecast traffic flow. However, this simple data format was inefficient due to the lack of spatial information, and these basic time series models failed to learn the complex traffic patterns. Therefore, the above-mentioned works were far from satisfactory in complex traffic scenarios.
In recent years, deep neural networks have become the mainstream approach in this field. For instance, Zhang et al.  utilized three residual networks to learn the closeness, period and trend properties of crowd flow. Wang et al. 
developed an end-to-end convolutional neural network to automatically discover the supply-demand patterns from the car-hailing service data. Zhang et al. simultaneously predicted the region-based flow and inter-region transitions with a deep multitask framework. Subsequently, RNN and its various variants are also widely adopted to learn the temporal patterns. For instance, Yao et al.  proposed a Deep Multi-View Spatial-Temporal Network for taxi demand prediction, which learned the spatial relations and the temporal correlations with deep CNN and Long Short-Term Memory (LSTM ) unit respectively. Liu et al.  developed an attentive convolutional LSTM network to dynamically learn the spatial-temporal representations with an attention mechanism. In , a periodically shifted attention mechanism based LSTM was introduced to capture the long-term periodic dependency and temporal shifting. To fit the required input format of CNN and RNN, most of these works divided the studied cities into regular grid maps and transformed the raw traffic data to be tensors. However, this preprocessing manner is ineffective to handle the traffic systems with irregular topologies, such as metro systems and road networks.
To improve the generality of the above-mentioned methods, some researchers have attempted to address this task with Graph Convolutional Networks. For instance, Li et al.  modeled the traffic flow as a diffusion process on a directed graph and captured the spatial dependency with bidirectional random walks. Zheng et al.  developed a graph multi-attention network to model the impact of some spatial-temporal factors on traffic conditions, while Zhao et al.  proposed a temporal graph convolutional network for traffic forecasting based on urban road networks. Recently, GCN has also been employed to metro ridership prediction. In , graph convolution operations were applied to capture the irregular spatiotemporal dependencies along with the metro network, but their graph was directly built based on the physical topology of metro systems. In constant, we combine the physical topologies information and human domain knowledge to construct three collaborative graphs with various topologies, which can effectively capture the complex patterns.
In this work, we propose a novel Physical-Virtual Collaboration Graph Network (PVCGN) for station-level metro ridership prediction. Based on the physical topology of a metro system and human domain knowledge, we construct a physical graph and a similarity graph and a correlation graph, which are incorporated into a Graph Convolutional Gated Recurrent Unit (GC-GRU) for local spatial-temporal representation learning. Then, a Fully-Connected Gated Recurrent Unit (FC-GRU) is applied to learn the global evolution feature. Finally, we develop a seq2seq framework with GC-GRU and FC-GRU to forecast the ridership of each metro station.
We first define some notations of ridership prediction before introducing the details of PVCGN. The ridership data of station at time interval is denoted as , where these two values are the passenger counts of inflow/outflow. The ridership of the whole metro system is represented as a signal , where is the number of stations. Given a historical ridership sequence, our goal is to predict a future ridership sequence:
where refers to the length of the input sequence and is the length of the predicted sequence. For the convenience in following subsections, we also denote the whole historical ridership of station as a vector , where is the number of time intervals in a training set.
Iii-a Physical-Virtual Graphs
In this section, we describe how to construct the physical graph and two virtual graphs. By definition, a graph is composed of nodes, edges as well as the weights of edges. In our work, the physical graph, similarity graph and correlation graph are denoted as , and , respectively. is the set of nodes () and each node represents a real-world metro station. Note that these three graphs share the same nodes, but have different edges and edge weights. , and are the edge sets of different graphs. For a specific graph (), denotes the weights of all edges. Specifically, is the weight of an edge from node to node .
Iii-A1 Physical Graph
is directly built based on the physical topology of the studied metro system. An edge is formed to connected node and in , if the corresponding station and are connected in real world. To calculate the weights of these edges, we first construct a physical connection matrix . As shown in Fig.1-(a,b), if there exists an edge between node and , or else . The edge weight is obtained by performing a linear normalization on each row (See Fig.1-c). Specifically, is computed by:
Iii-A2 Similarity Graph
In this section, the similarities of metro stations are used to guide the construction of . First, we construct a similarity score matrix by calculating the passenger flow similarities between every two stations. Specifically, the score between station and is computed with Dynamic Time Warping (DTW ):
where DTW is a general algorithm for measuring the distance between two temporal sequences. Based on the matrix , we select some station pairs to build edges . The selection strategy is flexible. For instance, these virtual edges can be determined with a predefined similarity threshold, or be built by choosing the top- station pairs with high similarity scores. More selection details can be found in Section IV-A2. Finally, we calculate the edge weights by conducting row normalization on :
where if contains an edge connecting node and , or else . A toy example of similarity graph is shown in Fig.1-(d,e,f) and we can observe that matrix is symmetrical, but matrix is asymmetrical due to the row normalization.
Iii-A3 Correlation Graph
In this work, we utilize the origin-destination distribution of ridership to build the virtual graph . First, we construct a correlation ratio matrix . Specifically, is computed by:
where is the total number of passengers that traveled from station to station in the whole training set. We use the similar selection strategy described in Section III-A2 to select some station pairs for edge construction. Finally, the edge weights is calculated by:
One example of correlation graph is shown in Fig.1-(d,e,f) and we can see that is a directed graph, since .
Iii-B Graph Convolution Gated Recurrent Unit
As an alternative of LSTM , Gated Recurrent Unit has been widely used for temporal modeling and it was usually implemented with standard convolution or full-connection. In this section, we incorporate the proposed physical-virtual graphs to develop a unified Graph Convolution Gated Recurrent Unit (GC-GRU) for spatial-temporal feature learning.
We first formulate the convolution on the proposed physical-virtual graphs. Let us assume that the input of graph convolution is , where can be the ridership data or its feature. The parameters of this graph convolution are denoted as . By definition of convolution, the output feature of is computed by:
where is Hadamard product and . Specifically, denotes the learnable parameters for self-loop. denotes the parameters of the physical graph and represents the neighbor set of node in . Other notations , , and have similar semantic meanings. is the dimensionality of feature . In this manner, a node can dynamically receive information from some highly-correlated neighbor nodes. For convenience, the graph convolution in Eq.7 is abbreviated as in the following.
Since the above-mentioned operation is conducted on spatial view, we embed the physical-virtual graph convolution in a Gated Recurrent Unit to learn spatial-temporal features. Specifically, the reset gate , update gate , new information and hidden state are computed by:
is the sigmoid function andis the hidden state at last iteration. denotes the graph convolution parameters between and , while denotes the parameters between and . Other parameters , , and have similar meanings. , and are biases. The feature dimensionality of , , and are also set to . For convenience, we denote the operation of Eq.8 as:
Thanks to this GC-GRU, we can effectively learn spatial-temporal features from the ridership data of metro systems.
Iii-C Local-Global Feature Fusion
In previous works [15, 5], global features have been proven to be also useful for traffic state prediction. However, the proposed GC-GRU conducts convolution on local space and fails to capture the global context. To address this issue, we apply a Fully-Connected Gated Recurrent Unit (FC-GRU) to learn the global evolution features of all stations and generate a comprehensive feature by fusing the output of GC-GRU and FC-GRU. The developed fusion module is termed as Collaborative Gated Recurrent Module (CGRM) in this work and its architecture is shown in Fig.2.
Specifically, the inputs of CGRM are and , where is the output of last iteration. For GC-GRU, rather than take the original as input, it utilizes the accumulated information in to update hidden state, thus Eq.9 becomes:
For FC-GRU, we first transform and to an embedded and with two fully-connected (FC) layers. Then we feed and into a common GRU  implemented with fully-connection to generate a global hidden state , which can be expressed as:
Finally, we incorporate and to generate a comprehensive hidden state with a fully-connected layer:
where denotes an operation of feature concatenation. contains the local and global context of ridership, and we has proved its effectiveness in Section IV-C2.
Iii-D Physical-Virtual Collaboration Graph Network
In this section, we apply the above-mentioned CGRUs to construct a Physical-Virtual Collaboration Graph Network (PVCGN) for station-level metro ridership prediction. Note that our PVCGN is developed based on the seq2seq model  and its architecture is shown in Fig.3.
Specifically, PVCGN consists of an encoder and a decoder, both of which contain two CGRMs. In encoder, the ridership data are sequentially fed into CGRMs to accumulate the historical information. At iteration , the bottom CGRM takes as input and its output hidden state is fed into the above CGRM for high-level feature learning. In particular, the initial hidden states of both CGRMs at the first iteration are set to zero. In decoder, at the first iteration, the input data is also set to zero and the final hidden states of encoder are used to initialize the hidden states of decoder. The future ridership is predicted by feeding the output hidden state of the above CGRM into a fully-connected layer. At iteration , the bottom CGRM takes as input data and the above CGRM also applies a fully-connected layer to forecast . Finally, we can obtain a future ridership sequence .
In this section, we first introduce the settings of experiments (e.g., dataset construction, implementation details, and evaluation metrics). Then, we compare the proposed PVCGN with eight representative approaches under various scenarios. Finally, we conduct extensive internal analyses to verify the effectiveness of each component in our method.
Iv-a Experiments Settings
Iv-A1 Dataset Construction
Since there are few public benchmarks for metro ridership prediction, we collect a mass of trip transaction records from two real-world metro systems and construct two large-scale datasets, which are termed as HZMetro and SHMetro respectively. The overviews of these two datasets are summarized in Table I.
SHMetro: This dataset was built based on the metro system of Shanghai, China. A total of 811.8 million transaction records were collected from Jul. 1st 2016 to Sept. 30th 2016, with 8.82 million ridership per day. Each record contains the information of passenger ID, entry/exit station and the corresponding timestamps. In this time period, 288 metro stations were operated normally and they were connected by 958 physical edges. For each station, we measured its inflow and outflow of every 15 minutes by counting the number of passengers entering or exiting the station. The ridership data of the first two months and that of the last three weeks are used for training and testing, while the ridership data of the remaining days are used for validation.
HZMetro: This dataset was created with the transaction records of the Hangzhou metro system collected in January 2019. With 80 operational stations and 248 physical edges, this system has 2.35 million ridership per day. The time interval of this dataset is also set to 15 minutes. Similar to SHMetro, this dataset is divided into three parts, including a training set (Jan. 1st - Jan. 18th), a validation set (Jan. 19th - Jan. 20th) and a testing set (Jan. 21th - Jan. 25th).
|City||Shanghai, China||Hangzhou, China|
|# Physical Edge||958||248|
|Ridership/Day||8.82 M||2.35 M|
|Time Interval||15 min||15 min|
|Training Timespan||7/01/2016 - 8/31/2016||1/01/2019 - 1/18/2019|
|Validation Timespan||9/01/2016 - 9/09/2016||1/19/2019 - 1/20/2019|
|Testing Timespan||9/10/2016 - 9/30/2016||1/21/2019 - 1/25/2019|
Iv-A2 Implementation Details
Since the physical graph has a well-defined topology, we only introduce the details of two virtual graphs in this section. In SHMetro dataset, to reduce the computational cost of GCN, for each station, we only select the top ten stations with high similarity scores or correlation rates to construct virtual graphs, thereby both the similarity graph and correlation graph have 2880 edges. In HSMetro dataset, as its station number is much smaller than that of SHMetro and the computational cost is not heavy, we can build more virtual edges to learn the complex patterns. Specifically, we determine the virtual edges by setting the similarity/correlation thresholds to 0.1/0.02, and the final similarity graph and correlation graph have 2502 and 1094 edges respectively.
We implement our PVCGN with the popular deep learning framework PyTorch
. The lengths of input and output sequences are set to 4 simultaneously. The input data and the ground-truth of output are normalized with Z-score Normalization444https://en.wikipedia.org/wiki/Standard_score before being fed into the network. The filter weights of all layers are initialized by Xavier . The batch size is set to 8 for SHMetro and 32 for HZmetro. The feature dimensionality is set to 256. The initial learning rate is 0.001 and its decay ratio is 0.1. We apply Adam 
to optimize our PVCGN for 200 epochs by minimizing the mean absolute error between the predicted results and the corresponding ground-truths. On each benchmark, we select a final model with the minimum error on the validation set and apply the selected model to conduct prediction on the testing set.
Iv-A3 Evaluation Metrics
Following previous works [26, 30], we evaluate the performance of methods with Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), which are defined as:
where is the number of testing samples. and denote the predicted ridership and the ground-truth ridership, respectively. Notice that and have been transformed back to the original scale with an inverted Z-score Normalization. As mentioned in Section IV-A2, our PVCGN is developed to predict the metro ridership of next four time intervals. In the following experiments, we would measure the errors of each time intervals separately.
Iv-B Comparison with State-of-the-Art Methods
In this section, we compare our PVCGN with eight basic and advanced methods under various scenarios (e.g., the comparison on the whole testing sets, comparison on rush hours and comparison on high-ridership stations). These methods can be classified into three categories, including:(i) three traditional time series models, (ii) three general deep learning models and (iii) two recently-proposed graph networks. The details of these methods are described as following:
Historical Average (HA): HA is a seasonal-based baseline that forecasts the future ridership by averaging the riderships of the corresponding historical periods. For instance, the ridership at interval 9:00-9:15 on a specific Monday is predicted as the average ridership of the corresponding time intervals of the previous Mondays. The variate is set to 4 on SHMetro and 2 on SHMetro.
Multiple Layer Perceptron (MLP):This model consists of two fully-connected layers with 256 and neurons respectively, where is the number of stations. It takes as input the riderships of all stations of previous time intervals and predicts the ridership of all stations of the next time intervals simultaneously.
Long Short-Term Memory (LSTM): This network is a simple seq2seq model and its core module consists of two fully-connected LSTM layers. The hidden size of each LSTM layer is set to 256.
Gated Recurrent Unit (GRU): With the similar architecture of the previous model, this network replaces the original LSTM layers with GRU layers. The hidden size of GRU is also set to 256.
Diffusion Convolutional Recurrent Neural Network (DCRNN ): As a deep learning framework specially designed for traffic forecasting, DCRNN captures the spatial dependencies using bidirectional random walks on graphs and learns the temporal dependencies with an encoder-decoder architecture.
Graph Convolutional Recurrent Neural Network (GCRNN): The architecture of this model is very similar to that of DCRNN. The main difference is that GCRNN replaces diffusion convolutional layers with =3 order ChebNets  based on spectral graph convolutions.
Iv-B1 Comparison on the Whole Testing Sets
We first compare the performance of all comparative methods on the whole testing sets (including all time intervals and all metro stations). Their performance on SHMetro and HZMetro datasets are summarized in Table II and Table III
, respectively. We can see that the baseline HA obtains unacceptable MAPE at all time intervals (about 31% on SHMetro and 20% on HZMetro). Compared with HA, RF and GBDT can get better results at the first time interval. However, with the increment of time, their MAPEs gradually become worse and even larger than that of HA, since these two traditional models have weak abilities to learn the ridership distribution. By automatically learning deep features from data, those general neural networks (e.g., MLP, LSTM and GRU) can greatly improve the performance. For example, LSTM obtains a MAPE 18.76% on SHMetro and 14.91% on HZMetro when predicting the ridership at the first time interval, while GRU obtains a MAPE 21.03% on SHMetro and 17.20% on HZMetro for the prediction of the fourth time interval. Thanks to the advanced graph learning, DCRNN and GCRNN achieves competitive performance by reducing the MAPE to 17.82% on SHMetro and to 14.00% on HZMetro. However, these methods directly construct graphs based on physical topologies. To fully capture the ridership complex patterns, our PVCGN constructs physical/virtual graphs with the information of physical topologies and human domain knowledge, thereby achieving state-of-the-art performance. For example, our PVCGN improves at least 1% in MAPE at different time intervals on SHMetro dataset. On HZmetro, PVCGN outperforms the existing best-performing models DCRNN and GCRNN with a large margin in all metrics. This comparison well demonstrates the superiority of the proposed PVCGN.
Iv-B2 Comparison on Rush Hours
In this section, we focus on the ridership prediction of rush hours, since the accurate prediction results are very crucial for the metro scheduling during such time. In this work, the rush time is defined as 7:30-9:30 and 17:30-19:30. The performance of all methods are summarized in Table IV and Table V. We can observe that our PVCGN outperforms all comparative methods consistently on both datasets. On SHMetro, our PVCGN obtains a MAPE 13.16% for the ridership prediction at the first time interval, while the MAPE of DCRNN and GCRNN are 13.93% and 14.07%, respectively. Other deep learning methods (such as MLP, LSTM and GRU) are relatively worse. For forecasting the ridership at the fourth time interval, our PVCGN achieves a MAPE 15.08%, outperforming DCRNN and GCRNN with a relative improvement of at least 9.04%.
There exists a similar situation of performance comparison on HZMetro. For example, obtaining a MAPE 9.72% for the ridership at the first time interval, our PVCGN is undoubtedly better than DCRNN and GCRNN, the MAPE of which are 10.37% and 10.36%, respectively. When predicting the ridership at the fourth time interval, our PVCGN achieves a very impressive MAPE 10.43%, while DCRNN and GCRNN suffer from serious performance degradation. For instance, their MAPE rapidly increase to 11.94% and to 11.93%, respectively. In summary, the extensive experiments on SHMetro and HZMetro dataset show the effectiveness and robustness of our method during rush hours.
Iv-B3 Comparison on High-Ridership Stations
Except for the prediction during rush hours, we also pay attention to the prediction of some stations with high ridership, since the demand of such stations should be prioritized in real-world metro systems. In this section, we first rerank all metro stations based on their historical ridership of training set and conduct choose comparison on the top 1/4 high-ridership stations. The performance on SHMetro is summarized in Table VI and we can observe that our PVGCN ranks the first place in performance among all comparative methods. When forecasting the ridership during the next 15 minutes, PVGCN achieves an RMSE 74.80 and a MAPE 10.62%. By contrast, the best RMSE and MAPE of other methods are 80.72 and 12.23%. As the prediction time increases to 60 minutes, our PVGCN can still obtain the best result (e.g., 13.61% in MAPE), while the MAPE of GCRNN significantly increases to 18.16%.
As shown in Table VII, our PVGCN also achieves impressive performance on HZMetro dataset. For the ridership prediction at the first time interval, the RMSE and MAPE of our PVCGN are 60.56 and 9.97%, while the RMSE and MAPE of the existing best-performing method GCRNN are 65.29 and 10.59%. When forecasting the ridership at the fourth time interval, our PVCGN has minor performance degradation. For example, its RMSE and MAPE respectively increase to 69.25 and 12.54%. In the same situation, the RMSE and MAPE of GCRNN increase to 80.34 and 14.74%. Therefore, we can conclude that our PVCGN is not only effective but also robust for the prediction on high-ridership stations.
Iv-C Component Analysis
Iv-C1 Effectiveness of Different Graphs
The distinctive characteristic of our work is that we incorporate a physical graph and two virtual graphs into Gated Recurrent Units (GRU) to collaboratively capture the complex flow patterns. To verify the effectiveness of each graph, we implement five variants of PVCGN, which are described as follows:
P-Net: This variant only utilizes the physical graph to implement the ridership prediction network;
P+S-Net: This variant is developed with the physical graph and the virtual similarity graph;
P+C-Net: Similar with P+S GRU, this variant is built with the physical graph and the correlation graph;
S+C-Net: Different with above variants that contain the physical graph, this variant is constructed only with the virtual similarity/correlation graphs;
P+S+C-Net: This network is the full model of the proposed PVCGN. It contains the physical graph and the two virtual graphs simultaneously.
The performance of all variants are summarized in Table VIII. To predict the ridership at the next time interval (15 minutes), the baseline P-Net obtains a MAPE 19.04% on SHMetro and 14.84% on HZMetro, ranking last among all the variants. By aggregating the physical graph and any one of the proposed virtual graphs, the variants P+S-Net and P+C-Net achieve obvious performance improvements over all evaluation metrics. For instance, P+S-Net decreases the RMSE to from 50.45 to 47.38 on SHMetro and from 41.80 to 38.89 on HZMetro, while P+C-Net reduces the RMSE to 46.18 and 39.46. Moreover, we observe that the variant S+C-Net can also achieve very competitive performance, even though it does not contain the physical graph. On SHMetro dataset, S+C-Net obtains an RMSE 46.52, outperforming P-Net with a relative improvement of 7.8%. On HZMetro dataset, S+C-Net also achieves a similar improvement by decreasing the RMSE to 39.92. These phenomenons indicate that the proposed virtual graphs are reasoning. Finally, the variant P+S+C-Net can obtain the best performance by incorporating the physical graph and all virtual graphs into networks. Specifically, P+S+C-Net gets the lowest RMSE (44.97 on SHMetro, 37.73 on HZMetro) and the lowest MAPE (16.83% on SHMetro, 13.72% on HZMetro). This significant improvement is mainly attributed to the enhanced spatial-temporal representation learned by the collaborative physical/virtual graph networks. These comparisons demonstrate the effectiveness of these tailor-designed graphs for the single time interval prediction.
Moreover, we find that these collaborative graphs are also effective for the ridership prediction of continuous time intervals. As shown in the bottom nine rows of Table VIII, all variants suffer from performance degradation to some extent, as the number of time intervals increases from 2 to 4. For instance, the RMSE is rapidly increased to 73.06 on SHMetro and 56.32 on HZMetro, when the baseline P-Net is applied to forecast the ridership at the fourth time interval (60 minutes) of the future. By contrast, P+S-Net and P+C-Net achieve much lower RMSE (about 60 on SHMetro and 44 on HZMetro), since the proposed virtual graphs can prompt these variants to learn the complex flow patterns. Incorporating all physical/virtual graphs, P+S+C-Net can further improve the performance with an RMSE 55.27 on SHMetro and 42.51 on HZMetro, which shows that these graphs are complementary.
|Local||Local + Global||Local||Local + Global|
Iv-C2 Influences of Local and Global Feature
As described in Section III-C, a Graph Convolution Gated Recurrent Unit (GC-GRU) is developed for local feature learning, while a Fully-Connected Gated Recurrent Unit (FC-GRU) is applied to learn the global feature. In this section, we train two variants to explore the influence of each type of feature for metro ridership prediction. The first variant only contains GC-GRU, and the second variant consists of GC-GRU and FC-GRU. The results of these variants are summarized in Table IX. We can observe that the performance of the first variant is very competitive. For example, when predicting the ridership of the next 15 minutes, the first variant obtains an RMSE 45.64 on SHMetro and 38.46 on HZMetro. For the prediction of the fourth time interval, with a MAE 26.50 on SHMetro and 25.36 on HZMetro, this variant is slightly worse than the full model of PVCGN. This competitive performance is attributed to the fact that we can effectively learn the semantic local feature with the customized physical/virtual graphs. By fusing the local/global features of GC-GRU/FC-GRU, the second variant can boost the performance to a certain degree. For example, when predicting the ridership of the second time interval, the RMSE is decreased from 48.79 to 47.83 on SHMetro. Through these experiments, we can conclude that the local feature plays a dominant role and the global feature provides ancillary information for ridership prediction.
In this work, we propose a unified Physical-Virtual Collaboration Graph Network to address the station-level metro ridership prediction. Unlike previous works that either ignored the topological information of a metro system or directly modeled on physical topology, we model the studied metro system as a physical graph and two virtual similarity/correlation graphs to fully capture the ridership evolution patterns. Specifically, the physical graph is built on the basis of the metro realistic topology. The similarity graph and correlation graph are constructed with virtual topologies under the guidance of the historical passenger flow similarity and correlation among different stations. We incorporate these graphs into a Graph Convolution Gated Recurrent Unit (GC-GRU) to learn spatial-temporal representation and apply a Fully-Connected Gated Recurrent Unit (FC-GRU) to capture the global evolution tendency. Finally, these GRUs are utilized to develop a seq2seq model for forecasting the ridership of each station. To verify the effectiveness of our method, we construct two real-world benchmarks with mass transaction records of Shanghai metro and Hangzhou metro and the extensive experiments on these benchmarks show the superiority of the proposed PVCGN.
In future works, several improvements should be considered. First, some external factors (such as weather and holiday events) may greatly affect the ridership evolution and we should incorporate these factors to dynamically forecast the ridership. Second, the metro ridership evolves periodically. For instance, the ridership at 9:00 of every weekday is usually similar. Therefore, we should also utilize the periodic law of ridership to learn more comprehensive representation. Last but not least
, the origin-destination distribution (ODD) of ridership provides rich information and we can model the ridership ODD at each time interval to facilitate the inflow/outflow ridership prediction.
Y. Li, X. Wang, S. Sun, X. Ma, and G. Lu, “Forecasting short-term subway passenger flow under special events scenarios using multiscale radial basis function networks,”Transportation Research Part C: Emerging Technologies, vol. 77, pp. 306–328, 2017.
-  L. Tang, Y. Zhao, J. Cabrera, J. Ma, and K. L. Tsui, “Forecasting short-term passenger flow: An empirical study on shenzhen metro,” IEEE Transactions on Intelligent Transportation Systems, 2018.
-  Y. Gong, Z. Li, J. Zhang, W. Liu, Y. Zheng, and C. Kirsch, “Network-wide crowd flow prediction of sydney trains via customized online non-negative matrix factorization,” in CIKM. ACM, 2018, pp. 1243–1252.
-  E. Chen, Z. Ye, C. Wang, and M. Xu, “Subway passenger flow prediction for special events using smart card data,” IEEE Transactions on Intelligent Transportation Systems, 2019.
-  S. Fang, Q. Zhang, G. Meng, S. Xiang, and C. Pan, “Gstnet: global spatial-temporal network for traffic flow prediction,” in IJCAI, 2019, pp. 10–16.
-  S. Hao, D.-H. Lee, and D. Zhao, “Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system,” Transportation Research Part C: Emerging Technologies, vol. 107, pp. 287–300, 2019.
M. Lippi, M. Bertini, and P. Frasconi, “Short-term traffic flow forecasting: An experimental comparison of time-series analysis and supervised learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 871–882, 2013.
J. Guo, W. Huang, and B. M. Williams, “Adaptive kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification,”Transportation Research Part C: Emerging Technologies, vol. 43, pp. 50–64, 2014.
-  S. V. Kumar and L. Vanajakshi, “Short-term traffic flow prediction using seasonal arima model with limited input data,” European Transport Research Review, vol. 7, no. 3, p. 21, 2015.
-  B. M. Williams, P. K. Durvasula, and D. E. Brown, “Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models,” Transportation Research Record, 1998.
-  S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in NIPS, 2015, pp. 802–810.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual networks for citywide crowd flows prediction.” in AAAI, 2017, pp. 1655–1661.
-  H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li, “Deep multi-view spatial-temporal network for taxi demand prediction,” in AAAI, 2018.
-  L. Liu, Z. Qiu, G. Li, Q. Wang, Ouyang，Wanli, and L. Lin, “Contextualized spatial-temporal network for taxi origin-destination demand prediction,” IEEE Transactions on Intelligent Transportation Systems, 2019.
-  H. Yao, X. Tang, H. Wei, G. Zheng, and Z. Li, “Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction,” in AAAI, 2019.
-  J. Zhang, Y. Zheng, J. Sun, and D. Qi, “Flow prediction in spatio-temporal networks based on multitask deep learning,” IEEE Transactions on Knowledge and Data Engineering, 2019.
-  X. Ma, J. Zhang, B. Du, C. Ding, and L. Sun, “Parallel architecture of convolutional bi-directional lstm neural networks for network-wide metro ridership prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 6, pp. 2278–2288, 2019.
-  J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
-  D. Duvenaud, D. Maclaurin, J. Aguileraiparraguirre, R. Gomezbombarelli, T. D. Hirzel, A. Aspuruguzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in NIPS, 2015, pp. 2224–2232.
-  T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
-  Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in ICCV, 2017, pp. 1261–1270.
-  X. Wang, C. Li, R. Yang, T. Zhang, J. Tang, and B. Luo, “Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking,” arXiv preprint arXiv:1811.10014, 2018.
J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” inCVPR, 2019, pp. 1237–1246.
-  T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin, “Learning semantic-specific graph representation for multi-label image recognition,” in CVPR, 2019, pp. 522–531.
-  Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” in ICLR, 2018.
-  B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting,” in IJCAI, 2018.
-  Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting,” arXiv preprint arXiv:1802.07007, 2018.
-  X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu, “Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting,” in AAAI, 2019.
-  L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, and H. Li, “T-gcn: A temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, 2019.
-  C. Zheng, X. Fan, C. Wang, and J. Qi, “Gman: A graph multi-attention network for traffic prediction,” in AAAI, 2020.
-  D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol. 10, no. 16. Seattle, WA, 1994, pp. 359–370.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014, pp. 3104–3112.
-  M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
-  M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016, pp. 3844–3852.
-  J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
-  P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
-  F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in ICML, 2019, pp. 6861–6871.
-  C. Jiang, H. Xu, X. Liang, and L. Lin, “Hybrid knowledge routed modules for large-scale object detection,” in NIPS, 2018, pp. 1552–1563.
-  D. Beck, G. Haffari, and T. Cohn, “Graph-to-sequence learning using gated graph neural networks,” in ACL, 2018, pp. 273–283.
-  F. W. Levi, Finite geometrical systems: six public lectues delivered in February, 1940, at the University of Calcutta. The University of Calcutta, 1942.
-  R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for web-scale recommender systems,” in KDD. ACM, 2018, pp. 974–983.
-  E. Bolshinsky and R. Friedman, “Traffic flow forecast survey,” Computer Science Department, Technion, Tech. Rep., 2012.
-  J. Barros, M. Araujo, and R. J. Rossetti, “Short-term real-time traffic prediction methods: A survey,” in MT-ITS. IEEE, 2015, pp. 132–139.
-  A. M. Nagy and V. Simon, “Survey on traffic prediction in smart cities,” Pervasive and Mobile Computing, vol. 50, pp. 148–163, 2018.
-  X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang, “Prediction of urban human mobility using large-scale taxi traces and its applications,” Frontiers of Computer Science, vol. 6, no. 1, pp. 111–121, 2012.
-  P. Dell’Acqua, F. Bellotti, R. Berta, and A. De Gloria, “Time-aware multivariate nearest neighbor regression methods for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 6, pp. 3393–3402, 2015.
-  M.-C. Tan, S. C. Wong, J.-M. Xu, Z.-R. Guan, and P. Zhang, “An aggregation approach to short-term traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 1, pp. 60–69, 2009.
-  D. Wang, W. Cao, J. Li, and J. Ye, “Deepsd: supply-demand prediction for online car-hailing services using deep neural networks,” in ICDE. IEEE, 2017, pp. 243–254.
-  L. Liu, R. Zhang, J. Peng, G. Li, B. Du, and L. Lin, “Attentive crowd flow machines,” in ACM Multimedia. ACM, 2018, pp. 1553–1561.
-  Y. Han, S. Wang, Y. Ren, C. Wang, P. Gao, and G. Chen, “Predicting station-level short-term passenger flow in a citywide metro network using spatiotemporal graph convolutional neural networks,” ISPRS International Journal of Geo-Information, vol. 8, no. 6, p. 243, 2019.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS workshop, 2017.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010, pp. 249–256.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
-  Y. Liu and H. Wu, “Prediction of road traffic congestion based on random forest,” in ISCID, vol. 2. IEEE, 2017, pp. 361–364.
-  Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” arXiv preprint arXiv:1707.01926, 2017.