With the increasing population of cities, metro lines have become more and more crowded. For improving the service efficiency of the metro system and facilitating passenger travel, metro ridership prediction has become hotspot research in the community of Intelligent Transportation Systems (ITSs). Recent works focus on designing complicated graph convolutional recurrent network architectures to capture spatial and temporal patterns. These works depend on dividing the ridership data into equal-sized intervals on the timeline and counting the ridership as the time period. In metro ridership data, there are distinct characteristics of peak and valley periods. The dense ridership during the peak period needs to be divided more finely, while the ridership in the low valley period is less, and there is not enough data when the time interval is small. This preprocessing destroys information about the ridership over time, which may be information about potential variables. At the same time, the problem is limited to time period and time period prediction, which is not conducive to real-time data collection and feedback to passenger users.
A more practical approach is to construct a continuous-time model that defines a potential state at all times. Chen et al.Chen et al. (2018) proposed a new family of continuous time-series models (NODE) which directly models the dynamics of network hidden states with an ODE solve. Rubanova et al.Rubanova et al. (2019) generalize state transitions in RNNs to continuous-time dynamics specified by Neural ODEs.and proposes latent ODE network defined in continuous time, which makes it more effective in irregular sampling data. Zhou et al.Zhou et al. (2021, ) Applied the NODE to the high-resolution reconstruction and prediction of urban flow grid map. These work provide a new direction for metro passenger flow prediction.
Previous work on NODE mainly focused on Euclidean data such as sequences and matrices. We modified it to make it suitable for non-Euclidean data, such as graph networks, called Graph ODEs(GODE).To facilitate performance comparison between models, we used the three complementary graphs proposed by PVCGNLiu et al. (2020) as spatial information to learn the relationship between subway stations. Inspired by latent ODEs, we use a GODE-RNN structure as encoder to learn the initial state
, and hidden states containing the characteristics of ridership. Subsequently, we modify the GODEs network in prediction so that the ridership information and its hidden vectors can be incorporated into the prediction process, which improves the cumulative error caused by the long time series. To verify the effectiveness of our STR-GODEs, we conduct experiments on two large-scale benchmarks(i.e., Shanghai Metro and Hangzhou Metro) with widely used baselines and state-of-the-art models and the evaluation results show that our approach outperforms existing state-of-the-art methods under various comparison circumstances.
In summary, our contribution is threefold:
We further extend the neural ODEs network to the time series prediction of graph networks, and to the best of our knowledge, firstly apply it to the metro ridership prediction.
By combining the ridership data and its hidden state, we improve the performance of the prediction model of neural grouph ODEs network so that it can reduce the cumulative error over a long time series.
our STR-GODEs has better effect and robustness, and is no longer limited by the same time slice and equal distance prediction in previous work. It can be better applied in real life. Since we have only done some early work, this direction is very prospective.
2 Background and Related Works
Traffic States Prediction Traffic forecasting depends on the combination of spatial-temporal feature and has been studied for decades. Traditional methodsLiu et al. (2019); Milenković et al. (2018); Williams and Hoel (2003); Tan et al. (2009)
can only consider the relation of a single station in temporal dimension, lack spatial information, and can only learn simple time series models.In recent years, deep neural network has become the mainstream method in this field. The early workYao et al. (2018, 2019); Liu et al. (2018)
mainly divided the studied cities into regular grid maps, and transformed the raw traffic data into tensors. CNN is used to capture the spatial correlation among near regions. This preprocessing method will bring structural noise information into the traffic system with irregular topology, which will affect the effect of the model. Graph convolution networkDefferrard et al. (2016); Kipf and Welling (2016) greatly improves the disadvantages of previous work on spatial information. Recent worksZhao et al. (2019); Geng et al. (2019); Bai et al. (2020) focus on designing complicated graph convolutional recurrent network architectures to capture the spatial and temporal patterns. DCRNN Li et al. (2017) re-formulates the spatial dependency of traffic as a diffusion process and extends the previous GCN to a directed graph. Graph-WaveNetWu et al. (2019) captures the hidden spatial dependency in the data by constructing a adaptive dependency matrix and learning it through node embedding. Liu et al.Liu et al. (2020)
construct three complementary graphs to utilize the metro physical topology information. A Graph Convolution Gated Recurrent Unit (GC-GRU) is applied to learn the spatial-temporal representation and a Fully-Connected Gated Recurrent Unit (FC-GRU) is applied to capture the global evolution tendency. However, due to the limitation of RNN structure, there are still some deficiencies in the temporal dimension.
Graph Neural Network Unlike traditional Euclidean data, graph data is difficult to be embedded into Euclidean space losslessly. Graph Convolution Networks (GCN)Defferrard et al. (2016); Kipf and Welling (2016) have been proposed to automatically learn feature representation on graphs,which is widely used in node classificationKipf and Welling (2016), link predictionZhang and Chen (2018), and graph classificationYing et al. (2018).Wu et al. (2020) There are two mainstreams of graph convolution networks, the spectralbased approaches Bruna et al. (2013); Defferrard et al. (2016); Kipf and Welling (2016)and the spatial-based approachesAtwood and Towsley (2016); Gilmer et al. (2017); Hamilton et al. (2017). In these approaches, the adjacency matrix is considered as prior knowledge and is fixed throughout training. Recently researchersLi et al. (2018); Wu et al. (2019); Bai et al. (2020) pay more attention on operating on both spatial and temporal dimensions without the pre-defined graph structure.
Built upon previous worksWeinan (2017); Lu et al. (2018), Chen et al.Chen et al. (2018) proposed a new family of continuous-time models named Neural ODEs,which can be interpreted as a continuous equivalent of Residual networksHe et al. (2016). A basic formulation of Neural ODEs is shown as:
where is parametrized as a neural network.The hidden state h(t) is defined at all times, and can be evaluated at any desired times using a numerical black-box ODE solver:
By using adjoint methodChen et al. (2018),the gradients w.r.t. can be computed memory-efficiently during back propagation. Recent worksDupont et al. (2019); Massaroli et al. (2020); Davis et al. (2020); Yan et al. (2019); Ghosh et al. (2020) have tried to analyze this framework theoretically, overcome the instability issue and improve memory-efficiency. Researchers also pay attention to apply NODE to other fields such as medical imagePinckaers and Litjens (2019)et al. (2020), video generationKanaa et al. (2019); Yildiz et al. (2019) and graph dataXhonneux et al. (2020).
3.1 Problem Definition
Previous work did not support predictions over continuous or unequal density time intervals. To facilitate performance comparison between models, we follow the traditional prediction framework. Assuming N is the number of Metro stations, the time period T data can be expressed as , where represents the the passenger counts of inflow/outflow of the station i at time interval t. our target is to predict the future ridership sequence based on the observed historical values:
Suppose is the latent state including the ridership feature at time interval t, and is the function of ridership changing with spatial,temporal and ridership information. By Euler method, there is .
The primal problem can be abstracted as we learn the ridership trend function according to the training data and pre-defined graph (Physical-Virtual GraphLiu et al. (2020)), obtain the initial state through the observation sequence, and predict the subsequent sequence.
Chen et al.Chen et al. (2018) present a continuous-time, generative approach to modeling time series. Each trajectory is determined from a local initial state, , and a global set of latent dynamics shared across all time series:
We follow the framework of Chen et al.Chen et al. (2018) for both training and prediction. We modify the original ODE Block to make it suitable for graph network data, called GODE Block. As shown in figure 1, inspired by Rubanova et al.Rubanova et al. (2019), we use the GODE-RNN structure to obtain the local initial state by learning the spatial,temporal and ridership feature of . In particular, due to the need to gather information from adjacent metro stations in GODE Block, we use GODE-RNN to learn the approximate posterior of local initial state
directly instead of learning the mean and standard deviation.
Furthermore, by combining the ridership information and its hidden state, we further improve the performance of the prediction model so that it can reduce the cumulative error over a long time series.
3.3 Physical-Virtual Graphs
In this section, we follows the pre-defined graph proposed by PVCGNLiu et al. (2020) to learn spatial information. In our work, the physical graph, similarity graph and correlation graph are denoted as , and , where is a set of nodes represent real-world metro stations(), is a set of edges and denotes the weights of all edges.
Physical Graph:is directly built based on the physical topology of the studied metro system. We first construct a physical connection matrix , where if there exists an edge between node i and j, or else . Note that each diagonal value P(i, i) is directly set to 0.Finally, the edge weight is obtained by performing a linear normalization on each row:
Similarity Graph:The similarities of metro stations are used to guide the construction of . We construct a similarity score matrix , where the score S(i, j) between station i and j is computed with Dynamic Time WarpingBerndt and Clifford (1994): . Based on the matrix S, we select some station pairs to build edges with a predefined similarity threshold 0.1 for HZMetro and by choosing the top ten stations with high similarity scores for SHMetro. Finally, we calculate the edge weights by conducting row normalization on S:
where if contains an edge connecting node i and k, or else .
Correlation Graph: We utilize the origin-destination distribution of ridership to build the virtual graph . First, we construct a correlation ratio matrix : where D(i, j) is the total number of passengers that traveled from station j to station i in the whole training set. Based on the matrix C, we select some station pairs to build edges with a predefined similarity threshold 0.02 for HZMetro and by choosing the top ten stations with high similarity scores for SHMetro. Then, the edge weights is calculated by:
3.4 GODE Block
In general, the graph convolutional networks is a method to aggregate the information of neighbor nodes around the target node and can be expressed asDefferrard et al. (2016); Kipf and Welling (2016); Hamilton et al. (2017); Veličković et al. (2017):
where refers to the hidden state of node i in the k-th layer,
refers to the eigenvector of the edge from node i to node j,is the aggregation method of neighbor information, is the transformation method of neighbor information and is a transformation that fuses one’s own feature with its aggregated neighbors’information.
These and related methods can be understood as special cases of a simple differentiable message-passing framework (Gilmer et al. 2017Gilmer et al. (2017)):
where is hidden state of node in the l-th layer of the neural network. Incoming messages of the form
are accumulated and passed through an element-wise activation function. denotes the set of incoming messages for node and is often chosen to be identical to the set of incoming edges. It can also be expressed in the following form:
According to Neural ODEsChen et al. (2018), the hidden state h(t) can be defined as the solution to an ODE initial-value problem:
Inspired by R-GCNSchlichtkrull et al. (2018), we define our GODE-function by using learnable weights to learn the changes of stations latent states over time, and learnable weights to learn the association relationship between stations. For every station,there are:
For any time , we can calculate the latent representation by:
3.5 STR-GODEs encoder
Rubanova et al. proposed using an ODE-RNN as the encoder for a latent ODEs model to learn the mean and standard deviation of the approximate posterior and the local initial state .
As shown in algorithm 1, we made some modifications on ODE-RNN. We replaced ODESolve with GODESolve, which can be applied to graph data. We use GODE-RNN to learn the approximate posterior of local initial state directly instead of learning the mean and standard deviation. The hidden state of ridership information is learned and stored in hidden, which is combined with in RNNCell. In our work, we use the classical GRUCho et al. (2014) structure as the RNNCell. Specifically, the reset gate , update gate , new information are computed by:
3.6 STR-GODEs decoder
Before prediction, we convert the latent state obtained from GODE-RNN to get with a transform block which is composed of a fully-connected layer, a tanh activation function, and a fully-connected layer.
Previous work uses Neural ODEs as decoder for prediction and when the sequence is long, the error will accumulate gradually. Similar to the GODE-RNN, We introduce ridership information and its hidden state into GODESolver through RNNCell structure, as shown in algorithm 2. The experimental results show that it can reduce the cumulative error over a long time series. Specifically, we use the classical GRUCho et al. (2014) structure described in session 3.5 as the RNNCell.
To evaluate the performance of our work, we conduct experiments on two real-world traffic: SHMetro and HZMetro222https://tianchi.aliyun.com/competition/entrance/231708/information.
SHMetro: The SHMetro dataset refers to the ridership data on the metro system of Shanghai, China. There are 288 metro stations connected by 958 physical edges. A total of 811.8 million transaction records were collected from Jul. 1st 2016 to Sept. 30th 2016, with 8.82 million ridership per day. Following the setting of PVCGNLiu et al. (2020), for each station, we measured its inflow and outflow every 15 minutes by counting the number of passengers entering or exiting the station. The ridership data of the first two months and that of the last three weeks are used for training and testing, while the ridership data of the remaining days are used for validation.
HZMetro: The HZMetro dataset refers to the ridership data on the metro system of Hangzhou, China. There are 80 metro stations connected by 248 physical edges with 2.35 million ridership per day. The time interval of this dataset is also set to 15 minutes. Similar to SHMetro, this dataset is divided into three parts, including a training set (Jan. 1st - Jan. 18th), a validation set (Jan. 19th - Jan. 20th), and a testing set (Jan. 21th - Jan. 25th).
4.2 Experimental Settings
We have trained the official code of PVCGNLiu et al. (2020), which has a small gap with the results of corresponding papers. To reduce the effect difference caused by different parameters and facilitate the model performance comparison, we keep the parameters in STR-GODEs the same as those in the official code of PVCGN.
We implement our STR-GODEs in Python with Pytorch 1.6.0 and executed on a server with one NVIDIA Titan X GPU card. The lengths of input and output sequences are set to 4 simultaneously. The input data and the ground-truth of output are normalized with Z-score Normalization333https://en.wikipedia.org/wiki/Standard_score before being fed into the network. The batch size is set to 8 and the feature dimensionality d is set to 256. The initial learning rate is 0.001 and its decay ratio is 0.1. We optimize the models by AdamKingma and Ba (2014)
optimizer for a maximum of 200 epochs by minimizing the mean absolute error between the predicted results and the corresponding ground-truths. The best parameters for all deep learning models are chosen through a carefully parameter-tuning process on the validation set.
4.3 Compared Methods
To evaluate the overall performance of our work, we compare STR-GODEs with widely used baselines and state-of-the-art models:
DCRNN: Huang et al.Huang et al. (2019) proposed a deep learning framework specially designed for traffic prediction, which uses bidirectional random walks on graphs to capture spatial dependencies and an encoder-decoder architecture to learn temporal correlation. We implement this method with its official code.
Graph-WaveNet: Wu et al.Wu et al. (2019) developed an adaptive dependency matrix to capture the hidden spatial dependencies, and used a stacked dilated 1D convolution component to handle very long sequences. We implement this method with its official code.
ST—ODEs: Zhou et al. applied the NODE to the prediction of urban flow grid maps. They concats the data over different time intervals, feeds it into CNN layer to get the initial state , and then uses the neural ODEs network for prediction.In this experiment, we reproduced the code and extended the ODE network to GODE network which can be used for graph data.
latent ODEs: Rubanova et al. refine the Latent ODE model of Chen et al. Chen et al. (2018) by using the ODE-RNN as a recognition network, where ODE-RNN is generalized from RNNs with continuous-time hidden dynamics defined by ordinary differential. In the experiment, we modify its official code and extended the ODE network to GODE network which can be used for graph data.
PVCGN: Liu et al.Liu et al. (2020) developed a Seq2Seq model with GC-GRU and FC-GRU to forecast the future metro ridership sequentially where Graph Convolution Gated Recurrent Unit (GC-GRU) is applied to learn the spatial-temporal representation and Fully-Connected Gated Recurrent Unit (FC-GRU) is applied to capture the global evolution tendency. We implement this method with its official code.
4.4 Performance Comparison and Analysis
We measure the performance of predictive models with three widely used metrics - Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE).
Conventional prediction experiment: Table 1 and Table 2 present the prediction performances in conventional prediction experiments of our STR-GODEs and representative comparison methods in HZMetro and SHMetro datasets. Compared with latent ODE, We can observe that the GODE network combined with ridership data and hidden states can indeed reduce the cumulative error. Compared with all other methods, STR-GODEs has achieved some improvement in most metrics.
Further more, we extract the prediction of ridership in peak periods(7:30-9:30 and 17:30-19:30) in conventional experiments, as shown in table 6 and table 7. We can observe that our STR-GODEs outperforms all comparative methods consistently on most metrics on both datasets. On HZMetro, we achieve a relative improvement of 4.13% in MAE metric compared with other optimal algorithms on average, 3.59% in RMSE and 4.28% MAPE. On SHMetro, we achieve a relative improvement of 4.32% in MAE metric compared with other optimal algorithms on average, 4.57% in RMSE and 1.84% MAPE.
Irregular prediction experiment: In order to compare the performance of different methods in continuous time or unequal interval time, we use a compromise experiment that the traditional method can also work.
In this experiment, assuming that are the observed ridership sequence, where is the random subsequence of , our goal is to predict a the ridership sequence:
Table 3 and Table 4 present the prediction performances in irregular prediction experiments of our STR-GODEs and representative comparison methods in HZMetro and SHMetro datasets. Through the experimental results, we can see that Neural GODEs network is superior to the traditional method in irregular prediction and compared with other work, the superiority in effectiveness of the proposed STR-GODEs is verified.
In this paper, we extended Neural ODE algorithms to the graph network and proposed the STR-GODEs network, which can effectively learn spatial, temporal, and ridership correlations without dividing data into equal-sized intervals on the timeline. Specifically, we use a GODE-RNN structure as encoder to learn the initial state , and hidden states containing the characteristics of ridership. Subsequently, GODE network which can incorporate the ridership information and its hidden vectors is used to predict the future sequence. Experimental results on two public large-scale datasets demonstrate the superior performance of our algorithm.
In future works, we will make further use of the continuous time series-based characteristics of the model to meet the requirements of real-time prediction and apply the model to the passenger. At the same time, we will further optimize the time and space efficiency of the model to enable it to run on mobile devices such as mobile phones.
Diffusion-convolutional neural networks. In Advances in neural information processing systems, pp. 1993–2001. Cited by: §2.
- Adaptive graph convolutional recurrent network for traffic forecasting. arXiv preprint arXiv:2007.02842. Cited by: §2, §2.
- Using dynamic time warping to find patterns in time series.. In KDD workshop, Vol. 10, pp. 359–370. Cited by: §3.3.
- Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
- Neural ordinary differential equations. arXiv preprint arXiv:1806.07366. Cited by: §1, §2, §3.2, §3.4, §4.3.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.5, §3.6.
- Time dependence in non-autonomous neural odes. arXiv preprint arXiv:2005.01906. Cited by: §2.
- Convolutional neural networks on graphs with fast localized spectral filtering. arXiv preprint arXiv:1606.09375. Cited by: §2, §2, §3.4.
Model-based reinforcement learning for semi-markov decision processes with neural odes. arXiv preprint arXiv:2006.16210. Cited by: §2.
- Augmented neural odes. arXiv preprint arXiv:1904.01681. Cited by: §2.
Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting.
Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 3656–3663. Cited by: §2.
- STEER: simple temporal regularization for neural odes. arXiv preprint arXiv:2006.10711. Cited by: §2.
Neural message passing for quantum chemistry.
International Conference on Machine Learning, pp. 1263–1272. Cited by: §2, §3.4.
- Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: §2, §3.4.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §2.
Diffusion convolutional recurrent neural network with rank influence learning for traffic forecasting. In
2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pp. 678–685. Cited by: §4.3.
- Simple video generation using neural odes. In Annu. Conf. Neural Inf. Process. Syst, Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §2, §3.4.
- Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §2.
- Diffusion convolutional recurrent neural network: data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. Cited by: §2.
- Physical-virtual collaboration modeling for intra-and inter-station metro ridership prediction. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1, §2, §3.2, §3.3, §4.1, §4.2, §4.3.
- Attentive crowd flow machines. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1553–1561. Cited by: §2.
- DeepPF: a deep learning based architecture for metro passenger flow prediction. Transportation Research Part C: Emerging Technologies 101, pp. 18–34. Cited by: §2.
- Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, pp. 3276–3285. Cited by: §2.
- Stable neural flows. arXiv preprint arXiv:2003.08063. Cited by: §2.
- SARIMA modelling approach for railway passenger flow forecasting. Transport 33 (5), pp. 1113–1120. Cited by: §2.
- Neural ordinary differential equations for semantic segmentation of individual colon glands. arXiv preprint arXiv:1910.10470. Cited by: §2.
- Latent odes for irregularly-sampled time series. arXiv preprint arXiv:1907.03907. Cited by: §1, §3.2.
- Modeling relational data with graph convolutional networks. In European semantic web conference, pp. 593–607. Cited by: §3.4.
- An aggregation approach to short-term traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems 10 (1), pp. 60–69. Cited by: §2.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3.4.
- A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5 (1), pp. 1–11. Cited by: §2.
- Modeling and forecasting vehicular traffic flow as a seasonal arima process: theoretical basis and empirical results. Journal of transportation engineering 129 (6), pp. 664–672. Cited by: §2.
- A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems. Cited by: §2.
- Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121. Cited by: §2, §2, §4.3.
- Continuous graph neural networks. In International Conference on Machine Learning, pp. 10432–10441. Cited by: §2.
- On robustness of neural ordinary differential equations. arXiv preprint arXiv:1910.05513. Cited by: §2.
- Revisiting spatial-temporal similarity: a deep learning framework for traffic prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 5668–5675. Cited by: §2.
- Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §2.
- ODE2VAE: deep generative second order odes with bayesian neural networks. Cited by: §2.
- Hierarchical graph representation learning with differentiable pooling. arXiv preprint arXiv:1806.08804. Cited by: §2.
- Link prediction based on graph neural networks. arXiv preprint arXiv:1802.09691. Cited by: §2.
- T-gcn: a temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems 21 (9), pp. 3848–3858. Cited by: §2.
- Urban flow prediction with spatial–temporal neural odes. Transportation Research Part C: Emerging Technologies 124, pp. 102912. Cited by: §1.
-  Enhancing urban flow maps via neural odes. Cited by: §1.