Spatio-temporal data forecasting has received increasing attention from the deep learning community in recent years[36, 32, 53]. It plays a vital role in a wide range of applications, such as traffic speed prediction  and air quality inference . In this paper, we study the problem of forecasting the future traffic conditions given historical observations on a road network.
Recent studies formulate traffic forecasting as a spatio-temporal graph modeling problem [23, 46, 40, 13, 29, 1, 7]. The basic assumption is that the state of each node is conditioned on its neighboring node information. Based on this, they construct a spatial graph with a pre-defined  or data-adaptive  adjacency matrix. In such a graph, each node corresponds to a location of interest (e.g., traffic sensor). The graph neural network  is applied on that graph to model the correlations among spatial neighboring nodes at each time step. To leverage the information from temporal neighboring nodes, they further connect each node with itself between adjacent time steps, which results in a spatio-temporal graph, as shown in Figure 1
. The 1D convolutional neural network23] is commonly used to model the correlations at each node between different time steps. By combining the spatial and temporal features, they are able to update the state of each node.
However, those spatio-temporal graphs do not explicitly reflect the correlations between different nodes at different time steps (e.g., the red dash lines in Figure 1). In such a graph, the information of spatial and temporal neighborhoods is captured through the spatial and temporal connections respectively, while the information of neighboring nodes across both spatial and temporal dimensions are not considered, which may restrict the learning ability of graph neural networks. For example, a traffic jam occurred at an intersection may affect not only current nearby roads (spatial neighborhoods) and its local future traffic condition (temporal neighborhoods), but also the downstream roads in next few hours (spatio-temporal neighborhoods). Thus, we argue that it is necessary to model the comprehensive correlations in the spatio-temporal data.
Another limitation of previous works is that they ignore the dynamic correlations among nodes at different time steps, as shown in Figure 1. The road network distances among sensors (nodes) are commonly used to define the spatial graph [23, 46]. This pre-defined graph is usually static. Some researchers [40, 1] propose to learn a data-adaptive adjacency matrix, which is also unchanged over time steps. However, the traffic data exhibits strong dynamic correlations in the spatial and temporal dimensions, those static graphs are unable to reflect the dynamic characteristics of correlations among nodes. For example, the residence region is highly correlated to the office area during workday morning rush hours, while the correlation would be relatively weakened in the evening because some people might prefer to dining out before going home. Thus, it is crucial to model the dynamic spatio-temporal correlations for traffic forecasting.
This paper addresses these limitations from the following perspectives. First, besides the spatial and temporal connections, we further add the spatio-temporal connections between two time steps according to the spatio-temporal distances to define the spatio-temporal joint graph (STJG). In this way, the pre-defined STJG preserves comprehensive spatio-temporal correlations between any two time steps. Second, in order to adapt to the dynamic correlations among nodes, we suggest to explore an adaptive STJG, which is time-variant by encoding the time features. The adjacency matrix in this adaptive STJG is dynamic, changing over time steps. By constructing both the pre-defined and adaptive STJGs, we are able to preserve comprehensive and dynamic spatio-temporal correlations.
On these basis, we then develop the spatio-temporal joint graph convolution (STJGC) operations on both pre-defined and adaptive STJGs to simultaneously capture the spatio-temporal dependencies in a unified operation. We further design the dilated causal STJGC layers to extract multiple spatio-temporal ranges of information. Next, a multi-range attention mechanism is proposed to aggregate the information of different ranges. Finally, we apply independent fully-connected layers to produce the multi-step ahead prediction results. The whole framework is named as spatio-temporal joint graph convolutional networks (STJGCN), which can be learned end-to-end. To evaluate the efficiency and effectiveness of STJGCN, we conduct extensive experiments on four public traffic datasets. The experimental results demonstrate that our STJGCN is computationally efficient and achieves the best performance against 11 state-of-the-art baseline methods. Our main contributions are summarized as follows.
We construct both pre-defined and adaptive spatio-temporal joint graphs (STJGs), which reflect comprehensive and dynamic spatio-temporal correlations.
We design dilated causal spatio-temporal joint graph convolution layers on both types of STJG to model multiple ranges of spatio-temporal correlations.
We propose a multi-range attention mechanism to aggregate the information of different ranges.
We evaluate our model on four public traffic datasets, and experimental results demonstrate that STJGCN has high computation efficiency and outperforms 11 state-of-the-art baseline methods.
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the preliminary of this work. Section 4 details the method of STJGCN. Section 5 compares STJGCN with state-of-the-art methods on four datasets. Finally, section 6 concludes this paper and draws future work.
2 Related Work
2.1 Graph Convolutional Networks
Graph convolutional networks (GCNs) are successfully applied on various tasks (e.g., node classification , link prediction ) due to their superior abilities of handling graph-structured data . There are mainly two types of GCN : spatial GCN and spectral GCN. The spatial GCN performs convolution filters on neighborhoods of each node. Researchers in 
propose a heuristic linear method for neighborhood selecting. GraphSAGE samples a fixed number of neighbors for each node and aggregates their features. GAT  learns the weights among nodes via attention mechanisms. The spectral GCN defines the convolution in the spectral domain , which is firstly introduced in . ChebNet  reduces the computational complexity with fast localized convolution filters. In , researchers further simplify the ChebNet to a simpler form and achieve state-of-the-art performances on various tasks. Recently, a range of studies apply the GCN on time-series data and construct spatio-temporal graphs for traffic forecasting [23, 50], human action recognition [41, 28], etc.
2.2 Spatio-Temporal Forecasting
Problem definition. Suppose there are sensors (nodes) on a road network, and each sensor records traffic measurements (e.g., volume, speed) at each time step. Thus, the traffic conditions at time step can be represented as . The traffic forecasting problem aims to learn a function that maps the traffic conditions of historical time steps to next time steps:
4.1 Framework Overview
Figure 2 depicts the framework of our proposed Spatio-Temporal Joint Graph Convolutional Networks (STJGCN), which includes three modules. First, previous graph-based methods generally ignore the spatio-temporal connections and the dynamic correlations among nodes, we thus propose the spatio-temporal joint graph (STJG) construction module to construct both pre-defined and adaptive STJGs, which preserve comprehensive and dynamic spatio-temporal correlations. Second, as the standard graph convolution operation models spatial correlations only, we propose the spatio-temporal joint graph convolution (STJGC) operation on both types of STJG to model the comprehensive and dynamic spatio-temporal correlations in a unified operation. Based on the STJGC, we further propose the dilated casual STJGC module to capture spatio-temporal dependencies within multiple neighborhood and time ranges. Finally, in the prediction module, we propose a multi-range attention mechanism to aggregate the information of different ranges, and apply fully-connected layers to produce the prediction results. We detail each module in the following subsections.
4.2 STJG Construction Module
In this module, we first pre-define the spatio-temporal joint graph (STJG) according to the spatio-temporal distances among nodes. While, the pre-defined graph may not reflect the underlying correlations among nodes [40, 1], we further propose to learn adaptive STJG. By constructing both types of STJG, we are able to represent comprehensive and dynamic spatio-temporal correlations among nodes.
4.2.1 Pre-defined Spatio-Temporal Joint Graph
where represents the road network distance from node to node ,
is the standard deviation of distances, anddenotes the edge weight between node and node . They construct the spatial graph at each time step, and then connect each node with itself between adjacent time steps to define the spatio-temporal graph. In such a graph, the connections between different nodes at different time steps are not incorporated, which may restrict its representation ability.
We propose to construct a spatio-temporal joint graph (STJG), which preserves comprehensive spatio-temporal correlations. The intuitive idea is to further connect different nodes between two time steps, as shown in Figure 1. Thus, we modify Equation 2 to be the STJG adjacency matrix, as:
where is the time difference between two time steps. defines the edge weight between node at time step and node at time step , which decreases with the increase of spatio-temporal distance. When , Equation 3 degenerates to Equation 2, which represents the spatial connections. If , the STJG adjacency matrix defines the temporal connections at each node between two time steps. Otherwise, it represents the spatio-temporal connections between different nodes at different time steps. Thus, we are able to define a comprehensive spatio-temporal graph according to Equation 3. Note that the STJG could be constructed between any two time steps, which makes it flexible to reveal multiple time-ranges of spatio-temporal correlations.
We filter the values smaller than a threshold in the STJG adjacency matrix to eliminate weak connections and control the sparsity. As this adjacency matrix is conditioned on the time difference , but irrelevant to a specific time step, we denote it as in following discussions.
4.2.2 Adaptive Spatio-Temporal Joint Graph
Previous studies [40, 1] demonstrate that the pre-defined adjacency matrix may not reflect the underlying correlations among nodes, and propose adaptive ones. However, they only define the spatial graph, and it is unchanged over time steps. We propose to learn adaptive STJG adjacency matrices that could represent comprehensive and dynamic spatio-temporal correlations based on the latent space modeling algorithm .
Latent space modeling
Given a graph, we assume each node resides in a latent space with various attributes. The attributes of nodes and how these attributes interact with each other jointly determine the underlying relations among nodes. The nodes which are close to each other in the latent space are more likely to form a link. Mathematically, we aim to learn two matrices and . Here, denotes the latent attributes of the nodes, and represents the attributes interaction patterns, which could be an asymmetric matrix for directed graph or symmetric matrix for undirected graph. The product of could represent the connections among nodes.
We propose a spatio-temporal embedding to form the latent node attributes. We first randomly initialize a spatial embedding for each of the nodes, and then transform it to dimensions via fully-connected layers. To obtain time-varying node attributes, we further encode the time information as the temporal embedding. At each time step, we consider two time features, i.e., time-of-day and day-of-week, which are encoded by one-hot coding and then be projected to dimensions using fully-connected layers. We then add the spatial and temporal embeddings together to generate the spatio-temporal embedding at each time step , represented as , which can be updated during the training stage. The spatio-temporal embedding encodes both the node-specific and time-related information, and it has the potential to take periodic patterns into account through the time features.
Adaptive STJG adjacency matrix
Based on the spatio-temporal embedding, we define the STJG adjacency matrix at time step according to the latent space modeling algorithm, as:
where is the spatio-temporal embedding of nodes at time step , is used to eliminate the weights smaller than a threshold , and the softmax function is applied for normalization. defines the spatial connections among nodes at time step , which is dynamic, changing over time steps. In order to construct the connections between different time steps, we modify Equation 4 as:
where is the normalized STJG adjacency matrix between time steps and . When , Equation 6 degenerates to Equation 4, which describes the spatial graph at time step . Thus, Equation 6 is able to define the spatio-temporal joint graph between time steps and with comprehensive and dynamic spatio-temporal connections.
4.3 Dilated Causal STJGC Module
The standard graph convolution performs on spatial graphs to model spatial correlations only, we thus propose the spatio-temporal joint graph convolution (STJGC) on both types of STJG to model spatio-temporal correlations in a unified operation. We further design dilated causal STJGC layers to capture multiple ranges of spatio-temporal dependencies, as shown in Figure 2. In the following discussion, we first describe the STJGC operation in section 4.3.1, and then introduce the dilated causal STJGC layers in section 4.3.2.
4.3.1 Spatio-Temporal Joint Graph Convolution (STJGC)
Graph convolution is an effective operation for learning node information from spatial neighborhoods according to the graph structure, while the standard graph convolution performs on the spatial graph to model the spatial correlations only. In order to model the comprehensive and dynamic spatio-temporal correlations on the STJG, we propose the spatio-temporal joint graph convolution (STJGC) operations on both types of STJG.
STJGC on pre-defined STJG
Consider the STJG between time steps and , the information of each node at time step comes from its spatial, temporal, and spatio-temporal neighborhoods:
where is the normalized pre-defined STJG adjacency matrix between time steps and (see Equation 3). In Equation 8, means we aggregate neighborhoods (both temporal and spatio-temporal) information from time step , and means we aggregate the information from spatial neighborhoods at time step . Thus, by performing Equation 8, we are able to model comprehensive spatio-temporal correlations between two time steps.
Furthermore, at time step , we propose to incorporate (denoted as kernel size) time step information (e.g., ) to update the node features. Specifically, we modify Equation 8 as:
In the case of a directed graph, we consider two directions of information propagation (i.e., forward and backward), corresponding to two normalized adjacency matrices: and , where and represent the out-degree and in-degree matrices, respectively. Thus, we transform Equation 9 to:
where and are the input graph signals at time steps and respectively, denotes the updated feature at time step , , , and are learnable parameters.
By this design, our STJGC simultaneously models the information propagation from three kinds of connections (i.e., spatial, temporal, and spatio-temporal) in a unified operation.
STJGC on adaptive STJG
As the pre-defined STJG may not reflect the underlying correlations among nodes, we further propose STJGC on adaptive STJG. The computation is similar as that on pre-defined STJG:
where is the normalized adaptive STJG adjacency matrix between time steps and (defined in Equation 6). Inspired by the bi-directional RNN , we consider both time directions of the information flow. Specifically, we compute two adaptive STJG adjacency matrices: and , and modify Equation 11 accordingly, as:
where is the updated feature at time step , which encodes the comprehensive and dynamic spatio-temporal correlations, , , and are learnable parameters.
The pre-defined and adaptive STJGs represent the spatio-temporal correlations from distinct perspectives. To enhance the representation ability, we use a gating mechanism to fuse the features extracted on two types of STJG. Specifically, we define a gate to control the importance of two features as:
denotes the concatenation operation, the sigmoid function is used to control the output lies in range, and are learnable parameters. The gate controls the information flow between pre-defined and adaptive STJGs in both node-wise and channel-wise. Based on the gate, we fuse two features as:
where denotes the element-wise product. As a result, represents the updated representation of nodes at time step , which aggregates the information from their spatial, temporal, and spatio-temporal neighborhoods on both types of STJG.
4.3.2 Dilated Causal STJGC Layers
The STJGC operation is able to model the correlations in different time ranges by controlling the time difference . In addition, different STJGC layers aggregate information within diverse neighborhood ranges. This makes it flexible to model the spatio-temporal correlations in multiple neighborhood and time ranges. The information in different ranges reveals distinct traffic properties. A small range uncovers the local dependency and a large range indicates the global dependency. Inspired by the dilated causal convolution [33, 2], which is able to capture diverse time-ranges of dependencies in different layers, we propose dilated causal STJGC layers to capture multiple ranges of spatio-temporal dependencies.
Dilated causal convolution
The dilated causal convolution operation slides over the input sequence by skipping elements with a certain time step (i.e., dilation factor ), and it involves only historical information at each time step to satisfy the causal constraint. In this way, it models diverse time-ranges of dependencies in different layers.
Dilated causal STJGC
As illustrated in Figure 3, we first transform the inputs into dimension space using fully-connected layers. Then we stack a couple of STJGC layers upon it in the dilated causal way. Different to the standard dilated causal convolution using 1D CNN, we use the STJGC in each layer to model the dynamic and comprehensive spatio-temporal correlations. Suppose the length of input graph signals is , we could stack four STJGC layers with kernel size and dilation factor
in each layer, respectively. The residual connections are also applied in each STJGC layer at the corresponding output time steps. The number of STJGC layers, dilation factors and kernel size could be re-designed according to the length of input graph signals, in order to ensure that the output of the last STJGC layer covers the information from all input time steps.
In these dilated causal STJGC layers, each STJGC layer captures different ranges of spatio-temporal dependencies. For example, as shown in Figure 3, in the first STJGC layer, the hidden state at time step aggregates information from 1-hop neighborhoods at time steps and . With the layer goes deeper, it could extract features from higher order neighborhoods at longer time-ranges. In particular, in the last STJGC layer, each node at time step captures the information within 4-hop neighborhoods from total time steps.
4.4 Prediction Module
In this module, we first propose a multi-range attention mechanism to aggregate the information of different ranges extracted by the dilated causal STJGC layers, and then apply independent fully-connected layers to produce the multi-step ahead prediction results.
4.4.1 Multi-Range Attention
As introduced in section 4.3.2, each STJGC layer captures different spatio-temporal ranges of dependencies. A small range uncovers the local dependency and a large range indicates the global dependency, e.g., the correlations between distant nodes at distant time steps. Thus, It is essential to combine the multi-range information. In addition, the importance of different ranges could be diverse. We propose a multi-range attention mechanism to aggregate the information of different ranges. Mathematically, we denote the hidden state of node at time step in -th STJGC layer as , the attention score is computed as:
where , , and are learnable parameters, is the number of STJGC layers, and is the attention score, indicating the importance of . Based on the attention scores, the multi-range information can be aggregated as:
where is the updated feature of node , which aggregates the information from multiple spatio-temporal ranges. The attention mechanism is conducted on all of the nodes in parallel with shared learnable parameters, and produces an output as .
4.4.2 Independent Fully-Connected Layers
As the traffic of different time steps may exhibit different properties, it would be better to use different networks to generate the predictions at different forecasting horizons. We thus apply independent two-fully-connected layers upon to produce the time steps ahead prediction results:
where denotes the prediction result of time step (), , , , and are the corresponding learnable parameters, is an activation function.
4.4.3 Loss Function
. In practice, the MAE loss optimizes all prediction values equally regardless of the value size, which leads to relatively non-ideal predictions for small values compared to the predictions of large values. The mean absolute percentage error (MAPE) loss is more relevant to the predictions of small values. Thus, we propose to combine the MAE loss and MAPE loss as our loss function:
where is used to balance MAE loss and MAPE loss, denotes all learnable parameters in STJGCN.
4.5 Complexity Analysis
|Module||STJG construction module||Dilated casual STJGC module||Prediction module|
We further analyze the time complexity of the main components in each module in our STJGCN, which is summarized in Table I.
In the STJG construction module, the computation mainly comes from the learning of adaptive STJG adjacency matrix (Equation 6). The time complexity is , where denotes the number of nodes, is the dimension of the spatio-temporal embedding. Regarding as a constant, the time complexity turns to , which is attributed to the pairwise computation of the nodes’ embeddings.
In the dilated casual STJGC module, the time complexity mainly depends on the computation of each STJGC operation (Equations 10 and 12), which incurs time complexity. Here, is the kernel size, denotes the number of edges in the graph, and is the dimension of hidden states. The time complexity of STJGC mainly depends on , as each node aggregates information from its neighborhoods, whose number is equal to the edge number.
In the prediction module, the time complexities of multi-range attention mechanism (Equations 15, 16, and 17) and independent fully-connected layers (Equation 18) are and , respectively. Thus, the total time complexity of the prediction module is , where is the number of STJGC layers and is the number of time steps to be predicted. The time complexity is highly related to , as we use independent fully-connected layers to produce the multi-step prediction results.
|Dataset||Time range||Time interval||# Nodes|
|PeMSD3||1/Sep/2018 - 30/Nov/2018||5-minute||358|
|PeMSD4||1/Jan/2018 - 28/Feb/2018||5-minute||307|
|PeMSD7||1/May/2017 - 31/Aug/2017||5-minute||883|
|PeMSD8||1/Jul/2016 - 31/Aug/2016||5-minute||170|
We evaluate our STJGCN on four highway traffic datasets: PeMSD3, PeMSD4, PeMSD7, and PeMSD8, which are released in [13, 29]. These datasets are collected by the Caltrans Performance Measurement System (PeMS) from 4 districts in real time every 30 seconds. The raw traffic data is aggregated into 5-minute time interval. There are three kinds of traffic measurements in PeMSD4 and PeMSD8 datasets, including total flow, average speed, and average occupancy. In PeMSD3 and PeMSD7 datasets, only the traffic flow is recorded. Following previous studies [1, 7], we predict the traffic flow in all datasets. The summary statistics of four datasets are presented in Table II.
5.2 Experimental Setup
5.2.1 Evaluation Metrics
We adopt three widely used metrics for evaluation, i.e., mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE), which are defined as:
where and denote the prediction result and ground truth of node at time step , respectively, is the number of nodes, and is the number of time steps to be predicted.
5.2.2 Experimental Settings
The PeMSD3 and PeMSD7 datasets contain one traffic measurement (i.e., traffic flow). Thus, the dimensions of the input and output are and 1, respectively. The PeMSD4 and PeMSD8 datasets contain three traffic measurements (i.e., traffic flow, average speed, and average occupancy), and only the traffic flow is predicted in the experiments [1, 7]. Thus, the dimensions of the input and output are and 1, respectively. Following previous studies [1, 7], we use the traffic data of historical 12 time steps () to forecast the next 12 time steps ().
The core hyperparameters in STJGCN include the thresholds and in pre-defined and adaptive STJG adjacency matrices respectively, the dimension of hidden states, the kernel size of each STJGC layer, and the threshold in the loss function. We tune these hyperparameters on the validation set that achieve the best validation performance. We provide a parameter study in section 5.3.3. The detailed hyperparameter settings of STJGCN on four datasets are presented in Table III.
The nonlinear activation function in our STJGCN refers to the ReLU activation 
, and we also add a Batch Normalization layer before each ReLU activation function.
We train our model using the Adam optimizer 
with an initial learning rate 0.001 and batch size 64 on a NVIDIA Tesla V100 GPU card. We run the experiments for 200 epochs and save the best model that evaluated on the validation set. We run each experiment 5 times, and report the mean errors and standard deviations.
5.2.3 Baseline Methods
We compare STJGCN with 11 baseline methods, which could be divided into two categories. The first category is the time-series prediction models, including:
FC-LSTM : an encoder-decoder framework using long short-term memory (LSTM) with peephole for multi-step time-series prediction.
The second category refers to the spatio-temporal graph neural networks, which are detailed as follows:
DCRNN : Diffusion Convolutional Recurrent Neural Network, which models the traffic as a diffusion process, and integrates diffusion convolution with recurrent neural network (RNN) into the encoder-decoder architecture.
STGCN : Spatio-Temporal Graph Convolutional Network, which employs graph convolutional network (GCN) to capture spatial dependencies and 1D convolutional neural network (CNN) for temporal correlations modeling.
ASTGCN : Attention based Spatio-Temporal Graph Convolutional Network that designs spatial and temporal attention mechanisms to capture spatial and temporal patterns, respectively.
Graph WaveNet : a graph neural network that performs diffusion convolution with both pre-defined and self-adaptive adjacency matrices to capture spatial dependencies, and applies 1D dilated causal convolution to capture temporal dependencies.
STSGCN : Spatio-Temporal Synchronous Graph Convolutional Network that designs spatio-temporal synchronous modeling mechanism to capture localized spatio-temporal correlations.
AGCRN : Adaptive Graph Convolutional Recurrent Network that learns data-adaptive adjacency matrix for graph convolution to model spatial correlations and uses gated recurrent unit (GRU) to model temporal correlations.
GMAN : Graph Multi-Attention Network is an encoder-decoder framework, which designs multiple spatial and temporal attention mechanisms in the encoder and decoder to model spatio-temporal correlations, and a transform attention mechanism to transform information from encoder to decoder.
5.3 Experimental Results
5.3.1 Overall Comparison
presents the forecasting performance comparison of our STJGCN with 11 baseline methods. We observe that: (1) the time-series prediction models, including traditional approach (i.e., VAR), machine learning based method (i.e., SVR), and deep neural network (i.e., FC-LSTM) perform poorly as they only consider the temporal correlations. (2) Spatio-temporal graph neural networks generally achieve better performances as they further model the spatial correlations using graph neural networks. (3) Our STJGCN performs the best in terms of all metrics on all datasets (1.4%~7.7% improvement against the second best results). Compared with other graph-based methods, the advantages of our STJGCN are three-fold. First, STJGCN models comprehensive spatio-temporal correlations. Second, STJGCN is able to capture dynamic dependencies at different time steps. Third, STJGCN leverages the information of multiple spatio-temporal ranges.
5.3.2 Ablation Study
|Dataset||Metrics||STJGCN||w/o STC-pdf||w/o STC-adt||w/o STC||w/o dgm||w/o mr||w/o att||w/o idp|
To better understand the effectiveness of different components in STJGCN, we conduct ablation studies on PeMSD4 and PeMSD8 datasets.
Effect of spatio-temporal connections
One difference between our STJG with normal spatio-temporal graph is that we explicitly add the spatio-temporal connections between different nodes at different time steps. To evaluate the effectiveness of this approach, we drop them separately/simultaneously from the pre-defined or/and adaptive STJG. These three variants of STJGCN are named as “w/o STC-pdf” (drop in pre-defined STJG), “w/o STC-adt” (drop in adaptive STJG), and “w/o STC” (drop in both types of STJG), respectively. The results in Table V demonstrate that the introduction of spatio-temporal connections improves the performance as it helps the model to explicitly capture comprehensive spatio-temporal correlations.
Effect of dynamic graph modeling
To evaluate the effect of dynamic graph modeling, we conduct experiments of learning static adjacency matrices. Specifically, we design a variant of STJGCN (i.e., “w/o dgm”) that only uses the node embedding to generate the adaptive STJG adjacency matrix without using the time feature. The results in Table V validate the effectiveness of modeling dynamic correlations among nodes at different time steps.
Effect of multi-range information
To verify the effect of multi-range information, we design a variant of STJGCN, namely “w/o mr”, in which we do not combine multiple ranges of information but directly use the output of the last STJGC layer to produce the predictions. The results in Table V indicate the necessity of leveraging multi-range information. We further design a variant “w/o att” that directly adds the outputs of each STJGC layer together without using the multi-range attention mechanism, and it performs worse than STJGCN, showing that it is beneficial to distinguish the importance of different ranges of information.
Effect of independent fully-connected layers
In the prediction module, we use independent fully-connected layers to produce the multi-step predictions. To evaluate the effectiveness of this, we conduct experiments of using shared fully-connected layers with units in the output layer to produce the time steps predictions. We name this variant of STJGCN as “w/o idp”, and present the experimental results in Table V. We observe that STJGCN improves the performances by introducing independent learning parameters for multi-step prediction. A potential reason is that the traffic of different time steps may exhibit different properties, and using different networks to generate the predictions at different forecasting horizons could be beneficial.
Effect of different STJG adjacency matrix configurations
We further conduct experiments of using different STJG adjacency matrix configurations to evaluate their effectiveness. As shown in Table VI, the models with only pre-defined STJG adjacency matrices (lines 3-4) achieve poor performances as they do not capture the underlying dependencies in the data. We observe that the models with only adaptive STJG adjacency matrices (lines 5-6) could realize promising performances, which indicates that our model can also be used even if the graph structure is unavailable. By using both pre-defined and adaptive STJG adjacency matrices (line 7), we could achieve better results. We further apply a gating fusion approach (section 4.3.1) in STJGCN (line 8) and observe consistent improvement of the predictive performances, as the gate is able to control the information flow between pre-defined and adaptive STJGs.
|STJG adjacency matrix configuration||PeMSD4||PeMSD8|
|MAE||RMSE||MAPE (%)||MAE||RMSE||MAPE (%)|
|+ gf (ours)||18.810.06||30.350.09||11.920.04||14.530.17||23.740.20||9.150.09|
5.3.3 Parameter Study
|PeMSD3||# Parameter (M)||0.37||0.42||0.31||0.59||3.50||0.75||0.57||0.52||0.32|
|Training time (s/epoch)||118.06||12.20||59.73||78.69||127.86||55.45||168.77||208.55||49.82|
|Inference time (s)||18.70||19.10||5.16||26.80||15.41||8.44||17.45||25.79||5.22|
|PeMSD4||# Parameter (M)||0.37||0.38||0.31||0.45||2.87||0.75||0.57||0.52||0.31|
|Training time (s/epoch)||69.55||6.54||32.40||53.51||56.18||37.05||82.40||88.41||25.64|
|Inference time (s)||11.97||13.44||2.60||14.67||6.03||5.55||9.16||11.84||2.87|
|PeMSD7||# Parameter (M)||0.37||0.75||0.31||3.24||15.36||0.75||0.57||0.52||0.36|
|Training time (s/epoch)||306.66||33.59||173.85||213.30||465.12||189.48||779.12||624.32||158.64|
|Inference time (s)||45.13||71.17||16.17||64.81||54.60||26.31||83.2||89.99||16.30|
|PeMSD8||# Parameter (M)||0.37||0.30||0.31||0.18||1.66||0.75||0.57||0.52||0.31|
|Training time (s/epoch)||46.41||4.24||20.48||47.07||31.23||21.74||32.27||52.51||17.60|
|Inference time (s)||8.81||9.37||1.72||14.01||3.09||3.04||4.06||7.36||1.67|
We conduct a parameter study on five core hyperparameters in STJGCN on the PeMSD4 and PeMSD8 datasets, including the thresholds and in the pre-defined and adaptive STJG adjacency matrices, respectively, the dimension of hidden states, the kernel size in the STJGC operation, and the threshold in the loss function. We change the parameter under investigation and fix other parameters in each experiment. Figures 4 and 5 show the experimental results on the PeMSD4 and PeMSD8 datasets, respectively.
As shown in Figures 4, 4, 5, and 5, the performance is not strongly sensitive to the sparsity of the STJG adjacency matrices, which we think is because the adaptive STJG adjacency matrix could adjust itself for aggregating the neighboring information during the training stage. While, in general, a more sparse adjacency matrix is beneficial to select the most related nodes for each node, and leads to better results. However, a too sparse graph may lose the connections between interrelated nodes, and thus degrades the performances. According to the validation loss, we set in the PeMSD4 dataset, and , in the PeMSD8 dataset.
As shown in Figures 4 and 5, increasing the number of hidden units could enhance the model’s expressive capacity. However, when it is larger than 64, the performance degrades significantly, as the model needs to learn more parameters and may suffer from the over-fitting problem.
Figures 4 and 5 show that the model performs poorly when the kernel size equals to 1, as it captures only the spatial dependencies and does not consider the correlations in the temporal dimension. We can further observe that it is enough to aggregate the information from neighboring 2 or 3 time steps at each time step. When , the model’s performance degrades. It is possibly because that a node’s information at a time step may only correlated to the nodes at a limited number of neighboring time steps, and a large would introduce noises into the model. Thus, according to the validation loss, we set and on the PeMSD4 and PeMSD8 datasets, respectively.
In the parameter study of the threshold in the loss function, we report the validation MAE, RMSE, and MAPE instead of reporting the loss value, as the size of directly impacts the size of the loss value. As shown in Figures 4, 4, 5, and 5, a larger means the model optimizes more on the MAPE loss and less on the MAE loss, and thus leads to smaller MAPE and larger MAE. The RMSE can also be influenced, as shown in Figures 4 and 5. Through a comprehensive consideration of the validation MAE, RMSE, MAPE and their standard deviations, we choose to use and in the PeMSD4 and PeMSD8 datasets, respectively.
5.3.4 Performance Comparison at Each Horizon
Figures 6 and 7 present the forecasting performance comparison of our STJGCN with five representative baseline methods (i.e., Graph WaveNet, STSGCN, AGCRN, GMAN, and Z-GCNETs) at each prediction time step on the PeMSD4 and PeMSD8 datasets, respectively. We exclude other baseline methods due to their poorer performances, as shown in Table IV. We can observe that Graph WaveNet performs well in the short-term (one or two time steps ahead) prediction. However, its performance degrades quickly with the increase of the forecasting horizon. The performance of GMAN degrades slowly when the predictions are made further into the future, and it performs well in the long-term (e.g., 12 time steps ahead) prediction, while still worse than STJGCN. In general, our model achieves the best performances at almost all horizons in terms of all three metrics on both datasets.
5.3.5 Model Size and Computation Time
We present the comparison of model size and computation time of our STJGCN with graph-based baseline methods in Table VII. The results demonstrate the high computation efficiency of our model. In terms of the model size, STJGCN has fewer parameters than most of the baseline models. In the training phase, our model runs faster than other methods except for STGCN. In the inference stage, STGCN runs very slowly as it adopts an iterative way to generate multi-step predictions, while STJGCN and Graph WaveNet are the most efficient. By further considering the prediction accuracy (see Table IV), our model shows superior ability in balancing predictive performances and time consumption as well as parameter settings.
We proposed STJGCN, which models comprehensive and dynamic spatio-temporal correlations and aggregates multiple ranges of information to forecast the traffic conditions over several time steps ahead on a road network. When evaluated on four public traffic datasets, STJGCN showed high computation efficiency and outperformed 11 state-of-the-art baseline methods. Our model could be potentially applied to other spatio-temporal data forecasting tasks, such as air quality inference and taxi demand prediction. We plan to investigate this in future work.
The research is supported by Natural Science Foundation of China (61872306), Xiamen Science and Technology Bureau (3502Z20193017) and Fundamental Research Funds for the Central Universities (20720200031).
-  (2020) Adaptive graph convolutional recurrent network for traffic forecasting. In NeurIPS, Cited by: §1, §1, §2.2, §4.2.2, §4.2, 6th item, §5.1, §5.2.2.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §4.3.2.
-  (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.1.
-  (2014) Spectral networks and deep locally connected networks on graphs. In ICLR, Cited by: §2.1.
-  (2010) Zigzag persistence. Foundations of computational mathematics 10, pp. 367–405. Cited by: 8th item.
-  (2020) Multi-range attentive bicomponent graph convolutional network for traffic forecasting. In AAAI, Cited by: §2.2.
-  (2021) Z-gcnets: time zigzags at graph convolutional networks for time series forecasting. In ICML, Cited by: §1, 8th item, §5.1, §5.2.2.
A neural attention model for urban air quality inference: learning the weights of monitoring stations. In AAAI, pp. 2151–2158. Cited by: §1.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, pp. 3844–3852. Cited by: §2.1.
-  (2016) Latent space model for road networks to predict time-varying traffic. In KDD, Cited by: §4.2.2.
-  (1997) Support vector regression machines. In NeurIPS, pp. 155–161. Cited by: 3rd item.
-  (2020) Exploiting interpretable patterns for flow prediction in dockless bike sharing systems. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
-  (2019) Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In AAAI, pp. 922–929. Cited by: §1, §2.2, 3rd item, §5.1.
-  (2021) Learning dynamics and heterogeneity of spatial-temporal graph data for traffic forecasting. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
-  (1994) Time series analysis. Princeton university press. Cited by: 1st item.
-  (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.3.2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §5.2.2.
-  (2021) DeepCrowd: a deep model for large-scale citywide crowd density and flow prediction. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.2.2.
-  (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2.1, §4.3.1.
Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Cited by: §2.1.
-  (2018) Diffusion convolutional recurrent neural network: data-driven traffic forecasting. In ICLR, Cited by: §1, §1, §1, §2.1, §2.2, §4.2.1, §4.4.3, 1st item.
-  (2015) Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54, pp. 187–197. Cited by: §2.2.
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §4.3.1, §5.2.2.
-  (2016) Learning convolutional neural networks for graphs. In ICML, pp. 2014–2023. Cited by: §2.1.
-  (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §4.3.1.
-  (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, pp. 12026–12035. Cited by: §2.1.
-  (2020) Spatial-temporal synchronous graph convolutional networks: a new framework for spatial-temporal network data forecasting. In AAAI, Cited by: §1, 5th item, §5.1.
-  (2020) Predicting citywide crowd flows in irregular regions using multi-view graph convolutional networks. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
-  (2014) Sequence to sequence learning with neural networks. In NeurIPS, pp. 3104–3112. Cited by: 2nd item.
-  (2020) A survey on modern deep neural network for traffic prediction: trends, methods and challenges. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §4.3.2.
-  (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §2.2.
-  (2018) Graph attention networks. In ICLR, Cited by: §2.1.
-  (2020) Deep learning for spatio-temporal data mining: a survey. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
-  (2020) Traffic flow prediction via spatial temporal graph neural network. In WWW, pp. 1082–1092. Cited by: §2.2.
-  (2019) Origin-destination matrix prediction via graph convolution: a new perspective of passenger demand modeling. In KDD, pp. 1227–1235. Cited by: §2.2.
-  (2021) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1), pp. 4–24. Cited by: §1, §2.1.
-  (2019) Graph wavenet for deep spatial-temporal graph modeling. In IJCAI, Cited by: §1, §1, §2.2, §4.2.2, §4.2, §4.4.3, 4th item.
-  (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, pp. 3482–3489. Cited by: §2.1.
-  (2019) Revisiting spatial-temporal similarity: a deep learning framework for traffic prediction. In AAAI, Cited by: §2.2.
-  (2018) Deep multi-view spatial-temporal network for taxi demand prediction. In AAAI, pp. 2588–2595. Cited by: §2.2.
-  (2020) How to build a graph-based deep learning architecture in traffic domain: a survey. arXiv preprint arXiv:2005.11691. Cited by: §2.2.
-  (2020) A comprehensive survey on traffic prediction. arXiv preprint arXiv:2004.08555. Cited by: §2.2.
-  (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In IJCAI, pp. 3634–3640. Cited by: §1, §1, §2.2, §4.2.1, 2nd item.
-  (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, pp. 1655–1661. Cited by: §2.2.
-  (2020) Flow prediction in spatio-temporal networks based on multitask deep learning. IEEE Transactions on Knowledge and Data Engineering 32 (3). Cited by: §2.2.
-  (2018) Link prediction based on graph neural networks. In NeurIPS, pp. 5165–5175. Cited by: §2.1.
-  (2021) A graph-based temporal attention framework for multi-sensor traffic flow forecasting. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.1.
-  (2020) GMAN: a graph multi-attention network for traffic prediction. In AAAI, pp. 1234–1241. Cited by: §2.2, §4.4.3, 7th item.
-  (2020) DeepSTD: mining spatio-temporal disturbances of multiple context factors for citywide traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems 21 (9), pp. 3744–3755. Cited by: §2.2.
-  (2021) STPC-net: learn massive geo-sensory data as spatio-temporal point clouds. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1.