1 Introduction
Spatiotemporal data forecasting has received increasing attention from the deep learning community in recent years
[36, 32, 53]. It plays a vital role in a wide range of applications, such as traffic speed prediction [23] and air quality inference [8]. In this paper, we study the problem of forecasting the future traffic conditions given historical observations on a road network.Recent studies formulate traffic forecasting as a spatiotemporal graph modeling problem [23, 46, 40, 13, 29, 1, 7]. The basic assumption is that the state of each node is conditioned on its neighboring node information. Based on this, they construct a spatial graph with a predefined [23] or dataadaptive [40] adjacency matrix. In such a graph, each node corresponds to a location of interest (e.g., traffic sensor). The graph neural network [39] is applied on that graph to model the correlations among spatial neighboring nodes at each time step. To leverage the information from temporal neighboring nodes, they further connect each node with itself between adjacent time steps, which results in a spatiotemporal graph, as shown in Figure 1
. The 1D convolutional neural network
[46][23] is commonly used to model the correlations at each node between different time steps. By combining the spatial and temporal features, they are able to update the state of each node.However, those spatiotemporal graphs do not explicitly reflect the correlations between different nodes at different time steps (e.g., the red dash lines in Figure 1). In such a graph, the information of spatial and temporal neighborhoods is captured through the spatial and temporal connections respectively, while the information of neighboring nodes across both spatial and temporal dimensions are not considered, which may restrict the learning ability of graph neural networks. For example, a traffic jam occurred at an intersection may affect not only current nearby roads (spatial neighborhoods) and its local future traffic condition (temporal neighborhoods), but also the downstream roads in next few hours (spatiotemporal neighborhoods). Thus, we argue that it is necessary to model the comprehensive correlations in the spatiotemporal data.
Another limitation of previous works is that they ignore the dynamic correlations among nodes at different time steps, as shown in Figure 1. The road network distances among sensors (nodes) are commonly used to define the spatial graph [23, 46]. This predefined graph is usually static. Some researchers [40, 1] propose to learn a dataadaptive adjacency matrix, which is also unchanged over time steps. However, the traffic data exhibits strong dynamic correlations in the spatial and temporal dimensions, those static graphs are unable to reflect the dynamic characteristics of correlations among nodes. For example, the residence region is highly correlated to the office area during workday morning rush hours, while the correlation would be relatively weakened in the evening because some people might prefer to dining out before going home. Thus, it is crucial to model the dynamic spatiotemporal correlations for traffic forecasting.
This paper addresses these limitations from the following perspectives. First, besides the spatial and temporal connections, we further add the spatiotemporal connections between two time steps according to the spatiotemporal distances to define the spatiotemporal joint graph (STJG). In this way, the predefined STJG preserves comprehensive spatiotemporal correlations between any two time steps. Second, in order to adapt to the dynamic correlations among nodes, we suggest to explore an adaptive STJG, which is timevariant by encoding the time features. The adjacency matrix in this adaptive STJG is dynamic, changing over time steps. By constructing both the predefined and adaptive STJGs, we are able to preserve comprehensive and dynamic spatiotemporal correlations.
On these basis, we then develop the spatiotemporal joint graph convolution (STJGC) operations on both predefined and adaptive STJGs to simultaneously capture the spatiotemporal dependencies in a unified operation. We further design the dilated causal STJGC layers to extract multiple spatiotemporal ranges of information. Next, a multirange attention mechanism is proposed to aggregate the information of different ranges. Finally, we apply independent fullyconnected layers to produce the multistep ahead prediction results. The whole framework is named as spatiotemporal joint graph convolutional networks (STJGCN), which can be learned endtoend. To evaluate the efficiency and effectiveness of STJGCN, we conduct extensive experiments on four public traffic datasets. The experimental results demonstrate that our STJGCN is computationally efficient and achieves the best performance against 11 stateoftheart baseline methods. Our main contributions are summarized as follows.

We construct both predefined and adaptive spatiotemporal joint graphs (STJGs), which reflect comprehensive and dynamic spatiotemporal correlations.

We design dilated causal spatiotemporal joint graph convolution layers on both types of STJG to model multiple ranges of spatiotemporal correlations.

We propose a multirange attention mechanism to aggregate the information of different ranges.

We evaluate our model on four public traffic datasets, and experimental results demonstrate that STJGCN has high computation efficiency and outperforms 11 stateoftheart baseline methods.
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the preliminary of this work. Section 4 details the method of STJGCN. Section 5 compares STJGCN with stateoftheart methods on four datasets. Finally, section 6 concludes this paper and draws future work.
2 Related Work
2.1 Graph Convolutional Networks
Graph convolutional networks (GCNs) are successfully applied on various tasks (e.g., node classification [21], link prediction [49]) due to their superior abilities of handling graphstructured data [39]. There are mainly two types of GCN [3]: spatial GCN and spectral GCN. The spatial GCN performs convolution filters on neighborhoods of each node. Researchers in [26]
propose a heuristic linear method for neighborhood selecting. GraphSAGE
[16] samples a fixed number of neighbors for each node and aggregates their features. GAT [35] learns the weights among nodes via attention mechanisms. The spectral GCN defines the convolution in the spectral domain [22], which is firstly introduced in [4]. ChebNet [9] reduces the computational complexity with fast localized convolution filters. In [21], researchers further simplify the ChebNet to a simpler form and achieve stateoftheart performances on various tasks. Recently, a range of studies apply the GCN on timeseries data and construct spatiotemporal graphs for traffic forecasting [23, 50], human action recognition [41, 28], etc.2.2 SpatioTemporal Forecasting
Spatiotemporal forecasting is an important research topic, which has been extensively studied for decades [45, 30, 12, 19, 14]
. Recurrent neural networks (RNNs), especially the long shortterm memory (LSTM) and gated recurrent unit (GRU) are successfully applied for modeling temporal correlations
[24]. To capture the spatial dependencies, convolutional neural networks (CNNs) are introduced, which are restricted to process regular grid structures [47, 43, 42, 52, 48]. Recently, researchers apply graph neural networks to model the nonEuclidean spatial correlations [44]. DCRNN [23] employs diffusion convolution to capture the spatial dependency and applies GRU to model the temporal dependency. STGCN [46] uses graph convolution and 1D convolution to model the spatial and temporal dependencies, respectively. Several works [13, 51, 37] introduce the attention mechanisms [34] into the spatiotemporal graph modeling to improve the prediction accuracy. Some studies consider more kinds of connections (e.g., semantic connection [38], edge interaction patterns [6]) to construct the spatial graph. The adjacency matrices in these models are usually predefined according to some prior knowledge (e.g., distances among nodes). Some researchers [40, 1] argue that the predefined adjacency matrix does not necessarily reflect the underlying dependencies among nodes, and propose to learn an adaptive adjacency matrix for graph modeling. However, both the predefined and adaptive adjacency matrices assume static correlations among nodes, which cannot adapt to the evolving systems (e.g., traffic networks). Moreover, these graphbased methods do not explicitly model the correlations between different nodes at different time steps, which may restrict the learning ability of graph neural networks.3 Preliminary
Problem definition. Suppose there are sensors (nodes) on a road network, and each sensor records traffic measurements (e.g., volume, speed) at each time step. Thus, the traffic conditions at time step can be represented as . The traffic forecasting problem aims to learn a function that maps the traffic conditions of historical time steps to next time steps:
(1) 
4 Methodology
4.1 Framework Overview
Figure 2 depicts the framework of our proposed SpatioTemporal Joint Graph Convolutional Networks (STJGCN), which includes three modules. First, previous graphbased methods generally ignore the spatiotemporal connections and the dynamic correlations among nodes, we thus propose the spatiotemporal joint graph (STJG) construction module to construct both predefined and adaptive STJGs, which preserve comprehensive and dynamic spatiotemporal correlations. Second, as the standard graph convolution operation models spatial correlations only, we propose the spatiotemporal joint graph convolution (STJGC) operation on both types of STJG to model the comprehensive and dynamic spatiotemporal correlations in a unified operation. Based on the STJGC, we further propose the dilated casual STJGC module to capture spatiotemporal dependencies within multiple neighborhood and time ranges. Finally, in the prediction module, we propose a multirange attention mechanism to aggregate the information of different ranges, and apply fullyconnected layers to produce the prediction results. We detail each module in the following subsections.
4.2 STJG Construction Module
In this module, we first predefine the spatiotemporal joint graph (STJG) according to the spatiotemporal distances among nodes. While, the predefined graph may not reflect the underlying correlations among nodes [40, 1], we further propose to learn adaptive STJG. By constructing both types of STJG, we are able to represent comprehensive and dynamic spatiotemporal correlations among nodes.
4.2.1 Predefined SpatioTemporal Joint Graph
Previous studies [23, 46] for traffic forecasting on graphs usually define the spatial adjacency matrix based on pairwise road network distances:
(2) 
where represents the road network distance from node to node ,
is the standard deviation of distances, and
denotes the edge weight between node and node . They construct the spatial graph at each time step, and then connect each node with itself between adjacent time steps to define the spatiotemporal graph. In such a graph, the connections between different nodes at different time steps are not incorporated, which may restrict its representation ability.We propose to construct a spatiotemporal joint graph (STJG), which preserves comprehensive spatiotemporal correlations. The intuitive idea is to further connect different nodes between two time steps, as shown in Figure 1. Thus, we modify Equation 2 to be the STJG adjacency matrix, as:
(3) 
where is the time difference between two time steps. defines the edge weight between node at time step and node at time step , which decreases with the increase of spatiotemporal distance. When , Equation 3 degenerates to Equation 2, which represents the spatial connections. If , the STJG adjacency matrix defines the temporal connections at each node between two time steps. Otherwise, it represents the spatiotemporal connections between different nodes at different time steps. Thus, we are able to define a comprehensive spatiotemporal graph according to Equation 3. Note that the STJG could be constructed between any two time steps, which makes it flexible to reveal multiple timeranges of spatiotemporal correlations.
We filter the values smaller than a threshold in the STJG adjacency matrix to eliminate weak connections and control the sparsity. As this adjacency matrix is conditioned on the time difference , but irrelevant to a specific time step, we denote it as in following discussions.
4.2.2 Adaptive SpatioTemporal Joint Graph
Previous studies [40, 1] demonstrate that the predefined adjacency matrix may not reflect the underlying correlations among nodes, and propose adaptive ones. However, they only define the spatial graph, and it is unchanged over time steps. We propose to learn adaptive STJG adjacency matrices that could represent comprehensive and dynamic spatiotemporal correlations based on the latent space modeling algorithm [10].
Latent space modeling
Given a graph, we assume each node resides in a latent space with various attributes. The attributes of nodes and how these attributes interact with each other jointly determine the underlying relations among nodes. The nodes which are close to each other in the latent space are more likely to form a link. Mathematically, we aim to learn two matrices and . Here, denotes the latent attributes of the nodes, and represents the attributes interaction patterns, which could be an asymmetric matrix for directed graph or symmetric matrix for undirected graph. The product of could represent the connections among nodes.
Spatiotemporal embedding
We propose a spatiotemporal embedding to form the latent node attributes. We first randomly initialize a spatial embedding for each of the nodes, and then transform it to dimensions via fullyconnected layers. To obtain timevarying node attributes, we further encode the time information as the temporal embedding. At each time step, we consider two time features, i.e., timeofday and dayofweek, which are encoded by onehot coding and then be projected to dimensions using fullyconnected layers. We then add the spatial and temporal embeddings together to generate the spatiotemporal embedding at each time step , represented as , which can be updated during the training stage. The spatiotemporal embedding encodes both the nodespecific and timerelated information, and it has the potential to take periodic patterns into account through the time features.
Adaptive STJG adjacency matrix
Based on the spatiotemporal embedding, we define the STJG adjacency matrix at time step according to the latent space modeling algorithm, as:
(4) 
with
(5) 
where is the spatiotemporal embedding of nodes at time step , is used to eliminate the weights smaller than a threshold , and the softmax function is applied for normalization. defines the spatial connections among nodes at time step , which is dynamic, changing over time steps. In order to construct the connections between different time steps, we modify Equation 4 as:
(6) 
where is the normalized STJG adjacency matrix between time steps and . When , Equation 6 degenerates to Equation 4, which describes the spatial graph at time step . Thus, Equation 6 is able to define the spatiotemporal joint graph between time steps and with comprehensive and dynamic spatiotemporal connections.
4.3 Dilated Causal STJGC Module
The standard graph convolution performs on spatial graphs to model spatial correlations only, we thus propose the spatiotemporal joint graph convolution (STJGC) on both types of STJG to model spatiotemporal correlations in a unified operation. We further design dilated causal STJGC layers to capture multiple ranges of spatiotemporal dependencies, as shown in Figure 2. In the following discussion, we first describe the STJGC operation in section 4.3.1, and then introduce the dilated causal STJGC layers in section 4.3.2.
4.3.1 SpatioTemporal Joint Graph Convolution (STJGC)
Graph convolution is an effective operation for learning node information from spatial neighborhoods according to the graph structure, while the standard graph convolution performs on the spatial graph to model the spatial correlations only. In order to model the comprehensive and dynamic spatiotemporal correlations on the STJG, we propose the spatiotemporal joint graph convolution (STJGC) operations on both types of STJG.
Graph Convolution
The graph convolution is defined as [21]:
(7) 
Here, and denote the input and output graph signals, and are learnable parameters,
is an activation function (e.g., ReLU
[25]), is the normalized adjacency matrix, where is the adjacency matrix with selfloops, and is the degree matrix.STJGC on predefined STJG
Consider the STJG between time steps and , the information of each node at time step comes from its spatial, temporal, and spatiotemporal neighborhoods:
(8) 
where is the normalized predefined STJG adjacency matrix between time steps and (see Equation 3). In Equation 8, means we aggregate neighborhoods (both temporal and spatiotemporal) information from time step , and means we aggregate the information from spatial neighborhoods at time step . Thus, by performing Equation 8, we are able to model comprehensive spatiotemporal correlations between two time steps.
Furthermore, at time step , we propose to incorporate (denoted as kernel size) time step information (e.g., ) to update the node features. Specifically, we modify Equation 8 as:
(9) 
In the case of a directed graph, we consider two directions of information propagation (i.e., forward and backward), corresponding to two normalized adjacency matrices: and , where and represent the outdegree and indegree matrices, respectively. Thus, we transform Equation 9 to:
(10) 
where and are the input graph signals at time steps and respectively, denotes the updated feature at time step , , , and are learnable parameters.
By this design, our STJGC simultaneously models the information propagation from three kinds of connections (i.e., spatial, temporal, and spatiotemporal) in a unified operation.
STJGC on adaptive STJG
As the predefined STJG may not reflect the underlying correlations among nodes, we further propose STJGC on adaptive STJG. The computation is similar as that on predefined STJG:
(11) 
where is the normalized adaptive STJG adjacency matrix between time steps and (defined in Equation 6). Inspired by the bidirectional RNN [27], we consider both time directions of the information flow. Specifically, we compute two adaptive STJG adjacency matrices: and , and modify Equation 11 accordingly, as:
(12) 
where is the updated feature at time step , which encodes the comprehensive and dynamic spatiotemporal correlations, , , and are learnable parameters.
Gating fusion
The predefined and adaptive STJGs represent the spatiotemporal correlations from distinct perspectives. To enhance the representation ability, we use a gating mechanism to fuse the features extracted on two types of STJG. Specifically, we define a gate to control the importance of two features as:
(13) 
where
denotes the concatenation operation, the sigmoid function is used to control the output lies in range
, and are learnable parameters. The gate controls the information flow between predefined and adaptive STJGs in both nodewise and channelwise. Based on the gate, we fuse two features as:(14) 
where denotes the elementwise product. As a result, represents the updated representation of nodes at time step , which aggregates the information from their spatial, temporal, and spatiotemporal neighborhoods on both types of STJG.
4.3.2 Dilated Causal STJGC Layers
The STJGC operation is able to model the correlations in different time ranges by controlling the time difference . In addition, different STJGC layers aggregate information within diverse neighborhood ranges. This makes it flexible to model the spatiotemporal correlations in multiple neighborhood and time ranges. The information in different ranges reveals distinct traffic properties. A small range uncovers the local dependency and a large range indicates the global dependency. Inspired by the dilated causal convolution [33, 2], which is able to capture diverse timeranges of dependencies in different layers, we propose dilated causal STJGC layers to capture multiple ranges of spatiotemporal dependencies.
Dilated causal convolution
The dilated causal convolution operation slides over the input sequence by skipping elements with a certain time step (i.e., dilation factor ), and it involves only historical information at each time step to satisfy the causal constraint. In this way, it models diverse timeranges of dependencies in different layers.
Dilated causal STJGC
As illustrated in Figure 3, we first transform the inputs into dimension space using fullyconnected layers. Then we stack a couple of STJGC layers upon it in the dilated causal way. Different to the standard dilated causal convolution using 1D CNN, we use the STJGC in each layer to model the dynamic and comprehensive spatiotemporal correlations. Suppose the length of input graph signals is , we could stack four STJGC layers with kernel size and dilation factor
in each layer, respectively. The residual connections
[17] are also applied in each STJGC layer at the corresponding output time steps. The number of STJGC layers, dilation factors and kernel size could be redesigned according to the length of input graph signals, in order to ensure that the output of the last STJGC layer covers the information from all input time steps.In these dilated causal STJGC layers, each STJGC layer captures different ranges of spatiotemporal dependencies. For example, as shown in Figure 3, in the first STJGC layer, the hidden state at time step aggregates information from 1hop neighborhoods at time steps and . With the layer goes deeper, it could extract features from higher order neighborhoods at longer timeranges. In particular, in the last STJGC layer, each node at time step captures the information within 4hop neighborhoods from total time steps.
4.4 Prediction Module
In this module, we first propose a multirange attention mechanism to aggregate the information of different ranges extracted by the dilated causal STJGC layers, and then apply independent fullyconnected layers to produce the multistep ahead prediction results.
4.4.1 MultiRange Attention
As introduced in section 4.3.2, each STJGC layer captures different spatiotemporal ranges of dependencies. A small range uncovers the local dependency and a large range indicates the global dependency, e.g., the correlations between distant nodes at distant time steps. Thus, It is essential to combine the multirange information. In addition, the importance of different ranges could be diverse. We propose a multirange attention mechanism to aggregate the information of different ranges. Mathematically, we denote the hidden state of node at time step in th STJGC layer as , the attention score is computed as:
(15) 
(16) 
where , , and are learnable parameters, is the number of STJGC layers, and is the attention score, indicating the importance of . Based on the attention scores, the multirange information can be aggregated as:
(17) 
where is the updated feature of node , which aggregates the information from multiple spatiotemporal ranges. The attention mechanism is conducted on all of the nodes in parallel with shared learnable parameters, and produces an output as .
4.4.2 Independent FullyConnected Layers
As the traffic of different time steps may exhibit different properties, it would be better to use different networks to generate the predictions at different forecasting horizons. We thus apply independent twofullyconnected layers upon to produce the time steps ahead prediction results:
(18) 
where denotes the prediction result of time step (), , , , and are the corresponding learnable parameters, is an activation function.
4.4.3 Loss Function
The mean absolute error (MAE) loss is commonly used in the traffic forecasting problem [23, 40, 51]
. In practice, the MAE loss optimizes all prediction values equally regardless of the value size, which leads to relatively nonideal predictions for small values compared to the predictions of large values. The mean absolute percentage error (MAPE) loss is more relevant to the predictions of small values. Thus, we propose to combine the MAE loss and MAPE loss as our loss function:
(19) 
where is used to balance MAE loss and MAPE loss, denotes all learnable parameters in STJGCN.
4.5 Complexity Analysis
Module  STJG construction module  Dilated casual STJGC module  Prediction module 

Time complexity 
We further analyze the time complexity of the main components in each module in our STJGCN, which is summarized in Table I.
In the STJG construction module, the computation mainly comes from the learning of adaptive STJG adjacency matrix (Equation 6). The time complexity is , where denotes the number of nodes, is the dimension of the spatiotemporal embedding. Regarding as a constant, the time complexity turns to , which is attributed to the pairwise computation of the nodes’ embeddings.
In the dilated casual STJGC module, the time complexity mainly depends on the computation of each STJGC operation (Equations 10 and 12), which incurs time complexity. Here, is the kernel size, denotes the number of edges in the graph, and is the dimension of hidden states. The time complexity of STJGC mainly depends on , as each node aggregates information from its neighborhoods, whose number is equal to the edge number.
In the prediction module, the time complexities of multirange attention mechanism (Equations 15, 16, and 17) and independent fullyconnected layers (Equation 18) are and , respectively. Thus, the total time complexity of the prediction module is , where is the number of STJGC layers and is the number of time steps to be predicted. The time complexity is highly related to , as we use independent fullyconnected layers to produce the multistep prediction results.
5 Experiments
5.1 Datasets
Dataset  Time range  Time interval  # Nodes 

PeMSD3  1/Sep/2018  30/Nov/2018  5minute  358 
PeMSD4  1/Jan/2018  28/Feb/2018  5minute  307 
PeMSD7  1/May/2017  31/Aug/2017  5minute  883 
PeMSD8  1/Jul/2016  31/Aug/2016  5minute  170 
We evaluate our STJGCN on four highway traffic datasets: PeMSD3, PeMSD4, PeMSD7, and PeMSD8, which are released in [13, 29]. These datasets are collected by the Caltrans Performance Measurement System (PeMS) from 4 districts in real time every 30 seconds. The raw traffic data is aggregated into 5minute time interval. There are three kinds of traffic measurements in PeMSD4 and PeMSD8 datasets, including total flow, average speed, and average occupancy. In PeMSD3 and PeMSD7 datasets, only the traffic flow is recorded. Following previous studies [1, 7], we predict the traffic flow in all datasets. The summary statistics of four datasets are presented in Table II.
5.2 Experimental Setup
5.2.1 Evaluation Metrics
We adopt three widely used metrics for evaluation, i.e., mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE), which are defined as:
(20) 
(21) 
(22) 
where and denote the prediction result and ground truth of node at time step , respectively, is the number of nodes, and is the number of time steps to be predicted.
5.2.2 Experimental Settings
Dataset  

PeMSD3  0.5  0.5  64  2  0.1 
PeMSD4  0.5  0.5  64  3  1.0 
PeMSD7  0.9  0.7  64  2  0.5 
PeMSD8  0.5  0.3  64  2  1.5 
The PeMSD3 and PeMSD7 datasets contain one traffic measurement (i.e., traffic flow). Thus, the dimensions of the input and output are and 1, respectively. The PeMSD4 and PeMSD8 datasets contain three traffic measurements (i.e., traffic flow, average speed, and average occupancy), and only the traffic flow is predicted in the experiments [1, 7]. Thus, the dimensions of the input and output are and 1, respectively. Following previous studies [1, 7], we use the traffic data of historical 12 time steps () to forecast the next 12 time steps ().
The core hyperparameters in STJGCN include the thresholds and in predefined and adaptive STJG adjacency matrices respectively, the dimension of hidden states, the kernel size of each STJGC layer, and the threshold in the loss function. We tune these hyperparameters on the validation set that achieve the best validation performance. We provide a parameter study in section 5.3.3. The detailed hyperparameter settings of STJGCN on four datasets are presented in Table III.
The nonlinear activation function in our STJGCN refers to the ReLU activation [25]
, and we also add a Batch Normalization
[18] layer before each ReLU activation function.We train our model using the Adam optimizer [20]
with an initial learning rate 0.001 and batch size 64 on a NVIDIA Tesla V100 GPU card. We run the experiments for 200 epochs and save the best model that evaluated on the validation set. We run each experiment 5 times, and report the mean errors and standard deviations.
5.2.3 Baseline Methods
We compare STJGCN with 11 baseline methods, which could be divided into two categories. The first category is the timeseries prediction models, including:

FCLSTM [31]: an encoderdecoder framework using long shortterm memory (LSTM) with peephole for multistep timeseries prediction.

SVR [11]
: Support Vector Regression utilizes a linear support vector machine to perform regression.
The second category refers to the spatiotemporal graph neural networks, which are detailed as follows:

DCRNN [23]: Diffusion Convolutional Recurrent Neural Network, which models the traffic as a diffusion process, and integrates diffusion convolution with recurrent neural network (RNN) into the encoderdecoder architecture.

STGCN [46]: SpatioTemporal Graph Convolutional Network, which employs graph convolutional network (GCN) to capture spatial dependencies and 1D convolutional neural network (CNN) for temporal correlations modeling.

ASTGCN [13]: Attention based SpatioTemporal Graph Convolutional Network that designs spatial and temporal attention mechanisms to capture spatial and temporal patterns, respectively.

Graph WaveNet [40]: a graph neural network that performs diffusion convolution with both predefined and selfadaptive adjacency matrices to capture spatial dependencies, and applies 1D dilated causal convolution to capture temporal dependencies.

STSGCN [29]: SpatioTemporal Synchronous Graph Convolutional Network that designs spatiotemporal synchronous modeling mechanism to capture localized spatiotemporal correlations.

AGCRN [1]: Adaptive Graph Convolutional Recurrent Network that learns dataadaptive adjacency matrix for graph convolution to model spatial correlations and uses gated recurrent unit (GRU) to model temporal correlations.

GMAN [51]: Graph MultiAttention Network is an encoderdecoder framework, which designs multiple spatial and temporal attention mechanisms in the encoder and decoder to model spatiotemporal correlations, and a transform attention mechanism to transform information from encoder to decoder.
Dataset  Metrics  VAR  SVR  FCLSTM  DCRNN  STGCN  ASTGCN  Graph WaveNet  STSGCN  AGCRN  GMAN  ZGCNETs  STJGCN 

PeMSD3  MAE  19.72  19.77  19.560.32  17.620.13  19.760.67  18.670.42  15.670.06  15.740.09  16.100.16  15.520.09  15.900.77  14.920.10 
RMSE  32.38  32.78  33.380.46  29.860.47  33.871.18  30.711.02  26.420.14  26.390.36  28.550.28  26.530.19  27.900.86  25.700.41  
MAPE (%)  20.50  23.04  19.560.51  16.830.13  17.330.94  19.851.06  15.720.23  15.400.07  15.020.26  15.190.25  15.511.67  14.810.16  
PeMSD4  MAE  24.44  26.18  23.600.52  24.420.06  23.900.17  22.900.20  19.910.10  19.620.16  19.740.09  19.250.06  19.540.07  18.810.06 
RMSE  37.76  38.91  37.110.50  37.480.10  36.430.22  35.590.35  31.060.17  31.020.29  32.010.17  30.850.21  31.330.11  30.350.09  
MAPE (%)  17.27  22.84  16.170.13  16.860.09  13.670.14  16.750.59  13.620.22  13.130.11  12.980.21  13.000.26  12.870.05  11.920.04  
PeMSD7  MAE  27.96  28.45  34.050.51  24.450.85  26.220.37  28.130.70  20.830.18  21.640.11  21.220.17  20.680.08  21.260.28  19.950.04 
RMSE  41.31  42.67  55.700.60  37.611.18  39.180.42  43.671.33  33.640.22  34.870.27  35.050.13  33.560.12  34.530.28  33.010.07  
MAPE (%)  12.11  14.00  15.310.31  10.670.53  10.740.16  13.310.55  9.100.27  9.090.05  9.000.12  9.310.12  9.040.11  8.310.11  
PeMSD8  MAE  19.83  20.92  21.180.27  18.490.16  18.790.49  18.720.16  15.570.12  16.120.25  15.920.19  14.870.15  16.120.08  14.530.17 
RMSE  29.24  31.23  31.880.43  27.300.22  28.230.36  28.990.11  24.320.21  24.890.52  25.310.25  24.060.16  25.740.13  23.740.20  
MAPE (%)  13.08  14.24  13.720.27  11.690.06  10.550.30  12.530.48  10.320.79  10.500.22  10.300.13  9.770.07  10.350.09  9.150.09 
5.3 Experimental Results
5.3.1 Overall Comparison
Table IV
presents the forecasting performance comparison of our STJGCN with 11 baseline methods. We observe that: (1) the timeseries prediction models, including traditional approach (i.e., VAR), machine learning based method (i.e., SVR), and deep neural network (i.e., FCLSTM) perform poorly as they only consider the temporal correlations. (2) Spatiotemporal graph neural networks generally achieve better performances as they further model the spatial correlations using graph neural networks. (3) Our STJGCN performs the best in terms of all metrics on all datasets (1.4%~7.7% improvement against the second best results). Compared with other graphbased methods, the advantages of our STJGCN are threefold. First, STJGCN models comprehensive spatiotemporal correlations. Second, STJGCN is able to capture dynamic dependencies at different time steps. Third, STJGCN leverages the information of multiple spatiotemporal ranges.
5.3.2 Ablation Study
Dataset  Metrics  STJGCN  w/o STCpdf  w/o STCadt  w/o STC  w/o dgm  w/o mr  w/o att  w/o idp 

PeMSD4  MAE  18.810.06  18.990.14  19.070.10  19.360.09  19.700.06  19.030.04  18.970.09  18.890.08 
RMSE  30.350.09  30.630.23  30.710.13  30.800.10  31.470.05  30.790.08  30.560.12  30.460.10  
MAPE (%)  11.920.04  12.000.07  12.070.06  12.270.08  12.390.07  11.980.03  11.960.02  11.950.02  
PeMSD8  MAE  14.530.17  14.630.23  14.820.09  15.070.07  15.490.22  15.110.57  14.670.11  14.600.11 
RMSE  23.740.20  24.010.22  24.110.14  24.220.14  24.490.23  24.490.55  24.030.30  23.960.21  
MAPE (%)  9.150.09  9.180.19  9.260.08  9.480.06  9.550.16  9.390.22  9.160.09  9.160.12 
To better understand the effectiveness of different components in STJGCN, we conduct ablation studies on PeMSD4 and PeMSD8 datasets.
Effect of spatiotemporal connections
One difference between our STJG with normal spatiotemporal graph is that we explicitly add the spatiotemporal connections between different nodes at different time steps. To evaluate the effectiveness of this approach, we drop them separately/simultaneously from the predefined or/and adaptive STJG. These three variants of STJGCN are named as “w/o STCpdf” (drop in predefined STJG), “w/o STCadt” (drop in adaptive STJG), and “w/o STC” (drop in both types of STJG), respectively. The results in Table V demonstrate that the introduction of spatiotemporal connections improves the performance as it helps the model to explicitly capture comprehensive spatiotemporal correlations.
Effect of dynamic graph modeling
To evaluate the effect of dynamic graph modeling, we conduct experiments of learning static adjacency matrices. Specifically, we design a variant of STJGCN (i.e., “w/o dgm”) that only uses the node embedding to generate the adaptive STJG adjacency matrix without using the time feature. The results in Table V validate the effectiveness of modeling dynamic correlations among nodes at different time steps.
Effect of multirange information
To verify the effect of multirange information, we design a variant of STJGCN, namely “w/o mr”, in which we do not combine multiple ranges of information but directly use the output of the last STJGC layer to produce the predictions. The results in Table V indicate the necessity of leveraging multirange information. We further design a variant “w/o att” that directly adds the outputs of each STJGC layer together without using the multirange attention mechanism, and it performs worse than STJGCN, showing that it is beneficial to distinguish the importance of different ranges of information.
Effect of independent fullyconnected layers
In the prediction module, we use independent fullyconnected layers to produce the multistep predictions. To evaluate the effectiveness of this, we conduct experiments of using shared fullyconnected layers with units in the output layer to produce the time steps predictions. We name this variant of STJGCN as “w/o idp”, and present the experimental results in Table V. We observe that STJGCN improves the performances by introducing independent learning parameters for multistep prediction. A potential reason is that the traffic of different time steps may exhibit different properties, and using different networks to generate the predictions at different forecasting horizons could be beneficial.
Effect of different STJG adjacency matrix configurations
We further conduct experiments of using different STJG adjacency matrix configurations to evaluate their effectiveness. As shown in Table VI, the models with only predefined STJG adjacency matrices (lines 34) achieve poor performances as they do not capture the underlying dependencies in the data. We observe that the models with only adaptive STJG adjacency matrices (lines 56) could realize promising performances, which indicates that our model can also be used even if the graph structure is unavailable. By using both predefined and adaptive STJG adjacency matrices (line 7), we could achieve better results. We further apply a gating fusion approach (section 4.3.1) in STJGCN (line 8) and observe consistent improvement of the predictive performances, as the gate is able to control the information flow between predefined and adaptive STJGs.
STJG adjacency matrix configuration  PeMSD4  PeMSD8  

MAE  RMSE  MAPE (%)  MAE  RMSE  MAPE (%)  
24.640.05  38.210.02  15.700.08  18.520.10  29.240.18  11.350.08  
24.400.06  38.030.23  15.470.03  18.120.07  28.490.16  11.190.11  
19.390.12  31.600.23  12.380.08  15.930.15  25.870.23  9.980.07  
19.350.13  31.470.16  12.340.14  15.420.15  24.800.32  9.850.14  
18.930.09  30.480.13  11.970.04  14.650.08  23.930.14  9.230.08  
+ gf (ours)  18.810.06  30.350.09  11.920.04  14.530.17  23.740.20  9.150.09 
5.3.3 Parameter Study
Dataset  DCRNN  STGCN  Graph WaveNet  ASTGCN  STSGCN  AGCRN  GMAN  ZGCNETs  STJGCN  

PeMSD3  # Parameter (M)  0.37  0.42  0.31  0.59  3.50  0.75  0.57  0.52  0.32 
Training time (s/epoch)  118.06  12.20  59.73  78.69  127.86  55.45  168.77  208.55  49.82  
Inference time (s)  18.70  19.10  5.16  26.80  15.41  8.44  17.45  25.79  5.22  
PeMSD4  # Parameter (M)  0.37  0.38  0.31  0.45  2.87  0.75  0.57  0.52  0.31 
Training time (s/epoch)  69.55  6.54  32.40  53.51  56.18  37.05  82.40  88.41  25.64  
Inference time (s)  11.97  13.44  2.60  14.67  6.03  5.55  9.16  11.84  2.87  
PeMSD7  # Parameter (M)  0.37  0.75  0.31  3.24  15.36  0.75  0.57  0.52  0.36 
Training time (s/epoch)  306.66  33.59  173.85  213.30  465.12  189.48  779.12  624.32  158.64  
Inference time (s)  45.13  71.17  16.17  64.81  54.60  26.31  83.2  89.99  16.30  
PeMSD8  # Parameter (M)  0.37  0.30  0.31  0.18  1.66  0.75  0.57  0.52  0.31 
Training time (s/epoch)  46.41  4.24  20.48  47.07  31.23  21.74  32.27  52.51  17.60  
Inference time (s)  8.81  9.37  1.72  14.01  3.09  3.04  4.06  7.36  1.67 
We conduct a parameter study on five core hyperparameters in STJGCN on the PeMSD4 and PeMSD8 datasets, including the thresholds and in the predefined and adaptive STJG adjacency matrices, respectively, the dimension of hidden states, the kernel size in the STJGC operation, and the threshold in the loss function. We change the parameter under investigation and fix other parameters in each experiment. Figures 4 and 5 show the experimental results on the PeMSD4 and PeMSD8 datasets, respectively.
As shown in Figures 4, 4, 5, and 5, the performance is not strongly sensitive to the sparsity of the STJG adjacency matrices, which we think is because the adaptive STJG adjacency matrix could adjust itself for aggregating the neighboring information during the training stage. While, in general, a more sparse adjacency matrix is beneficial to select the most related nodes for each node, and leads to better results. However, a too sparse graph may lose the connections between interrelated nodes, and thus degrades the performances. According to the validation loss, we set in the PeMSD4 dataset, and , in the PeMSD8 dataset.
As shown in Figures 4 and 5, increasing the number of hidden units could enhance the model’s expressive capacity. However, when it is larger than 64, the performance degrades significantly, as the model needs to learn more parameters and may suffer from the overfitting problem.
Figures 4 and 5 show that the model performs poorly when the kernel size equals to 1, as it captures only the spatial dependencies and does not consider the correlations in the temporal dimension. We can further observe that it is enough to aggregate the information from neighboring 2 or 3 time steps at each time step. When , the model’s performance degrades. It is possibly because that a node’s information at a time step may only correlated to the nodes at a limited number of neighboring time steps, and a large would introduce noises into the model. Thus, according to the validation loss, we set and on the PeMSD4 and PeMSD8 datasets, respectively.
In the parameter study of the threshold in the loss function, we report the validation MAE, RMSE, and MAPE instead of reporting the loss value, as the size of directly impacts the size of the loss value. As shown in Figures 4, 4, 5, and 5, a larger means the model optimizes more on the MAPE loss and less on the MAE loss, and thus leads to smaller MAPE and larger MAE. The RMSE can also be influenced, as shown in Figures 4 and 5. Through a comprehensive consideration of the validation MAE, RMSE, MAPE and their standard deviations, we choose to use and in the PeMSD4 and PeMSD8 datasets, respectively.
5.3.4 Performance Comparison at Each Horizon
Figures 6 and 7 present the forecasting performance comparison of our STJGCN with five representative baseline methods (i.e., Graph WaveNet, STSGCN, AGCRN, GMAN, and ZGCNETs) at each prediction time step on the PeMSD4 and PeMSD8 datasets, respectively. We exclude other baseline methods due to their poorer performances, as shown in Table IV. We can observe that Graph WaveNet performs well in the shortterm (one or two time steps ahead) prediction. However, its performance degrades quickly with the increase of the forecasting horizon. The performance of GMAN degrades slowly when the predictions are made further into the future, and it performs well in the longterm (e.g., 12 time steps ahead) prediction, while still worse than STJGCN. In general, our model achieves the best performances at almost all horizons in terms of all three metrics on both datasets.
5.3.5 Model Size and Computation Time
We present the comparison of model size and computation time of our STJGCN with graphbased baseline methods in Table VII. The results demonstrate the high computation efficiency of our model. In terms of the model size, STJGCN has fewer parameters than most of the baseline models. In the training phase, our model runs faster than other methods except for STGCN. In the inference stage, STGCN runs very slowly as it adopts an iterative way to generate multistep predictions, while STJGCN and Graph WaveNet are the most efficient. By further considering the prediction accuracy (see Table IV), our model shows superior ability in balancing predictive performances and time consumption as well as parameter settings.
6 Conclusion
We proposed STJGCN, which models comprehensive and dynamic spatiotemporal correlations and aggregates multiple ranges of information to forecast the traffic conditions over several time steps ahead on a road network. When evaluated on four public traffic datasets, STJGCN showed high computation efficiency and outperformed 11 stateoftheart baseline methods. Our model could be potentially applied to other spatiotemporal data forecasting tasks, such as air quality inference and taxi demand prediction. We plan to investigate this in future work.
Acknowledgment
The research is supported by Natural Science Foundation of China (61872306), Xiamen Science and Technology Bureau (3502Z20193017) and Fundamental Research Funds for the Central Universities (20720200031).
References
 [1] (2020) Adaptive graph convolutional recurrent network for traffic forecasting. In NeurIPS, Cited by: §1, §1, §2.2, §4.2.2, §4.2, 6th item, §5.1, §5.2.2.
 [2] (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §4.3.2.
 [3] (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.1.
 [4] (2014) Spectral networks and deep locally connected networks on graphs. In ICLR, Cited by: §2.1.
 [5] (2010) Zigzag persistence. Foundations of computational mathematics 10, pp. 367–405. Cited by: 8th item.
 [6] (2020) Multirange attentive bicomponent graph convolutional network for traffic forecasting. In AAAI, Cited by: §2.2.
 [7] (2021) Zgcnets: time zigzags at graph convolutional networks for time series forecasting. In ICML, Cited by: §1, 8th item, §5.1, §5.2.2.

[8]
(2018)
A neural attention model for urban air quality inference: learning the weights of monitoring stations
. In AAAI, pp. 2151–2158. Cited by: §1.  [9] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, pp. 3844–3852. Cited by: §2.1.
 [10] (2016) Latent space model for road networks to predict timevarying traffic. In KDD, Cited by: §4.2.2.
 [11] (1997) Support vector regression machines. In NeurIPS, pp. 155–161. Cited by: 3rd item.
 [12] (2020) Exploiting interpretable patterns for flow prediction in dockless bike sharing systems. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
 [13] (2019) Attention based spatialtemporal graph convolutional networks for traffic flow forecasting. In AAAI, pp. 922–929. Cited by: §1, §2.2, 3rd item, §5.1.
 [14] (2021) Learning dynamics and heterogeneity of spatialtemporal graph data for traffic forecasting. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
 [15] (1994) Time series analysis. Princeton university press. Cited by: 1st item.
 [16] (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §2.1.
 [17] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.3.2.
 [18] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §5.2.2.
 [19] (2021) DeepCrowd: a deep model for largescale citywide crowd density and flow prediction. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
 [20] (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.2.2.
 [21] (2017) Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §2.1, §4.3.1.

[22]
(2020)
Deeper insights into graph convolutional networks for semisupervised learning
. In AAAI, Cited by: §2.1.  [23] (2018) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In ICLR, Cited by: §1, §1, §1, §2.1, §2.2, §4.2.1, §4.4.3, 1st item.
 [24] (2015) Long shortterm memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54, pp. 187–197. Cited by: §2.2.
 [25] (2010) Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §4.3.1, §5.2.2.
 [26] (2016) Learning convolutional neural networks for graphs. In ICML, pp. 2014–2023. Cited by: §2.1.
 [27] (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §4.3.1.
 [28] (2019) Twostream adaptive graph convolutional networks for skeletonbased action recognition. In CVPR, pp. 12026–12035. Cited by: §2.1.
 [29] (2020) Spatialtemporal synchronous graph convolutional networks: a new framework for spatialtemporal network data forecasting. In AAAI, Cited by: §1, 5th item, §5.1.
 [30] (2020) Predicting citywide crowd flows in irregular regions using multiview graph convolutional networks. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.2.
 [31] (2014) Sequence to sequence learning with neural networks. In NeurIPS, pp. 3104–3112. Cited by: 2nd item.
 [32] (2020) A survey on modern deep neural network for traffic prediction: trends, methods and challenges. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
 [33] (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §4.3.2.
 [34] (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §2.2.
 [35] (2018) Graph attention networks. In ICLR, Cited by: §2.1.
 [36] (2020) Deep learning for spatiotemporal data mining: a survey. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
 [37] (2020) Traffic flow prediction via spatial temporal graph neural network. In WWW, pp. 1082–1092. Cited by: §2.2.
 [38] (2019) Origindestination matrix prediction via graph convolution: a new perspective of passenger demand modeling. In KDD, pp. 1227–1235. Cited by: §2.2.
 [39] (2021) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1), pp. 4–24. Cited by: §1, §2.1.
 [40] (2019) Graph wavenet for deep spatialtemporal graph modeling. In IJCAI, Cited by: §1, §1, §2.2, §4.2.2, §4.2, §4.4.3, 4th item.
 [41] (2018) Spatial temporal graph convolutional networks for skeletonbased action recognition. In AAAI, pp. 3482–3489. Cited by: §2.1.
 [42] (2019) Revisiting spatialtemporal similarity: a deep learning framework for traffic prediction. In AAAI, Cited by: §2.2.
 [43] (2018) Deep multiview spatialtemporal network for taxi demand prediction. In AAAI, pp. 2588–2595. Cited by: §2.2.
 [44] (2020) How to build a graphbased deep learning architecture in traffic domain: a survey. arXiv preprint arXiv:2005.11691. Cited by: §2.2.
 [45] (2020) A comprehensive survey on traffic prediction. arXiv preprint arXiv:2004.08555. Cited by: §2.2.
 [46] (2018) Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In IJCAI, pp. 3634–3640. Cited by: §1, §1, §2.2, §4.2.1, 2nd item.
 [47] (2017) Deep spatiotemporal residual networks for citywide crowd flows prediction. In AAAI, pp. 1655–1661. Cited by: §2.2.
 [48] (2020) Flow prediction in spatiotemporal networks based on multitask deep learning. IEEE Transactions on Knowledge and Data Engineering 32 (3). Cited by: §2.2.
 [49] (2018) Link prediction based on graph neural networks. In NeurIPS, pp. 5165–5175. Cited by: §2.1.
 [50] (2021) A graphbased temporal attention framework for multisensor traffic flow forecasting. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.1.
 [51] (2020) GMAN: a graph multiattention network for traffic prediction. In AAAI, pp. 1234–1241. Cited by: §2.2, §4.4.3, 7th item.
 [52] (2020) DeepSTD: mining spatiotemporal disturbances of multiple context factors for citywide traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems 21 (9), pp. 3744–3755. Cited by: §2.2.
 [53] (2021) STPCnet: learn massive geosensory data as spatiotemporal point clouds. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1.