I Introduction
With the increasing volume of traffic data and the growing impacts of datadriven technologies in modern transportation systems, the traffic flow forecasting problem is drawing increasing attention [16]. Reliable and timely predictions of the traffic dynamics can assist transportation management, help alleviate traffic congestion, and enhance traffic efficiency.
A traffic system is characterised by the changing flows at locations in a road network, i.e., nodes (sensors) interlinked by road segments, from which one may observe salient patterns: sudden bursts, drastic fluctuations, and periodic shifts. One can develop an understanding of these patterns from two perspectives. On the outside, traffic is the accumulation of various extrinsic factors, from road layout, to geographic features of the environment, to traffic laws, and erratic behaviours of drivers, all playing a role in shaping traffic conditions. On the inside, intrinsic principles are governing the flow of traffic, as vehicles – particles in the road system – maintain stable directional, speed and concentration features as they travel through space and time, giving rise to some universal patterns of the traffic flow. These universal patterns are what is behind wellestablished traffic flow models, such as Wardrop’s equilibria, and Kerner’s threephase traffic theory, that describe traffic flow from a mathematical point of view [14].
We argue that accurate predictions of traffic flow – based on macrolevel traffic data – rest upon a model’s ability to grasp not only extrinsic features from the given data set, but also intrinsic principles that are universal to any traffic systems. At its core, traffic can be seen as a physical system [22] where localised changes cascade like waves along roads, to some other locations, some time later [12]. Therefore the key to capturing the intrinsic principles of traffic flow lies in extracting the latent correlations between locations’ present states and surrounding locations’ past.
Recent breakthroughs in neuralbased techniques, in particular those based on graph neural networks (GNNs) [23, 21, 5, 30]
, represent a significant step towards representing nonlinear traffic features by leveraging topological constraints while achieving better prediction results as compared to earlier statistical methods such as autoregressive models
[17]. However, existing GNNbased traffic forecasting models have three common limitations: (1) These models heavily rely on the graph structure which varies across different road networks. As such, they are restricted to a particular road network, rather than revealing intrinsic properties of traffic system; (2) They apply computationally expensive feature aggregation operations (e.g., graph convolutions takes) on the entire network, and thus are not scalable to large road networks that contain hundreds of thousands of sensors. For instance, graph convolution network (GCN) scales quadratically w.r.t. the number of nodes in the network. (3) These models all consist of separate components to extract features from the spatial and the temporal dimensions, respectively, before integrating them into the same feature map to derive a prediction. The extracted feature map represents “aggregated” correlations between locations of the network across time. This setup implicitly assumes that the correlations between locations stay uniform over time. However, two locations may have different correlations at different times: As indicated in Figure 1, against the current () target node, node has a stronger correlation at an earlier time (i.e., ) than a later time (i.e., ), while has a weaker correlation at than at .To break the above limitations in existing GNNbased models, we propose a new spatiotemporal correlation learning paradigm, called Spacetime Interval Learning, which fuses the spatial dimensions and the temporal dimension into a single manifold, i.e., a spacetime, as coined in special relativity theory. The correlations that we aim to capture reflects a form of “distance” between two traffic events in spacetime, where the distance is referred to as interval in spacetime terminologies [11]. Specifically, the proposed spacetime interval learning paradigm extracts the traffic data of nearby sensors within a fixed time window, regarded as the localspacetime context of each target location. The localspacetime context is analogous to the “receptive field” of sensors to ignore the sensors that do not contribute much to the final prediction. Then, the spacetime interval learning paradigm learns to correlate a node at a given time with another node at a different time in the localspacetime context.
The proposed learning paradigm addresses the limitations of the GNNbased models as follows. Firstly, it models the spatiotemporal correlations within the localspacetime context of each target location. In this way, the model we build is independent of the graph structure, and thus is universal for traffic flows at large, as opposed to for any specific network. Secondly, our method focuses on the localspacetime, such that it changes the traffic prediction from networklevel to nodelevel. That is, our model can be predict the traffic for multiple locations of interest in parallel on different machines. Thirdly, unlike existing methods, we fuse the spatial and the temporal dimensions into a single manifold to capture the varying spatial correlations between locations across time.
Under the spacetime interval learning paradigm, we propose an implementation called Spacetime Neural Network (STNN), which combines novel spacetime attention blocks and spacetime convolutional blocks. The former highlights the interval between events, while the latter aggregates the learned features from spatial, temporal, and spatiotemporal aspects. The main advantage of the model are as follows. The code is available on https://github.com/songyangco/STNN.

[leftmargin=*]

Accuracy: The learned spacetime intervals explicitly capture the correlations between locations across different time instances, making the learned features highly informative in traffic prediction. As validated through experiments on two realworld datasets, METRLA and PeMSBay, when trained on data from a single network, our model significantly outperforms existing methods, with, e.g., 15%–20% improvement over current stateoftheart for 15min predictions. See Section VIB.

Transferrability amounts to a main advantage of our universal model. For this, we train a single STNN model on one dataset and use it without finetuning on unseen traffic datasets. In all cases, the results achieve competitive, if not better, performance compared to stateoftheart approaches, validating the applicability of our universal model. See Section VIB.

Ability to handle dynamic networks is another natural consequence of converting raw input to a localspacetime for every target node. This refers to, e.g., the network undergoes structural changes such as the blocking or addition of roads. We perform an experiment on a synthetic dynamic network. Our model is much better than baselines. See Section VIC.

Scalability: Previous GNNbased models all assume a small fixed network whose size is limited by computational capability. By performing prediction on nodelevel, we break the limit on the network size. The complexity of our model is independent from the network size, making our model applicable to networks of arbitrary size. See Section VIF.
Ii Related Works
Studies on traffic forecasting through spatiotemporal data analysis can roughly be divided into four types.
Statistical methods. Early studies in the 2000s use various statistical methods, such as historical average (HA), autoregressive integrated moving average (ARIMA) [17]
, and vector autoregression (VAR)
[32] for the traffic forecasting problem. However, these approaches assume the input time series to be stationary and univariate. This assumption may not hold in a complex traffic system. Thus, statistical methods often generate inaccurate predictions.Machine learning methods
. To account for more complex traffic data, various classical machine learning methods, such as kNN
[15], SVM [29], SVR [3], are employed in the early 2010s. One shortcoming of these methods is that they fail to capture the highly nonlinear spatial and temporal patterns present in the data.Deep learning methods
. After 2015, capitalising on the success of deep learning
[9], several deep learning models are proposed to capture the underlying patterns of traffic data. Recurrent neural networks (RNNs) are adopted to analyse time series patterns
[8, 4], while convolutional neural networks (CNNs) are used to capture spatial dependencies by treating a traffic network as a grid
[27, 28]. However, these methods omit the road network, which is an essential spatial constraint on how traffic moves in the spatial dimension. This limitation inhibits traditional deep learning methods from accurate traffic forecasting.Graph neural networks. More recently, graphbased deep learning models are increasingly used for spatiotemporal data analysis. Yu et al. [23] propose a spatiotemporal graph convolutional network based on spectral graph theory to extract features from the road network and the historical time series data. In particular, the spatial patterns are captured by graph convolutional layers while the temporal patterns are extracted by gated convolutional layers. Li et al. [10] model the traffic flow as a diffusion process and capture the spatial dependencies by random walks on a graph followed by an RNN to extract temporal patterns. Yao et al. [21]
use graph embeddings to capture the road network information and integrate LongShort Term Memory (LSTM) and CNN with the graph embeddings for traffic prediction. Attention mechanism has also been adopted to learn spatial and temporal patterns with improved performance
[26, 6]. However, all of these methods fall into the same paradigm that separate layers are applied to extract spatial patterns and temporal patterns respectively. We summarize current methods under this paradigm in Table I. Particularly, CNN, Graph Convolutional Network (GCN), or spatial attention are often used to learn spatial patterns. Gated Recurrent Unit (GRU), LSTM, gated temporal convolution (gated TCN), or temporal attention are used to learn temporal patterns.
Studies  Spatial layers  Temporal layers 
DCRNN (2017) [10]  Diffusion Conv  GRU 
STGCN (2017) [23]  GCN  Gated TCN 
DMVSTNet (2018) [21]  CNN  LSTM 
STDN (2019) [20]  CNN  LSTM + Attn 
GraphWaveNet (2019) [19]  GCN  Gated TCN 
DSTGCN (2019) [5]  GCN  TCN 
ASTGCN (2019) [6]  GCN + Attn  TCN + Attn 
SLCNN (2020) [30]  GCN  Gated TCN 
GMAN (2020) [31]  Spatial Attn  Temporal Attn 
MTGCN (2020) [18]  GCN  Gated TCN 
STNN (ours)  ST Conv + Attn 
Unlike current methods, this work proposes a novel learning paradigm. Instead of using separate components or layers to learn spatial and temporal patterns, we design a novel spacetime module to learn the local spatiotemporal correlations, and capture both spatial and temporal patterns at the same time. Our model does not fall into any existing categories. Specially, instead of exploiting GNN to learn spatial patterns, we fuse spatial features with temporal features together and learned by a unified layer.
Iii Problem Formulation
We summarise the list of important notations in Table II.
Notation  Description 

sensor set at time  
union sensor set of all time steps  
distance matrix at time  
stack along time axis  
a vector of features recorded by th sensor at time  
a matrix of features of all sensors at time  
stack along time axis  
a set of target nodes to predict  
a target node  
a set of neighboring nodes that close to  
number of features captured by each sensor station  
future traffic measure of th sensor at time  
future traffic measures of th sensor  
future traffic measures of all nodes in  
normalized connectivity matrix at time  
a localspacetime snapshot for at time  
a localspacetime for 
Traffic condition like vehicle speed is often recorded by the roadside sensor station along with other auxiliary information including time, sensor location, etc. Aggregating sensors together forms a sensor set which monitor the traffic flow of a certain area.
At each time step , the th sensor station records local traffic measures such as vehicle average speed in a fixed time duration and supplementary data. These measurements form a feature vector . In particular, we have two features in our input data: average speed, timestamp. Collecting the feature vectors of all sensor stations together, we obtain the overall observation of the traffic network at time step as . A traffic flow dataset contains the measures of a sequence of time steps, where , and is thus presented as a
sensor feature tensor
.Based on the location of sensor stations, we can get the travel distance on road network between any two sensors. Consequently, we built a matrix to reflect the distances of all sensor pairs. One can think as the weighted adjacent matrix of a directed complete graph. The edge weight is affected by the actual distance and complete graph is because we can always calculate the distance between two sensors. Furthermore, we do not require remain unchanged all the time. On the contrary, the travel distance between two sensors might be variable because of events like traffic accidents or road constructions which alter the underlying topology of road network, leading to a further travel distance. Thus, to better incorporate the network dynamics in our model, we use to just indicate the spatial context at the particular time step . reflects the underlying road network structure dynamics over time steps, where . For a static network, reduces to Q.
Traffic flow forecasting: Given a sensor feature tensor and a spatial context tensor over past time steps, as well as a set of nominated target nodes . The traffic flow forecasting problem aims to predict the traffic flow of the next time steps for every target node in . The desired output is thus where denotes the traffic conditions of the next time steps at node .
Iv Spacetime Interval Learning
Spacetime interval learning aims to discover the influence between traffic events in a single spacetime manifold instead of modeling the spatial and temporal dimensions separately. We define traffic events and spacetime as follows:
Definition 1 (Traffic Event)
Given a traffic measurement (e.g., speed) observed at sensor and time , a traffic event is a tuple consists of the measurement, time, and location, namely, .
A traffic event, defined analogously to the notion of events in physics [1], embodies both spatial and temporal information of a traffic measurement by a sensor station.
Definition 2 (Spacetime)
Given a subset of sensors , the sensor features , the related spatial context , and the time steps historical data, the spacetime is a manifold consists of all traffic events produced by and .
A spacetime manifold can be organzied as a 3D structure where first dimension corresponding to the time, the second dimension corresponding to the space, and the third is the observed traffic measurement. The time dimension is easy to understand but the space data needs more work. In order to squeeze the spatial information into a single dimension, we transfer the sensor coordinates to the travel distance with the anchor/target sensor. As a result, we construct a localspacetime around a target sensor and use that to predict the future traffic flow of that particular sensor.
Definition 3 (Localspacetime)
The localspacetime for sensor is a subset of , which only contains the traffic events occur at nearby location of and recent time steps. Furthermore, the sensor location in all traffic events from is replaced by its travel distance with .
The spacetime interval between two traffic events denotes the extent to which one event influences the other; a smaller interval means a closer association between two traffic events. In a localspacetime, we only care about the intervals between traffic events of the target sensor with other sensors.
Definition 4 (Spacetime Interval)
Spacetime interval is the quantified influence of a traffic event imposed on another traffic event regarding to the traffic measurement.
A crucial step in the proposed learning paradigm is to build the localspacetime of a target node , which is then used for spacetime interval learning. In this way, the trained model not only captures the spacetime interval explicitly, but also be able to generalize learned patterns to other localspacetimes in the same road network or event different city. We now give details of how to construct a localspacetime for an arbitrary node where is a set of nodes we want to predict. The pseudocode is outlined in Algorithm 1.
Given a time step , we define a connectivity matrix by applying the Gaussian kernel on the distance matrix to convert travel distance to a weight reflect the connectivity of two sensors. Longer distance corresponds to smaller weight, namely,
(1) 
where , are any two nodes in and is a hyperparameter. Empirically, we set
as the standard deviation among all
. Intuitively, is a normalized value that expresses how easy to travel from to in the network at time step . Note that Equation (1) guarantees . The superscript used to depict the variational travel distance caused by the underlying networks structure change.Using , we extract a set of nodes, denoted as , that could benefit the future traffic flow prediction for the target node as
(2) 
where is a predetermined threshold parameter that indicates how close a node should be from (or to) to be considered relevant to the traffic flow at . Note that can be different from because is not symmetric. control the tradeoff between computational cost and prediction accuracy. Moreover, to have the fixed shape input data for training, we require the size of to be , namely, . This is done by keeping the nearest neighbors only if or add dummy nodes if . Dummy nodes have no connections with other nodes, and their features are all zero. As a results, nodes in are sorted based on the travel distance to in ascending order.
We now construct a localspacetime of . For , each row of the matrix is defined as:
(3) 
where is the traffic measurement recorded at node and time step , and denotes concatenation. The matrix can be regarded as a snapshot of localspacetime . It encodes spatial relationship between and those nodes in , as well as traffic measurements of all these nodes at time step . Finally, define the localspacetime by
(4) 
Since contains all the information we need to train a predictive model of the future traffic flow at , the learning process is independent on the size of the entire network, thus resolving the scalability issue. Moreover, the incorporation of the network information at all time steps makes the model capable of handling dynamic network topology.
In summary, the spacetime interval learning framework reduces the traffic flow forecasting problem to learning a universal function that maps to the future traffic flow of and :
(5) 
In next section, we propose a novel model as a realization of the mapping function , namely, the spacetime neural network.
V Spacetime Neural Network
The core task of spacetime interval learning is to adaptively learn the influence between traffic events and use them to predict the future traffic flow. To this end, we propose SpaceTime Neural Network (STNN), which is an endtoend deep learning model that realizes the function in Equation 5.
Design Principles. To quantify the influence of traffic events in the localspacetime to the prediction of future traffic flow on the target sensor, a natural idea is to learn how each single traffic event influences the prediction. However, we observe that traffic events in localspacetime are not independent and they may interfere each other. For example, congestion in an arterial road may increase the traffic in a surrounding region and then diffuse to further road segments. As such, we design our model with two main principles: (1) the model should be able to capture the pairwise influence, and (2) it needs to be aware of the conditions of nearby regions, capturing the manytoone influence.
Model Architecture. Following the above two principles, we propose a novel spacetime module, on top of which we build the proposed STNN model. A spacetime module comprises a Spacetime Attention Block (Section VA) to capture the pairwise influence, and a Spacetime Convolution Block (Section VB) to capture the manytoone influence.
The overall architecture of STNN is shown in Figure 2. The input is a localspacetime of the target (central) node represented by the blue cylinder. STNN employs several spacetime modules to predict the future traffic of the target node.
More formally, convert the raw data to the localspacetime and permute to the channelfirst format, i.e., . Then the input is transformed using convolution to map traffic events to a highdimensional feature space. Next, we stack spacetime modules where controls the tradeoff between model complexity and performance. A smaller leads to a model with fewer parameters and unlikely to overfits the data. A larger
gives better performance but may jeopardize the generalization ability. Residual connections are also applied for each module. Last, we use a fully connected layer with linear activation to present the prediction results
.Va Spacetime Attention Block
Inspired by selfattention mechanism [24], we propose the spacetime attention block to automatically discover the pairwise influence between traffic events. Specifically, for the attention block at the th layer, the input is a 3D tensor which is the output of previous convolutional layer or the initial input data, i.e., , where denotes the number of channels for feature maps output by the previous layer. Note that . Let and we calculate the attention map as
(6) 
(7) 
where and are learnable weight matrices that project each traffic event into feature spaces corresponding to “query” and “key”, respectively. As such, the dot product of the “query” and “key”, i.e., indicates the relevance between the th and th traffic events. The softmax function is used to guarantee the attention weights sum to one. Then, we apply another learnable weight matrix to project input data into an output feature space and use the attention matrix to weight the contribution from different traffic events:
(8) 
The attention block can dynamically adjust the impact of different traffic events to target events w.r.t. the features from the previous layer.
VB Spacetime Convolution Block
After the attention layer, we design a spacetime convolution block that contains three kernels to capture the manytoone influence from three different perspectives corresponding to space, time, and spacetime, respectively. The spacetime kernel is the main perspective for uncovering the spacetime correlations from a local localspacetime to the target events. The temporal kernel finds correlations between traffic events of the same location along time, and the spatial kernel captures the influence from nearby locations on the same time step. The motivation for adopting the temporal kernel is that each sensor may have unique periodic temporal traffic patterns (peak/offpeak, weekday/weekend, etc.), which may be undervalued from the spacetime perspective. Similarly, we use the spatial kernel to keep the underlying geospatial influence from neighbors which is invariant to the time. Figure 3 demonstrates these three kernels on the spacetime manifold.
Each spacetime convolution block takes the output of former attention block as the input, e.g., . The output is computed by
(9) 
(10) 
where , , and are the spacetime kernel, the temporal kernel and the spatial kernel, respectively;
denotes the leaky rectified linear units function; and
represents the convolution operation. For the above three convolution kernels, we takein the experiments. Additionally, padding is used to make sure the output has the same size as the input. Last, we concatenate the output of the three kernels and use an
convolution, , to condense features and restrict the number of channels.At last, a fully connected layer is adopted to predict the future traffic flow from learned features.
Vi Experiments
To evaluate the performance of the proposed model, we compare STNN with the stateoftheart traffic prediction models on two public traffic datasets in Section VIB. In addition, a simulated dynamic traffic network is used to demonstrate our model’s capability of handling dynamic graphs in Section VIC. We further investigate the learned spacetime interval via a case study (Section VID) and the impacts of different components of STNN (Section VIE).
Via Experimental Setup
ViA1 Realworld Networks
Two public realworld datasets METRLA and PeMSBay are used for evaluation. METRLA were collected from the loop detectors in the highway of Los Angeles County [7], and PeMSBay were collected from California Caltrans Performance Measurement System (PeMS) [2]. These datasets were released by [10] and have been widely used to evaluate traffic prediction models. METRLA records four months of traffic data in early 2012 in Los Angeles County with 207 sensors. PeMSBay captures six months of traffic data in early 2017 in the Bay Area with 325 sensors. The speed reading of sensors in METRLA and PeMSBay are aggregated to 5minute windows, resulting in 288 data points per day. Statistics of the datasets are summarized in Table III.
Data  Nodes  Time steps  Traffic events  Dynamics 

METRLA  207  34,272  7,094,304  No 
PeMSBay  325  52,116  16,937,700  No 
Simulated  84  2,000  16,800  Yes 
ViA2 Simulated Network
Apart from the two public datasets with static networks, we simulate a traffic network to demonstrate our model’s ability to handle dynamic networks. We synthesize the dynamic traffic data using CityFlow [25]
which is a multiagent reinforcement learning tool for city traffic scenarios. It is worthy to note that the goal of this experiment is not to show how the simulated traffic network resembles reality but to evaluate how well our model predicts traffic flow as the network topology changes. The simulated dataset contains 84 nodes and 2000 time steps. Each node denotes a sensor station located in the middle of a road segment as shown in Figure
4. Each sensor station records the total number of vehicles passing it in the fixed time interval instead of the average speed. Dynamics are represented as road closures to simulate traffic accidents or road constructions. For example, road segment 8 in Figure 4 is closed from 400 to 600 and 1500 to 1900 time steps. Such closure will alter the travel distance between nearby nodes resulting in modified matrix . Besides the evaluation on the entire dataset, we also highlight the results of node 8 and its nearby roads 7 and 9, whose surroundings changed the most.ViA3 Baselines
We compare our model with the following baselines on the two realworld datasets. The default settings of these methods are used.

HA: Historical average method. We use the moving average with window size 12 for the forecasting.

ARIMA [17]
: AutoRegressive integrated moving average model with Kalman filter for time series analysis.

FCLSTM [13]: Recurrent neural network with fully connected LSTM hidden units.

DCRNN [10] combines diffusion convolution and gated recurrent units in an encoderdecoder manner.

STGCN [23] employs graph convolution for spatial patterns and 1D convolution for temporal features.

Graph WaveNet [19] uses node embeddings to learn a dependency matrix, and the dilated 1D convolution for prediction.

GMAN [31]: Graph multiattention network equipped with spatial, temporal and transform attention.

MTGCN [18]: Multivariate time series forecasting with graph neural networks.
ViA4 Experiment Settings
We implemented our model via the PyTorch framework and on the following hardware platform: (CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, GPU: NVIDIA Quadro RTX 8000). For the STNN model, we set
, i.e., two STModules are stacked, which is enough to produce satisfying results. The number of output channels for two STModules are 32 and 64, respectively. The convolutional kernel size is set to be. The parameter used in the LeakyReLU is 0.2. In the experiments, datasets are split as 70% training, 20% validation and 10% testing in chronological order. The length of input and output time sequence are both 12. In the training phase, the batch size is 80 , and the number of epochs is 50, which takes about 5 hours to train the model on METRLA dataset. Adam optimizer is adopted with learning rate 1
0.001 to update model parameters towards the minimized L1 loss between the predicted value and the grounded truth. Dropout is enabled with rate 0.3. We apply the commonly used metrics: mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) to evaluate the prediction accuracy. We report the average performance of five runs. Random seed of PyTorch is set to be 1 for reproducibility purpose.We perform a grid search for several important hyperparameters to determine the appropriate values regarding complexity and performance. The first hyperparameter is the number of nodes in each localspacetime. The second is the proportion of data used for training, ranges from 0.05 to 1. The third is the number of spacetime modules employed, ranges from 1 to 4. Results show that the model performance improves drastically with the increasing number of neighbors in a localspacetime and reach the plateau after . Similar pattern shows for the ratio of training set and the besttradeoff point is 0.2. In other words, in 70% training data, we random select for the final training. Last, model is not very sensitive to the number of ST modules after stack two modules.
Training Data  Test Data  Models  15 mins  30 mins  60 mins  

MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  
METRLA  METRLA  HA  4.16  7.80  13.02%  4.16  7.80  13.02%  4.16  7.80  13.02% 
ARMIA  3.99  8.21  9.60%  5.15  10.45  12.70%  6.90  13.23  17.40%  
FCLSTM  3.44  6.30  9.60%  3.77  7.23  10.90%  4.37  8.69  13.20%  
DCRNN  2.77  5.38  7.30%  3.15  6.45  8.80%  3.60  7.60  10.50%  
STGCN  2.88  5.74  7.62%  3.47  7.24  9.57%  4.59  9.40  12.70%  
Graph WaveNet  2.68  5.15  6.90%  3.07  6.22  8.37%  3.53  7.37  10.01%  
GMAN  2.75  5.42  7.19%  3.01  6.25  8.18%  3.34  7.21  9.71%  
MTGCN  2.69  5.18  6.86%  3.05  6.17  8.19%  3.49  7.23  9.87%  
STNN (ours)  2.27  4.46  5.80%  2.56  5.29  6.84%  3.01  6.23  8.50%  
PeMSBay  PeMSBay  HA  2.88  5.59  6.85%  2.88  5.59  6.85%  2.88  5.59  6.85% 
ARMIA  1.62  3.30  3.50%  2.33  4.76  5.40%  3.38  6.50  8.30%  
FCLSTM  2.05  4.19  4.80%  2.20  4.55  5.20%  2.37  4.96  5.70%  
DCRNN  1.38  2.95  2.90%  1.74  3.97  3.90%  2.07  4.74  4.90%  
STGCN  1.36  2.96  2.90%  1.81  4.27  4.17%  2.49  5.69  5.79%  
Graph WaveNet  1.30  2.74  2.73%  1.63  3.70  3.67%  1.95  4.52  4.63%  
GMAN  1.34  2.82  2.81%  1.62  3.72  3.63%  1.86  4.32  4.31%  
MTGCN  1.32  2.79  2.77%  1.65  3.74  3.69%  1.94  4.49  4.53%  
STNN (ours)  1.20  2.41  2.53%  1.50  3.26  3.33%  1.86  4.22  4.30%  
METRLA  PeMSBay  STNN (ours)  1.36  2.73  2.92%  1.68  3.58  3.73%  2.15  4.72  5.01% 
PeMSBay  METRLA  2.60  5.02  6.75%  3.01  5.94  8.37%  3.68  7.33  11.13%  
Combined  METRLA  2.30  4.50  5.83%  2.60  5.29  6.90%  3.07  6.37  8.80%  
Combined  PeMSBay  1.21  2.43  2.51%  1.49  3.22  3.26%  1.87  4.17  4.40% 
ViB Evaluation on RealWorld Data
One salient feature of this work is that a trained STNN model can be used directly on unseen traffic networks. None of the existing stateoftheart methods can do this. Current models train and predict on the same dataset, and the trained model cannot be applied to other networks with different sizes. Thus, we evaluate the performance of STNN in three settings: (1) train and test on the same network; (2) train and test on different networks; (3) train on multiple networks and test on each network.
ViB1 Train and Test on the Same Network
This experiment follows the conventional settings where we train and test the STNN model on the same network. We train two STNN models separately on the two realworld networks, namely for METRLA dataset and for PeMSbay dataset, respectively. The prediction accuracy on test sets are shown in the first two rows of Table IV. STNN surpasses the existing methods in both datasets and in all prediction time frames. For short period predictions like the METRLA 15 mins case, achieves 2.27 MAE and 4.46 RMSE. Compared with the baselines, where the best MAE and RMSE are 2.68 and 5.15, respectively, produces about 15% and 13% performance improvement. For midterm predictions (30 mins), we are 15% better than baselines. For long period predictions like the METRLA 60 mins case, improves the performance by 14% in terms of RMSE. For the other realworld dataset, PeMSBay, STNN also shows superior performance. For example, the best baselines MAE for 15 mins prediction is 1.30, while our model can achieve 1.20.
ViB2 Train and Test on Different Networks
Since STNN is designed to discover universal traffic patterns. An interesting question is to apply a trained model on an unseen dataset. This is a huge challenge for methods that rely on the graph adjacency/Laplacian matrix because these matrices can be dramatically different in various networks. The proposed STNN concentrates on the local spatial and temporal context such that the learned model can be easily used on unseen datasets. In previous experiments, we have trained the on METRLA and on PeMSbay. Without any finetuning, we take to make predictions on PeMSbay data and the performance is comparative to stateoftheart methods trained directly on PeMSbay. Similar results for on METRLA. The details are shown in the middle two rows of Table IV. The MAE of on PeMSbay 15 mins prediction period is 1.36 while the MAEs of baselines range from 1.30 to 1.38. The MAE of on METRLA 15 mins prediction period is 2.60 while the MAEs of baselines range from 2.68 to 2.77.
ViB3 Train on Multiple Networks
Finally, we trained a single model on the combined dataset by mixing the training examples from PeMSBay and METRLA. The learned STNN model captures the general traffic patterns from both datasets. The results are shown in the last two rows of Table IV. The performance of the uniform STNN is very close to STNNs that train and test on the same network, and outperforms baselines on both datasets with large margin.
ViC Evaluation on Simulated Dynamic Network
Table V shows the traffic volume prediction error for the next three time steps of nodes 7, 8, 9, and the average error on the whole network. We only compare with HA, ARMIA and FCLSTM because the other methods cannot be applied on networks with dynamic topology. In general, STNN outperforms the baselines by large margin. The overall MAE and RMSE of STNN are 5.31 and 19.83 for the entire dataset, which is 42% better than baselines. Moreover, for a specific location like node 7, the MAE and RMSE of the baseline are 9.75 and 19.39, respectively. STNN achieves 5.27 MAE and 10.31 RMSE, leading to the 46% and 47% improvement against the baseline. For the road segmentation monitored by sensor 8, the traffic volume is always zero during the closure period. It is not very sensible to predict the traffic flow for such time steps. Thus, we exclude them during evaluation. Similar to other nodes, STNN still outperforms baselines with a 35% better MAE. It is worthy to note that the prediction error on node 7 is larger than node 9, mainly because node 7 is one of the traffic sources where new vehicles keep entering the traffic network continuously and randomly makes it hard to predict.
Data  Models  MAE  RMSE  MAPE 

Node 7  HA  26.69  64.57  12.53 % 
ARMIA  12.08  29.16  9.84 %  
FCLSTM  9.75  19.39  7.74 %  
STNN (ours)  5.27  10.31  3.97 %  
Node 8  HA  20.75  43.82  28.79 % 
ARMIA  17.37  35.22  23.54 %  
FCLSTM  15.19  30.46  18.69 %  
STNN (ours)  9.78  26.31  9.63 %  
Node 9  HA  11.75  29.63  10.55 % 
ARMIA  6.35  11.35  6.10 %  
FCLSTM  5.23  9.61  5.19 %  
STNN (ours)  2.65  6.14  2.88 %  
Overall  HA  15.53  42.19  10.01 % 
ARMIA  11.02  31.75  7.81 %  
FCLSTM  9.17  25.65  5.89%  
STNN (ours)  5.31  19.83  3.79% 
ViD Case Study
To develop a better insight of our model, we present a realworld case study. The aim is to reveal the spacetime intervals that are extracted by our model. Figure 5 illustrates a small part of the METRLA dataset that contains a number of nodes around the Los Angeles Zoo. These nodes are used for the prediction of a target node, as indicated by node 111. We set the prediction period to 15 minutes (3 time steps). The color scale in Figure 4(a), which is transformed from the learned attention map, indicates the different influence (i.e., spacetime interval) made by those surrounding traffic events have towards the target node 111. The vertical axis represents the spatial dimension where nodes are listed in ascending order by their distance with the target node 111. The horizontal axis represents the temporal dimension where 0–11 time steps data are used for predicting the next 3 time steps of node 111. Figures 4(b) displays the spatial context of all nodes involved. From these two figures, we can observe that the most significant nodes for prediction are 111, 42, 54, 37, 142, 145. Among them, 54, 37, 142, 145 are the nearest upstream neighbors; and 42 is the immediate downstream neighbor. It is remarkable that downstream locations could be useful for traffic forecast, while the correlation fades out rapidly as distance increases as for node 24. Besides the spatial context, the temporal dimension also plays an important role. The time steps of node 111 that are key to the prediction are the immediate last four while this is not true for other nodes. Despite close proximity of nodes 125 and 58 to 111, they demonstrate a smaller influence as 58 is on the opposite direction of the same road while 125 is on another road. This case study demonstrates that our model can extract meaningful correlations between the traffic events along both the spatial and temporal dimensions.
ViE Ablation Study
We perform an ablation study to validate the impact of each proposed component to the model performance. In each experiment, a block/layer is removed and the rest remain unchanged. Particularly, six types of experiments are conducted: full STNN, STNN without STAttn block, STNN without STConv block, STNN without spatial convolution layer, STNN without temporal layer, and STNN without spacetime convolution layer. We report the performance of the above models for shortterm (15 mins) forecasting in Table VI.
METRLA  PeMSBay  

MAE  RMSE  MAPE  MAE  RMSE  MAPE  
Full STNN  2.27  4.46  5.80  1.20  2.41  2.53 
No STAttn block  2.35  4.56  5.85  1.23  2.43  2.55 
No STConv block  2.41  4.69  5.97  1.30  2.58  2.71 
No spatial conv  2.31  4.51  5.74  1.22  2.42  2.50 
No temporal conv  2.33  4.58  5.90  1.22  2.47  2.55 
No spacetime conv  2.34  4.59  5.98  1.23  2.49  2.58 
The effect of proposed spacetime attention block is evident since it highlights the contributive traffic events. The spacetime convolution block improves the prediction accuracy significantly as it aggregated the traffic events in three perspectives. Moreover, the spacetime convolutional layer appears to be more important than the other two convolutional layers.
ViF Complexity Analysis
We analyze the complexity of the proposed model to justify its scalability by showing the computation of STNN is independent of the road network size. In addition, the total trainable parameters in STNN is 308,786, which is smaller than STGCN (454,358), MTGCN (405,452), and other stateoftheart methods.
The time complexity of the spacetime convolutional block is where is the number of sensors in localspacetime, is the length of input time steps, and are channels of input feature maps and output feature maps. Additionally, four different convolutional layers (spacetime, spatial, temporal, 1*1) were utilized in the spacetime convolutional block. The spacetime attention block incurs time complexity where is the channels of input data and is the channels of output data. and still represents the size of the spacetime manifold. In short, the computational complexity of the proposed STNN depends on the localspacetime size, input sequence length, and channels. However, it is independent of the entire network.
Last, the localspacetime construction is part of the proposed spacetime interval learning framework which is not included in the forward pass and backpropagation of STNN. The complexity of localspacetime construction equals
where is the number of sensors in the network.Vii Conclusion
In this paper, we propose a novel spacetime interval learning framework and spacetime neural network (STNN) for accurate traffic prediction. Our approach captures the intrinsic principles of traffic flow by learning the spacetime intervals between traffic events. The model works on the localspacetime extracted for target nodes and thus can handle dynamic network topology. Experiments on two realworld datasets and a simulated dynamic network show that the proposed framework significantly outperforms stateoftheart methods. This confirms the effectiveness of capturing spatiotemporal correlations directly. In the future, we will further explore the possibility of capturing the graph dynamics from the input data when the underlying graph structure is unknown. Another promising direction is to embed the localspacetime construction into an endtoend deep learning framework, making the parameters used in localspacetime construction learnable. Last, we would like to investigate the spacetime interval learning paradigm in other spatiotemporal data modeling scenarios.
References
 [1] (2019) Spacetime and geometry. Cambridge University Press. Cited by: §IV.
 [2] (2001) Freeway performance measurement system: mining loop detector data. Transportation Research Record 1748 (1), pp. 96–102. Cited by: §VIA1.

[3]
(2015)
Forecasting holiday daily tourist flow based on seasonal support vector regression with adaptive genetic algorithm
. Applied Soft Computing 26, pp. 435–443. Cited by: §II.  [4] (2018) Deep bidirectional and unidirectional lstm recurrent neural network for networkwide traffic speed prediction. arXiv preprint arXiv:1801.02143. Cited by: §II.
 [5] (2019) Dynamic spatialtemporal graph convolutional neural networks for traffic forecasting. In AAAI, Vol. 33, pp. 890–897. Cited by: §I, TABLE I.
 [6] (2019) Attention based spatialtemporal graph convolutional networks for traffic flow forecasting. In AAAI, Vol. 33, pp. 922–929. Cited by: TABLE I, §II.
 [7] (2014) Big data and its technical challenges. Communications of the ACM 57 (7), pp. 86–94. Cited by: §VIA1.
 [8] (2017) Timeseries extreme event forecasting with neural networks at uber. In International conference on machine learning, Vol. 34, pp. 1–5. Cited by: §II.
 [9] (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §II.
 [10] (2018) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In ICLR, Cited by: TABLE I, §II, 4th item, §VIA1.
 [11] (2013) Geometrical physics in minkowski spacetime. Springer Science & Business Media. Cited by: §I.
 [12] (2008) Traffic jams without bottlenecks—experimental evidence for the physical mechanism of the formation of a jam. New journal of physics 10 (3), pp. 033001. Cited by: §I.
 [13] (2014) Sequence to sequence learning with neural networks. In NeurIPS, External Links: 1409.3215, ISSN 10495258 Cited by: 3rd item.
 [14] (2013) Traffic flow dynamics. Traffic Flow Dynamics: Data, Models and Simulation, SpringerVerlag Berlin Heidelberg. Cited by: §I.
 [15] (2012) Shortterm traffic and travel time prediction models. Artificial Intelligence Applications to Critical Transportation Issues 22 (1), pp. 22–41. Cited by: §II.
 [16] (2014) Shortterm traffic forecasting: where we are and where we’re going. Transportation Research Part C: Emerging Technologies 43, pp. 3–19. Cited by: §I.
 [17] (2003) Modeling and forecasting vehicular traffic flow as a seasonal arima process: theoretical basis and empirical results. Journal of transportation engineering 129 (6), pp. 664–672. Cited by: §I, §II, 2nd item.
 [18] (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In KDD, Cited by: TABLE I, 8th item.
 [19] (2019) Graph wavenet for deep spatialtemporal graph modeling. IJCAI. Cited by: TABLE I, 6th item.
 [20] (2019) Revisiting spatialtemporal similarity: a deep learning framework for traffic prediction. In AAAI, Vol. 33, pp. 5668–5675. Cited by: TABLE I.
 [21] (2018) Deep multiview spatialtemporal network for taxi demand prediction. In AAAI, Cited by: §I, TABLE I, §II.
 [22] (2005) The link transmission model: an efficient implementation of the kinematic wave theory in traffic networks. In Proceedings of the 10th EWGT Meeting, pp. 122–127. Cited by: §I.
 [23] (2018) Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In IJCAI, Cited by: §I, TABLE I, §II, 5th item.

[24]
(2019)
Selfattention generative adversarial networks
. In International Conference on Machine Learning, pp. 7354–7363. Cited by: §VA.  [25] (2019) Cityflow: a multiagent reinforcement learning environment for large scale city traffic scenario. In The World Wide Web Conference, pp. 3620–3624. Cited by: §VIA2.
 [26] (2018) Gaan: gated attention networks for learning on large and spatiotemporal graphs. In UAI, Cited by: §II.
 [27] (2016) DNNbased prediction model for spatiotemporal data. In SIGSPATIAL, pp. 1–4. Cited by: §II.
 [28] (2017) Deep spatiotemporal residual networks for citywide crowd flows prediction. In AAAI, Cited by: §II.

[29]
(2011)
Seasonal autoregressive integrated moving average and support vector machine models: prediction of shortterm traffic flow on freeways
. Transportation Research Record 2215 (1), pp. 85–92. Cited by: §II.  [30] (2020) Spatiotemporal graph structure learning for traffic forecasting. In AAAI, Vol. 34, pp. 1177–1185. Cited by: §I, TABLE I.
 [31] (2020) Gman: a graph multiattention network for traffic prediction. In AAAI, Vol. 34, pp. 1234–1241. Cited by: TABLE I, 7th item.
 [32] (2006) Vector autoregressive models for multivariate time series. Modeling financial time series with SPLUS®, pp. 385–429. Cited by: §II.