Log In Sign Up

Space Meets Time: Local Spacetime Neural Network For Traffic Flow Forecasting

Traffic flow forecasting is a crucial task in urban computing. The challenge arises as traffic flows often exhibit intrinsic and latent spatio-temporal correlations that cannot be identified by extracting the spatial and temporal patterns of traffic data separately. We argue that such correlations are universal and play a pivotal role in traffic flow. We put forward spacetime interval learning as a paradigm to explicitly capture these correlations through a unified analysis of both spatial and temporal features. Unlike the state-of-the-art methods, which are restricted to a particular road network, we model the universal spatio-temporal correlations that are transferable from cities to cities. To this end, we propose a new spacetime interval learning framework that constructs a local-spacetime context of a traffic sensor comprising the data from its neighbors within close time points. Based on this idea, we introduce spacetime neural network (STNN), which employs novel spacetime convolution and attention mechanism to learn the universal spatio-temporal correlations. The proposed STNN captures local traffic patterns, which does not depend on a specific network structure. As a result, a trained STNN model can be applied on any unseen traffic networks. We evaluate the proposed STNN on two public real-world traffic datasets and a simulated dataset on dynamic networks. The experiment results show that STNN not only improves prediction accuracy by 15 effective in handling the case when the traffic network undergoes dynamic changes as well as the superior generalization capability.


page 2

page 3

page 4

page 5

page 7

page 8

page 9

page 10


Unified Spatio-Temporal Modeling for Traffic Forecasting using Graph Neural Network

Research in deep learning models to forecast traffic intensities has gai...

Don't cross that stop line: Characterizing Traffic Violations in Metropolitan Cities

In modern metropolitan cities, the task of ensuring safe roads is of par...

Traffic Accident Risk Forecasting using Contextual Vision Transformers

Recently, the problem of traffic accident risk forecasting has been gett...

HintNet: Hierarchical Knowledge Transfer Networks for Traffic Accident Forecasting on Heterogeneous Spatio-Temporal Data

Traffic accident forecasting is a significant problem for transportation...

Towards Good Practices of U-Net for Traffic Forecasting

This technical report presents a solution for the 2020 Traffic4Cast Chal...

Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting

Traffic forecasting is a challenging problem due to complex road network...

Exploring Spatio-Temporal and Cross-Type Correlations for Crime Prediction

Crime prediction plays an impactful role in enhancing public security an...

I Introduction

With the increasing volume of traffic data and the growing impacts of data-driven technologies in modern transportation systems, the traffic flow forecasting problem is drawing increasing attention [16]. Reliable and timely predictions of the traffic dynamics can assist transportation management, help alleviate traffic congestion, and enhance traffic efficiency.

A traffic system is characterised by the changing flows at locations in a road network, i.e., nodes (sensors) interlinked by road segments, from which one may observe salient patterns: sudden bursts, drastic fluctuations, and periodic shifts. One can develop an understanding of these patterns from two perspectives. On the outside, traffic is the accumulation of various extrinsic factors, from road layout, to geographic features of the environment, to traffic laws, and erratic behaviours of drivers, all playing a role in shaping traffic conditions. On the inside, intrinsic principles are governing the flow of traffic, as vehicles – particles in the road system – maintain stable directional, speed and concentration features as they travel through space and time, giving rise to some universal patterns of the traffic flow. These universal patterns are what is behind well-established traffic flow models, such as Wardrop’s equilibria, and Kerner’s three-phase traffic theory, that describe traffic flow from a mathematical point of view [14].

We argue that accurate predictions of traffic flow – based on macro-level traffic data – rest upon a model’s ability to grasp not only extrinsic features from the given data set, but also intrinsic principles that are universal to any traffic systems. At its core, traffic can be seen as a physical system [22] where localised changes cascade like waves along roads, to some other locations, some time later [12]. Therefore the key to capturing the intrinsic principles of traffic flow lies in extracting the latent correlations between locations’ present states and surrounding locations’ past.


Fig. 1: An illustration of spacetime interval. The figure shows three snapshots of the network. The target node’s state at is strongly influenced by the state of at and the state of at , but only weakly influenced by the state of at and the state of at .

Recent breakthroughs in neural-based techniques, in particular those based on graph neural networks (GNNs) [23, 21, 5, 30]

, represent a significant step towards representing non-linear traffic features by leveraging topological constraints while achieving better prediction results as compared to earlier statistical methods such as autoregressive models

[17]. However, existing GNN-based traffic forecasting models have three common limitations: (1) These models heavily rely on the graph structure which varies across different road networks. As such, they are restricted to a particular road network, rather than revealing intrinsic properties of traffic system; (2) They apply computationally expensive feature aggregation operations (e.g., graph convolutions takes) on the entire network, and thus are not scalable to large road networks that contain hundreds of thousands of sensors. For instance, graph convolution network (GCN) scales quadratically w.r.t. the number of nodes in the network. (3) These models all consist of separate components to extract features from the spatial and the temporal dimensions, respectively, before integrating them into the same feature map to derive a prediction. The extracted feature map represents “aggregated” correlations between locations of the network across time. This setup implicitly assumes that the correlations between locations stay uniform over time. However, two locations may have different correlations at different times: As indicated in Figure 1, against the current () target node, node has a stronger correlation at an earlier time (i.e., ) than a later time (i.e., ), while has a weaker correlation at than at .

To break the above limitations in existing GNN-based models, we propose a new spatio-temporal correlation learning paradigm, called Spacetime Interval Learning, which fuses the spatial dimensions and the temporal dimension into a single manifold, i.e., a spacetime, as coined in special relativity theory. The correlations that we aim to capture reflects a form of “distance” between two traffic events in spacetime, where the distance is referred to as interval in spacetime terminologies [11]. Specifically, the proposed spacetime interval learning paradigm extracts the traffic data of nearby sensors within a fixed time window, regarded as the local-spacetime context of each target location. The local-spacetime context is analogous to the “receptive field” of sensors to ignore the sensors that do not contribute much to the final prediction. Then, the spacetime interval learning paradigm learns to correlate a node at a given time with another node at a different time in the local-spacetime context.

The proposed learning paradigm addresses the limitations of the GNN-based models as follows. Firstly, it models the spatio-temporal correlations within the local-spacetime context of each target location. In this way, the model we build is independent of the graph structure, and thus is universal for traffic flows at large, as opposed to for any specific network. Secondly, our method focuses on the local-spacetime, such that it changes the traffic prediction from network-level to node-level. That is, our model can be predict the traffic for multiple locations of interest in parallel on different machines. Thirdly, unlike existing methods, we fuse the spatial and the temporal dimensions into a single manifold to capture the varying spatial correlations between locations across time.

Under the spacetime interval learning paradigm, we propose an implementation called Spacetime Neural Network (STNN), which combines novel spacetime attention blocks and spacetime convolutional blocks. The former highlights the interval between events, while the latter aggregates the learned features from spatial, temporal, and spatio-temporal aspects. The main advantage of the model are as follows. The code is available on

  1. [leftmargin=*]

  2. Accuracy: The learned spacetime intervals explicitly capture the correlations between locations across different time instances, making the learned features highly informative in traffic prediction. As validated through experiments on two real-world datasets, METR-LA and PeMS-Bay, when trained on data from a single network, our model significantly outperforms existing methods, with, e.g., 15%–20% improvement over current state-of-the-art for 15-min predictions. See Section VI-B.

  3. Transferrability amounts to a main advantage of our universal model. For this, we train a single STNN model on one dataset and use it without fine-tuning on unseen traffic datasets. In all cases, the results achieve competitive, if not better, performance compared to state-of-the-art approaches, validating the applicability of our universal model. See Section VI-B.

  4. Ability to handle dynamic networks is another natural consequence of converting raw input to a local-spacetime for every target node. This refers to, e.g., the network undergoes structural changes such as the blocking or addition of roads. We perform an experiment on a synthetic dynamic network. Our model is much better than baselines. See Section VI-C.

  5. Scalability: Previous GNN-based models all assume a small fixed network whose size is limited by computational capability. By performing prediction on node-level, we break the limit on the network size. The complexity of our model is independent from the network size, making our model applicable to networks of arbitrary size. See Section VI-F.

Ii Related Works

Studies on traffic forecasting through spatio-temporal data analysis can roughly be divided into four types.

Statistical methods. Early studies in the 2000s use various statistical methods, such as historical average (HA), autoregressive integrated moving average (ARIMA) [17]

, and vector autoregression (VAR)

[32] for the traffic forecasting problem. However, these approaches assume the input time series to be stationary and univariate. This assumption may not hold in a complex traffic system. Thus, statistical methods often generate inaccurate predictions.

Machine learning methods

. To account for more complex traffic data, various classical machine learning methods, such as kNN

[15], SVM [29], SVR [3], are employed in the early 2010s. One shortcoming of these methods is that they fail to capture the highly non-linear spatial and temporal patterns present in the data.

Deep learning methods

. After 2015, capitalising on the success of deep learning


, several deep learning models are proposed to capture the underlying patterns of traffic data. Recurrent neural networks (RNNs) are adopted to analyse time series patterns

[8, 4]

, while convolutional neural networks (CNNs) are used to capture spatial dependencies by treating a traffic network as a grid

[27, 28]. However, these methods omit the road network, which is an essential spatial constraint on how traffic moves in the spatial dimension. This limitation inhibits traditional deep learning methods from accurate traffic forecasting.

Graph neural networks. More recently, graph-based deep learning models are increasingly used for spatio-temporal data analysis. Yu et al. [23] propose a spatio-temporal graph convolutional network based on spectral graph theory to extract features from the road network and the historical time series data. In particular, the spatial patterns are captured by graph convolutional layers while the temporal patterns are extracted by gated convolutional layers. Li et al. [10] model the traffic flow as a diffusion process and capture the spatial dependencies by random walks on a graph followed by an RNN to extract temporal patterns. Yao et al. [21]

use graph embeddings to capture the road network information and integrate Long-Short Term Memory (LSTM) and CNN with the graph embeddings for traffic prediction. Attention mechanism has also been adopted to learn spatial and temporal patterns with improved performance 

[26, 6]. However, all of these methods fall into the same paradigm that separate layers are applied to extract spatial patterns and temporal patterns respectively. We summarize current methods under this paradigm in Table I

. Particularly, CNN, Graph Convolutional Network (GCN), or spatial attention are often used to learn spatial patterns. Gated Recurrent Unit (GRU), LSTM, gated temporal convolution (gated TCN), or temporal attention are used to learn temporal patterns.

Studies Spatial layers Temporal layers
DCRNN (2017) [10] Diffusion Conv GRU
STGCN (2017) [23] GCN Gated TCN
DMVST-Net (2018) [21] CNN LSTM
STDN (2019) [20] CNN LSTM + Attn
GraphWaveNet (2019) [19] GCN Gated TCN
DSTGCN (2019) [5] GCN TCN
ASTGCN (2019) [6] GCN + Attn TCN + Attn
SLCNN (2020) [30] GCN Gated TCN
GMAN (2020) [31] Spatial Attn Temporal Attn
MTGCN (2020) [18] GCN Gated TCN
STNN (ours) ST Conv + Attn
TABLE I: Existing methods fall into the same paradigm

Unlike current methods, this work proposes a novel learning paradigm. Instead of using separate components or layers to learn spatial and temporal patterns, we design a novel spacetime module to learn the local spatio-temporal correlations, and capture both spatial and temporal patterns at the same time. Our model does not fall into any existing categories. Specially, instead of exploiting GNN to learn spatial patterns, we fuse spatial features with temporal features together and learned by a unified layer.

Iii Problem Formulation

We summarise the list of important notations in Table II.

Notation Description
sensor set at time
union sensor set of all time steps
distance matrix at time
stack along time axis
a vector of features recorded by -th sensor at time
a matrix of features of all sensors at time
stack along time axis
a set of target nodes to predict
a target node
a set of neighboring nodes that close to
number of features captured by each sensor station
future traffic measure of -th sensor at time
future traffic measures of -th sensor
future traffic measures of all nodes in
normalized connectivity matrix at time
a local-spacetime snapshot for at time
a local-spacetime for
TABLE II: Table of Notations

Traffic condition like vehicle speed is often recorded by the road-side sensor station along with other auxiliary information including time, sensor location, etc. Aggregating sensors together forms a sensor set which monitor the traffic flow of a certain area.

At each time step , the -th sensor station records local traffic measures such as vehicle average speed in a fixed time duration and supplementary data. These measurements form a feature vector . In particular, we have two features in our input data: average speed, timestamp. Collecting the feature vectors of all sensor stations together, we obtain the overall observation of the traffic network at time step as . A traffic flow dataset contains the measures of a sequence of time steps, where , and is thus presented as a

sensor feature tensor


Based on the location of sensor stations, we can get the travel distance on road network between any two sensors. Consequently, we built a matrix to reflect the distances of all sensor pairs. One can think as the weighted adjacent matrix of a directed complete graph. The edge weight is affected by the actual distance and complete graph is because we can always calculate the distance between two sensors. Furthermore, we do not require remain unchanged all the time. On the contrary, the travel distance between two sensors might be variable because of events like traffic accidents or road constructions which alter the underlying topology of road network, leading to a further travel distance. Thus, to better incorporate the network dynamics in our model, we use to just indicate the spatial context at the particular time step . reflects the underlying road network structure dynamics over time steps, where . For a static network, reduces to Q.

Traffic flow forecasting: Given a sensor feature tensor and a spatial context tensor over past time steps, as well as a set of nominated target nodes . The traffic flow forecasting problem aims to predict the traffic flow of the next time steps for every target node in . The desired output is thus where denotes the traffic conditions of the next time steps at node .

Iv Spacetime Interval Learning

Spacetime interval learning aims to discover the influence between traffic events in a single spacetime manifold instead of modeling the spatial and temporal dimensions separately. We define traffic events and spacetime as follows:

Definition 1 (Traffic Event)

Given a traffic measurement (e.g., speed) observed at sensor and time , a traffic event is a tuple consists of the measurement, time, and location, namely, .

A traffic event, defined analogously to the notion of events in physics [1], embodies both spatial and temporal information of a traffic measurement by a sensor station.

Definition 2 (Spacetime)

Given a subset of sensors , the sensor features , the related spatial context , and the time steps historical data, the spacetime is a manifold consists of all traffic events produced by and .

A spacetime manifold can be organzied as a 3D structure where first dimension corresponding to the time, the second dimension corresponding to the space, and the third is the observed traffic measurement. The time dimension is easy to understand but the space data needs more work. In order to squeeze the spatial information into a single dimension, we transfer the sensor coordinates to the travel distance with the anchor/target sensor. As a result, we construct a local-spacetime around a target sensor and use that to predict the future traffic flow of that particular sensor.

Definition 3 (Local-spacetime)

The local-spacetime for sensor is a subset of , which only contains the traffic events occur at nearby location of and recent time steps. Furthermore, the sensor location in all traffic events from is replaced by its travel distance with .

The spacetime interval between two traffic events denotes the extent to which one event influences the other; a smaller interval means a closer association between two traffic events. In a local-spacetime, we only care about the intervals between traffic events of the target sensor with other sensors.

Definition 4 (Spacetime Interval)

Spacetime interval is the quantified influence of a traffic event imposed on another traffic event regarding to the traffic measurement.

A crucial step in the proposed learning paradigm is to build the local-spacetime of a target node , which is then used for spacetime interval learning. In this way, the trained model not only captures the spacetime interval explicitly, but also be able to generalize learned patterns to other local-spacetimes in the same road network or event different city. We now give details of how to construct a local-spacetime for an arbitrary node where is a set of nodes we want to predict. The pseudocode is outlined in Algorithm 1.


Target nodes:
Parameter : ,
Output: Local-spacetime list
1 Set ;
2 for  do
3        for  do
4               if  or  then
5                      Set ;
6               end if
8        end for
9       for   do
10               for  do
11                      Set ;
12               end for
14        end for
15       ;
17 end for
Algorithm 1 Local-spacetime construction

Given a time step , we define a connectivity matrix by applying the Gaussian kernel on the distance matrix to convert travel distance to a weight reflect the connectivity of two sensors. Longer distance corresponds to smaller weight, namely,


where , are any two nodes in and is a hyper-parameter. Empirically, we set

as the standard deviation among all

. Intuitively, is a normalized value that expresses how easy to travel from to in the network at time step . Note that Equation (1) guarantees . The superscript used to depict the variational travel distance caused by the underlying networks structure change.

Using , we extract a set of nodes, denoted as , that could benefit the future traffic flow prediction for the target node as


where is a pre-determined threshold parameter that indicates how close a node should be from (or to) to be considered relevant to the traffic flow at . Note that can be different from because is not symmetric. control the trade-off between computational cost and prediction accuracy. Moreover, to have the fixed shape input data for training, we require the size of to be , namely, . This is done by keeping the nearest neighbors only if or add dummy nodes if . Dummy nodes have no connections with other nodes, and their features are all zero. As a results, nodes in are sorted based on the travel distance to in ascending order.

We now construct a local-spacetime of . For , each row of the matrix is defined as:


where is the traffic measurement recorded at node and time step , and denotes concatenation. The matrix can be regarded as a snapshot of local-spacetime . It encodes spatial relationship between and those nodes in , as well as traffic measurements of all these nodes at time step . Finally, define the local-spacetime by


Since contains all the information we need to train a predictive model of the future traffic flow at , the learning process is independent on the size of the entire network, thus resolving the scalability issue. Moreover, the incorporation of the network information at all time steps makes the model capable of handling dynamic network topology.

In summary, the spacetime interval learning framework reduces the traffic flow forecasting problem to learning a universal function that maps to the future traffic flow of and :


In next section, we propose a novel model as a realization of the mapping function , namely, the spacetime neural network.


Fig. 2: The architecture of STNN with an example local-spacetime constructed from the input data. STNN consists of spacetime modules (ST-Modules) and a fully-connected output layer. Each ST-Module contains a spacetime attention block (ST-Attn block) and a spacetime convolution block (ST-Conv block). The ST-Attn block uses self-attention mechanism to spotlight the most contributive traffic events. In each ST-Conv block, three different convolution kernels are employed to aggregate the spatio-temporal correlations in different perspectives. Then, the extracted features are stacked, and condensed by the convolution.

V Spacetime Neural Network

The core task of spacetime interval learning is to adaptively learn the influence between traffic events and use them to predict the future traffic flow. To this end, we propose SpaceTime Neural Network (STNN), which is an end-to-end deep learning model that realizes the function in Equation 5.

Design Principles. To quantify the influence of traffic events in the local-spacetime to the prediction of future traffic flow on the target sensor, a natural idea is to learn how each single traffic event influences the prediction. However, we observe that traffic events in local-spacetime are not independent and they may interfere each other. For example, congestion in an arterial road may increase the traffic in a surrounding region and then diffuse to further road segments. As such, we design our model with two main principles: (1) the model should be able to capture the pair-wise influence, and (2) it needs to be aware of the conditions of nearby regions, capturing the many-to-one influence.

Model Architecture. Following the above two principles, we propose a novel spacetime module, on top of which we build the proposed STNN model. A spacetime module comprises a Spacetime Attention Block (Section V-A) to capture the pair-wise influence, and a Spacetime Convolution Block (Section V-B) to capture the many-to-one influence.

The overall architecture of STNN is shown in Figure 2. The input is a local-spacetime of the target (central) node represented by the blue cylinder. STNN employs several spacetime modules to predict the future traffic of the target node.

More formally, convert the raw data to the local-spacetime and permute to the channel-first format, i.e., . Then the input is transformed using convolution to map traffic events to a high-dimensional feature space. Next, we stack spacetime modules where controls the trade-off between model complexity and performance. A smaller leads to a model with fewer parameters and unlikely to overfits the data. A larger

gives better performance but may jeopardize the generalization ability. Residual connections are also applied for each module. Last, we use a fully connected layer with linear activation to present the prediction results


V-a Spacetime Attention Block

Inspired by self-attention mechanism [24], we propose the spacetime attention block to automatically discover the pairwise influence between traffic events. Specifically, for the attention block at the  th layer, the input is a 3D tensor which is the output of previous convolutional layer or the initial input data, i.e., , where denotes the number of channels for feature maps output by the previous layer. Note that . Let and we calculate the attention map as


where and are learnable weight matrices that project each traffic event into feature spaces corresponding to “query” and “key”, respectively. As such, the dot product of the “query” and “key”, i.e., indicates the relevance between the -th and -th traffic events. The softmax function is used to guarantee the attention weights sum to one. Then, we apply another learnable weight matrix to project input data into an output feature space and use the attention matrix to weight the contribution from different traffic events:


The attention block can dynamically adjust the impact of different traffic events to target events w.r.t. the features from the previous layer.

V-B Spacetime Convolution Block

After the attention layer, we design a spacetime convolution block that contains three kernels to capture the many-to-one influence from three different perspectives corresponding to space, time, and spacetime, respectively. The spacetime kernel is the main perspective for uncovering the spacetime correlations from a local local-spacetime to the target events. The temporal kernel finds correlations between traffic events of the same location along time, and the spatial kernel captures the influence from nearby locations on the same time step. The motivation for adopting the temporal kernel is that each sensor may have unique periodic temporal traffic patterns (peak/off-peak, weekday/weekend, etc.), which may be undervalued from the spacetime perspective. Similarly, we use the spatial kernel to keep the underlying geo-spatial influence from neighbors which is invariant to the time. Figure 3 demonstrates these three kernels on the spacetime manifold.


Fig. 3: Spatial kernel, temporal kernel and spacetime kernel on the sub-spacetime

Each spacetime convolution block takes the output of former attention block as the input, e.g., . The output is computed by


where , , and are the spacetime kernel, the temporal kernel and the spatial kernel, respectively;

denotes the leaky rectified linear units function; and

represents the convolution operation. For the above three convolution kernels, we take

in the experiments. Additionally, padding is used to make sure the output has the same size as the input. Last, we concatenate the output of the three kernels and use an

convolution, , to condense features and restrict the number of channels.

At last, a fully connected layer is adopted to predict the future traffic flow from learned features.

Vi Experiments

To evaluate the performance of the proposed model, we compare STNN with the state-of-the-art traffic prediction models on two public traffic datasets in Section VI-B. In addition, a simulated dynamic traffic network is used to demonstrate our model’s capability of handling dynamic graphs in Section VI-C. We further investigate the learned spacetime interval via a case study (Section VI-D) and the impacts of different components of STNN (Section VI-E).

Vi-a Experimental Setup

Vi-A1 Real-world Networks

Two public real-world datasets METR-LA and PeMS-Bay are used for evaluation. METR-LA were collected from the loop detectors in the highway of Los Angeles County [7], and PeMS-Bay were collected from California Caltrans Performance Measurement System (PeMS) [2]. These datasets were released by [10] and have been widely used to evaluate traffic prediction models. METR-LA records four months of traffic data in early 2012 in Los Angeles County with 207 sensors. PeMS-Bay captures six months of traffic data in early 2017 in the Bay Area with 325 sensors. The speed reading of sensors in METR-LA and PeMS-Bay are aggregated to 5-minute windows, resulting in 288 data points per day. Statistics of the datasets are summarized in Table III.

Data Nodes Time steps Traffic events Dynamics
METR-LA 207 34,272 7,094,304 No
PeMS-Bay 325 52,116 16,937,700 No
Simulated 84 2,000 16,800 Yes
TABLE III: Statistics of datasets

Vi-A2 Simulated Network

Apart from the two public datasets with static networks, we simulate a traffic network to demonstrate our model’s ability to handle dynamic networks. We synthesize the dynamic traffic data using CityFlow [25]

which is a multi-agent reinforcement learning tool for city traffic scenarios. It is worthy to note that the goal of this experiment is not to show how the simulated traffic network resembles reality but to evaluate how well our model predicts traffic flow as the network topology changes. The simulated dataset contains 84 nodes and 2000 time steps. Each node denotes a sensor station located in the middle of a road segment as shown in Figure 

4. Each sensor station records the total number of vehicles passing it in the fixed time interval instead of the average speed. Dynamics are represented as road closures to simulate traffic accidents or road constructions. For example, road segment 8 in Figure 4 is closed from 400 to 600 and 1500 to 1900 time steps. Such closure will alter the travel distance between nearby nodes resulting in modified matrix . Besides the evaluation on the entire dataset, we also highlight the results of node 8 and its nearby roads 7 and 9, whose surroundings changed the most.


Fig. 4: Simulated road network illustration

Vi-A3 Baselines

We compare our model with the following baselines on the two real-world datasets. The default settings of these methods are used.

  • HA: Historical average method. We use the moving average with window size 12 for the forecasting.

  • ARIMA [17]

    : Auto-Regressive integrated moving average model with Kalman filter for time series analysis.

  • FC-LSTM [13]: Recurrent neural network with fully connected LSTM hidden units.

  • DCRNN [10] combines diffusion convolution and gated recurrent units in an encoder-decoder manner.

  • STGCN [23] employs graph convolution for spatial patterns and 1D convolution for temporal features.

  • Graph WaveNet [19] uses node embeddings to learn a dependency matrix, and the dilated 1D convolution for prediction.

  • GMAN [31]: Graph multi-attention network equipped with spatial, temporal and transform attention.

  • MTGCN [18]: Multivariate time series forecasting with graph neural networks.

Vi-A4 Experiment Settings

We implemented our model via the PyTorch framework and on the following hardware platform: (CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, GPU: NVIDIA Quadro RTX 8000). For the STNN model, we set

, i.e., two ST-Modules are stacked, which is enough to produce satisfying results. The number of output channels for two ST-Modules are 32 and 64, respectively. The convolutional kernel size is set to be

. The parameter used in the LeakyReLU is 0.2. In the experiments, datasets are split as 70% training, 20% validation and 10% testing in chronological order. The length of input and output time sequence are both 12. In the training phase, the batch size is 80 , and the number of epochs is 50, which takes about 5 hours to train the model on METR-LA dataset. Adam optimizer is adopted with learning rate 1

0.001 to update model parameters towards the minimized L1 loss between the predicted value and the grounded truth. Dropout is enabled with rate 0.3. We apply the commonly used metrics: mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) to evaluate the prediction accuracy. We report the average performance of five runs. Random seed of PyTorch is set to be 1 for reproducibility purpose.

We perform a grid search for several important hyper-parameters to determine the appropriate values regarding complexity and performance. The first hyper-parameter is the number of nodes in each local-spacetime. The second is the proportion of data used for training, ranges from 0.05 to 1. The third is the number of spacetime modules employed, ranges from 1 to 4. Results show that the model performance improves drastically with the increasing number of neighbors in a local-spacetime and reach the plateau after . Similar pattern shows for the ratio of training set and the best-tradeoff point is 0.2. In other words, in 70% training data, we random select for the final training. Last, model is not very sensitive to the number of ST modules after stack two modules.

Training Data Test Data Models 15 mins 30 mins 60 mins
METR-LA METR-LA HA 4.16 7.80 13.02% 4.16 7.80 13.02% 4.16 7.80 13.02%
ARMIA 3.99 8.21 9.60% 5.15 10.45 12.70% 6.90 13.23 17.40%
FC-LSTM 3.44 6.30 9.60% 3.77 7.23 10.90% 4.37 8.69 13.20%
DCRNN 2.77 5.38 7.30% 3.15 6.45 8.80% 3.60 7.60 10.50%
STGCN 2.88 5.74 7.62% 3.47 7.24 9.57% 4.59 9.40 12.70%
Graph WaveNet 2.68 5.15 6.90% 3.07 6.22 8.37% 3.53 7.37 10.01%
GMAN 2.75 5.42 7.19% 3.01 6.25 8.18% 3.34 7.21 9.71%
MTGCN 2.69 5.18 6.86% 3.05 6.17 8.19% 3.49 7.23 9.87%
STNN (ours) 2.27 4.46 5.80% 2.56 5.29 6.84% 3.01 6.23 8.50%
PeMS-Bay PeMS-Bay HA 2.88 5.59 6.85% 2.88 5.59 6.85% 2.88 5.59 6.85%
ARMIA 1.62 3.30 3.50% 2.33 4.76 5.40% 3.38 6.50 8.30%
FC-LSTM 2.05 4.19 4.80% 2.20 4.55 5.20% 2.37 4.96 5.70%
DCRNN 1.38 2.95 2.90% 1.74 3.97 3.90% 2.07 4.74 4.90%
STGCN 1.36 2.96 2.90% 1.81 4.27 4.17% 2.49 5.69 5.79%
Graph WaveNet 1.30 2.74 2.73% 1.63 3.70 3.67% 1.95 4.52 4.63%
GMAN 1.34 2.82 2.81% 1.62 3.72 3.63% 1.86 4.32 4.31%
MTGCN 1.32 2.79 2.77% 1.65 3.74 3.69% 1.94 4.49 4.53%
STNN (ours) 1.20 2.41 2.53% 1.50 3.26 3.33% 1.86 4.22 4.30%
METR-LA PeMS-Bay STNN (ours) 1.36 2.73 2.92% 1.68 3.58 3.73% 2.15 4.72 5.01%
PeMS-Bay METR-LA 2.60 5.02 6.75% 3.01 5.94 8.37% 3.68 7.33 11.13%
Combined METR-LA 2.30 4.50 5.83% 2.60 5.29 6.90% 3.07 6.37 8.80%
Combined PeMS-Bay 1.21 2.43 2.51% 1.49 3.22 3.26% 1.87 4.17 4.40%
TABLE IV: Overall performance of short-term (15 mins), mid-term (30 mins) and long-term (60 mins) traffic forecasting.

Vi-B Evaluation on Real-World Data

One salient feature of this work is that a trained STNN model can be used directly on unseen traffic networks. None of the existing state-of-the-art methods can do this. Current models train and predict on the same dataset, and the trained model cannot be applied to other networks with different sizes. Thus, we evaluate the performance of STNN in three settings: (1) train and test on the same network; (2) train and test on different networks; (3) train on multiple networks and test on each network.

Vi-B1 Train and Test on the Same Network

This experiment follows the conventional settings where we train and test the STNN model on the same network. We train two STNN models separately on the two real-world networks, namely for METR-LA dataset and for PeMS-bay dataset, respectively. The prediction accuracy on test sets are shown in the first two rows of Table IV. STNN surpasses the existing methods in both datasets and in all prediction time frames. For short period predictions like the METR-LA 15 mins case, achieves 2.27 MAE and 4.46 RMSE. Compared with the baselines, where the best MAE and RMSE are 2.68 and 5.15, respectively, produces about 15% and 13% performance improvement. For mid-term predictions (30 mins), we are 15% better than baselines. For long period predictions like the METR-LA 60 mins case, improves the performance by 14% in terms of RMSE. For the other real-world dataset, PeMS-Bay, STNN also shows superior performance. For example, the best baselines MAE for 15 mins prediction is 1.30, while our model can achieve 1.20.

Vi-B2 Train and Test on Different Networks

Since STNN is designed to discover universal traffic patterns. An interesting question is to apply a trained model on an unseen dataset. This is a huge challenge for methods that rely on the graph adjacency/Laplacian matrix because these matrices can be dramatically different in various networks. The proposed STNN concentrates on the local spatial and temporal context such that the learned model can be easily used on unseen datasets. In previous experiments, we have trained the on METR-LA and on PeMS-bay. Without any fine-tuning, we take to make predictions on PeMS-bay data and the performance is comparative to state-of-the-art methods trained directly on PeMS-bay. Similar results for on METR-LA. The details are shown in the middle two rows of Table IV. The MAE of on PeMS-bay 15 mins prediction period is 1.36 while the MAEs of baselines range from 1.30 to 1.38. The MAE of on METR-LA 15 mins prediction period is 2.60 while the MAEs of baselines range from 2.68 to 2.77.

Vi-B3 Train on Multiple Networks

Finally, we trained a single model on the combined dataset by mixing the training examples from PeMS-Bay and METR-LA. The learned STNN model captures the general traffic patterns from both datasets. The results are shown in the last two rows of Table IV. The performance of the uniform STNN is very close to STNNs that train and test on the same network, and outperforms baselines on both datasets with large margin.

Vi-C Evaluation on Simulated Dynamic Network

Table V shows the traffic volume prediction error for the next three time steps of nodes 7, 8, 9, and the average error on the whole network. We only compare with HA, ARMIA and FC-LSTM because the other methods cannot be applied on networks with dynamic topology. In general, STNN outperforms the baselines by large margin. The overall MAE and RMSE of STNN are 5.31 and 19.83 for the entire dataset, which is 42% better than baselines. Moreover, for a specific location like node 7, the MAE and RMSE of the baseline are 9.75 and 19.39, respectively. STNN achieves 5.27 MAE and 10.31 RMSE, leading to the 46% and 47% improvement against the baseline. For the road segmentation monitored by sensor 8, the traffic volume is always zero during the closure period. It is not very sensible to predict the traffic flow for such time steps. Thus, we exclude them during evaluation. Similar to other nodes, STNN still outperforms baselines with a 35% better MAE. It is worthy to note that the prediction error on node 7 is larger than node 9, mainly because node 7 is one of the traffic sources where new vehicles keep entering the traffic network continuously and randomly makes it hard to predict.

Node 7 HA 26.69 64.57 12.53 %
ARMIA 12.08 29.16 9.84 %
FC-LSTM 9.75 19.39 7.74 %
STNN (ours) 5.27 10.31 3.97 %
Node 8 HA 20.75 43.82 28.79 %
ARMIA 17.37 35.22 23.54 %
FC-LSTM 15.19 30.46 18.69 %
STNN (ours) 9.78 26.31 9.63 %
Node 9 HA 11.75 29.63 10.55 %
ARMIA 6.35 11.35 6.10 %
FC-LSTM 5.23 9.61 5.19 %
STNN (ours) 2.65 6.14 2.88 %
Overall HA 15.53 42.19 10.01 %
ARMIA 11.02 31.75 7.81 %
FC-LSTM 9.17 25.65 5.89%
STNN (ours) 5.31 19.83 3.79%
TABLE V: Performance of traffic volume prediction on the simulated dynamic network. We highlight sensor 7,8,9 as they are most affected by the changing network topology.

Vi-D Case Study

To develop a better insight of our model, we present a real-world case study. The aim is to reveal the spacetime intervals that are extracted by our model. Figure 5 illustrates a small part of the METR-LA dataset that contains a number of nodes around the Los Angeles Zoo. These nodes are used for the prediction of a target node, as indicated by node 111. We set the prediction period to 15 minutes (3 time steps). The color scale in Figure 4(a), which is transformed from the learned attention map, indicates the different influence (i.e., spacetime interval) made by those surrounding traffic events have towards the target node 111. The vertical axis represents the spatial dimension where nodes are listed in ascending order by their distance with the target node 111. The horizontal axis represents the temporal dimension where 0–11 time steps data are used for predicting the next 3 time steps of node 111. Figures 4(b) displays the spatial context of all nodes involved. From these two figures, we can observe that the most significant nodes for prediction are 111, 42, 54, 37, 142, 145. Among them, 54, 37, 142, 145 are the nearest upstream neighbors; and 42 is the immediate downstream neighbor. It is remarkable that downstream locations could be useful for traffic forecast, while the correlation fades out rapidly as distance increases as for node 24. Besides the spatial context, the temporal dimension also plays an important role. The time steps of node 111 that are key to the prediction are the immediate last four while this is not true for other nodes. Despite close proximity of nodes 125 and 58 to 111, they demonstrate a smaller influence as 58 is on the opposite direction of the same road while 125 is on another road. This case study demonstrates that our model can extract meaningful correlations between the traffic events along both the spatial and temporal dimensions.


(a) Impact of traffic events for the prediction of future traffic flow at node 111. Nodes are sorted by travel distance with 111 on road network.


(b) Spatial context of node 111. Red markers represent roads from west to east, blue markers represent roads from east to west, and green markers stand for north to south.
Fig. 5: Case study

Vi-E Ablation Study

We perform an ablation study to validate the impact of each proposed component to the model performance. In each experiment, a block/layer is removed and the rest remain unchanged. Particularly, six types of experiments are conducted: full STNN, STNN without ST-Attn block, STNN without ST-Conv block, STNN without spatial convolution layer, STNN without temporal layer, and STNN without spacetime convolution layer. We report the performance of the above models for short-term (15 mins) forecasting in Table VI.

Full STNN 2.27 4.46 5.80 1.20 2.41 2.53
No ST-Attn block 2.35 4.56 5.85 1.23 2.43 2.55
No ST-Conv block 2.41 4.69 5.97 1.30 2.58 2.71
No spatial conv 2.31 4.51 5.74 1.22 2.42 2.50
No temporal conv 2.33 4.58 5.90 1.22 2.47 2.55
No spacetime conv 2.34 4.59 5.98 1.23 2.49 2.58
TABLE VI: Ablation study

The effect of proposed spacetime attention block is evident since it highlights the contributive traffic events. The spacetime convolution block improves the prediction accuracy significantly as it aggregated the traffic events in three perspectives. Moreover, the spacetime convolutional layer appears to be more important than the other two convolutional layers.

Vi-F Complexity Analysis

We analyze the complexity of the proposed model to justify its scalability by showing the computation of STNN is independent of the road network size. In addition, the total trainable parameters in STNN is 308,786, which is smaller than STGCN (454,358), MTGCN (405,452), and other state-of-the-art methods.

The time complexity of the spacetime convolutional block is where is the number of sensors in local-spacetime, is the length of input time steps, and are channels of input feature maps and output feature maps. Additionally, four different convolutional layers (spacetime, spatial, temporal, 1*1) were utilized in the spacetime convolutional block. The spacetime attention block incurs time complexity where is the channels of input data and is the channels of output data. and still represents the size of the spacetime manifold. In short, the computational complexity of the proposed STNN depends on the local-spacetime size, input sequence length, and channels. However, it is independent of the entire network.

Last, the local-spacetime construction is part of the proposed spacetime interval learning framework which is not included in the forward pass and backpropagation of STNN. The complexity of local-spacetime construction equals

where is the number of sensors in the network.

Vii Conclusion

In this paper, we propose a novel spacetime interval learning framework and spacetime neural network (STNN) for accurate traffic prediction. Our approach captures the intrinsic principles of traffic flow by learning the spacetime intervals between traffic events. The model works on the local-spacetime extracted for target nodes and thus can handle dynamic network topology. Experiments on two real-world datasets and a simulated dynamic network show that the proposed framework significantly outperforms state-of-the-art methods. This confirms the effectiveness of capturing spatio-temporal correlations directly. In the future, we will further explore the possibility of capturing the graph dynamics from the input data when the underlying graph structure is unknown. Another promising direction is to embed the local-spacetime construction into an end-to-end deep learning framework, making the parameters used in local-spacetime construction learnable. Last, we would like to investigate the spacetime interval learning paradigm in other spatio-temporal data modeling scenarios.


  • [1] S. M. Carroll (2019) Spacetime and geometry. Cambridge University Press. Cited by: §IV.
  • [2] C. Chen, K. Petty, A. Skabardonis, P. Varaiya, and Z. Jia (2001) Freeway performance measurement system: mining loop detector data. Transportation Research Record 1748 (1), pp. 96–102. Cited by: §VI-A1.
  • [3] R. Chen, C. Liang, W. Hong, and D. Gu (2015)

    Forecasting holiday daily tourist flow based on seasonal support vector regression with adaptive genetic algorithm

    Applied Soft Computing 26, pp. 435–443. Cited by: §II.
  • [4] Z. Cui, R. Ke, Z. Pu, and Y. Wang (2018) Deep bidirectional and unidirectional lstm recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143. Cited by: §II.
  • [5] Z. Diao, X. Wang, D. Zhang, Y. Liu, K. Xie, and S. He (2019) Dynamic spatial-temporal graph convolutional neural networks for traffic forecasting. In AAAI, Vol. 33, pp. 890–897. Cited by: §I, TABLE I.
  • [6] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan (2019) Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In AAAI, Vol. 33, pp. 922–929. Cited by: TABLE I, §II.
  • [7] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi (2014) Big data and its technical challenges. Communications of the ACM 57 (7), pp. 86–94. Cited by: §VI-A1.
  • [8] N. Laptev, J. Yosinski, L. E. Li, and S. Smyl (2017) Time-series extreme event forecasting with neural networks at uber. In International conference on machine learning, Vol. 34, pp. 1–5. Cited by: §II.
  • [9] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §II.
  • [10] Y. Li, R. Yu, C. Shahabi, and Y. Liu (2018) Diffusion convolutional recurrent neural network: data-driven traffic forecasting. In ICLR, Cited by: TABLE I, §II, 4th item, §VI-A1.
  • [11] E. P. Rowe (2013) Geometrical physics in minkowski spacetime. Springer Science & Business Media. Cited by: §I.
  • [12] Y. Sugiyama, M. Fukui, M. Kikuchi, K. Hasebe, A. Nakayama, K. Nishinari, S. Tadaki, and S. Yukawa (2008) Traffic jams without bottlenecks—experimental evidence for the physical mechanism of the formation of a jam. New journal of physics 10 (3), pp. 033001. Cited by: §I.
  • [13] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NeurIPS, External Links: 1409.3215, ISSN 10495258 Cited by: 3rd item.
  • [14] M. Treiber and A. Kesting (2013) Traffic flow dynamics. Traffic Flow Dynamics: Data, Models and Simulation, Springer-Verlag Berlin Heidelberg. Cited by: §I.
  • [15] J. Van Lint and C. Van Hinsbergen (2012) Short-term traffic and travel time prediction models. Artificial Intelligence Applications to Critical Transportation Issues 22 (1), pp. 22–41. Cited by: §II.
  • [16] E. I. Vlahogianni, M. G. Karlaftis, and J. C. Golias (2014) Short-term traffic forecasting: where we are and where we’re going. Transportation Research Part C: Emerging Technologies 43, pp. 3–19. Cited by: §I.
  • [17] B. M. Williams and L. A. Hoel (2003) Modeling and forecasting vehicular traffic flow as a seasonal arima process: theoretical basis and empirical results. Journal of transportation engineering 129 (6), pp. 664–672. Cited by: §I, §II, 2nd item.
  • [18] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In KDD, Cited by: TABLE I, 8th item.
  • [19] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang (2019) Graph wavenet for deep spatial-temporal graph modeling. IJCAI. Cited by: TABLE I, 6th item.
  • [20] H. Yao, X. Tang, H. Wei, G. Zheng, and Z. Li (2019) Revisiting spatial-temporal similarity: a deep learning framework for traffic prediction. In AAAI, Vol. 33, pp. 5668–5675. Cited by: TABLE I.
  • [21] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li (2018) Deep multi-view spatial-temporal network for taxi demand prediction. In AAAI, Cited by: §I, TABLE I, §II.
  • [22] I. Yperman, S. Logghe, and B. Immers (2005) The link transmission model: an efficient implementation of the kinematic wave theory in traffic networks. In Proceedings of the 10th EWGT Meeting, pp. 122–127. Cited by: §I.
  • [23] B. Yu, H. Yin, and Z. Zhu (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In IJCAI, Cited by: §I, TABLE I, §II, 5th item.
  • [24] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019)

    Self-attention generative adversarial networks

    In International Conference on Machine Learning, pp. 7354–7363. Cited by: §V-A.
  • [25] H. Zhang, S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li (2019) Cityflow: a multi-agent reinforcement learning environment for large scale city traffic scenario. In The World Wide Web Conference, pp. 3620–3624. Cited by: §VI-A2.
  • [26] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung (2018) Gaan: gated attention networks for learning on large and spatiotemporal graphs. In UAI, Cited by: §II.
  • [27] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi (2016) DNN-based prediction model for spatio-temporal data. In SIGSPATIAL, pp. 1–4. Cited by: §II.
  • [28] J. Zhang, Y. Zheng, and D. Qi (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, Cited by: §II.
  • [29] N. Zhang, Y. Zhang, and H. Lu (2011)

    Seasonal autoregressive integrated moving average and support vector machine models: prediction of short-term traffic flow on freeways

    Transportation Research Record 2215 (1), pp. 85–92. Cited by: §II.
  • [30] Q. Zhang, J. Chang, G. Meng, S. Xiang, and C. Pan (2020) Spatio-temporal graph structure learning for traffic forecasting. In AAAI, Vol. 34, pp. 1177–1185. Cited by: §I, TABLE I.
  • [31] C. Zheng, X. Fan, C. Wang, and J. Qi (2020) Gman: a graph multi-attention network for traffic prediction. In AAAI, Vol. 34, pp. 1234–1241. Cited by: TABLE I, 7th item.
  • [32] E. Zivot and J. Wang (2006) Vector autoregressive models for multivariate time series. Modeling financial time series with S-PLUS®, pp. 385–429. Cited by: §II.