How to Build a Graph-Based Deep Learning Architecture in Traffic Domain: A Survey

05/24/2020 ∙ by Jiexia Ye, et al. ∙ 0

The huge success of deep learning in computer vision and natural language processing has inspired researchers to exploit deep learning techniques in traffic domain. Various deep learning architectures have been proposed to solve the complex challenges (e.g., spatial temporal dependencies) in traffic domain. In addition, researchers traditionally modeled the traffic network as grids or segments in spatial dimension. However, many traffic networks are graph-structured in nature. In order to utilize such spatial information fully, it's more appropriate to formulate traffic networks as graphs mathematically. Recently, many novel deep learning techniques have been developed to process graph data. More and more works have applied these graph-based deep learning techniques in various traffic tasks and have achieved state-of-the art performances. To provide a comprehensive and clear picture of such emerging trend, this survey carefully examines various graph-based deep learning architectures in many traffic applications. We first give guidelines to formulate a traffic problem based on graph and construct graphs from various traffic data. Then we decompose these graph-based architectures and discuss their shared deep learning techniques, clarifying the utilization of each technique in traffic tasks. What's more, we summarize common traffic challenges and the corresponding graph-based deep learning solutions to each challenge. Finally, we provide benchmark datasets, open source codes and future research directions in this rapidly growing field.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Related Work

There have been some surveys summarizing the evolving algorithms in traffic domain from different perspectives.  [21] discussed differences and similarities between statistical methods and neural networks to promote the comprehension between these two communities. [22] reviewed ten challenges on short-term traffic forecasting, which stemmed from the changing needs of ITS applications. [23] conducted a comprehensive overview of approaches in urban flow forecasting. [7] provided a classification of urban big data fusion methods based on deep learning (DL): DL-output-based fusion, DL-input-based fusion and DL-double-stage-based fusion. [24], [25] discussed deep learning for popular topics including traffic network representation, traffic flow forecasting, traffic signal control, automatic vehicle detection. [26] and [27] gave a similar but more elaborate analysis on new emerging deep learning models in multiple transportation applications. [28] provided a spatial temporal perspective to summarize the deep learning techniques in traffic domain and other domains. However, all these surveys don’t take graph neural networks (GNNs) related literatures into consideration, except that [28] mentioned GNNs but in a very short subsection.

On the other hand, there are also some reviews summarizing literatures w.r.t. GNNs in different aspects. [29] is the first to overview deep learning techniques on processing data in non-Euclidean space (e.g., graph data). [30] categorized GNNs by graph types, propagation types and training types and divided related applications into structural scenarios, non-structural scenarios, and other scenarios. [31] introduced GNNs based on small graph and giant graph respectively. [32],[31] focused on reviewing related works in a specific branch of GNNs, i.e., graph convolutional network (GCN). However, they seldom introduce GNNs works related with traffic scenarios.  [33] is the only survey spending a paragraph to describe GNNs in traffic domain, which is obviously not enough for anyone desired to explore this field.

Up to now, there still lacks a systematic and elaborated survey to explore graph-based deep learning techniques in traffic domain which have developed rapidly. Our work aims to fill this gap to promote the understanding of the new emerging techniques in transportation community.

Ii Problems, Research Directions and Challenges

Fig. 1: Traffic problems and the corresponding research directions

In this section, we introduce background knowledge in traffic domain briefly, including some important traffic problems and research directions (as shown in Figure 1) as well as common challenges under these problems. On one hand, we believe that such a concise but systematic introduction can help readers understand this domain quickly. On the other hand, our survey shows that existing works related with graph-based deep learning techniques have covered only some research directions, which inspires successors to transfer similar techniques to the remainder directions.

Ii-a Traffic Problems

There are many problems that the transportation community intends to tackle, including relieving traffic congestion, satisfying travel demand, enhancing traffic management, ensuring transportation safety, realizing automatic driving. Each problem can be partitioned into several research directions and some directions can serve more than one problem. We are going to introduce these problems along with their research directions.

Ii-A1 Traffic Congestion

Traffic congestion [34] is one of the most important and urgent problems in modern cities. A solution to extend road infrastructure is extremely expensive and time consuming. The more practical way is to increase the traffic efficiency, for example, to predict the traffic congestion on road network [35], [36], to control the road conditions by traffic state prediction[37],[18], to optimize vehicle flow by controlling traffic signals [38],[39].

Ii-A2 Travel Demand

Travel demand refers to the demand of traffic services (taxi, bike, public transports) from citizens. With the emerging of online ride-hailing platforms (e.g., Uber, DiDi) and rapid development of public transportation systems (e.g., metro system and bus system), travel demand prediction has become more and more important from many perspectives. For related authorities, it can help to better allocate resources, e.g., increasing metro frequency at rush hours, adding more buses to service hotspots. For business sector, it enables them to better manage taxi-hiring [40], carpooling [41], bike-sharing services[42],[43], and maximize their revenues. For individuals, it encourages users to consider various forms of transportation to decrease their commuting time and improve travel experience.

Ii-A3 Transportation Safety

Transportation safety is an indispensable part of public safety. Traffic accidents cause long delays and bring injuries or even deaths of victims. Therefore, monitoring the traffic accidents and evaluating traffic risk are essential to avoid property loss and save life. Many researches focus on directions such as detecting traffic incidents [44], predicting traffic accidents from social media data [45], predicting its risk-level[46], predicting the injury severity of traffic accidents [47], [48].

Ii-A4 Traffic Surveillance

Nowadays, surveillance cameras have been widely deployed in city roads, generating numerous images and videos [27]. Such development has enhanced traffic surveillance, which includes traffic law enforcement, automatic toll collection [49] and traffic monitoring systems. The research directions of traffic surveillance include license plate detection, automatic vehicle detection [50], pedestrian detection [51].

Ii-A5 Autonomous Driving

Autonomous driving is a key emerging industry representing the future. Autonomous driving requires to identify trees, paths, pedestrians in a smooth and accurate way. Many tasks are related with visual recognition. The research directions of autonomous driving include lane and vehicle detection[52],[53], pedestrian detection [54], traffic sign detection.

Ii-B Research Directions

Our survey of graph-based deep learning in traffic domain shows that existing works focus mainly on two directions, i.e., traffic state prediction, passenger demand prediction, and a few works focus on drivers behavior classification [55], optimal DETC scheme [49], vehicle/human trajectory Prediction [56],[57], path availability[58], traffic signal control [59]. To our best knowledge, traffic incident detection and vehicle detection have not yet been explored based on a graph view.

Ii-B1 Traffic State Prediction

Traffic state in literatures refers to traffic flow, traffic speed, travel time, traffic density and so on. Traffic flow prediction (TFP) [60],[61], Traffic speed prediction (TSP)[62], [63], Travel time prediction (TTP) [64],[65] are hot branches of traffic state prediction, which have attracted intensive studies.

Ii-B2 Travel Demand Prediction

Travel demand prediction aims to estimate the future number of users who require traffic services, for example, to predict future taxi request in each area of a city

[66],[67], or to predict the station-level passenger demand in subway system [68],[69], or to predict the bike hiring demand citywide [42],[43].

Ii-B3 Traffic Signal Control

The traffic signal control is to properly control the traffic lights so as to reduce vehicle staying time at the intersections in the long run [25]. Traffic signal control [59] can optimize the traffic flow and reduce traffic congestion and emission.

Ii-B4 Drivers Behaviors Classifying

With the availability of in-vehicle sensors and GPS data, automatic classifying driving styles of human drivers is an interesting research problem. A high-dimensional representation of driving features is expected to bring advanced benefits to autonomous driving and auto insurance industries.

Ii-B5 Traffic Incident Detection

Major incidents can cause fatal injuries to travelers and long delays on a road network. Therefore, understanding the main cause of incidents and their impact on a traffic network is crucial for a modern transportation management system [44].

Ii-B6 Vehicle Detection

Automatic vehicle detection aims to process videos recorded from stationary cameras over roads and then transmits videos to the surveillance centre for recording and processing.

Ii-C Challenges

Fig. 2: Traffic challenges and the corresponding deep learning techniques

Although traffic problems and their research directions are various, they share some common challenges, i.e., spatial dependency, temporal dependency, and external factors.

For instance, when a traffic congestion occurs on a main road at morning rush hours, the traffic flow will change at the following hours. What’s more, its adjacent roads are likely to have traffic jams soon [70], [71], [72]. In vehicle trajectory prediction, the stochastic behaviors of surrounding vehicles, relative positions of neighbors and the historical information of self-trajectory are factors influencing the prediction performance [56]. When predicting the ride-hailing demand in a region, its previous orders are critical for prediction. In addition, the regions sharing similar functionality are likely to share similar pattern in taxi demand [73],[66],[67]. To predict the traffic signal, the geometric features of multiple intersections on the road network are taken into consideration as well as the previous traffic flow around [59].

To tackle the challenges above, many works provid various solutions which can be divided into statistic methods, tradition machine learning approaches, deep learning techniques. In this paper, we focus on deep learning techniques in traffic domain. Different from previous deep learning related traffic surveys, we are interested in how to build a graph-based deep learning architecture to overcome challenges in various tasks. We look into many graph-based solutions provided by related traffic works and summarize common techniques to solve the challenges mentioned above (as shown in Figure 2).

In the following sections, we first introduce a common way to formulate the traffic problem and give detailed guidelines to build a traffic graph from traffic data. Then we clarify the correlations between challenges and techniques in two perspectives, i.e., the techniques perspective and the challenges perspective. In the techniques perspective, we introduce several common techniques and interpret the way how they tackle challenges in traffic tasks. In the challenges perspective, we elaborate each challenge and summarize the techniques which can tackle this challenge. In a word, we hope to provide insights into solving traffic problems in a graph view combing with deep learning techniques.

Iii Problem Formulation and Graph Construction

Among the graph-based deep learning traffic literatures we investigate, more than 80% tasks are essentially spatial temporal forecasting problems based on graphs, especially traffic state prediction, travel demand prediction. In this section, we first list commonly used notations. Then we summarize a general formulation of graph-based spatial temporal prediction in traffic domain, and provide the details to construct graphs from various traffic datasets. Finally, we discuss multiple definitions of adjacency matrix, which represents the graph topology of traffic network and is the key element of a graph-based solution.

Iii-a Notations

In this paper, we have denoted graph related elements, variables, parameters (hyper or trainable), activation functions, and operations. The variables are comprised of input variables {

, , , , } and output variables {, , , , }. These variables can divided into three groups. The first group is composed of spatial variables which only represent spatial attributes. The second group is composed of temporal variables only representing temporal attributes. The last group is composed of spatiotemporal variables which represent both spatial and temporal features.

Symbol Content
Graph related elements
Edges of graph
Vertices of graph
Adjacency matrix of graph
The transpose matrix of
Equal to , a self-looped
The degree matrix of adjacency matrix
The in-degree matrix of adjacency matrix
The out-degree matrix of adjacency matrix
Laplacian matrix of graph

The eigenvectors matrix of

The diagonal eigenvalues matrix of

The max eigenvalue of

An identity matrix

Hyper parameters
The number of nodes in graph
The number of input features
The number of hidden features
The number of output features
The number of past time slices
The number of future time slices
The dilation rate
Trainable parameters
The trainable parameters
The kernel
Activation functions

The activation function, e.g., tanh, sigmoid, ReLU

The sigmoid function

The hyperbolic tangent function
The ReLU function
The convolution operator on graph
Element-wise multiplication
Matrix multiplication
Spatial variables
An input graph composed of nodes with features
The feature of an input graph
Node in an input graph
A simply input graph
An output graph composed of nodes with features
The feature of an output graph
Node in an output graph
A simply output graph
Temporal variables
A sequential input with features over time slices
The element of sequential input at time
A simply sequential input over time slices
The element of simply sequential input at time
A hidden state with features at time
A sequential output with features over time slices
The element of sequential output at time
A simply sequential output over time slices
The element of simply sequential output at time
Spatiotemporal variables
A series of input graphs composed of nodes with features over time slices
An input graph at time
node in an input graph at time
the feature of an input graph at time
the feature of node in an input graph at time
A series of output graphs composed of nodes with features over time slices
An output graph at time
node in an output graph at time
the feature of an output graph at time
the feature of node in an output graph at time

Iii-B Graph-based Spatial Temporal Forecasting

To our best knowledge, most existing graph-based deep learning traffic works can be categorized to spatial temporal forecasting. They formalize their prediction problems in a very similar manner despite of different mathematical notations. We summarize their works to provide a general formulation for many graph-based spatial temporal problems in traffic domain.

The traffic network is represented as a graph , which can be weighted [74],[64],[60] or unweighted [58],[70],[75], directed [58],[76],[77] or undirected [74], [61],[78], depending on specific tasks. is a set of nodes and refers nodes in the graph. Each node represents a traffic object, which can be a sensor [62],[61],[79], a road segment [74],[80],[81], a road intersection [64],[76], or even an GPS intersection [60]. is a set of edges referring the connectivity between nodes.

is the adjacency matrix containing the topology information of the traffic network, which is valuable for traffic prediction. The entry in matrix represents the node proximity and is different among various applications. It can be a binary value or [61],[70],[75]. Specifically, indicates no edge between node and node while indicates an edge between these two nodes. It can also be a float value representing some kind of relationship between nodes [74],[73], e.g., the road distance between two sensors [62],[82],[77].

is a feature matrix of the whole graph at time . represents node with features at time . The features are usually traffic indicators, such as traffic flow [78],[77], traffic speed [62],[80],[76], or rail-hail orders [74],[73], passenger flow [68],[69]. Usually, continuous indicators are normalized during data preprocessing phase.

Fig. 3: The graph-based spatiotemporal problem formulation in traffic domain

Given historical observations of the whole traffic network over past time slices, denoted as , the spatial temporal forecasting problem in traffic domain aims to predict the future traffic observations over the next time slices, denoted as , where represents output graph with features at time . The problem (as shown in Figure 3) can be formulated as follow:


Some works predict multiple traffic indicators in the future (i.e., ) while other works predict one traffic indicator (i.e., ), such as traffic speed [80], [76], rail-hide orders [74],[73]. Some works only consider one-step prediction [83],[66],[49], i.e., forecasting traffic conditions in the next time step and . But models designed for one-step prediction can’t be directly applied to predict multiple steps, because they are optimized by reducing error during the training stage for the next-step instead of the subsequent time steps [67]. Many works focus on multi-step forecasting (i.e., ) [84],[18],[85]. According to our survey, there are mainly three kinds of techniques to generate a multi-step output, i.e., FC layer, Seq2Seq, dilation technique. Fully connected (FC) layer is the simplest technique as being the output layer to obtain a desired output shape [62], [61], [86], [70], [87], [88]. Some works adopt the Sequence to Sequence (Seq2Seq) architecture with a RNNs based decoder to generate output recursively through multiple steps [89],[79],[71],[84],[90],[77].[82], [85] adopted dilation technique to get a desired output length.

In addition, some works not only consider traffic related measurements, but also take external factors (e.g., time attributes, weather) [62],[91],[87],[92] into consideration. Therefore, the problem formulation becomes:


Where is the external factors.

Iii-C Graph Construction from Traffic Datasets

Fig. 4: Graph construction from traffic datasets: 1) In a sensor graph, sensor represents node and there is an edge between adjacent sensors on the same side of road. The features of a sensor are the traffic measurements corrected by itself. 2) In a road segment graph, road segment represents node and two connected segments have an edge. In sensors datasets, the features of a road segment are the average traffic measurements (e.g., traffic speed) recorded by all the sensors on it. In GPS datasets, the features of each road segment are the average traffic measurements recorded by all the GPS points on it. 3) In a road intersection graph, road intersection represents node and two road intersection connected by a road segment have an edge. The features of a road section are sum-up of the traffic measurements through it. Most works consider the edge direction being the traffic flow direction[62],[79],[58],[77],[60],[93], while some works ignore the direction and construct an undirected graph[61],[82],[75][81],[76].

To model a traffic network as a graph is vital for any works that intend to utilize graph-based deep learning architectures. Even though many works share a similar formulation of problem, they are different in graph construction due to the traffic datasets they collect. We find that these datasets can be divided into four categories by related traffic infrastructures: sensors data on road network [62],[61],[63], GPS trajectories of taxis [60],[93],[76], orders of rail-hailing system [73],[67],[92], transaction records of subway [68],[69] or bus system [93]. For each category, we describe the datasets and explain the construction of nodes , edges , feature matrix in traffic graph .

Iii-C1 Sensors Datasets

Traffic measurements (e.g., traffic speed) are generally collected in every 30s by the sensors (e.g., loop detectors, probes) on a road network in metropolises like Beijing [74], California [63], Los Angeles [62], New York [80], Philadelphia [86], Seattle [75], Xiamen [79], and Washington [86]. Sensor datasets are the most prevalent datasets in existing works, specially PEMS dataset from California. Generally, a road network contains traffic objects such as sensors, road segments (shown in Figure 4). Some existing works construct a sensor graph [62],[61],[77] while others construct a road segment graph[74],[80],[86].

Iii-C2 GPS Datasets

GPS trajectories datasets are usually generated by numbers of taxis over some period of time in a city, e.g., Beijing [60], Chengdu [60], Shenzhen [70], Cologne [76], and Chicago [81]. Each taxi produces substantial GPS points with time, space, speed information every day. Every GPS record is fitted to its nearest road on the city road map. All roads are divided into multiple road segments through road intersections. Some works extract a road segment graph [81], [70] while others extract a road intersection graph [64],[60],[76] (shown in Figure 4).

Iii-C3 Rail-hailing Datasets

These datasets record car/taxi/bicycle demand orders over a period of time in cities like Beijing [74],[73], Chengdu [73], and Shanghai [74]. The target city with an OpenStreetMap is divided into equal-size grid-based regions. Each region is defined as a node in a graph. The feature of each node is the number of orders in its region during a given interval. [74],[73] observed that various correlations between nodes were valuable for prediction and multiple graphs were constructed (as shown in Figure 5).

Fig. 5: Multi-relationships:1) A spatial locality graph: This graph is based on spatial proximity and it constructs edges between a region and its 8 adjacent regions in a 3 x 3 grid. 2) A transportation connectivity graph: This graph assumes that geographically distant but conveniently reachable regions by motorway, highway or subway have strong correlations with the target region. There should be edges between them. 3) A functional similarity graph: This graph assumes that regions sharing similar functionality might have similar demand patterns. Edges are constructed between regions with similar surrounding POIs.

Iii-C4 Transactions Datasets

These datasets are generated from subway or bus transaction system, from which a subway graph [68],[69],[93] or a bus graph [93] can be constructed.

A subway graph: Each station in the subway system is treated as a node. If two stations of a metro line are adjacent, there is an edge between them and vice versa. The features of a station are usually its inflow and outflow records during a given time interval.

A bus graph: Each bus stop is treated as a node. If two bus stops in a bus line are adjacent, there is an edge between them and vice versa. The features of a bus stop are usually its entrance records along with other features during a given time interval.

Iii-D Adjacency Matrix

The adjacency matrix is the key element to extract traffic graph topology which is valuable for prediction. Element (binary or weighted) represents heterogeneous pairwise relationship between nodes. However, based on different assumptions in traffic scenarios, the matrix can be designed in a very different way, like fixed matrix and dynamic matrix.

Iii-D1 Fixed Matrix

Many works assume that the correlations between nodes are fixed based on some prior knowledge and don’t change over time. Therefore, a fixed matrix is designed and unchanged during the whole experiment. In addition, some works extract multiple relationships between nodes, thus resulting in multiple fixed matrices[57],[43]. Generally, the pre-defined matrix represents spatial dependency in traffic network while in some works it also captures other kinds of correlations, like function similarity and transportation connectivity [74], semantic connection [73], temporal similarity [63]. As to the entry value , it is defined as (connection) or (disconnection) in some works [61],[86],[70],[75]. In many other works, it is defined as a function of distance between nodes[64],[60],[81],[78],[67],[76]. [74],[62],[91],[79],[82],[77]. They used threshold Gaussian Kernel to define as follow:


Where is the distance between node and node . Hyper parameters and are thresholds to control the distribution and sparsity of matrix .

Iii-D2 Dynamic Matrix

Some works argue that the pre-defined matrix does not necessarily reflect the true dependency among nodes due to the defective prior knowledge or incomplete data [64]. A novel adaptive matrix is proposed and learned through node embedding. Experiments in [82],[64],[80] have proven that adaptive matrix can precisely capture the hidden spatial dependency in data.

In some scenarios, the graph structure can evolve over time as some edges may become unavailable, like road congestion or closure, and become available again after alleviating congestion. An evolving topological structure [58] is incorporated into the model to capture such dynamic spatial change.

Iv Deep Learning Techniques Perspective

We summarize many graph-based deep learning architectures in existing traffic literatures and find that most of them are composed of graph neural networks (GNNs) and other modules, such as recurrent neural networks (RNNs), temporal convolution network (TCN), Sequence to Sequence (Seq2Seq) model, generative adversarial network (GAN) (as shown in Table I). It is the cooperation of GNNs and other deep learning techniques that achieves state-of-the-art performance in many traffic scenarios. This section aims to introduce their principles, advantages, defects and their variants in traffic tasks, to help participators understand how to utilize deep learning techniques in traffic domain.

Reference Year Directions Models Modules
[57] 2018 Human Trajectory Prediction SGCN
[49] 2019 Optimal DETC scheme SGCN
[55] 2020 Vehicle Behaviour Classification MR-GCN SGCN, LSTM
[56] 2020 Vehicle Trajectory Prediction SGCN, LSTM
[59] 2018 Traffic signal control

SGCN, Reinforcement learning

[58] 2019 Path availability LRGCN-SAPE SGCN, LSTM
[64] 2019 Travel time prediction SGCN
[60] 2018 Traffic Flow Prediction KW-GCN SGCN, LCN
[78] 2018 Traffic Flow Prediction Graph-CNN CNN, Graph Matrix
[94] 2018 Traffic Flow Prediction DST-GCNN SGCN
[61] 2019 Traffic Flow Prediction SGCN, CNN, Attention Mechanism
[93] 2019 Traffic Flow Prediction SGCN, TCN, Residual
[71] 2019 Traffic Flow Prediction GHCRNN SGCN, GRU, Seq2Seq
[84] 2019 Traffic Flow Prediction STGSA GAT, GRU, Seq2Seq
[77] 2019 Traffic Flow Prediction DCRNN-RIL DGCN, GRU, Seq2Seq
[95] 2019 Traffic Flow Prediction MVGCN SGCN, FNN, Gate Mechanism, Residual
[96] 2019 Traffic Flow Prediction STGI- ResNet SGCN, Residual
[72] 2020 Traffic Flow Prediction FlowConvGRU DGCN, GRU
[97] 2018 Traffic Speed Prediction GAT, GRU, Gate Mechanism
[62] 2019 Traffic Speed Prediction GTCN SGCN, TCN, Residual
[63] 2019 Traffic Speed Prediction 3D-TGCN SGCN, Gate Mechanism
[87] 2019 Traffic Speed Prediction DIGC-Net SGCN, LSTM
[98] 2019 Traffic Speed Prediction MW-TGC SGCN, LSTM
[90] 2019 Traffic Speed Prediction AGC-Seq2Seq SGCN, GRU, Seq2Seq, Attention Mechanism
[76] 2019 Traffic Speed Prediction GCGA SGCN, GAN
[88] 2019 Traffic Speed Prediction ST-GAT GAT, LSTM
[74] 2018 traffic state prediction STGCN SGCN, TCN, Gate Mechanism
[89] 2018 traffic state prediction DCRNN DGCN, GRU, Seq2Seq
[80] 2019 traffic state prediction SGCN, CNN, Gate Mechanism
[91] 2019 traffic state prediction MRes-RGNN DGCN, GRU, Residual, Gate Mechanism
[81] 2019 traffic state prediction GCGAN DGCN, LSTM, GAN, Seq2Seq, Attention Mechanism
[82] 2019 traffic state prediction Graph WaveNet DGCN, TCN, Residual, Gate Mechanism
[70] 2019 traffic state prediction T-GCN SGCN, GRU
[75] 2019 traffic state prediction TGC-LSTM SGCN, LSTM
[37] 2019 traffic state prediction DualGraph Seq2Seq, MLP, Graph Matirx
[85] 2019 traffic state prediction ST-UNet SGCN, GRU
[79] 2020 traffic state prediction GMAN GAT, Gate Mechanism, Seq2Seq, Attention Mechanism
[86] 2020 traffic state prediction OGCRNN SGCN, GRU, Attention Mechanism
[18] 2020 traffic state prediction MRA-BGCN SGCN, GRU, Seq2Seq, Attention Mechanism
[42] 2018 Travel Demand-bike SGCN, LSTM, Seq2Seq
[43] 2018 Travel Demand-bike GCNN-DDGF SGCN, LSTM
[68] 2020 Travel Demand-subway PVCGN SGCN, GRU, Seq2Seq, Attention Mechanism
[69] 2019 Travel Demand-subway WDGTC Tensor Completion, Graph Matrix
[74] 2019 Travel Demand-taxi CGRNN SGCN, RNN, Attention Mechanism, Gate Mechanism
[73] 2019 Travel Demand-taxi GEML SGCN, LSTM
[66] 2019 Travel Demand-taxi MGCN SGCN
[67] 2019 Travel Demand-taxi STG2Seq SGCN, Seq2Seq, Attention Mechanism, Gate Mechanism, Residual
[92] 2019 Travel Demand-taxi SGCN, LSTM, Seq2Seq
[83] 2019 Travel Demand-taxi ST-ED-RMGC SGCN, LSTM, Seq2Seq, Residual
TABLE I: The decomposition of graph-based deep learning architectures investigated in this paper

Iv-a GNNs

Fig. 6:

A general structure of Graph Neural Networks is composed of two kind of layers: 1) Aggregation layer: On each feature dimension, the features of adjacent nodes are aggregated to the central node. Mathematically, the output of aggregation layer is the product of adjacency matrix and features matrix. 2) Non-linear transformation layer: subsequently, all the aggregated features of each node are fed into the non-linear transformation layer to create higher feature representation. All nodes share the same transformation kernel.

In the last couple of years, motivated by the huge success of deep learning approaches (e.g., CNNs, RNNs), there is an increasing interest in generalizing neural networks to arbitrarily structured graphs and such networks are classified as graph neural networks. Many works focus on extending the convolution of CNN for graph data and novel convolutions on graph have been developed rapidly. The two mainstream graph convolutions related with traffic tasks are spectral graph convolution (SGC) for undirected graph, diffusion graph convolution (DGC) for directed graph. There are also other novel convolutions [60] but the related traffic works are relatively few. Both SGC and DGC aim to generate new feature representations for each node in a graph through feature aggregation and non-linear transformation (as shown in Figure 6). Note that we refer the SGC network as SGCN and DGC network as DGCN.

Iv-A1 Spectral Graph Convolution

In the spectral theory, a graph is represented by its corresponding normalized Laplacian matrix . The real symmetric matrix can be diagonalized via eigendecomposition as where is the eigenvectors matrix and is the diagonal eigenvalues matrix. Since

is also an orthogonal matrix,


adopted it as a graph Fourier basis, defining graph Fourier transform of a graph signal

as , and its inverse as .

[100] tried to build an analogue of CNN convolution into spectral domain and defined the spectral convolution as , i.e., transforming into spectral domain, adjusting its amplitude by a diagonal kernel , and doing inverse Fourier transform to get the final result in spatial domain. Although such convolution is theoretically guaranteed, it is computationally expensive as multiplication with is and the eigendecomposition of is intolerable for large scale graphs. In addition, it considers all nodes by the kernel with parameters and can’t extract spatial localization.

To avoid such limitations, [101] localized the convolution and reduced its parameters by restricting the kernel to be a polynomial of eigenvalues matrix as and determines the maximum radius of the convolution from a central node. Thus, the convolution can be rewritten as . Further more, [101] adopted the Chebyshev polynomials to approximate , resulting in with a rescaled , being the largest eigenvalue of and , , [102]. By recursively computing , the complexity of this -localized convolution can be reduced to with being the number of edges.

Based on [101], [103] simplified the spectral graph convolution by limiting and with , , they got . Noticing that , they set , resulting in . For that and , they got . Further, they reduced the number of parameters by setting to address overfitting and got . They further defined and adopted a renormalization trick to get , where is the degree matrix of .Finally,[103] proposed a spectral graph convolution layer as:


Here, is the layer input with features, is its feature. is the layer output, is its feature. is a trainable parameter. is the activation function. Such layer can aggregate information of 1-hop neighbors. The receptive neighborhood field can be expanded by stacking multiple graph convolution layers [18].

Iv-A2 Diffusion Graph Convolution

Spectral graph convolution requires a symmetric Laplacian matrix to implement eigendecomposition. It becomes invalid for a directed graph with an asymmetric Laplacian matrix. Diffusion convolution origins from graph diffusion without any constraint on graph. Graph diffusion [104], [105]

can be represented as a transition matrix power series giving the probability of jumping from node

to node at each step. After many steps, such Markov process converges to a stationary distribution , where is the transition matrix and is the restart probability, is the diffusion step. In practice, a finite -step truncation of the diffusion process is adopted and each step is assigned a trainable weight . Based on the -step diffusion process, [89] defined diffusion graph convolution as:


Here, and represent the transition matrices and its reverse one respectively. Such bidirectional diffusion enables the operation to capture the spatial correlation on a directed graph[89]. Similar to spectral graph convolution layer, a diffusion graph convolutional layer is built as follow:


Where parameters are trainable.

Iv-A3 GNNs in Traffic Domain

Many traffic networks are graph structure naturally (See Section Three). However, previous studies can only capture the spatial locality roughly due to the compromise of modeling them as grids or segments network [106],[107], overlooking the connectivity and globality of traffic network. In the literatures we investigate, they all model the traffic network as a graph to fully utilize spatial information.

Many works employ convolution operation directly on traffic graph to capture the complex spatial dependency of traffic data. Most of them adopt spectral graph convolution (SGC) while some employ diffusion graph convolution (DGC) [91], [89], [81], [82], [77],[72]. There are also some other graph neural networks such as graph attention network (GAT) [97], [88], [79],[84], tensor decomposition and completion on graph [69], but their related works are few, which might be a future research direction.

The key difference between SGC and DGC lies in their matrices which represent different assumptions on the spatial correlations of traffic network. The adjacency matrix in SGC infers that a central node in a graph has more strong correlation to its direct adjacent nodes than other distant ones, which reflects reality in many traffic scenarios [74],[62]. The state transition matrix in DGC indicates that the spatial dependency is stochastic and dynamic instead of being fixed and regular. The traffic flow is related to a diffusion process on a traffic graph to model its changing spatial correlations. In addition, the bidirectional diffusion in DGC offers the model more flexibility to capture the influence from both upstream and downstream traffic [89]. In a word, DGC is more complicated than SGC. DGC can be adopted in any traffic network graph while SGC can be only utilized to process symmetric traffic graph.

When it comes to spatial temporal forecasting problems related with traffic, the input is a 3-D tensor instead of a 2-D tensor . Thus, the convolution operations need to be further generalized to 3-D tensor. In many works, the equal convolution operation (e.g., SGC, DGC) with the same kernel is imposed on each time step of in parallel [74],[62],[93],[94].

In addition, to enhance the performance of convolution on graph in traffic tasks, many works develop some variants of SGC with other techniques based to their prediction goals.

For instance, [61] redefined SGC with attention mechanism to adaptively capture the dynamic correlations in traffic network: , where is the spatial attention.

[63] generalized SGC on both spatial and temporal dimensions by scanning order neighbors on graph and

neighbors on time-axis without padding which shortens the length of sequences by

at each step:


where is the feature of input at time , is the feature of output at time .

[75] changed SGC as , where is the -hop neighborhood matrix and is a matrix representing physical properties of roadways. [98],[90] followed this work and redefined , where is a function clipping each nonzero element in matrix to 1.

[95] modified adjacency matrix in SGC as to integrate the geospatial positions information into the model and is a matrix calculated via a thresholded Gaussian kernel weighting function. The layer is built as , where is the degree matrix of .

[49] designed a novel edge-based SGC on road network to extract the spatiotemporal correlations of the edge features. Both the feature matrix and adjacency matrix are defined on edges instead of nodes.


Recurrent Neural Networks (RNNs) are a type of neural network architecture which is mainly used to detect patterns in a sequence of data [108]. The traffic data collected in many traffic tasks are time series data, thus RNNs are commonly utilized in these traffic literatures to capture the temporal dependency in traffic data. In this subsection, we introduce three classical models of RNNs (i.e., RNN, LSTM, GRU) and the correlations among them, which provides theoretical evidence for participators to choose appropriate model for specific traffic problem.

Iv-B1 Rnn

Fig. 7: The folded and unfolded structure of recurrent neural networks

Similar to a classical Feedforward Neural Network (FNN), a simple recurrent neural network (RNN) [109] contains three layers, i.e., input layer, hidden layer, output layer [110]. What differentiates RNN from FNN is the hidden layer. It passes information forward to the output layer in FNN while in RNN, it also transmits information back into itself forming a cycle [108]. For this reason, the hidden layer in RNN is called recurrent hidden layer. Such cycling trick can retain historical information, enabling RNN to process time series data.

Suppose there are , , units in the input, hidden, output layer of RNN respectively. The input layer takes time series data in. For each element at time , the hidden layer transforms it to and the output layer maps to . Note that the hidden layer not only takes as input but also takes as input. Such cycling mechanism enables RNN to memorize the past information (as shown in Figure 7). The mathematical notations of hidden layer and output layer are as follow.


Where , , , are trainable parameters. and is the input sequence length. is initialized using small non-zero elements which can improve overall performance and stability of the network [111].

In a word, RNN takes sequential data as input and generate another sequence with the same length: . Note that we can deepen RNN through stacking multiple recurrent hidden layers.

Iv-B2 Lstm

Although the hidden state enables RNN to memorize the input information over past time steps, it also introduces matrix multiplication over the (potentially very long) sequence. Small values in its matrix multiplication causes the gradient decrease at each time step, resulting in the final vanish phenomenon and oppositely big values leads to exploding problem [112]. The exploding or vanishing gradients actually hinder the capacity of RNN to learn long term sequential dependencies in data [110].

To overcome this hurdle, Long Short-Term Memory (LSTM) neural networks

[113] are proposed to capture long-term dependency in sequence learning. Compared with hidden layer in RNN, LSTM hidden layer has extra four parts, which are a memory cell, input gate, forget gate, and output gate. These three gates ranging in [0,1] can control information flow into the memory cell and preserve the extracted features from previous time steps. These simple changes enable the memory cell to store and read as much long-term information as possible. The mathematical notations of LSTM hidden layer are as follow.


Where , , is the input gate, output gate, forget gate at time respectively. is the memory cell at time .

Iv-B3 Gru

While LSTM is a viable option for avoiding vanishing or exploding gradients, its complex structure leads to more memory requirement and longer training time. [114]

proposed a simple yet powerful variant of LSTM, i.e., Gated Recurrent Units (GRU). The LSTM cell has three gates, but the GRU cell only has two gates, resulting in fewer parameters thus shorter training time. However, GRU is equally effective as LSTM empirically

[114] and is widely used in various tasks. The mathematical notations of GRU hidden layer are as follow.


Where is the reset gate, is the update gate.

Iv-B4 RNNs in Traffic Domain

RNNs have shown impressive stability and capability of processing time series data. Since traffic data has a distinct temporal dependency, RNNs are usually leveraged to capture temporal correlation in traffic data. Among the works we survey, only [74] utilized RNN to capture temporal dependency in traffic while more than a half adopted GRU and some employed LSTM. This can be explained that RNN survives severe gradient disappearance or gradient explosion while LSTM and GRU handle this successfully and GRU can faster the training time.

In addition, there are many tricks to augment RNNs capacity to model the complex temporal dynamics in traffic domain, such as attention mechanism, gating mechanism, residual mechanism.

For instance, [74] incorporated the contextual information (i.e., output of SGCN containing information of related regions) into an attention operation to model the correlations between observations in different timestamps: and , , where is a global average pooling layer, denotes the RNN hidden layer.

[91] took external factors into consideration by embedding external attributes into the input. In addition, they added the previous hidden states to the next hidden states through a residual shortcut path, which they believed can make GRU more sensitive and robust to sudden changes in traffic historical observations. The new hidden state is formulated as: , where is the external features at time , is linear trainable parameter, is the residual shortcut.

[85] inserted a dilated skip connection into GRU by changing hidden state from to , where refers to skip length or dilation rate of each layer, denotes the GRU hidden layer. Such hierarchical design of dilation brings in multiple temporal scales for recurrent units at different layers which achieves multi-timescale modeling.

Despite the tricks above, some works replace the matrix multiplication in RNNs’ hidden layer with spectral graph convolution (SGC) or diffusion graph convolution (DGC), to capture spatial temporal correlations jointly. Take GRU as example:


The can represent SGC, DGC or other variants. In the literatures we survey, most replacements happen in GRU and only one in LSTM [58]. Among GRU related traffic works, [91], [89], [86], [77],[72] replaced matrix multiplication with DGC, [18], [85], [68] with SGC, [84], [97] with GAT.

Note that besides RNNs, other techniques (e.g., TCN in the next subsection) are also popular choices to extract the temporal dynamics in traffic tasks.

Iv-C Tcn

Although RNN-based models become widespread in time-series analysis, RNNs for traffic prediction still suffer from time-consuming iteration, complex gate mechanism, and slow response to dynamic changes [74]. On the contrary, 1D-CNN has the superiority of fast training, simple structures, and no dependency constraints to previous steps [115]. However, 1D-CNN is less common than RNNs in practice due to its lack of memory for a long sequence [116]. In 2016, [117] proposed a novel convolution operation integrating causal convolution and dilated convolution, which outperforms RNNs in text-to-speech tasks. The prediction of causal convolution depends on previous elements but not future elements. Dilated convolution expands the receptive filed of original filter by dilating it with zeros [118]. [119] simplified the causal dilated convolution in [117] for sequence modeling problem and renamed it as temporal convolution network (TCN). Recently, more and more works employ TCN to process traffic sequential data [74], [62], [82], [93].

Iv-C1 Sequence Modeling and 1-D TCN

Given an input sequence with length denoted as , sequence modeling aims to generate an output sequence with the same length, denoted as . The key assumption is that the output at current time is only related to historical data but not depends on any future inputs , i.e., , is the mapping function.

Obviously, RNN, LSTM and GRU can be solutions to sequence modeling tasks. However, TCN can tackle sequence modeling problem more efficient than RNNs for that it can capture long sequence properly in a non-recursive manner. The dilated causal convolution in TCN is formulated as follow:


Where is the dilated causal operator with dilation rate controlling the skipping distance, is the kernel. Zero padding strategy is utilized to keep the output length the same as the input length (as shown in Figure 8). Without padding, the output length is shortened by [74].

Fig. 8: Multiple dilated causal convolution layers in TCN: is the input sequence and is the output sequence with the same length. The size of kernel is and the dilation rates for layers are . Zero padding strategy is taken.

To enlarge the receptive field, TCN stacks multiple dilated causal convolution layers with as the dilation rate of layer (as shown in Figure 8). Therefore, the receptive filed in the network grows exponentially without requiring many convolutional layers or larger filter, which can handle longer sequence with less layers and save computation resources [82].

Iv-C2 TCN in Traffic Domain

There are many traffic works related with sequence modeling, especially traffic spatial temporal forecasting tasks. Compared with RNNs, the non-recursive calculation manner enables TCN to alleviate the gradient explosion problem and facilitate the training by parallel computation. Therefore, some works adopt TCN to capture the temporal dependency in traffic data.

Most graph-based traffic data are 3-D tensors denoted as , which requires the generalization of 1-D TCN to 3-D variables. The dilated causal convolution can be adopted to produce the output feature of node at time as follow [62]:


Where is the output feature of node at time . is the input feature of node at time . The kernel is trainable. is the number of output features.

The same convolution kernel is applied to all nodes on the traffic network and each node produces new features. The mathematical formulation of layer is as follow[62],[93]:


where represents the historical observations of the whole traffic network over past time slices, represents the related convolution kernel, is the output of TCN layer.

There are some tricks to enhance the performance of TCN in specific traffic tasks. For instance, [93] stacked multiple TCN layers to extract the short-term neighboring dependencies by bottom layer and long-term temporal features by higher layer:


where is the input of layer, is its output and . is the dilation rate of layer.

To reduce the complexity of model training, [62] constructed a residual block containing two TCN layers with the same dilation rate and the block input was added to last TCN layer to get the block output:


where are the convolution kernels of the first layer and the second layer respectively. is the input of residual block and is its output.

[82] integrated gating mechanism[116] with TCN to learn complex temporal dependency in traffic data:


where determines the ratio of information passed to the next layer.

Similarly, [74] used the Gated TCN and set the dilation rate without zero padding to shorten the output length as

. They argued that this can discover variances in time series traffic data.

Iv-D Seq2Seq

Iv-D1 Seq2Seq

Fig. 9: Sequence to Sequence Structure without attention mechanism

Sequence to Sequence (Seq2Seq) model proposed in 2014 [120] has been widely used in sequence prediction such as machine translation [121]. Seq2Seq architecture consists of two components, i.e., an encoder in charge of converting the input sequence into a fixed latent vector , and a decoder responsible for converting into an output sequence . Note that and can have different lengths (as shown in Figure 9).


Where is the input length and is the output length.

The specific calculation of is denoted as follow:


Here, is the hidden state related with input . is initialized using small non-zero elements. is the hidden state related with output . is the representation of beginning sign. Note that the encoder and decoder can be any model as long as it can accept sequence (vector or matrix) and produce sequence, such as RNN, LSTM, GRU or other novel models.

A major limitation of Seq2Seq is that the latent vector is fixed for each while might have stronger correlation with than other elements. To address this issue, attention mechanism is integrated into Seq2Seq, allowing the decoder to focus on task-relevant parts of the input sequence, helping the decoder make better decision.


Where is the normalized attention score, and [121] is a function to measure the correlation between input and output, for instance, [122] proposed three kinds of attention score calculation.


Another way to enhance Seq2Seq performance is the scheduled sampling technique [123]. The inputs of decoder during training and testing phases are different. Decoder during training phase is fed with true labels of training datasets while it is fed with predictions generated by itself during testing phase, which accumulates error at testing time and causes degraded performance. To mitigate this issue, scheduled sampling is integrated into the model. At iteration during the training process, there is probability to feed the decoder with true label and probability with prediction at the previous step. Probability gradually decreases to 0, allowing the decoder to learn the testing distribution [89], keeping the training and testing as same as possible.

Iv-D2 Seq2Seq in Traffic Domain

Since Seq2Seq can take an input sequence to generate an output sequence with different length, it is applied on multi-step prediction in many traffic works. The encoder encodes the historical traffic data into a latent space vector. Then, the latent vector is fed into a decoder to generates the future traffic conditions.

Attention mechanism is usually incorporated into Seq2Seq to model the different influence on future prediction from previous traffic observations at different time slots [81],[79], [90],[67].

The encoder and decoder in many traffic literatures are in charge of capturing spatial temporal dependencies. For instance, [89] proposed DCGRU to be the encoder and decoder, which can capture spatial and temporal dynamics jointly. The design of encoder and decoder is usually the core contribution and novel part of relative papers. But the encoder and decoder are not necessarily the same and we have made a summarization of Seq2Seq structure in previous graph-based traffic works (as shown in Table II).

References Encoder Decoder
[89] GRU+DGCN Same as encoder
[79] STAtt Block Same as encoder
[37] MLPs An MLP
[71] SGCN+Pooling+GRU GCN+Upooling+GRU
[84] GRU with graph self-attention Same as encoder
[18] GRU+SGCN Same as encoder
[90] SGCN+ bidirectional GRU Same as encoder
[67] Long-term encoder (Gated SGCN) Short-term encoder
[77] SGCN+GRU Same as encoder
[68] CGRM (GRU, SGCN) Same as encoder
[42] LSTM Same as encoder
TABLE II: The encoders and decoders of sequence to sequence architecture

Noted that the RNNs based decoder has a severe error accumulation problem during testing inference due to that each previous predicted step is the input to produce the next step prediction. [89],[84] adopted the scheduled sampling to alleviate this problem. [67] replaced the RNNs based decoder with a short-term and long-term decoder to take in last step prediction exclusively, thus easing error accumulation. The utilization of Seq2Seq technique in traffic domain is very flexible, for instance, [81] integrated Seq2Seq into a bigger framework, being the generator and discriminator of GAN.

Iv-E Gan

Iv-E1 Gan

Fig. 10: Generative Adversarial Network: Generator is in charge of producing a generated sample from a random vector , which is sampled from a prior distribution . Discriminator is in charge of discriminating between the fake sample generated from and the real sample from the training data.

Generative Adversarial Network (GAN) [124]

is a powerful deep generative model aiming to generate artificial samples as indistinguishable as possible from their real counterparts. GAN, inspired by game theory, is composed of two players, a generative neural network called Generator

and an adversarial network called Discriminator (as shown in Figure 10).

Discriminator tries to determine whether the input samples belong to the generated data or the real data while Generator tries to cheat on Discriminator by producing samples as true as possible. The two mutually adversarial and optimized processes are alternately trained, which strengthens the performance of both and . When the fake sample produced by is very close to the ground truth and is unable to distinguish them any more, it is considered that Generator has learned the true distribution of the real data and the model converges. At this time, we can consider this game to reach a Nash equilibrium.

Mathematically, such process can be formulated to minimize their losses and

. With the loss function being cross entropy denoted as

, we can have: