I Introduction
Shortterm accurate behavior prediction of traffic participants is important for applications such as automated driving or infrastructureassisted human drivinghinz_designing_2017. A major open research question is how to model interaction between traffic participants. In the past, interactions have been modelled by either creating a representation of one or several traffic participants treiber_congested_2000, lenz_deep_2017 or by using a fixed environment representation such as a simulated lidar beam kuefler_imitating_2017.
However, these methods impose certain disadvantages: A fixed environment representation poses a much harder problem to learn, since we cannot use data we might have extracted previously. Traffic participant representations, on the other hand, scale computationally with the amount of possible interactions, require a human to decide on a useful representation, and underspecify the problem one should learn.
Representing this problem as a graph makes sense intuitively: Each vehicle is a node, and possible interactions between vehicles are modelled as edges (see Fig. 1 for a visualization).
At the same time, it has been shown Morton2016, kuefler_imitating_2017, lenz_deep_2017 that machine learning models and particularly (deep) neural networks perform well on this problem. Yet most available deep learning models operate on data of a fixed size and with a fixed spatial organization such as single data points, time series, or images.
Only fairly recently gori_new_2005, scarselli_graph_2009 have GNN, i.e. neural networks operating on graph data, seen research interest and enjoyed successes. Later models kipf_semisupervised_2016, velickovic_graph_2017 only operate on a node’s local neighbourhood. This greatly improves scalability while improving performance.
Marrying the representation of a traffic situation as a graph with the modelling capabilities of GNN models promises a clear method to take interactions between traffic participants into account, good predictive performance, and efficient computation.
To evaluate this, we conduct traffic participant prediction on two realworld datasets, evaluating their predictive performance and comparing them to three baseline models. We show that prediction error decreases by 30% compared to our baseline when interaction is plentiful and performs no worse when little interaction occurs. At the same time, computational complexity remains reasonable and scales with linearly in the number of interactions.
This suggests a graph interpretation of interacting traffic participants is a worthwhile addition to traffic prediction systems.
Our main contributions are:

We show that representing interactions as graphs leads to better performance.

We introduce several adaptations to two stateoftheart GNN models.

We study both the results of different graph construction techniques and our introduced adaptations on two different datasets.
Ii Related Work
Since traffic participant prediction is a key feature of autonomous driving and traffic simulations, it has been a focus of extensive research for decades. This has lead to a multitude of different algorithms useful for varying prediction timespans and computational resources.
Iia Traffic Prediction
Following the survey by Lefevre2014, we roughly categorize traffic participant prediction into three subgroups of ascending complexity: Physicsbased, maneuverbased, and interactionaware.
Physicsbased models usually assume little vehicle action and instead use constant velocity or acceleration. The vehicle motion is then predicted from a physical model only. These models can be used for tracking Schubert2008 but often fail for predictions longer than a second or when vehicle interaction plays an important role.
Maneuverbased models use a set of maneuver prototypes and either match the past trajectory directly using clusterbased approaches Vasquez2004 or from vehicle features using machine learning methods GarciaOrtiz2011, Morris2011,Kumar2013. While these models are now able to include more complex maneuver, they also cannot take interaction into account.
Interactionaware models
aim to include interactions between vehicles in their predictions. These include an expansion of maneuverbased models which account for collision probabilities Lawitzky2013, coupled HMM, which model pairwise entity dependencies Brand1997, or machine learningbased models.
Machine learningbased models also vary in complexity and goal. lenz_deep_2017 use simple feedforward neural networks to create a fast model for use in a MonteCarlo Tree Search algorithm. Morton2016 evaluate RNN for the same task, also trained in a supervised fashion. Conversely, kuefler_imitating_2017 use Generative Adversarial Imitation Learning to imitate human driving behavior using reinforcement learning. For all of these, performances crucially depend on the representation of the environment.
IiB Environmental Representation
Environmental representation can be differentiated by their abstractness: One can represent the environment as data close to sensor input such as LIDAR beams kuefler_imitating_2017, camera images, or a simple gridmap. Alternatively, one can represent the environment as a list of discrete objects.
IiB1 Sensorlike representation
A sensorlike representation does not require expert knowledge to define the features to use and remains of constant size independent of factors such as traffic density. At the same time, it can receive information on many vehicles. However, the representation is inefficient (requiring many LIDAR beams or pixels per vehicle) and we are forced to learn not just driving behavior but also the extraction of vehicles from sensor data.
IiB2 Discrete object representation
Alternatively, we can represent each vehicle as an object with certain attributes. Predictions are then created per car. Interaction is then a matter of choosing the correct environment representation, which may be as simple as the distance and approach speed to the preceding vehicle — as in the IDM treiber_congested_2000 — or might contain a multitude of preprocessed features lenz_deep_2017. However, these models by design have to be simplistic in their assumptions of interaction between traffic participants and are therefore limited.
Several of these shortcomings can be avoided by thinking about traffic participants and their interactions as nodes and edges in a graph. A behavior prediction model then operates on that graph, producing predictions for each node.
Iii Traffic Participant Prediction from a Graph
While there are several different GNN architectures, experimental results suggest the relatively simple GCN model still performs best over a wide variety of tasks shchur_pitfalls_2018. We also evaluate the GAT model, since it allows us to easily include edge features into the model.
Iiia Graph Convolutional Networks
GCN kipf_semisupervised_2016 are an approach for nodebased classification or prediction on a graph. Analogous to convolutions on images or time series, a GCN applies the same operation on all nodes. Like other neural networks, it is defined by a series of differentlyparameterized layers which are applied successively.
IiiA1 The Base Model
Each layer of the GCN uses
(1) 
as a transformation. Here, is the th layer’s activations, is the adjacency matrix with added selfconnections between nodes,
is the degree vector of
, and is the th layer’s learnable weight matrix.This is equivalent to a firstorder approximation of a localized spectral filter, but has two crucial advantages: The Graph Laplacian does not need to be inverted (which would incur computational cost of ) and the transformation specified by layers takes exactly the hop neighborhood of a node into account. Accordingly, computational complexity scales linearly in the number of edges.
IiiA2 Adaptations for the Gcn
We originally applied the GCN exactly as described by kipf_semisupervised_2016. However, we found several changes to be crucial:

SelfWeights: GCN compute the next layer’s features for a node from a spectral decomposition of that node’s neighborhood (and, with added selfconnections, the ego node itself). However, this means a GCN cannot treat the ego node’s own features differently from any of its neighbors. In the prediction task, this appears to be a significant obstacle to good performance. Accordingly, we remove the selfconnections but introduce a second weight matrix defining a transformation on the ego node’s features. Our transformation equation is therefore
(2) 
Weight by Distance: kipf_semisupervised_2016 note that the adjacency matrix can be binary or weighted. We evaluate weighting edges by the inverse distance, with selfloops set to .

Feedforward output:
In addition, we no longer use a full GCN but replace the output layer with a feedforward layer operating on each node’s features independently. This allows a better decoupling of the feature extraction (occuring in the first few GCN layers) and the final prediction from the extracted features.
IiiB Graph Attention Networks
GAT velickovic_graph_2017 layers compute each node’s next representation by an attention mechanism over all of its neighbors.
IiiB1 The Base Model
Specifically, they compute attention coefficients
(3) 
for each connected node pair, with being the th node’s feature in the th layer and being the learnable weight matrix for the th layer. is the learnable attention computation, implemented by a neural network. The node feature vector is then computed as
(4) 
where
is a nonlinearity, usually ReLU.
In practice, velickovic_graph_2017 note that learning is stabilized by using multihead attention, i.e. using differentlyparameterized attention mechanisms and concatenating  or averaging in the last layer  the result. This allows features to be created from different subsets of nodes depending on the needs of these features.
As with GCN, a GAT layer operates on local neighbourhood only and therefore also scales linearly in the number of edges.
IiiB2 Adaptations for the Gat
As before, we also apply some adaptations to the base GAT model.

Edge attributes: In the GAT as introduced by velickovic_graph_2017, attention depends only on the features of the two nodes. However, we do have additional data  like the relative positions  available to us in this scenario. Accordingly, we augment the attention computation from Eq. 3 by including edge features, such that
(5) We do not learn successive edge features but instead use the relative positions for each layer.

Selfweights: While the GAT should be able to learn by itself to concentrate one attention head onto the ego node, we also evaluate explicitly adding a transformation of the ego node’s features.

Feedforward output: As with the GCN, our final output is produced by a feedforward layer.
IiiC Graph and Feature Construction
Formulating the prediction problem as a graph still leaves open the task of how we construct said graph and the node features. While there is an obvious strategy to construct node features  namely to use the corresponding car features like position or velocity  no such strategy is apparent to construct connections between the nodes. However, four basic strategies are immediately apparent:

Self connections: This only adds selfloops to the graph. It ignores all interaction performance and should perform identically to a simple model operating on the vehicle data only.

All connections: Connecting all vehicles ensures that no interactions are ignored. However, this ignores previous knowledge on spatial position and interaction and greatly increases the problem size.

Preceding connection: Arguably the most important interaction is with the vehicle immediately in front of us. We can therefore construct interactions only between the current vehicle and its predecessor.

Close vehicles: Alternatively, we can argue that the main interactions are with the vehicles in an ego vehicle’s direct environment, which are at most eight vehicles located to the front, rear, and sides of the ego vehicle.
While we would prefer to learn these connecting strategies, this is a very difficult open problem and scales quadratically with the number of considered vehicles. We therefore only evaluate the fixed strategies.
Iv Experiments
In order to evaluate the newly proposed models, we conduct a prediction experiment on realworld traffic data. We purposely keep baselines and models simple to demonstrate whether the graph interpretation is beneficial without introducing a multitude of confounding factors. We therefore do not include RNN architectures, simulation steps, or imitation learning.
From this, we aim to answer three main questions: (A) Which of our adaptions to GNN are necessary? (B) How do we construct an interaction graph? (C) Does a graph model increase prediction quality?
Iva Datasets
We conduct our experiment on two different datasets: The HighD dataset highDdataset and the NGSIM I80 dataset NGSIM.
IvA1 Ngsim
The NGSIM project’s I80 dataset contains trajectory data for vehicles in a highway merge scenario for three 15minute timespans. These are tracked using a fixed camera system. As Thiemann2008 show, position, velocity, and acceleration data contain unrealistic values. We therefore smooth the positions using doublesided exponential smoothing with a span of 0.5 and compute velocities from these.
We use two of the recordings as training set and split the last one equally into validation and test set. We subsample the trajectory data to 1 FPS and extract trajectories consisting of a total of 10 of length. The goal of the model is to predict the second half of the trajectory given the first five seconds.
IvA2 HighD
Since the NGSIM dataset still contains many artifacts (errors in bounding boxes, undetected cars, complete nonoverlap of bounding box and true vehicle), we additionally conduct experiments on the new HighD dataset highDdataset, which is a series of drone recordings and extracted vehicle features from about 400 meters each from several locations on the German Autobahn. A total of 16.5 h of data is available, containing 110 000 vehicles with a total driving distance of 45 000 km. However, since the dataset consists mainly of roads without on or offramps and without traffic jams, interaction seems limited: Only about 5% of the cars experience a lane change.
To avoid information leakage, we split the dataset by recording. The last 10 % of the recordings are used as test set, the 10 % before that as validation set. Trajectory construction is then identical to the NGSIM dataset.
IvB Baselines
We compare our approach to two different modelbased static approaches, and one learned approach.
IvB1 Cvm
This model considers each car to continue moving at the same velocity (both laterally and longitudinally) as the last frame it was observed.
IvB2 Idm
The IDM treiber_congested_2000 is a commonlyused driver model for microscopic traffic simulation since it is interpretable and collisionfree. We use this to predict the changes in longitudinal velocity and keep the inlane position constant.
The IDM’s acceleration is computed from both a free road and an interaction term. The free road acceleration is computed as
with the maximum acceleration , the acceleration exponent and the desired velocity being tunable parameters and the current velocity . The interaction term is defined as
where the minimum distance to the front vehicle , the time gap , and the maximum deceleration are tunable parameters. is the vehicle’s speed and the closing speed to its predecessor. The total acceleration is the sum of both the free road and the interaction acceleration.
We take the IDM parameters for the NGSIM dataset from Morton2016. For the HighD dataset, we tune the IDM’s parameters using guided random search with a total of 20 000 samples. Both values are listed in Table I.
Parameter  HighD  NGSIM Morton2016  

Desired velocity  
Maximum acceleration  
Time gap  
Comfortable deceleration  
Minimum distance 
IvB3 Independent FeedForward Model
In addition to the models taking interaction into account, we also add a simple feedforward neural network predicting the trajectory from only the ego vehicle’s past data. We use this baseline model to measure the improvement we gain from including interaction into our models.
IvC Model Configuration
Each model uses a similar configuration: Two layers producing a 256dimensional feature representation followed by a feedforward layer producing the final output. All models use the ReLU nonlinearity. The GAT employs four attention heads (and 64dimensional feature representations each).
Since the GNN models use two layers, their effective receptive field is the twohop neighbourhood from the ego vehicle.
All models receive inputs and produce outputs in fixedlength timesteps without recurrence. They are trained to predict displacement relative to the last position and receive position and velocity for each past timestep. They train to minimize the mean squared error over all outputs. All models are implemented in pytorch paszke2017automatic using and expanding upon the pytorchgeometric library Fey/etal/2018.
IvD Performance Measure
We report performances of the model by measuring the error in position between ground truth and prediction. We both report mean displacement over five seconds, weighting each timestep identically, and final displacement after five seconds.
IvE Experimental Procedure
Our choice of experiments is guided by the three main questions (sections VC, VB and VA). To ensure meaningful results, we repeat each evaluation a total of ten times using different, randomlychosen random seeds. In tables, we report all results as mean standard deviation. Figures are violin plots, showing both individual results and the total result distribution.
We optimize both network adaptations and graph construction strategies on the NGSIM I80 dataset since it is both smaller and contains more interactions. We then use these insights to pick the bestperforming models and evaluate them on both the NGSIM I80 and the HighD dataset.
V Discussion
Mean Displ.  Displ. @5s  
GCN Adaptations  
Default  
no ff output  
with weighted edges  
no selfweight & weighted edges  
no selfweight  
GAT Adaptations  
Default  
no ff output  
no selfweight  
no edge features  
Connection Strategy (GAT)  
SelfConnections  
Preceding Connection  
Neighbour Connection  
All Connections () 
() Uses 3 instead of 10 evaluations.
We structure our evaluation according to three research questions which answer (A) whether our proposed architectural adaptions are worthwhile, (B) which of the graph construction strategies should be preferred, and (C) whether the inclusion of interaction graph information improves performance.
Va Which of our adaptions to GNN are necessary?
In Section IIIC, we proposed several changes to the GCN and GAT architectures. To answer which of these changes are beneficial, we conducted an ablation study whose results are listed in Table II. We evaluated this using the Neighbour Connection graph construction strategy.
For both models, the added selfweights improve the final result signficantly. We believe these additional weights to greatly help because there is a clear difference between a neighbouring and the ego node in this task. We also found that using a feedforward layer as last layer does produce a small increase in performance but also stabilizes training.
Introducing relative positions as edge features into the GAT seems to be a clear success, reducing the final displacement by about a meter. Contrary to that, edge weights for the GCN slightly decrease performance, especially when omitting selfweights. We believe that the main contribution of edge weights in our scenario is to discern between the ego and surrounding vehicles, which is already more effectively modelled through selfweights.
We therefore evaluate the graph construction using the GAT model.
Mean Displ.  Displ. @5s  

GAT  
GAT NEF  
GCN  
FF  
CVM  
IDM 
Mean Displ.  Displ. @5s  

GAT  
GAT NEF  
FF  
GCN  
IDM  
CVM 
VB How do we construct an interaction graph?
In Section IIIC, we proposed four construction strategies for the interaction graph. We evaluate the quality of predictions with each of these strategies using the GAT models, since these seemed to perform best. We note that in practical scenarios, a tradeoff might be necessary between prediction quality and computational complexity. Table II shows results.
As expected, the SelfConnection strategy performs identically to the FF baseline model, and the NeighbourConnection graph construction method performs best. Somewhat surprisingly, the PrecedingConnection strategy performs no better than the baseline.
We especially note that using the AllConnection strategy imposes signficant computational disadvantages with quadratic instead of linear runtime and, in our experiments, a slowdown of about 50x.
We therefore use the Neighbour Connection graph construction strategy for our evaluation.
VC Does a graph model increase prediction quality?
The motivation of our work is to evaluate whether it is beneficial to model interaction between traffic participants and whether this can be modelled in a graph construction. To answer this question, we compare models with interaction to a model without (FF). We also include a comparison with two classical models (CVM and IDM).
We chose the GAT model as bestperforming GNN. We also included a GAT model without edge features (called GAT NEF in our figures and tables) for a fair comparison with the GCN model.
VC1 Ngsim
As can be clearly seen, every GNN model performs better than the baseline. At the same time, there are clear performance differences between them: Both GCN and GAT NEF perform worse, which we assume is because these models cannot take relative positions directly into account and instead only act on the existence or nonexistence of edges.
At the same time, the introduction of fixed edge features to the GAT model clearly shows its performance advantage, reducing the prediction error by a 30% compared to the FF baseline.
We note that the comparatively bad performance of the IDM in shorter timescales is consistent with previous work lenz_deep_2017 and it is likely to achieve better performances in a closed or openloop simulation.
VC2 HighD
On the HighD dataset, our results are different: As Fig. 3 and Table IV show, there is no significant performance difference between either of the learned models, and no significant performance difference between the IDM and CVM models. We believe this to be a consequence of little interaction between the cars, which makes all learned models degenerate to the noninteraction case and makes the interaction term of the CVM model irrelevant. This shows that, even with no interaction, including interaction representations into our models does not cause performance degradation.
In summary, we show that (A) several of our changes result in better performance, (B) as does a good interaction graph construction strategy. (C) In total, our model retains performance on a dataset with little interaction and greatly improves it on a dataset with plentiful interaction.
Vi Conclusion
We have proposed modelling a traffic scene as a graph of interacting vehicles. Through this interpretation, we gain a flexible and abstract model for interactions. To predict future traffic participant actions, we use GNN, neural networks operating on graph data. These naturally take the graph model and therefore interaction into account. We evaluated two computationally efficient GNN and proposed several adaptations for our scenario.
In a traffic dataset with plentiful interaction, including interactions decreased prediction error by over 30% compared to the best baseline model. At the same time, we saw no increase in prediction error on a dataset with little interaction.
While we have improved prediction quality, much work remains to be done: This work is only a proofofconcept that modelling interactions as a graph is worthwhile and should thus be seen as only one technique for one aspect of traffic prediction. Integrating this model into existing stateoftheart methodology, particularly RNN, remains an open task. At the same time, we would like to explore other graph construction strategies, particularly automatically finding relevant interactions.
finentry
Comments
There are no comments yet.