Short-term accurate behavior prediction of traffic participants is important for applications such as automated driving or infrastructure-assisted human drivinghinz_designing_2017. A major open research question is how to model interaction between traffic participants. In the past, interactions have been modelled by either creating a representation of one or several traffic participants treiber_congested_2000, lenz_deep_2017 or by using a fixed environment representation such as a simulated lidar beam kuefler_imitating_2017.
However, these methods impose certain disadvantages: A fixed environment representation poses a much harder problem to learn, since we cannot use data we might have extracted previously. Traffic participant representations, on the other hand, scale computationally with the amount of possible interactions, require a human to decide on a useful representation, and underspecify the problem one should learn.
Representing this problem as a graph makes sense intuitively: Each vehicle is a node, and possible interactions between vehicles are modelled as edges (see Fig. 1 for a visualization).
At the same time, it has been shown Morton2016, kuefler_imitating_2017, lenz_deep_2017 that machine learning models and particularly (deep) neural networks perform well on this problem. Yet most available deep learning models operate on data of a fixed size and with a fixed spatial organization such as single data points, time series, or images.
Only fairly recently gori_new_2005, scarselli_graph_2009 have GNN, i.e. neural networks operating on graph data, seen research interest and enjoyed successes. Later models kipf_semi-supervised_2016, velickovic_graph_2017 only operate on a node’s local neighbourhood. This greatly improves scalability while improving performance.
Marrying the representation of a traffic situation as a graph with the modelling capabilities of GNN models promises a clear method to take interactions between traffic participants into account, good predictive performance, and efficient computation.
To evaluate this, we conduct traffic participant prediction on two real-world datasets, evaluating their predictive performance and comparing them to three baseline models. We show that prediction error decreases by 30% compared to our baseline when interaction is plentiful and performs no worse when little interaction occurs. At the same time, computational complexity remains reasonable and scales with linearly in the number of interactions.
This suggests a graph interpretation of interacting traffic participants is a worthwhile addition to traffic prediction systems.
Our main contributions are:
We show that representing interactions as graphs leads to better performance.
We introduce several adaptations to two state-of-the-art GNN models.
We study both the results of different graph construction techniques and our introduced adaptations on two different datasets.
Ii Related Work
Since traffic participant prediction is a key feature of autonomous driving and traffic simulations, it has been a focus of extensive research for decades. This has lead to a multitude of different algorithms useful for varying prediction timespans and computational resources.
Ii-a Traffic Prediction
Following the survey by Lefevre2014, we roughly categorize traffic participant prediction into three subgroups of ascending complexity: Physics-based, maneuver-based, and interaction-aware.
Physics-based models usually assume little vehicle action and instead use constant velocity or acceleration. The vehicle motion is then predicted from a physical model only. These models can be used for tracking Schubert2008 but often fail for predictions longer than a second or when vehicle interaction plays an important role.
Maneuver-based models use a set of maneuver prototypes and either match the past trajectory directly using cluster-based approaches Vasquez2004 or from vehicle features using machine learning methods GarciaOrtiz2011, Morris2011,Kumar2013. While these models are now able to include more complex maneuver, they also cannot take interaction into account.
aim to include interactions between vehicles in their predictions. These include an expansion of maneuver-based models which account for collision probabilities Lawitzky2013, coupled HMM, which model pairwise entity dependencies Brand1997, or machine learning-based models.
Machine learning-based models also vary in complexity and goal. lenz_deep_2017 use simple feed-forward neural networks to create a fast model for use in a Monte-Carlo Tree Search algorithm. Morton2016 evaluate RNN for the same task, also trained in a supervised fashion. Conversely, kuefler_imitating_2017 use Generative Adversarial Imitation Learning to imitate human driving behavior using reinforcement learning. For all of these, performances crucially depend on the representation of the environment.
Ii-B Environmental Representation
Environmental representation can be differentiated by their abstractness: One can represent the environment as data close to sensor input such as LIDAR beams kuefler_imitating_2017, camera images, or a simple gridmap. Alternatively, one can represent the environment as a list of discrete objects.
Ii-B1 Sensor-like representation
A sensor-like representation does not require expert knowledge to define the features to use and remains of constant size independent of factors such as traffic density. At the same time, it can receive information on many vehicles. However, the representation is inefficient (requiring many LIDAR beams or pixels per vehicle) and we are forced to learn not just driving behavior but also the extraction of vehicles from sensor data.
Ii-B2 Discrete object representation
Alternatively, we can represent each vehicle as an object with certain attributes. Predictions are then created per car. Interaction is then a matter of choosing the correct environment representation, which may be as simple as the distance and approach speed to the preceding vehicle — as in the IDM treiber_congested_2000 — or might contain a multitude of preprocessed features lenz_deep_2017. However, these models by design have to be simplistic in their assumptions of interaction between traffic participants and are therefore limited.
Several of these shortcomings can be avoided by thinking about traffic participants and their interactions as nodes and edges in a graph. A behavior prediction model then operates on that graph, producing predictions for each node.
Iii Traffic Participant Prediction from a Graph
While there are several different GNN architectures, experimental results suggest the relatively simple GCN model still performs best over a wide variety of tasks shchur_pitfalls_2018. We also evaluate the GAT model, since it allows us to easily include edge features into the model.
Iii-a Graph Convolutional Networks
GCN kipf_semi-supervised_2016 are an approach for node-based classification or prediction on a graph. Analogous to convolutions on images or time series, a GCN applies the same operation on all nodes. Like other neural networks, it is defined by a series of differently-parameterized layers which are applied successively.
Iii-A1 The Base Model
Each layer of the GCN uses
as a transformation. Here, is the th layer’s activations, is the adjacency matrix with added self-connections between nodes,
is the degree vector of, and is the th layer’s learnable weight matrix.
This is equivalent to a first-order approximation of a localized spectral filter, but has two crucial advantages: The Graph Laplacian does not need to be inverted (which would incur computational cost of ) and the transformation specified by layers takes exactly the -hop neighborhood of a node into account. Accordingly, computational complexity scales linearly in the number of edges.
Iii-A2 Adaptations for the Gcn
We originally applied the GCN exactly as described by kipf_semi-supervised_2016. However, we found several changes to be crucial:
Self-Weights: GCN compute the next layer’s features for a node from a spectral decomposition of that node’s neighborhood (and, with added self-connections, the ego node itself). However, this means a GCN cannot treat the ego node’s own features differently from any of its neighbors. In the prediction task, this appears to be a significant obstacle to good performance. Accordingly, we remove the self-connections but introduce a second weight matrix defining a transformation on the ego node’s features. Our transformation equation is therefore
Weight by Distance: kipf_semi-supervised_2016 note that the adjacency matrix can be binary or weighted. We evaluate weighting edges by the inverse distance, with self-loops set to .
In addition, we no longer use a full GCN but replace the output layer with a feed-forward layer operating on each node’s features independently. This allows a better decoupling of the feature extraction (occuring in the first few GCN layers) and the final prediction from the extracted features.
Iii-B Graph Attention Networks
GAT velickovic_graph_2017 layers compute each node’s next representation by an attention mechanism over all of its neighbors.
Iii-B1 The Base Model
Specifically, they compute attention coefficients
for each connected node pair, with being the th node’s feature in the th layer and being the learnable weight matrix for the th layer. is the learnable attention computation, implemented by a neural network. The node feature vector is then computed as
is a non-linearity, usually ReLU.
In practice, velickovic_graph_2017 note that learning is stabilized by using multi-head attention, i.e. using differently-parameterized attention mechanisms and concatenating - or averaging in the last layer - the result. This allows features to be created from different subsets of nodes depending on the needs of these features.
As with GCN, a GAT layer operates on local neighbourhood only and therefore also scales linearly in the number of edges.
Iii-B2 Adaptations for the Gat
As before, we also apply some adaptations to the base GAT model.
Edge attributes: In the GAT as introduced by velickovic_graph_2017, attention depends only on the features of the two nodes. However, we do have additional data - like the relative positions - available to us in this scenario. Accordingly, we augment the attention computation from Eq. 3 by including edge features, such that
We do not learn successive edge features but instead use the relative positions for each layer.
Self-weights: While the GAT should be able to learn by itself to concentrate one attention head onto the ego node, we also evaluate explicitly adding a transformation of the ego node’s features.
Feed-forward output: As with the GCN, our final output is produced by a feed-forward layer.
Iii-C Graph and Feature Construction
Formulating the prediction problem as a graph still leaves open the task of how we construct said graph and the node features. While there is an obvious strategy to construct node features - namely to use the corresponding car features like position or velocity - no such strategy is apparent to construct connections between the nodes. However, four basic strategies are immediately apparent:
Self connections: This only adds self-loops to the graph. It ignores all interaction performance and should perform identically to a simple model operating on the vehicle data only.
All connections: Connecting all vehicles ensures that no interactions are ignored. However, this ignores previous knowledge on spatial position and interaction and greatly increases the problem size.
Preceding connection: Arguably the most important interaction is with the vehicle immediately in front of us. We can therefore construct interactions only between the current vehicle and its predecessor.
Close vehicles: Alternatively, we can argue that the main interactions are with the vehicles in an ego vehicle’s direct environment, which are at most eight vehicles located to the front, rear, and sides of the ego vehicle.
While we would prefer to learn these connecting strategies, this is a very difficult open problem and scales quadratically with the number of considered vehicles. We therefore only evaluate the fixed strategies.
In order to evaluate the newly proposed models, we conduct a prediction experiment on real-world traffic data. We purposely keep baselines and models simple to demonstrate whether the graph interpretation is beneficial without introducing a multitude of confounding factors. We therefore do not include RNN architectures, simulation steps, or imitation learning.
From this, we aim to answer three main questions: (A) Which of our adaptions to GNN are necessary? (B) How do we construct an interaction graph? (C) Does a graph model increase prediction quality?
We conduct our experiment on two different datasets: The HighD dataset highDdataset and the NGSIM I-80 dataset NGSIM.
The NGSIM project’s I-80 dataset contains trajectory data for vehicles in a highway merge scenario for three 15-minute timespans. These are tracked using a fixed camera system. As Thiemann2008 show, position, velocity, and acceleration data contain unrealistic values. We therefore smooth the positions using double-sided exponential smoothing with a span of 0.5 and compute velocities from these.
We use two of the recordings as training set and split the last one equally into validation and test set. We subsample the trajectory data to 1 FPS and extract trajectories consisting of a total of 10 of length. The goal of the model is to predict the second half of the trajectory given the first five seconds.
Since the NGSIM dataset still contains many artifacts (errors in bounding boxes, undetected cars, complete non-overlap of bounding box and true vehicle), we additionally conduct experiments on the new HighD dataset highDdataset, which is a series of drone recordings and extracted vehicle features from about 400 meters each from several locations on the German Autobahn. A total of 16.5 h of data is available, containing 110 000 vehicles with a total driving distance of 45 000 km. However, since the dataset consists mainly of roads without on- or off-ramps and without traffic jams, interaction seems limited: Only about 5% of the cars experience a lane change.
To avoid information leakage, we split the dataset by recording. The last 10 % of the recordings are used as test set, the 10 % before that as validation set. Trajectory construction is then identical to the NGSIM dataset.
We compare our approach to two different model-based static approaches, and one learned approach.
This model considers each car to continue moving at the same velocity (both laterally and longitudinally) as the last frame it was observed.
The IDM treiber_congested_2000 is a commonly-used driver model for microscopic traffic simulation since it is interpretable and collision-free. We use this to predict the changes in longitudinal velocity and keep the in-lane position constant.
The IDM’s acceleration is computed from both a free road and an interaction term. The free road acceleration is computed as
with the maximum acceleration , the acceleration exponent and the desired velocity being tunable parameters and the current velocity . The interaction term is defined as
where the minimum distance to the front vehicle , the time gap , and the maximum deceleration are tunable parameters. is the vehicle’s speed and the closing speed to its predecessor. The total acceleration is the sum of both the free road and the interaction acceleration.
We take the IDM parameters for the NGSIM dataset from Morton2016. For the HighD dataset, we tune the IDM’s parameters using guided random search with a total of 20 000 samples. Both values are listed in Table I.
Iv-B3 Independent Feed-Forward Model
In addition to the models taking interaction into account, we also add a simple feed-forward neural network predicting the trajectory from only the ego vehicle’s past data. We use this baseline model to measure the improvement we gain from including interaction into our models.
Iv-C Model Configuration
Each model uses a similar configuration: Two layers producing a 256-dimensional feature representation followed by a feed-forward layer producing the final output. All models use the ReLU nonlinearity. The GAT employs four attention heads (and 64-dimensional feature representations each).
Since the GNN models use two layers, their effective receptive field is the two-hop neighbourhood from the ego vehicle.
All models receive inputs and produce outputs in fixed-length timesteps without recurrence. They are trained to predict displacement relative to the last position and receive position and velocity for each past timestep. They train to minimize the mean squared error over all outputs. All models are implemented in pytorch paszke2017automatic using and expanding upon the pytorch-geometric library Fey/etal/2018.
Iv-D Performance Measure
We report performances of the model by measuring the error in position between ground truth and prediction. We both report mean displacement over five seconds, weighting each timestep identically, and final displacement after five seconds.
Iv-E Experimental Procedure
Our choice of experiments is guided by the three main questions (sections V-C, V-B and V-A). To ensure meaningful results, we repeat each evaluation a total of ten times using different, randomly-chosen random seeds. In tables, we report all results as mean standard deviation. Figures are violin plots, showing both individual results and the total result distribution.
We optimize both network adaptations and graph construction strategies on the NGSIM I-80 dataset since it is both smaller and contains more interactions. We then use these insights to pick the best-performing models and evaluate them on both the NGSIM I-80 and the HighD dataset.
|Mean Displ.||Displ. @5s|
|no ff output|
|with weighted edges|
|no self-weight & weighted edges|
|no ff output|
|no edge features|
|Connection Strategy (GAT)|
|All Connections ()|
() Uses 3 instead of 10 evaluations.
We structure our evaluation according to three research questions which answer (A) whether our proposed architectural adaptions are worthwhile, (B) which of the graph construction strategies should be preferred, and (C) whether the inclusion of interaction graph information improves performance.
V-a Which of our adaptions to GNN are necessary?
In Section III-C, we proposed several changes to the GCN and GAT architectures. To answer which of these changes are beneficial, we conducted an ablation study whose results are listed in Table II. We evaluated this using the Neighbour Connection graph construction strategy.
For both models, the added self-weights improve the final result signficantly. We believe these additional weights to greatly help because there is a clear difference between a neighbouring and the ego node in this task. We also found that using a feed-forward layer as last layer does produce a small increase in performance but also stabilizes training.
Introducing relative positions as edge features into the GAT seems to be a clear success, reducing the final displacement by about a meter. Contrary to that, edge weights for the GCN slightly decrease performance, especially when omitting self-weights. We believe that the main contribution of edge weights in our scenario is to discern between the ego and surrounding vehicles, which is already more effectively modelled through self-weights.
We therefore evaluate the graph construction using the GAT model.
|Mean Displ.||Displ. @5s|
|Mean Displ.||Displ. @5s|
V-B How do we construct an interaction graph?
In Section III-C, we proposed four construction strategies for the interaction graph. We evaluate the quality of predictions with each of these strategies using the GAT models, since these seemed to perform best. We note that in practical scenarios, a tradeoff might be necessary between prediction quality and computational complexity. Table II shows results.
As expected, the Self-Connection strategy performs identically to the FF baseline model, and the Neighbour-Connection graph construction method performs best. Somewhat surprisingly, the Preceding-Connection strategy performs no better than the baseline.
We especially note that using the All-Connection strategy imposes signficant computational disadvantages with quadratic instead of linear runtime and, in our experiments, a slowdown of about 50x.
We therefore use the Neighbour Connection graph construction strategy for our evaluation.
V-C Does a graph model increase prediction quality?
The motivation of our work is to evaluate whether it is beneficial to model interaction between traffic participants and whether this can be modelled in a graph construction. To answer this question, we compare models with interaction to a model without (FF). We also include a comparison with two classical models (CVM and IDM).
We chose the GAT model as best-performing GNN. We also included a GAT model without edge features (called GAT NEF in our figures and tables) for a fair comparison with the GCN model.
As can be clearly seen, every GNN model performs better than the baseline. At the same time, there are clear performance differences between them: Both GCN and GAT NEF perform worse, which we assume is because these models cannot take relative positions directly into account and instead only act on the existence or non-existence of edges.
At the same time, the introduction of fixed edge features to the GAT model clearly shows its performance advantage, reducing the prediction error by a 30% compared to the FF baseline.
We note that the comparatively bad performance of the IDM in shorter timescales is consistent with previous work lenz_deep_2017 and it is likely to achieve better performances in a closed- or open-loop simulation.
On the HighD dataset, our results are different: As Fig. 3 and Table IV show, there is no significant performance difference between either of the learned models, and no significant performance difference between the IDM and CVM models. We believe this to be a consequence of little interaction between the cars, which makes all learned models degenerate to the non-interaction case and makes the interaction term of the CVM model irrelevant. This shows that, even with no interaction, including interaction representations into our models does not cause performance degradation.
In summary, we show that (A) several of our changes result in better performance, (B) as does a good interaction graph construction strategy. (C) In total, our model retains performance on a dataset with little interaction and greatly improves it on a dataset with plentiful interaction.
We have proposed modelling a traffic scene as a graph of interacting vehicles. Through this interpretation, we gain a flexible and abstract model for interactions. To predict future traffic participant actions, we use GNN, neural networks operating on graph data. These naturally take the graph model and therefore interaction into account. We evaluated two computationally efficient GNN and proposed several adaptations for our scenario.
In a traffic dataset with plentiful interaction, including interactions decreased prediction error by over 30% compared to the best baseline model. At the same time, we saw no increase in prediction error on a dataset with little interaction.
While we have improved prediction quality, much work remains to be done: This work is only a proof-of-concept that modelling interactions as a graph is worthwhile and should thus be seen as only one technique for one aspect of traffic prediction. Integrating this model into existing state-of-the-art methodology, particularly RNN, remains an open task. At the same time, we would like to explore other graph construction strategies, particularly automatically finding relevant interactions.