Log In Sign Up

Multi-Head Attention with Joint Agent-Map Representation for Trajectory Prediction in Autonomous Driving

by   Kaouther Messaoud, et al.

For autonomous vehicles to navigate in urban environment, the ability to predict the possible future behaviors of surrounding vehicles is essential to increase their safety level by avoiding dangerous situations in advance. The behavior anticipation task is mainly based on two tightly linked cues; surrounding agents' recent motions and scene information. The configuration of the agents may uncover which part of the scene is important, while scene structure determines the influential existing agents. To better present this correlation, we deploy multi-head attention on a joint agents and map context. Moreover, to account for the uncertainty of the future, we use an efficient multi-modal probabilistic trajectory prediction model that learns to extract different joint context features and generate diverse possible trajectories accordingly in one forward pass. Results on the publicly available nuScenes dataset prove that our model achieves the performance of existing methods and generates diverse possible future trajectories compliant with scene structure. Most importantly, the visualization of attention maps reveals some of the underlying prediction logic of our approach which increases its interpretability and reliability to deploy in the real world.


page 4

page 7

page 8


Trajectory Prediction for Autonomous Driving based on Multi-Head Attention with Joint Agent-Map Representation

Predicting the trajectories of surrounding agents is an essential abilit...

Multi-modal Probabilistic Prediction of Interactive Behavior via an Interpretable Model

For autonomous agents to successfully operate in real world, the ability...

SVG-Net: An SVG-based Trajectory Prediction Model

Anticipating motions of vehicles in a scene is an essential problem for ...

Holistic Transformer: A Joint Neural Network for Trajectory Prediction and Decision-Making of Autonomous Vehicles

Trajectory prediction and behavioral decision-making are two important t...

View Vertically: A Hierarchical Network for Trajectory Prediction via Fourier Spectrums

Learning to understand and predict future motions or behaviors for agent...

1 Introduction

Autonomous vehicles navigate in a highly-uncertain and interactive environment shared with other dynamic agents. In order to plan safe and comfortable maneuvers, they need to anticipate multiple possible future behaviors of surrounding vehicles. To do so, they mainly rely on two tightly related cues; the recent motions of the surrounding agents and scene structure (map). Considered separately, the map and the surrounding agents do not present the context in its entirety. In fact, some information from the map or the surrounding agents could be important but this may only be revealed by knowing information from the other. For example, for the same scene, different map regions should be considered differently depending on the presence of different agents and their motions. Similarly, the influence of the surrounding agents depends on the map structure. For instance, the importance of an agent present on the target vehicle’s left is different depending on whether the map presents right turn or left turn roads.
Most existing studies build one context representation and then generate multiple possible trajectories based on this representation. However, we believe that each possible future trajectory is conditioned on a specific subset of surrounding agents’ behaviors and scene context. Therefore, for each possible intention, a different partial context is important to understand the future behavior. for one possible direction at an intersection, the agents and road structure in that direction would have the greatest influence on the possible future motion toward that direction. Based on this observation, we build a model that extracts different scene context representations and generates different possible trajectories conditioned on these contexts.
Multi-Head Attention (MHA) mechanism [20]

has shown great performance in different domains like natural language processing and visual question answering. Existing studies deployed attention mechanism in the task of trajectory prediction as well using single 

[21, 19, 14] or multiple attention heads [11, 12]. In this paper, we use multi-head attention differently by generating attention weights and values using a joint map and surrounding agents motion representation to model the existing spatio-temporal context interactions. Moreover, our model incorporates multi-modality by using each attention head to extract different context representations and generates a plausible trajectory conditioned on each context.

  • Designing a new way of extracting different joint spatio-temporal representations of the driving context combining the map and surrounding agents’ recent motion using multi-head attention.

  • Extending the use of each attention head as a prediction head, as introduced in [13], to generate each possible trajectory conditioned on specific agents motion and map context representation.

  • Presenting an interpretable method that reveals information about the prediction logic through the visualization and analysis of the attention maps.

Figure 1: Proposed model:

The LSTM encoders generate an encoding vector of each agent recent motion. Each attention head models a possible way of interactions between the target (green car) and the combined context features. The decoder receives each interaction vector and the target vehicle encoding and generates a possible distribution over the predicted trajectory conditioned on each context.

2 Related Studies

Agent motion anticipation task has been addressed in the state of the art from different perspectives. Recently, several intention prediction approaches have been proposed and well described in [18]

. Here, we give an overview of related deep learning based methods.

Cross-agents interaction:
As neighboring vehicles movements are correlated. Most of the state-of-the-art work investigates different methods of modeling vehicles’ interactions and deploys them to predict future intentions more accurately. Alahi et al [1] created the social pooling approach to model nearby pedestrian interactions. Deo et al. [5] extended this concept to model more distant interactions using successive convolutional layers. Messaoud et al. [11] applied a multi-head attention mechanism to directly relate distant vehicles and extract a context representation.

Agent-Scene modeling
Map information is also exploited in the task of trajectory prediction. Zhao et al. [22] concatenate trajectories and map features to build a global representation of the scene. Then, the authors apply a convolutional network to extract combined salient features. SoPhie [19] deployed two parallel attention blocks; a social attention for vehicle-vehicle interactions and a physical attention for vehicle-map interactions modeling. Yuan et al. [14] use two attention blocks as well but they deploy them sequentially by feeding the output of the cross-vehicle attention block as a query to the visual attention block.

Multimodal trajectory prediction

The inherent uncertainty of the future implies the existence of multiple plausible future trajectories that depend on partially observable information such as agents’ goals and interactions with other agents and the scene. This incited recent studies to include multi-modality in their trajectory prediction models in various ways. The most common one is sampling generative models such as conditional variational autoencoder (CVAE) 

[9] and Generative Adversarial Networks (GANs) [22, 7, 19]

. In contrast, other methods sample a stochastic policy learnt by imitation or inverse reinforcement learning 

[10, 6]. Ridel et al. [17]

predicts the probability distributions over grids and generates multiple trajectory samples.

In this paper, we generate a fixed number of possible distributions over trajectories (modes) and their corresponding probabilities [4, 3]

and we train our model using the ’best of k’ loss as well as other complementary loss functions (cf. Section 


3 Multi head Attention with Joint Agent-Map Representation

3.1 Input Representation

Our goal is to predict the future trajectory of a target vehicle. For this purpose, we exploit two sources of information:

  • The past trajectories of the target and its surrounding agents.

  • The scene map presenting the road structure oriented toward the driving direction.

We define the interaction space as the area centered on the target vehicle’s position at the prediction time and oriented toward its direction of motion. In the following, we consider the agents present in this area and the map covering it. This representation enables as to consider different numbers of interacting agents based on the occupancy of this area.
Trajectories representation: Each agent is represented by its recent states , .


Each state is composed of a sequence of the agent relative coordinates and , velocity , acceleration and yaw rate , for time steps between and . The positions are expressed in a stationary frame of reference where the origin is the position of the target vehicle at the prediction time , The is oriented toward the target vehicle’s direction of motion and points to the direction perpendicular to it.


We note the state of the target vehicle .
Map representation: The road geometry, driving area and the lane divisions of the interaction space are extracted from the global map and fed as an input RGB image to the model. We use a rasterized representation of the scene for each vehicle as described in [4].

3.2 Encoding layer

The Encoder is composed of two modules:
The trajectory encoding module: The state vector of each agent is embedded using a fully connected layer to a vector and encoded using an LSTM encoder.


are are the hidden states vector of the surrounding agent and the target vehicle respectively at time . All the LSTM encoders share the same weights .
The scene feature extractor module: We use a pretrained CNN to extract features of the map.

3.3 Joint Agent-Map Attention

The first step in modeling cross agents and vehicle-map interactions is to build a combined representation of the global context. Similar to [22], we place the trajectories encoding of the surrounding vehicles on their corresponding positions on top of the map features to generate a spatio-temporal representation of the global context . But, unlike them, we use the attention mechanism to attend to the shared context features composed of maps and trajectories as follows (cf. Figure 2):

  • The hidden state of the target vehicle’s is projected to form different queries .

  • The combined trajectories and map features are projected in a joint space to form different keys and values .

, and are the weight matrices learned in each attention head .
An attention feature is then calculated as a weighted sum of values .


where , weight the effect of surrounding context features on the target vehicle future trajectory.


is matrix multiplication used to calculate dot product similarities. is a scaling factor that equals to the dimensionality of the projection space.
We use multiple attention heads to extract different representations of the scene and we use each context representation to generate a plausible trajectory .

Figure 2: MHA with Joint-Agent-map representation: Each attention head generates keys and values based on a joint representation of agent and map features. Thus the attention weights take into account both the scene and agent contexts.

3.4 Decoding layer

Each context vector , representing the selected information about the target vehicle’s interactions with the surrounding agents and the scene, and its motion encoding are fed to

LSTM Decoders. The decoders generate the predicted parameters of the distributions over the target vehicle’s estimated future positions of each possible trajectory for time steps

,…, .


All the LSTM decoders share the same weights .
Similar to MTP [4]

, we also predict the likelihood of each predicted distribution. Therefore, we combine all the scene representation vectors

and feed them to two successive fully connected layers separated by an activation function. This network outputs the probability (

, ) of each produced trajectory being the best fit with the ground truth behavior.

3.5 Multimodal Output

Our model outputs the parameters characterizing a probability distribution over each possible predicted trajectory of the target vehicle.


Where is the predicted coordinates of the target vehicle. We note the ground truth positions.

Our model infers the conditional probability distribution

. The distribution over the possible positions at time

is presented as a bivariate Gaussian distribution with the parameters


3.6 Loss Functions

We train the model using the following loss functions:
Regression loss: While the model outputs a multimodal predictive distribution corresponding to L distinct futures, we only have access to 1 ground truth trajectory for training the model. In order to not penalize plausible trajectories generated by the model that do not correspond to the ground truth, we use a variant of the best-of-L regression loss for training our model, as has been previously done in . This encourages the model to generate a diverse set of predicted trajectories. Since we output the parameters of a bivariate Gaussian distribution at each time step for the trajectories, we compute the negative log-likelihood (NLL) of the ground truth trajectory under each of the modes output by the model, and consider the minimum of the NLL values as the regression loss. The regression loss is given by


Classification loss  [4]: In addition to the regression loss, we consider the cross entropy


where is a function equal to 1 if and 0 otherwise. Here is the mode corresponding to the minimum NLL in equation 9. is the predicted trajectory corresponding to and its predicted probability.

Off-road loss  [14]: While the loss given by equation 9 encourages the model to generate a diverse set of trajectories, we wish to generate trajectories that conform to the road structure. Since the regression loss only affects the trajectory closest to the ground-truth, we consider the auxiliary loss function proposed in [14] that penalizes points in any of the L trajectories that lie off the drivable area. The off-road loss for each predicted location, is the minimum distance of that location from the drivable area. Figure 2(b) shows a heatmap of the off-road loss for the layout in Figure 2(a).

The overall loss for training the model is given by


where the the weights and

are empirically determined hyperparameters.

(a) Input scene representation
(b) Off-road loss distance based map
Figure 3: Off-road loss: an auxiliary loss function that penalizes locations predicted by the model the fall outside the drivable area. It is proportional to the distance of a predicted location from the nearest point on the drivable area.

3.7 Implementation details

The input states are embedded in a space of dimension 32. We use an image representation of the scene map (cf. Figure 2(a)) of size of with a resolution of 0.1 meters per pixel. Similar to [16] representation, our input image extents are ahead of the target vehicle, behind and on each side. We use ResNet-

pretrained on ImageNet to extract map features. This CNN outputs a map features of size

on top of it we place the trajectories encodings. The deployed LSTM encoder and decoder are of 64 and 128 units respectively. We use parallel attention operations applied on the vectors projected on different spaces of size d=64. We use a batch size of 32 and Adam optimizer [8]

. The model is implemented using PyTorch 


4 Experimental Analysis and Evaluation

4.1 Dataset

We train and evaluate our model on the public self-driving car dataset nuScenes [2]. It was captured using camera and Lidar sensors during urban driving in Boston, USA and Singapore. It is composed of 1000 scenes, each of 20 seconds records. Each scene record involve tracks hand-annotated at 2 Hz as well as high definition maps. We train and evaluate our model using the data split given by nuScenes, 32,186 observations in the train set, 8,560 observations in the train-val set, and 9,041 observations in the validation set.

MinADE MinADE MinADE MinADE MinFDE MinFDE MinFDE MinFDE MissRate MissRate Off-Road Rate
Const vel and yaw 4.61 4.61 4.61 4.61 11.21 11.21 11.21 11.21 0.91 0.91 0.14
Physics oracle 3.69 3.69 3.69 3.69 9.06 9.06 9.06 9.06 0.88 0.88 0.12
MTP [4] 4.42 2.22 1.74 1.55 10.36 4.83 3.54 3.05 0.74 0.67 0.25
Multipath [3] 4.43 1.78 1.55 1.52 10.16 3.62 2.93 2.89 0.78 0.76 0.36
MHA-SAM 4.13 1.83 1.30 1.12 9.38 3.88 2.48 1.96 0.64 0.50 0.09
MHA-JAM (JAH) 4.15 1.92 1.33 1.10 9.52 4.14 2.58 1.94 0.67 0.52 0.11
MHA-JAM 3.86 1.87 1.30 1.09 8.84 4.00 2.49 1.91 0.63 0.50 0.09
MHA-JAM (off-road) 3.82 1.88 1.34 1.10 8.78 4.04 2.61 1.96 0.63 0.49 0.06
Table 1: Results of comparative analysis on nuScenes dataset, over a prediction horizon of 6-seconds

4.2 Baselines

We compare our model to four baselines:
Constant velocity and yaw : physics based method.
Physics oracle : an extension of the physics based model introduced in [16]. Based on the current state of the vehicle (velocity, acceleration and yaw), it computes the minimum average point-wise Euclidean distance over the generated predictions by the following four models:i) Constant velocity and yaw ii) Constant velocity and yaw rate iii) Constant acceleration and yaw iv) Constant acceleration and yaw rate.
Multiple-Trajectory Prediction (MTP) [4]

: It uses a convolutional neural network over a rasterized representation of the scene and the target vehicle state to generate a fixed number of trajectories (modes) and their associated probabilities. It reduces a weighted sum of regression (cf. Equation

9) and classification (cf. Equation10) losses. We use the implementation of this model by [16].
Multipath [3]: This model first generates a fixed set of anchors (same for all agents) representing different modes of the trajectory distribution. Then, it selects the best matching anchor that minimizes the average displacement to the ground truth and proceeds by regressing the residuals from it while accounting for uncertainties. We follow the details in [3] to implement MultiPath.

4.3 Our Models

We also compare different variants of applying MHA on agents and map features:

Figure 4: MHA with Separate-Agent-Map representation: A baseline where attention weights are separately generated for the map and agents features by generating keys and values for each set of features independent of the other.

MHA with Separate-Agent-Map representation (MHA-SAM): In this case, we have two separate MHA blocks (cf. Figure 4): MHA applied on surrounding agents (MHA-A) and on map (MHA-M). Target vehicle hidden state is projected to form two different queries and in MHA-A and MHA-M blocks respectively, where . Similarly, we have different independent keys ( and ) and values ( and ) for each block (MHA-A and MHA-M). While the keys and values of MHA-A block are computed by different projections of the surrounding agents encodings, those of the MHA-M block are implemented from map features as follows: and and refers to surrounding agents and map features of MHA-A and MHA-M respectively.
MHA with Joint-Agent-Map representation (MHA-JAM): Model described in Section 3.3 and Figure 2. The main difference between MHA-SAM and MHA-JAM is that MHA-JAM generates keys and values in MHA using a joint representation of the map and agents while MHA-SAM computes keys and values of the map and agents features separately.
MHA-JAM with Joint-Attention-Heads (MHA-JAM(JAH)): MHA-JAM uses each attention head to generate a possible trajectory. MHA-JAM with joint attention heads uses a fully connected layer to combine the outputs of all attention heads . It generates each possible trajectory using a different combination of all the attention heads.
MHA-JAM with Off-Road Loss (MHA-JAM (off-road)): MHA-JAM trained with off road loss (cf. Section 3.6).

4.4 Metrics

We use different ways to evaluate our trajectory prediction model. As we predict multiple plausible trajectories, we analyze the minimum of average displacement error over most probable trajectories:


Where is the set of the most probable trajectories.

We also use the minimum of final displacement error over most probable trajectories:


Miss rate indicates whether the predicted trajectory is fairly close to the ground truth. A prediction can be considered as a miss:


The average over all the performed predictions present the used metric MissRate. Finally, similar to [6] we consider the off-road rate metric, which measures the fraction of predicted trajectories that are not entirely contained in the drivable area of the map.

i Multipath
ii MTP
iv MHA-JAM (off-road)
(a) Example 1
i Multipath
ii MTP
iv MHA-JAM (off-road)
(b) Example 2
Figure 5: Prediction examples : Trajectories predicted by the different compared methods on two intersection scenarios. The ground truth trajectories are plotted in red and the five most probable trajectories predicted by the different methods are presented in green. The size of the marker points is proportional to the probability of each predicted trajectory to be the best fit with the ground truth.
i Predicted trajectories
ii 5 most probable trajectories and their corresponding attention maps
(a) Example 1
i Predicted trajectories
ii 5 most probable trajectories and their corresponding attention maps
(b) Example 2
Figure 6: Examples of produced attention maps and trajectories with MHA-JAM (off-road) model

4.5 Quantitative Results

Table 1 shows the comparison of our proposed methods and other baselines according to different metrics. Our methods outperform the compared methods in most cases. Multipath performs better in some cases. But, methods’ ranking depends on the metric deployed. Our methods have good performance according to the average displacement error metric. They achieve second best performance for and first best when . They have the best final goal fit according to the final displacement error metric for . We note also that Multipath performs the best only for . Having the best performance for proves that our method generates and selects trajectories that better fit the real ones. However, for

, our classifier doesn’t succeed to select the closest trajectories to the ground truth among the

most probable ones while the Multipath classifier does.

Moreover, our method presents significant improvements compared to others when considering miss rate and off-road rate metrics. This infers that our predicted trajectories are less likely to deviate from the ground truth over a threshold of . In addition, our model allows to reduce the off-road rate especially when trained with the off-road loss that penalizes predictions outside of the drivable area.

We note that MHA-JAM shows better performance compared to MHA-SAM. This proves the benefit of applying attention on a joint spatio-temporal context representation composed of map and surrounding agents motion, over using separate attention blocks to model vehicle-map and vehicle-agents interaction independently. Besides, comparing MHA-JAM and MHA-JAM(JAH) reveals that conditioning each possible trajectory on a context generated by one attention head performs better than generating each trajectory based on a combination of all attention heads outputs.

4.6 Qualitative Results

(a) Mean attention maps going straight (low speed)
(b) Mean attention maps going straight (high speed)
(c) Mean attention maps going left
(d) Mean attention maps going right
Figure 7: Visualisation of average attention maps over different generated maneuvers.

Figure 5 gives two examples of right turns performed at an intersection. The first one (cf. Figure 4(a)) presents an early prediction task, before the start of a clear pattern of the performed maneuver while the second one (cf. Figure 4(b)) while carrying out the maneuver execution.
In the first example, all predicted trajectories using MTP and Multipath models present going straight on motions (on a wrong-way driving road). In addition, with the presence of the vehicles on its left turning right, the target vehicle going straight on could be fatal. Both MTP and Multipath miss more plausible trajectories. MHA-based methods present more diverse predictions. They successfully predict the performed maneuver even at an early stage. Moreovers, MHA-JAM (off-road) predicts the target vehicle will slow down if it intends to go straight, which infers that it reasons about the interactions with the surrounding vehicles (on its left).
In the second example, all the compared methods successfully predict the turn right maneuver. However, we notice that even though Multipath, MTP and MHA-SAM present more diverse predictions. They generate inadmissible trajectories suggesting that the target vehicles would drive outside of the drivable area. However, MHA-JAM (off-road) presents only predictions that lie on the drivable area. This highlights the importance of using the off-road loss to present more admissible predictions.
These examples show that our proposed methods have better consideration of interactions between vehicles and comply with the map. However, all the methods present unlikely predictions suggesting that the vehicle would drive against traffic. This can be caused by the absence of clear information about the driving direction.
Figure 6 presents two examples of vehicle trajectory prediction, their corresponding 5 most probable generated trajectories and their associated attention maps. We notice that our proposed model MHA-JAM (off-road) successfully predicts possible maneuvers; straight and left for the first Example 5(a) and straight, left and right for the second Example 5(b). In addition, it produces different attention maps which implies that it learnt to create specific context features for each predicted trajectories. For instance, the attention maps of the going straight trajectories, assign high weights to the drivable area in the straight direction and to the leading vehicles (the dark red cells). Moreover, They show focus on relatively close features when performed with low speed and further ones with high speed (cf. Example 5(a)). For the left and right turns, in both examples, the corresponding attention maps seem to assign high weights to surrounding agents that could interact with the target vehicle while performing those maneuvers. For instance, in the left turn (cf. Example 5(b)), the attention map assigns high weights to vehicles in the opposite lane turning right. For the left turn of the first example and for the right turn of the second example, the attention maps assign high weights to pedestrians standing on both sides of the crosswalks. However, for the right turn, the model fails to take into account the traffic direction.
Figure 7 shows the average attention maps, for 4 generated possible maneuvers (going straight with low and high speed, left and right), over all samples in the test set. We note that each attention map assigns high weights, on average, to the leading vehicles, to surrounding agents and to the map cells in the direction of the performed maneuvers. This consolidates the previous observations in Figure 6. We conclude that our model generates attention maps that focus on specific surrounding agents and scene features depending on the future possible trajectory.

5 Concluding Remarks

This work tackled the task of vehicle trajectory prediction in an urban environment while considering interactions between the target vehicle, its surrounding agents and scene. To this end, we deployed a multi-head attention-based method on a joint agents and map based global context representation. The model enabled each attention head to extract specific agents and map interaction features that help infer the driver’s diverse possible behaviors. Furthermore, the visualisation of the attention maps reveals the importance of joint agents and map features and the interactions occurring during the execution of each possible maneuver. Experiments showed that our proposed approaches outperform the existing methods according to most of the metrics considered, especially the off-road metric. This highlights that the predicted trajectories comply with the scene structure.
As future work, in order to enhance driving safety, we will focus on the prediction and identification of irregular or dangerous behaviors of surrounding agents.