Log In Sign Up

Multi-Head Attention-based Probabilistic Vehicle Trajectory Prediction

by   Hayoung Kim, et al.
HanYang University

This paper presents online-capable deep learning model for probabilistic vehicle trajectory prediction. We propose a simple encoder-decoder architecture based on multi-head attention. The proposed model generates the distribution of the predicted trajectories for multiple vehicles in parallel. Our approach to model the interactions can learn to attend to a few influential vehicles in an unsupervised manner, which can improve the interpretability of the network. The experiments using naturalistic trajectories at highway show the clear improvement in terms of positional error on both longitudinal and lateral direction.


page 1

page 2

page 3

page 4


Convolutional Social Pooling for Vehicle Trajectory Prediction

Forecasting the motion of surrounding vehicles is a critical ability for...

Longitudinal Trajectory Prediction of Human-driven Vehicles Near Traffic Lights

Predicting future trajectories of human-driven vehicles is a crucial pro...

Enhancing Trajectory Prediction using Sparse Outputs: Application to Team Sports

Sophisticated trajectory prediction models that effectively mimic team d...

Audience measurement using a top-view camera and oriented trajectories

A crucial aspect for selecting optimal areas for commercial advertising ...

Holistic Transformer: A Joint Neural Network for Trajectory Prediction and Decision-Making of Autonomous Vehicles

Trajectory prediction and behavioral decision-making are two important t...

Attention-based Recurrent Neural Network for Urban Vehicle Trajectory Prediction

As the number of various positioning sensors and location-based devices ...

I Introduction

One of the most difficult problems in autonomous driving is to perceive and understand their surroundings. For safe and efficient decision making, it is necessary to accurately forecast the future trajectories of surrounding vehicles. However, it is challenging to accurately predict the trajectory. This is because the inherent uncertainty exists in the future trajectory itself and the behaviors of the surrounding vehicles affect to each other. To tackle these challenges, the prediction model should consider both interaction among vehicles and their uncertainty.

In this paper, we propose a probabilistic model for vehicle trajectory prediction, which can consider the interaction among surrounding vehicles and the road environment. In order to model the vehicle interaction, multi-head attention architecture in Transformer [vaswani2017attention]

is utilized and it is considered as a major breakthrough in field of natural language processing. When humans drive, they internally predict future trajectories of the surrounding vehicles. Instead of predicting trajectories for all the surrounding vehicles, humans focus on a small number of influential vehicles to plan safe and efficient trajectory. The proposed prediction model is motivated by this characteristic of the human driver. We want to make the model learn to attend a few influential vehicles naturally. In addition, by encoding lane features using the attention mechanism, the prediction model reflects the contextual information of the road environments. It helps the model to better predict the future trajectories of surrounding vehicles.

To evaluate the proposed model, naturalistic trajectories recorded at highways are used. In the experiments, the proposed method is compared with the existing methods, where the model jointly learn the distribution of the future trajectories and the interactions.

The proposed model have several attractive properties for vehicle trajectory prediction as follows:

  • Interpretability

    : The use of multi-head attention improves the interpretability of the neural network because the model can learn the social relations of neighboring vehicles in an unsupervised manner.

  • Scalability: As the output dimension of multi-head attention is flexible to the number of the vehicles, the proposed network can be extended to very dense traffic scenarios. The network is tested in an autonomous vehicle platform with surrounding vehicles less than 30. The average computation time is 50ms.

  • Accuracy: The proposed method is verified by using the naturalistic trajectory data in highway, and the results show the better performance than the existing methods in terms of positional error.

Figure 1: Proposed prediction architecture

Ii Related Work

Classically, there were several researches for trajectory prediction assuming that the vehicle moves according to a certain motion model (e.g. CTRA; Constant Turn Rate and Acceleration) [berthelot2011handling, polychronopoulos2007sensor]. In [xie2017vehicle], the authors integrated the motion model and the maneuver based model using Interactive Multiple Model (IMM) filters, which improved the accuracy of longer term prediction. In [wiest2012probabilistic]

, the future trajectory was predicted based on motion pattern extracted from the past trajectory, which used Gaussian Mixture Model (GMM) to consider uncertainty.

Recently, deep learning techniques have succeeded in the field of natural language processing, which is closely related to real life, and their techniques have been widely used to design the models for trajectory prediction [altche2017lstm], [park2018sequence]. In [altche2017lstm]

, lateral position and longitudinal velocity were predicted using Long-Short Term Memory (LSTM)

[hochreiter1997long]. The LSTM used the current state values such as position, velocity, distance from the preceding vehicle, and time-to-collision (TTC). Authors in [park2018sequence] used the encoder-decoder structure of Sequence-to-Sequence [sutskever2014sequence]

for trajectory prediction. After the past trajectory was encoded using LSTM, future trajectory sequence was generated using LSTM decoder. The main contribution of the prediction framework is that beam search can generate multiple future trajectories with high probability. However, since the future trajectory is predicted by the occupancy grid representation, it inherently contains an error corresponding to the size of the grid.

All the vehicles on the road maintain a certain social distance to avoid collisions with each other. For this reason, the importance of predicting the future trajectory by reflecting the interaction among the vehicles is more emphasized rather than independently predicting the trajectory of each vehicle [lee2017desire, feng2019vehicle, li2019grip]. A framework for generating diverse trajectory samples with Conditional Variational Auto-Encoder (CVAE) [NIPS2015_5775] and refining the trajectory using the inverse optimal control was introduced in [lee2017desire]. In the refinement process, the interaction of surrounding vehicles is considered using social pooling [alahi2016social]. In [feng2019vehicle], the authors proposed a prediction model which used the behavior level intention as a condition. Graph Convolutional Network (GCN) [kipf2016semi], which is an emerging topic in deep neural network, has been applied with grid representation in the trajectory prediction in order to model the interactions of close vehicles in [li2019grip]. These studies used the semantic information of the vehicles (e.g. front vehicle, front left vehicle, rear right vehicle etc.) for interaction modeling, or the maximum distance was manually set where the interaction occurs. However, the suggested method improves the prediction performance by learning the attention distribution via multi-head attention in an unsupervised manner, which makes the model to focus on vehicles with intimate social interaction.

Iii Proposed Prediction Model

It is very difficult to consider all the interactions with surrounding vehicles in the autonomous driving. As the number of surrounding vehicles increases, the complexity of the interactions increases more than exponentially. Interestingly, human drivers can reduce complexity by focusing on vehicles with intimate social interaction even if there are many surrounding vehicles. This motivates us to apply multi-head attention, which is used in Transformer [vaswani2017attention], for the vehicle trajectory prediction problem. Although studies on the relations among surrounding vehicles was conducted, the relations had to set manually as a semantic positional information ([xie2017vehicle, scheel2019attention, dong2019interactive]), rather than learning them automatically. In this section, the proposed model structure is described consisting of encoder and decoder with two attention layers.

Iii-a Problem formulation

The goal of trajectory prediction is to learn the posterior distribution, , of multiple vehicles’ future trajectories, , given their past trajectories and properties (e.g. length and width), , and lane information, , where and mean the number of the vehicles and the number of lanes, respectively. The positions of all the vehicles are observed from time 1 to , and their future positions are predicted for time to . The past trajectory and properties of the vehicle i can be written as where k is the number of properties considered. Also, the future trajectory of the vehicle can be written as

. The each element of the trajectory is a 2D position vector. The positions of the surrounding vehicles are represented in ego-vehicle based relative coordinate system. As the previous studies (e.g.

[xie2017vehicle, lee2017desire, li2019grip]), it is assumed that the position of the surrounding vehicles can be tracked from to .

Figure 2: Structure of the attention layer for both the lane and the vehicles.

Iii-B Prediction model architecture

The prediction model has an encoder-decoder structure. Both encoder and decoder use the multi-head attention, and the output of decoder is modeled as Gaussian distribution. The overall architecture is shown in Fig.

1. Here, the encoder maps the past trajectories, , and lane information, , to a compressed representation, . Given , the decoder generates predictive mean, , and predictive covariance, , for future trajectories, .

There are two attention layers in the proposed prediction model; vehicle attention layer and lane attention layer. Each attention layer have same architecture as shown in Fig. 2. The attention layers map a set of queries, , a set of keys, , and a set of values, , into an output vector. The only difference between vehicle attention layer and lane attention layer is an input configuration. The vehicle attention layer uses vehicle embedding for , and . However, the lane attention layer uses vehicle embedding as and lane embedding as and . Inside the attention layer in Fig. 2, there is Scaled Dot Product Attention layer. It enables the model to discover inter-dependencies within inputs. The attention computation in a single scaled dot product attention can be written as (1). A set of queries, , is compared to a set of keys, by computing dot product attention, . The attention matrix can be obtained by scaling the dot product attention by and normalizing it using softmax function.


Multi-head attention performs the scaled dot product attention function in parallel for

times. The independent attention outputs are concatenated and linearly transformed into the same dimension of

. Each attention layer adopts a residual connection with dropout layer

[srivastava2014dropout] and a layer normalization [ba2016layer].


The encoder has both vehicle attention layer and lane attention layer. These attention layers generate attention based representations for each surrounding vehicle. The lane attention layer encodes the lane information related with the past trajectories of the surrounding vehicles. The use of the lane attention layer improves the prediction accuracy compared to simply the use of lane information as an embedding vector. The vehicle attention layer encodes the relations among the past trajectories of the vehicles. These attention outputs are concatenated as final encoder output, .

The decoder generates probabilistic prediction for future trajectories. The decoder consists of a single vehicle attention layer, where , and are encoder output . It gathers the encoded information to predict the trajectories of the surrounding vehicles. The outputs of the decoder are predicted mean, , and predicted covariance, .

Iii-C Implementation details

The proposed prediction network is implemented in Python using Tensorflow

[abadi2016tensorflow]. The core parameters used for trajectory prediction are explained below. In the encoder, embedding of vehicles and lane information are performed first, and then, the embedded vectors are used in vehicle attention layer and lane attention layer, respectively. The embedding dimension for the past trajectories is 16 and the embedding dimension for the lane information is 4. In addition, the resulting output dimensions after the linear transformation using are 8 and 32 for the vehicle attention layer and lane attention layer, respectively. Layer normalization with is used for the output value of multi-head attention. The residual network, which adds the input and output of attention, used a dropout rate of 0.7 to prevent excessive use of residual connections during the training process. The vehicle attention layer used in the decoder is designed with the same parameters as those used in the encoder.

The following two loss terms are used in training attention based encoder-decoder: negative log likelihood loss in (3) and reconstruction loss in (4). The total loss is weighted sum of two losses.


For optimization, Adam optimizer [kingma2014adam] is applied. The learning rate is set to 0.001. In this work, the batch size is set to 128.

Iv Experiments

In this section, experimental results are obtained based on publicly available highway dataset: highD dataset [highDdataset] which is a large-scale naturalistic vehicle trajectory dataset from German highways observed by drones. NGSIM [NGSIMdataset] is one of the largest datasets of naturalistic vehicle trajectories and widely used for trajectory prediction researches, but raw NGSIM trajectories should be carefully refined as the dataset contains erroneous trajectories such as false-positive collisions. To improve the quality of the dataset, the proposed algorithm used the highD dataset, instead of the NGSIM.

position error (m)
position error (m)
Prediction horizon 1s 2s 3s 1s 2s 3s
Linear 0.71 1.67 3.41 0.17 0.55 1.31
V-LSTM 0.72 1.94 3.81 0.13 0.31 0.65
ED-LSTM 0.69 1.77 3.21 0.14 0.32 0.58
Proposed (N=2) 0.59 0.77 1.31 0.08 0.14 0.30
Proposed (N=4) 0.43 0.47 0.89 0.04 0.06 0.11
Proposed (N=8) 0.54 0.58 1.09 0.07 0.11 0.18
Table I: Performance comparison of augmentation methods for test dataset.
Figure 3: An example of trajectory prediction with four surrounding vehicles. The blue vehicle indicates ego-vehicle. Yellow solid line with three dots indicates true future trajectory, where dots represent positions at 1 second interval. Red dashed line indicates predicted future trajectory. Their uncertainties are drawn as ellipses from blue color to red color in chronological order. The boundaries of ellipses correspond to 3.

Iv-a Prediction model evaluation

The proposed model provides a Gaussian distribution of future trajectories. To evaluate this, we used the root mean square error (RMSE) for the predicted mean value as the evaluation metric. In Table.

I, we compare the performance of our model with some baseline methods:

  • Linear model (Linear)

    : Extrapolating trajectory with assumption of linear velocity using an off-the-shelf Kalman filter.

  • Vanilla LSTM (V-LSTM)

    : Predicting future trajectory as a point estimates using an LSTM model. The past trajectory of a single vehicle is feeded to an LSTM.

  • Encoder-decoder LSTM (ED-LSTM): LSTM encoder - LSTM decoder architecture is used for future trajectory prediction.

  • Proposed model: The proposed prediction model with various number of attention heads, .

Even though the highway dataset is used for evaluation, the longer we predict, the larger error linear model provides. Especially, positional errors in the longitudinal direction is larger than the errors in the lateral direction. The vanilla LSTM performs better in terms of lateral position error than the linear model because it has the ability to predict the future trajectory based on the past trajectory. The ED-LSTM outperforms linear model and V-LSTM.

However, the proposed algorithm shows much better performance than the baseline methods including ED-LSTM. There is a slightly different prediction performance depending on the number of heads, , used in the proposed network. Generally, the model with 4 prediction heads tends to have smaller prediction errors than the model with 2 heads or 8 heads. The proposed model has an error value of 0.89m in the longitudinal direction and 0.11m in the lateral direction after 3 seconds. The proposed prediction model with high accuracy is expected to help autonomous vehicles drive safely.

Iv-B Analyzing the attention in trajectory prediction

One of the advantages of the proposed prediction algorithm is that it can learn attention during training. These Attention matrices can be used as an indicator of how strongly the interactions among vehicles occur. Fig. 3 is an example of predicting future trajectories of four surrounding vehicles. Note the Vehicle 2 changing lane and the Vehicle 4 keeping the lane. The attention matrix for these two vehicles is shown in Fig. 4. In the case of changing lanes, the prediction model attends to the vehicles (ego vehicle and the vehicle 1) in the lane to be changed as shown in Fig. 4 (a). On the other hand, in the case of simply maintaining a lane, attention toward itself is higher than attention to other vehicles as shown in Fig. 4 (b). In most cases, the last attention head has a high weight on itself, indicating that it depends heavily on its own past trajectory to predict future trajectory.

Iv-C Scalability for different number of the vehicles

Unlike RNN variants, outputs of the encoder and decoder can be calculated in parallel by using Multi-head attention. This not only saves computation time in considering the interaction, but also has the advantage of being independent with the order of vehicles entering the network or the number of vehicles to be predicted, as long as the capacity of the memory allows.

In order to validate its scalability, we first trained the prediction network in a scene with up to 10 surrounding vehicles. After that, the prediction performance of the network is tested for the future trajectories. The results are shown in Fig. 5. The top two plots of Fig. 5 are the results of the prediction when there are fewer than 10 vehicles (3, 7 respectively), whereas the lower two are the results with more than 10 vehicles (11, 21 respectively). If there are 11 surrounding vehicles, the traffic density is not significantly different compared with 7 surrounding vehicles. Because of this, even if the network has not learned the case of 11 surrounding vehicles, it can be confirmed that the prediction results are not very different from the actual future trajectory and has small uncertainty. In contrast, when there are 21 vehicles, the traffic density will be drastically changed from learned density. At this time, the mean of the predicted trajectory does not make a big difference from the actual trajectory, but its uncertainty tends to increase noticeably. This experiment demonstrates that the proposed network can predict the relatively robust trajectory even for different numbers of vehicles. In addition, the uncertainty generated when predicting 21 vehicles is epistemic uncertainty resulting from model uncertainty, and this uncertainty can be reduced by training the network with the trajectory of surrounding vehicles having similar velocity distribution.

Figure 4: The attention matrix for four attention heads in Fig. 3 situation. The ego vehicle ID is zero. (a) The vehicle 2 is predicted to change lane. (b) The vehicle 4 is predicted to keep the lane.

V Conclusions

In this paper, the multi-head attention based prediction model is proposed for future trajectory prediction of multiple vehicles considering the interactions. The proposed model has an encoder-decoder architecture, which incorporates the past trajectories and the lane information by vehicle attention layer and lane attention layer. The proposed methods is compared based on experimental data of the naturalistic trajectories at highway, and the evaluation results show that the proposed method outperforms the baseline methods. Additionally, the trained attention shows that the prediction network intuitively gathers information from a few influential vehicles to make better predictions. The learned distribution of vehicle trajectories can be used as constraints or costs for trajectory planning framework, which is our future research topic.

Figure 5: Prediction results for robustness test on scalability. The number of the surrounding vehicles is 3, 7, 11, 21 from top to bottom figures. The prediction network is trained only up to 10 surrounding vehicles. The blue vehicle indicates ego-vehicle. Yellow solid line with three dots indicates true future trajectory, where dots represent positions at 1 second interval. Red dashed line indicates predicted future trajectory. Their uncertainties are drawn as ellipses from blue color to red color in chronological order. The boundaries of ellipses correspond to 3.