Multi-modal Motion Prediction with Transformer-based Neural Network for Autonomous Driving

09/14/2021 ∙ by Zhiyu Huang, et al. ∙ Nanyang Technological University 0

Predicting the behaviors of other agents on the road is critical for autonomous driving to ensure safety and efficiency. However, the challenging part is how to represent the social interactions between agents and output different possible trajectories with interpretability. In this paper, we introduce a neural prediction framework based on the Transformer structure to model the relationship among the interacting agents and extract the attention of the target agent on the map waypoints. Specifically, we organize the interacting agents into a graph and utilize the multi-head attention Transformer encoder to extract the relations between them. To address the multi-modality of motion prediction, we propose a multi-modal attention Transformer encoder, which modifies the multi-head attention mechanism to multi-modal attention, and each predicted trajectory is conditioned on an independent attention mode. The proposed model is validated on the Argoverse motion forecasting dataset and shows state-of-the-art prediction accuracy while maintaining a small model size and a simple training process. We also demonstrate that the multi-modal attention module can automatically identify different modes of the target agent's attention on the map, which improves the interpretability of the model.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Accurately predicting traffic participants’ future trajectories is critical for autonomous vehicles to make safe, informed, and human-like decisions [9], especially in complex traffic scenarios. However, motion prediction is a remarkably challenging task due to the complicated dependencies of agents’ behaviors on the road structure and interactions among agents in addition to their kinematics, as well as the inherent uncertainty and multi-modality of their intentions.

One major challenge is how to represent the driving environment in the prediction model, including encoding road structure and agent interactions, in addition to the target’s physical states. The difficulty of representing interactions or map data in explicit rules or parametric models gives rise to deep neural network-based methods


, which are able to handle high-dimensional map data and represent the agent interaction patterns by learning from human driving data. In this paper, we employ a graph-based structure to represent the relationship between the interacting agents, where all the agents are treated as nodes and each surrounding agent is connected to the target agent, and a Transformer-based encoder to model the relationship among the target agent and its surrounding agents. The multi-head attention in the Transformer layer can help extract different aspects of the agent interactions. As for extracting the relationship between the map and the target agent, we utilize the vectorized map representation

[6] and a lane set-based map structure consisting of a list of waypoints from different lanes, which provides a higher map resolution. To model the target agent’s different aspects of attention on the different lane segments, we propose the multi-modal Transformer encoder that can extract the different modes of agent-map relations. The rationals and details are given below.

The motion prediction model should be capable of outputting multiple possible trajectories, and we can notice that the uncertainty of agent behaviors predominantly comes from their different targets on the road [24]. Therefore, in modeling the attention between the target agent and the map, we propose to modify the multi-head attention mechanism in the Transformer to multi-modal attention. The multi-head attention is designed to extract multiple possible relationships and richer interpretations between the inputs, and all of the attention heads are then merged to produce a final output. Instead of combining them to output a single result, we propose to directly output the result of each independent attention head, and each separate agent-map attention result is used for predicting a possible trajectory. The intuition behind this is that each attention head can represent a different relationship between the target agent and different segments of the map, which affects their future trajectories. Moreover, to ensure the diversity of the multiple attention heads, only the head that outputs the closest trajectory to the ground-truth one gets updated during training.

The proposed motion prediction model is succinct, which encompasses only two cross-attention Transformer layers for modeling agent interactions and map attention respectively, other than the trajectory and map encoders and prediction head, enabling it to be easy to train and deploy while maintaining satisfactory prediction accuracy. The main contributions of this paper are listed as follows.

  1. We propose a Transformer-based network using the waypoint-based map structure for multi-modal trajectory prediction. The proposed multi-modal attention Transformer layer is shown to be able to capture the different modes of agent-map relations, thus bringing better accuracy and interpretability.

  2. We validate the proposed method on a large-scale real-world driving dataset and the results reveal that the proposed method with a simple structure and training process can achieve competitive accuracy compared to other state-of-the-art methods.

  3. We investigate the performance of using lane-based and waypoint-based map structure with the proposed prediction network, as well as the improvement of the proposed multi-modal attention mechanism.

Ii Related Work

Ii-a Encoding Map and Interaction

The most common method of encoding map and agent interaction information is to rasterize the driving scene into bird-eye-view images, which contain the information of relationships among agents and road structure. Such environment representation can be effectively processed by convolutional neural networks (CNNs), which have been used in many motion prediction-related works

[3, 17, 4]. However, the drawback of using rasterized images and CNNs is that the image structure is overly complex for driving environment representation and thus it requires a larger network to process and more computation and data to train the network. More recently, a notably compact vector map representation [6, 12] has been proposed, which can significantly reduce the computation burden. VectorNet [6] treats the lanes on the map and agent historical trajectories as a set of polylines, and models them as a global fully-connected interaction graph, which is proposed by a graph neural network (GNN) to encode the map and agent information. However, putting all information including lanes and agents in a single graph makes training hard and inefficient, and thus we utilize two separate Transformer-based layers to encode agent interaction and agent’s attention on map, respectively. On the other hand, LaneGCN [12] proposes to organize the lanes on the map into a lane graph considering the spatial connectivity, and then use the graph convolutional network (GCN) to encode the topology of the map. However, using lane-level features may decrease the map resolution, and thereby we propose to use lane set-based map representation, which is a set of waypoints from different lanes with minimal information loss.

Ii-B Multi-modal Prediction

To realize multi-modal prediction, i.e., predicting multiple possible future trajectories, some generative modeling approaches, such as conditional variational autoencoder (CVAE)

[11, 10]

and generative adversarial network (GAN)

[8], are employed. However, such generative methods are hard to train and infer because the model needs to be sampled many times to recover a plausible distribution over future behaviors. Some other works propose to predict a set of trajectories [3, 12, 21] using the variety loss [19]. However, they use the same extracted feature vector to output multiple trajectories, which is not intuitive and lacks interpretability on the outputs. Anchor-based methods [1, 16, 18] can provide better interpretability, feasibility, and diversity on the results, but their predictions are restricted to a predefined set, obtained by clustering from the data or generated by a model, which may impede the prediction accuracy. On the other hand, goal-based [24] or proposal-based methods [23, 13, 5] have been widely used due to their superior accuracy and interpretability. The trajectory predictions are conditioned on the possible long-term goals or proposals on the map, which brings diversity and also flexibility to the model outputs. Nevertheless, the goals or proposals are still manually selected, which is laborious in data processing and needs careful design. Different from these methods above, in this work, we propose to use the multi-modal Transformer layer to learn to attend to different segments of the map and produce diverse possible trajectories accordingly in an end-to-end manner, which can simplify the training process and maintain accuracy, interpretability, and flexibility.

Iii Multi-modal Motion Prediction Framework

Iii-a Problem Formulation

The task of motion prediction is to predict the possible future trajectories of a target agent over a time horizon based on its historical states over a time period and environmental context information. The input to the prediction model consists of the historical dynamic states of the target agent () and its surrounding agents (), as well as the current environment information . Without loss of generality, we assume that there are surrounding agents (e.g., vehicles, pedestrians, and cyclists) around the target agent, however, the number of surrounding agents can be varied in different situations. The output of the prediction model is trajectories, each consisting of a discrete sequence of 2D coordinates

, denoting the future positions of the target agent, as well as the corresponding probability

. Mathematically, the problem is formulated as:


where denotes the parameters of the prediction model , and is the dynamic state of the agent at timestep , is the coordinate of the th predicted trajectory at timestep , and is the current time step.

Fig. 1: An overview of our proposed motion prediction model. The agent encoder and map encoder are used to extract the features of agents and map waypoints, respectively. The agent-agent encoder is employed to model the relationship among interacting agents, and the map-agent encoder to model the relationship between the target agent with interaction feature and the waypoints on the map. Finally, the interaction feature, target’s agent historical feature, and agent-map attention feature are concatenated and passed through the trajectory and score decoders, to output the predicted trajectories and their scores.
Fig. 2: The detailed structures of the map encoder, agent-agent encoder, and agent-map encoder in the prediction model.

Iii-B Prediction Framework

The proposed motion prediction framework is illustrated in Fig. 1, which encompasses four main parts. First, the map and agent encoders translate the low-dimensional raw state inputs to high-dimensional feature vectors. Then, the agent-agent encoder is used to capture the relationship between the interacting agents, and the agent-map encoder models the target agent’s attention on different segments of the map. Finally, the interaction feature, map attention feature, and dynamic feature of the target agent are concatenated and passed through the scored trajectory decoder to generate possible trajectories with associated probabilities. The detailed structures of the key components, i.e, the map encoder, agent-agent encoder, and agent-map encoder are illustrated in Fig. 2, and the detailed explanations of the prediction model are given below.

Iii-B1 Map and Agent Encoders

The dynamic state of the target agent and its surrounding agents at timestep is in the format of , where is the coordinate, the velocity, and the heading angle. Note that the coordinate system is centered on the target agent’s position at the current timestep with its heading aligned with the x-axis. Therefore, the historical state of an agent

can be represented by a tensor with shape

. The agent encoder consists of two layers, i.e., one 1D convolutional layer and one long short-term memory (LSTM) layer to extract the temporal motion feature of the agent. We only consider up to ten surrounding agents within a radius of 30 meters to the target agent. All the agents including the target and surrounding agents share the same agent encoder.

The map information is represented by a set of waypoints from different segments of the map. Each waypoint has a unique feature of , where is the coordinate relative to the target agent, and

is the direction. The waypoints from the same lane have the same lane features, which are the turning direction, whether the lane is in an intersection, and whether the lane has traffic control measures. These lane features are first one-hot encoded respectively and then concatenated with the waypoint features. All the waypoints share the same map encoder, which is shown in Fig.


. First of all, the waypoints features are feed into a fully connected layer and we use the max-pooling operation to aggregate the information from all the waypoints on the same lane; the lane features are processed by another fully connected layer. The waypoint feature, aggregated feature, and lane feature are concatenated and passed through a fully connected layer to get the final feature vector of the waypoint.

Iii-B2 Agent-agent Encoder

We first represent the relationship between the agents as a graph, shown in Fig. 1. All the agents (nodes) are connected to the target agent (including the self-loop), and we ignore the edge attributes. We use a Transformer layer with the multi-head attention mechanism to encode the interactions between the agents, as seen in Fig. 2. In addition to the multi-head attention, the encoder also contains two position-wise fully connected feed-forward layers and two layer normalization layers following the Transformer architecture. The multi-head attention mechanism is illustrated as follows [20].


where is the total number of attention heads, are the query, key, and value vectors, respectively, and are the matrices for linear projection. The attention operation is called scaled dot-product attention, which is shown as


where is the dimension of the key vector.

In the agent-agent encoder, the query is the target agent’s feature vector from the agent encoder and the key and value are the feature vectors of all the agents.

Iii-B3 Agent-map Encoder

The relationship between the target agent and the map is represented as a graph, where the target agent can attend to all the elements in the map. We use another Transformer layer to model the agent-map relationship, as seen in Fig. 2, and we modify the multi-head attention to multi-modal attention as shown in Eq. 5. Specifically, we do not concatenate the results from individual heads and project the concatenated vector to a low-dimensional one, but instead, we directly output the results of individual heads, and the final trajectory outputs are conditioned on the individual heads.


The output of the agent-map encoder is a mode-wise feature, which means each mode has a different feature, corresponding to a different relationship between the target agent and map. To ensure the diversity of these modes, i.e., attending to different parts on the map, we only back-propagate the loss through the individual head that is closest to the ground truth in terms of final displacement error. In the agent-map encoder, the query is the interaction feature from the agent-agent encoder and the key and value are the feature vectors from the map encoder.

Iii-B4 Score and Trajectory Decoders

The predicted trajectories and their scores are conditioned on three features, i.e., the target agent’s historical state, the interaction among agents, and the target agent’s attention on the map. The interaction feature and target agent feature are first repeated along the mode axis to match with the shape of the multi-modal agent-map feature, and then the three features are concatenated to form a final representation of the driving environment. The trajectory decoder is a mode-wise four-layer MLP with the final layer outputting the coordinates of the trajectory at each timestep. The score decoder follows the same structure, except for the last layer that outputs the score of the predicted trajectories. The scores of all the predicted trajectories are grouped and passed through a softmax layer to yield a probability distribution.

Iii-C Training Objectives

All the modules in the prediction model are differentiable, and thus we can train the model end-to-end. To predict the trajectories, we use the smooth L1 loss on all predicted time steps. For a data point, the trajectory regression loss is defined as:


where is the ground truth position at time step . Using the variety or Minimum over N (MoN) loss [19], we only calculate the loss between the ground truth and the closest output prediction, and is the index of the predicted trajectory that is closest to the ground truth in terms of L2 distance between the endpoint and the ground truth endpoint :


Since different predicted trajectories are conditioned on the different attention heads in the agent-map encoder, only the head that corresponds to the closet trajectory gets updated, making the heads attend to different parts of the map and ensuring the diversity of the heads.

The scoring loss is the cross entropy loss between the ground truth scores (probability distribution) and the predicted probability distribution . For a data point, the scoring loss is defined as:


where the ground truth distribution is defined as:


The total loss is a weighted sum of the trajectory regression loss and the scoring loss:



is a hyperparameter to balance the two learning objectives.

Iv Experiments

Iv-a Experimental Setup

Iv-A1 Dataset

The proposed method is validated on the Argoverse Motion Forecasting dataset [2], which contains 324,557 real-world driving scenarios for training and validation. For each scenario, five-second trajectory sequences of each tracked object sampled at 10 Hz are provided and the map information is represented as a set of lane centerlines, which is composed of a set of waypoints. The prediction task is to forecast the future possible trajectories of the target agent in a scenario over the next 3 seconds, given the 2-second historical trajectory of the target agent, and the trajectories of its neighboring agents, as well as the map context. The whole dataset is split into 205,942 training, 39,472 validation, and 78,143 testing sequences, respectively. We train the prediction model on the training set and test it on the standard testing set to benchmark the performance of the model.

Iv-A2 Metrics

The performance of the prediction model is evaluated using some standard evaluation metrics, which are minimum average displacement error (minADE), minimum final displacement error (minFDE), brier-minFDE, and miss rate (MR). minADE and minFDE are two common distance-based metrics; minADE reports the average displacement error between the best-predicted trajectory and the ground truth over the entire time steps, and minFDE reports the displacement error at the endpoint. The best-predicted trajectory refers to the trajectory that has the minimum endpoint error. To evaluate the scoring function of the model, we employ anthor metric; brier-minFDE is defined as the sum of minFDE and the brier score

, where is the probability of the best-predicted trajectory. Additionally, the miss rate (MR) is reported, which is defined as the ratio of the scenarios in which none of the endpoints of predicted trajectories are within 2.0 meters of ground truth.

Fig. 3: The qualitative motion forecasting results of the proposed model on the Argoverse validation set. The historical trajectory of the target agent is in red, and the surrounding agents are in blue; the predicted trajectories in yellow and ground truth trajectory in green, respectively.

Iv-B Implementation Details

Iv-B1 Input and Output

For the map information, we search for 40 lanes closest to the current position of the target agent, each lane with 10 waypoints. For the agent information, we search for 10 neighboring agents within 30 meters to the target agent and organize them into a tensor along with the target agent. The missing lanes or agents in the tensors are padded with zeros and masked out when calculating the attention. The historical horizon is 2 seconds (

) and the prediction horizon is 3 seconds (). The output of the model is possible trajectories (a sequence of coordinates), with a matching probability for each trajectory. The number of prediction modes is set as .

Iv-B2 Network Structure

The dimensions of the embedded agent and map waypoint features are both 256. In the agent-agent encoder, the number of heads in the multi-head attention is 6, and the feed-forward network first projects the feature vector to 1024-dimension and then reduces it to 256-dimension. In the agent-agent encoder, the number of modes (heads) in the multi-modal attention is 6 and the remaining structure is the same as the agent-agent encoder. The final feature vector obtained from the environment encoders is with the shape of

, and then feed into the trajectory and score decoders to produce the trajectory and score per mode, respectively. All the activation functions in the dense layers are ELU, and all the fully connected layers are followed by dropout layers with a dropout rate of 0.1 to mitigate overfitting. The total number of parameters of the model is 6,328,125.

Iv-B3 Training

The hyperparameter

in the loss function (Eq.


) is 0.5 after some trials. We use Nadam optimizer with a learning rate that starts with 1e-4 and decays by 50% after every 20 epochs. To stabilize training, we use gradient clipping with a threshold of 5 (by norm). The number of training epochs is 100 and the batch size is 64. No data augmentation is used and training one epoch on the training set takes about 10 minutes using Tensorflow with an NVIDIA RTX 2080Ti GPU.

Iv-C Results

Iv-C1 Qualitative Results

Fig. 3 shows some representative examples of the motion forecasting results given by our prediction model. The model is capable of outputting multiple possible trajectories that are diverse and compliant with the structure of the map. The best-predicted trajectory is very close to the ground-truth one while the model maintains the ability to predict other possible trajectories with varying speed profiles or directions. The qualitative results demonstrate the effectiveness of our proposed model on different complex urban driving scenarios including left-turn, right-turn, intersection, etc.

Method minADE (m) minFDE(m) brier-minFDE miss rate
PRIME [18] 1.2187 1.5582 2.0978 0.1150
LaneRCNN [22] 0.9038 1.4526 2.1470 0.1232
TNT [24] 0.9097 1.4457 2.1401 0.1656
Multi-head attention [14] 0.9973 1.4209 2.1154 0.1308
LaneGCN [12] 0.8679 1.3550 2.0495 0.1597
mmTransformer [13] 0.8436 1.3383 2.0328 0.1540
HOME [7] 0.8904 1.2919 1.8601 0.0846
Ours (Multi-modal Transformer) 0.8372 1.2905 1.9393 0.1429
TABLE I: The quantitative results in comparison with existing methods on the Argoverse benchmark (test set)

Here, we visualize the target agent’s attention on the map waypoints, as seen in Fig. 4, to demonstrate the interpretability of our prediction model. The attention of each mode (head) to the map waypoints is represented as attention scores, and the waypoints with scores greater than 0.01 are displayed on the map according to different modes. In the given two cases of left-turn scenarios, we can notice that the more attention on the left-turn lane when the model predicts a left-turn trajectory, and likewise, more attention on the straight lane when the model predicts a go-straight trajectory. The results manifest that the proposed multi-modal attention can automatically learn to extract the possible goals on the map, and thus the predicted trajectories can be conditioned on the different goals.

Fig. 4: The visualization of the attention scores on the map waypoints for different modes. The darker red means more attention on the waypoint.

Iv-C2 Quantitative Results

The proposed method is evaluated with the quantitative metrics previously defined and compared against the state-of-the-art methods on the Argoverse benchmark (test set). We only report the results with six forecasted trajectories (), which are summarized in Table I. Our proposed method can achieve the best prediction accuracy in terms of the two distance error-based metrics (minADE and minFDE). The brier-minFDE and miss rate of our model are slightly worse than that of HOME [7]. This is because HOME is optimized for minimizing the miss rate, and the weight of the scoring term in the loss function needs further tuning in training our model to improve the scoring performance. It is also worth noting that our model is with a smaller size and simpler training process, which can ease the burden of pre- or post-processing and simplify the model training process, as well as bring fast inference.

Iv-D Ablation Study

We conduct an ablation study to evaluate and analyze the influence of map structure and the contributions of our proposed multi-modal attention to the final prediction accuracy. For the map structure, we investigate two different levels of representations, i.e., lane and waypoint. To encode the feature of a map lane, we use global max-pooling to aggregate the features of all the waypoints in a lane. For multi-modal prediction, in addition to our proposed multi-modal attention, we take the ensemble method trained with the variety loss as a comparison. It uses an ensemble of trajectory decoders to output different trajectories from the same extracted environment feature vector. All these methods are tested on the Argoverse standard test set with six predicted trajectories. From the results given in Table II, we can conclude that our proposed method with map waypoints and multi-modal agent-map attention can delivery the best prediction accuracy. Using the map waypoints can bring a higher map resolution and reduce the loss of information. Moreover, using the proposed multi-modal attention can achieve not only better prediction accuracy but also better interpretability, as shown in the previous section.

Map Multi-modal minADE (m) minFDE(m)
Lane Waypoint Attention Ensemble
0.8372 1.2905
0.8461 1.3097
0.8512 1.3199
0.8604 1.3373
TABLE II: Ablation study of the map structure and multi-modal prediction method on the Argoverse test set

V Conclusions

In this paper, we propose a multi-modal trajectory prediction model based on the Transformer structure. We employ a multi-head attention Transformer layer to model the relationship among interacting agents and introduce a multi-modal attention Transformer layer to extract the different relationships between the target agent and map waypoints, which determines the final trajectory outputs. Comprehensive experiments on the Argoverse motion dataset reveal the effectiveness of our model with competitive accuracy, better interpretability, yet a simple structure and training process.


  • [1] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov (2020) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In Conference on Robot Learning, Cited by: §II-B.
  • [2] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al. (2019) Argoverse: 3d tracking and forecasting with rich maps. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 8748–8757. Cited by: §IV-A1.
  • [3] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2090–2096. Cited by: §II-A, §II-B.
  • [4] B. Dong, H. Liu, Y. Bai, J. Lin, Z. Xu, X. Xu, and Q. Kong (2021) Multi-modal trajectory prediction for autonomous driving with semantic map and dynamic graph attention network. In Machine Learning for Autonomous Driving Workshop at the 34th Conference on Neural Information Processing Systems, Cited by: §II-A.
  • [5] L. Fang, Q. Jiang, J. Shi, and B. Zhou (2020) Tpnet: trajectory proposal network for motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6797–6806. Cited by: §II-B.
  • [6] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid (2020) Vectornet: encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11525–11533. Cited by: §I, §II-A.
  • [7] T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde (2021)

    HOME: heatmap output for future motion estimation

    arXiv preprint arXiv:2105.10968. Cited by: §IV-C2, TABLE I.
  • [8] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §II-B.
  • [9] Z. Huang, J. Wu, and C. Lv (2021)

    Driving behavior modeling using naturalistic human driving data with inverse reinforcement learning

    IEEE Transactions on Intelligent Transportation Systems. Cited by: §I.
  • [10] B. Ivanovic, K. Leung, E. Schmerling, and M. Pavone (2020) Multimodal deep generative models for trajectory prediction: a conditional variational autoencoder approach. IEEE Robotics and Automation Letters 6 (2), pp. 295–302. Cited by: §II-B.
  • [11] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §II-B.
  • [12] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun (2020) Learning lane graph representations for motion forecasting. In European Conference on Computer Vision, pp. 541–556. Cited by: §II-A, §II-B, TABLE I.
  • [13] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou (2021) Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7577–7586. Cited by: §II-B, TABLE I.
  • [14] J. Mercat, T. Gilles, N. El Zoghby, G. Sandou, D. Beauvois, and G. P. Gil (2020) Multi-head attention for multi-modal joint vehicle motion forecasting. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9638–9644. Cited by: TABLE I.
  • [15] S. Mozaffari, O. Y. Al-Jarrah, M. Dianati, P. Jennings, and A. Mouzakitis (2020) Deep learning-based vehicle behavior prediction for autonomous driving applications: a review. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I.
  • [16] T. Phan-Minh, E. C. Grigore, F. A. Boulton, O. Beijbom, and E. M. Wolff (2020) Covernet: multimodal behavior prediction using trajectory sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14074–14083. Cited by: §II-B.
  • [17] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone (2020) Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 683–700. Cited by: §II-A.
  • [18] H. Song, D. Luan, W. Ding, M. Y. Wang, and Q. Chen (2021) Learning to predict vehicle trajectories with model-based planning. arXiv preprint arXiv:2103.04027. Cited by: §II-B, TABLE I.
  • [19] L. A. Thiede and P. P. Brahma (2019) Analyzing the variety loss in the context of probabilistic trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9954–9963. Cited by: §II-B, §III-C.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §III-B2.
  • [21] M. Ye, T. Cao, and Q. Chen (2021) TPCN: temporal point cloud networks for motion forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11318–11327. Cited by: §II-B.
  • [22] W. Zeng, M. Liang, R. Liao, and R. Urtasun (2021)

    LaneRCNN: distributed representations for graph-centric motion forecasting

    arXiv preprint arXiv:2101.06653. Cited by: TABLE I.
  • [23] L. Zhang, P. Su, J. Hoang, G. C. Haynes, and M. Marchetti-Bowick (2020) Map-adaptive goal-based trajectory prediction. In Conference on Robot Learning, Cited by: §II-B.
  • [24] H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen, Y. Shen, Y. Chai, C. Schmid, et al. (2020) Tnt: target-driven trajectory prediction. In Conference on Robot Learning (CoRL), Cited by: §I, §II-B, TABLE I.