Diverse and Admissible Trajectory Forecasting through Multimodal Context Understanding

Multi-agent trajectory forecasting in autonomous driving requires an agent to accurately anticipate the behaviors of the surrounding vehicles and pedestrians, for safe and reliable decision-making. Due to partial observability over the goals, contexts, and interactions of agents in these dynamical scenes, directly obtaining the posterior distribution over future agent trajectories remains a challenging problem. In realistic embodied environments, each agent's future trajectories should be diverse since multiple plausible sequences of actions can be used to reach its intended goals, and they should be admissible since they must obey physical constraints and stay in drivable areas. In this paper, we propose a model that fully synthesizes multiple input signals from the multimodal world|the environment's scene context and interactions between multiple surrounding agents|to best model all diverse and admissible trajectories. We offer new metrics to evaluate the diversity of trajectory predictions, while ensuring admissibility of each trajectory. Based on our new metrics as well as those used in prior work, we compare our model with strong baselines and ablations across two datasets and show a 35


page 1

page 2

page 3

page 4


Class-Aware Attention for Multimodal Trajectory Prediction

Predicting the possible future trajectories of the surrounding dynamic a...

SMEMO: Social Memory for Trajectory Forecasting

Effective modeling of human interactions is of utmost importance when fo...

Time-series Imputation of Temporally-occluded Multiagent Trajectories

In multiagent environments, several decision-making individuals interact...

PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings

For autonomous vehicles (AVs) to behave appropriately on roads populated...

SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction

We propose advances that address two key challenges in future trajectory...

PRANK: motion Prediction based on RANKing

Predicting the motion of agents such as pedestrians or human-driven vehi...

Traffic Agent Trajectory Prediction Using Social Convolution and Attention Mechanism

The trajectory prediction is significant for the decision-making of auto...

Code Repositories

1 Introduction

Trajectory forecasting is an important problem in autonomous driving scenarios, where an autonomous vehicle must anticipate the behavior of other surrounding agents (e.g., vehicles and pedestrians), within a dynamically-changing environment, in order to plan its own actions accordingly. However, since none of the goals, contexts, or interactions are directly observed, predicting future trajectories is a challenging problem [40, 28, 35]

. It necessitates both the estimation of plausible agent actions based on observable environmental features (e.g., road structures, agent interactions) as well as the simulation of agents’ hypothetical future trajectories toward their intended goals. In realistic embodied environments, there are multiple plausible sequences of actions that an agent can take to reach its intended goals. However, each trajectory must obey physical constraints (e.g., Newton’s laws) and stay in the statistically plausible locations in the environment (i.e., the drivable areas). In this paper, we refer to these attributes as

diverse and admissible trajectories, respectively, and illustrate some examples in Fig. 1. Achieving diverse and admissible trajectory forecasting for autonomous driving allows each agent to make the best predictions, by taking into account all valid actions that other agents could take. In addition, it allows each agent to assess the surrounding situation to ensure safety and prevent accidents.

Figure 1: Diverse and admissible trajectory forecasting. Based on the existing context, there can be multiple valid hypothetical futures. Therefore, the predicted future trajectory distribution should have multiple modes representing multiple plausible goals (diversity) while at the same time assigning low density to the implausible trajectories that either conflict with the other agents or are outside valid drivable areas (admissibility).

To predict a diverse set of admissible trajectories, each agent must understand its multimodal environment, consisting of the scene context as well as interactions between multiple surrounding agents. While the scene context gives direct information about regions an agent can drive in, observation of other agents’ trajectories can provide additional environmental context. For example, conceptual constraints over the agent’s motion (e.g., traffic laws, road etiquette) may be inferred from the motion of the surrounding agents. Therefore, the model’s ability to extract and meaningfully represent multimodal cues is crucial.

Concurrently, another challenging aspect of trajectory forecasting lies in encouraging models to make diverse predictions about future trajectories. However, due to high-costs in data collection, most public datasets are not explicitly annotated for multiple future trajectories [18, 8, 21]. Vanilla predictive models that fit future trajectories based only on the existing annotations would severely underestimate the diversity of all possible trajectories. In addition, measuring the quality of predictions using existing annotation-based measures (e.g., displacement errors [30]) does not faithfully score diverse and admissible trajectory predictions.

As a step towards multimodal understanding for diverse trajectory forecasting, our contribution is four-fold.

  1. We propose a model that addresses the lack of diversity and admissibility for trajectory forecasting through the understanding of the multimodal environmental context. As illustrated in Fig. 2, our approach explicitly models agent-to-agent and agent-to-scene interactions through “self-attention” [36] among multiple agent trajectory encodings, and a conditional trajectory-aware “visual attention” [39] over the map, respectively. Together with a constrained flow-based decoding trained with symmetric cross-entropy [27], this allows our model to generate diverse and admissible trajectory candidates by fully integrating all environmental contexts.

  2. We propose a new approximation of the true trajectory distribution based on a differentiable drivable-area map. This approximation is used when evaluating our posterior likelihood. Previous approximation methods [27] utilize ground-truth (GT) trajectories to model the real distribution. However, only one GT annotation is available per agent. Our approximation method does not rely on GT samples and empirically facilitates greater diversity in the predicted trajectories while ensuring admissibility.

  3. We propose a new metric, Drivable Area Occupancy (DAO), to evaluate the diversity of the trajectory predictions while ensuring admissibility. This new metric utilizes the drivable-area map, without requiring multiple annotations of trajectory futures. We couple this new metric with standard metrics from prior art, such as Average Displacement Error (ADE) and Final Displacement Error (FDE), to compare our model with existing baselines.

  4. We provide a programmatic set of procedures to convert the nuScenes [6]

    tracking data to a new dataset for trajectory forecasting. The procedure includes trajectory association, smoothing, imputation, and generation of the drivable-area features. These features are used for approximation of the real trajectory distribution and for calculating our new metrics. We set new state-of-the-art performance for multi-agent trajectory forecasting, wherein our model enjoys a

    35% performance-improvement over the current baselines.

We will publish tools to replicate our data and results which we hope will advance the study of diverse trajectory forecasting.

Figure 2:

Overview of our multimodal attention approach. Best viewed in color. The cross-agent attention module (left) generates an attention map, based on the encoded trajectories of nearby agents. The agent-to-scene attention model (right) generates an attention map over the scene, based on the posterior approximations.

2 Related Work

Multimodal trajectory forecasting requires a detailed understanding of the agent’s environment. Many works integrate information from multiple modalities [20, 24], such as RGB image and LiDAR point-cloud information to model the surrounding environment [19, 27] and high dimensional map data to modeling vehicle lane segmentation [40, 3, 7]. Other methods additionally fuse different combinations of map context [10, 7, 3], LiDAR [19], and RGB  [29, 21] with the intention of jointly capturing all interactions between the agents and environment [1, 14, 31]. Without mechanisms to explicitly model agent-to-agent and agent-to-scene interactions, we hypothesize that these models are unable to capture complex nonlinear interactions in the high-dimensional input space. In this paper, we study and propose methods to explicitly model these interactions, escalating performance in trajectory forecasting.

Multi-agent modeling aims to learn representations that summarize the behavior of one agent given its surrounding agents. These interactions are often modeled through either spatial-oriented methods or through neural attention-based methods. Spatial-oriented methods use pooling approaches across individual agent representations [19, 9, 40] and usually take into account inter-agent distances, through a relative coordinate system [28, 22, 15, 3]. Despite their wide usage, spatial-oriented methods are designed to concentrate only on adjacent (spatially close) agents and assume a fixed number of agents in the scene; they also limit the maximum number of agents. Attention-based methods use attention [36] architectures to model multi-agent interaction for applications involving pedestrians [37, 12, 31], sports players [11, 33], indoor robots [25], and vehicle trajectories [35, 21]. In this paper, we use a cross-agent attention module to model the agent-to-agent interaction. Rather than using this information solely for prediction, we additionally generate attended scene context, conditioned on these cross-agent representations. We hypothesize that the attended map context will lead to improved tractability in modeling high-dimensional correlations in the scene. We support this with our empirical results in Section 6.

Diverse trajectory prediction: Many models follow a deterministic trajectory-prediction approach [9, 40] and, therefore, struggle to estimate the diversity in the future trajectories. Some works have applied generative models such as Generative Adversarial Networks (GANs) [13, 14, 31, 40] and Variational Auto Encoders (VAEs) [19] to encourage diverse predictions. However, these approaches focus more on generating and scoring multiple output candidates and focus less on analyzing the diversity across distributional modes.

Trajectory forecasting: Trajectory forecasting has been studied in various domains, spanning marine vessels, aircraft, satellites, motor vehicles, and pedestrians [5, 2, 32, 28, 14]. Tasks involving motor vehicles and pedestrians are especially challenging, due to the high stochasticity that arises from attempting to model complex latent factors (e.g., human intent, “social” agent interacts, and scene context) [8]. Despite some promising empirical results, it remains difficult to evaluate both the diversity and admissibility of predictions. In this paper, we define the task of diverse and admissible trajectory forecasting and provide a new dataset generated from nuScenes [6], a popular image tracking source. We also define new task metrics that specifically assess models on the basis of prediction diversity and admissibility, and we analyze model generalization based on data from multiple domains.

3 Problem Formulation

3.1 Model Architecture

We define the terminology that constitutes our problem. An agent is a dynamic on-road object that is represented as a sequence of 2D coordinates, i.e., a spatial position over time. We denote the position for agent at time as . By writing , we represent the sequence of its positions, between and . (bold) to denote full sequence of positions for agent . We set as present, as past, and as prediction or simply, pred. We often split the sequence into two parts, with respect to the past and pred sub-sequences: we denote these as and , respectively. A scene is a high-dimensional structured data that describes the present environmental context around the agent. For this, we utilize a bird’s eye view array, denoted , where and are the sizes of field around the agent and is the channel size of the scene, where each channel consists of distinct information such as the drivable area, position, and distance encodings.

Combining the scene and all agent trajectories yields an episode. In an episode , there is a variable number of agents, each of which plays for different time periods from between the variable start time and final time . As a result, the episode is the set where . In the combined setting, we often use bold to denote the agents subset of the episode and write or to represent the set of past or pred segments of agents. Since and serve as the observed information cue used for the prediction, they are often called observation simply being denoted as . Finally, we may add the subscript to all the notations, such as , , , , , , or to distinguish the information from different episodes.

We define diversity to be the level of coverage in a model’s predictions, across modes in a distribution representing all possible future trajectories. We denote the model distribution as and want the model to generate candidates or hypotheses, denoted as . We interpret as a set of independent hypotheses that might have happened, given the same observation. Instead of generating samples from one mode, which we refer to as perturbation, we expect to build a model that generates multiple hypotheses that cover multiple modes.

Finally, we acknowledge that encouraging a model’s predictions to be diverse, alone, is not sufficient for accurate and safe output; the model predictions should lie in the support of the real future trajectory distribution , i.e., predictions should be . Given the observation , it is futile to predict samples around regions that are physically and statistically implausible to reach. In conclusion, our task is diverse and admissible multi-agent motion forecasting by modeling multiple modes in the posterior distribution over the pred trajectories, given the observation: .

4 Proposed Approach

We hypothesize that future trajectories of human drivers should follow distributions of multiple modes conditioned on the scene context and social behaviors of agents. Therefore, we design our model to explicitly capture both agent-to-scene interactions and cross-agent interactions with respect to each agent of interest. Through our objective function, we explicitly encourage the model to learn a distribution with multiple modes by taking into account past trajectories and attended scene context.

As illustrated in Fig. 3, our model consists of an encoder-decoder architecture. The encoder has two modules to capture cross-agent interactions and existing trajectories. The decoder has three modules: the local scene extractor, the agent-to-scene interaction module, and the flow-based decoding module. Please refer to Fig. 4 for a detailed illustration of our main proposed modules.

Figure 3: Model Architecture. The model consists of an encoder-decoder architecture: the encoder takes as past agent trajectories and calculates cross-agent attention, and the flow-based decoder predicts future trajectories by attending scene contexts for each decoding step.

The encoder extracts past trajectory encoding for each agent, then calculates and fuses the interaction features among the agents. Given a set of past trajectories in an observation , we encode each agent’s past trajectory

by feeding it to the agent trajectory encoding module. The module utilizes a recurrent neural network (RNN) to summarize the past trajectory. It iterates through the past trajectory with Eq. (

1) and its final output (at present ) is utilized as the agent embedding. Collecting the embeddings for all agents, we get . We then pass to the cross-agent interaction module, depicted in Fig. 4, which uses self-attention [36]

to generate a cross-agent representation. We linearly transform each agent embedding to get a query-key-value triple,

. Next, we calculate the interaction features through self-attention with , where . Finally, the fused agent encoding is calculated by adding the features to each agent embedding (see Eq. (2) and Fig. 4). The architectural details of the encoder, which include the parameters for the agent encoding RNN and the cross-agent attention structures, are given in the supplemental material.


The decoder takes the final encodings and the scene context as inputs. We first extract the scene feature through a ConvNet, . The decoder then autoregressively generates the future position , while referring to both the local scene context and the global scene context from the agent-to-scene interaction module. The local scene feature

is gathered using bilinear interpolation on

crop of corresponding to the physical position . Then, the feature is concatenated with the encoding and processed thorough fully-connected layers to make the “local context” . We call this part the local scene extractor. The global scene feature is calculated using visual-attention [39] to generate weighted scene features, as shown in Fig. 4. To calculate the attention, we first make the encoding of the previous outputs , using a RNN in Eq. (3), whose output——is used to calculate the pixel-wise attention at each decoding step, for each agent; the global scene feature

(1D vector) is gathered by pooling (pixel-wise sum) the attended feature map as described in Eq. (

4) and Fig. 4. Finally, , , and are concatenated to make the “global context” in Eq. (5).

Figure 4: (a) Cross-agent attention. Interaction between each agent is modeled using attention, (b) Cross-agent interaction module. Agent trajectory encodings are corrected via cross-agent attention. (c) Visual attention. Agent-specific scene features are calculated using attention. (d) Agent-to-scene interaction module. Pooled vectors are retrieved from pooling layer after visual attention.

The flow-based decoding module generates the future position . The module utilizes Normalizing Flow [26], a generative modeling method based on a bijective and differentiable mapping. In particular, we choose an autoregressive design [17, 27, 28]. We use fully-connected layers to project the global context down to a 6-dimensional vector, and we split it into a vector and a matrix . Next, we transform a standard Gaussian sample , by the bijective and differentiable mapping . The “hats” in and are removed, in order to note that they went through the following details. To ensure the positive definiteness, we apply matrix exponential using the formula in [4]. Also, to improve the the physical admissibility of the prediction, we apply the constraint , where is a model degradation coefficient. When , the constraint is equivalent to Verlet integration [38], used in some previous works [27, 28], which gives the a perfect constant velocity (CV) prior to the model. However, we found empirically that, the model easily overfits to the dataset when the the perfect CV prior is used, and perturbing the CV prior model with prevents overfitting. We use in our model and an analysis on the effect of the degradation coefficient is given in the supplemental material.

Iterating the autoregressive decoding procedure, we get the future trajectory prediction for each agent. Note that by sampling multiple instances of , we can generate the multiple future .

4.1 Drivable Area Map and Approximating P

In this work, we generate a binary mask feature of size that denotes the drivable spaces around the agents. We call the feature drivable area map and utilize it for three different purposes: 1) deriving the approximated true trajectory distribution , 2) calculating the diversity and admissibility measures, and 3) building the scene context input for the model.

Particularly, is a key component for the evaluation of in our training objective, Eq. (7). Since penalizes the predicted trajectories with respect to the real distribution, the approximation should not underestimate some region of the real distribution, or diversity in the prediction could be erroneously penalized. Previous works on deriving utilized the ground-truth (GT) trajectories to model the true distribution  [27]

. However, there is often only one GT annotation available in datasets and the approximation based on the GT might severely assign low probability around some region in

. To cope with such problem in the previous methods, we propose a new method to derive using the drivable area. Our

is defined based on the assumptions that every location on the drivable-area is equally probable for future trajectories to appear in and that the locations on non-drivable area are increasingly less probable, proportional to the distance from the drivable area. To derive it, we first apply the distance transform on the drivable area map, to encode the distance on each non-drivable location. Lastly, we apply softmax over the entire map to constitute it as a probability distribution. The visualizations of the

are available in Fig. 7. Procedures regarding the diversity and admissibility measure will be discussed in Section  5.3; details on deriving and the scene context input , as well as additional visualizations and qualitative results, are given in the supplemental material.

4.2 Learning

Our model learns to predict the joint distribution over the future trajectories of the agents present in a given episode. In detail, we focus on predicting the conditional distribution

where the future trajectory depends on the set of observations of the past trajectories and the scene context given an episode. As described in the previous sections, our model utilizes a bijective and differentiable mapping, parameterized by a learnable parameter , between the future trajectory and a Gaussian prior to generate and evaluate the future trajectory. Such technique, commonly aliased ‘normalizing flow’, enables our model not only to generate multiple candidate samples of future, but also to evaluate the ground-truth trajectory according to the predicted distribution by using the change-of-variable formula in Eq. (6).


As a result, our model can simply learn to close the discrepancy between the predicting distribution and the real world distribution . In particular, we choose to minimize the combination of forward and reverse cross-entropy and , also known as ‘symmetric cross-entropy’, between the two distributions in Eq.  (7) by optimizing our model parameter . Minimizing symmetric cross-entropy allows model to learn generating diverse and plausible trajectory, which is mainly used in [27].


To realize this, we gather the ground-truth trajectories and scene context from the dataset that we assume to well reflect the real distribution , then optimize the model parameter such that 1) the density of the ground-truth future trajectories on top of the predicted distribution is maximized and 2) the density of the predicted samples on top of the real distribution is also maximized as described in Eq.( 8).


Such symmetric combination of the two cross-entropy guides our model to predict that covers all plausible modes in the future trajectory while penalizing the bad samples that are less likely under the real distribution . However, one major problem inherent in this setting is that we cannot actually evaluate in practice. To cope with the problem, several ways of approximating by using a separate model have been suggested so far [27]. In this paper, we propose a new way of modeling which approximates using a discrete grid map derived from the differentiable drivable area map in our dataset which considers every drivable region around the ego-vehicle to be equally probable that the future trajectories are placed. The details about our new is included in the supplemental material. Applying bilinear interpolation around each prediction time-step in the generative sample , we get the evaluation that is differentiable with respect to the model parameter

. Our overall loss function is:


where is the batch size, is the number of total agents in th episode, and is the number of candidates to sample per agent. Since this objective is fully differentiable with respect to the model parameter , we train our model using Adam optimizer [16]

, a popular variant of the stochastic gradient descent algorithm. We also use adaptive learning rate scheduling and early stopping. Optimization details are included in the supplementary.

5 Experimental Setup

The primary goal in the following experiments is to evaluate our model, baselines, and ablations on the following criteria- 1. Leveraging mechanisms that explicitly model agent-to-agent and agent-to-scene interactions (experiment 1 and 2). 2. Producing diverse trajectory predictions, while obeying admissibility constraints on the trajectory candidates given different approximation methods for the true trajectory distribution

(experiment 3). 3. Remaining robust to an increasing number of agents in the scene (agent complexity; experiment 4). 4. Generalizing to other domains (experiment 5). We implement all models in PyTorch 

[23] trained on NVIDIA TITAN X GPUs. Procedural, architectural, and training details are included in the supplementary material.

5.1 Dataset

Most current autonomous driving trajectory forecasting datasets are insufficient for evaluating predictions, due to the small size and the limited number of multimodal cues [21].

The Argoverse motion forecasting dataset consists of a large volume of forecasting data with drivable area annotations, but lacks certain modalities i.e LiDAR point-clouds and map images. We have generated motion forecasting datasets from nuScenes and Argoverse tracking dataset using their original annotations through programmatic trajectory association, smoothing, and imputation. Unlike the Argoverse forecasting dataset, this new dataset provides additional context information from LiDAR point-clouds and map information, for better forecasting performance. We utilize the trajectory record, vectorized geometry, and drivable area annotation as modalities for our research. In order to make the experimental setup of nuScenes similar to Argoverse, we crop each sequence to be 5 seconds long in total; 3 seconds for prediction and 2 seconds for observation, with a sampling rate of 2 Hz. Background information relating to nuScenes, Argoverse trajectory data generation are included in the supplementary material. By evaluating baselines and our models on both real world datasets, we provide complementary validation of each model’s diversity, admissibility, and generalizability across domains. Data extraction procedures and quantitative results on the simulated data are discussed in the supplementary material.

5.2 Baseline Models

Deterministic baselines: We compare three deterministic models with our approach, to examine our model’s ability to capture agent-to-agent interaction: LSTM-based encoder-decoder [34] (LSTM), convolutional social pooling LSTM (CSP[9], and a deterministic version of

multi-agent tensor fusion

(MATF-D[40]. For our deterministic model, we use an LSTM with our cross-agent attention module in the encoder, which we refer to as the cross-agent attention model (CAM). Because each model is predicated on an LSTM component, we set the capacity to be the same in all cases, to ensure fair comparison.

5.3 Metrics

Measuring diversity and admissibility: We define multiple metrics that provide a thorough interpretation about the behavior of each model in terms of precision, diversity, and admissibility. For the i-th trajectory, we first evaluate a prediction in terms of Euclidean errors: average displacement error and final displacement error , or Error to denote both. To evaluate predictions (i.e., precision), we use the average and the minimum Errors: and . A large avgError implies that predictions are spread out, and a small minError

implies at least one of predictions has high precision. From this observation, we define new evaluation metrics that capture diversity in predictions: the ratio of

avgADE to minADE and avgFDE to minFDE, namely rA and rF. In particular, rF is robust to the variability of magnitude in velocity in predictions because high avgADE and high minADE caused by large magnitudes will be offset and only the directional variability will remain. As a result, rF provides a handy tool that can distinguish between predictions with multiple modes (diversity) and predictions with a single mode (perturbation). For deterministic models, rA and rF have a value of 1.


We also report performance on additional metrics that are designed to capture diversity and admissibility in predictions. We follow [8] in the use of Drivable Area Count (DAC), , where is the number of predictions that go out of the drivable area and is the total number of predictions. Next, we propose a new metric, Drivable Area Occupancy (DAO), which measures the percentage of pixels that predicted trajectories occupy in the drivable area. Shown in Eq. (11), is the number of pixels occupied by predictions and is the total number of pixels of the drivable area, both within a pre-defined grid around the ego-vehicle. Due to the nature of DAO and DAC, the number of trajectory hypotheses should be set equally for fair comparison of models.

Figure 5: We motivate the need for multiple metrics, to assess diversity and admissibility. Case 1: DAO measures are equal, even though predictions have differing regard for the modes in the posterior distribution. Case 2: rF measures are equal, despite differing regard for the cost of leaving the drivable area. In both cases, it is important to distinguish between conditions—we do this by using DAO, rF, and DAC together.

We use rF, DAO, and DAC to assess the diversity and admissibility of models. Initially, DAO may seem like a reasonable standalone measure of both diversity and admissibility, as it only cares about diversity in a reasonable region of interest. However, DAO itself cannot distinguish between diversity (Section 3) and arbitrary stochasticity in predictions, as illustrated by Case 1 in Fig. 5: although DAO measures of both predictions are equal, the causality behind each prediction is different and we must distinguish the two. rF and DAO work in a complementary way and we, therefore, use both for measuring diversity. To assure the admissibility of predictions, we use DAC which explicitly counts off-road predictions, as shown by Case 2 in Fig. 5. As a result, assessing predictions using DAO along with rF and DAC provides a holistic view of the quantity and the quality of diversity in predictions; the characteristics of each metric is summarized in Fig. 6.

For our experiments, we use minADE and minFDE to measure precision, and use rF, DAC, and DAO to measure both diversity and admissibility. Due to the nature of DAO, where the denominator in our case is the number of overlapping pixels in a grid, we normalize it by multiplying by when reporting results. For the multi-agent experiment (experiment 4), relative improvement (RI) is calculated as we are interested in the relative improvement as the number of agents increases. If not specified, the number of hypotheses are set to 12 and minFDE is reported for the performance.

Figure 6: Metric quality spectrum. Our newly proposed metrics: rF measures the spread of predictions in Euclidean distance, DAO measures diversity in predictions that are only admissible. DAC measures extreme off-road predictions that defy admissibility.

6 Results and Discussion

In this section, we show experimental results on numerous settings including the comparison with the baseline, and ablation studies of our model. We first show the effect of our cross-agent interaction module and agent-to-scene interaction module on the model performance, then we analyze the performance with respect to different numbers of agents, and other datasets. All experiments are measured with minADE, minFDE, rF, DAC, and DAO for holistic interpretation.

Model minADE minFDE
LSTM 1.186 2.408
CSP [9] 1.390 2.676
MATF-D [40] 1.261 2.538
CAM (ours) 1.124 2.318
Table 1: Deterministic models on nuScenes. Our proposed model outperforms the existing baselines.

Effectiveness of cross-agent interaction module: We show the performance of one of our proposed models CAM, which utilizes our cross-agent attention module, along with three deterministic baselines as shown in Tables 1. For each model we test, agent-to-agent interaction is considered in different ways. CSP models the interaction through layers of convolutional networks, and the interaction is implicitly calculated within the receptive field of convolutional layers. MATF-D is an extension of convolutional social pooling with scene information. CAM explicitly defines the interaction between each agent by using attention. The result shows that CAM outperforms other baselines in both minADE and minFDE, indicating that the explicit way of modeling agent-to-agent interaction performs better in terms of precision than an implicit way of modeling interaction using convolutional networks used in CSP and MATF-D. Interestingly, CAM outperforms MATF-D that utilizes scene information. This infers that our cross-agent interaction module has the ability to learn the geometric structure of the roads given by the trajectories of surrounding agents.

Effectiveness of agent-to-scene interaction module: The performance of stochastic models is compared in Tables 2. We experiment with removing scene processing operations in the decoder to validate the importance of our proposed agent-to-scene interaction module. As mentioned previously, generating multiple modes of sample requires a strong scene processing module and a diversity-oriented decoder. Our proposed models all outperform other stochastic baseline models in terms of precision. MATF-GAN has a small rF inferring that the predictions are mostly unimodal, while other models such as VAE-based model DESIRE and flow-based models R2P2 and ours show more spread in their predictions. We note that R2P2 was not designed for multi-agent setting which causing it to make unreasonably shaking outputs. Our model has the highest and , indicating that our models exhibit diverse and admissible predictions by accurately utilizing scene context.

Figure 7: Our map loss and corresponding model predictions. Each pixel on our map loss denotes probability of future trajectories; higher probability values are represented by brighter pixels. Our approach generates diverse and admissible future trajectories. More visualizations of qualitative results are provided in the supplementary material.
Model minADE minFDE rF DAO DAC
DESIRE [19] 0.937 1.808 1.754 9.430 0.376
MATF-GAN [40] 1.053 2.124 1.194 5.950 0.391
R2P2-MA [27] 1.185 2.215 1.611 13.50** 0.396
CAM-NF (ours) 0.756 1.386 2.113 11.70 0.400
Local-CAM-NF (ours) 0.772 1.404 2.066 11.70 0.400
Global-CAM-NF (ours) 0.744 1.359 2.103 11.60 0.400
AttGlobal-CAM-NF (ours) 0.638 1.171 2.558 12.28 0.399

Table 2: Stochastic models on nuScenes. : unstable outputs observed on R2P2-MA.
Model minADE minFDE rF DAO DAC
AttGlobal-CAM-NF(MSE) 0.763 1.390 2.009 12.09 0.400
AttGlobal-CAM-NF 0.638 1.171 2.558 12.28 0.399
Table 3: Optimizing using loss outperforms MSE loss on nuScenes.

Effectiveness of new loss: We experiment with MSE and our drivable area-based approximation of in Table 3. Using our map loss in training shows superior results in most of the reported metrics. In particular, the precision and the diversity of predictions increases drastically as reflected in minError and rF while DAC remains unchanged. Our map loss assures admissibility while improving precision and diversity, as drivable-area associated provides additional possible regions of future trajectories.

Complexity from number of agents: We experiment with varying number of surrounding agents as shown in Table 4. Throughout all models, the performance increases as the number of agents increases even though we observe that many agents in the surrounding do not move significantly. In terms of relative improvement RI, as calculated between 1 agent and 10 agents, our model has the most improvement, indicating that our model makes the most use of the fine-grained trajectories of surrounding agents to generate future trajectories.

Model 1 agent 3 agents 5 agents 10 agents RI(1-10)
LSTM 2.736 2.477 2.442 2.268 17.1%
CSP [9] 2.871 2.679 2.671 2.569 10.5%
DESIRE [19] 2.150 1.846 1.878 1.784 17.0%
MATF GAN [40] 2.377 2.168 2.150 2.011 15.4%
R2P2-MA [27] 2.227 2.135 2.142 2.048 8.0%

AttGlobal-CAM-NF (ours)
1.278 1.158 1.100 0.964 24.6%
Table 4: Multi-agent experiments on Nuscenes (minFDE). RI denotes ratio of minFDE for 10 vs. 1 agent. Our approach best models multi-agent interactions.

Generalizability across datasets: We further compare our model with baselines extensively across two more real world datasets: nuScenes and Argoverse to test generalization to different environments. We show results in Table 5 where we outperform or achieve comparable results as compared to the baselines. For Argoverse, we additionally outperform MFP3 [35] in minFDE with 6 hypotheses: our full model shows a minFDE of 0.915, while MFP3 achieves 1.399.

Model Argoverse nuScenes
A () B () C ()* D ()* E ()* A () B () C ()* D ()* E ()*
LSTM 1.441 2.780 1.000 1.786 0.378 1.186 2.408 1.000 1.690 0.391
CSP 1.385 2.567 1.000 1.799 0.379 1.390 2.676 1.000 1.710 0.388
MATF-D 1.344 2.484 1.000 1.768 0.379 1.261 2.538 1.000 1.690 0.384
DESIRE 0.777 1.276 3.642 11.80 0.301 0.937 1.808 1.754 9.430 0.376
MATF-GAN 1.214 2.316 1.099 6.075 0.376 1.053 2.124 1.194 5.950 0.391
R2P2-MA 1.270 2.190 1.589 18.10** 0.381 1.185 2.215 1.611 13.50** 0.396
CAM 1.131 2.504 1.000 1.750 0.389 1.124 2.318 1.000 1.670 0.404
CAM-NF 0.852 1.347 2.763 17.60 0.378 0.756 1.386 2.113 11.70 0.400
Local-CAM-NF 0.807 1.250 2.858 17.00 0.381 0.772 1.404 2.066 11.70 0.400
Global-CAM-NF 0.807 1.241 3.068 16.90 0.380 0.744 1.359 2.103 11.60 0.400
AttGlobal-CAM-NF 0.731 1.126 3.278 15.50 0.383 0.638 1.171 2.558 12.28 0.399
Table 5: Results of baseline models and our proposed model. Local-CAM-NF is an ablation, whereas AttGlobal-CAM-NF is our full proposed model. The metrics are abbreviated as follows: minADE(A), minFDE(B), rF(C), DAO(D), DAC(E). Improvements indicated by arrows. : larger is better, as long as A and B are small.

7 Conclusion

In this paper, we tackled the problem of generating diverse and admissible predictions by understanding each agent’s multimodal context. We proposed a model that learns agent-to-agent interactions and agent-to-scene interactions using attention mechanisms, resulting in better prediction in terms of precision, diversity, and admissibility. We also developed a new approximation method that provides richer information about the true trajectory distribution and allows more accurate training of flow-based generative models. Finally, we present new metrics that provide a holistic view of the quantity and the quality of diversity in prediction, and a nuScenes trajectory extraction code to support future research in diverse and admissible trajectory forecasting.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 961–971. Cited by: §2.
  • [2] S. Ayhan and H. Samet (2016) Aircraft trajectory prediction made easy with predictive analytics. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 21–30. Cited by: §2.
  • [3] M. Bansal, A. Krizhevsky, and A. Ogale (2018) Chauffeurnet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079. Cited by: §2, §2.
  • [4] D. S. Bernstein and W. So (1993) Some explicit formulas for the matrix exponential. IEEE Transactions on Automatic Control 38 (8), pp. 1228–1232. Cited by: §4.
  • [5] P. Borkowski (2017) The ship movement trajectory prediction algorithm using navigational data fusion. Sensors 17 (6), pp. 1432. Cited by: §2.
  • [6] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) Nuscenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: item 4, §2.
  • [7] S. Casas, W. Luo, and R. Urtasun (2018) Intentnet: learning to predict intention from raw sensor data. In Conference on Robot Learning, pp. 947–956. Cited by: §2.
  • [8] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al. (2019) Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8748–8757. Cited by: §1, §2, §5.3.
  • [9] N. Deo and M. M. Trivedi (2018) Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1468–1476. Cited by: §2, §2, §5.2, Table 1, Table 4.
  • [10] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F. Chou, T. Lin, and J. Schneider (2018) Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: §2.
  • [11] P. Felsen, P. Lucey, and S. Ganguly (2018)

    Where will they go? predicting fine-grained adversarial multi-agent motion using conditional variational autoencoders

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 732–747. Cited by: §2.
  • [12] T. Fernando, S. Denman, S. Sridharan, and C. Fookes (2017) Soft + hardwired attention: an LSTM framework for human trajectory prediction and abnormal event detection. CoRR abs/1702.05552. External Links: Link, 1702.05552 Cited by: §2.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [14] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §2, §2, §2.
  • [15] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi (2017) Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 399–404. Cited by: §2.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [17] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §4.
  • [18] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein (2018) The highd dataset: a drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2118–2125. Cited by: §1.
  • [19] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §2, §2, §2, Table 2, Table 4.
  • [20] P. P. Liang, Y. C. Lim, Y. H. Tsai, R. Salakhutdinov, and L. Morency (2019) Strong and simple baselines for multimodal utterance embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §2.
  • [21] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha (2019) Trafficpredict: trajectory prediction for heterogeneous traffic-agents. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6120–6127. Cited by: §1, §2, §2, §5.1.
  • [22] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi (2018) Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1672–1678. Cited by: §2.
  • [23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §5.
  • [24] H. Pham, P. P. Liang, T. Manzini, L. Morency, and B. Póczos (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Cited by: §2.
  • [25] A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro (2017) Show, attend and interact: perceivable human-robot social interaction through neural attention q-network. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1639–1645. Cited by: §2.
  • [26] D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §4.
  • [27] N. Rhinehart, K. M. Kitani, and P. Vernaza (2018) R2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 772–788. Cited by: item 1, item 2, §2, §4.1, §4.2, §4, Table 2, Table 4.
  • [28] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine (2019) Precog: prediction conditioned on goals in visual multi-agent settings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2830. Cited by: §1, §2, §2, §4.
  • [29] C. Rodriguez, B. Fernando, and H. Li (2018) Action anticipation by predicting future dynamic images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.
  • [30] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras (2019) Human motion trajectory prediction: a survey. arXiv preprint arXiv:1905.06113. Cited by: §1.
  • [31] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese (2019) Sophie: an attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1349–1358. Cited by: §2, §2, §2.
  • [32] I. I. Shapiro (1963) The prediction of satellite orbits. In Dynamics of Satellites/Dynamique des Satellites, pp. 257–312. Cited by: §2.
  • [33] C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy (2019) Stochastic prediction of multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641. Cited by: §2.
  • [34] I. Sutskever, O. Vinyals, and Q. Le (2014-09) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 4, pp. . Cited by: §5.2.
  • [35] C. Tang and R. R. Salakhutdinov (2019) Multiple futures prediction. In Advances in Neural Information Processing Systems, pp. 15398–15408. Cited by: §1, §2, §6.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: item 1, §2, §4.
  • [37] A. Vemula, K. Muelling, and J. Oh (2018) Social attention: modeling attention in human crowds. In 2018 IEEE international Conference on Robotics and Automation (ICRA), pp. 1–7. Cited by: §2.
  • [38] L. Verlet (1967) Computer” experiments” on classical fluids. i. thermodynamical properties of lennard-jones molecules. Physical review 159 (1), pp. 98. Cited by: §4.
  • [39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International conference on machine learning

    pp. 2048–2057. Cited by: item 1, §4.
  • [40] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu (2019) Multi-agent tensor fusion for contextual trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12126–12134. Cited by: §1, §2, §2, §2, §5.2, Table 1, Table 2, Table 4.