CMUDATF
None
view repo
Multiagent trajectory forecasting in autonomous driving requires an agent to accurately anticipate the behaviors of the surrounding vehicles and pedestrians, for safe and reliable decisionmaking. Due to partial observability over the goals, contexts, and interactions of agents in these dynamical scenes, directly obtaining the posterior distribution over future agent trajectories remains a challenging problem. In realistic embodied environments, each agent's future trajectories should be diverse since multiple plausible sequences of actions can be used to reach its intended goals, and they should be admissible since they must obey physical constraints and stay in drivable areas. In this paper, we propose a model that fully synthesizes multiple input signals from the multimodal worldthe environment's scene context and interactions between multiple surrounding agentsto best model all diverse and admissible trajectories. We offer new metrics to evaluate the diversity of trajectory predictions, while ensuring admissibility of each trajectory. Based on our new metrics as well as those used in prior work, we compare our model with strong baselines and ablations across two datasets and show a 35
READ FULL TEXT VIEW PDFNone
Trajectory forecasting is an important problem in autonomous driving scenarios, where an autonomous vehicle must anticipate the behavior of other surrounding agents (e.g., vehicles and pedestrians), within a dynamicallychanging environment, in order to plan its own actions accordingly. However, since none of the goals, contexts, or interactions are directly observed, predicting future trajectories is a challenging problem [40, 28, 35]
. It necessitates both the estimation of plausible agent actions based on observable environmental features (e.g., road structures, agent interactions) as well as the simulation of agents’ hypothetical future trajectories toward their intended goals. In realistic embodied environments, there are multiple plausible sequences of actions that an agent can take to reach its intended goals. However, each trajectory must obey physical constraints (e.g., Newton’s laws) and stay in the statistically plausible locations in the environment (i.e., the drivable areas). In this paper, we refer to these attributes as
diverse and admissible trajectories, respectively, and illustrate some examples in Fig. 1. Achieving diverse and admissible trajectory forecasting for autonomous driving allows each agent to make the best predictions, by taking into account all valid actions that other agents could take. In addition, it allows each agent to assess the surrounding situation to ensure safety and prevent accidents.To predict a diverse set of admissible trajectories, each agent must understand its multimodal environment, consisting of the scene context as well as interactions between multiple surrounding agents. While the scene context gives direct information about regions an agent can drive in, observation of other agents’ trajectories can provide additional environmental context. For example, conceptual constraints over the agent’s motion (e.g., traffic laws, road etiquette) may be inferred from the motion of the surrounding agents. Therefore, the model’s ability to extract and meaningfully represent multimodal cues is crucial.
Concurrently, another challenging aspect of trajectory forecasting lies in encouraging models to make diverse predictions about future trajectories. However, due to highcosts in data collection, most public datasets are not explicitly annotated for multiple future trajectories [18, 8, 21]. Vanilla predictive models that fit future trajectories based only on the existing annotations would severely underestimate the diversity of all possible trajectories. In addition, measuring the quality of predictions using existing annotationbased measures (e.g., displacement errors [30]) does not faithfully score diverse and admissible trajectory predictions.
As a step towards multimodal understanding for diverse trajectory forecasting, our contribution is fourfold.
We propose a model that addresses the lack of diversity and admissibility for trajectory forecasting through the understanding of the multimodal environmental context. As illustrated in Fig. 2, our approach explicitly models agenttoagent and agenttoscene interactions through “selfattention” [36] among multiple agent trajectory encodings, and a conditional trajectoryaware “visual attention” [39] over the map, respectively. Together with a constrained flowbased decoding trained with symmetric crossentropy [27], this allows our model to generate diverse and admissible trajectory candidates by fully integrating all environmental contexts.
We propose a new approximation of the true trajectory distribution based on a differentiable drivablearea map. This approximation is used when evaluating our posterior likelihood. Previous approximation methods [27] utilize groundtruth (GT) trajectories to model the real distribution. However, only one GT annotation is available per agent. Our approximation method does not rely on GT samples and empirically facilitates greater diversity in the predicted trajectories while ensuring admissibility.
We propose a new metric, Drivable Area Occupancy (DAO), to evaluate the diversity of the trajectory predictions while ensuring admissibility. This new metric utilizes the drivablearea map, without requiring multiple annotations of trajectory futures. We couple this new metric with standard metrics from prior art, such as Average Displacement Error (ADE) and Final Displacement Error (FDE), to compare our model with existing baselines.
We provide a programmatic set of procedures to convert the nuScenes [6]
tracking data to a new dataset for trajectory forecasting. The procedure includes trajectory association, smoothing, imputation, and generation of the drivablearea features. These features are used for approximation of the real trajectory distribution and for calculating our new metrics. We set new stateoftheart performance for multiagent trajectory forecasting, wherein our model enjoys a
35% performanceimprovement over the current baselines.We will publish tools to replicate our data and results which we hope will advance the study of diverse trajectory forecasting.
Multimodal trajectory forecasting requires a detailed understanding of the agent’s environment. Many works integrate information from multiple modalities [20, 24], such as RGB image and LiDAR pointcloud information to model the surrounding environment [19, 27] and high dimensional map data to modeling vehicle lane segmentation [40, 3, 7]. Other methods additionally fuse different combinations of map context [10, 7, 3], LiDAR [19], and RGB [29, 21] with the intention of jointly capturing all interactions between the agents and environment [1, 14, 31]. Without mechanisms to explicitly model agenttoagent and agenttoscene interactions, we hypothesize that these models are unable to capture complex nonlinear interactions in the highdimensional input space. In this paper, we study and propose methods to explicitly model these interactions, escalating performance in trajectory forecasting.
Multiagent modeling aims to learn representations that summarize the behavior of one agent given its surrounding agents. These interactions are often modeled through either spatialoriented methods or through neural attentionbased methods. Spatialoriented methods use pooling approaches across individual agent representations [19, 9, 40] and usually take into account interagent distances, through a relative coordinate system [28, 22, 15, 3]. Despite their wide usage, spatialoriented methods are designed to concentrate only on adjacent (spatially close) agents and assume a fixed number of agents in the scene; they also limit the maximum number of agents. Attentionbased methods use attention [36] architectures to model multiagent interaction for applications involving pedestrians [37, 12, 31], sports players [11, 33], indoor robots [25], and vehicle trajectories [35, 21]. In this paper, we use a crossagent attention module to model the agenttoagent interaction. Rather than using this information solely for prediction, we additionally generate attended scene context, conditioned on these crossagent representations. We hypothesize that the attended map context will lead to improved tractability in modeling highdimensional correlations in the scene. We support this with our empirical results in Section 6.
Diverse trajectory prediction: Many models follow a deterministic trajectoryprediction approach [9, 40] and, therefore, struggle to estimate the diversity in the future trajectories. Some works have applied generative models such as Generative Adversarial Networks (GANs) [13, 14, 31, 40] and Variational Auto Encoders (VAEs) [19] to encourage diverse predictions. However, these approaches focus more on generating and scoring multiple output candidates and focus less on analyzing the diversity across distributional modes.
Trajectory forecasting: Trajectory forecasting has been studied in various domains, spanning marine vessels, aircraft, satellites, motor vehicles, and pedestrians [5, 2, 32, 28, 14]. Tasks involving motor vehicles and pedestrians are especially challenging, due to the high stochasticity that arises from attempting to model complex latent factors (e.g., human intent, “social” agent interacts, and scene context) [8]. Despite some promising empirical results, it remains difficult to evaluate both the diversity and admissibility of predictions. In this paper, we define the task of diverse and admissible trajectory forecasting and provide a new dataset generated from nuScenes [6], a popular image tracking source. We also define new task metrics that specifically assess models on the basis of prediction diversity and admissibility, and we analyze model generalization based on data from multiple domains.
We define the terminology that constitutes our problem. An agent is a dynamic onroad object that is represented as a sequence of 2D coordinates, i.e., a spatial position over time. We denote the position for agent at time as . By writing , we represent the sequence of its positions, between and . (bold) to denote full sequence of positions for agent . We set as present, as past, and as prediction or simply, pred. We often split the sequence into two parts, with respect to the past and pred subsequences: we denote these as and , respectively. A scene is a highdimensional structured data that describes the present environmental context around the agent. For this, we utilize a bird’s eye view array, denoted , where and are the sizes of field around the agent and is the channel size of the scene, where each channel consists of distinct information such as the drivable area, position, and distance encodings.
Combining the scene and all agent trajectories yields an episode. In an episode , there is a variable number of agents, each of which plays for different time periods from between the variable start time and final time . As a result, the episode is the set where . In the combined setting, we often use bold to denote the agents subset of the episode and write or to represent the set of past or pred segments of agents. Since and serve as the observed information cue used for the prediction, they are often called observation simply being denoted as . Finally, we may add the subscript to all the notations, such as , , , , , , or to distinguish the information from different episodes.
We define diversity to be the level of coverage in a model’s predictions, across modes in a distribution representing all possible future trajectories. We denote the model distribution as and want the model to generate candidates or hypotheses, denoted as . We interpret as a set of independent hypotheses that might have happened, given the same observation. Instead of generating samples from one mode, which we refer to as perturbation, we expect to build a model that generates multiple hypotheses that cover multiple modes.
Finally, we acknowledge that encouraging a model’s predictions to be diverse, alone, is not sufficient for accurate and safe output; the model predictions should lie in the support of the real future trajectory distribution , i.e., predictions should be . Given the observation , it is futile to predict samples around regions that are physically and statistically implausible to reach. In conclusion, our task is diverse and admissible multiagent motion forecasting by modeling multiple modes in the posterior distribution over the pred trajectories, given the observation: .
We hypothesize that future trajectories of human drivers should follow distributions of multiple modes conditioned on the scene context and social behaviors of agents. Therefore, we design our model to explicitly capture both agenttoscene interactions and crossagent interactions with respect to each agent of interest. Through our objective function, we explicitly encourage the model to learn a distribution with multiple modes by taking into account past trajectories and attended scene context.
As illustrated in Fig. 3, our model consists of an encoderdecoder architecture. The encoder has two modules to capture crossagent interactions and existing trajectories. The decoder has three modules: the local scene extractor, the agenttoscene interaction module, and the flowbased decoding module. Please refer to Fig. 4 for a detailed illustration of our main proposed modules.
The encoder extracts past trajectory encoding for each agent, then calculates and fuses the interaction features among the agents. Given a set of past trajectories in an observation , we encode each agent’s past trajectory
by feeding it to the agent trajectory encoding module. The module utilizes a recurrent neural network (RNN) to summarize the past trajectory. It iterates through the past trajectory with Eq. (
1) and its final output (at present ) is utilized as the agent embedding. Collecting the embeddings for all agents, we get . We then pass to the crossagent interaction module, depicted in Fig. 4, which uses selfattention [36]to generate a crossagent representation. We linearly transform each agent embedding to get a querykeyvalue triple,
. Next, we calculate the interaction features through selfattention with , where . Finally, the fused agent encoding is calculated by adding the features to each agent embedding (see Eq. (2) and Fig. 4). The architectural details of the encoder, which include the parameters for the agent encoding RNN and the crossagent attention structures, are given in the supplemental material.(1)  
(2) 
The decoder takes the final encodings and the scene context as inputs. We first extract the scene feature through a ConvNet, . The decoder then autoregressively generates the future position , while referring to both the local scene context and the global scene context from the agenttoscene interaction module. The local scene feature
is gathered using bilinear interpolation on
crop of corresponding to the physical position . Then, the feature is concatenated with the encoding and processed thorough fullyconnected layers to make the “local context” . We call this part the local scene extractor. The global scene feature is calculated using visualattention [39] to generate weighted scene features, as shown in Fig. 4. To calculate the attention, we first make the encoding of the previous outputs , using a RNN in Eq. (3), whose output——is used to calculate the pixelwise attention at each decoding step, for each agent; the global scene feature(1D vector) is gathered by pooling (pixelwise sum) the attended feature map as described in Eq. (
4) and Fig. 4. Finally, , , and are concatenated to make the “global context” in Eq. (5).(3)  
(4)  
(5) 
The flowbased decoding module generates the future position . The module utilizes Normalizing Flow [26], a generative modeling method based on a bijective and differentiable mapping. In particular, we choose an autoregressive design [17, 27, 28]. We use fullyconnected layers to project the global context down to a 6dimensional vector, and we split it into a vector and a matrix . Next, we transform a standard Gaussian sample , by the bijective and differentiable mapping . The “hats” in and are removed, in order to note that they went through the following details. To ensure the positive definiteness, we apply matrix exponential using the formula in [4]. Also, to improve the the physical admissibility of the prediction, we apply the constraint , where is a model degradation coefficient. When , the constraint is equivalent to Verlet integration [38], used in some previous works [27, 28], which gives the a perfect constant velocity (CV) prior to the model. However, we found empirically that, the model easily overfits to the dataset when the the perfect CV prior is used, and perturbing the CV prior model with prevents overfitting. We use in our model and an analysis on the effect of the degradation coefficient is given in the supplemental material.
Iterating the autoregressive decoding procedure, we get the future trajectory prediction for each agent. Note that by sampling multiple instances of , we can generate the multiple future .
In this work, we generate a binary mask feature of size that denotes the drivable spaces around the agents. We call the feature drivable area map and utilize it for three different purposes: 1) deriving the approximated true trajectory distribution , 2) calculating the diversity and admissibility measures, and 3) building the scene context input for the model.
Particularly, is a key component for the evaluation of in our training objective, Eq. (7). Since penalizes the predicted trajectories with respect to the real distribution, the approximation should not underestimate some region of the real distribution, or diversity in the prediction could be erroneously penalized. Previous works on deriving utilized the groundtruth (GT) trajectories to model the true distribution [27]
. However, there is often only one GT annotation available in datasets and the approximation based on the GT might severely assign low probability around some region in
. To cope with such problem in the previous methods, we propose a new method to derive using the drivable area. Ouris defined based on the assumptions that every location on the drivablearea is equally probable for future trajectories to appear in and that the locations on nondrivable area are increasingly less probable, proportional to the distance from the drivable area. To derive it, we first apply the distance transform on the drivable area map, to encode the distance on each nondrivable location. Lastly, we apply softmax over the entire map to constitute it as a probability distribution. The visualizations of the
are available in Fig. 7. Procedures regarding the diversity and admissibility measure will be discussed in Section 5.3; details on deriving and the scene context input , as well as additional visualizations and qualitative results, are given in the supplemental material.Our model learns to predict the joint distribution over the future trajectories of the agents present in a given episode. In detail, we focus on predicting the conditional distribution
where the future trajectory depends on the set of observations of the past trajectories and the scene context given an episode. As described in the previous sections, our model utilizes a bijective and differentiable mapping, parameterized by a learnable parameter , between the future trajectory and a Gaussian prior to generate and evaluate the future trajectory. Such technique, commonly aliased ‘normalizing flow’, enables our model not only to generate multiple candidate samples of future, but also to evaluate the groundtruth trajectory according to the predicted distribution by using the changeofvariable formula in Eq. (6).(6) 
As a result, our model can simply learn to close the discrepancy between the predicting distribution and the real world distribution . In particular, we choose to minimize the combination of forward and reverse crossentropy and , also known as ‘symmetric crossentropy’, between the two distributions in Eq. (7) by optimizing our model parameter . Minimizing symmetric crossentropy allows model to learn generating diverse and plausible trajectory, which is mainly used in [27].
(7) 
To realize this, we gather the groundtruth trajectories and scene context from the dataset that we assume to well reflect the real distribution , then optimize the model parameter such that 1) the density of the groundtruth future trajectories on top of the predicted distribution is maximized and 2) the density of the predicted samples on top of the real distribution is also maximized as described in Eq.( 8).
(8) 
Such symmetric combination of the two crossentropy guides our model to predict that covers all plausible modes in the future trajectory while penalizing the bad samples that are less likely under the real distribution . However, one major problem inherent in this setting is that we cannot actually evaluate in practice. To cope with the problem, several ways of approximating by using a separate model have been suggested so far [27]. In this paper, we propose a new way of modeling which approximates using a discrete grid map derived from the differentiable drivable area map in our dataset which considers every drivable region around the egovehicle to be equally probable that the future trajectories are placed. The details about our new is included in the supplemental material. Applying bilinear interpolation around each prediction timestep in the generative sample , we get the evaluation that is differentiable with respect to the model parameter
. Our overall loss function is:
(9) 
where is the batch size, is the number of total agents in th episode, and is the number of candidates to sample per agent. Since this objective is fully differentiable with respect to the model parameter , we train our model using Adam optimizer [16]
, a popular variant of the stochastic gradient descent algorithm. We also use adaptive learning rate scheduling and early stopping. Optimization details are included in the supplementary.
The primary goal in the following experiments is to evaluate our model, baselines, and ablations on the following criteria 1. Leveraging mechanisms that explicitly model agenttoagent and agenttoscene interactions (experiment 1 and 2). 2. Producing diverse trajectory predictions, while obeying admissibility constraints on the trajectory candidates given different approximation methods for the true trajectory distribution
(experiment 3). 3. Remaining robust to an increasing number of agents in the scene (agent complexity; experiment 4). 4. Generalizing to other domains (experiment 5). We implement all models in PyTorch
[23] trained on NVIDIA TITAN X GPUs. Procedural, architectural, and training details are included in the supplementary material.Most current autonomous driving trajectory forecasting datasets are insufficient for evaluating predictions, due to the small size and the limited number of multimodal cues [21].
The Argoverse motion forecasting dataset consists of a large volume of forecasting data with drivable area annotations, but lacks certain modalities i.e LiDAR pointclouds and map images. We have generated motion forecasting datasets from nuScenes and Argoverse tracking dataset using their original annotations through programmatic trajectory association, smoothing, and imputation. Unlike the Argoverse forecasting dataset, this new dataset provides additional context information from LiDAR pointclouds and map information, for better forecasting performance. We utilize the trajectory record, vectorized geometry, and drivable area annotation as modalities for our research. In order to make the experimental setup of nuScenes similar to Argoverse, we crop each sequence to be 5 seconds long in total; 3 seconds for prediction and 2 seconds for observation, with a sampling rate of 2 Hz. Background information relating to nuScenes, Argoverse trajectory data generation are included in the supplementary material. By evaluating baselines and our models on both real world datasets, we provide complementary validation of each model’s diversity, admissibility, and generalizability across domains. Data extraction procedures and quantitative results on the simulated data are discussed in the supplementary material.
Deterministic baselines: We compare three deterministic models with our approach, to examine our model’s ability to capture agenttoagent interaction: LSTMbased encoderdecoder [34] (LSTM), convolutional social pooling LSTM (CSP) [9], and a deterministic version of
multiagent tensor fusion
(MATFD) [40]. For our deterministic model, we use an LSTM with our crossagent attention module in the encoder, which we refer to as the crossagent attention model (CAM). Because each model is predicated on an LSTM component, we set the capacity to be the same in all cases, to ensure fair comparison.Measuring diversity and admissibility: We define multiple metrics that provide a thorough interpretation about the behavior of each model in terms of precision, diversity, and admissibility. For the ith trajectory, we first evaluate a prediction in terms of Euclidean errors: average displacement error and final displacement error , or Error to denote both. To evaluate predictions (i.e., precision), we use the average and the minimum Errors: and . A large avgError implies that predictions are spread out, and a small minError
implies at least one of predictions has high precision. From this observation, we define new evaluation metrics that capture diversity in predictions: the ratio of
avgADE to minADE and avgFDE to minFDE, namely rA and rF. In particular, rF is robust to the variability of magnitude in velocity in predictions because high avgADE and high minADE caused by large magnitudes will be offset and only the directional variability will remain. As a result, rF provides a handy tool that can distinguish between predictions with multiple modes (diversity) and predictions with a single mode (perturbation). For deterministic models, rA and rF have a value of 1.(10) 
(11) 
We also report performance on additional metrics that are designed to capture diversity and admissibility in predictions. We follow [8] in the use of Drivable Area Count (DAC), , where is the number of predictions that go out of the drivable area and is the total number of predictions. Next, we propose a new metric, Drivable Area Occupancy (DAO), which measures the percentage of pixels that predicted trajectories occupy in the drivable area. Shown in Eq. (11), is the number of pixels occupied by predictions and is the total number of pixels of the drivable area, both within a predefined grid around the egovehicle. Due to the nature of DAO and DAC, the number of trajectory hypotheses should be set equally for fair comparison of models.
We use rF, DAO, and DAC to assess the diversity and admissibility of models. Initially, DAO may seem like a reasonable standalone measure of both diversity and admissibility, as it only cares about diversity in a reasonable region of interest. However, DAO itself cannot distinguish between diversity (Section 3) and arbitrary stochasticity in predictions, as illustrated by Case 1 in Fig. 5: although DAO measures of both predictions are equal, the causality behind each prediction is different and we must distinguish the two. rF and DAO work in a complementary way and we, therefore, use both for measuring diversity. To assure the admissibility of predictions, we use DAC which explicitly counts offroad predictions, as shown by Case 2 in Fig. 5. As a result, assessing predictions using DAO along with rF and DAC provides a holistic view of the quantity and the quality of diversity in predictions; the characteristics of each metric is summarized in Fig. 6.
For our experiments, we use minADE and minFDE to measure precision, and use rF, DAC, and DAO to measure both diversity and admissibility. Due to the nature of DAO, where the denominator in our case is the number of overlapping pixels in a grid, we normalize it by multiplying by when reporting results. For the multiagent experiment (experiment 4), relative improvement (RI) is calculated as we are interested in the relative improvement as the number of agents increases. If not specified, the number of hypotheses are set to 12 and minFDE is reported for the performance.
In this section, we show experimental results on numerous settings including the comparison with the baseline, and ablation studies of our model. We first show the effect of our crossagent interaction module and agenttoscene interaction module on the model performance, then we analyze the performance with respect to different numbers of agents, and other datasets. All experiments are measured with minADE, minFDE, rF, DAC, and DAO for holistic interpretation.
Model  minADE  minFDE 

LSTM  1.186  2.408 
CSP [9]  1.390  2.676 
MATFD [40]  1.261  2.538 
CAM (ours)  1.124  2.318 
Effectiveness of crossagent interaction module: We show the performance of one of our proposed models CAM, which utilizes our crossagent attention module, along with three deterministic baselines as shown in Tables 1. For each model we test, agenttoagent interaction is considered in different ways. CSP models the interaction through layers of convolutional networks, and the interaction is implicitly calculated within the receptive field of convolutional layers. MATFD is an extension of convolutional social pooling with scene information. CAM explicitly defines the interaction between each agent by using attention. The result shows that CAM outperforms other baselines in both minADE and minFDE, indicating that the explicit way of modeling agenttoagent interaction performs better in terms of precision than an implicit way of modeling interaction using convolutional networks used in CSP and MATFD. Interestingly, CAM outperforms MATFD that utilizes scene information. This infers that our crossagent interaction module has the ability to learn the geometric structure of the roads given by the trajectories of surrounding agents.
Effectiveness of agenttoscene interaction module: The performance of stochastic models is compared in Tables 2. We experiment with removing scene processing operations in the decoder to validate the importance of our proposed agenttoscene interaction module. As mentioned previously, generating multiple modes of sample requires a strong scene processing module and a diversityoriented decoder. Our proposed models all outperform other stochastic baseline models in terms of precision. MATFGAN has a small rF inferring that the predictions are mostly unimodal, while other models such as VAEbased model DESIRE and flowbased models R2P2 and ours show more spread in their predictions. We note that R2P2 was not designed for multiagent setting which causing it to make unreasonably shaking outputs. Our model has the highest and , indicating that our models exhibit diverse and admissible predictions by accurately utilizing scene context.
Model  minADE  minFDE  rF  DAO  DAC 

DESIRE [19]  0.937  1.808  1.754  9.430  0.376 
MATFGAN [40]  1.053  2.124  1.194  5.950  0.391 
R2P2MA [27]  1.185  2.215  1.611  13.50**  0.396 
CAMNF (ours)  0.756  1.386  2.113  11.70  0.400 
LocalCAMNF (ours)  0.772  1.404  2.066  11.70  0.400 
GlobalCAMNF (ours)  0.744  1.359  2.103  11.60  0.400 
AttGlobalCAMNF (ours)  0.638  1.171  2.558  12.28  0.399 

Model  minADE  minFDE  rF  DAO  DAC 

AttGlobalCAMNF(MSE)  0.763  1.390  2.009  12.09  0.400 
AttGlobalCAMNF  0.638  1.171  2.558  12.28  0.399 
Effectiveness of new loss: We experiment with MSE and our drivable areabased approximation of in Table 3. Using our map loss in training shows superior results in most of the reported metrics. In particular, the precision and the diversity of predictions increases drastically as reflected in minError and rF while DAC remains unchanged. Our map loss assures admissibility while improving precision and diversity, as drivablearea associated provides additional possible regions of future trajectories.
Complexity from number of agents: We experiment with varying number of surrounding agents as shown in Table 4. Throughout all models, the performance increases as the number of agents increases even though we observe that many agents in the surrounding do not move significantly. In terms of relative improvement RI, as calculated between 1 agent and 10 agents, our model has the most improvement, indicating that our model makes the most use of the finegrained trajectories of surrounding agents to generate future trajectories.
Model  1 agent  3 agents  5 agents  10 agents  RI(110) 

LSTM  2.736  2.477  2.442  2.268  17.1% 
CSP [9]  2.871  2.679  2.671  2.569  10.5% 
DESIRE [19]  2.150  1.846  1.878  1.784  17.0% 
MATF GAN [40]  2.377  2.168  2.150  2.011  15.4% 
R2P2MA [27]  2.227  2.135  2.142  2.048  8.0% 
AttGlobalCAMNF (ours) 
1.278  1.158  1.100  0.964  24.6% 
Generalizability across datasets: We further compare our model with baselines extensively across two more real world datasets: nuScenes and Argoverse to test generalization to different environments. We show results in Table 5 where we outperform or achieve comparable results as compared to the baselines. For Argoverse, we additionally outperform MFP3 [35] in minFDE with 6 hypotheses: our full model shows a minFDE of 0.915, while MFP3 achieves 1.399.
Model  Argoverse  nuScenes  

A ()  B ()  C ()*  D ()*  E ()*  A ()  B ()  C ()*  D ()*  E ()*  
LSTM  1.441  2.780  1.000  1.786  0.378  1.186  2.408  1.000  1.690  0.391 
CSP  1.385  2.567  1.000  1.799  0.379  1.390  2.676  1.000  1.710  0.388 
MATFD  1.344  2.484  1.000  1.768  0.379  1.261  2.538  1.000  1.690  0.384 
DESIRE  0.777  1.276  3.642  11.80  0.301  0.937  1.808  1.754  9.430  0.376 
MATFGAN  1.214  2.316  1.099  6.075  0.376  1.053  2.124  1.194  5.950  0.391 
R2P2MA  1.270  2.190  1.589  18.10**  0.381  1.185  2.215  1.611  13.50**  0.396 
CAM  1.131  2.504  1.000  1.750  0.389  1.124  2.318  1.000  1.670  0.404 
CAMNF  0.852  1.347  2.763  17.60  0.378  0.756  1.386  2.113  11.70  0.400 
LocalCAMNF  0.807  1.250  2.858  17.00  0.381  0.772  1.404  2.066  11.70  0.400 
GlobalCAMNF  0.807  1.241  3.068  16.90  0.380  0.744  1.359  2.103  11.60  0.400 
AttGlobalCAMNF  0.731  1.126  3.278  15.50  0.383  0.638  1.171  2.558  12.28  0.399 
In this paper, we tackled the problem of generating diverse and admissible predictions by understanding each agent’s multimodal context. We proposed a model that learns agenttoagent interactions and agenttoscene interactions using attention mechanisms, resulting in better prediction in terms of precision, diversity, and admissibility. We also developed a new approximation method that provides richer information about the true trajectory distribution and allows more accurate training of flowbased generative models. Finally, we present new metrics that provide a holistic view of the quantity and the quality of diversity in prediction, and a nuScenes trajectory extraction code to support future research in diverse and admissible trajectory forecasting.
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 961–971. Cited by: §2.Where will they go? predicting finegrained adversarial multiagent motion using conditional variational autoencoders
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 732–747. Cited by: §2.Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 6120–6127. Cited by: §1, §2, §2, §5.1.International conference on machine learning
, pp. 2048–2057. Cited by: item 1, §4.