1 Introduction
Modeling and predicting the future behavior of human agents is a fundamental problem in many realworld robotics domains. For example, accurately forecasting the future state of other vehicles, cyclists and pedestrians is critical for safe, comfortable, and humanlike autonomous driving. However, behavior prediction in an autonomous vehicle (AV) driving setting poses a number of unique modeling challenges:
1) Multimodal output space
: The problem is inherently stochastic; it is impossible to truly know the future state of the environment. This is exacerbated by the fact that other agents’ intentions are not observable, and leads to a highly multimodal distribution over possible outcomes (e.g., a car could turn left or right at an intersection). Effective models must be able to represent such a rich output space with high precision and recall matching the underlying distribution.
2) Heterogenous, interrelated input space: The driving environment representation can contain a highly heterogeneous mix of static and dynamic inputs: road network information (lane geometry and connectivity, stop lines, crosswalks), traffic light state information, and motion history of agents. Driving situations are often highly interactive, and can involve many agents at once (e.g. negotiating a 4way stop with crosswalks). This requires careful modeling choices, as explicitly modeling joint future distributions over multiple agents is exponential in the number of agents. Effective models must capture not only the interactions between the agents in the scene, but also the relationships between the road elements and the behavior of agents given the road context.
The novel challenges and high impact of this problem have naturally garnered much interest in recent years. There has been a rich body of work on how to model agents’ futures, their interactions, and the environment. However, there is little consensus to date on the best modeling choices for each component, and in popular benchmark challenge datasets [nuscenes2019, chang2019argoverse, interactiondataset, ettinger2021womd], there is a surprisingly diverse set of solutions to this problem; for details see Section 2 and Table 1.
The MultiPath framework [sapp2019multipath]
addresses the multimodal output space challenge above by modeling the highly multimodal output distributions via a Gaussian Mixture Model. It handles a common issue of
mode collapse during learning by using static trajectory anchors, an external input to the model. This practical solution gives practitioners a straightforward way of ensuring diversity and an extra level of modeler control via the design of such anchors. The choice of a GMM representation proved to be an extremely popular, appearing in many works—see Table 1, “Trajectory Distribution”, where “Weighted set” is a special case of GMMs where only means and mixture weights are modeled.The MultiPath input representation and backbone draws heavily upon the computer vision literature. By rasterizing all world state in a topdown orthographic view, MultiPath and others
[phan2019covernet, DESIRE, hong2019rules, tang_multifuture, marchetti2020mantra, casas2018intentnet, cui2019multimodal, neural_motion_planner_zeng2019] leverage powerful, established CNN architectures like ResNet [ResNet16], which offer solutions to the heterogeneous interrelated input space: the heterogeneous world state is mapped to a common pixel format, and interactions occur via local information sharing via convolution operations. While convenient and established, there are downsides to such rasterization: (1) There is an uneasy tradeoff between resolution of the spatial grid, field of view, and compute requirements. (2) Rasterizing is a form of manual feature engineering, and some features may be inherently difficult to represent in such a framework (e.g. radial velocity). (3) It is difficult to capture long range interactions via convolutions with small receptive fields. (4) The information content is spatially very sparse, making a dense representation a potentially computationally wasteful choice.In this paper, we introduce MultiPath++, which builds upon MultiPath, taking its output GMM representation and concept of anchors, but reconsidering how to represent and combine highly heterogeneous world state inputs and model interactions between state elements. MultiPath++ introduces a number of key upgrades:

We eschew the rasterizationandCNN approach in favor of modeling sparse world state objects more directly from their compact state description. We represent road elements as polylines, agent history as a sequence of physical state encoded with RNNs, and agent interactions as RNNs over the state of neighbors relative to each egoagent. These choices avoid lossy rasterization in favor of raw, continuous state, and result in compute complexity that scales with the number of scene elements rather than the size of a spatial grid. Longrange dependencies are effectively and efficiently modeled in our representation.

Capturing relationships between road elements and agents is critical, and we find that encoding each element independently does not perform as well as modeling their interactions (e.g. , when the road encoding is aware of relevant agents, and vice versa). To address this, we propose a novel form of context awareness we call multicontext gating
(MCG), in which sets of elements have access to a summary context vector upon which encodings are conditioned. MCG is implemented as a generic neural network component that is applied throughout our model. MCG can be viewed as an efficient form of crossattention, whose efficiency/quality tradeoff depends on the size of the context vector.

We also explore improvements in trajectory modeling, comparing representations based on kinematic controls, and/or polynomials as a function of continuous future time. We further demonstrate a way to learn latent representations of anchors and show they outperform the original static anchors of MultiPath, while simplifying model creation to a singlestep process.

Finally, we find significant additional gains on public benchmarks by applying ensembling techniques to our models. Unlike models with static anchors, the latent anchors of a MultiPath++ ensemble are not in direct correspondence. Furthermore, a lot of popular behavior prediction benchmarks have introduced metrics such as missrate (MR) and mean Average Precision (mAP), which require the ability to model diverse outcomes with few trajectories and differ from pure trajectory distance error capturing the average agent behavior. With the above in mind, we formulate the problem of ensembling the results of several models as one of greedy iterative clustering, which maximizes a probabilistic objective using the popular Expectation Maximization algorithm
[bishop2006pattern].
As of November 1, 2021, MultiPath++ ranks on the Waymo Open Motion Dataset leaderboard^{1}^{1}1https://waymo.com/open/challenges/2021/motionprediction/, on the Argoverse Motion Forecasting. Competition^{2}^{2}2https://eval.ai/web/challenges/challengepage/454/leaderboard/1279 We offer MultiPath++ as a reference set of design choices, empirically validated via ablation studies, that can be adopted, further studied and extended by the behavior modeling community.
2 Related Work
Method  Road Enc.  Motion Enc.  Interactions  Decoder  Output  Trajectory Distribution 

Jean [mercat2020multi]  –  LSTM  attention  LSTM  states  GMM 
TNT [zhao2020tnt]  polyline  polyline  maxpool, attention  MLP  states  Weighted set 
LaneGCN [liang2020laneGCN]  GNN  1D conv  GNN  MLP  states  Weighted set 
WIMP [wimp2020]  polyline  LSTM  GNN+attention  LSTM  states  GMM 
VectorNet [gao2020vectornet]  polyline  polyline  maxpool, attention  MLP  states  Single traj. 
SceneTransformer [ngiam21scene_transformer]  polyline  attention  attention  attention  states  Weighted set 
GOHOME [gilles2021gohome]  GNN  1D conv + GRU  GNN  MLP  states  heatmap 
MP3 [casas2021mp3]  raster  raster  conv  conv  cost function  Weighted samples 
CoverNet [phan2019covernet]  raster  raster  conv  lookup  states  GMM w/ dynamic anchors 
DESIRE [DESIRE]  raster  GRU  spatial pooling  GRU  states  Samples 
RoadRules [hong2019rules]  raster  raster  conv  LSTM  states  GMM 
SocialLSTM [sociallstm]  –  LSTM  spatial pooling  LSTM  states  Samples 
SocialGan [SocialGAN]  –  LSTM  maxpool  LSTM  states  Samples 
MFP [tang_multifuture]  raster  GRU  RNNs+attention  GRU  states  Samples 
MANTRA [marchetti2020mantra]  raster  GRU  –  GRU  states  Samples 
PRANK [biktairov2020prank]  raster  raster  conv  lookup  states  Weighted set 
IntentNet [casas2018intentnet]  raster  raster  conv  conv  states  Single traj. 
SpaGNN [casas2020spagnn]  raster  raster  GNN  MLP  state  Single traj. 
Multimodal [cui2019multimodal]  raster  raster  conv  conv  states  Weighted set 
PLOP [buhet2020plop]  raster  LSTM  conv  MLP  state poly  GMM 
Precog [precog_Rhinehart_2019_ICCV]  raster  GRU  multiagent sim.  GRU  motion  Samples 
R2P2 [rhinehart2018r2p2]  raster  GRU  –  GRU  motion  Samples 
HYU_ACE [park2020diverse]  raster  LSTM  attn  LSTM  motion  Samples 
Trajectron++[salzmann2020trajectron++]  raster  LSTM  RNNs+attention  GRU  controls  GMM 
DKM [cui2020deep]  raster  raster  conv  conv  controls  Weighted set 
MultiPath [sapp2019multipath]  raster  raster  conv  MLP  states  GMM w/ static anchors 
MultiPath++  polyline  LSTM  RNNs+maxpool  MLP  control poly  GMM 
We focus on architectural design choices for behavior prediction in driving environments—what representations to use to encode road information, agent motion, agent interactions, output trajectories, and output distributions. Table 1 is a summary of past work, which we go over here with additional context.
For road encoding, there is a dichotomy of representations. The raster
approach encodes the world as a stack of images, from a topdown orthographic (or “bird’seye”) view. Rasterizing the world state has the benefit of simplicity—all the various types of input information (road configuration, agent state history, spatial relationships) are unified via rendering as a multichannel image, enabling one to leverage powerful offtheshelf Convolutional Neural Network (CNN) techniques. However, this onesizefitsall approach has significant downsides: difficulty in modeling longrange interactions, constrained field of view, and difficulty in representing continuous physical states. As an alternative, the
polyline approach describes curves (e.g., lanes, crosswalks, boundaries) as piecewise linear segments. This is a significantly more compact form due to the sparse nature of road networks. Previous works typically process a setofpolylines description of the world in a peragent, agentcentric coordinate system. LaneGCN [liang2020laneGCN] stands apart by treating road lanes as nodes in a graph neural network, leveraging road network connectivity structure.To model motion history, one popular choice is to encode the sequence of past observed states via a recurrent net (GRU, LSTM) or temporal (1D) convolution. As an alternative, in the raster framework, the state sequence is typically rendered as a stack of binary mask images depicting agent oriented bounding boxes, or rendered in the same image, with the corresponding time information rendered separately [Refaat2019AgentPF].
To model agent interactions, one must deal with a dynamic set of neighboring agents around each modeled agent. This is typically done by aggregating neighbor motion history with a permutationinvariant set operator: pooling or soft attention. Notably, Precog [precog_Rhinehart_2019_ICCV] jointly rolls out agent policies in a stepwise simulation. Raster approaches rely on convolution over the 2D spatial grid to implicitly capture interactions; longterm interactions are dependent on the network receptive fields.
Agent trajectory decoding choices are similar to choices for encoding motion history, with the exception of methods that do lookup on a fixed or learned trajectory database [phan2019covernet, biktairov2020prank].
The most popular output trajectory representation is a sequence of states (or state differences). A few works [rhinehart2018r2p2, rhinehart2019precog] instead model Newton’s laws of motion in a discrete timestep aggregation capturing Verlet integration. Other works [salzmann2020trajectron++, cui2020deep] explicitly model controls which parameterize a kinematicallyfeasible model for vehicles and bicycles. With any of these representations, the spacetime trajectory can be intrinsically represented as a sequence of sample points or a continuous polynomial representation [buhet2020plop]. In our experimental results, we explore the effectiveness of states and kinematic controls, with and without an underlying polynomial basis. Notably unique are (1) HOME [gilles21home] and GOHOME [gilles2021gohome] which first predict a heatmap, and then decode trajectories after sampling, and (2) MP3 [casas2021mp3] and NMP [neural_motion_planner_zeng2019]
which learn a cost function evaluator of trajectories, and the trajectories are enumerated heuristically rather than generated by a learned model.
Nearly all work assumes an independent, peragent output space, in which agent interactions cannot be explicitly captured. A few works are notable in describing joint interactions as output, either in an asymmetric [wimp2020, tolstaya2021cbp] or symmetric way [ettinger2021womd, precog_Rhinehart_2019_ICCV, ngiam21scene_transformer].
The choice of output trajectory distribution has ramifications on downstream applications. An intrinsic property of the driving setting is that a vehicle or a pedestrian can follow one of a diverse set of possible trajectories. It is thus essential to capture the multimodal nature of the problem. Gaussian Mixture Models (GMMs) are a popular choice for this purpose due to their compact parameterized form; mode collapse is addressed through training tricks [kitani_diverse_forecasting_dpps, cui2019multimodal] or the use of trajectory anchors [sapp2019multipath]. Other approaches model a discrete distribution over a set of trajectories (learned or fixed a priori) [zhao2020tnt, liang2020laneGCN, biktairov2020prank, cui2019multimodal], or via a collection of trajectory samples drawn from a latent distribution and decoded by the model [sociallstm, DESIRE, precog_Rhinehart_2019_ICCV, rhinehart2018r2p2, marchetti2020mantra].
3 Model Architecture
Figure 1 depicts the proposed MultiPath++ model architecture, which on a high level is similar to that of MultiPath [sapp2019multipath]; the model consists of an encoding step and a predictor head which conditions on anchors and outputs a Gaussian Mixture Model (GMM) [bishop2006pattern] distribution for the possible agent position at each future time step.
MultiPath used a common, topdown image based representation for all input modalities (e.g., agents’ tracked state, road network information), and a CNN encoder. In contrast, MultiPath++ has encoders processing each input modality and converting it to a compact and sparse representation; the different modality encodings are later fused using a multicontext gating (MCG) mechanism.
3.1 Input Representation
MultiPath++ makes predictions based on the following input modalities:

Agent state history: a state sequence describing the agent trajectory for a fixed number of past steps. In the Waymo Open Motion dataset [ettinger2021womd], this state information includes position, velocity, 3D bounding box size, heading angle and object type; for Argoverse [chang2019argoverse] only position information is provided. The state is transformed to an agentcentric coordinate system, such that the most recent agent pose is located at the origin and heading east. ^{3}^{3}3Since the explicit heading is missing in Argoverse data, we use the last two time steps to get the current orientation.

Road network: Road network elements such as lane lines, cross walks, and stop lines are often represented as parametric curves like clothoids [neural_motion_planner_zeng2019], which can be sampled to produce point collections that are easily stored in multidimensional array format, as is done in many public datasets [ettinger2021womd, chang2019argoverse]. We further summarize this information by approximating point sequences for each road element as a set of piecewise linear segments, or polylines, similar to [gao2020vectornet, liang2020laneGCN, homayounfar2018maps].

Agent interactions: For each modeled agent, we consider all neighboring agents. For each neighboring agent, we extract features in the modeled agent’s coordinate frame, such as relative orientation, distance, history and speed.

AVrelative features: Similar to the interaction features, we extract features of the autonomous vehicle / sensing vehicle (AV) relative to each other agent. We model the AV separately from the other agents. We hypothesize this is a helpful distinction for the model because: (a) The AV is the center of sensors’ field of view. Tracking errors due to distance and occlusion are relative to this center. (b) The behavior of the AV can be unlike the other road users, which to a good approximation can be assumed to all be humans.
Details on how these features are encoded and fused are described next. These steps comprise the “Encoder” block of Figure 1, whose output is an encoding per agent, in each agent’s coordinate frame.
3.2 Multi Context Gating for fusing modalities
In this section we focus on how to combine the different input modality encodings in an effective way. Other works use a common rasterized format [sapp2019multipath, neural_motion_planner_zeng2019], a simple concatenation of encodings [DESIRE, precog_Rhinehart_2019_ICCV, salzmann2020trajectron++], or employ attention [ngiam21scene_transformer, tang_multifuture, gao2020vectornet, liang2020laneGCN]. We propose an efficient mechanism for fusing information we term multicontext gating (MCG), and use MCG blocks throughout the MultiPath++ architecture.
Given a set of elements and an input context vector , a CG block assigns an output to each element in the set, and computes an output context vector . The output does not depend on the ordering of input elements. Mathematically, let be the function implemented by the CG block, and be any permutation operation on a sequence of elements. The following equations hold for CG:
(1) 
which imply that we have
The size of the set can vary across calls to .
CG’s set function properties—permutation invariance/equivariance and ability to process arbitrarily sized sets—are naturally motivated by the need to encode a variable, unordered set of road network elements and agent relationships. A number of set functions have been proposed in the literature such as DeepSets [zaheer17deepset], PointNet [qi2017pointnet] and SetTransformers [lee19settransformer].
A single CG block is implemented via
(2)  
(3)  
(4)  
(5) 
where denotes elementwise product and is a permutationinvariant pooling layer such as max or average pooling. These operations are illustrated in Figure 2. In the absence of an input context, we simply set to an allones vector in the first context gating block. Note that both and depend on all inputs. It can be shown that is permutationinvariant w.r.t the input embeddings. It can also be shown that are permutationequivariant.
We stack multiple CG blocks by incorporating runningaverage skipconnections, as is done residual networks [ResNet16]:
(6)  
(7)  
(8) 
We denote such multilayer CG blocks as for a stack of blocks.
Comparison with attention. Attention is a popular mechanism in domains such as NLP [vaswani2017attention] and computer vision [dosovitskiy2020ViT, dai2021coatnet], in which the encoding for each element of a set is updated via a combination of encodings of all other elements. For a set of size , this intrinsically requires operations. In models of human behavior in driving scenarios, self attention has been employed to update encodings for, e.g. , road lanes, by attending to neighboring lanes, or to update encodings per agent based on the other agents in the scene. Cross attention has also been used to condition one input type (e.g. agent encodings) on another (e.g. road lanes) [liang2020laneGCN, gao2020vectornet, ngiam21scene_transformer]. Without loss of generality, if there are agents and road elements, this cross attention scales as to aggregate road information for each agent.
can be viewed as an approximation to crossattention. Rather than each of elements attending to all elements of the latter set, CG summarizes the latter set with the single context vector , as shown in Figure 3. Thus the dimensionality of needs to be great enough to capture all the useful information contained in the original encodings. If the dimensionality of elements is , and the dimensionality of is , then if , CG can be reduced to some form of crossattention by setting . When , we are trading off the representational power of full crossattention with computational efficiency.
3.3 Encoders
In this section we detail the specific encoders shown in Figure 1.
Agent history encoding. The agent history encoding is obtained by concatenating the output of three sources:

A LSTM on the history features from time steps ago to the present time: .

A LSTM on the difference in the history features .

MCG blocks applied to the set of history elements. Each element in the set consists of a historical position and time offset in seconds relative to the present time. The context input here is an allones vector with an identity context MLP. Additionally we also encode the history frame id as a one hot vector to further disambiguate the history steps.
We denote the final embedding, which concatenates these three state history encodings, as .
Agent interaction encoding. For each modeled agent, we build an interaction encoding by considering each neighboring agent ’s past state observations: . We transform ’s state into the modeled agent’s coordinate frame, and embed it with a LSTM to obtain an embedding . Note this is similar to the egoagent history embedding but instead applied to the relative coordinates of another agent.
By doing this for neighboring agents we obtain a set of interaction embeddings . We fuse neighbor information with stacked MCG blocks as follows
(9) 
where the second argument is the input context vector to , which in this case is a concatenation of the modeled agent’s history embedding, and the AV’s interaction embedding. In this way we emphasize the AV’s representation as a unique entity in the context for all interactions; see Section 3.1 for motivation.
Road network encoding. We use the polyline road element representation discussed in Section 3.1 as input. Each line segment is parameterized by its start point, end point and the road element semantic type (e.g. , Crosswalk, SolidDoubleYellow, etc). For each agent of interest, we transform the closest polylines into their frame of reference and call the transformed segment . Let r be the closest point from the agent to the segment, and be the unit tangent vector at a on the original curve. Then we represent the agent’s spatial relationship to the segment via the vector . These feature vectors are each processed with a shared MLP, resulting in a set of agentspecific embeddings per road segment, which we denote by . We then fuse road element embeddings with the agent history embedding using stacked MCG blocks
(10) 
and thus enrich the road embeddings with dynamic state information.
3.4 Output representation
MultiPath++ predicts a distribution of future behavior parameterized as a Gaussian Mixture Model (GMM), as is done in MultiPath [sapp2019multipath] and other works [mercat2020multi, phan2019covernet, buhet20_plop]
. For efficient longterm prediction, the distribution is conditionally independent over time steps across mixture components, thus each mode at each time step is represented as a Gaussian distribution over
with a mean and covariance . The mode likelihoods are tied over time. MAP inference per mode is equivalent to taking the sequence as state waypoints defining a possible future trajectory for the agent. The full output distribution is(11) 
where represents a trajectory; .
The classification head of Figure 1 predicts the as a softmax distribution over mixture components. The regression head outputs the parameters of the Gaussians and for modes and time steps.
Training objective. We follow the original MultiPath approach and maximize the likelihood of the groundtruth trajectory under our model’s predicted distribution. We make a hardassignment labeling of a “correct” mixture component by choosing the one with the smallest Euclidean distance to the groundtruth trajectory.
The average log loss over the entire training set is optimized using Adam. We use an initial learning rate of and a batch size of , with decay rate of
every 2 epochs. The final model is chosen after training for
steps.3.5 Prediction architecture with learned anchor embeddings
The goal of the Predictor module (Figure 1) is to predict the parameters of the GMM described in Section 3.4, namely trajectories, with likelihoods and uncertainties around each waypoint.
In applications related to future prediction, capturing the highly uncertain and multimodal set of outcomes is a key challenge and the focus of much work [rhinehart2018r2p2, liang2020garden, kitani_diverse_forecasting_dpps, phan2019covernet, sapp2019multipath]. One of MultiPath’s key innovations was to use a static set of anchor trajectories
as predefined modes that applied to all scenes. One major downside to this is that most modes are not a good fit to any particular scene, thus requiring a large amount modes to be considered, with most obtaining a lowlikelihood and getting discarded. Another downside is the added complexity and effort stemming from a 2phase learning process (first estimating the modes from data, then training the network).
In this work, we learn anchor embeddings as part of the overall model training. We interpret these embeddings as anchors in latent space, and construct our architecture to have a onetoone correspondence with these embeddings and the output trajectory modes of our GMM. The vectors are trainable model parameters that are independent of the input. This has connections to Detection Transformers (DETR) [carion20detr] which propose a way to learn anchors rather than handdesign them for object detection. This is also similar in spirit to MANTRA [marchetti2020mantra], a trajectory prediction network, which has an explicit learned memory network which consists of a database of embeddings that can be retrieved and decoded into trajectories.
We concatenate the embeddings , and obtained from the output of the respective blocks to obtain a fixedlength feature vector for each modeled agent. We then use this as context in stacked MCG blocks that operate on the set of anchor embeddings , with a final MLP that predicts all parameters of the output GMM:
where is formed from .
3.6 Internal Trajectory Representation
We model the future position and heading of agents, along with agentrelative longitudinal and lateral Gaussian uncertainties. We parameterize the trajectory using
—position, heading, and standard deviation for longitudinal and lateral uncertainty.
The most popular approach in the literature is to directly predict a sequence of such states at uniform timediscretization. Here we also consider two nonmutually exclusive variants.

We can represent functions over time as polynomials, which add an inductive bias that ensures a smooth trajectory. It gives us a compact, interpretable representation of each predicted signal.

Instead of directly predicting , we can predict the underlying kinematic control signals, which can then be integrated to evaluate the output state. In this work, we experiment with predicting the acceleration and heading change rate and integrating them to recover the trajectory as follows:
These representations add inductive bias encouraging natural and realistic trajectories that are based on realistic kinematics and consistent with the current state of the predicted agent. For the polynomial representation, it is also possible to specify a soft constraint by regularizing the polynomial’s constant term, which determines the shift of the predicted signal from its current value.
Algorithm 1 demonstrates the conversion from control signals to output positions. Note that this operation is differentiable, permitting endtoend optimization. It is a numerical approximation of Equation 2 with additional technical considerations: (1) When computing the next position , we use the midpoint approximation of the speed and heading . (2) Given vehicle dimensions, we cap the heading change rate to match a predetermined maximum feasible curvature. (3) These equations are applied to the rearaxle of the vehicle rather than the center position. We use the rearend position of the vehicle as an approximation of the rearaxle position.
Note that Algorithm 1 can be viewed as a special type of recurrent network, without learned parameters. This decoding stage then mirrors other works which use a learned RNN (LSTM or GRU cells) to decode an embedding vector into a trajectory [mercat2020multi, wimp2020, hong2019rules, tang_multifuture, salzmann2020trajectron++]. In our case, the recurrent network state consists of and , and the input consists of and . Encoding an inductive bias derived from kinematic modeling spares the network the need to explicitly learn these properties makes the predicted state compact. This promotes data efficiency and generalization power, but can be more sensitive to perception errors in the current state estimate.
4 Ensembling predictor heads via bootstrap aggregation
Ensembling is a powerful and popular technique in many machine learning applications. For example, ensembling is a critical technique for getting the best performance on ImageNet
[ResNet16]. By combining multiple models which are to some degree complementary, we can enjoy the benefits of a higher capacity model with lower statistical variance.
We specifically apply bootstrap aggregation (bagging) [eslbook] to our predictor heads by training such heads together. To encourage models learning complementary information, the weights of the
heads are initialized randomly, and an example is used to update the weights of each head with a 50% probability.
Unlike scalar regression or classification, it is not obvious how to combine output from different heads in our case—each is a Gaussian Mixture Model, with no correspondence of mixture components across ensemble heads. Furthermore, we consider allowing each predictor head to predict a richer output distribution with more modes ; where is fixed as a requirement for the task (and is used in benchmark metrics calculations).
Let denote the union of the predictions from all heads
(12) 
where , and the mode likelihoods are divided by the number of heads so that they sum up to 1. Then we pose the ensemble combination task as one of converting to a more compact GMM with modes:
(13) 
while requiring that best approximates . In this section we describe the aggregation algorithm we use. Theoretical motivations and derivation can be found in Appendix A.
We find fit to using an iterative clustering algorithm, like ExpectationMaximization [bishop2006pattern], but with hard assignment of cluster membership. This setting lends itself to efficient implementation in a compute graph, and allows us to train this step endtoend as a final layer in our deep network.
We start by selecting cluster centroids from in a greedy fashion. The selection criteria is to maximize the probability that a centroid sampled from lies within distance from at least one selected centroid:
(14) 
This is a criterion that explicitly optimizes trajectory diversity, which is a fit for metrics such as miss rate, mAP and minADE, as defined in [chang2019argoverse, ettinger2021womd]. Other criteria could also be used depending on the metric of interest. It is interesting to relate this criteria to the ensembling and sampling method employed by GOHOME [gilles2021gohome]. In that work, they output an intermediate spatial heatmap representation, which is amenable to ensemble aggregation. Then they greedily sample endpoints in a similar fashion.
Since jointly optimizing is hard, we select each greedily for according to
(15) 
which differs in that the outer is done iteratively over rather than jointly .
Starting with the selected centroids, We iteratively update the parameters of using an expectationmaximizationstyle [dempster77] algorithm, where each iteration consists of the following updates
(16)  
(17)  
(18) 
where
is the posterior probability that a given sample
is sampled from the component of the mixture model specified by , which can be computed as(19) 
5 Experiments
5.1 Datasets
The Waymo Open Motion Dataset (WOMD) [ettinger2021womd] consists of 1.1M examples timewindowed from 103K 20s scenarios. The dataset is derived from realworld driving in urban and suburban environments. Each example for training and inference consists of 1 second of history state and 8 seconds of future, which we resample at 5Hz. The objectagent state contains attributes such as position, agent dimensions, velocity and acceleration vectors, orientation, angular velocity, and turn signal state. The long (8s) time horizon in this dataset tests the model’s ability to capture a large field of view and scale to an output space of trajectories, which in theory grows exponentially with time.
The Argoverse dataset [chang2019argoverse] consists of 333K scenarios containing trajectory histories, context agents, and lane centerline inputs for motion prediction. The trajectories are sampled at 10Hz, with 2 seconds of past history and a 3second future prediction horizon.
5.2 Metrics
We compare models using competition specific metrics associated with various datasets^{4}^{4}4 For each dataset, we report the results of our model against published results of publicly available models. ,
Specifically, we report the following metrics.
minDE (Minimum Distance Error): The minimum distance, over the top k mostlikely trajectories, between a predicted trajectory and the ground truth trajectory at time .
minADE (Minimum Average Distance Error): Similar to minDE, but the distance is calculated as an average over all timesteps.
MR@ (Miss Rate): Measures the rate at which minFDE exceeds meters. Note that WOMD leaderboard uses a different definition [ettinger2021womd].
mAP: For each set of predicted trajectories, we have at most one positive  the one closest to the ground truth and which is within distance from the ground truth. The other predicted trajectories are reported as misses. From this, we can compute precision and recall at various thresholds. Following WOMD metrics definition [ettinger2021womd] the agents future trajectories are partitioned into behavior buckets, and an area under the precisionrecall curve is computed using the possible true positive and false positives per agent, giving us Average Precision per behavior bucket. The total mAP value is a mean over the AP’s for each behavior bucket.
Overlap rate: The fraction of times the most likely trajectory prediction of any agent overlaps with a real future trajectory of another agent (see [ettinger2021womd] for details).
TRI: (Turning Radius Infeasibility) We compute the turning radius along the predicted trajectories using two approaches: one that uses the predicted yaw output from the model (TRIh), and the other that doesn’t require yaw predictions and instead uses the circumradius constituting three consecutive waypoints (TRIc). If the radius is less than a certain threshold , it is treated as a violation. We set this threshold as the approximate minimum turning radius threshold for a midsize sedan, . Note that a model that simply predicts a constant heading can achieve a TRIh rate of zero, hence we also compute inconsistencies between turning radius suggested by the coordinates and the predicted headings (TRIhc). TRIhc inconsistency is true when the difference in heading based on circumradius from waypoints and predicted headings is greater than 0.05 radians at any time step in a trajectory.
5.3 MultiPath baseline
As our work evolved from MultiPath, we include a reference MultiPath model where the input and backbone are faithful to the original paper [sapp2019multipath] for a point of comparison, with a few minor differences. Specifically, we use a topdown rendering of the scene as before, but now employ a splat rendering [zwicker2001surface_splat]
approach for rasterization, in which we sample points uniformly from scene elements and do an orthographic projection. This is a simpler, sparse form of rendering, which doesn’t employ antialiasing, but is efficient and straightforward to implement in TensorFlow and run as part of the model compute graph on hardware accelerators (GPU/TPU).
As in the original paper, we use a grid of cells, with grid cell physical dimension of , thus a total fieldofview of centered around the AV sensing vehicle in WOMD, with a ResNet18 backbone [ResNet16]
. We use 128 static anchors obtained via kmeans, which are shared among all agent types (vehicles, pedestrians, cyslists) for simplicity. Figure
10 illustrates this model’s inputs and architecture.5.4 External benchmark results
On Argoverse, MultiPath++ achieves top5 performance on most metrics (Table 2). Our technique is ranked on all metrics on Waymo Open Motion Dataset [ettinger2021womd] (Table 3).
The tested model is based on the best configuration in Table 4, where the outputs from multiple ensemble heads are aggregated as described in Section 4.
On WOMD, we also see that the original MultiPath model, even with the refinement of learned anchors and ensembling, is outperformed by more recent methods. It is interesting to note that MultiPath is the best performing topdown scenecentric model employing a CNN; every known method which outranks it uses sparse representations.
Argoverse leaderboard (, , )  

Rank  brierminDE  minFDE  MR  minADE  
LaneGCN [liang2020laneGCN]  50  2.059  1.364  0.163  0.868 
DenseTNT [gu21dense_tnt]  23  1.976  1.282  0.126  0.882 
HOME + GOHOME [gilles2021gohome]  10  1.860  1.292  0.085  0.890 
TPCN++ [ye2021tpcn]  5  1.796  1.168  0.116  0.780 
MultiPath++ (ours)  4  1.793  1.214  0.132  0.790 
QCraft Blue Team  1  1.757  1.214  0.114  0.801 
Waymo Open Motion Prediction (, )  

Rank  minDE  minADE  MR  Overlap  mAP  
MultiPath [sapp2019multipath]  11  2.04  0.880  0.345  0.166  0.409 
SceneTransformer [ngiam21scene_transformer]  7  1.212  0.612  0.156  0.147  0.279 
DenseTNT [gu21dense_tnt]  5  1.551  1.039  0.157  0.178  0.328 
MultiPath++  1  1.158  0.556  0.134  0.131  0.409 
5.5 Qualitative Examples
Figure 4 shows examples of Multipath++ on WOMD scenes. Figure 5 shows examples of Multipath++ on Argoverse scenes. These examples show the ability of MultiPath++ to handle different road layouts and agent interactions.
(a)  (b)  (c) 
(d)  (e)  (f) 
(g) 
(a)  (b)  (c) 
(d)  (e)  (f) 
5.6 Ablation Study
In this section we evaluate our design choices through an ablation study. Table 4 summarizes ablation results. In the following subsections we discuss how our architecture choices affect the model performance.
5.6.1 Set Functions
Recall that MultiPath++ uses two types of set functions. Invariant set functions are used to encode a set of elements (e.g. agents, roadgraph segments) into a single feature vector. Equivariant set functions are used to convert the set of learned anchors, together with the encoded feature vector as a context, into a corresponding set of trajectories with likelihoods.
We use multicontext gating to represent both types of functions. We experimented with other representations of set functions:

MLP+MaxPool: In this experiment, we replace the multicontext gating (MCG) road network encoder with a MLP+MaxPool applied on points rather than polylines, inspired by PointNet [qi2017pointnet]
. We use a 5 layer deep MLP and RELU activations.

Equivariant DeepSet [zaheer17deepset]
: The equivariant set function is represented as a series of blocks, each involving an elementwise transformation followed by pooling to compute the context. Unlike MCG, it does not use gating (pointwise multiplication) between set elements and the context vector. Instead, a linear transformation of the context is added to each element. We use a DeepSet of 5 blocks in the predictor.

Transformers [lee19settransformer]: We replace the gating mechanism (elementwise multiplication) on polylines with selfattention. For decoding, we used cross attention where the queries are the learned embeddings and the keys are the various encoder features.
5.6.2 Trajectory representation
As mentioned in Section 3.6, we experiment with predicting polynomial coefficients for the trajectory, as well predicting kinematic control signals (acceleration and heading change rate). We found that polynomial representations hurt performance, counter to conclusions made in PLOP [buhet20_plop], where they demonstrated improvements over the then state of the art on PRECOG[precog] and nuScenes[caesar2020nuscenes] using polynomials to represent output trajectories. Furthermore, in the PLOP datasets, we need to predict 4s into the future which is much shorter than our prediction horizon of 10s. For such short futures, polynomial representations are more suitable. In our case, we do not see much gains from using the polynomial representation, possibly due to the larger dataset size and longerterm prediction horizon.
The controlsbased output works better in distance metrics than a polynomial representations, which suggests it is a more beneficial and domainspecific form of inductive bias. Overall, our results suggest that the simple sequence of raw coordinates trajectory representation works best for distancebased metrics. However, these unconstrained representations have a nontrivial rate of kinematic infeasibility (TRIx metrics in Table 4). Kinematic feasibility and consistency between headings and positions is crucial in practice when such behavior models are used for planning and controls of a realworld robot, an issue that is not captured by public benchmark metrics.
5.6.3 Ensembling
We explore ensembling, producing an overcomplete set of trajectories that is then summarized using the aggregation proposed in Section 4, as well as their combination. The number of ensembles is denoted by and the number of trajecctories per ensemble is denoted by . Finally we aggregate the trajectories to which is the required number of trajectories for the WOMD submission.
5.6.4 Anchor representation
We explore learned and kmeans based anchor representation.
minDE  minADE  MR  AUC  TRIh (%)  TRIc (%)  TRIhc (%)  
Original MultiPath  4.752  1.796  0.749  –  –  –  – 
Set Function  
MLP+MaxPool  2.693  1.107  0.528  0.367  –  –  – 
DeepSet  2.562  1.055  0.5  0.368  –  –  – 
Transformer  2.479  1.042  0.479  0.3687  –  –  – 
1 MCG block  2.764  1.15  0.55  0.312  –  –  – 
5 stacked MCG blocks  2.305  0.978  0.44  0.393  –  –  – 
Trajectory Representation  
Polynomial  2.537  1.041  0.501  0.368  n/a  1.92  n/a 
Control  2.319  0.987  0.449  0.386  0.00  1.22  0.00 
Raw coordinates  2.305  0.978  0.44  0.393  n/a  1.08  n/a 
Raw coordinates w/ heading  2.311  0.978  0.443  0.395  4.10  1.04  9.92 
Ensembling  
2.333  0.982  0.410  0.240  –  –  –  
2.18  0.948  0.395  0.297  –  –  –  
2.487  1.057  0.473  0.367  –  –  –  
2.305  0.978  0.44  0.393  –  –  –  
Anchors  
Static kmeans anchors  2.99  1.22  0.578  0.324  –  –  – 
Learned anchors  2.305  0.978  0.44  0.393  –  –  – 
denotes the reference configuration: road encoding, state history encoding and interaction encoding as described in Section 3. “n/a” denotes a model that does not predict heading.
5.7 Discussion
First, we remark that MultiPath++ is a significant improvement over its predecessor MultiPath, as seen in Tables 3 and 4. As discussed in this paper, they differ in many design dimensions, the primary being the change from a dense topdown raster representation to a sparse, elementbased representation with agentcentric coordinate systems. Other design choices are validated in isolation in the following discussion.
We find that MLP+MaxPool performs the worst among all set function variants as expected due to limited capacity. DeepSet is able to outperform MLP+MaxPool. Also increasing the depth of the MCG gives consistently better results owing to effective increase in capacity and flow of information across skip connections. We get the best performance by increasing the depth of the MCG to 5 layers.
We find that learning anchors (“Learned anchors”) is more effective than using a set of anchors obtained a priori via kmeans. This runs counter to the original findings in the MultiPath paper [sapp2019multipath] that anchorfree models suffer from mode collapse. The difference could possibly be due to the richer and more structured inputs, improved model architecture, and larger batch sizes in MultiPath++. We leave more detailed ablations on this issue between the two approaches to future work. t We compare the baseline of directly outputting a single head with 6 trajectories (), to training 5 ensemble heads (). We see that ensembling significantly improves most metrics, and particularly minDE, for which this combination is best. We also train a model with a single head that outputs 64 trajectories, followed by our aggregation method that reduces them to 6 (). Compared to our initial baseline, this model significantly improves and that require diverse predictions, but regresses the average trajectory distance metrics , and even a little bit. This suggests that the different metrics pose different solution requirements. As expected, our aggregation criterion is well suited to preserving diversity, while straightup ensembling is better at capturing the average distribution. Finally, our experiment () with more ensemble heads and more predictions per ensemble combines the strengths of both techniques, obtaining a strictly superior performance in all metrics compared to the baseline.
5.8 Conclusion
We proposed a novel behavior prediction system, MultiPath++, by carefully considering choices for input representation and encoding, fusing encodings, and representing the output distribution. We demonstrated stateoftheart results on popular benchmarks for behavior prediction. Furthermore, we surveyed existing methods, analyzed our approach empirically, and provided practical insights for the research community. In particular, we showed the importance of sparse encoding, efficient fusion methods, controlbased methods, and learned anchors. Finally, we provided a practical guide for various tricks used for training and inference to improve robustness, increase diversity, handle missing data, and ensure fast convergence during training.
References
Appendix A Details and Derivation of Aggregation Algorithm
By having an overcomplete trajectory representation that is later aggregated into a fixed small number of trajectories, we attempt to address two kinds of uncertainties in the data:

Aleatoric uncertainty: This is a natural variation in the data. For example an agent can either take a left or right turn or change lanes, etc given the same context information. This level of ambiguity cannot be resolved by increasing the model capacity, but rather the model needs to predict calibrated probabilities for these outcomes. Despite the theoretical possibility of modeling these variations using a small number of output trajectories directly, there are several challenges in learning. Some examples include mode collapse and failure to model these variations due to limited model capacity. Training the model to produce an overcomplete representation forces the model to output a diverse distribution of trajectories and could make it more resistant to mode collapse. Following this up with greedy iterative trajectory aggregation enhances diversity in the final output.

Epistemic uncertainty: This is the variation across model outputs, which typically indicates the model’s failure to capture certain aspects of the scene or input features. Such variations could occur if some models are poorly trained or haven’t seen a particular slice of the data. By doing model ensembling, we attempt to reduce this uncertainty.
For ease of exposition, we assume each to trajectory to be composed of a single time point; the same computations are applied to each time step in a future sequence. The output is a Gaussian mixture model (GMM) distribution with modes on the future position:
(20) 
We formulate the aggregation as obtaining an mode GMM which minimizes the KLdivergence . This is equivalent to maximizing the expected log likelihood of a sample point drawn from the overcomplete distribution :
(21) 
Assuming the overcomplete distribution approximates the real distribution, this is roughly equivalent to fitting the compact distribution to the real data, but with the added benefits described above. Directly maximizing (21) is intractable. Hence we attempt to employ an ExpectationMaximizationlike algorithm to obtain a local maximum. The difference in the objective function between an old and new value may be written as
(22) 
Denoting the hidden variable h to be a mixture in the compact representation, we may write:
(23)  
(24)  
(25)  
(26)  
(27)  
(28) 
Thus
(29) 
The right hand side is called the Q function in the EM algorithm. Maximizing the Q function with respect to ensures that the likelihood increases at least as much when we update the parameters to . Noting that and factoring out the terms independent of , we find the update that maximizes the lower bound to be
(30) 
(31) 
The second equation follows from the fact that the overcomplete distribution is a mixture of Gaussians. The updates can be solved as follows.
where is the posterior probability that a given sample is sampled from the component of the mixture model specified by (here we use the previous estimate for ). This can be computed as:
(32) 
Notice the resemblence with standard GMM, except where is a dirac delta function in the standard setting (since the input data in standard GMM is a set of points instead of a distribution). Unlike standard GMM, these expectations (integrations) in the above EM updates are hard to compute in closed form. Instead we employ the approximation for any function
In other words, we assume that the posterior probability of any output cluster only depends on the mean of the overcomplete cluster centroid inside the expectation. This approximation is reasonable since most samples drawn from the distribution would be concentrated around the mean. Furthermore as we increase the number of cluster centroids in the overcomplete representation, the variance within each overcomplete cluster centroid becomes smaller yielding more focus around the mean. The set of updates can now be solved in closed form as follows:
Since EM is a local optimization method, careful initialization of GMM parameters is important. Our initialization criterion of GMM centroids is to maximize the probability that future point lies within distance from at least one centroid:
(33) 
Unfortunately, directly optimizing (33) is NPhard. So instead, we select an sized subset of in a greedy fashion to maximize (33)^{5}^{5}5 Note that this subset selection problem is submodular, which means that a greedy method is guranteed to achieve at least of the optimal subset value. .