Modeling and predicting the future behavior of human agents is a fundamental problem in many real-world robotics domains. For example, accurately forecasting the future state of other vehicles, cyclists and pedestrians is critical for safe, comfortable, and human-like autonomous driving. However, behavior prediction in an autonomous vehicle (AV) driving setting poses a number of unique modeling challenges:
1) Multimodal output space
: The problem is inherently stochastic; it is impossible to truly know the future state of the environment. This is exacerbated by the fact that other agents’ intentions are not observable, and leads to a highly multimodal distribution over possible outcomes (e.g., a car could turn left or right at an intersection). Effective models must be able to represent such a rich output space with high precision and recall matching the underlying distribution.
2) Heterogenous, interrelated input space: The driving environment representation can contain a highly heterogeneous mix of static and dynamic inputs: road network information (lane geometry and connectivity, stop lines, crosswalks), traffic light state information, and motion history of agents. Driving situations are often highly interactive, and can involve many agents at once (e.g. negotiating a 4-way stop with crosswalks). This requires careful modeling choices, as explicitly modeling joint future distributions over multiple agents is exponential in the number of agents. Effective models must capture not only the interactions between the agents in the scene, but also the relationships between the road elements and the behavior of agents given the road context.
The novel challenges and high impact of this problem have naturally garnered much interest in recent years. There has been a rich body of work on how to model agents’ futures, their interactions, and the environment. However, there is little consensus to date on the best modeling choices for each component, and in popular benchmark challenge datasets [nuscenes2019, chang2019argoverse, interactiondataset, ettinger2021womd], there is a surprisingly diverse set of solutions to this problem; for details see Section 2 and Table 1.
The MultiPath framework [sapp2019multipath]
addresses the multimodal output space challenge above by modeling the highly multimodal output distributions via a Gaussian Mixture Model. It handles a common issue ofmode collapse during learning by using static trajectory anchors, an external input to the model. This practical solution gives practitioners a straightforward way of ensuring diversity and an extra level of modeler control via the design of such anchors. The choice of a GMM representation proved to be an extremely popular, appearing in many works—see Table 1, “Trajectory Distribution”, where “Weighted set” is a special case of GMMs where only means and mixture weights are modeled.
The MultiPath input representation and backbone draws heavily upon the computer vision literature. By rasterizing all world state in a top-down orthographic view, MultiPath and others[phan2019covernet, DESIRE, hong2019rules, tang_multifuture, marchetti2020mantra, casas2018intentnet, cui2019multimodal, neural_motion_planner_zeng2019] leverage powerful, established CNN architectures like ResNet [ResNet16], which offer solutions to the heterogeneous interrelated input space: the heterogeneous world state is mapped to a common pixel format, and interactions occur via local information sharing via convolution operations. While convenient and established, there are downsides to such rasterization: (1) There is an uneasy trade-off between resolution of the spatial grid, field of view, and compute requirements. (2) Rasterizing is a form of manual feature engineering, and some features may be inherently difficult to represent in such a framework (e.g. radial velocity). (3) It is difficult to capture long range interactions via convolutions with small receptive fields. (4) The information content is spatially very sparse, making a dense representation a potentially computationally wasteful choice.
In this paper, we introduce MultiPath++, which builds upon MultiPath, taking its output GMM representation and concept of anchors, but reconsidering how to represent and combine highly heterogeneous world state inputs and model interactions between state elements. MultiPath++ introduces a number of key upgrades:
We eschew the rasterization-and-CNN approach in favor of modeling sparse world state objects more directly from their compact state description. We represent road elements as polylines, agent history as a sequence of physical state encoded with RNNs, and agent interactions as RNNs over the state of neighbors relative to each ego-agent. These choices avoid lossy rasterization in favor of raw, continuous state, and result in compute complexity that scales with the number of scene elements rather than the size of a spatial grid. Long-range dependencies are effectively and efficiently modeled in our representation.
Capturing relationships between road elements and agents is critical, and we find that encoding each element independently does not perform as well as modeling their interactions (e.g. , when the road encoding is aware of relevant agents, and vice versa). To address this, we propose a novel form of context awareness we call multi-context gating
(MCG), in which sets of elements have access to a summary context vector upon which encodings are conditioned. MCG is implemented as a generic neural network component that is applied throughout our model. MCG can be viewed as an efficient form of cross-attention, whose efficiency/quality trade-off depends on the size of the context vector.
We also explore improvements in trajectory modeling, comparing representations based on kinematic controls, and/or polynomials as a function of continuous future time. We further demonstrate a way to learn latent representations of anchors and show they outperform the original static anchors of MultiPath, while simplifying model creation to a single-step process.
Finally, we find significant additional gains on public benchmarks by applying ensembling techniques to our models. Unlike models with static anchors, the latent anchors of a MultiPath++ ensemble are not in direct correspondence. Furthermore, a lot of popular behavior prediction benchmarks have introduced metrics such as miss-rate (MR) and mean Average Precision (mAP), which require the ability to model diverse outcomes with few trajectories and differ from pure trajectory distance error capturing the average agent behavior. With the above in mind, we formulate the problem of ensembling the results of several models as one of greedy iterative clustering, which maximizes a probabilistic objective using the popular Expectation Maximization algorithm[bishop2006pattern].
As of November 1, 2021, MultiPath++ ranks on the Waymo Open Motion Dataset leaderboard111https://waymo.com/open/challenges/2021/motion-prediction/, on the Argoverse Motion Forecasting. Competition222https://eval.ai/web/challenges/challenge-page/454/leaderboard/1279 We offer MultiPath++ as a reference set of design choices, empirically validated via ablation studies, that can be adopted, further studied and extended by the behavior modeling community.
2 Related Work
|Method||Road Enc.||Motion Enc.||Interactions||Decoder||Output||Trajectory Distribution|
|TNT [zhao2020tnt]||polyline||polyline||maxpool, attention||MLP||states||Weighted set|
|LaneGCN [liang2020laneGCN]||GNN||1D conv||GNN||MLP||states||Weighted set|
|VectorNet [gao2020vectornet]||polyline||polyline||maxpool, attention||MLP||states||Single traj.|
|SceneTransformer [ngiam21scene_transformer]||polyline||attention||attention||attention||states||Weighted set|
|GOHOME [gilles2021gohome]||GNN||1D conv + GRU||GNN||MLP||states||heatmap|
|MP3 [casas2021mp3]||raster||raster||conv||conv||cost function||Weighted samples|
|CoverNet [phan2019covernet]||raster||raster||conv||lookup||states||GMM w/ dynamic anchors|
|DESIRE [DESIRE]||raster||GRU||spatial pooling||GRU||states||Samples|
|SocialLSTM [sociallstm]||–||LSTM||spatial pooling||LSTM||states||Samples|
|PRANK [biktairov2020prank]||raster||raster||conv||lookup||states||Weighted set|
|IntentNet [casas2018intentnet]||raster||raster||conv||conv||states||Single traj.|
|SpaGNN [casas2020spagnn]||raster||raster||GNN||MLP||state||Single traj.|
|Multimodal [cui2019multimodal]||raster||raster||conv||conv||states||Weighted set|
|PLOP [buhet2020plop]||raster||LSTM||conv||MLP||state poly||GMM|
|Precog [precog_Rhinehart_2019_ICCV]||raster||GRU||multi-agent sim.||GRU||motion||Samples|
|DKM [cui2020deep]||raster||raster||conv||conv||controls||Weighted set|
|MultiPath [sapp2019multipath]||raster||raster||conv||MLP||states||GMM w/ static anchors|
We focus on architectural design choices for behavior prediction in driving environments—what representations to use to encode road information, agent motion, agent interactions, output trajectories, and output distributions. Table 1 is a summary of past work, which we go over here with additional context.
For road encoding, there is a dichotomy of representations. The raster
approach encodes the world as a stack of images, from a top-down orthographic (or “bird’s-eye”) view. Rasterizing the world state has the benefit of simplicity—all the various types of input information (road configuration, agent state history, spatial relationships) are unified via rendering as a multi-channel image, enabling one to leverage powerful off-the-shelf Convolutional Neural Network (CNN) techniques. However, this one-size-fits-all approach has significant downsides: difficulty in modeling long-range interactions, constrained field of view, and difficulty in representing continuous physical states. As an alternative, thepolyline approach describes curves (e.g., lanes, crosswalks, boundaries) as piecewise linear segments. This is a significantly more compact form due to the sparse nature of road networks. Previous works typically process a set-of-polylines description of the world in a per-agent, agent-centric coordinate system. LaneGCN [liang2020laneGCN] stands apart by treating road lanes as nodes in a graph neural network, leveraging road network connectivity structure.
To model motion history, one popular choice is to encode the sequence of past observed states via a recurrent net (GRU, LSTM) or temporal (1D) convolution. As an alternative, in the raster framework, the state sequence is typically rendered as a stack of binary mask images depicting agent oriented bounding boxes, or rendered in the same image, with the corresponding time information rendered separately [Refaat2019AgentPF].
To model agent interactions, one must deal with a dynamic set of neighboring agents around each modeled agent. This is typically done by aggregating neighbor motion history with a permutation-invariant set operator: pooling or soft attention. Notably, Precog [precog_Rhinehart_2019_ICCV] jointly rolls out agent policies in a step-wise simulation. Raster approaches rely on convolution over the 2D spatial grid to implicitly capture interactions; long-term interactions are dependent on the network receptive fields.
Agent trajectory decoding choices are similar to choices for encoding motion history, with the exception of methods that do lookup on a fixed or learned trajectory database [phan2019covernet, biktairov2020prank].
The most popular output trajectory representation is a sequence of states (or state differences). A few works [rhinehart2018r2p2, rhinehart2019precog] instead model Newton’s laws of motion in a discrete time-step aggregation capturing Verlet integration. Other works [salzmann2020trajectron++, cui2020deep] explicitly model controls which parameterize a kinematically-feasible model for vehicles and bicycles. With any of these representations, the spacetime trajectory can be intrinsically represented as a sequence of sample points or a continuous polynomial representation [buhet2020plop]. In our experimental results, we explore the effectiveness of states and kinematic controls, with and without an underlying polynomial basis. Notably unique are (1) HOME [gilles21home] and GOHOME [gilles2021gohome] which first predict a heatmap, and then decode trajectories after sampling, and (2) MP3 [casas2021mp3] and NMP [neural_motion_planner_zeng2019]
which learn a cost function evaluator of trajectories, and the trajectories are enumerated heuristically rather than generated by a learned model.
Nearly all work assumes an independent, per-agent output space, in which agent interactions cannot be explicitly captured. A few works are notable in describing joint interactions as output, either in an asymmetric [wimp2020, tolstaya2021cbp] or symmetric way [ettinger2021womd, precog_Rhinehart_2019_ICCV, ngiam21scene_transformer].
The choice of output trajectory distribution has ramifications on downstream applications. An intrinsic property of the driving setting is that a vehicle or a pedestrian can follow one of a diverse set of possible trajectories. It is thus essential to capture the multimodal nature of the problem. Gaussian Mixture Models (GMMs) are a popular choice for this purpose due to their compact parameterized form; mode collapse is addressed through training tricks [kitani_diverse_forecasting_dpps, cui2019multimodal] or the use of trajectory anchors [sapp2019multipath]. Other approaches model a discrete distribution over a set of trajectories (learned or fixed a priori) [zhao2020tnt, liang2020laneGCN, biktairov2020prank, cui2019multimodal], or via a collection of trajectory samples drawn from a latent distribution and decoded by the model [sociallstm, DESIRE, precog_Rhinehart_2019_ICCV, rhinehart2018r2p2, marchetti2020mantra].
3 Model Architecture
Figure 1 depicts the proposed MultiPath++ model architecture, which on a high level is similar to that of MultiPath [sapp2019multipath]; the model consists of an encoding step and a predictor head which conditions on anchors and outputs a Gaussian Mixture Model (GMM) [bishop2006pattern] distribution for the possible agent position at each future time step.
MultiPath used a common, top-down image based representation for all input modalities (e.g., agents’ tracked state, road network information), and a CNN encoder. In contrast, MultiPath++ has encoders processing each input modality and converting it to a compact and sparse representation; the different modality encodings are later fused using a multi-context gating (MCG) mechanism.
3.1 Input Representation
MultiPath++ makes predictions based on the following input modalities:
Agent state history: a state sequence describing the agent trajectory for a fixed number of past steps. In the Waymo Open Motion dataset [ettinger2021womd], this state information includes position, velocity, 3D bounding box size, heading angle and object type; for Argoverse [chang2019argoverse] only position information is provided. The state is transformed to an agent-centric coordinate system, such that the most recent agent pose is located at the origin and heading east. 333Since the explicit heading is missing in Argoverse data, we use the last two time steps to get the current orientation.
Road network: Road network elements such as lane lines, cross walks, and stop lines are often represented as parametric curves like clothoids [neural_motion_planner_zeng2019], which can be sampled to produce point collections that are easily stored in multi-dimensional array format, as is done in many public datasets [ettinger2021womd, chang2019argoverse]. We further summarize this information by approximating point sequences for each road element as a set of piecewise linear segments, or polylines, similar to [gao2020vectornet, liang2020laneGCN, homayounfar2018maps].
Agent interactions: For each modeled agent, we consider all neighboring agents. For each neighboring agent, we extract features in the modeled agent’s coordinate frame, such as relative orientation, distance, history and speed.
AV-relative features: Similar to the interaction features, we extract features of the autonomous vehicle / sensing vehicle (AV) relative to each other agent. We model the AV separately from the other agents. We hypothesize this is a helpful distinction for the model because: (a) The AV is the center of sensors’ field of view. Tracking errors due to distance and occlusion are relative to this center. (b) The behavior of the AV can be unlike the other road users, which to a good approximation can be assumed to all be humans.
Details on how these features are encoded and fused are described next. These steps comprise the “Encoder” block of Figure 1, whose output is an encoding per agent, in each agent’s coordinate frame.
3.2 Multi Context Gating for fusing modalities
In this section we focus on how to combine the different input modality encodings in an effective way. Other works use a common rasterized format [sapp2019multipath, neural_motion_planner_zeng2019], a simple concatenation of encodings [DESIRE, precog_Rhinehart_2019_ICCV, salzmann2020trajectron++], or employ attention [ngiam21scene_transformer, tang_multifuture, gao2020vectornet, liang2020laneGCN]. We propose an efficient mechanism for fusing information we term multi-context gating (MCG), and use MCG blocks throughout the MultiPath++ architecture.
Given a set of elements and an input context vector , a CG block assigns an output to each element in the set, and computes an output context vector . The output does not depend on the ordering of input elements. Mathematically, let be the function implemented by the CG block, and be any permutation operation on a sequence of elements. The following equations hold for CG:
which imply that we have
The size of the set can vary across calls to .
CG’s set function properties—permutation invariance/equivariance and ability to process arbitrarily sized sets—are naturally motivated by the need to encode a variable, unordered set of road network elements and agent relationships. A number of set functions have been proposed in the literature such as DeepSets [zaheer17deepset], PointNet [qi2017pointnet] and SetTransformers [lee19settransformer].
A single CG block is implemented via
where denotes element-wise product and is a permutation-invariant pooling layer such as max or average pooling. These operations are illustrated in Figure 2. In the absence of an input context, we simply set to an all-ones vector in the first context gating block. Note that both and depend on all inputs. It can be shown that is permutation-invariant w.r.t the input embeddings. It can also be shown that are permutation-equivariant.
We stack multiple CG blocks by incorporating running-average skip-connections, as is done residual networks [ResNet16]:
We denote such multi-layer CG blocks as for a stack of blocks.
Comparison with attention. Attention is a popular mechanism in domains such as NLP [vaswani2017attention] and computer vision [dosovitskiy2020ViT, dai2021coatnet], in which the encoding for each element of a set is updated via a combination of encodings of all other elements. For a set of size , this intrinsically requires operations. In models of human behavior in driving scenarios, self attention has been employed to update encodings for, e.g. , road lanes, by attending to neighboring lanes, or to update encodings per agent based on the other agents in the scene. Cross attention has also been used to condition one input type (e.g. agent encodings) on another (e.g. road lanes) [liang2020laneGCN, gao2020vectornet, ngiam21scene_transformer]. Without loss of generality, if there are agents and road elements, this cross attention scales as to aggregate road information for each agent.
can be viewed as an approximation to cross-attention. Rather than each of elements attending to all elements of the latter set, CG summarizes the latter set with the single context vector , as shown in Figure 3. Thus the dimensionality of needs to be great enough to capture all the useful information contained in the original encodings. If the dimensionality of elements is , and the dimensionality of is , then if , CG can be reduced to some form of cross-attention by setting . When , we are trading off the representational power of full cross-attention with computational efficiency.
In this section we detail the specific encoders shown in Figure 1.
Agent history encoding. The agent history encoding is obtained by concatenating the output of three sources:
A LSTM on the history features from time steps ago to the present time: .
A LSTM on the difference in the history features .
MCG blocks applied to the set of history elements. Each element in the set consists of a historical position and time offset in seconds relative to the present time. The context input here is an all-ones vector with an identity context MLP. Additionally we also encode the history frame id as a one hot vector to further disambiguate the history steps.
We denote the final embedding, which concatenates these three state history encodings, as .
Agent interaction encoding. For each modeled agent, we build an interaction encoding by considering each neighboring agent ’s past state observations: . We transform ’s state into the modeled agent’s coordinate frame, and embed it with a LSTM to obtain an embedding . Note this is similar to the ego-agent history embedding but instead applied to the relative coordinates of another agent.
By doing this for neighboring agents we obtain a set of interaction embeddings . We fuse neighbor information with stacked MCG blocks as follows
where the second argument is the input context vector to , which in this case is a concatenation of the modeled agent’s history embedding, and the AV’s interaction embedding. In this way we emphasize the AV’s representation as a unique entity in the context for all interactions; see Section 3.1 for motivation.
Road network encoding. We use the polyline road element representation discussed in Section 3.1 as input. Each line segment is parameterized by its start point, end point and the road element semantic type (e.g. , Crosswalk, SolidDoubleYellow, etc). For each agent of interest, we transform the closest polylines into their frame of reference and call the transformed segment . Let r be the closest point from the agent to the segment, and be the unit tangent vector at a on the original curve. Then we represent the agent’s spatial relationship to the segment via the vector . These feature vectors are each processed with a shared MLP, resulting in a set of agent-specific embeddings per road segment, which we denote by . We then fuse road element embeddings with the agent history embedding using stacked MCG blocks
and thus enrich the road embeddings with dynamic state information.
3.4 Output representation
MultiPath++ predicts a distribution of future behavior parameterized as a Gaussian Mixture Model (GMM), as is done in MultiPath [sapp2019multipath] and other works [mercat2020multi, phan2019covernet, buhet20_plop]
. For efficient long-term prediction, the distribution is conditionally independent over time steps across mixture components, thus each mode at each time step is represented as a Gaussian distribution overwith a mean and covariance . The mode likelihoods are tied over time. MAP inference per mode is equivalent to taking the sequence as state waypoints defining a possible future trajectory for the agent. The full output distribution is
where represents a trajectory; .
The classification head of Figure 1 predicts the as a softmax distribution over mixture components. The regression head outputs the parameters of the Gaussians and for modes and time steps.
Training objective. We follow the original MultiPath approach and maximize the likelihood of the groundtruth trajectory under our model’s predicted distribution. We make a hard-assignment labeling of a “correct” mixture component by choosing the one with the smallest Euclidean distance to the groundtruth trajectory.
The average log loss over the entire training set is optimized using Adam. We use an initial learning rate of and a batch size of , with decay rate of
every 2 epochs. The final model is chosen after training forsteps.
3.5 Prediction architecture with learned anchor embeddings
In applications related to future prediction, capturing the highly uncertain and multimodal set of outcomes is a key challenge and the focus of much work [rhinehart2018r2p2, liang2020garden, kitani_diverse_forecasting_dpps, phan2019covernet, sapp2019multipath]. One of MultiPath’s key innovations was to use a static set of anchor trajectories
as pre-defined modes that applied to all scenes. One major downside to this is that most modes are not a good fit to any particular scene, thus requiring a large amount modes to be considered, with most obtaining a low-likelihood and getting discarded. Another downside is the added complexity and effort stemming from a 2-phase learning process (first estimating the modes from data, then training the network).
In this work, we learn anchor embeddings as part of the overall model training. We interpret these embeddings as anchors in latent space, and construct our architecture to have a one-to-one correspondence with these embeddings and the output trajectory modes of our GMM. The vectors are trainable model parameters that are independent of the input. This has connections to Detection Transformers (DETR) [carion20detr] which propose a way to learn anchors rather than hand-design them for object detection. This is also similar in spirit to MANTRA [marchetti2020mantra], a trajectory prediction network, which has an explicit learned memory network which consists of a database of embeddings that can be retrieved and decoded into trajectories.
We concatenate the embeddings , and obtained from the output of the respective blocks to obtain a fixed-length feature vector for each modeled agent. We then use this as context in stacked MCG blocks that operate on the set of anchor embeddings , with a final MLP that predicts all parameters of the output GMM:
where is formed from .
3.6 Internal Trajectory Representation
We model the future position and heading of agents, along with agent-relative longitudinal and lateral Gaussian uncertainties. We parameterize the trajectory using
—position, heading, and standard deviation for longitudinal and lateral uncertainty.
The most popular approach in the literature is to directly predict a sequence of such states at uniform time-discretization. Here we also consider two non-mutually exclusive variants.
We can represent functions over time as polynomials, which add an inductive bias that ensures a smooth trajectory. It gives us a compact, interpretable representation of each predicted signal.
Instead of directly predicting , we can predict the underlying kinematic control signals, which can then be integrated to evaluate the output state. In this work, we experiment with predicting the acceleration and heading change rate and integrating them to recover the trajectory as follows:
These representations add inductive bias encouraging natural and realistic trajectories that are based on realistic kinematics and consistent with the current state of the predicted agent. For the polynomial representation, it is also possible to specify a soft constraint by regularizing the polynomial’s constant term, which determines the shift of the predicted signal from its current value.
Algorithm 1 demonstrates the conversion from control signals to output positions. Note that this operation is differentiable, permitting end-to-end optimization. It is a numerical approximation of Equation 2 with additional technical considerations: (1) When computing the next position , we use the midpoint approximation of the speed and heading . (2) Given vehicle dimensions, we cap the heading change rate to match a predetermined maximum feasible curvature. (3) These equations are applied to the rear-axle of the vehicle rather than the center position. We use the rear-end position of the vehicle as an approximation of the rear-axle position.
Note that Algorithm 1 can be viewed as a special type of recurrent network, without learned parameters. This decoding stage then mirrors other works which use a learned RNN (LSTM or GRU cells) to decode an embedding vector into a trajectory [mercat2020multi, wimp2020, hong2019rules, tang_multifuture, salzmann2020trajectron++]. In our case, the recurrent network state consists of and , and the input consists of and . Encoding an inductive bias derived from kinematic modeling spares the network the need to explicitly learn these properties makes the predicted state compact. This promotes data efficiency and generalization power, but can be more sensitive to perception errors in the current state estimate.
4 Ensembling predictor heads via bootstrap aggregation
. By combining multiple models which are to some degree complementary, we can enjoy the benefits of a higher capacity model with lower statistical variance.
We specifically apply bootstrap aggregation (bagging) [eslbook] to our predictor heads by training such heads together. To encourage models learning complementary information, the weights of the
heads are initialized randomly, and an example is used to update the weights of each head with a 50% probability.
Unlike scalar regression or classification, it is not obvious how to combine output from different heads in our case—each is a Gaussian Mixture Model, with no correspondence of mixture components across ensemble heads. Furthermore, we consider allowing each predictor head to predict a richer output distribution with more modes ; where is fixed as a requirement for the task (and is used in benchmark metrics calculations).
Let denote the union of the predictions from all heads
where , and the mode likelihoods are divided by the number of heads so that they sum up to 1. Then we pose the ensemble combination task as one of converting to a more compact GMM with modes:
while requiring that best approximates . In this section we describe the aggregation algorithm we use. Theoretical motivations and derivation can be found in Appendix A.
We find fit to using an iterative clustering algorithm, like Expectation-Maximization [bishop2006pattern], but with hard assignment of cluster membership. This setting lends itself to efficient implementation in a compute graph, and allows us to train this step end-to-end as a final layer in our deep network.
We start by selecting cluster centroids from in a greedy fashion. The selection criteria is to maximize the probability that a centroid sampled from lies within distance from at least one selected centroid:
This is a criterion that explicitly optimizes trajectory diversity, which is a fit for metrics such as miss rate, mAP and minADE, as defined in [chang2019argoverse, ettinger2021womd]. Other criteria could also be used depending on the metric of interest. It is interesting to relate this criteria to the ensembling and sampling method employed by GOHOME [gilles2021gohome]. In that work, they output an intermediate spatial heatmap representation, which is amenable to ensemble aggregation. Then they greedily sample end-points in a similar fashion.
Since jointly optimizing is hard, we select each greedily for according to
which differs in that the outer is done iteratively over rather than jointly .
Starting with the selected centroids, We iteratively update the parameters of using an expectation-maximization-style [dempster77] algorithm, where each iteration consists of the following updates
is the posterior probability that a given sampleis sampled from the component of the mixture model specified by , which can be computed as
The Waymo Open Motion Dataset (WOMD) [ettinger2021womd] consists of 1.1M examples time-windowed from 103K 20s scenarios. The dataset is derived from real-world driving in urban and suburban environments. Each example for training and inference consists of 1 second of history state and 8 seconds of future, which we resample at 5Hz. The object-agent state contains attributes such as position, agent dimensions, velocity and acceleration vectors, orientation, angular velocity, and turn signal state. The long (8s) time horizon in this dataset tests the model’s ability to capture a large field of view and scale to an output space of trajectories, which in theory grows exponentially with time.
The Argoverse dataset [chang2019argoverse] consists of 333K scenarios containing trajectory histories, context agents, and lane centerline inputs for motion prediction. The trajectories are sampled at 10Hz, with 2 seconds of past history and a 3-second future prediction horizon.
We compare models using competition specific metrics associated with various datasets444 For each dataset, we report the results of our model against published results of publicly available models. ,
Specifically, we report the following metrics.
minDE (Minimum Distance Error): The minimum distance, over the top k most-likely trajectories, between a predicted trajectory and the ground truth trajectory at time .
minADE (Minimum Average Distance Error): Similar to minDE, but the distance is calculated as an average over all timesteps.
MR@ (Miss Rate): Measures the rate at which minFDE exceeds meters. Note that WOMD leaderboard uses a different definition [ettinger2021womd].
mAP: For each set of predicted trajectories, we have at most one positive - the one closest to the ground truth and which is within distance from the ground truth. The other predicted trajectories are reported as misses. From this, we can compute precision and recall at various thresholds. Following WOMD metrics definition [ettinger2021womd] the agents future trajectories are partitioned into behavior buckets, and an area under the precision-recall curve is computed using the possible true positive and false positives per agent, giving us Average Precision per behavior bucket. The total mAP value is a mean over the AP’s for each behavior bucket.
Overlap rate: The fraction of times the most likely trajectory prediction of any agent overlaps with a real future trajectory of another agent (see [ettinger2021womd] for details).
TRI: (Turning Radius Infeasibility) We compute the turning radius along the predicted trajectories using two approaches: one that uses the predicted yaw output from the model (TRI-h), and the other that doesn’t require yaw predictions and instead uses the circumradius constituting three consecutive waypoints (TRI-c). If the radius is less than a certain threshold , it is treated as a violation. We set this threshold as the approximate minimum turning radius threshold for a midsize sedan, . Note that a model that simply predicts a constant heading can achieve a TRI-h rate of zero, hence we also compute inconsistencies between turning radius suggested by the coordinates and the predicted headings (TRI-hc). TRI-hc inconsistency is true when the difference in heading based on circumradius from waypoints and predicted headings is greater than 0.05 radians at any time step in a trajectory.
5.3 MultiPath baseline
As our work evolved from MultiPath, we include a reference MultiPath model where the input and backbone are faithful to the original paper [sapp2019multipath] for a point of comparison, with a few minor differences. Specifically, we use a top-down rendering of the scene as before, but now employ a splat rendering [zwicker2001surface_splat]
approach for rasterization, in which we sample points uniformly from scene elements and do an orthographic projection. This is a simpler, sparse form of rendering, which doesn’t employ anti-aliasing, but is efficient and straightforward to implement in TensorFlow and run as part of the model compute graph on hardware accelerators (GPU/TPU).
As in the original paper, we use a grid of cells, with grid cell physical dimension of , thus a total field-of-view of centered around the AV sensing vehicle in WOMD, with a ResNet18 backbone [ResNet16]
. We use 128 static anchors obtained via k-means, which are shared among all agent types (vehicles, pedestrians, cyslists) for simplicity. Figure10 illustrates this model’s inputs and architecture.
5.4 External benchmark results
On WOMD, we also see that the original MultiPath model, even with the refinement of learned anchors and ensembling, is outperformed by more recent methods. It is interesting to note that MultiPath is the best performing top-down scene-centric model employing a CNN; every known method which outranks it uses sparse representations.
|Argoverse leaderboard (, , )|
|HOME + GOHOME [gilles2021gohome]||10||1.860||1.292||0.085||0.890|
|QCraft Blue Team||1||1.757||1.214||0.114||0.801|
|Waymo Open Motion Prediction (, )|
5.5 Qualitative Examples
Figure 4 shows examples of Multipath++ on WOMD scenes. Figure 5 shows examples of Multipath++ on Argoverse scenes. These examples show the ability of MultiPath++ to handle different road layouts and agent interactions.
5.6 Ablation Study
In this section we evaluate our design choices through an ablation study. Table 4 summarizes ablation results. In the following subsections we discuss how our architecture choices affect the model performance.
5.6.1 Set Functions
Recall that MultiPath++ uses two types of set functions. Invariant set functions are used to encode a set of elements (e.g. agents, roadgraph segments) into a single feature vector. Equivariant set functions are used to convert the set of learned anchors, together with the encoded feature vector as a context, into a corresponding set of trajectories with likelihoods.
We use multi-context gating to represent both types of functions. We experimented with other representations of set functions:
MLP+MaxPool: In this experiment, we replace the multi-context gating (MCG) road network encoder with a MLP+MaxPool applied on points rather than polylines, inspired by PointNet [qi2017pointnet]
. We use a 5 layer deep MLP and RELU activations.
Equivariant DeepSet [zaheer17deepset]
: The equivariant set function is represented as a series of blocks, each involving an element-wise transformation followed by pooling to compute the context. Unlike MCG, it does not use gating (pointwise multiplication) between set elements and the context vector. Instead, a linear transformation of the context is added to each element. We use a DeepSet of 5 blocks in the predictor.
Transformers [lee19settransformer]: We replace the gating mechanism (element-wise multiplication) on polylines with self-attention. For decoding, we used cross attention where the queries are the learned embeddings and the keys are the various encoder features.
5.6.2 Trajectory representation
As mentioned in Section 3.6, we experiment with predicting polynomial coefficients for the trajectory, as well predicting kinematic control signals (acceleration and heading change rate). We found that polynomial representations hurt performance, counter to conclusions made in PLOP [buhet20_plop], where they demonstrated improvements over the then state of the art on PRECOG[precog] and nuScenes[caesar2020nuscenes] using polynomials to represent output trajectories. Furthermore, in the PLOP datasets, we need to predict 4s into the future which is much shorter than our prediction horizon of 10s. For such short futures, polynomial representations are more suitable. In our case, we do not see much gains from using the polynomial representation, possibly due to the larger dataset size and longer-term prediction horizon.
The controls-based output works better in distance metrics than a polynomial representations, which suggests it is a more beneficial and domain-specific form of inductive bias. Overall, our results suggest that the simple sequence of raw coordinates trajectory representation works best for distance-based metrics. However, these unconstrained representations have a non-trivial rate of kinematic infeasibility (TRI-x metrics in Table 4). Kinematic feasibility and consistency between headings and positions is crucial in practice when such behavior models are used for planning and controls of a real-world robot, an issue that is not captured by public benchmark metrics.
We explore ensembling, producing an over-complete set of trajectories that is then summarized using the aggregation proposed in Section 4, as well as their combination. The number of ensembles is denoted by and the number of trajecctories per ensemble is denoted by . Finally we aggregate the trajectories to which is the required number of trajectories for the WOMD submission.
5.6.4 Anchor representation
We explore learned and kmeans based anchor representation.
|minDE||minADE||MR||AUC||TRI-h (%)||TRI-c (%)||TRI-hc (%)|
|1 MCG block||2.764||1.15||0.55||0.312||–||–||–|
|5 stacked MCG blocks||2.305||0.978||0.44||0.393||–||–||–|
|Raw coordinates w/ heading||2.311||0.978||0.443||0.395||4.10||1.04||9.92|
|Static k-means anchors||2.99||1.22||0.578||0.324||–||–||–|
denotes the reference configuration: road encoding, state history encoding and interaction encoding as described in Section 3. “n/a” denotes a model that does not predict heading.
First, we remark that MultiPath++ is a significant improvement over its predecessor MultiPath, as seen in Tables 3 and 4. As discussed in this paper, they differ in many design dimensions, the primary being the change from a dense top-down raster representation to a sparse, element-based representation with agent-centric coordinate systems. Other design choices are validated in isolation in the following discussion.
We find that MLP+MaxPool performs the worst among all set function variants as expected due to limited capacity. DeepSet is able to outperform MLP+MaxPool. Also increasing the depth of the MCG gives consistently better results owing to effective increase in capacity and flow of information across skip connections. We get the best performance by increasing the depth of the MCG to 5 layers.
We find that learning anchors (“Learned anchors”) is more effective than using a set of anchors obtained a priori via k-means. This runs counter to the original findings in the MultiPath paper [sapp2019multipath] that anchor-free models suffer from mode collapse. The difference could possibly be due to the richer and more structured inputs, improved model architecture, and larger batch sizes in MultiPath++. We leave more detailed ablations on this issue between the two approaches to future work. t We compare the baseline of directly outputting a single head with 6 trajectories (), to training 5 ensemble heads (). We see that ensembling significantly improves most metrics, and particularly minDE, for which this combination is best. We also train a model with a single head that outputs 64 trajectories, followed by our aggregation method that reduces them to 6 (). Compared to our initial baseline, this model significantly improves and that require diverse predictions, but regresses the average trajectory distance metrics , and even a little bit. This suggests that the different metrics pose different solution requirements. As expected, our aggregation criterion is well suited to preserving diversity, while straight-up ensembling is better at capturing the average distribution. Finally, our experiment () with more ensemble heads and more predictions per ensemble combines the strengths of both techniques, obtaining a strictly superior performance in all metrics compared to the baseline.
We proposed a novel behavior prediction system, MultiPath++, by carefully considering choices for input representation and encoding, fusing encodings, and representing the output distribution. We demonstrated state-of-the-art results on popular benchmarks for behavior prediction. Furthermore, we surveyed existing methods, analyzed our approach empirically, and provided practical insights for the research community. In particular, we showed the importance of sparse encoding, efficient fusion methods, control-based methods, and learned anchors. Finally, we provided a practical guide for various tricks used for training and inference to improve robustness, increase diversity, handle missing data, and ensure fast convergence during training.
Appendix A Details and Derivation of Aggregation Algorithm
By having an overcomplete trajectory representation that is later aggregated into a fixed small number of trajectories, we attempt to address two kinds of uncertainties in the data:
Aleatoric uncertainty: This is a natural variation in the data. For example an agent can either take a left or right turn or change lanes, etc given the same context information. This level of ambiguity cannot be resolved by increasing the model capacity, but rather the model needs to predict calibrated probabilities for these outcomes. Despite the theoretical possibility of modeling these variations using a small number of output trajectories directly, there are several challenges in learning. Some examples include mode collapse and failure to model these variations due to limited model capacity. Training the model to produce an overcomplete representation forces the model to output a diverse distribution of trajectories and could make it more resistant to mode collapse. Following this up with greedy iterative trajectory aggregation enhances diversity in the final output.
Epistemic uncertainty: This is the variation across model outputs, which typically indicates the model’s failure to capture certain aspects of the scene or input features. Such variations could occur if some models are poorly trained or haven’t seen a particular slice of the data. By doing model ensembling, we attempt to reduce this uncertainty.
For ease of exposition, we assume each to trajectory to be composed of a single time point; the same computations are applied to each time step in a future sequence. The output is a Gaussian mixture model (GMM) distribution with modes on the future position:
We formulate the aggregation as obtaining an -mode GMM which minimizes the KL-divergence . This is equivalent to maximizing the expected log likelihood of a sample point drawn from the overcomplete distribution :
Assuming the overcomplete distribution approximates the real distribution, this is roughly equivalent to fitting the compact distribution to the real data, but with the added benefits described above. Directly maximizing (21) is intractable. Hence we attempt to employ an Expectation-Maximization-like algorithm to obtain a local maximum. The difference in the objective function between an old and new value may be written as
Denoting the hidden variable h to be a mixture in the compact representation, we may write:
The right hand side is called the Q function in the EM algorithm. Maximizing the Q function with respect to ensures that the likelihood increases at least as much when we update the parameters to . Noting that and factoring out the terms independent of , we find the update that maximizes the lower bound to be
The second equation follows from the fact that the overcomplete distribution is a mixture of Gaussians. The updates can be solved as follows.
where is the posterior probability that a given sample is sampled from the component of the mixture model specified by (here we use the previous estimate for ). This can be computed as:
Notice the resemblence with standard GMM, except where is a dirac delta function in the standard setting (since the input data in standard GMM is a set of points instead of a distribution). Unlike standard GMM, these expectations (integrations) in the above EM updates are hard to compute in closed form. Instead we employ the approximation for any function
In other words, we assume that the posterior probability of any output cluster only depends on the mean of the overcomplete cluster centroid inside the expectation. This approximation is reasonable since most samples drawn from the distribution would be concentrated around the mean. Furthermore as we increase the number of cluster centroids in the overcomplete representation, the variance within each overcomplete cluster centroid becomes smaller yielding more focus around the mean. The set of updates can now be solved in closed form as follows:
Since EM is a local optimization method, careful initialization of GMM parameters is important. Our initialization criterion of GMM centroids is to maximize the probability that future point lies within distance from at least one centroid:
Unfortunately, directly optimizing (33) is NP-hard. So instead, we select an -sized subset of in a greedy fashion to maximize (33)555 Note that this subset selection problem is submodular, which means that a greedy method is guranteed to achieve at least of the optimal subset value. .