Self-driving cars must share the road with other vehicles, bicyclists, and pedestrians. To safely operate in this dynamic and uncertain environment, it is important to reason about the likely future motions of other road users. Each road user has different goals (e.g., turn left, speed up, stop) and preferences (e.g., desired speed, braking profile), which influence his or her actions. Thus, useful predictions of future motion must represent multiple possibilities and their associated likelihoods.
Motion prediction is fundamentally challenging due to the uncertainty in road user behavior. Likely behaviors are constrained by both the road (e.g., following lanes, stopping for traffic lights) and the actions of others (e.g., slowing for a car ahead, stopping for a pedestrian crossing the road).
State-of-the-art motion prediction models learn features over a combined representation of the map and the recent history of other road users [11, 6, 19]. These methods use a variety of multimodal output representations, including stochastic policies , occupancy maps , and regression [11, 6]. Recent work  has shown that classification over a trajectory set, which approximates possible motions, achieves state-of-the-art performance and avoids issues like mode collapse. However, that work did not use map data and trajectory set geometry in the loss, and had a limited comparison with ordinal regression. This paper addresses those limitations.
The classification loss used in  does not explicitly penalize predictions that go off the road. To use the prior knowledge that cars typically drive on the road, we simplify and adapt the off-road loss from  to classification over trajectory sets. Our formulation allows pretraining a model using only map data, which significantly improves performance for small datasets.
The trajectory sets introduced for motion prediction in  have rich spatial-temporal structure. However, the cross-entropy loss they used does not exploit the fact that classes correspond to physical trajectories. We investigate weighted cross-entropy losses that determine how much “near-misses” in the trajectory set are penalized. Surprisingly, we find that a variety of weighted losses do not improve on the original formulation. We analyze these results in the context of prediction diversity.
The classification-based approach of  is closely related to the ordinal regression approach of , which computes residuals from a set of “anchor” trajectories that are analogous to a trajectory set. We perform a detailed comparison of these approaches on two public datasets. Interestingly, we find that performance of ordinal regression can benefit from using an order of magnitude more “anchor” trajectories than previously reported .
Our main contributions on multimodal, probabilistic motion prediction are summarized as follows:
use an off-road loss with pretraining to help learn domain knowledge;
explore weighted cross-entropy losses to capture spatial relationships within a trajectory set;
2 Related Work
State-of-the-art motion prediction algorithms now typically use CNNs to learn appropriate features from a birds-eye-view rendering of the scene (map and road users). Other road users are represented as sensor data [23, 5] or the output of a tracking and fusion system [11, 19]. Recent work [4, 15]
has also explored using graph neural networks (GNNs) to encode interactions.
Complementing the input representations described above, various approaches have been used to represent the possible future motions. Generative models encode choice over multiple actions via sampling latent variables. Examples include stochastic policies [21, 28, 29, 32], CVAEs [19, 22, 2, 20] and GANs [31, 17, 34]. These approaches require multiple samples or policy rollouts at inference.
Regression models either predict a single future trajectory [23, 5, 14, 1], or a distribution over multiple trajectories [11, 19, 13]. The former unrealistically average over behaviors in many driving situations, while the latter can suffer from mode collapse. In , the authors propose a hybrid approach using ordinal regression. This method regresses to residuals from a set of pre-defined anchors, much like in object detection. This technique helps mitigate mode collapse.
CoverNet  frames the problem as classification over a trajectory set, which approximates all possible motions. Our work further explores this formulation through losses that use the map and trajectory set geometry, as well as detailed experimental comparison to ordinal regression.
2.1 Use of Domain Knowledge
Prior work using map-based losses includes the use of an approximate prior that captures off-road areas as part of a symmetric KL loss , and an off-road loss on future occupancy maps . However, these loss formulations are not directly compatible with most trajectory prediction approaches that output point coordinates. The most closely related work is , which applies an off-road loss to multimodal regression. Our formulation is simpler and allows pretraining using just the map.
We leverage domain knowledge by encoding the relationships among trajectories via a weighted cross-entropy loss, where the weight is a function of distance to the ground truth. To our knowledge, this loss has not been explored in motion forecasting, likely due to the prior focus on generative and regression models. This loss is typically used to mitigate class imbalance problems .
2.2 Public datasets
Until recently, public self-driving datasets suitable for motion prediction were either relatively small  or for highway driving . Accordingly, many publications are evaluated only on private datasets [11, 33, 5, 24]. We report results on two recently released self-driving datasets, nuScenes  and Argoverse , to help establish clear comparisons between state-of-the-art methods for motion forecasting on city roads.
We now set notation and give a brief, self-contained overview of CoverNet 
, which we will extend with modified loss functions in Section4. CoverNet uses the past states of all road users and a high-definition map to compute a distribution over a vehicle’s possible future states.
We use the state outputs of an object detection and tracking system, and a high-definition map that includes lanes, drivable area, and other relevant information. Both are typically used by self-driving cars operating in urban environments and are also assumed in [27, 11, 6].
We denote the set of road users at time by and the state of object at time by . Let , where and , denote the discrete-time trajectory of object at times . Let denote the scene context over the past steps (i.e., partial history of all objects and the map). We are interested in predicting given , where is the prediction horizon.
3.2 CoverNet overview
We briefly summarize CoverNet 
, which makes multimodal, probabilistic trajectory predictions for a vehicle of interest by classifying over atrajectory set. Figure 1 overviews the model architecture.
Input The input is a birds-eye-view raster image centered around the agent of interest that combines both map data and the past states of all objects. The raster image is an RGB image that is aligned so that the agent’s heading points up. Map layers are drawn first, followed by vehicles and pedestrians. The agent of interest, each map layer, and each object type are assigned different colors in the raster. The sequence of past observations for each object is represented through fading object colors using linearly decreasing saturation (in HSV space) as a function of time.
Output The output is a distribution over a trajectory set, which approximates the vehicle’s possible motions. Using a trajectory set is a reasonable approximation given the relatively short prediction horizons (3 to 6 seconds) and inherent uncertainty in agent behavior. Let be a trajectory set. Then, the output dimension is equal to the number of modes, namely .
Training The model is trained as a multi-class classification problem using a standard cross-entropy loss. For each example, the positive label is the element in the trajectory set closest to the ground truth, as measured by the minimum average of point-wise Euclidean distances.
We now introduce an off-road loss to partially encode “rules-of-the-road,” and a weighted cross-entropy classification loss to capture spatial relationships in a trajectory set.
4.1 Off-road loss
4.2 Weighted cross-entropy loss
The approach proposed in  uses a cross-entropy loss where positive samples are determined by the element in the trajectory set closest to the ground truth. This loss penalizes the second-closest trajectory just as much as the furthest, since it ignores the geometric structure of the trajectory set.
Our proposed modification is to create a probability distribution over all modes that are “close enough” to the ground truth. So, instead of a delta distribution over the closest mode, there is also probability assigned to near misses (see Figure 3).
Let be a real-valued function that measures the distance between trajectories in trajectory set and the ground truth for agent at time and let . We set a threshold that defines which trajectories are “close enough” to the ground truth and let be the set of these trajectories. We experiment with as both the max and mean of element-wise Euclidean distances, denoted by Max and Mean , respectively.
We take the element-wise inverse of and set the value of trajectories not in to . We linearly normalize the entries of this vector so they sum to and use this vector as the target probability distribution with the standard cross-entropy loss and we denote it by . Finally, let be the softmax function and let be the result of applying softmax to the model’s penultimate layer. Our weighted cross entropy loss is defined as:
We introduce hyperparameterto encode the relative weight of the off-road loss versus the classification loss. The total loss for agent at time is then
For our motion prediction experiments, the input representation and model architecture were fixed across all models and baselines. We varied both the loss functions and output representations.
Physics oracle. The physics oracle  is the minimum average point-wise Euclidean distance over the following physics-based models: i) constant velocity and yaw, ii) constant velocity and yaw rate, iii) constant acceleration and yaw, and iv) constant acceleration and yaw rate.
Regression to anchor residuals. We compare our contributions to MultiPath (MP) , a state-of-the-art multimodal regression model. This model implements ordinal regression by first choosing among a fixed set of anchors (computed a priori) and then regressing to residuals from the chosen anchor. This model predicts a fixed number of trajectories (modes) and their associated probabilities. The per-agent loss (agent at time ) is defined as:
where is the indicator function that equals only for the “closest” mode, is the mode index, is the regression loss, and is a hyper-parameter used to trade off between classification and regression. The -th trajectory is the sum of the corresponding anchor and predicted residual. We use and use CoverNet trajectory sets as the fixed anchors.
5.2 Implementation details
The input raster is an RGB image of size (, , ). The image is aligned so that the agent’s heading faces up, and the agent is placed on pixel (, ) as measured from the top-left of the image. In our experiments, we use a resolution of meters per pixel and choose and . Thus, the model can “see” meters ahead, meters behind, and
meters on each side of the agent. Objects are rendered based on their oriented bounding box, which are imputed for Argoverse.
All models use a ResNet-50 
backbone with pretrained ImageNet weights . We apply a global pooling layer to the ResNet conv5 feature map and concatenate the result with the agent’s speed, acceleration, and yaw rate. We then add a dimensional fully connected layer.
Trajectory sets for each dataset were constructed as described in  and our supplementary material. The variable specifies the maximum distance between the ground truth and the trajectory set.
For the regression models, our outputs are of dimension , where represents the total number of predicted modes, represents the number of coordinates we are predicting per point, represents the number of points in our predictions, and the extra output per mode is the probability associated with each mode. For our implementations, , where is the length of the prediction horizon, and is the sampling frequency. We predict coordinates, so .
We use a smooth loss for all regression baselines.
Models were trained for epochs on a GPU machine, using a batch size of , and a learning rate schedule that starts at and applies a multiplicative factor of every epoch. We found that these parameters gave good performance across all models. The per-batch core time was ~ ms.
We use commonly reported metrics to give insight into various facets of multimodal trajectory prediction. In our experiments, we average the metrics described below over all instances.
For insight into trajectory prediction performance in scenarios where there are multiple plausible actions, we use the minimum average displacement error (ADE). The minADEk is , where is the set of most likely trajectories.
We use the notion of a miss rate to simplify interpretation of whether or not a prediction was “close enough.” We define the for a single instance (agent at a given time) as 1 if , and 0 otherwise.
To measure how well predicted trajectories satisfy the domain knowledge that cars drive on the road, we use Drivable Area Compliance (DAC) . DAC is computed as , where is the total number of predictions and is the number that are off the road.
5.4 The impact of domain knowledge
Argoverse. Argoverse 
is a large-scale motion prediction dataset recorded in Pittsburgh and Miami. Object centroids (imperfectly estimated by a tracking system) are published atHz. The task is to predict the last seconds of each scene given its first seconds and the map. We use the split for version of the dataset, which includes training scenarios.
5.4.1 Off-road loss and pretraining
As all vehicles in Argoverse are on the drivable area, a perfect model should have drivable area compliance (DAC) of . Figure 5 explores how model performance varies with the amount of training data available. Using an off-road penalty improves DAC in all cases. Pretraining on the drivable area also gives a significant performance boost in data-limited regimes.
Figure 7 further explores the impact of off-road losses on DAC of less likely modes. We see small, but consistent improvements in DAC as we increase the off-road loss penalty. Furthermore, the relative improvements increase with less likely modes.
5.4.2 Weighted cross-entropy loss
|Loss||CE||Max WCE||Mean WCE||Avoid Nearby WCE|
|Mean dist. between modes||1.34||1.05||1.13||1.01||1.17||1.23||0.93||1.35|
Table 1 compares performance of a dense trajectory set () when we consider all trajectories within various distances of the ground truth as a match. Surprisingly, the cross-entropy loss performs better than weighted max and mean cross-entropy losses. We hypothesize that this is due to the weighted cross-entropy loss causing more “clumping” of modes. We explored this hypothesis by computing the mean distance between predicted trajectories, and results in Table 1 support this interpretation. To reduce clumping, we tried an “Avoid Nearby” weighted cross entropy loss that assigns weight of to the closest match, to all other trajectories within meters of ground truth, and to the rest. We see that we are able to increase mode diversity and recover the performance of the baseline loss. Our results indicate that losses that are better able to enforce mode diversity may lead to improved performance.
5.5 Classification vs. ordinal regression
We perform a detailed comparison of classification (CoverNet) and ordinal regression (MultiPath). All models use the same input representation and backbone to focus on the output representation.
nuScenes. nuScenes  is a self-driving car dataset that consists of scenes recorded in Boston and Singapore. Objects are hand-annotated in 3D and published at Hz. The task is to predict the next seconds given the prior second and the map. We use the split publicly available in the nuScenes software development kit , which only includes moving vehicles. This split includes observations in the train set.
For CoverNet models, we find that offers the best or second best performance for all metrics on both datasets. Additionally, smaller models tend to overfit, especially on the smaller nuScenes dataset. For MultiPath, we find that the MissRate5, 2m monotonically decreases as the trajectory set size increases but using a modest number of modes, such as , can achieve the best minADE5.
We emphasize the strong performance of the MultiPath approach with a large number of modes ( and ). The MultiPath paper  noted decreasing performance with more than anchors, whereas we see benefits from using an order of magnitude more. Our analysis suggests that the increase in performance associated with the higher number of modes happens due to better coverage of space via anchors, leaving the network to learn smaller residuals. Table 7 displays the average and norms of the residuals learned by the model using different numbers of modes.
|Constant velocity||N/A||2.33 4.11||2.33 4.11||2.33 4.11||0.77 0.90|
|Physics oracle||N/A||2.26 2.97||2.26 2.97||2.26 2.97||0.76 0.85|
|MTP ||all 1||1.80 3.41||1.80 3.41||1.80 3.41||0.71 0.90|
|MultiPath ||all 16||1.81 4.30||1.14 2.22||1.10 2.16||0.56 0.89|
|MultiPath ||all 64||1.90 4.40||0.98 1.94||0.89 1.68||0.41 0.84|
|MultiPath , =3||682 946||1.77 4.70||0.91 2.27||0.75 1.73||0.32 0.74|
|MultiPath , =2||1571 2339||1.76 4.80||0.94 2.50||0.76 1.85||0.32 0.72|
|CoverNet , =3||682 946||1.75 4.60||1.05 2.22||0.91 1.68||0.38 0.78|
|CoverNet , =2||1571 2339||1.79 4.60||1.00 2.30||0.82 1.70||0.36 0.71|
|CoverNet , =1||5077 9909||1.94 6.10||1.03 2.90||0.87 2.12||0.41 0.80|
5.5.2 Argoverse test set
We further compared our implementation and modifications of MultiPath and CoverNet on the Argoverse test set. Table 3 compares our results with notable published methods. Our multimodal results are similar to MFP , which explicitly models interactions. Our poor unimodal performance may be due to the simple input representation we used, which encodes history via fading colors.
|Model||CV ||LSTM ED ||VectorNet ||MFP ||MultiPath* ||CoverNet* |
We extended a state-of-the-art classification-based motion prediction algorithm to utilize domain knowledge. By adding an auxiliary loss that penalizes off-road predictions, the model can better learn that likely future motions stay on the road. Additionally, our formulation makes it easy to pretrain using only map information (e.g., off-road area), which significantly improves performance on small datasets. In an attempt to better encode spatial-temporal relationships among trajectories, we also investigated various weighted cross-entropy losses. Our results here did not improve on the baseline, although pointed towards the need for losses that promote mode diversity. Finally, our detailed analysis of classification and ordinal regression (on public self-driving datasets) showed that best performance can be achieved with an order of magnitude more modes than previously reported.
Acknowledgments. We would like to thank Oscar Beijbom, Caglayan Dicle, Emilio Frazzoli, and Sourabh Vora for helping improve the presentation of our work.
-  (2016-06) Social LSTM: human trajectory prediction in crowded spaces. In , Cited by: §2.
-  (2018-06) Accurate and diverse sampling of sequences based on a “best of many” sample objective. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: 3rd item, §2.2, §5.5.
-  (2019) Spatially-aware graph neural networks for relational behavior forecasting from sensor data. Note: https://arxiv.org/abs/1910.08233 External Links: Cited by: §2.
-  (2018-10) IntentNet: learning to predict intention from raw sensor data. In Proceedings of The 2nd Conference on Robot Learning, Cited by: §2.2, §2, §2.
-  (2019-11) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In 3rd Conference on Robot Learning (CoRL), Cited by: §1, §1, §2, §3.1, §5.1, §5.5.1, Table 2, Table 3.
-  (2019-06) Argoverse: 3d tracking and forecasting with rich maps. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §2.2, §5.3, §5.4, §5.4, Table 3.
-  (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pp. 742–751. Cited by: §2.1.
-  (2007) US highway 101 dataset. Federal Highway Administration (FHWA), Tech. Rep. FHWA-HRT07-030. Cited by: §2.2.
-  (2019)(Website) Note: https://pytorch.org/docs/stable/torchvision/models.html Cited by: §5.2.
-  (2019-05) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In International Conference on Robotics and Automation (ICRA), Cited by: §1, §2.2, §2, §2, §3.1, §5.2, Table 2.
-  (2019) Deep kinematic models for physically realistic prediction of vehicle trajectories. Note: https://arxiv.org/abs/1908.00219v1 External Links: Cited by: §2.1.
-  (2018-06) Convolutional social pooling for vehicle trajectory prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
-  (2018) Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks. Note: https://arxiv.org/abs/1808.05819v2 External Links: Cited by: §2.
-  (2020) VectorNet: encoding HD maps and agent dynamics from vectorized representation. Note: https://arxiv.org/abs/2005.04259 External Links: Cited by: §2, Table 3.
-  (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2018-06) Social GAN: socially acceptable trajectories with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Deep residual learning for image recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §5.2.
-  (2019-06) Rules of the Road: predicting driving behavior with a convolutional model of semantic interactions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2, §2.
-  (2018-10) Generative modeling of multimodal multi-human behavior. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
-  (2012) Activity forecasting. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2017-07) DESIRE: distant future prediction in dynamic scenes with interacting agents. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-06) Fast and Furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
-  (2019-06) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. In Robotics: Science and Systems (RSS), Cited by: §2.1, §2.2.
-  (2019) Improving movement prediction of traffic actors using off-road loss and bias mitigation. In Advances in Neural Information Processing Systems, Cited by: §1, §2.1.
-  (2020)(Website) Note: https://www.nuscenes.org/ Cited by: §5.5.
-  (2019) CoverNet: multimodal behavior prediction using trajectory sets. Note: https://arxiv.org/abs/1911.10298 External Links: Cited by: §1, §1, §1, §1, §2.1, §2, Figure 1, §3.1, §3.2, §3, §4.2, §5.1, §5.1, §5.2, §5.2, Table 2, Table 3, §7.
-  (2018-09) R2P2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, §2.
-  (2019-10) PRECOG: prediction conditioned on goals in visual multi-agent settings. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2015-12) ImageNet large scale visual recognition challenge. The International Journal of Computer Vision. Cited by: §5.2.
-  (2019-06) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Multiple futures prediction. In Advances in Neural Information Processing Systems, Cited by: §2, §5.5.2, Table 3.
-  (2019-06) End-to-end interpretable neural motion planner. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
Multi-agent tensor fusion for contextual trajectory prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
7 Supplementary Material
We created fixed trajectory sets as described in Section 3 of . We used ground truth trajectories from the Argoverse training set and all trajectories from the nuScenes training set to compute trajectory sets. The parameter controls the furthest a ground truth trajectory in our training set can be from a trajectory in our trajectory set. We visualize the trajectory sets used by CoverNet and MultiPath on the Argoverse data set and observe that for all , the trajectory sets include fast-moving lane keeping trajectories and some very wide turns. As we decrease , the diversity of lateral maneuvers increases.