1 Introduction
Selfdriving cars must share the road with other vehicles, bicyclists, and pedestrians. To safely operate in this dynamic and uncertain environment, it is important to reason about the likely future motions of other road users. Each road user has different goals (e.g., turn left, speed up, stop) and preferences (e.g., desired speed, braking profile), which influence his or her actions. Thus, useful predictions of future motion must represent multiple possibilities and their associated likelihoods.
Motion prediction is fundamentally challenging due to the uncertainty in road user behavior. Likely behaviors are constrained by both the road (e.g., following lanes, stopping for traffic lights) and the actions of others (e.g., slowing for a car ahead, stopping for a pedestrian crossing the road).
Stateoftheart motion prediction models learn features over a combined representation of the map and the recent history of other road users [11, 6, 19]. These methods use a variety of multimodal output representations, including stochastic policies [28], occupancy maps [19], and regression [11, 6]. Recent work [27] has shown that classification over a trajectory set, which approximates possible motions, achieves stateoftheart performance and avoids issues like mode collapse. However, that work did not use map data and trajectory set geometry in the loss, and had a limited comparison with ordinal regression. This paper addresses those limitations.
The classification loss used in [27] does not explicitly penalize predictions that go off the road. To use the prior knowledge that cars typically drive on the road, we simplify and adapt the offroad loss from [25] to classification over trajectory sets. Our formulation allows pretraining a model using only map data, which significantly improves performance for small datasets.
The trajectory sets introduced for motion prediction in [27] have rich spatialtemporal structure. However, the crossentropy loss they used does not exploit the fact that classes correspond to physical trajectories. We investigate weighted crossentropy losses that determine how much “nearmisses” in the trajectory set are penalized. Surprisingly, we find that a variety of weighted losses do not improve on the original formulation. We analyze these results in the context of prediction diversity.
The classificationbased approach of [27] is closely related to the ordinal regression approach of [6], which computes residuals from a set of “anchor” trajectories that are analogous to a trajectory set. We perform a detailed comparison of these approaches on two public datasets. Interestingly, we find that performance of ordinal regression can benefit from using an order of magnitude more “anchor” trajectories than previously reported [6].
Our main contributions on multimodal, probabilistic motion prediction are summarized as follows:

[nosep]

use an offroad loss with pretraining to help learn domain knowledge;

explore weighted crossentropy losses to capture spatial relationships within a trajectory set;
2 Related Work
Stateoftheart motion prediction algorithms now typically use CNNs to learn appropriate features from a birdseyeview rendering of the scene (map and road users). Other road users are represented as sensor data [23, 5] or the output of a tracking and fusion system [11, 19]. Recent work [4, 15]
has also explored using graph neural networks (GNNs) to encode interactions.
Complementing the input representations described above, various approaches have been used to represent the possible future motions. Generative models encode choice over multiple actions via sampling latent variables. Examples include stochastic policies [21, 28, 29, 32], CVAEs [19, 22, 2, 20] and GANs [31, 17, 34]. These approaches require multiple samples or policy rollouts at inference.
Regression models either predict a single future trajectory [23, 5, 14, 1], or a distribution over multiple trajectories [11, 19, 13]. The former unrealistically average over behaviors in many driving situations, while the latter can suffer from mode collapse. In [6], the authors propose a hybrid approach using ordinal regression. This method regresses to residuals from a set of predefined anchors, much like in object detection. This technique helps mitigate mode collapse.
CoverNet [27] frames the problem as classification over a trajectory set, which approximates all possible motions. Our work further explores this formulation through losses that use the map and trajectory set geometry, as well as detailed experimental comparison to ordinal regression.
2.1 Use of Domain Knowledge
Both dynamic constraints and “rulesoftheroad” place strong priors on likely motions. Dynamic constraints were explicitly enforced via trajectory sets in [27] and a kinematic layer in [12].
Prior work using mapbased losses includes the use of an approximate prior that captures offroad areas as part of a symmetric KL loss [28], and an offroad loss on future occupancy maps [24]. However, these loss formulations are not directly compatible with most trajectory prediction approaches that output point coordinates. The most closely related work is [25], which applies an offroad loss to multimodal regression. Our formulation is simpler and allows pretraining using just the map.
We leverage domain knowledge by encoding the relationships among trajectories via a weighted crossentropy loss, where the weight is a function of distance to the ground truth. To our knowledge, this loss has not been explored in motion forecasting, likely due to the prior focus on generative and regression models. This loss is typically used to mitigate class imbalance problems [8].
2.2 Public datasets
Until recently, public selfdriving datasets suitable for motion prediction were either relatively small [16] or for highway driving [9]. Accordingly, many publications are evaluated only on private datasets [11, 33, 5, 24]. We report results on two recently released selfdriving datasets, nuScenes [3] and Argoverse [7], to help establish clear comparisons between stateoftheart methods for motion forecasting on city roads.
3 Preliminaries
We now set notation and give a brief, selfcontained overview of CoverNet [27]
, which we will extend with modified loss functions in Section
4. CoverNet uses the past states of all road users and a highdefinition map to compute a distribution over a vehicle’s possible future states.3.1 Notation
We use the state outputs of an object detection and tracking system, and a highdefinition map that includes lanes, drivable area, and other relevant information. Both are typically used by selfdriving cars operating in urban environments and are also assumed in [27, 11, 6].
We denote the set of road users at time by and the state of object at time by . Let , where and , denote the discretetime trajectory of object at times . Let denote the scene context over the past steps (i.e., partial history of all objects and the map). We are interested in predicting given , where is the prediction horizon.
3.2 CoverNet overview
We briefly summarize CoverNet [27]
, which makes multimodal, probabilistic trajectory predictions for a vehicle of interest by classifying over a
trajectory set. Figure 1 overviews the model architecture.Input The input is a birdseyeview raster image centered around the agent of interest that combines both map data and the past states of all objects. The raster image is an RGB image that is aligned so that the agent’s heading points up. Map layers are drawn first, followed by vehicles and pedestrians. The agent of interest, each map layer, and each object type are assigned different colors in the raster. The sequence of past observations for each object is represented through fading object colors using linearly decreasing saturation (in HSV space) as a function of time.
Output The output is a distribution over a trajectory set, which approximates the vehicle’s possible motions. Using a trajectory set is a reasonable approximation given the relatively short prediction horizons (3 to 6 seconds) and inherent uncertainty in agent behavior. Let be a trajectory set. Then, the output dimension is equal to the number of modes, namely .
The output layer applies the softmax function to convert network activations to a probability distribution. The probability of the
th trajectory is , where is the output of the network’s penultimate layer.Training The model is trained as a multiclass classification problem using a standard crossentropy loss. For each example, the positive label is the element in the trajectory set closest to the ground truth, as measured by the minimum average of pointwise Euclidean distances.
4 Losses
We now introduce an offroad loss to partially encode “rulesoftheroad,” and a weighted crossentropy classification loss to capture spatial relationships in a trajectory set.
4.1 Offroad loss
Let be the activations of CoverNet’s penultimate layer for agent at time and let
be a binary vector whose value is
at entry if trajectory in the set is entirely contained in the drivable area. Letbe the sigmoid function and let
be the binary crossentropy loss. We define the offroad loss as(1) 
4.2 Weighted crossentropy loss
The approach proposed in [27] uses a crossentropy loss where positive samples are determined by the element in the trajectory set closest to the ground truth. This loss penalizes the secondclosest trajectory just as much as the furthest, since it ignores the geometric structure of the trajectory set.
Our proposed modification is to create a probability distribution over all modes that are “close enough” to the ground truth. So, instead of a delta distribution over the closest mode, there is also probability assigned to near misses (see Figure 3).
Let be a realvalued function that measures the distance between trajectories in trajectory set and the ground truth for agent at time and let . We set a threshold that defines which trajectories are “close enough” to the ground truth and let be the set of these trajectories. We experiment with as both the max and mean of elementwise Euclidean distances, denoted by Max and Mean , respectively.
We take the elementwise inverse of and set the value of trajectories not in to . We linearly normalize the entries of this vector so they sum to and use this vector as the target probability distribution with the standard crossentropy loss and we denote it by . Finally, let be the softmax function and let be the result of applying softmax to the model’s penultimate layer. Our weighted cross entropy loss is defined as:
(2) 
We introduce hyperparameter
to encode the relative weight of the offroad loss versus the classification loss. The total loss for agent at time is then(3) 
5 Experiments
For our motion prediction experiments, the input representation and model architecture were fixed across all models and baselines. We varied both the loss functions and output representations.
5.1 Baselines
Physics oracle. The physics oracle [27] is the minimum average pointwise Euclidean distance over the following physicsbased models: i) constant velocity and yaw, ii) constant velocity and yaw rate, iii) constant acceleration and yaw, and iv) constant acceleration and yaw rate.
Regression to anchor residuals. We compare our contributions to MultiPath (MP) [6], a stateoftheart multimodal regression model. This model implements ordinal regression by first choosing among a fixed set of anchors (computed a priori) and then regressing to residuals from the chosen anchor. This model predicts a fixed number of trajectories (modes) and their associated probabilities. The peragent loss (agent at time ) is defined as:
(4) 
where is the indicator function that equals only for the “closest” mode, is the mode index, is the regression loss, and is a hyperparameter used to trade off between classification and regression. The th trajectory is the sum of the corresponding anchor and predicted residual. We use and use CoverNet trajectory sets as the fixed anchors.
5.2 Implementation details
The input raster is an RGB image of size (, , ). The image is aligned so that the agent’s heading faces up, and the agent is placed on pixel (, ) as measured from the topleft of the image. In our experiments, we use a resolution of meters per pixel and choose and . Thus, the model can “see” meters ahead, meters behind, and
meters on each side of the agent. Objects are rendered based on their oriented bounding box, which are imputed for Argoverse.
All models use a ResNet50 [18]
backbone with pretrained ImageNet
[30] weights [10]. We apply a global pooling layer to the ResNet conv5 feature map and concatenate the result with the agent’s speed, acceleration, and yaw rate. We then add a dimensional fully connected layer.Trajectory sets for each dataset were constructed as described in [27] and our supplementary material. The variable specifies the maximum distance between the ground truth and the trajectory set.
For the regression models, our outputs are of dimension , where represents the total number of predicted modes, represents the number of coordinates we are predicting per point, represents the number of points in our predictions, and the extra output per mode is the probability associated with each mode. For our implementations, , where is the length of the prediction horizon, and is the sampling frequency. We predict coordinates, so .
We use a smooth loss for all regression baselines.
Models were trained for epochs on a GPU machine, using a batch size of , and a learning rate schedule that starts at and applies a multiplicative factor of every epoch. We found that these parameters gave good performance across all models. The perbatch core time was ~ ms.
5.3 Metrics
We use commonly reported metrics to give insight into various facets of multimodal trajectory prediction. In our experiments, we average the metrics described below over all instances.
For insight into trajectory prediction performance in scenarios where there are multiple plausible actions, we use the minimum average displacement error (ADE). The minADE_{k} is , where is the set of most likely trajectories.
We use the notion of a miss rate to simplify interpretation of whether or not a prediction was “close enough.” We define the for a single instance (agent at a given time) as 1 if , and 0 otherwise.
To measure how well predicted trajectories satisfy the domain knowledge that cars drive on the road, we use Drivable Area Compliance (DAC) [7]. DAC is computed as , where is the total number of predictions and is the number that are off the road.
5.4 The impact of domain knowledge
We investigate the impact of the domain knowledge that we encoded with the losses in Section 4. All results in this section are evaluated on the Argoverse [7] validation set.
Argoverse. Argoverse [7]
is a largescale motion prediction dataset recorded in Pittsburgh and Miami. Object centroids (imperfectly estimated by a tracking system) are published at
Hz. The task is to predict the last seconds of each scene given its first seconds and the map. We use the split for version of the dataset, which includes training scenarios.5.4.1 Offroad loss and pretraining
As all vehicles in Argoverse are on the drivable area, a perfect model should have drivable area compliance (DAC) of . Figure 5 explores how model performance varies with the amount of training data available. Using an offroad penalty improves DAC in all cases. Pretraining on the drivable area also gives a significant performance boost in datalimited regimes.
Figure 7 further explores the impact of offroad losses on DAC of less likely modes. We see small, but consistent improvements in DAC as we increase the offroad loss penalty. Furthermore, the relative improvements increase with less likely modes.
5.4.2 Weighted crossentropy loss
Loss  CE  Max WCE  Mean WCE  Avoid Nearby WCE  

Threshold (m)  N/A  2  3  1000  2  3  1000  2 
minADE_{1}  1.75  1.79  1.84  2.01  2.05  1.91  1.94  1.75 
minADE_{5}  1.05  1.19  1.26  1.42  1.30  1.32  1.41  1.05 
MissRate_{5, 2m}  0.38  0.44  0.49  0.55  0.48  0.49  0.53  0.38 
Mean dist. between modes  1.34  1.05  1.13  1.01  1.17  1.23  0.93  1.35 
Table 1 compares performance of a dense trajectory set () when we consider all trajectories within various distances of the ground truth as a match. Surprisingly, the crossentropy loss performs better than weighted max and mean crossentropy losses. We hypothesize that this is due to the weighted crossentropy loss causing more “clumping” of modes. We explored this hypothesis by computing the mean distance between predicted trajectories, and results in Table 1 support this interpretation. To reduce clumping, we tried an “Avoid Nearby” weighted cross entropy loss that assigns weight of to the closest match, to all other trajectories within meters of ground truth, and to the rest. We see that we are able to increase mode diversity and recover the performance of the baseline loss. Our results indicate that losses that are better able to enforce mode diversity may lead to improved performance.
5.5 Classification vs. ordinal regression
We perform a detailed comparison of classification (CoverNet) and ordinal regression (MultiPath). All models use the same input representation and backbone to focus on the output representation.
nuScenes. nuScenes [3] is a selfdriving car dataset that consists of scenes recorded in Boston and Singapore. Objects are handannotated in 3D and published at Hz. The task is to predict the next seconds given the prior second and the map. We use the split publicly available in the nuScenes software development kit [26], which only includes moving vehicles. This split includes observations in the train set.
5.5.1 Discussion
For CoverNet models, we find that offers the best or second best performance for all metrics on both datasets. Additionally, smaller models tend to overfit, especially on the smaller nuScenes dataset. For MultiPath, we find that the MissRate_{5, 2m} monotonically decreases as the trajectory set size increases but using a modest number of modes, such as , can achieve the best minADE_{5}.
We emphasize the strong performance of the MultiPath approach with a large number of modes ( and ). The MultiPath paper [6] noted decreasing performance with more than anchors, whereas we see benefits from using an order of magnitude more. Our analysis suggests that the increase in performance associated with the higher number of modes happens due to better coverage of space via anchors, leaving the network to learn smaller residuals. Table 7 displays the average and norms of the residuals learned by the model using different numbers of modes.
Method  Modes  minADE_{1}  minADE_{5}  minADE_{10}  MissRate_{5, 2m} 

Constant velocity  N/A  2.33 4.11  2.33 4.11  2.33 4.11  0.77 0.90 
Physics oracle  N/A  2.26 2.97  2.26 2.97  2.26 2.97  0.76 0.85 
MTP [11]  all 1  1.80 3.41  1.80 3.41  1.80 3.41  0.71 0.90 
MultiPath [6]  all 16  1.81 4.30  1.14 2.22  1.10 2.16  0.56 0.89 
MultiPath [6]  all 64  1.90 4.40  0.98 1.94  0.89 1.68  0.41 0.84 
MultiPath [6], =3  682 946  1.77 4.70  0.91 2.27  0.75 1.73  0.32 0.74 
MultiPath [6], =2  1571 2339  1.76 4.80  0.94 2.50  0.76 1.85  0.32 0.72 
CoverNet [27], =3  682 946  1.75 4.60  1.05 2.22  0.91 1.68  0.38 0.78 
CoverNet [27], =2  1571 2339  1.79 4.60  1.00 2.30  0.82 1.70  0.36 0.71 
CoverNet [27], =1  5077 9909  1.94 6.10  1.03 2.90  0.87 2.12  0.41 0.80 
5.5.2 Argoverse test set
We further compared our implementation and modifications of MultiPath and CoverNet on the Argoverse test set. Table 3 compares our results with notable published methods. Our multimodal results are similar to MFP [32], which explicitly models interactions. Our poor unimodal performance may be due to the simple input representation we used, which encodes history via fading colors.
Model  CV [7]  LSTM ED [7]  VectorNet [15]  MFP [32]  MultiPath^{*} [6]  CoverNet^{*} [27] 

minADE_{1}  3.55  2.15  1.81    2.34  2.38 
minADE_{6}  3.55  2.15  1.18  1.40  1.28  1.42 
6 Conclusion
We extended a stateoftheart classificationbased motion prediction algorithm to utilize domain knowledge. By adding an auxiliary loss that penalizes offroad predictions, the model can better learn that likely future motions stay on the road. Additionally, our formulation makes it easy to pretrain using only map information (e.g., offroad area), which significantly improves performance on small datasets. In an attempt to better encode spatialtemporal relationships among trajectories, we also investigated various weighted crossentropy losses. Our results here did not improve on the baseline, although pointed towards the need for losses that promote mode diversity. Finally, our detailed analysis of classification and ordinal regression (on public selfdriving datasets) showed that best performance can be achieved with an order of magnitude more modes than previously reported.
Acknowledgments. We would like to thank Oscar Beijbom, Caglayan Dicle, Emilio Frazzoli, and Sourabh Vora for helping improve the presentation of our work.
References

[1]
(201606)
Social LSTM: human trajectory prediction in crowded spaces.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.  [2] (201806) Accurate and diverse sampling of sequences based on a “best of many” sample objective. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [3] (2019) nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: 3rd item, §2.2, §5.5.
 [4] (2019) Spatiallyaware graph neural networks for relational behavior forecasting from sensor data. Note: https://arxiv.org/abs/1910.08233 External Links: 1910.08233 Cited by: §2.
 [5] (201810) IntentNet: learning to predict intention from raw sensor data. In Proceedings of The 2nd Conference on Robot Learning, Cited by: §2.2, §2, §2.
 [6] (201911) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In 3rd Conference on Robot Learning (CoRL), Cited by: §1, §1, §2, §3.1, §5.1, §5.5.1, Table 2, Table 3.
 [7] (201906) Argoverse: 3d tracking and forecasting with rich maps. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §2.2, §5.3, §5.4, §5.4, Table 3.
 [8] (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pp. 742–751. Cited by: §2.1.
 [9] (2007) US highway 101 dataset. Federal Highway Administration (FHWA), Tech. Rep. FHWAHRT07030. Cited by: §2.2.
 [10] (2019)(Website) Note: https://pytorch.org/docs/stable/torchvision/models.html Cited by: §5.2.
 [11] (201905) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In International Conference on Robotics and Automation (ICRA), Cited by: §1, §2.2, §2, §2, §3.1, §5.2, Table 2.
 [12] (2019) Deep kinematic models for physically realistic prediction of vehicle trajectories. Note: https://arxiv.org/abs/1908.00219v1 External Links: 1908.00219 Cited by: §2.1.
 [13] (201806) Convolutional social pooling for vehicle trajectory prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
 [14] (2018) Shortterm motion prediction of traffic actors for autonomous driving using deep convolutional networks. Note: https://arxiv.org/abs/1808.05819v2 External Links: 1808.05819 Cited by: §2.
 [15] (2020) VectorNet: encoding HD maps and agent dynamics from vectorized representation. Note: https://arxiv.org/abs/2005.04259 External Links: 2005.04259 Cited by: §2, Table 3.
 [16] (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
 [17] (201806) Social GAN: socially acceptable trajectories with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [18] (2015) Deep residual learning for image recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §5.2.
 [19] (201906) Rules of the Road: predicting driving behavior with a convolutional model of semantic interactions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2, §2.
 [20] (201810) Generative modeling of multimodal multihuman behavior. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
 [21] (2012) Activity forecasting. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [22] (201707) DESIRE: distant future prediction in dynamic scenes with interacting agents. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [23] (201806) Fast and Furious: real time endtoend 3d detection, tracking and motion forecasting with a single convolutional net. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
 [24] (201906) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. In Robotics: Science and Systems (RSS), Cited by: §2.1, §2.2.
 [25] (2019) Improving movement prediction of traffic actors using offroad loss and bias mitigation. In Advances in Neural Information Processing Systems, Cited by: §1, §2.1.
 [26] (2020)(Website) Note: https://www.nuscenes.org/ Cited by: §5.5.
 [27] (2019) CoverNet: multimodal behavior prediction using trajectory sets. Note: https://arxiv.org/abs/1911.10298 External Links: 1911.10298 Cited by: §1, §1, §1, §1, §2.1, §2, Figure 1, §3.1, §3.2, §3, §4.2, §5.1, §5.1, §5.2, §5.2, Table 2, Table 3, §7.
 [28] (201809) R2P2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, §2.
 [29] (201910) PRECOG: prediction conditioned on goals in visual multiagent settings. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [30] (201512) ImageNet large scale visual recognition challenge. The International Journal of Computer Vision. Cited by: §5.2.
 [31] (201906) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [32] (2019) Multiple futures prediction. In Advances in Neural Information Processing Systems, Cited by: §2, §5.5.2, Table 3.
 [33] (201906) Endtoend interpretable neural motion planner. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.

[34]
(201906)
Multiagent tensor fusion for contextual trajectory prediction
. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
7 Supplementary Material
We created fixed trajectory sets as described in Section 3 of [27]. We used ground truth trajectories from the Argoverse training set and all trajectories from the nuScenes training set to compute trajectory sets. The parameter controls the furthest a ground truth trajectory in our training set can be from a trajectory in our trajectory set. We visualize the trajectory sets used by CoverNet and MultiPath on the Argoverse data set and observe that for all , the trajectory sets include fastmoving lane keeping trajectories and some very wide turns. As we decrease , the diversity of lateral maneuvers increases.