I Introduction
Selfdriving vehicles (SDV) have the potential to make a large impact on our society, making transportation safer, cheaper and more efficient. A core component of every selfdriving vehicle is its ability to perceive the world (including dynamic objects), and forecast how the future might unroll. In recent years there has been incredible progress in perception systems [1, 2, 3]. Many challenges still remain, however, in providing motion forecasts that are simultaneously diverse and precise [4]. That is, having the ability to cover all the modes of the data distribution while generating few highly unrealistic trajectories.
Roads in modern cities have well defined geometries, topologies, as well as traffic rules. The vast majority of actors in the scene will adhere to this structure, for example driving close to the middle of their lane, respecting stop signs, or obeying yielding laws. These agents will also most likely act in a socially acceptable manner, avoiding collisions with other traffic participants. Despite this fact, most perception and motion forecasting systems are trained to be as close as possible to the ground truth employing symmetric loss functions that do not take this structure into account. For example, euclidean distance at the waypoint level between the predicted and groundtruth future trajectories is a common choice for motion forecasting. This can cause uncomfortable rides for the selfdriving vehicle with plenty of sudden brakes and steering changes due to false positive motion forecasts intruding into the egocar’s lane, such as the red trajectory illustrated in Fig. 1Left. Even worse, these sudden reactions to avoid an imminent collision with a motion forecast can cause an actual collision (with the groundtruth) as a byproduct. It is also critical to recall all actors (and their future motion) in the SDV lane, since otherwise it would look like free space ahead and the SDV could dangerously accelerate, as depicted by the red motion forecast in Fig. 1Right.
One possible solution is to add the aforementioned intuitions into the motion forecasting model as hard constraints. Unfortunately, this may not be resilient to noncompliant behavior from other actors as well as map failures, predicting possibly unrealistic and dangerous situations. In this paper we take an alternative approach and design loss functions that encourage our perception and prediction system to only violate these constraints when they happen in reality.
Incorporating prior knowledge via loss functions is easy when the perception and prediction modules are deterministic. However, deterministic systems fail to capture the inherent uncertainty of the future. This can be catastrophic when it fails to capture the true actor intention (e.g., crossing the street vs waiting, yielding vs not). In order to plan a safe maneuver, coverage of the possible future scenarios is required, along with information about the likelihood of each possible future such that the motion planner can choose the plan with the lowest expected cost. The Gaussian distribution and mixtures thereof have been widely used to represent uncertainty over spatial locations
[5, 6, 7]. However, as shown in [4], maximizing the log likelihood of data encourages the model to produce distributions with high recall in order to avoid the big penalty associated with lowdensity areas. As a consequence many unrealistic samples are generated, sacrificing the precision of the model.In this paper, we show that making explicit use of our prior knowledge about the geometry and topology of the roads as well as the traffic rules can provide more precise distributions over future outcomes while preserving their recall. However, this is challenging as these priors are typically nondifferentiable and thus not directly amenable to gradient based optimization. For instance, the fact that humans tend to follow traffic rules can be better described as a discrete (follow / not follow) action. To this end, we propose a flexible framework to incorporate nondifferentiable prior knowledge as a loss and exploit the popular REINFORCE [8]
gradient estimator. Our formulation allows us to optimize for any prior knowledge on future trajectories, as long as drawing samples from the perception and prediction model, and obtaining their likelihood can be done efficiently. In particular, we apply our formulation to model how the vehicles interact with the map, encouraging the predictions to respect lane dividers and traffic lights. We also exploit our framework to make the motion forecasting module more planning aware by emphasizing the importance of high recall and high precision near the SDV route.
Our experiments show that our proposed framework can improve the map understanding of stateoftheart motion forecasting methods in very complex, partially observable urban environments in two challenging realworld datasets: ATG4D [2] and nuScenes [9]. Importantly, our approach achieves significant improvements in the precision of the trajectory distribution, while maintaining the recall. Unlike previous works, in this paper we advocate for measuring the systemlevel impact of the motion forecasts via motion planning metrics. We demonstrate that including prior knowledge not only results in more comfortable rides, but also in major safety improvements over the decisions taken by a stateoftheart motion planner [10]. Moreover, we show that achieving a lower minADE alone, the most frequently used metric to benchmark multimodal motion forecasts, may not translate into safer motion plans.
Ii Related Work
In order to plan a safe maneuver, we must effectively deal with noisy and partial observations from sensor data, and provide an accurate characterization of the distribution over future trajectories of all actors in the scene. We thus want to model , with the observations (i.e. a local context for each actor) and the future trajectories of all actors. This is particularly challenging since the future is inherently uncertain, and actors’ discrete decisions induce highly multimodal distributions (e.g., turning right vs. going straight at an intersection, going vs. stopping at a yield sign).
In traditional selfdriving stacks, there is an object detection module responsible for recognizing traffic participants in the scene, followed by a motion forecasting module that predicts how the scene might unroll, given the current state of each actor. However, the actor state is typically a very compact representation including pose, velocity, and acceleration. As a consequence, it is hard to incorporate uncertainty coming from sources such as sensor noise and occlusion.
FAF [11] unified these two tasks by having a single, fully convolutional backbone network predict both the current and future states for each pixel in a bird’s eye view grid, directly from a voxelized LiDAR pointcloud. This naturally propagates uncertainty between the two tasks in the feature space, without any need for handcrafted, intermediate representations. [12]
extended this framework to include the map as an input by adding a parallel fully convolutional backbone network to process a semantic raster map of the scene, thus making fusion trivial by concatenation. Recently, this framework was further extended to model agentagent interactions via graph neural networks
[5], learn a cost map for motion planning [13], and add differentiable tracking intheloop [14]. While these works are great at dealing with uncertainties at the sensor level and mitigating failures downstream from object detection uncertainty by learning joint features, they do not focus on the output parameterization, and all of them produce unimodal predictions. Unimodal predictions can result in unsafe behaviors if, for example, the predicted intention is not accurate (e.g., a pedestrian crossing vs waiting) or the predictions lie inbetween two modes (e.g. at branching roads). In this work, we extend this framework to predict multimodal behaviors that are aligned with human prior knowledge.The motion planning algorithms in an SDV need to take into account all possibilities to make safe decisions. Thus, motion forecasting models that can characterize complex multimodal distributions are a must, and efficient sampling is desired such that motion planning can timely find the plan with the lowest expected cost. Approaches that directly output the parameters of the marginal distribution at each timestep with a closedform likelihood such as a sequence of Gaussians over time [15, 5] meet the efficient sampling requirement, but can suffer from low expressivity. [6]
proposed a more stable way to train mixtures of future trajectories than directly optimizing the likelihood, by only training the Gaussian likelihood of the closest mode to the groundtruth trajectory and applying crossentropy to the mode probabilities.
[7]followed up on this idea by anchoring the predictions to clusters produced offline by running kmeans on the training set. These two models have a good tradeoff between sampling efficiency and expressivity. Predicting discrete occupancy maps into the future
[16] has very good expressivity and naturally captures multimodality, but is very memory intensive and requires adaptations in order to use traditional motion planners designed for trajectory representations.Autoregressive models [4, 17, 18] require sequential sampling and thus are not amenable to realtime inference, which is necessary in safety critical applications such as selfdriving. Furthermore, there is a mismatch between training and inference: the distributions at training are conditioned on ground truth, but during inference they are conditioned on model samples. This makes the model less robust to noise, particularly in the joint detection and prediction setting where perfect detection is not possible. Finally, modeling interactions between the traffic participants [15, 19, 5, 17, 18] has been shown to help reduce the prediction uncertainty, and generate more socially consistent predictions.
There has been incredible progress in learning multimodal futures with a likelihood objective without leveraging prior knowledge or known structure, thanks to the availability of large and diverse driving datasets. However, datasets collected from realworld driving can only cover one possible mode out of many possibilities of how the future might have unrolled. As shown in [4], this partial description of the underlying distribution along with a Maximum Likelihood Estimation (MLE) objective results in high entropy distributions that favors modecovering over modeseeking behaviors [20]. Empirically, although the learned distribution recovers the ground truth future, it also generates highly implausible samples (e.g., outofmap predictions). In this paper we show that coverage at the expense of precision severely harms motion planning.
A solution to avoid modecovering predictions is to directly leverage prior knowledge to characterize the true distribution. In particular, the actor’s motion history, road geometry, traffic rules, and nearby traffic participants all constrain the space of plausible futures. Past approaches have (i) devised neural networks with inductive priors at the model level [21], and (ii) use a mapbased reconstruction loss as an auxiliary task (e.g. recovering the road mask shape [22] to encourage prediction to fall on drivable areas, which they show helps with generalization to novel driving scenarios). However, their approach is limited since the policy is non probabilistic, and the loss is applied to only the maximum a posteriori (MAP) estimate. In this paper, we propose a more general and powerful approach by leveraging the REINFORCE gradient estimator to incorporate any prior knowledge (including nondifferentiable functions) over probabilistic motion forecasts, still allowing the model to recover noncompliant behavior.
Iii Exploiting Prior Knowledge in Driving
In this section, we describe a novel framework to incorporate prior knowledge explicitly into probabilistic motion forecasts. Importantly, our approach still permits the model to predict noncompliant behavior that does not follow traffic rules in the rare event that this occurs. Our method is general and can be applied to any model that can generate actor trajectory samples and evaluate their marginal likelihood efficiently. Since for the remainder of the paper we only refer to peractor marginal likelihoods, we simplify the notation and refer to an actor’s trajectory as , and its local context as (i.e. the detected bounding box and local LiDAR/map features). We defer the explanation of the particular stateoftheart perception and prediction model we use to Section IV, as this is not the main contribution of our work.
Given a traffic scene, humans have rich prior knowledge over how the traffic participants might behave. In this paper, we propose to directly use this prior knowledge as supervision when learning an actor’s distribution over future trajectories. Towards this goal, we encode the prior knowledge as a deterministic reward function . We then define the prior knowledge objective as the negative expected reward over samples from the future trajectory distribution. Note that applying the loss directly to the point estimate of the means is not sufficient, since our goal is to learn an accurate characterization of the full distribution for safe motion planning. The goal is then to learn a stochastic policy (parameterized by ) that maximizes the expected reward:
Most priors are nondifferentiable and cannot be easily relaxed (e.g., motion forecast following the traffic rules or not) Thus, we leverage policy gradient algorithms, which do not assume differentiability of the reward function and allow direct optimization without making any approximations. In particular, we use the popular REINFORCE algorithm [8], which only requires the policy to be differentiable, provide efficient sampling, and allow likelihood evaluation. In this case, the gradient of the expected loss can be computed as:
The expectation can then be approximated by drawing samples from the predicted distribution as follows:
where is the th trajectory sample, over
samples. Although this Monte Carlo estimation is unbiased, it typically has high variance. Our experiments show that this does not pose a problem when using a policy that has an efficient sampling mechanism, since we can draw a large number of samples.
In practice, our reward explicitly incorporates the prior knowledge that drivers generally follow the reachable lanes defined by lane markers, traffic signs and traffic lights. Furthermore, we leverage our knowledge about the self driving vehicle’s planned route to place emphasis on the most relevant actors. Intuitively, missing the prediction of an actor coming in conflict with the SDV route (false negatives) or predicting that an actor will cross in front of the SDV when in reality it stops (false positives) are of greater importance than accurately characterizing the behavior of an actor 50 meters behind the SDV. We express the final reward as a simple linear combination of rewards
Next, we explain both terms in detail.
Iiia Reachable Lanes
Human driving behavior is highly structured: in the majority of scenarios, drivers will follow the road topology and traffic rules. To leverage this informative prior, but not overly penalize noncompliant behavior, we define a flexible trafficrule informed loss that is conditioned on groundtruth behavior. To this end, we leverage a lanegraph representation where the nodes encode lane segments and the edges represent relationships between lane segments such as adjacency, predecessor, and successor (taking into account direction of traffic flow). This allow us to define the reachable lanes loss per timestep as:
(1) 
where represent the lanes that are reachable from the detected vehicle bounding box by obeying the traffic rules. Note that to be robust to noise in the lane graph and avoid penalizing noncompliant behaviors, we only apply the loss if the ground truth waypoint falls within the set of reachable lanes as well. This loss is summarized in Fig. 2. To define this set of reachable lanes, we capture lane divider infractions as well as traffic light violations on the lanegraph.
Model  Final Lane Error (%)  meanADE (m)  minADE (m)  

Straight  Left  Right  Straight  Left  Right  Straight  Left  Right  
SpAGNN [5]  18.79  51.27  51.52  3.36  5.70  6.22  0.68  1.63  1.68 
MultiPath [7]  12.76  49.27  39.72  2.40  5.09  5.08  0.58  1.59  1.30 
R2P2MA [17]  6.36  42.26  28.58  2.57  4.74  5.09  0.75  1.85  1.63 
SpAGNN+  10.07  47.92  39.60  2.35  4.53  4.89  0.53  1.39  1.26 
Ours  6.28  39.13  28.07  2.17  4.16  4.57  0.54  1.60  1.51 
Model  Final Lane Error (%)  meanADE  minADE  

Straight  Left  Right  (m)  (m)  
SpAGNN+  15.24  24.24  28.48  1.72  0.48 
Ours  10.69  17.44  20.31  1.65  0.50 
IiiA1 Lane infraction
Lane dividers limit the set of legal highlevel actions a vehicle can take in the road. For instance, lane changing over a solid line or taking over another vehicle by crossing a yellow double solid line into opposite traffic are not allowed. We incorporate this prior by removing the edges corresponding to illegal maneuvers from the lanegraph. Encoding this prior helps the model predict less entropic distributions.
IiiA2 Traffic light violation
Many interactions occur at intersections, some of which are safety critical. Thus it is important to have accurate actor predictions at intersections, particularly differentiating stopping and going behaviors. To this end, we leverage the traffic control states (i.e., green, red, yellow) to remove edges connecting lane segments that are currently governed by a red traffic light.
Once we have processed the lanegraph by applying the aforementioned rules, we perform lane association to match each vehicle bounding box to a lane (or set of lanes when the vehicle overlaps with multiple ones, for example during a lane change). Subsequently, we run a depth first search starting from the current lane segment, obtaining a set of reachable lanes.
IiiB SDV Route
It is more important to precisely characterize the motion of vehicles that might interact with the SDV [23], rather than other traffic participants that do not influence the SDV behavior. We can approximate the area of interest with the SDV’s route (i.e. highlevel command), which is defined as the union of all lane segments that the SDV can travel on to reach a preset goal, given the lanegraph. More concretely, the horizon is set to be equal to the prediction horizon (5s), and the target lane is generated by a given high level route planner (out of the scope of this paper). This gives a safe approximation over its future possible locations.
We now define positive trajectories as those with at least one waypoint falling within the SDV route, and negative otherwise. We would like our trajectory predictions to achieve high precision and high recall under this definition, taking into account if the groundtruth trajectory intersects the route (positive) or not (negative).
More concretely, we define the route loss as:
(2) 
where we have different rewards for true positive, false positive, true negative and false negative waypoint predictions, since there is high imbalance in the data and they have different impact on the safety of our motion planner. This loss is illustrated in Fig. 3.
Iv Perception and Prediction Model
So far we have presented our general framework, but we have not specified the particular perception and motion forecasting model we apply our prior knowledge loss to. We exploit a combination of the backbone feature extraction, object detection network, and graph propagation from
SpAGNN [5], together with the mixture of Gaussians output parameterization of MTP [6]. This perception and prediction model takes a voxelized LiDAR point cloud and a raster map as input, extracts scene features using a backbone CNN, and applies Rotated Region of Interest Align (RRoI) [24] to extract peractor features. After that, we define a fullyconnected graph where the nodes correspond to traffic participants, and perform a series of graphpropagations to refine each actor’s representation by aggregating features from their neighbors, as proposed by [5]. Finally, an MLP header predicts the parameters of a multimodal distribution over future trajectories by using a mixture of Gaussians with full covariance for each actor:Our method can be trained endtoend using backpropagation and stochastic gradient descent. In particular, we minimize a multiobjective loss containing a classification and regression terms for object detection, a symmetric motion forecasting loss agnostic to the map or SDV, as well as the prior informed nondifferentiable loss we described in Section
III. The loss of each actor is a weighted sum of the multiple objectives.For the classification branch () of the detection header (background vs vehicle), we employ a binary cross entropy loss with hard negative mining. In particular, we select all positive examples from the groundtruth and 3 times as many negative examples from the rest of spatial locations. For box fitting (), we apply a smooth L1 loss to each of the 5 parameters of the bounding boxes anchored to a positive example .
Instead of directly optimizing the likelihood of the mixture model, we follow [6]
in heuristically matching the closest mode to the groundtruth and only taking the negative log likelihood of that mode, while training the mode scores with a crossentropy loss. This has been shown empirically
[7] to be a more stable training objective than optimizing the mixture likelihood directly as in [25]. Thus we definewhere is the mode whose mean is closest to the ground truth trajectory in euclidean distance.
Since the gaussian waypoints in this model are independent across time, we use a heuristic trajectory sampler at inference to draw smooth samples from this model (see Appendix A for details). Note that during training we do not need this sampler since all the losses are formulated at the waypoint level. However, it is important to have temporally consistent samples for measuring the impact of our predictions in the downstream task of motion planning, since we approximate the heading into the future by finite differences between waypoints of the sample trajectories.
Model  Collision  L2 human  Lat. acc.  Jerk  Progress 

(% up to 5s)  (m @ 5s)  (m/)  (m/)  (m @ 5s)  
SpAGNN [5]  4.19  5.98  2.94  2.90  32.37 
R2P2MA [17]  3.71  5.65  2.84  2.53  33.90 
MultiPath [7]  3.30  5.58  2.73  2.57  32.99 
SpAGNN+  3.33  5.52  2.77  2.56  33.11 
Ours  2.75  5.43  2.67  2.47  33.09 
V Experimental Evaluation
In this section, we first describe how we measure the motion forecasting ability of our method, namely how well it predicts the future behavior of all the traffic participants. We then introduce the comprehensive set of metrics that we use for evaluation in the downstream task of egomotion planning. Despite being the final goal of the system, this task has generally been ignored in previous motion forecasting works. Next, we discuss the stateoftheart baselines we compare against, report extensive quantitative results on two challenging, realworld datasets: ATG4D [2] and nuScenes [9], and showcase the differences of adding prior knowledge from a qualitative standpoint. Finally, we perform a thorough ablation study to justify our choices on how to incorporate prior knowledge. We defer our implementation details to Appendix A.
Va Metrics
Motion forecasting results neglect the fact that actors are of different importance in the overall system, and that motion forecasting metrics are not necessarily aligned with system level metrics. A model with the best aggregate metrics in perception and motion forecasting may excel at the unimportant cases, and miss safetycritical cases resulting in unsafe driving. Since our method is a perception and motion forecasting system, we lead the discussion with motion forecasting metrics for clarity, as well as to gain intuitions on what are the differences with the baselines. However, what we really care about is how well we drive, and thus provide an extensive analysis of system level metrics to conclude this section.
VA1 Motion Forecasting
We follow previous works [12, 2, 5] in joint perception and prediction and perform IoUbased matching between object detections and groundtruth bounding boxes, ignoring fully occluded vehicles without any LiDAR point.
In order to measure the performance of the motion forecasting system we use sample quality measures, following [17]. In particular, we use (i) the final lane error (trajectory waypoint inside vs. outside the reachable lanes at 5 seconds into the future) to measure map understanding, (ii) minimum average displacement error (minADE) to show the recall of our motion forecasts at different time horizons, and (iii) mean average displacement error (meanADE), which gives us an idea of the precision of our predictions since unrealistic samples severely harm this metric. Furthermore, we benchmark the performance in a diverse set of driving behaviors, by breaking down all metrics by the groundtruth highlevel action of the vehicle: going straight, left turning and right turning. We omit stationary and near stationary vehicles, since all models do well in detecting and predicting the future states of those. We use 50 samples for all evaluations.
Note that, in contrast to [17], we emphasize the need of metrics that show the precision of the predictions, as opposed to using only recalloriented metrics such as minADE, for 2 reasons. First, recall is easily achievable at the expense of precision by simply predicting very fanout distributions. Second, precision in the motion forecasts is critical for safe motion planning, as we show in Sec. VB.
VA2 EgoMotion Planning
To evaluate how our approach impacts the full system, we use the learnable motion planner proposed in [10]. We feed the planner with 50 trajectory samples for each vehicle, as a Monte Carlo approximation of the marginal distribution. Thus, we assign equal weight to each sample to avoid overweighing the high likelihood region of the distribution, since the samples already come from our model. This way we can keep the motion planner as proposed. Because in this paper we do not consider perception and prediction of pedestrian and bicyclists, we feed the groundtruth trajectories of these traffic participants to the motion planner, for all experiments.
We focus on the safetyrelated metric of collision rate ( of time the SDV plan collides with any other traffic participant in the groundtruth, for a future horizon of 5 seconds). We also provide results on comfortrelated metrics such as lateral acceleration and jerk, to reveal any potential tradeoff. Finally, we also include the progress of the SDV to show that the methods are not just trivially avoiding collisions and uncomfortable situations by staying still. Note that these metrics are computed in openloop, by unrolling the motion plan for the duration of the prediction horizon.
Example 1  Example 2  Example 3  Example 4  
SpAGNN+ 


Ours 
t = 0 s  t = 1.5 s  t = 3.5 s  t = 5 s  
SpAGNN+ 


Ours 
VB Comparison Against State of the Art
We compare our approach to three previously proposed stateoftheart motion forecasting approaches: SpAGNN [5], R2P2MA [17], and MultiPath [7]. We also consider the model outlined in Section IV but without applying our proposed prior knowledge loss () as a baseline, which we call SpAGNN+. We implemented all models in the context of joint perception and prediction, and employ the same detection backbone in the baselines as in our approach for a fair comparison. We provide more details about the adaptations of R2P2MA [17], and MultiPath [7] to the joint perception and prediction setting in Appendix A.
Motion Forecasting: Table I shows motion forecasting results in ATG4D, operating at a 90 common recall point across all models for fair comparison. We first show that adding the mixture of gaussians as output parameterization to SpAGNN improves all sample quality metrics (SpAGNN vs. SpAGNN+). This makes sense since unimodal distributions cannot capture multimodal behaviors such as breaking vs. accelerating, or turning right vs. going straight. Incorporating prior knowledge via our proposed method delivers much better map understanding and precision, as shown by the final lane error and meanADE. While the distance between the groundtruth and the closest sample (minADE) suffers a minor regression when incorporating prior knowledge, we have shown that this does not impact the downstream task of motion planning in Section VB. Indeed, it is a metric that does not provide a description of the full distribution and overly favors those with higher entropy.
We show qualitative results for 4 scenarios in ATG4D with diverse road topologies in Fig. IV. We can clearly see how adding prior knowledge makes the distributions less entropic while preserving multimodality, and substantially improves the map understanding of our predictions. We highlight that despite incorporating prior knowledge about the fact that vehicles tend to follow the map, we can see in Example 2 how our model can predict vehicles going out of the map (in this case an unmapped driveway).
In Table II, we validate our method on nuScenes to show that it is robust to variations in the high definition map specification. We use the same evaluation setup, with the exception that we operate our object detector at a 60 common recall point. As in our dataset, our approach shows improvements in precision and map understanding metrics.
EgoMotion Planning: In Table III, we show that our method yields much safer motion plans, indicated by a 17% collision reduction over the strongest baseline in this metric. Remarkably, the increase in safety is not at the expense of comfort, where our method achieves marginally less jerk and lateral acceleration than other approaches. Note that while progress in general is desired, it cannot be at the expense of safety and comfort. We notice that the egomotion plans make similar progress across models, but our approach produces the closest trajectories to the ground truth executed by an expert human driver (lowest L2 distance at 5 seconds into the future), while yielding much fewer collisions. We observe that despite the popularity of the minADE metric across previous works in motion forecasting, the model that achieves the lowest minADE in Table I does not yield the safest or most comfortable egomotion plans.
Fig. V showcases an example that illustrates a repeated behavior throughout the dataset: the baseline method produces much more fanout distributions that cause the egovehicle to verge into dangerous situations. In this particular example, there is a vehicle in the oncoming lane for which the baseline predicts with significant probability that will drive into the SDV’s lane, making the motion planner to drive into opposite traffic and finally causing a collision. In contrast, our model predicts a more precise and plausible distribution over possible futures where the relevant vehicle is predicted to follow its original lane.
VC Ablations
To show that our design choices to incorporate prior knowledge are sound, we explore different approaches:
Prior Repr.  Approach  Final Lane Error (%)  meanADE (m)  minADE (m)  

Straight  Left  Right  Straight  Left  Right  Straight  Left  Right  
    10.07  47.92  39.60  2.35  4.53  4.89  0.53  1.39  1.26 
Reach.  Reconstruction  10.08  48.02  38.21  2.37  4.48  4.85  0.53  1.32  1.24 
Centerline Dist  Mean  11.95  53.78  46.11  2.80  11.07  9.03  0.63  2.07  1.95 
Reach. Dist  Mean  9.68  47.65  39.73  2.24  4.98  5.12  0.55  1.58  1.53 
Centerline Dist  Reparam  7.22  40.21  30.66  2.36  4.81  5.24  0.57  1.80  1.90 
Reach. Dist  Reparam  6.84  37.75  27.55  2.29  4.46  4.91  0.56  1.68  1.86 
Reach.  REINFORCE  6.74  41.57  29.80  2.25  4.47  4.82  0.55  1.68  1.65 
Reach. & Route  REINFORCE  6.28  39.13  28.07  2.17  4.16  4.57  0.54  1.60  1.51 
Prior Repr.  Approach  Collision  L2 human  Lat. acc.  Jerk  Progress 

(% up to 5s)  (m @ 5s)  (m/)  (m/)  (m @ 5s)  
    3.33  5.52  2.77  2.56  33.11 
Reach.  Reconstruction  3.58  5.54  2.80  2.56  33.10 
Centerline Dist  Mean  3.94  5.72  2.81  2.75  32.40 
Reach. Dist  Mean  3.30  5.46  2.75  2.55  32.99 
Centerline Dist  Reparam  3.61  5.52  2.70  2.56  33.00 
Reach. Dist  Reparam  3.25  5.49  2.70  2.52  33.04 
Reach.  REINFORCE  3.00  5.45  2.70  2.51  33.02 
Reach. & Route  REINFORCE  2.75  5.43  2.67  2.47  33.09 
Reachable Lanes Reconstruction
Similar to the road loss proposed in [22], this approach uses an auxiliary convolutional head at the RRoI level that predicts the reachable lanes for each vehicle, represented as a spatial, binary mask. For this ablation, we replace our prior loss by a crossentropy per pixel reconstruction loss, hoping to make the backbone features more mapaware.
Differentiable relaxations
If the prior knowledge is a differentiable reward function
over the support of the distribution, we can directly apply the loss to the model samples and backpropagate the gradients through the gaussian output distribution using the reparameterization trick
[26].More concretely, we consider two differentiable relaxations of our nondifferentiable reachable lanes loss as baselines, shown in Fig. 4:

Distance to centerline: we define the loss at each spatial location as the closest distance to a centerline of any lane in the set of reachable lanes.

Distance to boundary: we define the loss at each spatial location as the closest distance to the boundary of one of the reachable lanes when outside the reachable lanes surface, and zero when inside.
We test the aforementioned relaxed losses by applying them both to the mean of the distribution only, as well as to samples drawn from .
Next, we discuss the motion forecasting results and finally the impact on motion planning.
Motion Forecasting As shown in Table VI, the implicit feature learning approach inspired by [22] barely changes from the baseline. Applying the relaxed loss only to the mean of the distribution can have conflicting effects with the data likelihood term in the loss, since the latter optimizes the full distribution but the former only the mean. This yields to results that are worse than the base model over all metrics. Finally, applying the relaxed loss to the samples via reparameterization trick achieves the best lane metrics, but harms the meanADE. This is caused by the approximation of the reachable lanes loss to a continuous function when applying the distance transform. For instance, in curvy lanes or turns, the relaxed loss will push the predictions to the closest point in the reachable lanes, causing the trajectory to shorten (i.e. reduce the speed). Another drawback is that in branch topologies, the prediction will get pushed to the closest branch, which could differ from the groundtruth one.
EgoMotion Planning In Table VII
, we show that not all approaches to incorporating prior knowledge improve the safety and comfort of egomotion planning. In particular, the continuous relaxations of the reachable lanes loss do not reduce the number of collisions despite improving the final lane error as shown in the previous paragraph. We conjecture that there is a fine balance between map understanding, precision, and recall that is adequate for motion planning. The baselines sacrifice too much precision and recall in exchange of map understanding, most likely due to the approximations in the relaxation. Utilizing the REINFORCE gradient estimator to optimize the exact prior translates into much safer plans, particularly when incorporating the SDV route loss, showing that it is important to focus on the actors that can interact with the SDV.
Vi Conclusion and Future Work
In this paper we have proposed a novel framework to explicitly incorporate prior knowledge into probabilistic motion forecasts, while allowing to predict noncompliant behavior when there is evidence. Our method is general, and can be applied to any model that can generate trajectory samples and evaluate their marginal likelihood per actor. We have demonstrated the effectiveness of our approach in two challenging realworld datasets, significantly outperforming other stateoftheart methods in both motion forecasting as well as in the downstream task of motion planning. Though we have chosen SpAGNN [5]
as the base model here, our method is general. We plan on integrating our approach with more base models, with a particular interest in joint distributions over actors where we can also apply our prior knowledge about interactions, such as the fact that vehicles do not generally collide.
References
 [1] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
 [2] B. Yang, W. Luo, and R. Urtasun, “Pixor: Realtime 3d object detection from point clouds,” in Proceedings of the IEEE CVPR, 2018.
 [3] S. Shi, Z. Wang, X. Wang, and H. Li, “Parta^ 2 net: 3d partaware and aggregation neural network for object detection from point cloud,” arXiv preprint arXiv:1907.03670, 2019.
 [4] N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting,” in ECCV, 2018.
 [5] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spatiallyaware graph neural networks for relational behavior forecasting from sensor data,” arXiv preprint arXiv:1910.08233, 2019.
 [6] H. Cui, V. Radosavljevic, F.C. Chou, T.H. Lin, T. Nguyen, T.K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 2090–2096.
 [7] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” arXiv preprint arXiv:1910.05449, 2019.

[8]
R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,”
Machine learning, vol. 8, no. 34, pp. 229–256, 1992.  [9] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
 [10] A. Sadat, M. Ren, A. Pokrovsky, Y.C. Lin, E. Yumer, and R. Urtasun, “Jointly learnable behavior and trajectory planning for selfdriving vehicles,” arXiv preprint arXiv:1910.04586, 2019.
 [11] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time endtoend 3d detection, tracking and motion forecasting with a single convolutional net,” in Proceedings of the IEEE CVPR, 2018.
 [12] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict intention from raw sensor data,” in Conference on Robot Learning, 2018.
 [13] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “Endtoend interpretable neural motion planner,” in Proceedings of the IEEE CVPR, 2019.
 [14] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multitask multisensor fusion for 3d object detection,” in Proceedings of the IEEE CVPR, 2019.
 [15] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE CVPR, 2016.
 [16] A. Jain, S. Casas, R. Liao, Y. Xiong, S. Feng, S. Segal, and R. Urtasun, “Discrete residual flow for probabilistic pedestrian behavior prediction,” 2019.
 [17] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “PRECOG: PREdiction Conditioned On Goals in Visual MultiAgent Settings,” arXiv eprints, p. arXiv:1905.01296, May 2019.
 [18] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” in Advances in Neural Information Processing Systems, 2019.

[19]
N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory
prediction,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, 2018, pp. 1468–1476.  [20] C. M. Bishop, Pattern recognition and machine learning, 2006.
 [21] M.F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., “Argoverse: 3d tracking and forecasting with rich maps,” 2019.
 [22] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
 [23] K. S. Refaat, K. Ding, N. Ponomareva, and S. Ross, “Agent prioritization for autonomous navigation,” arXiv preprint arXiv:1909.08792, 2019.
 [24] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitraryoriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, 2018.
 [25] C. M. Bishop, “Mixture density networks,” 1994.
 [26] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” 2013.
Comments
There are no comments yet.