In order for autonomous vehicles to successfully navigate a world filled with dynamic actors, they must perform a wide variety of challenging tasks. These include perceiving objects in a scene, forecasting actor motion, and determining an appropriate ego-trajectory to drive both safely and comfortably. While the domains of object detection and motion planning are well represented in the literature, the problem of predicting another actor’s future motion is a newer but rapidly growing area of study that is critically important for autonomous driving .
The most challenging aspect of the prediction problem is that the future is inherently ambiguous. Consider a vehicle approaching an intersection. Although we can use the actor’s current and past motion, along with contextual cues, to get an idea of where the actor will go, the driver could quite easily change their mind about which direction to travel while navigating the intersection. As a result, in order to successfully tackle the problem of long-term motion prediction (over a 5-10 second horizon), we must directly handle this ambiguity by considering multiple possible futures.
One way to achieve this is to predict a distribution over the actor’s future motion. Previous methods have primarily explored two main approaches to this problem. One group of methods has addressed this by considering a fixed set of maneuvers that an actor may pursue (e.g. left, right, straight) and estimating a categorical distribution over these behaviors[2, 9, 20]. This simplified view of the world works nicely in simulated environments, but breaks down almost immediately when driving in real cities, where intersections are complex (see Figure 1). Another group of methods model the actor’s future behavior as a multi-modal distribution over trajectories, where the set of modes is completely unstructured. These approaches have frequently suffered from the problem of mode collapse [18, 5].
In this work, we propose a new method for predicting a distribution over the actor’s future behavior that addresses the limitations of previous methods. Our approach relies on having a pre-cached map of the environment that encodes the precise topology of the road network. This topological representation captures information about the spatial relationships between lanes, thereby encoding important semantics on where and how to drive. In many existing approaches, this road topology is either ignored altogether or converted into a 2D bird’s-eye view of the scene, which we posit leads to significant information loss. Prior approaches thus under-utilize the map information and struggle with the basic task of predicting that actors will follow lanes, spurring the need for auxiliary loss functions to prevent actors from being predicted to drive off the road.
Our approach directly uses the mapped road topology to propose a broad set of lanes that the actor may traverse in the future. We then predict the occupancy for a sequence of spatial cells along each lane, which captures the likelihood that the actor will occupy that cell at any point over the targeted prediction horizon. This allows us to predict a distribution over the future occupancy of the actor within the mapped network of lanes. To reflect this, we call our method LaneOccupancyNet
. The key contribution of this work is a new method for encoding structure from the map into our model, which is done in such a way that (1) naturally captures the multi-modality of the future behavior of lane-following actors, (2) adapts to the local map context and generalizes to unseen road geometries, and (3) allows us to explicitly query the likelihood that a specific actor will occupy any particular lane of interest in the map, without necessarily estimating this likelihood for all lanes. These occupancy predictions can be then used determine the highest probability paths using a path-finding algorithm and generating timed trajectories or can be consumed directly by the planner to determine the cost map for the actor.
Ii Related Work
Structured Motion Prediction: Early approaches to motion prediction focused heavily on the use of physics-based models to estimate the current motion states of actors, using a fully specified kinematic model to propagate these states into the future [21, 8, 4]. Another class of structured methods leverages prior knowledge to estimate an actor’s intended destinations, and then uses a planner to generate the actor’s most likely trajectory to reach each goal [15, 7, 16]. In contrast, our method uses a broad set of candidate goals as input to a deep network for the purpose of occupancy prediction, rather than as a prior on strict lane-following behavior. Structured methods have also been previously applied to the problem of lane occupancy prediction, for instance predicting occupancy by modeling lane-following maneuvers and actor-actor interactions using explicit policies . Our approach also performs occupancy prediction, but with less explicit structure on motion.
Unstructured Motion Prediction:
More recently, many methods use deep learning models to directly predict future trajectories from inputs that capture the actor’s state plus context from the surrounding scene[12, 2, 6, 5]. In contrast to the highly structured physics-based models, these impose very little structure on the problem and instead assume the model will fully learn the patterns of actor motion from data. While we also use a DNN to extract features from the scene, our approach differs in how the map is utilized with the model: rather than providing the map as input to an unstructured motion prediction system, it is used to directly query occupancy probabilities for relevant spatial locations.
Intention Prediction: In addition to directly predicting motion, inferring an actor’s higher-level intent can help predict its future behavior. Intent can be modeled as a discrete set of actions, and several methods directly predict these action categories [11, 19, 2]or by producing multimodal trajectory predictions with each mode representing a particular semantic category of behavior [5, 13]. Our work focuses specifically on spatial multimodality, and we use structure from nearby mapped lanes to enumerate maneuvers and predict future occupancy without restricting ourselves to a pre-determined set of action categories.
In this section, we describe our approach for predicting vehicle occupancy over the lane network. In order to focus on the prediction problem, we assume we already have an object detection and tracking system which operates on sensor data to produce a list of actors in the scene along with their state estimates. We further assume that we have access to a high-resolution map of our surroundings, which encodes road boundaries, lane boundaries, lane connectivity, lane directions, crosswalks, traffic signals, and other details of the scene geometry. In this approach, we consider each actor in turn and generate predictions for the actor of interest in the context of all other actors in the scene, including the self-driving vehicle (SDV) itself. An overview of our approach is shown in Figure 2.
Iii-a Path Generation
The first step in our approach is to use the map to generate a set of candidate paths that the actor may follow.
We define each path as a region in 2D space that is specified by a sequence of lanes with no branching. Lanes that branch will be split into multiple paths that diverge at the branching point. To generate a set of paths for the actor of interest at a given time , we query the map to identify all lanes that fall within 2 meters of the actor’s position . Then, starting from the point , we “roll out” the paths according to the lane successor relationships specified by the map, up to a fixed distance. This yields a collection of candidate paths for the actor as shown in the middle image in Figure 3 for an actor in the intersection.
Together, the spatial area covered by the union of the paths determines the region over which we predict the occupancy of the actor. One potential drawback of this approach is that we explicitly do not predict other actor’s occupancy over regions of the world that are not covered by mapped lanes. However, it’s important to note that areas outside of mapped lanes are specifically of less interest to us, as we assume the SDV is designed to follow typical rules of the road and drive only within mapped lanes. Thus, predicting the occupancy of actors within these regions is of much higher importance.
Once we have the set of candidate paths, we discretize each path into a fixed length sequence of cells, where the number of cells is determined by the cell length. This discretization enables us to capture maneuvers like lane changes where the actor does not stay in the same lane for the duration of the prediction horizon. We constrain the discretization to be one-dimensional, meaning that each cell will cover the full width of the lane. We make this choice based on an observation that the SDV cares much more about which lanes will be occupied than it cares about the precise lateral position of a vehicle within the lane. Finally, given these cells, we label each one according to whether or not the vehicle’s polygon entered that cell at any point over the prediction horizon .
Iii-B Occupancy Prediction
Given the set of candidate paths and the sequence of cells along each path, we aim to predict whether or not each of these cells will be occupied by the actor. Because we are predicting spatial occupancy rather than spatio-temporal occupancy, the actor can occupy multiple cells over the duration of the prediction horizon. As a result, we want to predict an occupancy probability in the range for each cell rather than a normalized categorical distribution over all cells. To do this, rather than jointly predicting the occupancy for all cells at once, we consider each path independently and predict the occupancy only over the cells in a single path.
be a sequence of binary random variables whereindicates whether the -th cell in a path was occupied at any point over the next
seconds. We assume that our data consists of independent samples from the joint distributionparameterized by the per-cell occupancy probabilities . Our goal then is to estimate
are the inputs to our model. This approach, in which we process each path separately, has several benefits. First, it allows us to consider an arbitrary number of paths for each actor without relying on truncation or padding to force the paths into a fixed-size output representation. This flexibility in the number of paths enables our method to adapt to the local map context. As an example, an actor driving on a single-lane road far from any intersection will only have a single candidate path. In contrast, an actor approaching the 6-way intersection depicted in Figure1 may have 10 or more candidate paths. Another benefit of this approach is that it provides a path-centric output representation. This enables our model to generalize very well to unseen road geometries, as long as the map accurately captures the lane topology.
For a given actor and a given path , we assign a binary label to each cell along the path. To determine the label, we use the future ground truth trajectory of the vehicle truncated at the prediction horizon. If the vehicle’s ground truth polygon touches the cell at any point over the horizon, we label it 1, and otherwise we label it 0. When we observe the vehicle for a duration shorter than the prediction horizon , we only know the positive labels for certain (the other cells may or may not be visited in the remaining seconds). For these cases, we label all cells the ground truth polygon does not touch with a sentinel value of -1. These cells are ignored in the loss function, and therefore are not used for either training or evaluation. Since positive labels are scarce relative to negative labels, this allows us to leverage additional positive samples for training.
In this section we provide an overview of our occupancy prediction model, LaneOccupancyNet (LON), including a description of the input representation, the network architecture that is inspired by , and the output representation. The overall model architecture is shown in Figure 4.
Iii-D1 Input Representation
There are three different inputs we provide to our model to capture different pieces of information that influence an actor’s future occupancy: context from the actor’s neighborhood, information about the actor’s current and past behavior, and information about the candidate goal path.
(a) Scene rasters: The actor’s future behavior is influenced by the geometry of the scene and the positions of other nearby vehicles. To capture this context, we provide a rasterized RGB image capturing a bird’s-eye view representation of the scene at the current time instance, oriented based on the position and heading of the actor of interest. The rasters have a resolution of 0.2m and capture a 60m x 60m region, with 10m behind the actor and 50m in front of it. These rasters are similar to the ones used in  with the addition of the candidate path overlayed on the raster in dark green, to capture the path along which we are predicting occupancy.
(b) Actor features:
1D array of hand-crafted features that capture the actor’s current state and past behavior, such as the actor’s speed, its angular velocity, and the variance in its heading for the past 3 seconds.
(c) Path features: 1D array of hand-crafted features that capture additional information about the candidate path. Since the scene raster shows only an early portion of the candidate path, these features help provide information about the entire path, such as its curvature. We also provide information about the actor’s relationship to the path, such as the path-relative position, velocity, acceleration, and heading, along with the history of these values cached from previous cycles.
Iii-D2 Network Architecture
Our model architecture is inspired by the FastMobileNet architecture proposed in . In particular, we project the 1D concatenated array of actor and path features into the 2D space of the latent scene features. This allows us to directly add the projected features to the scene features and perform additional convolutional operations on top. As pointed out in , we hypothesize that this form of feature fusion allows the map information at different spatial locations to interact differently with the engineered features. See Figure 4 for details.
Iii-D3 Output Representation
We independently predict the future spatial occupancy of each actor along each of its paths, with the -th element in the output representing the probability of the actor occupying the -th cell along the path for any duration within the entire prediction horizon. We use the sigmoid cross entropy loss for each cell and compute the mean loss over all cells.
Iii-E Implementation Details
We train our model for 50,000 iterations as we observe the validation loss stabilizes by then. We set the learning rate to with a decay of 0.9 every 11000 steps. It takes around 12 hours to train the model using distributed training on 4 GPUs. In practice, we use a cell length of 4.8 meters, since this is the length of the average car, and use 40 cells per path, for a path length of 192 meters (this allows us to capture fast-moving actors). We experiment with prediction horizons of 3, 6, and 9 seconds, but primarily report results with the 9-second horizon.
Iv Experiment Results
For training and evaluation, we use the large-scale ATG4D dataset described in , which contains a variety of interesting scenarios and diverse driving behaviors from multiple cities across North America. The training and validation sets contain 5,000 and 30-second scenarios, respectively. To train our model, we randomly sample 5% of the training set. We report results on the validation set, which contains 127,669 actor frames with at least 9s of observed future. For all experiments, we measure performance only on moving vehicles (those with estimated speed 0.5 m/s) that are within 50m of the SDV and are observed for at least the full duration of the prediction horizon in the future.
Iv-B Spatial Occupancy Metrics
We evaluate our system performance by comparing an actor’s true spatial location against the spatial occupancies predicted by different methods using an average likelihood metric. We compare against two baselines: (1) unimodal trajectories generated by an unscented Kalman filter (UKF) that forward-propagates actor states from a second order tracking system; (2) trimodal trajectories generated by an unstructured deep network , identified here as Multiple Trajectory Prediction (MTP). The MTP model was trained on the same data as our proposed method. In both baseline methods, each predicted trajectory consists of a sequence of waypoints over time, position covariances surrounding each waypoint, and probabilities per trajectory (relevant in the multimodal case). In contrast, our method predicts spatial occupancy up to a future time horizon.
In order to compare these two different representations, we first convert the output of all methods into a common representation that consists of 2D spatial occupancy predictions over a grid that is centered at the actor’s current position (we use a 150m x 150m grid with 1m resolution). Ground truth labels are generated by determining which 2D cells an actor occupied at any point over the prediction horizon. Figure 5 shows an example of the ground truth occupancy mask and predicted occupancy likelihoods.
Given a ground truth 2D occupancy grid and a corresponding predicted likelihood grid, we compute the overall average likelihood as
Note that is the ground truth label of the -th cell in 2D space (as opposed to in path space), is the predicted occupancy likelihood for that cell, and is the grid size. We also compute two additional metrics, the positive likelihood, which is calculated only on 2D cells with label , and the negative likelihood, which is calculated only on 2D cells with label .
Next we describe how to convert the two output representations into the common 2D spatial representation.
Iv-B1 2D Spatial Occupancy from Trajectories
A multimodal trajectory prediction is defined as a time-varying mixture of Gaussian distributions. Each trajectory in the mixture ofcomponents is given by , where is the mean 2D position of the actor at time and is the corresponding position covariance. Each trajectory also has an associated mode probability, . In a unimodal case, such as UKF, and .
To convert a spatio-temporal trajectory into a spatial occupancy likelihood, for each 2D location, we determine the probability that an actor ends up occupying this location at any time over the prediction horizon. To approximate this, we utilize a Monte Carlo sampling technique. We first generate samples by repeating the following procedure:
Sample a mode from the distribution over the mixture components .
Sample a trajectory as follows. First, sample from a 2D standard Gaussian distribution . Then, for each time point , convert into a sample from . To do this, we compute , the Cholesky decomposition of , and then calculate . Repeat this for all time points to obtain a sequence of sampled positions . This allows us to sample a coherent trajectory from the sequence of Gaussians.
Define as the swept volume produced by the actor’s polygon moving along the sampled trajectory .
Finally, to estimate the predicted occupancy likelihood at cell in the 2D grid, we simply check each cell against the swept volumes from each of the sampled trajectories, and calculate the frequency with which the cell is occupied:
In our experiments, we use .
Iv-B2 2D Spatial Occupancy from Path-Based Occupancy
Since LaneOccupancyNet directly predicts the spatial occupancy probability for cells along each path, we directly use the polygon shapes of the cells to map these into the 2D likelihood grid. Note that that the set of paths we consider will likely have some spatial overlap (e.g., if a lane branches into 3 successor lanes, we will create 3 separate paths that partially overlap). To handle this, we also perform a post-processing step to estimate the final occupancy of each unique cell by taking the average of the set of estimates from each path that contains that cell.
Iv-C Multimodality Estimation
Since actors choose over multiple future actions and we are interested in prediction methods that capture these different possibilities, we also developed a measure to evaluate the spatial multimodality of each prediction method.
Our measure calculates the unique number of spatial modes an actor may follow according to its predicted occupancy. Modes are defined as spatially distinct paths to get from an actor’s current location out to a new location some distance away. To count the modes, we trace 1D likelihoods along concentric rings at varying ranges from the actor’s current location, and count the number of observed peaks in the likelihood along these rings.
Figure 6 provides a visual description of this general approach. Once 1D likelihoods are obtained for a ring at a given distance from the actor, the estimated number of modes is determined by counting the number of peaks in each curve. A peak is defined as a local maximum that rises at least above neighboring local minima. In our experiments, we use , and focus on the 180 degree arcs in the forward direction of the actor of interest, using its estimated heading.
Iv-D Quantitative Results
|Unscented Kalman Filter||0.9873||0.3709||0.9911|
|Multiple Trajectory Prediction||0.9955||0.6577||0.9976|
Table I shows comparative results on the three likelihood measures. While the overall likelihood and negative likelihood both degrade with our method, we observe a 13.8% improvement in positive likelihood. This suggests that our approach does a better job of estimating all of the possible places an actor may end up, effectively a measure of recall for prediction methods. We explore this further by plotting the positive likelihood at different future time horizons of the ground truth. We measure actor occupancy across individual points of time and compare against our predictions. As seen in Figure 7(a), our method consistently outperforms both UKF and MTP on this metric.
Using the approach described earlier, we evaluate the multimodality of all three methods over a set of ranges, and plot the results in Figure 7(b). These results demonstrate that observe a greater degree of multimodality in the predictions from LaneOccupancyNet than those from UKF and MTP. Given that a key advantage of our approach is that we are able to adaptively add modes to our predicted occupancy distribution as additional lanes appear in the scene, this analysis supports the argument that we are better able to handle complex scenes where actors can choose over many different paths.
Iv-E Qualitative Results
To further understand our results, we examine the cases with the best and worst likelihood deltas compared to the MTP baseline. Figure 8 shows examples from the results where our method performed better and worse than the baseline. The top left example shows that MTP rigidly predicts left, right, and straight trajectories relative to the actor’s position even when the lane topology doesn’t warrant it. The right panel illustrates some cases where we achieve a lower likelihood than MTP, such as when actors drive in unmapped areas. Interestingly, the bottom right example shows a case where we get penalized in the overall likelihood and negative likelihood metrics for having significant but appropriate multi-modality in our predictions, whereas MTP only predicts one reasonable mode.
We also highlight our performance on a few specific cases of interest in which actors move between different lane paths. Figure 9 shows examples where our method correctly assigns high likelihood to adjacent lanes in order to capture an actor going around another actor (left) and a lane change (right).
We present LaneOccupancyNet, a model that incorporates map structure in a novel way to predict the future occupancy of lane-following vehicles. Through quantitative and qualitative results, we demonstrate that our method does a better job than the two baselines of capturing the full distribution over possible future behaviors of an actor. In autonomous driving, the ability to generate multimodal predictions with high recall is critical for safe operation of SDVs.
By predicting discretized occupancies along a lane path, we find a middle ground between unstructured trajectory predictions and strict lane-following predictions, as demonstrated by our model’s ability to predict complex lane-changing maneuvers while still generalizing well to unusual road topologies. Although our approach is sensitive to good map coverage, in that we are only able to predict occupancy for mapped lanes, we assume an SDV must drive strictly within the mapped road network, thus our method still predicts occupancy in regions most important to the SDV. We also show, through qualitative examples, that it can capture a full range of behaviors including those where actors move between lanes rather than simply driving along a single lane.
-  (2018) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Cited by: §I.
IntentNet: learning to predict intention from raw sensor data.
In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.),
Proceedings of Machine Learning Research, Vol. 87, , pp. 947–956. External Links: Cited by: §I, §II, §II.
-  (2019) Predicting motion of vulnerable road users using high-definition maps and efficient convnets. CoRR abs/1906.08469. External Links: Cited by: Fig. 4, §III-D2, §III-D.
-  (2017) Towards full automated drive in urban environments: A demonstration in GoMentum Station, California. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 1811–1818. Cited by: §II.
-  (2018) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. CoRR abs/1809.10732. External Links: Cited by: §I, §II, §II, §III-D1, §IV-B.
-  (2018) Motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: §II.
-  (2013) Vehicle trajectory prediction based on motion model and maneuver recognition. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pp. 4363–4369. Cited by: §II.
-  (1960) A new approach to linear filtering and prediction problems. Journal of basic Engineering 82 (1), pp. 35–45. Cited by: §II.
-  (2017) Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors 17 (6), pp. 1350. Cited by: §I.
-  (2017) Interaction-aware occupancy prediction of road vehicles. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–8. Cited by: §II.
-  (2013) Learning-based approach for online lane change intention prediction. In 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 797–802. Cited by: §II.
-  (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In , pp. 3569–3577. Cited by: §II.
-  (2019) Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7144–7153. Cited by: §II.
-  (2019) LaserNet: An efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §IV-A.
-  (2013) Map-based long term motion prediction for vehicles in traffic environments. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pp. 2166–2172. Cited by: §II.
-  (2015) Goal-directed pedestrian prediction. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 50–58. Cited by: §II.
-  (2019) Human motion trajectory prediction: A survey. CoRR abs/1905.06113. External Links: Cited by: §I.
-  (2017) Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3591–3600. Cited by: §I.
-  (2014) Prediction of driver intended path at intersections. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 134–139. Cited by: §II.
A Hidden Markov Model based driver intention prediction system. In 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 115–120. Cited by: §I.
-  (2000) The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pp. 153–158. Cited by: §II, §IV-B.