Goal-Directed Occupancy Prediction for Lane-Following Actors

09/06/2020 ∙ by Poornima Kaniarasu, et al. ∙ Uber 0

Predicting the possible future behaviors of vehicles that drive on shared roads is a crucial task for safe autonomous driving. Many existing approaches to this problem strive to distill all possible vehicle behaviors into a simplified set of high-level actions. However, these action categories do not suffice to describe the full range of maneuvers possible in the complex road networks we encounter in the real world. To combat this deficiency, we propose a new method that leverages the mapped road topology to reason over possible goals and predict the future spatial occupancy of dynamic road actors. We show that our approach is able to accurately predict future occupancy that remains consistent with the mapped lane geometry and naturally captures multi-modality based on the local scene context while also not suffering from the mode collapse problem observed in prior work.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In order for autonomous vehicles to successfully navigate a world filled with dynamic actors, they must perform a wide variety of challenging tasks. These include perceiving objects in a scene, forecasting actor motion, and determining an appropriate ego-trajectory to drive both safely and comfortably. While the domains of object detection and motion planning are well represented in the literature, the problem of predicting another actor’s future motion is a newer but rapidly growing area of study that is critically important for autonomous driving [17].

The most challenging aspect of the prediction problem is that the future is inherently ambiguous. Consider a vehicle approaching an intersection. Although we can use the actor’s current and past motion, along with contextual cues, to get an idea of where the actor will go, the driver could quite easily change their mind about which direction to travel while navigating the intersection. As a result, in order to successfully tackle the problem of long-term motion prediction (over a 5-10 second horizon), we must directly handle this ambiguity by considering multiple possible futures.

One way to achieve this is to predict a distribution over the actor’s future motion. Previous methods have primarily explored two main approaches to this problem. One group of methods has addressed this by considering a fixed set of maneuvers that an actor may pursue (e.g. left, right, straight) and estimating a categorical distribution over these behaviors

[2, 9, 20]. This simplified view of the world works nicely in simulated environments, but breaks down almost immediately when driving in real cities, where intersections are complex (see Figure 1). Another group of methods model the actor’s future behavior as a multi-modal distribution over trajectories, where the set of modes is completely unstructured. These approaches have frequently suffered from the problem of mode collapse [18, 5].

Fig. 1: (a) Existing multi-modal prediction models assume that actor behavior can be summarized by a set of high-level maneuvers such as left/right/straight, as illustrated on the left. (b) However, many real-world intersections are much more complex than this, such as the 6-way intersection in Washington, DC shown on the right (source: Google Maps).

In this work, we propose a new method for predicting a distribution over the actor’s future behavior that addresses the limitations of previous methods. Our approach relies on having a pre-cached map of the environment that encodes the precise topology of the road network. This topological representation captures information about the spatial relationships between lanes, thereby encoding important semantics on where and how to drive. In many existing approaches, this road topology is either ignored altogether or converted into a 2D bird’s-eye view of the scene, which we posit leads to significant information loss. Prior approaches thus under-utilize the map information and struggle with the basic task of predicting that actors will follow lanes, spurring the need for auxiliary loss functions to prevent actors from being predicted to drive off the road 


Our approach directly uses the mapped road topology to propose a broad set of lanes that the actor may traverse in the future. We then predict the occupancy for a sequence of spatial cells along each lane, which captures the likelihood that the actor will occupy that cell at any point over the targeted prediction horizon. This allows us to predict a distribution over the future occupancy of the actor within the mapped network of lanes. To reflect this, we call our method LaneOccupancyNet

. The key contribution of this work is a new method for encoding structure from the map into our model, which is done in such a way that (1) naturally captures the multi-modality of the future behavior of lane-following actors, (2) adapts to the local map context and generalizes to unseen road geometries, and (3) allows us to explicitly query the likelihood that a specific actor will occupy any particular lane of interest in the map, without necessarily estimating this likelihood for all lanes. These occupancy predictions can be then used determine the highest probability paths using a path-finding algorithm and generating timed trajectories or can be consumed directly by the planner to determine the cost map for the actor.

Ii Related Work

Structured Motion Prediction: Early approaches to motion prediction focused heavily on the use of physics-based models to estimate the current motion states of actors, using a fully specified kinematic model to propagate these states into the future [21, 8, 4]. Another class of structured methods leverages prior knowledge to estimate an actor’s intended destinations, and then uses a planner to generate the actor’s most likely trajectory to reach each goal [15, 7, 16]. In contrast, our method uses a broad set of candidate goals as input to a deep network for the purpose of occupancy prediction, rather than as a prior on strict lane-following behavior. Structured methods have also been previously applied to the problem of lane occupancy prediction, for instance predicting occupancy by modeling lane-following maneuvers and actor-actor interactions using explicit policies [10]. Our approach also performs occupancy prediction, but with less explicit structure on motion.

Unstructured Motion Prediction:

More recently, many methods use deep learning models to directly predict future trajectories from inputs that capture the actor’s state plus context from the surrounding scene

[12, 2, 6, 5]. In contrast to the highly structured physics-based models, these impose very little structure on the problem and instead assume the model will fully learn the patterns of actor motion from data. While we also use a DNN to extract features from the scene, our approach differs in how the map is utilized with the model: rather than providing the map as input to an unstructured motion prediction system, it is used to directly query occupancy probabilities for relevant spatial locations.

Intention Prediction: In addition to directly predicting motion, inferring an actor’s higher-level intent can help predict its future behavior. Intent can be modeled as a discrete set of actions, and several methods directly predict these action categories [11, 19, 2]or by producing multimodal trajectory predictions with each mode representing a particular semantic category of behavior [5, 13]. Our work focuses specifically on spatial multimodality, and we use structure from nearby mapped lanes to enumerate maneuvers and predict future occupancy without restricting ourselves to a pre-determined set of action categories.

Iii Approach

In this section, we describe our approach for predicting vehicle occupancy over the lane network. In order to focus on the prediction problem, we assume we already have an object detection and tracking system which operates on sensor data to produce a list of actors in the scene along with their state estimates. We further assume that we have access to a high-resolution map of our surroundings, which encodes road boundaries, lane boundaries, lane connectivity, lane directions, crosswalks, traffic signals, and other details of the scene geometry. In this approach, we consider each actor in turn and generate predictions for the actor of interest in the context of all other actors in the scene, including the self-driving vehicle (SDV) itself. An overview of our approach is shown in Figure 2.

Fig. 2: An overview of our approach. Top: For a given traffic scene, we identify an actor of interest, generate a set of candidate lane paths for the actor, discretize the path into cells, and label each cell according to the actor’s ground truth occupancy. Bottom: We process each path separately and apply our LaneOccupancyNet to predict the spatial occupancy probabilities along the path.

Iii-a Path Generation

The first step in our approach is to use the map to generate a set of candidate paths that the actor may follow.

We define each path as a region in 2D space that is specified by a sequence of lanes with no branching. Lanes that branch will be split into multiple paths that diverge at the branching point. To generate a set of paths for the actor of interest at a given time , we query the map to identify all lanes that fall within 2 meters of the actor’s position . Then, starting from the point , we “roll out” the paths according to the lane successor relationships specified by the map, up to a fixed distance. This yields a collection of candidate paths for the actor as shown in the middle image in Figure 3 for an actor in the intersection.

Together, the spatial area covered by the union of the paths determines the region over which we predict the occupancy of the actor. One potential drawback of this approach is that we explicitly do not predict other actor’s occupancy over regions of the world that are not covered by mapped lanes. However, it’s important to note that areas outside of mapped lanes are specifically of less interest to us, as we assume the SDV is designed to follow typical rules of the road and drive only within mapped lanes. Thus, predicting the occupancy of actors within these regions is of much higher importance.

Once we have the set of candidate paths, we discretize each path into a fixed length sequence of cells, where the number of cells is determined by the cell length. This discretization enables us to capture maneuvers like lane changes where the actor does not stay in the same lane for the duration of the prediction horizon. We constrain the discretization to be one-dimensional, meaning that each cell will cover the full width of the lane. We make this choice based on an observation that the SDV cares much more about which lanes will be occupied than it cares about the precise lateral position of a vehicle within the lane. Finally, given these cells, we label each one according to whether or not the vehicle’s polygon entered that cell at any point over the prediction horizon .

Fig. 3: Left: A scene with the actor of interest shown in dark blue. Middle: The set of candidate lane paths generated for the actor of interest. Right: The discretization of each path into fixed-length cells.

Iii-B Occupancy Prediction

Given the set of candidate paths and the sequence of cells along each path, we aim to predict whether or not each of these cells will be occupied by the actor. Because we are predicting spatial occupancy rather than spatio-temporal occupancy, the actor can occupy multiple cells over the duration of the prediction horizon. As a result, we want to predict an occupancy probability in the range for each cell rather than a normalized categorical distribution over all cells. To do this, rather than jointly predicting the occupancy for all cells at once, we consider each path independently and predict the occupancy only over the cells in a single path.

Specifically, let

be a sequence of binary random variables where

indicates whether the -th cell in a path was occupied at any point over the next

seconds. We assume that our data consists of independent samples from the joint distribution

parameterized by the per-cell occupancy probabilities . Our goal then is to estimate


are the inputs to our model. This approach, in which we process each path separately, has several benefits. First, it allows us to consider an arbitrary number of paths for each actor without relying on truncation or padding to force the paths into a fixed-size output representation. This flexibility in the number of paths enables our method to adapt to the local map context. As an example, an actor driving on a single-lane road far from any intersection will only have a single candidate path. In contrast, an actor approaching the 6-way intersection depicted in Figure 

1 may have 10 or more candidate paths. Another benefit of this approach is that it provides a path-centric output representation. This enables our model to generalize very well to unseen road geometries, as long as the map accurately captures the lane topology.

Iii-C Labels

For a given actor and a given path , we assign a binary label to each cell along the path. To determine the label, we use the future ground truth trajectory of the vehicle truncated at the prediction horizon. If the vehicle’s ground truth polygon touches the cell at any point over the horizon, we label it 1, and otherwise we label it 0. When we observe the vehicle for a duration shorter than the prediction horizon , we only know the positive labels for certain (the other cells may or may not be visited in the remaining seconds). For these cases, we label all cells the ground truth polygon does not touch with a sentinel value of -1. These cells are ignored in the loss function, and therefore are not used for either training or evaluation. Since positive labels are scarce relative to negative labels, this allows us to leverage additional positive samples for training.

Iii-D Model

In this section we provide an overview of our occupancy prediction model, LaneOccupancyNet (LON), including a description of the input representation, the network architecture that is inspired by [3], and the output representation. The overall model architecture is shown in Figure 4.

Iii-D1 Input Representation

There are three different inputs we provide to our model to capture different pieces of information that influence an actor’s future occupancy: context from the actor’s neighborhood, information about the actor’s current and past behavior, and information about the candidate goal path.

(a) Scene rasters: The actor’s future behavior is influenced by the geometry of the scene and the positions of other nearby vehicles. To capture this context, we provide a rasterized RGB image capturing a bird’s-eye view representation of the scene at the current time instance, oriented based on the position and heading of the actor of interest. The rasters have a resolution of 0.2m and capture a 60m x 60m region, with 10m behind the actor and 50m in front of it. These rasters are similar to the ones used in [5] with the addition of the candidate path overlayed on the raster in dark green, to capture the path along which we are predicting occupancy.

(b) Actor features:

1D array of hand-crafted features that capture the actor’s current state and past behavior, such as the actor’s speed, its angular velocity, and the variance in its heading for the past 3 seconds.

(c) Path features: 1D array of hand-crafted features that capture additional information about the candidate path. Since the scene raster shows only an early portion of the candidate path, these features help provide information about the entire path, such as its curvature. We also provide information about the actor’s relationship to the path, such as the path-relative position, velocity, acceleration, and heading, along with the history of these values cached from previous cycles.

Iii-D2 Network Architecture

Fig. 4: The architecture of LaneOccupancyNet, which predicts 1-D occupancy along a path. We extract scene features from the BEV raster by passing the image through a convolutional network. We then fuse these with the 1D actor and path features by projecting them into the 2D space of the scene features using ideas from [3]. The projection operation uses a sequence of a fully connected layer, a reshape, and a 1x1 convolutional layer. Lastly, we have a second convolutional block and a fully connected block with 2 hidden layers of sizes 2048 and 1024, followed by the output layer.

Our model architecture is inspired by the FastMobileNet architecture proposed in [3]. In particular, we project the 1D concatenated array of actor and path features into the 2D space of the latent scene features. This allows us to directly add the projected features to the scene features and perform additional convolutional operations on top. As pointed out in [3], we hypothesize that this form of feature fusion allows the map information at different spatial locations to interact differently with the engineered features. See Figure 4 for details.

Iii-D3 Output Representation

We independently predict the future spatial occupancy of each actor along each of its paths, with the -th element in the output representing the probability of the actor occupying the -th cell along the path for any duration within the entire prediction horizon. We use the sigmoid cross entropy loss for each cell and compute the mean loss over all cells.

Iii-E Implementation Details

We train our model for 50,000 iterations as we observe the validation loss stabilizes by then. We set the learning rate to with a decay of 0.9 every 11000 steps. It takes around 12 hours to train the model using distributed training on 4 GPUs. In practice, we use a cell length of 4.8 meters, since this is the length of the average car, and use 40 cells per path, for a path length of 192 meters (this allows us to capture fast-moving actors). We experiment with prediction horizons of 3, 6, and 9 seconds, but primarily report results with the 9-second horizon.

Iv Experiment Results

Iv-a Dataset

For training and evaluation, we use the large-scale ATG4D dataset described in [14], which contains a variety of interesting scenarios and diverse driving behaviors from multiple cities across North America. The training and validation sets contain 5,000 and 30-second scenarios, respectively. To train our model, we randomly sample 5% of the training set. We report results on the validation set, which contains 127,669 actor frames with at least 9s of observed future. For all experiments, we measure performance only on moving vehicles (those with estimated speed 0.5 m/s) that are within 50m of the SDV and are observed for at least the full duration of the prediction horizon in the future.

Iv-B Spatial Occupancy Metrics

We evaluate our system performance by comparing an actor’s true spatial location against the spatial occupancies predicted by different methods using an average likelihood metric. We compare against two baselines: (1) unimodal trajectories generated by an unscented Kalman filter (UKF)

[21] that forward-propagates actor states from a second order tracking system; (2) trimodal trajectories generated by an unstructured deep network [5], identified here as Multiple Trajectory Prediction (MTP). The MTP model was trained on the same data as our proposed method. In both baseline methods, each predicted trajectory consists of a sequence of waypoints over time, position covariances surrounding each waypoint, and probabilities per trajectory (relevant in the multimodal case). In contrast, our method predicts spatial occupancy up to a future time horizon.

In order to compare these two different representations, we first convert the output of all methods into a common representation that consists of 2D spatial occupancy predictions over a grid that is centered at the actor’s current position (we use a 150m x 150m grid with 1m resolution). Ground truth labels are generated by determining which 2D cells an actor occupied at any point over the prediction horizon. Figure 5 shows an example of the ground truth occupancy mask and predicted occupancy likelihoods.

Ground Truth UKF MTP LON
Fig. 5: Ground truth and predicted occupancy heatmaps, shown for each method. In each figure, the actor starts on the left and then moves to the right. The UKF predicts a unimodal trajectory that disperses at future horizons. MTP predicts a trimodal distribution of trajectories that describe multiple possible future motions. Our method, LON, produces likelihoods tied to the geometry of the nearby lanes.

Given a ground truth 2D occupancy grid and a corresponding predicted likelihood grid, we compute the overall average likelihood as

Note that is the ground truth label of the -th cell in 2D space (as opposed to in path space), is the predicted occupancy likelihood for that cell, and is the grid size. We also compute two additional metrics, the positive likelihood, which is calculated only on 2D cells with label , and the negative likelihood, which is calculated only on 2D cells with label .

Next we describe how to convert the two output representations into the common 2D spatial representation.

Iv-B1 2D Spatial Occupancy from Trajectories

A multimodal trajectory prediction is defined as a time-varying mixture of Gaussian distributions. Each trajectory in the mixture of

components is given by , where is the mean 2D position of the actor at time and is the corresponding position covariance. Each trajectory also has an associated mode probability, . In a unimodal case, such as UKF, and .

To convert a spatio-temporal trajectory into a spatial occupancy likelihood, for each 2D location, we determine the probability that an actor ends up occupying this location at any time over the prediction horizon. To approximate this, we utilize a Monte Carlo sampling technique. We first generate samples by repeating the following procedure:

  • Sample a mode from the distribution over the mixture components .

  • Sample a trajectory as follows. First, sample from a 2D standard Gaussian distribution . Then, for each time point , convert into a sample from . To do this, we compute , the Cholesky decomposition of , and then calculate . Repeat this for all time points to obtain a sequence of sampled positions . This allows us to sample a coherent trajectory from the sequence of Gaussians.

  • Define as the swept volume produced by the actor’s polygon moving along the sampled trajectory .

Finally, to estimate the predicted occupancy likelihood at cell in the 2D grid, we simply check each cell against the swept volumes from each of the sampled trajectories, and calculate the frequency with which the cell is occupied:

In our experiments, we use .

Iv-B2 2D Spatial Occupancy from Path-Based Occupancy

Since LaneOccupancyNet directly predicts the spatial occupancy probability for cells along each path, we directly use the polygon shapes of the cells to map these into the 2D likelihood grid. Note that that the set of paths we consider will likely have some spatial overlap (e.g., if a lane branches into 3 successor lanes, we will create 3 separate paths that partially overlap). To handle this, we also perform a post-processing step to estimate the final occupancy of each unique cell by taking the average of the set of estimates from each path that contains that cell.

Iv-C Multimodality Estimation

Since actors choose over multiple future actions and we are interested in prediction methods that capture these different possibilities, we also developed a measure to evaluate the spatial multimodality of each prediction method.

Our measure calculates the unique number of spatial modes an actor may follow according to its predicted occupancy. Modes are defined as spatially distinct paths to get from an actor’s current location out to a new location some distance away. To count the modes, we trace 1D likelihoods along concentric rings at varying ranges from the actor’s current location, and count the number of observed peaks in the likelihood along these rings.

Figure 6 provides a visual description of this general approach. Once 1D likelihoods are obtained for a ring at a given distance from the actor, the estimated number of modes is determined by counting the number of peaks in each curve. A peak is defined as a local maximum that rises at least above neighboring local minima. In our experiments, we use , and focus on the 180 degree arcs in the forward direction of the actor of interest, using its estimated heading.

Fig. 6: A visual overview of the multimodality measure. Left: Spatial likelihoods for the MTP and LON approaches. In each, the spatial occupancies clearly show two separate possible paths. The traced curves show a 180 degree arc in the forward facing direction of the actor of interest. Right: Plot showing 1D likelihoods for the arcs traced on the left. In each case, we see two obvious peaks. Our multimodality estimation method uses peak finding to count the number of spatial modes at each range.

Iv-D Quantitative Results

Method Overall Positive Negative
Unscented Kalman Filter 0.9873 0.3709 0.9911
Multiple Trajectory Prediction 0.9955 0.6577 0.9976
LaneOccupancyNet 0.9934 0.7482 0.9949
TABLE I: Average likelihood results for the two baselines vs. our proposed method (9s horizon).

Table I shows comparative results on the three likelihood measures. While the overall likelihood and negative likelihood both degrade with our method, we observe a 13.8% improvement in positive likelihood. This suggests that our approach does a better job of estimating all of the possible places an actor may end up, effectively a measure of recall for prediction methods. We explore this further by plotting the positive likelihood at different future time horizons of the ground truth. We measure actor occupancy across individual points of time and compare against our predictions. As seen in Figure 7(a), our method consistently outperforms both UKF and MTP on this metric.

Using the approach described earlier, we evaluate the multimodality of all three methods over a set of ranges, and plot the results in Figure 7(b). These results demonstrate that observe a greater degree of multimodality in the predictions from LaneOccupancyNet than those from UKF and MTP. Given that a key advantage of our approach is that we are able to adaptively add modes to our predicted occupancy distribution as additional lanes appear in the scene, this analysis supports the argument that we are better able to handle complex scenes where actors can choose over many different paths.

(a) (b)
Fig. 7: (a) Comparison of the positive likelihood distribution across the 3 methods shows that LON best predicts occupied areas at future time horizons. Solid lines show the median average likelihood across the scenes in our test set, shaded regions corresponding to 25th and 75th percentiles. (b) Observed number of spatial modes at varying distances. UKF is unimodal by design. MTP exhibits multimodality at shorter range, diminishing at longer ranges due to short predictions in some cases. LON has a greater number of modes overall.

Iv-E Qualitative Results

To further understand our results, we examine the cases with the best and worst likelihood deltas compared to the MTP baseline. Figure 8 shows examples from the results where our method performed better and worse than the baseline. The top left example shows that MTP rigidly predicts left, right, and straight trajectories relative to the actor’s position even when the lane topology doesn’t warrant it. The right panel illustrates some cases where we achieve a lower likelihood than MTP, such as when actors drive in unmapped areas. Interestingly, the bottom right example shows a case where we get penalized in the overall likelihood and negative likelihood metrics for having significant but appropriate multi-modality in our predictions, whereas MTP only predicts one reasonable mode.

We also highlight our performance on a few specific cases of interest in which actors move between different lane paths. Figure 9 shows examples where our method correctly assigns high likelihood to adjacent lanes in order to capture an actor going around another actor (left) and a lane change (right).

Truth     MTP     LON (a) Top Cases Truth     MTP     LON (b) Bottom Cases
Fig. 8: Examples of scenes in which our method has (a) higher likelihood and (b) lower likelihood than the MTP baseline. Within each panel, the left image shows the ground truth, the center image shows the trajectory predictions from MTP (each trajectory is colored according to its mode probability), and the right image shows the occupancy predictions from our method (each cell is colored according to its occupancy likelihood). (a) The top two examples depict unusual road geometries (here MTP predicts that the vehicle will drive out of the road or into an oncoming lane). The bottom two examples are cases where the baseline under-utilizes the map information (e.g., knowledge of left-turn-only lanes). (b) The first example shows an emergency vehicle driving against the direction of the lane. The next two examples show actors driving in areas where we don’t have mapped lanes like parking spaces and shoulders). The last example shows a case where our predictions are more multimodal than MTP.
Truth       MTP       LON Truth       MTP       LON
Fig. 9: Examples showing that our method can handle complex scenarios involving actors moving across multiple lanes. The example on the left shows the actor going around a blocking actor and coming back to the lane. The example on the right shows a lane change, where we rightly predict less likelihood for future occupancy in the current lane and more likelihood in the adjacent lanes.

V Conclusion

We present LaneOccupancyNet, a model that incorporates map structure in a novel way to predict the future occupancy of lane-following vehicles. Through quantitative and qualitative results, we demonstrate that our method does a better job than the two baselines of capturing the full distribution over possible future behaviors of an actor. In autonomous driving, the ability to generate multimodal predictions with high recall is critical for safe operation of SDVs.

By predicting discretized occupancies along a lane path, we find a middle ground between unstructured trajectory predictions and strict lane-following predictions, as demonstrated by our model’s ability to predict complex lane-changing maneuvers while still generalizing well to unusual road topologies. Although our approach is sensitive to good map coverage, in that we are only able to predict occupancy for mapped lanes, we assume an SDV must drive strictly within the mapped road network, thus our method still predicts occupancy in regions most important to the SDV. We also show, through qualitative examples, that it can capture a full range of behaviors including those where actors move between lanes rather than simply driving along a single lane.


  • [1] M. Bansal, A. Krizhevsky, and A. S. Ogale (2018) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Link, 1812.03079 Cited by: §I.
  • [2] S. Casas, W. Luo, and R. Urtasun (2018-29–31 Oct) IntentNet: learning to predict intention from raw sensor data. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 87, , pp. 947–956.
    External Links: Link Cited by: §I, §II, §II.
  • [3] F. Chou, T. Lin, H. Cui, V. Radosavljevic, T. Nguyen, T. Huang, M. Niedoba, J. Schneider, and N. Djuric (2019) Predicting motion of vulnerable road users using high-definition maps and efficient convnets. CoRR abs/1906.08469. External Links: Link, 1906.08469 Cited by: Fig. 4, §III-D2, §III-D.
  • [4] A. Cosgun, L. Ma, J. Chiu, J. Huang, M. Demir, A. M. Anon, T. Lian, H. Tafish, and S. Al-Stouhi (2017) Towards full automated drive in urban environments: A demonstration in GoMentum Station, California. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 1811–1818. Cited by: §II.
  • [5] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2018) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. CoRR abs/1809.10732. External Links: Link, 1809.10732 Cited by: §I, §II, §II, §III-D1, §IV-B.
  • [6] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F. Chou, T. Lin, and J. Schneider (2018) Motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: §II.
  • [7] A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao (2013) Vehicle trajectory prediction based on motion model and maneuver recognition. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pp. 4363–4369. Cited by: §II.
  • [8] R. E. Kalman (1960) A new approach to linear filtering and prediction problems. Journal of basic Engineering 82 (1), pp. 35–45. Cited by: §II.
  • [9] I. Kim, J. Bong, J. Park, and S. Park (2017) Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors 17 (6), pp. 1350. Cited by: §I.
  • [10] M. Koschi and M. Althoff (2017) Interaction-aware occupancy prediction of road vehicles. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–8. Cited by: §II.
  • [11] P. Kumar, M. Perrollaz, S. Lefevre, and C. Laugier (2013) Learning-based approach for online lane change intention prediction. In 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 797–802. Cited by: §II.
  • [12] W. Luo, B. Yang, and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3569–3577. Cited by: §II.
  • [13] O. Makansi, E. Ilg, O. Cicek, and T. Brox (2019) Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7144–7153. Cited by: §II.
  • [14] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington (2019) LaserNet: An efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §IV-A.
  • [15] D. Petrich, T. Dang, D. Kasper, G. Breuel, and C. Stiller (2013) Map-based long term motion prediction for vehicles in traffic environments. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pp. 2166–2172. Cited by: §II.
  • [16] E. Rehder and H. Kloeden (2015) Goal-directed pedestrian prediction. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 50–58. Cited by: §II.
  • [17] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras (2019) Human motion trajectory prediction: A survey. CoRR abs/1905.06113. External Links: Link, 1905.06113 Cited by: §I.
  • [18] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, and G. D. Hager (2017) Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3591–3600. Cited by: §I.
  • [19] T. Streubel and K. H. Hoffmann (2014) Prediction of driver intended path at intersections. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 134–139. Cited by: §II.
  • [20] D. Tran, W. Sheng, L. Liu, and M. Liu (2015)

    A Hidden Markov Model based driver intention prediction system

    In 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 115–120. Cited by: §I.
  • [21] E. A. Wan and R. Van Der Merwe (2000) The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pp. 153–158. Cited by: §II, §IV-B.