Multiple Futures Prediction

11/04/2019 ∙ by Yichuan Charlie Tang, et al. ∙ 23

Temporal prediction is critical for making intelligent and robust decisions in complex dynamic environments. Motion prediction needs to model the inherently uncertain future which often contains multiple potential outcomes, due to multi-agent interactions and the latent goals of others. Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly model the multi-step future motions of agents in a scene. Our framework is data-driven and learns semantically meaningful latent variables to represent the multimodal future, without requiring explicit labels. Using a dynamic attention-based state encoder, we learn to encode the past as well as the future interactions among agents, efficiently scaling to any number of agents. Finally, our model can be used for planning via computing a conditional probability density over the trajectories of other agents given a hypothetical rollout of the 'self' agent. We demonstrate our algorithms by predicting vehicle trajectories of both simulated and real data, demonstrating the state-of-the-art results on several vehicle trajectory datasets.



There are no comments yet.


page 2

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to make good predictions lies at the heart of robust and safe decision making. It is especially critical to be able to predict the future motions of all relevant agents in complex and dynamic environments. For example, in the autonomous driving domain, motion prediction is central both to the ability to make high level decisions, such as when to perform maneuvers, as well as to low level path planning optimizations Thrun et al. (2006); Paden et al. (2016).

Motion prediction is a challenging problem due to the various needs of a good predictive model. The varying objectives, goals, and behavioral characteristics of different agents can lead to multiple possible futures or modes. Agents’ states do not evolve independently from one another, but rather they interact with each other. As an illustration, we provide some examples in Fig. 1. In Fig. 1(a), there are a few different possible futures for the blue vehicle approaching an intersection. It can either turn left, go straight, or turn right, forming different modes in trajectory space. In Fig. 1(b), interactions between the two vehicles during a merge scenario show that their trajectories influence each other, depending on who yields to whom. Besides multimodal interactions, prediction needs to scale efficiently with an arbitrary number of agents in a scene and take into account auxiliary and contextual information, such as map and road information. Additionally, the ability to measure uncertainty by computing probability over likely future trajectories of all agents in closed-form (as opposed to Monte Carlo sampling) is of practical importance.

(a) Multiple possible future trajectories.
(b) Scenario A: green yields to blue. (c) Scenario B: blue yields to green.
Figure 1: Examples illustrating the need for mutimodal interactive predictions. (a): There are a few possible modes for the blue vehicle. (b and c): Time-lapsed visualization of how interactions between agents influences each other’s trajectories.

Despite a large body of work in temporal motion predictions Lee et al. (2017); Casas et al. (2018); Cui et al. (2018); Ma et al. (2018); Deo and Trivedi (2018); Bansal et al. (2018); Rhinehart et al. (2019); Chai et al. (2019); Zhao et al. (2019), existing state-of-the-art methods often only capture a subset of the aforementioned features. For example, algorithms are either deterministic, not multimodal, or do not fully capture both past and future interactions. Multimodal techniques often require the explicit labeling of modes prior to training. Models which perform joint prediction often assume the number of agents present to be fixed Watters et al. (2017); Sun et al. (2019).

We tackle these challenges by proposing a unifying framework that captures all of the desirable features mentioned earlier. Our framework, which we call Multiple Futures Predictor (MFP), is a sequential probabilistic latent variable generative model that learns directly from multi-agent trajectory data. Training maximizes a variational lower bound on the log-likelihood of the data. MFP learns to model multimodal interactive futures jointly for all agents, while using a novel factorization technique to remain scalable to arbitrary number of agents. After training, MFP can compute both (un)conditional trajectory probabilities in closed form, not requiring any Monte Carlo sampling.

MFP builds on the Seq2seq Sutskever et al. (2014), encoder-decoder framework by introducing latent variables and using a set of parallel RNNs (with shared weights) to represent the set of agents in a scene. Each RNN takes on the point-of-view of its agent and aggregates historical information for sequential temporal prediction for that agent. Discrete latent variables, one per RNN, automatically learn semantically meaningful modes to capture multimodality without explicit labeling. MFP can be further efficiently and jointly trained end-to-end for all agents in the scene. To summarize, we make the following contributions: First, semantically meaningful latent variables are automatically learned from trajectory data without labels. This addresses the multimodality problem. Second, interactive and parallel step-wise rollouts are preformed for all agents in the scene. This addresses the modeling of interactions between actors during future prediction, see Sec. 3.1. We further propose a dynamic attentional encoding which captures both the relationships between agents and the scene context, see Sec. 3.1. Finally, MFP is capable of performing hypothetical inference: evaluating the conditional probability of agents’ trajectories conditioning on fixing one or more agent’s trajectory, see Sec. 3.2.

2 Related Work

The problem of predicting future motion for dynamic agents has been well studied in the literature. The bulk of classical methods focus on using physics based dynamic or kinematic models Welch et al. (1995); Haarnoja et al. (2016); Lefèvre et al. (2014)

. These approaches include Kalman filters and maneuver based methods, which compute the future motion of agents by propagating their current state forward in time. While these methods perform well for short time horizons, longer horizons suffer due to the lack of interaction and context modeling.

The success of machine learning and deep learning ushered in a variety of data-driven recurrent neural network (RNN) based methods 

Lee et al. (2017); Casas et al. (2018); Cui et al. (2018); Ma et al. (2018); Deo and Trivedi (2018); Bansal et al. (2018)

. These models often combine RNN variants, such as LSTMs or GRUs, with encoder-decoder architectures such as conditional variational autoencoders (CVAEs). These methods eschew physic based dynamic models in favor of learning generic sequential predictors (e.g. RNNs) directly from data. Converting raw input data to input features can also be learned, often by encoding rasterized inputs using CNNs 

Casas et al. (2018); Cui et al. (2018).

Methods that can learn multiple future modes have been proposed in Deo and Trivedi (2018); Lee et al. (2017); Cui et al. (2018). However, Deo and Trivedi (2018)

explicitly labels six maneuvers/modes and learn to separately classify these modes. 

Lee et al. (2017); Cui et al. (2018) do not require mode labeling but they also do not train in an end-to-end fashion by maximizing the data log-likelihood of the model. Most of the methods in literature encode the past interactions of agents in a scene, however prediction is often an independent rollout of a decoder RNN, independent of other future predicted trajectories Deo and Trivedi (2018); Park et al. (2018). Encoding of spatial relationships is often done by placing other agents in a fixed and spatially discretized grid Deo and Trivedi (2018); Lee et al. (2017).

In contrast, MFP proposes a unifying framework which exhibits the aforementioned features. To summarize, we present a feature comparison of MFP with some of the recent methods in the supplementary materials.

3 Multiple Futures Prediction

We tackle motion prediction by formulating a probabilistic framework of continuous space but discrete time system with a finite (but variable) number of interacting agents. We represent the joint state of all agents at time as , where is the dimensionality of each state111We assume states are fully observable and are agents’ coordinates on the ground plane (=2)., and is the state -th agent at time . With a slight abuse of notation, we use superscripted to denote the past states of the -th agent and to denote the joint agent states from time to , where is the past history steps. The future state at time of all agents is denoted by and the future trajectory of agent , from time to time , is denoted by . denotes the joint state of all agents for the future timesteps. Contextual scene information, e.g. a rasterized image of the map, could be useful by providing important cues. We use to represent any contextual information at time .

The goal of motion prediction is then to accurately model . As in most sequential modelling tasks, it is both inefficient and intractable to model jointly. RNNs are typically employed to sequentially model the distribution in a cascade form. However, there are two major challenges specific to our multi-agent prediction framework: (1) Multimodality

: optimizing vanilla RNNs via backpropagation through time will lead to

mode-averaging since the mapping from to is not a function, but rather a one-to-many mapping. In other words, multimodality means that for a given

, there could be multiple distinctive modes that results in significant probability distribution over different sequences of

. (2) Variable-Agents: the number of agents is variable and unknown

, and therefore we can not simply vectorize

as the input to a standard RNN at time .

(a) Graphical model of the MFP. Solid nodes denote observed. Cross agent interaction edges are shaded for clarity. denotes both the state and contextual information from timesteps to .
(b) Architecture of the proposed MFP. Circular ’world’ contains the world state and positions of all agents. Diamond nodes are deterministic while the circular

are discrete latent random variables.

Figure 2: Graphical model and computation graph of the MFP. See text for details. Best viewed in color.

For multimodality, we introduce a set of stochastic latent variables , one per agent, where can take on discrete values. The intuition here is that would learn to represent intentions (left/right/straight) and/or behavior modes (aggressive/conservative). Learning maximizes the marginalized distribution, where is free to learn any latent behavior so long as it helps to improve the data log-likelihood. Each is conditioned on at the current time (before future prediction) and will influence the distribution over future states . A key feature of the MFP is that is only sampled once at time , and must be consistent for the next time steps. Compared to sampling at every timestep, this leads to a tractability and more realistic intention/goal modeling, as we will discuss in more detail later. We now arrive at the following distribution:


where denotes the joint latent variables of all agents. Naïvely optimizing for Eq. 1 is prohibitively expensive and not scalable as the number of agents and timesteps may become large. In addition, the max number of possible modes is exponential: . We first make the model more tractable by factorizing across time, followed by factorization across agents. The joint future distribution assumes the form of product of conditional distributions:


The second factorization is sensible as the factorial component is conditioning on the joint states of all agents in the immediate previous timestep, where the typical temporal delta is very short (e.g. ms). Also note that the future distribution of the -th agent is explicitly dependent on its own mode but implicitly dependent on the latent modes of other agents by re-encoding the other agents predicted states (please see discussion later and also Sec. 3.1). Explicitly conditioning an agent’s own latent modes is both more scalable computationally as well as more realistic: agents in the real-world can only infer other agent’s latent goals/intentions via observing their states. Finally our overall objective from Eq. 1 can be written as:


The graphical model of the MFP is illustrated in Fig. 1(a). While we show only three agents for simplicity, MFP can easily scale to any number of agents. Nonlinear interactions among agents makes complicated to model. The class of recurrent neural networks are powerful and flexible models that can efficiently capture and represent long-term dependences in sequential data. At a high level, RNNs introduce deterministic hidden units at every timestep , which act as features or embeddings that summarize all of the observations up until time . At time step , a RNN takes as its input the observation,

, and the previous hidden representation,

, and computes the update: . The prediction is computed from the decoding layer of the RNN . and are recursively applied at every timestep of the sequence.

Fig. 1(b) shows the computation graph of the MFP. A point-of-view (PoV) transformation is first used to transform the past states to each agent’s own reference frame by translation and rotation such that -axis aligns with agent’s heading. We then instantiate an encoding and a decoding RNN222We use GRUs Chung et al. (2014). LSTMs and GRUs perform similarly, but GRUs were slightly faster computationally. per agent. Each encoding RNN is responsible for encoding the past observations

into a feature vector. Scene context is transformed via a convolutional neural network into its own feature. The features are combined via a

dynamic attention encoder, detailed in Sec. 3.1, to provide inputs both to the latent variables as well as to the ensuing decoding RNNs. During predictive rollouts, the decoding RNN will predict its own agent’s state at every timestep. The predictions will be aggregated and subsequently transformed via , providing inputs to every agent/RNN for the next timestep. Latent variables provide extra inputs to the decoding RNNs to enable multimodality. Finally, the output consists of a

dim vector governing a Bivariate Normal distribution:

, , , , and correlation coefficient .

While we instantiate two RNNs per agent, these RNNs share the same parameters across agents, which means we can efficiently perform joint predictions by combining inputs in a minibatch, allowing us to scale to arbitrary number of agents. Making discrete and having only one set of latent variables influencing subsequent predictions is also a deliberate choice. We would like to model modes generated due to high level intentions such as left/right lane changes or conservative/aggressive modes of agent behavior. These latent behavior modes also tend to stay consistent over the time horizon which is typical of motion prediction (e.g. 5 seconds).


Given a set of training trajectory data

, we optimize using the maximum likelihood estimation (MLE) to estimate the parameters

that achieves the maximum marginal data log-likelihood:333We have omitted the dependence on context for clarity. The R.H.S. is derived from the common log-derivative trick.


Optimizing for Eq. 6 directly is non-trivial as the posterior distribution is not only hard to compute, but also varies with . We can however decompose the log-likelihood into the sum of the evidence lower bound (ELBO) and the KL-divergence between the true posterior and an approximating posterior  Neal and Hinton (1998):


where Jensen’s inequality is used to arrive at the lower bound, is the entropy function and is the KL-divergence between the true and approximating posterior. We learn by maximizing the variational lower bound on the data log-likelihood by first using the true posterior444The ELBO is the tightest when the KL-divergence is zero and the is the true posterior. at the current as the approximating posterior: . We can then fix the approximate posterior and optimize the model parameters for the following function:


where denote the parameters of the RNNs and the parameters of the network layers for predicting . As our latent variables are discrete and have small cardinality (e.g. < 10), we can compute the posterior exactly for a given . The RNN parameter gradients are computed from and the gradient for is .

Our learning algorithm is a form of the EM algorithm Dempster et al. (1977)

, where for the M-step we optimize RNN parameters using stochastic gradient descent. By integrating out the latent variable

, MFP learns directly from trajectory data, without requiring any annotations or weak supervision for latent modes. We provide a detailed training algorithm pseudocode in the supplementary materials.


Teacher forcing is a standard technique (albeit biased) to accelerate RNN and sequence-to-sequence training by using ground truth values as the input to step . Even with scheduled sampling Bengio et al. (2015), we found that over-fitting due to exposure bias could be an issue. Interestingly, an alternative is possible in the MFP: at time , for agent , the ground truth observations are used as inputs for all other agents . However, for agent itself, we still use its previous predicted state instead of the true observations as its input. We provide empirical comparisons in Table 2.

Connections to other Stochastic RNNs

Various stochastic recurrent models in existing literature have been proposed: DRAW Gregor et al. (2015), STORN Bayer and Osendorfer (2014), VRNN Chung et al. (2015), SRNN Fraccaro et al. (2016), Z-forcing Goyal et al. (2017), Graph-VRNN Sun et al. (2019). Beside the multi-agent modeling capability of the MFP, the key difference between these methods and MFP is that the other methods use continuous stochastic latent variables at every

timestep, sampled from a standard Normal prior. The training is performed via the pathwise derivatives, or the reparameterization trick. Having multiple continuous stochastic variables means that the posterior can not be computed in closed form and Monte Carlo (or lower-variance MCMC estimators

555Even with IWAE Burda et al. (2015),  50 samples are needed to obtain a somewhat tight lower-bound, making it prohibitively expensive to compute good log-densities for these stochastic RNNs for online applications.) must be used to estimate the ELBO. This makes it hard to efficiently compute the log-probability of an arbitrary imagined or hypothetical trajectory, which might be useful for planning and decision-making (See Sec. 3.2). In contrast, latent variables in MFP is discrete and can learn semantically meaningful modes (Sec. 4.1). With modes, it is possible to evaluate the exact log-likelihoods of trajectories in , without resorting to sampling.

3.1 State Encodings

As shown in Fig. 1(b), the input to the RNNs at step is first transformed via the point-of-view transformation, followed by state encoding, which aggregates the relative positions of other agents with respect to the -th agent (ego agent, or the agent for which the RNN is predicting) and encodes the information into a feature vector. We denote the encoded feature

. Here, we propose a dynamic attention-like mechanism where radial basis functions are used for matching and routing relevant agents from the input to the feature encoder, shown in Fig. 


Figure 3: Diagram for dynamic attentional state encoding. MFP uses state encoding at every timestep to convert the state of surrounding agents into a feature vector for next-step prediction, see text for more details.

Each agent uses a neural network to transform its state (positions, velocity, acceleration, and heading) into a key or descriptor, which is then matched via a radial basis function to a fixed number of “slots" with learned keys in the encoder network. The ego666We will use ego to refer to the main or ’self’ agent for whom we are predicting. agent has a separate slot to send its own state. Slots are aggregated and further transformed by a two layer encoder network, encoding a state (e.g. dim vector). The entire dynamic encoder can be learned in an end-to-end fashion. The key-matching is similar to dot-product attention Vaswani et al. (2017), however, the use of radial basis functions allows us to learn spatially sensitive and meaningful keys to extract relevant agents. In addition, Softmax normalization in dot-product attention lacks the ability to differentiate between a single close-by agent vs. a far-away agent.

3.2 Hypothetical Rollouts

Planning and decision-making must rely on prediction for what-ifs Howard et al. (2014). It is important to predict how others might behave to different hypothetical ego actions (e.g. what if ego were to perform a more an aggressive lane change?). Specifically, we are interested in the distribution when conditioning on any hypothetical future trajectory of one (or more) agents:


This can be easily computed within MFP by fixing future states of the conditioning agent on the R.H.S. of Eq. 9 while the states of other agents are not changed. This is due to the fact that MFP performs interactive future rollouts in a synchronized manner for all agents, as the joint predicted states at of all agents are used as inputs for predicting the states at . As a comparison, most of the other prediction algorithms perform independent rollouts, which makes it impossible to perform hypothetical rollouts as there is a lack of interactions during the future timesteps.

4 Experimental Results

We demonstrate the effectiveness of MFP in learning interactive multimodal predictions for the driving domain, where each agent is a vehicle. As a proof-of-concept, we first generate simulated trajectory data from the CARLA simulator Dosovitskiy et al. (2017), where we can specify the number of modes and script 2nd-order interactions. We demonstrate MFP can learn semantically meaningful latent modes to capture all of the modes of the data, all without using labeling of the latent modes. We then experiment on a widely known standard dataset of real vehicle trajectories, the NGSIM Colyar and Halkias (2007) dataset. We show that MFP achieves state-of-the-art results on modeling held-out test trajectories. In addition, we also benchmark MFP with previously published results on the more recent large scale Argoverse motion forecasting dataset Chang et al. (2019). We provide MFP architecture and learning details in the supplementary materials.

(a) CARLA simulation Dosovitskiy et al. (2017). (b) MFP sample rollouts after training. Multiple trials from same initial locations are overlaid.
(c) Learned latent modes. Same marker shape denotes the same mode across agents. Time is the -axis.
Figure 4: (a) CARLA data. (b) Sample rollouts overlayed, showing learned multimodality. (c) MFP learned semantically meaningful latent modes automatically: triangle: right turn, square: straight ahead, circle: stop.

4.1 Carla

CARLA is a realistic, open-source, high fidelity driving simulator based on the Unreal Engine 

Dosovitskiy et al. (2017). It currently contains six different towns and dozens of different vehicle assets. The simulation includes both highways and urban settings with traffic light intersections and four-way stops. Simple traffic law abiding "auto-pilot" CPU agents are also available.

We create a scenario at an intersection where one vehicle is approaching the intersection and two other vehicles are moving across horizontally (Fig. 4(a)). The first vehicle (red) has 3 different possibilities which are randomly chosen during data generation. The first mode aggressively speeds up and makes the right turn, cutting in front of the green vehicle. The second mode will still make the right turn, however it will slow down and yield to the green vehicle. For the third mode, the first vehicle will slow to a stop, yielding to both of the other vehicles. The far left vehicle also chooses randomly between going straight or turning right. We report the performance of MFP as a function of of modes in Table 1.

Metric C.V. RNN MFP MFP MFP MFP MFP (nats) basic 1 mode 2 modes 3 modes 4 modes 5 modes NLL 11.46 5.64 5.23 3.37 1.72 1.39 1.39 Table 1: Test performance (minMSD with ) comparisons.
Fixed-Encoding DynEnc NLL 1.878 1.694 Teacher-forcing Classmates-forcing NLL 4.344 4.196 Table 2: Additional comparisons. Metric Vehicle 1 Vehicle 2 K=12 Standard Hypo Standard Hypo minADE minFDE Table 3: Hypothetical Rollouts.

The modes learned here are somewhat semantically meaningful. In Fig. 4(c), we can see that even for different vehicles, the same latent variable learned to be interpretable. Mode (squares) learned to go straight, mode (circles) learned to break/stop, and mode (triangles) represents right turns. Finally, in Table 2, we can see the performance between using teacher-forcing vs. the proposed classmates-forcing. In addition, we compare different types of encodings. DynEnc is the encoding proposed in Sec. 3.1. Fixed-encoding uses a fixed ordering which is not ideal when there are arbitrary number of agents. We can also look at how well we can perform hypothetical rollouts by conditioning our predictions of other agents on ego’s future trajectories. We report these results in Table 3.

We next compared MFP to a much larger CARLA dataset with published benchmark results. This dataset consists of over 60K training sequences collected from two different towns in CARLA Rhinehart et al. (2019). We trained MFP (with and without LIDAR; with 3, 5, and 7 modes) on the Town01 training set for 300K updates. We report the minMSD metric (in meters) at for a 5 agents jointly. We compare our results with state-of-the-art results in Tables 45. Non-MFP results are reported from Rhinehart et al. (2019); Chai et al. (2019). Table 5 are reported for Town02 test set.

DESIRE SocialGAN R2P2-MA ESPRhinehart et al. (2019) ESP MultiPath MFP7 [21] Rhinehart et al. (2019) no LIDAR Chai et al. (2019) Town01 test Town02 test Table 4: Test performance (minMSD with ) comparisons. MFP3 MFP5 MFP5 MFP7 no LIDAR Table 5: MFP variations.

Metric time Cons vel. CVGMMDeo et al. (2018) Kuefler et al. (2017) MATFZhao et al. (2019) LSTM S-LSTMAlahi et al. (2016) CS-LSTM(M) MFP-1 MFP-2 MFP-3 MFP-4 MFP-5 NLL(nats) 1 sec. 3.72 2.02 - - 1.17 1.01 0.89 (0.58) 0.73 -0.32 -0.58 -0.65 -0.45 2 sec. 5.37 3.63 - - 2.85 2.49 2.43 (2.14) 2.33 1.43 1.26 1.19 1.36 3 sec. 6.40 4.62 - - 3.80 3.36 3.30 (3.03) 3.17 2.45 2.32 2.28 2.42 4 sec. 7.16 5.35 - - 4.48 4.01 3.97 (3.68) 3.77 3.21 3.07 3.06 3.17 5 sec. 7.76 5.93 - - 4.99 4.54 4.51 (4.22) 4.26 3.81 3.69 3.69 3.76 Metric time Cons vel. CVGMM MATF LSTM S-LSTM CS-LSTMDeo and Trivedi (2018) MFP-1 MFP-2 MFP-3 MFP-4 MFP-5 RMSE(m) 1 sec. 0.73 0.66 0.69 0.66 0.68 0.65 0.61 0.54 0.55 0.54 0.54 0.55 2 sec. 1.78 1.56 1.51 1.34 1.65 1.31 1.27 1.16 1.18 1.17 1.16 1.18 3 sec. 3.13 2.75 2.55 2.08 2.91 2.16 2.09 1.90 1.92 1.91 1.89 1.92 4 sec. 4.78 4.24 3.65 2.97 4.46 3.25 3.10 2.78 2.80 2.78 2.75 2.78 5 sec. 6.68 5.99 4.71 4.13 6.27 4.55 4.37 3.83 3.85 3.83 3.78 3.80

Table 6: NGSIM prediction results. Hightlighted columns are our results (lower is better). MFP-:

is the number of latent modes. The standard error of the mean is over

trials. For multimodal MFPs, we report minRMSE over 5 samples. NLL can be negative as we are modeling a continuous density function.
(a) Merge-off scenario.
(b) Lane change left scenario.
Figure 5: Qualitative MFP-3 results after training on NGSIM data. Three modes: red, purple, and green are shown as density contour plots for the blue vehicle. Grey vehicles are other agents. Blue path is past trajectory, orange path is actual future ground truth. Grey pixels form a heatmap of frequently visited paths. Additional visualizations provided in the supplementary materials.

4.2 Ngsim

Next Generation Simulation Colyar and Halkias (2007)(NGSIM) is a collection of video-transcribed datasets of vehicle trajectories on US-101, Lankershim Blvd. in Los Angeles, I-80 in Emeryville, CA, and Peachtree St. in Atlanta, Georgia. In total, it contains approximately minutes of vehicle trajectory data at Hz and consisting of diverse interactions among cars, trucks, buses, and motorcycles in congested flow.

We experiment with the US-101 and I-80 datasets, and follow the experimental protocol of Deo and Trivedi (2018), where the datasets are split into training, validation, and testing. We extract seconds trajectories, using the first seconds as history to predict seconds into the future.

In Table 6, we report both neg. log-likelihood and RMSE errors on the test set. RMSE and other measures such as average/final displacement errors (ADE/FDE) are not good metrics for multimodal distributions and are only reported for MFP-1. For multimodal MFPs, we report minRMSE over 5 samples, which uses the ground truth select the best trajectory and therefore could be overly optimistic. Note that this applies equally to other popular metrics such as minADE, minFDE, and minMSD.

The current state-of-the-art, multimodal CS-LSTM Deo and Trivedi (2018), requires a separate prediction of 6 fixed maneuver modes. As a comparison, MFP achieves significant improvements with less number of modes. Detailed evaluation protocols are provided in the supplementary materials. We also provide qualitative results on the different modes learned by MFP in Fig. 5. In the right panel, we can interpret the green mode is fairly aggressive lane change while the purple and red mode is more “cautious”. Ablative studies showing the contributions of both interactive rollouts and dynamic attention encoding are also provided in the supplementary materials. We obtain best performance with the combination of both interactive rollouts and dynamic attention encoding.

4.3 Argoverse Motion Forecasting

Argoverse motion forecasting dataset is a large scale trajectory prediction dataset with more than curated scenarios Chang et al. (2019). Each sequence is 5 seconds long in total and the task is to predict the next 3 seconds after observing 2 seconds of history.

minADE C.V. NN+map LSTM+ED LSTM MFP3 MFP3 K=6 ED+map (ver. 1.0) (ver. 1.1) meters Table 7: Argoverse Motion Forecasting. Performance on the validation set. CV: constant velocity. Baseline results are from Chang et al. (2019).

We performed preliminary experiments by training a MFP with 3 modes for 20K updates and compared to the existing official baselines in Table 7. MFP hyperparmeters were not selected for this dataset so we do expect to see improved MFP performances with additional tuning. We report validation set performance on both version 1.0 and version 1.1 of the dataset.

4.4 Planning and Decision Making

The original intuitive motivation for learning a good predictor is to enable robust decision making. We now test this by creating a simple yet non-trivial reinforcement learning (RL) task in the form of an unprotected left turn. Situated in Town05 of the CARLA simulator, the objective is to safely perform an unprotected (no traffic lights) turn, see Fig. 

6. Two oncoming vehicles have random initial speeds. Collisions incur a penalty of while success yields . There is also a small reward for higher velocity and the action space is acceleration along the ego agent’s default path (blue).

Using predictions to learn the policy is in the domain of model-based RL Sutton (1990); Weber et al. (2017). Here, MFP can be used in several ways: 1) we can generate imagined future rollouts and add them to the experiences from which temporal difference methods learns Sutton (1990), or 2) we can perform online planning by using a form of the shooting methods Betts (1998), which allows us to optimize over future trajectories. We perform experiments with the latter technique where we progressively train MFP to predict the joint future trajectories of all three vehicles in the scene. We find the optimal policy by leveraging the current MFP model and optimize over ego’s future actions. We compare this approach to a couple of strong model-free RL baselines: DDPG and Proximal policy gradients. In Fig. 7, we plot the reward vs. the number of environmental steps taken. In Table 4.4, we show that MFP based planning is more robust to parameter variations in the testing environment.

Figure 6: RL learning environment - Unprotected left turn. Figure 7: Learning curves as a function of step sizes. Env. Params DDPG PPO MFP 3% 4% 0% 8% 4% 0% 6% 15% 0% 3% 1% 0% Table 8: Testing crash rates per 100 trials. Test env. modifies the velocity & acceleration parameters.

5 Discussions

In this paper, we proposed a probabilistic latent variable framework that facilitates the joint multi-step temporal prediction of arbitrary number of agents in a scene. Leveraging the ability to learn latent modes directly from data and interactively rolling out the future with different point-of-view encoding, MFP demonstrated state-of-the-art performance on several vehicle trajectory datasets. For future work, it would be interesting to add a mix of discrete and continuous latent variables as well as train and validate on pedestrian or bicycle trajectory datasets.

Acknowledgements We thank Barry Theobald, Russ Webb, Nitish Srivastava, and the anonymous reviewers for making this a better manuscript. We also thank the authors of Deo and Trivedi (2018) for open sourcing their code and dataset.


  • A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 961–971. Cited by: Table 6.
  • M. Bansal, A. Krizhevsky, and A. S. Ogale (2018) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Link, 1812.03079 Cited by: §1, §2.
  • J. Bayer and C. Osendorfer (2014) Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610. Cited by: §3.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §3.
  • J. T. Betts (1998) Survey of numerical methods for trajectory optimization. Journal of guidance, control, and dynamics 21 (2), pp. 193–207. Cited by: §4.4.
  • Y. Burda, R. Grosse, and R. Salakhutdinov (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: footnote 5.
  • S. Casas, W. Luo, and R. Urtasun (2018) IntentNet: learning to predict intention from raw sensor data. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 947–956. External Links: Link Cited by: §1, §2.
  • Y. Chai, B. Sapp, M. Bansal, and D. Anguelov (2019) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449. Cited by: §1, §4.1, §4.1.
  • M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al. (2019) Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8748–8757. Cited by: §4.3, Table 7, §4.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: footnote 2.
  • J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988. Cited by: §3.
  • J. Colyar and J. Halkias (2007) US highway 101 dataset.. FHWA-HRT-07-030. Note: Cited by: §4.2, §4.
  • H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2018) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. CoRR abs/1809.10732. External Links: Link, 1809.10732 Cited by: §1, §2, §2.
  • A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, pp. 1–38. External Links: Link Cited by: §3.
  • N. Deo, A. Rangesh, and M. M. Trivedi (2018) How would surround vehicles move? a unified framework for maneuver classification and motion prediction. IEEE Transactions on Intelligent Vehicles 3 (2), pp. 129–140. Cited by: Table 6.
  • N. Deo and M. M. Trivedi (2018) Convolutional social pooling for vehicle trajectory prediction. CoRR abs/1805.06771. External Links: Link, 1805.06771 Cited by: §1, §2, §2, §4.2, §4.2, Table 6, §5.
  • A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: 3(a), §4.1, §4.
  • M. Fraccaro, S. K. Sønderby, U. Paquet, and O. Winther (2016) Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207. Cited by: §3.
  • A. Goyal, A. Sordoni, M. Côté, N. R. Ke, and Y. Bengio (2017) Z-forcing: training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723. Cited by: §3.
  • K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015) Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623. Cited by: §3.
  • T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel (2016) Backprop KF: learning discriminative deterministic state estimators. CoRR abs/1605.07148. External Links: Link, 1605.07148 Cited by: §2.
  • T. Howard, M. Pivtoraiko, R. A. Knepper, and A. Kelly (2014) Model-predictive motion planning: several key developments for autonomous mobile robots. IEEE Robotics & Automation Magazine 21 (1), pp. 64–73. Cited by: §3.2.
  • A. Kuefler, J. Morton, T. Wheeler, and M. Kochenderfer (2017) Imitating driver behavior with generative adversarial networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 204–211. Cited by: Table 6.
  • N. Lee, W. Choi, P. Vernaza, C. Choy, P. H. S. Torr, and M. Chandraker (2017) DESIRE: distant future prediction in dynamic scenes with interacting agents. pp. 2165–2174. External Links: Document Cited by: §1, §2, §2.
  • S. Lefèvre, D. Vasquez, and C. Laugier (2014) A survey on motion prediction and risk assessment for intelligent vehicles. ROBOMECH journal 1 (1), pp. 1. Cited by: §2.
  • Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha (2018) TrafficPredict: trajectory prediction for heterogeneous traffic-agents. CoRR abs/1811.02146. External Links: Link, 1811.02146 Cited by: §1, §2.
  • R. M. Neal and G. E. Hinton (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp. 355–368. Cited by: §3.
  • B. Paden, M. Cap, S. Z. Yong, D. Yershov, and E. Frazzoli (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on Intelligent Vehicles 1. Cited by: §1.
  • S. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi (2018) Sequence-to-sequence prediction of vehicle trajectory via LSTM encoder-decoder architecture. CoRR abs/1802.06338. External Links: Link, 1802.06338 Cited by: §2.
  • N. Rhinehart, R. McAllister, K. M. Kitani, and S. Levine (2019) PRECOG: prediction conditioned on goals in visual multi-agent settings. CoRR abs/1905.01296. External Links: Link, 1905.01296 Cited by: §1, §4.1, §4.1.
  • C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy (2019) Stochastic prediction of multi-agent interactions from partial observations. CoRR abs/1902.09641. External Links: Link, 1902.09641 Cited by: §1, §3.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • R. S. Sutton (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pp. 216–224. Cited by: §4.4.
  • S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann, et al. (2006) Stanley: the robot that won the darpa grand challenge. Journal of field Robotics 23 (9), pp. 661–692. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.
  • N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti (2017) Visual interaction networks: learning a physics simulator from video. In Advances in neural information processing systems, pp. 4539–4547. Cited by: §1.
  • T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. W. Battaglia, D. Silver, and D. Wierstra (2017) Imagination-augmented agents for deep reinforcement learning. CoRR abs/1707.06203. External Links: Link, 1707.06203 Cited by: §4.4.
  • G. Welch, G. Bishop, et al. (1995) An introduction to the kalman filter. Cited by: §2.
  • T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu (2019)

    Multi-agent tensor fusion for contextual trajectory prediction

    CoRR abs/1904.04776. External Links: Link, 1904.04776 Cited by: §1, Table 6.