I Introduction
Vehicle trajectory prediction is crucial for autonomous driving and advanced driver assistant systems. While existing literature relates to improving the accuracy of prediction [13, 33, 15, 7, 1], the diversity of the predicted trajectories [21, 11]
must be explored. High accuracy implies good approximation of the true distribution according to some performance metric, but emphasizing diversity allows prediction approaches to access lowprobability but highimportance parts of the state space. Diverse trajectory sampling provides coverage of possible actions for surrounding vehicles and facilitates safe motion planning and accurate behavior modeling for nearby vehicles in simulation. For instance, at an intersection, sampling distinct outcomes, such as left or right turns, rather than simply predicting going forward, provides benefits in verification. Different maneuvers can have radically different outcomes, and missing one of them can be catastrophic. Sampling efficiently proves difficult in such scenarios, as neither the distribution of trajectories nor the definition of semantically distinct outcomes has an analytical form. Additionally, expensive rollouts of a future trajectory are required to define its utility, which considers environment of the car and nearby agents.
In this paper, we propose a model that handles both accuracy and diversity by incorporating a latent semantic layer into the trajectory generation step. This layer should represent approximate highlevel vehicle behaviors, matching semantic distinctions when they exist. We expect it to be effectively lowdimensional, since a driver can perform only a few distinct maneuvers at any given moment. Therefore, enumerating low dimensional samples should be feasible; however, we wish to do so without matching the driver’s behaviors into a fixed taxonomy. We illustrate this idea in Figure
1, where the goal is to produce diverse trajectory predictions and cover distinct outcomes. The top row shows traditional sampling, which fail to sample diverse behaviors efficiently. The bottom row demonstrates our latent semantic sampling technique, which is able to capture both maneuvers in the intersection.We do so by shaping the notion of similarity in the intermediate layer activation via metric learning [32]. We train the latent semantic layer activations to match annotations of highlevel labels where these exist. Distances between two trajectories should be large if they represent different semantic labels, and should be small otherwise.
In addition to prediction, our model can produce behavior samples for simulation and verification. Verification of safety properties for a given driving strategy is challenging, since it requires numerous simulations using predictive models instantiated over a large sampling space of initial agent conditions, road configurations, etc. A semanticallymeaningful, lowdimensional latent space provides the advantage of efficient sampling of all possible behaviors, requiring fewer simulations to find rare events that affect safety (e.g. collisions between cars).
Finally, our proposed latent state affords some interpretation of the network, which is crucial in safetycritical tasks such as autonomous driving. By tuning the highlevel latent state, our samples better cover the human intuition about diverse outcomes.
Our work has three main contributions. i) We extend a generative adversarial network to produce diverse and realistic future vehicle trajectories. We process the noise samples into two independent latent vectors, utilizing loss functions to disentangle them. The highlevel vector captures semantic properties of trajectories, while the lowlevel layer maintains spatial and social context. ii) We describe an efficient sampling method to cover the possible future actions, which is important for safe motion planning and realistic behavior modeling in simulation. iii) We validate our approach on a publicly available dataset with vehicle trajectories collected in urban driving. Quantitative results show our method outperforming stateoftheart approaches, while in qualitative scenarios it efficiently generates diversified trajectories.
The remainder of the paper is organized as follows. We introduce relevant work in Section IA, and our problem formulation and proposed method in Section II. We demonstrate results in vehicle motion prediction in Section III, followed by a summary and a discussion of future work in Section IV.
Ia Related Works
Our work relates to several topics in probabilistic trajectory prediction. Unlike deterministic alternatives [13], it allows us reason about the uncertainty of driver’s behaviors. There are several representations that underlie reasoning about trajectories. [33, 15, 16, 28, 4]
predict future vehicle trajectories as Gaussian mixture models, whereas
[18] utilizes a gridbased map. In our work, we focus on generating trajectory samples directly from an approximated distribution space, using a sequential network, similar to [21, 11].For longer term prediction horizons, additional context cues are needed from the driving environment. Spatial context, including as mapped lanes, not only indicates the possible options a vehicle may take (especially at intersections), but also improves the prediction accuracy, as vehicles usually follow lane centers closely [7, 5]. Another important cue is social context based on nearby agents, affording reasoning about interaction among agents [1, 11, 23, 16]. Our method takes advantage of these two cues by feeding map data and nearby agent positions into our model, improving the accuracy of predictions over a few seconds.
Recently proposed generative adversarial networks (GANs) can sample trajectories by utilizing a generator of vehicle trajectories and a discriminator that distinguishes real trajectories and trajectories produced by the generator [10, 11, 23, 22]. Despite their success, efficiently producing unlikely events, such as lane changes and turns, remains a challenge. These events are important to consider as they can pose a significant risk and affect driving decisions.
Hybrid maneuverbased models [8]
are effective in producing distinct vehicle behaviors. They first classify maneuvers based on vehicle trajectories, and then predict future positions conditioned on a maneuver. As such, they are restricted to cases where predefined maneuvers are well defined. Similar to
[28], our method allows more general cases dealing with undefined semantics, including multivehicle interactions.Beyond prediction, recent learning models use an intermediate representation in probabilistic network models to improve sample efficiency and coverage. [28] utilizes a set of discrete latent variables to represent different driver intentions and behaviors. [27] has shown that there exist semantics in the latent space of generative adverserial networks (GANs), and [6] successfully decomposes the latent factor in a GAN into structured semantic parts. In addition to GANs, [14]
has learned disentangled latent representations in a variational autoencoder (VAE) framework to ground spatial relations between objects. Unlike the information bottleneck motivation of
[6], we use metric learning [32] to capture information such as maneuvers and interactions. The low dimensionality of the semantics space allows us to obtain distinct vehicle behaviors efficiently. In a relevant work, [30]proposes to generate samples in a potential field learned by the discriminator to approximate the real probability distribution of data accurately, and to ensure sample diversity.
Finally, our work has applications to sampling and estimation of rare events for verification, which is its own active field, see
[26, 3, 19, 20, 24] and references therein. The closest work to ours is [24, 19], which also propose samplebased estimation of probabilities. As opposed to probability estimation under standard driving, our work focuses explicitly on sampling from diverse modes of behaviors.Ii Model
Here, we present the problem formulation and describe the model underlying our work, including loss functions and our proposed sampling procedure.
Iia Problem Formulation
The input to the trajectory prediction problem includes a sequence of observed vehicle trajectories , as well as the surrounding lanes, given as their centerline coordinates, denoted as . The goal is to predict a set of possible future trajectories , where the acausal future trajectories are denoted as .
In the probabilistic setting, since multiple future trajectory sets are possible, the goal is to estimate the predicted probability distribution . Many of the modern approaches sample from in the lack of a closedform expression for it, requiring some form of sample generation approaches, such as traditional ones such as MCMC and particle filters [17], planning based approaches such as RRTs[2], and GANs and other probabilistic generative networks[21, 1].
IiB Model Overview
We now describe the network structure and sampling approach, as illustrated in Figure 2. The trajectory generator takes the past trajectory of target vehicles, a map of lane centerlines and a noise sample, before producing samples of future trajectories. The discriminator identifies whether the generated trajectory is realistic.
In addition to the generator and discriminator networks, we require a source of semantic labels about trajectories. These labels can include maneuvers such as merging, turning or slowing down, or interaction patterns such as giving right of way or turning at a fourwaystop junction. For simplicity, these labels may be boolean or unknown values, and they are arranged into a vector with elements , where denotes that is unknown or undefined. We stress that for some values of , in some instances the any choice does not make sense. For example, a labels of ”the vehicle is next on a stop sign intersection” and ”is vehicle waiting on a red line or not” do not coexist. This motivates a representation that avoids a single taxonomy of all road situations with definite semantic values.
IiC Trajectory Generator
The trajectory generator predicts realistic future vehicle trajectories given inputs of the past trajectories and the map information. It embeds the two inputs before sending them into a long shortterm memory (LSTM) network encoder that captures both the spatial and temporal aspect from the inputs. The encoder output is combined with a noise vector generated from a standard normal distribution, and fed into a latent network that separates the information into a highlevel vector and a lowlevel vector. The decoder, taking these two vectors, produces the trajectory samples.
IiC1 Trajectory Network
A series of fully connected layers that embed spatial coordinates into a trajectory embedding vector [1].
IiC2 Map Network
In order to simplify the task of learning to interact with the map, we using the following representation for the lanes. First, we find the nearest point to the vehicle from each lane at the predicting time. Second, we traverse each lane starting at its nearest point to generate an arclengthparameterized curve before computing polynomial coefficients up to second order. Third, we create monomials for the coefficients of the target vehicle using the vehicle velocity and 1,2 sampling time steps – and for . Last, we feed the products to allow the encoder and discriminator to learn lane behavior.
IiC3 Encoder
A series of LSTM units process the spatial and map embedding vectors from time steps to . The output is a hidden vector that stores the relevant information up to the current time step.
IiC4 Latent Network
A series of fully connected layers takes the encoder’s hidden vector and a noise sample from a standard normal distribution. The outputs are two activation vectors: a vector that represents high level information such as maneuvers, and a vector that represents low level information such as vehicle dynamics. To sample efficiently from at test time, is designed to be much smaller than . We train the vectors to be uncorrelated, with matching semantic labels in terms of distances between samples This representation disentangles semantic concepts from lowlevel trajectory information, in a fashion resembling information bottlenecks [6], but driven by human notions of semantic similarity as learned from the labels.
IiC5 RNNbased decoder
A series of LSTM units takes , , and a map embedding vector, to output a sequence of future vehicle positions.
IiD Trajectory Discriminator
An LSTMbased encoder converts the past trajectory and future predictions into a label = {fake, real}, where fake means a trajectory is generated by our predictor, while real means the trajectory is from data. The structure of the discriminator mirrors that of the trajectory encoder, except in its output dimensionality.
IiE Losses
Similar to [11], we measure the performance of our model using the average displacement error (ADE) of Equation 1 and the final displacement error (FDE) of Equation 2.
(1)  
(2) 
IiE1 Best prediction displacement loss
Also as in [11], we compute the Minimum over N (MoN) losses to encourage the model to cover groundtruth options while maintaining diversity in its predictions:
(3) 
where are samples generated by our model. The loss, over
samples from the generator, is computed as the average distance between the best predicted trajectories and acausal trajectories. Although minimizing MoN loss leads to a diluted probability density function compared to the groundtruth
[29], we use it to show that our method can estimate an approximate distribution efficiently. We defer a different, more accurate, supervisory cue to future work.IiE2 Adversarial loss
We use standard binary cross entropy losses, , to compute the loss between outputs from the discriminator and the labels. This loss is used to encourage diversity in predictions and is assigned with a higher weight once best prediction displacement loss is reduced to a reasonable scale.
IiE3 Independence loss
The independence loss enforces that the crosscovariance between the two latent vectors and remain small, encouraging to hold only lowlevel information. While this does not guarantee independence of the two, we found this to suffice as regularization.
(4) 
IiE4 Latent space regularization loss
The latent loss regularizes and
in terms of their mean and variance and helps to avoid degenerate solutions.
(5) 
where denotes the Frobenius norm.
IiE5 Embedding loss
After enforcing and are independent vectors, we introduce an embedding loss to enforce the correlation between highlevel latent vector and prediction coding . Similar to [25], if two data samples have the same answer element for label , we expect the differences in their highlevel latent vectors to be small. On the other hand, if two predictions have different codings, we want to encourage the difference to be large. This can be written as
(6) 
where is batch size, denote the label answers on examples respectively, and if either argument is .
IiE6 Total loss
In total, we combine the losses listed above together with appropriate coefficients that are adjusted dynamically during training.
(7)  
(8) 
IiF Sampling Approach
We now describe how we sample from the space of in Alg. 1. We generate a set of latent samples, selecting from them a subset of representatives using the Farthest Point Sampling (FPS) algorithm[9, 12]. We store the nearest representative identity as we compute the distances, to augment the FPS representatives with a weight proportional to their Voronoi cell. This gives us a weighted set of samples that converges to the original distribution, but favors samples from distinct regions of space. FPS allows us to emphasize samples that represent distinct high level maneuvers encoded in .
The samples cover (in the sense of an covering) the space of possible highlevel choices. The high level latent space is shaped according to human labels of similarity. With this similarity metric shaping, FPS techniques can leverage its optimal distance coverage property in order to capture the majority of semantically different rollouts in just a few samples.^{1}^{1}1We note that a modified FPS[31] can trade off modeseeking with coverageseeking when generating samples.
Iii Results
In this section, we describe the details of our model and dataset, followed by a set of quantitative results against stateoftheart baselines and qualitative results on diverse prediction.
Iiia Model Details
The Trajectory Network utilizes two stacked linear layers with dimensions of (32, 32). The Map Network uses four stacked linear layers with dimensions of (64, 32, 16, 32). An LSTM with one layer and a hidden dimension of 64 forms both the Encoder and Decoder in the Trajectory Generator. The Latent Network takes inputs from the Encoder and a noise vector with dimension of 10. This network is composed of two individual linear layers with output dimensions of 3 and 71 for the highlevel and lowlevel layers, respectively. The Discriminator
is an LSTM with the same structure as the Generator’s Encoder, followed by a series of stacked linear layers with dimensions of (64, 16, 1), activated by a sigmoid layer at the end. All linear layers in the Generator are followed by a batch norm, ReLU, and dropout layers. The linear layers in the Discriminator utilize a leakyReLU activation instead. The number of samples
we use for the MoN loss is 5.IiiB Semantic Annotations
In order to test our embedding over a large scale dataset, we devised a set of classifiers for the data as surrogates to human annotations. They check for specific highlevel trajectory features, and each of them outputs a ternary bit representing whether the feature exists, does not exist, or is unknown, as a dimensional vector that includes the outputs from all filters. The list of feature filters used in this paper includes: accelerate, decelerate, turn left, turn right, lane follow, lane change, move to left latitudinally, and move to right latitudinally.
IiiC Quantitative Results
IiiC1 Prediction
Over 1 and 3 second prediction horizons, with samples, we compute the MoN ADE (1) and FDE (2
) losses, respectively. In addition to our method, we introduce a few baseline models to demonstrate the prediction accuracy of our method. The first two baselines include a linear Kalman filter with a constant velocity (CV) model and with a constant acceleration (CA) model, respectively. We sample multiple trajectories given the smoothing uncertainties. The third baseline is an LSTMbased encoder decoder model
[5], which produces deterministic predictions. In addition, we introduce a few variants of a vanilla GANbased model taking different input features, where social contains the positions of nearby agents and map contains the nearby lane information as described in IIC2. The results are summarized in Table I. The first two rows indicate that physicsbased models can produce predictions with reasonable accuracy. Using only five samples, the CV Kalman Filter outperforms a deterministic deep model with results shown on the third row. The rest of the table shows that a generative adversarial network improve upon accuracy by a large margin compared to physicsbased models using five samples. It is observed that the map features contribute more to long horizon predictions. Additionally, our method is competitive compared to standard ones, after regularizing the latent space, while adding sample diversification.1 Second  3 Seconds  

Model Name  ADE  FDE  ADE  FDE 
Kalman Filter (CV)  0.51  0.79  1.63  3.62 
Kalman Filter (CA)  0.69  1.22  2.87  7.08 
LSTM Encoder Decoder  0.57  0.94  1.81  4.13 
GAN  0.42  0.62  1.55  3.09 
GAN+social  0.44  0.66  1.68  3.04 
GAN+social+map  0.44  0.63  1.34  2.75 
DiversityGAN+social+map  0.41  0.65  1.35  2.74 
DiversityGAN(FPS)+social+map  0.44  0.62  1.33  2.72 
To show the effectiveness of our latent sampling approach, we measure the MoN loss with and without the FPS method. We test using a challenging subset of the validation dataset that filters out straight driving with constant velocity scenarios, resulting in a trajectory distribution that emphasizes rare events in the data. As indicated in Figure 3, when the number of samples increases, the prediction loss using FPS drops faster compared to direct sampling. We note the improvement is larger in the regime of  samples, where reasoning about full rollout of multiple hypotheses is still practical in realtime systems, and we obtain an improvement of . However, beyond the gain in average accuracy, the importance of the method is that it is able to obtain some samples from the additional modes of the distribution of trajectories. We demonstrate the advantage of our methods with a small number of samples in Section IIID.
IiiD Qualitative Results

We first show how FPS can be used to improve both prediction accuracy and diversity coverage by illustrating two examples in Figure 4.
In the first example as illustrated in Figure 4(a), our method, as described in Algorithm 1, first generate samples in grey, and select samples using FPS (highlighted on the left column) and direct sampling (highlighted on the right column) to produce predictions. By selecting samples that are farther away, FPS is able to produce rare events such as right turn, as labelled in 2, that match with the acausal trajectory and thus improve the prediction accuracy. On the other hand, direct sampling tends to sample points from denser regions, which lead to high likelihood events. We show two additional challenging examples in Figure 5(a), where FPS is able to reduce the prediction error by covering turning events when the vehicle is approaching an offramp and a full intersection, respectively.
In the second example as illustrated in Figure 4(b), although our method predicts rare events that do not improve displacement losses compared to direct sampling, they are still important for decision making and risk estimation. Although the target vehicle is most likely to go forward, it is useful for our predictor to cover lane change behavior, as labelled in 1, even with a low likelihood, since such prediction could help avoid a possible collision if our ego car is driving on the right lane. Similarly, in the other two examples as shown in Figure 5(b), our method produces events such as merging and turning that are unlikely to happen but are important to consider for robust and safe decision making for the ego car.
Iv Conclusion
We propose a vehicle motion prediction method that caters to both prediction accuracy and diversity. We achieve this by dividing a latent variable into a learned semanticlevel part encoding discrete options that the target vehicle can possibly take, and a lowlevel part encoding other information. The method is demonstrated to achieve stateoftheart prediction accuracy, while efficiently obtaining trajectory coverage by nearoptimal sampling of the highlevel latent vector. Future work includes adding more complicated semantic labels such as vehicle interactions, and exploring other sampling methods beyond FPS.
References

[1]
(2016)
Social LSTM: human trajectory prediction in crowded spaces.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 961–971. Cited by: §IA, §I, §IIA, §IIC1.  [2] (2011) Mobile agent trajectory prediction using bayesian nonparametric reachability trees. In Infotech@ Aerospace 2011, pp. 1512. Cited by: §IIA.
 [3] (2013) Introduction to rare event simulation. Springer Science & Business Media. Cited by: §IA.
 [4] (2019) Multipath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449. Cited by: §IA.
 [5] (2019) Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8748–8757. Cited by: §IA, §IIIA, §IIIC1.
 [6] (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §IA, §IIC4.
 [7] (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 IEEE International Conference on Robotics and Automation (ICRA), pp. 2090–2096. Cited by: §IA, §I.
 [8] (2018) Multimodal trajectory prediction of surrounding vehicles with maneuver based LSTMs. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1179–1184. Cited by: §IA.
 [9] (1985) Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38, pp. 293 – 306. External Links: ISSN 03043975 Cited by: §IIF.
 [10] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §IA.
 [11] (2018) Social GAN: socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §IA, §IA, §IA, §I, §IIE1, §IIE.

[12]
(198505)
A best possible heuristic for the kcenter problem
. Math. Oper. Res. 10 (2), pp. 180–184. External Links: ISSN 0364765X, Document Cited by: §IIF.  [13] (2013) Vehicle trajectory prediction based on motion model and maneuver recognition. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4363–4369. Cited by: §IA, §I.
 [14] (2019) Disentangled relational representations for explaining and learning from demonstration. arXiv preprint arXiv:1907.13627. Cited by: §IA.
 [15] (2019) Uncertaintyaware driver trajectory prediction at urban intersections. In 2019 IEEE International Conference on Robotics and Automation (ICRA), pp. 9718–9724. Cited by: §IA, §I.
 [16] (2019) The Trajectron: probabilistic multiagent trajectory modeling with dynamic spatiotemporal graphs. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §IA, §IA.
 [17] (2016) Intentaware longterm prediction of pedestrian motion. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2543–2549. Cited by: §IIA.

[18]
(2017)
Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network
. In 2017 IEEE International Conference on Intelligent Transportation Systems (ITSC), pp. 399–404. Cited by: §IA.  [19] (2019) Efficient autonomy validation in simulation with adaptive stress testing. arXiv preprint arXiv:1907.06795. Cited by: §IA.
 [20] (2019) Computationally efficient safety falsification of adaptive cruise control systems. In 2019 IEEE International Conference on Intelligent Transportation Systems (ITSC), Cited by: §IA.
 [21] (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §IA, §I, §IIA.
 [22] (2019) Conditional generative neural system for probabilistic trajectory prediction. arXiv preprint arXiv:1905.01631. Cited by: §IA.
 [23] (2019) Interactionaware multiagent tracking and probabilistic behavior prediction via adversarial learning. In 2019 IEEE International Conference on Robotics and Automation (ICRA), pp. 6658–6664. Cited by: §IA, §IA.
 [24] (2019) A scalable riskbased framework for rigorous autonomous vehicle evaluation. Cited by: §IA.
 [25] (2017) Hybrid control and learning with coresets for autonomous vehicles. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6894–6901. Cited by: §IIE5.
 [26] (2001) Combinatorial optimization, crossentropy, ants and rare events. In Stochastic Optimization: Algorithms and Applications, S. Uryasev and P. M. Pardalos (Eds.), pp. 303–363. Cited by: §IA.
 [27] (2019) Interpreting the latent space of GANs for semantic face editing. arXiv preprint arXiv:1907.10786. Cited by: §IA.
 [28] (2019) Multiple futures prediction. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §IA, §IA, §IA.
 [29] (2019) Analyzing the variety loss in the context of probabilistic trajectory prediction. arXiv preprint arXiv:1907.10178. Cited by: §IIE1.
 [30] (2018) Coulomb GANs: provably optimal nash equilibria via potential fields. In 2018 International Conference on Learning Representations (ICLR), Cited by: §IA.
 [31] (2015) Coresets for visual summarization with applications to loop closure. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3638–3645. Cited by: footnote 1.

[32]
(2009)
Distance metric learning for large margin nearest neighbor classification.
Journal of Machine Learning Research
10 (Feb), pp. 207–244. Cited by: §IA, §I.  [33] (2012) Probabilistic trajectory prediction with gaussian mixture models. In 2012 IEEE Intelligent Vehicles Symposium (IV), pp. 141–146. Cited by: §IA, §I.