1 Introduction
Trajectory forecasting has long been a great interest in autonomous driving since accurate predictions of future trajectories of traffic agents are essential for the safe motion planning of an autonomous vehicle (AV). Many approaches have been proposed for trajectory forecasting in the literature and remarkable progress has been made in recent years. The recent trend in trajectory forecasting is to predict multiple possible trajectories for each agent in the traffic scene. This is because human drivers’ future behavior is uncertain, and consequently, the future motion of the agent naturally exhibits a multimodal distribution.
Latent variable models, such as variational autoencoders (VAEs) [19]
and generative adversarial networks (GANs)
[13], have been used for modeling the distribution over the agents’ future trajectories. Using latent variables, trajectory forecasting models can learn to capture agentagent and agentspace interactions from data, and consequently, generate future trajectories that are compliant with the input scene contexts.VAEs have been applied in many machine learning applications, including image synthesis
[15, 33], language modeling [3, 34], and trajectory forecasting [21, 5] because they are theoretically elegant, easy to train, and have nice manifold representations. One of the limitations of VAEs is that the generated sample tends to be blurry (especially in image reconstruction and synthesis tasks) [37]. We found from our experiments that a similar problem often arises in VAEbased trajectory forecasting models. More specifically, it is often found that the generated trajectory is located between adjacent lanes as illustrated in Figure 1. These false positive motion forecasts can cause uncomfortable rides for the AV with plenty of sudden brakes and steering changes [6]. In the rest of this paper, we will refer to this problem as mode blur as instancelevel lanes are closely related to the modes of the trajectory distribution [16]. Mode blur is also found in the recent SOTA model [8] as shown in supplementary materials.Many approaches have been proposed to mitigate the blurry sample generation problem primarily for image reconstruction or synthesis tasks. In this paper, we introduce a hierarchical latent structure into a VAEbased forecasting model to mitigate mode blur. Based on the assumption that the trajectory distribution can be approximated as a mixture of simple distributions (or modes), the lowlevel latent variable is employed to model each mode of the mixture and the highlevel latent variable is employed to represent the weights for the modes. As a result, the forecasting model is capable of generating clear multimodal trajectory distributions. To model each mode accurately, we condition the lowlevel latent variable using two lanelevel context vectors (one corresponds to vehiclelane interaction (VLI) and the other to vehiclevehicle interaction (V2I)) computed in novel ways. The context vectors are also used to model the weights via the proposed mode selection network. Lastly, we also introduce two techniques to further improve the prediction performance of our model: 1) positional data preprocessing and 2) GANbased regularization. The preprocessing is introduced based on the fact that vehicles moving along a lane usually try to be parallel to the tangent vector of the lane. The regularization is intended to ensure that the proposed model generates trajectories that match the shape of the lanes well.
In summary, our contributions are the followings:

The hierarchical latent structure is introduced in the VAEbased forecasting model to mitigate mode blur.

Two context vectors (one corresponds to the VLI and the other to the V2I) calculated in novel ways are proposed for lanelevel scene contexts.

Positional data preprocessing and GANbased regularization are introduced to further improve the prediction performance.

Our forecasting model outperforms the SOTA models in terms of prediction accuracy on two largescale realworld datasets.
2 Related Works
2.1 Limitations of VAEs
The VAE framework has been used to explicitly learn data distributions. The models based on the VAE framework learn mappings from samples in a dataset to points in a latent space and generate plausible samples from variables drawn from the latent space. The VAEbased generative models are known to suffer from two problems: 1) posterior collapse (that the models ignore the latent variable when generating samples) and 2) blurry sample generation. To mitigate the problems, many approaches have been proposed in the literature, primarily for image reconstruction or synthesis tasks [29, 15, 11, 18, 28, 14, 36, 33]. In trajectory forecasting, some researchers [5, 31] have employed the techniques for the mitigation of the posterior collapse. To mitigate the blurry sample generation, [2] proposed a “bestofmany” sample objective that leads to accurate and diverse trajectory generation.
2.2 Forecasting with Lane Geometry
Because the movement of vehicles on the road is greatly restricted by the lane geometry, many works have been proposed to utilize the lane information provided by HighDefinition (HD) maps [9, 5, 31, 24, 10, 27, 12, 23, 16, 26]
. There are two types of approaches to the representation of the lane information: 1) rasterizing the components of the HD maps on a 2D canvas to obtain the topview images of the HD maps, 2) representing each component of the HD maps as a series of coordinates of points. In general, Convolutional Neural Network (CNN) is utilized for the former case while Long ShortTerm Memory (LSTM) or 1DCNN is utilized for the latter case to encode the lane information. In this paper, we adopt the second approach. The centerline of each lane in the HD maps is first represented as a series of equallyspaced 2D coordinates and then encoded by an LSTM network. The ability to handle individual lanes in the HD maps allows us to calculate lanelevel scene contexts.
2.3 Lanelevel Scene Context
Since instancelevel lanes are closely related to the modes of the trajectory distribution, recent works [24, 16, 26, 10] proposed calculating lanelevel scene contexts and using them for generating trajectories. Our work shares the idea with the previous works. However, ours differs from them in the way it calculates the lanelevel scene contexts, which leads to significant gains in the prediction performance. Instead of considering only a single lane for a lanelevel scene context, we also take into account surrounding lanes along with their relative importance. The relative importance is calculated based on the past motion of the target vehicle, thus reflecting the vehiclelane interaction. In addition, for the interaction between the target vehicle and surrounding vehicles, we consider only the surrounding vehicles within a certain distance from the reference lane as illustrated in Figure 1c. This approach shows improved prediction performance compared to the existing approaches that consider either all neighbors [26] or only the most relevant neighbor [16]. This result is consistent with the observation that only a subset of surrounding vehicles is indeed relevant when predicting the future trajectory of the target vehicle [22].
3 Proposed Method
In this section, we present the details of our trajectory forecasting model.
3.1 Problem Formulation
Assume that there are vehicles in the traffic scene. We aim to generate plausible trajectory distributions for the vehicles . Here, denotes the positional history of for the previous timesteps at time , denotes the future positions of for the next timesteps, and denotes additional scene information available to . For , we use the positional histories of the surrounding vehicles and the lane candidates available for at time , where denotes the equally spaced coordinate points on the centerline of the th lane. Finally, we note that every positional information is expressed in the coordinate frame defined by ’s current position and heading. According to [16], can be rewritten as
(1) 
where denotes the event that becomes the reference lane for . Equation 1 shows that the trajectory distribution can be expressed as a weighted sum of the distributions which we call modes. The fact that the modes are usually much simpler than the overall distribution inspired us to model each mode through a latent variable, and sample trajectories from the modes in proportion to their weights as illustrated in Figure 1b.
3.2 Forecasting Model with Hierarchical Latent Structure
We introduce two latent variables and to model the modes and the weights for the modes in Eq. 1. With the lowlevel latent variable , our forecasting model defines by using the decoder network and the prior network based on
(2) 
where denotes the scene information relevant to . To train our forecasting model, we employ the conditional VAE framework [32] and optimize the following modified ELBO objective [14]:
(3) 
where is a constant and is the approximated posterior network. The weights for the modes are modeled by the highlevel latent variable , which is output of the proposed mode selection network .
As shown in Eq. 3 and the definition of the mode selection network, the performance of our forecasting model is dependent on how the lanelevel scene information is utilized along with for defining the lanelevel scene context. One can consider two interactions for the lanelevel scene context: the VLI and V2I. This is because the future motion of the vehicle is highly restricted not only by the vehicle’s motion history but also by the motion histories of the surrounding vehicles and the lane geometry of the road. For the VLI, the existing works [10, 16, 24, 26] considered only the reference lane. For the V2I, [16] considered only one vehicle most relevant to the reference lane, while the others considered all vehicles. In this paper, we present novel ways of defining the two interactions. For the VLI, instead of considering only the reference lane, we also take into account surrounding lanes along with their relative importance, which is calculated based on the target vehicle’s motion history. The V2I is encoded through a GNN by considering only surrounding vehicles within a certain distance from the reference lane. Our approach is based on the fact that human drivers often pay attention to surrounding lanes and vehicles occupying the surrounding lanes when driving along the reference lane. Driving behaviors such as lane changes and overtaking are examples.
3.3 Proposed Network Structure
We show in Fig. 2 the overall architecture of our forecasting model. In the following sections, we describe the details of our model.
3.3.1 Feature Extraction Module:
Three LSTM networks are used to encode the positional data , , and , respectively. The last hidden state vector of the networks is used for the encoding result. Before the encoding process, we preprocess the positional data. For the vehicles, we calculate the speed and heading at each timestep and concatenate the sequential speed and heading data to the original data along the data dimension. As a result, and have the data dimension of size 4 (xposition, yposition, speed, and heading). For the lanes, at each coordinate point, we calculate the tangent vector and the direction of the tangent vector. The sequential tangential and directional data are concatenated to the original data along the data dimension. As a result, have the data dimension of size 5 (2D position vector, 2D tangent vector, and direction). We introduce the preprocessing step to make our model better infer the future positions of the target vehicle with the historical speed and heading records and the tangential data, based on that vehicles moving along a lane usually try to be parallel to the tangent vector of the lane. As shown in Table 1, the prediction performance of our model is improved due to the preprocessing step. In the rest of this paper, we use a tilde symbol at the top of a variable to indicate that it is the result of the encoding process. For example, the encoding result of is expressed as .
3.3.2 Scene Context Extraction Module:
Two lanelevel context vectors are calculated in this stage. Assume that is the reference lane for . The context vector for the VLI is calculated as follows:
(4) 
where are the weights calculated through the attention operation [1] between and and the semicolon denotes the concatenation operation. represents the relative importance of the surrounding lane compared to the reference lane under the consideration of the past motion of . As a result, our model can generate plausible trajectories for the vehicles that drive paying attention to multiple lanes. For example, suppose that the vehicle is changing its lane from to . will be close to 1 and can be approximated as , thus, our model can generate plausible trajectories corresponding to the lane change. We show in supplementary materials how the target vehicle interacts with the surrounding lanes of the reference lane using some driving scenarios.
To model the interaction between and its surrounding vehicles , we use a GNN. As we mentioned, only the surrounding vehicles within a certain distance from the reference lane are considered for the interaction; see Fig. 1c. Let denote the set of the vehicles including and its select neighbors. The context vector for the V2I is calculated as follows:
(5) 
(6) 
(7) 
(8) 
where for all vehicles in . The message passing from to is defined in Eq. 5 and all messages coming to are aggregated by the sum operation as shown in Eq. 6. After the rounds of the message passing, the hidden feature vector represents not only the motion history of but also the history of the interaction between and the others. The distance threshold for plays the important role in the performance improvement. We explore the choice of value and empirically find that the best performance is achieved with meters (the distance between two nearby lane centerlines in straight roads is around 5 meters). Finally, note that we use the zero vector for when has the target vehicle only.
3.3.3 Mode Selection Network:
The weights for the modes of the trajectory distribution are calculated by the mode selection network . As instancelevel lanes are closely related to the modes, it can be assumed that there are modes, each corresponding to one of . We calculate the weights from the lanelevel scene context vectors which condense the information about the modes:
(9) 
The softmax operation is applied to to get the final weights . Let denote the result of applying the softmax operation to . is equal to the th element of . The lanelevel scene context vector is the core feature vector for our encoder, prior, and decoder networks as described in the next section.
3.3.4 Encoder, Prior, and Decoder:
The approximated posterior , also known as encoder or recognition network, is implemented as MLPs with the encoding of the future trajectory and the lanelevel scene context vector as inputs:
(10) 
where and
are the mean and standard deviation vectors, respectively. The encoder is utilized in the training phase only because
is not available in the inference phase. The prior is also implemented as MLPs with the context vector as input:(11) 
where and are the mean and standard deviation vectors, respectively. The latent variable is sampled from via the reparameterization trick [19] during the training and from during the inference.
The decoder network generates the prediction of the future trajectory, , via an LSTM network as follows:
(12) 
(13) 
(14) 
where we initialize and as the last observed position of and the zerovector, respectively.
3.4 Regularization Through GAN
To generate more clear image samples, [20] proposed a method that combines VAE and GAN. Based on the observation that the discriminator network implicitly learns a rich similarity metric for images, the typical elementwise reconstruction metric (e.g., distance) in the ELBO objective is replaced with a featurewise metric expressed in the discriminator. In this paper, we also propose training our forecasting model with a discriminator network simultaneously. However, we don’t replace the elementwise reconstruction metric with the featurewise metric since the characteristic of trajectory data is quite different from that of images. We instead use the discriminator to regularize our forecasting model during the training so that the trajectories generated by our model well match the shape of the reference lane.
The proposed discriminator network is defined as follows:
(15) 
We explored different choices for the encoding of the inputs to the discriminator network and observed that the following approaches improve the prediction performance: 1) is the result of encoding through an LSTM network where , , and is the coordinate point of closest to , 2)
is from the feature extraction module. We also observed that generating trajectories for the GAN objective (
defined in Eq. 18) from both the encoder and prior yields better prediction performance, which is consistent with the observations in [20]. However, not backpropagating the error signal from the GAN objective to the encoder and prior does not lead to the performance improvement, which is not consistent with the observations in [20].3.5 Training Details
The proposed model is trained by optimizing the following objective:
(16) 
Here, is the binary cross entropy loss for the mode selection network and is defined as follows:
(17) 
where is the onehot vector indicating the index of the lane, in which the target vehicle traveled in the future timesteps, among the candidate lanes. is the typical adversarial loss defined as follows:
(18) 
where denotes our forecasting model. The hyperparameters (, ) in Eq. 16 and in Eq. 3 are set to , , and , respectively. More details can be found in supplementary materials.
3.6 Inference
Future trajectories for the target vehicle are generated from the modes based on their weights. Assume that trajectories need to be generated for . out of future trajectories are generated by the decoder network using and . In the end, a total of trajectories can be generated from since .
4 Experiments
4.1 Dataset
Two largescale realworld datasets, Argoverse Forecasting [7] and nuScenes [4], are used to evaluate the prediction performance of our model. Both provide 2D or 3D annotations of road agents, track IDs of agents, and HD map data. nuScenes includes 1000 scenes, each 20 seconds in length. A 6second future trajectory is predicted from a 2second past trajectory for each target vehicle. Argoverse Forecasting is the dataset for the trajectory prediction task. It provides more than 300K scenarios, each 5 seconds in length. A 3second future trajectory is predicted from a 2second past trajectory for each target vehicle. Argoverse Forecasting and nuScenes publicly release only training and validation sets. Following the existing works [16, 31], we use the validation set for the test. For the training, we use the training set only.
4.2 Evaluation Metric
For the quantitative evaluation of our forecasting model, we employ two popular metrics, average displacement error (ADE) and final displacement error (FDE), defined as follows:
(19) 
(20) 
where and respectively denote the groundtruth trajectory and its prediction. In the rest of this paper, we denote and as the minimum of ADE and FDE among the generated trajectories, respectively. It is worth noting that and metrics shown in the tables presented in the later sections represent the average quality of the trajectories generated for . Our derivation can be found in the supplementary materials. On the other hand, and represent the quality of the trajectory closest to the groundtruth among the generated trajectories. We will call and metrics in the tables the best quality in the rest of this paper. According to [5], the average quality and the best quality are complementary and evaluate the precision and coverage of the predicted trajectory distributions, respectively.


Model  

CoverNet [27]  3.87  9.26  1.96    1.48       
Trajectron++ [31]    9.52  1.88    1.51       
AgentFormer [35]      1.86  3.89  1.45  2.86     
ALAN [26]  4.67  10.0  1.77  3.32  1.10  1.66     
LaPred [16]  3.51  8.12  1.53  3.37  1.12  2.39  1.10  2.34 
MHAJAM [25]  3.69  8.57  1.81  3.72  1.24  2.21  1.03  1.7 
Ours 
Model  

DESIRE [21]  2.38  4.64  1.17  2.06  1.09  1.89  0.90  1.45 
R2P2 [30]  3.02  5.41  1.49  2.54  1.40  2.35  1.11  1.77 
VectorNet [12]  1.66  3.67             
LaneAttention [24]  1.46  3.27      1.05  2.06     
LaPred [16]  1.48  3.29  0.76  1.55  0.71  1.44  0.60  1.15 
Ours 
4.3 Ablation Study
4.3.1 Performance Gain over Baseline
In Table 0(a), we present the contributions of each idea to the performance gain over a baseline. M1 denotes the baseline that does not use the positional data preprocessing (PDP), VLI, V2I, and GAN regularization proposed in this paper. We can see from the table that the average quality of the generated trajectories is improved by both the PDP and the VLI (M1 v.s. M2 v.s. M3). The improvement due to the VLI is consistent with the observation in [16] that consideration of multiple lane candidates is more helpful than using a single best lane candidate in predicting the future trajectory. Both the average quality and the best quality are much improved by the V2I (M3 v.s. M4). The accurate trajectory prediction for the vehicles waiting for traffic lights is the most representative case of the performance improvement by the V2I. Due to the past movement of the neighboring vehicles waiting for the traffic light, our model can easily conclude that the target vehicle will also be waiting for the traffic light. Finally, the prediction performance is further improved by the GAN regularization (M4 v.s. M5). As seen in Eq. 15, our discriminator uses a future trajectory along with the reference lane to discriminate between fake trajectories and real trajectories.
4.3.2 Effect of Surrounding Vehicle Selection Mechanism
In Table 0(b), we show the effect of the surrounding vehicle selection mechanism on the prediction performance of our model. Here, Ours () denotes our model in which only the surrounding vehicles within meters from the reference lane are considered. Ours+Rel and Ours+All denote our model in which the most relevant vehicle and all the vehicles are considered, respectively. We can see from the table that Ours with shows the best performance. This result demonstrates that considering only surrounding vehicles within a certain distance from the reference lane is effective in modeling the V2I from a lanelevel perspective.
4.3.3 Hierarchical Latent Structure
We show in Fig. 3 the generated trajectories for a particular scenario to demonstrate how helpful the introduction of the hierarchical latent structure would be for the mitigation of mode blur. In the figure, Baseline denotes the VAEbased forecasting model in which a latent variable is trained to model the trajectory distribution. Baseline+BOM and Baseline+NF respectively denote Baseline trained with the bestofmany (BOM) sample objective [2] and normalizing flows (NF) [29]. We introduce NF since the blurry sample generation is often attributed to the limited capability of the approximated posterior [15] and NF is a powerful framework for building flexible approximated posterior distributions [18]. In the figure, gray and black circles indicate historical and future positions, respectively. Squares with colors indicate the predictions of the future positions. Time is encoded in the rainbow color map ranging from red (0s) to blue (6s). Red solid lines indicate the centerlines of the candidate lanes. For the scenario, fifteen trajectories were generated. We can see in the figure that the proposed model generates trajectories that are aligned with the lane candidates. In contrast, neither normalizing flows nor BOM objective can help a lot for the mitigation of mode blur.
4.4 Performance Evaluation
4.4.1 Quantitative Evaluation
We compare our forecasting model with the existing models objectively. The results are shown in Table 2 and 3. Note that the bold and underline indicate the best and secondbest performance, respectively. The values in the subscript indicate the performance gain over the secondbest or loss over the best. Finally, the values in the table are from the corresponding papers and [16]. Table 2 presents the results on Nuscenes. It shows that our model outperforms the SOTA models [26, 16, 35] on most of the metrics. In particular, the performance gains over the SOTA models in the and metrics are significant. Consequently, it can be said that the trajectories generated from our model, on average, are more accurate than those from the SOTA models. On the other hand, [26] shows the significant performance on . This is because, in [26], the vehicle trajectory is defined along the centerlines in a 2D curvilinear normaltangential coordinate frame, so that the predicted trajectory is well aligned with the centerlines. However, [26] shows the poorest performance in the average quality. Table 3 presents the results on Argoverse Forecasting. It is seen that our forecasting model outperforms the SOTA models [16, 24] on all the metrics. The and results show that our model achieves much better performance in the best quality compared to the models. However, the performance gain over the secondbest model in the average quality is not significant. In short, our forecasting model exhibits remarkable performance in the average and best quality on the two largescale realworld datasets.
4.4.2 Qualitative Evaluation
Figure 4 illustrates the trajectories generated by our model for particular scenarios in the test dataset. Note that fifteen and twelve trajectories were generated for each scenario in nuScenes and Argoverse Forecasting, respectively. We can see in the figure that the generated trajectories are well distributed along admissible routes. In addition, the shape of the generated trajectory matches the shape of the candidate lane well. These results verify that the trajectory distribution is nicely modeled by the two latent variables conditioned by the proposed lanelevel scene context vectors. It is noticeable that our model can generate plausible trajectories for the driving behaviors that require simultaneous consideration of multiple lanes. The first and third figures in the first column show the scenario where the target vehicle has just started changing lanes, and the second shows the scenario where the target vehicle is in the middle of a lane change. For both scenarios, our model generates plausible trajectories corresponding to both changing lanes and returning back to its lane. Finally, the last figure in the first column shows the scenario where the target vehicle is in the middle of a right turn. Our model well captures the motion ambiguity of the vehicle that can keep a lane or change lanes.
5 Conclusions
In this paper, we proposed a VAEbased trajectory forecasting model that exploits the hierarchical latent structure. The hierarchy in the latent space was introduced to the forecasting model to mitigate mode blur by modeling the modes of the trajectory distribution and the weights for the modes separately. For the accurate modeling of the modes and weights, we introduced two lanelevel context vectors calculated in novel ways, one corresponds to the VLI and the other to the V2I. The prediction performance of the model was further improved by the two techniques, positional data preprocessing and GANbased regularization, introduced in this paper. Our experiments on two largescale realworld datasets demonstrated that the model is not only capable of generating clear multimodal trajectory distributions but also outperforms the SOTA models in terms of prediction accuracy.
Acknowledgment This research work was supported by the Institute of Information Communications Technology Planning Evaluation (IITP) grant funded by the Korean government (MSIP) (No. 2020000002, Development of standard SW platformbased autonomous driving technology to solve social problems of mobility and safety for public transportmarginalized communities)
References

[1]
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Int. Conf. on Learn. Represent. (2015)
 [2] Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a bestofmany sample objective. In: IEEE Conf. Comput. Vis. Pattern Recog. (2018)
 [3] Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: arXiv:1511.06349 (2015)
 [4] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: a multimodal dataset for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
 [5] Casas, S., Gulino, C., Suo, S., Luo, K., Liao, R., Urtasun, R.: Implicit latent variable model for sceneconsistent motion forecasting. In: Eur. Conf. Comput. Vis. (2020)
 [6] Casas, S., Gulino, C., Suo, S., Urtasun, R.: The importance of prior knowledge in precise multimodal prediction. In: Int. Conf. Intell. Robots Syst. (2020)
 [7] Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., Hays, J.: Argoverse: 3d tracking and forecasting with rich maps. In: IEEE Conf. Comput. Vis. Pattern Recog. (2019)
 [8] Cui, A., Sadat, A., Casas, S., Liao, R., Urtasun, R.: Lookout: diverse multifuture prediction and planning for selfdriving. In: Int. Conf. Comput. Vis. (2021)
 [9] Cui, H., Radosavljevic, V., F.C.Chou, Lin, T.H., Nguyen, T., Huang, T.K., Schneider, J., Djuric, N.: Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: IEEE Int. Conf. Robotics and Automation (2019)
 [10] Fang, L., Jiang, Q., Shi, J., Zhou, B.: Tpnet: trajectory proposal network for motion prediction. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
 [11] Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., Carin, L.: Cyclical annealing schedule: A simple approach to mitigating kl vanishing. In: NAACL (2019)
 [12] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., Schmid, C.: Vectornet: encoding hd maps and agent dynamics from vectorized representation. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
 [13] Goodfellow, I., Abadie, J.P., Mirza, M., Xu, B., Farley, D.W., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Adv. Neural Inform. Process. Syst. (2014)
 [14] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: betavae: learning basic visual concepts with a constrained variational framework. In: Int. Conf. on Learn. Represent. (2017)
 [15] Huang, H., Li, Z., He, R., Sun, Z., Tan, T.: Introvae: Introspective variational autoencoders for photographic image synthesis. In: Adv. Neural Inform. Process. Syst. (2018)
 [16] Kim, B., Park, S.H., Lee, S., Khoshimjonov, E., Kum, D., Kim, J., Kim, J.S., Choi, J.W.: Lapred: laneaware prediction of multimodal future trajectories of dynamic agents. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021)
 [17] Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: Int. Conf. on Learn. Represent. (2015)
 [18] Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Adv. Neural Inform. Process. Syst. (2016)
 [19] Kingma, D.P., Welling, M.: Autoencoding variational bayes. In: arXiv:1312.6114 (2013)
 [20] Larsen, A.B.L., Sonderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: Int. Conf. on Learn. Represent. (2016)
 [21] Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H.S., Chan, M.: Desire: Distant future prediction in dynamic scenes with interacting agents. In: IEEE Conf. Comput. Vis. Pattern Recog. (2017)
 [22] Li, J., Yang, F., Ma, H., Malla, S., Tomizuka, M., Choi, C.: Rain: reinforced hybrid attention inference network for motion forecasting. In: Int. Conf. Comput. Vis. (2021)
 [23] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., Urtasun, R.: Learning lane graph representations for motion forecasting. In: Eur. Conf. Comput. Vis. (2020)
 [24] Luo, C., Sun, L., Dabiri, D., Yuille, A.: Probabilistic multimodal trajectory prediction with lane attention for autonomous vehicles. In: IEEE Conf. Intell. Robots Syst. (2020)
 [25] Messaoud, K., Deo, N., Trivedi, M.M., Nashashibi, F.: Trajectory prediction for autonomous driving based on multihead attention with joint agentmap representation. In: arXiv:2005.02545 (2020)
 [26] Narayanan, S., Moslemi, R., Pittaluga, F., Liu, B., Chandraker, M.: Divideandconquer for laneaware diverse trajectory prediction. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021)
 [27] PMinh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: Covernet: multimodal behavior prediction using trajectory sets. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
 [28] Razavi, A., Oord, A., Poole, B., Vinyals, O.: Preventing posterior collapse with deltavaes. In: Int. Conf. on Learn. Represent. (2019)
 [29] Rezende, D.J., Mohamad, S.: Variational inference with normalizing flows. In: Int. Conf. on Mach. Learn. (2015)
 [30] Rhinehart, N., Kitani, K.M., Vernaza, P.: R2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In: Eur. Conf. Comput. Vis. (2018)
 [31] Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: dynamicallyfeasible trajectory forecasting with heterogeneous data. In: Eur. Conf. Comput. Vis. (2020)
 [32] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Adv. Neural Inform. Process. Syst. (2015)
 [33] Vahdat, A., Kautz, J.: Nvae: a deep hierarchical variational autoencoder. In: Adv. Neural Inform. Process. Syst. (2020)
 [34] Yang, Z., Hu, Z., Salakhutdinov, R., B.Kirkpatrick, T.: Improved variational autoencoders for text modeling using dilated convolutions. In: Int. Conf. on Mach. Learn. (2017)
 [35] Yuan, Y., Weng, X., Ou, Y., Kitani, K.: Agentformer: agentaware transformers for sociotemporal multiagent forecasting. In: arXiv:2103.14023 (2021)
 [36] Zhao, S., Song, J., Ermon, S.: Infovae: information maximizing variational autoencoders. In: arXiv:1706.02262 (2017)
 [37] Zhao, S., Song, J., Ermon, S.: Towards a deeper understanding of variational autoencoding models. In: arXiv:1702.08658v1 (2017)
Appendix 0.A Visualization of VehicleLane Interaction (VLI)
As we mentioned in the paper, for the calculation of the lanelevel context vector , we use not only the reference lane but also the surrounding lanes with their relative importance. This idea is based on the fact that human drivers often pay attention to surrounding lanes when driving along the reference lane. To show how our model pays attention to the surrounding lanes for the target vehicle, we use four scenarios in nuScenes and show the results in Figure 5. In the figure, blue lines denote the reference lanes while the others denote the surrounding lanes. The surrounding lanes of high importance are shown in red and the surrounding lanes of low importance are shown in green. We can see in the figure that our forecasting model pays more attention to the surrounding lanes that are close to the reference lane.
Appendix 0.B Mode Blur in SOTA Model
We show in Figure 6 the prediction examples of the stateoftheart model [8]. We note here that the figure is identical to the figure illustrated in the supplementary material of [8]. The model is built upon [5]
, which is based on the VAE framework and learns a diverse joint distribution over multiagent future trajectories in a traffic scene. In the figure, green and light blue bounding boxes respectively denote the AV and surrounding vehicles. The solid lines with light blue dots denote the predicted trajectories for the surrounding vehicles. We can see in the figure that some trajectories are located between adjacent lanes, which can cause uncomfortable rides for the AV with plenty of sudden brakes and steering changes
[6].Appendix 0.C Further Explanation to Average Quality
We mentioned in the paper that and metrics shown in the tables presented in the paper represent the average quality of the trajectories generated for the groundtruth trajectory . The metric in the table is calculated as
(21) 
where is the test dataset and is the prediction of . Because there are relatively few distinct actions that can be taken by a vehicle over a reasonable time horizon (3 to 6 seconds) [27], the groundtruth trajectories in can be clustered into multiple groups, where the trajectories of each group are very close to each other in Euclidean space. Assume that there are groups in and let denote the th group. Then Eqn. 1 can be expressed as
(22)  
where . Since the trajectories of each group are very close to each other in Euclidean space, in the last line of Eqn. 2 can be approximated as
(23) 
where is large enough. Here and are the most representative trajectory in and its th prediction, respectively. The last term of Eqn. 3 is the average quality of the trajectories generated for . Consequently, the metric represents the average quality. The same derivation can be applied for the metric.


Appendix 0.D Trajectory Generation from The Most Prominent Mode
We show in Table 4 the ADE and FDE performance of our forecasting model when trajectories are generated from the most prominent mode only. In the table, Ours+Multi denotes the inference method that generates future trajectories from the modes. This method is the same as that described in the paper. Ours+Single denotes the inference method that generates future trajectories from the most prominent mode, which is identified by the weight distribution . We can observe from the table that the best quality () is degraded when the trajectories are generated from the most prominent mode only. On the other hand, Ours+Single shows nearly the same average quality performance as Ours+Multi. These are very natural results. When sampling a single future trajectory, the most prominent mode will be chosen for the sampling. Therefore, Ours+Multi and Ours+Single will show the same performance. On the other hand, when sampling multiple future trajectories, the trajectories generated by Ours+Multi will better reflect the true future trajectory distribution. Therefore, Ours+Multi will outperforms Ours+Single in terms of the best quality.
Appendix 0.E Trajectory Generation Speed
We ran our model on PC equipped with Intel i7, 32GB RAM, and a GPU (RTX 2080Ti). To generate 15 trajectories per vehicle, it takes around 0.02 sec.
Appendix 0.F Implementation Details
0.f.1 Candidate Lanes Acquisition
We identify lane candidates for each target vehicle based on the method proposed in [16, 26, 7]. The lane segments within the search radius (10 meters) from the current position of the vehicle are first found. Next, lane candidates 80 meters long in the vehicle’s heading direction are obtained by attaching the preceding and succeeding lane segments based on lane connectivity information provided by the HD maps. The set of coordinate points for the lane candidates is resampled such that any two adjacent coordinate points have equal distance (1 meter). The groundtruth lane on which the target vehicle has moved during the future timesteps is identified by the Euclidean distance between the groundtruth future trajectory and the lane candidates. If the number of the identified lane candidates is less than , we add fake lane candidates with coordinate points of (0, 0). If the number is greater than , randomly selected lanes and the groundtruth lane are used.
0.f.2 Details of Our Implementation
0.f.2.1 Preprocessing:
Let denote the position of the vehicle at . The speed (meter per second) and heading (radian) of the vehicle at are calculated as follows:
(24) 
(25) 
where is the sampling rate. Let denote the coordinate of the th point of the lane . The tangent vector and its direction at the point are calculated as follows:
(26) 
(27) 
0.f.2.2 Feature Extraction Module:
The positional data , , and
are first preprocessed by the method proposed in this paper. Next, the data are embedded by singlelayer MLPs followed by ReLU activation. The MLPs for
and take as input a 4dimensional vector and output a 16dimensional vector. The MLP for takes as input a 5dimensional vector and outputs a 64dimensional vector. Finally, the embedded sequential vectors are encoded by LSTM networks. The final hidden states of the LSTM networks are used for the final encodings. The hidden state size of the LSTM networks for and is 16. The hidden state size for is 64.0.f.2.3 Scene Context Extraction Module:
The attention operation between and for the context vector is based on [1]. The context vector is calculated as follows: The messages coming to the node are first calculated by a singlelayer MLP followed by ReLU activation, which takes as input a 34dimensional vector and outputs a 16dimensional vector, and then summarized by the sum operation. The summarized message is used to update the hidden state of the node. To update the hidden state, we use a GRU cell, which takes as input a 16dimensional vector and outputs a 16dimensional hidden state vector. After the one round of the message passing, is obtained by summing the hidden states of the neighboring nodes.
0.f.2.4 Mode Selection Network:
Ten lanelevel scene context vectors are first embedded by a singlelayer MLP followed by ReLU activation, which takes as input a 160dimensional vector and outputs a 64dimension vector. The embedded vectors are then concatenated and used as input to a singlelayer MLP, which takes as input a 640dimensional vector and outputs a 10dimension vector, to obtain the latent vector .
0.f.2.5 Encoder and Prior:
The encoder produces the mean and variance vectors from the lanelevel scene context vector
and the positional data encoding . We use two twolayer MLPs for the mean and variance, respectively. The first layers of the MLPs take as input a 178dimensional vector and output a 64dimensional vector. The second layers take as input a 64dimensional vector and output a 16dimensional vector. The prior produces the mean and variance vectors from . The networks for the prior have the same structure as those for the encoder except that the first layers of the MLPs take as input a 160dimensional vector. Finally note that we use ReLU activation for the first layers of the MLPs.0.f.2.6 Decoder:
To produce the next position , the current position is first embedded by a singlelayer MLP followed by ReLU activation, which takes as input a 2dimensional vector and output a 16dimensional vector. Next, , , and the embedding are concatenated and used as input to an LSTM network, which takes as input a 192dimensional vector and outputs a 128dimensional hidden state vector, to update the hidden state vector. The next position is obtained by a singlelayer MLP, which takes as input a 128dimensional vector and outputs a 2dimensional vector.
0.f.2.7 Discriminator:
The positional data is first embedded by a singlelayer MLP followed by ReLU activation, which takes as input a 4dimensional vector and outputs a 16dimensional vector. The embedded sequential data is then encoded by an LSTM network, which takes as input 16dimensional sequential vectors and outputs 16dimensional sequential hidden state vectors. The future encoding and lane encoding are then used as input to a singlelayer MLP to produce a scalar value. The MLP takes as input an 80dimensional vector.
0.f.2.8 Training:
Adam optimizer [17] is used for the optimization with initial learning rates of (nuScenes) and
(Argoverse Forecasting) and batch size of 8 for 100 (nuScenes) and 50 (Argoverse Forecasting) epochs. We evaluate the prediction performance after every three consecutive training epochs by using the validation samples in the training dataset. Whenever the prediction performance improves over the past, we save the model’s network parameters. During the training, we use a cyclical annealing schedule
[11] for .0.f.3 Details of Ablation Study
We describe the details of the ablation study shown in section 4.3 of the paper. For , we do not use the positional data preprocessing (PDP), VLI, V2I, and GAN regularization proposed in the paper. As a result, the lanelevel scene context vector is defined as . For , we use the VLI so that . Finally, is used for , which employ the VLI and V2I.
0.f.4 Details of Baselines
We describe the details of the baseline models shown in Figure 3 of the paper. For the figure, we exclude the scene context extraction module and discriminator to show how helpful the introduction of the hierarchical latent structure would be for the mitigation of mode blur. Finally, note that the trajectories depicted in Figure 3(a) of the paper is generated from .
0.f.4.1 Baseline:
We train a generative model with a latent variable to model the trajectory distribution. One scene context vector that condenses the information about all the modes of the distribution is first calculated as follows:
(28) 
where is the result of the attention operation [1] between and . is then used as input to the encoder, prior, and decoder.
0.f.4.2 Baseline+BOM:
We train Baseline with the bestofmany (BOM) sample objective [2]. During the training, we let the model generate five trajectories per vehicle and select the trajectory with the minimum ADE out of the five for the distance loss calculation.
0.f.4.3 Baseline+NF:
We train Baseline with normalizing flows (NF) [29]
. We apply ten planar flow operations to a random vector that follows the normal distribution to obtain the final latent variable.