Traffic Agent Trajectory Prediction Using Social Convolution and Attention Mechanism

07/06/2020 ∙ by Tao Yang, et al. ∙ Xi'an Jiaotong University 0

The trajectory prediction is significant for the decision-making of autonomous driving vehicles. In this paper, we propose a model to predict the trajectories of target agents around an autonomous vehicle. The main idea of our method is considering the history trajectories of the target agent and the influence of surrounding agents on the target agent. To this end, we encode the target agent history trajectories as an attention mask and construct a social map to encode the interactive relationship between the target agent and its surrounding agents. Given a trajectory sequence, the LSTM networks are firstly utilized to extract the features for all agents, based on which the attention mask and social map are formed. Then, the attention mask and social map are fused to get the fusion feature map, which is processed by the social convolution to obtain a fusion feature representation. Finally, this fusion feature is taken as the input of a variable-length LSTM to predict the trajectory of the target agent. We note that the variable-length LSTM enables our model to handle the case that the number of agents in the sensing scope is highly dynamic in traffic scenes. To verify the effectiveness of our method, we widely compare with several methods on a public dataset, achieving a 20 decrease. In addition, the model satisfies the real-time requirement with the 32 fps.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Benefitting from the advanced sensors such as the laser radar, camera, millimeter-wave radar and equipments as well as the complex processing algorithms, the autonomous vehicles can accurately perceive the surrounding environment[3]. Based on perception results, the planning and control algorithm is able to control the autonomous vehicles to follow the specified route and avoid collisions. However, in some complex driving scenes, this mode might lead to serious consequences such as the traffic accidents of Tesla and Uber in 2018. These phenomena result from the lack of predictability of the future trajectory of agents in the planning algorithm. Experts believe that autonomous vehicles with the predictability of the future trajectory of agents can avoid similar accidents[20].

(a) Trajectory prediction results under egocentric vision
(b) Trajectory prediction results under radar map
Fig. 1:

Trajectory Prediction in High-Density Traffic Scene. The red cuboids in (b) from big to small are vehicles, riders, and pedestrians, respectively. The circles on the left represent autonomous vehicles. In (a) and (b), the blue lines are inputs, the green lines are Ground Truth, The blue lines are inputs, the green lines are ground truth, the purple, blue, yellow and red lines are the prediction results of Linear Regression, LSTM AE, CVAE and our model respectively

The factors that affect agent trajectories in traffic scenarios are particularly numerous and complex, therefore, trajectory prediction in the autonomous driving scenario is an extremely challenging task. These factors include the type of the agent (pedestrian, rider, vehicle) [15], the traffic rules[11], interaction between the different type of agents[13], the drivers’ subjective decision[6] etc. Some early works that focus on different aspects of these factors have been proposed. However, these works didn’t model the regression problem from the perspective of an agent’s decision making, which allows us consider these factors in a more intuitive way and decrease the computation consumption. Some other works [13, 1, 19] regard the regression problem as a regression of the traffic scene, which makes it difficult to deal with the agent entering and leaving of the scope of the sensing scope. One recently-proposed work [19] uses the egocentric visual cue and the optical flow to predict the future bounding box of the agent, but the vision-based trajectory prediction is not accurate. The work in [14] uses 3D point cloud computing for simultaneous detection, tracking, and trajectory prediction, but it requires a large amount of calculation.

In this paper, inspired by the above-mentioned methods, we propose a trajectory prediction model that involves an attention mechanism and a social map, targeting to consider the procedure of the decision making and decrease the computation consumption. When making a decision, the ego-vehicle usually pays more attention to surrounding agents and less attention to distant agents. Motivated by this observation, we encode the history trajectories of the target agent as an attention mask and the positions of surrounding agents as a social map. The attention mask is actually a probability map with high probabilities at the regions around the target agent and low probabilities at the regions far from the target agent. This attention mask is fused with the social map to encode the importance of surrounding agents. In addition, this mechanism allows decreasing the computation consumption since the trajectory prediction is conducted only utilizing surrounding agents instead of all agents in the traffic scene. Given a trajectory sequence as input, the attention mask and social map are firstly formed based on the original LSTM features of agents. Then, the attention mask and social map are fused and processed a social convolution to output a fusion feature. Finally, the fusion feature, together with the original LSTM feature, are concatenated to serve as the input of a variable-length LSTM to predict the trajectory of the target agent. We note that the variable-length LSTM enables our model to handle the case that the number of agents in the sensing scope is highly dynamic in traffic scenes. In the experiments, our method is compared with several methods, achieving the best performance on three different metrics. In addition, the ablation study experiments are conducted to verify the effectiveness of our attention mechanism and social convolution.

The contributions of this paper are as follows:

  • An efficient and accurate trajectory prediction framework is proposed to improve the trajectory prediction accuracy of traffic agents around autonomous vehicles.

  • The attention mechanism and social map are proposed to consider the procedure of decision making and decrease the computation consumption.

The rest of this paper is organized in the following order. We discuss related works in Section II, followed by the problem formulation in Section III. In Section IV, we detail our method. We present the implementation details and report our experimental results in Section V. Finally, we conclude this paper in Section VI.

Ii Related Work

Trajectory prediction has been researched extensively. Traditional methods include the the Bayesian formulation [12]

, Hidden Markov Models (HMMs)

[7]

, Kalman Filters

[9], the Monte Carlo simulation[4], the Gaussian processes[10]

and LSTM autoencoders

[16]. These traditional methods did not take the complex interactions between agents into consideration. Thus, here we merely summarize the more recent works that take the complex interactions between agents into account.

Alahi et al. proposed a social pooling layer in LSTM to extract surrounding (local) agents’ information to help the trajectory prediction and called Social LSTM[1], it is the first one proposed to use scene information to assist trajectory prediction, and it can make the precise prediction. However, since it predicts the future trajectory of the whole traffic scene, which makes it difficult to handle the frequent entry and exit of agents in the scene. Besides, in every time step, it needs to compute the social pooling of the agents, therefore, the amount of calculation is large.

Deo et al. proposed a convolution social pooling layers to fuse the surrounding agents (global) information [5]

. It divides one lane into cells, different lanes form a grid. It uses the convolutional neural network to fuse the information of the agents located in different cells, which can solve the problem of frequent entry and exit of agents in the scene. And social convolution fully utilized the location information of the agents. However, the prediction is made according to the maneuver classification, so that the maneuver prediction error has a great impact on the trajectory prediction, and the type of agents is not taken into account.

Li et al. proposed a graph convolutional model to model interactions between the agents which is called GRIP [13]

. They regard the agents as nodes and regard the interactive event as the edges, then use graph convolution to extract the effect of the interactive events on the trajectory prediction, it achieves a precise prediction. Since it did not directly utilize the position information of the agents, but take convolution on the original tensor, it takes more effort to learn the interactive events. In addition, it needs to compute the inverse of the matrix, it consumes a large amount of calculation.

Iii Problem Formulation

For the convenience of describing the method, we would like to formulate the trajectory prediction problem before presenting our model. The observable scene of the autonomous vehicle at time is denoted as , thus the input of our model are the historical scenes over time steps:

(1)

when the autonomous vehicle moves, there are agents entering and leaving the observable region of it. The number of agents at different times might be different, suppose that there are agents in the observable region at time . Thus, the observable scene at time is:

(2)

where is the position of agent at time , considering that there are slopes during driving and different lanes may have different heights sometimes. We believe that it is necessary that our coordinate includes z-axis, the coordinate in this condition is:

(3)

the coordinate used here is the ego-vehicle-based coordinate system with relative measurement, in the above context, assume that the model needs to predict the trajectory of the agents from time step to , the output of the model of time is as follows:

(4)

where the definition of is the same as the input, however, the number of the agents is as same as the time step .

Iv Approach

Fig. 2: The architecture of the proposed approach: The target agent is marked by the grey square. The blue grid region around it is its grid cell. We generate input representation for all agents based on trajectory information. These representation are passed through LSTMs and eventually used to construct the social map, the target agent’s representation is encoded as the attention mask. The production of attention mask and social map is passed through ConvNets and then concatenated together with the target agent tensor to produce latent representation. Finally, this latent representation are passed through an LSTM to generate a trajectory prediction for the target agent.

We divide the proposed deep neural network model into four parts: input representation module, LSTM encoder module, attention mask, and social map fusion and LSTM decoder module. The overall architecture is shown in Fig. 2. The raw data is processed and converted to structured data by using the part 1, and then we extract the historical trajectory representation of all the agents by using the part 2, we fuse the surrounding agents’ representation to obtain a social representation by applying the part 3, this feature representation is concatenated with history trajectory feature representation to serve as the input of part 4 to finally obtain the future prediction.

Iv-a Input Representation Module

The complex traffic scene leads to the unstructured raw data, before feed it to our model, we need to transform the raw data into structured data.

We consider a scenario with time , we need to predict agents. When predicting the trajectory of the -th agent by modeling its planning, assuming there are agents in the size observable region, we divide the observable region into grid. Then label the agent in the corresponding grid cell, represent the historical trajectory of the predicted agent as a tensor with a size of , where denote the batch size. We set to indicate and coordinates of the agent.

As shown in Fig. 2, we represent the grid as a tensor with a size of , set the cell with the agent in it as its label, set the value of other cells . Because there are different number of agents in the observable region, we represent them by a dictionary: {Agent_label:

}. However, this representation can not feed into our model directly, we need to pad it into a tensor with size of

, where represents the maximum number of agents in the grid of the batch, and represents the maximum time length of the historical trajectory of the agents in the batch. It is worth noting that the tensor with a size of is used as the input of the LSTA Encoder Module and the tensor with a size of is used to form the social map.

Iv-B LSTM Encoder Module

To encode the historical trajectory information, we feed the trajectories of the different agents with the different time lengths into a dynamic LSTM, taking the hidden state of the corresponding time length as the representation of the trajectory. The representation is concatenated with the class label’s one-hot representation to be the representation .

Iv-C Attention Mask and Social Map Fusion

Iv-C1 Social Map

We take out the historical trajectory representation , which is in the grid of the agent . Then we plug it into the corresponding position of the grid cell by using the label we defined in the input representation module, forming a tensor of size as Fig. 2 shows. In this case, only those grids containing agents contain the representation of their historical trajectories, while all the other positions are 0. This tensor contains the location information and historical trajectory information, we represent the tensor as .

Iv-C2 Trajectory-based Attention Mask

In driving scenario, the agent at the time of decision making, will pay attention to those agents that enter their observable zone and could be dangerous, thus when using the surrounding information to do auxiliary trajectory prediction, not every agent is equally important, the importance is related to the centric agent’s historical trajectory, and also other agents’ positions. Therefore, we model this process [2] by feeding the trajectory representation of the centric agent into a fully-connected network to predict a grid mask :

(5)

where is the trajectory representation of the centric agent derived in the LSTM encode module, the value of is in indicating how important the agent is for predicting the trajectory of the centric agent.

Iv-C3 Social Convolution Fusion

We let the derived social map and the attention mask

do element-wise product, then feed it into a convolutional layer with ReLU as the activation to compute convolutional feature maps, which is further processed by a convolutional layer to fuse the information of different positions. Finally it is fed into a max-pooling layer to further extract the surrounding information as Fig.

2 shows. Then a fully-connected network is used to embed the social map into the same representation space with , and they are concatenated to be a trajectory representation with agents’ interactions.

Iv-D LSTM Decoder Module

This module predicts the centric agent trajectory by taking the trajectory representation with agents’ interactions as input. As Fig. 2

shows, we take it as the initial hidden state of the LSTM decoder, predicting trajectory by outputting a vector

, where

varies according to the type of Loss function we use. In details, if we use L2 Loss as the loss function,

, which represents x, y, z. If the Loss function GMM Loss, , which represents .

Iv-E Loss Function

We use L2 Loss to regress the prediction coordinates, then the overall loss can be computed as:

(6)

where is the prediction time length, denote the prediction coordinate of the agents,

is the prediction ground truth. We regard the model as a probability density estimation model, and use the gaussian mixture model to model the prediction trajectory probability. In this case, the objective function is:

(7)

where denotes the model parameters, represents the input historical scenes, and denotes the predicted scenes. In this case, the loss function is GMM Loss.

Dataset all pedestrian vehicle rider
ModelMetric ADE MDE FDE ADE MDE FDE ADE MDE FDE ADE MDE FDE
Linear Regression 3.31 3.77 3.35 2.68 3.03 2.69 3.31 3.77 3.34 3.04 3.53 3.09
LSTM_AE[8] 1.11 1.69 1.60 0.94 1.43 1.35 1.12 1.71 1.63 1.12 1.75 1.64
CVAE[17] 0.81 1.28 1.05 0.70 1.17 1.01 1.23 1.27 1.03 0.82 1.32 1.11
Ours(L2 Loss) 0.71 1.15 1.04 0.69 1.12 1.02 0.71 1.14 1.03 0.80 1.29 1.19
Ours(GMM Loss) 0.65 1.04 0.93 0.64 1.01 0.91 0.65 1.04 0.93 0.72 1.15 1.02
TABLE I: The results for trajectory prediction on BLVD dataset.

V Experiments

V-a Settings

Dataset The BLVD dataset[18] consists of 654 high-resolution video clips with a total of 120k frames, was extracted from Changshu city, Jiangsu province. This dataset includes 6,004 valid event fragments of surrounding participants. In each frame, the ID, 3D coordinates, direction information and the interaction behavior of all objects are recorded. We follow Xue et al.[18] to divide the dataset into the training set and the test set. From four datasets (day high density , day low density, night high density, night low density), here different lighting conditions during the day and night will affect the detection of the agents.

Implementation Details

We run our model on a desktop running Ubuntu 16.04 with 4.0GHz Intel Core i7 CPU, 32GB Memory, and an NVIDIA Tesla V100 Graphics Card. Our model is implemented by using Python and PyTorch.

hyper-parameters setting

We set m to be 30m and grid size k to be 11. The dimension of the output representation is 20, included 17-dimensional historical trajectory information representation and 3-dimensional representation of the agent category, there are three types of agents: vehicles, pedestrians and riders. Hyperparameters of the two convolutional layers are: kernel size: 3, 5, the stride: 2, 2, output channels: 64, 16. The kernel size of the pooling layer is 2. We use Adam optimizer to train the model, set the learning rate to 0.001, take every 10 epochs and multiply the learning rate by 0.1 to decrease until convergence, and set the batch size to 256.

V-B Metrics

Following the metrics used in[11] and [19]

. In this paper, we use the following three evaluation metrics to comprehensively measure the performance of the model: Average Displacement Error (ADE), the average displacement error reflects the average level of the prediction error, which can be calculated by the following formula:

(8)

Maximum Displacement Error (MDE): the upper bound of the prediction error, which can be calculated by the following formula:

(9)

Final Displacement Error (FDE): the displacement error of the predicted trajectory’s final point, which can be calculated by the following formula:

(10)

V-C Comparision Results

In this subsection, to verify the effectiveness of our model, we compare our model with three baseline methods, which are briefly introduced as follows.

  • Linear Regression(LR) estimates linear parameters by minimizing the least square error.

  • LSTM autoencoder(LSTM AE)[8] takes the historical trajectory as the input to extract the intention representation and takes it as the hidden state of the LSTM decoder to predict the future trajectory.

  • Conditional VAE(CVAE)[17] uses variational autoencoder as the model which takes the historical trajectory as the input of encoder, concatenate the one-hot representation to the output of the encoder and takes it as the input of the decoder to get the future prediction.

The comparison results are reported in Tab. LABEL:tab1, from which we observe our model significantly outperforms the baselines on all of the datasets, especially in the vehicle dataset, which has the most samples. We analyze our experimental results from the following two aspects:

Baseline LSTM_AE performs significantly better than Linear Regression since it can learn the non-linear motions. We observe that CVAE performs better than LSTM_AE on all of the datasets since CVAE takes advantage of the information on the types of agents. Both of them are not accurate in predicting the trajectory of the vehicle dataset than the other two datasets, because the vehicle dataset has more samples than the other two datasets and the trajectories are more complex and less predictable.

(a) Day high-density
(b) Day low-density
(c) Night high-density
(d) Night low-density
Fig. 3: Visualized prediction results. The red cuboids from big to small are vehicles, riders, and pedestrians, respectively. The circles on the left represent autonomous vehicles. The blue lines are inputs, the green lines are ground truth, the purple, blue, yellow and red lines are the prediction results of Linear Regression, LSTM AE, CVAE and our model respectively

Proposed models Our model with L2 Loss outperforms the CVAE by 12%. Since on the one hand, it utilizes the information of surrounding agents to assist the trajectory prediction, on the other hand, we use the centric agent intention to predict an attention mask to emphasize some of the important information of surrounding agents. Besides, our model with GMM Loss outperforms the CVAE by 20%, since GMM loss predicts the output’s distribution rather than the trajectory itself, leading to the information extracted from the surroundings is more accurate, which makes our attention mask more accurate. In addition, the poor performance of our model in riders dataset is due to the rider’s weak dependence on surrounding information. What’s more, we measure computing speed of our proposed model. In the testing process, it runs 32 fps, which outperforms the model without attention that runs 12 fps.

V-D Qualitative Results

We show the visualization results of trajectory prediction in four different scenarios in Fig. 3(a) and Fig. 3(b), namely, day high-density, day low-density, night high-density and night low-density. From which we know that: Under the four complex scenarios with different modes, the predicted trajectory of agents of our model is the most accurate, whether the agent is pedestrians, vehicles or riders.

Fig. 4: The visualization of the attention mask.The left is the heat map of the attention mask, and the right is the distribution of agents in the grid cell. The cells with a value of 1 are the cells containing agents

In addition, in order to visualize what our attention module learned, we visualized a typical mask as shown in Fig. 4, it is worth noting that:

  • The proposed framework a simple and effective method to model the different types of traffic agents and use this method to improve the trajectory prediction precision.

  • The high-weight regions in the attention mask learned by our model are strips. Considering that the driving routes of agents are also strips, we speculate that the model predicts which agents around are important according to its intention.

We draw the conclusion that our model uses agent centric’s intention-based attention convolution fusion representation to improve the performance of trajectory prediction compared with existing methods.

V-E Ablation Study

To discuss the impact of this information fusion mechanism on our model, we adopt the following two mechanisms to replace our convolution network (SCNN): concatenating directly (CON) and social pooling (SP). The experimental results are shown in Tab. II, our model achieves the lowest prediction error, in which our encoder module variable-length LSTM and attention mask play an important role. Besides, the ConvNet plays a very important role in extracting the interactive information. Besides, from Tab. III, the longer horizon step will cause a decrease in prediction accuracy.

Dataset all
Model \ Metric ADE MDE FDE
VLSTM + CON 1.08 1.77 1.72
VLSTM + SP 0.86 1.26 1.16
VLSTM + SCNN 0.82 1.26 1.17
LSTM+Attention+SCNN 0.76 1.23 1.16
VLSTM+Attention+SCNN 0.65 1.04 0.93
TABLE II: The results of different combination models
Dataset all
frame \ Metric ADE MDE FDE
5 frame 0.65 1.04 0.93
7 frame 0.78 1.11 1.10
9 frame 0.80 1.15 1.14
TABLE III: The results of different number of prediction horizon step

Vi Conclusions

In this paper, we propose a trajectory prediction model that involves an attention mechanism and a social map. By comparing our method with several existing methods on the BLVD dataset and analyzing the ablative experiment, we conclude that 1) It is significant for the trajectory prediction to consider the social relationship between the surrounding agents on the target agent. 2) The attention mechanism significantly contributes to accuracy improvement.

Vii Acknowledgements

This work is supported by the National Science Foundation of China (No. 61790562, 61790563, 61773312)

References

  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 961–971. Cited by: §I, §II.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §IV-C2.
  • [3] S. Chen, Z. Jian, Y. Huang, Y. Chen, Z. Zhou, and N. Zheng (2019) Autonomous driving: cognitive construction and situation understanding. Science China Information Sciences 62 (8), pp. 81101. Cited by: §I.
  • [4] S. Danielsson, L. Petersson, and A. Eidehall (2007) Monte carlo based threat assessment: analysis and improvements. In 2007 IEEE Intelligent Vehicles Symposium, pp. 233–238. Cited by: §II.
  • [5] N. Deo and M. M. Trivedi (2018) Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1468–1476. Cited by: §II.
  • [6] N. Deo and M. M. Trivedi (2018) Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1179–1184. Cited by: §I.
  • [7] J. Firl, H. Stübing, S. A. Huss, and C. Stiller (2012) Predictive maneuver evaluation for enhancement of car-to-x mobility data. In 2012 IEEE Intelligent Vehicles Symposium, pp. 558–564. Cited by: §II.
  • [8] A. Graves (2013)

    Generating sequences with recurrent neural networks

    .
    arXiv preprint arXiv:1308.0850. Cited by: TABLE I, 2nd item.
  • [9] R. E. Kalman et al. (1960) A new approach to linear filtering and prediction problems [j]. Journal of basic Engineering 82 (1), pp. 35–45. Cited by: §II.
  • [10] C. Laugier, I. E. Paromtchik, M. Perrollaz, M. Yong, J. Yoder, C. Tay, K. Mekhnacha, and A. Nègre (2011) Probabilistic analysis of dynamic scenes and collision risks assessment to improve driving safety. IEEE Intelligent Transportation Systems Magazine 3 (4), pp. 4–19. Cited by: §II.
  • [11] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §I, §V-B.
  • [12] S. Lefèvre, C. Laugier, and J. Ibañez-Guzmán (2011) Exploiting map information for driver intention estimation at road intersections. In 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 583–588. Cited by: §II.
  • [13] X. Li, X. Ying, and M. C. Chuah (2019) GRIP: graph-based interaction-aware trajectory prediction. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 3960–3966. Cited by: §I, §II.
  • [14] W. Luo, B. Yang, and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §I.
  • [15] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha (2019) Trafficpredict: trajectory prediction for heterogeneous traffic-agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6120–6127. Cited by: §I.
  • [16] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi (2018) Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1672–1678. Cited by: §II.
  • [17] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: TABLE I, 3rd item.
  • [18] J. Xue, J. Fang, T. Li, B. Zhang, P. Zhang, Z. Ye, and J. Dou (2019) BLVD: building a large-scale 5d semantics benchmark for autonomous driving. arXiv preprint arXiv:1903.06405. Cited by: §V-A.
  • [19] Y. Yao, M. Xu, C. Choi, D. J. Crandall, E. M. Atkins, and B. Dariush (2019) Egocentric vision-based future vehicle localization for intelligent driving assistance systems. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9711–9717. Cited by: §I, §V-B.
  • [20] W. Zhan, A. La de Fortelle, Y. Chen, C. Chan, and M. Tomizuka (2018) Probabilistic prediction from planning perspective: problem formulation, representation simplification and evaluation metric. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1150–1156. Cited by: §I.