Multi-Vehicle Trajectories Generation for Vehicle-to-Vehicle Encounters

09/15/2018 ∙ by Wenhao Ding, et al. ∙ Carnegie Mellon University 0

Generating multi-vehicle trajectories analogous to these in real world can provide reliable and versatile testing scenarios for autonomous vehicle. This paper presents an unsupervised learning framework to achieve this. First, we implement variational autoencoder (VAE) to extract interpretable and controllable representatives of vehicle encounter trajectory. Through sampling from the distribution of these representatives, we are able to generate new meaningful driving encounters with a developed Multi-Vehicle Trajectory Generator (MTG). A new metric is also proposed to comprehensively analyze and compare disentangled models. It can reveal the robustness of models and the dependence among latent codes, thus providing guidance for practical application to improve system performance. Experimental results demonstrate that our proposed MTG outperforms infoGAN and vanilla VAE in terms of disentangled ability and traffic awareness. These generations can provide abundant and controllable driving scenarios, thus providing testbeds and algorithm design insights for autonomous vehicle development.



There are no comments yet.


page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Various and complex scenarios with other surrounding vehicles engaged raise a big challenge for fully-autonomous driving due to the environment uncertainties. Classifying complex scenarios and then designing associated policies separately seems to be an easy way to overcome this challenge, but the flood of on-hand datasets could overwhelm the human insight and analysis because of limited prior knowledge on the complex driving scenarios


. Some researchers resort to the deep learning technologies, such as learning controller via end-to-end neural networks


, which do not need full recovery of the internal relationship of interaction policies among multi-vehicles. But it really requires large amounts of high-quality data and would fail when encountering with the new scenarios that have never appeared in the training dataset. On the other hand, some researchers also resort to the reinforcement learning due to its unique abilities to learn failure lessons from new environment with exploration and exploitation. But it should try all possible scenarios before successfully driving in any existing and forthcoming scenarios. Sufficient testing in all possible scenarios is required for the end-users.

There exist some public databases [3], but most of them do not contain the multi-vehicle interaction data regarding the vehicle’s GPS trajectories and vehicle dynamic states because of the cost and technical limitations, but it is excessively time- and resource-consuming and dangerous as well [4]. An alternative way is to develop an efficient model able to generate new scenarios that are statistically analogous to these in real world from the limited on-hand datasets (Fig. 1). This procedure consists of two stages: first projecting the encounter trajectories to a disentangled space, and then generating new trajectories with sampling from this space.

Fig. 1: Procedure of generating multi-vehicle trajectories.

For generating new samples, one of the suitable solution candidates is the generative model, for example, Generative Adversarial Networks (GANs) [5], which have been applied to image style transformation and face reconstruction [6, 7]. Variational Autoencoder (VAE) [8], as another class of generative models, can control the characteristic of generated samples more explicably than GAN due to its significant theoretical improvements [9, 10]

. On the other hand, the convolutional Neural Network (CNN) is reasonable to deal with images, but it is unsuitable for time-series data processing. To that end, the combination of recurrent neural networks (RNN) and GAN or VAE provides a practically tractable way to deal time series. Most works are on Natural Language Processing (NLP) like machine translation

[11] and image caption [12]

. In these methods, Long Short-Term Memory (LSTM)


and Gated Recurrent Unit (GRU)

[14] were usually used by selectively remembering and forgetting the past states.

Fig. 2: Framework of our proposed muti-vehicle trajectory generator, which consists of three parts: the encoder (green), the reparameterization process (purple), and the decoder (red).

Some existing literature utilized the aforementioned generative models with supervised methods to predict spatiotemporal trajectories of human movements [15, 16, 17, 18, 19] or vehicle behavior[20, 21], given a specific trajectory. However, it is not applicable due to the difficulty of modeling all moving objects in scenarios. In order to make the generated trajectory reasonable, the potential trajectories of nearby objects must be considered simultaneously.

Supervised learning can extract the features of interaction; however, it is limited to transformation ability in the data space. Since unsupervised learning could extract intrinsic features and reconstruct scenes of multi-agent interactions, in this paper, we will develop an unsupervised learning framework (Fig. 2) to regenerate multi-trajectories time-series data address the above issues. An end-to-end system is built to extract the interpretable representations of driving encounters by combining an encoder (green) with a bidirectional GRU (purple). For the decoder (red), we implement two branches to process multiple sequences separately. These sequences interact with each other through hidden states containing information of the former samples. A hidden state of one sequence is considered as part of the input to the next state of the other sequence. In summary, the main contributions of this paper are threefold:

  • We utilize VAE to extract the representations of driving encounters, and then realize the intersection trajectory generation of two vehicles by sampling from these representations.

  • We propose an interactive structure to generate trajectories consistent with the spatiotemporal characteristics of real traffic trajectory.

  • We develop a new disentangled metric to comprehensively analyze and compare generative models regarding their robustness.

The reminder of this paper is organized as follows. Section II introduces some related works. Section III presents our developed methods. Section IV details the experiment procedure. Section V discusses and analyzes the experiment results. Finally, the conclusion and future work is given in Section VI.

Ii Related Works

Ii-a Generative Models

Deep generative models like VAE [8] and GAN [5]

have been widely used to construct generative models. VAE introduces the idea of variational inference into neural networks to calculate posterior probability, while GAN utilizes the antagonism of game theory to build the generator and discriminator structure. The discriminator can distinguish the true samples from the false ones, thus the sample distribution created by the generator is gradually forced to approach the real data distribution. VAE can obtain the interpretable latent representatives and controllable decoupling features. However, one of the limitations is the ambiguity of the decoder when processing images. Thus the

-VAE [9, 22] was developed to leverage the distribution formation and reconstruction.

Different models based on GAN have been developed, for example, the infoGAN structure is proposed to obtain interpretable representatives by maximizing mutual information combined with GAN [23]. Its main advantage is to generate high-quality samples [24], but its training process is unstable and could run into mode collapse problem [6, 7]. Our goal is to generate trajectories, therefore we would not have the issues. Considering the difficulty of training GAN, we selected the -VAE [9] as the basic framework and modified it suitable for time-series information extraction and trajectory generation.

Ii-B Time Series Processing

Dealing with time series is challenging because of the dependent relation between two adjacent states, which requires to consider the characteristics of memory. LSTM [13] and GRU [14] are the potential solutions to this issue. The existing work of time-series processing usually combines the generative model with LSTM, GRU, and their variants. For instance, RGAN and RCGAN [25] were proposed to process medical time-series data, and VRAE and VRNN were proposed [26, 27] by combining RNN with VAE. The factorized hierarchical variational autoencoder (FHVAE) was proposed to denoisy the vocie data [28]. In addition, Seq2seq is essentially a time-series-based autoencoder [11] and being a famous structure in the field of NLP [11, 29]. Since GRU is faster than LSTM in processing time series, we select GRU in our new model.

Ii-C Trajectory Generation

RNN is usually used to process coordinate trajectories, except for natural language and voice data. Most work on trajectory generation aims to predict the subsequent position of a trajectory under a supervised condition, for example, predicting pedestrian trajectories in multiple people scenarios [15, 16, 17, 18, 19]. Alahi et al. proposed a social LSTM [15] to increase the correlation between implicit states within LSTM, so that multiple agents in a neighborhood was considered simultaneously. An improvement of LSTM was made [17]. They proposed social GAN and then the trajectory was generated by the generator in GAN. A new attention mechanism was added into LSTM to implement social and physical constraints [18]. A structured LSTM structure was also proposed in [30] to predict pedestrian trajectory. Nevertheless, supervised learning methods can only generate results in a fixed data space, and they do not have latent code to control the generated results.

Another examples of trajectory generation are the sketch drawings [31] and the Chinese character stroke generation[32]. They usually utilized VAE [31] and Seq2seq structures by combining the Mixture Density Network (MDN) [33]

. In MDN, the author used the Gaussian mixture model (GMM) to describe the generating result instead of directly outputting. Their model creates simple sketch of trajectories instead of images. However, these works all focus on single sequence and only one single binary encoder was used to determine pen-up and pen-down, without considering the relationship between multiple sequences. While our task is to deal with the intersection between two vehicles, which is more complicated.

Traffic scenario is quite complex and contains a great uncertainty information, regarding driving intention [34, 35]. Authors in [3] proposed a traffic-primitive-based framework to reconstruct the scenarios. A very simple LSTM structure for road vehicle trajectory prediction is developed [36], and then improved by considering behavior and social rules [20, 21]. These analysis of traffic scenes and signals have achieved some promising results.

Iii Methods

In this section, we will briefly introduce the basic principles of VAE and -VAE, then propose two simple baselines of VAE and infoGAN. In addition, we will present the MVTG architecture to output high quality vehicle encounter trajectories. Finally, we will introduce our developed metric for comprehensive model evaluation and analysis.

Iii-a VAE and -Vae

The main formation difference between VAE and autoencoder is the additional term of a Kullback-Leibler (KL) divergence, , where and represent the real data and latent code, respectively. The generative model is defined by a standard Gaussian . is the distribution of representation of the real data. In order to calculate the backward propagation, we use the trick of reparameterization – let the neural network output the mean value

and variance

of the current distribution and express the as follows:

For the objective function of VAE, it consists of two parts:

Essentially, there exists a trade-off between the two parts: KL divergence forces the distribution of latent codes as close as possible to Gaussian, while the reconstruction error forces the latent code to contain more data information. A certain restriction exists between them in the training procedure, so the result is either that the quality of reconstruction is poor or that the distance between latent code and Gaussian distribution is far away. For

-VAE, it improves the vanilla VAE by adding a hyper-parameter to the KL divergence term to adjust the ratio[9]. As a result, we can control training balance by adjusting a single parameter.

Iii-B Baseline of Method

Considering the encoding and generation tasks of two sequences, firstly, we build a Seq2seq baseline system with GRU and

-VAE. The structure merges two sequences together and processes them simultaneously. The encoder uses a single layer GRU and reparameterize the output vectors by

and are two input sequences and is latent code vector with dimension . Then feeding the obtained to the decoder as the initial state and outputs sequence coordinates in a circular way. Namely, it uses the output coordinates of the last state as the input coordinates of the next state as

For the reconstruction error design, we use Mean Square Error (MSE) to compute two sequences separately, and the final objective function is

In order to test the capability of GAN, we implemented a GRU version of infoGAN as another baseline. Its generator and discriminator adopt the similar structure as the VAE baseline.

1:  Initiate as an empty dictionary
2:  for each  do
4:     Initiate as an empty set
5:     Sample a number

with uniform distribution

6:     for each  do
7:        Use as the index code and Generate other codes with uniform distribution; 
8:        Concatenate them into latent code
10:        Latent code One sample
12:        One sample Latent code
13:        Append to
14:     end for
15:     Calculate of all latent codes.
16:      =
18:     Append to the dictionary
19:  end for
20:  Use to train a classifier and get the accuracy; 
Algorithm 1 Disentangled metric proposed in [10]

Iii-C Multi-Vehicle Trajectory Generator (MTG)

There exist some problems when generating multi-vehicle trajectories by using VAE. Firstly, the training process is quite slow due to the improper structure of encoder and decoder, thus resulting in invalid latent codes or coupled codes in , which affects the overall performance. Another problem is that the generated sequences with many sharp turns and circles, which are quite different from the real traffic trajectory. The structure of parallel outputting two sequences could make these sequences interactive with each other, thus two sequences are dependent.

According to these analysis, we propose a new VAE structure for modeling traffic sequences, called multi-vehicle trajectory generator (MTG), as shown in Fig. 2. We first replaced the encoder structure with a bidirectional GRU structure, as it still makes sense to reverse the vehicle trajectory sequence. The bidirectional GRU structure is used to analyze the whole data sequence better. This encoder is expressed as

In the decoder, we separate the two sequences and use the hidden state of one sequence as part of the input to the next state of the other sequence:

Since the hidden state retains all the information over past positions, it is able to provide guidance to generate another sequence. And generating the two sequences independently avoids mutual interference, thus making the generated results analogous to the real observed samples.

The Gaussian mixture models (GMM) have been used to describe the output of the decoder[31, 32] for the data with coordinate points equal to keypoints, and the slight modification of keypoints only changes the overall structure, thus obtaining similar semantic results. However, the vehicle trajectory should be continuous and smooth, and thus GMM can not be directly used here.

In many NLP-related works, the Teacher-Forcing algorithm is implemented to Seq2seq models during the training stage, wherein the ground truth is used as input of each state. It can increase the stability of the network and shorten the training time, but it would lead the previous information to directly pass into the decoder through the input of the decoder, and thus the hidden states will be disable to contain all the information.

Fig. 3: Trajectories generated from different models. (a), (b), (c) and (d) represents real trajectory, infoGAN output results, VAE output results, and MVTG output results, respectively. The first row displays the reconstruction results of three models, while the second row shows the results with manually controlled test codes. is the element of latent code and each code has 10 value, varying from -1 to 1, to generate different results.
Fig. 4: Results of MVTG with our new metric. 10 figures show 10 different input codes. For one code, 10 different variance values from 0.1 to 2.8 are shown. Legends are displayed on the top of the whole figure.
Fig. 5: Results of infoGAN with our new metric. 10 figures show 10 different input codes. For one code, 10 different variance values from 0.1 to 2.8 are shown. Legends are displayed on the top of the whole figure.
Fig. 6: Results of VAE with our new metric. 10 figures show 10 different input codes. For one code, 10 different variance values from 0.1 to 2.8 are shown. Legends are displayed on the top of the whole figure.
1:  Set
2:  Initiate as an empty set
3:  for each  do
4:      is the number of all variance values
5:     for each  do
7:        Initiate as an empty set
8:        for each  do
9:           Generate one latent vector from sampling with and fixing others; 
11:           Latent code One sample
13:           One sample Latent code
14:           Append to
15:        end for
16:         =
17:        Append to
18:     end for
19:  end for
20:  Display variances in
Algorithm 2 New disentangled metric

Iii-D A New Disentangled Metric

Performance evaluation plays a pivotal role in the disentangled model development. In [9], a supervised metric was proposed by fixing one of the factors and randomly selecting the other factors. Then the latent codes were retrieved through decoder and encoder in turn. A simple classifier was then trained to identify which factor is fixed, and the model’s coding ability can be evaluated according to the identification result. But Kim et al. [10] claimed that the method proposed by [9] has defects in principle, and they proposed a method to calculate the normalized variance for analysis. The pipeline of the metric operation in [10] is shown in Algorithm 1. In practical experiment, an applicable method should be provided to analyze the model robustness and the effects of each factor. However, the method in [10] can only offer a relative evaluation of the capability of autoencoder, rather than analyzing the independence between the factors. We will elaborate this problem more specifically in experimental section.

In order to overcome this issue, we propose a comprehensive comparison and analysis metric, as shown in Algorithm 2. We divide the input samples into several groups with different variances (Each group has samples), represented as , where means the index of the latent code. And means the group index of different variances. Differing from the metric evaluation method in [10], we sample one factor in each group and fixing other factors. After passing through the decoder and encoder, we obtain a more comprehensive analysis by comparing the variance of each factor in the output with the variance of the input.

Iv Experiment and Data Processing

Iv-a Dataset and Preprocessing

The driving encounter data we used was collected by the University of Michigan Transportation Research Institute (UMTRI) [4]

. The dataset includes approximately 3,500 equipped vehicles. We used the latitude and longitude information as coordinates, which were collected by on-board GPS. The data was collected with a sampling frequency of 10 Hz. And in order to uniformly extract features, the linear interpolation was used to reshape each trajectory to the length of 50. Considering the property of neural networks, we normalized the data into


Some examples of vehicle encounter trajectories are shown in the first column of Fig. 3. The trajectories of two vehicles are marked with blue and red.

Iv-B Experiment Settings

In this paper, we compare three different architectures of generating trajectories and then evaluate them using our proposed metric. In next section, we will show the generation and evaluation results. To make latent codes understandable, we use t-Distributed Stochastic Neighbor Embedding (t-SNE)[37]

to display generative results over their feature space. We also show the benefits of our proposed evaluation metric by comparing with the metric proposed in


All the results are generated with manually controlled test codes. We set the dimension of latent code . Thus, the test codes have 10 different groups for different codes. In each group, we varied the value of the code from -1 to 1 (ten numbers were chosen to display) and fixed other codes on 0. Thus 10 different variances (from 0.1 to 2.8) were tested for each code and represented with different colors.

V Result Analysis and Evaluation

V-a Generated Results

Fig. 3 shows the raw trajectories and the generated outcomes using three different models. The top four figures show the reconstruction results, which indicates that our proposed MTG outperforms other two models when reconstructing real trajectories. The bottom four plots show the capability of the latent codes. It can be seen that for infoGAN, the last three rows are almost the same, indicating that these codes neither control any feature nor contain any information about the data. But it is undeniable that and learns some simply rules. The third column and forth column displays the results of using the original VAE and our proposed VAE architecture, respectively. For the original VAE, it is quite complex with some circles and sharp turns, which are abnormal in real traffic scenarios. However, our developed architecture outputs more reasonable trajectories, that is, the last column also intuitively shows the control ability of all codes when varying from -1 to 1.

Fig. 7: The t-SNE results of four models. We use the code of variance 1 to generate encounter trajectories. Then these trajectories are processed with t-SNE and the outcome are 2-dimensional points.
Fig. 8: Comparison of two evaluation metrics on autoencoder and VAE. Only is selected to be shown in these four figures.

V-B Disentangled Evaluation of models

Figs. 45 and 6 display the output variances of three models. The legend on the top displays different input variances. First, the results of our new architecture is shown in Fig. 4

, and each plot shows the result of sampling one code with a normal distribution by fixing the other codes. As shown in the histogram, only the sampling code outputs a increasing variance, while other codes keep close to zero. It indicates that there no interference among those codes, that is, sampling one code only influences itself.

The line chart inside each figure shows the ratio of output variance and input variance for each code. A much robust model will make all lines close to zero and the line slope of the input code close to 1. For our proposed MTG, the slope of input code is larger than 1, but the other lines are quite close to 0. It demonstrates that our proposed architecture achieves a decoupling and stable performance, and generates meaningful samples with manually controllable latent codes. Figs. 5 and 6 show the results of original VAE and infoGAN, which includes two kinds of new patterns in Fig. 4.

  • One is the code in the 1st plot of Fig. 5. It indicates that some codes always are independent on input code changes with variances always around 1. Moreover, these codes output a normal distribution that the K-L divergence forces them to be. From these codes, it can be found that they are invalid and contain less information about the input data.

  • The other is the code in the 9th plot of Fig. 6. It indicates that the output variance changes along the input variance. This is because these codes are correlative and coupled, and they may control different features dependently. The appearance of interaction code results from the case that the model does not have the ability of factorizing the latent codes.

As for the ratio of output variance and input variance for each code, some codes show negative slope in Figs. 5 and 6. This phenomenon is quite consistent with the output variance of normal distribution. After the analysis of those results, it can be concluded that our proposed MTG outperforms the original VAE and infoGAN for both decoupling/encoding latent codes and generating reasonable trajectories.

V-C Feature Space Display

We utilized the t-SNE tool to display generated results in feature space. Fig. 7 displays four results of t-SNE in two dimensions. In total, 200 samples were generated with a variance of 1.0 for each codes independently. The dimension of each sample is

. We firstly used principal component analysis (PCA) to take dimension reduction to 5, then applied t-SNE to obtain the results in Fig. 

7. The goal of VAE is to project the data into a disentangled space, where codes are decoupled. As shown in Fig. 7, the codes of autoencoder are entangled and not continuous. Both InfoGAN and VAE show disentangled ability for most codes, while some codes still couple with each other. Nevertheless, MTG demonstrates its continuity and interdependency, i.e., the latent codes interact less with each other.

V-D Evaluation Metric

The first plot of Fig 8 (a) displays the results of by using the metric proposed in [10]

. The left plot displays the standard deviation of all 10 codes. The autoencoder obtains different standard deviation without the K-L divergence since the elements of embeddings are encoded in a entangled space. After normalization by dividing the standard deviation as mentioned before, the right of this figure shows the output normalized variance of all codes.

In our case, gets the lowest value, and this sample will be easily classified by the metric proposed in [10]. After applying this metric to the autoencoder, it achieved a high score than VAE, which indicates that this disentangled metric has problem in evaluating the distribution of the latent codes and the interference among them. In the second plot of Fig. 8 (a), our proposed metric on autoencoder shows a different result: does greatly impact on other codes because most codes except for obtain a large variance.

Fig. 8 (b) shows the result of evaluation on VAE. It can be seen that the variance of is much lower than others, which ensures that VAE gets a high score. The left figure of using the metric proposed in [10] could not explain why is invalid; however, this can be easily revealed by using our proposed metric in the right figure since the output of always remains around 1.

Vi Conclusions

This paper proposed a way to generate the encounter trajectory of two vehicles based on VAE. In order to extract the features with considering the relationship between the two spatiotemporal sequences better, a new network architecture was proposed. We also developed an evaluation metric capable of comprehensively analyzing the generated results and the their stability. Experimental results demonstrate that our proposed architecture achieves more disentangled and stable latent codes. Moreover, our proposed method can obtain more realistic encounter trajectories than the original VAE and infoGAN.

It is significant to successfully generate the trajectory of two vehicles encounters since the controllable generation of the trajectory could provide sufficient high-quality data at low cost for self-driving applications. We only generate a very short trajectory in this paper, but the start point could be set and these trajectories could be carefully cascaded together. In future work, we will use the actual road coordinates as conditions as well, and then utilize the conditional GAN or the conditional VAE to generate trajectories with considering more constrains regarding road profiles and vehicle dynamics.

Appendix A

More details about the hyper-parameter settings refers to the link: