An LSTM-Based Autonomous Driving Model Using Waymo Open Dataset

02/14/2020 ∙ by Zhicheng Li, et al. ∙ Columbia University 0

The Waymo Open Dataset has been released recently, providing a platform to crowdsource some fundamental challenges for automated vehicles (AVs), such as 3D detection and tracking. While the dataset provides a large amount of high-quality and multi-source driving information, people in academia are more interested in the underlying driving policy programmed in Waymo self-driving cars, which is inaccessible due to AV manufacturers' proprietary protection. Accordingly, academic researchers have to make various assumptions to implement AV components in their models or simulations, which may not represent the realistic interactions in real-world traffic. Thus, this paper introduces an approach to learn an long short-term memory (LSTM)-based model for imitating the behavior of Waymo's self-driving model. The proposed model has been evaluated based on Mean Absolute Error (MAE). The experimental results show that our model outperforms several baseline models in driving action prediction. Also, a visualization tool is presented for verifying the performance of the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous vehicles (AVs) have attracted a significant amount of interest in recent years, and many leading companies, such as Waymo and Lyft, have invested a huge amount of money, manpower and engineering capabilities in developing such systems. Designing policies for an autonomous driving system is particularly challenging due to demanding performance requirements in terms of both making safe operational decisions and fast processing in real-time. Despite recent advancements in technology, such systems are still largely under exploration with many fundamental challenges unsolved.

1.1 Motivation

Reliable driving policies play a crucial role in developing an effective AV system, which is also one fundamental challenge faced by researchers. No documentation reveals how the existing AV test fleets on public roads are actually programmed to drive and interact with other road users due to AV manufacturers’ proprietary protection, which forms a barrier for outsiders to develop AV policies for follow-up studies. Accordingly, academic researchers have to make various assumptions to implement AV components in their models or simulations, which may not represent the realistic interactions in the simulation-based research.

Waymo’s recent release of a rich and diverse autonomous driving datasetwaymo_open_dataset; sun2019scalability

attracted the attention in academia in terms of pattern recognition, 3D detection and tracking. However, researchers might be more interested in inaccessible driving policies of Waymo cars that generates the dataset. Combining with the idea of behavioral cloning and the large Waymo dataset, the lack-of-practical-policy issue can be properly addressed. To this end, this paper proposes a novel long short-term memory (LSTM)-based model to study latent driving policies reflected by the dataset.

1.2 Research Objective

In this paper, the problem is to learn an autonomous driving policy model from which the given Waymo Open Dataset is most likely to generate. The policy model generates a driving action given that the environment is in a certain motion state. The response refers to the instantaneous acceleration a (including the magnitude and steering angle). In a realistic situation, a can be generated by controlling the gas pedal and the brake. The environment is defined as measurable variables that describe the surroundings of an AV including 12 kinematics and distance measures. Those variables were collected by radars, lidars, cameras and other sensors installed on Waymo cars. There are various driving scenarios in the dataset, among which we only focus on one specific scenario: car-following on highways or urban roads.

1.3 Contributions

To the best of our knowledge, this paper is the first-of-its-kind to develop a driving model for existing level-5 autonomous cars using Waymo’s sensor data. Due to manufacturers’ proprietary protection, no documentation has revealed how the existing AV fleets are actually programmed to drive and interact with other road users on public roads. Accordingly, academic researchers have to make various assumptions prior to controller design. Our work can be used as a general autonomous driving model for the transportation community on AV-related studies. Specifically, we have made the following contributions:

  • Design advanced LSTM-based policy models with an encoder-decoder structure, which contains multiple stacked LSTM cells;

  • Evaluate the proposed LSTM-based policies by comparing to baselines; and

  • Investigate how using different feature sets as input affects the performance and training efficient of the models.

2 Related Work

2.1 Behavioral Cloning

Behavioral cloning is a popular approach to mimic the actions of human drivers. Mariusz Bojarski et al BojarskiTDFFGJM16

from NVIDIA presented a convolutional neural network (CNN) to imitate driving behavior and predict steering commands. The trained neural network maps raw pixels from a single front-facing camera to basic steering commands. This system can successfully handle complicated scenarios, such as areas with blurred vision or on unpaved roads.

2.2 Convolutional Neural Networks

Convolutional Neural Network (CNN) is a widely applied mechanism in the computer vision field. In 2012,

NIPS2012_4824NIPS2012_4824

proposed a novel deep learning architecture using CNNs and demonstrated its competence in the ImageNet

5206848 contest. Simonyan15 Simonyan15 proposed a deeper CNN network compared to AlexNet and proved that deeper models have better capacity and lower error rates. To develop deeper network models, 4302243022 introduced the idea of inception, which improved the utilization of the computing resources inside the network. Moreover, He_2016_CVPR He_2016_CVPR proposed the famous CNN network, ResNet, which further increased the depth of the network by adding residual blocks and reached a stunning depth of 152 layers. Later, the ResNet was refined by 10.1007/978-3-319-46493-0_38 10.1007/978-3-319-46493-0_38 by introducing identity mappings. liao2018large have succesfully applied ResNet for taxi demand forecasting. In this work, the refined ResNet is treated as a feature extractor for camera images in Waymo Dataset.

2.3 Recurrent Neural Networks

Recurrent Neural Network (RNN) is advantageous in processing time sequence data as its capability of storing the information through the internal state to behave temporally and dynamically. LSTM network, the refined version of RNN, solves the problem of gradient vanishing or exploding arising in original RNNsDBLP:journals/corr/abs-1211-5063. Thus, many researchers adopt LSTMs as their primary models on various time sequence tasks such as machine translationDBLP:journals/corr/ChoMGBSB14, activity recognition8121994 and trajectory predictionAlahi_2016_CVPR.

2.4 Trajectory Prediction

In the problem of trajectory prediction, the future prediction is usually based on past experience or the history of previous trajectories. Thus, LSTMs, designed to process time sequence data, can likely perform well on such a task. Alahi_2016_CVPR Alahi_2016_CVPR proposed a novel LSTM model in predicting human trajectories in the crowded space and 8354239 8354239 incorporated more information including occupancy map and camera image to build a robust model in predicting pedestrian trajectories. Such ideas are also effective in predicting cars’ trajectories as cars and pedestrians share common motion patterns to some degree. Inspired by their work, we will consider using LSTM-based approaches to process the time series data in the Waymo dataset and learn the driving policy underlying it.

3 Data

Waymo Open Dataset is the largest, richest and most diverse AV datasets ever published for academic research sun2019scalability

. This dataset, collected from Waymo level-5 autnomous vehicles in various traffic conditions, comprise radar, lidar and camera data from 1000 20-second segments (as of December 2019) with labels. In this section, we introduce details about the Waymo dataset, as well as how the data is preprocessed before being fed into several machine learning models.

3.1 Labels

Labels in this dataset refer to kinematics and spatial parameters of objects, which are represented as bounding boxes. Specifically, one kind of labels, type

, is classified into

pedestrian, vehicle, unknown, sign and cyclist categories. Detailed information is provided for each label, among which we especially pay attention to the coordinates of the bounding boxes, velocities v, and accelerations a

in the subsequent feature extraction step.

3.2 Coordinate Systems

Three coordinate systems are provided in this dataset: global frame, vehicle frame, and sensor frame. Some raw features are represented in unintended coordinate systems. In order to maintain consistency, it is crucial to transform data into the correct coordinate system. The dataset also provides vehicle pose VP, a row matrix, to transform variables from one coordinate system to another. The formula below shows the most frequent transformation that is employed in our experiments waymo_data.

where and refer to the vehicle frame and the global frame respectively.

3.3 Data Size

According to the data format, 1000 segments are packed into multiple compressed files (tars) with a size of 25 GB each. In our experiments, 32 training tars are used as the training set and 8 validation tars are used as the testing set. The total number of videos extracted from the segments is 45000.

3.4 Acceleration Computation

Because one’s instant acceleration of is not directly available in the dataset, the “ground truth” for training and evaluation needs to be computed by velocity differences. We designed the formula below to calculate the acceleration between two consecutive video frames, where the acceleration of the initial frame is set to zero.

where is the frame index, and are the acceleration and velocity of the AV at the time frame , respectively.

(a) Detection tolerance = 0 (b) Detection tolerance = 1
Figure 1: Examples of two different detection tolerances applied to the same scenario, displayed on visualizations of camera images in Waymo Open Dataset.

3.5 More Features

In addition to one’s velocity which is represented in the global frame, it is also beneficial to take the information of the front car into account while studying the change in the acceleration of the AV. As mentioned previously, the acceleration and velocity data are provided per bounding box in the vehicle frame, it is necessary to transform them into the global frame to keep consistency.

Furthermore, we use a simple geometrical procedure to detect the existence of the front car. As shown in Figure 1, the procedure check if a hypothetical ray (green) starting from the position of the AV intersects with the front car (the red bounding box). However, one defect is that there could be many misses by using a single ray. So we optimized this procedure by utilizing a rectangular ray to represent the AV’s view with some detection tolerance. As a result, the following quantities related are extracted as part of the features: velocity () and acceleration of the front car () and the relative distance to the front car (). Additionally, enlarging the tolerance range may lead to detection mistakes when there are multiple vehicles ahead, thus the number of vehicles is also included to describe the complexity of the scenario, which is denoted as . For frames where the front car is not detected, the previously mentioned quantities are set to 0.

3.6 Image embedding

As shown in Figure 2, there are five cameras installed on the AV, facing towards front, front-left, front-right, side-left, and side-left

respectively. These images reflect the time-series information of the moving vehicle with relatively smoother variation than numerical data, which helps to prevent spiky prediction between consecutive frames. Therefore, we utilize image embedding, a concise representation (vector) of images, as a part of input data to train the models. More details of image embeddings will discuss such embedding in the LSTM-based model sections.

(a) Front-Left
(b) Front
(c) Front-Right
(d) Side-Left
(e) Side-Right
Figure 2: One example of images captured by cameras of five different views from segment-11004685739714500220 of tar training_0000 in Waymo Open Dataset.

4 Baseline Models

In order to establish benchmarks for comparison, we propose several baseline models using different methods including neural network models and decision-tree-based ensemble methods. In this section, we focus on the structures and principles of the baseline models.

4.1 Neural Network Models

  • NN

    : Since the numeric and image data can both represent the surroundings and affect the decisions made by the AV, the first baseline model is a simple deep neural network that takes 12 features as input and owns 2 hidden layers with 8 and 4 neurons respectively. This model aims at capturing the characteristic of the 12 features, as well as the relationship between accelerations and the surrounding environment.

  • CNN

    : The second model is a traditional convolutional neural network with convolutional, batch normalization, and pooling layers. Grid search is applied to determine the optimal parameters during training. Video images of each frame in the dataset are used as inputs to predict the acceleration of the same frame. Then, the output of the multiple layers is fed into dense layers to extract feature embedding of the image and generate the output.

  • NN + CNN: To better represent a frame using the image embedding as well as the 12 features corresponding to a frame, the third model combines the NN and CNN models. Specifically, as shown in Figure 3, the last dense layers of both models are concatenated to new dense layers which eventually generate the final output.

Figure 3: The composited model of NN and CNN. Outputs of dense layers are concatenated to form new dense layers which generate the final output. Notice that the output of the dense layer from the CNN network is actually the trained representation of the input image.

4.2 Ensemble Methods

Another popular category for regression problems is the ensemble methods. For a thorough exploration of the dataset, we attempted several ensemble methods and compare their prediction results.

  • XGBoost

    : XGBoost is a decision-tree-based ensemble Machine Learning algorithm that is built on a gradient boosting framework. It uses a more regularized model formalization to control over-fitting, which frequently produces superior performance in competitions.

  • Light Gradient Boosting: Another popular boosting algorithm, Light Gradient Boosting NIPS2017_6907, uses the gradient-based one-side sampling for splitting, instead of growing trees horizontally (increasing levels), it grows vertically to increase the number of leaves, which minimizes loss to a higher degree than growing by levels.

  • Stacked Linear Regressor: Stacked linear regressor Breiman_1996

    is a more comprehensive ensemble method that linearly combines different models. Unlike the other boosting regressors, it uses a second-layer learning algorithm, where each model in the second layer finds its optimum. Moreover, the second-layer regressor can be built by more complex models, which overcomes the limitations of the single linear regression model.

5 LSTM-based Model

5.1 Basic Model with 12 Features

One of the straightforward ways to build the acceleration prediction model is to treat 12 basic features as the input of the model. The ”encoder-decoder” architecture proposed for trajectory prediction in SS-LSTM 8354239 is a suitable architecture for the acceleration prediction problem as the acceleration curve is a trajectory based on past experiences. While SS-LSTM uses different input information such as the occupancy map, the 12 basic features are packed into one single input channel in our proposed basic model. These features are fed into an ”encoder” module and this ”encoder” extracts the key information from input features and generates an intermediate result, or more precisely, a latent representation of these features. The intermediate result is then forwarded into a ”decoder” module, which decodes the representation and outputs the acceleration prediction. The figure of this architecture is shown as Figure 4.

Figure 4: Given one video clip with a frame-length of 10, the input is the vector consists of 12 features from these 10 frames. The output is the acceleration for the next 5 frames starting from the end of the video clip. The ”encoder” module contains 128 LSTM-cells and the ”decoder” module contains 128 LSTM-cells

However, the limitation of information provided by the 12 input features cannot be ignored. It is difficult for the model to capture and analyze the complex surrounding environment for the AV by these simple and limited numerical features, and the model trained on these features may fail to give a satisfying prediction. High-quality features with more key information are needed to improve the model capacity.

5.2 Advanced Model with Image Inputs

Given the limitation of the basic model, more features from the Waymo Open Dataset should be incorporated and one of the most critical data that can be used is the camera image, which perfectly records the changing environment of the AV. It is obvious that a raw camera image cannot be fed into the basic model directly as an image is a two-dimensional data point while the input is required to be a one-dimensional vector. Thus, converting the raw image into a vector whose elements represent the key content of the image is necessary.

Naturally, CNN is a popular and excellent image feature extractor that converts an image into a flatten vector and adding one CNN in front of the ”encoder” module in the basic model seems to be reasonable. Training the CNN module with the ”encoder” module and the ”decoder” module as a whole can yield an end-to-end advanced model, much more complex than the basic model. But training this model from scratch can be time-consuming because of the complexity of a deep CNN module. One quick but efficient solution is to use a pre-trained CNN module to convert the raw image into a vector without any fine-tuning, as the existing popular pre-trained CNN networks are mostly well trained on massive datasets and should generalize well. One of the best CNN networks is the ResNetV2 10.1007/978-3-319-46493-0_38 and the output of its second last layer can be a high-quality representation of the input image. Given such an image feature extractor, the front camera image in the Waymo Open Dataset is chosen as the major input as this camera should obtain the most critical environment of the AV and contain key information to determine the future acceleration, or the car behavior decision.

The architecture of such an advanced model is similar to the previous basic model. An ”encoder-decoder” structure is maintained to learn the information hidden in the input features. The difference is that the front camera images are treated as additional inputs. The detail of this architecture is shown as Figure 5.

Figure 5: Noticed that the ”image” input is actually a vector containing key image content. The first channel input is the 12 features for the observed 10 frames. The second channel input is the 1576-dimensional representation of front camera images from these 10 frames. Such representation is extracted from the second last output of a pre-trained Resnet152v2. The output is the acceleration for the future 5 frames. The ”encoder” module contains 128 LSTM-cells and the ”decoder” module contains 128 LSTM-cells

Adding more cameras from different views may help the model even understanding the driving environment better as adding a front camera has improved the model performance to a significant degree. In the Waymo Open Dataset, cameras from five different views are provided, which are the front camera, the front-left camera, the front-right camera, the side-left camera, and the side-right camera respectively. Incorporating all these five cameras can obtain a more advanced model than the previously trained models. The module architecture of the model is similar and the only change is to add 4 extra input channels for the rest of four cameras besides the front camera. The architecture of this model is shown as Figure 6

Figure 6: The first channel input is the 12 features for the observed 10 frames. The rest of the input is the 1576-dimensional representation of camera image from different views in the observed 10 frames. The output is the acceleration for the future 5 frames. All ”encoder” modules contain 128 LSTM-cells and all ”decoder” modules contain 128 LSTM-cells

6 Experiments and Results

Each model is evaluated on the same test data sampled from Waymo Open Dataset, which is called the validation part of the dataset. To determine the accuracy of predicting acceleration from the trained model, Mean Absolute Error (MAE) is chosen as the major metric. The MAE value is calculated by taking the average on all MAE errors of the video clips from the test data. VM instance from Google Cloud Platform is used to train all models mentioned and the VM has one single NVIDIA Tesla K80.

6.1 Hyper-Parameter

In the experiments, the major hyper-parameters are the following:

  • The length of the previous frames This decides how much the model can see in the history. If the length is too short, the model may perform poorly as it could not obtain enough history to predict accurately. If the length is too long, the model is too complex and waste resources. We set this number to 10.

  • The length of the future frames This decides how much the model needs to predict for the future. If the length is too long, the model is too complex and needs more history to support good prediction. We set this number to 5.

  • training epochs

    We set this to 300 in order to obtain a well-trained model.

6.2 Results

From Table 1, the model using 12 features and the front camera image has outperformed most of the other models and it is proved that adding image has pushed the model performance to another level compared with the one only using the 12 basic features. However, the model with all cameras has similar performance with the model using only the front camera, which is not consistent with the previous assumption. The reason why the five-camera model failed to perform even better is that the rest of the camera image is noisy input for the model and may likely interfere with information from the front camera, resulting in a slight decrease of the model performance.

Models MAE X MAE Y
NN 0.4014 0.4312
CNN 0.3272 0.3123
NN+CNN 0.2985 0.2802
XGBoost 0.3502 0.3537
Light Gradient Boosting 0.3459 0.3444
Stacked Linear Regressor 0.3623 0.3222
LSTM with 12 features 0.3179 0.3088
LSTM with front camera 0.1379 0.1278
LSTM with all cameras 0.1327 0.1363
Table 1: MAE comparison. The minimum loss is bolded.

6.3 Training Efficiency

The time efficiency of different models is also evaluated during the experiments. From Table 2, it is obvious that adding images will lead to the rapid growth of training time due to the increase of the model complexity. When all five camera images are introduced into the input data, training takes much more time than only using front camera images. But the previous result has shown that the difference between the performance of the model using only the front camera and that using all cameras is small. Thus, it is more reasonable to only use the front camera image as the image input because it is more computationally efficient than using images from all five cameras.

Model Approximated GPU Training Time
LSTM with 12 features 1.8 hour
LSTM with front camera 9.0 hours
LSTM with all cameras 37.8 hours
Table 2: Approximate training time of different models.

6.4 Visualization

To visually inspect how well our model performs, we develop a visualization tool, which takes the predicted accelerations and raw camera images as the input, and generates a video as the output. In the three data plots of Figure 7, the horizontal axis represents the frame index of the segment, while the vertical axis means the acceleration value. The ranges of the vertical axes are all limited to to prevent misinterpretations. The generated video shows both predicted and ground-truth values for longitudinal and lateral accelerations (marked as “x acc” and “y acc”, respectively). Combining the data plot (left) and actual video frames (right) give us an intuitive way to inspect prediction deviation within a continuous spatial and temporal range.

Generally, our model maintains a decent performance in more complex scenarios such as nighttime (as in Figure 6(c)). The visualization results show that our predicted accelerations behave more cautiously than Waymo cars. As indicated in Figure 6(a), our model accelerates and decelerates more gentle than Waymo’s model. Also, because of the smoothing step in preprocessing, our model generates a smoother acceleration curve than Waymo’s, as shown in Figure 6(b).

7 Conclusions and future work

From the experimental results, it is observed that the LSTM-based model has better performance than the baseline models, which also supports our assumption that the acceleration prediction is a time-series problem. Inspired by the architecture of LSTM, the ”encoder-decoder” structure is adopted in all LSTMs-based models. At first, only the basic numeric features such as velocity and distance between the front car and the current AV are used to train the model. Such an LSTM-based model has already outperformed the baseline models without using CNN or those considering time-dependency in the data. To further increase the model performance, camera images are introduced to help the model output a more accurate result. The result shows that the LSTM-based model with image has significant improvement. However, the model using the front camera only shares similar performance with the model using all five cameras while the latter took much longer time to train. Therefore, we conclude that the LSTM-based model with the front camera images is an effective and efficient model for acceleration prediction.

This work can be extended in several ways. One limitation of the experiments we carried out is that in the current training and testing process, the “online approach” takes the ground truth relative distance and velocities as input to the network after predicting the first frame after the initial 10 frames. However, in the scenario of the real world, the input should be taken from (or accumulated from) the prediction of previous frames. Therefore, our current approach may cause the loss to accumulate continually. To solve this problem, our future work is to generate predictions of one frame immediately after the initial ten frames, and then compute the loss between that specific frame and the ground truth data. Another future direction is to train a more generalizable model for other traffic scenarios than car-following, for example, lane-changing, merging and diverging.

(a) The last frame of segment-10289507859301986274 from tar validation_0001 and corresponding prediction.
(b) The last frame of segment-12496433400137459534 from tar validation_0001 and corresponding prediction.
(c) The last frame of segment-11660186733224028707 from tar validation_0001 and corresponding prediction.
Figure 7: Model prediction with synchronized frames captured by the front-camera on the right.

Acknowledgments: We would like to express our great gratitude to other collaborators who have contributed to this work. Thank Qian Zheng, Gerardo Antonio Lopez Ruiz, and Yumeng Jiang for featuring engineering at an early stage, coming up with the definitions of the crucial “12 features” on which the models were trained on, as well as developing and testing several baseline models such as the XGBoost method and so on. We also appreciate explorations made by Aashish Kumar Misraa, Naman Jain, and Saurav Singh Dhakad about including camera images into the training data, which greatly broadened our scope of mind. Also, thank Siddhant Gada and Aditya Das for implementing other neural network baseline models for comparison purposes.

References