Simulating LIDAR Point Cloud for Autonomous Driving using Real-world Scenes and Traffic Flows

by   Jin Fang, et al.
Baidu, Inc.

We present a LIDAR simulation framework that can automatically generate 3D point cloud based on LIDAR type and placement. The point cloud, annotated with ground truth semantic labels, is to be used as training data to improve environmental perception capabilities for autonomous driving vehicles. Different from previous simulators, we generate the point cloud based on real environment and real traffic flow. More specifically we employ a mobile LIDAR scanner with cameras to capture real world scenes. The input to our simulation framework includes dense 3D point cloud and registered color images. Moving objects (such as cars, pedestrians, bicyclists) are automatically identified and recorded. These objects are then removed from the input point cloud to restore a static background (e.g., environment without movable objects). With that we can insert synthetic models of various obstacles, such as vehicles and pedestrians in the static background to create various traffic scenes. A novel LIDAR renderer takes the composite scene to generate new realistic LIDAR points that are already annotated at point level for synthetic objects. Experimental results show that our system is able to close the performance gap between simulation and real data to be 1 6 fine tuning, only 10 original model trained with full real dataset.


page 1

page 4


A LiDAR Point Cloud Generator: from a Virtual World to Autonomous Driving

3D LiDAR scanners are playing an increasingly important role in autonomo...

AADS: Augmented Autonomous Driving Simulation using Data-driven Algorithms

Simulation systems have become an essential component in the development...

Multimodal Virtual Point 3D Detection

Lidar-based sensing drives current autonomous vehicles. Despite rapid pr...

PillarFlow: End-to-end Birds-eye-view Flow Estimation for Autonomous Driving

In autonomous driving, accurately estimating the state of surrounding ob...

MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous Driving Using Multiple Views

Autonomous driving requires the inference of actionable information such...

The ApolloScape Dataset for Autonomous Driving

Scene parsing aims to assign a class (semantic) label for each pixel in ...

Recovering and Simulating Pedestrians in the Wild

Sensor simulation is a key component for testing the performance of self...

1 Introduction

Figure 1: Simulation point cloud with different methods. (a) is generated from CARLA, (b) comes from our proposed method and (c) is the real point cloud collected by Velodyne HDL-64E. The second row shows the point cloud from the bird-eye-view. Notice the inherently rich background in our approach.

LiDAR devices have been widely used in robotics and in particular autonomous driving. They provide robust and precise depth measurements of their surroundings, making them usually the first choice for environmental sensing. Typically the raw point cloud from LiDAR is sent to a computer vision system to find obstacles and other relevant driving information. Currently high-performance vision systems usually are based on deep learning techniques. Deep neural networks (DNN) have proven to be a powerful tool for many vision tasks

[4] [30] [26].

The success of DNN most relies on the quality and quantity of labeled training data. Compared to image data, 3D point cloud from LiDAR is much more difficult to label manually[13] [23]. This is particularly true from real-time LiDAR scanners, such as these from Velodyne. These devices, predominantly used in autonomous vehicles (AV), generate sparse point cloud that is difficult to interpret. Labeling the huge amount of point cloud data needed for the safety of AV quickly becomes prohibitively expensive.

There have been work to leverage computer graphics techniques to generate synthetic labeled data (e.g., [21], [18], [15], [24], [25]

). While these simulated data are shown to be useful to improve DNN’s performance, there remain a few unsolved problems. First the CG environment is mostly manually crafted with limited scale and complexity. As we will show in our experiments, the fidelity of background plays a significant role in perception accuracy. Creating photo-realistic scenes, with a price tag of over 10K USD per kilometer, simply does not scale. Secondly the obstacle placement and movement are mostly based on heuristics, which often do not reflect the diversity of our real world. Thirdly in the scope of LiDAR simulation 

[29] [25], existing methods simply render the scene depth, without considering the physical characteristics of LiDAR, leading to obvious artifacts. As such detector trained with only synthetic data performed poorly (e.g., around accuracy) on real data [29].

In this paper we present a novel hybrid point cloud generation framework for automatically producing high-fidelity annotated 3D point data, aimed to be immediately used for training DNN models. To both enhance the realism of our simulation and reduce the cost, we have made the following design choices. First, we take advantage of mobile LiDAR scanners, which are usually used in land surveys, to directly take 3D scan of road scenes as our virtual environment, which naturally retains the complexity and diversity of real world geometry, therefore bypassing completely the need for environmental model creation. Secondly we develop novel data-driven approaches to determine obstacles’ poses (position and orientation) and shapes (CAD model). More specifically, we extract from real traffic scenes the distributions of obstacle models and their poses. The learned obstacle distribution is used to synthesize the placement and type of obstacles to be placed in the synthetic background. Note that the learning distribution does not have to be aligned with the captured background. Different combinations of obstacle distribution and background provide a much richer set of data without any additional data acquisition or labeling cost. Thirdly we develop a novel LiDAR renderer that takes into considerations both the physical model and real statistics from the corresponding hardware. Combing all these together, we have developed a simulation system that is realistic, efficient, and scalable.

The main contributions of this paper include the following.

  • We present a LiDAR point cloud simulation framework that can generate the annotated data for autonomous driving perception, the resultant data have achieved comparable performance with real point cloud. Our simulator provides the realism from real data, with the same amount of flexibility that was previously available only in VR-based simulation, such as regeneration of traffic patterns and change of sensor parameters.

  • We combine real-world background models, which is acquired by LiDAR scanners, with realistic obstacle placement that is learned from real traffic scenes. Our approach does not require the costly background modeling process, or heuristics based rules. It is both efficient and scalable.

  • We demonstrate that the model trained with synthetic point data alone can achieve competitive performance in terms of 3D obstacle detection and semantic segmentation. Mixing real-data and simulated data can easily outperform the model trained with the real data alone.

2 Related Work

As deep learning becomes prevalent, increasing effort has been invested to alleviate the lack of annotated data for training DNN. In this section, we mainly review the recent work on data simulation and synthesis for autonomous driving.

To liberate the power of DNN from limited training data, [19] introduces SYNTHIA, a dataset of a big volume of synthetic images and associated annotation of urban scenes. [7] imitates the tracking video data of KITTI in virtual scenes to create a synthetic copy, and then augments it by simulating different lighting and weather conditions. [7] and [18] generate a comprehensive dataset with pixel level labels and instance level annotation, aiming to provide a benchmark supporting both low-level and high-level vision tasks. [5] added CG characters to existing street view images for the purposed of pedestrians detection. However the problem of existing pedestrians in the image is not discussed. [9] synthesizes annotated images for vehicle detection, and shows an encouraging result that it is possible for a state-of-the-art DNN model trained with purely synthetic data to beat the one trained on real data, when the amount of synthetic data is sufficiently large. Similarly, our target is to obtain the comparable perception capability from purely simulated point data by virtue of their diversity. To address the scarcity of realism of virtual scenes,[1] proposes to augment the dataset of real-world images by inserting rendered vehicles into those images, thus inheriting the realism from the background and taking advantage of the variation of the foreground. [22] presents a method for data synthesis based on procedural world modeling and more complicated image rendering techniques. Our work also benefits from the real background data and physically based sensor simulation.

While most of previous work are devoted to image synthesis, only few focus on the generation and usage of synthetic LiDAR point cloud, albeit they play an even more important role for autonomous driving. Recently, [29] collects the calibrated images and point cloud using the APIs provided by the video game engine, and apply these data for vehicle detection in their later work [24]. Carla [6] and AutonoVi-Sim [2] also furnish the function to simulate LiDAR point data from the virtual world. However their primary target is to provide platform for testing algorithms of learning and control for autonomous vehicles.

In summary, in the domain of augmenting data, our method is the first to focus on LiDAR point. In addition, we can change the parameters of LiDAR, in terms of placement and the number of lines, arbitrarily. The change of camera parameters has not been reported in image-augmentation methods.

Figure 2:

The proposed LiDAR point cloud simulation framework. (a) describes accurate, dense background with semantic information obtained by a professional 3D scanner. (b) shows the synthetic movable obstacles e.g., vehicle, cyclist and other objects. (c) illustrates an example of placing foreground obstacles (yellow boxes) in the static background based on a Probability Map. (d) is an example of the simulated LiDAR point cloud with ground truth 3D bounding boxes (green boxes) by using our carefully designed simulation strategy.

3 Methodology

In general, we simulate the data acquisition process of the LiDAR sensor mounted on the autonomous driving vehicle in the real traffic environment. The whole process is composed of several modules: static background construction, movable foreground obstacles generation and placement, LiDAR point cloud simulation and the final verification stage. A general overview of our proposed framework is described in Fig 2 and more details of each module will be described in the following context.

3.1 Static Background Generation

Different with other simulation frameworks [6] [29], which generate both the foreground and background point cloud from an artificial virtual world, we generate the static background with the help of a professional 3D scanner Riegl VMX-1HA 111Riegl VMX-1HA

The RIEGL is a high speed, high performance dual scanner mobile mapping system which provides dense, accurate, and feature-rich data at highway speeds. The resolution of the point cloud from the Riegl scanner is about within a range of 100 meters. In the real application, a certain traffic scene will be repeatedly scanned several rounds (e.g., rounds) in our experiments. After multiple scanning, the point cloud resolution can be increased to about . An example of the scanned point cloud is displayed in Fig 3. By using this scanner, the structure details can be well obtained. Theoretically, we can simulate any other type of LiDAR point cloud whose point distance is larger than . For example, the resolution of common used Velodyne HDL-64E S3 222Velodyne HDL-64E S3 is about within a range of 10 meters.

To obtain clean background, both the dynamic and static movable obstacles should be removed away. To improve the efficiency, state-of-the-art semantic segmentation approaches are employed to obtain an initial labeling results roughly and then annotators correct these wrong parts manually. Here, PointNet++ [16] is used for semantic segmentation and the average precision can reach about . Based on the semantic information, holes are filled with the surrounding points.

Figure 3: Left sub-image illustrates an example of the point cloud obtained by RIEGL scanner with more than 200 million 3D points. The actual size of the place is about . Right sub-image displays the detail structure of the point cloud.

3.2 Movable Obstacle Generation

After obtaining the static background, we need to consider how to add movable obstacles in the environment. Particularly, we find that the position of obstacles has a great influence on the final detection and segmentation results. However, this has been rarely mentioned by other simulation approaches. Instead of placing obstacles randomly, we propose a data-driven-based method to generalize the obstacle’s pose based on their distribution in the real dataset.

3.2.1 Probability Map for Obstacle Placement

First of all, a Probability Map [5] will be constructed based on the obstacles distribution in the labeled dataset from different scenarios. In the probability map, the position with a higher value means it will be selected to place an obstacle with a higher chance. This map is built based on some labeled dataset. Instead of merely increasing the probability value at the position where an obstacle appeared in the labeled dataset, we also increase the neighboring positions based on a Gaussian kernel. Similarly, the direction of the obstacles can also be generated from this map. Details of building probability map can be found in  Alg 1. Particularly, we have built different probability maps for different classes. Given the semantics of background, we could easily generalize the Probability Map to other areas in a texture-synthesis fashion.

1: - Annotated point cloud in a local area - A scanner pose ;
2: - A set of obstacles’ pose;  
3:Divide the local area into grids with the weight matrix and direction matrix ;
4:Initialize the Gaussian weight template with the size of ;
5:for  to  do
6:     if  contain obstacles then
7:         for  to  do
10:         end for
11:     end if
12:end for
13: Given a scanner pose , obstacle positions and directions can be sampled with and by weighted random sampling;
Algorithm 1 Probability Map based Obstacle Pose Generation
Figure 4: Man-made CAD models. For real AD application, some uncommon categories have been also considered in our simulation, e.g., traffic cone, baby carriage and tricycle, etc.

3.2.2 Model Selection

Similar to the obstacle’s position, a data-driven-based strategy has been employed to determine the occurrence frequency of different obstacle categories. Based on the labeled dataset, prior occurrence frequency information of different types can be easily obtained. During the simulation process, each 3D model will be selected with this prior knowledge. In addition, the model for each category has been divided into two groups: one is the high-frequency model used to cover the most common cases and another is the low-frequency model used to meet the requirement of diversity.

3.2.3 Obstacle Classes and CAD Models

In order to apply for the real AD application, both the common types (such as cars, SUVs, trucks, buses, bicyclist, motorcyclist, pedestrians) and uncommon categories (e.g., traffic cone, baby carriage and tricycle) have been considered. Some examples of CAD models are shown in Fig 4. The number of models for each type has been given in Tab 1. Interestingly, a small number of 3D models can achieve high detection rate for rigid obstacles, while more models are needed for non-rigid objects. To maintain fidelity, all the 3D models are made with real size and appearance. In addition, we also try to maximize the diversity for each type as much as possible. Specifically, for vehicle models, glasses have been marked as transparent and passengers and driver are added inside them as in the real traffic scene.

Types Cars, SUVs Trucks, Buses Bicyc- & Motor-list Pedestrians Others
Number 45 60 350 500 200
Table 1: Number of 3D model for different categories. Some uncommon categories are included in the others such as traffic cones, baby carriages and tricycles.

3.3 Sensor Simulation

The LiDAR sensor captures the surrounding scene through reckoning the time of flight of laser pulses emitted from the LiDAR and reflected from target surfaces [11]. A point is generated if the returned pulse energy is much higher than a certain threshold.

3.3.1 Model Design

Figure 5: Geometric model of Velodyne HDL-64E S3 which emits 64 laser beams at a preset rate and rotates to cover 360 degrees filled of view.

A simple but practically sufficient physical model has been used to simulate this process here. The model is formulated as


where denotes the energy of a returned laser pulse and is the energy of original laser pulse, represents the reflectivity of the surface material, denotes the reflection rate w.r.t the laser incident angle, is the air attenuation rate because the laser beam is absorbed and reflected when traveling in the air, is a constant number (0.004) in our implementation, and denotes the distance from LiDAR center to target.

Given the basic principle above, the common used multi-beam LiDAR sensor (e.g., Velodyne HDL-64E S3) for autonomous vehicles can be simulated. It emits 64 laser beams in different vertical angles ranging from to , as shown in Fig 5. These beams can be assumed to be emitted from the center of the LiDAR. During data acquisition, HDL-64E S3 rotates around its own upright direction and shoots laser beams at a predefined rate to accomplish coverage of the scenes.

Theoretically, 5 parameters of the beam should be considered for generating the point cloud, including the vertical and azimuth angles and their angular noises, as well as the distance measurement noise. Ideally, these parameters should keep constant, however, we found that different devices have different vertical angles and noise. To be closer to reality, we obtain these values from real point clouds statistically. Specifically, We collect real point clouds of these HDL-64E S3 sensors atop parked vehicles, guaranteeing the point curves generated by different laser beams to be smooth. The points of each laser beam are then marked manually and fitted by a cone with the apex located in the LiDAR center. The half-angle of the cone minus

forms the real vertical angle while the noise variance is figured out from the deviation of lines constructed by the cone apex and the points from the cone surface. The real vertical angles usually differ from the ideal ones by

. In our implementation, we approximate aforementioned noises using Standard Gaussian Distribution, setting distance noise variance to

and the azimuth angular noise variance .

3.3.2 Point Cloud Rendering

In order to generate a point cloud, we have to compute intersections of laser beams and the virtual scene, for this we propose a cube map based method to handle the hybrid data of virtual scenes, points and meshes. Instead of computing intersections of beams and the hybrid data, we compute the intersection with the projected maps (depth map) of scenes which offer the equivalent information but much easier.

To do this, we first perspectively project the scene onto 6 faces of a cube centered at the LiDAR origin to form the cube maps as in Fig 6. The key to make cube maps usable is to obtain the smooth and holeless projection of the scene, with presence of environment points. Therefore we render the environment point cloud using surface splatting [31], while rendering obstacle models using the regular method. We synergize these two parts in the same rendering pipeline, yielding in a complete image with both the environment and obstacles. In this way, we get 3 types of cube maps: depth, normal, and material which are used in Eq (1).

Figure 6: The cube map generated by projecting the surrounding point cloud onto 6 faces of a cube centered at the LiDAR origin. Here we only show the depth maps one the 6 different views.

Next, we simulate the laser beams according to the geometric model of HDL-64E S3, and for each beam we look for the distance, normal and material of the target sample it hit, with which we generate a point for this beam. Note that some beams are likely discarded, if its returned energy computed with Eq (1) is too low or it hit a empty area in the cube face, which indicates the sky.

Finally, we automatically generate the tight oriented bounding box (OBB) for each obstacle by simply adjusting its original CAD’ OBB to points of the obstacle.


Methods Instance Segmentation 3D Object Detection
mean AP mean MasK AP AP 50 AP 70
CARLA 10.55 23.98 45.17 15.32
Proposed 33.28 44.68 66.14 29.98
Real KITTI 40.22 48.62 83.47 64.79
CARLA + Real KITTI 40.97 49.31 84.79 65.85
Proposed + Real KITTI 45.51 51.26 85.42 71.54
Table 2: The performance of models trained by different simulation point cloud on KITTI benchmark. In which, “CARLA” and “Proposed” represents the model trained by the point cloud generated by CARLA and proposed method, “Real KITTI” represents the model trained with KITTI training data and the “CARLA + Real KITTI” and “Proposed + Real KITTI” represent the models trained on the simulation data first and the fine-tuned on the KITTI training data.

4 Experimental Results and Analysis

The whole simulation framework is a complex system. Direct comparison of different simulation system is really a difficult task and it is also not the key point of this paper. The ultimate objective of our work is to boost DNN’s perception performance by introducing free auto-labeled simulation ground truth. Therefore, the comparison of the different simulators can be transferred by comparing the point cloud generated by different ones. Here, we choose an indirect way of evaluation by comparing the DNN’s performance trained with different simulation point cloud. To highlight the superiority of our proposed framework, we plan to verify it on two types of dataset: public and the self-collected point cloud. Due to the popularity of Velodyne HDL-64E in the field of AD, we set all the model parameters based on this type of LiDAR in our simulation process and all the following experiments are executed based on this type of LiDAR.

4.1 Evaluation on Public Dataset

Currently, CARLA, as an open-source simulation platform 333CARLA, have been widely used for different kinds of simulation purposes (e.g., [10] [28] [20] [12] [17]). Similar with most typical simulation frameworks, both the foreground and background CG models have to be built in advance. As we have mentioned before, the proposed framework only need the CG models of foreground and the background can be directly obtained by a laser scanner. Here, we take CARLA as a representative of traditional methods for comparison.

4.1.1 Dataset

Simulation Data: we use CARLA and the proposed method to produce two groups of simulation data. In order to get better generalization, we first generate 100,000 frame of point cloud from a large scale traffic scenario and then randomly select 10,000 frames for using. For CARLA, both the foreground and background point cloud are rendered simultaneously with CG models. For the proposed method, we first build a clean background and then place some certain obstacles base on the proposed PM algorithm. To be fair, similar number of obstacles are included in each group.

Real Data: all the evaluation are executed on the third-party public KITTI [8] object detection benchmark. This data has been divided into training and testing two subsets. Since the ground truth for the test set is not available, we subdivide the training data into a training set and a validation set as described in [3] [4] [30]. Finally, we obtained 3,712 data samples for training and 3,769 data samples for validation. On the KITTI benchmark, the objects have been categorized into “easy”, “moderate” and “hard” based on their height in the image and occlusion ratio etc. Here, we merge them together for evaluation because they are equally important for the real AD application. In addition, the intensity attribute has been removed in our experiments.

4.1.2 Evaluation Methods

Two popular perception tasks in AD application have been evaluated here including instance segmentation (simultaneous 3D object detection and semantic segmentation) and 3D object detection. For each task, one state-of-the-art approach is used for evaluation here. As far, SECOND [27] performs the best for 3D object detection on the KITTI benchmark among all the open-source methods. Therefore, we choose it for object detection evaluation. While for instance segmentation, an accelerated real-time version of MV3D [4] from ApolloAuto 444ApolloAuto project is used here.

4.1.3 Evaluation Metrics

For the 3D object detection, we take the AP-50 (Average Precision) and AP-70 which have been used on the KITTI benchmark. For instance segmentation, we take the metric of mean bounding box (Bb)/mask AP proposed in coco challenges [14] here. Specifically, the thresholds are set as .

4.1.4 Experimental Results and Analysis

Here, we advocate two kinds of way for using the simulation data. One is to train a model purely using simulation data and the other is to train a model on the simulation data first and then fine-tuned on the real data. Both of them are evaluated here and the experimental results are shown in Tab 2. Compared with CARLA, the model trained purely by the proposed simulation point cloud achieves better performances on both instance segmentation and object detection. For instance segmentation, it gives more than 20 points improvements for mean AP and Mask AP. While for object detection, the AP is also improved by a big margin. With the help of simulation data, both of the models for segmentation and object detection can be boosted compared to the model trained only on the real data. The performance has just increased slightly by CARLA data, while the mean AP for segmentation and AP 70 for object detection have been improved about 5 point by using our simulated dataset.

Analysis: although the simulation data can really boost the DNN’s performance by fine-tuning with some real data, however, its generalization ability is far from the real application. For example, the model trained only with simulation data can achieve only 29.28% detection rate for the 3D object detection which is far from the real application in AD. The main reason of this is the domain in training is quite different with testing data and this is a very important drawback for the traditionally simulation framework. However, we would like to claim that this situation can be well recited with our proposed method. With carefully design, the model trained with simulated data can achieve applicable detection results for the real application. Detailed results will be introduced in the following subsection.

4.2 Simulation for Real Application

As we have mentioned, the proposed framework can easily solve the domain problem happening in the traditional simulators by collecting the similar environment by a profession laser scanner. For building a city-level large-scale background, one week is sufficient enough. Then, we can use this background to generate sufficient ground truth data.

To further prove the effectiveness of our proposed framework, we test it on a large self-collected dataset. We collect and label more point cloud with our self-driving car from different cities. The data is collected by Velodyne HDL-64E, including 100,000 frames for training and 20,000 for testing. Different from KIITI, our data is labeled for full view. Six types of obstacles have been labeled in our dataset, including small cars, big motors (e.g., truck and bus), cyclists, pedestrians, traffic cones and other unknown obstacles. The following experiments are executed based on this dataset.

4.2.1 Results on Self-collected Dataset

Dataset mean Mask AP


100k sim 91.02
16k real 93.27
100k sim + 1.6k real 94.10
100k real 94.39
100k sim + 16k real 94.41
100k sim + 32k real 94.91
100k sim + 100k real 95.27
Table 3: Model trained with pure simulation point cloud can achieve comparable results with model trained with real dataset for instance segmentation task.

First of all, we collect background point cloud from different cities to build a large background database. Then we generate sufficient simulation point cloud by placing different objects into the background. Finally, we randomly select 100K frame of simulation point cloud for our experiments. The experimental results are shown in Tab 3. From the first big row of the table, we can find that the model trained with 100K simulation data gives comparable result with the model trained with 16K real data, with only 2 points of gap. More important, the detection rate can achieve 91.02% which is relatively high enough for the real application. Furthermore, by mixing 1.6K real data with the 100K simulation data, the mane AP reaches which outperforms 16K real data.

From the second row of the table, we can see that 16K real data together with 100K simulation can beat 100K real data, which can save more than 80% of the money for annotation. Furthermore, we can obviously found last row the table that model trained with real data can be boosted to varying degrees by adding simulation data even we have big enough labeled real data.

4.2.2 Ablation Studies

The results in Tab 3, shows the promising performance of the proposed simulation framework. While the whole simulation framework is a complex system and the final perception results comes from the influence of different steps of the system. Inspired by the idea of ablation analysis, a set of experiments have been designed to learn the effectiveness of different parts. Generally, we found that three main parts are very important for the whole system including the way of background construction, obstacle pose generation, random point dropout. Detailed information for each part will be introduced in this section.

Background As discovered by many researchers, context information from the background is very important for perception task. As shown in Tab 4, we have found that the background simulated from our scanned point cloud (bold values) can achieve very close performance with the real background (values in blue). Compared with synthetic models (e.g., CARLA), the performance can be improved about points for both mean Bbox and mask APs.


Methods Mask AP 50 Mask AP 70 mean Mask AP
No BG + Sim FG 1.86 1.41 1.25
Scan BG + Sim FG 88.60 86.80 83.38
Real BG + Sim FG 88.79 87.19 84.22
Real BG + Real FG 90.40 89.45 86.33
Table 4: Evaluations with different background for instance segmentation, where “Sim”, “BG” and “FG” represent “simulation”,“background” and “foreground” for short.


Methods Mask AP 50 Mask AP 70 mean Mask AP
Random on road 73.03 69.23 65.95
Rule-based 81.37 78.13 74.32
Proposed PM 86.57 84.55 82.47
Augmented PM 87.80 86.19 83.07
Real Pose 88.60 86.80 83.38
Table 5: Evaluations for instance segmentation with different obstacle poses.

Obstacle Poses We also empirically find that where to place objects into background has a big influence on the perception results. Five different ways of placing obstacles have been evaluated here: (a) randomly on the road, (b) rule based method depends on prior high definition map information, (c) proposed probability map (PM) based method, (d) probability map plus some pose augmentation, (e) manual labeled pose plus data augmentation. From Tab 5, we can find that the proposed PM method together with proper pose augmentation can achieve comparable results with the manual labeled real pose while the proposed method can largely free the manual labors.

Random Dropout Another difference between the real and simulated data is the number of the point in each frame. For Velodyne HDL-64E, usually about 102,000 points will be returned each frame for the real sensor, while the number is about 117,000 in the simulated point cloud. The reason for this may come from different aspects: textures, material, surface reflectivity and even the color of objects. For man-made foreground obstacle, these proprieties can be easily obtained, however, it is not a trail task for backgrounds. Inspired by the dropout strategy in DNN, we randomly drop a certain ratio of points during our model training process. Surprising, we find that it can steadily improve the performance by 2 points for mean Mask AP.


Methods Mask AP 50 Mask AP 70 mean Mask AP
w/o dropout 88.60 86.80 83.38
w dropout 89.43 88.75 85.53
Real Data 90.40 89.45 86.33
Table 6: Evaluations for instance segmentation with or without random dropout.

4.2.3 Extension to Different Sensors

Beyond the Velodyne HDL-64E LiDAR, the proposed framework can be easily generalized to other type sensors such as Velodyne Alpha Puck (VLS-128) 555Velodyne Alpha (VLS-128) For our implementation, a friendly interface have been designed. With a slightly modification of configure file, we can achieve different types of LiDAR point clouds. In the configure file, the specific LiDAR properties can be modified such as the channels number, range, horizontal and vertical filed-of-view, angular and vertical resolution etc. Fig 7 gives a simulation example of VLS-128. By using the same LiDAR location and obstacle poses, we can’t visually distinguish the real and simulated one.

Figure 7: An simulation example of VLS-128 with same LiDAR location and obstacle posse, where (a) is the simulated point cloud and (b) is the real one.

5 Conclusions and Future Works

This paper presents an augmented LiDAR simulation system to automatically produce annotated point cloud used for 3D obstacle perception in autonomous driving scenario. Our approach is entirely data driven, with scanned background, obstacle poses and types that are statistically similar to that from real traffic data, and a general LiDAR renderer that takes into considerations of physical/statics properties of the actual devices. The most significant benefits of our approach are realism and scalablity. We demonstrated realism by showing that the performance gap between detectors trained with real or simulated data is within two percentage point. The scalablity of our system is by design, there is no manual labeling, and the combination of different background and different obstacle placement provides abundant labeled data with virtually no cost except computation. More importantly, it can rapidly simulate a scene of interests by simply scanning that area.

Looking into the future, we want to investigate the use of low-fidelity LiDAR. Our current hardware system can produce very high quality and dense 3D point cloud, which can be re-sampled to simulate any LiDAR type. By using a low-fidelity one, we think it may limit the LiDAR type our system can simulate, this could be a design choice between cost and versatility. Another area to address is to simulate the LiDAR intensity value for the foreground object. It involves the modeling of material reflectance properties under near infrared (NIR) illumination, which is doable but quite tedious. CG models for foreground needs to be updated accordingly to include an NIR texture.



  • [1] H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and C. Rother. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision, 126(9):961–972, 2018.
  • [2] A. Best, S. Narang, L. Pasqualin, D. Barber, and D. Manocha. Autonovi-sim: Autonomous vehicle simulation platform with weather, sensing, and traffic control. In Neural Information Processing Systems, 2017.
  • [3] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2147–2156, 2016.
  • [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
  • [5] E. Cheung, A. Wong, A. Bera, and D. Manocha. Mixedpeds: pedestrian detection in unannotated videos using synthetically generated human-agents for training. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [6] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. In Conference on Robot Learning, pages 1–16, 2017.
  • [7] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016.
  • [8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012.
  • [9] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 746–753. IEEE, 2017.
  • [10] M. Kaneko, K. Iwami, T. Ogawa, T. Yamasaki, and K. Aizawa. Mask-slam: Robust feature-based monocular slam by masking using semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 258–266, 2018.
  • [11] S. Kim, I. Lee, and Y. J. Kwon. Simulation of a geiger-mode imaging ladar system for performance assessment. sensors, 13(7):8461–8489, 2013.
  • [12] B. R. Kiran, L. Roldao, B. Irastorza, R. Verastegui, S. Süss, S. Yogamani, V. Talpaert, A. Lepoutre, and G. Trehard. Real-time dynamic object detection for autonomous driving using prior 3d-maps. In European Conference on Computer Vision, pages 567–582. Springer, 2018.
  • [13] R. Kopper, F. Bacim, and D. A. Bowman. Rapid and accurate 3d selection by progressive refinement. 2011.
  • [14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [15] M. Mueller, V. Casser, J. Lahoud, N. Smith, and B. Ghanem. Ue4sim: A photo-realistic simulator for computer vision applications. arXiv preprint arXiv:1708.05869, 2017.
  • [16] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
  • [17] J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. arXiv preprint arXiv:1812.00488, 2018.
  • [18] S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In International conference on computer vision (ICCV), volume 2, 2017.
  • [19] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
  • [20] I. Sobh, L. Amin, S. Abdelkarim, K. Elmadawy, M. Saeed, O. Abdeltawab, M. Gamal, and A. El Sallab. End-to-end multi-modal sensors fusion system for urban automated driving. 2018.
  • [21] H. Su, C. R. Qi, Y. Li, and L. J. Guibas.

    Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views.

    In Proceedings of the IEEE International Conference on Computer Vision, pages 2686–2694, 2015.
  • [22] A. Tsirikoglolu, J. Kronander, M. Wrenninge, and J. Unger. Procedural modeling and physically based rendering for synthetic data generation in automotive applications. arXiv preprint arXiv:1710.06270, 2017.
  • [23] P. Welinder and P. Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 25–32. IEEE, 2010.
  • [24] B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1887–1893. IEEE, 2018.
  • [25] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. arXiv preprint arXiv:1809.08495, 2018.
  • [26] D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 244–253. IEEE, 2018.
  • [27] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  • [28] D. J. Yoon, T. Y. Tang, and T. D. Barfoot. Mapless online detection of dynamic objects in 3d lidar. arXiv preprint arXiv:1809.06972, 2018.
  • [29] X. Yue, B. Wu, S. A. Seshia, K. Keutzer, and A. L. Sangiovanni-Vincentelli. A lidar point cloud generator: from a virtual world to autonomous driving. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 458–464. ACM, 2018.
  • [30] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
  • [31] M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Surface splatting. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 371–378. ACM, 2001.