LiDAR is a prominent sensor for autonomous driving and robotics because it provides detailed 3D information critical for perceiving and tracking real-world objects [2, 28]. The 3D localization of objects within LiDAR point clouds represents one of the most important tasks in visual perception, and much effort has focused on developing novel network architectures for operating on point clouds [1, 32, 21, 31, 36, 30, 16, 25]. Following the image classification literature, such modeling efforts have employed manually designed data augmentation schemes for boosting performance [1, 30, 16, 32, 22, 35, 25, 36].
In recent years, much work in the 2D image literature has demonstrated that investing heavily into data augmentation may lead to gains comparable to architectural improvements [4, 37, 20, 11, 5]. In spite of these advancements, 3D detection models have yet to significantly leverage automated data augmentation methods (but see ). Porting such ideas naively to point cloud data presents numerous challenges. Most prominently, the types of augmentations appropriate for point clouds differ tremendously from labeled images. Transformations appropriate for point clouds are typically geometric-based and may contain a large number of parameters. Thus, the search space proposed in [4, 37] may not be naively reused for an automated search in point cloud augmentation space. Finally, because the search space is far larger, employing a more efficient search method becomes a necessity for making such a system practical. Several works have attempted to significantly accelerate the search for data augmentation strategies [20, 11, 5], however it is unclear if such method would continue to work in a point cloud setting.
In this work, we demonstrate that automated data augmentation significantly improves the prediction accuracy of 3D object detection models. We introduce a new search space for point cloud augmentations in 3D object detection. In this search space, we find the performance distribution of augmentation policies is quite diverse. To effectively discover good augmentation policies, we present an evolutionary-based search algorithm termed Progressive Population Based Augmentation (PPBA). PPBA works by narrowing down the search space through successive iterations of evolutionary search, and by adopting the best parameters discovered in past iterations. We demonstrate that PPBA is effective at finding good data augmentation strategies across datasets and detection architectures. Additionally, we find that a model trained with PPBA may be up to 10x more data efficient, implying reduced human labeling demands for point clouds. Our main contributions can be summarized as follows: (1) We propose an automated data augmentation technique for localization in 3D point clouds. (2) The proposed search method may effectively improve point cloud 3D detection models compared to random search with less computation cost. (3) We demonstrate up to a 10x increase in data efficiency when employing PPBA.
2 Related Work
Data augmentation has been an essential technique for boosting the performance of 2D image classification and object detection models. Augmentation methods typically include manually designed image transformations to which the labels remain invariant, or distortion on the information present in the images. For example, elastic distortions, scale transformations, translations, and rotations are beneficial on models [26, 3, 29, 24] trained on MNIST. Crops, image mirroring and color shifting / whitening 
are commonly adopted on natural image datasets like CIFAR-10 and ImageNet. Recently, cutout and mixup  have emerged as data augmentation methods that lead to good improvements in natural image datasets. For object detection in 2D images, image mirroring and multi-scale training are popular distortions . Dwibedi et al. add new objects on training images by cut-and-paste .
While the distortions mentioned above are designed by domain experts, there are also automated approaches to designing data augmentation for 2D images. Early attempts include Smart Augmentation, which uses a network to generate augmented data by merging two or more samples . Ratner et al. use GANs to output sequences of data augmentation operations 
. AutoAugment uses reinforcement learning to optimize data augmentation strategies for classification and object detection . More recently, improved search methods are able to find data augmentation strategies more efficiently [5, 11, 20].
While all the mentioned work so far is on 2D image classification and object detection, automated data augmentation methods have not been explored for 3D object detection tasks to the best of our knowledge. Models trained on KITTI use a wide variety of manually designed distortions. Due to the small size of the KITTI training set, data augmentation has been shown to improve performance significantly (common augmentations include horizontal flips, global scale distortions, and rotations) [1, 30, 32, 16, 25]. Yan et al. add new objects in training point clouds by pasting points inside the 3D bounding boxes of ground truths . Despite its effectiveness for KITTI models, data augmentation was not used on some of the larger point cloud datasets [22, 35]. Very recently, an automated data augmentation approach was studied for point cloud classification .
Historically, 2D vision research has focused on architectural modifications to improve generalization. More recently, it was observed that improving data augmentation strategies can lead to comparable gains to a typical architectural advance [33, 8, 4, 37]. In this work, we demonstrate that a similar type of improvement can also be obtained by an effective automated data augmentation strategy for 3D object detection over point clouds.
We formulate the problem of finding the right augmentation strategy as a special case of hyperparameter schedule learning. The proposed method consists of two components: a specialized data augmentation search space for point cloud inputs and a search algorithm for the optimization of data augmentation parameters. We describe these two components below.
3.1 Search Space for 3D Point Cloud Augmentation
In the proposed search space, an augmentation policy consists of N augmentation operations. Additionally, each operation is associated with a probability and some specialized parameters. For example, the ground-truth augmentation operation has parameters denoting the probability for sampling vehicles, pedestrians, cyclists, etc.; the global translation noise operation has parameters for the distortion magnitude of the translation operation on x, y and z coordinates. To reduce the size of the search space, these different operations are always applied in the same, pre-determined order in the model.
The operations (see Fig. 2) we searched over are GroundTruthAugmentor, RandomFlip, WorldScaling, GlobalTranslateNoise, FrustumDropout, FrustumNoise, RandomRotation and RandomDropLaserPoints. In total, there are 8 augmentation operations and 29 operation parameters in the proposed search space. Details about these operations are in the Appendix.
3.2 Learning through Progressive Population Based Search
The proposed search process is maximizing a given metric on a model by optimizing a schedule of augmentation operation parameters while represents the number of iterative updates for the augmentation operation parameters during model training. For point cloud detection tasks, the metric for measuring the performance is given by mean average precision (mAP). The process of discovering optimal augmentation schedule is given as follows:
During training the objective function L (which is used for optimization of the model given data and label pairs ) is usually different from the actual performance metric
since the optimization procedure (i.e. stochastic gradient descent) requires a differentiable objective function. Therefore, at each iterationthe model is optimizing:
During search, the training process of the model is split into iterations. At every iteration, only a subset of augmentation operations will be explored for optimization while others will be fixed. While models with different are trained in parallel, those models will be evaluated by the same predefined metric at the end of the iteration. Models trained in all previous iterations are placed in the same population .
Similar to Population Based Training , the exploit phase will keep the good models and replace the inferior models. In contrast with Population Based Training, the proposed method focuses only on a subset of the search space at each iteration. During the exploration phase, a successor might focus on a different subset of the parameters than its predecessor. In that case, the remaining parameters (parameters that the predecessor does not focus on) are inherited from the parameters of the corresponding operations with the best overall performance.
The detailed Progressive Population Based Augmentation algorithm is described in Algorithm 1 as below.
3.3 Optimize Schedule with Historical Data
Compared to augmentation operations on 2D images (e.g. rotation, translation, etc.), parameters for point cloud augmentation are more complicated, due to the nature of geometric information in 3D data. For example, there are five parameters – theta_width, phi_width, distance, keep_prob and drop_type – in FrustumDropout operation, three of which are related to 3D coordinates. The analogous operation for 2D images is cutout , which has only one parameter. Therefore it is more challenging to discover optimal parameters for point cloud operations with limited resources.
In order to learn the parameters for individual operations effectively, PPBA modifies only a small portion of the parameters in the search space at every iteration, and the historical information from the previous iterations are reused to progressively optimize the augmentation schedule. By narrowing down the focus on certain subsets of the search space, it becomes easier to distinguish inferior augmentation parameters. To mitigate the slowing down of search speed caused by shrinking the search space at each training iteration, the best parameters of each operation discovered in the past iterations are inherited by the successors, when their focused subsets of search space are different from their predecessors.
In Algorithm 2 below, we describe the exploration phase based on historical data.
In this section, we empirically investigate the performance of PPBA on predictive accuracy, computational efficiency and data efficiency. We focus on single-stage detection models due to their simplicity, speed advantages and widespread adoption [30, 16, 22]. We first benchmark PPBA on the KITTI object detection benchmark  and the Waymo Open Dataset  (Sections 4.1 and 4.2). Our results show PPBA improves the baseline models and the magnitude of the improvements may be comparable to advances in 3D perception architectures. Next, we compare PPBA with random search and PBA  on the KITTI Dataset (Section 4.3). Our results demonstrate that PPBA may be up to 31x more computationally efficient than random search, while identifying higher-performing augmentation strategies. Furthermore, PPBA outperforms PBA by a substantial margin with the same computation budget. Finally, we study the data efficiency of PPBA on the Waymo Open Dataset (Section 4.4). Our control experiments demonstrate that on a single-stage baseline model , the proposed method may be up to 3.3x or 10x more data efficient than without augmentation when sampling from run segments and single frames of sensor data, respectively.
4.1 Surpassing Single-Stage Models on the KITTI Dataset
The KITTI Dataset  is generally recognized to be a small dataset for modern methods, and thus, data augmentation is critical to the performance of models trained on it [30, 16, 22]. We evaluate PPBA with StarNet  on the KITTI test split in Table 1. PPBA improves the detection performance of StarNet significantly, outperforming all current state-of-the-art single-stage point cloud detection models on the moderate difficulty category. For all categories including car, pedestrian and cyclist on all difficulties, compared to StarNet with manually designed augmentation, the proposed method achieves much better results (i.e., increasing the mAP by 3.66, 2.83, 3.7 for car, pedestrian, and cyclist on the moderate difficulty, respectively).
|3D IoU Loss ||84.43||76.28||68.22||-||-||-||-||-||-|
|StarNet  + PPBA||84.16||77.65||71.21||52.65||44.08||41.54||79.42||61.99||55.34|
During the PPBA search, 16 trials are trained to optimize the mAP for car and for pedestrian/cyclist, respectively. The same training and inference settings111http://github.com/tensorflow/lingvo as  are used during search, while all trials are trained on the train split (3,712 samples) and validated on the val split (3,769 samples). The search is conducted in the search space described in Section 3.1. In contrast to AutoAugment , of which every subpolicy contains 2 augmentation operations and only applies two operations to each training example, we found it is helpful to apply all operations according to some learned probabilities since it can largely increase the diversity of the training data.
We train the first iteration for 3,000 steps, and all subsequent iterations for 1,000 steps during the search. All iterations have a batch size 64. We perform the search for 30 iterations for the car category and 20 iterations for the pedestrian/cyclist categories. In the search for the car category, trials with lower sampling probability on GroundTruthAugmentor and FrustumNoise operations tend to perform better than the others. Different from the observation in [30, 16], we discover that aggressively applying the GroundTruthAugmentor operation can hurt the val split performance.
4.2 Automated Data Augmentation Benefits Large-Scale Data
The Waymo Open Dataset is a recently released, large-scale dataset for 3D object detection in point clouds . The dataset contains roughly 20x more scenes than KITTI, and roughly 20x more human-annotated objects per scene. This dataset presents an opportunity to ask whether data augmentations – being critical to model performance on the KITTI dataset due to the small size of the dataset – continue to provide a benefit in a large-scale training setting more reflective of the self-driving conditions in the real world.
To address this question, we evaluate the proposed method on the Waymo Open Dataset. In particular, we evaluate PPBA with StarNet  and PointPillars  on the test split in Table 2 and Table 3 on both LEVEL 1 and LEVEL 2 difficulties at different ranges. Our results indicates that PPBA notably improves the predictive accuracy of 3D detection across architectures, difficulty levels and object classes. These results indicate that data augmentation remains an important method for boosting model performance even in large-scale dataset settings. Furthermore, the gains due to PPBA may be as large as changing the underlying architecture, without any increase in inference cost.
|Method||Difficulty||3D mAP (IoU=0.7)||3D mAPH (IoU=0.7)|
|StarNet  + PPBA||1||64.0||82.8||59.4||34.6||63.5||82.3||58.8||34.2|
|StarNet  + PPBA||2||55.6||81.9||53.8||26.4||55.2||81.4||53.3||26.0|
|PointPillars  + PPBA||1||65.0||83.8||61.9||37.1||64.3||83.2||61.0||36.2|
|PointPillars  + PPBA||2||57.4||82.0||55.8||28.3||56.8||81.4||55.0||27.6|
|Method||Difficulty||3D mAP (IoU=0.5)||3D mAPH (IoU=0.5)|
|StarNet  + PPBA||1||69.7||77.5||68.7||57.0||61.7||69.3||61.2||48.4|
|StarNet  + PPBA||2||63.0||74.8||63.2||46.5||55.8||66.8||56.2||39.4|
|PointPillars  + PPBA||1||66.4||74.7||64.8||52.7||54.4||62.5||52.5||41.2|
|PointPillars  + PPBA||2||60.1||72.2||59.7||42.8||49.2||60.4||48.2||33.4|
When performing the search with PPBA, 16 trials are trained to optimize the mAP for car and pedestrian, respectively. The list of augmentation operations in Section 3.1 except for GroundTruthAugmentor and RandomFlip are used during search. In our experiments, we have found RandomFlip has a negative impact on heading prediction for both car and pedestrian.
For both StarNet and PointPillars on the Waymo Open Dataset, the same training and inference setting222http://github.com/tensorflow/lingvo as  is used. All trials are trained on the full train set and validated on the 10% val split (4,109 samples). During the search, we train the first iteration for 8,000 steps and the remaining iterations for 4,000 steps on StarNet with batch size 128. We reduce the training steps on PointPillars by half in each iteration with batch size 64, since it converges faster. We perform the search for 25 iterations on StarNet and for 20 iterations on PointPillars.
4.3 Better Results with Less Computation
Above, we have verified the effectiveness of PPBA on improving 3D object detection on the KITTI Dataset and the Waymo Open Dataset. In this section, we analyze the computational cost of PPBA, and compare PPBA with random search and PBA  on the KITTI test split.
All searches are performed with StarNet  and the search space described in Section 3.1. For Random Search333Our initial experiment on random search shows the performance distribution of augmentation policies is spread on the KITTI val split. In order to save computation resources, the random search here is performed on a fine-grained search space., 1000 distinct augmentation policies are randomly sampled and trained. PBA is run with 16 total trials while training the first iteration for 3,000 steps and the remaining iterations for 1,000 steps with batch size 64.
The baseline StarNet is trained for 8 hours with a TPU v3-32 Pod [13, 15] on vehicle detection and pedestrian/cyclist detection models. Random Search requires about TPU hours for training. In comparison, both PBA and PPBA train with a much smaller cost of TPU hours, with an additional real-time computation overhead of waiting for the evaluation result for TPU hours. While PPBA speeds up the search compared to random search by 31x, it achieves the best results compared to random search and PBA on car and cyclist detection categories.
|Manual design ||8||81.63||73.99||67.07||48.58||41.25||39.66||73.14||58.29||52.58|
While searching the augmentation policies randomly for pedestrian/cyclist detection, the majority of samples perform worse than the manual designed augmentation strategy on the KITTI val split (see Fig. 3). As in Table 7 in the Appendix, augmentation parameters of different operations represent magnitudes in different domains (e.g., geometric distance, operation strength, distribution of categorical sampling, etc.). Because of the complex search space, it is challenging to discover good augmentation policies with random search, especially for the cyclist category. We find it is effective to fine tune the parameter search space of each operation to improve the overall performance of random search. However, the whole process is expensive and requires domain expertise.
We observe that PBA is not effective at discovering better augmentation policies, compared to random search or even to manual search, when the detection category is sensitive to inferior augmentation parameters. Inspired by beam search, we narrow down the search space to focus on optimizing only 2 operations at every iteration and iterate on top of the search result in a randomly mutated small search space in the next iteration in PPBA. To counteract the slowing down of search caused by the shrinking of the search space at each iteration, the best parameters of each operation in the past iterations are recorded as references for mutating parameters in future iterations. As in Table 4, PPBA shows much larger improvements on the car and cyclist categories, demonstrating the effectiveness of the proposed strategy.
4.4 Automated Data Augmentation Improves Data Efficiency
In this section, we conduct experiments to determine how PPBA performs when the dataset size grows. To conduct these experiments, we take subsets of the Waymo Open Dataset with the following number of training examples: 10%, 30%, 50%, by randomly sampling run segments and single frames of sensor data, respectively. We use the re-implemented PointPillars model for this experiment. During training, the decay interval of the learning rate is linearly decreased accordingly to the percentile of data sampled (e.g., reduce the decay interval of learning rate by 50% when sampling 50% of the training examples), while the number of training epochs is set to be inversely proportional to the percentile of data sampled. As it is commonly known that smaller datasets need more regularization, we increase weight decay from 1e-4 to 1e-3, when training on 10% examples.
Compared to downsampling from single frames of sensor data, performance degradation of PointPillars models is more severe when downsampling from run segments. This phenomenon is due to the relative lack of diversity in the run segments, which tend to contain the same set of distinct vehicles and pedestrians. As in Table 5, Fig. 4 and Fig. 5, we compare the overall 3D detection mAP on the Waymo Open Dataset val for all ground truth examples with 5 points and rated as LEVEL difficulty for 3 sets of PointPillars models: with no augmentation, random augmentation policy and PPBA. While random augmentation policy can improve the PointPillars baselines and demonstrate the effectiveness of the proposed search space, PPBA pushes the limit even further. PPBA is 10x more data efficient when sampling from single frames of sensor data, and 3.3x more data efficient when sampling from run segments. As we expected, the improvement from PPBA becomes larger when the dataset size is reduced.
We have presented Progressive Population Based Augmentation, a novel automated augmentation algorithm for point clouds. PPBA optimizes the augmentation schedule via narrowing down the search space and adopting the best parameters from past iterations. Compared with random search and PBA, PPBA can more effectively and more efficiently discover good augmentation policies in a rich search space for 3D object detection. Experimental results on the KITTI dataset and the Waymo Open Dataset demonstrate that the proposed method can significantly improve 3D object detection in terms of performance and data efficiency.
We would like to thank Peisheng Li, Chen Wu, Ming Ji, Weiyue Wang, Zhinan Xu, James Guo, Shirley Chung, Yukai Liu, Pei Sun of Waymo and Ang Li of DeepMind for helpful feedback and discussions. We also thank the larger Google Brain team including Matthieu Devin, Zhifeng Chen, Wei Han and Brandon Yang for their support and comments.
Multi-view 3d object detection network for autonomous driving.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §1, §2.
-  (2014) A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1836–1843. Cited by: §1.
Multi-column deep neural networks for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649. Cited by: §2.
-  (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §1, §2, §2, §4.1.
-  (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §1, §2.
Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2, §3.3.
-  (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310. Cited by: §2.
-  (2019) Instaboost: boosting instance segmentation via probability map guided copy-pasting. arXiv preprint arXiv:1908.07801. Cited by: §2.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition(CVPR), Cited by: §4.1, §4.
-  (2018) Detectron. Cited by: §2.
-  (2019) Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393. Cited by: §1, §2, §4.3, §4.
-  (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §3.2.
In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §4.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
-  (2019) Scale mlperf-0.6 models on google tpu-v3 pods. arXiv preprint arXiv:1909.09756. Cited by: §4.3.
-  (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1, §2, §4.1, §4.1, §4.2, Table 1, Table 2, Table 3, §4.
-  (2017) Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5, pp. 5858–5869. Cited by: §2.
-  (2020) PointAugment: an auto-augmentation framework for point cloud classification. arXiv preprint arXiv:2002.10876. Cited by: §1, §2.
-  (2018) Deep continuous fusion for multi-sensor 3d object detection. Cited by: Table 1.
-  (2019) Fast autoaugment. arXiv preprint arXiv:1905.00397. Cited by: §1, §2.
-  (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §1.
-  (2019) Starnet: targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069. Cited by: §1, §2, §4.1, §4.1, §4.2, §4.2, §4.3, Table 1, Table 2, Table 3, Table 4, §4.
-  (2017) Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems, pp. 3239–3249. Cited by: §2.
-  (2015) Apac: augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229. Cited by: §2.
-  (2019) PV-rcnn: point-voxel feature set abstraction for 3d object detection. arXiv preprint arXiv:1912.13192. Cited by: §1, §2.
-  (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Proceedings of International Conference on Document Analysis and Recognition, Cited by: §2.
-  (2019) Scalability in perception for autonomous driving: waymo open dataset. arXiv preprint arXiv:1912.04838. Cited by: §4.2, Table 2, Table 3, §4.
-  (2006) Stanley: the robot that won the darpa grand challenge. Journal of field Robotics 23 (9), pp. 661–692. Cited by: §1.
Regularization of neural networks using dropconnect.
International Conference on Machine Learning, pp. 1058–1066. Cited by: §2.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: Appendix 0.A, §1, §2, §4.1, §4.1, Table 1, §4.
-  (2018) Hdnet: exploiting HD maps for 3d object detection. In Conference on Robot Learning, pp. 146–155. Cited by: §1.
-  (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: Appendix 0.A, §1, §2.
-  (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §2, §2.
-  (2019) Iou loss for 2d/3d object detection. Cited by: Table 1.
-  (2019) End-to-end multi-view fusion for 3d object detection in lidar point clouds. arXiv preprint arXiv:1910.06528. Cited by: §1, §2.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: Appendix 0.A, §1, Table 1.
-  (2019) Learning data augmentation strategies for object detection. arXiv preprint arXiv:1906.11172. Cited by: §1, §2, §2.