3D object detection is critical for real-world applications such as in autonomous driving car and robotics. Although 3D object detection research has been massively conducted, most of the works focus on architectures suitable for 3D point clouds [lang2019pointpillars, yang2019std, he2020structure, Wang2019FrustumCS, Liu2020TANetR3, shi2020pv].
Meanwhile, data augmentation plays an important role in boosting the performance of 3D models. 3D labeling is much more time-consuming compared to 2D labeling, which leads to most of the 3D datasets having a limited amount of training samples. Yet, 3D data augmentation has not been much explored.
Many works in 3D object detection apply data augmentation, such as translation, random flipping, shifting, scaling and rotation, directly extending typical 2D augmentation methods to 3D [shi2020pv, Wang2019FrustumCS, chen2019object, hahner2020quantifying]. These existing methods are effective in improving performance. However, they did not fully utilize the 3D information. 3D ground-truth boxes have much richer structural information compared to 2D ground-truth boxes as they perfectly fit the object along with each direction. For example, the 2D label may contain other instances and background in the box, so the information provided contains much noise. On the other hand, 3D boxes provide sufficient information of a single object even with occlusion and have little background noise (Fig. 1, First row). Also, since the 2D boxes have no structural information about the objects, they cannot tell which part of the car is the ‘wheel’. However, we can aware the wheels are located near the bottom corners using the intra-object part location information of 3D boxes (Fig. 1, Second row). Utilizing the unique characteristics of 3D boxes enables more sophisticated and effective augmentation which 2D augmentation cannot do.
In this paper, we propose a part-aware data augmentation method robust to various extreme environments by using structural information of 3D ground-truth boxes. The network can aware intra-object relation as it learns individual variation in an intra-object part. Our part-aware augmentation divides 3D ground truth boxes into 8 or 4 partitions depending on the object type. It stochastically applies five augmentation methods to each partition, such as internal points dropout, cutmix [yun2019cutmix], cutmixup [yoo2020rethinking], sparse sampling, and random noise generation. The internal points dropout removes partitions stochastically and leaves the corner of an object. It enables the network to find the entire box when only some parts of the object are given. Cutmix and cutmixup respectively replace and mix points in the partition with other points from the same class and same partition location, which give the network a regularization effect. Sparse sampling samples point clouds from a dense partition, sparsifying the partition from which the network can learn more information of distant objects. Random noise generation trains the network in a situation of severe occlusion.
Note that [yun2019cutmix, yoo2020rethinking]
respectively apply cutmix and cutmixup to an image region with a patch from another class that the network could learn a relation across examples of different classes. In our work, however, points from the same class are mixed to further give a regularization effect for intra-class examples. This reflects the task characteristics of 3D object detection that requires accurate localization while classifying 3 to 23 classes[geiger2013vision, sun2019scalability, caesar2020nuscenes] centered on car, pedestrian and cyclist compared to [yun2019cutmix, yoo2020rethinking]
which classify 1000 classes of ImageNet.
Our proposed part-aware data augmentation improves KITTI [geiger2013vision] Cyclist 3D AP of the PointPillars baseline [lang2019pointpillars] up to 8.91%p, which is an advantage from part-awareness considering 0.45%p improvement when random partitions are used instead of part-aware partitions. Also, part-aware data augmentation enables the model to be robust in poor but inevitable environments, such as severe occlusion, low resolution, and inaccuracy due to snow or rain. In those situations, our work shows much less drop in accuracy than the baseline. In addition, part-aware augmentation performs well when data is insufficient, which has the equivalent effect of increasing the train data by about 2.5. Meanwhile, as our work divides 3D box according to its structure and applies augmentation methods individually on the partitions, multiple augmentation methods are allowed to be used simultaneously without interference with each other. This can enhance the regularization effect a lot.
Our main contributions are:
As well as proposing a partitioning method based on structural information of a 3D box, we propose five novel 3D LiDAR data augmentation methods which significantly enhance performance when they are used together.
Our work is compatible with existing LiDAR data augmentation methods and boosts conventional detectors’ performance with negligible cost.
We show that proposed part-aware augmentation not only improves the recognition accuracy of given datasets but also obtains the robustness to corrupted data.
Ii Related Works
Ii-a 3D Object Detection
Although RGB and LiDAR data can be used for 3D object detection, recent state-of-the-art (SOTA) detectors [he2020structure, shi2020pv] rely only on LiDAR data. LiDAR-based 3D object detectors are largely classified into the projection, voxelization, and raw point cloud methods depending on the method for encoding point cloud.
The projection-based detection methods project point cloud data in the form of FV (Front View) or BEV (Bird Eye View) to use 2D convolutions. MV3D [chen2017multi]
fuses 2D CNN features extracted from BEV, FV, and RGB images. PIXOR[yang2018pixor] proposed a proposal-free, single-stage detector that uses BEV. LaserNet [meyer2019lasernet] proposed a method of predicting boxes in the form of distribution using FV. Since projection-based detectors use 2D CNNs, they have a great advantage in recognition speed, but their recognition performance is somewhat insufficient due to information loss that occurs in the projection process.
Voxelization-based methods quantize point cloud and encode them in a 3D matrix form to use 3D convolution. VoxelNet [zhou2018voxelnet] divides the space into a 3D grid, combines the points included in each grid with fully-connected layers, and performs 3D convolution to regress 3D boxes. However, 3D convolution is very time-consuming. To resolve this problem, SECOND [yan2018second] introduced sparse convolution, which greatly improved the detection speed of VoxelNet.
Unlike the projection and voxelization-based methods, the methods based on raw point cloud have no information loss of input. PointNet [qi2017pointnet] and PointNet++ [qi2017pointnet++] perform classification and segmentation by learning 3D representation of points using fully connected layers. PointRCNN [shi2019pointrcnn] proposed a method which makes proposals using PointNet++ and refines 3D boxes with PointNet.
In recent years, many studies have been conducted to combine the advantages of the methods introduced above. PointPillars [lang2019pointpillars] proposed a method of encoding a point cloud by voxelization in the form of a BEV 2D grid, significantly improving the detection speed. Part- [shi2020points] Net creates proposals using raw point clouds to reduce the region of interest and then performs Box Refinement using voxelization. In addition, Part- Net proposed a method of using intra-object part location information of 3D labels. PV-RCNN [shi2020pv] performs region proposal using voxelization and combines multi-scale voxel features with voxel set abstraction module to compensate for inaccurate recognition due to insufficient spatial resolution of voxelization-based proposals, greatly improving detection performance. SA-SSD [he2020structure]
also converts point cloud into a tensor using quantization and then extracts feature with 3D convolution. Also, to supplement the inaccurate detection due to downsampling, they proposed an auxiliary network that learns raw point cloud at a point level. Networks using these fusion methods currently show the best performance in LiDAR-based 3D Object Detection.
Ii-B 2D Data Augmentation
It has been demonstrated that data augmentation leads to gains in 2D image tasks such as classification and object detection [zhong2020random, inoue2018data, taylor2018improving]. Especially, patch-based data augmentation methods that utilize patches cut and pasted among training images boosted performance. Image patches are zeroed-out in [devries2017improved], which encourages the model to utilize the full context of the image, on the other hand, deleted regions become uninformative. Cutmix [yun2019cutmix] replaces deleted regions with a patch from another image and maximizes the proportion of informative pixels. These methods, when applied to CIFAR and ImageNet datasets, greatly improve the performance. Such improvements were also shown in low-level vision tasks. Cutblur [yoo2020rethinking] cuts a low-resolution patch and replaces it with the corresponding high-resolution image region and vice versa. It has the same effect as making the image partially sparse and enables the model to learn both “how” and “where” when super-resolves the image.
In our work, the 2D image patch is extended to 3D partition. Using the 3D partition, we extend cutout [devries2017improved], cutmix [yun2019cutmix], and cutblur [yoo2020rethinking] to 3D point clouds. Five proposed augmentation methods are simultaneously applied to the partitions which gives robustness to the model and significantly improves performance.
Ii-C 3D Data Augmentation
Considering the limited size of datasets for 3D object detection including KITTI datasets, data augmentation is one of the ways to alleviate overfitting and boost performance. The works [shi2020pv, Wang2019FrustumCS, chen2019object] which showed the improved performance on 3D object detection adopted data augmentation methods such as translation, random flipping, shifting, scaling and rotation when training the model on KITTI datasets which led to additional improvement. Oversampling was also used to address foreground-background class imbalance problem [yan2018second, shi2020pv, hahner2020quantifying]. A large-scale dataset with the entire sensor suite [caesar2020nuscenes] was provided to complement the shortcomings of KITTI datasets, yet data augmentation is still necessary for model robustness.
Despite their effectiveness on the models, existing data augmentation methods do not fully utilize richer information of point clouds compared to the counterparts for 2D images. We propose part-aware data augmentation which takes full advantage of spatial information unique in 3D datasets.
Recently, an automated data augmentation approach has been actively studied. [cheng2020improving] narrowed down the search space with an evolutionary-based search algorithm and adopted the best parameters discovered. [li2020pointaugment] jointly optimized augmentor and classifier via an adversarial learning strategy. These approaches could be incorporated with our proposed part-aware data augmentation to further enhance the performance in future work.
We propose a part-aware partitioning method that divides the object into partitions according to intra-object part location to fully utilize the structural information of 3D label. Partitioning is necessary to separate the characteristic sub-parts of an object and it enables more diverse and efficient augmentation than existing methods. Because the location of characteristic parts for each class is different, Car, Pedestrian and Cyclist are divided into 8, 4 and 4 partitions respectively (Fig. 2, First column). When using partition-based augmentation, instead of applying the same augmentation to the entire object, different augmentations can be applied to each intra-object sub-parts.
Point cloud can be expressed by the union of foreground points and background points as below:
where is the points in a 3D box, and is the number of boxes in a scene. is the internal points in a partition, and is the number of partitions in the box.
The set of augmented foreground points can be represented as
Here, the bounding boxes and the partitions to which augmentation is applied are denoted as and respectively.
Iii-a Dropout Partition
Dropout [srivastava2014dropout] was first used in the feature-level to increase the regularization effect of the network by randomly making the activation of some nodes zero. After that, it was shown that dropout could be effectively applied to the input in the 2D image classification task [devries2017improved]. Inspired by the previous works, we propose a partition-based dropout method that can be used effectively in 3D point clouds as below.
In Eq. (4),. The index indicates a randomly selected dropout partition among partitions. Dropout using a predefined partition can remove characteristic sub-parts of an object, making learning more robust.
Iii-B Swap Partition
CutMix [yun2019cutmix], which is used in 2D image recognition, proposed an augmentation method that swaps random regions extracted from training samples. It can be applied to different classes by mixing class labels and has been shown effective for regularization. Inspired by the work, we propose a swap partition operation that utilizes intra-object part location information of 3D labels. The difference from CutMix is that our method swaps partitions of the same class and the same location in an object as follows.
for , and .
After selecting a box to swap with a probability of for all boxes, as in the Eq. (5), (6), we swap a randomly selected non-empty partition in box with the partition in box . When swapping, partitions of different boxes have different scales, directions, and locations. So after converting them to the canonical coordinate system [shi2019pointrcnn] with a standard scale, we swap partitions and restore them to the original coordinate system with the original scale.
Because CutMix swaps patches of random areas, object can be swapped to the background area. And it could have a bad effect on learning. However, our partition-aware swap has no such problem and maximizes the effect of intra-class regularization by swapping only between the same class.
|Method||Car 3D (IoU=0.7)||Cyclist 3D (IoU=0.5)||Pedestrian 3D (IoU=0.5)|
|PointPillars + Dropout||80.89||72.23||68.03||66.00||44.19||41.89||55.10||50.38||45.63|
|PointPillars + Swap||81.45||68.60||66.85||66.66||44.94||42.62||56.92||51.97||47.32|
|PointPillars + Mix||81.79||70.21||67.87||62.78||40.45||38.42||59.98||54.60||48.87|
|PointPillars + Sparse||82.56||69.83||67.27||66.88||44.37||42.00||58.47||53.62||48.64|
|PointPillars + Noise||82.03||68.37||65.81||66.44||44.31||41.79||57.81||52.55||47.73|
|PointPillars + PA-AUG||83.70||72.48||68.23||70.88||47.58||44.80||57.38||51.85||46.91|
|PV-RCNN + PA-AUG||89.38||80.90||78.95||86.56||72.21||68.01||67.57||60.61||56.58|
Iii-C Mix Partition
CutMixup [yoo2020rethinking], a combination of CutMix [yun2019cutmix] and Mixup [zhang2018mixup]
, blends random areas of the training images, which is a quite effective data augmentation method in the task of image super-resolution. We applied it to our partition-based augmentation and call it Mix partition. The detailed method is identical to Eq.(5) except that
for , , and .
The partition to mix is selected in the same way as the Swap operation. Likewise, the same partition standardization process is applied. The only difference is that it merges and when creating augmented partition .
Iii-D Sparsify Partition
The density of LiDAR points decreases cubically as the distance of the box increases. As the point density decreases, the shape of the object cannot be fully recognized, which is one of the most significant factors in reducing the performance of LiDAR-based detectors. We propose sparsifying partitions as an augmentation method which makes some dense partitions sparse to improve distant objects’ recognition. The detail is as the following.
As in Eq. (8), it selects partitions to augment with the probability of among the dense partitions with the number of points over . Then, points of the partition are sampled using Farthest Point Sampling (FPS) and it is denoted as .
Iii-E Add Noise to Partition
Since the RGB-image-based detectors are greatly influenced by the illuminance of the surrounding environment, the augmentation methods that change the brightness and color help improve performance. Likewise, LiDAR-based detectors are vulnerable to weather changes such as rain or snow that can cause noise and occlusion in point cloud data. We propose a partition-based augmentation method to be robust to noise caused by various reasons as follows:
As in Eq. (9), it selects partitions to augment with the probability of . Then, it adds randomly generated noise points to the selected partition .
The five augmentation methods using part-aware partitioning introduced above can be used individually, but because the methods are independent, different augmentation methods can be applied to an object multiple times. And in order to create various combinations of augmentations, operations are applied independently so that different operations can be applied to one partition. However, if all augmentations are used without a specific order, interference may occur between operations, and in order to minimize this, we apply Dropout-Swap-Mix-Sparse-Noise in order. We call it PA-AUG, which stochastically applies five operations, so it can take advantage of each and show a strong regularization effect.
In this section, we analyze in various ways how the proposed augmentation method affects recognition. Section IV-A shows how the performance changes when our methods are applied to existing detectors on KITTI [geiger2013vision] dataset. In Section IV-B, robustness tests are performed by creating corrupted KITTI datasets to check in what situation each augmentation method actually helps. In Section IV-C, we test whether the partitioning method using intra-object part location information actually has a performance advantage. Finally, Section IV-D shows how efficiently our augmentation works when the size of the training dataset is reduced.
Settings The KITTI object detection benchmark dataset [geiger2013vision] consists of 7,481 training samples and 7,518 testing samples. In order to verify the effectiveness of PA-AUG, we separated the training dataset into 3,712 training samples and 3,769 validation samples [chen20153d]. Since our augmentation methods are applied stochastically, we report the average values of 3 repeated experiments in Table I.
uses two separate networks for Car and Cyclist/Pedestrian classes. We use a batch size of 2 for Car network and 1 for Cyc/Ped network. And we train 160 epochs for Car and 80 epochs for Cyc/Ped network. PV-RCNN[shi2020pv] uses a single network for all classes. We train the network with batch size 8 for 80 epochs. The parameters of the proposed augmentation methods are shown in Table II. The left values of ‘/’ are parameters of the Car network, and the right values are parameters of the Cyc/Ped network. Basic data augmentations such as ground-truth oversampling [yan2018second], rotation, translation, and flipping are used before applying our partition-based augmentations. For other parameters not mentioned, the settings of each original paper are used.
Results Table I shows the effect of each partition-based augmentation methods and PA-AUG. All the proposed standalone augmentation methods performed better than the baseline algorithms without our data augmentation (PointPillars [lang2019pointpillars] and PV-RCNN [shi2020pv]) in most cases. We have found that the cases in which each operation significantly increases are different. For example, Dropout does not improve the Easy score of Car a lot, but it does for Mod. and Hard cases. Other operations, on the contrary, increase the Easy score a lot compared to Mod. and Hard scores. For the Cyc/Ped network, Mix operation achieves the highest scores for Pedestrian class, but scores for Cyclist class are remarkably low. Interestingly, PA-AUG achieves the highest performance improvement on average through even improvements for all scores, which means the proposed partition-based augmentations have synergy effects when used together. Also, PA-AUG improves all the scores of PV-RCNN [shi2020pv], one of the current state-of-the-art LiDAR-based detectors.
Iv-B Robustness Test
Settings We evaluate the robustness of our proposed augmentations using three corrupted KITTI-val datasets, KITTI-D, KITTI-S, and KITTI-J. KITTI-D (Dropout) is a dataset in which some of the foreground points are removed by dropping out a portion of all objects. For fairness, not making it the same as the dropout used for our proposed augmentation, a random dense area with many points is selected for the part to be dropped out. KITTI-S (Sparse) is a dataset that leaves only 30% of points using Farthest Point Sampling (FPS) across the point cloud. Finally, KITTI-J (Jittering) is a dataset that adds Gaussian noise for all points. Each corrupted dataset is designed to closely simulate the actual scenario of the cases when occlusion is severe, LiDAR has a low resolution, or LiDAR is incorrect. Some examples with detection results are shown in Fig. 3.
Results In Table III, the 3D AP(IoU=0.7) scores on the KITTI-val and its corrupted datasets are reported. The values in parentheses are the performance decrease of each corrupted dataset compared to KITTI-val (leftmost). In the table, the best performance on each dataset is denoted in bold. Dropout, Swap, Mix, and Sparse operations all showed less performance decrease on the KITTI-D and KITTI-S datasets than the baseline. However, the performance decreased significantly on the KITTI-J dataset. On the other hand, Noise operation showed a smaller decrease than the baseline on every corrupted dataset. PA-AUG takes advantage of each operation evenly and shows the most robust performance for corrupted datasets. Some qualitative results are shown in Fig. 3.
|Method||KITTI-val 3D AP|
Iv-C Partitioning Method
To verify the need for the part-aware partitioning method, we randomly create partitions without part information. The random partitions are created with the same number and the same direction as the part-aware partitions, but the scales and positions are randomly generated. We apply the proposed partition-based augmentations equally to the random partitions and the part-aware partitions. As shown in Table IV, random partition-based augmentation has significantly less performance improvement than part-aware partition for all classes. From this result, it can be seen that the part information plays a critical role in applying the proposed partition-based augmentations.
Iv-D Data Efficiency
In this section, we conduct experiments to determine how PA-AUG performs under very little data. We downsample the KITTI datasets, taking subsets with number of 20%, 40%, 60%, 80% training examples. The PointPillars model [lang2019pointpillars] with the parameters in Table II is used to compare 3D AP scores on KITTI-val datasets.
Green and orange dots in Fig. 4 show the performance of PA-AUG with the full datasets and four subsets, respectively indicating Car and Pedestrian categories. Cyan and yellow dots in Fig. 4 show the results of the baselines, respectively indicating Car and Pedestrian. In these experiments, all other data augmentations except PA-AUG are not used to verify the effectiveness of PA-AUG alone. PA-AUG not only improves the baselines but also shows the characteristics of data efficiency for given datasets. PA-AUG using only 40% of examples achieves 3D AP comparable with the baselines using full datasets in both Car and Pedestrian. That is, PA-AUG is about 2.5 more data-efficient.
We notice that the performance drop in PA-AUG is steeper than the baseline as the size of the datasets decreases. This phenomenon is due to the relative lack of information of original objects in PA-AUG since modified and augmented datasets are provided where the original data itself is highly insufficient. The performance drop may slow down when smaller augmentation parameters are applied. Even so, PA-AUG shows the higher performances in full datasets and all subsets than the baseline since the improvement is much more significant.
We have presented PA-AUG which makes better use of 3D information of point clouds than the conventional methods. We divide the objects into 4 or 8 partitions according to intra-object part location and apply five separate augmentation methods which can be used simultaneously in a partition-based way. The proposed data augmentation methods can be universally applied to any architecture, and PA-AUG further improves PointPillars [lang2019pointpillars] and PV-RCNN [shi2020pv], one of the SOTA detectors on KITTI datasets. Compared with random partition augmentation, PA-AUG shows improved performance, demonstrating the effectiveness of part-aware augmentation that utilizes 3D information effectively. Experimental results show that PA-AUG can improve robustness to corrupted data and enhance data efficiency. Because of the generality of the proposed methods, we believe that they can be used in any tasks utilizing 3D point clouds such as semantic segmentation and object tracking.