Point cloud, as the commonly used 3D data representation, has been widely generated, collected and processed in many important computer vision applications, such as autonomous driving, AR/VR, remote sensing etc. Motivated by the current unprecedented success and popularity of deep neural networks (DNNs) in the 2D image processing, both academia and industry are now actively investigating the potential high-performance DNN-based solutions for the efficient 3D point cloud processing.
However, different from the spatially-regular image, point cloud is essentially the unordered set of vectors, which are inherently invariant to the permutation of the member points. Such important and unique characteristics of point cloud data, if not properly considered, would significantly limit the effectiveness of DNNs. To address this challenge, some earlier efforts[wu20153d, maturana2015voxnet]
propose to first convert the irregular point cloud to the regular volumetric representation, and then use the standard 3D convolutional neural networks (CNNs) for backend processing. Such intermediate representation-based strategy, though facilitating the convenient utilization of the existing DNN models, suffers high demand on memory usage as well as inevitable quantization artifacts. Therefore, pioneered by PointNet[qi2017pointnet], directly consuming the raw point cloud data has become the preferred solution. To date, many different DNN architectures (e.g., PointNet [qi2017pointnet], PointNet++ [qi2017pointnet], DGCNN [wang2019dynamic]) have been proposed to process 3D point clouds without extra voxelization or data transformation, and some of them demonstrate the state-of-the-art performance in various point cloud processing tasks.
Regularization for Point Cloud: Importance & Benefits. These recent architecture-level innovations in the point cloud domain indeed show the huge benefits brought by the advancement of DNN architecture design. However, on the other aspect, regularization
, as another important design strategy that has significantly promoted the development of deep learning in the image domain, is little exploited in the DNN-based point cloud processing. In principle, regularization, if performed properly, can potentially provide even more significant performance improvement in the point cloud domain than what it has done in the image domain. As pointed out in[lee2021regularization]
, 3D point cloud datasets are usually much smaller and less diverse than 2D image datasets (e.g., 40-category 10K-model ModelNet40 vs 1000-class 1000K-image ImageNet). Therefore, compared with their counterparts for image processing, the DNN models for point cloud processing are typically more prone to overfitting and lacking generality. These severe challenges, fortunately, are just what regularization can effectively alleviate.
Noise Injection-based Regularization for Point Cloud: Uncharted Territory. Despite its promising potential benefits, regularization has not been thoroughly studied in the point cloud domain. To date, only very few works exploit this direction, and most of them [lee2021regularization, 10.1007/978-3-030-58580-8_20] focus on data augmentation-based regularization, which essentially performs manipulation on the raw input points. On the other side, noise injection-based regularization, e.g., Dropout, DropBlock and DropPath [srivastava2014dropout, ghiasi2018dropblock, cai2019effective], as an important regularization strategy that has been widely adopted in the image-domain DNNs, is very little studied in the point cloud processing. Currently only the conventional Dropout [srivastava2014dropout]
, which randomly drops part of the 1D activation outputs, is straightforwardly used on the back-end last multilayer perceptron (MLP) of the point cloud-domain DNNs. Surprisingly, how to properly drop the information-rich high-dimensionalpoint features generated by the front-end point processing layers (e.g., shared MLPs in PointNet and EdgeConv in DGCNN), which is potentially much more important and critical to alleviate overfitting problem and improve model generality, is not investigated and reported in the existing literature.
Technical Preview and Contributions. In this paper, we, for the first time, systematically analyze and investigate the noise injection-based regularization for point cloud processing, and propose a series of regularization techniques for the point cloud-domain DNNs. To be specific, we propose three different regularization solutions, namely DropFeat, DropPoint and DropCluster, to drop information of the point features of DNN at feature level, point level, and cluster level, respectively. Similar to their counterparts in the image domain, these point cloud-domain noise injection approaches are easy for implementation. More importantly, they can be used as convenient plug-ins to improve various backbone DNN architectures for various classification/segmentation tasks. In overall, the contributions of this paper are summarized as follows:
We systematically investigate the noise injection-based regularization for point cloud processing, and propose to regularize point cloud-domain DNN models at three different levels (feature, point and cluster). Our proposed three regularization solutions (DropFeat, DropPoint and DropCluster) are easy for implementation, and they are very general and effective for different DNN models on different datasets. To the best of our knowledge, this is the first work that comprehensively and systematically studies the noise injection-based regularization in the point cloud domain.
We empirically analyze the impacts of different dropping factors on the regularization performance. Based on our ablation study for the dropping rate, the cluster size and the dropping positions, we obtain useful insights and general guidelines that facilitate the deployment of our regularization techniques across different datasets and different DNN architectures.
We perform extensive experiments for various DNN models in various tasks. on ModelNet40 shape classification dataset, DropCluster enables , and overall accuracy increase for PointNet, PointNet++ and DGCNN, respectively. In part segmentation task, DropCluster brings , and mean IoU increase for PointNet, PointNet++ and DGCNN, respectively, on ShapeNet dataset. On S3DIS semantic segmentation dataset, DropCluster increases the mean IoU of PointNet, PointNet++ and DGCNN by , and , respectively. Also, the overall accuracy for these three models are improved by , and , respectively.
2 Related Work
Deep Learning on Point Clouds. The conventional DNN models are designed to process input data with regular structure. To adapt deep learning for the irregular point cloud data-based tasks, a simple solution is to voxelize the point cloud to a volumetric representation, which can be then processed by various well established 3D CNNs [liu2019point, zhou2018voxelnet, ben20173d]. A major drawback of this intermediate representation-based strategy is the high memory cost incurred by the voxelization. Meanwhile, this approach is also limited by the inevitable quantization artifact. Multiview-based solutions [su2015multi, yu2018multi, yang2019learning, qi2016volumetric, wang2019dominant] render and project the 3D point clouds to multiple 2D images, and then apply the well engineered 2D CNNs to perform the corresponding classification and segmentation tasks. Such projection-based strategy, by its nature, cannot fully preserve the rich geometric information of the point cloud, and hence it is far from the ideal solution. PointNet [qi2017pointnet] is the pioneering DNN model that can directly consume the raw point cloud input. By introducing a simple symmetric function to accumulate the features, PointNet preserves the permutation invariance of point cloud data very efficiently. Since then, many architecture-level innovations, such as PointNet++ [qi2017pointnet++] and DGCNN [wang2019dynamic], have been proposed to efficiently exploit and capture the local geometric structure. These recent progress further brings the state-of-the-art performance in various point cloud processing tasks.
Noise Injection-based Regularization for Image-domain DNN. Noise injection-based regularization has been widely used in image-domain DNN training to alleviate the overfitting problem. Dropout [srivastava2014dropout], as the first work that drops some information/features during the training procedure, successfully demonstrates the huge benefits of this methodology. Since then, many follow-up variants, including but not limited to Droppath [cai2019effective], DropBlock [ghiasi2018dropblock], ZoneOut [krueger2016zoneout], CutOut [devries2017improved], Variational Dropout [kingma2015variational], have been proposed and applied on the different components of DNN training (e.g., filter channel and feature map) and different DNN model types (e.g., CNN and RNN). According to their extensive experiments, many dropping-related factors, such as rate, schedule and position, have significant impacts on the overall regularization performance.
Regularization in Point Cloud Processing. Unlike their counterparts in image processing, regularization techniques are rarely studied for point cloud processing. To date, most of existing point cloud-oriented regularization works focus on data augmentation. Two most recent progress along this direction are PointMixUp [10.1007/978-3-030-58580-8_20] and RSMix [lee2021regularization]
, which propose to generate new virtual examples via structure-preserving linear interpolation to enhance model generality. On the other side, noise injection-based regularization, such as dropping certain information and features during the training procedure, is even less exploited in the point cloud domain. Though Dropout has been commonly used in the modern point cloud-domain DNNs (e.g., PointNet, PointNet++ and DGCNN), it is just a straightforward adoption in the last MLP of those models; while injecting the noise to the much more important point feature maps generated by the front-end processing layers, to the best of our knowledge, is not investigated before.
3 Background and Preliminaries
3.1 Basics of DNNs for Point Cloud Processing
Point Cloud. A point cloud, as a set of unordered points, can be represented as , where each point = contains both geometric position and feature information . To be specific, is the 3D coordinates, and represents the corresponding -dimensional feature. Notice that the values of may vary for different data formats. For instance, for the plain black and the RGB-based colorful point cloud data, is set as 0 and 3, respectively.
Processing Points: Independence Style. According to different network architectures, point cloud data can be processed in the DNNs with different styles. PointNet [qi2017pointnet]
proposes to first operate on each point independently to achieve the important permutation invariance, and then use max pooling and MLP to aggregate all the extracted individual point features. In general, this type of independent processing style is essentially a set functionthat maps a point set to a vector as:
where is the overall mapping function of the front-end layers of DNN model. After each point has been independently processed by those front-end layers, the corresponding vectorized outputs are aggregated by a max pooling function
to form a vector-format global feature. This global feature is then sent to the back-end multilayer perceptron (MLP) to learn the desired-dimensional global point cloud signature (see Figure 1).
Processing Points: Aggregation Style. As indicated in [qi2017pointnet, qi2017pointnet++], processing each point independently largely neglects the geometric relationship among different points. Therefore, the state-of-the-art point cloud-domain DNNs, e.g., DGCNN and PointCNN [wang2019dynamic, li2018pointcnn], adopt to process and aggregate the information of each point and its neighbors in an explicit way. Consequently, the above described set function can be further generalized as:
Here we assume the model requires times of point feature updates before the max pooling operation. For the -th point feature update, is the update function, and is the corresponding point feature map for the entire points associated with this update, where each row represents the -dimensional feature for the -th point. For the raw point cloud input, with and (see Figure 1).
3.2 Existing Point Cloud-domain Dropout
As mentioned in Section 1, noise injection-based regularization is rarely studied in the point cloud domain. Currently this strategy is only applied on the two simple stages of the entire point-cloud processing pipeline.
Dropout on the Back-end MLP. The state-of-the-art point cloud-domain DNNs are typically equipped with an back-end MLP to learn the desired global point cloud signature from the global point feature (see Figure 1). Consider the most conventional use of Dropout in the image domain is just on the MLP; most point cloud-domain DNNs naturally also perform Dropout operation on their back-end MLPs. Hence the optimization objective of the entire DNN, e.g., for shape classification task, can be formulated as:
where is the vectorized ground truth and denotes squared Frobenius norm. is an diagonal matrix with as the diagonal. , as the -th entry of , is i.i.d. Bernoulli with dropping rate .
Dropout on the Raw Input Points. In PointNet++ [qi2017pointnet], Dropout is also performed on the raw input point cloud data. As illustrated in Figure 1
, the input points are randomly dropped with various densities in different training epochs. From the perspective of regularization taxonomy, such random dropping operation on the input data is essentially a type of data augmentation method. Hence the optimization objective of the DNN (e.g., for shape classification task) is:
where is an diagonal matrix with . is raw point cloud inputs, and diagonal matrix randomly zeros a subset of the input points.
4 Our Methods
As formulated in Eq. 3 and 4, the current point cloud-domain dropping operations only perform the Dropout either in the very early stage (e.g., on the raw input data) or in the very late stage (e.g., on the back-end last MLP) of the entire processing pipeline. However, these two existing approaches are actually not involved with the main body of DNNs – the front-end processing layers such as shared MLPs in PointNet and EdgeConv in DGCNN. From the perspective of feature learning, these front-end layers play the critical role for high-performance point cloud processing: they are in charge of preserving, extracting and learning the important multi-level point features in both Euclidean and semantic space. Neglecting this precious regularization opportunity, evidently, will severely limit the performance of DNN models in the point cloud domain.
Motivated by this observation, we propose to systematically investigate the noise injection-based regularization strategy on the important point features. To be specific, we propose three different regularization techniques, namely DropFeat, DropPoint and DropCloud, aiming to inject noise at feature, point and cloud level, respectively.
4.2 DropFeat: Drop the Features of Points
In the image domain, Dropout is performed on the activation map with randomly dropping at the pixel level. Following the similar principle, we propose to randomly drop some feature from the entire point feature map . This feature-level dropping strategy, namely DropFeat, is illustrated in Figure 1. To be specific, for an point feature map where each row represents -dimensional feature belongs to one individual point, DropFeat randomly zeros out some entries of . In other words, partial (instead of the entire) feature information of the partial (instead of the entire) points are removed. In this scenario, the global distribution of the entire point sets are still preserved, and thereby increasing the generality of the trained models without sacrificing task performance. Hence the optimization objective of the DNN models, e.g., for shape classification task, can be formulated as follows:
Here is the Hardmard product and is the dropping mask. is placed after the -th point feature update, and is the pre-set dropping rate.
4.3 DropPoint: Drop the Points with the Features
As described above, DropFeat only drops part of feature information for part of the points during point feature update. From the perspective of regularization, this type of noise injection is relatively conservative. Recall that in PointNet++, some raw input points, which are associated with -dimensional features, can be entirely dropped in a random way. Such point-wise dropping, surprisingly, can further enhance the generality and robustness of the trained models. In fact, as long as the dropping is performed in a uniform way, such point-wise dropping operation will only make the point sets uniformly sparser – the critical local and global geometric information can still be preserved, thereby retaining or even improving model performance.
Inspired by this phenomenon, we propose DropPoint, a regularization strategy that performs random point-level dropping operation on the point features. Figure 1 illustrates the key idea of DropPoint. For an point feature map , DropPoint randomly zeros out some entire rows, where each row represents one individual point containing -dimensional feature. In other words, all the associated feature information belonging to the randomly selected points are dropped. In general, the optimization objective of the DropPoint-regularized DNN models, e.g., for shape classification task, can be formulated as follows:
Here is the diagonal matrix spanned by , while is the pre-set dropping rate.
4.4 DropCluster: Drop the Clusters of Points
Beyond the feature-level dropping (DropFeat) and point-level dropping (DropPoint), we further explore the possibility of more aggressive dropping strategy at higher level. To be specific, we propose DropCluster, a technique that performs random dropping operation on the clusters of the neighboring points. Historically, such neighborhood-aware dropping strategy is originated from DropBlock [ghiasi2018dropblock], which bounds and drops the contiguous regions of the feature map in the image domain. As [ghiasi2018dropblock] indicates, the contiguous region-free dropping (e.g., Dropout) is not effective in removing semantic information due to the spatial correlation among the nearby activations. Instead, dropping the contiguous region of feature map can effectively remove the semantic information in the correlated area, and hence the models are forced to learn more representative features.
Inspired by DropBlock, we believe dropping the neighboring points in the point cloud domain can bring the similar benefits: the nearby points typically contain the closely related information, hence a neighborhood-aware dropping strategy can effectively remove certain geometric and semantic information, thereby pushing the DNN models to enhance its capability for feature learning. Figure 2 illustrates the main mechanism of DropCluster. Here some points in the point feature map are first randomly selected. With each of those points as the centroid, multiple clusters of points, which contain the selected points and their neighboring points, are then dropped during the training.
As a neighborhood-aware dropping strategy, the performance of DropCluster highly depends on two factors: distance calculation and neighboring point selection
. Considering the raw point cloud input data is defined on the Euclidean space, DropCluster directly utilizes the 3D coordinates of input points to calculate spatial distances. Notice that for some DNN models such as DGCNN, this computation can be even saved since the distance calculation has already been done in their K-Nearest Neighbors (KNN) step.
Neighboring Point Selection.
To determine and adjust the size of the dropped region, DropCluster introduces a hyperparameterto determine the number of the selected points in one cluster. To be specific, once a point is randomly selected, its nearest points, together with their associated features, will be dropped as well. Notice that since points are now dropped simultaneously for each cluster, the sampling rate for the centroid points becomes instead of . In general, the optimization objective of the DropCluster-regularized DNN models, e.g., for shape classification task, can be formulated as follows:
Here is an diagonal matrix spanned by , and is an -length binary vector to record the index of the sampled centorid point. Each entry of, where 0 denotes the corresponding entry is the sampled centroid. The function returns the set of nearest neighbors’ indices of the point . Notice that when , DropCluster converges to DropPoint.
5.1 Experimental Setup
Dataset. We evaluate our methods (DropFeat, DropPoint and DropCluster) on three 3D point cloud datasets ModelNet40 [wu20153d], ShapeNet part dataset [yi2016scalable] and Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS) [armeni20163d] for shape classification, part segmentation and semantic segmentation tasks, respectively. To be specific, For ModelNet40, we use 9843 models for training and 2468 models for testing. In each model 1024 points are sampled and rescaled into the unit sphere. For ShapeNet, we use 14006 shapes for training and 2874 shapes for testing. In each shape 2048 points are sampled and at most labeled with five parts. For S3DIS, we follow the same 6-fold strategy used in PointNet. For each example in S3DIS, 4096 points are sampled and each input point is represented as a 9D feature vector (XYZ, RGB and normalized spatial coordinates).
Network Architecture. We select three popular backbone networks, PointNet, PointNet++ with multi-scale groping (MSG) and DGCNN, to demonstrate the generality of our proposed noise injection-based regularization approaches. All of these three networks are evaluated on ModelNet40, ShapeNet and S3DIS datasets.
We conduct our experiments on Nvidia TITAN RTX GPUs using PyTorch framework. For all the evaluations for PointNet and PointNet++, we use the ADAM optimizer with initial learning rate as 0.001. For the evaluations for DGCNN, we adopt SGD optimizer with initial learning rate as 0.1 and the momentum as 0.9. The batch size is set from 16 to 32 across different tasks.
5.2 Ablation Study for , and Dropping Positions
As analyzed in Section 4, the performance of our proposed noise injection-based regularization approaches is highly determined by three key factors: the dropping rate , the place where to apply the dropping, and cluster size (for DropCluster). To study their individual impact on the DNN performance, we perform ablation study on an example experimental setting: Applying DropCluster on DGCNN for shape classification task (ModelNet40 dataset).
The Dropping Rate . We first investigate the impact of overall dropping rate . Figure 3(a) shows the test accuracy of DropCluster-regularized DGCNN on ModelNet40 with respect to different values of . Here the cluster size is set as 20. It is seen that when the test accuracy reaches its maximum ( higher than non-dropping case), and then it quickly decreases with higher dropping rates. This phenomenon indicates that, unlike the case in the image domain (e.g. dropping rate=0.5 for Dropout), smaller dropping rate is preferred in the point cloud domain. We hypothesize that this trend may be caused by the relatively smaller number of points in the point cloud data – typically the point feature map only has points; while the feature map in image domain can have tens of thousand activation values. So too aggressive dropping strategy may negatively affect the overall performance.
The Cluster Size . Figure 3(b) shows the overall test accuracy of DGCNN with respect to different cluster sizes. Here the overall dropping rate is set as . It is seen that the test accuracy significantly increases when , peaks when , and then decreases with larger . This phenomenon indicates that a proper setting of in the medium range is very important to achieve a ”sweet point” – dropping either too few or too many neighboring points would impair the capability of learning geometric information. Also notice that the accuracy with , which just means using DropPoint, is lower than many cases. This verifies our prior hypothesise that dropping nearby points together with proper dropping rates is more effective.
Where to Drop. We further study where to apply Dropping operation on the DNN models can achieve best performance. As shown in Table 1, we apply DropCluster at different positions of DGCNN. Here because the DGCNN classification model has 4 EdgeConv blocks, namely EdgeConv 1,2,3,4 in Table 1
, we evaluate different combinations when applying DropCluster on the point feature maps output from those EdgeConv blocks. From this table we can find that it is better to apply dropping operation at the earlier stages rather than later stages of DNN models. We hypothesize that this is because the point features extracted in the earlier stages of DNN intend to represent lower-level feature that is more noise tolerate; while the point features extracted in the later stages are more high-level condensed features that are sensitive to noise injection.
|Data Augment-based Regularization||Noise Injection-based Regularization|
|RSMix [lee2021regularization]||PointMixup [10.1007/978-3-030-58580-8_20]||DropFeat||DropPoint||DropCluster|
|Mean IoU||Aero||Bag||Cap||Car||Chair||Ear Phone||Guitar||Knife||Lamp||Laptop||Motor||Mug||Pistol||Rocket||Skate Board||Table|
Performance of different regularization approaches on ShapeNet part segmentation dataset. The evaluation metric is the mean IoU(). Notice that RSMix and PointMixup do not report performance on part segmentation task.
5.3 Shape Classification on ModelNet40 Dataset
Settings. Based on the analysis in the ablation study, we set the overall dropping rate for DropFeat, DropPoint and DropCluster on all the shape classification experiments. Also, we set the cluster size for DropCluster when regularizing DGCNN and PointNet. For PointNet++, since it hierarchically samples a portion of the input point cloud, we set to fit the reduced number of points. Meanwhile, we apply DropFeat, DropPoint and DropCluster on the point feature maps output from the first two EdgeConv layers and the first two set abstraction blocks of DGCNN and PointNet++, respectively. For PointNet, the dropping operation is performed on the point feature maps of the first shared linear layers with 64 output channels.
Results. Table 2 shows the overall accuracy of three backbone networks regularized by different approaches on ModelNet40. Compared with the state-of-the-art data augmentation-based regularization (RSMix [lee2021regularization] and PointMixup [10.1007/978-3-030-58580-8_20]), our proposed noise injection-based regularization approaches show the competitive (DropFeat and DropPoint) and better (DropCluster) performance. In particular, DropCluster consistently outperforms RSMix and PointMixup on all the three models. To be specific, DropCluster achieves , and overall accuracy improvement on PointNet, PointNet++ and DGCNN, respectively.
5.4 Part Segmentation on ShapeNet Dataset
Settings. For part segmentation task we still use the dropping rate for all the dropping methods on all the backbone networks. Also, the dropping operations are performed at the same positions as described in the shape classification task. Since the inputs of part segmentation models are 2048-point data, we adjust cluster size for PointNet and DGCNN, and for PointNet++.
Results. Table 3 shows the performance of the regularized backbone networks on ShapeNet dataset in terms of mean Intersection-over-union (IoU). It is seen that all our proposed three dropping approaches improve the mean IoU for part segmentation task. In particular, DropCluster brings , and mean IoU increase for PointNet, PointNet++, and DGCNN, respectively. Notice that RSMix and PointMixup do not report performance on part segmentation task. Also, we show the qualitative part segmentation results in Figure 4.
|Mean IoU()||Overall Accuracy()|
5.5 Semantic Segmentation on S3DIS Dataset
Setting. Considering DropCluster consistently outperforms DropFeat and DropPoint in the shape classification and part segmentation tasks, in semantic segmentation task we only evaluate the performance of DropCluster regularization. Here we adopt the same dropping rate () and dropping positions that are used in shape classification and part segmentation experiments. In addition we use the same cluster size settings that are used in shape classification ( for PointNet and DGCNN and for PointNet++). This is because though each S3DIS input example has large number of point, the average number of points for each semantic object is relatively small – several objects belonging to different categories exist in one room.
Results. Table 4 shows the performance of DropCluster-regularized backbone networks on S3DIS dataset in terms of both the mean IoU and overall accuracy. It is seen that using DropCluster brings , , mean IoU improvement on PointNet, PointNet++ and DGCNN, respectively. For overall accuracy, DropCluster enables , , increase for PointNet, PointNet++ and DGCNN, respectively. Notice that RSMix and PointMixup do not report performance on semantic segmentation task. In addition, We also show the qualitative semantic segmentation results in Figure 5.
In this paper, we propose a systematic investigation on noise injection-based regularization for point cloud processing. We develop a series of techniques to inject noise at the different levels of point feature maps of DNN models. Experimental results show our proposed approaches bring significant performance improvement across different DNN models for different point cloud processing tasks.