We present PointAugment, a new auto-augmentation framework that automatically optimizes and augments point cloud samples to enrich the data diversity when we train a classification network. Different from existing auto-augmentation methods for 2D images, PointAugment is sample-aware and takes an adversarial learning strategy to jointly optimize an augmentor network and a classifier network, such that the augmentor can learn to produce augmented samples that best fit the classifier. Moreover, we formulate a learnable point augmentation function with a shape-wise transformation and a point-wise displacement, and carefully design loss functions to adopt the augmented samples based on the learning progress of the classifier. Extensive experiments also confirm PointAugment's effectiveness and robustness to improve the performance of various networks on shape classification and retrieval.READ FULL TEXT VIEW PDF
Data augmentation is an effective regularization strategy to alleviate t...
Augmentation can benefit point cloud learning due to the limited availab...
We introduce ShapeAdv, a novel framework to study shape-aware adversaria...
Since the PointNet was proposed, deep learning on point cloud has been t...
Processing point cloud data is an important component of many real-world...
We present a novel compact point cloud representation that is inherently...
The recent progress on automatically searching augmentation policies has...
In recent years, there has been a growing interest in developing deep neural networks[20, 21, 33, 16, 15]
for 3D point clouds. To robustly train a network often relies on the availability and diversity of the data. However, unlike 2D image benchmarks such as ImageNet and MS COCO dataset , which have over millions of training samples, 3D datasets are typically much smaller in quantity, with relatively small amount of labels and limited diversity. For instance, ModelNet40 , one of the most commonly-used benchmark for 3D point cloud classification, only has 12,311 models of 40 categories. The limited data quantity and diversity may cause overfitting problem and further affect the generalization ability of the network.
Nowadays, data augmentation (DA) is a very common strategy to avoid overfitting and improve the network generalization ability by artificially enlarging the quantity and diversity of the training samples. For 3D point clouds, due to the limited amount of training samples and an immense augmentation space in 3D, conventional DA strategies [20, 21] often simply perturb the input point cloud randomly in a small and fixed pre-defined augmentation range to maintain the class label. Despite its effectiveness for the existing classification networks, this conventional DA approach may lead to insufficient training, as summarized below.
First, existing methods regard the network training and DA as two independent phases without jointly optimizing them, e.g., feedback the training results to enhance the DA. Hence, the trained network could be suboptimal. Second, existing methods apply the same fixed augmentation process with rotation, scaling, and/or jittering, to all input point cloud samples. The shape complexity of the samples is ignored in the augmentation, e.g., a sphere remains the same no matter how we rotate it, but a complex shape may need larger rotations. Hence, conventional DA may be redundant or insufficient for augmenting the training samples .
To improve the augmentation of point cloud samples, we formulate PointAugment, a new auto-augmentation framework for 3D point clouds, and demonstrate its effectiveness to enhance shape classification; see Figure 1. Different from the previous works for 2D images, PointAugment learns to produce augmentation functions specific to individual samples. Further, the learnable augmentation function considers both shape-wise transformation and point-wise displacement, which relate to the characteristics of 3D point cloud samples. Also, PointAugment jointly optimizes the augmentation process with the network training, via an adversarial learning strategy to train the augmentation network (augmentor) together with the classification network (classifier) in an end-to-end manner. By taking the classifier losses as feedbacks, the augmentor can learn to enrich the input samples by enlarging the intra-class data variations, while the classifier can learn to combat this by extracting insensitive features. Benefited by such adversarial learning, the augmentor can then learn to generate augmented samples that best fit the classifier in different stages of the training, thus maximizing the capability of the classifier.
As the first attempt to explore auto-augmentation for 3D point clouds, we show by replacing conventional DA with PointAugment, clear improvements in shape classification on ModelNet40  (see Figure 1) and SHREC16  (see Section 5) datasets can be achieved on four representative networks, including PointNet , PointNet++ , RSCNN , and DGCNN . Also, we demonstrate the effectiveness of PointAugment on shape retrieval and evaluate its robustness, loss configuration, and modularization design. More results are presented in Section 5.
Data augmentation on images. Training data plays a very important role for deep neural networks to learn to perform tasks. However, training data usually has limited quantity, compared with the complexity of our real world, so data augmentation is often needed as a means to enlarge the training set and maximize the knowledge that a network can learn from the training data. Instead of randomly transforming the training data samples [37, 36], some works attempted to generate augmented samples from the original data by using image combination , generative adversarial network (GAN) [27, 23, 29], Bayesian optimization 
, and image interpolation in the latent space[3, 13, 1]. However, these methods may produce unreliable samples that are different from those in the original data.
Another approach aims to find an optimal combination of predefined transformation functions to augment the training samples, instead of applying the transformation functions based on a manual design or by complete randomness. AutoAugment 
suggests a reinforcement learning strategy to find the best set of augmentation functions by alternatively training a proxy task and a policy controller, then applying the learned augmentation function to the input data. Soon after, two other works, FastAugment and PBA , explore advanced hyper-parameter optimization methods to more efficiently find the best transformations for the augmentation. Different from these methods, which learn to find a fixed augmentation strategy for all the training samples, PointAugment is sample-aware, meaning that we dynamically produce the transformation functions based on the properties of individual training samples and the network capability during the training process.
Data augmentation on point cloud. In existing points processing networks, data augmentation mainly include random rotation about the gravity axis, random scaling, and random jittering [20, 21]. These handcrafted rules are fixed throughout the training process, so we may not obtain the best samples to effectively train the network. So far, we are not aware of any work that explores auto-augmentation to maximize the network learning with 3D point clouds.
Deep learning on point cloud. Improving on the PointNet architecture , several works [21, 14, 15] explored local structures to enhance the feature learning. Some others explored the graph convolutional networks by creating a local graph [32, 33, 25, 39] or geometric elements [9, 19]. Another stream of works [28, 30, 16]
projected irregular points into a regular space to allow traditional convolutional neural networks to work on. Different from the above works, our goal is not on designing a new network but on boosting the classification performance of existing networks by effectively optimizing the augmentation of point cloud samples. To this end, we design an augmentor to learn a sample-specific augmentation function and adjust the augmentation based also on the learning progress of the classifier.
The main contribution of this work is the PointAugment framework that automatically optimizes the augmentation of the input point cloud samples for more effectively training the classification network. Figure 2 illustrates the design of our framework, which has two deep neural network components: (i) an augmentor and (ii) a classifier . Given an input training dataset of samples, where each sample has points, before we train classifier with sample , we feed first to our augmentor to generate an augmented sample . Then, we feed and separately to classifier for training, and further take ’s results as feedback to guide the training of augmentor .
Before elaborating the PointAugment framework, we first discuss our key ideas behind the framework. These are new ideas (not present in previous works [2, 11, 6]) that enable us to efficiently augment the training samples, which are now 3D point clouds instead of 2D images.
Sample-aware. Rather than finding a universal set of augmentation policy or procedure for processing every input data sample, we aim to regress a specific augmentation function for each input sample by considering the underlying geometric structure of the sample. We call this a sample-aware auto-augmentation.
2D vs. 3D augmentation. Unlike 2D augmentations for images, 3D augmentation involves a more immense and different spatial domain. Accounting for the nature of 3D point clouds, we consider two kinds of transformations on point cloud samples: shape-wise transformation (including rotation, scaling, shearing, and their combinations), and point-wise displacement (jittering of point locations), where our augmentor should learn to produce them to enhance the network training.
Joint optimization. During the network training, the classifier will gradually learn and become more powerful, so we need more challenging augmented samples to better train the classifier, as the classifier becomes stronger. Hence, we design and train the PointAugment framework in an end-to-end manner, such that we can jointly optimize both the augmentor and classifier. To achieve so, we have to carefully design the loss functions and dynamically adjust the difficulty of the augmented samples, while considering both the input sample and the capacity of the classifier.
In this section, we first present the network architecture details of the augmentor and classifier (Section 4.1). Then, we present our loss functions formulated for the augmentor (Section 4.2) and classifier (Section 4.3), and introduce our end-to-end training strategy (Section 4.4). Lastly, we present the implementation details (Section 4.5).
Augmentor. Different from existing works [2, 11, 6], our augmentor is sample-aware, and it learns to generate a specific function for augmenting each input sample. From now on, we drop subscript for ease of reading, and denote as the training sample input to augmentor and as the corresponding augmented sample output from .
The overall architecture of our augmentor is illustrated in Figure 3
(top). First, we use a per-point feature extraction unit to embed point featuresfor all points in , where is the number of feature channels. From , we then regress the augmentation function specific to input sample using two separate components in the architecture: (i) shape-wise regression to produce transformation and (ii) point-wise regression to produce displacement . Note that, the learned is a linear matrix in 3D space, combining mainly rotation, scaling, and shearing, whereas the learned gives point-wise translation and jittering. Using and , we can then generate the augmented sample as .
The design of our proposed framework for the augmentor is generic, meaning that we may use different models to build its components. Figure 4 shows our current implementation, for reference. Specifically, similar to PointNet 
, we first employ a series of shared multi-layer perceptron (MLPs) to extract per-point features. To regress , we generate a
-dimension noise vector based on a Gaussian distribution and concatenate it with, and then employ MLPs to obtain . Note that the noise vector enables the augmentor to explore more diverse choices in regressing the transformation matrix through the randomness introduced into the regression process. To regress , we concatenate copies of with , together with an noise matrix, whose values are randomly and independently generated based on a Gaussian distribution. Lastly, we employ MLPs to obtain .
Classifier. Figure 3 (bottom) shows the general architecture of classifier . It takes and as inputs in two separate rounds and predicts corresponding class labels and . Both and , where is the total number of classes in the classification problem. In general, first extracts per-shape global features or (from or ), and then employ fully-connected layers to regress a class label. Also, the choice of implementing is flexible. We may employ different classification networks as . In Section 5, we shall show that the performance of several conventional classification networks can be further boosted when equipped with our augmentor in the training.
To maximize the network learning, the augmented sample generated by the augmentor should satisfy two requirements: (i) should be more challenging than ; and (ii) should not lose its shape distinctiveness, meaning that it should describe a shape that is not too far from .
To achieve requirement (i), a simple way to formulate the loss function for the augmentor (denoted as ) is to maximize the difference between the cross entropy losses on and , or equivalently, to minimize
where is ’s cross entropy loss; denotes the one-hot ground-truth label when belongs to the -th class; and
is the probability of predictingas -th class. Note also that, for to be more challenging than , we assume that and a larger indicates a larger magnitude of augmentation, which can be defined as .
However, if we naively minimize Eq. (1) for , we encourage . So, a simple solution for is an arbitrary sample regardless of . Such clearly violates requirement (ii). Hence, we further restrict the augmentation magnitude . Inspired by LS-GAN , we first introduce a dynamic parameter and re-formulate as
See Figure 5 for the graph plot of Eq. (2). In this formulation, we want to be large (for requirement (i)) but it should not be too large (for requirement (ii)), so we upper-bound by . Hence, we can obtain
where we denote as ’s upper bound.
Note that, when we train the augmentor, the classifier is fixed (to be presented in Section 4.4), so is fixed. Hence, depends only on . Since it should be non-negative, we thus ensure . Moreover, considering that the classifier is very fragile at the beginning of the training, we pay more attention to training the classifier rather than generating a challenging . Hence, should not be too large, meaning that should not be too challenging. Later, when the classifier becomes more powerful, we can gradually enlarge to allow the augmentor to generate a more challenging . Therefore, we design a dynamic to control with the following formulation:
where ensures . At the beginning of the network training, the classifier predictions may not be accurate. Hence, the prediction probability is generally small, resulting in a small , and will also be small according to Eq. (3). When the classifier becomes more powerful, will increase, and we will have larger and accordingly.
Lastly, to further ensure the augmented sample to be shape distinctive (for requirement (ii)), we add , as a fidelity term, to Eq. (2) to construct the final loss :
where is a fixed hyper-parameter to control the relative importance of each term. A small encourages the augmentor to focus more on the classification with less augmentation on , and vice versa. In our implementation, we set to treat the two terms equally.
The goal of the classifier is to correctly predict both and . Additionally, should also have the ability to learn stable per-shape global features, no matter given or as input. We thus formulate the classifier loss as
where is to balance the importance of the terms, and we empirically set it as 10.0.
Algorithm 1 summarizes our end-to-end training strategy. Overall, the procedure alternatively optimizes and updates the learnable parameters in augmentor and classifier , while fixing the other one, during the training. Given input sample , we first employ to generate its augmented sample . We then update the learnable parameters in by calculating the augmentor loss using Eq. (5). In this step, we keep unchanged. After updating , we keep unchanged, and generate the updated . We then feed and to one by one to obtain and , respectively, and update the learnable parameters in by calculating the classifier loss using Eq. (6). In this way, we can optimize and train and in an end-to-end manner.
We implement PointAugment using PyTorch
. In detail, we set the number of training epochswith a batch size . To train the augmentor, we adopt the Adam optimizer with a learning rate of 0.001. To train the classifier, we follow the respective original configuration from the released code and paper. Specifically, for PointNet , PointNet++ , and RSCNN , we use the Adam optimizer with an initial learning rate of 0.001, which is gradually reduced with a decay rate of 0.5 every 20 epochs. For DGCNN , we use the SGD solver with a momentum of 0.9 and a base learning rate of 0.1, which decays using a cosine annealing strategy .
Note also that, to reduce model oscillation , we follow  to train PointAugment by using mixed training samples, which contain the original training samples as one half and our previously-augmented samples as the other half, rather than using only the original training samples. Please refer to  for more details. Moreover, to avoid overfitting, we set a dropout probability of 0.5 to randomly drop or keep the regressed shape-wise transformation and point-wise displacement. In the testing phase, we follow previous networks [20, 21] to feed the input test samples to the trained classifier to obtain the predicted labels, without any additional computational cost.
We conducted extensive experiments on PointAugment. First, we introduce the benchmark datasets and classifiers employed in our experiments (Section 5.1). We then evaluate PointAugment on shape classification and shape retrieval (Section 5.2). Next, we perform detailed analysis on PointAugment’s robustness, loss configuration, and modularization design (Section 5.3). Lastly, we present further discussion and potential future extensions (Section 5.4).
datasets, including the number of categories (classes), number of training and testing samples, average number of samples per class, and the corresponding standard deviation value.
Datasets. We employed three 3D benchmark datasets in our evaluations, i.e., ModelNet10 , ModelNet40 , and SHREC16 , for which we denote as MN10, MN40, and SR16, respectively. Table 1 presents statistics about the datasets, showing that, MN10 is a very small dataset with only 10 classes. Though most networks [20, 14] can achieve a high classification accuracy on MN10, they may easily overfit. SR16 is the largest data with over 36,000 training samples. However, the high standard deviation (std.) value, i.e., 1111, shows the uneven distribution of training samples among the classes. For example, in SR16, the Table class has 5,905 training samples, while the Hat class has only 39 training samples. For MN40, we directly adopt the data kindly provided by PointNet  and follow the same train-test split. For MN10 and SR16, we uniformly sample 1,024 points on each mesh surface and normalize the point sets to fit a unit ball centered at the origin.
Classifiers. As explained in Section 4.1, our overall framework is generic, and we can employ different classification networks as classifier . To show that the performance of conventional classification networks can be further boosted when equipped with our augmentor, in the following experiments, we employ several representative classification networks as classifier , including (i) PointNet , a pioneer network that processes points individually; (ii) PointNet++ , a hierarchical feature extraction network; (iii) RSCNN111Only the single-scale RSCNN  is released so far. , a recently-released enhanced version of PointNet++ with a relation weight inside each local region; and (iv) DGCNN , a graph-based feature extraction network. Note that, most existing networks [38, 30, 14] are built and extended from the above networks with various means of adaptation.
|PointNet (+PA)||90.9 (1.7)||94.1 (2.2)||87.2 (4.1)|
|PointNet++ (+PA)||92.9 (2.2)||95.8 (2.5)||89.5 (4.4)|
|RSCNN (+PA)||92.7 (1.0)||96.0 (1.8)||90.1 (3.5)|
|DGCNN (+PA)||93.4 (1.2)||96.7 (1.9)||90.6 (3.6)|
Shape classification. First, we evaluate our PointAugment on the shape classification task using the classifiers listed in Section 5.1. For comparison, when we train the classifiers without PointAugment, we follow  to augment the training samples by random rotation, scaling, and jittering.
Table 2 summarizes the quantitative evaluation results for comparison. We report the overall classification accuracy (%) of each classifier on all the three benchmark datasets, with conventional DA and with our PointAugment. From the results we can clearly see that, by employing PointAugment, the shape classification accuracies of all classifier networks can improve for all the three benchmark datasets. Particularly, on MN40, the classification accuracy achieved by DGCNN+PointAugment is 93.4%, which is a very high accuracy value comparable with the very recent works [38, 30, 14]. Moreover, our PointAugment is shown to be more effective on the imbalanced SR16 dataset; see the right-most column in Table 2, showing that PointAugment can alleviate the class size imbalance problem through our sample-aware auto-augmentation strategy to introduce more intra-class variation to the augmented samples.
. To validate whether PointAugment facilitates the classifiers to learn a better shape signature, we compare the shape retrieval performance on MN40. Specifically, we regard each sample in the testing split as a query, and aim to retrieve the best similar shapes from the training split by comparing the cosine similarity between their global features
. In this experiment, we employ the mean Average Precision (mAP) as the evaluation metric.
Table 3 presents the evaluation results, which clearly show that PointAugment improves the shape retrieval performance for all the four classifier networks. Especially, for PointNet  and PointNet++ , the percentage of improvement is over 6%. Besides, we show visual results on shape retrieval for three different query models in Figure 6. Compared with the original PointNet , which is equipped with conventional DA, the augmented version with PointAugment produces more accurate retrievals.
Further, we conducted more experiments to evaluate various aspects of PointAugment, including a robustness test (Section 5.3.1), an ablation study (Section 5.3.2), and a detailed analysis on its Augmentor network (Section 5.3.3). Note that, in these experiments, we employ PointNet++  as the classifier and perform experiments on MN40.
We conducted the robustness test by corrupting test samples and using PointNet++ previously trained with conventional DA or PointAugment to classify the samples. Specifically, we use the following five settings: (i) adding random jittering with Gaussian noise ranged ; (ii,iii) adding uniform scaling with a ratio of 0.9 or 1.1; and (iv,v) adding rotation with or along the gravity axis.
Table 4 reports the results, where we show also the original test accuracy (Ori.) without using corrupted test samples as a reference. Comparing the two rows of results in the table, we can see that, our PointAugment consistently outperforms the conventional DA, for all settings. Particularly, by comparing the results with the original test accuracy, PointAugment is less sensitive to corruption, where the achieved accuracy reduces only slightly. Such a result shows that PointAugment improves the robustness of a network with better shape recognition.
Table 5 summarizes the results of the ablation study. Model A denotes PointNet++  without our augmentor, which gives a baseline classification accuracy of 90.7%. On top of Model A, we employ our augmentor with point-wise displacement alone (Model B), with shape-wise transformation alone (Model C), or with both (Model D). From the results shown in the first four rows in Table 5, we can see that, each of the augmentation functions contributes to produce more effective augmented samples.
Besides, we also ablate the dropout strategy (DP) for training, and the use of mixed training samples (Mix), as presented in Section 4.5, where we create Models E & F for comparison; see Table 5. By comparing the classification accuracies achieved by Models D, E, and F, we can see that both DP and Mix help to improve the network training.
Lastly, to study the impact of the input point set size on PointAugment, we separately trained the whole framework with 2,048 points (Model G), and found no change in the classification accuracy; see the bottom-most row in Table 5. We believe that, compared with collecting more input point samples to train the network, incorporating a powerful augmentation is more cost-effective to maximize the knowledge that we can extract from the training samples.
Analysis on . As we have described in Section 4.2, we employ (see Eq. (5)) to guide the training of our augmentor. To validate its superiority over the simple version (see Eq. (1)) and conventional DA employed in PointNet++ , we plot the training accuracy curve in terms of the training epochs for the three cases; see Figure 7.
Clearly, the training state achieved by using the simple version is very unstable; see the gray plot.This indicates that simply enlarging the difference between and without restriction will create turbulence in the training process, resulting in a worse classification performance, when compared even with the baseline; see the orange plot. Comparing the blue and orange plots in Figure 5, we can see that, at the begining of the training, since the augmentor is initialized randomly, the accuracy of employing PointAugment is slightly lower than the baseline. However, when the training continues, PointAugment rapidly surpasses the baseline and shows a clear improvement over the baseline, showing the effectiveness of our designed augmentor loss.
Hyper-parameter in Eq. (5). Table 6 shows the classification accuracy for different choices of . As we mentioned in Section 4.2, a small encourages the augmentor to focus more on the classification. However, if it is too small, e.g., 0.5, the augmentor tends to take no actions in the augmentation function, thus leading to a worse classification performance; see the comparisons in the left two columns in Table 6. On the other hand, if we set a larger , e.g., 2.0, the augmented samples may be too difficult for the classifier. Such a result also hinders the network training; see the right-most column. Hence, in PointAugment, we adopt to equally treat the two components.
Analysis on the augmentor architecture design. As we mentioned in Section 4.1, we employ a per-point feature extraction unit to embed per-point features given the training sample . In our implementation, we use shared MLPs to extract . In this part, we further explore two other choices of feature extraction units for replacing the MLPs, including DenseConv  and EdgeConv . Please refer to their original papers for the detailed methods. Table 7 shows the accuracy comparisons for the three implementations. From the results, we can see that, although using MLPs is a relative simple implementation compared with DenseConv and EdgeConv, it can lead to the best classification performance. We think of the following reasons. The aim of our augmentor is to regress a shape-wise transformation and a point-wise displacement from the per-point features , which is not a very tough task. If we apply a complex unit to extract , it may easily have overfitting problem. The results shown in Table 7 demonstrate that the MLPs is already enough for our augmentor to regress the augmentation functions.
Overall, the augmentor network learns the sample-level augmentation function in a self-supervised manner, by taking the feedback from the classifier to update its parameters. As a result, the advantage on the classifier network is that by exploring those well-tuned augmented samples, the classifier can enhance its capability and better learn to uncover intrinsic variations among the different classes and discover the intra-class insensitive features.
In the future, we plan to adapt PointAugment for more tasks, such as part segmentation [20, 15], semantic segmentation [17, 39] and object detection [35, 26]. However, it is worth to note particularly that different tasks require different considerations. While classification networks focus on the whole 3D shape, semantic segmentation, for example, takes input samples that are partial but may still contain multiple objects. Hence, designing auto-augmentation for segmentation would require the capability of learning to produce more diverse transformations for individual objects in the scene, rather than as a whole. Therefore, one future direction is to explore instance-aware auto-augmentation to extend PointAugment for other tasks.
We presented PointAugment, the first auto-augmentation framework that we are aware of for 3D point clouds, considering both the capability of the classification network and the complexity of the training samples. First, PointAugment is an end-to-end framework that jointly optimizes the augmentor and classifier networks, such that the augmentor can learn to improve based on feedback from the classifier and the classifier can learn to process wider variety of training samples. Second, PointAugment is sample-aware with its augmentor learns to produce augmentation functions specific to the input samples, with a shape-wise transformation and a point-wise displacement for handling point cloud samples. Third, we formulate a novel loss function to enable the augmentor to dynamically adjust the augmentation magnitude based on the learning state of the classifier, so that it can generate augmented samples that best fit the classifier in different training stages. In the end, we conducted extensive experiments and demonstrated how PointAugment contributes to improve the performance of four representative networks on the MN40 and SR16 datasets.
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 113–123, 2019.
Int. Conf. on Machine Learning. (ICML), 2019.