CP-Net: Contour-Perturbed Reconstruction Network for Self-Supervised Point Cloud Learning

by   Mingye Xu, et al.

Self-supervised learning has not been fully explored for point cloud analysis. Current frameworks are mainly based on point cloud reconstruction. Given only 3D coordinates, such approaches tend to learn local geometric structures and contours, while failing in understanding high level semantic content. Consequently, they achieve unsatisfactory performance in downstream tasks such as classification, segmentation, etc. To fill this gap, we propose a generic Contour-Perturbed Reconstruction Network (CP-Net), which can effectively guide self-supervised reconstruction to learn semantic content in the point cloud, and thus promote discriminative power of point cloud representation. First, we introduce a concise contour-perturbed augmentation module for point cloud reconstruction. With guidance of geometry disentangling, we divide point cloud into contour and content components. Subsequently, we perturb the contour components and preserve the content components on the point cloud. As a result, self supervisor can effectively focus on semantic content, by reconstructing the original point cloud from such perturbed one. Second, we use this perturbed reconstruction as an assistant branch, to guide the learning of basic reconstruction branch via a distinct dual-branch consistency loss. In this case, our CP-Net not only captures structural contour but also learn semantic content for discriminative downstream tasks. Finally, we perform extensive experiments on a number of point cloud benchmarks. Part segmentation results demonstrate that our CP-Net (81.5 self-supervised models, and narrows the gap with the fully-supervised methods. For classification, we get a competitive result with the fully-supervised methods on ModelNet40 (92.5 codes and models will be released afterwards.



There are no comments yet.


page 1

page 11


Self-Ensemling for 3D Point Cloud Domain Adaption

Recently 3D point cloud learning has been a hot topic in computer vision...

SPU-Net: Self-Supervised Point Cloud Upsampling by Coarse-to-Fine Reconstruction with Self-Projection Optimization

The task of point cloud upsampling aims to acquire dense and uniform poi...

Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning

In this paper, we propose a simple and general framework for self-superv...

FloorPP-Net: Reconstructing Floor Plans using Point Pillars for Scan-to-BIM

This paper presents a deep learning-based point cloud processing method ...

3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-branch Learning

Intracranial aneurysms are common nowadays and how to detect them intell...

Point Cloud Pre-training by Mixing and Disentangling

The annotation for large-scale point clouds is still time-consuming and ...

Learning Geometry-Disentangled Representation for Complementary Understanding of 3D Object Point Cloud

In 2D image processing, some attempts decompose images into high and low...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Point cloud analysis has gradually become an important problem for understanding 3D world, resulting from its wide applications in robotics, AR/VR, autonomous driving, etc [33, 30, 28, 18, 11]

. Current mainstream methods of point cloud analysis are mainly driven by fully supervised deep learning

[4, 31, 43, 45, 39, 27, 36]. However, these methods require a large number of manual annotations, which could be expensive and infeasible for practical applications. Therefore, it is desirable to obtain discriminative representations of 3D point cloud in a self-supervised manner.

Fig. 1: Point cloud representation learning of fully supervised method (a) and self-supervised baseline method (b) and our CP-Net (c). On the left is the basic structure of these models. The right parts are the feature representation and distribution curves of the semantic parts (fuselage and wing). (a): It can be observed that fully-supervised feature representation can be well distinguished among different parts. (b): Self-supervised methods can learn contours well, but not content. For example, the distribution curves of the wing and the fuselage overlap in the content. (c): The self-supervised feature distribution of our CP-Net shows the distinction both in contour and content components.

The current self-supervised methods are mainly based on pretext tasks provided by generation or reconstruction [12, 14, 23, 26, 53, 32, 35, 46, 50, 15]. However, their performance is far from that in the fully-supervised methods. To fill this gap, we investigate what is the underlying problem in the self-supervised point cloud learning. As Figure 1 shows, we take the point cloud segmentation task of an airplane for illustration. We first visualize point cloud by the corresponding feature response in different methods. It can be observed that, the self-supervised reconstruction is good at learning structural contours, but fails in distinguishing semantic content, e.g., the contours of and are clear, while their contents are confused with same feature responses. This further motivates us to investigate self-supervised point cloud representations, in terms of contour and content. Specifically, we use geometry disentangle module [44] to divide point cloud into contour and content components, and show the feature response distribution of and on the complete point cloud, contour and content components in Figure 1. As expected, the feature distribution of different semantic parts can be easily separated in the contour components, while they are heavily overlapped in the content components. It clearly indicates that, self-supervised reconstruction lacks capacity of distinguishing semantic parts in the content components.

To address this difficulty, we propose a generic Contour-Perturbed reconstruction network (CP-Net), which can effectively guide self-supervised reconstruction to pay more attention to discriminative object content, based on point cloud disentangling. First, we geometrically divide a point cloud into contour and content components, and then augment it by perturbing contour and preserving content. By reconstructing original point cloud from the contour-perturbed one, we can effectively force self-supervisor to learn semantic content. Second, we build up a weight-sharing dual branch structure to boost point cloud representation learning. Since the basic branch is good at learning structural contour while ignoring semantic content, we leverage such perturbed reconstruction as an assistant branch of the original reconstruction task. Via designing a novel dual-branch consistency loss, we can progressively use the assistant branch to guide the basic branch for learning semantic content. In this case, the basic branch can capture easy-to-learn contour as well as exploit necessary-to-learn content, which enhances point cloud representation of self-supervised reconstruction for discriminative downstream tasks. Finally, the experiments and visualization analysis in Sec. V, IV demonstrate the effectiveness, robustness and generalization ability of our method. The main contributions are summarized as follows,

  • We propose a generic Contour-Perturbed Reconstruction Network (CP-Net) for self-supervised point cloud feature learning, which can effectively learn the discriminative representation both on structural contour and semantic content components.

  • We explore an effective contour-perturbed augmentation module to perturb contour components of point clouds, which is used to force the assistant branch to learn the semantic content information.

  • We introduce a multi-scale dual-branch consistency loss, which can bring the the corresponding features of the basic and assistant branches closer. Then the basic branch can pay more attention to the semantic content information by the guidance of assistant branch.

  • Experiments demonstrate that our method outperforms the self-supervised methods and narrows the gap between unsupervised and supervised models in part segmentation task (81.5% mIoU of ShapeNetPart). For classification, our self-supervised method gets a comparable result with the fully-supervised methods on ModelNet40 (92.5% accuracy) and ScanObjectNN (87.9% accuracy).

Fig. 2:

Network architecture of our CP-Net. ”RC” means RS-Conv module, ”TU” means transition up module. The basic branch uses the original point cloud to reconstruct its coordinates and normal vectors. While the assistant branch takes the contour-perturbed point cloud as input and reconstruct the original point cloud. The feature extractor network and prediction network from two branches share the parameters.

Ii Related Work

Ii-a Supervised Learning on 3D Point Clouds

Recently 3D point cloud analysis has enjoyed some remarkable progress for various downstream tasks, PointNet [4] and DeepSet [49]

are pioneering architectures that directly process point cloud. Their basic idea is to learn a spatial encoding of each point and fusing all individual point features to a global signature with max pooling. Though efficient, the local geometry structures are not sufficiently captured. To remedy this, PointNet++

[31] extracts local features capturing fine geometric structures from neighbors through a hierarchical grouping architecture. Some subsequent works such as PointCNN [25], PointConv [40] and RSCNN [27] also focus on the extraction of local geometric features. To capture the holistic geometric information more efficiently, GS-Net [43]

groups distant points with similar and relevant geometric information and aggregates features from neighbors in both Euclidean space and Eigenvalue space. DGCNN

[39] reconstructs the k-nn graph using nearest neighbors in the features space dynamically. Although these supervised methods push state-of-the-art of point cloud deep learning with the help of extensive supervised signals, the generalization ability may be limited by the supervised learning mechanism. Therefore, it is desirable to obtain features in an unsupervised manner and obtain the general representation of 3D point clouds.

Ii-B Unsupervised Learning on 3D Point Clouds

In order to produce a semantic latent space without relying on annotations, the unsupervised network is trained to perform the tasks based on some information obtained from the point cloud itself. Based on this, recent self-supervised approaches design various pre-tasks such that models need to learn useful information from data itself [1, 7, 8, 16, 37]. Several prior works have attempted on learning representation of point cloud without human supervision [32, 46, 53, 50, 42, 15]. FoldingNet [46] trains an end-to-end auto-encoder that consumes unordered point clouds directly by reconstruction from the point cloud itself. PointGLR [32] focuses on reasoning between local and global representations. GraphTER [13] proposes graph transformation equivalent representation learning to extract unsupervised representations. Chen et. al. [5] destroies certain local shape parts of an object, and then segment points that belong to distorted parts via a point cloud network. Different from all these previous works, we explore a dual-branched self-supervised learning framework (CP-Net), which can guide the self-supervisor to learn the discriminative representation both on semantic content and structural contour. Moreover, the self-supervised representation of our CP-Net is more friendly to the downstream tasks.

Iii Method

This section will introduce our proposed CP-Net in detail. First, we elaborate on the overall framework. Then, we introduce our contour-perturbed augmentation module for assistant branch. Finally, we describe the loss terms in our CP-Net.

Iii-a Overall Architecture of Our CP-Net

Our CP-Net is a generic dual-branched network, which consists of assistant branch and basic branch. The assistant branch is used to learn discriminative representation on semantic content, while the basic branch preserves the discriminative representation of the structural contour. Since both branches are used for point cloud reconstruction, they share the weights of feature extraction network and prediction network. Feature extraction network is designed to obtain the global feature and point-wise feature. Prediction network reconstructs the coordinates and estimate the normal vectors. By introducing dual-branch consistency loss as feature consistency regularization, we can leverage the assistant branch to guide the basic branch for distinguishing content information of point clouds.

As shown in Figure 2, we consider a 3D point cloud with points as the input. Generally, point cloud contains 3D coordinates , and normal vectors . The input of basic branch is the original point cloud coordinates , while the input of assistant branch is the perturbed point cloud coordinates by contour-perturbed augmentation module. This extractor network recieves point cloud coordinates as input, and outputs the point-wise features and global features . The reconstructed coordinates and normal vectors of prediction network can be defined as and .

Feature Extraction Network. Like PointNet++ [31], we use a hierarchical structure to learn point cloud feature progressively with skip connections. Specially, at -th level of encoder, point set is downsampled by using iterative furthest point sampling (FPS) to produce a new point set with points from points. Meanwhile we extract the point-wise feature by applying the RS-Conv [27] for each point . For corresponding -th level of decoder, the input feature is , we use the transition-up module [52] to propagate the points feature:



is the point interpolation function

[31]. Moreover, we also use the transition-up module to propagate the feature of each level to from points to the original points:


Then we concatenate them together as the point-wise feature , where is the layers of feature extractor network. While the global feature is obtained by a symmetric aggregation function (e.g., max pooling, …) operating on the point-wise feature.

Fig. 3: The process of contour-perturbed augmentation module.

Prediction Networks. The self-supervised prediction networks consist of normal prediction network and point cloud reconstruction network. The normal prediction network is used to enhance point cloud representation with geometry structure information. We can take the concatenation of the global feature , original point coordinates and point-wise feature from as input, then obtain the estimated normal through a shared light-weight MLP and normalization operation :


where is the concatenation operator, and .

The self-reconstruction network is used to recover the coordinate information of the original point cloud. Referring to FoldingNet [46], by incorporating a standard two-dimensional grid, we can deform the reconstructed coordinates with the guidance of the global feature . The self-reconstruction network contains two consecutive 3-layer MLPs. Specially, before we feed the global feature into this network, we replicate it times as , then concatenate it with an matrix which contains the grid points on a square centered at the origin [46]. The reconstructed point cloud can be obtained by the following operation:


where is the concatenation operator.

Iii-B Contour-Perturbed Augmentation Module

As mentioned in the Sec.I, self-supervised reconstruction mainly focuses on structural contours, while ignoring the discriminative content information of point cloud. To tackle this problem, we design the contour-perturbed augmentation module for the assistant branch in Figure 3. We have observed that the geometry-disentangle module [44] can decompose the point cloud into the structural contour information and semantic content information. Inspired by this, we can design our contour-perturbed augmentation module to perturb the point cloud.

First, for the input point cloud with points, we construct the point graph with the eigenvalues which represent the graph frequencies. Second, we collect the contour components and content components through the graph filters [44] on the constructed graph, where . The points in the contour components are easier to describe the local geometric structural information of the point cloud, while the content points can highlight the relatively common semantic information. Third, we perturb the contour points

with normal distributed noise

, then we concatenate the perturbed contour points with original content points as the perturbed point cloud , where . Finally, we use the perturbed point cloud to reconstruct the original point cloud in the assistant branch, which makes this branch pay more attention to the semantic content information.

Besides of these methodology insights, our experiments in Table VI can also prove our statement, it clearly shows that content components are harder to learn in the self-supervised manner. Moreover, we also consider other point cloud decomposition schemes, such as point cloud clustering, spatial domain decomposition, etc. Specific results and analysis are presented in Table VI.

Iii-C Loss Terms

To learn our model effectively, we introduce three training losses:


where the dual-branch consistency loss is used to guide the basic branch to learn the semantic content representation with assistant branch. and are widely used self reconstruction losses.

Dual-branch consistency loss. After the perturbed point cloud is obtained, we take the perturbed and original point cloud to the feature extraction networks to extract the point-wise feature and global feature for basic branch, and for assistant branch. Then, we design the dual-branch consistency loss to bring the corresponding feature of different branches closer. It allows to transmit the discriminative content information from the assistant branch to the basic branch. Moreover, to capture multi-scale semantics in the point cloud, we introduce such consistency losses in different scales, i.e., global consistency loss, local consistency loss, and local-to-global consistency loss.

Method %train cat. mIoU ins. mIoU aero bag cap car chair earp. guitar knife lamp laptop motor mug pistol rocket skate. table
Kd-Net[21] 77.4 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3
PointNet[4] 80.4 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
SO-Net[24] 80.8 84.6 81.9 83.5 84.8 78.1 90.8 72.2 90.1 83.6 82.3 95.2 69.3 94.2 80.0 51.6 72.1 82.6
KCNet[34] Full 82.2 84.7 82.8 81.5 86.4 77.6 90.3 76.8 91.0 87.0 84.5 95.5 69.2 94.4 81.6 60.1 75.2 81.3
RS-Net[17] (100%) 81.4 84.9 82.7 86.4 84.1 78.2 90.4 69.3 91.4 87.0 83.5 95.4 66.0 92.6 81.8 56.1 75.8 82.2
PointNet++[31] 81.9 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
DGCNN[39] 82.3 85.1 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0
RSCNN[27] 84.0 86.2 83.5 84.8 88.8 79.6 91.2 81.1 91.6 88.4 86.0 96.0 73.7 94.1 83.4 60.5 77.7 83.6
Ours Self. (5%) 76.4 81.5 79.6 66.5 79.4 73.2 87.5 64.4 89.2 80.1 76.1 94.9 51.2 93.2 78.7 48.5 79.4 80.5
Ours Self. (10%) 76.4 81.6 79.1 73.1 76.5 70.5 87.6 63.2 89.2 82.3 77.8 95.0 52.4 90.5 78.6 48.8 75.9 80.8
Ours Self. (50%) 78.8 82.5 79.0 78.5 82.9 75.0 88.0 68.8 89.8 85.4 77.1 94.9 62.6 86.8 81.6 56.7 72.7 82.1
TABLE I: Comparison on ShapeNetPart segmentation task. Average mIoU over instances (ins.) and categories (cat.) are reported. Full (100%): The fully-supervised methods are trained on the train set of ShapeNetPart (with annotations). Self (%): The self-supervised methods are first pretrained on the train set of ShapeNetPart (WITHOUT annotations), and then fine-tuned on only % train set of ShapeNetPart (the parameters of pre-trained models are fixed).

The global consistency loss mainly applies to preserve the global consistency between the perturbed global representation and basic original representation. Here we calculate the similarity of the global features from two branch as:


While the local consistency loss acts on point-wise representation, which is used to enhance the feature relevance between the basic branch and assistant branch. We can operate the similarity of features of each corresponding point as follows:


Moreover, in order to maximize the lower bound of the mutual information between local and global representation, then make the point-wise representation as close as possible to the global representation, we use a local-to-global consistency loss to explore the distinct property by connecting local and global representation of different branches. This loss can be formulated as:


where is the batch size, are the global features of different point clouds. Finally, the dual-branch consistency loss is .

Reconstruction Loss. Based on our dual-branched framework, we can extract the global representation and respectively for original and perturbed point clouds. In order to perform self-reconstruction, a FoldingNet based predictor [46] is used to deform the normal 2D grid with and onto the 3D coordinates of the reconstructed point cloud and . We calculate the reconstruction loss of reconstructed point cloud and original point cloud, which is defined as the chamfer distance [10]:


Normal Estimation Loss. The point cloud normal vector is a most basic point cloud feature and plays a vital role in many point cloud processing algorithms [4, 32]. The task of normal estimation requires the establishment of a high level representation on the surface of the 3D object. In the process of self-supervised feature learning, we do not need to pursue the accuracy of the estimated normal vectors [27], but we need to use this task as a self-supervised signal to improve the point-wise level of self-supervised representation. We use cosine loss to measure the estimation error:


where and are predicted normal vector and original normal vector.

Method Unsupervised 1%train data 5%train data
pretrain mIoU (%) mIoU (%)
SO-Net [24] ShapeNet 64.0 69.0
PointCapsNet [53] ShapeNet 67.0 70.0
MultiTask [15] ShapeNet 68.2 77.7
UFF [50] ShapeNet 68.5 78.3
PointContrast [42] ScanNet 74.0 79.9
Du et. al. [9] ShapeNet 76.2 79.2
Chen et. al. [5] ShapeNet 74.1 80.1
Ours ShapeNet 79.3 81.2
TABLE II: Transfer capacity (ShapeNetPart segmentation task). Most models are first unsupervisedly pretrained on ShapeNet, and then semi-supervisedly finetuned on ShapeNetPart with limited annotations. Note that, PointContrast is pretrained on larger ScanNet dataset. We can see that, our self-supervised model achieves the best performance, especially when the fine-tuned labeled data is quite limited (e.g., 1% ShapeNetPart). This shows its powerful generalization capacity on limited data.

Iv Experiments

This section will introduce the implementation details and the experimental comparisons for point cloud classification and part segmentation.

Iv-a Implementation Details

All our models are trained on a single GTX 2080ti GPU with the deep learning library Pytorch

[29]. Our model is trained under the Adam [20]

optimizer with a basic learning rate of 0.001, and the learning rate is reduced by 0.7 every 20 epoches. The momentum of the batch normalized

[19] layer starts from 0.9, and then decays at a rate of 0.5 every 20 periods.

As for the self-supervised pre-training of classification, we only use three RSConv modules to extract point cloud features in the extractor network, where the feature does not need to propagate to the original number of points. We use the global consistency loss and local-to-global consistency loss to preserve consistency of global fine-grained representations. While for the segmentation, we utilize the transition-up part [52] with four layers to obtain more diverse point-wise self-supervised features, where the intermediate features should be propagated to the original points numbers. Focusing on point-wise representation, we can use the local consistency loss as dual-branch consistency loss. As for the evaluation, we can randomly select the training set by category. For the contour-perturbed augmentation module , we jitter the contour points with normal distributed noise with std of 0.02, which is determined by experimental results in Table VIII.

Method Acc.( %)
PointNet [4] 89.2
PointNet++ [31] 90.5
SO-Net [24] 92.5
PointCNN [25] 92.5
Supervised DGCNN [39] 92.9
RSCNN [27] 92.9
RSCNN (vote) 93.6
DGCNN [39] 93.5
SO-Net [24] 93.4
KPConv [36] 92.9
PointHop [51] 89.1
Unsupervised PointHop++ [51] 91.1
(simple) PointGLR [32] 92.9
Chen et. al. [5] 92.4
Ours 92.5
FoldingNet [46] 88.9
PointCapsNet [53] 88.9
Unsupervised MultiTask [15] 89.1
(difficult) UFF [50] 90.4
Ours 91.9
TABLE III: Comparison of classification results on ModelNet40 dataset. “vote” is using the testing voting trick. “simple” means the self-supervised methods are trained and tested on ModelNet40, while “difficult” means that the self-supervised methods are trained on ShapeNet and tested on ModelNet40.

Iv-B Point Cloud Part Segmentation

The purpose of point cloud part segmentation is to predict the part category label of each point in a given point cloud. We evaluate the features of each point learned by our self-supervised model on the ShapeNetPart dataset [47] which is pre-trained on ShapeNetPart and ShapeNet [3] dataset. ShapeNetPart [47] contains 16,881 objects from 16 categories. Each object consists of 2 to 6 parts with total of 50 distinct parts among all categories. While ShapeNet [3] contains 57,000 models across 55 categories.

As for the training setting, in Table I, we use the training set of ShapeNetPart (without annotations) for unsupervised pretraining, and then use only R% samples of the training set for fine-tuning the unsupervised feature. It can be observed that with only 50% training annotations, our self-supervised CP-Net can achieve 82.5% instance mIoU. Table I also shows the comparison with fully-supervised models. The results suggest that our model achieves the mIoU which is only 4.1% less than the best supervised model. In addition, to disentangle the performance benefits due to unsupervised training, we trained the fully-supervised DGCNN [39] in Table I with only 5% training data. It only achieves 76.6% instance mIoU, which is worse than our semi-supervised method (81.5% mIoU with 5% train set). This also reflects that our method has powerful generalization capacity on limited data.

In Table II, we use the whole train set of ShapeNet (without annotations) for pretraining and use 1%/5% train set of ShapeNetPart for fine-tuning. These two settings follow [15, 24, 50, 53], to make fair comparison. In order to assess the transferability of the self-supervised methods, we train our model on ShapeNet [3] and then evaluate it on ShapeNetPart. The results are shown in Table II

, our method has a better performance than other methods on the transfer learning to ShapeNetPart.

Moreover, in the setting of GraphTER [13], they use 100% train set of ShapeNetPart (with labels) for supervised fine-tuning (same pretraining like us). In this case, our setting is actually more challenging with less data in fine-tuning. Under the same setting of GraphTER [13], our method (83.2%) outperforms GraphTER (81.9%) on ShapeNetPart segmentation task.

Method Acc.(%)
3DmFV [2] 73.8
PointNet [4] 79.2
Supervised SpiderCNN [45] 79.5
DGCNN [39] 86.2
PointCNN [25] 85.5
GDANet [44] 88.1
Point-BERT [48] 88.1
Unsupervised PointGLR [32] 86.9
Ours 87.9
TABLE IV: Comparison of classification results on real-world ScanObjectNN dataset (OBJ ONLY).
Fig. 4: Different perturbation manners for the assistant branch (corresponding to Table VI).

Iv-C Unsupervised Point Cloud Classification

For the unsupervised classification, we first obtain the self-supervised shape features from the ModelNet40 [41] and ScanObjectNN [38] dataset using self-supervised pre-trained model. Then we use a linear SVM [6]

to classify self-supervised shape features. ModelNet40

[41] is a benchmark dataset for shape classification. It contains 9,843 training samples, 2,468 testing samples and 40 object categories, where the points are sampled from CAD models. ScanObjectNN [38] is a real-world dataset, where 2,902 3D objects are extracted from scans. In our classification experiments, we sample 1,024 points for each point cloud for training and evaluation. All our results are measured using a single view without using the multi-view voting trick to show the neat performance of different models. Surface normal vectors are used to provide self-supervised signals for our models trained on ModelNet40 and we did not use it as input. For the models trained on ScanObjectNN we do not use the normal vectors, because normal vector of real-world data is not accurate enough.

Our CP-Net mIoU(%)
only basic branch 78.4
only assistant branch 78.2
basic branch and assistant branch 81.5
TABLE V: Ablation study of the branches. We report the mIoU on ShapeNetPart semi-supervised segmentation with 5% train data, where the self-supervised features are learned from ShapeNetPart.
Perturbation Manners mIoU
A Randomly delete a cluster part 78.7
B Delete content points 80.8
C Delete contour points 81.0
D Randomly delete contour or content points 79.9
E Jitter all points 79.6
F Randomly jitter a cluster part 78.8
G Jitter content points 80.8
H Jitter contour points 81.5
I Randomly jitter contour or content points 81.1
TABLE VI: Ablation Study of the perturbation of contour-perturbed augmentation module. We report the mIoU on ShapeNetPart semi-supervised segmentation with 5% train data, where the self-supervised features are learned from ShapeNetPart.

Unsupervised learning on ModelNet40. As shown in Table III, we compare the performance of unsupervised classification methods and fully-supervised classification methods. The top part (Supervised) is the result of the fully-supervised SOTA methods, the middle part (Unsupervised simple) is the classification result of self-supervised feature learning on ModelNet40, and the bottom part (Unsupervised difficult) is the unsupervised transfer learning from the ShapeNet to the ModelNet40 dataset.

As for learning on ModelNet40, many studies [39, 25, 27] have shown that the performance of ModelNet40 has been gradually saturated. Our method achieves a comparable performance 92.5% with the SOTA unsupervised method PointGLR[32], and more notably, our method can even achieve very competitive results compared to SOTA supervised methods (92.9% without voting trick) in an unsupervised manner. This evidence indicates that our method can discover global semantic representation shared in different kinds of point clouds.

For better evaluation and further explore the generalization ability of the learned representation, we use a more challenging transfer setting (Table III: bottom), we test our method with transfer learning from ShapeNet to ModelNet40. i.e., ShapeNet (unsupervised pretraining) + ModelNet40 (SVM for evaluating the unsupervised representation). Our method largely outperforms the SOTA approaches. It shows that the unsupervised features from our method are more generic than other methods.

# Jittered Points 512 1024 1536 2048
mIoU (%) 80.9 81.5 81.3 79.6
TABLE VII: Evaluation of the mIoU when increasing the number of jittered points of the contour points (all points number is 2048). We report the mIoU on ShapeNetPart segmentation with 5% train data.
STD 0.01 0.02 0.03
mIoU (%) 80.5 81.5 80.7
TABLE VIII: Experiments of the std of the normal distributed noise. We report the mIoU on ShapeNetPart segmentation with 5% train data.

Unsupervised Learning on ScanObjectNN. In order to verify the effectiveness of our method more comprehensively, we conduct the same unsupervised classification task on the ScanObjectNN [38]. ScanObjectNN is used to investigate the robustness to noisy objects with deformed geometric shape and non-uniform surface density in the real world. We adopt our model on the (simplest variant of the dataset). The results are summarized in Table IV, our method achieves the comparable accuracy with the fully-supervised methods, and this proves that our method has strong practicality in the real world data.

V Network Analysis

In this section, we first introduce the ablation studies of our framework. Second, we analyze the details of contour perturbation module. Then we further present the analysis of the normal estimation loss, self-reconstruction loss and the dual-branch consistency loss. Moreover, we also analyze the robustness of our method on sampling density.

V-a Ablation Studies of the Network Architecture

In order to examine the effectiveness of our designs, we conduct architecture ablation studies based on our framework. In Table V, we conduct the comparison of the branches in our framework, where we only use the basic branch or assistant branch, there is a large gap on the performance compared with the fully complete dual branches. It can be concluded that the feature distinction of contour components (basic branch) and semantic content components (assistant branch) are both important in the self-supervised feature representation. More importantly, with the consistency learning between the two branches, we can improve the performance largely by enhancing the representation relevance between the semantic content information and structural contour information.

V-B Analysis of Contour-perturbed Augmentation Module

To explore an effective way to perturb the point cloud for self-supervised feature learning, as Table VI and Figure 4 shows, we design a variety of ways to perturb the point cloud to the assistant branch. We first consider the aggregation part information of the original point cloud. Specially, we use the non-negative matrix factorization method [22] to extract similar clustering effects from the original point cloud, and then randomly select one cluster as the cluster part mentioned in Model A and F. However, no matter jitter or delete a cluster part randomly, the results are not particularly ideal, because there extends a large gap between the parts obtained by clustering and the ground truth, and it is might mislead the network to learn some distracting information. To verify the effectiveness of contour-perturbed augmentation module, we conduct a series of comparative experiments (Model B,C,D,G,H,I). Model C indicates that the contour points have certain useful information; Model B and D show that delete the content points can harm the performance (). Model G, H and I verify that jittering contour points can highlight the learning of holistic and generic information.

Based on Model H, we further analyze the perturbation details. Table VII shows the mIoU evaluation when increasing the number of jittered points of the contour points, when we use 1024 points (), we get the best performance. As for the normal distributed noise, we conduct a series of experiments based on the std of the normal distributed noise, which is shown in Table VIII.

Model mIoU(%)
(a) 71.7
(b) 75.9
(c) 79.8
(d) 78.4
(e) 81.2
(f) 80.8
(g) 81.5
TABLE IX: Ablation study of losses. We report the mIoU on ShapeNetPart semi-supervised segmentation with 5% train data, where the self-supervised features are learned from ShapeNetPart.
Model mIoU(%)
a 81.5
b 80.5
c 80.2
TABLE X: Ablation study of dual-branch consistency loss for segmentation downstream task. We report the mIoU on ShapeNetPart semi-supervised segmentation with 5% train data, where the self-supervised features are learned from ShapeNetPart.
Model Acc.(%)
A 91.3
B 91.5
C 92.5
TABLE XI: Ablation study of consistent loss for clsssification downstream task. We report the accuracy on ModelNet40 test set.

V-C Analysis of Network Losses

Fig. 5: Sampling density robustness test compared to the supervised version on classification and part segmentation. (a). Test results on ModelNet40 of using sparser points as the input to a model trained with 1,024 points. (b) Test results on ShapeNetPart of using sparser points as the input to a model trained with 2,048 points.

In Table IX, we report the mIoU of segmentation with 5% train data, and the self-supervised representation is learned from ShapeNetPart dataset. Model (a), (b) and (d) are based on our baseline model which is training with the basic branch, without assistant branch. Model (a) can be viewed as a variant of FoldingNet [46], which is trained by self-reconstruction loss only and get a low segmentation mIoU of 71.7%, while model (b) and (d) make a slight improvement with normal estimation loss, because normal estimation is a point-wise task, this self-supervised signal can affect the self-supervised feature for each point, thereby improving the performance of part segmentation. When we only use dual-branch consistency loss without reconstruction and normal estimation loss, we can still get a result of 79.8%. Compared with model (a), (b) and (d), model (c) shows that the contour-perturbed augmentation module with dual-branch consistency loss has a great improvement in performance. Based on dual-branch consistency loss, we add the normal estimation module, self reconstruction module to model (c), which can be indicated as model (e), (f) and (g). They boost significant improvements, and the best performance is 81.5% of model (g).

Here, we further analyze the dual-branch consistency losses. These losses can be used to minimize the feature distance between the point clouds from basic branch and assistant branch, leading the self-supervisor to obtain diverse and effective representation. Due to different attentions on point cloud representation of the different downstream tasks, here we conduct some ablation studies of the dual-branch consistency losses for classification downstream task and part segmentation downstream task. As Table XI shows, for classification, global consistency loss and local to global consistency loss play a critical role in performance improvement. For part segmentation, which focus more on point-wise local representation, the local consistency loss is more critical as Table X indicates. Thus, for pre-training of part segmentation task, we can use the local consistency loss as dual-branch consistency loss.

Fig. 6: Comparison of Feature representation. We show the supervised version (a), self-supervised baseline (b) and our self-supervised CP-Net (c). The point color refers to the activation value of each point which is obtained by averaging all the entries in the feature vector.

V-D Robustness Analysis

Figure 5 shows the robustness of our method on sampling density compared to the supervised version. Following [32], we use sparse points (1024, 512, 256, 128, 64) as the input of classification model and part segmentation model for testing. For classification (Figure 5 (a)), we feed sparse point clouds to the model trained with 1024 points, and obtain the self-supervised feature, then use a linear SVM to perform the classification results. We can see that our self-supervised classification model is much more robust than the supervised model on sampling density. Even we use 128 points for testing, the accuracy can achieve to 84.3%. For part segmentation (Figure 5 (b)), we also feed the point clouds with different densities to a segmentation model which is trained with 2048 points. Then we use the same setting with Table I to perform the segmentation results. It can be concluded that our self-supervised method is more robust to the point cloud density.

Vi Visualization

Vi-a Reasonable Segmentation

In Table I and II, we show the part segmentation results of our SPR-Net on ShapeNetPart dataset [47]

. The mIoU is an important statistical evaluation metric which indicates statistical overall performance on test dataset. Occasionally, the ground truth may be confused in some situations. In order to show our results more comprehensively, here we visualize some test segmentation results of our method with the ground truth in Figure

7. Although our results are different from the manually annotation, they are both reasonable manners of segmenting the objects. Our method divides the lamp bracket and lamp rope into the same category, while the ground truth labels consider that bases and brackets belong to the same category, The situation is similar for shank, chair leg brackets …

Vi-B Feature Representation Visualization

In order to have an intuitive understanding of our models, we colored the point cloud according to the point cloud feature response on the test set of ShapeNetPart in Figure 6. The points (which belong to the same part) have similar activation. It can be observed the fully supervised feature representations can be well distinguished among the different parts. The common self-supervised feature representation from baseline model reflect confusion among the different parts. Compared with the baseline, our self-supervised CP-Net shows more distinction which verifies that our model can learn semantic content more effectively. Our method still needs to be improved in some object details, such as the lamp.

Fig. 7: Point cloud reasonable segmentation results. The first row shows the ground truth part labels. The second row shows the predict labels by our method. Even though the ground truth are noisy, our method can get reasonable segmenting results of the objects.

Vii Conclusion

We propose a dual-branched CP-Net for point cloud self-supervised learning. Equipped with contour-perturbed augmentation module and dual-branch consistency loss, our CP-Net not only preserves the discriminative representation on easy-to-learned structural contour information, but also extract the semantic content information which is hard to learn by self-supervisors. Extensive experiments have shown the performances, transferability and decent robustness of our CP-Net.


  • [1] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §II-B.
  • [2] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer (2018)

    3dmfv: three-dimensional point cloud classification in real-time using convolutional neural networks

    IEEE Robotics and Automation Letters. Cited by: TABLE IV.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §IV-B, §IV-B.
  • [4] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. CVPR. Cited by: §I, §II-A, §III-C, TABLE I, TABLE III, TABLE IV.
  • [5] Y. Chen, J. Liu, B. Ni, H. Wang, J. Yang, N. Liu, T. Li, and Q. Tian (2021) Shape self-correction for unsupervised point cloud understanding. In ICCV, Cited by: §II-B, TABLE II, TABLE III.
  • [6] C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning. Cited by: §IV-C.
  • [7] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §II-B.
  • [8] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In ICCV, Cited by: §II-B.
  • [9] B. Du, X. Gao, W. Hu, and X. Li (2021) Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3133–3142. Cited by: TABLE II.
  • [10] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In CVPR, Cited by: §III-C.
  • [11] M. Feng, S. Z. Gilani, Y. Wang, L. Zhang, and A. Mian (2021) Relation graph network for 3d object detection in point clouds. IEEE Transactions on Image Processing. Cited by: §I.
  • [12] M. Gadelha, R. Wang, and S. Maji (2018) Multiresolution tree networks for 3d point cloud processing. In ECCV, Cited by: §I.
  • [13] X. Gao, W. Hu, and G. Qi (2020) GraphTER: unsupervised learning of graph transformation equivariant representations via auto-encoding node-wise transformations. In CVPR, Cited by: §II-B, §IV-B.
  • [14] Z. Han, X. Wang, Y. Liu, and M. Zwicker (2019) Multi-angle point cloud-vae: unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In ICCV, Cited by: §I.
  • [15] K. Hassani and M. Haley (2019) Unsupervised multi-task feature learning on point clouds. In ICCV, Cited by: §I, §II-B, TABLE II, §IV-B, TABLE III.
  • [16] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In ICML, Cited by: §II-B.
  • [17] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In CVPR, Cited by: TABLE I.
  • [18] X. Huang, J. Zhang, L. Fan, Q. Wu, and C. Yuan (2017) A systematic approach for cross-source point cloud registration by preserving macro and micro structures. IEEE Transactions on Image Processing. Cited by: §I.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §IV-A.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
  • [21] R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In ICCV, Cited by: TABLE I.
  • [22] D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature. Cited by: §V-B.
  • [23] C. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov (2018) Point cloud gan. arXiv preprint arXiv:1810.05795. Cited by: §I.
  • [24] J. Li, B. M. Chen, and G. Hee Lee (2018) SO-net: self-organizing network for point cloud analysis. In CVPR, Cited by: TABLE I, TABLE II, §IV-B, TABLE III.
  • [25] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) Pointcnn: convolution on x-transformed points. In NIPS, Cited by: §II-A, §IV-C, TABLE III, TABLE IV.
  • [26] X. Liu, Z. Han, X. Wen, Y. Liu, and M. Zwicker (2019) L2g auto-encoder: understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In Proceedings of the 27th ACM International Conference on Multimedia, Cited by: §I.
  • [27] Y. Liu, B. Fan, S. Xiang, and C. Pan (2019)

    Relation-shape convolutional neural network for point cloud analysis

    Cited by: §I, §II-A, §III-A, §III-C, TABLE I, §IV-C, TABLE III.
  • [28] S. Milani, E. Polo, and S. Limuti (2020) A transform coding strategy for dynamic point clouds. IEEE Transactions on Image Processing. Cited by: §I.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §IV-A.
  • [30] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §I.
  • [31] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: §I, §II-A, §III-A, TABLE I, TABLE III.
  • [32] Y. Rao, J. Lu, and J. Zhou (2020) Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In CVPR, Cited by: §I, §II-B, §III-C, §IV-C, TABLE III, TABLE IV, §V-D.
  • [33] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz (2008) Towards 3d point cloud based object maps for household environments. Robotics and Autonomous Systems. Cited by: §I.
  • [34] Y. Shen, C. Feng, Y. Yang, and D. Tian (2018) Mining point cloud local structures by kernel correlation and graph pooling. In CVPR, Cited by: TABLE I.
  • [35] M. Shoef, S. Fogel, and D. Cohen-Or (2019) Pointwise: an unsupervised point-wise feature learning network. arXiv preprint arXiv:1901.04544. Cited by: §I.
  • [36] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) KPConv: flexible and deformable convolution for point clouds. arXiv preprint arXiv:1904.08889. Cited by: §I, TABLE III.
  • [37] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §II-B.
  • [38] M. A. Uy, Q. Pham, B. Hua, T. Nguyen, and S. Yeung (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In ICCV, Cited by: §IV-C, §IV-C.
  • [39] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2018) Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §I, §II-A, TABLE I, §IV-B, §IV-C, TABLE III, TABLE IV.
  • [40] W. Wu, Z. Qi, and L. Fuxin (2019) Pointconv: deep convolutional networks on 3d point clouds. In CVPR, Cited by: §II-A.
  • [41] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §IV-C.
  • [42] S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany (2020) PointContrast: unsupervised pre-training for 3d point cloud understanding. In European Conference on Computer Vision, Cited by: §II-B, TABLE II.
  • [43] M. Xu, Z. Zhou, and Y. Qiao (2020) Geometry sharing network for 3d point cloud classification and segmentation. AAAI. Cited by: §I, §II-A.
  • [44] M. Xu, J. Zhang, Z. Zhou, M. Xu, X. Qi, and Y. Qiao (2020) Learning geometry-disentangled representation for complementary understanding of 3d object point cloud. arXiv preprint arXiv:2012.10921. Cited by: §I, §III-B, §III-B, TABLE IV.
  • [45] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In ECCV, Cited by: §I, TABLE IV.
  • [46] Y. Yang, C. Feng, Y. Shen, and D. Tian (2018) Foldingnet: point cloud auto-encoder via deep grid deformation. In CVPR, Cited by: §I, §II-B, §III-A, §III-C, TABLE III, §V-C.
  • [47] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG). Cited by: §IV-B, §VI-A.
  • [48] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu (2021) Point-bert: pre-training 3d point cloud transformers with masked point modeling. arXiv preprint arXiv:2111.14819. Cited by: TABLE IV.
  • [49] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep sets. Cited by: §II-A.
  • [50] M. Zhang, P. Kadam, S. Liu, and C. J. Kuo (2020) Unsupervised feedforward feature (uff) learning for point cloud classification and segmentation. In 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), Cited by: §I, §II-B, TABLE II, §IV-B, TABLE III.
  • [51] M. Zhang, H. You, P. Kadam, S. Liu, and C. J. Kuo (2020) Pointhop: an explainable machine learning method for point cloud classification. IEEE Transactions on Multimedia. Cited by: TABLE III.
  • [52] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun (2020) Point transformer. arXiv preprint arXiv:2012.09164. Cited by: §III-A, §IV-A.
  • [53] Y. Zhao, T. Birdal, H. Deng, and F. Tombari (2019) 3D point capsule networks. In CVPR, Cited by: §I, §II-B, TABLE II, §IV-B, TABLE III.