Label-Efficient Learning on Point Clouds using Approximate Convex Decompositions

03/30/2020 ∙ by Matheus Gadelha, et al. ∙ University of Massachusetts Amherst 2

The problems of shape classification and part segmentation from 3D point clouds have garnered increasing attention in the last few years. Both of these problems, however, suffer from relatively small training sets, creating the need for statistically efficient methods to learn 3D shape representations. In this paper, we investigate the use of Approximate Convex Decompositions (ACD) as a self-supervisory signal for label-efficient learning of point cloud representations. We show that using ACD to approximate ground truth segmentation provides excellent self-supervision for learning 3D point cloud representations that are highly effective on downstream tasks. We report improvements over the state-of-the-art for unsupervised representation learning on the ModelNet40 shape classification dataset and significant gains in few-shot part segmentation on the ShapeNetPart dataset.Code available at https://github.com/matheusgadelha/PointCloudLearningACD

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of our method v.s. a fully-supervised approach. Top: Approximate Convex Decomposition (ACD) can be applied on a large repository of unlabeled point clouds, yielding a self-supervised

training signal for the neural network without involving any human annotators.

Bottom: the usual fully-supervised setting, where human annotators label the semantic parts of point clouds, which are then used as supervision for the neural network. The unsupervised ACD task results in the neural network learning useful representations from unlabeled data, significantly improving performance in shape classification and semantic segmentation when labeled data is scarce or unavailable.

The performance of current deep neural network models on tasks such as classification and semantic segmentation of point cloud data is limited by the amount of high quality labeled data available for training the networks. Since in many situations collecting high quality annotations on point cloud data is time consuming and incurs a high cost, there has been increasing efforts in circumventing this problem by training the neural networks on noisy or weakly labeled datasets [46], or training in completely unsupervised ways [20, 14, 67, 66, 8].

A ubiquitous technique in training deep networks is to train the network on one task to initialize its parameters and learn generically useful features, and then fine-tune the network on the final task. In particular, there has been great interest in so-called self-supervised tasks for initialization. These tasks, which do not require any human annotations, allow the network to be initialized by using various techniques to generate labels automatically, i.e. , in a self-supervised manner – e.g.

tasks such as clustering, solving jigsaw puzzles, and colorization. There have been a few recent attempts to come up with similar tasks that help with 3D data 

[20, 8]. The overarching question here is “what makes for a good self-supervision task?” – what are the useful inductive biases that our model learns from solving such a task that is beneficial to the actual downstream target task we are interested in solving.

We propose using a classical shape decomposition method, Approximate Convex Decomposition (ACD), as the self-supervisory signal to train neural networks built to process 3D data. We posit that being able to decompose a shape into geometrically simple constituent parts provides an excellent self-supervisory learning signal for such purposes. As shown in the Figure 2, ACD decomposes shapes into segments that roughly align with instances of different parts, e.g. two wings of an airplane are decomposed into two separate approximately convex parts. Many man-made shapes are influenced by physical and geometric constraints. Convex parts tend to be easily manufactured, and are strong and aerodynamic, thus fulfilling the above requirements. However, strictly convex decomposition often leads to highly over-segmented shapes. For that reason, we chose approximate convex decomposition, which we show benefits a number of learning tasks.

Our approach is illustrated in Figure 1. The main idea is to automatically generate training data by decomposing unlabeled 3D shapes into convex components. Since ACD relies solely on geometric information to perform its decomposition, the process does not require any human intervention. From the model perspective, we formulate ACD as a metric learning problem on point embedding and train the model using a contrastive loss [19, 9]. We demonstrate the effectiveness of our approach on standard 3D shape classification and segmentation benchmarks. In classification, we show that the representation learned from performing shape decomposition leads to features that achieve state-of-the-art performance on ModelNet40 [64] unsupervised shape classification (). For few-shot part segmentation on ShapeNet [7], our model outperforms the state-of-the-art by 7.5% mIoU when using 1% of the available labeled training data. Moreover, differently from other unsupervised approaches, our method can be applied to any of the well-known neural network backbones for point cloud processing. Finally, we provide thorough experimental analysis and visualizations demonstrating the role of the ACD self-supervision on the representations learned by neural networks.

2 Related Work

Learning Representations on 3D data.

Shape representations using neural networks have been widely studied in the field of computer graphics and computer vision. Occupancy grids have been used to represent shape for classification and segmentation tasks 

[33]; however it suffered from issues of computational and memory efficiency, which were later circumvented by architectures using spatial partitioning data structures [42, 27, 57, 58]. Multi-view approaches [22, 52, 26, 48, 50] learn representations by using order invariant pooling of features from multiple rendered views of a shape. Another class of methods take a point cloud representation (i.e. a set of co-ordinates) as input, and learn permutation invariant representations [61, 14, 40, 41, 66, 20, 13]. Point clouds are a compact 3D representation that does not suffer from the memory constraints of volumetric representations nor the visibility issues of multi-view approaches. However, all these approaches rely on massive amounts of labeled 3D data. In this paper, we focus on developing a technique to allow label-efficient representation learning of point clouds. Our approach is architecture-agnostic and relies on learning approximate convex decompositions, which can be automatically computed from a variety of shape representations without any human intervention.

Approximate Convex Decompositions. Early cognitive science literature has demonstrated that humans tend to reason about 3D shapes as a union of convex components [21]. However, performing exact convex decomposition is a NP-Hard problem that leads to an undesirable high number of components on realistic shapes [4]. Thus, we are interested in a particular class of decomposition techniques named Approximate Convex Decomposition (ACD) [31, 72, 25, 32], which compute components that are approximately convex – up to a concavity tolerance . This makes the computation significantly more efficient and leads to shape approximations containing a smaller number of components. These approximations are useful for a variety of tasks, like mesh generation [31] and collision detection [62]. ACDs are also an important step in non-parametric shape segmentation methods [25, 3]. Furthermore, ACD is shown to have a low rand-index compared to human segmentations in the PSB benchmark [25], which indicates that it is a reasonable proxy for our intuitions of shape parts. In this work, we used a particular type of ACD named Volumetric Hierachical Approximate Convex Decomposition (V-HACD) [32] – details in Section 3.1. Differently from non-parametric approaches, our goal is to use ACD as a self-supervisory task to improve point cloud representations learned by deep neural networks. We show not only that the training signal provided by ACD leads to improvements in semantic segmentation, but also to unsupervised shape classification.

Self-supervised learning.

In many situations, unlabeled images or videos themselves contain information which can be leveraged to provide a training loss to learn useful representations. Self-supervised learning explores this line of work, utilizing unlabeled data to train deep networks by solving proxy tasks that do not require any human annotation effort, such as predicting data transformations 

[36, 16, 37] or clustering [5, 6]. Learning to colorize grayscale images was among the first approaches to training modern deep neural networks in a self-supervised fashion [29, 69, 70] – being able to predict the correct color for an image requires some understanding of a pixel’s semantic meaning (e.g. skies are blue, grass is green etc.), leading to learning representations useful in downstream tasks like object classification. The contextual information in an image also lends itself to the design of proxy tasks – learning to predict the relative positions of cropped image patches as in Doersch et al. [11], similarity of patches tracked across videos [59, 60], inpainting a missing patch in an image by leveraging the context from the rest of the image [39, 54]. Motion from unlabeled videos also provides a useful pre-training signal, as shown in Pathak et al. [38] using motion segmentation, and Jiang et al. [24]

who predict relative depth as pre-training for downstream scene understanding tasks. Other approaches include solving jigsaw puzzles with permuted image patches 

[36] and training a generative adversarial model [12]. An empirical comparison of various self-supervised tasks may be found in [17, 28]. In the case of limited samples i.e. the few-shot classification setting, including self-supervised losses along with the usual supervised training is shown to be beneficial in Su et al. [51]. Recent work has also focused on learning unsupervised representations for 3D shapes using tasks such as clustering [20] and reconstruction [45, 67], which we compare against in our experiments.

Label-efficient representation learning on point clouds. Several recent approaches [35, 74, 20, 8, 46] have been proposed to alleviate expensive labeling of shapes. Muralikrishnan et al. [35] learn per-point representation by training the network to predict shape-level tags. Yi et al. [68] embeds pre-segmented parts in descriptor space by jointly learning a metric for clustering parts, assigning tags to them, and building a consistent part hierarchy. Another direction of research is to utilize noisy/weak labels for supervision. Chen et al. [8] proposed a branched auto-encoder, where each branch learns coarse part level features, which are further used to reconstruct the shape by producing implicit fields for each part. However, this approach requires one decoder for every different part, which restricts their experiments to category-specific models. On the other hand, our approach can be directly applied to any of the well known point-based architectures, being capable of handling multiple categories at once for part segmentation and learning state-of-the-art features for unsupervised shape classification. Furthermore, [8] shows experiments on single shot semantic segmentation on manually selected shapes, whereas we show results on randomly selected training shapes in few-shot setting. Most similar to our work, Hassani et al. [20] propose a novel architecture for point clouds which is trained on multiple tasks at the same time: clustering, classification and reconstruction. In our experiments, we demonstrate that we outperform their method on few-shot segmentation by IoU and achieve the same performance on unsupervised ModelNet40 classification by using only ACD as a proxy task. If we further add a reconstruction term, our method achieves state-of-the-art performance in unsuperivsed shape classification. Finally, Sharma et al. [46] proposed learning point embedding by utilizing noisy part labels and semantic tags available freely on a 3D warehouse dataset. The model learnt in this way is used for a few-shot semantic segmentation task. In this work, we instead get part labels using approximate convex decomposition, whose computation is completely automatic and can be applied to any mesh regardless of the existence of semantic tags.

3 Method

Figure 2: Input point clouds (first row), convex components automatically computed by ACD (second row) and human-labeled point clouds (last row) from the ShapeNet [7] part segmentation benchmark. Note – (i) different colors for the ACD components only signify different parts– no semantic meaning or inter-shape correspondence is inferred by this procedure; (ii) for the human labels, colors do convey semantic meaning: e.g. the backs of chairs are always orange; (iii) while the ACD decompositions tend to oversegment the shapes, they contain most of the boundaries present in the human annotations, suggesting that the model has similar criteria for decomposing objects into subparts; e.g. chair’s legs are separated from the seat, wings and engines are separated from the airplane boundary, pistol trigger is separated from the rest, etc. 

3.1 Approximate Convex Decomposition

In this subsection, we provide an overview of the shape decomposition approach used to generate the training data for our self-supervised task. A detailed description of the method used in this work can be found in [32].

Decomposing complex shapes as sets of convex components is a well studied problem [31, 32, 25, 72]. Given a polyhedron , the goal is to compute the smallest set of convex polyhedra , such that the union corresponds to . However, exact convex decomposition of polyhedra is an NP-Hard problem [4] and leads to decompositions containing too many components, rendering it impractical for use in most applications (ours included). This can be circumvented by Approximate Convex Decomposition (ACD) techniques. ACD relaxes the convexity constraint of exact convex decomposition by allowing every component to be approximately convex up to a concavity . The way concavity is computed and how the components are split varies according to different methods [32, 15, 31, 25, 72]. In this work, we use an approach called Volumetric Hierarchical Approximate Convex Decomposition (V-HACD) [32]. The reasons for utilizing this approach are threefold. First, as the name suggests, V-HACD performs computations using volumetric representations, which can be easily computed from dense point cloud data or meshes and lead to good results without having to resort to costly decimation and remeshing procedures. Second, the procedure is reasonably fast and can be applied to open surfaces of arbitrary genus. Third, V-HACD guarantees that no two components overlap, which means that there is no part of the surface that is approximated by more than one component. In the next paragraph, we describe V-HACD in detail.

V-HACD. Since the method operates on volumetric representations, the first step is to convert a shape into an occupancy grid. If the shape is represented as a point cloud, one can compute an occupancy grid by selecting which cells are occupied by the points and filling its interior. In our case, since our training shapes are from ShapeNet [7] which come with meshes, we chose to compute the occupancy grid by voxelizing the meshes using [23]. Once the voxelization is computed, the algorithm proceeds on computing convex components by recursively splitting the volume into two parts. First, the volume is centered and aligned in the coordinate system according to its principal axis. Then, one of the three axis aligned planes is selected as a splitting plane that separates the volume in two different parts. This procedure is applied multiple times until we reach the maximum number of desired components or the concavity tolerance is reached. The concavity of a set of components is computed as follows:

(1)

where is the difference between the volumes and ; is the convex hull of ; and is the th element of the set . The splitting plane selection happens by choosing one of the axis aligned planes which minimizes an energy , where is the volume we are aiming to split and is the splitting plane. This energy is defined as:

(2)

where is the connectivity component, which measures the sum of the normalized concavities between both sides of volume; is the balance component, which measures the dissimilarity between both sides; and is the symmetry component, which penalizes planes that are orthogonal to a potential revolution axis. and are weights for the last two terms. In all our experiments we used the default values of . We refer the reader to [32] for a detailed description of the components in the energy term.

Assigning component labels to point clouds. The output of ACD for every shape is a set of convex components represented by convex meshes. For each shape, we sample points on the original ShapeNet mesh and on the mesh of every ACD component. We then propagate component labels to every point in the original point cloud by using nearest neighbor matching with points in the decomposition. More precisely, given an unlabeled point cloud , this assigns a component label to each point via

(3)

3.2 Self-supervision with ACD

The component labels generated by the ACD algorithm are not consistent across point clouds, i.e. “component 5” may refer to the seat of a chair in one point cloud, while the leg of the chair may be labeled as “component 5” in another point cloud. Therefore, the usual cross-entropy loss, which is generally used to train networks for tasks such as semantic part labeling, is not applicable in our setting. We formulate the learning of Approximate Convex Decompositions as a metric learning problem on point embeddings via a pairwise or contrastive loss [19].

We assume that each point in a point cloud is encoded as in some embedding space by a neural network encoder , e.g. PointNet [49] or PointNet++ [41]. Let the embeddings of a pair of points from a shape be and , normalized to unit length (i.e. ), and the set of convex components as described above be . The pairwise loss is then defined as

(4)

This loss encourages points belonging to the same component to have a high similarity , and encourages points from different components to have low similarity, subject to a margin . denotes the Iverson bracket.

Joint training with ACD. Formally, let us consider samples , divided into two parts: and of sizes and respectively. Now consist of point clouds that are provided with human-annotated labels , while we do not know the labels of the samples . By running ACD on the samples in , we can obtain a set of components for each shape. The pairwise contrastive loss (Eq. 4) can then be defined over as a self-supervised objective. For the samples , we have access to their ground-truth labels , which may for example, be semantic part labels. In that case, the standard choice of training objective is the cross-entropy loss , defined over the points in an input point cloud. Thus, we can train a network on both and via a joint loss that combines both the supervised () and self-supervised () objectives,

(5)

The scalar hyper-parameter controls the relative strength between the supervised and self-supervised training signals. In the pretraining scenario, when we only have the unlabeled dataset available, we can train a neural network purely on the ACD parts by optimizing the objective.

4 Experiments

We demonstrate the effectiveness of the ACD self-supervision across a range of experimental scenarios. For all the experiments in this section we use ACDs computed on all shapes from the ShapeNetCore data [7], which contains 57,447 shapes across 55 categories. The decomposition was computed using a concavity tolerance of and a volumetric grid of resolution . All the other parameters are set to their default values according to a publicly available implementation111https://github.com/kmammou/v-hacd of [32]. The resulting decompositions have an average of 17 parts per shape.

4.1 Shape classification on ModelNet

In this set of experiments, we show that the representations learned by a network trained on ACD are useful for discriminative downstream tasks such as classifying point clouds into shape categories.

Dataset.

We report results on the ModelNet40 shape classification benchmark, which consists of 12,311 shapes from 40 shape categories in a train/test split of 9,843/2,468. A linear SVM is trained on the features extracted on the training set of ModelNet40. This setup mirrors other approaches for unsupervised learning on point clouds, such as FoldingNet 

[67] and Hassani et al. [20].

Experimental setup.

A PointNet++ network is trained on the unlabeled ShapeNet-Core data using the pairwise contrastive loss on the ACD task, using the Adam optimizer, initial learning rate of 1e-3 and halving the learning rate every epoch. However, this network architecture creates an embedding for each of the

points in an input shape, while for the shape classification task we require a single global descriptor for the entire point cloud. Therefore, we aggregate the per-point features of PointNet++ at the first two set aggregation layers (SA1 and SA2) and the last fully connected layer (fc

), resulting in 128, 256 and 128 dimensional feature vectors, respectively. Since features from different layers may have different scales, we normalize each vector to unit length before concatenating them, and apply element-wise signed square-rooting 

[43], resulting in a final 512-dim descriptor for each point cloud. The results are presented in Table 1.

Comparison with baselines.

As an initial naïve baseline, we use a PointNet++ network with random weights as our feature extractor, and then perform the usual SVM training. This gives 78% accuracy on ModelNet40 – while surprisingly good, the performance is not entirely unexpected: randomly initialized convolutional neural networks are known to provide useful features by virtue of their architecture, as studied in Saxe 

et al. [44]. Training this network with ACD, on the other hand, gives a significant boost to performance (78% 89.1%), demonstrating the effectiveness of our proposed self-supervision task. This indicates some degree of generalization across datasets and tasks – from distinguishing convex components on ShapeNet to classifying shapes on ModelNet40. Inspired by [20], we also investigated if adding a reconstruction component to the loss would further improve accuracy. Reconstruction is done by simply adding an AtlasNet [18] decoder to our model and using Chamfer distance as reconstruction loss. Without the reconstruction term (i.e. trained only to perform ACD using contrastive loss), our result accuracy () is the same as the multi-task learning approach presented in [20]. After adding a reconstruction term, we achieve an improved accuracy of . On the other hand, having just reconstruction without ACD yields an accuracy of . This shows not only that ACD is a useful task when learning representations for shape classification, but that it can also be combined with shape reconstruction to yield even better results.

Comparison with previous work. Approaches for unsupervised or self-supervised learning on point clouds are listed in the upper portion of Table 1. Our method achieves 89.1% classification accuracy from purely using the ACD loss, which is met only by the unsupervised multi-task learning method of Hassani et al. [20]. We note that our method merely adds a contrastive loss to a standard architecture (PointNet++), without requiring a custom architecture and multiple pretext tasks as in [20], which uses clustering, pseudo-labeling and reconstruction.

Method Accuracy (%)
VConv-DAE [45] 75.5
3D-GAN [63] 83.3
Latent-GAN [1] 85.7
MRTNet [14] 86.4
PointFlow [66] 86.8
FoldingNet [67] 88.4
PointCapsNet [71] 88.9
Multi-task [20] 89.1
Our baseline (with Random weights) 78.0
With reconstruction term only 86.2
Ours with ACD 89.1
Ours with ACD + Reconstruction 89.8
Table 1: Unsupervised shape classification on the ModelNet40 dataset. The representations learned in the intermediate layers by a network trained for the ACD task on ShapeNet data are general enough to be useful for discriminating between shape categories on ModelNet40.

4.2 Few-shot segmentation on ShapeNet

Dataset. We report results on the ShapeNetSeg part segmentation benchmark [7], which is a subset of the ShapeNetCore database with manual annotations (train/val/test splits of 12,149/1,858/2,874). It consists of 16 man-made shape categories such as airplanes, chairs, and tables, with manually labeled semantic parts (50 in total), such as wings, tails, and engines for airplanes; legs, backs, and seats for chairs, and so on. Given a point cloud at test time, the goal is to assign each point its correct part label out of the 50 possible parts. Few-shot learning tasks are typically described in terms of “-way -shot” – the task is to discriminate among classes and samples per class are provided as training data. We modify this approach to our setup as follows – we select samples from each of the shape categories as the labeled training data, while the task remains semantic part labeling over the 50 part categories.

Samples/cls. k=1 k=3 k=5 k=10
Baseline 53.15 2.49 59.54 1.49 68.14 0.90 71.32 0.52
w/ ACD 61.52 2.19 69.33 2.85 72.30 1.80 74.12 1.17
k=20 k=50 k=100 k=inf
Baseline 75.22 0.82 78.79 0.44 79.67 0.33 81.40 0.44
w/ ACD 76.19 1.18 78.67 0.72 78.76 0.61 81.57 0.68
Table 2: Few-shot segmentation on the ShapeNet dataset (class avg. IoU over 5 rounds). denotes the number of shots or samples per class for each of the 16 ShapeNet categories used for supervised training. Jointly training with the ACD task reduces overfitting when labeled data is scarce, leading to significantly better performance over a purely supervised baseline.

Experimental setup. For this task, we perform joint training with two losses – the usual cross-entropy loss over labeled parts for the training samples from ShapeNetSeg, and an additional contrastive loss over the ACD components for the samples from ShapeNetCore (Eq. 5), setting . In our initial experiments, we found joint training to be more helpful than pre-training on ACD and then fine-tuning on the few-shot task (an empirical phenomenon also noted in [65]), and thereafter consistently used joint training for the few-shot experiments. All overlapping point clouds between the human-annotated ShapeNetSeg and the unlabeled ShapeNetCore were removed from the self-supervised training set. The coordinates of the points in each point cloud are used an the input to the neural network; we do not include any additional information such as normals or category labels in these experiments.

Comparison with baselines. Table 2 shows the few-shot segmentation performance of our method, versus a fully-supervised baseline. Especially in the cases of very few labeled training samples (), having the ACD loss over a large unlabeled dataset provides a consistent and significant gain in performance over purely training on the labeled samples. As larger amounts of labeled training samples are made available, naturally there is limited benefit from the additional self-supervised loss – e.g.

when using all the labeled data, our method is within standard deviation of the purely supervised baseline. Qualitative results are shown in Fig. 

3.

Comparison with previous work. The performance of recent unsupervised and self-supervised methods on ShapeNet segmentation are listed in Table 3. Consistent with the protocol followed by the multi-task learning approach of Hassani et al. [20], we provide 1% and 5% of the training samples of ShapeNetSeg as the labeled data and report instance-average IoU. Our method clearly outperforms the state-of-the-art unsupervised learning approaches, improving over [20] at both the 1% and 5% settings (68.2 75.7% and 77.7 79.7%, respectively).

Method 1% labeled 5% labeled
IoU IoU
SO-Net [30] 64.0 69.0
PointCapsNet [71] 67.0 70.0
MortonNet [53] - 77.1
Multi-task [20] 68.2 77.7
ACD (ours) 75.7 79.7
Table 3: Comparison with state-of-the-art semi-supervised part segmentation methods on ShapeNet. Performance is evaluated using instance-averaged IoU.
Figure 3: Qualitative comparison on 5-shot ShapeNet [7] part segmentation. The baseline method in the first row corresponds to training using only 5 examples per class, whereas the ACD results in the second row were computed by performing joint training (cross-entropy from 5 examples + contrastive loss over ACD components from ShapeNetCore). The network backbone architecture is the same for both approaches – PointNet++ [41]. The baseline method merges parts that should be separated, e.g. engines of the airplane, details of the rocket, top of the table, seat of the motorcycle, etc.

4.3 Analysis of ACD

On the effect of backbone architectures. Differently from [8, 20, 67], the ACD self-supervision does not require any custom network design and should be easily applicable across various backbone architectures. To this end, we use two recent high-performing models – PointNet++ (with multi-scale grouping [41]) and DGCNN [61] – as the backbones, reporting results on ModelNet40 shape classification and few-shot segmentation () on ShapeNetSeg (Table 4). On shape classification, both networks show large gains from ACD pre-training: 11% for PointNet++ (as reported earlier) and 14% for DGCNN.

Figure 4: Classification accuracy of a linear SVM on the ModelNet40 validation set v.s. the ACD validation loss over training epochs.

On few-shot segmentation with 5 samples per category (16 shape categories), PointNet++ improves from 68.14% IoU to 72.3% with the inclusion of the ACD loss. The baseline DGCNN performance with only 5 labeled samples per class is relatively lower (64.14%), however with the additional ACD loss on unlabeled samples, the model achieves 73.11% IoU, which is comparable to the corresponding PointNet++ performance (72.30%).

Task / Dataset Method PointNet++ DGCNN
Class./MN40 Baseline 77.96 74.11
w/ ACD 89.06 88.21
5-shot Seg./ShapeNet Baseline 68.14 0.90 64.14 1.43
w/ ACD 72.30 1.80 73.11 0.95
Table 4: Comparing embeddings from PointNet++ [41] and DGCNN [61] backbones: shape classification accuracy on ModelNet40 (Class./MN40) and few-shot part segmentation performance in terms of class-averaged IoU on ShapeNet (Part Seg./ShapeNet).

On the role of ACD in shape classification. Fig. 4 shows the reduction in validation loss on learning ACD (red curve) as training progresses on the unlabeled ShapeNet data. Note that doing well on ACD (in terms of the validation loss) also leads to learning representations that are useful for the downstream tasks of shape classification (in terms of SVM accuracy on a validation subset of ModelNet40 data, shown in blue).

However, the correlation between the two quantities is not very strong (Pearson ) – from the plots it appears that after the initial epochs, where we observe a large gain in classification accuracy as well as a large reduction in ACD loss, continuing to be better at the pretext task does not lead to any noticeable gains in the ability to classify shapes: training with ACD gives the model some useful notion of grouping and parts, but it is not intuitively obvious if perfectly mimicking ACD will improve representations for classifying point-clouds into shape categories.

ACD K-means Spectral HAC
Figure 5:

Correspondence between human part labels and shape decompositions: comparing ACD with basic clustering algorithms – K-means, spectral clustering and hierarchical agglomerative clustering (HAC).

Row-1: histogram of normalized mutual information (NMI) between human labels and clustering – ACD is closer to the ground-truth parts than others (y-axes clipped at 100 for clarity). Row-2: plotting precision v.s. recall for each input shape, ACD has high precision and moderate recall (tendency to over-segment parts), while other methods are usually lower in both metrics.

Comparison with clustering algorithms. We quantitatively analyse the connection between convex decompositions and semantic object parts by comparing ACD with human part annotations on 400 shapes from ShapeNet, along with simple clustering baselines – K-means [2], spectral clustering [47, 56] and hierarchical agglomerative clustering (HAC) [34] on coordinates of the point clouds. For the baselines, we set the number of clusters to be the number of ground-truth parts in each shape. For each sample shape, given the set of part categories and the set of clusters , clustering performance is evaluated using normalized mutual information (NMI) [55], defined as

(6)

where denotes the mutual information between classes and clusters , and is the entropy [10]. A better clustering results in higher NMI w.r.t. the ground-truth part labels. The first row of Fig. 5 shows the histograms of NMI between cluster assignments and human part annotations: ACD, though not exactly aligned to human notions of parts, is significantly better than other clustering methods, which have very low NMI in most cases.

We plot the precision and recall of clustering for each of the 400 shapes on the second row of Fig. 5. The other baseline methods show that a naïve clustering of points does not correspond well to semantic parts. ACD has high precision and moderate recall on most of the shapes – this agrees with the visual impression that though ACD tends to oversegment the shapes, the decompositions contain most of the boundaries present in the human annotations. For example, ACD typically segments the legs of a chair into four separate components. Part annotations on ShapeNet however label all the legs of a chair with the same label, since the benchmark does not distinguish between the individual legs of a chair. We note that the correspondence of ACD to human part labels is not perfect, and this opens an interesting avenue for further work – exploring other decomposition methods like generalized cylinders [73] that may correspond more closely to human-defined parts, and in turn could lead to improved downstream performance on discriminative tasks.

5 Conclusions

Self-supervision using approximate convex decompositions (ACD) has been shown to be effective across multiple tasks and datasets – few-shot part segmentation on ShapeNet and shape classification on ModelNet, consistently surpassing existing self-supervised and unsupervised methods in performance. A simple pairwise contrastive loss is sufficient for introducing the ACD task into a network training framework, without dependencies on any custom architectures or losses.

The method can be easily integrated into existing state-of-the-art architectures operating on point clouds such as PointNet++ and DGCNN, yielding significant improvements in both cases. Extensive ablations and analyses are presented on the approach, helping us develop a better intuition about the method. Given the demonstrated effectiveness of ACD in self-supervision, this opens the door to incorporating other shape decomposition methods from the classical geometry processing literature into deep neural network based models operating on point clouds.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2017) Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392. Cited by: Table 1.
  • [2] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. Technical report Society for Industrial and Applied Mathematics. Cited by: §4.3.
  • [3] O. K. Au, Y. Zheng, M. Chen, P. Xu, and C. Tai (2011-07) Mesh Segmentation with Concavity-Aware Fields. IEEE Trans. Visual. Comput. Graphics 18 (7), pp. 1125–1134. External Links: ISSN 2160-9306, Document Cited by: §2.
  • [4] C. Bernard (1984-07) Convex partitions of polyhedra: a lower bound and worst-case optimal algorithm. SIAM J. Comput.. External Links: Link Cited by: §2, §3.1.
  • [5] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2.
  • [6] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.
  • [7] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3d model repository. CoRR abs/1512.03012. Cited by: §1, Figure 2, §3.1, Figure 3, §4.2, §4.
  • [8] Z. Chen, K. Yin, M. Fisher, S. Chaudhuri, and H. Zhang (2019)

    BAE-net: branched autoencoder for shape co-segmentation

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 8490–8499. Cited by: §1, §1, §2, §4.3.
  • [9] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    ,
    Vol. 1, pp. 539–546. Cited by: §1.
  • [10] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §4.3.
  • [11] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §2.
  • [12] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551. Cited by: §2.
  • [13] M. Gadelha, S. Maji, and R. Wang (2017) 3D shape generation using spatially ordered point clouds. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [14] M. Gadelha, R. Wang, and S. Maji (2018) Multiresolution Tree Networks for 3D Point Cloud Processing. In ECCV, Cited by: §1, §2, Table 1.
  • [15] M. Ghosh, N. M. Amato, Y. Lu, and J. Lien (2013-02) Fast approximate convex decomposition using relative concavity. Compututer Aided Deisgn. 45, pp. 494–504. Cited by: §3.1.
  • [16] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §2.
  • [17] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235. Cited by: §2.
  • [18] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry (2018) AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • [19] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1, §3.2.
  • [20] K. Hassani and M. Haley (2019) Unsupervised multi-task feature learning on point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8160–8171. Cited by: §1, §1, §2, §2, §2, §4.1, §4.1, §4.1, §4.2, §4.3, Table 1, Table 3.
  • [21] D. D. Hoffman and W. Richards (1983) Parts of recognition. Cited by: §2.
  • [22] H. Huang, E. Kalogerakis, S. Chaudhuri, D. Ceylan, V. G. Kim, and E. Yumer (2018) Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Transactions on Graphics 37 (1). Cited by: §2.
  • [23] J. Huang, R. Yagel, V. Filippov, and Y. Kurzion (1998-10) An accurate method for voxelizing polygon meshes. IEEE Symposium on Volume Visualization. Cited by: §3.1.
  • [24] H. Jiang, G. Larsson, M. Maire Greg Shakhnarovich, and E. Learned-Miller Self-supervised relative depth learning for urban scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–35. Cited by: §2.
  • [25] O. V. Kaick, N. Fish, Y. Kleiman, S. Asafi, and D. Cohen-OR (2014) Shape segmentation by approximate convexity analysis. ACM Trans. Graph. 34 (1). Cited by: §2, §3.1.
  • [26] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri (2017) 3D shape segmentation with projective convolutional networks. In Proc. CVPR, Cited by: §2.
  • [27] R. Klokov and V. Lempitsky (2017) Escape from cells: deep Kd-Networks for the recognition of 3D point cloud models. In Proc. ICCV, Cited by: §2.
  • [28] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005. Cited by: §2.
  • [29] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European Conference on Computer Vision, pp. 577–593. Cited by: §2.
  • [30] J. Li, B. M. Chen, and G. Hee Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: Table 3.
  • [31] J. Lien and N. M. Amato (2007) Approximate convex decomposition of polyhedra. In Proceedings of the 2007 ACM Symposium on Solid and Physical Modeling, SPM ’07. Cited by: §2, §3.1.
  • [32] K. Mamou (2016) Volumetric approximate convex decomposition. In Game Engine Gems 3, E. Lengyel (Ed.), pp. 141–158. Cited by: §2, §3.1, §3.1, §3.1, §4.
  • [33] D. Maturana and S. Scherer (2015) 3D convolutional neural networks for landing zone detection from LiDAR. In Proc. ICRA, Cited by: §2.
  • [34] D. Müllner et al. (2013) Fastcluster: fast hierarchical, agglomerative clustering routines for r and python. Journal of Statistical Software 53 (9), pp. 1–18. Cited by: §4.3.
  • [35] S. Muralikrishnan, V. G. Kim, and S. Chaudhuri (2018) Tags2Parts: discovering semantic regions from shape tags. In Proc. CVPR, Cited by: §2.
  • [36] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.
  • [37] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018) Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §2.
  • [38] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2701–2710. Cited by: §2.
  • [39] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §2.
  • [40] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    PointNet: deep learning on point sets for 3D classification and segmentation

    .
    In Proc. CVPR, Cited by: §2.
  • [41] C. R. Qi, L. Yi, H. Su, and L. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Proc. NIPS, Cited by: §2, §3.2, Figure 3, §4.3, Table 4.
  • [42] G. Riegler, A. O. Ulusoys, and A. Geiger (2017) Octnet: learning deep 3D representations at high resolutions. In Proc. CVPR, Cited by: §2.
  • [43] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: §4.1.
  • [44] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng (2011) On random weights and unsupervised feature learning.. In ICML, Vol. 2, pp. 6. Cited by: §4.1.
  • [45] A. Sharma, O. Grau, and M. Fritz (2016) Vconv-dae: deep volumetric shape learning without object labels. In European Conference on Computer Vision, pp. 236–250. Cited by: §2, Table 1.
  • [46] G. Sharma, E. Kalogerakis, and S. Maji (2019) Learning point embeddings from shape repositories for few-shot segmentation. CoRR abs/1910.01269. External Links: Link, 1910.01269 Cited by: §1, §2.
  • [47] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §4.3.
  • [48] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV, Cited by: §2.
  • [49] H. Su, C. Qi, K. Mo, and L. Guibas (2017) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, Cited by: §3.2.
  • [50] J. Su, M. Gadelha, R. Wang, and S. Maji (2018) A deeper look at 3d shape classifiers. In Second Workshop on 3D Reconstruction Meets Semantics, ECCV, Cited by: §2.
  • [51] J. Su, S. Maji, and B. Hariharan (2019) When does self-supervision improve few-shot learning?. arXiv preprint arXiv:1910.03560. Cited by: §2.
  • [52] M. Tatarchenko, J. Park, V. Koltun, and Q. Zhou. (2018) Tangent convolutions for dense prediction in 3D. CVPR. Cited by: §2.
  • [53] A. Thabet, H. Alwassel, and B. Ghanem (2019-03) MortonNet: Self-Supervised Learning of Local Features in 3D Point Clouds. arXiv. External Links: 1904.00230, Link Cited by: Table 3.
  • [54] T. H. Trinh, M. Luong, and Q. V. Le (2019) Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940. Cited by: §2.
  • [55] N. X. Vinh, J. Epps, and J. Bailey (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance.

    Journal of Machine Learning Research

    11 (Oct), pp. 2837–2854.
    Cited by: §4.3.
  • [56] U. Von Luxburg (2007) A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: §4.3.
  • [57] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017) O-CNN: octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. 36 (4). Cited by: §2.
  • [58] P. Wang, C. Sun, Y. Liu, and X. Tong (2018) Adaptive o-cnn: a patch-based deep representation of 3d shapes. ACM Trans. Graph. 37 (6). Cited by: §2.
  • [59] X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §2.
  • [60] X. Wang, K. He, and A. Gupta (2017) Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE international conference on computer vision, pp. 1329–1338. Cited by: §2.
  • [61] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 1–12. Cited by: §2, §4.3, Table 4.
  • [62] R. Weller (2013) A Brief Overview of Collision Detection. SpringerLink, pp. 9–46. Cited by: §2.
  • [63] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems, pp. 82–90. Cited by: Table 1.
  • [64] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1.
  • [65] Q. Xie, E. Hovy, M. Luong, and Q. V. Le (2019)

    Self-training with noisy student improves imagenet classification

    .
    arXiv preprint arXiv:1911.04252. Cited by: §4.2.
  • [66] G. Yang, X. Huang, Z. Hao, M. Liu, S. Belongie, and B. Hariharan (2019) Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4541–4550. Cited by: §1, §2, Table 1.
  • [67] Y. Yang, C. Feng, Y. Shen, and D. Tian (2018) Foldingnet: point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §1, §2, §4.1, §4.3, Table 1.
  • [68] L. Yi, L. Guibas, A. Hertzmann, V. G. Kim, H. Su, and E. Yumer (2017-07) Learning hierarchical shape segmentation and labeling from online repositories. ACM Trans. Graph. 36. Cited by: §2.
  • [69] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §2.
  • [70] R. Zhang, P. Isola, and A. A. Efros (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1058–1067. Cited by: §2.
  • [71] Y. Zhao, T. Birdal, H. Deng, and F. Tombari (2019) 3D point capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1009–1018. Cited by: Table 1, Table 3.
  • [72] Zhou Ren, Junsong Yuan, Chunyuan Li, and Wenyu Liu (2011-11) Minimum near-convex decomposition for robust shape representation. In 2011 International Conference on Computer Vision, Cited by: §2, §3.1.
  • [73] Y. Zhou, K. Yin, H. Huang, H. Zhang, M. Gong, and D. Cohen-Or (2015) Generalized cylinder decomposition. ACM Trans. Graph. 34 (6). Cited by: §4.3.
  • [74] C. Zhu, K. Xu, S. Chaudhuri, L. Yi, L. J. Guibas, and H. Zhang (2019) CoSegNet: deep co-segmentation of 3d shapes with group consistency loss. CoRR abs/1903.10297. External Links: Link, 1903.10297 Cited by: §2.