Few-shot 3D Point Cloud Semantic Segmentation

06/22/2020 ∙ by Na Zhao, et al. ∙ National University of Singapore 0

Many existing approaches for point cloud semantic segmentation are strongly supervised. These strongly supervised approaches heavily rely on a large amount of labeled training data that is difficult to obtain and suffer from poor generalization to new classes. To mitigate these limitations, we propose a novel attention-aware multi-prototype transductive few-shot point cloud semantic segmentation method to segment new classes given a few labeled examples. Specifically, each class is represented by multiple prototypes to model the complex data distribution of 3D point clouds. Subsequently, we employ a transductive label propagation method to exploit the affinities between labeled multi-prototypes and unlabeled query points, and among the unlabeled query points. Furthermore, we design an attention-aware multi-level feature learning network to learn the discriminative features that capture the semantic correlations and geometric dependencies between points. Our proposed method shows a significant improvement compared to the baselines for few-shot point cloud segmentation on unseen classes in two benchmark datasets.



There are no comments yet.


page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud semantic segmentation is a fundamental computer vision problem, which aims to estimate the category of each point in the 3D point cloud representation of a scene. The outcome of 3D semantic segmentation can benefit various real-world applications, including autonomous driving, robotics, and augmented/virtual reality. However, point cloud semantic segmentation is a challenging task due to the unstructured and unordered characteristics of point clouds. Recently, a number of strongly supervised 3D semantic segmentation approaches

(Engelmann et al., 2017; Huang et al., 2018; Landrieu and Simonovsky, 2018; Li et al., 2018; Qi et al., 2017a, b; Wang et al., 2019b; Ye et al., 2018) have been proposed and have achieved promising performance on several benchmark datasets Armeni et al. (2016); Dai et al. (2017)

. Nonetheless, their success relies heavily on the availability of a large amount of labeled training data that is time-consuming and expensive to collect. Moreover, their learned classifiers still suffer from poor generalization to new classes. Consequently, it is necessary to find an alternative that does not require strong supervision and is able to easily generalize to new classes.

Although several existing works used weakly- Guinard and Landrieu (2017); Xu and Lee (2020) and semi-supervised Mei et al. (2019)

learning to mitigate the data hungry bottleneck in strongly supervised learning, these approaches overlooked the generalization ability of the models. The increasingly popular few-shot learning is a promising direction that allows us to concurrently ease the requirement of enormous training data and enhance the model generalization ability. In few-shot point cloud segmentation, our goal is to train a model to segment new classes given a few labeled point clouds. We adopt the commonly used meta-learning strategy,

i.e. episodic training Vinyals et al. (2016), that learns over a distribution of similar few-shot tasks instead of only one target segmentation task. Each few-shot task consists of a few labeled samples (support set) and unlabeled samples (query set), and the model segments the query with learned knowledge from the support. Due to the consistency between the training few-shot task and the testing task, the model is less susceptible to overfitting to rare support samples and endowed with generalization ability. Despite the benefit of episodic training, few-shot point cloud segmentation still faces two major challenges on how to: (1) distill discriminative knowledge from scarce support that can represent the distributions of novel classes; and (2) leverage on this knowledge effectively to perform segmentation.

In this paper, we propose a novel attention-aware multi-prototype transductive inference method for few-shot point cloud segmentation. Specifically, our approach is able to model the complex distributions of the points within the point clouds of the support set, and perform the segmentation via transductive inference with the discriminative features extracted under the few-shot constraint. We are motivated by the prototypical network

Snell et al. (2017), which represents each class with a single prototype obtained from averaging the embeddings of labeled samples in the support. We postulate that such uni-modal distribution assumption can be violated in point cloud segmentation due to the complex data distribution of points. In particular, the geometric structures of the points can vary largely within the same semantic class. Consequently, we propose to represent each class with multiple prototypes to better capture the complex distribution. Furthermore, it is important to learn discriminative features from the few-shot 3D point cloud semantic segmentation setting. To this end, we meticulously design an attention-aware multi-level feature learning network to learn the point-wise features by capturing the geometric dependencies and semantic correlations between the points. Subsequently, we perform the segmentation step in a transductive manner with the multiple prototypes in the learned feature space. In contrast to the conventional prototypical network Snell et al. (2017) that matches unlabeled instances with the class prototypes by computing their Euclidean distances, our transductive inference considers the relationships among the unlabeled query points, and between the unlabeled query points and the multi-prototypes. We validate our proposed approach on two benchmark datasets – S3DIS and ScanNet. The improvements in performance over the baselines demonstrate the effectiveness of our method in few-shot point cloud segmentation.

2 Related Work

3D Semantic Segmentation

3D semantic segmentation aims to assign each point of a point cloud with a semantic class label from a set of pre-defined classes. Due to the unstructured and unordered nature of point clouds, the learning of point-wise labels is very challenging. Many approaches Engelmann et al. (2017); Huang et al. (2018); Landrieu and Simonovsky (2018); Li et al. (2018); Qi et al. (2017a, b); Wang et al. (2019b); Ye et al. (2018) are proposed to solve this task using strong supervisions (i.e. point-wise ground truths). PointNet Qi et al. (2017a)

is the first work that designs an end-to-end deep neural network to segment raw point clouds instead of their transformed representations (

e.g. voxel grids and multi-view images). Despite its simplicity and efficiency, PointNet overlooks the important local information embedded in the neighboring points. DGCNN Wang et al. (2019b) addresses this issue by designing the EdgeConv module that can capture local structures. In our work, we make use of DGCNN as the backbone of our feature extractor to extract local geometric features and semantic features. Although these strongly supervised approaches achieved promising segmentation performance, the requirement for large amounts of training data precludes their use in many real-world scenarios where training data is costly to acquire. Moreover, the learned classifiers in these approaches are hard to generalize to new classes that are not seen in the training data. To alleviate these limitations, we explore the direction of few-shot learning for 3D semantic segmentation. This enables the model to generalize to new classes by seeing just a few labeled samples.

Few-shot Learning

A naive and standard ‘fine-tuning’ method with rare training samples can severely lead to overfitting and poor performance. The goal of few-shot learning is to develop a classifier that is able to generalize to new classes with very few samples (e.g. one sample for the one-shot case). To address this challenging few-shot learning, several recent approaches Finn et al. (2017); Garcia and Estrach (2018); Munkhdalai and Yu (2017); Ravi and Larochelle (2017); Santoro et al. (2016); Snell et al. (2017); Vinyals et al. (2016) have proposed to learn transferable knowledge from a collection of learning tasks and made significant progress. In particular, metric-based method Garcia and Estrach (2018); Snell et al. (2017); Vinyals et al. (2016) is notable because of its effectiveness in directly inferring labels for unseen classes during inference. The key idea in metric-based method is to learn a good metric function which is able to produce a similarity embedding space representing the relationship between samples. Matching network Vinyals et al. (2016) and Prototypical Network Snell et al. (2017) are two representative metric-based methods. Both methods utilize deep neural network to map the support and query sets into an embedding space, and then apply a non-parametric method to predict classes for the query based on the support. Specifically, matching network leverages the weighted nearest neighbor method that represents a class by all its support samples, while prototypical network leverages the prototypical method that represents a class by the mean of its support samples. These two non-parametric methods become two extreme ends of the spectrum of complex to simple data distribution modeling when applied to few-shot point cloud segmentation. This is because a support set contains a large number of points. We argue that the complexity of the distribution in point clouds should be somewhere between the two extremes. Consequently, we propose a novel attention-aware multi-prototype transductive inference method for segmenting point clouds under few-shot constraint. Our method models the complex distribution of each class with multiple prototypes and learns more discriminative features with an attention mechanism.

3 Our Methodology

3.1 Problem Definition

We align the training and testing of few-shot point cloud segmnetation with the episodic paradigm Vinyals et al. (2016) that is commonly used in few-shot learning. Specifically, we train our model on a group of few-shot tasks sampled from a dataset with the class set , and then we test the trained model by evaluating it on another group of tasks sampled from a different dataset with new classes , where . Each few-shot task, a.k.a. an episode, is instantiated as an -way -shot point cloud segmentation task. In each -way -shot episode, we are given a support set, denoted as , with labeled pairs of support point cloud and its corresponding binary mask for each of the unique classes. Each point cloud contains points associated with the coordinate information and an additional feature , e.g. color. We are also given a query set, denoted as , which contains pairs of query point cloud and its corresponding label . Note that the ground-truth label L is only available during training. The goal of -way -shot point cloud segmentation is to learn a model that predicts the label distribution for any query point cloud based on . Formally, our training objective is to find the optimal parameters of by computing:


where denotes the training set containing all the episodes sampled from , and

is the loss function that will be defined in Section


3.2 Attention-aware Multi-prototype Transductive Inference Method

Figure 1: The architecture of our proposed method. This figure illustrates a 3-way 1-shot setting.

Figure 1 illustrates our attention-aware multi-prototype transductive inference framework. It consists of five components: the embedding network that learn the discriminative features for the support and query point clouds; the multi-prototype generation that produces multiple prototypes for each of the classes ( semantic classes and one background class); the -NN graph construction that encodes both the cross-set (support-query) and intra-set (support-support, query-query) relationships within the embedding space; the label propagation that diffuses labels through the whole graph along high density areas formed by the unlabeled query points; and the cross-entropy loss function that computes the loss between the predicted labels and ground-truth labels of all the query points.

3.2.1 Embedding Network

Figure 2: The architecture of Self Attention Network (SAN). , , and are linear embedding functions containing trainable parameters.

The embedding network is the core of our proposed approach, since both multi-prototype generation and -NN graph construction are performed on the learned embedding space. We expect the space to possess three properties: it can 1) encode the geometric structures of points based on local context; 2) encode the semantic information of points and their semantic correlation based on global context; and 3) quickly adapt to different few-shot tasks. To this end, we design an attention-aware multi-level feature learning network that incorporates three levels of features: local geometric features, global semantic features, and metric-adaptive features. Specifically, our embedding network is composed of three modules: feature extractor, attention learner, and metric learner. We adopt DGCNN Wang et al. (2019b), a dynamic graph CNN architecture, as the backbone of our feature extractor to respectively produce local geometric features (outputs of the first EdegConv layer) and semantic features (outputs of the feature extractor). To further explore semantic correlation between points in the global context, we apply a self-attention network (SAN) on the generated semantic features. SAN allows the point-wise feature to aggregate the global contextual information of the correposding point cloud in a flexible and adaptive manner. Figure 2 shows the illustration of the architecture of SAN. We introduce the metric learner, i.e.

a stack of multi-layer perceptrons (MLP) layers to enable the adaptability of the embedding space. The metric learner maps all point-wise features of support and query sets into a manifold space, where common distance functions (e.g., Euclidean distance or cosine distance) can be directly used to measure proximity between points. Finally, we concatenate the three levels of learned features together as the output of the embedding network.

3.2.2 Multi-prototype Generation

For each of the classes in the support set, we generate 111Although we can vary for different classes, we keep it uniform for simplicity. prototypes to model the complex data distribution according to the few labeled samples in the episode. We cast the generation procedure as a clustering problem. While there can be different ways to cluster support points into multiple prototypes, we employ a simple strategy: sampling seed points and point-to-seed assignment based on the learned embedding space. More specifically, we sample a subset of seed points from a set of support points in one class using the farthest point sampling based on the embedding space. Intuitively, the farthest points in this space can inherently represent different perspectives of one class if the embedding space is learned well. Let and , where , denote the sampled seeds and all the support points belonging to the class , respectively. We compute the point-to-seed distance and take the index of the closest seed as the assignment of a point. Formally, the multi-prototypes of class is given by:


where is partition into sets such that is assigned to .

3.2.3 Transductive Inference

In addition to the similarity relations between each unlabeled query point and the labeled multi-prototypes, we also consider the similarity relations between pairs of unlabeled query points to exploit the “smoothness" constraints between neighboring query points in our semantic segmentation task. To this end, we leverage on transductive inference to reason cross-set and intra-set relationships based on the embedding space. Concretely, we propose the use of transductive label propagation to construct a graph on the labeled multi-prototypes and the unlabeled query points, and then propagate the labels in the graph with random walk.

-NN graph construction.

To mitigate the large number of query points, we construct a Nearest Neighbor (NN) graph instead of a fully-connected graph for computational efficiency. Specifically, we take both the multi-prototypes and query points as nodes of a graph with size

. We construct a sparse affinity matrix, denoted as

, by computing the Gaussian similarity between each node and its nearest neighbors in the embedding space:


where represents the node feature and

is the variance of the distance between two nodes. We follow

Iscen et al. (2019) to let , this assures the adjacency matrix is non-negative and symmetric. Subsequently, we symmetrically normalize W to yield , where D is the diagonal degree matrix with its diagonal value to be the sum of the corresponding row of W. In addition to S, we also define a label matrix , where the rows corresponding to labeled prototypes are one-hot ground truth labels and the rest are zero.

Label propagation.

Given S and Y, label propagation iteratively diffuses labels through the graph according to:


represents the predicted label distributions at iteration t, and

is a parameter that controls the relative probability of the information from its adjacency nodes or its initial labels. In

Zhou et al. (2004), Zhou et al. show that sequence converges to a closed-form solution:


We adopt the closed-form solution to directly compute the predictions of label propagation.

3.2.4 Loss Function

Once is obtained, we first take the predictions corresponding to the query point clouds denoted as , where represents the predictions of the point cloud . The prediction of each point in

is then normalized into a probability distribution using the softmax function:


Finally, we compute the cross-entropy loss between and the ground truth labels as:


where is the set of parameters of our model . More precisely, is a composite function of the embedding network , and the multi-prototypes generation and transductive inference operations . It becomes apparent that the minimization of over the parameters is governed by the affinity properties among the labeled multi-prototypes and unlabeled query points since the gradients have to flow through the parameter-less into the embedding network .

4 Experiments

We conduct experiments to evaluate the effectiveness of our proposed method on two benchmark datasets. To the best of our knowledge, there is no prior study of few-shot point cloud segmentation. Consequently, we define the setup of the dataset, implementation details, and baselines for evaluating our method.

4.1 Datasets and Setup


We evaluate on two datasets: 1) S3DIS Armeni et al. (2016) consists of 272 point clouds of rooms with various styles (e.g. lobby, hallway, office, pantry). The annotation of the point clouds corresponds to 12 semantic classes plus one for the clutter. 2) ScanNet Dai et al. (2017) consists of 1,513 point clouds of scans from 707 unique indoor scenes. The annotation of the point clouds corresponds to 20 semantic classes plus one for the unannotated space.


To fit the dataset to the few-shot learning setting, we evenly split the object classes in each dataset into two non-overlapping subsets based on the alphabetical order of the class names. The splitting details are listed in the Appendix. For each dataset, we perform cross-validation on the corresponding two subsets by selecting one split as the test class set while taking the remaining split as the training class set .

Since the number of points in the original rooms is large, we divide the rooms into blocks using a non-overlapping sliding window of 1m1m on the plane, followed the data pre-processing strategy used in Qi et al. (2017a); Wang et al. (2019b). As a result, S3DIS and ScanNet produces 7,547 and 36,350 blocks, respectively. From each block, M points are randomly sampled as the input.

The training set is constructed by including all the blocks that contain at least 100 points for any target class in . During training, we randomly sample one episode from using the following procedure: we first randomly choose classes from , where ; and then a support set and a query set are randomly sampled based on the chosen classes. The mask M in the support set and the label L in the query set are modified from the original point annotation accordingly to correspond to the chosen classes. The testing episodes are formed in a similar fashion, with the exception that we exhaustively iterate all the combinations of classes out of rather than randomly choosing classes. Specifically, we sample 100 episodes for each combination and use them as the for evaluating each of the methods in our experiments. It is worth highlighting that the same point cloud may appear in both and , but the annotations of this point cloud are different due to the difference in the classes of interest.

Evaluation metric.

As for the evaluation metric, we adopt the widely used metric in point cloud semantic segmentation,

i.e. mean Interaction over Union (mean-IoU). In our few-shot setting, the mean-IoU is obtained by averaging over the set of testing classes .

4.2 Implementation Details

Framework details. We illustrate the architecture and configuration of the embedding network in Figure 1 (bottom). Following Zhou et al. (2004), the hyper-parameter

in label propagation is set to 0.99. The settings of the other three hyperparameters (

i.e., in multi-prototype generation, and in the -NN graph construction) are discussed in Section 4.4 and the Appendix.

Training. We pre-train the feature extractor module on training set by adding three MLP layers at the end of feature extractor as the segmentor over

. During pre-traning, we set the batch size to 32 and train for 100 epochs. The pre-trained model is optimized by Adam with a learning rate of 0.001. After initializing the feature extractor with the pre-trained weights, we use the Adam optimizer with an initial learning rate of 0.0001 for the feature extractor module, and an initial learning rate of 0.001 for the metric learner and attention learner modules, respectively. Both learning rates are decayed by half after 5000 iterations. In each iteration, one episode is randomly sampled, and all the point clouds in the support and query set are augmented by Gaussian jittering and random rotation around z-axis.

4.3 Baselines

We design four baselines for comparison.

Fine-tuning (FT).

We take the architecture of our pre-trained segmentation network as the backbone of this baseline. For fair comparison, we use the same pre-trained weights for this segmentation network and our method. Following the strategy in Shaban et al. (2017), we fine-tune the trained segmentation network on samples from the support set and test on the query samples for each -way -shot task. To avoid overfitting, we only fine-tune the last three MLP layers.

Prototypical Learning (ProtoNet).

We adapt the prototypical network Garcia and Estrach (2018) that has been utilized in the few-shot image segmentation Dong and Xing (2018); Wang et al. (2019a) task to few-shot point cloud segmentation. To extract the point-wise features for the support and query sets, we use similar architecture as our embedding network but replace SAN with a linear mapper that maps the features into the same dimension as SAN. Similarly, the feature extractor is initialized by the same pre-trained weights. We represent each class by one prototype given by the mean feature of its support points. The predictions of query points are from its squared Euclidean distance with respect to the prototypes.

Attention-aware Prototypical Learning (AttProtoNet).

This baseline is an upgraded version of ProtoNet, where the self-attention mechanism is added into the embedding network. In other words, it uses the same architecture of our embedding network.

Multi-prototype Transductive Inference (MPTI).

This can be considered as a degraded version of our proposed method, where the attention learner module (SAN) in the embedding network is replaced by a linear mapper similar to ProtoNet.

4.4 Results and Analyses

Method  2-way   3-way
1-shot   5-shot   1-shot   5-shot
S S mean   S S mean   S S mean   S S mean
FT  36.34 38.79 37.57  56.49 56.99 56.74  30.05 32.19 31.12  46.88 47.57 47.23
ProtoNet  48.39 49.98 49.19  57.34 63.22 60.28  40.81 45.07 42.94  49.05 53.42 51.24
AttProtoNet  50.98 51.90 51.44  61.02 65.25 63.14  42.16 46.76 44.46  52.20 56.20 54.20
MPTI  52.27 51.48 51.88  58.93 60.56 59.75  44.27 46.92 45.60  51.74 48.57 50.16
Ours  53.77 55.94 54.86  61.67 67.02 64.35  45.18 49.27 47.23  54.92 56.79 55.86
Table 1: Results on S3DIS dataset using mean-IoU metric (%). S denotes the split is used for testing.
Method  2-way   3-way
1-shot   5-shot   1-shot   5-shot
S S mean   S S mean   S S mean   S S mean
FT  31.55 28.94 30.25  42.71 37.24 39.98  23.99 19.10 21.55  34.93 28.10 31.52
ProtoNet  33.92 30.95 32.44  45.34 42.01 43.68  28.47 26.13 27.30  37.36 34.98 36.17
AttProtoNet  37.99 34.67 36.33  52.18 46.89 49.54  32.08 28.96 30.52  44.49 39.45 41.97
MPTI  39.27 36.14 37.71  46.90 43.59 45.25  29.96 27.26 28.61  38.14 34.36 36.25
Ours  42.55 40.83 41.69  54.00 50.32 52.16  35.23 30.72 32.98  46.74 40.80 43.77
Table 2: Results on ScanNet dataset using mean-IoU metric (%). S denotes the split is used for testing.
Comparison with baselines.

Table 1 and 2 summarize the results of comparing our method to the baselines on S3DIS and ScanNet, respectively. It is not surprising that using more labeled samples, i.e. larger -shot leads to significant improvements for all the methods. We can also see that the performance of 3-way is generally lower than 2-way segmentation due to its higher difficulty. It can be seen that our proposed method significantly outperforms the baselines in all four settings, i.e. 2/3-way 1/5-shot on both datasets. Particularly, our method improves upon FT under the challenging 3-way 1-shot setting by 51.75% and 53.05% on S3DIS and ScanNet dataset, respectively. Compared to ProtoNet, our method gains around 10% and over 20% improvements on S3DIS and ScanNet, respectively, when using only one sample, i.e. one-shot. These improvements shows that our proposed method can obtain more useful knowledge from very limited data during inference. The superiority of our method as compared to AttProtoNet shows the contribution of our proposed multi-prototype transductive inference mechanism. Additionally, both the improvements of AttProtoNet over ProtoNet, and the improvements of our method over MPTI demonstrate the capacity of self-attention network in exploiting semantic correlations between the points.

An interesting observation is that the degraded version of our method, i.e. MPTI clearly outperforms ProtoNet under the one-shot settings, but loses the gain under five shots. This is probably due to the naive multi-prototype generation of MPTI made it difficult to extract accurate multi-prototypes for a large number of support points if the embedding space is not learned well. This phenomenon also indicates the importance of incorporating the self-attention network, which facilitates the model towards learning an more representative embedding space.

Ablation study of multi-level features.
S3DIS  ScanNet
40.31  26.91
44.43  34.51
48.24  35.07
47.82  38.69
52.21  36.12
50.12  39.81
53.77  40.83
Table 3: Effects of different features under 2-way 1-shot setting on S3DIS (S) and ScanNet (S) datasets.

We study the effects of various designs of the embedding network since it is one of the most important components of our method. We denote the levels of features, i.e. local geometric feature, global semantic feature, and metric-adaptive feature as , , and , respectively. We select one or two level(s) of feature(s) as our embedded feature for the estimation of its(their) contribution(s). The results of six variants are listed in Table 3. From the perspective of individual feature, and contribute more than . This is reasonable since the embedding space are supposed to be semantic. By combining any two levels of features, we achieve varying improvements on the two datasets. Eventually, the integration of the three levels of features gives us the best performance on both datasets.

Effects of hyper-parameters.
Figure 3: Effects of hyper-parameters under 2-way 1-shot setting on S3DIS (S) and ScanNet (S) datasets.

In Figure 3, we illustrate the effects of three hyper-parameters (i.e., , , ) for 2-way 1-shot point cloud segmentation on one split of each dataset. As can be seen from Figure 3(a), increasing the number of prototypes per class can achieve better results, but overly large may lead to the over-fitting problem and cause adverse impact on the performance. As Figure 3(b) shows, there is a slight difference on performance between the two datasets when choosing a smaller , i.e. . However, the overall trend is similar, and the selection of gives the best result on both datasets. As reported in Liu et al. (2019); Xiaojin and Zoubin (2002), in the Gaussian similarity function used in the construction of the affinity matrix (see Eq. 3) plays a role in the performance of label propagation. We empirically find that in different datasets has different optimal values. Specifically, on S3DIS and on ScanNet enable us to achieve the best performance, respectively.

5 Conclusion

This study investigates the unexplored few-shot point cloud segmentation problem. We propose a novel solution: the attention-aware multi-prototype transductive inference method, which achieves clear and consistent improvements over baselines on a variety of few-shot point cloud segmentation tasks. This work offers several key insights. Firstly, the learning of the discriminative features that encode both geometric and semantic context is the core of the metric-based few-shot point cloud segmentation method. Secondly, the data distribution of point clouds is more complex than uni-modal distribution. Thirdly, the exploitation of intrinsic relationships in the embedding space is necessary for the point cloud segmentation task. Future work could investigate an adaptive generation of multi-prototypes to efficiently infer the number of prototypes based on data complexity.


  • [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1534–1543. Cited by: §1, §4.1.
  • [2] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niessner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1, §4.1.
  • [3] N. Dong and E. Xing (2018) Few-shot semantic segmentation with prototype learning.. In Proceedings of the British Machine Vision Conference, Vol. 3. Cited by: §4.3.
  • [4] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–724. Cited by: §1, §2.
  • [5] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1126–1135. Cited by: §2.
  • [6] V. Garcia and J. B. Estrach (2018) Few-shot learning with graph neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §2, §4.3.
  • [7] S. Guinard and L. Landrieu (2017) Weakly supervised segmentation-aided classification of urban scenes from 3d lidar point clouds. ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 151–157. Cited by: §1.
  • [8] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: §1, §2.
  • [9] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum (2019)

    Label propagation for deep semi-supervised learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5070–5079. Cited by: §3.2.3.
  • [10] L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [11] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) Pointcnn: convolution on x-transformed points. In Advances in neural information processing systems, pp. 820–830. Cited by: §1, §2.
  • [12] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §4.4.
  • [13] J. Mei, B. Gao, D. Xu, W. Yao, X. Zhao, and H. Zhao (2019) Semantic segmentation of 3d lidar data in dynamic scene using semi-supervised learning. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1.
  • [14] T. Munkhdalai and H. Yu (2017) Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. Cited by: §2.
  • [15] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    PointNet: deep learning on point sets for 3d classification and segmentation

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2, §4.1.
  • [16] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §1, §2.
  • [17] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: §2.
  • [18] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §2.
  • [19] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots (2017) One-shot learning for semantic segmentation. In Proceedings of the British Machine Vision Conference, Cited by: §4.3.
  • [20] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §1, §2.
  • [21] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1, §2, §3.1.
  • [22] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng (2019) Panet: few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9197–9206. Cited by: §4.3.
  • [23] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 1–12. Cited by: §1, §2, §3.2.1, §4.1.
  • [24] Z. Xiaojin and G. Zoubin (2002) Learning from labeled and unlabeled data with label propagation. Tech. Rep., Technical Report CMU-CALD-02–107, Carnegie Mellon University. Cited by: §4.4.
  • [25] X. Xu and G. H. Lee (2020) Weakly supervised semantic point cloud segmentation: towards 10x fewer labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [26] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang (2018)

    3D recurrent neural networks with context fusion for point cloud semantic segmentation

    In European Conference on Computer Vision, pp. 415–430. Cited by: §1, §2.
  • [27] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004) Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §3.2.3, §4.2.

Appendix A Dataset Split

Table 4 lists the class names in each split of the S3DIS and ScanNet datasets.

                                    split=0                                     split=1
S3DIS beam, board, bookcase, ceiling, chair, column door, floor, sofa, table, wall, window
ScanNet bathtub, bed, bookshelf, cabinet, chair, counter, curtain, desk, door, floor otherfurniture, picture, refrigerator, show curtain, sink, sofa, table, toilet, wall, window
Table 4: Test class names for each split of S3DIS and ScanNet.

Appendix B More Framework Details

b.1 EdgeConv architecture details

Figure 4 illustrates the architecture and configuration of EdgeConv, which is a basic block of the feature extractor. To perform graph CNN, a -NN graph is dynamically computed from the input point-wise features to EdgeConv. Note that this -NN graph is different from the -NN graph in Section 3.2.3. We set in our experiments. Each point in the point cloud is concatenated with its translated neighbor point , which is yielded by translating to the local system with as the center. Consequently, a

feature tensor is produced from the input tensor

and further passed to two MLP layers. Finally, EdgeConv aggregates the resultant feature tensor over the

neighboring features using a max-pooling operator to generate the output point-wise features.

Figure 4: The architecture of EdgeConv component in the embedding network.

b.2 Hyper-parameter settings

As mentioned in Section 4.4, we empirically find that the optimal value of varies in different datasets. Additionally, we also observe varying optimal number of prototypes per class under different few-shot settings. Table 5 shows the optimal value of in different few-shot settings. It can be seen that becomes larger when the number of shots increases. This is reasonable since more shots result in larger number of observed support points for each class, which requires larger to model the larger variety. From Table 5, we also observe that becomes larger when the number of “ways" increases. This is probably due to the more difficult 3-way segmentation requires fine-grained multi-prototypes for each class. We set for the -NN graph mentioned in Section 3.2.3 on all few-shot settings in both datasets.

2-way 1-shot 2-way 5-shot 3-way 1-shot 3-way 5-shot
100 150 150 300
Table 5: The value of in different few-shot settings.

Appendix C Qualitative Results

Figure 5 and 6 show the qualitative results of our proposed method for 2-way 1-shot point cloud semantic segmentation on the S3DID and ScanNet dataset, respectively. We compare the predictions of one query point cloud from our method with the ground truths and predictions from ProtoNet. As we can see from Figure 5, the S3DIS dataset is very challenging in many scenarios, e.g., the “the white columns that are very similar to the white wall and the window frame” (first row of Figure 5), “the doors that only have visible door frames” (second row of Figure 5), “the table that has a lot of clutter on it” (last row of Figure 5). The accuracy of the predictions from our method drops for these challenging scenarios due to the limitation of having only one labeled sample as support. Nonetheless, our method still generally gives more accurate segmentation results than ProtoNet.

In contrast with the S3DIS dataset, the ScanNet dataset contains more diverse room types, such as bathroom (see first and last rows of Figure 6), bedroom/hotel (see second row of Figure 6), living room/lounge (see third and fifth rows of Figure 6), etc. Our proposed method is able to correctly segment most of semantic classes within these new room types, while ProtoNet gives poor segmentation results that mix the background class with the semantic classes. We believe that our correct segmentations are consequences of integrating the attention-aware feature embedding and multi-prototype transductive inference, which facilitates the smoothness among neighboring points and the distinction between different semantic classes.

Figure 5: Qualitative results of our method in 2-way 1-shot point cloud semantic segmentation on the S3DIS dataset in comparison to the ground truth and ProtoNet. Four combinations of 2-way are illustrated from the top to bottom rows, i.e., “ceiling, column” (first row), “floor, door” (second row), “chair, bookcase” (third and forth rows), “table, wall” (last row).
Figure 6: Qualitative results of our method in 2-way 1-shot point cloud semantic segmentation on the ScanNet dataset in comparison to the ground truth and ProtoNet. Four combinations of 2-way are illustrated from the top to bottom rows, i.e., “cabinet, bed” (first and second rows), “sofa, window” (third and forth rows), “chair, floor” (fifth row), “sink, toilet” (last row).