Omni-supervised Point Cloud Segmentation via Gradual Receptive Field Component Reasoning

05/21/2021 ∙ by Jingyu Gong, et al. ∙ Xiamen University Shanghai Jiao Tong University East China Normal University 0

Hidden features in neural network usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. In this paper, we bring the first omni-scale supervision method to point cloud segmentation via the proposed gradual Receptive Field Component Reasoning (RFCR), where target Receptive Field Component Codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder. Then, target RFCCs will supervise the decoder to gradually infer the RFCCs in a coarse-to-fine categories reasoning manner, and finally obtain the semantic labels. Because many hidden features are inactive with tiny magnitude and make minor contributions to RFCC prediction, we propose a Feature Densification with a centrifugal potential to obtain more unambiguous features, and it is in effect equivalent to entropy regularization over features. More active features can further unleash the potential of our omni-supervision method. We embed our method into four prevailing backbones and test on three challenging benchmarks. Our method can significantly improve the backbones in all three datasets. Specifically, our method brings new state-of-the-art performances for S3DIS as well as Semantic3D and ranks the 1st in the ScanNet benchmark among all the point-based methods. Code will be publicly available at https://github.com/azuki-miho/RFCR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation of point cloud in which we need to infer the point-level labels is a typical but still challenging task in 3D vision. Meanwhile, this technique can be widely used in many applications like robotics, autonomous driving, and virtual/augmented reality.

Figure 1: Illustration of Receptive Field Component Reasoning for a point cloud in ScanNet v2 from top to bottom. The Receptive Field Component Code (RFCC) indicates the category components in the receptive field. In the decoding stage, the segmentation problem is decomposed into a much easier global context recognition problem (predicting the global RFCCs, see the top of figure) and a series of receptive field component reasoning problems. During reasoning, the target RFCCs generated in the encoder are used as the groundtruth in the decoder to guide the network to gradually reason the RFCCs in a coarse-to-fine manner, and finally obtain the semantic labels.

To handle point cloud segmentation, previous works usually introduced well-designed encoder-decoder architecture to hierarchically extract global context features in the encoding stage, and distribute contextual features to points in the decoding stage to achieve point-wise labeling [graham20183d, thomas2019kpconv, yan2020pointasnl]. However, in the typical encoder-decoder framework, network is merely supervised by labels of points in the final layer [wu2019pointconv, thomas2019kpconv, hu2020randla], while ignoring a critical fact that, hidden units in other layers lack direct supervision to extract features with informative representation. In other words, multi-scale/omni-scale supervision is indeed necessary.

In 2D vision, CVAE [sohn2015learning] attempted to give a multi-scale prediction and supervision to extract useful features in segmentation task. CPM [wei2016convolutional] and MSS-net [ke2018multi] tried to add intermediate supervision periodically and layer-wise loss, respectively. PointRend [kirillov2020pointrend] proposed to segment image in low-resolution, and iteratively up-sample the coarse prediction and fine-tune it to obtain final result, thus prediction at different scales can be supervised together.

However, so far, no one succeed in applying multi-scale, let alone omni-scale supervision to 3D semantic segmentation, due to the irregularity of point cloud. Unlike in image domain, it is hard to up-sample the hidden features to the original resolution through simple tiling or interpolation, because there is no fixed mapping relationship between sampled point cloud and original point cloud especially when the sampling is random 

[wu2019pointconv, hu2020randla]. Additionally, the common up-sampling methods using nearest neighbors cannot trace the encoding relationship, thus introducing improper supervisions to the intermediate features (referring Sec 4.4 for discussion). More recently, SceneEncoder [xu2020sceneencoder] provided a method to supervise the center-most layer to extract meaningful global features, but lots of other layers remain unhandled.

To solve this problem, we propose an omni-scale supervision method via gradual Receptive Field Component Reasoning. Instead of up-sampling the hidden features to the original resolution, we design a Receptive Field Component Code (RFCC) to effectively trace the encoding relationship and represent the categories within receptive field for each hidden unit. Based upon this, we generate the target RFCCs at different layers from semantic labels in the encoding stage to supervise the network at all scales. Specifically, in the decoding stage, the target RFCCs will supervise the network to predict the RFCCs at different scales, and the features (hints) from skip link can help further deduce RFCCs within more local and specific receptive fields. In this way, the decoding stage is transferred into a gradual reasoning procedure, as shown in Figure 1.

Inspired by SceneEncoder [xu2020sceneencoder], for each sampled point in any layer of encoder, according to the existence of categories in its receptive field, a multi-hot binary code can be built, designated as target Receptive Field Component Code (RFCC). The target RFCCs at different layers are generated alongside the convolution and down-sampling, thus they can precisely record the existing categories in corresponding receptive fields without any extra annotations. In Figure 1, we show the target RFCCs at various layers for a point cloud in the decoding stage, where the network will first recognize the global context (inferring the categories of objects existing in the whole point cloud). Then, contextual features will be up-sampled iteratively to gradually reason the RFCCs in a coarse-to-fine manner. By comparing the target RFCCs and the predicted RFCCs, the omni-scale supervision can be realized. It is noteworthy that even the network reasons the RFCCs gradually, the training and inference of network is implemented in a end-to-end manner.

Additionally, to further unleash the potential of omni-scale supervision, more active features (features with large magnitude) are required to make unambiguous contribution to the RFCC prediction. Contrarily, in traditional networks [wu2019pointconv, thomas2019kpconv, xu2020sceneencoder], lots of units are inactive with tiny magnitude, such that having minor contribution to the final prediction. The principle underlying the above observations comes from entropy regularization [grandvalet2005semi, lee2013pseudo] over features, where greater number of active dimensionalities would bring low-density separation between positive features and negative features, generating more unambiguous features with certain signals. Consequently, in point cloud scenario, more certainty in features can help the training of the network to better reason the RFCCs at various scales and finally predict the semantic labels. Motivated by this, we proposed a Feature Densification method with a well-deigned potential function to push hidden features away from . Moreover, this potential is in effect equivalent to a entropy loss over features (detailed deduction is shown in Sec 3.4), leading to a simple but highly effective regularization for intermediate features.

To evaluate the performance and versatility of our method in point cloud semantic segmentation task, we embed our method into four prevailing backbones (deformable KPConv, rigid KPConv [thomas2019kpconv], RandLA [hu2020randla], and SceneEncoder [xu2020sceneencoder]), and test on three challenging point cloud datasets (ScanNet v2 [dai2017scannet] for indoor cluttered rooms, S3DIS [armeni20163d] for large indoor space, and Semantic3D [hackel2017semantic3d] for large-scale outdoor space). In all the three datasets, we outperform the backbone methods and almost all the state-of-the-art point-based competitors. What’s more, we also push the state-of-the-art of S3DIS [armeni20163d] and Semantic3D [hackel2017semantic3d] ahead.

2 Related Work

Point Cloud Semantic Segmentation.

PointNet [qi2017pointnet]

proposed to directly concatenate global features to point-wise features before several Multi-Layer Perceptrons (MLPs) to finish the semantic segmentation. Later, PointNet++ 

[qi2017pointnet++], SubSparseConv [graham20183d] and KPConv [thomas2019kpconv] utilized an encoder-decoder architecture with skip links for better fusion of local and global information. Joint tasks like instance segmentation and edge detection are also introduced to enhance the performance of semantic segmentation through additional supervision [pham2019jsis3d, zhao2020jsnet, hu2020jsenet]. SceneEncoder [xu2020sceneencoder]

designed a meaningful global scene descriptor to guide the global feature extraction. These methods directly utilized semantic labels to supervise the output features or features in the center-most layer.

Compared with previous works, we propose an omni-scale supervision method for point cloud semantic segmentation via a gradual Receptive Field Component Reasoning.

Multi-scale Supervision.

In 2D Vision, CVAE [sohn2015learning] proposed to give multi-scale prediction in the segmentation task. RMI [zhao2019region] proposed to predict and supervise the neighborhood of each pixel rather than the pixel itself. PointRend [kirillov2020pointrend] segmented the images in a coarse-to-fine fashion, give low-resolution prediction, and iteratively up-sample and fine-tune it to obtain the original-resolution prediction. CPM [wei2016convolutional] and MSS-net [ke2018multi] added intermediate supervision periodically and layer-wise loss, respectively.

Compared with these methods, we design a Receptive Field Component Code (RFCC) to represent receptive field component and dynamically generate target RFCCs to give omni-scale supervision to the network rather than simply up-sample the features to the original resolution or down-sample the ground truth. Thanks to the omni-scale supervision, the network can infer the RFCCs gradually and finally obtain RFCCs in the original resolution which is also the semantic labels.

Entropy Regularization.

Entropy Regularization [grandvalet2005semi] minimized the prediction entropy in semi-supervised classification task to obtain unambiguous final features. This idea is introduced into the deep neural network for self-training by [lee2013pseudo], and the final features with tiny magnitude will be pushed away from to make deterministic contribution to the final prediction. In these methods, final features with positive values will be greater and negative features will be smaller due to the entropy loss.

Compared with their methods, our Feature Densification introduce the entropy regularization [grandvalet2005semi, lee2013pseudo] into the hidden features rather than just the final features to obtain more active hidden features which can directly contribute to the RFCC prediction.

3 Methods

Figure 2: Framework of gradual Receptive Field Component Reasoning. (a) shows the target Receptive Field Component Codes (RFCCs) is generated alongside the common encoding procedure. (b) indicates the network will predict the RFCCs in a coarse-to-fine manner. (c) represents the centrifugal potential which pushes hidden features away from

. In our network, the target RFCCs will supervise the RFCC predictions, and the learnt feature can reason RFCCs in more local and specific receptive fields as more and more local features (clues) are provided through skip links. The prediction activation function will be Softmax for the final layer and Sigmoid otherwise.

In the following parts, we will first give an overview of our method in Sec 3.1. Then, we will introduce the Receptive Field Component Codes (RFCCs) and the target RFCCs that we generate at various layers in Sec 3.2. In Sec 3.3, how to use these target RFCCs to supervise the network, and make the gradual Receptive Field Component Reasoning, would be explained. At last, we will show the strategy of Feature Densification for more active features in Sec 3.4.

3.1 Overview

The framework of our gradual Receptive Field Component Reasoning (RFCR) is shown in Figure 2. In our method, we generate target Receptive Field Component Codes (RFCCs) at different layers alongside the convolution and sampling of features (Figure 2 (a)) in the encoding stage. In the decoding stage, the network will reason the RFCCs at different layers, and the corresponding target RFCCs will give omni-scale supervision on the predicted RFCCs (Figure 2 (b)). Consequently, the semantic segmentation task can be treated as a coarse-to-fine receptive field component reasoning procedure after recognizing the global context (predicting categories of objects existing in the point cloud). Additionally, we introduce Feature Densification through a centrifugal potential to obtain more active features for omni-scale RFCC prediction (Figure 2 (c)).

3.2 Receptive Field Component Code

For a point cloud, it is easy to define the label of a point in the original point cloud. Nevertheless, it is non-trivial to give a label to a point in any down-sampled point cloud which receives information from points inside its receptive field. In our method, we design a Receptive Field Component Code (RFCC) to represent all categories within the receptive field of sampled points in the encoder. The target RFCCs are generated alongside the convolution and sampling of features in the encoding stage. In other words, sharing sampling is used between the encoding stage (left part of top branch in Figure 2) and RFCC generation (Figure 2 (a)), thus the generated target RFCCs can precisely record the category components in the receptive fields, even though the sampling of point cloud is a random process.

Implementation.

Our RFCC is designed to be a multi-hot label for every point in any layer of encoder. Specifically, in the semantic segmentation task where we need to classify each point into

categories, the RFCC will be a

binary vector. Given the

-th point in the -th layer of the encoder , the target RFCC represents the categories of objects existing in the receptive field of , and each element indicates the existence of category . Based upon this definition, we can first assign the one-hot label of input point to the RFCC in the input layer, because the receptive field of point only contains itself:

(1)

where is the label of point in the original point cloud. As illustrated in Figure 2 (a), we can obtain from the RFCCs in the previous layer alongside the 3D Convs:

(2)

where indicates the channel index, and is the index of point in ’s receptive field at the -th layer. That is to say, receives features from in the 3D Convs thanks to the sharing sampling. represents the logical OR (disjunction) operation. It is noteworthy that the generation of RFCCs only occurs in the encoder, rather than the decoder. The generation of RFCCs is iterated until reaching the center-most layer . Typically, the scene descriptor is only a naturally deduced global supervisor when the center-most layer contains only one point [xu2020sceneencoder]. Besides, can also be treated as a simplified version of neighborhood multi-dimension distribution in RMI [zhao2019region], which exploits the semantic relationship among neighboring points.

3.3 RFCC Reasoning

The decoder of network is to infer the category of each input point in the task of semantic segmentation. In our method, as shown in Figure 2 (b), we decompose this complex problem into a much easier global context recognition problem (predicting ) and a series of gradual receptive field component reasoning problem (reasoning from gradually with additional features from skip link and finally obtain the semantic labels ).

As shown in Figure 2, is the features of sampled point in decoder. For each layer of decoder except the last one, we apply a shared Multi-Layer Perceptron (MLP)

and a sigmoid function

to to predict the RFCCs :

(3)

Then, the target RFCC generated in the encoding stage is directly used to guide prediction through layer-wise supervision :

(4)

where

(5)

denotes the sampled point cloud in the -th layer of encoder, and corresponds the number of points in .

According to Eq. (3), the center-most features which contain global information will learn to recognize the global context, , predict with largest receptive field. Meanwhile, will be used to regularize this prediction to help learn a better representation. Then, for the following layer of decoder, which learns informative representation to predict will be up-sampled and concatenated with from the skip link. After that, the concatenated features will be used to extract more distinguishable via 3D Convs, and the extracted features will be used to reason the RFCCs of more local and specific receptive field. This procedure is iterated until . The whole RFCC reasoning loss can be simply expressed by

(6)

In the last layer, we can simply utilize the MLPs and softmax to predict the , and cross entropy loss is used to supervise the output features in the original scale.

3.4 Feature Densification

Due to the large amounts of supervision introduced by the gradual Receptive Field Component Reasoning, more active features with unambiguous signals are required. However, there are many inactive hidden units with tiny magnitude in the traditional network (detailed experiment is shown in Sec 4.4). Therefore, we introduce a centrifugal potential to bring low-density separation between positive features and negative features (push features away from ) as shown in Figure 2 (c):

(7)

where and can be an identity function or a simple perceptron. We can see the negative gradient of potential function over feature is:

(8)

which have the same sign as the feature. This indicates positive features will become greater and negative features will be smaller given this potential. Additionally, features with smaller absolute value will receive larger gradient according to this formula.

Meanwhile, this centrifugal potential can be implemented by a simple entropy loss:

(9)

where is the -th channel of .

If we take the following notation:

(10)

we can reformulate Eq. (9) into

(11)

So, our centrifugal potential can be treated as entropy regularization [lee2013pseudo] over hidden features which can decrease ambiguity of features in the intermediate layers. On the other side, our omni-scale supervision can directly benefit from more active features with certain signal introduced by the Feature Densification. That is because more unambiguous features can participate into the RFCC predictions and help learning better representation of hidden layer, improving the semantic segmentation performance.

The total loss for Feature Densification can be summarized by

(12)

and represents the number of features’ channel in .

In a nutshell, all the supervision can be concluded by

(13)

where and are two adjustable hyper-parameters while represents the common cross entropy loss for semantic segmentation. In our experiment, we simply set and to , and we find it can perform well in most cases.

4 Experiments

To show the effectiveness of our method and prove our claims, we embed our method into four prevailing methods (deformable KPConv, rigid KPConv [thomas2019kpconv], RandLA [hu2020randla] and SceneEncoder [xu2020sceneencoder]), and conduct experiments on three popular point cloud segmentation datasets (ScanNet v2 [dai2017scannet] for cluttered indoor scenes, S3DIS [armeni20163d] for large-scale indoor rooms and Semantic3D [hackel2017semantic3d] for large outdoor spaces). First, we introduce these three datasets in Sec 4.1. Next, implementation details and hyper-parameters used in our experiments are described in Sec 4.2. Then, we give the metric used to evaluate the performance as well as the quantitative and qualitative results in Sec 4.3. Finally, we conduct more ablation studies to prove our claims in Sec 4.4.

4.1 Datasets

ScanNet v2.

In the task of ScanNet v2 [dai2017scannet], we need to classify all the points into different semantic categories. This dataset provides scanned scenes with point-level annotations, scanned scenes for training, and scanned scenes for validation. Another scanned scenes are published without any annotations for testing. We need to make prediction on the test set and submit our final result to ScanNet server for testing.

S3dis.

S3DIS [armeni20163d] provides point clouds of rooms with comprehensive annotations in large-scale indoor areas from different buildings. There are million points in total, and all these points are categorized into classes. Following [qi2017pointnet, thomas2019kpconv], we take Area 5 as the test set and rooms in the remaining areas for training.

Semantic3D.

Semantic3D [hackel2017semantic3d] is a large-scale outdoor point cloud dataset with online benchmark. It contains more than billion points from diverse urban scenes, and all the points are classified into categories. The whole dataset includes point clouds for training and another point clouds for testing. For easy evaluation, Semantic3D provides the task of Semantic3D reduced-8, where large-scale point clouds are used for training and down-sampled point clouds are used for testing.

4.2 Implementation

All the experiments can be conducted on a single GTX 1080Ti with 3700X CPU and 64 GB RAM. We apply our method to a common backbone deformable KPConv [thomas2019kpconv] and evaluate the performance on all three datasets. To show the versatility of our method, we also embed our method into three other backbones (one for each dataset).

ScanNet.

We separately choose deformable KPConv [thomas2019kpconv] and SceneEncoder [xu2020sceneencoder] as our backbones and apply our method. When we take deformable KPConv as our backbone, we randomly sample spheres with radius equal to meters from scenes in the training set during training procedure, and the batch size is set to . When we take SceneEncoder as our backbone and train our model, we randomly sample mmm cubes from training scenes for every batch like SceneEncoder [xu2020sceneencoder]. After training, we separately predict the results of the test set using these two trained models and submit them to the online benchmark server for testing [dai2017scannet].

S3dis.

We insert our methods into deformable KPConv [thomas2019kpconv] and RandLA [hu2020randla] respectively and treat them as our backbones. When we take deformable KPConv as our backbone, we randomly sample spheres with m radius from original point clouds, and the batch size is set to . We randomly sample points from entire rooms for each training sample and set the batch size to be when taking RandLA [hu2020randla] as the backbone. Rooms in Area-1,2,3,4,6 are used for training. After training, we test the model on the whole S3DIS Area-5 set.

Semantic3D.

Deformable KPConv and rigid KPConv proposed in [thomas2019kpconv] are taken as our backbones to evaluate our method on Semantic3D reduced-8 task [hackel2017semantic3d]. Because Semantic3D is a large-scale outdoor space dataset, point cloud is randomly sampled into a sphere with m radius for deformable KPConv backbone and m radius for rigid KPConv backbone. Every time, samples are fed into the network for training and testing. We need to submit the final predictions to the Semantic3D server for testing [hackel2017semantic3d].

4.3 Metric and Results

Metric.

For better evaluation of segmentation performance, we take mean Intersection over Union (mIoU) among categories as our metric like many previous works [gong2021boundary, qi2017pointnet, thomas2019kpconv].

The results of semantic segmentation on ScanNet v2 [dai2017scannet] are reported in Table 1, where we achieve mIoU and rank first in this benchmark among all point-based methods. Here, we take deformable KPConv as our baseline and improvement is achieved in mIoU. To show the generalization ability of our method, we also apply our method to SceneEncoder [xu2020sceneencoder]. As shown in Table 1, improvement in mIoU is achieved. Additionally, we provide the qualitative results of our baseline (deformable KPConv) and our method in Figure 3. The red dashed circles indicate the obvious qualitative improvements.

Method mIoU(%)
PointNet++ (NIPS’17[qi2017pointnet++] 33.9
PointCNN (NIPS’18[li2018pointcnn] 45.8
3DMV (ECCV’18[dai20183dmv] 48.4
PointConv (CVPR’19[wu2019pointconv] 55.6
TextureNet (CVPR’19[huang2019texturenet] 56.6
HPEIN (ICCV’19[jiang2019hierarchical] 61.8
SPH3D-GCN (TPAMI’20[lei2020spherical] 61.0
FusionAwareConv (CVPR’20[zhang2020fusion] 63.0
FPConv (CVPR’20[lin2020fpconv] 63.9
DCM-Net (CVPR’20[Schult_2020_CVPR] 65.8
PointASNL (CVPR’20[yan2020pointasnl] 66.6
FusionNet (ECCV’20[zhang2020deep] 68.8
SceneEncoder (IJCAI’20[xu2020sceneencoder] 62.8
SceneEncoder + Ours 65.9
KPConv deform (ICCV’19[thomas2019kpconv] 68.4
KPConv deform + Ours 70.2
Table 1: Results of indoor scene semantic segmentation segmentation on ScanNet v2.

We report the segmentation results on S3DIS Area-5 [armeni20163d] in Table 2. In this dataset, we also take deformable KPConv as our backbone and achieve mIoU in S3DIS Area-5 task which pushes the state-of-the-art performance ahead. Deformable KPConv is also treated as our baseline for its good performance. Meanwhile, we also apply our method to RandLA and the improvement over these backbones is also obvious (i.e., mIoU). Figure 4 gives the visualization results of our method and the qualitative improvement over the baseline (deformable KPConv).

Figure 3: Visualization results on the validation dataset of ScanNet v2. The images from the left to right are input point clouds, semantic labels, predictions given by our baseline and our method, respectively.
Method mIoU(%)
PointNet (CVPR’17[qi2017pointnet] 41.09
RSNet (CVPR’18[huang2018recurrent] 51.93
PointCNN (NIPS’18[li2018pointcnn] 57.26
ASIS (CVPR’19[wang2019associatively] 54.48
ELGS (NIPS’19[wang2019exploiting] 60.06
PAT (CVPR’19[yang2019modeling] 60.07
SPH3D-GCN (TPAMI’20[lei2020spherical] 59.5
PointASNL (CVPR’20[yan2020pointasnl] 62.6
FPConv (CVPR’20[lin2020fpconv] 62.8
Point2Node (AAAI’20[han2019point2node] 62.96
SegGCN (CVPR’20[lei2020seggcn] 63.6
DCM-Net (CVPR’20[Schult_2020_CVPR] 64.0
FusionNet (ECCV’20[zhang2020deep] 67.2
RandLA (CVPR’20[hu2020randla] 62.42
RandLA [hu2020randla] + Ours 65.09
KPConv deform (ICCV’19[thomas2019kpconv] 67.1
KPConv deform + Ours 68.73
Table 2: Results of indoor scene semantic segmentation on S3DIS Area-5.
Figure 4: Visualization results on the test dataset of the S3DIS Area-5. The left-most images are input point clouds and the following images are segmentation ground truth, predictions of baseline and our method separately.

In Table 3, we show the results of our method and other prevailing methods on Semantic3D [hackel2017semantic3d]. In this task, we achieve in mIoU, outperforming all the state-of-the-art competitors. When taking deformable KPConv as our backbone, our method improves it by . Then we take rigid KPConv as our backbone, and our method can also bring improvement in mIoU. We present the visual results of our method and the baseline (deformable KPConv) on the validation set of Semantic3D in Figure 5. The dark blue dashed circles indicate the qualitative improvements.

Method mIoU(%)
SegCloud (3DV’17[tchapmi2017segcloud] 61.3
RF_MSSF (3DV’18[thomas2018semantic] 62.7
SPG (CVPR’18[landrieu2018large] 73.2
ShellNet (ICCV’19[zhang2019shellnet] 69.4
GACNet (CVPR’19[wang2019graph] 70.8
FGCN (CVPR’20[khan2020fgcn] 62.4
PointGCR (WACV’20[ma2020global] 69.5
RandLA (CVPR’20[hu2020randla] 77.4
KPConv rigid (ICCV’19[thomas2019kpconv] 74.6
KPConv rigid + Ours 77.6
KPConv deform (ICCV’19[thomas2019kpconv] 73.1
KPConv deform + Ours 77.8
Table 3: Results of outdoor space semantic segmentation on Semantic3D (reduced-8).
Figure 5: Visualizations on validation set of Semantic3D. Inputs, semantic labels, results of our baseline and our method are presented separately from the left to the right.

4.4 Ablation Study

In this section, we conduct more experiments to evaluate the effectiveness of the proposed gradual Receptive Field Component Reasoning (RFCR) method from different aspects. Without loss of generality, our ablation studies are mainly conducted on the task of Semantic3D reduced-8 and deformable KPConv [thomas2019kpconv] is chosen as backbone.

Gradual Receptive Field Component Reasoning.

To conduct ablation studies on different parts of gradual Receptive Field Component Reasoning in the semantic segmentation, we firstly only give the omni-supervision in the decoding procedure to guide the network reason Receptive Field Component Codes (RFCCs) gradually without the loss for Feature Densification (FD). Then, we add the centrifugal potential to obtain more active features for RFCC prediction, and the results are reported in Table 4. The results indicate the Receptive Field Component Reasoning can improve the segmentation performance by alone, and FD can further bring improvement. We also conduct ablation studies on the effects of supervisions at different scales and provide the details in supplementary materials.

Method mIoU
KPConv deform 73.1
  + RFCR 76.0
    + FD 77.8
Table 4: Ablation study on impact of different parts of gradual Receptive Field Component Reasoning.

Omni-supervision via Up-sampling.

Multi-scale supervision is usually used in 2D segmentation via up-sampling the low-resolution prediction. Even we cannot up-sample the point cloud through simple tiling or interpolation, we attempt to up-sample the intermediate predictions iteratively using the nearest neighbors. Then, semantic labels are used to supervise all the up-sampled predictions. Same as our method, all scales are supervised and Feature Densification is also used to provide more unambiguous features for intermediate prediction. We report the result of Omni-supervision via Up-sampling (OvU) in Table 5 and compare it with our method. It shows inferior performance () because the up-sampling method using nearest neighbors cannot trace the proper encoding relationship.

Method mIoU
KPConv deform 73.1
KPConv deform + OvU + FD 76.2
KPConv deform + RFCR[one-hot] + FD 76.4
KPConv deform + RFCR + FD 77.8
Table 5: Ablation study on omni-scale supervision strategy.

One-hot RFCC.

In previous works like PointRend [kirillov2020pointrend], they give one-hot predictions at low resolutions, and these predictions will be up-sampled to be supervised by the one-hot labels at original resolution. So, it is intuitive to take an one-hot RFCC for the major category in the receptive field to supervise the prediction. However, the category information of some points will be ignored in this way. Compared with this method, we take a multi-hot label for every sampled point at all the scales, and no labels will be ignored in the supervision of down-sampled points. In order to show the benefit of multi-hot labels, we replace the multi-hot labels with one-hot labels which represent the majority of categories in the receptive fields, and all other settings remain the same. We report the results in Table 5. We can see one-hot RFCC which ignores the minor category cannot fully represent the information in the receptive field, thus having sub-optimal performance () in the segmentation which is lower than multi-hot RFCC.

Feature Densification.

As stated in Sec 3.4, active features will be densified by centrifugal potential given the loss in Eq. (12). The distribution of features’ magnitude after training can be visualized by the bar chart shown in Figure 6. As indicated in this figure, features are pushed away from and more unambiguous features are available for the Receptive Field Component Reasoning, thus improving the segmentation performance (Table 4).

Figure 6: Visualization of features’ magnitude in the decoding layers. The green chart bars represent the distribution of features’ absolute value after adding Feature Densification while the red chart bars represent the distribution of features’ absolute value in the original network.

5 Conclusion

In this paper, we propose a gradual Receptive Field Component Reasoning method for omni-supervised point cloud segmentation which decomposes the hard segmentation problem into a global context recognition task and a series of gradual Receptive Field Component Code reasoning steps. Additionally, we propose a complementary Feature Densification method to provide more active features for RFCC prediction. We evaluate our method with four prevailing backbones on three popular benchmarks and outperform almost all the state-of-the-art point-based competitors. Furthermore, our method brings new state-of-the-art performance for Semantic3D and S3DIS benchmarks. Even our method brings large improvements to many backbones for point cloud segmentation, it is more suitable for networks with encoder-decoder architecture.

References

Appendix

Appendix A Supervisions at Different Layers

Supervision Scales mIoU
1 2 3 4 5
76.3
76.6
76.9
76.2
77.8
Table 6: Ablation study on significance of supervisions at different scales.

We design an omni-scale supervision method for point cloud segmentation via the proposed gradual Receptive Field Component Reasoning in the main paper. All scales are supervised in the decoding stage to learn informative representation for semantic segmentation. In this section, we attempt to analyze the significance of supervisions at different scales. In this ablation study, deformable KPConv [thomas2019kpconv] is also taken as the backbone and performance is evaluated on the Semantic3D reduced-8 task. In the architecture of deformable KPConv network, there are 5 different scales as shown in Figure 7. So, we separately remove the supervisions for . It is noteworthy that we always keep the supervision for the final layer () because it directly guides the semantic label prediction, otherwise the network will give random prediction. The results is reported in Table 6. The results indicates supervision in the center-most layer () plays an important role in the omni-scale supervision. That is because it can help the encoder to obtain representative global features which is quite important for the following reasoning. Meanwhile, the supervision before the final prediction also contributes a lot because it can directly provide semantic informative features to the final segmentation.

Figure 7: Illustration of framework using deformable KPConv as the backbone. In our method, all the five scales are supervised by the target RFCCs.

Appendix B Visualization of intermediate RFCC

We visualize the RFCC reasoning process and our predicted RFCCs in intermediate layers to implicitly show the intermediate feature learning in Figure 8. Meanwhile, the OA of RFCC prediction is on the validation set of ScanNet v2, demonstrating good representation learning of intermediate features to some extent.

Figure 8:

Visualization of intermediate RFCCs whose element color represents the probability of existence for each category.

Appendix C Supervision on Decoder vs. Encoder

Method mIoU
KPConv deform 73.1
KPConv deform + [RFCR + FD][encoder] 76.8
KPConv deform + RFCR + FD 77.8
Table 7: More ablation study on the strategy of omni-scale supervision.

In our implementation, all the supervisions are added in the decoder even the target RFCCs are generated according to the receptive fields of features in the encoder. That is because the features in the encoder can also be supervised through the skip links. In order to show the advantage of our strategy, we attempt to supervise the features in the encoder rather than the decoder according to the RFCCs, and Feature Densification is also applied on the corresponding features in the encoder. Compared with supervision in the decoding stage, guiding the feature extraction using RFCCs in the encoder is not able to effectively extract informative representation from global and local features in the decoding stage, such obtaining inferior result as reported in Table 7.

Appendix D Visualization Results

Figure 9: More visualization results on the validation dataset of ScanNet v2. The images from the left to right are input point clouds, semantic labels, predictions given by our baseline and our method, respectively.
Figure 10: More visualization results on the test dataset of the S3DIS Area-5. The left-most images are inputs and the following images are segmentation ground truth, predictions of baseline and our method separately.
Figure 11: More visualization results on the validation dataset of Semantic3D. Input point clouds, semantic labels, results of our baseline and our method are presented respectively from left to right.

In this section, we present more visualization results of our method on the three datasets described in the main paper. We present more visualization results of our baseline and our methods on the validation set of ScanNet v2 [dai2017scannet] in Figure 9. In Figure 10, we provide additional visualization results to show the qualitative improvement over the baseline in S3DIS Area 5. We also visualize more scenes in the validation set of Semantic3D in Figure 11.

Method mIoU bath. bed bksf. cab. chair ctr. curt. desk door floor oth. pic. ref. shw. sink sofa tab. toil. wall win.  
PointNet++ (NIPS’17[qi2017pointnet] 33.9 58.4 47.8 45.8 25.6 36.0 25.0 24.7 27.8 26.1 67.7 18.3 11.7 21.2 14.5 36.4 34.6 23.2 54.8 52.3 25.2
PointCNN (NIPS’18[li2018pointcnn] 45.8 57.7 61.1 35.6 32.1 71.5 29.9 37.6 32.8 31.9 94.4 28.5 16.4 21.6 22.9 48.4 54.5 45.6 75.5 70.9 47.5
3DMV (ECCV’18[dai20183dmv] 48.4 48.4 53.8 64.3 42.4 60.6 31.0 57.4 43.3 37.8 79.6 30.1 21.4 53.7 20.8 47.2 50.7 41.3 69.3 60.2 53.9
PointConv (CVPR’19[wu2019pointconv] 55.6 - - - - - - - - - - - - - - - - - - - -
TextureNet (CVPR’19[huang2019texturenet] 56.6 67.2 66.4 67.1 49.4 71.9 44.5 67.8 41.1 39.6 93.5 35.6 22.5 41.2 53.5 56.5 63.6 46.4 79.4 68.0 56.8
HPEIN (ICCV’19[jiang2019hierarchical] 61.8 72.9 66.8 64.7 59.7 76.6 41.4 68.0 52.0 52.5 94.6 43.2 21.5 49.3 59.9 63.8 61.7 57.0 89.7 80.6 60.5
SegGCN (CVPR’20[lei2020seggcn] 58.9 83.3 73.1 53.9 51.4 78.9 44.8 46.7 57.3 48.4 93.6 39.6 6.1 50.1 50.7 59.4 70.0 56.3 87.4 77.1 49.3
SPH3D-GCN (TPAMI’20[lei2020spherical] 61.0 85.8 77.2 48.9 53.2 79.2 40.4 64.3 57.0 50.7 93.5 41.4 4.6 51.0 70.2 60.2 70.5 54.9 85.9 77.3 53.4
FusionAwareConv (CVPR’20[zhang2020fusion] 63.0 60.4 74.1 76.6 59.0 74.7 50.1 73.4 50.3 52.7 91.9 45.4 32.3 55.0 42.0 67.8 68.8 54.4 89.6 79.5 62.7
FPConv (CVPR’20[lin2020fpconv] 63.9 78.5 76.0 71.3 60.3 79.8 39.2 53.4 60.3 52.4 94.8 45.7 25.0 53.8 72.3 59.8 69.6 61.4 87.2 79.9 56.7
DCM-Net (CVPR’20[Schult_2020_CVPR] 65.8 77.8 70.2 80.6 61.9 81.3 46.8 69.3 49.4 52.4 94.1 44.9 29.8 51.0 82.1 67.5 72.7 56.8 82.6 80.3 63.7
PointASNL (CVPR’20[yan2020pointasnl] 66.6 70.3 78.1 75.1 65.5 83.0 47.1 76.9 47.4 53.7 95.1 47.5 27.9 63.5 69.8 67.5 75.1 55.3 81.6 80.6 70.3
FusionNet (ECCV’20[zhang2020deep] 68.8 70.4 74.1 75.4 65.6 82.9 50.1 74.1 60.9 54.8 95.0 52.2 37.1 63.3 75.6 71.5 77.1 62.3 86.1 81.4 65.8
SceneEncoder (IJCAI’20[xu2020sceneencoder] 62.8 - - - - - - - - - - - - - - - - - - - -
SceneEncoder + Ours 65.9 69.1 72.4 69.6 63.2 81.5 47.7 75.4 64.6 50.9 95.2 42.8 28.4 56.6 76.1 62.6 71.1 61.0 88.9 79.3 61.0
KPConv deform (ICCV’19[thomas2019kpconv] 68.4 84.7 75.8 78.4 64.7 81.4 47.3 77.2 60.5 59.4 93.5 45.0 18.1 58.7 80.5 69.0 78.5 61.4 88.2 81.9 63.2
KPConv deform + Ours 70.2 88.9 74.5 81.3 67.2 81.8 49.3 81.5 62.3 61.0 94.7 47.0 24.9 59.4 84.8 70.5 77.9 64.6 89.2 82.3 61.1
Table 8: Semantic segmentation results on ScanNet v2.
Method mIoU ceil. floor wall beam col. wind. door chair table book. sofa board clut.  
PointNet (CVPR’17[qi2017pointnet] 41.09 88.80 97.33 69.80 0.05 3.92 46.26 10.76 58.93 52.61 5.85 40.28 26.38 33.22
RSNet (CVPR’18[huang2018recurrent] 51.93 93.34 98.36 79.18 0.00 15.75 45.37 50.10 65.52 67.87 22.45 52.45 41.02 43.64
PointCNN (NIPS’18[li2018pointcnn] 57.26 92.31 98.24 79.41 0.00 17.6 22.77 62.09 74.39 80.59 31.67 66.67 62.05 56.74
ASIS (CVPR’19[wang2019associatively] 53.40 - - - - - - - - - - - - -
ELGS (NIPS’19[wang2019exploiting] 60.06 92.80 98.48 72.65 0.01 32.42 68.12 28.79 74.91 85.12 55.89 64.93 47.74 58.22
PAT (CVPR’19[yang2019modeling] 60.07 93.04 98.51 72.28 1.00 41.52 85.05 38.22 57.66 83.64 48.12 67.00 61.28 33.64
SPH3D-GCN (TPAMI’20[lei2020spherical] 59.5 93.3 97.1 81.1 0.0 33.2 45.8 43.8 79.7 86.9 33.2 71.5 54.1 53.7
PointASNL (CVPR’20[yan2020pointasnl] 62.6 94.3 98.4 79.1 0.0 26.7 55.2 66.2 83.3 86.8 47.6 68.3 56.4 52.1
FPConv (CVPR’20[lin2020fpconv] 62.8 94.6 98.5 80.9 0.0 19.1 60.1 48.9 80.6 88.0 53.2 68.4 68.2 54.9
Point2Node (AAAI’20[han2019point2node] 62.96 93.88 98.26 83.30 0.00 35.65 55.31 58.78 79.51 84.67 44.07 71.13 58.72 55.17
SegGCN (CVPR’20[lei2020seggcn] 63.6 93.7 98.6 80.6 0.0 28.5 42.6 74.5 80.9 88.7 69.0 71.3 44.4 54.3
DCM-Net (CVPR’20[Schult_2020_CVPR] 64.0 92.1 96.8 78.6 0.0 21.6 61.7 54.6 78.9 88.7 68.1 72.3 66.5 52.4  
FusionNet (ECCV’20[zhang2020deep] 67.2 - - - - - - - - - - - - -
RandLA (CVPR’20[hu2020randla] 62.42 91.19 95.66 80.11 0.00 25.24 62.27 47.36 75.78 83.17 60.82 70.82 65.15 53.95
RandLA+Ours 65.09 92.66 97.43 82.40 0.00 37.04 59.72 52.30 77.49 86.95 63.48 71.99 70.54 54.13  
KPConv deform (ICCV’19[thomas2019kpconv] 67.1 92.8 97.3 82.4 0.0 23.9 58.0 69.0 91.0 81.5 75.3 75.4 66.7 58.9
KPConv deform+Ours 68.73 94.18 98.33 84.34 0.00 28.45 62.36 71.17 91.95 82.60 76.13 71.14 71.60 61.25  
Table 9: Results of indoor scene semantic segmentation on S3DIS Area-5.
Method mIoU man-made. natural. high veg. low veg. buildings hard scape scanning. cars  
SegCloud (3DV’17[tchapmi2017segcloud] 61.3 83.9 66.0 86.0 40.5 91.1 30.9 27.5 64.3
RF_MSSF (3DV’18[thomas2018semantic] 62.7 87.6 80.3 81.8 36.4 92.2 24.1 42.6 56.6
SPG (CVPR’18[landrieu2018large] 73.2 97.4 92.6 87.9 44.0 93.2 31.0 63.5 76.2
ShellNet (ICCV’19[zhang2019shellnet] 69.4 96.3 90.4 83.9 41.0 94.2 34.7 43.9 70.2
GACNet (CVPR’19[wang2019graph] 70.8 86.4 77.7 88.5 60.6 94.2 37.3 43.5 77.8
FGCN (CVPR’20[khan2020fgcn] 62.4 90.3 65.2 86.2 38.7 90.1 31.6 28.8 68.2
PointGCR (WACV’20[ma2020global] 69.5 93.8 80.0 64.4 66.4 93.2 39.2 34.3 85.3
RandLA (CVPR’20[hu2020randla] 77.4 95.6 91.4 86.6 51.5 95.7 51.5 69.8 76.8
KPConv rigid (ICCV’19[thomas2019kpconv] 74.6 90.9 82.2 84.2 47.9 94.9 40.0 77.3 79.7
KPConv deform + Ours 77.6 97.0 90.9 86.7 50.8 94.5 37.3 79.7 84.1 
KPConv deform (ICCV’19[thomas2019kpconv] 73.1 - - - - - - - -
KPConv deform + Ours 77.8 94.2 89.1 85.7 54.4 95.0 43.8 76.2 83.7  
Table 10: Semantic segmentation results on Semantic3D (reduced-8).

Appendix E Detailed Experimental Results

In this section, we provide more quantitative details about our experimental results for better comparison with other competitors. In Table 8, we present the mean IoU (mIoU) over categories and the IoUs for different classes for ScanNet v2. We also list the category scores for S3DIS Area-5 in Table 9. It’s noteworthy that all the methods do not have good performance on the segmentation of beams in Area 5 because there is a large difference between the beams in Area 5 (test set) and those in Area 1, 2, 3, 4, and 6 (training set). Finally, Table 10 shows the IoUs of various classes for Semantic3D reduced-8 task.