SASO: Joint 3D Semantic-Instance Segmentation via Multi-scale Semantic Association and Salient Point Clustering Optimization

06/25/2020 ∙ by Jingang Tan, et al. ∙ 0

We propose a novel 3D point cloud segmentation framework named SASO, which jointly performs semantic and instance segmentation tasks. For semantic segmentation task, inspired by the inherent correlation among objects in spatial context, we propose a Multi-scale Semantic Association (MSA) module to explore the constructive effects of the semantic context information. For instance segmentation task, different from previous works that utilize clustering only in inference procedure, we propose a Salient Point Clustering Optimization (SPCO) module to introduce a clustering procedure into the training process and impel the network focusing on points that are difficult to be distinguished. In addition, because of the inherent structures of indoor scenes, the imbalance problem of the category distribution is rarely considered but severely limits the performance of 3D scene perception. To address this issue, we introduce an adaptive Water Filling Sampling (WFS) algorithm to balance the category distribution of training data. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on benchmark datasets in both semantic segmentation and instance segmentation tasks.



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene perception plays a decisive role in many applications, such as autonomous driving, robot navigation and augmented reality. With the growth of computer technology and artificial intelligence in recent years, scene perception ability of intelligent devices has received increasing attention from both academia and industry, especially for the 3D scenes which can represent the real environment intuitively. Semantic segmentation and instance segmentation of 3D scenes are the fundamental and critical portions of 3D scene perception. Nevertheless, how to model the 3D space into digital shape to accomplish scene segmentation task is an indefinite problem. Various representations of 3D scenes have been investigated, such as depth maps, voxels, multi-views, meshes and point clouds. Based on these representations, a series of excellent works have been investigated to operate segmentation task, such as

wang2019voxsegnet; dai20183dmv; qi2017pointnet; qi2017pointnet++; engelmann2017exploring; graham20183d; shen2018mining; huang2018recurrent; Xiaoqing20183D; yi2019gspn; yang2019learning; lahoud20193d; liu2019masc; wang2018sgpn; wang2019associatively; pham2019jsis3d; liang20193d; hou20193d

. Among these representations, point clouds are the most compact and natural to the geometric distributions of real 3D scenes, which have been applied extensively in recent researches. In terms of semantic and instance segmentation tasks in 3D point clouds, based on the great success achieved in recent years

Landrieu2018Large; wang2019voxsegnet; graham20183d; wang2019graph; dai20183dmv; engelmann2017exploring; Xiaoqing20183D; wang2018sgpn; yi2019gspn; yang2019learning; lahoud20193d; liu2019masc; elich20193d for each single task, joint learning methods for both tasks wang2018sgpn; pham2019jsis3d; wang2019associatively have opened up a new effective way to explore the 3D scene segmentation, which improved the performance and promoted further development. Compared with the method wang2018sgpn exploiting similarity matrix, pham2019jsis3d; wang2019associatively utilized clustering algorithm to generate instance segmentation result, which was proved to be more effective and flexible. Nevertheless, whether the convergence direction of the training process is consistent with the orientation of clustering algorithm was rarely considered. Additionally, the marginal points are usually harder to be distinguised than the central points, and in multiple objects case the internal points are easier to be distinguished than the boundary points across objects, as shown in Figure 3. To address this problem, we propose a Salient Point Clustering Optimization (SPCO) module to introduce clustering into the training process and saliently focus on the points that are harder to be distinguished in the clustering process. As for semantic segmentation, the spatial distribution of the semantic information has a strong association, which can be further exploited. For example, when a point comes from table, it is highly possible that there will be some neighbor points belonging to chair other than from ceiling. The most common approach to explore the semantic associations is the Conditional Random Fields (CRF) algorithm lafferty2001conditional

, which utilizes normalization based on statistical global probability and has been proved to be effective in segmentation tasks. However, CRF is complex and consumes plenty of resources, how to sufficiently exploit the semantic associations more efficiently is an indefinite problem. Consequently, we propose a Multi-scale Semantic Association (

MSA) module to fine tune the semantic segmentation results, which is based on the multiple scale semantic association maps generated by statistical analysis. In addition, because of the inherent structures of indoor scenes, the imbalance problem of the category distribution badly limits the performance of 3D scene perception. For example, wall and floor certainly exist in every room while other categories may not, such as sofa, sink, bookshelf, etc.. This leads to the numbers of points from wall and floor are much more than the one from other categories. The imbalance problem is rarely considered in previous wokrs. Thus, we present an adaptive Water Filling Sampling (WFS) algorithm to address this problem by changing the sampling probabilities of each category adaptively.
To summarize, our contributions are the following:

  • We propose a Salient Point Clustering Optimization (SPCO) module to introduce clustering into the training process and saliently focus on the points that are harder to be distinguished in instance segmentation.

  • We propose a Multi-scale Semantic Association (MSA) module based on statistical knowledge to explore the potential spatial association of the semantic information in point clouds.

  • We propose an adaptive Water Filling Sampling (WFS) algorithm to balance category distribution in the point clouds, which is rarely considered but critical in 3D scene perception.

  • Extensive experiments demonstrate that our SASO outperforms the state-of-the-art related methods on benchmark datasets in both semantic and instance segmentation criteria.

2 Related Works

This section reviews recent deep learning-based techniques applied to 3D point clouds. In recent years, a series of deep learning architectures have been proposed to perform the encoding and decoding for 3D point clouds or its derived representations, which are widely utilized in many 3D vision tasks such as semantic and instance segmentation, object part segmentation and object detection. We divide these methods into four categories based on the data representations. Further more, we will introduce recent 3D semantic and instance segmentation research progress based on above techniques.

2.1 Volumetric Methods

Due to 3D point clouds are irregular, the most simple but naive method is to voxelize the irregular point clouds to regular 3D grids so that 3D convolutions can be applied wu20153d; wang2019voxsegnet; graham20183d; wang2017cnn; zhou2018voxelnet; qi2016volumetric; klokov2017escape; riegler2017octnet; lei2019octree; ren2018sbnet; maturana2015voxnet; huang2016point. Specifically, Wu et al. wu20153d

represented a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network. Maturana

et al. maturana2015voxnet

proposed an architecture to efficiently deal with large amounts of point cloud data by integrating a volumetric Occupancy Grid representation with a supervised 3D Convolutional Neural Network. Zhou

et al. zhou2018voxelnet removed the manual feature engineering for 3D point clouds and divided point clouds into equally spaced 3D voxels, then transformed a group of points within each voxel into a unified feature representation through a newly introduced voxel feature encoding layer. Wang et al. wang2019voxsegnet

designed a spatial dense extraction module to preserve the spatial resolution during the feature extraction procedure, alleviating the loss of detail caused by sub-sampling operations such as max-pooling. Although volumetric data representation is the most common and simplest form, there is an obvious drawback that cubic complexity of 3D convolutions leads to a dramatic increase in the memory consumption and computing resources. To tackle this issue,

wang2017cnn; riegler2017octnet proposed octree representation to improve efficiency of network and reduce computing resources. In addition, graham20183d; ren2018sbnet proposed sparse convolutional operations to process spatially-sparse 3D point clouds and achieved impressive results. Although these methods try to alleviate the efficiency problem, they are much more complex than volumetric CNNs and can not fundamentally solve the memory consumption problem.

Figure 1: An illustration of our joint learning framework. The input 3D point clouds are first encoded to by PointNet++ qi2017pointnet++, then the common feature will be decoded separately by semantic and instance segmentation branches. In semantic segmentation branch (blue), a MSA module based on statistics knowledge is proposed to explore the semantic association, we will expound it in Sec 3.2. For the instance segmentation (green), we proposed SPCO module to introduce clustering into the training process and focus on hard-distinguished points, which will be explained in Sec 3.3.

2.2 Multi-view Methods

Another common method for 3D point clouds are a multi-view representation. In recent years, Convolutional Neural Networks have been proved successful in a wide range of 2D visual tasks. To sufficiently take advantage of the strong extraction capability of classical CNNs, 3D point clouds are first projected into multiple pre-defined views, which are then processed by well-designed image-based CNNs to extract features, such as su2015multi; shi2015deeppano; roveri2018network; you2018pvnet; dai20183dmv; guerry2017snapnet; qi2016volumetric. Specifically, Guerry et al. guerry2017snapnet used 3D-coherent synthesis of scene observations and mixed them in a multi-view framework for 3D labeling. Su et al. su2015multi presented a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. Dai et al. dai20183dmv encoded the sparse 3D point clouds with a compact multi-view representation, including bird’s eye view and front view as well as RGB image to perform high-accuracy 3D object detection. You et al. you2018pvnet proposed PVNet to integrate both the point cloud and the multi-view data towards joint 3D shape recognition. Although the multi-view representation of point cloud data is reasonable, the project process from 3D to 2D will loss the full utilization of 3D geometric information.

2.3 Graph Convolution Methods

Graph structure is a native representation of irregular data, such as 3D point clouds, which offers a compact yet rich representation of contextual relationships between points of different object parts bruna2013spectral; wang2018local; te2018rgcnn; simonovsky2017dynamic; wang2019graph; Landrieu2018Large. Specifically, Bruna et al. bruna2013spectral

proposed two constructions based on a hierarchical clustering of the domain and the spectrum of the graph Laplacian, to prove that for low-dimensional graphs, it is possible to learn convolutional layers with a number of parameters independent of the input size, resulting in efficient deep architectures. Wang

et al. wang2018local operated spectral graph convolution on a local graph, combined with a novel graph pooling strategy to augment the relative layout of neighboring points as well as their features. Te et al. te2018rgcnn

treated features of points in a point cloud as signals on graph, and defined the convolution over graph by Chebyshev polynomial approximation leveraging on spectral graph theory. They also designed a graph-signal smoothness prior in the loss function to regularize the learning process. Although the graph convolutional methods have achieved significant performance, these methods constructed on Laplacian matrix, is computationally complex for Laplacian eigen-decomposition and has a large quantity of parameters to express the convolutional filters while lacks spatial localization.

2.4 Point clouds Methods

Point clouds are an intuitive, memory-efficient 3D representation which is well-suited for representing geometric details. How to apply deep learning techniques in point clouds directly, simply and efficiently is a critical problem. To address this challenge, Qi et al. qi2017pointnet

designed a novel type of neural network PointNet that directly consumes point clouds and well respects the permutation invariance of points in the input. More specifically, they solved the disorder problem of the point clouds through max pooling and maintained the rotation invariance through the spatial transformation network STN. The extracted features of each point are the combination of its own information and the global information. PointNet has been proved efficient in many applications ranging from object classification, part segmentation, object detection to scene semantic parsing. However, PointNet only relies on the max-pooling layer to learn global features and does not consider local relationships. Therefore, a series of works

qi2017pointnet++; huang2018recurrent; wang2019dynamic; li2018pointcnn; engelmann2017exploring were developed through investigations of the local context and hierarchical learning structures. Typically, Qi et al. qi2017pointnet++ proposed PointNet++ based on their previous work PointNet, which utilizes pointnet as a local feature extraction module to operate hierarchical feature extraction like CNNs, and finally uses upsampling to generate the final high level features. Li et al. li2018pointcnn proposed PointCNN which uses MLP to learn a transformation matrix to solve the disorder problem of point cloud, and then utilizes the introduced x-conv module to perform convolution on the transformed features. This method achieved similar performance as PointNet++.

2.5 3D semantic and instance segmentation

Recent advances in learning-based techniques have also led to various cutting-edge 3D semantic and instance segmentation approaches Landrieu2018Large; te2018rgcnn; wang2019graph; wang2019voxsegnet; graham20183d; qi2016volumetric; dai20183dmv; qi2017pointnet; qi2017pointnet++; engelmann2017exploring; shen2018mining; li2018pointcnn; hua2018pointwise; huang2018recurrent; wu2019pointconv. Volumetric representation has been adapted by wang2019voxsegnet; graham20183d to transfer 3D point clouds to regular grids and operate CNNs to extract features. Landrieu2018Large; te2018rgcnn; wang2019graph utilized graph convolutional networks to model the relationships of 3D points which offers a compact yet rich representation of context. qi2016volumetric; dai20183dmv transfered 3D point clouds into multiple views to sufficiently take advantage of the strong extraction capability of classical CNNs. qi2017pointnet; qi2017pointnet++; engelmann2017exploring; shen2018mining presented more efficient and flexible ways to utilize MLP directly upon point clouds and well respect the permutation invariance of points. li2018pointcnn; hua2018pointwise; wu2019pointconv operated segmentation task by designing novel CNNs on point clouds, Huang et al. huang2018recurrent and Ye et al. Xiaoqing20183D

proposed new approaches by slicing the point clouds and utilizing recurrent neural networks to exploit the inherent contextual features. 3D instance segmentation is a relatively new research area and attracts more and more attention

yang2019learning; lahoud20193d; liu2019masc. Specifically, Lahoud et al. lahoud20193d

proposed a network based on 3D voxel grids, which treats the instance segmentation task as multi-task learning problem. The network generates abstract feature embeddings for voxels and estimates instances’ centers to learn instance information. Yang

et al. yang2019learning introduced a framework which simultaneously generates 3D bounding boxes and predicts the binary masks for the points within each box in one stage. Recently, Wang et al. wang2018sgpn have opened up a framework by jointly operating semantic and instance segmentation in 3D point clouds. Inspired by the proposal mechanism in 2D FasterRcnnren2015faster, they proposed similarity matrix indicating the similarity between each pair of points in embedded feature space to predict point grouping proposals, then the network will predict corresponding semantic class for each proposal to generate the final semantic-instance results. Although the similarity matrix is effective and natural to indicate the proposals, it will generate a large and inefficient matrix which suffers from the heavy computation and memory consumes. Some followed proposal methods yi2019gspn; hou20193d were proposed to boost the performance of similar framework while still depended on two-stage procedure and the time-consuming non-maximum suppression algorithm. More recently, wang2019associatively; pham2019jsis3d utilized clustering algorithm to divide points into different objects, which was demonstrated to be more effective and efficient than proposal methods. Nevertheless, they did not consider whether the convergence direction of the training process is coupled with the orientation of clustering algorithm. In addition, different points have various diffculties to be devided into distinct objects, which is rarely considered. In this work, we propose a framework which take this critical problem into consideration and prove that it is significant and effective.

Figure 2: An illustration of our MSA module. First, we create multi-scale semantic association map by statistics with ball query upon all the training 3D scenes. For a point

in the semantic prediction result, we also generate a vector

indicating the probabilities of different categories about surrounding points with ball query. Then we calculate the similarity between this vector and each line (category) in the and normalize it as a probability vector, the detail of the calculation is formulated in equation 7. The final prediction for each point is the fusion of original predict probability and the fine-tuned probability, as formulated in equation 8.

3 Proposed Method

In this section, we first introduce the baseline framework of our network which jointly perform semantic and instance segmentation tasks. Then we give the details of our MSA module for semantic segmentation in Sec 3.2, as depicted in Figure 2. Next, we expound our SPCO module for instance segmentation in Sec 3.3, as shown in Figure 3. The whole framework of our method can be seen in Figure 1. Finally, the adaptive Water Filling Sampling (WFS) algorithm is explained in details in Sec 3.4.

3.1 Baseline Framework

As depicted in Figure 1, the network without MSA and replaced SPCO with normal clustering is the baseline framework. First, point clouds of size are encoded into a high-dimensional feature matrix by the encoder PointNet++ qi2017pointnet++. Next, two tasks separately decode for their own missions. In the semantic segmentation branch, is decoded into the semantic feature matrix and then outputs the semantic predictions , where is the semantic class number. The instance segmentation branch decodes into the instance feature matrix , which is utilized to predict the per-point instance embeddings , where denotes the length of the output embedding dimensions. These embeddings are used to calculate the distances among the points for instance clustering. During the training process, the semantic branch is supervised by cross entropy loss while the loss function for instance segmentation, inspired by wang2019associatively, is formulate as follows:

Figure 3: An illustration of our SPCO module. As shown in the first line, the different points of one object have different difficulties to be distinguished, especially for the points of the joints among different objects. In terms of this problem, we introduce clustering into training procedure and saliently focus on the points that are harder to be distinguished in the clustering process.

where the goal of is to pull the embeddings toward the mean embedding of the points in the instance, while guides the mean embedding of instances to repel each other. We denote as a regularization term that bounds the embedding values. The three loss terms are denoted as:


where represents the number of ground-truth instances; is the number of points in instance ; denotes the mean embedding of instance ; is an embedding of a point; and

indicate margins for the variance and distance loss respectively;

and represent different instances; is the hinge function; and the distance is represented by .

For inference, we use mean-shift clustering comaniciu2002mean on the instance embeddings to obtain the final instance labels following wang2019associatively . The mode of the semantic labels for the points within the same instance is assigned as the predicted semantic class.

3.2 Multi-scale Semantic Association Module

In 3D semantic segmentation, for a point of an object, the categories of surrounding points are usually related to the category of the point itself, i.e., the spatial distribution of the semantic information has a strong association as the ensample in Sec 1, which can be further exploited. Thus, based on the semantic context information, we propose our Multi-scale Semantic Association (MSA) Module, which can be seen in Figure 2.
As shown in Figure 2, on the one hand, we create multi-scale semantic association maps by statistics with ball query upon all the training 3D scenes, means the map in scale , is the number of class. On the other hand, based on the decoded semantic output feature , we can also generate the probabilities of the categories from surrounding points with ball query in scale . Then for each point in , we calculate the distance between and each line in , and transfer the result as a probability vector for this point, where the larger a bit is, the higher the probability for this point belonging to corresponding category is. Note that the MSA module will generate multiple probability vectors because of multiple scales and these probabilities are only come from surrounding points. At last, the original predicted probability vector is added by the multiple probability vectors to get the final prediction. The formula is described as equation (5)-(8)


where means one hot operation, means the number of points in the ball query of point in scale , means the semantic association map in scale , means normalization and means softmax operation. Note that and can be operated with broadcast mechanism, and is operated in axis 1. The final probability output is the sum of and in different scales with different coefficients. In our experiment, we set equal to radius 0.2, 0.3, 0.5 m and equal to 0.5, 0.3, 0.2 respectively.

3.3 Salient Point Clustering Optimization Module

As explained in the baseline framework, for instance segmentation, the goal of is to pull the embeddings toward the mean embedding of the points from the same object, while guides the mean embedding of instances to repel each other in the training process. In the inference time, mean shift clustering algorithm is utilized to distinguish points of different objects. However, the coupling between the convergence orientations in training and the clustering orientation in inference is not taken into consideration. In addition, the points from the same object have different difficulties in instance segmentation as the ensample in Sec 1. Thus, in this paper, we propose a Salient Point Clustering Optimization (SPCO) module, which takes mean shift clustering algorithm into the training process and saliently focuses on the points that are harder to be distinguished in the clustering process. More specifically, as shown in Figure 3, mean shift clustering algorithm is operated in training process to simulate the clustering procedure in inference. Then for the points clustered in one instance while are not belonging to this instance according to the ground truth, we generate an additional loss to repel these embeddings away from the mean embedding of this instance. The loss is formulated in equation (9), note that the ID of the clustered instance is decided by the mode of ID in the ground truth, and to converge on a reliable model, we add

into the training process from 10 epochs.


where means the number of instances in clustering, means the number of wrong clustered points in instance , means the embedding of th wrong clustered point in instance and means the mean embedding of the correct clustered points in instance . Equipped with our SPCO module, the network can simulate the clustering procedure in inference more realistically, and pay more attention to the points that are easy to be erroneously clustered, which is significant for improving the performance of instance segmentation.

3.4 Water Filling Sampling algorithm

In indoor scenes, there exists some inherent structures. For example, the space is always surrounded by walls and floors. When we sample point clouds from indoor scenes, points of certain categories will occupy the main proportion, which will cause serious imbalance problem between these main categories and other normal categories, especially for tiny objects. In previous works of points segmentation task, this problem is rarely discussed. Therefore, in this paper, a Water Filling Sampling algorithm is proposed to solve the imbalance problem in indoor scenes, which is adaptive to different category distribution. Specifically, for a point cloud of a scene, we first cut it into blocks along - plane and store corresponding semantic and instance labels for each point in the blocks. In addition, we define an accumulative vector to store the block number for each category, and generate a list to indicate which block contains points of category . If the number of points in a block that belongs to category is larger than a thresh , the block index will be contained in and will be added by 1. When we accomplished the cutting step, we can get the probabilities of block number for each category from . To keep the balance among the categories, we need to sample the same size of blocks from all the blocks with different probabilities. If the original probability of a category is high in the row data, the sample probability should be correspondingly low. To achieve this goal, we gradually add a small probability value to the category with the minimum sum of original probability and current sampling probability, until the sum of the total sampling probability values up to 1. The process is likely to fill water to the canyon consisting of original probabilities of all the categories, the details of the algorithm can be formulated as Algorithm 1. As for part segmentation datasets, such as ShapeNet, the algorithm becomes more concise because we can obtain and for each object directly and skip the cutting step. Note that because of the characteristic of part segmentation dataset, we perform algorithm on super categories.

Input: Training point clouds of all the scenes with corresponding semantic-instance labels, and a series of parameters, including threshold , number of points for each block and number of categories .
Output: All the balanced blocks with corresponding semantic labels and instance labels .
initialization: ,,, ,,,

1:for  in all the scenes  do
2:     Cut into blocks along - plane.
3:     In each block, random sample points with labels.
4:     for  in all the blocks  do
5:          Separate out corresponding labels.
6:         for  in range  do
7:               =
8:              if   then
9:                   extended with
11:              end if
12:         end for
13:          = extended with
14:     end for
15:end for
16:Get the original probability =
17:while  do
22:end while
23:for  in range  do
25:      Random sample block indicates in
26:      extended with
27:end for
28: extended with
29:Separate into , and
30:Return , ,
Algorithm 1 Details of Water Filling Sampling algorithm ()

4 Experiments

In this part, we will compare our method with other SOTA methods in 3D point clouds semantic and instance segmentation tasks to demonstrate that our method is effective and robust on different kind of datasets, including large scale indoor 3D dataset and part segmentation 3D dataset.

4.1 Datasets and Details

Datasets. Followed as wang2019associatively, we conduct the experiments on two benchmark datasets: Stanford 3D Indoor Semantics Dataset (S3DIS) armeni20163d and ShapeNet part segmentation Dataset yi2016scalable. The specific introduction of these datasets is as follows:

  • S3DIS is a real 3D point cloud dataset generated by Matterport Scanners for indoor spaces, which contains 6 areas and 272 rooms. Each point contains 9 dimensions for the input feature including , and normalized coordinates. For each point, an instance ID and a semantic category ID with 13 classes are annotated. Following qi2017pointnet, we split the rooms into 1 m

    1 m overlapped blocks with stride 0.5 m along the

    - plane and sample 4096 points from each block.

  • ShapeNet dataset is a synthetic scene mesh for part segmentation, which consists of 16881 shape models from 16 categories. Each object is annotated with 2 to 5 parts from 50 different sub-categories. We utilize the instance annotations generated by wang2018sgpn as the ground-truth labels and we sample 2048 points for each shape during training followed as qi2017pointnet. We split the dataset into training and validation followed wang2019associatively and 3-dimensional vector including is fed into our network as input.

Figure 4: Qualitative results of our method on the S3DIS dataset. For semantic results, each color refers to a particular category and for instance results, different colors represent different objects.
Dataset Method mCov mWCov mPrec mRec mAcc mIou oAcc
Area5 SGPN wang2018sgpn 32.7 35.5 36.0 28.7 ————————
JSIS3D pham2019jsis3d 32.6 35.6 39.7 29.1 59.2 51.8 86.9
3D-BoNet yang2019learning 41.5 44.6 57.6 40.2 59.2 51.8 86.9
ASIS wang2019associatively 44.6 47.8 55.3 42.4 60.9 53.4 86.9
OURS 49.0 51.9 59.5 45.9 63.5 55.5 87.5
6-Fold CV SGPN wang2018sgpn 37.9 40.8 38.2 31.2 ————————
JSIS3D pham2019jsis3d 37.3 41.0 49.5 33.4 59.8 48.5 79.9
3D-BoNet yang2019learning 48.4 52.4 65.6 47.6 69.3 59.4 86.3
ASIS wang2019associatively 51.2 55.1 63.6 47.5 70.1 59.3 86.2
OURS 54.5 58.3 64.2 50.8 72.8 61.1 87.0
Table 1: Semantic (green) and instance (red) segmentation results on S3DIS.
SPCO MSA WFS mWCov mPrec mAcc mIou
47.1 51.9 59.7 52.0
50.3 56.0 61.6 53.6
47.1 51.9 61.4 53.2
49.8 55.3 61.2 53.3
50.3 56.0 62.7 54.5
51.9 59.5 63.5 55.5
Table 2: Ablation study on the S3DIS dataset in Area5.
Metrics Method mean ceiling floor wall beam column window door table chair sofa bookcase board clutter
BASE 47.1 89.7 88.7 68.3 0.0 3.4 60.9 5.0 51.8 67.6 23.9 53.6 50.3 49.5
Wcov ASIS wang2019associatively 47.6 89.0 89.2 72.4 0.0 8.8 58.1 4.7 52.4 76.6 46.3 50.1 64.4 45.5
OURS 51.9 89.0 87.3 73.1 0.0 9.1 60.1 13.3 54.3 69.8 48.7 55.0 68.1 46.6
BASE 52.0 92.8 97.8 74.8 0.0 7.9 51.9 16.1 72.3 77.9 35.4 56.1 42.5 50.8
Sem IoU ASIS wang2019associatively 53.4 92.4 98.4 76.7 0.0 15.6 49.5 21.4 72.3 78.7 38.0 55.9 45.8 49.7
OURS 55.5 92.5 97.7 77.2 0.0 11.7 50.8 29.0 74.2 80.3 41.3 60.0 56.6 50.8
Table 3: Per class results on the S3DIS dataset.
MethodMetrics Train Test mPrec
time (m) memory (MB) time (m) memory (MB)
SGPN wang2018sgpn 59.3 7549 209.5 420 36.0
ASIS wang2019associatively 64.7 4275 54.2 1235 55.3
OURS 75.0 1203 40.4 373 59.5
Table 4: Comparisons of computation time, GPU memory and performance.

Details. For instance segmentation, we trained SASO with . We use five output embeddings following wang2019associatively and set to 0.01. We select the Adam optimizer to optimize the network on a single GPU (Tesla P100) and set the momentum to 0.9 for the training process. During the inference process, we set the bandwidth to 0.6 for mean-shift clustering and apply the BlockMerging algorithm wang2018sgpn to merge instances from different blocks.

Evaluation. Following wang2019associatively, we evaluate the experimental results in the following metrics. For semantic segmentation, we calculate the overall accuracy (), mean accuracy () and mean () across all the semantic classes along with the detailed scores of the per-class . To evaluate the performance of instance segmentation, we use the coverage () and weighted coverage () ren2017end; liu2017sgn; zhuo2017indoor. is the average instance-wise IoU of the prediction matched with ground truth, and is the score after being weighted by the size of ground truth. For the predicted regions and the ground-truth regions , and are defined as:


where is the number of points in ground-truth region . We also measure the classical metrics of mean precision () and mean recall () with an threshold of 0.5.

Figure 5: Qualitative results for semantic and instance segmentation on ShapeNet dataset.

4.2 S3DIS Evaluation

We conduct the experiments on the S3DIS dataset with the backbone networks PointNet++. We train the network for 50 epochs with a batch size of 12, the initial learning rate is set to 0.001 and divided by 2 every 300 k iterations.

Quantitative Results. For classical Area5 validation scenes, the quantitative results of SASO in instance and semantic segmentation tasks are shown in Table 1. As we can see, SASO achieves 51.9 and 59.5 , which dramatically outperforms the state-of-the-art method 3D-BoNet yang2019learning by 7.3 in and 1.9 in . As for semantic segmentation, our method significantly improves the and by 2.6 and 2.1 respectively, compared with advanced ASIS wang2019associatively. For a more comprehensive comparison, we evaluate our method with 6 fold cross validation on S3DIS dataset. As shown in the table, our method achieves 58.3 and 72.8 , which significantly outperforms the state-of-the-art methods by a large margin.

Figure 6: The sampling probability and the corresponding improvement for different categories upon ShapeNet dataset.
(a). The orange color means the original frequency for different categories in the training dataset, the blue color represents the sampling probabilities for different categories.
(b). The orange color means positive boost while the purple color represents negative influence. Note that for an intuitive visualization, the value are multiplied by 5.

The stable improvement in both semantic and instance segmentation demonstrates the effectiveness of our method. For a more detailed comparison with our baseline framework and ASIS wang2019associatively, Table 3 shows the results for specific categories in both instance and semantic segmentation based on Area5 scene in S3DIS. Note that for a fair comparison, we reproduce the result of ASIS wang2019associatively with PointNet++ backbone using the author’s code to get the per class results.

Qualitative Results. To intuitively present our results, we visualize the predict results and annotations on point clouds, as shown in Figure 4. For instance segmentation, different colors represent different instances. For semantic segmentation, each color refers to a particular category. It is obvious that our method has a great performance, especially at the boundaries of different objects.

Ablation Study. The ablation study results are shown in Table 2. Equipped with different modules of our method upon the baseline framework, we can find that with our module, we obtain 3.2 gains in and 4.1 gains in . It is interesting that the semantic segmentation results are also improved with this module, we think this is because the semantic and instance segmentation tasks share the shallow features, the improvement in the instance segmentation branch can be beneficial to semantic segmentation branch. When we add module to the baseline, we can find that the semantic segmentation results are improved with 1.7 in and 1.2 in . With the WFS algorithm added to the baseline framework, we obtain 3.4 gains in and 1.3 gains in , which means the balance among different categories is critical to both two tasks. Finally, compared with the baseline framework, our full method has a dramatic improvement in both two tasks, including 7.6 gains in instance segmentation task and 3.5 gains in the semantic segmentation task.

Consumption of memory and time. Table 4 shows a comparison of the memory cost and computation time. For a fair comparison, we conducted the experiments in the same environment, including the same GPU (GTX 1080), batch size (4) and data (Area5 including 68 rooms). Note that all the time units are minutes, and all the memory units are MB. In the training process, the result is the time and memory required for one epoch. As we can see, our method needs relatively more time for training because we introduce clustering into training process, while costs little memory because of the brief but efficient architecture. In the inference process, the results show the resource consumption for Area5. Our approach takes only 373 MB and needs 40.4 minutes while acquires better performance, which is significantly faster and more efficient than the state-of-the-art methods.

Method mIoU
PointNet++ qi2017pointnet++ 84.3
ASIS wang2019associatively 85.0
SGPNwang2018sgpn 85.8
SpiderCNN xu2018spidercnn 85.3
SSCN graham20183d 86.0
PointConv wu2019pointconv 85.7
BASE 83.5
OURS 86.4
Table 5: Semantic segmentation results on ShapeNet datasets.

4.3 ShapeNet Evaluation

We also validate our method on part segmentation dataset ShapeNet, the semantic annotations are publicly available while the instance segmentation annotations are the generated results as wang2018sgpn. Because of the deficiency of ground truth for instance annotations, we only provide the qualitative results for instance segmentation in Figure 5 as wang2019associatively. Four lines from top to bottom in Figure 5 mean semantic segmentation results, semantic annotations, instance segmentation results and instance annotations respectively. As we can see, different parts in the same object are well grouped into individual instances, especially the boundaries of different parts. The semantic segmentation results are exhibited in Table 5. Our approach obviously boosts the result upon baseline framework by 2.9 and outperforms the state-of-the-art method ASIS wang2019associatively, PointConv wu2019pointconv and SSCN graham20183d. These results reveal that our proposed method also has the capability to improve the part segmentation performance.
To prove the effectiveness of our WFS algorithm intuitively, we show the sampling probability and the corresponding improvement for different categories, as depicted in Figure 6. In the upper graph (a), the orange color means the original frequency of different categories in the training dataset, the blue color represents the sampling probabilities for different categories. We can find that the distribution of different categories is more balanced with our WFS algorithm. The second graph (b) shows the improvement for different categories, the orange color means positive boost while the purple color represents negative influence. For the categories with low frequency existing in the raw data, the corresponding improvements are obvious, while for the categories with high frequency, the results are rarely influenced. It demonstrates that our WFS algorithm is effective and critical for alleviating the imbalance problem.

5 Conclusion

In this paper, we propose a novel framework which jointly performs semantic and instance segmentation. For the instance segmentation task, a module named SPCO is proposed to introduce clustering into the training process and saliently focus on the points that are harder to be distinguished in the clustering process. For the semantic segmentation branch, we introduce MSA module based on the statistic knowledge to exploit the potential association of spatial semantic distribution. In addition, we propose a Water Filling Sampling algorithm to address the imbalance problem of category distribution. Qualitative and quantitative experiment results on challenging benchmark datasets demonstrate the effectiveness and robustness of our method.


* This project was supported by National Natural Science Foundation of China (No.61806189) and Shanghai Municipal Science and Technology Major Project (Grant No. 2018SHZDZX01, ZHANGJIANG LAB).