1 Introduction
Scene perception plays a decisive role in many applications, such as autonomous driving, robot navigation and augmented reality. With the growth of computer technology and artificial intelligence in recent years, scene perception ability of intelligent devices has received increasing attention from both academia and industry, especially for the 3D scenes which can represent the real environment intuitively. Semantic segmentation and instance segmentation of 3D scenes are the fundamental and critical portions of 3D scene perception. Nevertheless, how to model the 3D space into digital shape to accomplish scene segmentation task is an indefinite problem. Various representations of 3D scenes have been investigated, such as depth maps, voxels, multiviews, meshes and point clouds. Based on these representations, a series of excellent works have been investigated to operate segmentation task, such as
wang2019voxsegnet; dai20183dmv; qi2017pointnet; qi2017pointnet++; engelmann2017exploring; graham20183d; shen2018mining; huang2018recurrent; Xiaoqing20183D; yi2019gspn; yang2019learning; lahoud20193d; liu2019masc; wang2018sgpn; wang2019associatively; pham2019jsis3d; liang20193d; hou20193d. Among these representations, point clouds are the most compact and natural to the geometric distributions of real 3D scenes, which have been applied extensively in recent researches. In terms of semantic and instance segmentation tasks in 3D point clouds, based on the great success achieved in recent years
Landrieu2018Large; wang2019voxsegnet; graham20183d; wang2019graph; dai20183dmv; engelmann2017exploring; Xiaoqing20183D; wang2018sgpn; yi2019gspn; yang2019learning; lahoud20193d; liu2019masc; elich20193d for each single task, joint learning methods for both tasks wang2018sgpn; pham2019jsis3d; wang2019associatively have opened up a new effective way to explore the 3D scene segmentation, which improved the performance and promoted further development. Compared with the method wang2018sgpn exploiting similarity matrix, pham2019jsis3d; wang2019associatively utilized clustering algorithm to generate instance segmentation result, which was proved to be more effective and flexible. Nevertheless, whether the convergence direction of the training process is consistent with the orientation of clustering algorithm was rarely considered. Additionally, the marginal points are usually harder to be distinguised than the central points, and in multiple objects case the internal points are easier to be distinguished than the boundary points across objects, as shown in Figure 3. To address this problem, we propose a Salient Point Clustering Optimization (SPCO) module to introduce clustering into the training process and saliently focus on the points that are harder to be distinguished in the clustering process. As for semantic segmentation, the spatial distribution of the semantic information has a strong association, which can be further exploited. For example, when a point comes from table, it is highly possible that there will be some neighbor points belonging to chair other than from ceiling. The most common approach to explore the semantic associations is the Conditional Random Fields (CRF) algorithm lafferty2001conditional, which utilizes normalization based on statistical global probability and has been proved to be effective in segmentation tasks. However, CRF is complex and consumes plenty of resources, how to sufficiently exploit the semantic associations more efficiently is an indefinite problem. Consequently, we propose a Multiscale Semantic Association (
MSA) module to fine tune the semantic segmentation results, which is based on the multiple scale semantic association maps generated by statistical analysis. In addition, because of the inherent structures of indoor scenes, the imbalance problem of the category distribution badly limits the performance of 3D scene perception. For example, wall and floor certainly exist in every room while other categories may not, such as sofa, sink, bookshelf, etc.. This leads to the numbers of points from wall and floor are much more than the one from other categories. The imbalance problem is rarely considered in previous wokrs. Thus, we present an adaptive Water Filling Sampling (WFS) algorithm to address this problem by changing the sampling probabilities of each category adaptively.To summarize, our contributions are the following:

We propose a Salient Point Clustering Optimization (SPCO) module to introduce clustering into the training process and saliently focus on the points that are harder to be distinguished in instance segmentation.

We propose a Multiscale Semantic Association (MSA) module based on statistical knowledge to explore the potential spatial association of the semantic information in point clouds.

We propose an adaptive Water Filling Sampling (WFS) algorithm to balance category distribution in the point clouds, which is rarely considered but critical in 3D scene perception.

Extensive experiments demonstrate that our SASO outperforms the stateoftheart related methods on benchmark datasets in both semantic and instance segmentation criteria.
2 Related Works
This section reviews recent deep learningbased techniques applied to 3D point clouds. In recent years, a series of deep learning architectures have been proposed to perform the encoding and decoding for 3D point clouds or its derived representations, which are widely utilized in many 3D vision tasks such as semantic and instance segmentation, object part segmentation and object detection. We divide these methods into four categories based on the data representations. Further more, we will introduce recent 3D semantic and instance segmentation research progress based on above techniques.
2.1 Volumetric Methods
Due to 3D point clouds are irregular, the most simple but naive method is to voxelize the irregular point clouds to regular 3D grids so that 3D convolutions can be applied wu20153d; wang2019voxsegnet; graham20183d; wang2017cnn; zhou2018voxelnet; qi2016volumetric; klokov2017escape; riegler2017octnet; lei2019octree; ren2018sbnet; maturana2015voxnet; huang2016point. Specifically, Wu et al. wu20153d
represented a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network. Maturana
et al. maturana2015voxnetproposed an architecture to efficiently deal with large amounts of point cloud data by integrating a volumetric Occupancy Grid representation with a supervised 3D Convolutional Neural Network. Zhou
et al. zhou2018voxelnet removed the manual feature engineering for 3D point clouds and divided point clouds into equally spaced 3D voxels, then transformed a group of points within each voxel into a unified feature representation through a newly introduced voxel feature encoding layer. Wang et al. wang2019voxsegnetdesigned a spatial dense extraction module to preserve the spatial resolution during the feature extraction procedure, alleviating the loss of detail caused by subsampling operations such as maxpooling. Although volumetric data representation is the most common and simplest form, there is an obvious drawback that cubic complexity of 3D convolutions leads to a dramatic increase in the memory consumption and computing resources. To tackle this issue,
wang2017cnn; riegler2017octnet proposed octree representation to improve efficiency of network and reduce computing resources. In addition, graham20183d; ren2018sbnet proposed sparse convolutional operations to process spatiallysparse 3D point clouds and achieved impressive results. Although these methods try to alleviate the efficiency problem, they are much more complex than volumetric CNNs and can not fundamentally solve the memory consumption problem.2.2 Multiview Methods
Another common method for 3D point clouds are a multiview representation. In recent years, Convolutional Neural Networks have been proved successful in a wide range of 2D visual tasks. To sufficiently take advantage of the strong extraction capability of classical CNNs, 3D point clouds are first projected into multiple predefined views, which are then processed by welldesigned imagebased CNNs to extract features, such as su2015multi; shi2015deeppano; roveri2018network; you2018pvnet; dai20183dmv; guerry2017snapnet; qi2016volumetric. Specifically, Guerry et al. guerry2017snapnet used 3Dcoherent synthesis of scene observations and mixed them in a multiview framework for 3D labeling. Su et al. su2015multi presented a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. Dai et al. dai20183dmv encoded the sparse 3D point clouds with a compact multiview representation, including bird’s eye view and front view as well as RGB image to perform highaccuracy 3D object detection. You et al. you2018pvnet proposed PVNet to integrate both the point cloud and the multiview data towards joint 3D shape recognition. Although the multiview representation of point cloud data is reasonable, the project process from 3D to 2D will loss the full utilization of 3D geometric information.
2.3 Graph Convolution Methods
Graph structure is a native representation of irregular data, such as 3D point clouds, which offers a compact yet rich representation of contextual relationships between points of different object parts bruna2013spectral; wang2018local; te2018rgcnn; simonovsky2017dynamic; wang2019graph; Landrieu2018Large. Specifically, Bruna et al. bruna2013spectral
proposed two constructions based on a hierarchical clustering of the domain and the spectrum of the graph Laplacian, to prove that for lowdimensional graphs, it is possible to learn convolutional layers with a number of parameters independent of the input size, resulting in efficient deep architectures. Wang
et al. wang2018local operated spectral graph convolution on a local graph, combined with a novel graph pooling strategy to augment the relative layout of neighboring points as well as their features. Te et al. te2018rgcnntreated features of points in a point cloud as signals on graph, and defined the convolution over graph by Chebyshev polynomial approximation leveraging on spectral graph theory. They also designed a graphsignal smoothness prior in the loss function to regularize the learning process. Although the graph convolutional methods have achieved significant performance, these methods constructed on Laplacian matrix, is computationally complex for Laplacian eigendecomposition and has a large quantity of parameters to express the convolutional filters while lacks spatial localization.
2.4 Point clouds Methods
Point clouds are an intuitive, memoryefficient 3D representation which is wellsuited for representing geometric details. How to apply deep learning techniques in point clouds directly, simply and efficiently is a critical problem. To address this challenge, Qi et al. qi2017pointnet
designed a novel type of neural network PointNet that directly consumes point clouds and well respects the permutation invariance of points in the input. More specifically, they solved the disorder problem of the point clouds through max pooling and maintained the rotation invariance through the spatial transformation network STN. The extracted features of each point are the combination of its own information and the global information. PointNet has been proved efficient in many applications ranging from object classification, part segmentation, object detection to scene semantic parsing. However, PointNet only relies on the maxpooling layer to learn global features and does not consider local relationships. Therefore, a series of works
qi2017pointnet++; huang2018recurrent; wang2019dynamic; li2018pointcnn; engelmann2017exploring were developed through investigations of the local context and hierarchical learning structures. Typically, Qi et al. qi2017pointnet++ proposed PointNet++ based on their previous work PointNet, which utilizes pointnet as a local feature extraction module to operate hierarchical feature extraction like CNNs, and finally uses upsampling to generate the final high level features. Li et al. li2018pointcnn proposed PointCNN which uses MLP to learn a transformation matrix to solve the disorder problem of point cloud, and then utilizes the introduced xconv module to perform convolution on the transformed features. This method achieved similar performance as PointNet++.2.5 3D semantic and instance segmentation
Recent advances in learningbased techniques have also led to various cuttingedge 3D semantic and instance segmentation approaches Landrieu2018Large; te2018rgcnn; wang2019graph; wang2019voxsegnet; graham20183d; qi2016volumetric; dai20183dmv; qi2017pointnet; qi2017pointnet++; engelmann2017exploring; shen2018mining; li2018pointcnn; hua2018pointwise; huang2018recurrent; wu2019pointconv. Volumetric representation has been adapted by wang2019voxsegnet; graham20183d to transfer 3D point clouds to regular grids and operate CNNs to extract features. Landrieu2018Large; te2018rgcnn; wang2019graph utilized graph convolutional networks to model the relationships of 3D points which offers a compact yet rich representation of context. qi2016volumetric; dai20183dmv transfered 3D point clouds into multiple views to sufficiently take advantage of the strong extraction capability of classical CNNs. qi2017pointnet; qi2017pointnet++; engelmann2017exploring; shen2018mining presented more efficient and flexible ways to utilize MLP directly upon point clouds and well respect the permutation invariance of points. li2018pointcnn; hua2018pointwise; wu2019pointconv operated segmentation task by designing novel CNNs on point clouds, Huang et al. huang2018recurrent and Ye et al. Xiaoqing20183D
proposed new approaches by slicing the point clouds and utilizing recurrent neural networks to exploit the inherent contextual features. 3D instance segmentation is a relatively new research area and attracts more and more attention
yang2019learning; lahoud20193d; liu2019masc. Specifically, Lahoud et al. lahoud20193dproposed a network based on 3D voxel grids, which treats the instance segmentation task as multitask learning problem. The network generates abstract feature embeddings for voxels and estimates instances’ centers to learn instance information. Yang
et al. yang2019learning introduced a framework which simultaneously generates 3D bounding boxes and predicts the binary masks for the points within each box in one stage. Recently, Wang et al. wang2018sgpn have opened up a framework by jointly operating semantic and instance segmentation in 3D point clouds. Inspired by the proposal mechanism in 2D FasterRcnnren2015faster, they proposed similarity matrix indicating the similarity between each pair of points in embedded feature space to predict point grouping proposals, then the network will predict corresponding semantic class for each proposal to generate the final semanticinstance results. Although the similarity matrix is effective and natural to indicate the proposals, it will generate a large and inefficient matrix which suffers from the heavy computation and memory consumes. Some followed proposal methods yi2019gspn; hou20193d were proposed to boost the performance of similar framework while still depended on twostage procedure and the timeconsuming nonmaximum suppression algorithm. More recently, wang2019associatively; pham2019jsis3d utilized clustering algorithm to divide points into different objects, which was demonstrated to be more effective and efficient than proposal methods. Nevertheless, they did not consider whether the convergence direction of the training process is coupled with the orientation of clustering algorithm. In addition, different points have various diffculties to be devided into distinct objects, which is rarely considered. In this work, we propose a framework which take this critical problem into consideration and prove that it is significant and effective.3 Proposed Method
In this section, we first introduce the baseline framework of our network which jointly perform semantic and instance segmentation tasks. Then we give the details of our MSA module for semantic segmentation in Sec 3.2, as depicted in Figure 2. Next, we expound our SPCO module for instance segmentation in Sec 3.3, as shown in Figure 3. The whole framework of our method can be seen in Figure 1. Finally, the adaptive Water Filling Sampling (WFS) algorithm is explained in details in Sec 3.4.
3.1 Baseline Framework
As depicted in Figure 1, the network without MSA and replaced SPCO with normal clustering is the baseline framework. First, point clouds of size are encoded into a highdimensional feature matrix by the encoder PointNet++ qi2017pointnet++. Next, two tasks separately decode for their own missions. In the semantic segmentation branch, is decoded into the semantic feature matrix and then outputs the semantic predictions , where is the semantic class number. The instance segmentation branch decodes into the instance feature matrix , which is utilized to predict the perpoint instance embeddings , where denotes the length of the output embedding dimensions. These embeddings are used to calculate the distances among the points for instance clustering. During the training process, the semantic branch is supervised by cross entropy loss while the loss function for instance segmentation, inspired by wang2019associatively, is formulate as follows:
(1) 
where the goal of is to pull the embeddings toward the mean embedding of the points in the instance, while guides the mean embedding of instances to repel each other. We denote as a regularization term that bounds the embedding values. The three loss terms are denoted as:
(2) 
(3) 
(4) 
where represents the number of groundtruth instances; is the number of points in instance ; denotes the mean embedding of instance ; is an embedding of a point; and
indicate margins for the variance and distance loss respectively;
and represent different instances; is the hinge function; and the distance is represented by .For inference, we use meanshift clustering comaniciu2002mean on the instance embeddings to obtain the final instance labels following wang2019associatively . The mode of the semantic labels for the points within the same instance is assigned as the predicted semantic class.
3.2 Multiscale Semantic Association Module
In 3D semantic segmentation, for a point of an object, the categories of surrounding points are usually related to the category of the point itself, i.e., the spatial distribution of the semantic information has a strong association as the ensample in Sec 1, which can be further exploited. Thus, based on the semantic context information, we propose our Multiscale Semantic Association (MSA) Module, which can be seen in Figure 2.
As shown in Figure 2, on the one hand, we create multiscale semantic association maps by statistics with ball query upon all the training 3D scenes, means the map in scale , is the number of class. On the other hand, based on the decoded semantic output feature , we can also generate the probabilities of the categories from surrounding points with ball query in scale . Then for each point in , we calculate the distance between and each line in , and transfer the result as a probability vector for this point, where the larger a bit is, the higher the probability for this point belonging to corresponding category is. Note that the MSA module will generate multiple probability vectors because of multiple scales and these probabilities are only come from surrounding points. At last, the original predicted probability vector is added by the multiple probability vectors to get the final prediction. The formula is described as equation (5)(8)
(5) 
(6) 
(7) 
(8) 
where means one hot operation, means the number of points in the ball query of point in scale , means the semantic association map in scale , means normalization and means softmax operation. Note that and can be operated with broadcast mechanism, and is operated in axis 1. The final probability output is the sum of and in different scales with different coefficients. In our experiment, we set equal to radius 0.2, 0.3, 0.5 m and equal to 0.5, 0.3, 0.2 respectively.
3.3 Salient Point Clustering Optimization Module
As explained in the baseline framework, for instance segmentation, the goal of is to pull the embeddings toward the mean embedding of the points from the same object, while guides the mean embedding of instances to repel each other in the training process. In the inference time, mean shift clustering algorithm is utilized to distinguish points of different objects. However, the coupling between the convergence orientations in training and the clustering orientation in inference is not taken into consideration. In addition, the points from the same object have different difficulties in instance segmentation as the ensample in Sec 1. Thus, in this paper, we propose a Salient Point Clustering Optimization (SPCO) module, which takes mean shift clustering algorithm into the training process and saliently focuses on the points that are harder to be distinguished in the clustering process. More specifically, as shown in Figure 3, mean shift clustering algorithm is operated in training process to simulate the clustering procedure in inference. Then for the points clustered in one instance while are not belonging to this instance according to the ground truth, we generate an additional loss to repel these embeddings away from the mean embedding of this instance. The loss is formulated in equation (9), note that the ID of the clustered instance is decided by the mode of ID in the ground truth, and to converge on a reliable model, we add
into the training process from 10 epochs.
(9) 
(10) 
where means the number of instances in clustering, means the number of wrong clustered points in instance , means the embedding of th wrong clustered point in instance and means the mean embedding of the correct clustered points in instance . Equipped with our SPCO module, the network can simulate the clustering procedure in inference more realistically, and pay more attention to the points that are easy to be erroneously clustered, which is significant for improving the performance of instance segmentation.
3.4 Water Filling Sampling algorithm
In indoor scenes, there exists some inherent structures. For example, the space is always surrounded by walls and floors. When we sample point clouds from indoor scenes, points of certain categories will occupy the main proportion, which will cause serious imbalance problem between these main categories and other normal categories, especially for tiny objects. In previous works of points segmentation task, this problem is rarely discussed. Therefore, in this paper, a Water Filling Sampling algorithm is proposed to solve the imbalance problem in indoor scenes, which is adaptive to different category distribution. Specifically, for a point cloud of a scene, we first cut it into blocks along  plane and store corresponding semantic and instance labels for each point in the blocks. In addition, we define an accumulative vector to store the block number for each category, and generate a list to indicate which block contains points of category . If the number of points in a block that belongs to category is larger than a thresh , the block index will be contained in and will be added by 1. When we accomplished the cutting step, we can get the probabilities of block number for each category from . To keep the balance among the categories, we need to sample the same size of blocks from all the blocks with different probabilities. If the original probability of a category is high in the row data, the sample probability should be correspondingly low. To achieve this goal, we gradually add a small probability value to the category with the minimum sum of original probability and current sampling probability, until the sum of the total sampling probability values up to 1. The process is likely to fill water to the canyon consisting of original probabilities of all the categories, the details of the algorithm can be formulated as Algorithm 1. As for part segmentation datasets, such as ShapeNet, the algorithm becomes more concise because we can obtain and for each object directly and skip the cutting step. Note that because of the characteristic of part segmentation dataset, we perform algorithm on super categories.
4 Experiments
In this part, we will compare our method with other SOTA methods in 3D point clouds semantic and instance segmentation tasks to demonstrate that our method is effective and robust on different kind of datasets, including large scale indoor 3D dataset and part segmentation 3D dataset.
4.1 Datasets and Details
Datasets. Followed as wang2019associatively, we conduct the experiments on two benchmark datasets: Stanford 3D Indoor Semantics Dataset (S3DIS) armeni20163d and ShapeNet part segmentation Dataset yi2016scalable. The specific introduction of these datasets is as follows:

S3DIS is a real 3D point cloud dataset generated by Matterport Scanners for indoor spaces, which contains 6 areas and 272 rooms. Each point contains 9 dimensions for the input feature including , and normalized coordinates. For each point, an instance ID and a semantic category ID with 13 classes are annotated. Following qi2017pointnet, we split the rooms into 1 m
1 m overlapped blocks with stride 0.5 m along the
 plane and sample 4096 points from each block. 
ShapeNet dataset is a synthetic scene mesh for part segmentation, which consists of 16881 shape models from 16 categories. Each object is annotated with 2 to 5 parts from 50 different subcategories. We utilize the instance annotations generated by wang2018sgpn as the groundtruth labels and we sample 2048 points for each shape during training followed as qi2017pointnet. We split the dataset into training and validation followed wang2019associatively and 3dimensional vector including is fed into our network as input.
Dataset  Method  mCov  mWCov  mPrec  mRec  mAcc  mIou  oAcc 

Area5  SGPN wang2018sgpn  32.7  35.5  36.0  28.7  ————————  
JSIS3D pham2019jsis3d  32.6  35.6  39.7  29.1  59.2  51.8  86.9  
3DBoNet yang2019learning  41.5  44.6  57.6  40.2  59.2  51.8  86.9  
ASIS wang2019associatively  44.6  47.8  55.3  42.4  60.9  53.4  86.9  
OURS  49.0  51.9  59.5  45.9  63.5  55.5  87.5  
6Fold CV  SGPN wang2018sgpn  37.9  40.8  38.2  31.2  ————————  
JSIS3D pham2019jsis3d  37.3  41.0  49.5  33.4  59.8  48.5  79.9  
3DBoNet yang2019learning  48.4  52.4  65.6  47.6  69.3  59.4  86.3  
ASIS wang2019associatively  51.2  55.1  63.6  47.5  70.1  59.3  86.2  
OURS  54.5  58.3  64.2  50.8  72.8  61.1  87.0 
SPCO  MSA  WFS  mWCov  mPrec  mAcc  mIou 

47.1  51.9  59.7  52.0  
✓  50.3  56.0  61.6  53.6  
✓  47.1  51.9  61.4  53.2  
✓  49.8  55.3  61.2  53.3  
✓  ✓  50.3  56.0  62.7  54.5  
✓  ✓  ✓  51.9  59.5  63.5  55.5 
Metrics  Method  mean  ceiling  floor  wall  beam  column  window  door  table  chair  sofa  bookcase  board  clutter 

BASE  47.1  89.7  88.7  68.3  0.0  3.4  60.9  5.0  51.8  67.6  23.9  53.6  50.3  49.5  
Wcov  ASIS wang2019associatively  47.6  89.0  89.2  72.4  0.0  8.8  58.1  4.7  52.4  76.6  46.3  50.1  64.4  45.5 
OURS  51.9  89.0  87.3  73.1  0.0  9.1  60.1  13.3  54.3  69.8  48.7  55.0  68.1  46.6  
BASE  52.0  92.8  97.8  74.8  0.0  7.9  51.9  16.1  72.3  77.9  35.4  56.1  42.5  50.8  
Sem IoU  ASIS wang2019associatively  53.4  92.4  98.4  76.7  0.0  15.6  49.5  21.4  72.3  78.7  38.0  55.9  45.8  49.7 
OURS  55.5  92.5  97.7  77.2  0.0  11.7  50.8  29.0  74.2  80.3  41.3  60.0  56.6  50.8 
MethodMetrics  Train  Test  mPrec  

time (m)  memory (MB)  time (m)  memory (MB)  
SGPN wang2018sgpn  59.3  7549  209.5  420  36.0 
ASIS wang2019associatively  64.7  4275  54.2  1235  55.3 
OURS  75.0  1203  40.4  373  59.5 
Details.
For instance segmentation, we trained SASO with . We use five output embeddings following wang2019associatively and set to 0.01. We select the Adam optimizer to optimize the network on a single GPU (Tesla P100) and set the momentum to 0.9 for the training process. During the inference process, we set the bandwidth to 0.6 for meanshift clustering and apply the BlockMerging algorithm wang2018sgpn to merge instances from different blocks.
Evaluation. Following wang2019associatively, we evaluate the experimental results in the following metrics. For semantic segmentation, we calculate the overall accuracy (), mean accuracy () and mean () across all the semantic classes along with the detailed scores of the perclass . To evaluate the performance of instance segmentation, we use the coverage () and weighted coverage () ren2017end; liu2017sgn; zhuo2017indoor.
is the average instancewise IoU of the prediction matched with ground truth, and is the score after being weighted by the size of ground truth.
For the predicted regions and the groundtruth regions , and are defined as:
(11) 
(12) 
(13) 
where is the number of points in groundtruth region .
We also measure the classical metrics of mean precision () and mean recall () with an threshold of 0.5.
4.2 S3DIS Evaluation
We conduct the experiments on the S3DIS dataset with the backbone networks PointNet++. We train the network for 50 epochs with a batch size of 12, the initial learning rate is set to 0.001 and divided by 2 every 300 k iterations.
Quantitative Results. For classical Area5 validation scenes, the quantitative results of SASO in instance and semantic segmentation tasks are shown in Table 1. As we can see, SASO achieves 51.9 and 59.5 , which dramatically outperforms the stateoftheart method 3DBoNet yang2019learning by 7.3 in and 1.9 in . As for semantic segmentation, our method significantly improves the and by 2.6 and 2.1 respectively, compared with advanced ASIS wang2019associatively.
For a more comprehensive comparison, we evaluate our method with 6 fold cross validation on S3DIS dataset. As shown in the table,
our method achieves 58.3 and 72.8 , which significantly outperforms the stateoftheart methods by a large margin.
The stable improvement in both semantic and instance segmentation demonstrates the effectiveness of our method.
For a more detailed comparison with our baseline framework and ASIS wang2019associatively, Table 3 shows the results for specific categories in both instance and semantic segmentation based on Area5 scene in S3DIS. Note that for a fair comparison, we reproduce the result of ASIS wang2019associatively with PointNet++ backbone using the author’s code to get the per class results.
Qualitative Results. To intuitively present our results, we visualize the predict results and annotations on point clouds, as shown in Figure 4. For instance segmentation, different colors represent different instances. For semantic segmentation, each color refers to a particular category. It is obvious that our method has a great performance, especially at the boundaries of different objects.
Ablation Study. The ablation study results are shown in Table 2. Equipped with different modules of our method upon the baseline framework, we can find that with our module, we obtain 3.2 gains in and 4.1 gains in . It is interesting that the semantic segmentation results are also improved with this module, we think this is because the semantic and instance segmentation tasks share the shallow features, the improvement in the instance segmentation branch can be beneficial to semantic segmentation branch. When we add module to the baseline, we can find that the semantic segmentation results are improved with 1.7 in and 1.2 in . With the WFS algorithm added to the baseline framework, we obtain 3.4 gains in and 1.3 gains in , which means the balance among different categories is critical to both two tasks. Finally, compared with the baseline framework, our full method has a dramatic improvement in both two tasks, including 7.6 gains in instance segmentation task and 3.5 gains in the semantic segmentation task.
Consumption of memory and time. Table 4 shows a comparison of the memory cost and computation time. For a fair comparison, we conducted the experiments in the same environment, including the same GPU (GTX 1080), batch size (4) and data (Area5 including 68 rooms).
Note that all the time units are minutes, and all the memory units are MB. In the training process, the result is the time and memory required for one epoch. As we can see, our method needs relatively more time for training because we introduce clustering into training process, while costs little memory because of the brief but efficient architecture. In the inference process, the results show the resource consumption for Area5. Our approach takes only 373 MB and needs 40.4 minutes while acquires better performance, which is significantly faster and more efficient than the stateoftheart methods.
Method  mIoU 

PointNet++ qi2017pointnet++  84.3 
ASIS wang2019associatively  85.0 
SGPNwang2018sgpn  85.8 
SpiderCNN xu2018spidercnn  85.3 
SSCN graham20183d  86.0 
PointConv wu2019pointconv  85.7 
BASE  83.5 
OURS  86.4 
4.3 ShapeNet Evaluation
We also validate our method on part segmentation dataset ShapeNet, the semantic annotations are publicly available while the instance segmentation annotations are the generated results as wang2018sgpn. Because of the deficiency of ground truth for instance annotations, we only provide the qualitative results for instance segmentation in Figure 5 as wang2019associatively. Four lines from top to bottom in Figure 5 mean semantic segmentation results, semantic annotations, instance segmentation results and instance annotations respectively.
As we can see, different parts in the same object are well grouped into individual instances, especially the boundaries of different parts.
The semantic segmentation results are exhibited in Table 5. Our approach obviously boosts the result upon baseline framework by 2.9 and outperforms the stateoftheart method ASIS wang2019associatively, PointConv wu2019pointconv and SSCN graham20183d.
These results reveal that our proposed method also has the capability to improve the part segmentation performance.
To prove the effectiveness of our WFS algorithm intuitively, we show the sampling probability and the corresponding improvement for different categories, as depicted in Figure 6. In the upper graph (a), the orange color means the original frequency of different categories in the training dataset, the blue color represents the sampling probabilities for different categories. We can find that the distribution of different categories is more balanced with our WFS algorithm. The second graph (b) shows the improvement for different categories, the orange color means positive boost while the purple color represents negative influence. For the categories with low frequency existing in the raw data, the corresponding improvements are obvious, while for the categories with high frequency, the results are rarely influenced. It demonstrates that our WFS algorithm is effective and critical for alleviating the imbalance problem.
5 Conclusion
In this paper, we propose a novel framework which jointly performs semantic and instance segmentation. For the instance segmentation task, a module named SPCO is proposed to introduce clustering into the training process and saliently focus on the points that are harder to be distinguished in the clustering process. For the semantic segmentation branch, we introduce MSA module based on the statistic knowledge to exploit the potential association of spatial semantic distribution. In addition, we propose a Water Filling Sampling algorithm to address the imbalance problem of category distribution. Qualitative and quantitative experiment results on challenging benchmark datasets demonstrate the effectiveness and robustness of our method.
Acknowledgment
* This project was supported by National Natural Science Foundation of China (No.61806189) and Shanghai Municipal Science and Technology Major Project (Grant No. 2018SHZDZX01, ZHANGJIANG LAB).
Comments
There are no comments yet.