1 Introduction
Semantic scene segmentation refers to the process of assigning a class label to each element representation of a scene. The outcome of semantic scene segmentation is extremely useful for many applications in artificial intelligence such as interactions of robots (selfdriving cars and autonomous drones) with its environment, and augmented/virtual reality (AR/VR) . Semantic scene segmentation on 2D images, where each pixel is the elementary representation of the scene, is a longstanding problem in computer vision; and many impressive results have been shown with deep learning
[3, 6, 9, 16, 20]in recent years. In contrast to its 2D image counterpart, semantic scene segmentation on 3D point cloud, where each point is the elementary representation of the scene, has gained attention in the deep learning/ computer vision community only over last few years. This is largely attributed to the nonpermutation invariance property of neural networks
[18, 29], which work well on 2D image pixels that are arranged in a regular orderly structure, but would fail catastrophically when used on 3D point clouds where the points exist in an unorderly and irregular manner.To circumvent the permutation invariance problem, several existing works attempt to convert the unordered and irregular 3D point cloud into ordered and regular representations that can be directly used in neural networks. Two commonly used approaches include: (a) [4, 11, 14, 22, 23] that project the 3D point cloud onto a set of virtual multiview image planes, and (b) [5, 7, 17, 24, 27] that quantize the 3D point cloud into regular voxel grids. However, both approaches result in severe loss of information. The former suffers information loss from occlusions and 3D to 2D projections. The latter suffers from quantization error and the high computational cost of 3D convolutions limits its scalability. Recently, [18]
pioneered the permutation invariant PointNet that allows deep learning to be directly applied to 3D point clouds. Furthermore, it has shown promising results on both 3D object classification and semantic segmentation tasks. Nonetheless, the design of PointNet, feeding each point individually into several multilayer perceptrons (MLPs) followed by a maxpooling operation to achieve permutation invariance, inherently prohibits the network from capturing the local information embedded in the neighboring points. However, the local information is essential in modeling finegrained structures, plane or corner, and convex or concave element. Consequently, several later works
[8, 10, 13, 15, 19, 28, 30] proposed to capture the local information by considering the neighboring points. However, these approaches treat each point in their respective local region independently to achieve permutation invariance, and this prevents the modeling of geometric relationships among neighbor points, which hinders discriminative local feature learning. Moreover, those RNNbased approaches [8, 10, 13, 28] requiring sequential input ordering violate permutation invariance, and [15] inherently is not permutationinvariant. DGCNN [26]is designed to overcome such limitations. It embeds the relationship between a point and its neighbors in the socalled edge features, and uses a channelwise symmetric aggregation operation (maxpooling) over local neighborhood to capture representative local information. Nonetheless, the global context among different local features is neglected in their work due to a max pooling operation. This limits its ability to encode the semantic information of the entire 3D scene.
In this paper, we propose the PSNet  an efficient endtoend framework for Point cloud Semantic Segmentation (PS) that takes local structures and global context into consideration. Our work leverages on EdgeConv [26] to capture local information, and uses the NetVLAD [1] to encode global context. More specifically, PSNet takes raw point cloud as the input and outputs point level semantic class labels. We design an encoder module that can be stacked repeatedly into a deep network for learning the discriminative representations. Each encoder module consists of two basic components  EdgeConv and NetVLAD. Furthermore, we prove theoretically that the PSNet guarantees permutation invariance to any order of input points. Our main contributions are summarized as follows:

We design PSnet  an endtoend network for semantic scene segmentation on 3D point clouds. Our network is permutationinvariant, and is able to integrate both local structures and global context.

Our encoder is flexible and can be stacked or recurrently plugged into existing deep learning architectures to exploit the finegrained local and global properties from point clouds.

Extensive experimental results on two largescale 3D indoor datasets show that our PSnet outperforms existing stateoftheart approaches on the task of semantic scene segmentation on 3D point clouds.
2 Related work
In this section, we focus on the literature survey of existing deep learning approaches that directly take point clouds as input for semantic scene segmentation. We omit discussions of existing works that alleviate the permutation invariance problem via conversion of the point clouds into alternative form of representations projections into multiview virtual images and voxelization.
PointNet [18] pioneered the direct use of 3D points as input in deep learning. It learns pointwise features by passing each point individually through several shared MLPs, followed by a symmetry function (maxpooling) that aggregates over the feature channels into a global feature that represents the point cloud. The operation of the shared MLPs on the individual points and maxpooling made the network permutation invariant. However, the design of PointNet overlooks the exploitation of local structures and prevents the network from learning finegrained structures. To overcome this limitation, most followup works attempt to incorporate the local structures in an efficient way. PointNet++ [19] utilizes the farthest point sampling and ball query to group the point cloud into a set of local regions, and then applies PointNet on each region to capture the local structures. Despite the preservation of permutation invariance and the incorporation of local structures, the grouping operation on the point cloud into independent local regions may impede learning of the global context.
Other variants of PointNet employ various Recurrent Neural Network (RNN) based techniques to learn the contextual information from the local features, which are extracted by PointNet from partitioned local patches based on the spatial locations of the points
[8, 10, 28] or geometric homogeneousness [13]. Engelmann [8]divide a point cloud into a tessellation of ordered blocks. A feature generated from each block using PointNet is then fed sequentially into a Gated Recurrent Unit (GRU). The GRU ensures that information from neighboring blocks are encoded in the final feature. Huang
[10] propose the RSNET, where the point cloud is sliced into three sets of uniformlyspaced blocks independently along the three coordinate axes (x, y, z). A feature is extracted from each block using a PointNetlike network. Next, an ordered sequence for each set of blocks is defined manually along each axis. Each of the three ordered sequences is respectively fed into a RNN layer to explore the local context. Finally, the updated block features are assigned back to the points in the blocks. Thus, each pointwise feature aggregates its three associated directional block features. In contrast to an independent use of RNNs on the different axes as in RSNET, Ye [28] propose a twodirection hierarchical RNN model. Specifically, the first set of RNNs is applied on the blocks along the xaxis where the outputs are used by the second set of RNNs that runs along the yaxis to generate the block features across the two dimensions. However, the RNNbased approaches require sequential ordering in the inputs and this is a violation of the permutation invariance.More recently, several works [15, 21, 26] are proposed to explore the finegrained geometric relationships among neighboring points. PointCNN [15] learns a transformation from the K neighborhoods of the input points. It assumes that each point assigned with a local coordinate can be permuted into a latent and potentially canonical order with the transformation. The learned transformation is further packed with typical convolutions to form a new process to extract features from local regions. However, the transformation is not permutation invariant. KCNet [21] presents a kernel correlation layer to exploit local geometric structures, where pointset kernel correlation is used to measure similarities between local neighborhood graphs and learned kernel graphs. Despite showing promising performance on objectlevel related tasks, it is somewhat sensitive to underlying graph structures and its capability in scenelevel semantic segmentation is unknown. DGCNN[26] explicitly designs the EdgeConv module to better capture local geometric features of point clouds while preserving permutation invariance. In particular, they dynamically compute a
nearest neighbors (KNN) graph for each point in the feature spaces produced by different EdgeConv layers. EdgeConv applies multiple MLP layers over the combination of pointwise features and their respective KNN features to generate edge features, and outputs the enhanced pointwise features by pooling among neighboring edge features. However, the features learned from DGCNN does not encode contextual and semantic information of the whole 3D scene, i.e. global context. Furthermore, the dynamic recomputation of the KNN graphs during network training is extremely computationally expensive and inefficient when applied on largescale point clouds.
Our PSNet leverages on EdgeConv to learn the local information, but uses static KNN graphs defined in the spatial coordinate space to reduce computational complexity. Additionally, we use the permutation invariant NetVLAD layer [1] to incorporate the global contextual information into our learned feature.
3 Our PSNet
The network architecture of our PSNet is presented in Figure 2. Our network consists of four encoders that are repeatedly stacked together, and a final segment of shared 3layer MLPs. The encoders output different levels of feature abstractions that are subsequently merged and passed to the shared MLPs for point level semantic label classification. The input to our network is an matrix of points and an additional feature (color or surface normal). The network output is an
matrix of predicted probabilities for the
labels on each of the points. A KNN search is done on each of the points in the space to produce an matrix. This matrix is fed into each of the encoder to group local regions in the corresponding feature spaces.An encoder is made up of two components  EdgeConv and NetVLAD. The output of EdgeConv is fed into NetVLAD, and then concatenated with the output of NetVLAD to generate the final output of the encoder. Each encoder first forms an tensor with the operator (see next paragraph for more details) using the set of input features that are the nearest neighbors . This tensor is fed into several MLP layers to extract the edge features that describe the relationships between each point and its neighbors. The feature dimension equals to and for the first and subsequent encoders, respectively. We then apply two symmetric operations, i.e. channelwise maxpooling and averagepooling, to respectively transform the edge features into two locally aggregated representations. Subsequently, the two representations are concatenated and fed into another MLP layer to produce a set of pointwise features () as the output of EdgeConv.
More formally, let us denote the input feature matrix to EdgeConv as , the indices of KNN as , the edge feature tensor as , and output feature matrix of the EdgeConv as . Then, the operator is given by:
(1) 
where
H  (2a)  
(2b) 
is the row of and represents the concatenation of matrices A and B. The edge features E are obtained by:
(3) 
where represents a shared MLP layer, and
represents a nonlinear ReLU activation function.
and are two sets of learnable parameters in the two respective shared MLP layers. Finally, the output of edgeConv Y is given by:(4) 
where is the set of learnable parameters for the EdgeConv output shared MLP.
By taking Y as input, NetVLAD outputs an global descriptor, denoted as V. The
128vector is given by:
(5) 
There are two sets of parameters in the NetVLAD module: cluster centers (“visual words”) and for learning the soft assignments, which determines the propagation of information from the input pointwise feature vectors to cluster centroids. Specifically, NetVLAD uses a MLP with feature channels followed by a softmax function to obtain the softassignments. Next, the residuals between input pointwise descriptors and cluster centroids are aggregated with the softassignments. Finally, the matrix is flatten into an output vector V with intra and inter normalization. To be computational efficient, this high dimensional vector is further compressed into a concise feature representation via a MLP, which is treated as the final global descriptor.
3.1 Exploitation of Local Structure
As mentioned earlier, we take inspiration from EdgeConv in [26] to exploit the local structures among the local neighborhood of a point. We replace the original dynamic KNN graphs computation with static KNN graphs computed on metric space for two reasons. First, we assume that the static metricbased KNN graphs may supervise local structure learning with spatial constraint. Second, static KNN graphs is analogous to the structure of images, where the neighborhoods of pixels remain fixed during convolution. Additionally, the two symmetric operations in aggregating edge features, i.e. channelwise maxpooling and averagepooling, are designed to compensate the information loss during aggregation. Maxpooling operator individually performs over dimensions and selects the maximum feature responses over the neighborhood of each point, while average pooling operator sums up all the features in the same local region. By combining these two operations, finegrained local information are preserved and the strong responses are emphasized. We progressively expand the receptive field by repeatedly stacking the encoders.
3.2 Aggregation of Global Context
In addition to exploiting the local structures, we also design our network to be aware of the global context that provides scenelevel semantic information. This information is potentially useful and can alleviate local confusions. Particularly, semantic context helps to distinguish patches with similar appearances/geometrics but different semantic meanings (differentiate a white wall from a white board on it). To incorporate this globally contextual information, we leverage on the success of NetVLAD [1], which is a technique originally designed for aggregating local descriptors into a global vector in the image domain. [25] adopted it into PointNetVLAD that generates global descriptors for pointbased inputs. The descriptor vector of each cluster center is a summation of residuals (contributions) of each input feature to the center. Consequently, it is able to reveal finegrained global contexts due to the large receptive field and aggregation of the learned relationships with all points.
3.3 Permutation Invariance
We now prove that the PSNet is permutationinvariant.
Lemma 1.
The PSNet is permutation invariant, if the rows of the input point cloud matrix are permuted, the output of the network remains unchanged.
Proof.
Let denote the input matrix, and L denote the output matrix. Now we want to prove if the input is where is an permutation matrix, while the output of PSNet remains as L. As the shared MLPs operating on individual points are obviously permutationinvariant. Here we simplify the proof by proving the permutation invariance property of the encoder in PSNet. Suppose we have a permuted point cloud
that only reorders points and in P, then the feature representation of is given by
As the reordering does not affect the order of nearest neighbors, the KNN indices are still given by K. Inputting and K into the encoder, the output is formulated as:
(6) 
where denotes a series of operations in EdgeConv and denotes the NetVLAD function .
Putting the into the operator in Equation 2, we get
Since each is processed independently in the shared MLPs (see Equation 3), it is obvious that
Finally, the output of edgeConv is given by
again this is because each is processed independently in the shared edgeConv output MLP in Equation 4.
The output of NetVLAD of the original input features X before permutation can now be written as
(7) 
where
(8) 
and the output of NetVLAD with the permuted input features is given by
(9) 
Hence,
(10) 
This completes our proof that the encoder of PSNet is permutation invariant.
∎
4 Experiments
4.1 Datasets
We conduct experiments of our network on two challenging benchmark datasets : S3DIS [2] and ScanNet [7]. The details of these two datasets are as follows.
S3dis
This dataset consists of 6 different indoor areas from 3 different buildings. It has 271 rooms with various styles, conference room, lobby, restroom. Each point is labeled by one of 13 semantic classes. These classes are partitioned into structural type (ceiling, floor, wall, beam, column, window, door), furniture type (table, chair, sofa, bookcase, board) and clutter. In the experiments, we adopt the 6fold training/testing split used in [18].
ScanNet
This dataset consists of 1,513 scans from 707 unique indoor scenes. The space type is very diverse, ranging from very small (bathroom, closet, utility room) to very large (apartment, classroom, library) spaces. Each point is annotated by one out of 21 semantic classes, including 20 object classes plus 1 extra class representing free space. Following the experimental settings in [19], we adopt 1,201 scans for training and 312 scans for testing.
4.2 Implementation Details
Our PSNet consists of four repeatedly stacked encoders with the same configuration. Each EdgeConv module has two shared MLPs (64,64) to extract edge features and one shared MLP (128) to fuse the max and avg pooled edge features. The NetVLAD module has 16 clusters and produces a
dimensional global descriptor that is fed into a shared MLP (128) for dimension reduction. Skip link is is added from the output of EdgeConv to the output of NetVLAD for integrating local and global features. The outputs from all the four encoders are concatenated and forwarded to three shared MLPs (512,256,128) to map the learned features to point labels. Dropout with a dropratio of 0.3 is used in the first layer. Batch normalization and ReLU are added to all the respective MLPs. The number of nearest neighbors is set to 20.
Method  OA  mIoU  ceiling  floor  wall  beam  column  window  door  table  chair  sofa  bookcase  board  clutter 
PointNet [18]  78.5  47.6  88  88.7  69.3  42.4  23.1  47.5  51.6  54.1  42  9.6  38.2  29.4  35.2 
G+RCU [8]  81.1  49.7  90.3  92.1  67.9  44.7  24.2  52.3  51.2  58.1  47.4  6.9  39  30  41.9 
DGCNN [26]  84.1  56.1  —  —  —  —  —  —  —  —  —  —  —  —  — 
RNNCF [28]  86.9  56.3  92.9  93.8  73.1  42.5  25.9  47.6  59.2  60.4  66.7  24.8  57  36.7  51.6 
RSNet [10]  —  56.47  92.48  92.83  78.56  32.75  34.37  51.62  68.11  60.13  59.72  50.22  16.42  44.85  52.03 
PSNet—P1  86.69  61.56  93.40  95.64  79.94  37.17  40.93  59.83  66.65  63.65  65.71  37.16  49.83  54.56  55.66 
PointCNN [15]  88.14  65.39  94.78  97.3  75.82  63.25  51.71  58.38  57.18  71.63  69.12  39.08  61.15  52.19  58.59 
PSNet—P2  88.22  66.60  93.04  96.26  83.22  41.61  54.05  60.08  70.40  67.37  73.13  48.75  58.73  58.68  60.48 
Our framework is implemented using PyTorch deep learning library on a NVIDIA GTX 1080Ti. We optimize the network using ADAM
[12]with an initial learning rate of 0.001 and weight decay 1e5. Multiclass crossentropy is used as the loss function. In most experiments, the learning rate is decayed by half after every 100 epochs. In general, the networks converged at
150 epochs. The batch size is set to 6 for experiments with 4,096 points as input, otherwise, the batch size is updated with respect to the number of input points. Note that we do not use any data augmentation in our experiments.We adopt the two widely used metrics: overall accuracy (OA) and mean interaction over union (mIoU) to evaluate the segmentation performance of our network. Additionally, we also report the classwise IoU, which is computed for each point that belongs to its corresponding semantic class.
4.3 Segmentation on S3DIS Dataset
4.3.1 Data preparation
We observed some differences in the data preparation setups among the existing approaches for S3DIS dataset. One widelyadopted setup is proposed by PointNet [18], which splits each room into nonoverlapping blocks of 1m1m area on the plane and each point is represented by a 9dim vector containing the coordinates, color and normalized coordinates. 4,096 points are randomly sampled from each block during training and testing. The other setup is presented by PointCNN [15]
. It slices the rooms into 1.5mby1.5m blocks with 0.3m padding on each side and each point is associated with a 6dim vector containing the
coordinates and color. (denotes the Gaussian distribution) points are sampled from each block during training, while each block is sampled multiple times to make sure that all the points are evaluated during testing. In order to make fair comparisons, we conduct experiments using both data preparation setups
^{1}^{1}1We use the codes in PointNet and PointCNN for preprocessing. and report the comparisons accordingly.Method  OA  mIoU  wall  floor  chair  table  desk  bed  bookshelf  sofa  sink  bathtub 

PointNet [18]  —  14.69  69.44  88.59  35.93  32.78  2.63  17.96  3.18  32.79  0  0.17 
PointNet++ [19]  —  34.26  77.48  92.5  64.55  46.6  12.69  51.32  52.93  52.27  30.23  42.72 
RSNet [10]  —  39.35  79.23  94.1  64.99  51.04  34.53  55.95  53.02  55.41  34.84  49.38 
PSNet—P3  —  40.17  72.44  91.51  65.08  45.61  26.27  48.90  39.96  53.94  24.58  64.78 
PointCNN [15]  85.1  —  —  —  —  —  —  —  —  —  —  — 
PSNet—P2  87.21  44.90  77.02  91.22  68.36  56.66  31.62  53.55  36.32  58.75  43.07  70.11 
Method  toilet  curtain  counter  door  window  shower curtain  refridgerator  picture  cabinet  other furniture 

PointNet [18]  0  0  5.09  0  0  0  0  0  4.99  0.13 
PointNet++ [19]  31.37  32.97  20.04  2.02  3.56  27.43  18.51  0  23.81  2.2 
RSNet [10]  54.16  6.78  22.72  3  8.75  29.92  37.9  0.95  31.29  18.98 
PSNet—P3  60.87  41.20  24.62  8.37  21.55  47.24  19.68  2.63  28.02  16.22 
PointCNN [15]  —  —  —  —  —  —  —  —  —  — 
PSNet—P2  66.28  41.94  23.73  10.94  17.82  51.02  44.19  3.17  32.82  19.33 
4.3.2 Results and Discussion
Table 1 illustrates the performance of our PSNet in comparison to previous stateoftheart approaches on the S3DIS dataset. All approaches [8, 10, 26, 28] from the upper table prepared their experimental data according to the data preparation setup in PointNet [18]. We can see that our PSNet achieves the best performance in the mIoU criteria, which is a more precise criteria than the overall accuracy as the dataset is highly unbalanced. In particular, our PSNet improves the mIoU by 9.7% and overall accuracy by 3% when compared with DGCNN [26], which is made up of a stacking of EdgeConv layers. We argue that the incorporation of global context in our network contributes to this significant improvement. Additionally, our PSNet outperforms the three existing RNNbased methods [8, 10, 28] in the mIoU criteria.
In the lower part of Table 1, we show competitive performance of our PSNet with PointCNN [15] when the same data preprocessing setup is used. Moreover, the performance of our PSNet—P2 increases by 8.2% in mIoU compared to PSNet—P1. This indicates that data preprocessing plays an important role in training, and the extensive online sampling used in PointCNN is able to expose the model to larger groups of samples during training.
Furthermore, several qualitative results on the S3DIS dataset are shown in Figure 3. As we can see from the examples, the dataset is very challenging in many scenarios, “the white boards on the while wall”, “the open doors with only visible door frames”, and “the white column in the boundaries of the white wall” . Interestingly, our PSNet successfully segments the boards, doors, columns in most cases despite the similar colors (wall vs column) or geometries (wall vs board). We believe that the correct classifications are consequences of integrating local structures and global context into our network.
4.4 Segmentation on ScanNet Dataset
4.4.1 Data preparation
Similar to S3DIS, there are two data preparation setups in the ScanNet dataset. The first setup is proposed by PointNet++ [19]. It follows the same setup in [7] to first generate a tessellation of 1.5m1.5m3m cubes with 2cm voxels from the 3D point clouds. Next, cubes with occupied voxels and valid annotations on the voxel surfaces are extracted.
During training, 8,192 points are sampled from each cube. Each point is represented by only the coordinates. The second setup is proposed by PointCNN [15]. It prepares the data in the same way as S3DIS except that only the coordinates are used on each point. Additionally, it converts the the segmentation results on the test data into semantic voxel labeling for comparison with test results from the first data preprocessing setup. Again, to make fair comparisons, we conduct experiments using both data processing setups^{2}^{2}2We use the code of PointNet++ for preprocessing.. Note that different from S3DIS, we only use the information as the input in this dataset in order to be compliant with the previous approaches.
4.4.2 Results and Discussion
The comparison of performances on the ScanNet dataset is summarized in Table 2. It can be seen that the performance of our PSNet outperforms the previous stateoftheart methods [10, 19] in the mIoU criteria when the prepocessing setup in PointNet++ [19] is used. Notably, we achieve remarkable improvements in the classification results of several challenging classes with very few training data, bathtub (0.3%), toilet (0.3%), curtain (1.5%), window (0.9%) and shower curtain (0.2%). We reckon that the increase in performance comes from the use of EdgeConv, where the discriminative representations learned from local structures are superior to the pointwise features that are individually processed in [10, 18, 19]. We obtain an impressive improvement in mIoU (11.8%) on PSNet—P2 compared to PSNet—P3; and furthermore, we surpass PointCNN [15]. Particularly, our PSNet achieves more accurate segmentation results (classwise IoU 0.4) in 10 out of 20 classes.
Several segmentation results are visualized in Figure 4. Our PSNet is able to recognize both frequently, wall, floor, chair, and rarely, toilet, sink, refridgerator, seen objects.
4.5 Ablation studies
In this section, we investigate the contributions of the respective components (EdgeConv and NetVLAD) in our PSNet, and evaluate the effects of several key hyperparameters (number of nearest neighbors in EdgeConv, clusters in NetVLAD, and stacked encoders). We conduct ablation experiments on the fifth fold of the S3DIS dataset, we test on Area 5 and train on the remaining data. More specifically, the testing area is collected in a separated building from all the training areas. All settings remain unchanged as the baseline PSNet in the ablation experiments except the target hyperparameter.
Model  OA  mIoU 

w/o local features  81.66  45.25 
w/o EdgeConv  83.50  50.35 
w/o NetVLAD  84.00  50.18 
Our full PSNet  84.60  52.95 
Effectiveness of network modules
We study the effect of each network module by removing them individually from the network, and compare the performances before and after the removal. To further measure the contribution of local features, we design two settings: in (1) “w/o local features”, we substitute the EdgeConv module with a pointwise feature learning module which does not exploit local structures, and in (2) “w/o EdgeConv”, we apply pointwise feature learning followed by a maxpooling operator over Knearest neighbors to extract the local features. Additionally, in “w/o NetVLAD” we replace NetVLAD with a maxpooling operation similar to DGCNN [26]. Table 3 shows the comparison results. The poor performance of “w/o local features” in mIoU accords well with our claim of the importance of local information in modeling finegrained structures. Furthermore, we can see that the integration of local structures and global context in our PSNet contribute to the improvements over “w/o EdgeConv” and “w/o NetVLAD”.
Number of encoders
We evaluate the performance of our PSNet with different number of stacked encoders, and show the results in Figure 5(a). Generally, the network becomes deeper and is capable of learning more discriminative representations when the number of encoders increases. This can be seen from the improvement of performance when the number of encoders is increased from 1 to 4. However, we also observe that the performance begins to drop after a certain number of encoders. This might be due to a deeper network with more parameters requires more training data to prevent overfitting. In other words, a deeper network may fail to generalize well to new cases when the training dataset is not sufficiently large. Hence, we use the best setting of 4 encoders in all our experiments.
Number of nearest neighbors
This hyperparameter controls the range of local regions and thus influences the amount of local information included in our network. From Figure 5(b), we can see a tradeoff in selecting . A small may lead to limited regions with insufficient local context, while a big may bring irrelevant noises and increases the computation complexity. We set to 20 in the experiments according to Figure 5(b).
Number of clusters
As is revealed in Figure 5(c), this hyperparameter gives lower performances when is bigger than 16. It is likely because we reduce the dimension of from dim to 128dim. If is too big, the dimension reduction operation may induce uncertain information loss. Consequently, we selected =16 that achieved the best performance.
5 Conclusion
In this paper, we proposed the PSNet, an endtoend deep neural network for the 3D point cloud semantic segmentation task. PSNet is built on four repeatedly stacked encoders, where each encoder has two basic components: EdgeConv and NetVLAD that capture local structures and global context, respectively. We provided proof to guarantee the permutation invariance property of our PSNet. We obtained stateoftheart performances with our locally and globally aware PSNet on the two challenging 3D indoor scene datasets for pointbased semantic segmentation.
References
 [1] (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In CVPR, pp. 5297–5307. Cited by: §1, §2, §3.2.
 [2] (2016) 3d semantic parsing of largescale indoor spaces. In CVPR, pp. 1534–1543. Cited by: §4.1.
 [3] (2017) SegNet: a deep convolutional encoderdecoder architecture for image segmentation. IEEE Transactions on Pattern Analysis & Machine Intelligence (12), pp. 2481–2495. Cited by: §1.
 [4] (2018) SnapNet: 3d point cloud semantic labeling with 2d deep segmentation networks. Computers & Graphics 71, pp. 189–198. Cited by: §1.

[5]
(2016)
Generative and discriminative voxel modeling with convolutional neural networks
. arXiv preprint arXiv:1608.04236. Cited by: §1.  [6] (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis & Machine Intelligence 40 (4), pp. 834–848. Cited by: §1.
 [7] (2017) ScanNet: richlyannotated 3d reconstructions of indoor scenes. In CVPR, pp. 5828–5839. Cited by: §1, §4.1, §4.4.1.
 [8] (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In CVPR, pp. 716–724. Cited by: §1, §2, §4.3.2, Table 1.
 [9] (2017) Mask rcnn. In ICCV, pp. 2980–2988. Cited by: §1.
 [10] (2018) Recurrent slice networks for 3d segmentation of point clouds. In CVPR, pp. 2626–2635. Cited by: §1, §2, §4.3.2, §4.4.2, Table 1, Table 2.
 [11] (2017) 3D shape segmentation with projective convolutional networks. In CVPR, pp. 6630–6639. Cited by: §1.
 [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 [13] Largescale point cloud semantic segmentation with superpoint graphs. In CVPR, pp. 4558–4567. Cited by: §1, §2.
 [14] (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §1.
 [15] (2018) Pointcnn: convolution on xtransformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §1, §2, §4.3.1, §4.3.2, §4.4.1, §4.4.2, Table 1, Table 2.
 [16] (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1.
 [17] (2015) Voxnet: a 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp. 922–928. Cited by: §1.
 [18] (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660. Cited by: §1, §1, §2, §4.1, §4.3.1, §4.3.2, §4.4.2, Table 1, Table 2.
 [19] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §1, §2, §4.1, §4.4.1, §4.4.2, Table 2.
 [20] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §1.
 [21] (2018) Mining point cloud local structures by kernel correlation and graph pooling. In CVPR, Vol. 4. Cited by: §2.
 [22] (2016) Deep learning 3d shape surfaces using geometry images. In ECCV, pp. 223–240. Cited by: §1.
 [23] (2015) Multiview convolutional neural networks for 3d shape recognition. In CVPR, pp. 945–953. Cited by: §1.
 [24] (2017) Segcloud: semantic segmentation of 3d point clouds. In 3D Vision (3DV), 2017 International Conference on, pp. 537–547. Cited by: §1.
 [25] (2018) PointNetVLAD: deep point cloud based retrieval for largescale place recognition. CVPR. Cited by: §3.2.
 [26] (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics. Cited by: §1, §1, §2, §3.1, §4.3.2, §4.5, Table 1.
 [27] (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, pp. 1912–1920. Cited by: §1.
 [28] (2018) 3D recurrent neural networks with context fusion for point cloud semantic segmentation. In ECCV, pp. 415–430. Cited by: §1, §2, §4.3.2, Table 1.
 [29] (2017) Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401. Cited by: §1.
 [30] (2018) 3DContextNet: kd tree guided hierarchical learning of point clouds using local and global contextual cues. In ECCV, pp. 314–330. Cited by: §1.
Appendix A More Visualizations and Discussions on PSNet Variants
Fig. 6 shows the visualizations of some qualitative results from the ablation studies  “w/o Local”, “w/o EdgeConv”, “w/o NetVLAD”, and “full PSNet” mentioned in Sec. 4.5 of our main paper. We have several interesting findings on the effectiveness of integrating the local and global contexts from these qualitative results:

The method “w/o Local” (3rd column of Fig. 6) shows difficulties in capturing local structures, such as plane or corner, and convex or concave, without the exploitation of local context. It can be seen from the results that the method fails to differentiate between “column” from “wall”, which tend to have similar color but different geometric structures. Additionally, the segmentation regions are not homogeneous because each point is processed independently in “w/o Local”. See “column” in Row A, “chair” in Row B, and “bookcase” in Row E.

In comparison to “w/o Local”, “w/o EdgeConv” (4th column of Fig. 6) considers neighboring points to some extent. Hence, it performs slightly better in capturing the geometric differences, which can be seen from the correct classification of “column” in Row B and E. However, despite the consideration of local regions, “w/o EdgeConv” still treats each point in its local region independently. Consequently, this leads to insufficient exploitation of the finegrained local structures. As we can see from the wrongly segmented “chair” and “clutter” (on the “table”) in Row B and D, “w/o EdgeConv” is confused by the neighboring points from the “bookcase” class. This indicates that it lacks the ability to capture complex geometry in classes such as “clutter” and “chair”.

We observe that “w/o NetVLAD” (5th column of Fig. 6) fails to model scenelevel semantic information without the aggregation of the global context. As shown in Row A, “w/o NetVLAD” is confused with the “chairs” and “board” classes, and this does not happen with the global context. Similar confusions can be seen in Row D and F, where the “wall” class is confused with the “table” and “chair” classes, respectively.

Our full PSNet (6th column of Fig. 6) that incorporates both local structures and global context achieves the best segmentation results compared to the other methods that remove either the local or global context exploitation module. This suggests that the locally and globally aware framework we proposed is able to capture finegrained local structures (successful segmentation of “column” class in Row AF, distinction between “clutter” and “bookcase” classes in Row B and D) as well as the global context (distinction between the white “wall” and white “board” classes in Row B,C,D,F)
Appendix B More Discussions on Unbalance Dataset Problem
The two datasets used in our experiments are very challenging as both of them are highly unbalanced. The statistics of data portion of the two datasets are given in Table 4 and Table 5, respectively. As seen in Table 4, “wall”, “ceiling”, “floor” that are the dominant classes on S3DIS have 10 66 times more data than the rare classes, such as “sofa”, “board”, and “beam”. This unbalance problem is more severe on the ScanNet dataset, where the most dominate class (“wall”) has over 200 times more training data than the rarest class (“shower curtain”). In the experiments, we did not implement any strategy (oversampling and weighted loss based on class frequency) to explicitly solve this problem. Interestingly, we still achieve acceptable performance on the rarest classes, such as “column”, “window”, “board” on the S3DIS dataset, and “bathtub”, “toilet”, “window”, “shower curtain” on the ScanNet dataset. This may indicate that our proposed method is robust to unbalanced data distributions.
Class name  ceiling  floor  wall  beam  column  window  door  table  chair  sofa  bookcase  board  clutter 

# object  391  290  1,552  165  260  174  549  461  1,369  61  590  143  — 
Data percentage(%)  19.27  16.52  27.81  1.73  2.02  2.52  4.78  3.39  3.43  0.42  6.33  1.24  10.54 
Class name  wall  floor  chair  table  desk  bed  bookshelf  sofa  sink  bathtub  toilet 

Train data percentage(%)  36.8  24.90  4.60  2.53  1.66  2.58  2.04  2.59  0.34  0.34  0.27 
Test data percentage(%)  36.46  24.38  5.36  2.90  1.65  2.14  2.15  2.44  0.33  0.22  0.26 
Class name  curtain  counter  door  window  shower curtain  refridgerator  picture  cabinet  other furniture 

Train data percentage(%)  1.48  0.62  2.33  0.94  0.18  0.43  0.37  2.59  2.46 
Test data percentage(%)  0.98  0.65  2.15  0.66  0.10  0.34  0.17  2.43  3.23 
Comments
There are no comments yet.