I Introduction
In recent years, 3D imaging sensors have been greatly developed to facilitate the acquisition of 3D point cloud data. With the explosive growth of 3D point cloud data, point cloud semantic segmentation has received more and more attention [23, 7, 15]
in 3D scene understanding. Point cloud semantic segmentation aims to classify each point into a category. Due to the unordered and irregular structure of 3D point clouds, how to exploit context information of point clouds for semantic segmentation is very challenging.
Recently, various efforts have been made on point cloud semantic segmentation. PointNet [16] directly employs the multilayer perception (MLP) to extract the local feature of a single point for point cloud segmentation. Based on PointNet, PointNet++ [17] and DGCNN [21]
aggregate different local features of point clouds with the max pooling operation for segmentation. Based on the PointNet++ framework, PointWeb
[27] learns the weights between each pair of points in a local region to extract the local features. Although these methods can capture the geometric structures of different neighborhoods well, the relationships between longrange neighborhoods of point clouds are ignored.In fact, the context from longrange neighboring points is important for point cloud semantic segmentation. It is desirable to exploit the longrange dependencies of different neighborhoods to characterize discriminative geometric structures of point clouds. To this end, we propose a novel cascaded nonlocal neural network for segmentation. In our method, the nonlocal operation is performed on three levels, i.e., the neighborhood level, superpoint level and global level, corresponding to different scales of areas in point clouds. The nonlocal operation at the neighborhood level is applied to the neighboring points with the nearest neighbors (NN) algorithm. The nonlocal operation at the superpointlevel is conducted in the superpoint area, which is a set of points with the isotropous geometric characteristics, while the globallevel operation is implemented in the global point clouds composed of all the superpoints. By cascading the three nonlocal operations, geometric structure information of neighboring points can be propagated level by level. Once the nonlocal features at the global level are obtained, we employ the encoderdecoder framework in PointNet++ to predict the label of each point for semantic segmentation. Experimental results on the S3DIS, ScanNet, vKITTI and SemanticKITTI datasets demonstrate the effectiveness of our proposed method.
In the cascaded nonlocal neural network, the nonlocal operation, as an attention mechanism, can learn the attention weights of neighboring points at each level. Thus, different weights can be highlighted for different contexts from longrange neighboring points. In addition, the original nonlocal operation performed on the whole point clouds is timeconsuming. By partitioning the whole point clouds into the cascaded areas, the resulting cascaded nonlocal operation can largely reduce the computational cost of the original nonlocal module.
In summary, the main contributions of this paper are twofold. On one hand, we develop a novel cascaded nonlocal module where the nonlocal features of the centroid points at different levels can be extracted. Thus, context information from longrange neighboring points can be propagated level by level. On the other hand, the cascaded nonlocal module can largely reduce the computational complexity of the original nonlocal operation.
Ii Related Work
Iia Deep learning on point clouds
PointNet [16]
is the pioneering algorithm to extract deep features of unordered point clouds. PointNet++
[17] pays more attention to extract the local feature of point clouds with the hierarchical structure. PointWeb [27] learns the weights of pairwise points to enhance the local features of point clouds. LatticeNet [18] adopts a sparse permutohedral lattice to to characterize the local features of point clouds.Graphbased methods mainly focus on depicting edge relationships between points, and thus boost the local feature embedding. DGCNN [21] constructs the nn graph to characterize the local geometric structures of point clouds so that the local features of the points can be extracted. GACNet [19] introduces a graph attention convolution to assign weights to neighboring points and extracts the local features of point clouds through an attention mechanism. Models based on the superpoint graph (SPG) framework [14, 13] partition point clouds into superpoints, and then conduct feature embedding through a graph neural network built upon SPG.
SqueezeSegV3 [22] adopts spherical projection to generate LiDAR images and proposes the spatiallyadaptive convolution for point cloud feature embedding, where the spatial priors in the LiDAR images can be exploited. SalsaNet [1] is an encoderdecoder network for point cloud segmentation, where the birdeye view image is used as the input and an autolabeling process is employed to transfer the labels from the camera to the LiDAR. Due to the imbalanced spatial distribution PolarNet [25] proposes the polar bird’seyeview representation so that the nearestneighborfree segmentation method is used for point cloud semantic segmentation.
IiB Nonlocal neural networks
The idea of the nonlocal operation was originally introduced in image denoising [5]. Wang et al. [20] propose a neural network combining the nonlocal blocks with CNNs to extract the image features. However, the vast computational consumption and massive memory occupation hinder its application. In [28], Zhu et al. propose an asymmetric nonlocal neural network to reduce the computational cost through a pyramid sampling module. Zhang et al. [24] introduce a novel graph structure to reduce the computational complexity of the nonlocal operation. In [10], only the pixels along the crisscross path participate in the nonlocal operation. In [12]
, an adaptive sampling strategy is adopted to decrease the computational complexity. These methods aiming at reducing the computational cost of the nonlocal operation are based on the regular structure of images, which cannot be directly extended to the irregular and unordered point clouds for point cloud feature extraction.
Iii Our Method
In this section, we present our cascaded nonlocal neural network for point cloud segmentation. In Sec. IIIA, we briefly revisit the nonlocal operation. In Sec. IIIB, we describe the details of our cascaded nonlocal module. Finally, we present our network architecture in Sec. IIIC and analyze the computational complexity of our cascaded nonlocal module in Sec. IIID.
Iiia Revisiting the nonlocal operation
Recently, the nonlocal operation is employed to construct the deep nonlocal network [20]. Given the feature map of an image, we denote , , and as the height and width of the image and the number of the channels, respectively. Suppose that in each nonlocal block there are three convolutions , , and for embedding, respectively, where , , and is the number of the output channels. We then reshape these three embeddings to , and , where . The final nonlocal feature is calculated as:
(1) 
where is the weighted sum of the embeddings of all pixels.
However, the nonlocal operation leads to the high computational cost due to large amounts of multiplications in Eq. 1. To tackle this problem, existing methods mainly focus on sampling the feature maps to reduce the computational complexity of matrix multiplications such as [10, 12, 28]. Due to the regular grid structure of images, these methods can effectively reduce the computational cost of the nonlocal operation. Nonetheless, since point clouds are irregular and unordered, it is difficult to regularly sample the feature maps of point clouds. Therefore, these methods cannot be directly applied to extract nonlocal features of point clouds.
IiiB Cascaded nonlocal module
Given two pointwise features and , the nonlocal operation for point clouds is formulated as follows:
(2) 
where
is the pairwise function embedding the difference between the two feature vectors,
is a unary function for feature embedding, the operator represents the Hadamard product. and are the two weights to be learned.In Eq. 1, the original nonlocal function employs the scalar value to depict the similarity between each pair of points. However, our pairwise function employs a channelwise vector to describe the relationship between two points. Thus our nonlocal feature can capture the geometry structure more accurately. Furthermore, to deal with the feature scale of different pair of points, the pairwise function is normalized as follow:
(3) 
where represents the th channel of the feature vector.
It is noted that it is infeasible to directly perform the nonlocal operation on the whole point clouds since the nonlocal operation defined in Eq. 2 requires the huge memory occupancy. Therefore, in order to balance the memory cost of the nonlocal operation and accurate depiction of geometric structures of point clouds, we propose a cascaded nonlocal module, where the nonlocal operation can be performed on the point clouds at different levels. Specifically, the threelevel nonlocal operations are conducted in three different scales of areas: neighborhood area, superpoint area and global area, respectively. It can greatly reduce the computational complexity of the nonlocal opeation by controlling the number of points participated in each nonlocal operation.
Neighborhood level The nonlocal operation at the neighborhood level aims to extract the local features of the centroids in the corresponding neighborhoods. We first leverage the farthest point sampling (FPS) to choose points as the centroids. For each centroid , the nearest neighboring points are used to construct the neighborhood area (Fig. 1). Consequently, we apply the nonlocal operation in the neighborhood area to extract the local feature of the centroid:
(4) 
where is the index of the centroid points. From Eq. 4, one can see that the local feature of each centroid point can be characterized by assigning different weights to the points in the neighborhood.
Superpoint level Once the local features of the centroid points are extracted, we conduct the nonlocal operation at the superpoint level. It is expected that geometric structure information of different neighborhoods in the superpoint can be effectively propagated. Superpoint is a set of points with isotropically geometric features. Generally, the number of centroid points in the superpoint is different. Therefore, in order to facilitate batch processing in our neural network, we randomly sample centroid points in the superpoint. Thus, for the centroid point , the nonlocal feature at the superpoint level is defined as:
(5) 
where is the local feature of the centroid point at the neighborhood level.
Global level In order to exploit semantic contexts from different superpoints in the point clouds, we furthermore propagate the features of the centroid points at the superpoint level. Since each superpoint contains multiple centroids, we use a max pooling operation to extract the superpoint feature of the th superpoint. As shown in Fig. 1, each small box represents the corresponding superpoint feature. For the centroid point , by assigning different weights to the superpoint features, we define the following nonlocal operation at the global level:
(6) 
where is the number of superpoints.
After propagating the nonlocal features of the centroids through the whole point clouds, we can obtain the fused features with a mapping function . Therefore, the final feature of each centroid point is formulated as:
(7) 
where denotes the concatenation operation and , , represent the nonlocal features at three levels, respectively. With the cascaded nonlocal operations at three levels, the longrange dependencies across different neighboring points can be build so that we can obtain the discriminative feature of each centroid point for semantic segmentation.
IiiC Architecture
The overall architecture of our network is illustrated in Fig. 2. Our network is constructed based on the PointNet++ framework, combing the cascaded nonlocal modules to build the longrange dependencies between the points. In the encoder, we employ four nonlocal modules for feature embedding while in the decoder we adopt the upsampling strategy in PointNet++ to predict the semantic labels of the points.
In PointNet++, the local features of the centroid points are extracted by performing the max pooling operation on the features of single points in a local region. Different from PointNet++, our proposed method extracts more discriminative local features by weighting the neighboring points through the nonlocal mechanism. In addition, we propagate the local features of the centroid points at the superpoint and global levels with the nonlocal operation. Thus, the discriminativeness of the local features of the centroid points can be further boosted. On the contrary, PointNet++ mainly focuses on the hierarchical local regions for feature extraction without considering nonlocal regions such as superpoints.
IiiD Computational complexity analysis
In Eq. 2, the nonlocal feature of each point is computed with the weighted sum of responses of all points in the point clouds. If we ignore the feature channel, the computational complexity is . For our cascaded nonlocal module, the nonlocal operation is performed at three levels and the number of points participated in each operation is far smaller than . Specifically, in the neighborhood level, the number of the subsampled centroid points is ( in the four nonlocal modules) and each centroid point has neighboring points. At the superpoint level, for each centroid point, the nonlocal operation is performed on centroid points in the same superpoint. At the global level, superpoints are involved in the nonlocal operation for each centroid point. In the experiment, we set . Therefore the final computational complexity of our hierarchical nonlocal operation is:
(8) 
where , and are far smaller than and . Thus, we can significantly reduce the computational complexity of the nonlocal operation on point clouds.
Iv Experiments
In this section, we evaluate our proposed model on indoor and outdoor datasets.
Iva Implementation details
To train our model, we use the SGD optimizer, where the base learning rate and minibatch size are set to 0.05 and 16, respectively. The momentum is set to 0.9 and the weight decay is 0.0001. The approach adopted for superpoints partition follows [14]. For the S3DIS [2]
dataset, the learning rate decays by 0.1 for every 25 epochs to train 100 epochs. For the ScanNet
[6] dataset, we train the network for 200 epochs and decay the learning rate by 0.1 every 50 epochs. For the vKITTI [9] dataset, we train the network for 100 epochs and decay the learning rate by 0.1 every 25 epochs. For the outdoor dataset SemanticKITTI [3], the learning rate is 0.003 and weight decay is 0.0001. The scenes are partitioned into blocks as superpoints. Note that our method does not need to build the superpoint graphs.IvB Semantic segmentation on datasets
S3DIS. S3DIS [2] is an indoor 3D point cloud dataset. The point clouds are split into six largescale areas, where each point is annotated with one of the semantic labels from 13 categories. Three metrics are used to quantitatively evaluate our method: mean of perclass intersection over union (mIoU), mean of perclass accuracy (mAcc), and overall accuracy (OA). For evaluation, we follow the methods [17, 14] to test the model on Area 5 and 6fold cross validation. For training, we follow [17] to uniformly split the point clouds to blocks with an area size of 1m1m and randomly sample 4096 points in each block. During the test, we adopt all the points for evaluation.
The quantitive results with 6folds cross validation on S3DIS are shown in Tab. I. Our model named PoinNL can achieve better or comparable performance than other methods listed in Tab. I, benefiting from the rich features provided by the nonlocal module. It is noted that although additional handcraft geometry features in [14, 13] are not utilized in our model, our PointNL can still achieve comparable results. The visual results are shown in Fig. 4. It can be seen that our proposed method can obtain more accurate segmentation results on the S3DIS dataset.
ScanNet. ScanNet [6] is an indoor scene dataset, which contains 1513 point clouds and points are annotated with 20 categories. Following [17, 27], the dataset is split into 1201 scenes for training and 312 scenes for testing. During the training, the points are divided into blocks of size 1.5m1.5m, and each block consists of 8192 points sampled onthefly. For testing, all the points in the test set are used for the evaluation. Following [27]
, the stride between the adjacent blocks is set as 0.5. The overall semantic voxel labeling accuracy (OA) is used to evaluate the compared segmentation methods.
We list the quantitative results on the validate dataset in Tab. I. From this table, one can see that our model outperforms the other compared methods. Due to the hierarchical nonlocal module, which can capture the longrange context information of point clouds, our model can obtain good segmentation results.
vKITTI. We also evaluate our method on the vKITTI [9] dataset, which mimics the realworld KITTI dataset. There are 5 sequences of synthetic outdoor scenes and 13 classes (including road, tree, terrain, car, etc.) are annotated. The 3D point clouds are obtained by projecting the 2D depth image into 3D coordinates. For evaluation, we follow the strategy adopted in [23] and split the dataset into 6 nonoverlapping subsequences and conduct 6fold cross validation. During the training, we split the point cloud into 4m 4m blocks and randomly sample 4096 points in each block. For evaluation, the metrics, mean of perclass intersection over union (mIoU), mean of perclass accuracy (mAcc) and overall accuracy (OA) are adopted.
The quantitative results are listed in Tab. I. From this table, one can see that our proposed PointNL can yield better performance on this dataset with a significant gain of in terms of mIoU. The visual results are shown in Fig. 3, which further demonstrates the effectiveness of our method.
SemanticKITTI. SemanticKITTI [3] is a largescale outdoor 3D dataset with 43552 LIDAR scans. Each scan contains points and can cover meters in the 3D space. The dataset is split into 21 sequences, while the sequences 0007 and 0910 (19130 scans) are used for training, sequence 08 (4071 scans) for validation, and the sequences 1121 (20351 scans) for online testing. The point clouds are annotated with 19 classes without color information. For evaluation, the mean of perclass intersection over union (mIoU) is adopted following [3].
On SemanticKITTI dataset, PointNet [16], PointNet++ [17], DarkNet21Seg [3], DarkNet53Seg [3] can obtain the mIoUs of 14.6%, 20.1%, 47.4% and 49.9%, respectively, while our PointNL achieves the mIoU of 52.2%. It can be seen that our model outperforms the others. In addition, Fig. 4 visualizes the segmentation results on the SemanticKITTI dataset with our proposed PointNL.
IvC Ablation study
To better demonstrate the effectiveness of our proposed method, we conduct ablation studies on Area 5 of the S3DIS dataset to analyze the effects of the nonlocal operations on different levels. Tab. II shows the segmentation accuracies with the nonlocal operation at different levels. The neighborhoodlevel nonlocal operation (PointNL) can obtain the mIoU of 60.03% and is competitive with many point cloud segmentation approaches in recent years. When we further add the nonlocal operation at the superpoint level (PointNL), the mIoU can be improved by 2.15%. When we cascade the nonlocal operations at three levels (PointNL), the mIoU can be further improved by 1.32%. In addition, we also implement the original nonlocal operation, which directly applies the nonlocal operation on the whole points. In Tab. II, the mIoU of the original nonlocal operation (baseline) is only 54.76%, which is lower than that of PointNL. It implies that our hierarchical nonlocal model can effectively build semantic longrange dependencies between the points to obtain the discriminative local features of point clouds.
IvD Computational cost
In terms of training time and GPU memory, we compare our method to PointWeb and the original nonlocal operation (baseline) on the Pytorch platform. For a fair comparison, on Area 5 in the S3DIS dataset, both codes are run on Tesla P40 GPUs for 100 epochs. The computational cost, GPU memory occupancy and segmentation accuracy of the compared methods are shown in Tab.
II.For PointWeb [27], it needs at least 4.2 days, 24GB GPU memory and can obtain the mIoU of 60.28%. The original nonlocal operation (baseline) costs about 6 days and 60G GPU memory, and obtain the mIoU of 54.76%. While our method costs about 1.8 days, 8G GPU memory and can achieve mIoU of 63.50%.
The performance, computational cost and the GPU memory occupancy of the corresponding methods are shown in Tab. II. The total number of parameters of our model is 3.5M. For testing, the inference time of each batch is 0.35s, while PointWeb requires 1.5s. Therefore, our method not only improves the performance of semantic segmentation but also greatly reduces the time cost and memory consumption.
V Conclusion
We proposed a novel cascaded nonlocal neural network for point cloud semantic segmentation. In our proposed nonlocal neural network, we developed a new cascaded nonlocal module to capture the neighborhoodlevel, superpointlevel and globallevel geometric structures of point clouds. By stacking the cascaded nonlocal modules, semantic context information of point clouds is propagated level by level so that the discriminativeness of local features of point clouds can be boosted. Experimental results on the benchmark point cloud segmentation datasets demonstrate the effectiveness of our proposed PointNL in terms of the segmentation accuracy and computational cost.
References
 [1] (2019) Salsanet: fast road and vehicle segmentation in lidar point clouds for autonomous driving. arXiv preprint arXiv:1909.08291. Cited by: §IIA.
 [2] (2016) 3d semantic parsing of largescale indoor spaces. In CVPR, Cited by: §IVA, §IVB.
 [3] (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In ICCV, Cited by: §IVA, §IVB, §IVB.
 [4] (2019) Generalizing discrete convolutions for unstructured point clouds. arXiv preprint arXiv:1904.02375. Cited by: TABLE I.

[5]
(2005)
A nonlocal algorithm for image denoising.
In
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
, Cited by: §IIB.  [6] (2017) Scannet: richlyannotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §IVA, §IVB.
 [7] (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In ICCV, Cited by: §I, TABLE I.
 [8] (2018) Know what your neighbors do: 3d semantic segmentation of point clouds. In ECCV, Cited by: TABLE I.
 [9] (2016) Virtual worlds as proxy for multiobject tracking analysis. In CVPR, Cited by: §IVA, §IVB.
 [10] (2019) Ccnet: crisscross attention for semantic segmentation. In ICCV, Cited by: §IIB, §IIIA.
 [11] (2019) Hierarchical pointedge interaction network for point cloud semantic segmentation. In ICCV, Cited by: TABLE I.
 [12] (2014) Dynamic network energy management via proximal message passing. In CVPR, Cited by: §IIB, §IIIA.
 [13] (2019) Point cloud oversegmentation with graphstructured deep metric learning. arXiv preprint arXiv:1904.02113. Cited by: §IIA, TABLE I, §IVB.
 [14] (2018) Largescale point cloud semantic segmentation with superpoint graphs. In CVPR, Cited by: §IIA, TABLE I, §IVA, §IVB, §IVB.
 [15] (2018) Pointcnn: convolution on xtransformed points. In NeurIPS, Cited by: §I, TABLE I.

[16]
(2017)
Pointnet: deep learning on point sets for 3d classification and segmentation
. In CVPR, Cited by: §I, §IIA, TABLE I, §IVB.  [17] (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §I, §IIA, TABLE I, §IVB, §IVB, §IVB.
 [18] (2019) Latticenet: fast point cloud segmentation using permutohedral lattices. arXiv preprint arXiv:1912.05905. Cited by: §IIA.
 [19] (2019) Graph attention convolution for point cloud semantic segmentation. In CVPR, Cited by: §IIA.
 [20] (2018) Nonlocal neural networks. In CVPR, Cited by: §IIB, §IIIA.
 [21] (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: §I, §IIA.
 [22] (2020) Squeezesegv3: spatiallyadaptive convolution for efficient pointcloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: §IIA.

[23]
(2018)
3d recurrent neural networks with context fusion for point cloud semantic segmentation
. In ECCV, Cited by: §I, TABLE I, §IVB.  [24] (2019) LatentGNN: learning efficient nonlocal relations for visual recognition. arXiv preprint arXiv:1905.11634. Cited by: §IIB.
 [25] (2020) PolarNet: an improved grid representation for online lidar point clouds semantic segmentation. In CVPR, Cited by: §IIA.

[26]
(2019)
Shellnet: efficient point cloud convolutional neural networks using concentric shells statistics
. In ICCV, Cited by: TABLE I.  [27] (2019) PointWeb: enhancing local neighborhood features for point cloud processing. In CVPR, Cited by: §I, §IIA, TABLE I, §IVB, §IVD.
 [28] (2019) Asymmetric nonlocal neural networks for semantic segmentation. arXiv preprint arXiv:1908.07678. Cited by: §IIB, §IIIA.
Comments
There are no comments yet.