In recent years, 3D imaging sensors have been greatly developed to facilitate the acquisition of 3D point cloud data. With the explosive growth of 3D point cloud data, point cloud semantic segmentation has received more and more attention [23, 7, 15]
in 3D scene understanding. Point cloud semantic segmentation aims to classify each point into a category. Due to the unordered and irregular structure of 3D point clouds, how to exploit context information of point clouds for semantic segmentation is very challenging.
Recently, various efforts have been made on point cloud semantic segmentation. PointNet  directly employs the multi-layer perception (MLP) to extract the local feature of a single point for point cloud segmentation. Based on PointNet, PointNet++  and DGCNN 
aggregate different local features of point clouds with the max pooling operation for segmentation. Based on the PointNet++ framework, PointWeb learns the weights between each pair of points in a local region to extract the local features. Although these methods can capture the geometric structures of different neighborhoods well, the relationships between long-range neighborhoods of point clouds are ignored.
In fact, the context from long-range neighboring points is important for point cloud semantic segmentation. It is desirable to exploit the long-range dependencies of different neighborhoods to characterize discriminative geometric structures of point clouds. To this end, we propose a novel cascaded non-local neural network for segmentation. In our method, the non-local operation is performed on three levels, i.e., the neighborhood level, superpoint level and global level, corresponding to different scales of areas in point clouds. The non-local operation at the neighborhood level is applied to the neighboring points with the -nearest neighbors (-NN) algorithm. The non-local operation at the superpoint-level is conducted in the superpoint area, which is a set of points with the isotropous geometric characteristics, while the global-level operation is implemented in the global point clouds composed of all the superpoints. By cascading the three non-local operations, geometric structure information of neighboring points can be propagated level by level. Once the non-local features at the global level are obtained, we employ the encoder-decoder framework in PointNet++ to predict the label of each point for semantic segmentation. Experimental results on the S3DIS, ScanNet, vKITTI and SemanticKITTI datasets demonstrate the effectiveness of our proposed method.
In the cascaded non-local neural network, the non-local operation, as an attention mechanism, can learn the attention weights of neighboring points at each level. Thus, different weights can be highlighted for different contexts from long-range neighboring points. In addition, the original non-local operation performed on the whole point clouds is time-consuming. By partitioning the whole point clouds into the cascaded areas, the resulting cascaded non-local operation can largely reduce the computational cost of the original non-local module.
In summary, the main contributions of this paper are two-fold. On one hand, we develop a novel cascaded non-local module where the non-local features of the centroid points at different levels can be extracted. Thus, context information from long-range neighboring points can be propagated level by level. On the other hand, the cascaded non-local module can largely reduce the computational complexity of the original non-local operation.
Ii Related Work
Ii-a Deep learning on point clouds
is the pioneering algorithm to extract deep features of unordered point clouds. PointNet++ pays more attention to extract the local feature of point clouds with the hierarchical structure. PointWeb  learns the weights of pairwise points to enhance the local features of point clouds. LatticeNet  adopts a sparse permutohedral lattice to to characterize the local features of point clouds.
Graph-based methods mainly focus on depicting edge relationships between points, and thus boost the local feature embedding. DGCNN  constructs the -nn graph to characterize the local geometric structures of point clouds so that the local features of the points can be extracted. GACNet  introduces a graph attention convolution to assign weights to neighboring points and extracts the local features of point clouds through an attention mechanism. Models based on the superpoint graph (SPG) framework [14, 13] partition point clouds into superpoints, and then conduct feature embedding through a graph neural network built upon SPG.
SqueezeSegV3  adopts spherical projection to generate LiDAR images and proposes the spatially-adaptive convolution for point cloud feature embedding, where the spatial priors in the LiDAR images can be exploited. SalsaNet  is an encoder-decoder network for point cloud segmentation, where the bird-eye view image is used as the input and an auto-labeling process is employed to transfer the labels from the camera to the LiDAR. Due to the imbalanced spatial distribution PolarNet  proposes the polar bird’s-eye-view representation so that the nearest-neighbor-free segmentation method is used for point cloud semantic segmentation.
Ii-B Non-local neural networks
The idea of the non-local operation was originally introduced in image denoising . Wang et al.  propose a neural network combining the non-local blocks with CNNs to extract the image features. However, the vast computational consumption and massive memory occupation hinder its application. In , Zhu et al. propose an asymmetric non-local neural network to reduce the computational cost through a pyramid sampling module. Zhang et al.  introduce a novel graph structure to reduce the computational complexity of the non-local operation. In , only the pixels along the criss-cross path participate in the non-local operation. In 
, an adaptive sampling strategy is adopted to decrease the computational complexity. These methods aiming at reducing the computational cost of the non-local operation are based on the regular structure of images, which cannot be directly extended to the irregular and unordered point clouds for point cloud feature extraction.
Iii Our Method
In this section, we present our cascaded non-local neural network for point cloud segmentation. In Sec. III-A, we briefly revisit the non-local operation. In Sec. III-B, we describe the details of our cascaded non-local module. Finally, we present our network architecture in Sec. III-C and analyze the computational complexity of our cascaded non-local module in Sec. III-D.
Iii-a Revisiting the non-local operation
Recently, the non-local operation is employed to construct the deep non-local network . Given the feature map of an image, we denote , , and as the height and width of the image and the number of the channels, respectively. Suppose that in each non-local block there are three convolutions , , and for embedding, respectively, where , , and is the number of the output channels. We then reshape these three embeddings to , and , where . The final non-local feature is calculated as:
where is the weighted sum of the embeddings of all pixels.
However, the non-local operation leads to the high computational cost due to large amounts of multiplications in Eq. 1. To tackle this problem, existing methods mainly focus on sampling the feature maps to reduce the computational complexity of matrix multiplications such as [10, 12, 28]. Due to the regular grid structure of images, these methods can effectively reduce the computational cost of the non-local operation. Nonetheless, since point clouds are irregular and unordered, it is difficult to regularly sample the feature maps of point clouds. Therefore, these methods cannot be directly applied to extract non-local features of point clouds.
Iii-B Cascaded non-local module
Given two pointwise features and , the non-local operation for point clouds is formulated as follows:
is the pairwise function embedding the difference between the two feature vectors,is a unary function for feature embedding, the operator represents the Hadamard product. and are the two weights to be learned.
In Eq. 1, the original non-local function employs the scalar value to depict the similarity between each pair of points. However, our pairwise function employs a channel-wise vector to describe the relationship between two points. Thus our non-local feature can capture the geometry structure more accurately. Furthermore, to deal with the feature scale of different pair of points, the pairwise function is normalized as follow:
where represents the -th channel of the feature vector.
It is noted that it is infeasible to directly perform the non-local operation on the whole point clouds since the non-local operation defined in Eq. 2 requires the huge memory occupancy. Therefore, in order to balance the memory cost of the non-local operation and accurate depiction of geometric structures of point clouds, we propose a cascaded non-local module, where the nonlocal operation can be performed on the point clouds at different levels. Specifically, the three-level non-local operations are conducted in three different scales of areas: neighborhood area, superpoint area and global area, respectively. It can greatly reduce the computational complexity of the non-local opeation by controlling the number of points participated in each non-local operation.
Neighborhood level The non-local operation at the neighborhood level aims to extract the local features of the centroids in the corresponding neighborhoods. We first leverage the farthest point sampling (FPS) to choose points as the centroids. For each centroid , the nearest neighboring points are used to construct the neighborhood area (Fig. 1). Consequently, we apply the non-local operation in the neighborhood area to extract the local feature of the centroid:
where is the index of the centroid points. From Eq. 4, one can see that the local feature of each centroid point can be characterized by assigning different weights to the points in the neighborhood.
Superpoint level Once the local features of the centroid points are extracted, we conduct the non-local operation at the superpoint level. It is expected that geometric structure information of different neighborhoods in the superpoint can be effectively propagated. Superpoint is a set of points with isotropically geometric features. Generally, the number of centroid points in the superpoint is different. Therefore, in order to facilitate batch processing in our neural network, we randomly sample centroid points in the superpoint. Thus, for the centroid point , the non-local feature at the superpoint level is defined as:
where is the local feature of the centroid point at the neighborhood level.
Global level In order to exploit semantic contexts from different superpoints in the point clouds, we furthermore propagate the features of the centroid points at the superpoint level. Since each superpoint contains multiple centroids, we use a max pooling operation to extract the superpoint feature of the -th superpoint. As shown in Fig. 1, each small box represents the corresponding superpoint feature. For the centroid point , by assigning different weights to the superpoint features, we define the following non-local operation at the global level:
where is the number of superpoints.
After propagating the non-local features of the centroids through the whole point clouds, we can obtain the fused features with a mapping function . Therefore, the final feature of each centroid point is formulated as:
where denotes the concatenation operation and , , represent the non-local features at three levels, respectively. With the cascaded non-local operations at three levels, the long-range dependencies across different neighboring points can be build so that we can obtain the discriminative feature of each centroid point for semantic segmentation.
The overall architecture of our network is illustrated in Fig. 2. Our network is constructed based on the PointNet++ framework, combing the cascaded non-local modules to build the long-range dependencies between the points. In the encoder, we employ four non-local modules for feature embedding while in the decoder we adopt the upsampling strategy in PointNet++ to predict the semantic labels of the points.
In PointNet++, the local features of the centroid points are extracted by performing the max pooling operation on the features of single points in a local region. Different from PointNet++, our proposed method extracts more discriminative local features by weighting the neighboring points through the non-local mechanism. In addition, we propagate the local features of the centroid points at the superpoint and global levels with the non-local operation. Thus, the discriminativeness of the local features of the centroid points can be further boosted. On the contrary, PointNet++ mainly focuses on the hierarchical local regions for feature extraction without considering non-local regions such as superpoints.
Iii-D Computational complexity analysis
In Eq. 2, the non-local feature of each point is computed with the weighted sum of responses of all points in the point clouds. If we ignore the feature channel, the computational complexity is . For our cascaded non-local module, the non-local operation is performed at three levels and the number of points participated in each operation is far smaller than . Specifically, in the neighborhood level, the number of the subsampled centroid points is ( in the four non-local modules) and each centroid point has neighboring points. At the superpoint level, for each centroid point, the non-local operation is performed on centroid points in the same superpoint. At the global level, superpoints are involved in the non-local operation for each centroid point. In the experiment, we set . Therefore the final computational complexity of our hierarchical nonlocal operation is:
where , and are far smaller than and . Thus, we can significantly reduce the computational complexity of the non-local operation on point clouds.
In this section, we evaluate our proposed model on indoor and outdoor datasets.
Iv-a Implementation details
To train our model, we use the SGD optimizer, where the base learning rate and mini-batch size are set to 0.05 and 16, respectively. The momentum is set to 0.9 and the weight decay is 0.0001. The approach adopted for superpoints partition follows . For the S3DIS 
dataset, the learning rate decays by 0.1 for every 25 epochs to train 100 epochs. For the ScanNet dataset, we train the network for 200 epochs and decay the learning rate by 0.1 every 50 epochs. For the vKITTI  dataset, we train the network for 100 epochs and decay the learning rate by 0.1 every 25 epochs. For the outdoor dataset SemanticKITTI , the learning rate is 0.003 and weight decay is 0.0001. The scenes are partitioned into blocks as superpoints. Note that our method does not need to build the superpoint graphs.
Iv-B Semantic segmentation on datasets
S3DIS. S3DIS  is an indoor 3D point cloud dataset. The point clouds are split into six large-scale areas, where each point is annotated with one of the semantic labels from 13 categories. Three metrics are used to quantitatively evaluate our method: mean of per-class intersection over union (mIoU), mean of per-class accuracy (mAcc), and overall accuracy (OA). For evaluation, we follow the methods [17, 14] to test the model on Area 5 and 6-fold cross validation. For training, we follow  to uniformly split the point clouds to blocks with an area size of 1m1m and randomly sample 4096 points in each block. During the test, we adopt all the points for evaluation.
The quantitive results with 6-folds cross validation on S3DIS are shown in Tab. I. Our model named PoinNL can achieve better or comparable performance than other methods listed in Tab. I, benefiting from the rich features provided by the non-local module. It is noted that although additional hand-craft geometry features in [14, 13] are not utilized in our model, our PointNL can still achieve comparable results. The visual results are shown in Fig. 4. It can be seen that our proposed method can obtain more accurate segmentation results on the S3DIS dataset.
ScanNet. ScanNet  is an indoor scene dataset, which contains 1513 point clouds and points are annotated with 20 categories. Following [17, 27], the dataset is split into 1201 scenes for training and 312 scenes for testing. During the training, the points are divided into blocks of size 1.5m1.5m, and each block consists of 8192 points sampled on-the-fly. For testing, all the points in the test set are used for the evaluation. Following 
, the stride between the adjacent blocks is set as 0.5. The overall semantic voxel labeling accuracy (OA) is used to evaluate the compared segmentation methods.
We list the quantitative results on the validate dataset in Tab. I. From this table, one can see that our model outperforms the other compared methods. Due to the hierarchical non-local module, which can capture the long-range context information of point clouds, our model can obtain good segmentation results.
vKITTI. We also evaluate our method on the vKITTI  dataset, which mimics the real-world KITTI dataset. There are 5 sequences of synthetic outdoor scenes and 13 classes (including road, tree, terrain, car, etc.) are annotated. The 3D point clouds are obtained by projecting the 2D depth image into 3D coordinates. For evaluation, we follow the strategy adopted in  and split the dataset into 6 non-overlapping sub-sequences and conduct 6-fold cross validation. During the training, we split the point cloud into 4m 4m blocks and randomly sample 4096 points in each block. For evaluation, the metrics, mean of per-class intersection over union (mIoU), mean of per-class accuracy (mAcc) and overall accuracy (OA) are adopted.
The quantitative results are listed in Tab. I. From this table, one can see that our proposed PointNL can yield better performance on this dataset with a significant gain of in terms of mIoU. The visual results are shown in Fig. 3, which further demonstrates the effectiveness of our method.
SemanticKITTI. SemanticKITTI  is a large-scale outdoor 3D dataset with 43552 LIDAR scans. Each scan contains points and can cover meters in the 3D space. The dataset is split into 21 sequences, while the sequences 0007 and 0910 (19130 scans) are used for training, sequence 08 (4071 scans) for validation, and the sequences 1121 (20351 scans) for online testing. The point clouds are annotated with 19 classes without color information. For evaluation, the mean of per-class intersection over union (mIoU) is adopted following .
On SemanticKITTI dataset, PointNet , PointNet++ , DarkNet21Seg , DarkNet53Seg  can obtain the mIoUs of 14.6%, 20.1%, 47.4% and 49.9%, respectively, while our PointNL achieves the mIoU of 52.2%. It can be seen that our model outperforms the others. In addition, Fig. 4 visualizes the segmentation results on the SemanticKITTI dataset with our proposed PointNL.
Iv-C Ablation study
To better demonstrate the effectiveness of our proposed method, we conduct ablation studies on Area 5 of the S3DIS dataset to analyze the effects of the non-local operations on different levels. Tab. II shows the segmentation accuracies with the non-local operation at different levels. The neighborhood-level non-local operation (PointNL) can obtain the mIoU of 60.03% and is competitive with many point cloud segmentation approaches in recent years. When we further add the non-local operation at the superpoint level (PointNL), the mIoU can be improved by 2.15%. When we cascade the non-local operations at three levels (PointNL), the mIoU can be further improved by 1.32%. In addition, we also implement the original non-local operation, which directly applies the non-local operation on the whole points. In Tab. II, the mIoU of the original non-local operation (baseline) is only 54.76%, which is lower than that of PointNL. It implies that our hierarchical non-local model can effectively build semantic long-range dependencies between the points to obtain the discriminative local features of point clouds.
Iv-D Computational cost
In terms of training time and GPU memory, we compare our method to PointWeb and the original non-local operation (baseline) on the Pytorch platform. For a fair comparison, on Area 5 in the S3DIS dataset, both codes are run on Tesla P40 GPUs for 100 epochs. The computational cost, GPU memory occupancy and segmentation accuracy of the compared methods are shown in Tab.II.
For PointWeb , it needs at least 4.2 days, 24GB GPU memory and can obtain the mIoU of 60.28%. The original non-local operation (baseline) costs about 6 days and 60G GPU memory, and obtain the mIoU of 54.76%. While our method costs about 1.8 days, 8G GPU memory and can achieve mIoU of 63.50%.
The performance, computational cost and the GPU memory occupancy of the corresponding methods are shown in Tab. II. The total number of parameters of our model is 3.5M. For testing, the inference time of each batch is 0.35s, while PointWeb requires 1.5s. Therefore, our method not only improves the performance of semantic segmentation but also greatly reduces the time cost and memory consumption.
We proposed a novel cascaded non-local neural network for point cloud semantic segmentation. In our proposed non-local neural network, we developed a new cascaded non-local module to capture the neighborhood-level, superpoint-level and global-level geometric structures of point clouds. By stacking the cascaded non-local modules, semantic context information of point clouds is propagated level by level so that the discriminativeness of local features of point clouds can be boosted. Experimental results on the benchmark point cloud segmentation datasets demonstrate the effectiveness of our proposed PointNL in terms of the segmentation accuracy and computational cost.
-  (2019) Salsanet: fast road and vehicle segmentation in lidar point clouds for autonomous driving. arXiv preprint arXiv:1909.08291. Cited by: §II-A.
-  (2016) 3d semantic parsing of large-scale indoor spaces. In CVPR, Cited by: §IV-A, §IV-B.
-  (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In ICCV, Cited by: §IV-A, §IV-B, §IV-B.
-  (2019) Generalizing discrete convolutions for unstructured point clouds. arXiv preprint arXiv:1904.02375. Cited by: TABLE I.
-  (2005) A non-local algorithm for image denoising. In , Cited by: §II-B.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §IV-A, §IV-B.
-  (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In ICCV, Cited by: §I, TABLE I.
-  (2018) Know what your neighbors do: 3d semantic segmentation of point clouds. In ECCV, Cited by: TABLE I.
-  (2016) Virtual worlds as proxy for multi-object tracking analysis. In CVPR, Cited by: §IV-A, §IV-B.
-  (2019) Ccnet: criss-cross attention for semantic segmentation. In ICCV, Cited by: §II-B, §III-A.
-  (2019) Hierarchical point-edge interaction network for point cloud semantic segmentation. In ICCV, Cited by: TABLE I.
-  (2014) Dynamic network energy management via proximal message passing. In CVPR, Cited by: §II-B, §III-A.
-  (2019) Point cloud oversegmentation with graph-structured deep metric learning. arXiv preprint arXiv:1904.02113. Cited by: §II-A, TABLE I, §IV-B.
-  (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, Cited by: §II-A, TABLE I, §IV-A, §IV-B, §IV-B.
-  (2018) Pointcnn: convolution on x-transformed points. In NeurIPS, Cited by: §I, TABLE I.
Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §I, §II-A, TABLE I, §IV-B.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §I, §II-A, TABLE I, §IV-B, §IV-B, §IV-B.
-  (2019) Latticenet: fast point cloud segmentation using permutohedral lattices. arXiv preprint arXiv:1912.05905. Cited by: §II-A.
-  (2019) Graph attention convolution for point cloud semantic segmentation. In CVPR, Cited by: §II-A.
-  (2018) Non-local neural networks. In CVPR, Cited by: §II-B, §III-A.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: §I, §II-A.
-  (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: §II-A.
3d recurrent neural networks with context fusion for point cloud semantic segmentation. In ECCV, Cited by: §I, TABLE I, §IV-B.
-  (2019) LatentGNN: learning efficient non-local relations for visual recognition. arXiv preprint arXiv:1905.11634. Cited by: §II-B.
-  (2020) PolarNet: an improved grid representation for online lidar point clouds semantic segmentation. In CVPR, Cited by: §II-A.
Shellnet: efficient point cloud convolutional neural networks using concentric shells statistics. In ICCV, Cited by: TABLE I.
-  (2019) PointWeb: enhancing local neighborhood features for point cloud processing. In CVPR, Cited by: §I, §II-A, TABLE I, §IV-B, §IV-D.
-  (2019) Asymmetric non-local neural networks for semantic segmentation. arXiv preprint arXiv:1908.07678. Cited by: §II-B, §III-A.