Due to the directness and robustness in obtaining 3D information, there has been an increasing proliferation of light detection and ranging (LiDAR) sensors which have been popularly deployed on a variety of intelligent agents such as unmanned ground vehicles (UGVs), unmanned aerial vehicles (UAVs) to perform localization, obstacle detection, exploration, etc. Consequently, efficient and effective large-scale 3D LiDAR point clouds understanding is of great importance to facilitate machine perception, which bridges the gap between 3D points and any high-level information, structural or semantic, or both [44, 6, 56, 12]. However, due to the electrical and mechanical disturbances, and the reflectance property of targets, the point clouds often suffer from noise and outliers. Moreover, compared with the 2D raster image,
the topological relationship between objects in 3D point clouds is much weakened, rendering the task of segmentation and understanding much more challenging. Therefore, autonomous large-scale point clouds understanding remains an open problem and requires urgent efforts to tackle the challenges, especially when both accuracy and efficiency are taken into account.
Like any high-level task in 2D image domain such as object detection, segmentation, or classification, the point clouds understanding methods can also be classified into traditional category and deep learning based methods. In the traditional category, the representative histogram-based methods 
encode the k-Nearest-Neighbor (kNN) geometric features of a 3D point via calculating its surrounding multidimensional average curvature for local geometric variations descriptions. In signature-based and transform-based  methods, handcrafted feature descriptions of point clouds have been proposed and exploited for semantic understanding. However, the performances of these methods are merely demonstrated in well-controlled conditions with ideal assumptions such as noise-free and homogeneous environments. On the other hand, deep learning based point clouds processing methods have been proposed with promising results in recent years. The mainstream point clouds understanding methods can be roughly divided into three categories: projection-based [30, 47, 50, 8, 17, 22, 10], voxel-based[11, 5, 41, 29, 54], and direct point-based [18, 44, 32, 56, 34, 52, 26, 33, 38, 42, 20, 27, 48]. Such representative works of each category will be discussed in Section II, and the limitations of these methods are summarized here: First, the sampling operation of almost all these methods has high computational cost and memory consumption. For instance, the widely applied farthest point sampling (FPS)   takes more than 1000 seconds to subsample points to about points. Furthermore, their subsequent perception networks usually rely on the expensive operations, such as voxelizations  , or graph construction , etc. Second, nearly all the existing methods are designed for small-scale point clouds without considering noise and outliers which are inevitable in practice. Moreover, the large-scale point clouds typically suffer from great class imbalance in semantic categories, and the points obtained by LiDAR in complex dynamic environments are often irregular, orderless, and have distant distributed semantic information. For example, in autonomous driving scenarios, the typical objects exhibit diverse geometric shapes with varying object sizes (e.g., cyclists and persons) or have distribution across a long spatial range in a non-uniform way (e.g., road, buildings, and vegetations). However, to the best of our knowledge, the existing methods can hardly capture the complex geometry and the latent feature correlations in large-scale point clouds effectively.
To overcome the aforementioned challenges, we propose a general deep learning framework named FG-Net for large-scale point clouds understanding. We adopt deformable convolution for modelling the geometric structure, and pointwise attentional aggregation for mining the correlated features among point clouds. It should be noted that the deformable convolutional modelling can effectively adapt to the local geometry of objects by deformed kernels that dynamically adapt to diverse local geometries while the correlated feature mining can capture the distributed contextual information in spatial locations and semantic features adaptively across a long spatial range. The modules in our network can be implemented with simple pointwise matrix multiplication and add operations, which can be easily parallelized by GPU for acceleration. As shown in Fig. 1, our method outperforms state-of-the-art ones in terms of both accuracy and efficiency, rendering it achievable to realize real-time perception performance on the large-scale point clouds. In summary, our work makes the following contributions:
We propose pointwise correlated feature mining and geometric-aware modelling module for large-scale point clouds understanding. Furthermore, we interpret the effectiveness of our network by visualizing the complementary features captured by our network modules.
We propose a feature pyramid based residual learning architecture to leverage patterns at different resolutions in a memory-efficient way. Extensive experiments on real-world challenging datasets demonstrated that our approaches outperform state-of-the-art ones in terms of accuracy and efficiency.
We propose a novel fast noise and outliers removal method and a points down-sampling strategy for large-scale point clouds, which simultaneously enhances the performance and improve the efficiency of semantic understanding tasks in the large-scale scene.
The paper is organized as follows: Section II gives a detailed review of existing methods of point clouds understanding. In Section III, we illustrate our proposed framework thoroughly and gives our optimization function formulation and training details. We also propose point clouds filtering methods and sampling methods which are specially designed to speed up our framework on large-scale point clouds. Section IV gives substantial results of our experiments and ablation studies. Section V concludes our work.
Ii Related Works
Advanced deep learning techniques in 2D image domain have been investigated extensively, and resulted in stunning performance [53, 51, 16, 4, 25]. Naturally, such deep learning techniques have been exploited for point clouds processing and understanding, and the published works can be roughly categorized into voxel-based [11, 5, 41, 29, 54], projection-based[30, 47, 50, 8, 17, 22, 10] and point-based methods[18, 44, 32, 56, 34, 52, 26, 33, 38, 42, 20, 27, 48]. The voxel-based and projection-based methods transform point clouds into different representations while the point-based methods process point clouds directly. These methods are mainly designed and tested on the relative small-scale point clouds of less than points with block partitioning. Directly extending them to deal with large-scale point clouds will result in prohibitively expensive computational costs. Here we discuss these methods thoroughly of their advantages and shortcoming, and the rationale that motivates our modifications and improvements.
Ii-a Voxel-based and Projection-based Methods for Point Clouds Understanding
The most recent and typical voxel-based methods are SparseConv  and Minkowski CNN . The voxel-based methods use 3D convolutions which are intuitive extensions of 2D counterparts. The advantage of them is that the spatial relationship can be well reserved with high voxel resolutions, but these methods are quite computationally expensive. The computation cost and memory consumption of the voxel-based models increase cubically with the resolution of input point clouds. By contrast, the geometric information loss will be significant if we decrease the resolution of voxelization. In addition, the voxel-based methods rely on aggressive down sampling to achieve bigger receptive fields, resulting in even lower resolution in deeper network layers. Hence, it is quite hard to achieve real-time performance while considering the balance between accuracy and computational cost. The projection is also used to project point clouds into range images[30, 47, 50] or multi-view images[8, 17, 22, 10], to facilitate the use of 2D CNNs. However, such projection inevitably leads to the loss of geometrical information. In practice of dealing with large-scale point clouds, the drawbacks of voxel-based and projection-based methods become more prohibitive.
Ii-B Point-based Methods for Point Clouds Understanding
The pointNet 
is the pioneering work that extracts the pointwise feature directly using shared Multi-layer Perceptron (MLP). It is permutation invariant to point clouds orders because it uses max-pooling operations. The pointNet++ extracts the local features using pointNet and considers local geometric relationships with the hierarchical grouping and abstraction as well as the multi-scale and resolution grouping. More point-based methods[23, 32, 18] have been proposed recently with complicated network design to aggregate local features. However, all these methods are not able to model intrinsic geometric structures of points or to capture the non-local distributed contextual correlations in spatial locations and semantic features effectively. There are also a series of new explorations on how to implement convolution on point clouds. The methods [38, 42, 20] focus on how to learn kernels which can better capture the local geometry of points. However, the proposed convolutional kernels are too complicated to be directly applied to deep neural networks for large-scale point clouds understanding. Motivated by the challenges above, we proposed a novel lightweight point-based method to consider distributed long-range dependencies and learn kernels to capture the local structures of point clouds.
Ii-C Efficient Large-Scale Point Clouds Understanding
It is till recently that more attention has been paid to efficient large-scale point clouds understanding. Previously, block partitioning was utilized to divide large-scale point clouds into small sub-blocks before being fed to networks. However, such operation of partition is time-consuming and damages the spatial geometric contextual information among the objects of large-scale scene. Although several attempts    have been made on large-scale point clouds segmentation, there are still some major problems existing: Firstly, the farthest point sampling (FPS) adopted by most of the previous methods require large computational cost which increases quadratically with respect to the number of input points . Secondly, block partitioning causes that large-scale point clouds semantics can not be inferred within one scan, which limits the volume of point clouds that can be processed. Some methods   also try to combine voxel-wise features with pointwise features to improve the performance. Analogous to the super-pixel conception in 2D image domain, the super-point  method in point clouds is also introduced to apply graph convolutions on large-scale points. But due to the high computational cost of voxelization or graph construction, such methods can hardly achieve real-time performance.
Iii Proposed Methodology
In this section, a fast deep learning method leveraging correlated feature mining and geometric-aware modelling is proposed for large-scale point clouds understanding. As illustrated in Fig. 2, our FG-Net takes raw point clouds of a large-scale complex scene as input and gives the predictions of object classification and semantic segmentation simultaneously. The details are given as follows.
Iii-a Noise and Outliers Filtering
The point clouds obtained by the LiDAR sensors contain noise and outliers which are harmful to the following high-level processing, thus we propose a novel filtering method for pre-processing. As shown in Algorithm 1, the set of input point clouds is defined as . The radius based nearest neighbour ball query is conducted to ensure the robustness to the density distribution variation of point clouds in sampling. Given a query point , we define the neighbouring points within radius as . The number of points
is obtained, which can be regarded as an estimation of point density in radius, and will be reused in the following inverse density based sampling (IDS). If ( is a threshold depending on the density of points acquired by LiDAR), we regard the point as an isolated point and remove it from
. Next, the distance between points is modelled as the Gaussian distribution, and the points are removed if the mean distance is outside the confidence interval according to Gaussian distribution. The meanof the distances between points can be computed as follows:
The noise and outliers can be removed very effectively with a speed of 0.61s per million () points. This simple but effective method can be utilized to remove noise and outliers of point clouds while enhancing the performance of point clouds understanding in the meanwhile. The implementation details for acceleration of our framework will be introduced in subsection C.
Iii-B Proposed Network Architecture for Large-Scale Point Clouds Understanding
We design the network module FG-Conv to capture the feature correlations and model the local geometry of point clouds simultaneously. Leveraging feature pyramid based residual learning framework, FG-Conv can be integrated into deep network FG-Net as the core module for large-scale point clouds understanding. As shown in Fig. 3, the large-scale point clouds can be processed in parallel by our novel design leveraging feature-level correlation mining and geometric convolutional modelling. The core network module FG-Conv includes 3 components: pointwise correlated features mining (PFM), geometric convolutional modelling (GCM), and attentional aggregation (AG), which are detailed as follows:
Iii-B1 Pointwise correlated features mining
The point clouds after filtering are represented as x-y-z coordinates with per-point features. The input features can consist of raw RGB, surface normal information, intensity of point clouds, and even learnt latent features. Denote the input point clouds as the matrix , where is the number of points and is the dimension of input features respectively. The i-th vector in P can be denoted as where . For the point , denote the k-th point vector in the spherical neighbourhood } as , in which is the radius of the neighbourhood. The similarity score of and is calculated as:
which is the inner product of and . It gives a good evaluation of the similarity of neighbouring points in spatial locations and features. For each of the K neighbouring points, can be calculated and they constitute a similarity score vector . However, the similarity scores are not relevant to any specific task such as classification or segmentation. Thus, the attentional technique is introduced to make the new similarity score adaptive to specific task by training of deep networks. which is calculated as:
where is the weight matrix to be learnt. is the softmax function to normalize the attentional weights. All constitute the matrix . Then each element of is multiplied with each row of through element-to-row multiplication to obtain the augmented attentional feature matrix . From now on, the augmented features (e.g., ) which encode both geometry and feature correlations are called feature for the sake of brevity. Next, is concatenated with their corresponding input feature to obtain the enhanced feature . In this way, the local contextual relationship can be captured, and the similarity of features is enhanced adaptively and selectively by the attentive weighting in a learnable way. The similar feature elements in latent space are enhanced while distinct ones are attenuated.
Iii-B2 Geometric convolutional modelling
After the pointwise correlated features mining, the local correlated features can be largely captured, but the geometric structure of points can not be sufficiently modelled. Inspired by the great success of deformable convolution in the image recognition, we extend deformable convolutions from image to point clouds to model the irregular and unordered 3D structures. Similar to 2D deformable convolutions, the deformable 3D kernels in Euclidean space are a set of learnable points that conform to the local structures of point clouds, thus, the dominant local geometric shapes of the points can be activated by the corresponding kernels in the neighbourhood. Note that kernel deformations can adapt to the local geometry of points in a learnable way by elaborately designed optimization functions. As shown in the right bottom of Fig. 3, like convolutional neural networks in image processing, the convolution on points is defined as:
The core problem is that point clouds are unstructured and unordered, which makes it difficult for point convolutional kernel function to learn representative local geometric patterns. We design the correlation function to measure the correspondence between kernel points and local geometry. To be more specific, the closer kernels points are to input points, the higher the correlation value should be assigned. Denote the difference between and as . The pseudo kernel points centered at () are designed so as to imitate the convolutional kernels in image processing, (). The relative coordinates of pseudo kernel points and the center point are given as: . We set the correlation function as the Gaussian function formulated as:
where is the number of kernel points, is a constant, and is the parameter determining the influence distance of kernel points. Then the kernel function can be given as the sum of all relations with learnable weights as shown in Equation 7:
where is the weight matrix of MLP layers, and , are input and output channel numbers respectively. During the optimization process, the kernel points are forced to adapt to the dominant structures in the local point clouds. Finally, the feature after deformable convolution can be obtained. In this way, the local geometric structures are well captured by convolutional kernels and the dominant structural features are enhanced.
Iii-B3 Attentional aggregation
The attention mechanism is utilized to leverage the feature level and geometric level patterns without large information loss. As shown in Fig. 3, the integrated neighbouring features can be represented as . Then the attentive score for aggregation is defined as , which will adaptively learn the importance score of each feature. The weighted attentional feature can be given as:
The summed feature can be given as: . Then the element-wise multiplication between and is utilized to obtain the learnt feature . And the final features is the sum of original feature and learnt feature: . We apply the MLP layer to control the dimension of the output vector flexibly and give the meaningful aggregated feature containing both local correlated features and enhanced local geometry.
Iii-B4 Feature pyramid hierarchical residual architecture
The Resnet  based architecture has achieved great success in image recognition. Motivated by the residual learning paradigm, we proposed a general deep network that is specially designed for classification and semantic segmentation of large-scale point clouds. To our knowledge, until recently, there are several methods   starting to use residual learning for point clouds recognition, but their attempts are limited to small-scale point clouds. Thus the fitting capacity of residual architecture can not be fully demonstrated. We propose a feature pyramid based multi-scale fusion strategy for adaptively aggregating features from different layers of the network. Leveraging deep residual structure, memory-efficient deep networks can be built.
As shown in Fig. 4, the encoder-decoder based network structure can be utilized to obtain point clouds at multiple resolutions. In image processing, the networks are supposed to extract large feature maps for small objects and small feature maps for big objects 
. It should be noted that scale variation in images will not exist in 3D point clouds. Different from images in which the scale of objects will vary with the distance, the scale of point clouds will keep constant. Hence the deconvolution by interpolation must be conducted to recover the points to the original resolution. As shown in the Subfigure (a) of Fig.4, in residual learning block (RLB), denote the input dimension and output dimension of RLB as and respectively. Unlike some deep architectures which are memory consuming, we reduce the feature dimension in the original residual learning block to ( is adopted in our framework) by convolution before feeding them into FG-Conv module, which reduce the parameters by 9.6 times. And the accuracy for classification and segmentation can also be maintained through residual learning, which will be given in experiments. Another convolution will be applied to recover the feature dimension. At the block connecting two stages, convolution should be applied in skip link for increasing the feature dimensions. Then the global feature extraction in Subsubsection 5 will be conducted to obtain the latent global features from , which can be directly utilized for classification predictions. The point clouds are all upsampled after the ( in our case) convolutional blocks. Unlike previous methods which directly used the upsampled features for segmentation, we propose to fuse the predictions at different resolutions and use the supervised loss to guide the training process. It turned out the hierarchical structure will give better results for pointwise large-scale point clouds segmentation.
Iii-B5 Point clouds global feature extraction
The global and long-range dependencies in point clouds should also be captured before doing upsampling and giving the pointwise predictions. Due to the limited receptive field of the neural layer mentioned above, the global contextual semantic patterns can not be fully obtained. We adopt the self-attentional module shown in Fig. 5 to selectively enhance the closely relevant elements in the global feature . After the global relationship mining by this module, both the local and global relationships in features and geometry will be captured adaptively. Then the feature representation with combined local or non-local semantic contextual correlations will be adaptively obtained to facilitate the subsequent recognition task. Given the original local feature map , in our case) as shown in Fig. 4 and Fig. 5, the convolution with weight in our case) is used to transform the feature map into latent representations and for further obtaining the similarity of each two elements in . After and are obtained, the dot product between them can be conducted to obtain the relevant score matrix which is given as:
Each element in gives the relevance score between the representation and . Then the softmax is applied to normalize the latent attentional scores to obtain the final self-relation weights of the latent representation . Each element of can be represented as:
The attention weights reveals the correlations among all local and non-local features. The more related distributed feature relationships, even in the non-local region, can be effectively captured, and larger attention weights are assigned in to enhances their similar semantic contexts. Finally, the attention scores are applied to all elements in to produce the global attentional vector :
and the consolidated feature is the sum of and , which is given as: .
Ultimately, the global contextual representation is fused with the local aggregated representation for a comprehensive encoding of local and non-local correlated features. The predictions of classification can be directly obtained from the aggregated latent features and the segmentation results can also be learnt by up-sampling.
Iii-B6 Optimization function formulation and data augmentation
As mentioned in Subsubsection 2, denote the relative coordinates of kernel points as and the learnt deformation as . The losses utilized for the deformable convolution are designed as:
which is utilized to match the kernel positions with local geometries of point clouds.
which is the repulsive loss utilized to keep distance between different kernels.
which is to keep the kernel points from diverging and make them inside the query ball. The kernel loss will be the sum of above 3 losses. i.e. . As shown in Fig. 4, the losses at different stages of the network are also summed, which can be formulated as the cross-entropy loss denoted as :
denotes the weight at stage of the residual network, W denotes the weight of the entire network, denotes the upsampled point clouds at stage , denotes the fused point clouds, and denote the segmentation ground truth of points at different stages and fused points respectively. And denotes the segmentation prediction of the networks. We also propose to use the contextual loss shown in Fig. 4 to predict the presence of objects or not in the scene to consider semantic contexts of the scene, which can be given as:
Where indicates whether the object presents in the scene or not and is the classification prediction. This loss helps the network equally consider all the semantic categories appearing in the scene. The total loss the network can be given as: . The kernel positions and network parameters are jointly optimized in an end-to-end manner.
Iii-C Implementation Details for Acceleration of our Framework
Iii-C1 Igsam for fast learning-based sampling
The sampling methods play a very significant role in processing point clouds by convolutional neural networks. Directly using raw points for segmentation is rather inefficient due to large computational costs when feeding them all into networks. Thus, an effective sampling method is highly required when taking efficiency into consideration. To tackle the large computational overhead when processing millions of point clouds, we propose efficient sampling methods IGSAM leveraging the advantages of inverse density sampling (IDS) and gumbel softmax sampling (GSS) to achieve fast and effective point clouds understanding. We design a novel learning based GSS which adaptively selects the significant points based on optimization objectives. Leveraging inverse density sampling (IDS), point clouds can be sampled efficiently with density awareness while the meaningful information or feature for point clouds understanding is maintained. The sampling method is implemented with multi-thread parallelization of CPU for acceleration.
To achieve learning based sampling operation, the Gumbel-Softmax trick is utilized to transfer the non-differentiable sampling operation into differentiable selection of features and coordinates by reparameterization tricks . In this way, points that matter most for the task can be selected in a learnable way. Given the point clouds
, with its coordinate and features, the probability scoreof a point being sampled can be estimated by the MLP based convolution, which is formulated as:
Then the gumbel noise  can be selected from the gumbel distribution . And the gumbel softmax (GS) operation on P can be used to compute the sampled vector , where is the time constant, and and are the i-th element in s and g respectively:
Utilizing Equation 18, is differentiable with respect to , thus, the differentiable sampling can be realized for original point distributions in the training process. It should be noted that should start from a high value of 1.0 and gradually anneals to a smaller value of 0.05 during training. When , GS degenerates to the Gumbel-Max (GM) selection. And the discrete individual point candidate can be hard selected utilizing GM in inference as shown in Equation 19:
In this way, we can facilitate the differentiable sampling in training while maintain a discrete and hard GM sampling in network inference, hence sampling operation can be integrated into deep networks. To be more specific, given the input points denoted as , the GSS can be conducted as:
where is weights of MLP layers in Equation 17, and are the points to be selected.
During the inference of the network, the discrete sampling can be realized by substitute GS with GM, which is formulated as:
Note that the matrix will be relatively memory consuming when the quantity of point clouds is large, hence, we only adopt it in deeper layers of network when the points are down sampled to a certain quantity, which is less than .
In our IGSAM, the IDS is conducted in the first four stages of the network while the GSS is adopted in the fifth layer of the network. In this way, the density awareness can be maintained while significant points can be selected in a learnable way before conducting local aggregation and mining the global relationship by the non-local attentional module as introduced above, which enforces the sampling process to select the significant points for the specific understanding task.
Iii-C2 Acceleration by multi-thread parallel computation on CPU
The noise and outliers filtering is implemented on the CPU while deep network computations are run on the GPU. It should be noted that we have reused the radius based ball query in both point clouds filtering and network operations for accelerations. As shown in Fig. 10, we have preloaded the next stream of point clouds to CPU before the network computation on GPU is finished for acceleration. And the multi-thread computation on CPU is also utilized to accelerate the query process which reduces the idle period notably in subsequent CPU and GPU computations.
Iv Experiments Results
Iv-a Experiments Setup
We have tested our method extensively on 8 different large-scale point clouds understanding datasets. We implemented the network in Tensorflow and optimized it with Adam optimizer and initial learning rate of. Also, the point clouds are randomly rotated around each axis with an angle . The scaling is also applied along axis with a scalar for data augmentation. The network is trained and tested in parallel with point clouds in each stream. The experiments are conducted on Nvidia GTX 1080 graphics card with 8 GB memory.
Iv-B Experiments of Filtering and Sampling Methods
Iv-B1 Noise and Outliers filtering
We have tested the influence of proposed noise and outliers filtering on the semantic segmentation performance of S3DIS. Noted that we adopt 6-fold cross-validation for S3DIS to guarantee the generality and robustness. The noise filtering results with mean intersection over unions (mIOUs) are shown in Table I, it demonstrates that filtering has a boost on segmentation performance for diverse point clouds understanding methods. With unrelated isolated noise points removed, the meaningful semantics of point clouds will be retained which will boost the performance of segmentation.
Iv-B2 Point clouds sampling methods
|FG-Net (Ours) (best)||66.8||70.3||70.8||70.3||69.5||69.9|
To compare the efficiency our proposed IGSAM
with different sampling methods, we have experimented their GPU memory usage and processing time on a single GTX 1080 GPU with 8 GB memory. The sampling methods include Random Sampling (RS), Reinforcement Learning based Sampling (RLS), GSS , IDS, Farthest Point Sampling (FPS), and Generative Network (GS)  based Sampling. The point clouds are divided into batches consisting of and points respectively, then the batches of points are down-sampled 5 times which imitates the down-sampling in our network shown in Fig. 4. The total time and memory consumption of sampling methods on different numbers of points are illustrated in Fig. 14. It can be demonstrated that RS has the fastest processing speed with the smallest memory consumption. However, RS will result in a stochastic loss in meaningful information, which will give unsatisfactory segmentation results. As shown in Table II, mIOUs will drop significantly from to if RS is adopted. It should also be noted that GSS is not suitable for more than points because the GPU memory will increase greatly with the number of points. Hence, we only use GSS in the last layer of the network when the number of points is less than . Leveraging IDS with adaptability to the local density of points, our IGSAM achieve the best performance among different sampling methods with only a marginal increase of computational cost compared with RS. The segmentation mIOUs using different sampling methods are also shown in Table II. It can be seen that our sampling methods give the best performance among all sampling methods on the S3DIS benchmark with mIOUs of 70.8%, which demonstrates the effectiveness of our proposed sampling strategy.
Iv-C Experiments of Large-scale Scene Understanding
We have experimented our method extensively on nearly all existing large-scale point clouds understanding datasets including ModelNet40, ShapeNet-Part, PartNet, S3DIS, NPM3D, Semantic3D, Semantic-KITTI, and Scannet. The detailed information of different datasets and the results with speed of our framework in point clouds understanding are shown in Table III. The qualitative experiments of large-scale real-world scene parsing are shown in Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig. 11, and Fig. 12, respectively. The mIOUs of 78.2%, 70.8%, 82.3%, 58.2% are attained on Semantic3D , S3DIS , NPM3D  and fine-grained segmentation dataset PartNet  respectively with real time segmentation of about 18.6 Hz for each LiDAR scan with points, which outperforms state-of-the-art methods in terms of accuracy and efficiency. Denote the network with global attentive module in Fig. 5 as FG-Net-v0 and the one without as FG-Net-v1. With global attention, the feature correlations across a long spatial range are effectively captured and distributed objects such as road and buildings are more precisely segmented in FG-Net-v1 compared with FG-Net-v0. The transfer learning results shown in Fig. 13 also demonstrates our networks learn the underlying latent model of feature representations that is able to generalize across new scenes.
|Datasets||Main Features||Accuracy/ mIOUs (%)||Time (s)/ points|
|More than 12311 CAD models from 40 classes||93.1||0.0561|
|16,881 shapes from 16 classes, annotated with 50 parts in total||86.6||0.0552|
|More than 27000 3D models in 24 object classes with 18 parts per object||58.2||0.0501|
|LiDAR scan of more than 6000 of 6 different areas from 3 buildings||70.8||0.0498|
|Kilometer-scale point clouds captured from multiple city roads||82.3||0.0528|
|Largest-scale dataset of more than 4 billion annotated points with 8 classes||78.2||0.0523|
|Largest-scale LiDAR dataset for autonomous driving||53.8||0.0512|
|Reconstructed 1513 indoor scenes from 707 different areas||68.5||0.0495|
Iv-D Visualization of the Network Modules
Iv-D1 Visualization of the Deformable Convolutional Kernels
To better demonstrate the geometry adaptive capacity of the deformable convolutions, the deformable kernel are visualized in Fig. 15. It can be seen that kernel points are adaptively deformed to capture different geometric structures in the original point clouds. Hence in the test phase, the specific geometric structures in the unseen scene will be effectively captured and described by deformable kernels. In this way, we can model the geometry of the scene in a learnable way to better enhance structural awareness in per-point based processing.
Iv-D2 Visualization of the learnt features
To figure out what has been learnt by our network, the inner activations of the 2 core network modules are visualized as is illustrated in Fig. 16. It should be noted that the activations of designed 2 branches is complementary to each other because the deformable convolution captures the coarse-grained continuous geometry features while the correlated features learning focuses on fine-grained isolated per-point features as shown in Fig. 16, which is also in accordance with our design.
Iv-D3 Visualization of the non-local activation
In order to further demonstrate the effectiveness of the non-local module, we also visualize the non-local activation. Fig. 17 demonstrates that the non-local module can capture the long-range dependencies of the same semantic category such as chairs or bookcases. The contexts that are far away from each other can be nicely modeled and captured. It can also be observed that the non-local activation can also give rough results of segmentation prediction of the category of the query point, which is advantageous for further semantic segmentation tasks.
Iv-E Efficiency and Online Performance
To give a more convincing evaluation of the real-time performance of our method on large-scale real-scene point clouds segmentation, we have tested our method compared with others on the whole sequence of Semantic-KITTI  dataset. The sequences are captured and fed into the networks at 25 Hz. The inference time and the GPU memory used are shown in Fig. 18, and the PaiSeg  is also a recently developed method for point clouds segmentation. With the residual learning framework, our method can reach 16.89 Hz, 19.53 Hz, 19.31 Hz, and 18.69 Hz for LIDAR scan 02, 04, 05, and 09 respectively. Compared with RSSP and RandlaNet , the speed increase by 274% and 38.5% while the memory consumption reduce 46.5% and 8.6% respectively, which is a prominent progress in the speed and memory efficiency.
Iv-F Ablation Study of Network Modules
Remove pointwise feature relation mining (PFM)
Remove geometric convolutional modelling (GCM)
Remove attentional aggregation (AG)
Remove global feature extraction
The full network framework
Replace the backbone with a 6-stage network
Replace the backbone with a 4-stage network
Without semantic context loss
With 2 RLB2 in each convolutional block
Choose in RLB1 and RLB2
Our designed network modules can be easily integrated or removed from existing point clouds processing architectures. Some ablation studies are also done to validate the effectiveness and necessity of our designed modules. As shown in Table. IV, 4 core modules are removed from our network respectively and the mIOUs of 6-fold cross-validation on S3DIS dataset is recorded. From the results, removing geometric convolutional modelling results in 11% performance drop because learning the intrinsic geometric shape contexts of point clouds is vital for the recognition. On the other hand, removing global and local correlated feature mining results in 5.6% and 7.3% drop in mIOUs, which demonstrates both the local and long-range feature relationship capturing are also essential to the segmentation task. And not using the attentional aggregation will also decline the performance for not retaining some meaningful features. Furthermore, the experiments show the 5-stage () network is superior to 4-stage or 6-stage networks because shallow networks have a poor fitting ability while deeper networks will result in oversampling of point clouds, which will all deteriorate the performance. We also tested in the RLB, however the segmentation performance will not increase. Therefore is adopted for the sake of memory efficiency.
In this work, we have proposed a general solution FG-Net to large-scale point clouds understanding with real-time speed and state-of-the-art performance. The filtering and sampling methods are specially designed for large-scale point clouds with high efficiency and they will boost the scene parsing performance. The network can effectively model the point clouds structures and find the feature correlations across a long spatial range. Leveraging feature pyramid based residual learning, hierarchical features at different resolutions can be fused in a memory efficient way. Experiments on challenging circumstances showed that our approach outperforms state-of-the-art methods in terms of both accuracy and efficiency.
-  (2016) 3d semantic parsing of large-scale indoor spaces. In , pp. 1534–1543. Cited by: Fig. 1, §IV-C, TABLE III.
-  (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9297–9307. Cited by: §IV-C, §IV-E, TABLE III.
-  (2020) FKAConv: Feature-Kernel Alignment for Point Cloud Convolution. In 15th Asian Conference on Computer Vision (ACCV), Cited by: Fig. 1, TABLE I, TABLE II.
-  (2020) Deep category-level and regularized hashing with global semantic similarity learning. IEEE Transactions on Cybernetics. Cited by: §II.
-  (2019) 4D spatio-temporal convnets: minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084. Cited by: §I, §II-A, §II.
-  (2018) Speedup 3-d texture-less object recognition against self-occlusion for intelligent manufacturing. IEEE transactions on cybernetics 49 (11), pp. 3887–3897. Cited by: §I.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5828–5839. Cited by: §IV-C, TABLE III.
-  (2018) Gvcnn: group-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272. Cited by: §I, §II-A, §II.
-  (2020) PAI-conv: permutable anisotropic convolutional networks for learning on point clouds. arXiv preprint arXiv:2005.13135. Cited by: §IV-E.
-  (2020) Learning multiview 3d point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1759–1769. Cited by: §I, §II-A, §II.
-  (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9224–9232. Cited by: §I, §II-A, §II.
-  (2014) 3D object recognition in cluttered scenes with local surface features: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2270–2287. Cited by: §I, §I.
-  (2017) Semantic3d. net: a new large-scale point cloud classification benchmark. International Society for Photogrammetry and Remote Sensing. Cited by: §IV-C.
-  (2016-06) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-B4.
-  (2020) RandLA-net: efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11108–11117. Cited by: §II-C, §IV-E.
-  (2020) I-keyboard: fully imaginary keyboard on touch devices empowered by deep neural decoder. IEEE Transactions on Cybernetics. Cited by: §II.
-  (2020) Virtual multi-view fusion for 3d semantic segmentation. arXiv preprint arXiv:2007.13138. Cited by: §I, §II-A, §II.
-  (2019) Point cloud oversegmentation with graph-structured deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7440–7449. Cited by: §I, §II-B, §II-C, §II, TABLE I, TABLE II.
-  (2020) SampleNet: differentiable point cloud sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7578–7588. Cited by: §IV-B2.
-  (2020) Spherical kernel for efficient graph convolution on 3d point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-B, §II.
-  (2019) Deepgcns: can gcns go as deep as cnns?. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9267–9276. Cited by: §III-B4.
-  (2020) End-to-end learning local multi-view descriptors for 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1919–1928. Cited by: §I, §II-A, §II.
-  (2018) Pointcnn: convolution on x-transformed points. In Advances in neural information processing systems (NIPS), pp. 820–830. Cited by: §II-B.
-  (2020) Deep learning for generic object detection: a survey. International journal of computer vision (ICCV) 128 (2), pp. 261–318. Cited by: §III-B4.
-  (2019) Weakly supervised deep learning for brain disease prognosis using mri and incomplete clinical scores. IEEE Transactions on Cybernetics. Cited by: §II.
-  (2019) Densepoint: learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5239–5248. Cited by: §I, §II, §III-B4.
-  (2019) Point-voxel cnn for efficient 3d deep learning. In Advances in Neural Information Processing Systems (NIPS), pp. 965–975. Cited by: §I, §II-C, §II.
The concrete distribution: a continuous relaxation of discrete random variables. International Conference on Learning Representations (ICLR). Cited by: §III-C1, §IV-B2.
-  (2019) Vv-net: voxel vae net with group convolutions for point cloud segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8500–8508. Cited by: §I, §II.
-  (2019) RangeNet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §I, §II-A, §II.
-  (2019) Partnet: a large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 909–918. Cited by: §IV-C, TABLE III.
-  (2020) Adaptive hierarchical down-sampling for point cloud classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12956–12964. Cited by: §I, §II-B, §II.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 652–660. Cited by: §I, §II-B, §II-C, §II.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (NIPS), pp. 5099–5108. Cited by: §I, §II-B, §II-C, §II.
-  (2018) Paris-lille-3d: a large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification. The International Journal of Robotics Research (IJRR) 37 (6), pp. 545–557. Cited by: §IV-C, TABLE III.
-  (2009) Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE international conference on robotics and automation (ICRA), pp. 3212–3217. Cited by: §I.
-  (2008) Aligning point cloud views using persistent feature histograms. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3384–3391. Cited by: §I.
-  (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4548–4557. Cited by: §I, §II-B, §II.
-  (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10529–10538. Cited by: §II-C.
-  (2018) The importance of sampling inmeta-reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 9280–9290. Cited by: §IV-B2.
-  (2020) Searching efficient 3d architectures with sparse point-voxel convolution. arXiv preprint arXiv:2007.16100. Cited by: §I, §II.
-  (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6411–6420. Cited by: §I, §II-B, §II, TABLE I, TABLE II.
-  (2010) Unique signatures of histograms for local surface description. In European conference on computer vision (ECCV), pp. 356–369. Cited by: §I.
-  (2020) Real-time 3-d semantic scene parsing with lidar sensors. IEEE Transactions on Cybernetics. Cited by: §I, §I, §II-C, §II, §IV-E.
-  (2020) Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §III-B4.
-  (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (ToG) 38 (5), pp. 1–12. Cited by: TABLE I, TABLE II.
-  (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. Cited by: §I, §II-A, §II.
-  (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9621–9630. Cited by: §I, §II-C, §II, TABLE I, TABLE II.
-  (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1912–1920. Cited by: §IV-C, TABLE III.
-  (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: §I, §II-A, §II.
-  (2020) A new aggregation of dnn sparse and dense labeling for saliency detection. IEEE Transactions on Cybernetics. Cited by: §II.
-  (2020) PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5589–5598. Cited by: §I, §II.
Hierarchical deep embedding for aurora image retrieval. IEEE Transactions on Cybernetics. Cited by: §II.
-  (2020) HVNet: hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1631–1640. Cited by: §I, §II.
-  (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG) 35 (6), pp. 1–12. Cited by: §IV-C, TABLE III.
-  (2019) Shellnet: efficient point cloud convolutional neural networks using concentric shells statistics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1607–1616. Cited by: §I, §I, §II, TABLE I, TABLE II.
-  (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9308–9316. Cited by: §III-B2.