I Introduction
Due to the directness and robustness in obtaining 3D information, there has been an increasing proliferation of light detection and ranging (LiDAR) sensors which have been popularly deployed on a variety of intelligent agents such as unmanned ground vehicles (UGVs), unmanned aerial vehicles (UAVs) to perform localization, obstacle detection, exploration, etc. Consequently, efficient and effective largescale 3D LiDAR point clouds understanding is of great importance to facilitate machine perception, which bridges the gap between 3D points and any highlevel information, structural or semantic, or both [44, 6, 56, 12]. However, due to the electrical and mechanical disturbances, and the reflectance property of targets, the point clouds often suffer from noise and outliers. Moreover, compared with the 2D raster image,
the topological relationship between objects in 3D point clouds is much weakened, rendering the task of segmentation and understanding much more challenging. Therefore, autonomous largescale point clouds understanding remains an open problem and requires urgent efforts to tackle the challenges, especially when both accuracy and efficiency are taken into account.
Like any highlevel task in 2D image domain such as object detection, segmentation, or classification, the point clouds understanding methods can also be classified into traditional category and deep learning based methods. In the traditional category, the representative histogrambased methods
[37] [36]encode the kNearestNeighbor (kNN) geometric features of a 3D point via calculating its surrounding multidimensional average curvature for local geometric variations descriptions. In signaturebased
[43] and transformbased [12] methods, handcrafted feature descriptions of point clouds have been proposed and exploited for semantic understanding. However, the performances of these methods are merely demonstrated in wellcontrolled conditions with ideal assumptions such as noisefree and homogeneous environments[12]. On the other hand, deep learning based point clouds processing methods have been proposed with promising results in recent years. The mainstream point clouds understanding methods can be roughly divided into three categories: projectionbased [30, 47, 50, 8, 17, 22, 10], voxelbased[11, 5, 41, 29, 54], and direct pointbased [18, 44, 32, 56, 34, 52, 26, 33, 38, 42, 20, 27, 48]. Such representative works of each category will be discussed in Section II, and the limitations of these methods are summarized here: First, the sampling operation of almost all these methods has high computational cost and memory consumption. For instance, the widely applied farthest point sampling (FPS) [34] [52] takes more than 1000 seconds to subsample points to about points. Furthermore, their subsequent perception networks usually rely on the expensive operations, such as voxelizations [11] [5], or graph construction [18], etc. Second, nearly all the existing methods are designed for smallscale point clouds without considering noise and outliers which are inevitable in practice. Moreover, the largescale point clouds typically suffer from great class imbalance in semantic categories, and the points obtained by LiDAR in complex dynamic environments are often irregular, orderless, and have distant distributed semantic information. For example, in autonomous driving scenarios, the typical objects exhibit diverse geometric shapes with varying object sizes (e.g., cyclists and persons) or have distribution across a long spatial range in a nonuniform way (e.g., road, buildings, and vegetations). However, to the best of our knowledge, the existing methods can hardly capture the complex geometry and the latent feature correlations in largescale point clouds effectively.To overcome the aforementioned challenges, we propose a general deep learning framework named FGNet for largescale point clouds understanding. We adopt deformable convolution for modelling the geometric structure, and pointwise attentional aggregation for mining the correlated features among point clouds. It should be noted that the deformable convolutional modelling can effectively adapt to the local geometry of objects by deformed kernels that dynamically adapt to diverse local geometries while the correlated feature mining can capture the distributed contextual information in spatial locations and semantic features adaptively across a long spatial range. The modules in our network can be implemented with simple pointwise matrix multiplication and add operations, which can be easily parallelized by GPU for acceleration. As shown in Fig. 1, our method outperforms stateoftheart ones in terms of both accuracy and efficiency, rendering it achievable to realize realtime perception performance on the largescale point clouds. In summary, our work makes the following contributions:

[]

We propose pointwise correlated feature mining and geometricaware modelling module for largescale point clouds understanding. Furthermore, we interpret the effectiveness of our network by visualizing the complementary features captured by our network modules.

We propose a feature pyramid based residual learning architecture to leverage patterns at different resolutions in a memoryefficient way. Extensive experiments on realworld challenging datasets demonstrated that our approaches outperform stateoftheart ones in terms of accuracy and efficiency.

We propose a novel fast noise and outliers removal method and a points downsampling strategy for largescale point clouds, which simultaneously enhances the performance and improve the efficiency of semantic understanding tasks in the largescale scene.
The paper is organized as follows: Section II gives a detailed review of existing methods of point clouds understanding. In Section III, we illustrate our proposed framework thoroughly and gives our optimization function formulation and training details. We also propose point clouds filtering methods and sampling methods which are specially designed to speed up our framework on largescale point clouds. Section IV gives substantial results of our experiments and ablation studies. Section V concludes our work.
Ii Related Works
Advanced deep learning techniques in 2D image domain have been investigated extensively, and resulted in stunning performance [53, 51, 16, 4, 25]. Naturally, such deep learning techniques have been exploited for point clouds processing and understanding, and the published works can be roughly categorized into voxelbased [11, 5, 41, 29, 54], projectionbased[30, 47, 50, 8, 17, 22, 10] and pointbased methods[18, 44, 32, 56, 34, 52, 26, 33, 38, 42, 20, 27, 48]. The voxelbased and projectionbased methods transform point clouds into different representations while the pointbased methods process point clouds directly. These methods are mainly designed and tested on the relative smallscale point clouds of less than points with block partitioning. Directly extending them to deal with largescale point clouds will result in prohibitively expensive computational costs. Here we discuss these methods thoroughly of their advantages and shortcoming, and the rationale that motivates our modifications and improvements.
Iia Voxelbased and Projectionbased Methods for Point Clouds Understanding
The most recent and typical voxelbased methods are SparseConv [11] and Minkowski CNN [5]. The voxelbased methods use 3D convolutions which are intuitive extensions of 2D counterparts. The advantage of them is that the spatial relationship can be well reserved with high voxel resolutions, but these methods are quite computationally expensive. The computation cost and memory consumption of the voxelbased models increase cubically with the resolution of input point clouds. By contrast, the geometric information loss will be significant if we decrease the resolution of voxelization. In addition, the voxelbased methods rely on aggressive down sampling to achieve bigger receptive fields, resulting in even lower resolution in deeper network layers. Hence, it is quite hard to achieve realtime performance while considering the balance between accuracy and computational cost. The projection is also used to project point clouds into range images[30, 47, 50] or multiview images[8, 17, 22, 10], to facilitate the use of 2D CNNs. However, such projection inevitably leads to the loss of geometrical information. In practice of dealing with largescale point clouds, the drawbacks of voxelbased and projectionbased methods become more prohibitive.
IiB Pointbased Methods for Point Clouds Understanding
The pointNet [33]
is the pioneering work that extracts the pointwise feature directly using shared Multilayer Perceptron (MLP). It is permutation invariant to point clouds orders because it uses maxpooling operations. The pointNet++
[34] extracts the local features using pointNet and considers local geometric relationships with the hierarchical grouping and abstraction as well as the multiscale and resolution grouping. More pointbased methods[23, 32, 18] have been proposed recently with complicated network design to aggregate local features. However, all these methods are not able to model intrinsic geometric structures of points or to capture the nonlocal distributed contextual correlations in spatial locations and semantic features effectively. There are also a series of new explorations on how to implement convolution on point clouds. The methods [38, 42, 20] focus on how to learn kernels which can better capture the local geometry of points. However, the proposed convolutional kernels are too complicated to be directly applied to deep neural networks for largescale point clouds understanding. Motivated by the challenges above, we proposed a novel lightweight pointbased method to consider distributed longrange dependencies and learn kernels to capture the local structures of point clouds.IiC Efficient LargeScale Point Clouds Understanding
It is till recently that more attention has been paid to efficient largescale point clouds understanding. Previously, block partitioning[34][33][48] was utilized to divide largescale point clouds into small subblocks before being fed to networks. However, such operation of partition is timeconsuming and damages the spatial geometric contextual information among the objects of largescale scene. Although several attempts [18] [27] [15] have been made on largescale point clouds segmentation, there are still some major problems existing: Firstly, the farthest point sampling (FPS) adopted by most of the previous methods require large computational cost which increases quadratically[44] with respect to the number of input points . Secondly, block partitioning causes that largescale point clouds semantics can not be inferred within one scan, which limits the volume of point clouds that can be processed. Some methods [27] [39] also try to combine voxelwise features with pointwise features to improve the performance. Analogous to the superpixel conception in 2D image domain, the superpoint [18] method in point clouds is also introduced to apply graph convolutions on largescale points. But due to the high computational cost of voxelization or graph construction, such methods can hardly achieve realtime performance.
Iii Proposed Methodology
In this section, a fast deep learning method leveraging correlated feature mining and geometricaware modelling is proposed for largescale point clouds understanding. As illustrated in Fig. 2, our FGNet takes raw point clouds of a largescale complex scene as input and gives the predictions of object classification and semantic segmentation simultaneously. The details are given as follows.
Iiia Noise and Outliers Filtering
The point clouds obtained by the LiDAR sensors contain noise and outliers which are harmful to the following highlevel processing, thus we propose a novel filtering method for preprocessing. As shown in Algorithm 1, the set of input point clouds is defined as . The radius based nearest neighbour ball query is conducted to ensure the robustness to the density distribution variation of point clouds in sampling. Given a query point , we define the neighbouring points within radius as . The number of points
is obtained, which can be regarded as an estimation of point density in radius
, and will be reused in the following inverse density based sampling (IDS). If ( is a threshold depending on the density of points acquired by LiDAR), we regard the point as an isolated point and remove it from. Next, the distance between points is modelled as the Gaussian distribution, and the points are removed if the mean distance is outside the confidence interval according to Gaussian distribution. The mean
of the distances between points can be computed as follows:(1) 
(2) 
The noise and outliers can be removed very effectively with a speed of 0.61s per million () points. This simple but effective method can be utilized to remove noise and outliers of point clouds while enhancing the performance of point clouds understanding in the meanwhile. The implementation details for acceleration of our framework will be introduced in subsection C.
IiiB Proposed Network Architecture for LargeScale Point Clouds Understanding
We design the network module FGConv to capture the feature correlations and model the local geometry of point clouds simultaneously. Leveraging feature pyramid based residual learning framework, FGConv can be integrated into deep network FGNet as the core module for largescale point clouds understanding. As shown in Fig. 3, the largescale point clouds can be processed in parallel by our novel design leveraging featurelevel correlation mining and geometric convolutional modelling. The core network module FGConv includes 3 components: pointwise correlated features mining (PFM), geometric convolutional modelling (GCM), and attentional aggregation (AG), which are detailed as follows:
IiiB1 Pointwise correlated features mining
The point clouds after filtering are represented as xyz coordinates with perpoint features. The input features can consist of raw RGB, surface normal information, intensity of point clouds, and even learnt latent features. Denote the input point clouds as the matrix , where is the number of points and is the dimension of input features respectively. The ith vector in P can be denoted as where . For the point , denote the kth point vector in the spherical neighbourhood } as , in which is the radius of the neighbourhood. The similarity score of and is calculated as:
(3) 
which is the inner product of and . It gives a good evaluation of the similarity of neighbouring points in spatial locations and features. For each of the K neighbouring points, can be calculated and they constitute a similarity score vector . However, the similarity scores are not relevant to any specific task such as classification or segmentation. Thus, the attentional technique is introduced to make the new similarity score adaptive to specific task by training of deep networks. which is calculated as:
(4) 
where is the weight matrix to be learnt. is the softmax function to normalize the attentional weights. All constitute the matrix . Then each element of is multiplied with each row of through elementtorow multiplication to obtain the augmented attentional feature matrix . From now on, the augmented features (e.g., ) which encode both geometry and feature correlations are called feature for the sake of brevity. Next, is concatenated with their corresponding input feature to obtain the enhanced feature . In this way, the local contextual relationship can be captured, and the similarity of features is enhanced adaptively and selectively by the attentive weighting in a learnable way. The similar feature elements in latent space are enhanced while distinct ones are attenuated.
IiiB2 Geometric convolutional modelling
After the pointwise correlated features mining, the local correlated features can be largely captured, but the geometric structure of points can not be sufficiently modelled. Inspired by the great success of deformable convolution in the image recognition[57], we extend deformable convolutions from image to point clouds to model the irregular and unordered 3D structures. Similar to 2D deformable convolutions, the deformable 3D kernels in Euclidean space are a set of learnable points that conform to the local structures of point clouds, thus, the dominant local geometric shapes of the points can be activated by the corresponding kernels in the neighbourhood. Note that kernel deformations can adapt to the local geometry of points in a learnable way by elaborately designed optimization functions. As shown in the right bottom of Fig. 3, like convolutional neural networks in image processing, the convolution on points is defined as:
(5) 
The core problem is that point clouds are unstructured and unordered, which makes it difficult for point convolutional kernel function to learn representative local geometric patterns. We design the correlation function to measure the correspondence between kernel points and local geometry. To be more specific, the closer kernels points are to input points, the higher the correlation value should be assigned. Denote the difference between and as . The pseudo kernel points centered at () are designed so as to imitate the convolutional kernels in image processing, (). The relative coordinates of pseudo kernel points and the center point are given as: . We set the correlation function as the Gaussian function formulated as:
(6) 
where is the number of kernel points, is a constant, and is the parameter determining the influence distance of kernel points. Then the kernel function can be given as the sum of all relations with learnable weights as shown in Equation 7:
(7)  
where is the weight matrix of MLP layers, and , are input and output channel numbers respectively. During the optimization process, the kernel points are forced to adapt to the dominant structures in the local point clouds. Finally, the feature after deformable convolution can be obtained. In this way, the local geometric structures are well captured by convolutional kernels and the dominant structural features are enhanced.
IiiB3 Attentional aggregation
The attention mechanism is utilized to leverage the feature level and geometric level patterns without large information loss. As shown in Fig. 3, the integrated neighbouring features can be represented as . Then the attentive score for aggregation is defined as , which will adaptively learn the importance score of each feature. The weighted attentional feature can be given as:
(8) 
The summed feature can be given as: . Then the elementwise multiplication between and is utilized to obtain the learnt feature . And the final features is the sum of original feature and learnt feature: . We apply the MLP layer to control the dimension of the output vector flexibly and give the meaningful aggregated feature containing both local correlated features and enhanced local geometry.
IiiB4 Feature pyramid hierarchical residual architecture
The Resnet [14] based architecture has achieved great success in image recognition. Motivated by the residual learning paradigm, we proposed a general deep network that is specially designed for classification and semantic segmentation of largescale point clouds. To our knowledge, until recently, there are several methods [26] [21] starting to use residual learning for point clouds recognition, but their attempts are limited to smallscale point clouds. Thus the fitting capacity of residual architecture can not be fully demonstrated. We propose a feature pyramid based multiscale fusion strategy for adaptively aggregating features from different layers of the network. Leveraging deep residual structure, memoryefficient deep networks can be built.
As shown in Fig. 4, the encoderdecoder based network structure can be utilized to obtain point clouds at multiple resolutions. In image processing, the networks are supposed to extract large feature maps for small objects and small feature maps for big objects[24] [45]
. It should be noted that scale variation in images will not exist in 3D point clouds. Different from images in which the scale of objects will vary with the distance, the scale of point clouds will keep constant. Hence the deconvolution by interpolation must be conducted to recover the points to the original resolution. As shown in the Subfigure (a) of Fig.
4, in residual learning block (RLB), denote the input dimension and output dimension of RLB as and respectively. Unlike some deep architectures which are memory consuming, we reduce the feature dimension in the original residual learning block to ( is adopted in our framework) by convolution before feeding them into FGConv module, which reduce the parameters by 9.6 times. And the accuracy for classification and segmentation can also be maintained through residual learning, which will be given in experiments. Another convolution will be applied to recover the feature dimension. At the block connecting two stages, convolution should be applied in skip link for increasing the feature dimensions. Then the global feature extraction in Subsubsection 5 will be conducted to obtain the latent global features from , which can be directly utilized for classification predictions. The point clouds are all upsampled after the ( in our case) convolutional blocks. Unlike previous methods which directly used the upsampled features for segmentation, we propose to fuse the predictions at different resolutions and use the supervised loss to guide the training process. It turned out the hierarchical structure will give better results for pointwise largescale point clouds segmentation.IiiB5 Point clouds global feature extraction
The global and longrange dependencies in point clouds should also be captured before doing upsampling and giving the pointwise predictions. Due to the limited receptive field of the neural layer mentioned above, the global contextual semantic patterns can not be fully obtained. We adopt the selfattentional module shown in Fig. 5 to selectively enhance the closely relevant elements in the global feature . After the global relationship mining by this module, both the local and global relationships in features and geometry will be captured adaptively. Then the feature representation with combined local or nonlocal semantic contextual correlations will be adaptively obtained to facilitate the subsequent recognition task. Given the original local feature map , in our case) as shown in Fig. 4 and Fig. 5, the convolution with weight in our case) is used to transform the feature map into latent representations and for further obtaining the similarity of each two elements in . After and are obtained, the dot product between them can be conducted to obtain the relevant score matrix which is given as:
(9) 
Each element in gives the relevance score between the representation and . Then the softmax is applied to normalize the latent attentional scores to obtain the final selfrelation weights of the latent representation . Each element of can be represented as:
(10) 
The attention weights reveals the correlations among all local and nonlocal features. The more related distributed feature relationships, even in the nonlocal region, can be effectively captured, and larger attention weights are assigned in to enhances their similar semantic contexts. Finally, the attention scores are applied to all elements in to produce the global attentional vector :
(11) 
and the consolidated feature is the sum of and , which is given as: .
Ultimately, the global contextual representation is fused with the local aggregated representation for a comprehensive encoding of local and nonlocal correlated features. The predictions of classification can be directly obtained from the aggregated latent features and the segmentation results can also be learnt by upsampling.
IiiB6 Optimization function formulation and data augmentation
As mentioned in Subsubsection 2, denote the relative coordinates of kernel points as and the learnt deformation as . The losses utilized for the deformable convolution are designed as:
(12) 
which is utilized to match the kernel positions with local geometries of point clouds.
(13) 
which is the repulsive loss utilized to keep distance between different kernels.
(14) 
which is to keep the kernel points from diverging and make them inside the query ball. The kernel loss will be the sum of above 3 losses. i.e. . As shown in Fig. 4, the losses at different stages of the network are also summed, which can be formulated as the crossentropy loss denoted as :
(15)  
denotes the weight at stage of the residual network, W denotes the weight of the entire network, denotes the upsampled point clouds at stage , denotes the fused point clouds, and denote the segmentation ground truth of points at different stages and fused points respectively. And denotes the segmentation prediction of the networks. We also propose to use the contextual loss shown in Fig. 4 to predict the presence of objects or not in the scene to consider semantic contexts of the scene, which can be given as:
(16) 
Where indicates whether the object presents in the scene or not and is the classification prediction. This loss helps the network equally consider all the semantic categories appearing in the scene. The total loss the network can be given as: . The kernel positions and network parameters are jointly optimized in an endtoend manner.
IiiC Implementation Details for Acceleration of our Framework
IiiC1 Igsam for fast learningbased sampling
The sampling methods play a very significant role in processing point clouds by convolutional neural networks. Directly using raw points for segmentation is rather inefficient due to large computational costs when feeding them all into networks. Thus, an effective sampling method is highly required when taking efficiency into consideration. To tackle the large computational overhead when processing millions of point clouds, we propose efficient sampling methods IGSAM leveraging the advantages of inverse density sampling (IDS) and gumbel softmax sampling (GSS) to achieve fast and effective point clouds understanding. We design a novel learning based GSS which adaptively selects the significant points based on optimization objectives. Leveraging inverse density sampling (IDS), point clouds can be sampled efficiently with density awareness while the meaningful information or feature for point clouds understanding is maintained. The sampling method is implemented with multithread parallelization of CPU for acceleration.
To achieve learning based sampling operation, the GumbelSoftmax trick is utilized to transfer the nondifferentiable sampling operation into differentiable selection of features and coordinates by reparameterization tricks [28]. In this way, points that matter most for the task can be selected in a learnable way. Given the point clouds
, with its coordinate and features, the probability score
of a point being sampled can be estimated by the MLP based convolution, which is formulated as:(17) 
Then the gumbel noise [28] can be selected from the gumbel distribution . And the gumbel softmax (GS) operation on P can be used to compute the sampled vector , where is the time constant, and and are the ith element in s and g respectively:
(18) 
Utilizing Equation 18, is differentiable with respect to , thus, the differentiable sampling can be realized for original point distributions in the training process. It should be noted that should start from a high value of 1.0 and gradually anneals to a smaller value of 0.05 during training. When , GS degenerates to the GumbelMax (GM) selection. And the discrete individual point candidate can be hard selected utilizing GM in inference as shown in Equation 19:
(19) 
In this way, we can facilitate the differentiable sampling in training while maintain a discrete and hard GM sampling in network inference, hence sampling operation can be integrated into deep networks. To be more specific, given the input points denoted as , the GSS can be conducted as:
(20) 
where is weights of MLP layers in Equation 17, and are the points to be selected.
During the inference of the network, the discrete sampling can be realized by substitute GS with GM, which is formulated as:
(21) 
Note that the matrix will be relatively memory consuming when the quantity of point clouds is large, hence, we only adopt it in deeper layers of network when the points are down sampled to a certain quantity, which is less than .
In our IGSAM, the IDS is conducted in the first four stages of the network while the GSS is adopted in the fifth layer of the network. In this way, the density awareness can be maintained while significant points can be selected in a learnable way before conducting local aggregation and mining the global relationship by the nonlocal attentional module as introduced above, which enforces the sampling process to select the significant points for the specific understanding task.
IiiC2 Acceleration by multithread parallel computation on CPU
The noise and outliers filtering is implemented on the CPU while deep network computations are run on the GPU. It should be noted that we have reused the radius based ball query in both point clouds filtering and network operations for accelerations. As shown in Fig. 10, we have preloaded the next stream of point clouds to CPU before the network computation on GPU is finished for acceleration. And the multithread computation on CPU is also utilized to accelerate the query process which reduces the idle period notably in subsequent CPU and GPU computations.
Iv Experiments Results
Iva Experiments Setup
We have tested our method extensively on 8 different largescale point clouds understanding datasets. We implemented the network in Tensorflow and optimized it with Adam optimizer and initial learning rate of
. Also, the point clouds are randomly rotated around each axis with an angle . The scaling is also applied along axis with a scalar for data augmentation. The network is trained and tested in parallel with point clouds in each stream. The experiments are conducted on Nvidia GTX 1080 graphics card with 8 GB memory.IvB Experiments of Filtering and Sampling Methods
IvB1 Noise and Outliers filtering
We have tested the influence of proposed noise and outliers filtering on the semantic segmentation performance of S3DIS. Noted that we adopt 6fold crossvalidation for S3DIS to guarantee the generality and robustness. The noise filtering results with mean intersection over unions (mIOUs) are shown in Table I, it demonstrates that filtering has a boost on segmentation performance for diverse point clouds understanding methods. With unrelated isolated noise points removed, the meaningful semantics of point clouds will be retained which will boost the performance of segmentation.
IvB2 Point clouds sampling methods
Sampling Method 
RS  RLS  IGSAM  IDS  FPS  GS 

SPG[18] 
61.7  62.2  63.2  62.6  61.8  62.5 
Shellnet[56] 
66.2  66.3  67.6  66.9  66.1  66.3 
PointCNN[48] 
64.9  65.3  66.8  66.1  65.9  65.3 
Kpconv[42] 
66.5  67.1  67.5  67.3  66.7  66.3 
DGCNN[46] 
55.2  56.0  58.4  57.5  57.9  56.0 
FKAConv[3]  64.6  65.2  68.6  66.2  66.1  65.3 
FGNet (Ours) (best)  66.8  70.3  70.8  70.3  69.5  69.9 
To compare the efficiency our proposed IGSAM
with different sampling methods, we have experimented their GPU memory usage and processing time on a single GTX 1080 GPU with 8 GB memory. The sampling methods include Random Sampling (RS), Reinforcement Learning based Sampling (RLS)
[40], GSS [28], IDS, Farthest Point Sampling (FPS), and Generative Network (GS) [19] based Sampling. The point clouds are divided into batches consisting of and points respectively, then the batches of points are downsampled 5 times which imitates the downsampling in our network shown in Fig. 4. The total time and memory consumption of sampling methods on different numbers of points are illustrated in Fig. 14. It can be demonstrated that RS has the fastest processing speed with the smallest memory consumption. However, RS will result in a stochastic loss in meaningful information, which will give unsatisfactory segmentation results. As shown in Table II, mIOUs will drop significantly from to if RS is adopted. It should also be noted that GSS is not suitable for more than points because the GPU memory will increase greatly with the number of points. Hence, we only use GSS in the last layer of the network when the number of points is less than . Leveraging IDS with adaptability to the local density of points, our IGSAM achieve the best performance among different sampling methods with only a marginal increase of computational cost compared with RS. The segmentation mIOUs using different sampling methods are also shown in Table II. It can be seen that our sampling methods give the best performance among all sampling methods on the S3DIS benchmark with mIOUs of 70.8%, which demonstrates the effectiveness of our proposed sampling strategy.IvC Experiments of Largescale Scene Understanding
We have experimented our method extensively on nearly all existing largescale point clouds understanding datasets including ModelNet40[49], ShapeNetPart[55], PartNet[31], S3DIS[1], NPM3D[35], Semantic3D[13], SemanticKITTI[2], and Scannet[7]. The detailed information of different datasets and the results with speed of our framework in point clouds understanding are shown in Table III. The qualitative experiments of largescale realworld scene parsing are shown in Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig. 11, and Fig. 12, respectively. The mIOUs of 78.2%, 70.8%, 82.3%, 58.2% are attained on Semantic3D [13], S3DIS [1], NPM3D [35] and finegrained segmentation dataset PartNet [31] respectively with real time segmentation of about 18.6 Hz for each LiDAR scan with points, which outperforms stateoftheart methods in terms of accuracy and efficiency. Denote the network with global attentive module in Fig. 5 as FGNetv0 and the one without as FGNetv1. With global attention, the feature correlations across a long spatial range are effectively captured and distributed objects such as road and buildings are more precisely segmented in FGNetv1 compared with FGNetv0. The transfer learning results shown in Fig. 13 also demonstrates our networks learn the underlying latent model of feature representations that is able to generalize across new scenes.
Datasets  Main Features  Accuracy/ mIOUs (%)  Time (s)/ points 

ModelNet40[49] 
More than 12311 CAD models from 40 classes  93.1  0.0561 
ShapeNetPart[55] 
16,881 shapes from 16 classes, annotated with 50 parts in total  86.6  0.0552 
PartNet[31] 
More than 27000 3D models in 24 object classes with 18 parts per object  58.2  0.0501 
S3DIS[1] 
LiDAR scan of more than 6000 of 6 different areas from 3 buildings  70.8  0.0498 
NPM3D[35] 
Kilometerscale point clouds captured from multiple city roads  82.3  0.0528 
Semantic3D [2] 
Largestscale dataset of more than 4 billion annotated points with 8 classes  78.2  0.0523 
SemanticKITTI [2] 
Largestscale LiDAR dataset for autonomous driving  53.8  0.0512 
Scannet[7] 
Reconstructed 1513 indoor scenes from 707 different areas  68.5  0.0495 

IvD Visualization of the Network Modules
IvD1 Visualization of the Deformable Convolutional Kernels
To better demonstrate the geometry adaptive capacity of the deformable convolutions, the deformable kernel are visualized in Fig. 15. It can be seen that kernel points are adaptively deformed to capture different geometric structures in the original point clouds. Hence in the test phase, the specific geometric structures in the unseen scene will be effectively captured and described by deformable kernels. In this way, we can model the geometry of the scene in a learnable way to better enhance structural awareness in perpoint based processing.
IvD2 Visualization of the learnt features
To figure out what has been learnt by our network, the inner activations of the 2 core network modules are visualized as is illustrated in Fig. 16. It should be noted that the activations of designed 2 branches is complementary to each other because the deformable convolution captures the coarsegrained continuous geometry features while the correlated features learning focuses on finegrained isolated perpoint features as shown in Fig. 16, which is also in accordance with our design.
IvD3 Visualization of the nonlocal activation
In order to further demonstrate the effectiveness of the nonlocal module, we also visualize the nonlocal activation. Fig. 17 demonstrates that the nonlocal module can capture the longrange dependencies of the same semantic category such as chairs or bookcases. The contexts that are far away from each other can be nicely modeled and captured. It can also be observed that the nonlocal activation can also give rough results of segmentation prediction of the category of the query point, which is advantageous for further semantic segmentation tasks.
IvE Efficiency and Online Performance
To give a more convincing evaluation of the realtime performance of our method on largescale realscene point clouds segmentation, we have tested our method compared with others on the whole sequence of SemanticKITTI [2] dataset. The sequences are captured and fed into the networks at 25 Hz. The inference time and the GPU memory used are shown in Fig. 18, and the PaiSeg [9] is also a recently developed method for point clouds segmentation. With the residual learning framework, our method can reach 16.89 Hz, 19.53 Hz, 19.31 Hz, and 18.69 Hz for LIDAR scan 02, 04, 05, and 09 respectively. Compared with RSSP[44] and RandlaNet [15], the speed increase by 274% and 38.5% while the memory consumption reduce 46.5% and 8.6% respectively, which is a prominent progress in the speed and memory efficiency.
IvF Ablation Study of Network Modules
Ablations 
mIOUs (%) 
Remove pointwise feature relation mining (PFM) 
66.2 
Remove geometric convolutional modelling (GCM) 
59.8 
Remove attentional aggregation (AG) 
67.1 
Remove global feature extraction 
63.5 
The full network framework 
70.8 
Replace the backbone with a 6stage network 
69.1 
Replace the backbone with a 4stage network 
67.9 
Without semantic context loss 
68.2 
With 2 RLB2 in each convolutional block 
69.7 
Choose in RLB1 and RLB2 
70.6 
Our designed network modules can be easily integrated or removed from existing point clouds processing architectures. Some ablation studies are also done to validate the effectiveness and necessity of our designed modules. As shown in Table. IV, 4 core modules are removed from our network respectively and the mIOUs of 6fold crossvalidation on S3DIS dataset is recorded. From the results, removing geometric convolutional modelling results in 11% performance drop because learning the intrinsic geometric shape contexts of point clouds is vital for the recognition. On the other hand, removing global and local correlated feature mining results in 5.6% and 7.3% drop in mIOUs, which demonstrates both the local and longrange feature relationship capturing are also essential to the segmentation task. And not using the attentional aggregation will also decline the performance for not retaining some meaningful features. Furthermore, the experiments show the 5stage () network is superior to 4stage or 6stage networks because shallow networks have a poor fitting ability while deeper networks will result in oversampling of point clouds, which will all deteriorate the performance. We also tested in the RLB, however the segmentation performance will not increase. Therefore is adopted for the sake of memory efficiency.
V Conclusion
In this work, we have proposed a general solution FGNet to largescale point clouds understanding with realtime speed and stateoftheart performance. The filtering and sampling methods are specially designed for largescale point clouds with high efficiency and they will boost the scene parsing performance. The network can effectively model the point clouds structures and find the feature correlations across a long spatial range. Leveraging feature pyramid based residual learning, hierarchical features at different resolutions can be fused in a memory efficient way. Experiments on challenging circumstances showed that our approach outperforms stateoftheart methods in terms of both accuracy and efficiency.
References

[1]
(2016)
3d semantic parsing of largescale indoor spaces.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1534–1543. Cited by: Fig. 1, §IVC, TABLE III.  [2] (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9297–9307. Cited by: §IVC, §IVE, TABLE III.
 [3] (2020) FKAConv: FeatureKernel Alignment for Point Cloud Convolution. In 15th Asian Conference on Computer Vision (ACCV), Cited by: Fig. 1, TABLE I, TABLE II.
 [4] (2020) Deep categorylevel and regularized hashing with global semantic similarity learning. IEEE Transactions on Cybernetics. Cited by: §II.
 [5] (2019) 4D spatiotemporal convnets: minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084. Cited by: §I, §IIA, §II.
 [6] (2018) Speedup 3d textureless object recognition against selfocclusion for intelligent manufacturing. IEEE transactions on cybernetics 49 (11), pp. 3887–3897. Cited by: §I.
 [7] (2017) Scannet: richlyannotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5828–5839. Cited by: §IVC, TABLE III.
 [8] (2018) Gvcnn: groupview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272. Cited by: §I, §IIA, §II.
 [9] (2020) PAIconv: permutable anisotropic convolutional networks for learning on point clouds. arXiv preprint arXiv:2005.13135. Cited by: §IVE.
 [10] (2020) Learning multiview 3d point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1759–1769. Cited by: §I, §IIA, §II.
 [11] (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9224–9232. Cited by: §I, §IIA, §II.
 [12] (2014) 3D object recognition in cluttered scenes with local surface features: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2270–2287. Cited by: §I, §I.
 [13] (2017) Semantic3d. net: a new largescale point cloud classification benchmark. International Society for Photogrammetry and Remote Sensing. Cited by: §IVC.
 [14] (201606) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IIIB4.
 [15] (2020) RandLAnet: efficient semantic segmentation of largescale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11108–11117. Cited by: §IIC, §IVE.
 [16] (2020) Ikeyboard: fully imaginary keyboard on touch devices empowered by deep neural decoder. IEEE Transactions on Cybernetics. Cited by: §II.
 [17] (2020) Virtual multiview fusion for 3d semantic segmentation. arXiv preprint arXiv:2007.13138. Cited by: §I, §IIA, §II.
 [18] (2019) Point cloud oversegmentation with graphstructured deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7440–7449. Cited by: §I, §IIB, §IIC, §II, TABLE I, TABLE II.
 [19] (2020) SampleNet: differentiable point cloud sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7578–7588. Cited by: §IVB2.
 [20] (2020) Spherical kernel for efficient graph convolution on 3d point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §IIB, §II.
 [21] (2019) Deepgcns: can gcns go as deep as cnns?. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9267–9276. Cited by: §IIIB4.
 [22] (2020) Endtoend learning local multiview descriptors for 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1919–1928. Cited by: §I, §IIA, §II.
 [23] (2018) Pointcnn: convolution on xtransformed points. In Advances in neural information processing systems (NIPS), pp. 820–830. Cited by: §IIB.
 [24] (2020) Deep learning for generic object detection: a survey. International journal of computer vision (ICCV) 128 (2), pp. 261–318. Cited by: §IIIB4.
 [25] (2019) Weakly supervised deep learning for brain disease prognosis using mri and incomplete clinical scores. IEEE Transactions on Cybernetics. Cited by: §II.
 [26] (2019) Densepoint: learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5239–5248. Cited by: §I, §II, §IIIB4.
 [27] (2019) Pointvoxel cnn for efficient 3d deep learning. In Advances in Neural Information Processing Systems (NIPS), pp. 965–975. Cited by: §I, §IIC, §II.

[28]
(2017)
The concrete distribution: a continuous relaxation of discrete random variables
. International Conference on Learning Representations (ICLR). Cited by: §IIIC1, §IVB2.  [29] (2019) Vvnet: voxel vae net with group convolutions for point cloud segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8500–8508. Cited by: §I, §II.
 [30] (2019) RangeNet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §I, §IIA, §II.
 [31] (2019) Partnet: a largescale benchmark for finegrained and hierarchical partlevel 3d object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 909–918. Cited by: §IVC, TABLE III.
 [32] (2020) Adaptive hierarchical downsampling for point cloud classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12956–12964. Cited by: §I, §IIB, §II.
 [33] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 652–660. Cited by: §I, §IIB, §IIC, §II.
 [34] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (NIPS), pp. 5099–5108. Cited by: §I, §IIB, §IIC, §II.
 [35] (2018) Parislille3d: a large and highquality groundtruth urban point cloud dataset for automatic segmentation and classification. The International Journal of Robotics Research (IJRR) 37 (6), pp. 545–557. Cited by: §IVC, TABLE III.
 [36] (2009) Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE international conference on robotics and automation (ICRA), pp. 3212–3217. Cited by: §I.
 [37] (2008) Aligning point cloud views using persistent feature histograms. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3384–3391. Cited by: §I.
 [38] (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4548–4557. Cited by: §I, §IIB, §II.
 [39] (2020) Pvrcnn: pointvoxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10529–10538. Cited by: §IIC.
 [40] (2018) The importance of sampling inmetareinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 9280–9290. Cited by: §IVB2.
 [41] (2020) Searching efficient 3d architectures with sparse pointvoxel convolution. arXiv preprint arXiv:2007.16100. Cited by: §I, §II.
 [42] (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6411–6420. Cited by: §I, §IIB, §II, TABLE I, TABLE II.
 [43] (2010) Unique signatures of histograms for local surface description. In European conference on computer vision (ECCV), pp. 356–369. Cited by: §I.
 [44] (2020) Realtime 3d semantic scene parsing with lidar sensors. IEEE Transactions on Cybernetics. Cited by: §I, §I, §IIC, §II, §IVE.
 [45] (2020) Deep highresolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §IIIB4.
 [46] (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (ToG) 38 (5), pp. 1–12. Cited by: TABLE I, TABLE II.
 [47] (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for roadobject segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. Cited by: §I, §IIA, §II.
 [48] (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9621–9630. Cited by: §I, §IIC, §II, TABLE I, TABLE II.
 [49] (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1912–1920. Cited by: §IVC, TABLE III.
 [50] (2020) Squeezesegv3: spatiallyadaptive convolution for efficient pointcloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: §I, §IIA, §II.
 [51] (2020) A new aggregation of dnn sparse and dense labeling for saliency detection. IEEE Transactions on Cybernetics. Cited by: §II.
 [52] (2020) PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5589–5598. Cited by: §I, §II.

[53]
(2020)
Hierarchical deep embedding for aurora image retrieval
. IEEE Transactions on Cybernetics. Cited by: §II.  [54] (2020) HVNet: hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1631–1640. Cited by: §I, §II.
 [55] (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG) 35 (6), pp. 1–12. Cited by: §IVC, TABLE III.
 [56] (2019) Shellnet: efficient point cloud convolutional neural networks using concentric shells statistics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1607–1616. Cited by: §I, §I, §II, TABLE I, TABLE II.
 [57] (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9308–9316. Cited by: §IIIB2.
Comments
There are no comments yet.