RandLANet
RandLANet in Tensorflow
view repo
We study the problem of efficient semantic segmentation for largescale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/postprocessing steps, most existing approaches are only able to be trained and operate over smallscale point clouds. In this paper, we introduce RandLANet, an efficient and lightweight neural architecture to directly infer perpoint semantics for largescale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLANet can process 1 million points in a single pass with up to 200X faster than existing approaches. Moreover, our RandLANet clearly surpasses stateoftheart approaches for semantic segmentation on two largescale benchmarks Semantic3D and SemanticKITTI.
READ FULL TEXT VIEW PDF
In this paper, we present the PS^2Net  a locally and globally aware d...
read it
Deep learning approaches have made tremendous progress in the field of
s...
read it
In this paper, we present a conceptually simple and powerful framework,
...
read it
Stateoftheart segmentation methods rely on very deep networks that ar...
read it
Deep models are capable of fitting complex high dimensional functions wh...
read it
Analyzing the geometric and semantic properties of 3D point clouds throu...
read it
Point Pair Features is a widely used method to detect 3D objects in poin...
read it
RandLANet in Tensorflow
Efficient semantic segmentation of largescale 3D point clouds is a fundamental and essential capability for realtime intelligent systems, such as autonomous driving and augmented reality. A key challenge is that the raw point clouds acquired by depth sensors are typically irregularly sampled, unstructured and unordered. Although deep convolutional networks show excellent performance in structured 2D computer vision tasks, they cannot be directly applied to this type of unstructured data.
Recently, the pioneering work PointNet [qi2017pointnet]
has emerged as a promising approach for directly processing 3D point clouds. It learns perpoint features using shared multilayer perceptrons (MLPs). This is computationally efficient but fails to capture wider context information for each point. To learn richer local structures, many dedicated neural modules have been subsequently and rapidly introduced. These modules can be generally categorized as: 1) neighbouring feature pooling
[qi2017pointnet++, sonet, RSNet, pointweb, zhang2019shellnet], 2) graph message passing [dgcnn, KCNet, local_spectral, GACNet, clusternet, HPEIN, Agglomeration], 3) kernelbased convolution [su2018splatnet, hua2018pointwise, wu2018pointconv, octree_guided, ACNN, GeoCNN, thomas2019kpconv, mao2019interpolated], and 4) attentionbased aggregation [xie2018attentional, PCAN, Yang2019ModelingPC, AttentionalPointNet]. Although these approaches achieve impressive results for object recognition and semantic segmentation, almost all of them are limited to extremely small 3D point clouds (e.g., 4k points or 11 meter blocks) and cannot be directly extended to larger point clouds (e.g., millions of points and up to 200200 meters). The reasons for this limitation are threefold. 1) The commonly used pointsampling methods of these networks are either computationally expensive or memory inefficient. For example, the widely employed farthestpoint sampling [qi2017pointnet++] takes over 200 seconds to sample 10% of 1 million points. 2) Most existing local feature learners usually rely on computationally expensive kernelisation or graph construction, thereby being unable to process massive number of points. 3) For a largescale point cloud, which usually consists of hundreds of objects, the existing local feature learners are either incapable of capturing complex structures, or do so inefficiently, due to their limited size of receptive fields.A handful of recent works have started to tackle the task of directly processing largescale point clouds. SPG [landrieu2018large]
preprocesses the large point clouds as super graphs before applying neural networks to learn per superpoint semantics. Both FCPN
[rethage2018fully] and PCT [PCT] combine voxelization and pointlevel networks to process massive point clouds. Although they achieve decent segmentation accuracy, the preprocessing and voxelization steps are too computationally heavy to be deployed in realtime applications.In this paper, we aim to design a memory and computationally efficient neural architecture, which is able to directly process largescale 3D point clouds in a single pass, without requiring any pre/postprocessing steps such as voxelization, block partitioning or graph construction. However, this task is extremely challenging as it requires: 1) a memory and computationally efficient sampling approach to progressively downsample largescale point clouds to fit in the limits of current GPUs, and 2) an effective local feature learner to progressively increase the receptive field size to preserve complex geometric structures. To this end, we first systematically demonstrate that random sampling is a key enabler for deep neural networks to efficiently process largescale point clouds. However, random sampling can discard key semantic information, especially for objects with low point densities. To counter the potentially detrimental impact of random sampling, we propose a new and efficient local feature aggregation module to capture complex local structures over progressively smaller pointsets.
Amongst existing sampling methods, farthest point sampling and inverse density sampling are the most frequently used for smallscale point clouds [qi2017pointnet++, wu2018pointconv, li2018pointcnn, pointweb, Groh2018flexconv]. As point sampling is such a fundamental step within these networks, we investigate the relative merits of different approaches in Section 3.2, both by examining their computational complexity and empirically by measuring their memory consumption and processing time. From this, we see that the commonly used sampling methods limit scaling towards large point clouds, and act as a significant bottleneck to realtime processing. However, we identify random sampling as by far the most suitable component for largescale point cloud processing as it is fast and scales efficiently. Random sampling is not without cost, because prominent point features may be dropped by chance and it cannot be used directly in existing networks without incurring a performance penalty. To overcome this issue, we design a new local feature aggregation module in Section 3.3, which is capable of effectively learning complex local structures by progressively increasing the receptive field size in each neural layer. In particular, for each 3D point, we firstly introduce a local spatial encoding (LocSE) unit to explicitly preserve local geometric structures. Secondly, we leverage attentive pooling to automatically keep the useful local features. Thirdly, we stack multiple LocSE units and attentive poolings as a dilated residual block, greatly increasing the effective receptive field for each point. Note that all these neural components are implemented as shared MLPs, and are therefore remarkably memory and computational efficient.
Overall, being built on the principles of simple random sampling and an effective local feature aggregator, our efficient neural architecture, named RandLANet^{1}^{1}1Code and data are available at: https://github.com/QingyongHu/RandLANet, not only is up to 200 faster than existing approaches on largescale point clouds, but also surpasses the stateoftheart semantic segmentation methods on both Semantic3D [Semantic3D] and SemanticKITTI [behley2019semantickitti] benchmarks. Figure 1 shows qualitative results of our approach. Our key contributions are:
[leftmargin=0.4cm]
We analyse and compare existing sampling approaches, identifying random sampling as the most suitable component for efficient learning on largescale point clouds.
We propose an effective local feature aggregation module to automatically preserve complex local structures by progressively increasing the receptive field for each point.
We demonstrate significant memory and computational gains over baselines, and surpass the stateoftheart semantic segmentation methods on multiple largescale benchmarks.
To extract features from 3D point clouds, traditional approaches usually manually handcraft features [point_signatures, fast_hist]. Recent learning based approaches mainly include projectionbased, voxelbased and pointbased schemes which are outlined here.
(1) Projection and Voxel Based Networks. To leverage the success of 2D CNNs, many works [li2016vehicle_rss, chen2017multi, PIXOR, pointpillars] project/flatten 3D point clouds onto 2D images to address the task of object detection. However, many geometric details are lost during the projection. Alternatively, point clouds can be voxelized into 3D grids and then powerful 3D CNNs are applied as in [sparse, pointgrid, 4dMinkpwski, vvnet, Fast_point_rcnn]. Although they achieve leading results on semantic segmentation and object detection, their primary limitation is the heavy computation cost, especially when processing largescale point clouds.
(2) Point Based Networks. Inspired by PointNet/PointNet++ [qi2017pointnet, qi2017pointnet++]
, many recent works introduced sophisticated neural modules to learn perpoint local features. These modules can be generally classified as 1) neighbouring feature pooling
[sonet, RSNet, pointweb, zhang2019shellnet], 2) graph message passing [dgcnn, KCNet, local_spectral, GACNet, clusternet, HPEIN, Agglomeration], 3) kernelbased convolution [su2018splatnet, hua2018pointwise, wu2018pointconv, octree_guided, ACNN, GeoCNN, thomas2019kpconv, mao2019interpolated], and 4) attentionbased aggregation [xie2018attentional, PCAN, Yang2019ModelingPC, AttentionalPointNet]. Although these networks have shown promising results on small point clouds, most of them cannot directly scale up to large scenarios due to their high computational and memory costs. Compared with them, our proposed RandLANet is distinguished in three ways: 1) it only relies on random sampling within the network, thereby requiring much less memory and computation; 2) the proposed local feature aggregator can obtain successively larger receptive fields by explicitly considering the local spatial relationship and point features, thus being more effective and robust for learning complex local patterns; 3) the entire network only consists of shared MLPs without relying on any expensive operations such as graph construction and kernelisation, therefore being superbly efficient for largescale point clouds.(3) Learning for Largescale Point Clouds. SPG [landrieu2018large] preprocesses the large point clouds as super graphs to learn per superpoint semantics. The recent FCPN [rethage2018fully] and PCT [PCT] apply both voxelbase and pointbased networks to process the massive point clouds. However, both the graph partitioning and voxelisation are computationally expensive. In constrast, our efficient RandLANet is endtoend trainable without requiring any additional pre/postprocessing steps.
As illustrated in Figure 2, given a largescale point cloud with millions of points spanning up to hundreds of meters, to process it with a deep neural network inevitably requires those points to be progressively and efficiently downsampled in each neural layer, without losing the useful point features. In our RandLANet, we propose to use the simple and fast approach of random sampling to greatly decrease point density, whilst applying a carefully designed local feature aggregator to retain prominent features. This allows the entire network to achieve an excellent tradeoff between efficiency and effectiveness.
Existing point sampling approaches [qi2017pointnet++, li2018pointcnn, Groh2018flexconv, learning2sample, concrete, wu2018pointconv]
can be roughly classified into heuristic and learningbased approaches. However, there is still no standard sampling strategy that is suitable for largescale point clouds. Therefore, we analyse and compare their relative merits and complexity as follows.
(1) Heuristic Sampling
[leftmargin=*]
Farthest Point Sampling (FPS): In order to sample points from a largescale point cloud with points, FPS returns a reordering of the metric space , such that each is the farthest point from the first points. FPS is widely used in [qi2017pointnet++, li2018pointcnn, wu2018pointconv] for semantic segmentation of small point sets. Although it has a good coverage of the entire point set, its computational complexity is . For a largescale point cloud (), FPS takes up to 200 seconds to process on a single GPU. This shows that FPS is not suitable for largescale point clouds.
Inverse Density Importance Sampling (IDIS): To sample points from points, IDIS reorders all points according to the density of each point, after which the top points are selected [Groh2018flexconv]. Its computational complexity is approximately . Empirically, it takes 10 seconds to process
points. Compared with FPS, IDIS is more efficient, but also more sensitive to outliers. However, it is still too slow for use in a realtime system.
Random Sampling (RS): Random sampling uniformly selects points from the original points. Its computational complexity is , which is agnostic to the total number of input points, i.e., it is constanttime and hence inherently scalable. Compared with FPS and IDIS, random sampling has the highest computational efficiency, regardless of the scale of input point clouds. It only takes 0.004s to process points.
(2) Learningbased Sampling
[leftmargin=*]
Generatorbased Sampling (GS): GS [learning2sample] learns to generate a small set of points to approximately represent the original large point set. However, FPS is usually used in order to match the generated subset with the original set at inference stage, incurring additional computation. In our experiments, it takes up to 1200 seconds to sample 10% of points.
Continuous Relaxation based Sampling (CRS): CRS approaches [concrete, Yang2019ModelingPC]
use the reparameterization trick to relax the sampling operation to a continuous domain for endtoend training. In particular, each sampled point is learnt based on a weighted sum over the full point clouds. It results in a large weight matrix when sampling all the new points simultaneously with a onepass matrix multiplication, leading to an unaffordable memory cost. For example, it is estimated to take more than a 300GB memory footprint to sample 10% of
points.Policy Gradient based Sampling (PGS):
PGS formulates the sampling operation as a Markov decision process
[show_attend]. It sequentially learns a probability to sample each point. However, the learnt probability has high variance due to the extremely large exploration space when the point cloud is in large scale. For example, to sample 10% of
points, the exploration space is and it is unlikely to learn an effective sampling policy. We empirically find that the network is difficult to converge if PGS is used for large point clouds.Overall, FPS, IDIS and GS are too computationally expensive to be applied for largescale point clouds. CRS approaches have an excessive memory footprint and PGS is hard to learn. By contrast, random sampling has the following two advantages: 1) it is remarkably computational efficient as it is agnostic to the total number of input points, 2) it does not require extra memory for computation. Therefore, we safely conclude that random sampling is by far the most suitable approach to process largescale point clouds compared with all existing alternatives. However, random sampling may result in many useful point features being dropped. To overcome it, we propose a powerful local feature aggregation module as presented in below Section 3.3.
As shown in Figure 3, our local feature aggregation module is applied to each 3D point in parallel and it consists of three neural units: 1) local spatial encoding (LocSE), 2) attentive pooling, and 3) dilated residual block.
(1) Local Spatial Encoding
Given a point cloud together with perpoint features (e.g., raw RGB, or intermediate learnt features), this local spatial encoding unit explicitly embeds the xyz coordinates of all neighbouring points, such that the corresponding point features are always aware of their relative spatial locations. This allows the LocSE unit to explicitly observe the local geometric patterns, thus eventually benefiting the entire network to effectively learn complex local structures. In particular, this unit includes the following steps:
Finding Neighbouring Points. For the point, its neighbouring points are firstly gathered by the simple
nearest neighbours (KNN) algorithm for efficiency. Note that, the KNN is based on the pointwise Euclidean distances.
Relative Point Position Encoding. For each of the nearest points of the center point , we explicitly encode the relative point position as follows:
(1) 
where and are the xyz positions of points, is the concatenation operation, and calculates the Euclidean distance between the neighbouring and center points. It seems that is encoded from redundant point position information. Interestingly, this tends to aid the network to learn local features and obtains good performance in practice.
Point Feature Augmentation. For each neighbouring point , the encoded relative point positions are concatenated with its corresponding point features
, obtaining an augmented feature vector
.Eventually, the output of LocSE unit is a new set of neighbouring features , which explicitly encodes the local geometric structures for the center point . We notice that the recent work [liu2019relation] also uses point positions to improve semantic segmentation. However, the positions are used to learn point scores in [liu2019relation], while our LocSE explicitly encodes the relative positions to augment the neighbouring point features.
(2) Attentive Pooling
This neural unit is used to aggregate the set of neighbouring point features . Existing works [qi2017pointnet++, li2018pointcnn] typically use max/mean pooling to hard integrate the neighbouring features, resulting in the majority of the information being lost. By contrast, we turn to the powerful attention mechanism to automatically learn important local features. In particular, inspired by [Yang_ijcv2019], our attentive pooling unit consists of the following steps.
Computing Attention Scores. Given the set of local features , we design a shared function to learn a unique attention score for each feature. Basically, the function consists of a shared MLP followed by . It is formally defined as follows:
(2) 
where is the learnable weights of a shared MLP.
Weighted Summation. The learnt attention scores can be regarded as a soft mask which automatically selects the important features. Formally, these features are weighted summed as follows:
(3) 
To summarize, given the input point cloud , for the point , our LocSE and Attentive Pooling units learn to aggregate the geometric patterns and features of its nearest points, and finally generate an informative feature vector .
(3) Dilated Residual Block
Since the large point clouds are going to be substantially downsampled, it is desirable to significantly increase the receptive field for each point, such that the geometric details of input point clouds are more likely to be reserved, even if some points are dropped. As shown in Figure 3, inspired by the successful ResNet [he2016deep] and the effective dilated networks [DPC], we stack multiple LocSE and Attentive Pooling units together with a skip connection as a dilated residual block.
To further illustrate the capability of our dilated residual block, Figure 4 shows that the red 3D point observes neighbouring points after the first LocSE/Attentive Pooling operation, and then is able to receive information from up to neighbouring points i.e. its twohop neighbourhood after the second. This is a cheap way of dilating the receptive field and expanding the effective neighbourhood through feature propagation. Theoretically, the more units we stack, the more powerful this block as its sphere of reach becomes greater and greater. However, more units would inevitably sacrifice the overall computation efficiency. In addition, the entire network is likely to be overfitted. In our RandLANet, we simply stack two sets of LocSE and Attentive Pooling as the standard residual block, achieving a satisfactory balance between efficiency and effectiveness.
Overall, our local feature aggregation module is designed to effectively preserve complex local structures via explicitly considering neighbouring geometries and significantly increasing receptive fields. Moreover, this module only consists of feedforward MLPs, thus being computationally efficient.
We implement RandLANet by stacking multiple local feature aggregation modules and random sampling layers. The detailed architecture is presented in the Appendix. We use the Adam optimizer with default parameters. The initial learning rate is set as 0.01 and decreases by 5% after each epoch. The number of nearest points
is set as 16. To train our RandLANet in parallel, we sample a fixed number of points () from each point cloud as the input. During testing, the whole raw point cloud is fed into our network to infer perpoint semantics without any pre/postprocessing. All experiments are conducted on an NVIDIA RTX2080Ti GPU.In this section, we empirically evaluate the efficiency of existing sampling approaches including FPS, IDIS, RS, GS, CRS, and PGS, which have been discussed in Section 3.2. In particular, we conduct the following 4 groups of experiments.
Group 1. Given a small scale point cloud ( points), we use each sampling approach to progressively downsample it. Specifically, the point cloud is downsampled by five steps with only 25% points being retained in each step on a single GPU i.e. a fourfold decimation ratio. This means that there are only points left in the end. This downsampling strategy emulates the procedure used in PointNet++ [qi2017pointnet++]. For each sampling approach, we sum up its time and memory consumption for comparison.
Group 2/3/4. The total number of points are increased towards largescale, i.e., around and points respectively. We use the same five sampling steps as in Group 1.
Analysis. Figure 5 compares the total time and memory consumption of each sampling approach to process different scales of point clouds. It can be seen that: 1) For small scale point clouds (), all sampling approaches tend to have similar time and memory consumption, and are unlikely to incur a heavy or limiting computation burden. 2) For largescale point clouds (), FPS/IDIS/GS/CRS/PGS are either extremely timeconsuming or memorycostly. By contrast, random sampling has superior time and memory efficiency overall. This result clearly demonstrates that most existing networks [qi2017pointnet++, li2018pointcnn, wu2018pointconv, liu2019relation, pointweb, Yang2019ModelingPC] are only able to be optimized on small blocks of point clouds primarily because they rely on the expensive sampling approaches. Motivated by this, we use the efficient random sampling strategy in our RandLANet.
In this section, we systematically evaluate the overall efficiency of our RandLANet on realworld largescale point clouds for semantic segmentation. Particularly, we evaluate RandLANet on SemanticKITTI [behley2019semantickitti] dataset, obtaining the total time consumption of our network on Sequence 08 which has 4071 frames of point clouds in total. We also evaluate the time consumption of recent representative works [qi2017pointnet, qi2017pointnet++, li2018pointcnn, landrieu2018large, thomas2019kpconv] on the same dataset. For a fair comparison, we feed the same number of points (i.e., 81920) from each scan into each neural network.
In addition, we also evaluate the memory consumption of RandLANet and the baselines. In particular, we not only report the total parameters of each network, but also measure the maximum number of 3D points each network can take as input in a single pass to infer perpoint semantics. Note that, all experiments are conducted on the same machine with an AMD 3700X @3.6GHz CPU and an NVIDIA RTX2080Ti GPU.
Analysis. Table 1 quantitatively shows the total time and memory consumption of different approaches. It can be seen that, 1) SPG [landrieu2018large] has the lowest number of network parameters, but takes the longest time to process the point clouds due to the expensive geometrical partitioning and supergraph construction steps; 2) PointNet++ [qi2017pointnet++] and PointCNN [li2018pointcnn] are also computationally expensive mainly because of the FPS sampling operation; 3) PointNet [qi2017pointnet] and KPConv [thomas2019kpconv] are unable to take extremely largescale point clouds (e.g. points) in a single pass due to their memory inefficient operations. 4) Thanks to the simple random sampling together with the efficient MLP based local feature aggregator, our RandLANet takes the shortest time (translating to 23 frames per second) to infer the semantic labels for each largescale point cloud (up to points).





PointNet (Vanilla) [qi2017pointnet]  192  0.8  0.49  
PointNet++ (SSG) [qi2017pointnet++]  9831  0.97  0.98  
PointCNN [li2018pointcnn]  8142  11  0.05  
SPG [landrieu2018large]  43584  0.25    
KPConv [thomas2019kpconv]  717  14.9  0.54  
RandLANet (Ours)  176  0.95  1.15 
mIoU (%)  OA (%)  manmade.  natural.  high veg.  low veg.  buildings  hard scape  scanning art.  cars  

SnapNet_ [snapnet]  59.1  88.6  82.0  77.3  79.7  22.9  91.1  18.4  37.3  64.4 
SEGCloud [tchapmi2017segcloud]  61.3  88.1  83.9  66.0  86.0  40.5  91.1  30.9  27.5  64.3 
RF_MSSF [RF_MSSF]  62.7  90.3  87.6  80.3  81.8  36.4  92.2  24.1  42.6  56.6 
MSDeepVoxNet [msdeepvoxnet]  65.3  88.4  83.0  67.2  83.8  36.7  92.4  31.3  50.0  78.2 
ShellNet [zhang2019shellnet]  69.3  93.2  96.3  90.4  83.9  41.0  94.2  34.7  43.9  70.2 
GACNet [GACNet]  70.8  91.9  86.4  77.7  88.5  60.6  94.2  37.3  43.5  77.8 
SPG [landrieu2018large]  73.2  94.0  97.4  92.6  87.9  44.0  83.2  31.0  63.5  76.2 
KPConv [thomas2019kpconv]  74.6  92.9  90.9  82.2  84.2  47.9  94.9  40.0  77.3  79.7 
RandLANet (Ours)  76.0  94.4  96.5  92.0  85.1  50.3  95.0  41.1  68.2  79.4 
In this section, we evaluate the semantic segmentation of our RandLANet on three largescale public datasets: the Semantic3D [Semantic3D], SemanticKITTI [behley2019semantickitti], and S3DIS [2D3DS].
Methods  Size 
mIoU(%) 
Params(M) 
road 
sidewalk 
parking 
otherground 
building 
car 
truck 
bicycle 
motorcycle 
othervehicle 
vegetation 
trunk 
terrain 
person 
bicyclist 
motorcyclist 
fence 
pole 
trafficsign 


PointNet [qi2017pointnet]  50K pts  14.6  3  61.6  35.7  15.8  1.4  41.4  46.3  0.1  1.3  0.3  0.8  31.0  4.6  17.6  0.2  0.2  0.0  12.9  2.4  3.7  
SPG [landrieu2018large]  17.4  0.25  45.0  28.5  0.6  0.6  64.3  49.3  0.1  0.2  0.2  0.8  48.9  27.2  24.6  0.3  2.7  0.1  20.8  15.9  0.8  
SPLATNet [su2018splatnet]  18.4  0.8  64.6  39.1  0.4  0.0  58.3  58.2  0.0  0.0  0.0  0.0  71.1  9.9  19.3  0.0  0.0  0.0  23.1  5.6  0.0  
PointNet++ [qi2017pointnet++]  20.1  6  72.0  41.8  18.7  5.6  62.3  53.7  0.9  1.9  0.2  0.2  46.5  13.8  30.0  0.9  1.0  0.0  16.9  6.0  8.9  
TangentConv [tangentconv]  40.9  0.4  83.9  63.9  33.4  15.4  83.4  90.8  15.2  2.7  16.5  12.1  79.5  49.3  58.1  23.0  28.4  8.1  49.0  35.8  28.5  
SqueezeSeg [wu2018squeezeseg] 

29.5  1  85.4  54.3  26.9  4.5  57.4  68.8  3.3  16.0  4.1  3.6  60.0  24.3  53.7  12.9  13.1  0.9  29.0  17.5  24.5  
SqueezeSegV2 [wu2019squeezesegv2]  39.7  1  88.6  67.6  45.8  17.7  73.7  81.8  13.4  18.5  17.9  14.0  71.8  35.8  60.2  20.1  25.1  3.9  41.1  20.2  36.3  
DarkNet21Seg [behley2019semantickitti]  47.4  25  91.4  74.0  57.0  26.4  81.9  85.4  18.6  26.2  26.5  15.6  77.6  48.4  63.6  31.8  33.6  4.0  52.3  36.0  50.0  
DarkNet53Seg [behley2019semantickitti]  49.9  50  91.8  74.6  64.8  27.9  84.1  86.4  25.5  24.5  32.7  22.6  78.3  50.1  64.0  36.2  33.6  4.7  55.0  38.9  52.2  
RandLANet (Ours)  50K pts  50.3  0.95  90.4  67.9  56.9  15.5  81.1  94.0  42.7  19.8  21.4  38.7  78.3  60.3  59.0  47.5  48.8  4.6  49.7  44.2  38.1 
(1) Evaluation on Semantic3D. The Semantic3D dataset [Semantic3D] consists of 15 point clouds for training and 15 for online testing. Each point cloud has up to points, covering up to 16024030 meters in realworld 3D space. The raw 3D points belong to 8 classes and contain 3D coordinates, RGB information, and intensity. We only use the 3D coordinates and color information to train and test our RandLANet. Mean Intersection of Union (mIoU) and Overall Accuracy (OA) of all classes are used as the standard metrics. For fair comparison, we only include the results of recently published strong baselines [snapnet, tchapmi2017segcloud, RF_MSSF, msdeepvoxnet, zhang2019shellnet, GACNet, landrieu2018large] and the current stateoftheart approach KPConv [thomas2019kpconv].
Table 2 presents the quantitative results of different approaches. RandLANet clearly outperforms all existing methods in terms of both mIoU and OA. Notably, RandLANet also achieves superior performance on six of the eight classes, except low vegetation and scanning artifact.
(2) Evaluation on SemanticKITTI. SemanticKITTI [behley2019semantickitti] consists of 43552 densely annotated LIDAR scans belonging to 21 sequences. Each scan is a largescale point cloud with points and spanning up to 16016020 meters in 3D space. Officially, the sequences 0007 and 0910 (19130 scans) are used for training, the sequence 08 (4071 scans) for validation, and the sequences 1121 (20351 scans) for online testing. The raw 3D points only have 3D coordinates without color information. The mIoU score over 19 categories is used as the standard metric.
Table 3 shows a quantitative comparison of our RandLANet with two families of recent approaches, i.e. 1) pointbased methods [qi2017pointnet, landrieu2018large, su2018splatnet, qi2017pointnet++, tangentconv] and 2) projection based approaches [wu2018squeezeseg, wu2019squeezesegv2, behley2019semantickitti], and Figure 6 shows some qualitative results of RandLANet on the validation split. It can be seen that our RandLANet surpasses all point based approaches [qi2017pointnet, landrieu2018large, su2018splatnet, qi2017pointnet++, tangentconv] by a large margin. We also outperform all projection based methods [wu2018squeezeseg, wu2019squeezesegv2, behley2019semantickitti], but not significantly, primarily because DarkNet [behley2019semantickitti] achieves much better results on the small object category such as trafficsign. However, our RandLANet has far fewer network parameters than DarkNet [behley2019semantickitti] and is more computationally efficient as it does not require the costly steps of pre and post projection processing.
(3) Evaluation on S3DIS. The S3DIS dataset [2D3DS] consists of 271 rooms belonging to 6 large areas. Each point cloud is a mediumsized single room ( 20155 meters) with dense 3D points. To evaluate the semantic segmentation of our RandLANet, we use the standard 6fold crossvalidation in our experiments. The mean IoU (mIoU), mean class Accuracy (mAcc) and Overall Accuracy (OA) of the total 13 classes are compared.
As shown in Table 4, our RandLANet achieves onpar or better performance than stateoftheart methods. Note that, most of these baselines [qi2017pointnet++, li2018pointcnn, pointweb, zhang2019shellnet, dgcnn, chen2019lsanet] tend to use sophisticated but expensive operations or samplings to optimize the networks on small blocks (e.g., 11 meter) of point clouds, and the relatively small rooms act in their favours to be divided into tiny blocks. By contrast, RandLANet takes the entire rooms as input and is able to efficiently infer perpoint semantics in a single pass.
OA(%)  mAcc(%)  mIoU(%)  

PointNet [qi2017pointnet]  78.6  66.2  47.6 
PointNet++ [qi2017pointnet++]  81.0  67.1  54.5 
DGCNN [dgcnn]  84.1    56.1 
3PRNN [3PRNN]  86.9    56.3 
RSNet [RSNet]    66.5  56.5 
SPG [landrieu2018large]  85.5  73.0  62.1 
LSANet [chen2019lsanet]  86.8    62.2 
PointCNN [li2018pointcnn]  88.1  75.6  65.4 
PointWeb [pointweb]  87.3  76.2  66.7 
ShellNet [zhang2019shellnet]  87.1    66.8 
HEPIN [HPEIN]  88.2    67.8 
KPConv [thomas2019kpconv]    79.1  70.6 
RandLANet (Ours)  87.2  81.5  68.5 
Since the impact of random sampling is fully studied in Section 4.1, we conduct the following ablation studies for our local feature aggregation module. All ablated networks are trained on sequences 0007 and 0910, and tested on the sequence 08 of SemanticKITTI dataset [behley2019semantickitti].
(1) Removing local spatial encoding (LocSE). This unit enables each 3D point to explicitly observe its local geometry. After removing locSE, we directly feed the local point features into the subsequent attentive pooling.
(24) Replacing attentive pooling by max/mean/sum pooling. The attentive pooling unit learns to automatically combine all local point features. By comparison, the widely used max/mean/sum poolings tend to hard select or combine features, therefore their performance may be suboptimal.
(5) Simplifying the dilated residual block. The dilated residual block stacks multiple LocSE units and attentive poolings, substantially dilating the receptive field for each 3D point. By simplifying this block, we use only one LocSE unit and attentive pooling per layer, i.e. we don’t chain multiple blocks as in our original RandLANet.
Table 5 compares the mIoU scores of all ablated networks. From this, we can see that: 1) The greatest impact is caused by the removal of the chained spatial embedding and attentive pooling blocks. This is highlighted in Figure 4, which shows how using two chained blocks allows information to be propagated from a wider neighbourhood, i.e. approximately points as opposed to just . This is especially critical with random sampling, which is not guaranteed to preserve a particular set of points. 2) The removal of the local spatial encoding unit shows the next greatest impact on performance, demonstrating that this module is necessary to effectively learn local and relative geometry context. 3) Removing the attention module diminishes performance by not being able to effectively retain useful features. From this ablation study, we can see how the proposed neural units complement each other to attain our stateoftheart performance.
mIoU(%)  

(1) Remove local spatial encoding  45.1 
(2) Replace with maxpooling 
47.1 
(3) Replace with meanpooling  45.2 
(4) Replace with sumpooling  45.7 
(5) Simplify dilated residual block  41.5 
(6) The Full framework (RandLANet)  52.0 
In this paper, we demonstrated that it is possible to efficiently and effectively segment largescale point clouds by using a lightweight network architecture. In contrast to most current approaches, that rely on expensive sampling strategies, we instead use random sampling in our framework to significantly reduce the memory footprint and computational cost. A local feature aggregation module is also introduced to effectively preserve useful features from a wide neighbourhood. Extensive experiments on multiple benchmarks demonstrate the high efficiency and the stateoftheart performance of our approach. It would be interesting to extend our framework for the endtoend 3D instance segmentation on largescale point clouds by drawing on the recent work [3dbonet] and also for the realtime dynamic point cloud processing [liu2019meteornet].
We provide the implementation details of different sampling approaches evaluated in Section 4.1. To sample points (point features) from a largescale point cloud with points (point features):
Farthest Point Sampling (FPS): We follow the implementation ^{2}^{2}2https://github.com/charlesq34/pointnet2 provided by PointNet++ [qi2017pointnet++], which is also widely used in [li2018pointcnn, wu2018pointconv, liu2019relation, chen2019lsanet, pointweb]. In particular, FPS is implemented as an operator running on GPU.
Inverse Density Importance Sampling (IDIS): Given a point , its density is approximated by calculating the summation of the distances between and its nearest points [Groh2018flexconv]. Formally:
(4) 
where represents the coordinates (i.e. xyz) of the point of the neighbour points set , is set to 16. All the points are ranked according to the inverse density of points. Finally, the top points are selected.
Random Sampling (RS): We implement random sampling with the python numpy package. Specifically, we first use the numpy function numpy.random.choice() to generate indices. We then gather the corresponding spatial coordinates and perpoint features from point clouds by using these indices.
Generator based Sampling (GS): The implementation follows the code^{3}^{3}3https://github.com/orendv/learning_to_sample provided by [learning2sample]. We first train a ProgressiveNet [learning2sample] to transform the raw point clouds into ordered point sets according to their relevance to the task. After that, the first points are kept, while the rest discarded.
Continuous Relaxation based Sampling (CRS): CRS is implemented with the selfattended gumbelsoftmax sampling [concrete][Yang2019ModelingPC]. Given a point cloud with 3D coordinates and per point features, we firstly estimate a probability score vector through a score function parameterized by a MLP layer, i.e., , which learns a categorical distribution. Then, with the Gumbel noise drawn from the distribution . Each sampled point feature vector is calculated as follows:
(5) 
where and indicate the element in the vector and respectively, represents the row vector in the input matrix . is the annealing temperature. When , Equation 5 approaches the discrete distribution and samples each row vector in with the probability .
Policy Gradients based Sampling (PGS): Given a point feature set with 3D coordinates and per point features, we first predict a score for each point, which is learnt by a MLP function, i.e., , where is a zeromean Gaussian noise with the variance for random exploration. After that, we sample vectors in with the top scores. To properly update the score function, we apply REINFORCE algorithm [sutton2000policy] as the gradient estimator. By modeling the entire sampling operation as a sequential Markov Decision Process (MDP), we formulate the policy function as:
(6) 
where is the binary decision of whether to sample the vector in , is the network parameter of the MLP. Then we apply the segmentation accuracy R as the reward value for the entire sampling process and maximize our reward function with the following estimated gradients:
(7)  
where is the batch size, and are two control variates [mnih2014neural] for alleviating the high variance problem of policy gradients.
Figure 7 shows the detailed architecture of RandLANet. The network follows the widelyused encoderdecoder architecture with skip connections. The input point cloud is first fed to a shared MLP layer to extract perpoint features. Four encoding and decoding layers are then used to learn features for each point. At last, three fullyconnected layers and a dropout layer are used to predict the semantic label of each point. The details of each part are as follows:
Network Input: The input is a largescale point cloud with a size of (the batch dimension is dropped for simplicity), where is the number of points, is the feature dimension of each input point. For both S3DIS [2D3DS] and Semantic3D [Semantic3D] datasets, each point is represented by its 3D coordinates and color information (i.e., xyzRGB), while each point of the SemanticKITTI [behley2019semantickitti] dataset is only represented by 3D coordinates.
Encoding Layers: Four encoding layers are used in our network to progressively reduce the size of the point clouds and increase the perpoint feature dimensions. Each encoding layer consists of a local feature aggregation module (Section 3.3) and a random sampling operation (Section 3.2). The point cloud is downsampled with a fourfold decimation ratio. In particular, only 25% of the point features are retained after each layer, i.e., . Meanwhile, the perpoint feature dimension is gradually increased each layer to preserve more information, i.e., .
Decoding Layers: Four decoding layers are used after the above encoding layers. For each layer in the decoder, we first use the KNN
algorithm to find one nearest neighboring point for each query point, the point feature set is then upsampled through a nearestneighbor interpolation. Next, the upsampled feature maps are concatenated with the intermediate feature maps produced by encoding layers through skip connections, after which a shared MLP is applied to the concatenated feature maps.
Final Semantic Prediction: The final semantic label of each point is obtained through three shared fullyconnected layers (N, 64) (N, 32) (N, ) and a dropout layer. The dropout ratio is 0.5.
Network Output: The output of RandLANet is the predicted semantics of all points, with a size of , where is the number of classes.
In Section 3.3, we encode the relative point position based on the following equation:
(8) 
We further investigate the effects of different spatial information in our framework. Particularly, we conduct the following more ablative experiments for LocSE:
1) Encoding the coordinates of the point only.
2) Encoding the coordinates of neighboring points only.
3) Encoding the coordinates of the point and its neighboring points .
4) Encoding the coordinates of the point , the neighboring points , and Euclidean distance .
5) Encoding the coordinates of the point , the neighboring points , and the relative position .
LocSE  mIoU(%) 

(1)  40.7 
(2)  41.1 
(3)  42.5 
(4)  44.1 
(5)  48.8 
(6) (The Full Unit)  52.0 
Table 6 compares the mIoU scores of all ablated networks. We can see that: 1) Explicitly encoding all spatial information leads to the best mIoU performance. 2) The relative position plays an important role in this component, primarily because the relative point position enables the network to be aware of the local geometric patterns. 3) Only encoding the point position or is unlikely to improve the performance, because the relative local geometric patterns are not explicitly encoded.
In our RandLANet, we stack two LocSE and Attentive Pooling units as the standard dilated residual block to gradually increase the receptive field. To further evaluate how the number of aggregation units in the dilated residual block impact the entire network, we conduct the following two more groups of experiments.
1) We simplify the dilated residual block by using only one LocSE unit and attentive pooling.
2) We add one more LocSE unit and attentive pooling, i.e., there are three aggregation units chain together.
Dilated residual block  mIoU(%) 

(1) one aggregation unit  41.9 
(2) three aggregation units  48.7 
(3) two aggregation units (The Standard Block )  52.0 
Table 7 shows the mIoU scores of different ablated networks on the validation split of SemanticKITTI [behley2019semantickitti] dataset. It can be seen that: 1) Only one aggregation unit in the dilated residual block leads to a significant drop in segmentation performance, due to the limited receptive field. 2) Three aggregation units in each block do not improve the accuracy as expected. This is because the significantly increased receptive fields and the large number of trainable parameters tend to be overfitted.
To better understand the attentive pooling, it is desirable to visualize the learned attention scores. However, since the attentive pooling operates on a relatively small local point set (i.e., =16), it is hardly able to recognize meaningful shapes from such small local regions. Alternatively, we visualize the learned attention weight matrix defined in Equation 2 in each layer. As shown in Figure 8, the attention weights have large values in the first encoding layers, then gradually become smooth and stable in subsequent layers. This shows that the attentive pooling tends to choose prominent or key point features at the beginning. After the point cloud being significantly downsampled, the attentive pooling layer tends to retain the majority of those point features.
More qualitative results of RandLANet on Semantic3D [Semantic3D] dataset (reduced8) are shown in Figure 9.
Figure 10 shows more qualitative results of our RandLANet on the validation set of SemanticKITTI [behley2019semantickitti]. The red boxes showcase the failure cases. It can be seen that, the points belonging to othervehicle are likely to be misclassified as car, mainly because the partial point clouds without colors are extremely difficult to be distinguished between the two similar classes. In addition, our approach tends to fail in several minority classes such as bicycle, motorcycle, bicyclist and motorcyclist, due to the extremely imbalanced point distribution in the dataset. For example, the number of points for vegetation is 7000 times more than that of motorcyclist.
OA(%)  mAcc(%)  mIoU(%)  ceil.  floor  wall  beam  col.  wind.  door  table  chair  sofa  book.  board  clut.  

PointNet [qi2017pointnet]  78.6  66.2  47.6  88.0  88.7  69.3  42.4  23.1  47.5  51.6  54.1  42.0  9.6  38.2  29.4  35.2 
RSNet [RSNet]    66.5  56.5  92.5  92.8  78.6  32.8  34.4  51.6  68.1  59.7  60.1  16.4  50.2  44.9  52.0 
3PRNN [3PRNN]  86.9    56.3  92.9  93.8  73.1  42.5  25.9  47.6  59.2  60.4  66.7  24.8  57.0  36.7  51.6 
SPG [landrieu2018large]  86.4  73.0  62.1  89.9  95.1  76.4  62.8  47.1  55.3  68.4  73.5  69.2  63.2  45.9  8.7  52.9 
PointCNN [li2018pointcnn]  88.1  75.6  65.4  94.8  97.3  75.8  63.3  51.7  58.4  57.2  71.6  69.11  39.1  61.2  52.2  58.6 
PointWeb [pointweb]  87.3  76.2  66.7  93.5  94.2  80.8  52.4  41.3  64.9  68.1  71.4  67.1  50.3  62.7  62.2  58.5 
ShellNet [zhang2019shellnet]  87.1    66.8  90.2  93.6  79.9  60.4  44.1  64.9  52.9  71.6  84.7  53.8  64.6  48.6  59.4 
KPConv [thomas2019kpconv]    79.1  70.6  93.6  92.4  83.1  63.9  54.3  66.1  76.6  57.8  64.0  69.3  74.9  61.3  60.3 
RandLANet(Ours)  87.1  81.5  68.5  92.7  95.6  79.2  61.7  47.0  63.1  67.7  68.9  74.2  55.3  63.4  63.0  58.7 
We report the detailed 6fold cross validation results of our RandLANet on S3DIS [2D3DS] in Table 8. Figure 11 shows more qualitative results of our approach.
Comments
There are no comments yet.