(CVPR 2021) Rank 1st in the public leaderboard of SemanticKITTI Panoptic Segmentation (2020-11-16)
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space. The projection methods includes spherical projection, bird-eye view projection, etc. Although this process makes the point cloud suitable for the 2D CNN-based networks, it inevitably alters and abandons the 3D topology and geometric relations. A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space. In this work, we first perform an in-depth analysis for different representations and backbones in 2D and 3D spaces, and reveal the effectiveness of 3D representations and networks on LiDAR segmentation. Then, we develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds. Moreover, a dimension-decomposition based context modeling module is introduced to explore the high-rank context information in point clouds in a progressive manner. We evaluate the proposed model on a large-scale driving-scene dataset, i.e. SematicKITTI. Our method achieves state-of-the-art performance and outperforms existing methods by 6 mIoU.READ FULL TEXT VIEW PDF
State-of-the-art methods for large-scale driving-scene LiDAR segmentatio...
3D data such as point clouds and meshes are becoming more and more avail...
2D image representations are in regular grids and can be processed
Panoptic segmentation of point clouds is a crucial task that enables
The prevalence of relation networks in computer vision is in stark contr...
High-definition map (HD map) construction is a crucial problem for auton...
A thorough and holistic scene understanding is crucial for autonomous
(CVPR 2021) Rank 1st in the public leaderboard of SemanticKITTI Panoptic Segmentation (2020-11-16)
3D LiDAR sensor has become an indispensable device in modern autonomous driving vehicles. It captures more precise and further-away distance measurements of the surrounding environments than conventional visual cameras. The measurements of the sensor naturally form 3D point clouds that can be used to understand the overall scenes for autonomous driving planning and execution.
Semantic segmentation of 3D point clouds is crucial for driving-scene understanding. It aims to identify the pre-defined categories of each 3D point that belongs to, such as car, truck, pedestrian,etc, which provides point-wise perception information of the overall 3D scene.
Most existing point cloud-based segmentation algorithms focus more on indoor scenes, where the point clouds are generally dense and have mostly uniform densities. In contrast, only very few methods work on segmentation of LiDAR point clouds in outdoor or autonomous driving scenes, where the LiDAR points have varying densities according to their distances to the sensor and pose great challenges to the algorithms.
Recent methods usually pay much attention on point feature representations [1, 2, 3]. Point feature representation for LiDAR point clouds has three major categories: range image [4, 1], bird view image  and voxel partition [3, 5]. The range image is obtained via spherical projection of the irregularly distributed 3D point clouds to the 2D dense grids. The bird-view image squeezes point height information and shares a global height feature for each location on the bird view map. However, most of these approaches might lose certain accurate geometric information during the 3D-to-2D projection.
In this paper, we reposition the focus of LiDAR segmentation in autonomous driving scenes.This paper conducts experiments to show effectiveness of different point feature representations and neural network architectures. Experiments reveal that 3D partition with 3D convolutional neural networks works better than other counterparts. A cylinder partition is proposed to process the driving-scene point clouds due to its varying densities, which balances the distribution of driving-scene point clouds. To match the cuboid objects in driving-scene LiDAR data, we propose the asymmetric residual block as a basic module to form the 3D backbone. In addition to the network search, we also propose a new dimension decomposition block to efficiently exploit the context information via a series of low-rank convolution kernels.
The contributions of this work can be summarized as three-fold. (1) We study state-of-the-art network architectures and different point feature representations, which reveal directly processing point clouds without 3D-to-2D projection is crucial for achieving superior segmentation performance. (2) We propose a cylinder partition, a point cloud encoding scheme, which better follows the inherent distribution of the 3D driving-scene point clouds, and develop a 3D convolution based framework, in which the asymmetric residual block is designed as the basic module and a new dimension decomposition block is proposed to explore the context in a progressive manner. (3) Our proposed LiDAR segmentation algorithm outperforms state-of-the-art algorithms on driving-scene semantic segmentation benchmarks with a large margin, i.e., 6% mIoU gain.
Indoor-scene Point Cloud Segmentation. Indoor-scene point clouds have some properties, including generally uniform density and small range of the scene. Hence, most indoor-scene segmentation methods [6, 7, 8, 9, 10, 11, 12, 13] often learn the point features from the raw point directly. PointNet  is a classical convolutional neural network on point sets and proposed a multi-layer perception to extract features from input points. Moreover, PointNet++  further proposed multi-scale sampling to aggregate global and local features. Another group of indoor-scene segmentation [9, 10]
utilizes the clustering (including KNN) to extract the point features. However, these methods are computationally costly and do not take varying sparsity (the property of outdoor-scene LiDAR) into consideration.
Outdoor-scene Point Cloud Segmentation. Most existing outdoor-scene point cloud segmentation focuses on converting the 3D point cloud to 2D grids to enable the use of 2D Convolutional Neural Networks. SqueezeSeg , Darknet , SqueezeSegv2 , and RangeNet++  utilize the spherical projection mechanism, which converts the point cloud to a frontal-view (range) image, and adopt the 2D convolution network on the pseudo image for segmentation. PolarNet 
follows the bird-view projection, which projects point cloud data into small grids from the bird view and takes the height as a whole. Instead of partitioning points in a Cartesian coordinate system, they use a polar coordinate system for encoding point clouds. However, this 3D-to-2D projection inevitably compresses the 3D topology and fails to model the geometric information.
3D Voxel Partition 3D voxel partition is another routine of point cloud encoding [17, 18, 19, 20, 3]. It converts a point cloud into 3D voxels. 3D U-Net  proposes voxel partition and 3D U-Net on biomedical data and shows successful application on difficult microscopic datasets. OccuSeg , SSCN  and SEGCloud  follow this line to utilize the voxel partition and apply 3D convolutions for LiDAR segmentation. Our work also follows this routine, utilizing the 3D grid and 3D convolution networks, but with substantial differences. We use the 3D cylinder partition based on the cylinder coordinate system, which meets the varying sparsity of driving-scene LiDAR point cloud and balances point distribution. Specifically, distant region performs much sparse than closer one, and cylinder partition thus utilizes a larger cylinder to cover the distant region accordingly.
Network Architectures for Segmentation. Fully Convolutional Network 
is the fundamental work in the deep-learning era. U-Net built upon FCN and proposed a symmetric architecture to utilize the low-level features. Furthermore, many works explore the dilated convolution for multi-scale context modeling, including DeepLab[23, 24] and PSP . Due to the great success of U-Net on 2D benchmarks, many studies for LiDAR segmentation adapt the U-Net to the 3D space and propose 3D U-Net . However, they often fail to explore the distribution and property of the driving-scene LiDAR point cloud. In this work, two modules, i.e., Asymmetric Residual Block and Dimension-decomposition based Context Modeling, are designed to match the cuboid objects and model the high-rank context information, respectively.
Outdoor-scene point clouds have significant differences with indoor-scene point clouds. (1) A driving-scene point cloud might cover a very large area, as far as over 100 meters. (2) It generally contains more points (100,000 points) but are much sparser than those of the indoor scenes. Hence, the indoor segmentation methods working on dense and fixed-number points are difficult to be adapted to the driving scenes with varying point densities.
Existing outdoor LiDAR segmentation methods mainly focus on transforming the 3D point clouds to 2D representations via projection, including spherical projection and bird-eye view projection, and then adopt 2D convolutions to process the 2D grid representations. However, as shown in Fig. 1(right), the local spatial pattern in 2D grid representation cannot well capture 3D geometric structures. It can be observed that the red rectangle in 2D grid denotes the points distributing in different spatial locations. Hence, these 3D-to-2D projection methods may fail to encode certain 3D geometric structures and incur inaccurate pattern extraction. The detailed survey is shown in Section 4.2. We perform extensive experiments with various partition and networks among 2D, 2.5D and 3D. From the results, the consistent performance gain indicates the effectiveness of our technical road map, namely, 3D partition and 3D networks.
The outdoor point clouds are covering a large varieties of urban scenes. Our task is to assign a semantic label to each point in the point cloud. Based on our investigation on the distribution of 2D and 3D point-cloud representations, we discover that the 2D representation obtained from projection would abandon many available 3D structures. To this end, we propose a new outdoor LiDAR segmentation approach based on 3D representation and neural networks.
As shown in Fig. 2, the framework consists of two components, including 3D cylinder partition (to obtain the 3D representation) and 3D U-Net (to process the 3D representation). Particularly, we design two modules to suit the properties of outdoor point clouds, i.e., Asymmetrical Residual Block to match these cuboid based objects often appearing in the driving scenes (cars, trucks, motorcycles, etc), and dimension-decomposition based context modeling module to exploit the high-rank context information in point clouds in a decomposition-aggregation manner. In the following sections, we will introduce these components in detail.
As mentioned above, outdoor-scene LiDAR point cloud possesses the property of varying density, where nearby region has much greater density than distant region. We thus use the cylinder coordinate system to replace the Cartesian grid partition. It utilizes the increasing grid to cover the further-away region, thus it more evenly distributes the points across different regions and matches the distribution of outdoor points. Moreover, unlike these projection-based methods project the point to the 2D view, we maintain the 3D grid representation to retain the geometric structure. The workflow is shown in Fig. 3. We first transform the points on Cartesian coordinate system to the Cylinder coordinate system, where radius and azimuth are calculated. This step transforms the points () to points (). Then cylinder partition is to split these three dimensions uniformly, note that this split indicates more further-away region, larger voxel. These cylinder grid representation is fed to a MLP-based pointnet to get the cylinder features. After these steps, we can get the 3D cylinder representation , where denotes the feature dimension.
For the autonomous driving scenes, there exist a large amount of cuboid objects, including cars, trucks, buses and motorcycles. Inspired by text detection methods , where asymmetry convolutional kernels are used to match the rectangle target regions, we design the asymmetric residual block to meet the property of such cuboid objects. Moreover, this asymmetric residual block also significantly reduces the computational cost of conventional 3D convolutional kernels. Specifically, using a convolution with kernel= followed by a convolution is equivalent to sliding a two layer network with the same receptive field as in a 3D convolution with kernel= , but it has 33% cheaper computational cost than a
convolution with same number of output filters. The proposed asymmetrical residual block is the basic component of downsample block and upsample block. For downsample block, it consists of a asymmetrical residual block and a 3D convolution with stride=2 to perform downsample. Upsample block incorporates the low-level features and processes the fused features with a asymmetrical residual block.
Due to the large varieties of context (for 3D space, its context varies from point cloud to point cloud and should have large diversity), the context tensor should be high-rank[27, 28] to have enough capacity for encoding context information. To model this context feature requires a huge cost, especially in the 3D space, because of the high-rank property of context. Inspired by the high-rank matrix decomposition theory, we can separate the high-rank context into several low-rank representation. In our task, this high-rank context can be divided into three dimensions, i.e., height, width and depth, where all three fragments are both low-rank. Then we build up the complete high-rank context using these fragments. In this way, this decomposite-aggregate strategy tackles the high-rank difficulty from different views with low-rank constraints. As shown in Fig. 2(bottom), three rank-1 kernels (i.e., , and
) are used to generate these low-rank encoding in all three dimensions. Then the Sigmoid function modulate the convolution results and generates weights for each dimension, in which the co-occurrence contextual information is mined based on the rank-1 tensors from different views. We aggregate all three low-rank activations to obtain the summation to represent the complete context features.
In this section, we give the details of three parts: 3D cylinder partition, 3D segmentation backbone and segmentation head, as shown in Fig.2
. Cylinder partition utilizes a 4-layer MLP network with BatchNorm and ReLU to extract point features for each point and select the maximum magnitude of the point features as voxel representation. Our 3D segmentation backbone is derived from U-Net, where 3D convolution is the sparse convolution adapted from. As mentioned above, we replace traditional residual block with asymmetrical residual block and insert a DDCM module before the final prediction. The input to segmentation backbone is tensor. The third part is segmentation head, in which we adapts a 3d convolution layer with kernel as a light-weight segmentation head. After the whole pipeline, voxel based prediction, whose size is , is obtained.
For network optimization, we use a weighted cross-entropy loss and a lovasz-softmax loss to maximize the point accuracy and the intersection-over-union score for classes. Two losses share the same weight. Thus, the total loss is: . For the optimizer, Adam with an initial learning rate of 0.001, is employed.
SemanticKITTI  is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark and collected in Germany with a Velodyne-HDLE64 LiDAR. The dataset consists of 22 sequences, splitting sequences 00 to 10 as training set, and 11 to 21 as test set. Overall, the dataset provides 23201 point clouds for training and 20351 for testing. Following previous literature, sequence 08 is used as the validation set. The dataset has in total 28 classes, where 6 classes are duplicated with moving or non-moving attribute. After merging classes with different moving status and ignore classes with very few points, 19 classes are remained for training and evaluation. To evaluate the proposed method, we leverage mean intersection-over-union (mIoU) metric defined in  over all classes, given by: where represent true positive, false positive, and false negative predictions for class and the mIoU is the mean value of over all classes.
For LiDAR segmentation in outdoor scene, there exists many previous literatures, in which various partitions and backbones are proposed among 2D and 3D space. We choose two published cutting-edge networks and some variants with different partitions and backbones (among 2D and 3D space) as a reference group, and conducts extensive experiments to show the odyssey of our network design.
Spherical Projection, RangeNet++  is one of typical methods of spherical projection, which projects point cloud onto a spherical surface surrounding the sensor. Compared with other analogous methods, such as Darknet  and SqueezeSeg , it achieves the best performance on Semantic-Kitti test set. Thus, we choose RangeNet53 as spherical projection baseline and replace original rangenet53 with deeplab-resnet101. For a fairer comparison, we also adopt a KNN as postprocessing method to reduce spatial boundary effect for spherical projection.
Polar Bird View Projection, PolarNet  is not traditional bird-view method defined in Cartesian coordinates. It introduces polar coordinates on radius-theta plane to effectively represent these points. Radius-theta encoding can reduce learning complexity due to its small input size. We follow the polar image setting in PolarNet and show the results with different network architectures.
Cuboid 3D Voxelization is a common point representation used in LiDAR segmentation. It converts point cloud into 3D voxels in Cartesian coordinates. These methods often possess huge computing costs because of the large cuboid voxel resolution and 3D convolution backbone.
Cylinder 3D Voxelization is proposed in this paper. It divides point cloud into small grids in Cylindrical coordinate system. As we claim in section 3, cylinder voxel partition meets the varying sparsity of driving-scene LiDAR point cloud and balances point distribution.
Analysis. as shown in Table 1, we conduct extensive experiments to evaluate different projections with different segmentation backbones among 2D and 3D space. It can be observed that for 2D projections, polar projection outperforms spherical projection methods with different segmentation backbones, such as Resnet-50-FCN, DRN-DeepLab and Resnet101-DeepLab, which demonstrates the superiority of polar projection. It is worth noting that our 2D and 3D backbones share the same architecture as shown in Fig. 2. The main difference between 2D and 3D backbones is convolution layer, and we instead use 2D Convolution in 2D backbone. Based on the same polar projection, our 2D backbone outperforms the polarnet by 2.8% mIoU, which demonstrates the scalability of the proposed model even in 2D space. When we replace the polar projection with our cylinder 3D voxelization, our model has a 1.7% gain because it retains the 3D topology, which indicates the effectiveness of 3D cylinder partition. After converting the 2D backbone to the 3D backbone, the proposed Cylinder3D obtains 4.2% gain and achieves 64.3% mIoU on val set. It can be observed that 3D convolution based framework significantly boosts the performance compared to the 2D backbone, which demonstrates the cooperation of 3D Cylinder partition and 3D convolution leads to the point cloud segmentation and verifies our conjecture 3D structure is a crucial aspect in LiDAR segmentation.
In this experiment, we report the results of our model on the SemanticKitti test set from official evaluation server. As shown in Table 2, our method achieves the state-of-the-art on SemanticKitti test set in comparison with existing methods, including RangeNet++ , PolarNet , SqueezeSegv3 , RandLA-Net , et al. The proposed method outperforms other state-of-the-art methods by at least 6% in terms of mIoU.
In this experiment, we perform the ablation studies to investigate the effects of different network components in Cylinder3D, including Asymmetry residual block, Dimension-decomposition based context modeling and Flip test (a common technique to boost the performance). We use the Cylinder partition and 3D U-net (similar to our 3D networks, but use the common residual block and no dimension-decomposition context modeling) as the baseline method. Then we gradually add these network components to observe its effectiveness. By replacing residual block with asymmetry residual block, it can be found about 1.5% mIoU performance gain is achieved. When adding Dimension-decomposition based Context Modeling, our proposed Cylinder3D achieves 64.3% in terms of mIoU. Moreover, by further incorporating the Flip Test, i.e., flipping the original point cloud via x-axis, y-axis and x-y-axis, and averaging four predictions as the final results, the mIoU increases by another 0.9%. From the ablation, we can find that both two designed modules achieve the consistent performance gain.
|Baseline||Asymmetry residual block||DDCM||Flip Test||mIoU|
Some of the results are visualized in Fig.5. It can be observe that the proposed Cylinder3D mainly achieves decent accuracy, and well separates the nearby objects because it maintains the 3D topology and utilizes the geometric information (we highlight corresponding regions with red rectangles).
In this paper, we follow the 3D nature of lidar point cloud to reposition the focus of lidar segmentation, from 2D to 3D representation and network. We design a 3D pointcloud representation, named Cylinder partition, which suits for the varying sparsity of driving-scene lidar point cloud, and propose a 3D convolution based network, where two basic network modules,called Asymmetric Residual Block and Dimension-decomposition based Context Modeling, are introduced to reduce computational cost and explore the high-rank context. With the cooperation of Cylinder partition and 3D convolution networks, our method achieves the state-of-the-art on the SemanticKITTI test set.