3D LiDAR sensor has become an indispensable device in modern autonomous driving vehicles . It captures more precise and farther-away distance measurements  of the surrounding environments than conventional visual cameras [49, 48, 34]
. The measurements of the sensor naturally form 3D point clouds that can be used to realize a thorough scene understanding for autonomous driving planning and execution, in which LiDAR-based segmentation and detection are crucial for driving-scene perception and understanding.
Recently, the advances in deep learning have significantly pushed forward the state of the art in image domain such as image segmentation and detection. Some existing LiDAR-based perception approaches follow this route to project the 3D point clouds onto a 2D space and process them via 2D convolution networks, including range image based[33, 52] and bird’s-eye-view image based [62, 26]. However, this group of methods lose and alter the accurate 3D geometric information during the 3D-to-2D projection (as shown in the top row of Fig. 0(a)).
A natural alternative is to utilize the 3D partition and 3D convolution networks to process the point cloud and maintain their 3D geometric relations. However, in our initial attempts, we directly apply the 3D voxelization [18, 11] and 3D convolution networks to outdoor LiDAR point cloud, only to find very limited performance gain (as shown in Fig. 0(b)). Our investigation into this issue reveals a key difficulty of outdoor LiDAR point cloud, namely sparsity and varying density, which is also the key difference to indoor scenes with dense and uniform-density points. However, previous 3D voxelization methods consider the point cloud as a uniform one and split them via the uniform cube, while neglecting the varying-density property of outdoor point cloud. Consequently, this effect to apply the 3D partition to outdoor point cloud is met with fundamental difficulty.
Motivated by these findings, we propose a new framework to outdoor LiDAR segmentation that consists of two key components, i.e., 3D cylindrical partition and asymmetrical 3D convolution networks, which maintain the 3D geometric information and handle these issues from partition and networks, respectively. Here, cylindrical partition resorts to the cylinder coordinates to divide the point cloud dynamically according to the distance (Regions that are far away from the origin have much sparse points, thus requiring a larger cell), which produces a more balanced point distribution (as shown in Fig. 0(a)); while asymmetrical 3D convolution networks strengthen the horizontal and vertical kernels to match the point distribution of objects in the driving scene and enhance the robustness to the sparsity. Moreover, voxel based methods might divide the points with different categories into the same cell and cell label encoding would inevitably cause the information loss (for LiDAR-based segmentation tasks). To alleviate the interference of lossy label encoding, a point-wise module is introduced to further refine the features obtained from voxel-based network. Overall, the cooperation of these components well maintains the geometric relation and tackle the difficulty of outdoor point cloud, thus improving the effectiveness of 3D frameworks.
Since the learned features from our model can be used for downstream tasks, we benchmark our model on a variety of LiDAR-based perception tasks such LiDAR-based semantic segmentation, panoptic segmentation and 3D detection. For semantic segmentation, we evaluate the proposed method on several large-scale outdoor datasets, including SemanticKITTI , nuScenes  and A2D2 . Our method achieves the state-of-the-art on the leaderboard of SemanticKITTI (both single-scan and multi-scan challenges) and also outperforms the existing methods on nuScenes and A2D2 with a large margin. We also extend the proposed cylindrical partition and asymmetrical 3D convolution networks to LiDAR panoptic segmentation and LiDAR 3D detection. For panoptic segmentation and 3D detection, experimental results on SemanticKITTI and nuScenes, respectively, show its strong performance and good generalization capability.
The contributions of this work mainly lie in three aspects:
We reposition the focus of outdoor LiDAR segmentation from 2D projection to 3D structure, and further investigate the inherent properties (difficulties) of outdoor point cloud.
We introduce a new framework to explore the 3D geometric pattern and tackle these difficulties caused by sparsity and varying density, through cylindrical partition and asymmetrical 3D convolution networks.
The proposed method achieves the state of art on LiDAR-based semantic segmentation, LiDAR panoptic segmentation and LiDAR point cloud 3D detection, which also demonstrates its strong generalization capability.
2 Related Work
Deep Learning for Indoor-scene Point Cloud. Indoor-scene point clouds carry out some properties, including generally uniform density, small number of points, and small range of the scene. Mainstream methods [36, 44, 54, 51, 45, 29, 15, 61, 57, 46, 35, 37] of indoor point cloud segmentation learn the point features based on the raw point directly, which are often based on the pioneering work, i.e., PointNet, and promote the effectiveness of sampling, grouping and ordering to achieve the better performance. Another group of methods utilize the clustering algorithm [51, 45] to extract the hierarchical point features. However, these methods focusing on indoor point cloud are limited to adapt to the outdoor point cloud under the property of varying density and large range of scenes, and the large number of points also result in the computational difficulties for these methods when deploying from indoor to outdoor.
focus on converting the 3D point cloud to 2D grids, to enable the usage of 2D Convolutional Neural Networks. SqueezeSeg, Darknet , SqueezeSegv2 , and RangeNet++  utilize the spherical projection mechanism, which converts the point cloud to a frontal-view image or a range image, and adopt the 2D convolution network on the pseudo image for point cloud segmentation or detection task. PolarNet  follows the bird’s-eye-view projection, which projects point cloud data into bird’s-eye-view representation under the polar coordinates. However, these 3D-to-2D projection methods inevitably loss and alter the 3D topology and fails to model the geometric information. Moreover, in most outdoor scenes, LiDAR device is often used to produce the point cloud data, where its inherent properties, i.e., sparsity and varying density , are often neglected.
3D Voxel Partition. 3D voxel partition is another routine of point cloud encoding [19, 43, 18, 11, 31, 65, 67]. It converts a point cloud into 3D voxels, which mainly retains the 3D geometric information. OccuSeg , SSCN  and SEGCloud  follow this line to utilize the voxel partition and apply regular 3D convolutions for LiDAR segmentation. It is worth noting that while the aforementioned efforts have shown encouraging results, the improvement in the outdoor LiDAR point cloud remains limited. As mentioned above, a common issue is that these methods neglect the inherent properties of outdoor LiDAR point cloud, namely, sparsity and varying density. Compared to these methods, our proposed method resorts to the 3D cylindrical partition and asymmetrical 3D convolution networks to tackle these difficulties.
is the fundamental work for segmentation tasks in the deep-learning era. Built upon the FCN, many works aim to improve the performance via exploring the dilated convolution, multi-scale context modeling and attention modeling, including DeepLab[7, 8] and PSP . Recent work utilizes the neural architecture search to find the more effective backbone for the segmentation [27, 41]. Particularly, U-Net  proposes a symmetric architecture to incorporate the low-level features. With the great success of U-Net on 2D benchmarks and its good flexibility , many studies for LiDAR-based perception often adapt the U-Net to the 3D space . We also follow this structure to construct our asymmetrical 3D convolution networks.
3.1 Framework Overview
As shown in the top and middle row of Fig. 2, we elaborate the pipeline of our model in LiDAR-based segmentation and detection task. In the context of semantic segmentation, given a point cloud, the task is to assign the semantic label to each point. Based on the comparison between 2D and 3D representation and investigation of the inherent properties of outdoor LiDAR point cloud, we desire to obtain a framework which explores the 3D geometric information and handles the difficulty caused by sparsity and varying-density. To this end, we propose a new outdoor segmentation approach based on the 3D partition and 3D convolution networks. To handle these difficulties of outdoor LiDAR point cloud, namely sparsity and varying density, we first employ the cylindrical partition to generate the more balanced point distribution (more robust to varying density), then apply the asymmetrical 3D convolution networks to power the horizontal and vertical weights, thus well matching the object point distribution in driving scene and enhancing the robustness to the sparsity. Same backbone with cylindrical partition and asymmetrical convolution network is also adapted to LiDAR-based 3D detection (shown in the middle row of Fig. 2).
Specifically, the framework consists of two major components, including cylindrical partition and asymmetrical 3D convolution networks. The LiDAR point cloud is first divided by the cylindrical partition and the features extracted from MLP is then reassigned based on this partition. Asymmetrical 3D convolution networks are then used to generate the voxel-wise outputs. For segmentation tasks, a point-wise module is introduced to alleviate the interference of lossy cell-label encoding, thus refining the outputs. In the following sections, we will present these components in detail.
3.2 Cylindrical Partition
As mentioned above, outdoor-scene LiDAR point cloud possesses the property of varying density, where nearby region has much greater density than farther-away region. Therefore, uniform cells splitting the varying-density points would fall into an imbalanced distribution (for example, larger proportion of empty cells). While in the cylinder coordinate system, it utilizes the increasing grid size to cover the farther-away region, and thus more evenly distributes the points across different regions and gives an more balanced representation against the varying density. We perform a statistic to show the proportion of non-empty cells across different distances in Fig. 3. It can be found that with the distance goes far, cylindrical partition maintains a balanced non-empty proportion due to the increasing grid size while cubic partition suffers the imbalanced distribution, especially in the farther-away regions (about 6 times less than cylindrical partition). Moreover, unlike these projection-based methods project the point to the 2D view, cylindrical partition maintains the 3D grid representation to retain the geometric structure.
The workflow is illustrated in Fig. 4
. We first transform the points on Cartesian coordinate system to the Cylinder coordinate system. This step transforms the points () to points (), where radius (distance to origin in x-y axis) and azimuth (angle from x-axis to y-axis) are calculated. Then cylindrical partition performs the split on these three dimensions, note that in the cylinder coordinate, the farther-away the region is, the larger the cell will be. Point-wise features obtained from the MLP are reassigned based on the result of this partition to get the cylindrical features. Specifically, the point-cylinder mapping contains the index of point-wise features to cylinder. Based on this mapping function, point-wise features within same cylinder are mapped together and processed via max-pooling to get the cylindrical features. After these steps, we unroll the cylinder from 0-degree and get the 3D cylindrical representation , where denotes the feature dimension and mean the radius, azimuth and height. Subsequent asymmetrical 3D convolution networks will be performing on this representation.
3.3 Asymmetrical 3D Convolution Network
Since the driving-scene point cloud carries out the specific object shape distribution, including car, truck, bus, motorcycle and other cubic objects, we aim to follow this observation to enhance the representational power of a standard 3D convolution. Moreover, recent literature [50, 14] also shows that the central crisscross weights count more in the square convolution kernel. In this way, we devise the asymmetrical residual block to strengthen the horizontal and vertical responses and match the object point distribution. Based on the proposed asymmetrical residual block, we further build the asymmetrical downsample block and asymmetrical upsample block to perform the downsample and upsample operation. Moreover, a dimension-decomposition based context modeling (termed as DDCM) is introduced to explore the high-rank global context in decomposite-aggregate strategy. We detail these components in the bottom of Fig. 2
Asymmetrical Residual Block Motivated by the observation and conclusion in [50, 14], the asymmetrical residual block strengthens the horizontal and vertical kernels, which matches the point distribution of object in the driving scene and explicitly makes the skeleton of the kernel powerful, thus enhancing the robustness to the sparsity of outdoor LiDAR point cloud. We use the Car and Motorcycle as the example to show the asymmetrical residual block in Fig. 6, where 3D convolutions are performing on the cylindrical grids. Moreover, the proposed asymmetrical residual block also saves the computation and memory cost compared to the regular square-kernel 3D convolution block. By incorporating the asymmetrical residual block, the asymmetrical downsample block and upsample block are designed and our asymmetrical 3D convolution networks are built via stacking these downsample and upsample blocks.
Dimension-Decomposition based Context Modeling Since the global context features should be high-rank to have enough capacity to capture the large context varieties 
, it is hard to construct these features directly. We follow the tensor decomposition theory to build the high-rank context as a combination of low-rank tensors, where we use three rank-1 kernels to obtain the low-rank features and then aggregate them together to get the final global context.
3.4 Sparse Activation Visualization
As mentioned above, the proposed cylindrical partition and asymmetrical 3D networks aim to tackle the difficulties caused by sparsity and varying-density in outdoor point cloud. We thus visualize some filter activations from regular 3D convolution networks (with regular cubic partition) and asymmetrical 3D convolution networks (with cylindrical partition), respectively. The results are shown in Fig. 5. Fig. 5(a) and (b) are extracted from regular 3D convolution networks, which are activated at almost regions; While the proposed asymmetrical 3D convolution networks strengthen sparser activations and focus on them (as shown in Fig. 5(c) and (d)), they mainly focus on some certain regions. It demonstrates that the proposed model could adaptively handle the sparse point cloud input and focus on some certain regions.
3.5 Point-wise Refinement Module
Partition-based methods predict one label for each cell. Although partition-based methods effectively explore the large-range point cloud, however, this group of method, including cube-based and cylinder-based, inevitably suffers from the lossy cell-label encoding, e.g., points with different categories are divided into same cell, and this case would cause the information loss for point cloud semantic segmentation task (as shown in the middle row of Fig. 2). We make a statistic to show the effect of different label encoding methods with cylindrical partition in Fig. 7, where majority encoding means using the major category of points inside a cell as the cell label and minority encoding indicates using the minor category as the cell label. It can be observed that both of them cannot reach the 100 percent mIoU (ideal encoding) and inevitably have the information loss. Here, the point-wise refinement module is introduced to alleviate the interference of lossy cell-label encoding. We first project the cylindrical features to the point-wise based on the inverse point-cylinder mapping table (note that points inside same cylinder would be assigned to the same cylindrical features). Then the point-wise module takes both point features before and after 3D convolution networks as the input, and fuses them together to refine the output. We also show the detailed structure of MLPs in point-wise refinement module and cylindrical partition in Fig. 8.
3.6 Objective Function
For LiDAR-based semantic segmentation task, the total objective of our method consists of two components, including voxel-wise loss and point-wise loss. It can be formulated as . For the voxel-wise loss (), we follow the existing methods [13, 21] and use the weighted cross-entropy loss and lovasz-softmax  loss to maximize the point accuracy and the intersection-over-union score, respectively. For point-wise loss (), we only use the weighted cross-entropy loss to supervise the training. During inference, the output from point-wise refinement module is used as the final output.
For LiDAR-based panoptic segmentation task, except the loss of semantic segmentation, it also contains the loss of instance branch , which utilizes center regression to achieve the clustering.
In this section, we benchmark the proposed model on three downstream tasks. For semantic segmentation task, we evaluate the proposed method on several large-scale datasets, i.e., SemanticKITTI, nuScenes and A2D2. SemanticKITTI and nuScenes are also used in panoptic segmentation and 3D detection, respectively. Furthermore, extensive ablation studies on LiDAR semantic segmentation task are conducted to validate each component.
4.1 Dataset and Metric
is a large-scale driving-scene dataset for point cloud segmentation, including semantic segmentation and panoptic segmentation. It is derived from the KITTI Vision Odometry Benchmark and collected in Germany with the Velodyne-HDLE64 LiDAR. The dataset consists of 22 sequences, splitting sequences 00 to 10 as training set (where sequence 08 is used as the validation set), and sequences 11 to 21 as test set. 19 classes are remained for training and evaluation after merging classes with different moving status and ignore classes with very few points. In this dataset, it consists of two challenges, namely, single-scan and multi-scan point-cloud semantic segmentation, where single-scan denotes the single-frame point cloud semantic segmentation and multi-scan denotes the multiple-frame point cloud segmentation, respectively. The key difference is that multi-scan semantic segmentation requires classifying the moving categories, including moving car, moving truck, moving person, moving bicyclist, moving motorcyclist.
nuScenes  It collects 1000 scenes of 20s duration with 32 beams LiDAR sensor. The number of total frames is 40,000, which is sampled at 20Hz. They also officially split the data into training and validation set. After merging similar classes and removing rare classes, total 16 classes for the LiDAR semantic segmentation are remained.
A2D2  We follow the data pre-processing in  to generate the label and process the point cloud data. A2D2 uses five asynchronous LiDAR sensors where each sensor covers a potion of the surrounding view. After LiDAR panoramic stitching, the A2D2 dataset is split into 22408, 2774 and 13264 training, validation and test scans, respectively with 38-class segmentation annotation. Since there are 38 categories in A2D2 dataset where some of them only have subtle differences, it is harder than other datasets, SemanticKITTI and nuScenes.
Implementation Details For these datasets, the Cartesian spaces are different which are related to the LiDAR sensor range. In our implementation, we fix the Cartesian spaces to be , , and for SemanticKITTI, nuScenes and A2D2, respectively. After transforming to the Cylindrical spaces, they are fixed to be , , and . In this way, the proposed cylindrical spaces can cover more than 99% of points for each point cloud scan on average and points out of the spaces are assigned to the closest cylindrical cell. For all datasets, cylindrical partition splits these point clouds into 3D representation with the size = , where three dimensions indicate the radius, angle and height, respectively. We also perform the ablation studies to investigate and cross-validate the effect of these parameters . We use NVIDIA V100 GPU with 16G memory to train the proposed model with batch size = 2.
To evaluate the proposed method, we follow the official guidance to leverage mean intersection-over-union (mIoU) as the evaluation metric defined in[3, 6], which can be formulated as: where represent true positive, false positive, and false negative predictions for class and the mIoU is the mean value of over all classes.
4.2 LiDAR-based Semantic Segmentation
Results on SemanticKITTI Single-scan Semantic Segmentation In this experiment, we compare the results of our proposed method with existing state-of-the-art LiDAR segmentation methods on SemanticKITTI single-scan test set. The target is to generate the semantic prediction for single frame point cloud. As shown in Table I, our method outperforms all existing methods in term of mIoU. Compared to the projection-based methods on 2D space, including Darknet53 , SqueezeSegv3 , RangeNet++  and PolarNet , our method achieves 8% 17% performance gain in term of mIoU due to the modeling of 3D geometric information. Compared to some voxel partition and 3D convolution based methods, including FusionNet , TORANDONet  (multi-view fusion based method) and SPVNAS  (utilizing the neural architecture search for LiDAR segmentation), the proposed method also performs better than these 3D convolution based methods, where the cylindrical partition and asymmetrical 3D convolution networks well handle the difficulty of driving-scene LiDAR point cloud that is neglected by these methods.
Visualization We show some visualization results of single-scan segmentation in Fig.9, which are sampled from the SemanticKITTI validation set. It can be observed that the proposed method mainly achieves decent accuracy, and well separates the nearby objects and accurately identifies them because it maintains the 3D topology and utilizes the geometric information (we highlight corresponding regions with red rectangles). These visualization can verify our claim that keeping 3D structure and more balanced point distribution could benefit the segmentation results.
Results on SemanticKITTI Multi-scan Semantic Segmentation Unlike the single-scan semantic segmentation, the multi-scan segmentation in SemanticKITTI takes multiple frame point cloud as input and generates the more categories under moving status, including moving car, moving truck, moving other-vehicle, moving person, moving bicyclist and moving motorcyclist. In this experiment, we first perform the multiple-frame point cloud fusion. Specifically, the sequential point clouds in LiDAR coordinate are firstly transformed to global coordinate. Then, these sequential point clouds are fused in the global coordinate. Finally, all these points are transformed to the coordinate of last frame. In this way, we can achieve the multiple-frame fusion and we use 3 sequential point clouds as input data in our implementation. We show an example in Fig. 10. It can be found that moving cars have multiple shifting point clouds while stationary cars keep all points in same location.
The results of multi-scan semantic segmentation are shown in Table III and IV. Generally, our method outperforms all existing methods in terms of mIoU, where it achieves 0.3% and 8.4% gain compared to KPConv  (ICCV2019) and SpSeqnet  (CVPR2020), respectively. Our method obtains superior performance for most categories, even for some small objects, like bicycle and motorcycle, etc. For these moving categories, our method achieves the best performance on moving car and moving truck.
Results on nuScenes For nuScenes LiDARseg dataset, we report the results on its validation set. As shown in Table II, our method achieves better performance than existing methods in all categories, and this consistent performance improvement demonstrates the capability of the proposed model. Specifically, the proposed method obtains about 4% 7% performance gain than projection-based methods. Moreover, for these categories with sparse points, such as bicycle and pedestrian, our method significantly outperforms existing approaches, which also demonstrates the effectiveness of the proposed method to tackle the sparsity and varying density. Note that RangeNet++  and Salsanext 
perform the post-processing, including KNN,etc.
Results on A2D2 We report the results on A2D2  validation set. As shown in Table V and VI, it can be observed that the proposed method performs much better than existing methods about 3% in terms of mIoU, including Squeezeseg , SqueezesegV2 , DarkNet53  and PolarNet , where all of them are based on the 2D projection and 2D convolution networks. Specifically, our method achieves better performance on almost all categories consistently, which also demonstrates the effectiveness of our method. Note that due to the more fine-grained categories in A2D2 (38 categories in total), it is harder than other datasets, such as SemanticKITTI and nuScenes, and there exist more categories with zero values.
In general, our method achieves the consistent state-of-the-art performance in all three datasets with different settings (single-scan and multi-scan) and sensor ranges. It clearly demonstrates the effectiveness of the proposed method and its good generalization capability across different datasets.
4.3 LiDAR-based Panoptic Segmentation
Panoptic segmentation is first proposed in  as a new task, in which semantic segmentation is performed for background classes and instance segmentation for foreground classes and these two groups of category are also termed as stuff and things classes, respectively. Behley et al.  extend the task to LiDAR point clouds and propose the LiDAR panoptic segmentation. In this experiment, we conduct the panoptic segmentation on SemanticKITTI dataset and report results on the validation set. For the evaluation metrics, we follow the metrics defined in , where they are the same as that of image panoptic segmentation defined in  including Panoptic Quality (PQ), Segmentation Quality (SQ) and Recognition Quality (RQ) which are calculated across all classes. PQ is defined by swapping PQ of each stuff class to its IoU and averaging over all classes like PQ does. Since the categories in panoptic segmentation contain two groups, i.e., stuff and things, these metrics are also performed separately on these two groups, including PQTh, PQSt, RQTh, RQSt, SQTh and SQSt, where Panoptic Quality (PQ) is usually used as the first criteria. For the experimental setting, we follow the LiDAR semantic segmentation, where Adam optimizer with learning rate = is used for optimization.
In this experiment, we use the proposed cylindrical partition as the partition method and asymmetrical 3D convolution networks as the backbone. Moreover, a semantic branch is used to output the semantic labels for stuff categories, and an instance branch is introduced to generate the instance-level features and further extract their instance IDs for things categories through heuristic clustering algorithms (we use mean-shift in the implementation and the bandwidth of the Mean Shift used in our backbone method is set towhile the minimum number of points in a valid instance is set to 50 for SemanticKITTI).
We report the results in Table VII. It can be found that our method achieves much better performance than existing methods [32, 22]. In terms of PQ, we have about 4.7% point improvement, and particularly for the thing categories, our method significantly outperforms state-of-the-art in terms of PQTh and RQTh with a large margin of 10% points. It indicates that our cylindrical partition and asymmetrical 3D convolution networks significantly benefit the recognition of the things classes. It is worthy of noting that PointGroup and LPASD perform poorly on the outdoor LiDAR segmentation task which indicates that these indoor methods are not suitable for the challenging outdoor point clouds due to the different scenarios and inherent properties. Experimental results demonstrate the effectiveness of the proposed method and its good generalization ability. We show several samples of panoptic segmentation results in Fig. 11, where different colors represent different vehicles.
|KPConv  + PV-RCNN ||51.7||57.4||63.1||78.9||46.8||56.8||81.5||55.2||67.8||77.1||63.1|
|PP + Reconfig ||32.5||50.6|
|SECOND  + Cylinder||34.3||49.6|
|SECOND  + Asym-CNN||33.0||48.3|
|SECOND  + CyAs||36.4||51.7|
|SSN  + CyAs||47.7||58.2|
|SSNv2  + CyAs||52.8||64.0|
4.4 LiDAR-based 3D Detection
LiDAR 3D detection aims to localize and classify the multi-class objects in the point cloud. SECOND  first utilizes the 3D voxelization and 3D convolution networks to perform the single-stage 3D detection. In this experiment, we follow SECOND method and replace the regular voxelization and 3D convolution with the proposed cylindrical partition and asymmetrical 3D convolution networks, respectively. Similarly, to verify its scalability, we also extend the proposed modules to SSN . Furthermore, another strong baseline, SSNv2 , is also adapted to verify the effectiveness of our method when the baseline is very competitive. The experiments are conducted on nuScenes dataset and the cylindrical partition also generates the representation. For the evaluation metrics, we follow the official metrics defined in nuScenes, i.e., mean average precision (mAP) and nuScenes detection score (NDS). For other experimental settings, including the optimization method, target assignment, anchor size and network architecture of multiple heads, we all follow the setting in SSN .
The results are shown in Table VIII. PP + Reconfig  is a partition enhancement approach based on PointPillar , while our SECOND + CyAs performs better with similar backbone, which indicates the superiority of the cylindrical partition. To verify the effect of different components (i.e., Cylinder partition and Asymmetrical 3D convolution networks) of our method on LiDAR 3D detection, we design two variants, i.e., SECOND  + Cylinder and SECOND  + Asym-CNN. The results shown in Table VIII demonstrate that these two components in our method consistently improve the baseline method with 2.8% points and 1.5% points in terms of NDS, respectively. We then extend the proposed method (i.e., CyAs) to two baseline methods, termed as SECOND + CyAs and SSN + CyAs, respectively. By comparing these two models with their extensions, it can be observed that the proposed Cylindrical partition and Asymmetrical 3D convolution networks boost the performance consistently, even for the strong baseline i.e., SSNv2, which demonstrates the effectiveness and scalability of our model. For different backbones, like SECOND and SSN, our method could consistently benefit them, showing its good generalization ability. Several qualitative results on nuScenes dataset are shown in Fig. 12.
4.5 Ablation Studies
In this section, we perform the thorough ablation experiments on LiDAR-based semantic segmentation task to investigate the effect of different components in our method. We also design several variants of asymmetrical residual block to verify our claim that strengthening the horizontal and vertical kernels power the representation ability for driving-scene point cloud. For the 3D representation after cylindrical partition, we also try several other hyper-parameters to cross-validate these values.
Effects of Network Components In this part, we make several variants of our model to validate the contributions of different components. The results on SemanticKITTI validation set are reported in Table IX. Baseline method denotes the framework using 3D voxel partition (with cubic partition) and 3D convolution networks. It can be observed that cylindrical partition performs much better than cubic-based partition with about 3% mIoU gain and asymmetrical 3D convolution networks also significantly boost the performance about 3% improvement, which demonstrates that both cylindrical partition and asymmetrical 3D convolution networks are crucial in the proposed method. Furthermore, dimension-decomposition based context modeling delivers the effective global context features, which yields an improvement of 1.4%. Point-wise refinement module further pushes forward the performance based on the strong model, about 0.7%. Generally, the proposed cylindrical partition and asymmetrical 3D convolution networks make the most contribution to the performance improvement.
Variants of Asymmetrical Residual Block To verify the effectiveness of the proposed asymmetrical residual block, we design several variants of asymmetrical residual block to investigate the effect of horizontal and vertical kernel enhancement (as shown in Fig. 13). The first variant is the regular residual block without any asymmetrical structure. The second one is the 1D-asymmetrical residual block, which utilizes the 1D asymmetrical kernels without height and also strengthens the horizontal or vertical kernels in one-dimension. The third one is the proposed asymmetrical residual block, which strengthens both horizontal and vertical kernels. These variants strengthen the skeleton of convolution kernels step by step (from regular residual block to asymmetrical kernel without height, then to both horizontal and vertical kernels with height).
We conduct the ablation studies on SemanticKITTI validation set. Note that we use the cylindrical partition as the partition method and stack these proposed variants to build the 3D convolution networks for this ablation experiment. We report the results in Table X. It can be found that although the 1D-Asymmetrical residual block only powers the horizontal and vertical kernels in one-dimension, it still achieves 1.3% gain in terms of mIoU and it obtains about more than 5% performance gain for motorcycle, other-vehicle and bicyclist, which demonstrates the effectiveness of strengthening skeleton of convolution kernel, even without height dimension. After taking the height into the consideration, the proposed asymmetrical residual block further matches the object distribution in driving scene and powers the skeleton of kernels, which enhances the robustness to the sparsity. From Table X, the proposed asymmetrical residual block significantly boosts the performance with about 3% improvements, where large improvement can be observed on some instance categories (about 10% gain), including bicycle, person, other-vehicle and motorcycle, because it matches the point distribution of object and enhances the representational power.
Size of 3D Representation As mentioned in implementation details, we set the size of 3D representation to . In this experiment, we use other hyper-parameters to cross-validate these values, including and . They cover the denser and sparser representations compared with representation. Furthermore, we also introduce a cubic partition with size of as the counterpart to investigate the effectiveness and compactness of cylindrical partition.
We conduct the experiments on SemanticKITTI validation set and all experiments are under same settings except the different representation size. The results are shown in Table XI. It can be found that the 3D representation with performs better than other two representations with 2% point improvement than and . Since representation delivers compacter representation with larger cylindrical cells, it however might mis-split the points across different categories into same cell, which inevitably increases the information loss; While for representation, it contains fine-grained cylindrical cell, but generates the larger representation, which might burden the training of 3D convolution and cause the degradation of performance. Compared to the cubic partition with , all cylindrical partitions achieve much better performance and this consistent performance gain demonstrates its effectiveness. From this experiment, we cross-validate the representation and investigate the effect of different size of 3D representations.
|Size of Representation||mIoU|
5.1 Comparison of Inference Time
To investigate the efficiency of the proposed method, we further make a statistic of inference time compared to existing methods. In the experiment, we keep the setting unchanged and set the mode as the evaluation mode to calculate the inference time. The results of inference time of existing methods are directly token from .
The results are shown in Table XII. Compared to 2D projection based method (inference time consists of computation time and post-processing time), i.e., RandLA , our method achieves about 5.0 speedup with 14% performance improvement due to no requirement for post-processing. Moreover, compared to other 3D based methods, including MinkowskiNet  and SPVNAS , we also achieve the better performance and less inference time. The main reasons lie in two aspects: 1) the proposed cylindrical partition generates compacter representation compared to regular cubic partition. For example, the regular cubic partition often has the cell of , and it thus generates a 3D representation of , which is more than 4 times larger than the cylindrical partition. 2) the asymmetrical 3D convolution networks consume smaller computational overhead and less parameters compared to the regular 3D convolution networks. Specifically, using a convolution with kernel= followed by a convolution is equivalent to sliding a two layer network with the same receptive field as in a 3D convolution with kernel= , but it has 33% cheaper computational cost than a convolution with same number of output filters. The corporation of these two parts leads to the effective and efficient approach.
5.2 Comparison with other methods dealing with sparsity issue
Our proposed method utilizes the cylindrical partition and asymmetrical 3D convolution networks to handle the inherent difficulties, i.e., sparsity and varying density. Hence, we further compare the proposed method with other methods tackling the sparsity issue, to verify its effectiveness. Specifically, SPVNAS  proposes a sparse point-voxel convolution to preserve the fine details and deal with sparsity. MinkowskiNet  adopts sparse tensors and proposes a generalized sparse convolution. We take them as the counterpart dealing with the sparsity issue and make a comparison with them. Note that in our implementation, we also use the sparse convolution  to build up the asymmetrical 3D convolution networks.
The results are shown in Table XIII. Compared to other methods handling the sparsity issue, our method achieves both better performance and efficiency, which also demonstrates the superiority of our method.
In this paper, we have proposed a cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation, where it maintains the 3D geometric relation. Specifically, two key components, the cylinder partition and asymmetrical 3D convolution networks, are designed to handle the inherent difficulties in outdoor LiDAR point cloud, namely sparsity and varying density, effectively and robustly. We conduct the extensive experiments and ablation studies, where the model achieves the state-of-the-art in SemanticKITTI, A2D2 and nuScenes, and keeps good generalization ability to other LiDAR based tasks, including LiDAR panoptic segmentation and LiDAR 3D detection.
-  (2020) 3D-mininet: learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation. arXiv preprint arXiv:2002.10893. Cited by: §2.
-  (2019) SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV, Cited by: §4.2, TABLE III, TABLE IV, TABLE V, TABLE VI, §5.1, TABLE XII.
-  (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In ICCV, pp. 9297–9307. Cited by: §1, §2, §4.1, §4.1, §4.2, TABLE I.
-  (2020) A benchmark for lidar-based panoptic segmentation based on kitti. arXiv preprint arXiv:2003.02371. Cited by: §4.3.
The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pp. 4413–4421. Cited by: §3.6.
-  (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §1, §4.1, §4.1.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §2.
-  (2020) Tensor low-rank reconstruction for semantic segmentation. arXiv preprint arXiv:2008.00490. Cited by: §3.3.
-  (2019) 4d spatio-temporal convnets: minkowski convolutional neural networks. In CVPR, pp. 3075–3084. Cited by: §5.1, §5.2, TABLE XII, TABLE XIII.
-  (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, pp. 424–432. Cited by: §1, §2, §2.
-  (2021) Input-output balanced framework for long-tailed lidar semantic segmentation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.
-  (2020) SalsaNext: fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving. External Links: Cited by: §2, §3.6, §4.2, TABLE I, TABLE II.
-  (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In ICCV, pp. 1911–1920. Cited by: §3.3, §3.3.
-  (2020) 3D-mpa: multi-proposal aggregation for 3d semantic instance segmentation. In CVPR, pp. 9031–9040. Cited by: §2.
TORNADO-net: multiview total variation semantic segmentation with diamond inception module. arXiv preprint arXiv:2008.10544. Cited by: §4.2, TABLE I.
-  (2019) A2D2: aev autonomous driving dataset. Note: http://www. a2d2. audi 1 (4). Cited by: §1, §4.1, §4.2.
-  (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, pp. 9224–9232. Cited by: §1, §2.
-  (2020) OccuSeg: occupancy-aware 3d instance segmentation. In CVPR, pp. 2940–2949. Cited by: §2.
-  (2020) LiDAR-based panoptic segmentation via dynamic shifting network. arXiv preprint arXiv:2011.11964. Cited by: §2, §3.6.
-  (2020) RandLA-net: efficient semantic segmentation of large-scale point clouds. In CVPR, pp. 11108–11117. Cited by: Fig. 1, §2, §3.6, TABLE I, §5.1, TABLE XII.
-  (2020) PointGroup: dual-set point grouping for 3d instance segmentation. In CVPR, pp. 4867–4876. Cited by: §4.3, TABLE VII.
-  (2019) Panoptic segmentation. In CVPR, pp. 9404–9413. Cited by: §4.3.
-  (2020) KPRNet: improving projection-based lidar semantic segmentation. arXiv preprint arXiv:2007.12668. Cited by: TABLE I.
-  (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, pp. 4558–4567. Cited by: §2.
-  (2019) Pointpillars: fast encoders for object detection from point clouds. In CVPR, pp. 12697–12705. Cited by: §1, §4.4, TABLE VIII.
-  (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §2.
-  (2020) Learning to segment 3d point clouds in 2d image space. In CVPR, pp. 12255–12264. Cited by: §2.
Trafficpredict: trajectory prediction for heterogeneous traffic-agents.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6120–6127. Cited by: §1.
-  (2019) Vv-net: voxel vae net with group convolutions for point cloud segmentation. In ICCV, pp. 8500–8508. Cited by: §2.
-  (2020) LiDAR Panoptic Segmentation for Autonomous Driving. In iros, Cited by: §4.3, TABLE VII.
-  (2019) RangeNet++: fast and accurate lidar semantic segmentation. In IROS, pp. 4213–4220. Cited by: Fig. 1, §1, §2, §4.2, §4.2, TABLE I, TABLE II.
SIDE: center-based stereo 3d detector with structure-aware instance depth estimation. arXiv preprint arXiv:2108.09663. Cited by: §1.
-  (2019) JSIS3D: joint semantic-instance segmentation of 3d point clouds with multi-task pointwise networks and multi-value conditional random fields. In CVPR, pp. 8827–8836. Cited by: §2.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660. Cited by: §2.
-  (2017) 3d graph neural networks for rgbd semantic segmentation. In ICCV, pp. 5199–5208. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.
-  (2020) SpSequenceNet: semantic segmentation network on 4d point clouds. In CVPR, pp. 4574–4583. Cited by: §4.2, TABLE III, TABLE IV.
-  (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In CVPR, pp. 10529–10538. Cited by: TABLE VII.
-  (2020) Searching efficient 3d architectures with sparse point-voxel convolution. arXiv preprint arXiv:2007.16100. Cited by: §2, §4.2, TABLE I, §5.1, §5.2, TABLE XII, TABLE XIII.
-  (2018) Tangent convolutions for dense prediction in 3d. In CVPR, pp. 3887–3896. Cited by: TABLE I, TABLE III, TABLE IV, TABLE XII.
-  (2017) Segcloud: semantic segmentation of 3d point clouds. In 3DV, pp. 537–547. Cited by: §2.
-  (2019) Kpconv: flexible and deformable convolution for point clouds. In ICCV, pp. 6411–6420. Cited by: §2, §4.2, TABLE I, TABLE III, TABLE IV, TABLE VII, TABLE XII.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.
-  (2019) Graph attention convolution for point cloud semantic segmentation. In CVPR, pp. 10296–10305. Cited by: §2.
-  (2020) Reconfigurable voxels: a new representation for lidar-based point clouds. Conference on Robot Learning. Cited by: §4.4, TABLE VIII.
-  (2021) FCOS3D: fully convolutional one-stage monocular 3d object detection. arXiv preprint arXiv:2104.10956. Cited by: §1.
-  (2021) Probabilistic and geometric depth: detecting objects in perspective. arXiv preprint arXiv:2107.14160. Cited by: §1.
-  (2019) Shape robust text detection with progressive scale expansion network. In CVPR, pp. 9336–9345. Cited by: §3.3, §3.3.
-  (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §2.
-  (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In ICRA, pp. 1887–1893. Cited by: §1, §2, §4.2, TABLE V, TABLE VI.
-  (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, pp. 4376–4382. Cited by: §2, §4.2, TABLE V, TABLE VI.
-  (2019) Pointconv: deep convolutional networks on 3d point clouds. In CVPR, pp. 9621–9630. Cited by: §2.
-  (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: Fig. 1, §4.2, TABLE I.
Depth completion from sparse lidar data with depth-normal constraints.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2811–2820. Cited by: §1.
-  (2020) PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling. In CVPR, pp. 5589–5598. Cited by: §2.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §3.6, §4.4, §4.4, TABLE VIII, §5.2.
-  (2020) Deep fusionnet for point cloud semantic segmentation. ECCV. Cited by: §2, §4.2, TABLE I.
-  (2019) Co-occurrent features in semantic segmentation. In CVPR, pp. 548–557. Cited by: §3.3.
-  (2020) Fusion-aware point convolution for online semantic 3d scene segmentation. In CVPR, pp. 4534–4543. Cited by: §2.
-  (2020) PolarNet: an improved grid representation for online lidar point clouds semantic segmentation. In CVPR, pp. 9601–9610. Cited by: Fig. 1, §1, §2, §4.1, §4.2, §4.2, TABLE I, TABLE II, TABLE V, TABLE VI.
-  (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §2.
-  (2021) LIF-seg: lidar and camera image fusion for 3d lidar semantic segmentation. arXiv preprint arXiv:2108.07511. Cited by: §2.
-  (2020) Cylinder3d: an effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550. Cited by: §2.
-  (2020) SSN: shape signature networks for multi-class object detection from point clouds. ECCV. Cited by: §3.6, §4.4, TABLE VIII.
Cylindrical and asymmetrical 3d convolution networks for lidar segmentation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9939–9948. Cited by: §2.