1 Introduction
3D LiDAR sensor has become an indispensable device in modern autonomous driving vehicles [30]. It captures more precise and fartheraway distance measurements [56] of the surrounding environments than conventional visual cameras [49, 48, 34]
. The measurements of the sensor naturally form 3D point clouds that can be used to realize a thorough scene understanding for autonomous driving planning and execution, in which LiDARbased segmentation and detection are crucial for drivingscene perception and understanding.
Recently, the advances in deep learning have significantly pushed forward the state of the art in image domain such as image segmentation and detection. Some existing LiDARbased perception approaches follow this route to project the 3D point clouds onto a 2D space and process them via 2D convolution networks, including range image based
[33, 52] and bird’seyeview image based [62, 26]. However, this group of methods lose and alter the accurate 3D geometric information during the 3Dto2D projection (as shown in the top row of Fig. 0(a)).A natural alternative is to utilize the 3D partition and 3D convolution networks to process the point cloud and maintain their 3D geometric relations. However, in our initial attempts, we directly apply the 3D voxelization [18, 11] and 3D convolution networks to outdoor LiDAR point cloud, only to find very limited performance gain (as shown in Fig. 0(b)). Our investigation into this issue reveals a key difficulty of outdoor LiDAR point cloud, namely sparsity and varying density, which is also the key difference to indoor scenes with dense and uniformdensity points. However, previous 3D voxelization methods consider the point cloud as a uniform one and split them via the uniform cube, while neglecting the varyingdensity property of outdoor point cloud. Consequently, this effect to apply the 3D partition to outdoor point cloud is met with fundamental difficulty.
Motivated by these findings, we propose a new framework to outdoor LiDAR segmentation that consists of two key components, i.e., 3D cylindrical partition and asymmetrical 3D convolution networks, which maintain the 3D geometric information and handle these issues from partition and networks, respectively. Here, cylindrical partition resorts to the cylinder coordinates to divide the point cloud dynamically according to the distance (Regions that are far away from the origin have much sparse points, thus requiring a larger cell), which produces a more balanced point distribution (as shown in Fig. 0(a)); while asymmetrical 3D convolution networks strengthen the horizontal and vertical kernels to match the point distribution of objects in the driving scene and enhance the robustness to the sparsity. Moreover, voxel based methods might divide the points with different categories into the same cell and cell label encoding would inevitably cause the information loss (for LiDARbased segmentation tasks). To alleviate the interference of lossy label encoding, a pointwise module is introduced to further refine the features obtained from voxelbased network. Overall, the cooperation of these components well maintains the geometric relation and tackle the difficulty of outdoor point cloud, thus improving the effectiveness of 3D frameworks.
Since the learned features from our model can be used for downstream tasks, we benchmark our model on a variety of LiDARbased perception tasks such LiDARbased semantic segmentation, panoptic segmentation and 3D detection. For semantic segmentation, we evaluate the proposed method on several largescale outdoor datasets, including SemanticKITTI [3], nuScenes [6] and A2D2 [17]. Our method achieves the stateoftheart on the leaderboard of SemanticKITTI (both singlescan and multiscan challenges) and also outperforms the existing methods on nuScenes and A2D2 with a large margin. We also extend the proposed cylindrical partition and asymmetrical 3D convolution networks to LiDAR panoptic segmentation and LiDAR 3D detection. For panoptic segmentation and 3D detection, experimental results on SemanticKITTI and nuScenes, respectively, show its strong performance and good generalization capability.
The contributions of this work mainly lie in three aspects:

We reposition the focus of outdoor LiDAR segmentation from 2D projection to 3D structure, and further investigate the inherent properties (difficulties) of outdoor point cloud.

We introduce a new framework to explore the 3D geometric pattern and tackle these difficulties caused by sparsity and varying density, through cylindrical partition and asymmetrical 3D convolution networks.

The proposed method achieves the state of art on LiDARbased semantic segmentation, LiDAR panoptic segmentation and LiDAR point cloud 3D detection, which also demonstrates its strong generalization capability.
2 Related Work
Deep Learning for Indoorscene Point Cloud. Indoorscene point clouds carry out some properties, including generally uniform density, small number of points, and small range of the scene. Mainstream methods [36, 44, 54, 51, 45, 29, 15, 61, 57, 46, 35, 37] of indoor point cloud segmentation learn the point features based on the raw point directly, which are often based on the pioneering work, i.e., PointNet, and promote the effectiveness of sampling, grouping and ordering to achieve the better performance. Another group of methods utilize the clustering algorithm [51, 45] to extract the hierarchical point features. However, these methods focusing on indoor point cloud are limited to adapt to the outdoor point cloud under the property of varying density and large range of scenes, and the large number of points also result in the computational difficulties for these methods when deploying from indoor to outdoor.
Deep Learning for Outdoorscene Point Cloud. Most existing approaches for outdoorscene point cloud [21, 13, 64, 33, 1, 59, 25, 20, 12]
focus on converting the 3D point cloud to 2D grids, to enable the usage of 2D Convolutional Neural Networks. SqueezeSeg
[52], Darknet [3], SqueezeSegv2 [53], and RangeNet++ [33] utilize the spherical projection mechanism, which converts the point cloud to a frontalview image or a range image, and adopt the 2D convolution network on the pseudo image for point cloud segmentation or detection task. PolarNet [62] follows the bird’seyeview projection, which projects point cloud data into bird’seyeview representation under the polar coordinates. However, these 3Dto2D projection methods inevitably loss and alter the 3D topology and fails to model the geometric information. Moreover, in most outdoor scenes, LiDAR device is often used to produce the point cloud data, where its inherent properties, i.e., sparsity and varying density , are often neglected.3D Voxel Partition. 3D voxel partition is another routine of point cloud encoding [19, 43, 18, 11, 31, 65, 67]. It converts a point cloud into 3D voxels, which mainly retains the 3D geometric information. OccuSeg [19], SSCN [18] and SEGCloud [43] follow this line to utilize the voxel partition and apply regular 3D convolutions for LiDAR segmentation. It is worth noting that while the aforementioned efforts have shown encouraging results, the improvement in the outdoor LiDAR point cloud remains limited. As mentioned above, a common issue is that these methods neglect the inherent properties of outdoor LiDAR point cloud, namely, sparsity and varying density. Compared to these methods, our proposed method resorts to the 3D cylindrical partition and asymmetrical 3D convolution networks to tackle these difficulties.
Network Architectures for Feature Extraction
. Fully Convolutional Network [28]is the fundamental work for segmentation tasks in the deeplearning era. Built upon the FCN, many works aim to improve the performance via exploring the dilated convolution, multiscale context modeling and attention modeling, including DeepLab
[7, 8] and PSP [63]. Recent work utilizes the neural architecture search to find the more effective backbone for the segmentation [27, 41]. Particularly, UNet [38] proposes a symmetric architecture to incorporate the lowlevel features. With the great success of UNet on 2D benchmarks and its good flexibility , many studies for LiDARbased perception often adapt the UNet to the 3D space [11]. We also follow this structure to construct our asymmetrical 3D convolution networks.3 Methodology
3.1 Framework Overview
As shown in the top and middle row of Fig. 2, we elaborate the pipeline of our model in LiDARbased segmentation and detection task. In the context of semantic segmentation, given a point cloud, the task is to assign the semantic label to each point. Based on the comparison between 2D and 3D representation and investigation of the inherent properties of outdoor LiDAR point cloud, we desire to obtain a framework which explores the 3D geometric information and handles the difficulty caused by sparsity and varyingdensity. To this end, we propose a new outdoor segmentation approach based on the 3D partition and 3D convolution networks. To handle these difficulties of outdoor LiDAR point cloud, namely sparsity and varying density, we first employ the cylindrical partition to generate the more balanced point distribution (more robust to varying density), then apply the asymmetrical 3D convolution networks to power the horizontal and vertical weights, thus well matching the object point distribution in driving scene and enhancing the robustness to the sparsity. Same backbone with cylindrical partition and asymmetrical convolution network is also adapted to LiDARbased 3D detection (shown in the middle row of Fig. 2).
Specifically, the framework consists of two major components, including cylindrical partition and asymmetrical 3D convolution networks. The LiDAR point cloud is first divided by the cylindrical partition and the features extracted from MLP is then reassigned based on this partition. Asymmetrical 3D convolution networks are then used to generate the voxelwise outputs. For segmentation tasks, a pointwise module is introduced to alleviate the interference of lossy celllabel encoding, thus refining the outputs. In the following sections, we will present these components in detail.
3.2 Cylindrical Partition
As mentioned above, outdoorscene LiDAR point cloud possesses the property of varying density, where nearby region has much greater density than fartheraway region. Therefore, uniform cells splitting the varyingdensity points would fall into an imbalanced distribution (for example, larger proportion of empty cells). While in the cylinder coordinate system, it utilizes the increasing grid size to cover the fartheraway region, and thus more evenly distributes the points across different regions and gives an more balanced representation against the varying density. We perform a statistic to show the proportion of nonempty cells across different distances in Fig. 3. It can be found that with the distance goes far, cylindrical partition maintains a balanced nonempty proportion due to the increasing grid size while cubic partition suffers the imbalanced distribution, especially in the fartheraway regions (about 6 times less than cylindrical partition). Moreover, unlike these projectionbased methods project the point to the 2D view, cylindrical partition maintains the 3D grid representation to retain the geometric structure.
The workflow is illustrated in Fig. 4
. We first transform the points on Cartesian coordinate system to the Cylinder coordinate system. This step transforms the points (
) to points (), where radius (distance to origin in xy axis) and azimuth (angle from xaxis to yaxis) are calculated. Then cylindrical partition performs the split on these three dimensions, note that in the cylinder coordinate, the fartheraway the region is, the larger the cell will be. Pointwise features obtained from the MLP are reassigned based on the result of this partition to get the cylindrical features. Specifically, the pointcylinder mapping contains the index of pointwise features to cylinder. Based on this mapping function, pointwise features within same cylinder are mapped together and processed via maxpooling to get the cylindrical features. After these steps, we unroll the cylinder from 0degree and get the 3D cylindrical representation , where denotes the feature dimension and mean the radius, azimuth and height. Subsequent asymmetrical 3D convolution networks will be performing on this representation.3.3 Asymmetrical 3D Convolution Network
Since the drivingscene point cloud carries out the specific object shape distribution, including car, truck, bus, motorcycle and other cubic objects, we aim to follow this observation to enhance the representational power of a standard 3D convolution. Moreover, recent literature [50, 14] also shows that the central crisscross weights count more in the square convolution kernel. In this way, we devise the asymmetrical residual block to strengthen the horizontal and vertical responses and match the object point distribution. Based on the proposed asymmetrical residual block, we further build the asymmetrical downsample block and asymmetrical upsample block to perform the downsample and upsample operation. Moreover, a dimensiondecomposition based context modeling (termed as DDCM) is introduced to explore the highrank global context in decompositeaggregate strategy. We detail these components in the bottom of Fig. 2
Asymmetrical Residual Block Motivated by the observation and conclusion in [50, 14], the asymmetrical residual block strengthens the horizontal and vertical kernels, which matches the point distribution of object in the driving scene and explicitly makes the skeleton of the kernel powerful, thus enhancing the robustness to the sparsity of outdoor LiDAR point cloud. We use the Car and Motorcycle as the example to show the asymmetrical residual block in Fig. 6, where 3D convolutions are performing on the cylindrical grids. Moreover, the proposed asymmetrical residual block also saves the computation and memory cost compared to the regular squarekernel 3D convolution block. By incorporating the asymmetrical residual block, the asymmetrical downsample block and upsample block are designed and our asymmetrical 3D convolution networks are built via stacking these downsample and upsample blocks.
DimensionDecomposition based Context Modeling Since the global context features should be highrank to have enough capacity to capture the large context varieties [60]
, it is hard to construct these features directly. We follow the tensor decomposition theory
[9] to build the highrank context as a combination of lowrank tensors, where we use three rank1 kernels to obtain the lowrank features and then aggregate them together to get the final global context.3.4 Sparse Activation Visualization
As mentioned above, the proposed cylindrical partition and asymmetrical 3D networks aim to tackle the difficulties caused by sparsity and varyingdensity in outdoor point cloud. We thus visualize some filter activations from regular 3D convolution networks (with regular cubic partition) and asymmetrical 3D convolution networks (with cylindrical partition), respectively. The results are shown in Fig. 5. Fig. 5(a) and (b) are extracted from regular 3D convolution networks, which are activated at almost regions; While the proposed asymmetrical 3D convolution networks strengthen sparser activations and focus on them (as shown in Fig. 5(c) and (d)), they mainly focus on some certain regions. It demonstrates that the proposed model could adaptively handle the sparse point cloud input and focus on some certain regions.
3.5 Pointwise Refinement Module
Partitionbased methods predict one label for each cell. Although partitionbased methods effectively explore the largerange point cloud, however, this group of method, including cubebased and cylinderbased, inevitably suffers from the lossy celllabel encoding, e.g., points with different categories are divided into same cell, and this case would cause the information loss for point cloud semantic segmentation task (as shown in the middle row of Fig. 2). We make a statistic to show the effect of different label encoding methods with cylindrical partition in Fig. 7, where majority encoding means using the major category of points inside a cell as the cell label and minority encoding indicates using the minor category as the cell label. It can be observed that both of them cannot reach the 100 percent mIoU (ideal encoding) and inevitably have the information loss. Here, the pointwise refinement module is introduced to alleviate the interference of lossy celllabel encoding. We first project the cylindrical features to the pointwise based on the inverse pointcylinder mapping table (note that points inside same cylinder would be assigned to the same cylindrical features). Then the pointwise module takes both point features before and after 3D convolution networks as the input, and fuses them together to refine the output. We also show the detailed structure of MLPs in pointwise refinement module and cylindrical partition in Fig. 8.
3.6 Objective Function
For LiDARbased semantic segmentation task, the total objective of our method consists of two components, including voxelwise loss and pointwise loss. It can be formulated as . For the voxelwise loss (), we follow the existing methods [13, 21] and use the weighted crossentropy loss and lovaszsoftmax [5] loss to maximize the point accuracy and the intersectionoverunion score, respectively. For pointwise loss (), we only use the weighted crossentropy loss to supervise the training. During inference, the output from pointwise refinement module is used as the final output.
For LiDARbased panoptic segmentation task, except the loss of semantic segmentation, it also contains the loss of instance branch [20], which utilizes center regression to achieve the clustering.
4 Experiments
In this section, we benchmark the proposed model on three downstream tasks. For semantic segmentation task, we evaluate the proposed method on several largescale datasets, i.e., SemanticKITTI, nuScenes and A2D2. SemanticKITTI and nuScenes are also used in panoptic segmentation and 3D detection, respectively. Furthermore, extensive ablation studies on LiDAR semantic segmentation task are conducted to validate each component.
4.1 Dataset and Metric
SemanticKITTI [3]
is a largescale drivingscene dataset for point cloud segmentation, including semantic segmentation and panoptic segmentation. It is derived from the KITTI Vision Odometry Benchmark and collected in Germany with the VelodyneHDLE64 LiDAR. The dataset consists of 22 sequences, splitting sequences 00 to 10 as training set (where sequence 08 is used as the validation set), and sequences 11 to 21 as test set. 19 classes are remained for training and evaluation after merging classes with different moving status and ignore classes with very few points. In this dataset, it consists of two challenges, namely, singlescan and multiscan pointcloud semantic segmentation, where singlescan denotes the singleframe point cloud semantic segmentation and multiscan denotes the multipleframe point cloud segmentation, respectively. The key difference is that multiscan semantic segmentation requires classifying the moving categories, including moving car, moving truck, moving person, moving bicyclist, moving motorcyclist.
nuScenes [6] It collects 1000 scenes of 20s duration with 32 beams LiDAR sensor. The number of total frames is 40,000, which is sampled at 20Hz. They also officially split the data into training and validation set. After merging similar classes and removing rare classes, total 16 classes for the LiDAR semantic segmentation are remained.
A2D2 [17] We follow the data preprocessing in [62] to generate the label and process the point cloud data. A2D2 uses five asynchronous LiDAR sensors where each sensor covers a potion of the surrounding view. After LiDAR panoramic stitching, the A2D2 dataset is split into 22408, 2774 and 13264 training, validation and test scans, respectively with 38class segmentation annotation. Since there are 38 categories in A2D2 dataset where some of them only have subtle differences, it is harder than other datasets, SemanticKITTI and nuScenes.
Implementation Details For these datasets, the Cartesian spaces are different which are related to the LiDAR sensor range. In our implementation, we fix the Cartesian spaces to be , , and for SemanticKITTI, nuScenes and A2D2, respectively. After transforming to the Cylindrical spaces, they are fixed to be , , and . In this way, the proposed cylindrical spaces can cover more than 99% of points for each point cloud scan on average and points out of the spaces are assigned to the closest cylindrical cell. For all datasets, cylindrical partition splits these point clouds into 3D representation with the size = , where three dimensions indicate the radius, angle and height, respectively. We also perform the ablation studies to investigate and crossvalidate the effect of these parameters . We use NVIDIA V100 GPU with 16G memory to train the proposed model with batch size = 2.
Evaluation Metric
To evaluate the proposed method, we follow the official guidance to leverage mean intersectionoverunion (mIoU) as the evaluation metric defined in
[3, 6], which can be formulated as: where represent true positive, false positive, and false negative predictions for class and the mIoU is the mean value of over all classes.4.2 LiDARbased Semantic Segmentation
Results on SemanticKITTI Singlescan Semantic Segmentation In this experiment, we compare the results of our proposed method with existing stateoftheart LiDAR segmentation methods on SemanticKITTI singlescan test set. The target is to generate the semantic prediction for single frame point cloud. As shown in Table I, our method outperforms all existing methods in term of mIoU. Compared to the projectionbased methods on 2D space, including Darknet53 [3], SqueezeSegv3 [55], RangeNet++ [33] and PolarNet [62], our method achieves 8% 17% performance gain in term of mIoU due to the modeling of 3D geometric information. Compared to some voxel partition and 3D convolution based methods, including FusionNet [59], TORANDONet [16] (multiview fusion based method) and SPVNAS [41] (utilizing the neural architecture search for LiDAR segmentation), the proposed method also performs better than these 3D convolution based methods, where the cylindrical partition and asymmetrical 3D convolution networks well handle the difficulty of drivingscene LiDAR point cloud that is neglected by these methods.
Visualization We show some visualization results of singlescan segmentation in Fig.9, which are sampled from the SemanticKITTI validation set. It can be observed that the proposed method mainly achieves decent accuracy, and well separates the nearby objects and accurately identifies them because it maintains the 3D topology and utilizes the geometric information (we highlight corresponding regions with red rectangles). These visualization can verify our claim that keeping 3D structure and more balanced point distribution could benefit the segmentation results.
Results on SemanticKITTI Multiscan Semantic Segmentation Unlike the singlescan semantic segmentation, the multiscan segmentation in SemanticKITTI takes multiple frame point cloud as input and generates the more categories under moving status, including moving car, moving truck, moving othervehicle, moving person, moving bicyclist and moving motorcyclist. In this experiment, we first perform the multipleframe point cloud fusion. Specifically, the sequential point clouds in LiDAR coordinate are firstly transformed to global coordinate. Then, these sequential point clouds are fused in the global coordinate. Finally, all these points are transformed to the coordinate of last frame. In this way, we can achieve the multipleframe fusion and we use 3 sequential point clouds as input data in our implementation. We show an example in Fig. 10. It can be found that moving cars have multiple shifting point clouds while stationary cars keep all points in same location.
The results of multiscan semantic segmentation are shown in Table III and IV. Generally, our method outperforms all existing methods in terms of mIoU, where it achieves 0.3% and 8.4% gain compared to KPConv [44] (ICCV2019) and SpSeqnet [39] (CVPR2020), respectively. Our method obtains superior performance for most categories, even for some small objects, like bicycle and motorcycle, etc. For these moving categories, our method achieves the best performance on moving car and moving truck.
Results on nuScenes For nuScenes LiDARseg dataset, we report the results on its validation set. As shown in Table II, our method achieves better performance than existing methods in all categories, and this consistent performance improvement demonstrates the capability of the proposed model. Specifically, the proposed method obtains about 4% 7% performance gain than projectionbased methods. Moreover, for these categories with sparse points, such as bicycle and pedestrian, our method significantly outperforms existing approaches, which also demonstrates the effectiveness of the proposed method to tackle the sparsity and varying density. Note that RangeNet++ [33] and Salsanext [13]
perform the postprocessing, including KNN,
etc.Results on A2D2 We report the results on A2D2 [17] validation set. As shown in Table V and VI, it can be observed that the proposed method performs much better than existing methods about 3% in terms of mIoU, including Squeezeseg [52], SqueezesegV2 [53], DarkNet53 [2] and PolarNet [62], where all of them are based on the 2D projection and 2D convolution networks. Specifically, our method achieves better performance on almost all categories consistently, which also demonstrates the effectiveness of our method. Note that due to the more finegrained categories in A2D2 (38 categories in total), it is harder than other datasets, such as SemanticKITTI and nuScenes, and there exist more categories with zero values.
In general, our method achieves the consistent stateoftheart performance in all three datasets with different settings (singlescan and multiscan) and sensor ranges. It clearly demonstrates the effectiveness of the proposed method and its good generalization capability across different datasets.
4.3 LiDARbased Panoptic Segmentation
Panoptic segmentation is first proposed in [23] as a new task, in which semantic segmentation is performed for background classes and instance segmentation for foreground classes and these two groups of category are also termed as stuff and things classes, respectively. Behley et al. [4] extend the task to LiDAR point clouds and propose the LiDAR panoptic segmentation. In this experiment, we conduct the panoptic segmentation on SemanticKITTI dataset and report results on the validation set. For the evaluation metrics, we follow the metrics defined in [4], where they are the same as that of image panoptic segmentation defined in [23] including Panoptic Quality (PQ), Segmentation Quality (SQ) and Recognition Quality (RQ) which are calculated across all classes. PQ^{†} is defined by swapping PQ of each stuff class to its IoU and averaging over all classes like PQ does. Since the categories in panoptic segmentation contain two groups, i.e., stuff and things, these metrics are also performed separately on these two groups, including PQ^{Th}, PQ^{St}, RQ^{Th}, RQ^{St}, SQ^{Th} and SQ^{St}, where Panoptic Quality (PQ) is usually used as the first criteria. For the experimental setting, we follow the LiDAR semantic segmentation, where Adam optimizer with learning rate = is used for optimization.
In this experiment, we use the proposed cylindrical partition as the partition method and asymmetrical 3D convolution networks as the backbone. Moreover, a semantic branch is used to output the semantic labels for stuff categories, and an instance branch is introduced to generate the instancelevel features and further extract their instance IDs for things categories through heuristic clustering algorithms (we use meanshift in the implementation and the bandwidth of the Mean Shift used in our backbone method is set to
while the minimum number of points in a valid instance is set to 50 for SemanticKITTI).We report the results in Table VII. It can be found that our method achieves much better performance than existing methods [32, 22]. In terms of PQ, we have about 4.7% point improvement, and particularly for the thing categories, our method significantly outperforms stateoftheart in terms of PQ^{Th} and RQ^{Th} with a large margin of 10% points. It indicates that our cylindrical partition and asymmetrical 3D convolution networks significantly benefit the recognition of the things classes. It is worthy of noting that PointGroup and LPASD perform poorly on the outdoor LiDAR segmentation task which indicates that these indoor methods are not suitable for the challenging outdoor point clouds due to the different scenarios and inherent properties. Experimental results demonstrate the effectiveness of the proposed method and its good generalization ability. We show several samples of panoptic segmentation results in Fig. 11, where different colors represent different vehicles.
Method  PQ  PQ^{†}  RQ  SQ  PQ^{Th}  RQ^{Th}  SQ^{Th}  PQ^{St}  RQ^{St}  SQ^{St}  mIoU 
KPConv [44] + PVRCNN [40]  51.7  57.4  63.1  78.9  46.8  56.8  81.5  55.2  67.8  77.1  63.1 
PointGroup [22]  46.1  54.0  56.6  74.6  47.7  55.9  73.8  45.0  57.1  75.1  55.7 
LPASD [32]  36.5  46.1        28.2          50.7 
Ours  56.4  62.0  67.1  76.5  58.8  66.8  75.8  54.8  67.4  77.1  63.5 
Methods  mAP  NDS 

PointPillar [26]  30.5  45.3 
PP + Reconfig [47]  32.5  50.6 
SECOND [58]  31.6  46.8 
SECOND [58] + Cylinder  34.3  49.6 
SECOND [58] + AsymCNN  33.0  48.3 
SECOND [58] + CyAs  36.4  51.7 
SSN [66]  46.3  56.9 
SSN [66] + CyAs  47.7  58.2 
SSNv2 [66]  50.6  61.6 
SSNv2 [66] + CyAs  52.8  64.0 
4.4 LiDARbased 3D Detection
LiDAR 3D detection aims to localize and classify the multiclass objects in the point cloud. SECOND [58] first utilizes the 3D voxelization and 3D convolution networks to perform the singlestage 3D detection. In this experiment, we follow SECOND method and replace the regular voxelization and 3D convolution with the proposed cylindrical partition and asymmetrical 3D convolution networks, respectively. Similarly, to verify its scalability, we also extend the proposed modules to SSN [66]. Furthermore, another strong baseline, SSNv2 [66], is also adapted to verify the effectiveness of our method when the baseline is very competitive. The experiments are conducted on nuScenes dataset and the cylindrical partition also generates the representation. For the evaluation metrics, we follow the official metrics defined in nuScenes, i.e., mean average precision (mAP) and nuScenes detection score (NDS). For other experimental settings, including the optimization method, target assignment, anchor size and network architecture of multiple heads, we all follow the setting in SSN [66].
The results are shown in Table VIII. PP + Reconfig [47] is a partition enhancement approach based on PointPillar [26], while our SECOND + CyAs performs better with similar backbone, which indicates the superiority of the cylindrical partition. To verify the effect of different components (i.e., Cylinder partition and Asymmetrical 3D convolution networks) of our method on LiDAR 3D detection, we design two variants, i.e., SECOND [58] + Cylinder and SECOND [58] + AsymCNN. The results shown in Table VIII demonstrate that these two components in our method consistently improve the baseline method with 2.8% points and 1.5% points in terms of NDS, respectively. We then extend the proposed method (i.e., CyAs) to two baseline methods, termed as SECOND + CyAs and SSN + CyAs, respectively. By comparing these two models with their extensions, it can be observed that the proposed Cylindrical partition and Asymmetrical 3D convolution networks boost the performance consistently, even for the strong baseline i.e., SSNv2, which demonstrates the effectiveness and scalability of our model. For different backbones, like SECOND and SSN, our method could consistently benefit them, showing its good generalization ability. Several qualitative results on nuScenes dataset are shown in Fig. 12.
4.5 Ablation Studies
In this section, we perform the thorough ablation experiments on LiDARbased semantic segmentation task to investigate the effect of different components in our method. We also design several variants of asymmetrical residual block to verify our claim that strengthening the horizontal and vertical kernels power the representation ability for drivingscene point cloud. For the 3D representation after cylindrical partition, we also try several other hyperparameters to crossvalidate these values.
Baseline  Cylinder  AsymCNN  DDCM  PR  mIoU 

✓  58.1  
✓  ✓  61.0  
✓  ✓  ✓  63.8  
✓  ✓  ✓  ✓  65.2  
✓  ✓  ✓  ✓  ✓  65.9 
Effects of Network Components In this part, we make several variants of our model to validate the contributions of different components. The results on SemanticKITTI validation set are reported in Table IX. Baseline method denotes the framework using 3D voxel partition (with cubic partition) and 3D convolution networks. It can be observed that cylindrical partition performs much better than cubicbased partition with about 3% mIoU gain and asymmetrical 3D convolution networks also significantly boost the performance about 3% improvement, which demonstrates that both cylindrical partition and asymmetrical 3D convolution networks are crucial in the proposed method. Furthermore, dimensiondecomposition based context modeling delivers the effective global context features, which yields an improvement of 1.4%. Pointwise refinement module further pushes forward the performance based on the strong model, about 0.7%. Generally, the proposed cylindrical partition and asymmetrical 3D convolution networks make the most contribution to the performance improvement.
Variants of Asymmetrical Residual Block To verify the effectiveness of the proposed asymmetrical residual block, we design several variants of asymmetrical residual block to investigate the effect of horizontal and vertical kernel enhancement (as shown in Fig. 13). The first variant is the regular residual block without any asymmetrical structure. The second one is the 1Dasymmetrical residual block, which utilizes the 1D asymmetrical kernels without height and also strengthens the horizontal or vertical kernels in onedimension. The third one is the proposed asymmetrical residual block, which strengthens both horizontal and vertical kernels. These variants strengthen the skeleton of convolution kernels step by step (from regular residual block to asymmetrical kernel without height, then to both horizontal and vertical kernels with height).
We conduct the ablation studies on SemanticKITTI validation set. Note that we use the cylindrical partition as the partition method and stack these proposed variants to build the 3D convolution networks for this ablation experiment. We report the results in Table X. It can be found that although the 1DAsymmetrical residual block only powers the horizontal and vertical kernels in onedimension, it still achieves 1.3% gain in terms of mIoU and it obtains about more than 5% performance gain for motorcycle, othervehicle and bicyclist, which demonstrates the effectiveness of strengthening skeleton of convolution kernel, even without height dimension. After taking the height into the consideration, the proposed asymmetrical residual block further matches the object distribution in driving scene and powers the skeleton of kernels, which enhances the robustness to the sparsity. From Table X, the proposed asymmetrical residual block significantly boosts the performance with about 3% improvements, where large improvement can be observed on some instance categories (about 10% gain), including bicycle, person, othervehicle and motorcycle, because it matches the point distribution of object and enhances the representational power.
Size of 3D Representation As mentioned in implementation details, we set the size of 3D representation to . In this experiment, we use other hyperparameters to crossvalidate these values, including and . They cover the denser and sparser representations compared with representation. Furthermore, we also introduce a cubic partition with size of as the counterpart to investigate the effectiveness and compactness of cylindrical partition.
We conduct the experiments on SemanticKITTI validation set and all experiments are under same settings except the different representation size. The results are shown in Table XI. It can be found that the 3D representation with performs better than other two representations with 2% point improvement than and . Since representation delivers compacter representation with larger cylindrical cells, it however might missplit the points across different categories into same cell, which inevitably increases the information loss; While for representation, it contains finegrained cylindrical cell, but generates the larger representation, which might burden the training of 3D convolution and cause the degradation of performance. Compared to the cubic partition with , all cylindrical partitions achieve much better performance and this consistent performance gain demonstrates its effectiveness. From this experiment, we crossvalidate the representation and investigate the effect of different size of 3D representations.
Size of Representation  mIoU 

(cubic)  58.1 
63.4  
64.1  
65.9 
5 Discussion
5.1 Comparison of Inference Time
Methods  Latency (ms)  mIoU(%) 

TangentConv [42]  3000  40.9 
RandLA [21]  800  53.9 
KPConv [44]  263  58.8 
MinkowskiNet [10]  294  63.1 
SPVNASlite [41]  110  63.7 
SPVNAS [41]  259  66.4 
Ours  170  67.8 
To investigate the efficiency of the proposed method, we further make a statistic of inference time compared to existing methods. In the experiment, we keep the setting unchanged and set the mode as the evaluation mode to calculate the inference time. The results of inference time of existing methods are directly token from [2].
The results are shown in Table XII. Compared to 2D projection based method (inference time consists of computation time and postprocessing time), i.e., RandLA [21], our method achieves about 5.0 speedup with 14% performance improvement due to no requirement for postprocessing. Moreover, compared to other 3D based methods, including MinkowskiNet [10] and SPVNAS [41], we also achieve the better performance and less inference time. The main reasons lie in two aspects: 1) the proposed cylindrical partition generates compacter representation compared to regular cubic partition. For example, the regular cubic partition often has the cell of , and it thus generates a 3D representation of , which is more than 4 times larger than the cylindrical partition. 2) the asymmetrical 3D convolution networks consume smaller computational overhead and less parameters compared to the regular 3D convolution networks. Specifically, using a convolution with kernel= followed by a convolution is equivalent to sliding a two layer network with the same receptive field as in a 3D convolution with kernel= , but it has 33% cheaper computational cost than a convolution with same number of output filters. The corporation of these two parts leads to the effective and efficient approach.
5.2 Comparison with other methods dealing with sparsity issue
Our proposed method utilizes the cylindrical partition and asymmetrical 3D convolution networks to handle the inherent difficulties, i.e., sparsity and varying density. Hence, we further compare the proposed method with other methods tackling the sparsity issue, to verify its effectiveness. Specifically, SPVNAS [41] proposes a sparse pointvoxel convolution to preserve the fine details and deal with sparsity. MinkowskiNet [10] adopts sparse tensors and proposes a generalized sparse convolution. We take them as the counterpart dealing with the sparsity issue and make a comparison with them. Note that in our implementation, we also use the sparse convolution [58] to build up the asymmetrical 3D convolution networks.
Methods  SparseConv  Latency (ms)  mIoU(%) 

MinkowskiNet [10]  ✓  294  63.1 
SPVNAS [41]  ✓  259  66.4 
Ours  ✓  170  67.8 
The results are shown in Table XIII. Compared to other methods handling the sparsity issue, our method achieves both better performance and efficiency, which also demonstrates the superiority of our method.
6 Conclusion
In this paper, we have proposed a cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation, where it maintains the 3D geometric relation. Specifically, two key components, the cylinder partition and asymmetrical 3D convolution networks, are designed to handle the inherent difficulties in outdoor LiDAR point cloud, namely sparsity and varying density, effectively and robustly. We conduct the extensive experiments and ablation studies, where the model achieves the stateoftheart in SemanticKITTI, A2D2 and nuScenes, and keeps good generalization ability to other LiDAR based tasks, including LiDAR panoptic segmentation and LiDAR 3D detection.
References
 [1] (2020) 3Dmininet: learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation. arXiv preprint arXiv:2002.10893. Cited by: §2.
 [2] (2019) SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV, Cited by: §4.2, TABLE III, TABLE IV, TABLE V, TABLE VI, §5.1, TABLE XII.
 [3] (2019) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In ICCV, pp. 9297–9307. Cited by: §1, §2, §4.1, §4.1, §4.2, TABLE I.
 [4] (2020) A benchmark for lidarbased panoptic segmentation based on kitti. arXiv preprint arXiv:2003.02371. Cited by: §4.3.

[5]
(2018)
The lovászsoftmax loss: a tractable surrogate for the optimization of the intersectionoverunion measure in neural networks
. In CVPR, pp. 4413–4421. Cited by: §3.6.  [6] (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §1, §4.1, §4.1.
 [7] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
 [8] (2018) Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §2.
 [9] (2020) Tensor lowrank reconstruction for semantic segmentation. arXiv preprint arXiv:2008.00490. Cited by: §3.3.
 [10] (2019) 4d spatiotemporal convnets: minkowski convolutional neural networks. In CVPR, pp. 3075–3084. Cited by: §5.1, §5.2, TABLE XII, TABLE XIII.
 [11] (2016) 3D unet: learning dense volumetric segmentation from sparse annotation. In MICCAI, pp. 424–432. Cited by: §1, §2, §2.
 [12] (2021) Inputoutput balanced framework for longtailed lidar semantic segmentation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.
 [13] (2020) SalsaNext: fast, uncertaintyaware semantic segmentation of lidar point clouds for autonomous driving. External Links: 2003.03653 Cited by: §2, §3.6, §4.2, TABLE I, TABLE II.
 [14] (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In ICCV, pp. 1911–1920. Cited by: §3.3, §3.3.
 [15] (2020) 3Dmpa: multiproposal aggregation for 3d semantic instance segmentation. In CVPR, pp. 9031–9040. Cited by: §2.

[16]
(2020)
TORNADOnet: multiview total variation semantic segmentation with diamond inception module
. arXiv preprint arXiv:2008.10544. Cited by: §4.2, TABLE I.  [17] (2019) A2D2: aev autonomous driving dataset. Note: http://www. a2d2. audi 1 (4). Cited by: §1, §4.1, §4.2.
 [18] (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, pp. 9224–9232. Cited by: §1, §2.
 [19] (2020) OccuSeg: occupancyaware 3d instance segmentation. In CVPR, pp. 2940–2949. Cited by: §2.
 [20] (2020) LiDARbased panoptic segmentation via dynamic shifting network. arXiv preprint arXiv:2011.11964. Cited by: §2, §3.6.
 [21] (2020) RandLAnet: efficient semantic segmentation of largescale point clouds. In CVPR, pp. 11108–11117. Cited by: Fig. 1, §2, §3.6, TABLE I, §5.1, TABLE XII.
 [22] (2020) PointGroup: dualset point grouping for 3d instance segmentation. In CVPR, pp. 4867–4876. Cited by: §4.3, TABLE VII.
 [23] (2019) Panoptic segmentation. In CVPR, pp. 9404–9413. Cited by: §4.3.
 [24] (2020) KPRNet: improving projectionbased lidar semantic segmentation. arXiv preprint arXiv:2007.12668. Cited by: TABLE I.
 [25] (2018) Largescale point cloud semantic segmentation with superpoint graphs. In CVPR, pp. 4558–4567. Cited by: §2.
 [26] (2019) Pointpillars: fast encoders for object detection from point clouds. In CVPR, pp. 12697–12705. Cited by: §1, §4.4, TABLE VIII.
 [27] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §2.
 [28] (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §2.
 [29] (2020) Learning to segment 3d point clouds in 2d image space. In CVPR, pp. 12255–12264. Cited by: §2.

[30]
(2019)
Trafficpredict: trajectory prediction for heterogeneous trafficagents.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 6120–6127. Cited by: §1.  [31] (2019) Vvnet: voxel vae net with group convolutions for point cloud segmentation. In ICCV, pp. 8500–8508. Cited by: §2.
 [32] (2020) LiDAR Panoptic Segmentation for Autonomous Driving. In iros, Cited by: §4.3, TABLE VII.
 [33] (2019) RangeNet++: fast and accurate lidar semantic segmentation. In IROS, pp. 4213–4220. Cited by: Fig. 1, §1, §2, §4.2, §4.2, TABLE I, TABLE II.

[34]
(2021)
SIDE: centerbased stereo 3d detector with structureaware instance depth estimation
. arXiv preprint arXiv:2108.09663. Cited by: §1.  [35] (2019) JSIS3D: joint semanticinstance segmentation of 3d point clouds with multitask pointwise networks and multivalue conditional random fields. In CVPR, pp. 8827–8836. Cited by: §2.
 [36] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660. Cited by: §2.
 [37] (2017) 3d graph neural networks for rgbd semantic segmentation. In ICCV, pp. 5199–5208. Cited by: §2.
 [38] (2015) Unet: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.
 [39] (2020) SpSequenceNet: semantic segmentation network on 4d point clouds. In CVPR, pp. 4574–4583. Cited by: §4.2, TABLE III, TABLE IV.
 [40] (2020) Pvrcnn: pointvoxel feature set abstraction for 3d object detection. In CVPR, pp. 10529–10538. Cited by: TABLE VII.
 [41] (2020) Searching efficient 3d architectures with sparse pointvoxel convolution. arXiv preprint arXiv:2007.16100. Cited by: §2, §4.2, TABLE I, §5.1, §5.2, TABLE XII, TABLE XIII.
 [42] (2018) Tangent convolutions for dense prediction in 3d. In CVPR, pp. 3887–3896. Cited by: TABLE I, TABLE III, TABLE IV, TABLE XII.
 [43] (2017) Segcloud: semantic segmentation of 3d point clouds. In 3DV, pp. 537–547. Cited by: §2.
 [44] (2019) Kpconv: flexible and deformable convolution for point clouds. In ICCV, pp. 6411–6420. Cited by: §2, §4.2, TABLE I, TABLE III, TABLE IV, TABLE VII, TABLE XII.
 [45] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.
 [46] (2019) Graph attention convolution for point cloud semantic segmentation. In CVPR, pp. 10296–10305. Cited by: §2.
 [47] (2020) Reconfigurable voxels: a new representation for lidarbased point clouds. Conference on Robot Learning. Cited by: §4.4, TABLE VIII.
 [48] (2021) FCOS3D: fully convolutional onestage monocular 3d object detection. arXiv preprint arXiv:2104.10956. Cited by: §1.
 [49] (2021) Probabilistic and geometric depth: detecting objects in perspective. arXiv preprint arXiv:2107.14160. Cited by: §1.
 [50] (2019) Shape robust text detection with progressive scale expansion network. In CVPR, pp. 9336–9345. Cited by: §3.3, §3.3.
 [51] (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §2.
 [52] (2018) Squeezeseg: convolutional neural nets with recurrent crf for realtime roadobject segmentation from 3d lidar point cloud. In ICRA, pp. 1887–1893. Cited by: §1, §2, §4.2, TABLE V, TABLE VI.
 [53] (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for roadobject segmentation from a lidar point cloud. In ICRA, pp. 4376–4382. Cited by: §2, §4.2, TABLE V, TABLE VI.
 [54] (2019) Pointconv: deep convolutional networks on 3d point clouds. In CVPR, pp. 9621–9630. Cited by: §2.
 [55] (2020) Squeezesegv3: spatiallyadaptive convolution for efficient pointcloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: Fig. 1, §4.2, TABLE I.

[56]
(2019)
Depth completion from sparse lidar data with depthnormal constraints.
In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pp. 2811–2820. Cited by: §1.  [57] (2020) PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling. In CVPR, pp. 5589–5598. Cited by: §2.
 [58] (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §3.6, §4.4, §4.4, TABLE VIII, §5.2.
 [59] (2020) Deep fusionnet for point cloud semantic segmentation. ECCV. Cited by: §2, §4.2, TABLE I.
 [60] (2019) Cooccurrent features in semantic segmentation. In CVPR, pp. 548–557. Cited by: §3.3.
 [61] (2020) Fusionaware point convolution for online semantic 3d scene segmentation. In CVPR, pp. 4534–4543. Cited by: §2.
 [62] (2020) PolarNet: an improved grid representation for online lidar point clouds semantic segmentation. In CVPR, pp. 9601–9610. Cited by: Fig. 1, §1, §2, §4.1, §4.2, §4.2, TABLE I, TABLE II, TABLE V, TABLE VI.
 [63] (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §2.
 [64] (2021) LIFseg: lidar and camera image fusion for 3d lidar semantic segmentation. arXiv preprint arXiv:2108.07511. Cited by: §2.
 [65] (2020) Cylinder3d: an effective 3d framework for drivingscene lidar semantic segmentation. arXiv preprint arXiv:2008.01550. Cited by: §2.
 [66] (2020) SSN: shape signature networks for multiclass object detection from point clouds. ECCV. Cited by: §3.6, §4.4, TABLE VIII.

[67]
(2021)
Cylindrical and asymmetrical 3d convolution networks for lidar segmentation.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 9939–9948. Cited by: §2.