Log In Sign Up

CPSeg: Cluster-free Panoptic Segmentation of 3D LiDAR Point Clouds

A fast and accurate panoptic segmentation system for LiDAR point clouds is crucial for autonomous driving vehicles to understand the surrounding objects and scenes. Existing approaches usually rely on proposals or clustering to segment foreground instances. As a result, they struggle to achieve real-time performance. In this paper, we propose a novel real-time end-to-end panoptic segmentation network for LiDAR point clouds, called CPSeg. In particular, CPSeg comprises a shared encoder, a dual decoder, a task-aware attention module (TAM) and a cluster-free instance segmentation head. TAM is designed to enforce these two decoders to learn rich task-aware features for semantic and instance embedding. Moreover, CPSeg incorporates a new cluster-free instance segmentation head to dynamically pillarize foreground points according to the learned embedding. Then, it acquires instance labels by finding connected pillars with a pairwise embedding comparison. Thus, the conventional proposal-based or clustering-based instance segmentation is transformed into a binary segmentation problem on the pairwise embedding comparison matrix. To help the network regress instance embedding, a fast and deterministic depth completion algorithm is proposed to calculate surface normal of each point cloud in real-time. The proposed method is benchmarked on two large-scale autonomous driving datasets, namely, SemanticKITTI and nuScenes. Notably, extensive experimental results show that CPSeg achieves the state-of-the-art results among real-time approaches on both datasets.


page 4

page 5

page 10

page 14

page 17

page 18


SMAC-Seg: LiDAR Panoptic Segmentation via Sparse Multi-directional Attention Clustering

Panoptic segmentation aims to address semantic and instance segmentation...

RangeSeg: Range-Aware Real Time Segmentation of 3D LiDAR Point Clouds

Semantic outdoor scene understanding based on 3D LiDAR point clouds is a...

EfficientLPS: Efficient LiDAR Panoptic Segmentation

Panoptic segmentation of point clouds is a crucial task that enables aut...

Identifying Unknown Instances for Autonomous Driving

In the past few years, we have seen great progress in perception algorit...

LiDAR-based 4D Panoptic Segmentation via Dynamic Shifting Network

With the rapid advances of autonomous driving, it becomes critical to eq...

LiDAR-based Panoptic Segmentation via Dynamic Shifting Network

With the rapid advances of autonomous driving, it becomes critical to eq...

Panoster: End-to-end Panoptic Segmentation of LiDAR Point Clouds

Panoptic segmentation has recently unified semantic and instance segment...

1 Introduction

A panoptic segmentation system that is able to predict both semantic tags and instance-level segmentation begins to draw the attention of the autonomous driving community, because both foreground dynamic objects (i.e. the thing) and background static scenes (i.e. the stuff) can be perceived and outputted simultaneously. Compared with image-based approaches, a panoptic segmentation system using LiDAR point cloud lacks sufficient research yet, despite the fact that LiDAR is a well-concurred primary perception sensor for autonomous driving for its active sensing nature with high resolution of sensor readings. The recently published GP-S3Net [razani2021gps3net] is the current state-of-the-art in LiDAR panoptic segmentation task. The authors proposed using a 3D sparse convolution-based UNet as a semantic backbone and a combination of HDBSCAN and GCNN to segment instances. However, GP-S3Net is computationally intensive and is thus not a real-time method. Among all published works for LiDAR panoptic segmentation, only few of them [9340837, Zhou2021PanopticPolarNet, li2021smacseg] are capable of operating in real-time (see Figure 1). There still exists a performance gap when comparing the real-time methods with the current state-of-the-art. A natural question to ask: is it possible to fill this gap and build a panoptic segmentation method with a competitive PQ yet still with a fast running time?

Given the runtime constraint for real-time systems, proposal-free methods are favorable as they are more computationally efficient. Current proposal-free approaches in the literature usually rely on clustering or graph to segment foreground objects. These methods are mainly adopted from 2D image processing tasks. Designing an effective 3D panoptic segmentation system based on unique characteristics of LiDAR point clouds is still an open research problem. This motivates us to find a more suitable and unique design targeting the panoptic segmentation task in the LiDAR domain. In this work, we take advantage of strong geometric patterns in LiDAR point clouds and present a new proposal-free and cluster-free approach to segment foreground objects with capabilities of running in real-time. In particular, we propose a network to predict object centroid as embedding of each point and dynamically group points with similar embedding as pillars in the sparse 2D space. Then, objects are formed by building connections of pillars.

Our main contributions can be summarized as,

  1. A real-time panoptic segmentation network that is end-to-end (i.e., not relying on clustering or proposals to segment instances) and achieves state-of-the-art results without extensive post-processing

  2. A novel Task-aware Attention Module (TAM) that enforces the dual decoder network to learn task-specific features

  3. A fast surface normal calculation module to aid the process of regressing foreground instance embedding with a novel deterministic depth completion algorithm

  4. A comprehensive qualitative and quantitative comparison demonstrating the proposed method as opposed to existing methods on both large-scale datasets of SemanticKITTI and nuScenes

  5. A thorough ablation analysis of how each proposed component contributes to the overall performance

2 Related Work

The panoptic segmentation task jointly optimizes semantic and instance segmentation. LiDAR-based semantic segmentation can be categorized into either a projection-based, a voxel-based, or a point-based method depending on the format of data being processed. Projection-based methods project a 3D point cloud into a 2D image plane either in spherical Range-View (RV) [milioto2019rangenet++, razani2021litehdseg], top-down Bird-Eye-View (BEV) [8403277, 10.1007/978-3-030-11009-3_11], or multi-view format [gerdzhev2021tornadonet]. Voxel-based methods transform a point cloud into 3D volumetric grids to be processed using 3D convolutions. Processing these 3D grids using 3D convolution is computationally expensive. Therefore, some methods leverage sparse convolutions to alleviate this limitation and to fully exploit sparsity of point clouds [cheng2021s3net, zhou2020cylinder3d, cheng2021af2s3net]. Point-based methods [qi2017pointnet, thomas2019kpconv], however, process the unordered point cloud directly. Despite the high accuracy of the latter approaches, they are inefficient and require large memory consumption. Similar to instance segmentation, panoptic segmentation can be divided into top-down (proposal-based) or bottom-up (proposal-free) methods, as elaborated below.

2.1 Proposal-based panoptic segmentation

Top-down panoptic segmentation is a two-stage approach. First, foreground object proposals are generated, and subsequently, they are further processed to extract instance information that is fused with background semantic information. Mask R-CNN [He_2017_ICCV] is commonly used for instance segmentation with a light-weight stuff branch segmentation. To resolve the overlapping instance predictions by Mask R-CNN and the conflict between instance and semantic predictions, several methods are introduced. UPSnet [xiong2019upsnet] presents a panoptic head with the addition of an unknown class label. EfficientPS [sirohi2021efficientlps] proposes to fuse the instance and semantic heads according to their confidence. Inspired by image-based methods, MOPT [hurtado2020mopt] and EfficientLPS [sirohi2021efficientlps] attach a semantic head to Mask R-CNN to generate panoptic segmentation on the range image. However, these top-down methods are a result of the multiple sequential processes in the pipeline and are usually slow in speed.

2.2 Proposal-free panoptic segmentation

In contrast to the proposal-based methods, bottom-up panoptic segmentation predicts semantic segmentation and groups the “thing" points into clusters to achieve instance segmentation. DeeperLab [yang2019deeperlab] introduces the instance segmentation by regressing the bounding box corners and object center. Later, Panoptic-DeepLab [cheng2020panoptic] proposes to predict the instance center locations and group pixels to their closest predicted centers. The pioneering panoptic method in the LiDAR domain, LPSAD [9340837], presents a shared encoder with a dual decoder, followed by a clustering algorithm to segment instances based on the predicted semantic embedding and the predicted object centroids in the 3D space. Panoster [gasperini2021panoster] proposes a learnable clustering module to assign instance class labels to every point. It requires extensive post-processing steps (e.g. using DBSCAN to merge nearby object predictions) to refine the predictions in order to have comparable results with the other state-of-the-art methods. DS-Net [hong2020lidar], however, offers a learnable dynamic shifting module to shift points in 3D space towards the object centroids. Recently, GP-S3Net [razani2021gps3net] proposes a graph-based instance segmentation network for LiDAR-based panoptic segmentation, which achieves state-of-the-art performance on nuScenes and SemanticKITTI panoptic segmentation benchmarks. Moreover, SMAC-Seg [li2021smacseg] introduces a Sparse Multi-directional Attention Clustering module with a novel repel loss to better supervise the network separating the instances. They reach the state-of-the-art among existing real-time LiDAR-based methods.

2.3 Normal surface calculation

Numerous point cloud processing algorithms, such as surface reconstruction, segmentation, shape modeling, and feature extraction, benefit from accurate normal vectors associated with each point. Existing methods for computing normal vectors for LiDAR point clouds could be divided into learning-based and traditional deterministic approaches. For instance,

[,, ZHOU2020102916]

use a neural network to estimate the normal of point cloud directly. Using learning-based approaches would require extensive training data in order to achieve good performance and might need additional fine-tuning when using different types of LiDAR setups. On the other hand, deterministic geometry-based approaches like

[zhou2018open3d] usually require a nearest neighbour search to obtain the neighbourhood for each point and then estimate the surface by fitting a plane to the neighbourhood point cloud using the least square method. These approaches usually struggle to achieve real-time performance as the neighbourhood search in 3D space is computationally heavy. [5980275] uses a range-based deterministic approach to compute surface normal where the 3D normal is transformed from derivatives of the surface from a sparse spherical depth map. This approach operates in real-time; however, the surface normal is incorrectly affected by the empty entries in the projected depth map, resulting in false sharp gradients. To overcome this limitation, we propose to use a novel depth completion algorithm to adaptively complete the neighbourhood around the valid points in the sparse depth map and then compute its normal from the 2D gradient.

3 Proposed Method

3.1 Problem Formulation

Let be N unordered points of a point cloud where is the input feature for point , tuple is the semantic class label and instance ID label for point . is a set of semantic class labels and is a set of instance IDs. can be further divided into and representing a set of countable foreground thing classes and a set of background stuff classes, respectively. Note that instance label is only valid if . The goal is to learn a function , parameterized by , that takes input feature and assigns a semantic label for each point and an instance label if it is part of the foreground.

To solve this problem, we propose CPSeg, an end-to-end network to generate predictions for panoptic segmentation without proposals or clustering algorithms.

3.2 Network Architecture

Figure 2: Overview of our panoptic segmentation system.
Figure 3: Illustration of CPSeg. The network consists of a dual decoder U-net which processes the input point cloud in RV to obtain semantic segmentation and instance embedding. Then, the Cluster-free Instance Segmentation Module separates the foreground into pillars and builds connections which leads to instances. Both semantic and instance predictions are then gathered and sent back to 3D view for post-processing.

The overview of our panoptic segmentation framework is depicted in Figure 2. We first transform the LiDAR point cloud with as input features (Cartesian coordinates, remission and depth) into a 2D range image with spatial dimension using spherical projection similar to [9340837]

. At the same time, we build a dense depth map, which will be utilized as a guidance for the depth completion algorithm to extract surface normal features in the following stage. Then, CPSeg takes both inputs and predicts semantic and instance segmentation results in the range view (RV). When re-projecting the results to the 3D point cloud, KNN-based post-processing is utilized to refine the output, as introduced in

[milioto2019rangenet++]. Lastly, we fuse the results to obtain panoptic labels and use majority voting to refine the semantic segmentation results where different semantics are predicted in the same instance.

Our proposed model is summarized in Figure 3. It consists of three main components: (A) a dual-decoder U-Net with Task-aware Attention Module (TAM), (B) a surface normal calculation module, which takes the depth maps and computes normal vectors to benefit instance embedding regression, (C) a cluster-free instance segmentation head, which segments the foreground instance embedding into objects.

A 2D range view representation of a LiDAR point cloud is fed into CPSeg. The output , of the dual-decoder U-Net are the semantic prediction and instance embedding of the projected point cloud, respectively. In particular, the semantic decoder generates , where is the number of semantic class labels. With the Cartesian xy coordinates added as a prior, the instance decoder outputs where is the xy coordinates of the point cloud in RV, and is the output from the last block of the instance decoder. Essentially, is the instance embedding in 2D space, which could also be interpreted as the predicted 2D location of the object centroids. Then, a binary mask is used to filter foreground points and can be expressed as,


where is the binary conditional function, is the ground truth semantic label during training and will be replaced by during test stage. We use to obtain corresponding embedding of the foreground thing from the instance decoder, denoted as where is the number of foreground points and 2 refers to the embedding in 2D space. The foreground embedding is then used by the cluster-free instance segmentation head to segment objects.

Basic architecture We adopt CLSA module from [li2021smacseg] to extract contextual features. CLSA block learns to recover local geometry in the neighbourhood which is beneficial for RV-based methods to learn contextual information. The output of the CLSA module is then fed to a shared encoder with five residual blocks, similarly to [cortinhal2020salsanext], where we obtain multi-scale feature maps,

(subscript indicates the stride with respect to the full resolution downsampled by

AvgPool layer at the end of each encoder block). We provide the detailed architecture, such as, number of in/out channels, number of layers, and dimension of feature maps at each stage of the encoder and decoder blocks in the Supplementary Material.

Figure 4: TAM: Task-aware Attention Module.

Task-aware Attention Module (TAM) Given feature maps in multiple resolutions, we upsample them to the full resolution in a hierarchical manner, as shown in Figure 4. Then a convolution layer with a residual skip connection is used to refine the boundary for each feature map. Next, we obtain channel-wise attention weights, and with two MLPs targetting the semantic and instance segmentation tasks, respectively, using the following equation.


where is the upsampled and refined feature map from stride to the full resolution, is the concatenation operation, and denotes the sigmoid operation. Lastly, each feature channel of the refined feature maps is multiplied with and to get and , which are then sent to the two decoders to be fused with a residual block and obtain semantic segmentation and instance embedding.

3.3 Surface Normal

The instance decoder learns to regress the instance embedding, which is the shifted location in 2D space starting from the original xy coordinates towards the object centroids. Surface normal vectors provide additional geometric features to aid the network in this process. Thus, we use a Diamond Inception Module, adopted from

[gerdzhev2021tornadonet], to extract geometric features from surface normals and directly fuse them with the features in the instance decoder using concatenation operation followed by convolutional layers to obtain instance embedding. We provide the final results with and without this module in the Ablation Studies to further demonstrate its impact to the overall performance.

Figure 5: Proposed Depth Completion Algorithm.

In this section, we describe a deterministic way of calculating surface normal features of the point cloud with a novel depth completion algorithm, as depicted in Figure 5. The inputs to this module are (A) , sparse 2D depth map with a scale of , and (B) , dense 2D depth map with a scale of . The depth map is obtained by projecting the LiDAR point cloud onto a 2D map with a specified size using discretized indices from spherical transformation, as introduced by [wu2018squeezeseg].

First, obtain , a completed depth map with a weighted row fill using row neighbours for every entry, as given by,


where and are the depth value and binary occupancy at row and columnm respectively. The operation denotes the floor function. The weights

are sampled from a Gaussian distribution where the center point receives the largest attention and the neighboring points are weighted less as they deviate away from the center. Here,


are hyperparameters which are set to 1 in our method. Then equations

3 and 4 are re-used to obtain using column neighbours to fill. Next, bilinear upsample to is applied to obtain a coarse but dense depth map, denoting as . From , we calculate and using finite difference approximation along horizontal and vertical directions where and are the azimuth and elevation angles for each entry. Local geometry could be interpreted from the two signals. Hence, they serve as the guidance signal to adaptively select a horizontal or vertical fill for each empty entry. For instance, when the magnitude of is small (i.e. ), it indicates the point is on a pole-like or wall-like object. Therefore, a completion using weighted average of the valid column neighbours is more desired since the change in depth in the vertical direction is relatively small. In summary, each entry in the completed depth map can be expressed as,


where that index [] is omitted for brevity and denotes occupancy. From the completed depth map, r, we follow [badino2011normal] to calculate the gradients and and transform them into 3D Cartesian frame centered at the LiDAR sensor to obtain (, , ). Note that the purpose of the depth completion algorithm above is to ensure the neighbourhood of the valid pixels is smooth such that the gradients are not influenced by the noise.

3.4 Cluster-free Instance Segmentation

Given the 2D embedding of the foreground from the instance decoder , the goal of the cluster-free instance segmentation module is to segment them into instances. First, we dynamically group the foreground points into pillars according to , their location in the 2D embedding space, such that points within grid size are inside the same pillar (see bottom right of Figure 3). The embedding of each resulting pillar is the average embedding of the points being grouped together. Pillarized foreground embedding is denoted as , where M is the number of pillars. Next, we construct a pairwise comparison matrix to find connected pillars with each entry as,


Each entry represents the connectivity probability of pillar

and . A large probability indicates the network is confident that the points in the two pillars belong to the same object. In order for to provide meaningful connectivity indications, we need the function, , to follow several constraints: 1). , a pillar must be connected to itself with 100% confidence. 2). , the connectivity is symmetric; in other words, the network should output the same confidence when comparing pillar to and pillar to . 3). and , with indicating the two pillars belong to the same object.

We define which satisfies all the constraints listed above. Note that could be either learned from the pillar features or fixed as a hyperparameter constant. We discuss in detail and compare the results in the Ablation Studies. Then, we use a threshold to obtain a binary connectivity matrix, , formally, . Note that could be interpreted as an adjacency matrix as for the graph structure where each pillar is a node of the graph and a true entry in the matrix represents the two nodes are connected. Then a simple algorithm is used to find the connected disjoint sets in and assign them separate instance IDs. Lastly, we map the pillar instance ID back to the range view using point index matching process. Both semantic and instance segmentation results are now ready to be re-projected back to the point cloud and post-process.

3.5 Loss Functions

Semantic Segmentation Loss We follow [gerdzhev2021tornadonet] to supervise the semantic segmentation output, , with a weighted combination of cross entropy (WCE), Lovász softmax, and Total Variation loss (TV).


where is the ground truth (GT) semantic label, is the Lovász extension of IoU introduced in [berman2018lovasz], is the absolute error between the predicted probability and the GT.

Instance Embedding Loss We use L2 regression loss to supervise the learning of instance embedding by taking the difference in predicted instance embedding with the GT. Note that the instance embedding here can be interpreted as the mass centroid of an object in 2D BEV.


where denotes the 2D index on the range image, is the GT foreground binary mask to eliminate the background points when calculating loss, is the GT instance embedding, which is the mass centroid of each instance calculated by taking the average of the xy coordinates in the object.

Instance Segmentation Loss Essentially, the task here is to supervise binary segmentation on the pairwise matrix and optimize the IoUs for positive and negative predictions. Assume points within the same pillar are from the same object, we construct the GT instance label of each pillar by taking the mode label of the points inside, denoted as where is the set of GT instance labels. The GT binary label for the pairwise comparison matrix, is obtained with entries .


where is the binary cross entropy loss, is the Lovász extension of IoU, is the absolute error between the predicted probability and GT. The Lovász loss introduced by [berman2018lovasz] has been shown to be effective in optimizing the IoU metrics. Further experimental results in the Ablation Studies show adding this loss achieve better overall accuracy.

The total loss that is used to train the network is a weighted combination of the loss terms described above.


where are the weights for the semantic, instance embedding, and instance segmentation loss terms.

4 Experiments

In this section, we describe the experimental settings and evaluate the performance of CPSeg on SemanticKITTI [DBLP:conf/iccv/BehleyGMQBSG19] and nuScenes dataset [caesar2020nuscenes] for panoptic segmentation. We compared our results with state-of-the-art approaches. We refer the readers to the Ablation Studies on the design choices and various components of the network.

Datasets SemanticKITTI [DBLP:conf/iccv/BehleyGMQBSG19] is the first available dataset on LiDAR-based panoptic segmentation for driving scenes. It contains 19,130 training frames, 4,071 validation frames, and 20,351 test frames. We provided ablation analysis and validation results on sequence 08, as well as test results on sequence 11-21. Each point in the dataset is provided with a semantic label of 28 classes, which are mapped to 19 classes for the task of panoptic segmentation. Among these 19 classes, 11 belong to stuff classes and the rest of them are considered thing, where instance IDs are available.

nuScenes [caesar2020nuscenes] is another popular large-scale driving-scene dataset, with 700 scenes for training, 150 for validation, and 150 for testing. At the time of writing, the authors have not provided point-level panoptic segmentation labels for LiDAR scans. Thus, we generated the labels using the provided 3D bounding box annotations from the detection dataset and the semantic labels from the lidarseg dataset. In particular, we assign the same instance ID for points within the bounding box with same semantic labels. Out of 16 labeled classes in lidarseg datset, 8 human and vehicle classes are considered things, and 8 other classes are considered stuff. We follow [Zhou2021PanopticPolarNet] to discard instances with fewer than 20 points during evaluation. We train our model on the 700 training scenes and report the results on the validation set of 150 scenes.

Baselines We use a dual-decoder U-Net based on SalsaNext [cortinhal2020salsanext] as the baseline. In particular, the two decoders generate semantic segmentation and instance embedding respectively. Then a clustering algorithm (e.g. BFS, HDBSCAN) is added after the instance decoder to segment the objects based on the predicted embedding. To be fair in comparison, we add the CLSA Feature Extractor Module in front of the encoder to match our network design. Moreover, we implement LPSAD based on [9340837] as an additional baseline. Quantitative and qualitative results are compared against the proposed methods on the SemanticKITTI and nuScenes validation set.

Evaluation Metric We follow [panopticMetric] to use the mean Panoptic Quality (PQ) as our main metric to evaluate and compare the results with others. In addition, we also report Recognition Quality (RQ), and Segmentation Quality (SQ). They are calculated separately on stuff and thing classes, providing PQSt, SQSt, RQSt and PQTh, SQTh, RQTh.

4.1 Experimental Setup

For both datasets, we trained CPSeg end-to-end for 150 epochs using SGD optimizer and exponential-decay learning rate scheduler with initial learning rate starting at 0.01 and a decay rate of 0.99 every epoch. A weight decay of

was used. The model was trained on 4 NVIDIA V100 GPUs with a batch size of 4 per GPU. The weights for the losses were set to , , . We used a range image with resolution of (, ) for SemanticKITTI, and (, ) for nuScenes. Additionally, we provided CPSeg HR (high-resolution) using input size of (, ) for nuScenes. The pillar grid size, of the final models was set to with pillar pairwise matrix threshold, to be . The mapping parameter, was set to be for the final model.

RangeNet++ [milioto2019rangenet++] + PointPillars [Lang_2019_CVPR_pointpillars]
PanopticTrackNet [hurtado2020mopt]
KPConv [thomas2019kpconv] + PointPillars [Lang_2019_CVPR_pointpillars]
Panoster [gasperini2021panoster]
DS-Net [hong2020lidar]
EfficientLPS [sirohi2021efficientlps] 83.0 87.8 60.5 74.6 79.5
GP-S3Net [razani2021gps3net] 60.0 69.0 72.1 65.0 74.5 70.8
LPSAD [9340837] 11.8
Panoptic-PolarNet [Zhou2021PanopticPolarNet] 87.2
SMAC-Seg [li2021smacseg] 58.4 72.3 79.3 63.3
CPSeg [Ours] 57.0 63.5 68.8 82.2 55.1 64.1 58.4 72.3 79.3
Table 1: Comparison of LiDAR panoptic segmentation performance on SemanticKITTI [DBLP:conf/iccv/BehleyGMQBSG19] test dataset. Metrics are provided in [%] and FPS is in [Hz].(*: source from [li2021smacseg])
DS-Net [hong2020lidar] 84.4
PanopticTrackNet [hurtado2020mopt]
EfficientLPS [sirohi2021efficientlps] 71.5 84.1
GP-S3Net [razani2021gps3net] 75.8
SMAC-Seg HiRes [li2021smacseg] 68.4 73.4 79.7 85.2 68.0 77.2 87.3
LPSAD [9340837] 22.3
SMAC-Seg [li2021smacseg] 71.8 78.2 65.2 74.2 72.2
Panoptic-PolarNet [Zhou2021PanopticPolarNet] 67.7 86.0 65.2 87.2 71.9 84.9 83.9
Dual-Dec UNet w/ BFS [Our Baseline]
Dual-Dec UNet w/ HDBSCAN [Our Baseline]
CPSeg [Ours] 21.3
CPSeg HR [Ours] 71.1 75.6 82.5 85.5 71.5 81.3 87.3 70.6 83.7 83.6 73.2
Table 2: Comparison of LiDAR panoptic segmentation performance on nuScenes [caesar2020nuscenes] validation dataset. Metrics are provided in [%] and FPS is in [Hz].

4.2 Quantitative Evaluation

In Table 1 and Table 2, we compile the results of CPSeg compared to other models. For evaluations on SemanticKITTI test dataset (Table 1), we separate the models into two groups based on their inference speed. Only the models in row 8-11 are known to have real-time performance, with FPS’s above 10Hz. With a PQ of and an FPS of 10.6Hz, CPSeg (row 11) achieves performances that match state-of-the-art models. More importantly, it establishes a new benchmark in PQ for real-time models, surpassing the next best real-time model, SMAC-Seg, by . Specifically, with a increase in RQTh and 0.5Hz improvement in FPS over SMAC-Seg, we demonstrate that CPSeg is better in recognizing foreground objects while requiring less computation. These improvements can be mainly attributed to the use of cluster-free instance segmentation module and the incorporation of surface normal as a helpful part of the instance embedding.

For results on nuScenes validation dataset (Table 2), since the methods for creating the instance labels are not standardized across publications, we separated the models into three groups for better comparison. The models in rows 1-8 are previously published models, grouped by inference speed similar to Table 1. The baseline and proposed models listed in row 9-12 use results from our experiments. Using range images as input, CPSeg HR (row 12) obtains the highest PQ out of all models, outperforming the baseline models Dual Decoder (BFS) and Dual Decoder (HDBSCAN) by and , respectively. However, because the setting in nuScenes dataset is more crowded compared to semanticKITTI dataset with more instances in each scene, the performance of CPSeg HR is no longer real-time. By reducing the resolution from to , CPSeg (row 11) again achieves a competitive real-time performance with only a small trade-off in PQ.

4.3 Qualitative Evaluation

Figure 6: Qualitative comparison of CPSeg with other methods on both SemanticKITTI and nuSenes validation set.

The panoptic segmentation performance of CPSeg can also be seen in Figure 6, where we compare its inference results to LPSAD, our implementation based on [9340837], and our baseline models. For a closedup view of a scene from SemanticKITTI dataset (row 1) where three cars are lined up closely, only CPSeg segments the instance points without errors. LPSAD identifies the car in the middle as two separate instances, whereas the baseline model produce even worse over-segmentation errors.

In a complex scene from nuScenes (row 2), with variations in instance classes and few sparse points describing each instance, correctly recognizing and distinguishing each instance is proved to be more difficult. For areas where pedestrians are walking closely to each other or where cars are positioned further away, the baseline model using HDBSCAN and LPSAD are prone to making under-segmentation errors. In such a complex scene, only CPSeg is able to segment each instance accurately. Additional examples and an explanation of the models’ behaviours are provided in the Supplementary Material.

5 Ablation Studies

In this section, we present an extensive ablation analysis on the proposed components in CPSeg. Note that all results are compared on SemanticKITTI validation set (Seq 08). First, we investigate the individual contribution of each component in the network and show the results in Table 3. The cluster-free instance segmentation module is the key component, introducing increase on the PQ (compare to the baseline with BFS) while eliminating the computation of clustering at the same time. Adding TAM also leads to a further improvement as the two decoders receive more meaningful task-specific features. Moreover, extracting surface normal brings another jump in PQ since the network receives guidance on regressing the embedding for each foreground object. Lastly, the model achieves the best result by incorporating binary Lovász loss in supervising the segmentation on the pariwise matrix.




3D Normal



Baseline w/ BFS
Baseline w/ HDBSCAN 52.7


Table 3: Ablation study of the proposed model with individual components vs baseline. Metrics are provided in [%].

We experiment with changing , the parameter used to map the pillar embedding to the connectivity probability. We set the threshold to be 0.5, and pillar grid size to be 0.15 for the experiments on . In the first setting, is learned from the corresponding pillar feature from the instance decoder. In particular, , where and are the corresponding features of pillar and from the instance decoder, and denotes the concatenation operation. In the second setting, we set to be various fixed values. From the results in Table 4, constant yields the best results. A fixed value works relatively better than learning from the feature; for panoptic segmentation tasks on outdoor autonomous driving dataset, the difference in the regressed 2D embedding is enough to determine the connectivity of the pillars. However, we think that a learned could potentially work better if the scene is dense and crowded (indoor scenes) such that the network requires more information in making connections. Also note that . We draw conclusions that can be regarded as a function of the threshold, , and pillar grid size, . Hence, we choose to be for the rest of the experiments. This design choice ensures that adjacent pillars are considered to be connected.



56.2 58.7 66.8


Table 4: Ablation study of using different . Metrics are provided in [%].

One may concern about the complexity of the model as it grows quadratically with , the number of pillars. Note that we provide the average number of pillars resulted from using different grid sizes in Table 5. Typically, a SemanticKITTI LiDAR scan contains an average number of 12 instances and 6.8k number of foreground points. We find that is proportional to the number of instances in the scan but significantly less than the number of points. As the point embedding is learned to shift together in the network, dynamically grouping the foreground points together using pillars according to their embedding significantly reduces the computation the network needs to carry.

Grid Size (m) PQ PQTh RQTh SQTh Runtime (ms)

56.2 58.7 66.8

76.6 90

Table 5: Ablation study of using different grid sizes in pillarizing foreground points by their embedding. Metrics are provided in [%].

6 Conclusion

In this work, we propose a novel real-time proposal-free and cluster-free panoptic segmentation network for 3D point cloud, called CPSeg. Our method builds upon an efficient semantic segmentation network and addresses the instance segmentation by incorporating a unique cluster-free instance head where the foreground point cloud is dynamically pillarized in the sparse space according to the learned embedding and object instances are formed by building connection of pillars. Moreover, a novel task-aware attention module is designed to enforce two decoders to learn task-specific features. CPSeg outperforms existing real-time LiDAR-based panoptic segmentation methods on both datasets of SemanticKITTI and nucSenes. The thorough analysis illustrates the robustness and effectiveness of the proposed method, which could inspire the field and push the panoptic segmentation research towards a proposal-free and cluster-free direction.