1 Introduction
A panoptic segmentation system that is able to predict both semantic tags and instancelevel segmentation begins to draw the attention of the autonomous driving community, because both foreground dynamic objects (i.e. the thing) and background static scenes (i.e. the stuff) can be perceived and outputted simultaneously. Compared with imagebased approaches, a panoptic segmentation system using LiDAR point cloud lacks sufficient research yet, despite the fact that LiDAR is a wellconcurred primary perception sensor for autonomous driving for its active sensing nature with high resolution of sensor readings. The recently published GPS3Net [razani2021gps3net] is the current stateoftheart in LiDAR panoptic segmentation task. The authors proposed using a 3D sparse convolutionbased UNet as a semantic backbone and a combination of HDBSCAN and GCNN to segment instances. However, GPS3Net is computationally intensive and is thus not a realtime method. Among all published works for LiDAR panoptic segmentation, only few of them [9340837, Zhou2021PanopticPolarNet, li2021smacseg] are capable of operating in realtime (see Figure 1). There still exists a performance gap when comparing the realtime methods with the current stateoftheart. A natural question to ask: is it possible to fill this gap and build a panoptic segmentation method with a competitive PQ yet still with a fast running time?
Given the runtime constraint for realtime systems, proposalfree methods are favorable as they are more computationally efficient. Current proposalfree approaches in the literature usually rely on clustering or graph to segment foreground objects. These methods are mainly adopted from 2D image processing tasks. Designing an effective 3D panoptic segmentation system based on unique characteristics of LiDAR point clouds is still an open research problem. This motivates us to find a more suitable and unique design targeting the panoptic segmentation task in the LiDAR domain. In this work, we take advantage of strong geometric patterns in LiDAR point clouds and present a new proposalfree and clusterfree approach to segment foreground objects with capabilities of running in realtime. In particular, we propose a network to predict object centroid as embedding of each point and dynamically group points with similar embedding as pillars in the sparse 2D space. Then, objects are formed by building connections of pillars.
Our main contributions can be summarized as,

A realtime panoptic segmentation network that is endtoend (i.e., not relying on clustering or proposals to segment instances) and achieves stateoftheart results without extensive postprocessing

A novel Taskaware Attention Module (TAM) that enforces the dual decoder network to learn taskspecific features

A fast surface normal calculation module to aid the process of regressing foreground instance embedding with a novel deterministic depth completion algorithm

A comprehensive qualitative and quantitative comparison demonstrating the proposed method as opposed to existing methods on both largescale datasets of SemanticKITTI and nuScenes

A thorough ablation analysis of how each proposed component contributes to the overall performance
2 Related Work
The panoptic segmentation task jointly optimizes semantic and instance segmentation. LiDARbased semantic segmentation can be categorized into either a projectionbased, a voxelbased, or a pointbased method depending on the format of data being processed. Projectionbased methods project a 3D point cloud into a 2D image plane either in spherical RangeView (RV) [milioto2019rangenet++, razani2021litehdseg], topdown BirdEyeView (BEV) [8403277, 10.1007/9783030110093_11], or multiview format [gerdzhev2021tornadonet]. Voxelbased methods transform a point cloud into 3D volumetric grids to be processed using 3D convolutions. Processing these 3D grids using 3D convolution is computationally expensive. Therefore, some methods leverage sparse convolutions to alleviate this limitation and to fully exploit sparsity of point clouds [cheng2021s3net, zhou2020cylinder3d, cheng2021af2s3net]. Pointbased methods [qi2017pointnet, thomas2019kpconv], however, process the unordered point cloud directly. Despite the high accuracy of the latter approaches, they are inefficient and require large memory consumption. Similar to instance segmentation, panoptic segmentation can be divided into topdown (proposalbased) or bottomup (proposalfree) methods, as elaborated below.
2.1 Proposalbased panoptic segmentation
Topdown panoptic segmentation is a twostage approach. First, foreground object proposals are generated, and subsequently, they are further processed to extract instance information that is fused with background semantic information. Mask RCNN [He_2017_ICCV] is commonly used for instance segmentation with a lightweight stuff branch segmentation. To resolve the overlapping instance predictions by Mask RCNN and the conflict between instance and semantic predictions, several methods are introduced. UPSnet [xiong2019upsnet] presents a panoptic head with the addition of an unknown class label. EfficientPS [sirohi2021efficientlps] proposes to fuse the instance and semantic heads according to their confidence. Inspired by imagebased methods, MOPT [hurtado2020mopt] and EfficientLPS [sirohi2021efficientlps] attach a semantic head to Mask RCNN to generate panoptic segmentation on the range image. However, these topdown methods are a result of the multiple sequential processes in the pipeline and are usually slow in speed.
2.2 Proposalfree panoptic segmentation
In contrast to the proposalbased methods, bottomup panoptic segmentation predicts semantic segmentation and groups the “thing" points into clusters to achieve instance segmentation. DeeperLab [yang2019deeperlab] introduces the instance segmentation by regressing the bounding box corners and object center. Later, PanopticDeepLab [cheng2020panoptic] proposes to predict the instance center locations and group pixels to their closest predicted centers. The pioneering panoptic method in the LiDAR domain, LPSAD [9340837], presents a shared encoder with a dual decoder, followed by a clustering algorithm to segment instances based on the predicted semantic embedding and the predicted object centroids in the 3D space. Panoster [gasperini2021panoster] proposes a learnable clustering module to assign instance class labels to every point. It requires extensive postprocessing steps (e.g. using DBSCAN to merge nearby object predictions) to refine the predictions in order to have comparable results with the other stateoftheart methods. DSNet [hong2020lidar], however, offers a learnable dynamic shifting module to shift points in 3D space towards the object centroids. Recently, GPS3Net [razani2021gps3net] proposes a graphbased instance segmentation network for LiDARbased panoptic segmentation, which achieves stateoftheart performance on nuScenes and SemanticKITTI panoptic segmentation benchmarks. Moreover, SMACSeg [li2021smacseg] introduces a Sparse Multidirectional Attention Clustering module with a novel repel loss to better supervise the network separating the instances. They reach the stateoftheart among existing realtime LiDARbased methods.
2.3 Normal surface calculation
Numerous point cloud processing algorithms, such as surface reconstruction, segmentation, shape modeling, and feature extraction, benefit from accurate normal vectors associated with each point. Existing methods for computing normal vectors for LiDAR point clouds could be divided into learningbased and traditional deterministic approaches. For instance,
[https://doi.org/10.1111/cgf.12983, https://doi.org/10.1111/cgf.13343, ZHOU2020102916]use a neural network to estimate the normal of point cloud directly. Using learningbased approaches would require extensive training data in order to achieve good performance and might need additional finetuning when using different types of LiDAR setups. On the other hand, deterministic geometrybased approaches like
[zhou2018open3d] usually require a nearest neighbour search to obtain the neighbourhood for each point and then estimate the surface by fitting a plane to the neighbourhood point cloud using the least square method. These approaches usually struggle to achieve realtime performance as the neighbourhood search in 3D space is computationally heavy. [5980275] uses a rangebased deterministic approach to compute surface normal where the 3D normal is transformed from derivatives of the surface from a sparse spherical depth map. This approach operates in realtime; however, the surface normal is incorrectly affected by the empty entries in the projected depth map, resulting in false sharp gradients. To overcome this limitation, we propose to use a novel depth completion algorithm to adaptively complete the neighbourhood around the valid points in the sparse depth map and then compute its normal from the 2D gradient.3 Proposed Method
3.1 Problem Formulation
Let be N unordered points of a point cloud where is the input feature for point , tuple is the semantic class label and instance ID label for point . is a set of semantic class labels and is a set of instance IDs. can be further divided into and representing a set of countable foreground thing classes and a set of background stuff classes, respectively. Note that instance label is only valid if . The goal is to learn a function , parameterized by , that takes input feature and assigns a semantic label for each point and an instance label if it is part of the foreground.
To solve this problem, we propose CPSeg, an endtoend network to generate predictions for panoptic segmentation without proposals or clustering algorithms.
3.2 Network Architecture
The overview of our panoptic segmentation framework is depicted in Figure 2. We first transform the LiDAR point cloud with as input features (Cartesian coordinates, remission and depth) into a 2D range image with spatial dimension using spherical projection similar to [9340837]
. At the same time, we build a dense depth map, which will be utilized as a guidance for the depth completion algorithm to extract surface normal features in the following stage. Then, CPSeg takes both inputs and predicts semantic and instance segmentation results in the range view (RV). When reprojecting the results to the 3D point cloud, KNNbased postprocessing is utilized to refine the output, as introduced in
[milioto2019rangenet++]. Lastly, we fuse the results to obtain panoptic labels and use majority voting to refine the semantic segmentation results where different semantics are predicted in the same instance.Our proposed model is summarized in Figure 3. It consists of three main components: (A) a dualdecoder UNet with Taskaware Attention Module (TAM), (B) a surface normal calculation module, which takes the depth maps and computes normal vectors to benefit instance embedding regression, (C) a clusterfree instance segmentation head, which segments the foreground instance embedding into objects.
A 2D range view representation of a LiDAR point cloud is fed into CPSeg. The output , of the dualdecoder UNet are the semantic prediction and instance embedding of the projected point cloud, respectively. In particular, the semantic decoder generates , where is the number of semantic class labels. With the Cartesian xy coordinates added as a prior, the instance decoder outputs where is the xy coordinates of the point cloud in RV, and is the output from the last block of the instance decoder. Essentially, is the instance embedding in 2D space, which could also be interpreted as the predicted 2D location of the object centroids. Then, a binary mask is used to filter foreground points and can be expressed as,
(1) 
where is the binary conditional function, is the ground truth semantic label during training and will be replaced by during test stage. We use to obtain corresponding embedding of the foreground thing from the instance decoder, denoted as where is the number of foreground points and 2 refers to the embedding in 2D space. The foreground embedding is then used by the clusterfree instance segmentation head to segment objects.
Basic architecture We adopt CLSA module from [li2021smacseg] to extract contextual features. CLSA block learns to recover local geometry in the neighbourhood which is beneficial for RVbased methods to learn contextual information. The output of the CLSA module is then fed to a shared encoder with five residual blocks, similarly to [cortinhal2020salsanext], where we obtain multiscale feature maps,
(subscript indicates the stride with respect to the full resolution downsampled by
AvgPool layer at the end of each encoder block). We provide the detailed architecture, such as, number of in/out channels, number of layers, and dimension of feature maps at each stage of the encoder and decoder blocks in the Supplementary Material.Taskaware Attention Module (TAM) Given feature maps in multiple resolutions, we upsample them to the full resolution in a hierarchical manner, as shown in Figure 4. Then a convolution layer with a residual skip connection is used to refine the boundary for each feature map. Next, we obtain channelwise attention weights, and with two MLPs targetting the semantic and instance segmentation tasks, respectively, using the following equation.
(2) 
where is the upsampled and refined feature map from stride to the full resolution, is the concatenation operation, and denotes the sigmoid operation. Lastly, each feature channel of the refined feature maps is multiplied with and to get and , which are then sent to the two decoders to be fused with a residual block and obtain semantic segmentation and instance embedding.
3.3 Surface Normal
The instance decoder learns to regress the instance embedding, which is the shifted location in 2D space starting from the original xy coordinates towards the object centroids. Surface normal vectors provide additional geometric features to aid the network in this process. Thus, we use a Diamond Inception Module, adopted from
[gerdzhev2021tornadonet], to extract geometric features from surface normals and directly fuse them with the features in the instance decoder using concatenation operation followed by convolutional layers to obtain instance embedding. We provide the final results with and without this module in the Ablation Studies to further demonstrate its impact to the overall performance.In this section, we describe a deterministic way of calculating surface normal features of the point cloud with a novel depth completion algorithm, as depicted in Figure 5. The inputs to this module are (A) , sparse 2D depth map with a scale of , and (B) , dense 2D depth map with a scale of . The depth map is obtained by projecting the LiDAR point cloud onto a 2D map with a specified size using discretized indices from spherical transformation, as introduced by [wu2018squeezeseg].
First, obtain , a completed depth map with a weighted row fill using row neighbours for every entry, as given by,
(3) 
(4) 
where and are the depth value and binary occupancy at row and columnm respectively. The operation denotes the floor function. The weights
are sampled from a Gaussian distribution where the center point receives the largest attention and the neighboring points are weighted less as they deviate away from the center. Here,
andare hyperparameters which are set to 1 in our method. Then equations
3 and 4 are reused to obtain using column neighbours to fill. Next, bilinear upsample to is applied to obtain a coarse but dense depth map, denoting as . From , we calculate and using finite difference approximation along horizontal and vertical directions where and are the azimuth and elevation angles for each entry. Local geometry could be interpreted from the two signals. Hence, they serve as the guidance signal to adaptively select a horizontal or vertical fill for each empty entry. For instance, when the magnitude of is small (i.e. ), it indicates the point is on a polelike or walllike object. Therefore, a completion using weighted average of the valid column neighbours is more desired since the change in depth in the vertical direction is relatively small. In summary, each entry in the completed depth map can be expressed as,(5) 
where that index [] is omitted for brevity and denotes occupancy. From the completed depth map, r, we follow [badino2011normal] to calculate the gradients and and transform them into 3D Cartesian frame centered at the LiDAR sensor to obtain (, , ). Note that the purpose of the depth completion algorithm above is to ensure the neighbourhood of the valid pixels is smooth such that the gradients are not influenced by the noise.
3.4 Clusterfree Instance Segmentation
Given the 2D embedding of the foreground from the instance decoder , the goal of the clusterfree instance segmentation module is to segment them into instances. First, we dynamically group the foreground points into pillars according to , their location in the 2D embedding space, such that points within grid size are inside the same pillar (see bottom right of Figure 3). The embedding of each resulting pillar is the average embedding of the points being grouped together. Pillarized foreground embedding is denoted as , where M is the number of pillars. Next, we construct a pairwise comparison matrix to find connected pillars with each entry as,
(6) 
Each entry represents the connectivity probability of pillar
and . A large probability indicates the network is confident that the points in the two pillars belong to the same object. In order for to provide meaningful connectivity indications, we need the function, , to follow several constraints: 1). , a pillar must be connected to itself with 100% confidence. 2). , the connectivity is symmetric; in other words, the network should output the same confidence when comparing pillar to and pillar to . 3). and , with indicating the two pillars belong to the same object.We define which satisfies all the constraints listed above. Note that could be either learned from the pillar features or fixed as a hyperparameter constant. We discuss in detail and compare the results in the Ablation Studies. Then, we use a threshold to obtain a binary connectivity matrix, , formally, . Note that could be interpreted as an adjacency matrix as for the graph structure where each pillar is a node of the graph and a true entry in the matrix represents the two nodes are connected. Then a simple algorithm is used to find the connected disjoint sets in and assign them separate instance IDs. Lastly, we map the pillar instance ID back to the range view using point index matching process. Both semantic and instance segmentation results are now ready to be reprojected back to the point cloud and postprocess.
3.5 Loss Functions
Semantic Segmentation Loss We follow [gerdzhev2021tornadonet] to supervise the semantic segmentation output, , with a weighted combination of cross entropy (WCE), Lovász softmax, and Total Variation loss (TV).
(7) 
where is the ground truth (GT) semantic label, is the Lovász extension of IoU introduced in [berman2018lovasz], is the absolute error between the predicted probability and the GT.
Instance Embedding Loss We use L2 regression loss to supervise the learning of instance embedding by taking the difference in predicted instance embedding with the GT. Note that the instance embedding here can be interpreted as the mass centroid of an object in 2D BEV.
(8) 
where denotes the 2D index on the range image, is the GT foreground binary mask to eliminate the background points when calculating loss, is the GT instance embedding, which is the mass centroid of each instance calculated by taking the average of the xy coordinates in the object.
Instance Segmentation Loss Essentially, the task here is to supervise binary segmentation on the pairwise matrix and optimize the IoUs for positive and negative predictions. Assume points within the same pillar are from the same object, we construct the GT instance label of each pillar by taking the mode label of the points inside, denoted as where is the set of GT instance labels. The GT binary label for the pairwise comparison matrix, is obtained with entries .
(9) 
where is the binary cross entropy loss, is the Lovász extension of IoU, is the absolute error between the predicted probability and GT. The Lovász loss introduced by [berman2018lovasz] has been shown to be effective in optimizing the IoU metrics. Further experimental results in the Ablation Studies show adding this loss achieve better overall accuracy.
The total loss that is used to train the network is a weighted combination of the loss terms described above.
(10) 
where are the weights for the semantic, instance embedding, and instance segmentation loss terms.
4 Experiments
In this section, we describe the experimental settings and evaluate the performance of CPSeg on SemanticKITTI [DBLP:conf/iccv/BehleyGMQBSG19] and nuScenes dataset [caesar2020nuscenes] for panoptic segmentation. We compared our results with stateoftheart approaches. We refer the readers to the Ablation Studies on the design choices and various components of the network.
Datasets SemanticKITTI [DBLP:conf/iccv/BehleyGMQBSG19] is the first available dataset on LiDARbased panoptic segmentation for driving scenes. It contains 19,130 training frames, 4,071 validation frames, and 20,351 test frames. We provided ablation analysis and validation results on sequence 08, as well as test results on sequence 1121. Each point in the dataset is provided with a semantic label of 28 classes, which are mapped to 19 classes for the task of panoptic segmentation. Among these 19 classes, 11 belong to stuff classes and the rest of them are considered thing, where instance IDs are available.
nuScenes [caesar2020nuscenes] is another popular largescale drivingscene dataset, with 700 scenes for training, 150 for validation, and 150 for testing. At the time of writing, the authors have not provided pointlevel panoptic segmentation labels for LiDAR scans. Thus, we generated the labels using the provided 3D bounding box annotations from the detection dataset and the semantic labels from the lidarseg dataset. In particular, we assign the same instance ID for points within the bounding box with same semantic labels. Out of 16 labeled classes in lidarseg datset, 8 human and vehicle classes are considered things, and 8 other classes are considered stuff. We follow [Zhou2021PanopticPolarNet] to discard instances with fewer than 20 points during evaluation. We train our model on the 700 training scenes and report the results on the validation set of 150 scenes.
Baselines We use a dualdecoder UNet based on SalsaNext [cortinhal2020salsanext] as the baseline. In particular, the two decoders generate semantic segmentation and instance embedding respectively. Then a clustering algorithm (e.g. BFS, HDBSCAN) is added after the instance decoder to segment the objects based on the predicted embedding. To be fair in comparison, we add the CLSA Feature Extractor Module in front of the encoder to match our network design. Moreover, we implement LPSAD based on [9340837] as an additional baseline. Quantitative and qualitative results are compared against the proposed methods on the SemanticKITTI and nuScenes validation set.
Evaluation Metric We follow [panopticMetric] to use the mean Panoptic Quality (PQ) as our main metric to evaluate and compare the results with others. In addition, we also report Recognition Quality (RQ), and Segmentation Quality (SQ). They are calculated separately on stuff and thing classes, providing PQ^{St}, SQ^{St}, RQ^{St} and PQ^{Th}, SQ^{Th}, RQ^{Th}.
4.1 Experimental Setup
For both datasets, we trained CPSeg endtoend for 150 epochs using SGD optimizer and exponentialdecay learning rate scheduler with initial learning rate starting at 0.01 and a decay rate of 0.99 every epoch. A weight decay of
was used. The model was trained on 4 NVIDIA V100 GPUs with a batch size of 4 per GPU. The weights for the losses were set to , , . We used a range image with resolution of (, ) for SemanticKITTI, and (, ) for nuScenes. Additionally, we provided CPSeg HR (highresolution) using input size of (, ) for nuScenes. The pillar grid size, of the final models was set to with pillar pairwise matrix threshold, to be . The mapping parameter, was set to be for the final model.Method  PQ  PQ^{†}  RQ  SQ  PQ^{Th}  RQ^{Th}  SQ^{Th}  PQ^{St}  RQ^{St}  SQ^{St}  mIoU  FPS 

RangeNet++ [milioto2019rangenet++] + PointPillars [Lang_2019_CVPR_pointpillars]  
PanopticTrackNet [hurtado2020mopt]  
KPConv [thomas2019kpconv] + PointPillars [Lang_2019_CVPR_pointpillars]  
Panoster [gasperini2021panoster]  
DSNet [hong2020lidar]  
EfficientLPS [sirohi2021efficientlps]  83.0  87.8  60.5  74.6  79.5  
GPS3Net [razani2021gps3net]  60.0  69.0  72.1  65.0  74.5  70.8  
LPSAD [9340837]  11.8  
PanopticPolarNet [Zhou2021PanopticPolarNet]  87.2  
SMACSeg [li2021smacseg]  58.4  72.3  79.3  63.3  
CPSeg [Ours]  57.0  63.5  68.8  82.2  55.1  64.1  58.4  72.3  79.3 
Method  PQ  PQ^{†}  RQ  SQ  PQ^{Th}  RQ^{Th}  SQ^{Th}  PQ^{St}  RQ^{St}  SQ^{St}  mIoU  FPS 
DSNet [hong2020lidar]  84.4  
PanopticTrackNet [hurtado2020mopt]  
EfficientLPS [sirohi2021efficientlps]  71.5  84.1  
GPS3Net [razani2021gps3net]  75.8  
SMACSeg HiRes [li2021smacseg]  68.4  73.4  79.7  85.2  68.0  77.2  87.3  
LPSAD [9340837]  22.3  
SMACSeg [li2021smacseg]  71.8  78.2  65.2  74.2  72.2  
PanopticPolarNet [Zhou2021PanopticPolarNet]  67.7  86.0  65.2  87.2  71.9  84.9  83.9  
DualDec UNet w/ BFS [Our Baseline]  
DualDec UNet w/ HDBSCAN [Our Baseline]  
CPSeg [Ours]  21.3  
CPSeg HR [Ours]  71.1  75.6  82.5  85.5  71.5  81.3  87.3  70.6  83.7  83.6  73.2 
4.2 Quantitative Evaluation
In Table 1 and Table 2, we compile the results of CPSeg compared to other models. For evaluations on SemanticKITTI test dataset (Table 1), we separate the models into two groups based on their inference speed. Only the models in row 811 are known to have realtime performance, with FPS’s above 10Hz. With a PQ of and an FPS of 10.6Hz, CPSeg (row 11) achieves performances that match stateoftheart models. More importantly, it establishes a new benchmark in PQ for realtime models, surpassing the next best realtime model, SMACSeg, by . Specifically, with a increase in RQ^{Th} and 0.5Hz improvement in FPS over SMACSeg, we demonstrate that CPSeg is better in recognizing foreground objects while requiring less computation. These improvements can be mainly attributed to the use of clusterfree instance segmentation module and the incorporation of surface normal as a helpful part of the instance embedding.
For results on nuScenes validation dataset (Table 2), since the methods for creating the instance labels are not standardized across publications, we separated the models into three groups for better comparison. The models in rows 18 are previously published models, grouped by inference speed similar to Table 1. The baseline and proposed models listed in row 912 use results from our experiments. Using range images as input, CPSeg HR (row 12) obtains the highest PQ out of all models, outperforming the baseline models Dual Decoder (BFS) and Dual Decoder (HDBSCAN) by and , respectively. However, because the setting in nuScenes dataset is more crowded compared to semanticKITTI dataset with more instances in each scene, the performance of CPSeg HR is no longer realtime. By reducing the resolution from to , CPSeg (row 11) again achieves a competitive realtime performance with only a small tradeoff in PQ.
4.3 Qualitative Evaluation
The panoptic segmentation performance of CPSeg can also be seen in Figure 6, where we compare its inference results to LPSAD, our implementation based on [9340837], and our baseline models. For a closedup view of a scene from SemanticKITTI dataset (row 1) where three cars are lined up closely, only CPSeg segments the instance points without errors. LPSAD identifies the car in the middle as two separate instances, whereas the baseline model produce even worse oversegmentation errors.
In a complex scene from nuScenes (row 2), with variations in instance classes and few sparse points describing each instance, correctly recognizing and distinguishing each instance is proved to be more difficult. For areas where pedestrians are walking closely to each other or where cars are positioned further away, the baseline model using HDBSCAN and LPSAD are prone to making undersegmentation errors. In such a complex scene, only CPSeg is able to segment each instance accurately. Additional examples and an explanation of the models’ behaviours are provided in the Supplementary Material.
5 Ablation Studies
In this section, we present an extensive ablation analysis on the proposed components in CPSeg. Note that all results are compared on SemanticKITTI validation set (Seq 08). First, we investigate the individual contribution of each component in the network and show the results in Table 3. The clusterfree instance segmentation module is the key component, introducing increase on the PQ (compare to the baseline with BFS) while eliminating the computation of clustering at the same time. Adding TAM also leads to a further improvement as the two decoders receive more meaningful taskspecific features. Moreover, extracting surface normal brings another jump in PQ since the network receives guidance on regressing the embedding for each foreground object. Lastly, the model achieves the best result by incorporating binary Lovász loss in supervising the segmentation on the pariwise matrix.
Architecture 
Clusterfree 
TAM 
3D Normal 
Lovász 
mPQ 

Baseline w/ BFS 
44.9  
Baseline w/ HDBSCAN  52.7  
Proposed  ✓  

✓  ✓  

✓  ✓  ✓  

✓  ✓  ✓  

✓  ✓  ✓  

✓  ✓  ✓  ✓  56.2 

We experiment with changing , the parameter used to map the pillar embedding to the connectivity probability. We set the threshold to be 0.5, and pillar grid size to be 0.15 for the experiments on . In the first setting, is learned from the corresponding pillar feature from the instance decoder. In particular, , where and are the corresponding features of pillar and from the instance decoder, and denotes the concatenation operation. In the second setting, we set to be various fixed values. From the results in Table 4, constant yields the best results. A fixed value works relatively better than learning from the feature; for panoptic segmentation tasks on outdoor autonomous driving dataset, the difference in the regressed 2D embedding is enough to determine the connectivity of the pillars. However, we think that a learned could potentially work better if the scene is dense and crowded (indoor scenes) such that the network requires more information in making connections. Also note that . We draw conclusions that can be regarded as a function of the threshold, , and pillar grid size, . Hence, we choose to be for the rest of the experiments. This design choice ensures that adjacent pillars are considered to be connected.
PQ  PQ^{Th}  RQ^{Th}  SQ^{Th}  

Fixed 

76.7  

56.2  58.7  66.8  


Learned 


One may concern about the complexity of the model as it grows quadratically with , the number of pillars. Note that we provide the average number of pillars resulted from using different grid sizes in Table 5. Typically, a SemanticKITTI LiDAR scan contains an average number of 12 instances and 6.8k number of foreground points. We find that is proportional to the number of instances in the scan but significantly less than the number of points. As the point embedding is learned to shift together in the network, dynamically grouping the foreground points together using pillars according to their embedding significantly reduces the computation the network needs to carry.
Grid Size (m)  PQ  PQ^{Th}  RQ^{Th}  SQ^{Th}  Runtime (ms)  




56.2  58.7  66.8  



76.6  90  

6 Conclusion
In this work, we propose a novel realtime proposalfree and clusterfree panoptic segmentation network for 3D point cloud, called CPSeg. Our method builds upon an efficient semantic segmentation network and addresses the instance segmentation by incorporating a unique clusterfree instance head where the foreground point cloud is dynamically pillarized in the sparse space according to the learned embedding and object instances are formed by building connection of pillars. Moreover, a novel taskaware attention module is designed to enforce two decoders to learn taskspecific features. CPSeg outperforms existing realtime LiDARbased panoptic segmentation methods on both datasets of SemanticKITTI and nucSenes. The thorough analysis illustrates the robustness and effectiveness of the proposed method, which could inspire the field and push the panoptic segmentation research towards a proposalfree and clusterfree direction.