Autonomous vehicles (AVs) have made significant strides in the past decade, in large part due to the advent of deep learning techniques and large-scale perception datasets. The majority of the existing AV datasets[Geiger2013IJRR, caesar2020nuscenes, chang2019argoverse, houston2020one, sun2020scalability]
, model objects as 3D bounding boxes with 7 degrees of freedom (DoF) and constant size, tracked over time. While bounding boxes form a good approximation for the shape of a car, they fail to tightly include extremities such as mirrors and antennas. Bounding boxes are also less suitable for other classes such as pedestrians with extended arms. When modeling the interactions of two pedestrians with bounding boxes, we lose most of the information about their spatial relation and contact points. This is exacerbated by the common assumption that object size remains constant across a track, which is not the case for articulated objects like pedestrians. Finally, the use of 7 DoF boxes means that the pitch of an object is not adjusted to the inclination of the road, which results in less accurate annotations. Furthermore, bounding boxes are only suitable for foreground objects (thing), but not for amorphous background regions (stuff) such as road or grass.
In the LiDAR semantic segmentation task [thomas2019kpconv, wu2018squeezeseg, milioto2019rangenet++], we label each LiDAR point with a semantic label rather than bounding boxes. This enables a much higher level of granularity. It also includes labels for stuff classes that are missing in object detection and tracking. However, the lack of instance labels means that the interaction between objects of the same class can still not be accurately modeled. In LiDAR panoptic segmentation [behley2019semantickitti, sirohi2021efficientlps], we assign each LiDAR point with a semantic label and each point belonging to thing classes with an instance label. However, existing datasets and metrics ignore the temporal dimension by evaluating only on a single scan. The novel panoptic tracking task [hurtado2020mopt] combines panoptic segmentation and tracking into a single coherent scene understanding problem.
|Toronto-3D [tan2020toronto3d]||1x Canada||8||-||-||Point||✗||-|
|A2D2 [geyer2020a2d2]||3x Germany||38||-||-||Pixel, Box||✗||-|
|PandaSet [pandaset]||2x USA||37||100||-||Point, Box||✓||0.2 h|
|SemanticKITTI [behley2021panoptickitti]||1x Germany||28||22||23k (270k)||Point, Instance||✓||1.5 h|
|Panoptic nuScenes (ours)||Boston, SG||32||1000||300k (1.2M)||Point, Box, Instance||✓||5.5 h|
In this paper, we introduce Panoptic nuScenes, the first benchmark dataset for panoptic tracking222An equivalent task [aygun20214d] was concurrently proposed on SemanticKITTI.. Many AV datasets only support a small number of AV tasks. Rather than developing a new dataset for each task, we extend the existing nuScenes [caesar2020nuscenes] dataset that already features object detection, bounding box tracking, and prediction tasks. We manually annotate 40,000 keyframes with 32 semantic classes, amounting to a total of 1.1B labeled LiDAR points. The nuScenes dataset contains 1,000 scenes from 4 locations in Singapore and Boston. Contrary to the SemanticKITTI [behley2019semantickitti] dataset, nuScenes has been primarily collected in dense urban environments with many dynamic agents such as different vehicles and pedestrians. The AV that was used to collect the dataset was equipped with sensors that cover the full 360 degrees of the surroundings and include radar. Moreover, the nuScenes dataset is much more diverse with 1,000 scenes compared to 21 scenes that are present in SemanticKITTI. Furthermore, the semantic class annotations are more fine-grained with 23 compared to 7 thing classes.
Our main contributions in this work are as follows:
We manually annotate 1.1B points in 1,000 scenes with 23 object classes, resulting in the most diverse LiDAR segmentation dataset to date.
We extend nuScenes to a holistic LiDAR benchmark that includes semantic segmentation, panoptic segmentation, and panoptic tracking tasks as illustrated in Fig. 1.
We propose a new instance-level Panoptic Tracking (PAT) metric that incorporates both panoptic quality and tracking quality while penalizing track fragmentation.
We provide extensive baselines for each of the three tasks and present ablation studies that demonstrate the novelty of our benchmark dataset.
The supplementary material can be found at https://arxiv.org/abs/2109.03805.
Ii Related Work
Datasets and Benchmarks: LiDAR datasets for autonomous driving have contributed significantly to the progress in the industry. However, most of these datasets focus solely on 3D object detection rather than point-level segmentation which provides a more holistic and richer scene understanding. To fill this gap, recent driving datasets provide LiDAR point-level semantic annotations. However, they either do not provide instance annotations, have only a few moving object instances, or ignore the temporal dimension. Tab. I provides an overview of relevant LiDAR semantic segmentation datasets. Overall, there is a lack of LiDAR datasets that provide annotations for panoptic segmentation and tracking. SemanticKITTI [behley2021panoptickitti] introduced a dataset that provides fine-grained semantic and temporally consistent instance annotations. It provides 1.5 hours of data recorded in a single city. Recently, Aygün et al. [aygun20214d]
extended this benchmark and proposed a point-centric evaluation metric for panoptic tracking[hurtado2020mopt].
In this work, we annotate the significantly more diverse nuScenes [caesar2020nuscenes] dataset with LiDAR segmentation labels for 1.1B points. This dataset is referred to as nuScenes-lidarseg. Since the initial release of nuScenes-lidarseg, some works have attempted to create a panoptic dataset by combining the 3D bounding boxes from nuScenes [caesar2020nuscenes] and the LiDAR segmentation labels from nuScenes-lidarseg. DS-Net [hong2021dynamic] and EfficientLPS [sirohi2021efficientlps] assign instance labels to each point based on the instance of the bounding box it is inside of. EfficientLPS [sirohi2021efficientlps] ignores instances that have less than 15 points. PolarSeg-Panoptic [zhou2021panoptic] creates the panoptic instance annotation by assigning thing
object points to its closest 3D bounding box of the same label and removes outliers by omitting thething object points that are more than from the centroid of the nearest bounding box. Only instances with at least 20 points are evaluated. Due to the lack of an official dataset and evaluation protocol, the performances reported by these works are not comparable. Moreover, none of these works have made their dataset public nor have they released a standardized evaluation protocol or public evaluation server, making it difficult to reproduce or build on their works. In this work, we build upon nuScenes and introduce nuScenes-lidarseg and Panoptic nuScenes. With Panoptic nuScenes, we introduce official panoptic annotations, a benchmarking protocol, and an open challenge for panoptic segmentation and tracking.
Methods: There have been significant breakthroughs in LiDAR semantic segmentation in recent years. Methods such as KPConv [thomas2019kpconv] and RandLA [hu2020randla] utilize an encoder-decoder architecture that directly operates on the point cloud whereas approaches such as Squeezeseg [wu2018squeezeseg] and RangeNet++ [milioto2019rangenet++]
use spherical projections of point clouds to enable usage of 2D CNN architectures. These methods typically employ a KNN-based post-processing step to account for reprojection errors. PolarNet[zhang2020polarnet] on the other hand, transforms the point cloud into polar BEV space to account for the imbalanced distribution of points. Cylinder3D [zhu2021cylindrical] opts for cylindrical partition and asymmetrical 3D convolution networks.
LiDAR panoptic segmentation and tracking approaches are broadly categorized into proposal-free and proposal-based methods. Proposal-free methods [milioto2020lidar, gasperini2021panoster, aygun20214d]
generally infer semantic segmentation before detecting instances through keypoint/center estimation or vote/learnable clustering in the projection or BEV domain. Whereas, proposal-based methods[sirohi2021efficientlps, hurtado2020mopt] first generate region proposals from encoded features and then detect the instances in parallel to perform semantic segmentation. EfficientLPS [sirohi2021efficientlps] introduces several novel network modules to incorporate 3D information explicitly into a 2D panoptic segmentation architecture. PanopticTrackNet [hurtado2020mopt] further builds upon EfficientPS [mohan2020efficientps] and adds a tracking head to obtain temporally consistent instance labels for panoptic tracking. The subsequently proposed 4D-PLS [aygun20214d] employs point clustering to effectively leverage the sequential nature of several consecutive LiDAR scans. Given that LiDAR panoptic segmentation and tracking are critical tasks for autonomous driving and they are considerably less explored than the image-based approaches, we believe that our large-scale benchmark dataset will encourage innovative research that addresses these tasks.
In this section, we describe the Panoptic nuScenes dataset and the protocol that we employ for annotating the groundtruth for semantic segmentation, panoptic segmentation, and panoptic tracking tasks. Panoptic nuScenes is an extension of the nuScenes dataset [caesar2020nuscenes] which contains 1000 scenes collected across multiple cities with both left and right-hand driving, and several dynamic agents such as different moving vehicles and pedestrians. As a result, Panoptic nuScenes is large-scale and geographically diverse with point-level semantic and instance segmentation annotations that are temporally consistent. Moreover, the highly varied scenes and a large number of moving agents in each scene make it a challenging dataset suitable for benchmarking the panoptic tracking task.
Iii-a Data Annotation
We annotate the Panoptic nuScenes dataset to consist of 32 semantic classes, with 23 thing and 9 stuff classes as shown in Fig. 2. We manually annotate the semantic label of each LiDAR point. To reduce the annotation effort, we use the 3D boxes to initialize the semantic labels of the points for the thing classes. Thereafter, we manually refine points that are included in multiple thing boxes or close to stuff points (e.g. vehicle wheels close to the ground plane). This approach eliminates a significant number of points and reduces the effort required to semantically annotate the remainder of the points, which belong to stuff classes. We then perform multiple rounds of validation to achieve high-quality LiDAR segmentation annotations. As part of the process, we rendered camera frames with the corresponding point cloud and segmentation labels overlaid to ensure that the labels match up with what is seen in the camera view. We render these frames sequentially into a video to help identify temporal inconsistencies in the point-wise semantic labels during the review process.
We combine the point-level labels with the 3D bounding boxes from nuScenes [caesar2020nuscenes] to obtain instance labels for each point. An instance consists of the points that fall within a 3D bounding box and have the same segmentation type as the box. For bounding boxes that overlap, we resolve them by labeling the overlapping points as noise. The percentage of points that are present in such overlapping regions is zero for 9 classes, and it is less than for all classes. This is even the case for classes such as ‘bendy bus‘ which is usually made up of multiple potentially overlapping bounding boxes. Using the track ID we ensure that instances are temporally consistent.
Iii-B Dataset Analysis
Our dataset contains 1.1B LiDAR points annotated with one of 32 semantic labels and temporally consistent instance IDs for thing classes. Fig. 2 shows the number of points for each semantic class in our dataset and the number of instances for the relevant classes. The most common stuff classes are drivable surface, manmade, and vegetation, which can be useful for mapping and ground plane estimation [cattaneo2021lcdnet]. For the thing classes, instances of dynamic object classes such as cars and adult pedestrians occur most frequently. On the other end of the distribution, we have rare classes such as police and ambulances, which represent the long-tail challenging categories. Fig. 3 compares the number of moving object instances in SemanticKITTI [behley2021panoptickitti] and Panoptic nuScenes. With the exception of the bicycle class, Panoptic nuScenes contains significantly more moving object instances. Additionally, the variety of moving object classes in Panoptic nuScenes is also greater than SemanticKITTI [behley2021panoptickitti].
Iv Tasks and Metrics
In this section, we describe the three benchmark tasks in Panoptic nuScenes and the corresponding evaluation metrics. For all the tasks, we only consider object instances that have more than 15 LiDAR points for evaluation. We also merge similar classes and remove rare or void classes, resulting in 10 thing and 6 stuff classes as shown in Tab. S.1 (supplementary).
Iv-a LiDAR Semantic Segmentation
Bounding box-based 3D object detection and tracking are two of the main perception tasks in autonomous driving. Due to the release of many large-scale public benchmark datasets [Geiger2013IJRR, caesar2020nuscenes, sun2020scalability], the performance of methods that address these tasks has progressed rapidly. Similar to the evolution from 2d bounding box detection to semantic segmentation with images, 3d bounding box detection evolved to point-level LiDAR segmentation and has been attracting more and more attention in recent years. On one hand, there are numerous use cases for LiDAR segmentation in autonomous driving, such as drivable surface segmentation, ground plane prediction, and vehicle door segmentation. On the other hand, the emergence of better and faster segmentation models enables real-time point-level semantic segmentation. To further boost the development, we set up a LiDAR semantic segmentation benchmark using the Panoptic nuScenes dataset. The goal is to predict the semantic categories of each LiDAR point for both thing and stuff classes. The benchmark supports both LiDAR only and multi-modal methods. To enable a fair comparison, the benchmark is split into the LiDAR track and open track based on whether only the LiDAR or multi-modal data is used. More details can be found at http://www.nuscenes.org/lidar-segmentation. To evaluate the performance of semantic segmentation methods in the benchmark, we primarily use the Intersection-over-Union (IoU) metric and rank methods based on the average IoU (mIoU) over all the classes. We also report the frequency-weighted IoU (fwIoU) that weights the relevance of each class depending on its point-level frequency.
Iv-B LiDAR Panoptic Segmentation
While the semantic segmentation task requires predicting the semantic categories for each point, it does not distinguish different object instances. Similar to image panoptic segmentation [kirillov2019panoptic], the LiDAR panoptic segmentation task requires predicting the semantic categories for each point as well as instance IDs for thing classes. The panoptic segmentation benchmark also consists of the LiDAR track and open track. More details can be found at http://www.nuscenes.org/panoptic. We primarily use the panoptic quality (PQ) [kirillov2019panoptic] metric to evaluate the panoptic segmentation performance in the benchmark. We individually compute the performance for thing (PQ,SQ, RQ) and stuff (PQ, SQ, RQ) classes, and also report the PQ [porzi2019seamless] metric for completeness.
Iv-C LiDAR Panoptic Tracking
While panoptic segmentation resolves the point-level and instance-level prediction of the whole scene, another critical aspect for autonomous driving is the temporally consistent perception of the environment. The recently introduced panoptic tracking task [hurtado2020mopt] addresses this problem by unifying panoptic segmentation and multi-object tracking into a single coherent scene understanding task. The goal of this task is to predict temporally consistent semantic categories of each point and instance IDs for thing classes. While panoptic segmentation focuses on static frames, panoptic tracking additionally enforces temporal coherence and pixel-level association over time. Due to the lack of existing public benchmarks, we introduce a LiDAR panoptic tracking benchmark. Similar to the other tasks, the panoptic tracking benchmark also consists of the LiDAR track and open track. More details can be found at http://www.nuscenes.org/panoptic.
Evaluating LiDAR panoptic tracking requires assessing the performance of both panoptic segmentation and instance association across frames. Unlike panoptic segmentation, for which we use the well-established PQ metric, evaluating multi-object tracking (MOT) has typically been challenging. Simultaneously measuring detection and association can lead to a higher influence of either detection or association in the metric. Moreover, measuring the performance of panoptic tracking is even more challenging since it combines panoptic segmentation and MOT.
Recent works have proposed different metrics to evaluate this task. Hurtado et al. [hurtado2020mopt] propose panoptic tracking quality (PTQ) that adapts PQ to account for the incorrectly tracked objects. Weber et al. [weber2021step] analyzes the drawbacks of PTQ and demonstrates that it assigns more weight to segmentation than association, as it depends on the correct segmentation to assess tracking quality. To deal with this imbalance, Aygun et al. [aygun20214d]
adapt the STQ metric to LiDAR Segmentation and Tracking Quality (LTSQ). LSTQ disentangles semantics and temporal association and avoids penalizing tracking when an instance is incorrectly classified. LSTQ is computed at the point-level and evaluates if each point is assigned the right semantic class and ifthing class objects are assigned the right instance IDs. However, LSTQ also does not account for temporal information and is invariant to frame permutation. Importantly, it does not penalize the score in case a tracking prediction is fragmented. In the context of autonomous driving, penalizing track fragmentation is more important than the correct point-level spatio-temporal prediction. A fragmented track prediction can lead to imprecise velocity estimates. Even if the segmentation is not entirely accurate and some points are misclassified, tracking the entire moving object instance is essential for decision making. Although point-level evaluation of this task enriches the granularity of segmentation, measuring the performance of tracking at the instance-level, which is robust to track fragmentation, is more suitable for certain applications.
With this in mind and based on the requirements of a good metric identified in [weber2021step], we define a new instance-level metric with the following features:
Error-type differentiability: Allows separate analysis of tracking and panoptic segmentation components.
Penalize fragmentation of tracking: The metric is not invariant to frame permutation.
Long-term track consistency: The metric promotes long-term track consistency considering Association Recall.
Penalize ID transfer: The metric can detect association errors accounting for Association Precision.
Penalize the removal of correctly segmented instances with incorrect IDs: Metric robustness.
|val set||(AF)2-S3Net [cheng20212]||60.3||12.6||82.3||80.0||20.1||62.0||59.0||49.0||42.2||67.4||94.2||68.0||64.1||68.6||82.9||82.4||62.2||83.0|
|test set||EfficientLPS [sirohi2021efficientlps]||77.0||20.9||69.0||78.8||40.0||66.3||64.6||60.8||72.0||57.6||95.6||65.3||76.4||72.9||87.5||86.1||68.2||86.8|
Accordingly, we introduce the Panoptic Tracking (PAT) metric based on two separable components that are explicitly related to the task and allow straightforward interpretation. PAT is computed as the harmonic mean of PQ and TQ as
where PQ is the panoptic quality [kirillov2019panoptic]
and TQ is our proposed tracking quality that we compute as the following class agnostic geometric mean forthing classes:
where is the set of unique instance IDs. TQ is comprised of two components. First is the association score that we define by adapting [aygun20214d] to compute it at the instance-level. We compute using the true positive association (), false negative association (), and false positive association () sets for thing classes. More precisely, is a set between a of an object and any other object prediction with ID that have their mask overlap greater than IoU. Subsequently, is a set of that has no matching predictions or have matching predictions with a ID other than , and is a set of predictions that has no matching or is matched with some instead of . is then defined as
where the is obtained with , and . This component of TQ encourages long-term track consistency and penalizes the removal of correctly segmented instances. The second component of TQ penalizes track fragmentation for with the rate of ID switches where is the number of ID switches, and is the maximum possible ID switches over track length. We define an ID switch when either of the two consecutive frames of the instance does not have a matching prediction or are associated with predictions with inconsistent IDs. In Sec. V-B we present detailed analysis of our proposed metric that proves our proposed metric satisfies all the outlined requirements of being a good metric at instance-level.
V Experimental Evaluation
In this section, we present extensive quantitative comparisons and benchmarking results for the semantic segmentation, panoptic segmentation and panoptic tracking tasks defined in Sec. V-A. We then present a detailed analysis of our PAT metric for panoptic tracking in Sec. V-B and ablation studies that demonstrate the utility of our proposed dataset in Sec. V-C.
V-a Baseline Results
We follow standard protocols to create/select baselines for the three tasks. In the case of LiDAR semantic segmentation, we frequently organize challenges for the task on Panoptic nuScenes and thus, report the test set results of the published challenge submissions as the baselines. However, for the validation set, we report the results of approaches from their respective published sources. For panoptic segmentation and panoptic tracking, we create two groups of baselines: end-to-end approaches and an independent combination of task-specific methods that are benchmarked on our challenge server. We trained the end-to-end approaches using the official implementations that have been publicly released by the authors after further tuning of hyperparameters to the best of our ability and report their results for both validation and test set.
V-A1 Semantic Segmentation
Results for the semantic segmentation benchmark are shown in Tab. II. We observe that stuff classes such as manmade and drivable surface are comparatively easier to segment, achieving an IoU of about 90%. Additionally, these classes generally exhibit negligible IoU degradation with variation in distance from the ego vehicle. Among the thing classes, bicycle appears to be the hardest class as only a few approaches achieve an IoU greater than 40%. (AF)2-S3Net [cheng20212]
achieves the highest bicycle class IoU of 52.2% and an overall mIoU of 78.3% due to the novel multi-branch attentive feature fusion module in the encoder and a unique adaptive feature selection module with feature map re-weighting in the decoder that enables more accurate segmentation of smaller instances.
|val set||PanopticTrackNet [hurtado2020mopt]||51.4||56.2||80.2||63.3||45.8||81.4||55.9||60.4||78.3||75.5||58.0|
|test set||PanopticTrackNet [hurtado2020mopt]||51.6||56.1||80.4||63.3||45.9||81.4||56.1||61.0||79.0||75.4||58.9|
|SPVNAS [tang2020searching] + CenterPoint [yin2021center]||72.2||76.0||88.5||81.2||71.7||89.7||79.4||73.2||86.4||84.2||76.9|
|Cylinder3D++ [zhu2021cylindrical] + CenterPoint [yin2021center]||76.5||79.4||89.6||85.0||76.8||91.1||84.0||76.0||87.2||86.6||77.3|
|(AF)2-S3Net [cheng20212] + CenterPoint [yin2021center]||76.8||80.6||89.5||85.4||79.8||91.8||86.8||71.8||85.7||83.0||78.8|
|val set||PanopticTrackNet [hurtado2020mopt]||44.0||43.4||50.9||51.6||38.5||58.4||32.3|
|test set||PanopticTrackNet [hurtado2020mopt]||45.7||44.8||51.6||51.7||40.9||58.9||36.7|
|AMVNet [liong2020amvnet] + OGR3MOT [zaech2021ogr3mot]||63.2||61.7||61.5||61.9||64.7||63.6||59.9|
|Cylinder3D++ [zhu2021cylindrical] + OGR3MOT [zaech2021ogr3mot]||62.7||61.7||61.3||61.6||63.8||64.0||59.4|
|(AF)2-S3Net [cheng20212] + OGR3MOT [zaech2021ogr3mot]||62.9||62.4||60.9||61.3||64.5||65.0||59.9|
|EfficientLPS [sirohi2021efficientlps] + Kalman Filter||67.1||63.7||62.3||63.6||71.2||67.4||60.2|
V-A2 Panoptic Segmentation
To create the combination of task-specific baselines for panoptic segmentation, we merge submissions from the past nuScenes LiDAR segmentation and detection challenges. We do so, by assigning unique instance IDs to all points lying within the predicted bounding boxes along with the semantic class ID of the box. Further, we apply a heuristic similar to[kirillov2019panoptic] to resolve bounding box overlaps. While combining the detection predictions, we investigated the filtering of low confidence detection boxes based on the max F1 score per class or different threshold values. Interestingly, applying no filtering on the detected boxes achieves the highest performance for each method, on average improving the PQ score by for each class compared to F1 thresholding.
By combining 70 detection submissions and 21 semantic segmentation submissions, we generate a total of 1470 independently combined panoptic segmentation baselines as shown in Fig. S.6(a) of the supplementary material and compare it with three state-of-the-art end-to-end approaches [hurtado2020mopt, sirohi2021efficientlps, zhou2021panoptic]. In the former, we observe a general trend of baselines with combinations of higher-performing detection and semantic segmentation methods achieving higher PQ scores, with more emphasis on stronger detection approaches. This can be attributed to the fact that the detection methods are primarily responsible for the instance segmentation performance. Tab. III presents the benchmarking results for this task where we report only the top three combination baselines that have published sub-task methods. Among the end-to-end baselines, PolarSeg-Panoptic achieves the highest PQ score of indicating that the polar bird’s-eye-view representation is more suitable for LiDAR panoptic segmentation compared to other projection-based baselines. Overall, the independently combined models significantly outperform the end-to-end approaches, demonstrating the need for more research on end-to-end LiDAR panoptic segmentation methods. Thus, we believe that our challenging benchmark will pave the way for many innovative solutions.
V-A3 Panoptic Tracking
Following a similar protocol for creating baselines, for panoptic tracking, we generate a total of 924 task-specific combination baselines by merging 21 submissions from the nuScenes LiDAR segmentation challenge and 44 submissions from the tracking challenge. Results from this experiment are presented in Fig. 4. Note that three thing classes (barrier, construction vehicle, and traffic cone) are not included in the tracking challenge and hence are absent in the merged panoptic tracking prediction. Essentially, the contribution of these classes is counted as zero in the relevant evaluation metrics. On the other hand, for the unified end-to-end [hurtado2020mopt, aygun20214d] approaches, we train inclusive of all classes as they will serve as the main baselines for any future work in this task. We observe that the task-specific combination baselines are positively correlated to the performance of both semantic segmentation and tracking methods, similar to the observation in the panoptic segmentation results. Tab. IV presents the benchmarking results for this task where we report only the top three combination baselines that have published methods. Among all the baselines, EfficientLPS with Kalman Filter achieves the highest PAT score of and on the validation and test set respectively. This shows that the initialization of initial and noise state covariance matrices in the filter with training set statistics is crucial for the filter’s convergence which yields accurate data associations. We also observe that the unified approaches are outperformed by the task-specific combination baselines. We expect that our benchmark dataset will accelerate research on unified end-to-end approaches that will surpass the performance of the independently combined baselines.
V-B Analysis of Panoptic Tracking Metrics
To evaluate the effectiveness of our proposed PAT metric for panoptic tracking, we consider five challenging tracking scenarios, each corresponding to the metric requirements described in Sec. IV-C. First, we analyze the capability of the metric to decouple panoptic and tracking errors. Subsequently, we analyze four challenging tracking scenarios or cases in which we assume perfect segmentation for seven consecutive scans and study the performance of the tracking metrics. Tab. V illustrates the four cases in the figure and the adjacent table presents the tracking scores comparing PAT with the previous proposed LSTQ and PTQ metrics.
V-B1 Error-Type Differentiability
In Tab. IV, we present the panoptic tracking results of three baselines with the same tracking method OGR3MOT combined with different panoptic segmentation methods AMVNet, (AF)2-S3Net, and Cylinder3D++. The ranking of the methods differs between PAT and LSTQ. While LSTQ only considers segmentation quality with mIoU, our metric also accounts for instance identification by means of PQ. Since the panoptic tracking task aims to predict stuff and thing classes in the scene while preserving the thing IDs across frames, we consider PAT to be more informative for the task. In this case, our metric yields a higher score for the method with the better panoptic segmentation performance, which facilitates interpretability. Additionally, we observe in Tab. S.3 that the correlation of AMOTA and mIOU to PAT is 0.48 and 0.57. This represents an improvement with respect to PTQ (0.23 and 0.69), implying that similar to LSTQ (0.55 and 0.62), PAT is a more balanced metric.
|Case 1||Case 2||Case 3||Case 4|
V-B2 Track Fragmentation
In Tab. V, Pred 1 and 2 represent tracking predictions with the same number of incorrect track IDs but with permuted frames. We observe that PAT and PTQ are able to penalize the track fragmentation while LSTQ remains invariant to frame permutation. In the particular case of autonomous driving, we aim to provide a metric that encourages tracking consistency so that approaches correctly represent the dynamics of the scene.
Prediction 3 of case 2 shown in Tab. V refers to when a track finishes and a new track corresponding to a different object instance begins. Ideally, the metric should be able to penalize the predictions that do not reflect the track changes to account for association precision. We observe that both PAT and LSTQ correctly penalize this case, while PTQ incorrectly assesses it as the same track.
V-B4 Long-Term Tracking Consistency
Predictions 4 and 5 of case 3 shown in Fig. V exemplify a tracking error in which a single track is predicted as two different tracks. The difference between these predictions is the track duration. We aim to provide a metric that accounts for association recall stimulating long-term track consistency. The results show that the longer track prediction is evaluated with a higher score for our PAT and LSTQ while PTQ fails in this case.
V-B5 Void Instance Prediction
As previously discussed in [weber2021step], a suitable metric should be unsusceptible to prediction processing, which affects the metric by yielding a better result without improving the quality of predictions. One such processing is ignoring the correctly segmented instances with the incorrect track IDs during evaluation. We represent this case 4 with predictions 6 and 7 in Tab. V having instances assigned with the incorrect track ID presented and ignored for evaluation respectively. The results show that all three metrics are able to penalize ignoring the incorrectly tracked instances.
V-C Ablation Study
In this section, we present ablation studies that demonstrate the utility of our Panoptic nuScenes benchmark dataset.
V-C1 Influence of Pre-Training
Transfer learning by pre-training on a large-scale dataset followed by fine-tuning on the target dataset generally improves the performance of complex image segmentation tasks. Due to the lack of LiDAR panoptic segmentation datasets, transfer learning for panoptic segmentation has not been investigated thus far. To demonstrate the utility of our diverse Panoptic nuScenes dataset, we performed transfer learning experiments with SemanticKITTI, using our EfficientLPS model with a Kalman Filter. Tab. VI shows that pre-training on one dataset and fine-tuning on the other, results in increased performance on the respective datasets. Overall, we observe an increase of in PAT, in PQ, and in mIoU scores for the model pre-trained on SemanticKITTI and fine-tuned on Panoptic nuScenes. Whereas, the conversely trained model achieves a improvement of in PAT, in PQ, and in mIoU scores. The stronger performance gain when pre-training on SemanticKITTI is expected, as Panoptic nuScenes is sparser than SemanticKITTI. Thus, the rich feature representation captured during pre-training translates more efficiently from dense to sparse datasets. Nevertheless, the presence of two distinct datasets is highly beneficial for the research community as each can leverage the other and at the same time it facilitates studying the generalization ability of proposed methods.
V-C2 Impact of Diverse Scenes in Training Set
In this section, we study the generalization ability of an approach trained on a dataset consisting of diverse scenes and objects. We train two EfficeintLPS models for panoptic segmentation, one on the Panoptic nuScenes and the other on SemanticKITTI. We evaluate these models on the unseen validation set of PandaSet [pandaset]. Please note that PandaSet does not provide official panoptic segmentation annotations, we generate these annotations using a similar approach described in Sec. III-A. For this experiment, we only consider the classes that are common among all three datasets. Tab. VII presents the results of this experiment. We find that the model trained on Panoptic nuScenes outperforms that trained on SemanticKITTI on all the metrics. We observe the highest improvement of in the RQ score which can be attributed to the presence of diverse scenes in Panoptic nuScenes that consist of different types of thing object classes which helps in better generalization. The improved instance segmentation capability consequently results in an overall higher panoptic quality.
|Training Dataset||Evaluation Dataset|
In this paper, we introduce Panoptic nuScenes, a large-scale public LiDAR benchmark dataset that facilitates research on semantic segmentation, panoptic segmentation, and panoptic tracking tasks. We released an evaluation server on a hidden test set for fair comparisons of the aforementioned tasks. Panoptic nuScenes alleviates the lack of a diverse urban dataset with a large number of scenes that contain many moving objects and point-wise annotations for the aforementioned tasks. We performed an extensive analysis of existing panoptic tracking metrics and proposed the novel PAT metric that addresses the shortcomings of existing metrics. We presented exhaustive benchmarking results of several baselines for all three tasks in Panoptic nuScenes. The results demonstrate the need for further research on end-to-end learning methods that effectively address these tasks in a coherent manner. We believe that this work will pave the way for novel research on scene understanding of dynamic urban environments.
This work was partly funded by Eva Mayr-Stihl Stiftung. The Panoptic nuScenes dataset was annotated by Scale.ai and we thank Jon Wilfong, Gerard Roy, and Galen Bertozzi. We thank Serene Chen at Motional for data inspection and quality control.
s.1 Additional Dataset Details
This section presents further details about the Panoptic nuScenes dataset and annotations.
Dynamic and Diverse Scenes: Panoptic nuScenes was primarily collected in dense urban environments with many dynamic agents. Such scenes include those near intersections and construction sites, which are of high traffic density and have the potential for interesting driving situations (e.g. jaywalkers, lane changes, turning). The scenes are also diverse in terms of, among others, geographical location (i.e. left-hand versus right-hand drive), weather, and lighting conditions. Fig. S.1 shows example panoptic segmentation annotations overlaid on the camera image.
Instance Statistics: We present additional analysis of object instances that are present in the Panoptic nuScenes dataset, both from a scan-wise and sequence-wise perspective. Fig. S.3 shows the distribution of non-moving and moving scan-wise instances for various semantic classes. For common classes such as adult and car, we have 152k and 114k moving scan-wise instances. For rarer classes such as police and construction vehicles, there are 882 and 298 moving scan-wise instances, which is a non-trivial amount.
Fig. S.4 further shows the distribution of the track lengths per sequence-wise instance for each semantic class. The median track lengths range from 9.5 frames to 39 frames across classes. For some of the less frequent classes such as wheelchair and construction vehicle, the median track length is on the higher end, which might be due to these being generally slower moving classes. Fig. S.5 shows an overview of the track lengths for instances stratified by class. With the non-trivial amount of short, medium, and long tracks, Panoptic nuScenes provides diversity in object track length and across a wide variety of classes. This challenges panoptic tracking approaches to be able to track objects for a relatively sustained period of time, while also being able to handle situations when an object only appears briefly.
Panoptic nuScenes Panoptic Segmentation and Tracking Challenges: In Panoptic nuScenes, each LiDAR point is annotated as one of 32 semantic classes. However, for the Panoptic nuScenes panoptic segmentation and tracking challenges, we merge similar classes and remove rare or void classes, resulting in 10 thing and 6 stuff classes as discussed in Sec. IV. Table S.1 shows the class mapping from the general Panoptic nuScenes classes to the classes used in the panoptic segmentation and tracking challenges. In addition, the rightmost column indicates the percentage of points that fall into overlapping bounding boxes for each thing class. The panoptic labels for these points are assigned to noise. Fig. S.2 presents visualizations of groundtruth annotation examples from our Panoptic nuScenes dataset for each of the three challenges.
|General Panoptic Class||Challenge Class||Thing/Stuff||Overlap Percent ()|
|traffic cone||traffic cone||thing||0.07|
|driveable surface||driveable surface||stuff||-|
s.2 Additional Panoptic Segmentation Results
In Table S.2, we present a comparison of per-class panoptic segmentation results on the Panoptic nuScenes dataset. We compare the top three of our combination baselines that have published task-specific methods and the end-to-end baselines. Across all the classes, it can be seen that the independently combined baselines outperform the end-to-end methods. This is more pronounced for the thing classes compared to the stuff classes. Among the thing classes, the average gap between the best-in-class result of the independently combined baselines compared to that of the end-to-end methods is 20.8 PQ, while for the stuff classes, the average gap is 4.6 PQ. This is likely due to the role of the stronger task-specific detection methods that are used for the instance segmentation of the thing classes.
|SPVNAS [tang2020searching] + CenterPoint [yin2021center]||78.1||62.3||59.4||91.2||46.5||83.2||92.9||90.7||49.0||63.4||97.4||49.4||72.6||54.5||85.4||79.9||72.2|
|Cylinder3D++ [zhu2021cylindrical] + CenterPoint [yin2021center]||79.2||70.5||63.8||92.8||56.8||83.3||93.6||92.8||66.1||69.0||97.6||55.1||74.1||57.1||87.0||85.1||76.5|
|(AF)2-S3Net [cheng20212] + CenterPoint [yin2021center]||80.8||79.4||67.3||91.7||63.8||86.4||93.8||93.7||68.7||72.9||97.1||45.6||69.4||51.5||84.6||82.3||76.8|
s.3 Correlation Analysis
In this section, we extend the analysis of the combination baselines to study the relationships between (i) panoptic segmentation performance and the component semantic segmentation and detection performance, and (ii) panoptic tracking performance and the component semantic segmentation and tracking performance.
Panoptic Segmentation: As discussed in Sec. V-A, panoptic segmentation can be achieved by combining the submissions of LiDAR semantic segmentation and LiDAR detection from the nuScenes challenges. We generate 1470 of these combinations and in Fig. S.6(a), we show an overview of the resulting panoptic segmentation performance as measured by the PQ score, for the combinations. Within each column, a perceptual change in color can be observed, implying that for the same LiDAR semantic segmentation method, the change in detection method influences the PQ. However, across each row, the change in color is insignificant, indicating that varying the LiDAR semantic segmentation method does not impact PQ as much as the LiDAR detection method.
|(a) Panoptic segmentation performance (PQ)||(b) Panoptic tracking performance (LSTQ)|
Panoptic Tracking: In Sec. V-B, we compare our proposed metric, PAT, with other existing metrics which has been proposed for panoptic tracking. Here, we provide additional analysis for the LSTQ metric with respect to mIoU and AMOTA metrics. Fig. S.6(b) gives a qualitative overview of how LSTQ varies with different panoptic tracking baselines which we generate by combining LiDAR semantic segmentation and tracking methods. As per Sec. V-A, we used mIoU and AMOTA as a performance measure for the LiDAR semantic segmentation methods and the tracking methods respectively. From the 924 independently combined panoptic tracking baselines, we can see that there is generally a perceptual change in color across each row and column. This implies that LSTQ takes into account both the LiDAR semantic segmentation performance as well as the tracking performance. In Table S.3, we show more quantitatively how various panoptic tracking metrics consider LiDAR semantic segmentation and tracking. PTQ demonstrates more bias towards LiDAR semantic segmentation as observed from the significantly stronger correlation between mIoU and PTQ in contrast to that between AMOTA and PTQ. On the other hand, LSTQ and PAT are more balanced metrics, with a smaller disparity in the respective correlations.