Spatio-temporal interpretation of raw sensory data is important for autonomous vehicles to understand how to interact with the environment and perceive how trajectories of moving agents evolve in 3D space and time.
In the past, different aspects of dynamic scene understanding such as semantic segmentation [Everingham10IJCV, dai17cvpr, Milioto19IROS, Thomas19ICCV, zhang20CVPR, tang20eccv], object detection [Felzenszwalb08CVPR, Ren15NIPS, Lang19CVPR, Shi19CVPR, shi20cvpr, shi20pami], instance segmentation [He17ICCV], and multi-object tracking [Leibe08TPAMI, Bergmann19ICCV, Osep17ICRA, Braso20CVPR, Weng20iros, qi20cvpr, poschmann20arxiv]
have been tackled independently. The developments in these fields were largely fueled by the rapid progress in deep learning-based image[Krizhevsky12NIPS] and point-set representation learning [Qi17CVPR_pointnet, Qi17NIPS, Thomas19ICCV], together with contributions of large-scale datasets, benchmarks, and unified evaluation metrics [Lin14ECCV, Everingham10IJCV, Geiger12CVPR, dendorfer20ijcv, Cordts16CVPR, Voigtlaender19CVPR, Gupta19CVPR, Behley19ICCV, dai17cvpr, Caesar20CVPR, sun20CVPR]. In the pursuit of image-based holistic scene understanding, recent community efforts have been moving towards convergence of tasks, such as multi-object tracking (MOT) and segmentation [Voigtlaender19CVPR, Yang19ICCV], and semantic and instance segmentation, , panoptic segmentation [Kirillov19CVPR]. Recently, panoptic segmentation was extended to the video domain [kim20cvpr]. Here, the dataset, task formalization, and evaluation metrics focused on interpreting short and sparsely labeled video snippets in 3D (2D image+time) in an offline setting. Autonomous vehicles, however, need to continuously interpret sensory data and localize objects in a 4D continuum.
Tackling sequence-level LiDAR panoptic segmentation is a challenging problem, since state-of-the-art methods [Thomas19ICCV] usually need to downsample even single-scan point clouds to satisfy the memory constraints. Therefore, the common approach in (3D) multi-object tracking is detecting objects in individual scans, followed by temporal association [Frossard18ICRA, Weng20iros, weng20CVPR], often guided by a hand-crafted motion model. In this paper, we take a substantially different approach, inspired by the unified space-time treatment philosophy. We form overlapping 4D volumes of scans (see Fig. 1) and, in parallel, assign to 4D points a semantic interpretation while grouping object instances jointly in 4D space-time.
Importantly, these 4D volumes can be processed in a single network pass, and the temporal association is resolved implicitly via clustering. This way, we retain inference efficiency while resolving long-term association between overlapping volumes based on the point overlap, alleviating the need for explicit data association.
For the evaluation, we introduce a point-centric higher-order tracking metric, inspired by recent metrics for multi-object tracking [luiten20ijcv] and concurrent work on video panoptic segmentation [weber2021step] which differ from the available metrics [Kirillov19CVPR, Bernardin08JIVP] that overemphasize the recognition part of the tasks. Our metric consist of two intuitive terms, one measuring the semantic aspect and second the spatio-temporal association of the task. Together with the recently proposed SemanticKITTI [Behley19ICCV, Behley20arxiv] dataset, this gives us a test bed to analyze our method and compare it with existing LiDAR semantic/instance segmentation [Lang19CVPR, Thomas19ICCV, Weng20iros, milioto2020iros] approaches, adapted to the sequence-level domain.
In summary, our contributions
are: (i) we propose a unified space-time perspective to the task of 4D LiDAR panoptic segmentation, and pose detection/segmentation/tracking jointly as point clustering which can effectively leverage the sequential nature of the data and process several LiDAR scans while maintaining memory efficiency; (ii) we adopt a point-centric evaluation protocol that fairly weights semantic and association aspects of this task and summarizes the final performance with a single number; (iii) we establish a test bed for this task, which we use to thoroughly analyze our model’s performance and the existing LiDAR panoptic segmentation methods used in conjunction with a tracking-by-detection mechanism. Our code, experimental data111https://github.com/mehmetaygun/4d-pls and benchmark222http://bit.ly/4d-panoptic-benchmark are publicly available.
2 Related Work
Our work is related to tasks covering different aspects of scene perception, such as semantic segmentation, object detection/segmentation, and tracking. In the following, we review related methods and tasks.
Datasets and Metrics. The growing interest in autonomous vehicles has sparked interest in scene perception using LiDAR sensors. Here the progress has been fueled by advances in deep learning on point sets [Qi17CVPR_pointnet, Qi17NIPS, hua18cvpr, komarichev19cvpr, lan19cvpr, Wu18ICRA, Wu19ICRA, Tatarchenko18CVPR, Thomas19ICCV, milioto2020iros] and datasets with standardized benchmarks for 3D semantic/instance segmentation [Geiger12CVPR, Behley19ICCV] and 3D object detection and multi-object tracking [Caesar20CVPR, sun20CVPR]. This confirms the importance of advancing both spatial and temporal aspects of mobile robot perception. Our proposed task formulation and evaluation metric is the first that unifies both aspects to the best of our knowledge.
Recent community efforts in the field of image-based perception have been moving towards the convergence of different tasks. For instance, Kirillov [Kirillov19CVPR] proposed to unify semantic and instance segmentation, which they termed panoptic segmentation, together with an evaluation metric, the panoptic quality (PQ). Others proposed to tackle multi-object tracking and instance segmentation (MOTS) in videos jointly [Voigtlaender19CVPR, Yang19ICCV]. Moreover, [kim20cvpr] recently extended panoptic segmentation to videos – however, the dataset and the evaluation metrics focus on interpreting short and sparse video snippets offline. This is reflected in the evaluation metric, which is essentially PQ evaluated based on the 3D IoU [Yang19ICCV] and averaged over temporal windows of varying sizes to compensate that the difficulty of the task depends on the sequence length. This setting is not suitable for autonomous vehicles that need to interpret raw sensor data continuously. Hurtado [hurtado20arxiv] proposes to combine ideas from MOTA [Bernardin08JIVP] and PQ [Kirillov19CVPR] by adding a penalty related to ID switches to the PQ. Nonetheless, both PQ and MOTA were criticized [porzi19cvpr, luiten20ijcv], and the proposed evaluation inherits all of their well-known issues.
In this paper, we propose a different approach and bring ideas recently introduced in the context of benchmarking vision-based multi-object tracking [luiten20ijcv] to the domain of sequential LiDAR semantic and instance segmentation. Together with the metric, we also propose an approach that operates directly on spatio-temporal point clouds providing object instances in space and time.
Point Cloud Segmentation. Semantic segmentation or point-wise classification of point clouds is a well-known research topic [anguelov05cvpr]
. Traditionally, it was solved using feature extractors in combination with traditional classifiers[agrawal09icra] and conditional random fields to enforce label consistency of neighboring points [triebel06icra, munoz09cvpr, xiong11icra]. Availability of large-scale datasets, such as S3DIS [armeni16cvpr], Semantic3D [hackel2017isprs], and recently SemanticKITTI [Behley19ICCV], made it possible to also investigate end-to-end pipelines [landrieu18cvpr, Milioto19IROS, Thomas19ICCV, hu20cvpr, zhang20CVPR, shi20CVPRSpSequenceNet, Qi17NIPS, Qi17CVPR_pointnet, milioto2020iros]. Similar to recent trends in RGB-D [jiang20cvpr_pointgroup, Engelmann20CVPR] and LiDAR segmentation [wong20corl], our method performs bottom-up point grouping in a data-driven fashion. However, different from the aforementioned, we perform grouping in 3D space and time. We use the backbone by [Thomas19ICCV] that applies deformable point convolutions directly on the point clouds. In our case, this empirically performed better compared to backbones, specifically designed for point sequences [choy20194d, shi20CVPRSpSequenceNet].
Multi-Object Tracking and Segmentation. The majority of vision-based MOT methods follow tracking-by-detection [Okuma04ECCV]. Here the idea is to first run a pre-trained object detector independently in each video frame and then associate detections across time. In the past, there was a strong focus on developing robust and, preferably, globally optimal methods for data association [Zhang08CVPR, Leal11ICCVW, Pirsiavash11CVPR, Brendel11CVPR, ButtCollins13CVPR]. Recent data-driven trends mainly focus on learning to associate detections [LealTaixe16CVPRW, Son17CVPR] or to regress targets [Bergmann19ICCV], often in combination with end-to-end learning [peng19ICCV, xu19cvpr, Braso20CVPR].
In the realm of robot vision, it is critical to localize object trajectories in 3D space and time. Early methods localized monocular detections in 3D , using stereo [Leibe08TPAMI, Osep17ICRA, Ess09PAMI], or performed tracking in a category-agnostic manner by first performing bottom-up segmentation based on spatial proximity followed by point-segment association [Teichman12IJRR, Held14RSS]. Recently, LiDAR-based MOT has become a popular task, thanks to the emergence of reliable 3D object detectors [Shi19CVPR, Lang19CVPR] and LiDAR-centric datasets [Caesar20CVPR, sun20CVPR]. Weng [Weng20iros] demonstrated that even simple methods based on linear assignment and constant-velocity motion models can perform surprisingly well when object detections are localized reliably in 3D space. Our method departs from 3D object detection in the spatial domain, followed by the detection association in the temporal domain. Instead, we follow recent advances in image [Neven19CVPR, cheng20cvpr] and video instance segmentation [Athar20eccv]
. We localize possible object instance centers within a 4D volume and associate points to estimated centers in a bottom-up manner, while a semantic branch assigns semantic classes to points.
In this paper, we propose a method and a metric for 4D Panoptic LiDAR Segmentation task that tackles LiDAR semantic segmentation and instance segmentation jointly in the spatial and temporal domain. Given a sequence of LiDAR scans, the goal of this task is to predict for each 3D point (i) a semantic label for both stuff and thing classes, and (ii) a unique, identity-preserving object instance ID that should persist over the whole sequence.
3.1 4D Panoptic LiDAR Segmentation: 4D-PLS
In this work, we take a different path compared to the tracking-by-detection paradigm to video-instance and video-panoptic segmentation [Voigtlaender19CVPR, Yang19ICCV, kim20cvpr, hurtado2020mopt]. We pose 4D panoptic segmentation as two joint processes. The first one is responsible for point grouping in the 4D continuum using clustering, while the second assigns a semantic interpretation to each point.
We provide an overview of our method in Fig. 2. In a nutshell, we first form 4D point clouds from several consecutive LiDAR scans. In parallel, within a single network pass, we localize the most likely object centers (inspired by point-based tracking methods by [zhou20ECCV, cheng20cvpr]) in the sequence (objectness map ), assign semantic classes to points (semantic map ), and compute per-point embeddings (embedding map
) and variances (variance map).
The clustering can be performed efficiently by evaluating the probability of each 4D point belonging to a certain “seed“ point, which is similarly performed in the context of images and video segmentation [Neven19CVPR, Athar20eccv]. Finally, to associate 4D sub-volumes, we examine point intersections between overlapping point volumes.
4D Volume Formation. During inference and training, we form overlapping 4D point cloud volumes in an online setting. In particular, for scan and temporal window size , we align together point clouds within temporal window using ego-motion estimates provided by a SLAM approach [behley18rss]. Our experiments in Sec. 4.1 reveal that processing multiple point clouds significantly improves spatial and temporal point association performance. However, due to the linear growth in memory requirements, stacking point clouds along the temporal dimension is prohibitively expensive. To overcome this issue, we build on the intuition that thing classes are most critical for a stable temporal association, since these classes correspond to potentially moving objects. As we operate in an online setting, where the past scans have already been processed, we can sample points that belong to thing classes or lie near to an object centers from earlier scans.
We model object instances via Gaussian probability distributions. Given an estimate of the object center, , clustering “seed“ point, we can assign points to their respective instance by evaluating each point under the Gaussian pdf based on the point’s embedding vectors. The estimated centers do not need to correspond to exact object centers but are merely used to initiate the clustering. Thus, our approach is in practice fairly robust to occlusions and cross-time view changes. We note that the Gaussian assumption is only valid for shorter temporal windows. In particular, given a pointrepresenting the instance center and its embedding vector , and a query point with its embedding vector , we can evaluate the probability of point belonging to its center “seed“ point as:
where is a diagonal matrix constructed using variance prediction of point . We concatenate coordinate values with the learned point embedding vectors to combine spatial and temporal coordinates with learned embeddings. We account for these additional dimensions during the training of the variance map.
Network and Training. To perform such clustering, we need to identify most likely instance centers, i.e., “seed“ points, in a 4D point cloud. We also need variance predictions for each point to evaluate probability scores during clustering, and a posterior over all semantic classes.
We estimate all these quantities using an encoder-decoder architecture that operates directly on the 4D point cloud . The encoder network is based on the KP-Conv [Thomas19ICCV] backbone that uses deformable point convolutions. The decoder predicts point-wise feature embeddings using consecutive point convolutions. On top of the encoder, we add an object centerness decoder , point variance decoder , and semantic decoder . We train our network in an end-to-end manner and in online fashion.
To train the semantic decoder, we use cross-entropy classification loss . As the semantic classes are highly imbalanced, we sample points to ensure that the probability of sampling a point from a certain class is roughly uniform.
To learn the point centerness and point variance, we use three different losses. First, we impose the mean squared error (MSE) loss to train the object centerness decoder. Due to the sparsity of the LiDAR signal, there will generally be no points near the actual object centers, unlike the image and video domain [Neven19CVPR, Athar20eccv]. Therefore, instead of predicting per-point centerness, we predict the proximity of the point to its instance center. We compute for each point the point objectness as Euclidean distance between the point and its instance center, , mean point of all instance point, normalized to the range of . This objectness is then compared to the regressed objectness score :
Since we want the embeddings of instances to form clusters in the spatio-temporal domain, we introduce our instance loss. Given a 4D point cloud of points and instances, it is defined as:
where is evaluated under the Gaussian pdf (Eq. 1) with points embedding as well as instance embedding and variance and . In addition, we employ variance smoothness loss , similar to [Athar20eccv, Neven19CVPR] for training the variance decoder. In summary, we use four different losses to train our network in end-to-end manner: .
Inference. We resolve point-to-instance associations in two stages, first within a processed 4D volume, and then across volumes. First, based on the point cloud centerness map, we select the point , which has the highest objectness score. Then, we evaluate probabilities for all candidate points and assign them to the cluster in case . The assigned points are then removed from the candidate pool. We repeat these steps until the next highest objectness score is below a certain threshold. To transfer identities across processed 4D volumes, we perform cross-volume association greedily based on the overlap score, taking all scans into account. When the overlap is below a threshold, we assign a new id.
3.2 Measuring Performance
The central question when proposing a novel task and benchmark is how to evaluate and compare different methods. Preferably, we would like to summarize performance with a single number to rank the methods while retaining the capability of looking at different aspects of the task.
3.2.1 Existing Evaluation Measures
To motivate our approach to evaluation, we first briefly discuss established metrics for image-based panoptic segmentation (PQ [Kirillov19CVPR]) and multi-object tracking and segmentation (MOTSA/MOTSP [Bernardin08JIVP, Voigtlaender19CVPR]). Then, we discuss two recently proposed extensions of PQ to the temporal domain and argue why we do not promote their adaptation for the task of 4D LiDAR panoptic segmentation.
Segment-centric Evaluation. PQ and MOTSA/MOTSP are instance-centric evaluation metrics. Both first determine a unique matching between sets of ground-truth objects and model predictions for each frame individually to determine true positives (TPs), false positives (FPs), and false negatives (FNs). Both metrics provide measures for the segmentation and recognition aspects of the task. The segmentation quality (SQ) term of PQ and MOTSP integrates IoUs over the set TPs and normalizes it by the size of the TP set. The recognition quality (RQ) term of PQ is expressed as the score. Similarly, MOTSA combines detection errors (FNs and FPs) with ID switch (IDSW) penalty in a single term. IDSW occurs when a track is lost, and the tracker assigns a new identity to a tracked object. This is the only term that takes the temporal aspect of the task into account.
A criticism of PQ is that it over-emphasizes the importance of very small segments and stuff classes can be difficult to match [porzi19cvpr]. MOTSA overemphasizes the detection compared to the association aspect and it is nonintuitive, since the score can be negative and is unbounded, as can be seen in Sec. 4. Furthermore, the influence of the ID switches on the final score depends on the frame rate, and MOTSA does not reward trackers that recover from incorrect associations. Importantly, both metrics are sensitive to the selection of the matching threshold. Thus, instances that slightly miss this threshold will cause both a FN and a FP. This is not the case for pixel or point centric metrics used for evaluating semantic segmentation. The standard mean IoU (mIoU) metric [Everingham10IJCV] computes sets of TPs, FPs and FNs pixel (or point) basis, effectively bypassing the segment matching.
PQ Extensions. Recent work [kim20cvpr] proposes video panoptic quality, a variant of PQ for the sequential domain. Different from PQ, gt-to-prediction mapping is established based on the sequential IoU matching criterion, proposed in the context of video instance segmentation [Yang19ICCV]. As objects are not present throughout the clip and the difficulty of the task critically depends on the length of the temporal window, the final metric is averaged over varying temporal window sizes. This is suitable for the setting defined in [kim20cvpr], where the task is to evaluate short, sparsely labeled video snippets. However, this approach does not scale to real-world sequences of arbitrary length. Another extension to PQ, panoptic tracking quality (PTQ) [hurtado2020mopt] combines MOTA and PQ by adding an ID penalty to the PQ measure. This approach inherits issues from both PQ and MOTSA metrics.
3.2.2 LiDAR Segmentation and Tracking Quality
In the following, we assume a sequence of 3D point clouds of length , sampled at discrete time-steps: . We define the ground-truth assignment function as and a prediction function as , that map each 4D tuple, consisting of a point and a timestamp , to a certain class and identity . In the following, we devise an evaluation metric that, for each pair , evaluates whether (i) it was assigned to a correct class, and (ii) for the thing classes, whether it was assigned to the correct object instance. Inspired by the recently introduced Higher Order Tracking Accuracy (HOTA) [luiten20ijcv], proposed in the context of MOT, and concurrent work on video panoptic segmentation proposing the Segmentation and Tracking Quality (STQ) [weber2021step], our LSTQ (LiDAR Segmentation and Tracking Quality) consists of two terms, the classification score and the association score .
We adopt a fundamentally different evaluation philosophy compared to other metrics [luiten20ijcv, Kirillov19CVPR, kim20cvpr, hurtado20arxiv]. In particular, we drop the concept of the frame-level “detection“ and do not match segments between ground-truth and prediction. Our association score measures point-to-instance association quality in a unified way – in space and time at point level, which is more natural for segmentation tasks.
Classification Score. For the classification score, we first define instance-agnostic ground-truth and predictions sets:
representing the ground truth and predicted points that belong to class irrespective of their assigned ids. Then, the TP, FP, FN sets are computed as in standard semantic segmentation evaluation with respect to gt class and predicted class :
The classification score then simply boils down to intersection-over-union (IoU) over these sets, which is the standard approach for evaluating semantic segmentation (however, this is different from segment-centric PQ, where points contribute to the term only if the segment that they belong is matched). We follow the standard procedure and average over the classes:
Association Score. To evaluate the association score, we introduce the following class-agnostic predictions and ground-truth for the thing classes:
We define the true positive association (TPA) set between a ground-truth object with identity and prediction , that was assigned identity . This gives us a set of points with mutually consistent identities and , over the whole 4D volume:
Analogously, we define the set of false positive associations:
Intuitively, this set contains predicted point assignments with identity , that were assigned a different ground-truth identity (), or were not assigned to a valid object instance. Finally, the set of false negative assignments:
contains ground-truth points with identity that were assigned an identity, different to , or were missed. We note that the concept of TPA, FPA and FNA was first introduced in the context of MOT evaluation for measuring the quality of temporal detection association. Therefore, to establish these sets, a bijective mapping between gt and pred needs to be established (as in the case of [Bernardin08JIVP]). However, in LSTQ, these sets are established with respect to each 4D point, treating association in space and time in a unified manner.
Once we have quantified these sets, we can evaluate how well a predicted segment agrees with ground-truth segment . Because a ground truth segment may be explained by multiple different predictions, we sum contributions of all pairs with non-zero overlap:
). In practice, we do not need to perform any point segment association, and even a prediction with a single common point will contribute to this term. We normalize these contributions by the tube volume, and we weigh each contribution by the volume of the TPA set. This weighting term ensures that instances with larger temporal spans have a higher contribution to the final score. Finally, our metric is computed as a geometric mean of the two terms:. The advantage over the arithmetic mean is that the final score will become zero if any of the two terms approach zero. This reflects our intuition that failing at either of two aspects of the task should yield a very low final score.
LSTQ tolerates different semantic predictions within a spatio-temporal segment by design (the term in Eq. 7 is evaluated in a class-agnostic manner). Following STQ [weber2021step], we decouple semantic and association errors, otherwise, , a truck mistaken for a bus would be harshly penalized by the association term, even though it was tracked correctly. This behavior that dis-entangles association and classification errors is different from MOTSA/PTQ/VPQ where semantics and temporal association are entangled.
4 Experimental Evaluation
In this section, we first evaluate different strategies for forming 4D point cloud volumes, assess the impact of processing multiple scans on the final performance, and discuss several possibilities for designing the embedding function used for point grouping. We compare our method to several baselines for single-scan LiDAR panoptic segmentation [Behley20arxiv] and 4D panoptic LiDAR segmentation by extending existing methods to the sequential domain.
We use the SemanticKITTI [Behley19ICCV] LiDAR dataset to conduct our experiments. It contains sequences from KITTI odometry dataset [Geiger12CVPR] and provides point-wise semantic and temporally consistent instance annotations [Behley20arxiv]. We use the training/validation/test split by SemanticKITTI [Behley19ICCV, Behley20arxiv].
4.1 Ablation studies
We perform all ablation studies on the validation split and interpret results through the lens of LSTQ metric (Sec. 3.2.2).
Point Propagation. As discussed in Sec. 3, we cannot simply stack point clouds along the temporal dimension due to the memory constraints. We build on the intuition that we can subsample a set of points from the past scans that are most beneficial for the end-task performance. As we are operating in an online setting, and the past scans have already been processed, we can leverage predictions from the past scans.
In this experiment, we discuss different temporal point sampling strategies for varying temporal window sizes of . In the thing-propagation strategy, we exclusively sample points which are assigned only to a thing class, as they represent only a small number of all point. In the importance sampling strategy, we sample 10% of points with a probability proportional to the objectness. This way, we focus on points likely to represent thing classes, while still allowing to propagate points belonging to stuff classes, which can aid the semantic segmentation of the task. Similarly, the temporal decay sampling uses the objectness score as a deciding factor, but we decay the number of sampled points based on the proximity to the current scan. Finally, the strided sampling samples points with a stride of along the temporal dimension.
As can be seen in Tab. 1, the importance sampling strategy yields higher performance compared to sampling only thing classes, at a slightly increased memory cost. As expected, this approach improves association quality and aids semantics as it also propagates points representing stuff classes.
Interestingly, even a temporal window of size drastically improves the performance compared to a single scan baseline, at negligible memory consumption (). We observe the largest gains when the scans are temporally close: our 4-scan multi-scan baseline improves performance in terms of LSTQ. The association term benefits more from processing multiple scans compared to the segmentation term. This confirms that our model learns to exploits temporal cues very well. While temporal decaying does aid semantic or temporal aspect, introducing a temporal stride of yields the highest performance gains for semantic point classification. However, denser sampling in the temporal domain benefits association, which is why in the following experiments, we focus on the importance sampling strategy with . In Tab. 7 (supplementary), we highlight the performance of this approach for temporal window size . As can be seen, the association accuracy is increasing up to and then saturates, while classification accuracy saturates a ; however, it only decreases marginally.
|RangeNet++[Milioto19IROS] + PP + MOT||43.76||36.28||52.78||60.49||42.17||34.58||33.83||-7.88||-4.57|
|KPConv [Thomas19ICCV] + PP + MOT||46.27||37.58||56.97||64.21||54.13||39.13||38.11||-6.16||-2.41|
|RangeNet++[Milioto19IROS] + PP + SFP||43.38||35.66||52.78||60.49||42.17||35.83||35.46||-3.13||-0.01|
|KPConv [Thomas19ICCV] + PP + SFP||45.95||37.07||56.97||64.21||54.13||41.44||41.05||2.83||6.1|
|Our (single scan) + MOT||51.92||45.16||59.69||64.60||60.40||48.36||47.84||6.65||12.69|
|Our (single scan) + SFP||45.45||34.61||59.69||64.60||60.40||48.24||47.72||3.01||7.93|
|Ours (2 scans)||59.86||58.79||60.95||64.96||63.06||51.14||50.67||29.04||33.2|
|Ours (4 scans)||62.74||65.11||60.46||65.36||61.26||51.50||51.20||0.34||4.8|
Embedding Design. In this experiment, we study different design decisions to formulate the point embeddings for clustering and show our findings in Tab. 2. We investigate the base performance of using only 3D spatial and 4D spatio-temporal point coordinates , and using only learned embeddings (Emb.). Next, we investigate the performance of the coordinate mixing formulation that combines learned embeddings with 3D spatial and 4D spatio-temporal coordinates. As can be seen, the variant in which we combine both yields the best results, not only in terms of , but also . This shows that a well-designed embedding branch has a positive effect on learning the backbone features. Note that for the baseline that uses only spatio-temporal coordinates, we still use our fully trained network.
4.2 Benchmark Results
Single-scan Prediction. First, we evaluate our method using standard single-scan LiDAR panoptic segmentation [Behley19ICCV, Behley20arxiv] to demonstrate the effectiveness of our network solely in the spatial domain. We use points from single scans during training and testing. We follow the standard evaluation protocol and compare to published and peer-reviewed methods.
|Category||# Instances||% Instances||# Scans||TP||FP||FN||IDS||Precision||Recall||MOTSA||S||S|
As can be seen from Tab. 5, our method achieves state-of-the-art results on all metrics for semantic and panoptic segmentation [Kirillov19CVPR, Behley19ICCV, Behley20arxiv]. The first two entries use two different networks for object detection and semantic segmentation, followed by fusion of the results. We use a single network to obtain both semantic and instance segmentation of the point cloud in a single network pass. We note that the recently proposed Panoptic RangeNet [milioto2020iros] and RangeNet++ [Milioto19IROS] combined with PointPillars [Lang19CVPR] detector operate on the range image and not the point cloud, and therefore, use a different backbone. However, KPConv combined with PointPillars uses the same backbone as our method.
4D Panoptic Segmentation. For evaluation in the multi-scan setting for the 4D panoptic segmentation task, we extend all single-scan methods reported in Tab. 5, except for the Panoptic RangeNet [milioto2020iros] as we were not able to obtain point cloud predictions from the authors. We adapt them to the sequential domain using two strategies. AB3DMOT [Weng20iros] uses a constant-velocity motion model to obtain track predictions associated with object detection based on a 3D bounding box overlap. The second strategy, Scene Flow Propagation (SFP) is inspired by standard baselines that perform optical flow warping of points, followed by mask-IoU based association. This approach is commonly used in the domain of vision-based video object segmentation [Luiten18ACCV], video instance segmentation [Yang19ICCV], and multi-object tracking and segmentation [Osep18ICRA, Osep19arxiv, Voigtlaender19CVPR]. Instead of optical flow, we use state-of-the-art LiDAR scene flow by [mittal20cvpr]. We outline our results, obtained on the test set in Tab. 6. As can be seen, the baseline that uses KPConv [Thomas19ICCV] to obtain per-pixel classification, PointPillars (PP) detector [Lang19CVPR] and a network for point cloud propagation (SFP [mittal20cvpr]
) performs slightly better in terms of association accuracy, compared to the standard 3D multi-object tracking baseline. Our method that unifies all three aspects in a single network outperforms all tracking-by-detection baselines by a large margin, including our single-scan baseline, even though we are using a single network. This confirms the importance of tackling all three aspects of these tasks in a unified manner. An important contribution of our paper is the finding that even when processing smaller overlapping sub-sequences with our network (and resolving intra-window associations with a simple overlap-based approach), we perform significantly better compared to single-scan baselines that use more elaborate association techniques (, Kalman filter), as can be confirmed in Tab.6.
|RangeNet++ [Milioto19IROS] + PointPillars [Lang19CVPR]||37.1||45.9||75.9||47.0||52.4|
|KPConv [Thomas19ICCV] + PointPillars [Lang19CVPR]||44.5||52.5||80.0||54.4||58.8|
|Panoptic RangeNet [milioto2020iros]||38.0||47.0||76.5||48.2||50.9|
|Our method (single scan)||50.3||57.8||81.6||61.0||61.3|
|RangeNet++[Milioto19IROS] + PP + MOT||35.52||24.06||52.43||64.52||35.82|
|KPConv [Thomas19ICCV] + PP + MOT||38.01||25.86||55.86||66.90||47.66|
|RangeNet++[Milioto19IROS] + PP + SFP||34.91||23.25||52.43||64.52||35.82|
|KPConv [Thomas19ICCV] + PP + SFP||38.53||26.58||55.86||66.90||47.66|
|Our (single scan) + MOT||40.18||28.07||57.51||66.95||51.50|
|Our (single scan) + SFP||43.88||33.48||57.51||66.95||51.50|
|Ours (multi scan)||56.89||56.36||57.43||66.86||51.64|
Metric Insights. In this section, we analyze the performance on the validation split (Tab. 3) through the lens of several evaluation metrics and analyze per-class performance (Tab. 4). Our method outperforms all baselines with respect to all metrics. However, while our 4-scan variant performs better than the 2-scan variant in terms of LSTQ, we observe a significant drop in the MOTSA score. Our analysis shows that this is due to the fact that we obtain negative MOTSA scores on some classes due to a decrease in precision while having fewer ID switches (see Tab. 4, and Tab. 8). We visualize such case in Fig. 3. As can be seen, the difference is due to the semantic interpretation of the points and not due to the segmentation and tracking quality at the instance level. This confirms the nonintuitive behavior of MOTSA, while our metric provides insights on both semantic interpretation and instance segmentation and tracking. For a more detailed discussion we refer to the supplementary (Sec. B.2).
In this paper, we extended LiDAR panoptic segmentation to the temporal domain resulting in the 4D Panoptic Segmentation task. We presented an evaluation metric suitable for analyzing this task’s performance and proposed a new model. Importantly, we have shown that a single model tackling semantic point classification and point-to-instance association jointly in space and time substantially outperforms methods that independently tackle these aspects. We hope that our unified view and model, accompanied by a public benchmark, will pave the road to future developments in this field.
Acknowledgements. This project was funded by the Humboldt Foundation through the Sofja Kovalevskaja Award and the EU Horizon 2020 research and innovation programme under grant agreement No. 101017008 (Harmony). We thank Ismail Elezi and the whole DVL group for helpful discussions.
Appendix A Implementation Details
In this section, we (i) provide details about the four different point propagation strategies we experimented with for forming a 4D point clouds and (ii) we detail the point overlap based association procedure we use to link 4D object instances across overlapping point clouds.
a.1 4D Point Cloud Formation
Our method works on directly 4D volumes which constructed using consecutive lidar scans. However, due to memory constraints stacking all points is not feasible. To reduce memory usage, when we process the scan together with previous scans ,…, , we take all of the points from and sub-sample points from other scans. Moreover, since we already processed previous scans ,…, before, we know the semantic class and objectness scores of all points at time step for that scans. We use three different strategy to sub-sample point from previous scans by leveraging these information.
Thing Propagation: In this strategy, we only sample points from previous scans if the points are assigned to a thing class. If the total number of points are exceeded the gpu memory limit, we randomly sub-sample again.
Importance Sampling: We select 10% of points from a previous scans using the objectness score predicted by the network in the previous time steps. Thus, points with higher objectness scores have a higher chance to be used in the clustering process in the following scans.
Temporal Decay: In this strategy, we use importance sampling using objectness scores again. However, instead of sampling 10% of points from each past scan, we select the percentage of points based on temporal proximity of scans. Given a temporal window size of , we select the number of points as:
where is the closest scan to the current scan. In this strategy more points would be sampled from scans which are temporally close.
Temporal Stride: We used importance sampling in this strategy, but instead of using points from previous scans we used every second scan . For the points from the remaining scans, we assigned predictions by looking at the closest points, which had class and instance prediction.
Our method can cluster points with different semantics and does not provide a single class label for a specific instance. This can be adapted depending on the requirements of the downstream application (, via majority vote). Moreover, if the number of points that assigned to a specific cluster is lower than a threshold, we eliminate that instance from the final prediction.
As discussed in the main paper (Section 3), we process multiple scans together in an overlapping fashion. For a window size of , at time , we process scans together by overlapping them in a 4D point cloud. represent the scan which processed at time step .
To associate instances at time and , we look at instance intersections in scans which are common in both time steps. For instance, with temporal window size of two, we would process scans and , next we would process and together. To transfer ids from the previous time to the current scan (), we would look the instance intersections in scans which processed on both time step ( and ). Since the instance ids are same for the scans which processed together ( and ), the association would be finished between overlapping 4D volumes.
For the intersection, we consider all common scans. When there is a conflict (i.e, one instance has overlap with two instance in the next step), we pick the instance pair which have higher intersection-over-union. If any of the intersections do not surpass IoU of , we create a new ID for the instance.
Appendix B Additional Results
b.1 Ablation on the Temporal Window Size
In Tab. 7, we highlight the performance of our method for temporal window size . As can be seen, the association accuracy is increasing up to and then saturates, while classification accuracy saturates a ; however, it only decreases marginally.
b.2 Per-class Evaluation
In this section, we analyze the performance on the validation split (Tab. 3) through the lens of several evaluation metrics and analyze per-class performance in Tab. 8 (this table extends Tab. 4 from the main paper). While our 4-scan variant performs better than the 2-scan variant in terms of LSTQ, we observe a significant drop in the MOTSA score. Our analysis shows that this is because we obtain negative MOTSA scores on some classes due to a decrease in precision while having fewer ID switches. This unintuitive behavior of MOTSA can be further validated when analysing performance for class, , other-vehicle. For this class the IDS reduces (), the precision drops (), while recall improves from (). In our metric, this is reflected in the decrease of () and increase in () while MOTSA unintuitively drops from to , even though association capabilities improve.
We visualize such cases in Fig. 4. As can be seen, the difference is due to the semantic interpretation of the points and not due to the segmentation and tracking quality at the instance level. This confirms the nonintuitive behavior of MOTSA, while our metric provides insights on both semantic interpretation and instance segmentation and tracking. As shown in Figure 3(a)-3(c), our method successfully recovers the ID of the instance. This behavior is penalized by both MOTSA and PTQ, but not by the association score of our metric . Moreover, while the instances tracked reasonably well in Figure 3(b), MOTSA and PTQ scores decrease substantially due to poor segmentation of the instances.
Finally, we acknowledge that our method works very well on the most frequently occurring object classes (car), however, segmenting and tracking objects that appear in the long tail of the object class distribution remains challenging.
|Category||# Scans||# Instances||% Instances||TP||FP||FN||IDS||Prec.||Recall||MOTSA||S||S|