Log In Sign Up

PADLoC: LiDAR-Based Deep Loop Closure Detection and Registration using Panoptic Attention

by   José Arce, et al.

A key component of graph-based SLAM systems is the ability to detect loop closures in a trajectory to reduce the drift accumulated over time from the odometry. Most LiDAR-based methods achieve this goal by using only the geometric information, disregarding the semantics of the scene. In this work, we introduce PADLoC, a LiDAR-based loop closure detection and registration architecture comprising a shared 3D convolutional feature extraction backbone, a global descriptor head for loop closure detection, and a novel transformer-based head for point cloud matching and registration. We present multiple methods for estimating the point-wise matching confidence based on diversity indices. Additionally, to improve forward-backward consistency, we propose the use of two shared matching and registration heads with their source and target inputs swapped by exploiting that the estimated relative transformations must be inverse of each other. Furthermore, we leverage panoptic information during training in the form of a novel loss function that reframes the matching problem as a classification task in the case of the semantic labels and as a graph connectivity assignment for the instance labels. We perform extensive evaluations of PADLoC on multiple real-world datasets demonstrating that it achieves state-of-the-art performance. The code of our work is publicly available at


page 1

page 4


SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure

LiDAR-based SLAM system is admittedly more accurate and stable than othe...

LCDNet: Deep Loop Closure Detection for LiDAR SLAM based on Unbalanced Optimal Transport

Loop closure detection is an essential component of Simultaneous Localiz...

NDD: A 3D Point Cloud Descriptor Based on Normal Distribution for Loop Closure Detection

Loop closure detection is a key technology for long-term robot navigatio...

GraffMatch: Global Matching of 3D Lines and Planes for Wide Baseline LiDAR Registration

Using geometric landmarks like lines and planes can increase navigation ...

Online LiDAR-SLAM for Legged Robots with Robust Registration and Deep-Learned Loop Closure

In this paper, we present a factor-graph LiDAR-SLAM system which incorpo...

Towards Stable Adversarial Feature Learning for LiDAR based Loop Closure Detection

Stable feature extraction is the key for the Loop closure detection (LCD...

AEROS: Adaptive RObust least-Squares for Graph-Based SLAM

In robot localisation and mapping, outliers are unavoidable when loop-cl...

Code Repositories


LiDAR-Based Deep Loop Closure Detection and Registration using Panoptic Attention

view repo

I Introduction

Simultaneous Localization and Mapping (SLAM) is a core task of autonomous mobile robots. Typically, SLAM approaches consist of two steps: alignment of consecutive measurements, e.g., from wheel odometry, followed by loop closure detection and registration. Reliable loop closure detection enables a robot to recognize places it has seen before to optimize its world representation and belief of its current position, reducing the drift over time. Thus, it is considered a fundamental component of SLAM systems. Many SLAM systems have been proposed for different sensor modalities including cameras [voedisch2022continual] and LiDARs [li2021saloam]. While vision-based methods fail in challenging lighting conditions such as illumination changes, LiDAR-based approaches are more robust to such alterations and provide a more accurate representation of the environment. In this work, we address the joint problem of loop closure detection and map registration for LiDAR-based SLAM. A high-level overview of our approach is depicted in Fig. 1.

Similar to other fields, learning-based approaches have started to replace handcrafted methods due to their better generalization ability and faster runtime [bevsic2022dynamic, gosala2022bird]

. Typically, deep neural networks predict point correspondences which are then used in differential singular value decomposition (SVD) to compute the transformation between two point clouds 

[cattaneo2022lcdnet, wang2019deep]

. Motivated by the success of transformers in natural language processing and computer vision tasks, attention-based architectures were recently introduced for point cloud registration 

[wang2019deep, qin2022geometric, yew2022regtr] to encode context across points. While existing works do not consider the semantic meaning of the different inputs to a transformer cell, i.e., queries, keys, and values, we explicitly take advantage of the internal structure by feeding in abstract features and raw points separately.

Fig. 1: We propose PADLoC that jointly detects loop closures for LiDAR-based SLAM and simultaneously performs point cloud registration. In addition to geometric information, we leverage panoptic segmentation annotations during training to facilitate more robust point matching.

Although geometric information suffices for classical point cloud registration such as Iterative Closest Point (ICP) [zhang1994iterative], they can be further stabilized by integrating semantic information [li2021saloam, chen2019sumapp, kong2020semantic]. Inspired by recent semantic mapping approaches [chen2019sumapp, radwan2018vlocnet++] and methods that exploit panoptic information for vision-based loop closure detection [yuan2021svloop], we leverage panoptic segmentation of point clouds in this work. Unlike related methods, our approach requires panoptic labels only while training but not during deployment, making it more versatile. We evaluate the loop closure detection and point cloud registration performance on three real-world autonomous driving datasets, namely, KITTI [geiger2012are], Ford campus [pandey2011ford]

, and an in-house dataset recorded in Freiburg, Germany. We compare against both state-of-the-art handcrafted and deep learning-based methods and demonstrate that PADLoC achieves state-of-the-art performance. We also present several ablation studies on the different components of our approach validating our architectural design choices.

The main contributions of this work are as follows:

  1. [label=0),topsep=0pt]

  2. We propose PADLoC, a transformer encoder architecture for point cloud matching and registration. Unlike existing methods, we use separate inputs as keys, values, and queries effectively, exploiting the transformer structure.

  3. We define a novel loss function that leverages panoptic information for registration. We further propose formulating both geometric and panoptic registration losses as bidirectional functions that greatly improve performance.

  4. We study the effect of multiple weighting methods in SVD to enhance point matching.

  5. We extensively evaluate our proposed approach and compare it to other point cloud matching and registration methods, using two openly available datasets and in-house data recorded in Freiburg, Germany.

  6. We release our code and the trained models at

Ii Related Work

In this section, we first provide an overview of LiDAR-based loop closure detection techniques for SLAM, followed by various methods for point cloud registration, and finally describe approaches that leverage semantic segmentation for either task.

Loop Closure Detection: Traditionally, handcrafted methods for LiDAR loop closure detection can be categorized into local feature-based and global feature-based methods. Inspired by the success of local feature-based methods in images, approaches from the first category design similar descriptors and adapt them to 3D point cloud data. 3D keypoint descriptors such as Fast Point Feature Histograms (FPFH) [rusu2009fast] and Normal-Aligned Radial Features (NARF) [steder2011narf] are used to extract local features, which are then aggregated in a bag-of-word model to detect loop closures. More recently, HOPN [Lun2022] exploits a bird’s-eye view (BEV) representation and normal information to increase robustness to noise and viewpoint changes. Global feature-based approaches, on the other hand, summarize the whole point cloud into a single fingerprint, which is then compared against the fingerprints from past frames to detect loops. The M2DP [He2016] descriptor projects the point cloud into multiple 2D planes and combines density information computed on each plane into a global descriptor. Scan Context [giseop2018scan] combines a polar coordinate representation with partitioning to generate an image as a global descriptor. Subsequent works extended this method by adding additional information such as intensity [wang2020intensity] and semantic data [li2021ssc]. Recently, many deep learning-based approaches have been proposed to overcome some of the limitations of handcrafted methods. PointNetVLAD [Uy_2018_CVPR] is built on top of the PointNet [Qi_2017_CVPR] architecture and generates a compact descriptor. OverlapNet [chen2020overlapnet] projects the point cloud into a range image and predicts the overlap and the yaw misalignment between a pair of frames. To increase viewpoint robustness and to reduce inference time, OverlapTransformer [Junyi2022] adapts OverlapNet by including a transformer module. In this work, we build upon LCDNet [cattaneo2022lcdnet] that uses learning-based feature extraction to generate global descriptors. LCDNet significantly improves loop closure in challenging conditions, such as reverse loops and, unlike other methods, does not require an ad-hoc function to compare two global descriptors.

Point Cloud Registration:

Standard techniques for point cloud registration can be broadly classified into two main categories. The first category comprises the Iterative Closest Point (ICP) algorithm 

[zhang1994iterative] and its variants [chen2019sumapp, bouaziz2013sparse]

. These methods require an initial guess on the transformation and then iteratively alternate between finding matches between points by exploiting some heuristics and estimating the transformation based on these matches. Methods of the second category use a two-stage approach. They first extract local point features, e.g., FPFH 

[rusu2009fast], and then regress the transformation using robust estimators such as RANSAC [fischler1981random]. While methods of the first category are prone to get stuck in local minima if the provided initial guess is not accurate enough, approaches of the second category are sensitive to noise and incorrect matches. Many deep learning-based approaches have also been proposed to solve the point cloud registration task. PointNetLK [Aoki_2019_CVPR] is a pioneering work that combines an architecture inspired by PointNet [Qi_2017_CVPR] and a modified Lucas-Kanade algorithm to iteratively improve the registration. Inspired by the success of transformers in other fields, Deep Closest Point [wang2019deep] uses an attention-based module to predict soft matches between two point clouds, which are fed to a differentiable SVD layer to infer a rigid transformation. Following the same idea, both GeoTransformer [qin2022geometric] and REGTR [yew2022regtr] directly learn to predict point correspondences using both self and cross-attention. Our previous work LCDNet [cattaneo2022lcdnet] combines a state-of-the-art feature extraction architecture with a place recognition head and a relative pose head for simultaneous loop closure detection and point cloud registration. In this work, we adapt LCDNet [cattaneo2022lcdnet] by integrating a transformer-based registration and matching module.

Semantic-Aided Mapping and Localization: Only a handful works have proposed to leverage semantic information for large-scale mapping and localization [chen2019sumapp, ballardini2019], and particularly for loop closure detection. Based on semantic segmentation, SuMa++ [chen2019sumapp] filters dynamic objects from a LiDAR-based map and extends the ICP algorithm with additional semantic constraints. While SuMa++ does not utilize semantic information for loop closure detection, RINet [li2022rinet] explicitly addresses LiDAR-based place recognition via a rotation invariant global descriptor combining semantic and geometric information. For the same task, Kong et al. [kong2020semantic] propose to build a graph representation of point clouds, which are enriched by both semantic and instance segmentation and perform graph similarity matching. SA-LOAM [li2021saloam] integrates a semantic-aided variant of ICP into the popular LOAM pipeline for point cloud registration. To address loop closure, it uses a similar graph representation as Kong et al. [kong2020semantic]. SV-Loop [yuan2021svloop] is a loop closure detection method for vision-based SLAM. It separately proposes loop closure candidates based on raw images and panoptic segmentation maps, which are then fused to extract the most feasible candidates. In our approach, we exploit panoptic annotations of point clouds while predicting both loop closure detection and point cloud registration. Additionally, we only utilize them during the training process but not for deployment, making the method more versatile.

Iii Technical Approach

In this section, we introduce our novel PADLoC architecture for joint loop closure detection and point cloud registration. First, we detail the overall approach comprising the modules shown in Fig. 2. We then describe the loss functions that we employ, including our proposed loss that leverages panoptic annotations of point clouds.

Fig. 2: Overview of our proposed PADLoC architecture for joint loop closure detection and point cloud registration. It consists of a shared feature extractor (green) followed by a global descriptor head (blue) for loop closure detection and a registration and matching module (orange) to estimate the 6-DoF transform between two point clouds (red). To train the global descriptor, we use a triplet loss (purple) that compares the anchor point cloud with a positive and negative sample. For the registration module, we leverage losses (purple) based on both geometric and panoptic information.

Iii-a Model Architecture

In this section, we describe the individual components of the PADLoC architecture. We build upon our previously proposed LCDNet [cattaneo2022lcdnet], where instead of using a differentiable approximation of the optimal transport to obtain point matches, we propose to leverage the cross-attention matrices of transformers. The learnable keys, queries, and values weights yield a better latent representation of the features, and thus more reliable matches. As depicted in Fig. 2, the overall PADLoC architecture consists of three modules: feature extraction, loop closure detection, and point cloud registration. During training, we employ a triplet-based training scheme by feeding in an anchor point cloud along with a positive sample of a loop closure and a negative sample.

Feature Extraction: The feature extraction backbone converts raw input scans into a high-dimensional embedding that is used as a common input for both loop closure detection and point cloud registration. It effectively exploits global and local contexts and is built upon the PV-RCNN architecture [shi2020pvrcnn]. In detail, a point cloud , comprising 3D coordinates and reflectance values, is discretized into a voxel grid which is then passed through four sparse 3D convolutional layers to generate the feature maps at different resolutions. The final feature map is then stacked to form a BEV feature map. Additionally, the original point cloud is downsampled using the Farthest Point Sampling (FPS) algorithm to uniformly select

keypoints. The feature vector of each sampled keypoint is assembled by combining the feature maps from each convolutional layer in a neighborhood of the sampled keypoint using the Voxel Set Abstraction module 


. The raw input of each sampled keypoint is also appended to each feature vector, along with the corresponding entry in the BEV feature map. Finally, these intermediate features are fed through a multilayer perceptron to obtain the final feature vector for each sampled point. This module thus outputs the sampled keypoints

and the corresponding features .

Loop Closure Detection: The global descriptor module of PADLoC further encodes the previously extracted features to perform loop closure detection. For this task, we employ the NetVLAD layer [arandjelovic2018netvlad] to convert the feature vectors of the anchor, the positive, and the negative points to their respective final descriptor . In detail, NetVLAD learns clusters along with corresponding descriptors, which are aggregated in a single descriptor for the entire point cloud. The final descriptors  of length are then obtained via a context gating layer. This learnable pooling operation with weights and bias is defined as



refers to the logistic sigmoid function and

denotes the element-wise multiplication.

During inference, the descriptors are stored in such a manner that allows for efficient querying of the nearest neighbor in descriptor space. If the distance between the descriptor of the current scan and its nearest neighbor is below a predefined threshold, they are considered to form a loop closure. To avoid matching consecutive scans, we introduce a small temporal distance between the current scan and potential neighbors.

Fig. 3: The matching module consists of a transformer encoder that takes the extracted features of the source keypoints as query, the features of the target keypoints as key, and the corresponding target keypoints as value. It outputs both soft correspondences and projected target points along with confidence weights . The latter is fed together with the source keypoints to a registration module that performs weighted SVD to estimate the final transform .

Point Matching: The matching module shown in Fig. 3 predicts soft correspondences between keypoints and of a source point cloud and a target point cloud , respectively. Additionally, it outputs projected target coordinates which are linear combinations of the original target coordinates with a one-to-one pairing with the points of the source set and a confidence weight for each of these matches. Inspired by the success of transformers in related tasks, we propose a novel architecture that performs cross-attention directly on the encoder part, obviating the need for a decoder by feeding independent inputs for the queries, keys, and values.


where is a transformer encoder layer, as defined in [vaswani2017attention], but applied to independent query , key , and value inputs. and are learnable weights and biases used to reduce the dimensionality of the output from the size of the features to 3D space. We directly use the encoder’s attention matrix as our matching

, since it already encodes the similarity between the features of the two sets of points. Moreover, each row in the attention matrix represents the probability distribution of matching the corresponding point from the source set to all of the points from the target set, given that it is non-negative and adds up to one due to the use of the softmax function.

From the matching matrix , we compute a confidence weight for every pair of point correspondences by penalizing the dispersion of the distributions represented by each row. We propose using a diversity metric for that purpose, such as the Shannon Entropy (), the order- Hill number (), or the Berger-Parker index (), defined as



is a vector of probabilities.

The weights are obtained using either of the aforementioned metrics by normalizing their output to a range, where the two extreme weights of 0 and 1 respectively correspond to a uniform and an infinitely sharp distribution.

Point Cloud Registration: To obtain the final relative transformation from a source point cloud to a target point cloud, we perform a weighted version of the Kabsch-Umeyama algorithm that finds the optimal translation and rotation between two sets of points by minimizing the root mean square error of the point pairs. First, the correspondences between the sampled source keypoints and the projected target keypoints are weighted by the matching confidences . Subsequently, the optimal translation is computed as the difference between the weighted centroids of the two point clouds. Finally, the optimal rotation is obtained via SVD of the weighted covariance matrix of the two sets of keypoints. This approach is fully differentiable and thus allows end-to-end training by measuring the error of the predicted transformation with respect to the ground truth relative pose.

Iii-B Loss Functions

Our total loss function consists of a weighted sum of the triplet loss for loop closure detection as well as a geometric loss and the newly proposed panoptic loss for point cloud registration. The following paragraphs describe these losses in greater detail.

Triplet Loss: For the loop closure detection task, we use the triplet loss. It enforces a small distance between the descriptors of an anchor point cloud and a positive point cloud, i.e., a loop closure LiDAR scan while increasing the distance between the descriptors of the anchor and a negative point cloud, i.e., a LiDAR scan taken at a different place.


where the descriptors of the anchor, the positive, and the negative sample are denoted by , , and , respectively. is a given distance function and refers to the desired separation margin.

Geometric Loss: We formulate our geometric loss as a sum of a pose loss and an auxiliary matching loss . For the pose loss, we compare the predicted relative transformation from the anchor to the positive sample with the ground truth transformation by applying both to the coordinates of the same sampled point cloud . Then we compute the mean absolute error in the Euclidean space.


We further evaluate the geometric correspondence between the sampled anchor and positive points leveraging the predicted matching matrix . In detail, we transform the anchor points with the ground truth transformation and project the positive sample with .

(a) Object matching
(b) Graph representation
Fig. 4: The multi-matched object loss penalizes matching an object in the anchor point cloud to multiple objects in the positive sample. Unlike the semantic misclassification losses, the multi-matched object loss does not consider the semantic class, as depicted in (a). By exploiting a graph representation shown in (b) of the point cloud, it enforces that all points of the same object are matched to points of another object.

Panoptic Loss: In addition to the geometric point correspondences, we propose to leverage panoptic information to register two point clouds. In detail, we formulate a novel panoptic loss as the sum of semantic misclassification losses and as well as a multi-matched object loss .

We treat the matching process as a classification task, where the projected positive points are assigned a semantic class. While a cross-entropy loss is commonly used in classification problems, due to the fact that the proposed class logits are not the output of either a logistic or softmax activation, we empirically found that a mean absolute error resulted in a more stable training process. First, we use the semantic labels to construct one-hot encoded matrices

and for the anchor and positive samples, respectively. Using the predicted matching matrix , we define the semantic loss as


Additionally, to allow flexibility in the semantic misclassification, we define a mapping from the semantic class labels to a set of super-classes, e.g., both car and truck belong to the vehicle class. Further details can be found in Sec. IV-A. Analogously to the semantic loss, we construct one-hot encoded matrices and and define the meta-semantic loss as


In our novel multi-matched object loss, we further exploit the instance labels to encourage the network to match entire objects consistently from one point cloud to the other. This is done by penalizing matches of points from a single object in the anchor to multiple objects in the positive sample. Unlike the previously introduced semantic misclassification losses, the multi-matched object loss does not consider the semantic class of objects, as depicted in Fig. 4 (a).

Since instance labels may not be consistent throughout a driving sequence, it is not feasible to purely rely on the IDs. Therefore, we construct adjacency matrices and of a graph representation of the point clouds, where nodes represent points and edges connect points of the same instances of a semantic class. The predicted matching matrices and can then be considered as weighted, directed, bipartite graphs between the two sets of points (see Fig. 4 (b)). Finally, we formulate the multi-matched object loss as


where denotes the element-wise multiplication.

Reverse Losses: Finally, we add a second instance of the registration module that processes the swapped source and target inputs and predicts the inverse relative transformation. Both the geometric and the panoptic losses can be reformulated accordingly. The total loss is then formulated by averaging the results of both the original and the reverse versions.

Iv Experimental Evaluation

In this section, we evaluate our proposed PADLoC architecture with respect to multiple handcrafted and learning-based methods. We perform several experiments and present both the loop closure detection and the point cloud registration results. Finally, we evaluate the design choices in PADLoC by performing multiple ablation studies.

KITTI Seq. 08 [geiger2012are] Ford Seq. 01 [pandey2011ford] Freiburg (in-house)
Method AP [°] [m] AP [°] [m] AP [°] [m]
Handcrafted M2DP [He2016] 0.05 0.89 0.60
Scan Context [kim2022scan] 0.65 3.11 0.97 16.68 0.74 52.70
LiDAR-Iris [wang2020lidar] 0.64 1.84 0.90 1.66 0.73 46.24
ISC [wang2020intensity] 0.31 6.27 0.62 6.15 0.38 51.02
ICP (pt2pt) [zhang1994iterative] 160.63 2.41 9.56 2.79 89.43 2.37
ICP (pt2pl) [zhang1994iterative] 160.73 2.49 9.16 2.62 89.25 2.25
Learning DCP [wang2019deep] 46.06 2.59 12.14 3.42 45.70 2.30
OverlapNet [chen2020overlapnet] 0.32 65.45 0.79 9.44 0.59 70.91
LCDNet [cattaneo2022lcdnet] 0.76 0.37 0.19 0.97 1.82 1.44 0.65 10.08 0.91
PADLoC (ours) 0.81 0.37 0.16 0.98 1.50 1.33 0.67 9.30 1.41

Comparison of the average precision (AP) for loop closure detection as well as rotation error and translation error for point cloud registration of PADLoC with previous methods. All learning-based models are trained on the KITTI odometry benchmark dataset. PADLoC uses panoptic annotations from the SemanticKITTI dataset. Methods denoted with only estimate the yaw between two point clouds instead of a full 6-DoF transformation. Bold and underlined values denote the best and second best scores, respectively.

TABLE I: Comparison of loop closure detection and point cloud registration performance

Iv-a Implementation Details

We perform experiments on two publicly available autonomous driving datasets, namely the KITTI odometry benchmark [geiger2012are] and the Ford campus vision and LiDAR dataset [pandey2011ford]. Additionally, we also present results on a more challenging in-house dataset recorded in Freiburg, Germany. For training, we leverage the ground truth panoptic annotations from the SemanticKITTI dataset [behley2019semantickitti]. In particular, we train all models on sequences {00, 05, 06, 07, 09} and, if not specified otherwise, evaluate on sequence 08. We consider a loop closure between two point clouds if their poses are within a distance and took place at a minimum of 50 frames apart to avoid consecutive scans. Unless otherwise specified, we use keypoints, set the feature size to , the descriptor length to , and the number of clusters . To improve the invariance of the model with respect to the inputs’ position and orientation, we augment the data during training by applying a random rigid transformation to the input point clouds with a uniform translation of in the and axes and along , and a uniform rotation of for the roll and pitch angles and

for the yaw. We train all our models on a server with 4 NVIDIA RTX A6000 GPUs for 150 epochs with a batch size of

. We use the Adam optimizer with an initial learning rate of , halved after epochs 40 and 80, and with a weight decay of .

The total loss function is computed as a weighted sum of the components described in Sec. III-B, with weights , , , , , and . We use a triplet margin of and the the L2 distance as the distance function in Eq. 6

. For the semantic super-classes, we follow the definitions of Cityscapes 

[cordts2016the] and group the semantic labels into flat, human, vehicle, construction, object, nature, and void. Based on the ablation study presented in Sec. IV-D, we use the Berger-Parker index to compute the confidence weights.

Iv-B Loop Closure Detection

To evaluate the loop closure detection performance, we compare PADLoC with the handcrafted methods M2DP [He2016], Intensity Scan Context (ISC) [wang2020intensity], Scan Context [kim2022scan], and LiDAR-Iris [wang2020lidar], as well as with the learning-based approaches LCDNet [cattaneo2022lcdnet], OverlapNet [chen2020overlapnet], and Deep Closest Point (DCP) [wang2019deep]. For DCP, we combine the feature extraction module of PADLoC with a full transformer-based matching module based on the authors’ code release. For the other methods, we directly use the official code published by the respective authors. To compute the results on OverlapNet, we download the model weights provided on the project website that are trained on KITTI. We re-train the other learning-based methods on sequences {00, 05, 06, 07, 09} of the KITTI odometry benchmark [geiger2012are], where PADLoC leverages the ground truth panoptic annotations from the SemanticKITTI dataset [behley2019semantickitti]. We evaluate all methods on sequence 08 of the KITTI dataset, sequence 01 of the Ford dataset, and an in-house dataset recorded in Freiburg, Germany.

When evaluating PADLoC, we generate a descriptor for every scan in a sequence and compute its similarity with that of all frames prior to the 50 previous scans. If a scan with the closest descriptor to that of scan has a similarity higher than a threshold , then the pair is considered to form a loop closure. If the distance between the two ground truth poses is within for the KITTI dataset and for the Ford and Freiburg datasets, then it is considered as a true positive. Otherwise, it is considered a false positive. Conversely, if the pose distance is within /, but the similarity between the descriptors is below the threshold , then we regard it as a false negative. By changing the value of , we obtain precision-recall pairs that are then used to compute the average precision (AP).

(a) OverlapNet [chen2020overlapnet] (b) LCDNet [cattaneo2022lcdnet] (c) PADLoC (ours)
Fig. 5: Qualitative loop closure detection results on KITTI sequence 08. The ground truth path corresponds to true negatives. While LCDNet reduces both false positives and false negatives compared to OverlapNet, the proposed PADLoC further decreases false positives.

In Table I, we report the average precision (AP) of PADLoC and the aforementioned baseline methods. Notably, PADLoC achieves the highest performance across the entire board for the evaluation sequences of both KITTI and Ford datasets. For our in-house Freiburg dataset, PADLoC yields the highest AP compared to the other learning-based approaches. Although the proposed transformer-based registration head and the panoptic losses do not directly influence the loop closure detection module, by sharing the same feature extractor between the two branches and training for the two tasks jointly, the better feature representation learned using our novel module and losses also improve the loop closure detection performance compared to LCDNet, which achieved the second best AP on both KITTI and Ford. Qualitative results of these methods on the KITTI dataset are visualized in Fig. 5. Compared to OverlapNet, both LCDNet and PADLoC correctly detect a higher number of loop closures, whereas PADLoC is able to further reduce the number of false positives. In Fig. 6, we plot the corresponding precision-recall curves that are used to compute the AP scores. We observe that PADLoC can maintain a higher precision for increased recall than LCDNet.

Fig. 6: Precision-recall curves for loop closure detection of learning-based methods evaluated on sequence 08 of the KITTI dataset.

Iv-C Point Cloud Registration

To evaluate the point cloud registration performance, we compare PADLoC with the same handcrafted and learning-based methods described in Sec. IV-B, except for M2DP that does not perform point cloud registration. Since these handcrafted methods only estimate the yaw between two point clouds instead of the full 6-DoF transformation, we additionally compare with the Iterative Closest Point algorithm (ICP) [zhang1994iterative], using both point-to-point and point-to-plane distances. Following the standard experimental setup [cattaneo2022lcdnet], for LCDNet, DCP, and PADLoC, we perform point cloud registration with RANSAC using the extracted features before the respective matching layers.

As a measure of registration accuracy, we compute the rotation error in degrees and the translation error in meters of all positive pairs. We then average the errors over the entire sequence and present the results in Table I. We observe that PADLoC yields the smallest rotation error compared to all the handcrafted and learning-based methods on each of the evaluation sequences in the datasets. Additionally, it yields the smallest translation error on both the KITTI and Ford datasets, as well as the second lowest translation error on our in-house Freiburg dataset. LCDNet achieves the second best performance in most evaluations while achieving the lowest translation error on the Freiburg dataset. This result shows that while the feature extraction architecture and the training scheme play an important role, leveraging the cross-modal attention matrices from the transformer architecture and the panoptic information during training further improves the point cloud registration performance. While LiDAR-Iris achieves the lowest rotation error across all the handcrafted methods, it only estimates the yaw angle instead of the full 6-DoF transformation.

Iv-D Ablation Studies

In this section, we present ablation studies to analyze the major design choices in the PADLoC architecture. As the RANSAC-based point cloud registration described in Sec. IV-C is applied only during inference and does not impact the training stage, all the experiments reported in this section do not exploit RANSAC.

Confidence Weighting: We investigate the effect of different weighting schemes on the performance of both loop closure detection and point cloud registration tasks. In Table II, we present the average precision (AP) as well as the registration errors and for the six weighting methods. In particular, uniform weights corresponding to unweighted SVD, column sum representing the method used in LCDNet [cattaneo2022lcdnet], where weights are the sums along the columns of the matching matrix, and the diversity metrics from Sec. 3, i.e., the Shannon Entropy, the order- Hill number with , and the Berger-Parker index. We observe that both the Hill numbers and the Berger-Parker index outperform the other confidence weighting methods. Due to the substantially smaller translation error of the Berger-Parker index, improving the registration by more than , we use this method in our final design.

Method AP [°] [m]
Uniform 0.73 4.63 3.76
Column sum 0.76 6.34 3.62
Shannon 0.50 21.86 3.99
Hill (r=2) 0.89 2.45 2.00
Hill (r=4) 0.84 2.47 2.12
Berger-Parker 0.81 2.35 1.43

Average precision (AP) of loop closure detection as well as the mean error of point cloud registration, evaluated on KITTI sequence 08 for different weightings used in SVD.

TABLE II: Ablation study on confidence weights

Effect of Losses: To demonstrate the efficacy of our proposed panoptic loss and the impact of formulating all losses in a bidirectional manner (), we consecutively add them to the original geometric loss . We present the results for both the loop closure detection and point cloud registration tasks in Table III. We observe that adding the proposed panoptic losses increases the average loop closure detection precision by further constraining which points can be matched together based on their semantic and instance labels. Furthermore, by including the second matching and registration head, along with its corresponding reverse losses as illustrated in the bottom row, the added bidirectional consistency constraint yields the highest AP and the smallest registration errors.

AP [°] [m]
0.70 3.09 1.62
0.78 3.36 1.71
0.81 2.35 1.43

Average precision (AP) of loop closure detection and the mean error of point cloud registration, evaluated on KITTI sequence 08 for the different loss functions.

TABLE III: Influence of the loss functions

V Conclusion

In this paper, we proposed the novel PADLoC architecture for LiDAR-based joint loop closure detection and point cloud registration. PADLoC is composed of a common feature extractor, a global descriptor as well as a transformer-based registration and matching module. Unlike previous approaches, we feed different inputs as value, query, and key to the transformer encoder exploiting its internal structure. We further introduced a new loss function that leverages ground truth panoptic annotations by penalizing matching points from different semantic classes as well as across multiple objects, and validated its positive impact. Through extensive experimental evaluations, we demonstrated the efficacy of PADLoC compared to both handcrafted and learning-based methods. In particular, we show that we can take advantage of the principles behind attention mechanisms to design transformer-based models with lower complexity than full encoder-decoder architectures, which yield more accurate results. Future work will focus on exploiting panoptic information in an online manner and applying the matching approach of PADLoC to point cloud registration tasks in other domains.