Geometric Transformer for Fast and Robust Point Cloud Registration

by   Zheng Qin, et al.

We study the problem of extracting accurate correspondences for point cloud registration. Recent keypoint-free methods bypass the detection of repeatable keypoints which is difficult in low-overlap scenarios, showing great potential in registration. They seek correspondences over downsampled superpoints, which are then propagated to dense points. Superpoints are matched based on whether their neighboring patches overlap. Such sparse and loose matching requires contextual features capturing the geometric structure of the point clouds. We propose Geometric Transformer to learn geometric feature for robust superpoint matching. It encodes pair-wise distances and triplet-wise angles, making it robust in low-overlap cases and invariant to rigid transformation. The simplistic design attains surprisingly high matching accuracy such that no RANSAC is required in the estimation of alignment transformation, leading to 100 times acceleration. Our method improves the inlier ratio by 17%∼30% and the registration recall by over 7% on the challenging 3DLoMatch benchmark. The code and models will be released at <>.



page 1

page 8

page 17

page 19


PREDATOR: Registration of 3D Point Clouds with Low Overlap

We introduce PREDATOR, a model for pairwise point-cloud registration wit...

CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration

We study the problem of extracting correspondences between a pair of poi...

Robust Partial-to-Partial Point Cloud Registration in a Full Range

Point cloud registration for 3D objects is very challenging due to spars...

REGTR: End-to-end Point Cloud Correspondences with Transformers

Despite recent success in incorporating learning into point cloud regist...

Accurate Point Cloud Registration with Robust Optimal Transport

This work investigates the use of robust optimal transport (OT) for shap...

EOE: Expected Overlap Estimation over Unstructured Point Cloud Data

We present an iterative overlap estimation technique to augment existing...

OLAE-ICP: Robust and fast alignment of geometric features with the optimal linear attitude estimator

The problems of point-cloud registration and attitude estimation from ve...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud registration is a fundamental task in graphics, vision and robotics. Given two partially overlapping 3D point clouds, the goal is to estimate a rigid transformation that aligns them. The problem has gained renewed interest recently thanks to the fast growing of 3D point representation learning and differentiable optimization.

The recent advances have been dominated by learning-based, correspondence-based methods [deng2018ppfnet, gojcic2019perfect, choy2019fully, bai2020d3feat, huang2021predator, yu2021cofinet]

. A neural network is trained to extract point correspondences between two input point clouds, based on which an alignment transformation is calculated with a robust estimator,

e.g., RANSAC. Most correspondence-based methods rely on keypoint detection [choy2019fully, bai2020d3feat, ao2021spinnet, huang2021predator]. However, it is challenging to detect repeatable keypoints across two point clouds, especially when they have small overlapping area. This usually results in low inlier ratio in the putative correspondences.

Inspired by the recent advances in image matching [rocco2018neighbourhood, zhou2021patch2pix, sun2021loftr], keypoint-free methods [yu2021cofinet] downsample the input point clouds into superpoints and then match them through examining whether their local neighborhood (patch) overlaps. Such superpoint (patch) matching is then propagated to individual points, yielding dense point correspondences. Consequently, the accuracy of dense point correspondences highly depends on that of superpoint matches.


Figure 1: Given two low-overlap point clouds, GeoTransformer improves inlier ratio over vanilla transformer significantly, both for superpoint (patch) level (left) and for dense point level (right). A few representative patch correspondences are visualized with distinct colors. Notice how GeoTransformer preserves the spatial consistency of the matching patches across two point clouds. It corrects the wrongly matched patches around the symmetric corners of the chair back (see the yellow point cloud).

Superpoint matching is sparse and loose. The upside is that it reduces strict point matching into loose patch overlapping, thus relaxing the repeatability requirement. Meanwhile, patch overlapping is a more reliable and informative constraint than distance-based point matching for learning correspondence; consider that two spatially close points could be geodesically distant. On the other hand, superpoint matching calls for features capturing more global context.

To this end, Transformer [vaswani2017attention] has been adopted [wang2019deep, yu2021cofinet]

to encode contextual information in point cloud registration. However, vanilla transformer overlooks the geometric structure of the point clouds, which makes the learned features geometrically less discriminative and induces numerous outlier matches (

Fig. 1(top)). Although one can inject positional embeddings [zhao2021point, yang2019modeling], the coordinate-based encoding is transformation-variant, which is problematic when registering point clouds given in arbitrary poses. We advocate that a point transformer for registration task should be learned with the geometric structure of the point clouds so as to extract transformation-invariant geometric features. We propose Geometric Transformer, or GeoTransformer for short, for 3D point clouds which encodes only distances of point pairs and angles in point triplets.

Given a superpoint, we learn a non-local representation through geometrically “pinpointing” it w.r.t. all other superpoints based on pair-wise distances and triplet-wise angles. Self-attention mechanism is utilized to weigh the importance of those anchoring superpoints. Since distances and angles are invariant to rigid transformation, GeoTransformer learns geometric structure of point clouds efficiently, leading to highly robust superpoint matching even in low-overlap scenarios. Fig. 1(left) demonstrates that GeoTransformer significantly improves the inlier ratio of superpoint (patch) correspondences. For better convergence, we devise an overlap-aware circle loss to make GeoTransformer focus on superpoint pairs with higher patch overlap.

Benefitting from the high-quality superpoint matches, our method attains high-inlier-ratio dense point correspondences (Fig. 1(right)) using an optimal transport layer [sarlin2020superglue], as well as highly robust and accurate registration without relying on RANSAC. Therefore, the registration part of our method runs extremely fast, e.g., s for two point clouds with K correspondences, times faster than RANSAC. Extensive experiments on both indoor and outdoor benchmarks [zeng20173dmatch, geiger2012we] demonstrate the superiority of GeoTransformer. Our method improves the inlier ratio by and the registration recall by over on the challenging 3DLoMatch benchmark [huang2021predator]. Our main contributions are:

  • A fast and accurate point cloud registration method which is both keypoint-free and RANSAC-free.

  • A geometric transformer which learns transformation-invariant geometric representation of point clouds for robust superpoint matching.

  • An overlap-aware circle loss which reweights the loss of each superpoint match according to the patch overlap ratio for better convergence.

2 Related Work

Correspondence-based Methods.   Our work follows the line of the correspondence-based methods [deng2018ppfnet, deng2018ppf, gojcic2019perfect, choy2019fully]. They first extract correspondences between two point clouds and then recover the transformation with robust pose estimators, e.g., RANSAC. Thanks to the robust estimators, they achieve state-of-the-art performance in indoor and outdoor scene registration. These methods can be further categorized into two classes according to how they extract correspondences. The first class aims to detect more repeatable keypoints [bai2020d3feat, huang2021predator] and learn more powerful descriptors for the keypoints [choy2019fully, ao2021spinnet, wang2021you]. While the second class [yu2021cofinet] retrieves correspondences without keypoint detection by considering all possible matches. Our method follows the detection-free methods and improves the accuracy of correspondences by leveraging the geometric information.

Direct Registration Methods.

   Recently, direct registration methods have emerged. They estimate the transformation with a neural network in an end-to-end manner. These methods can be further classified into two classes. The first class 

[wang2019deep, wang2019prnet, yew2020rpm, fu2021robust] follows the idea of ICP [besl1992method], which iteratively establishes soft correspondences and computes the transformation with differentiable weighted SVD. The second class [aoki2019pointnetlk, huang2020feature, xu2021omnet]

first extracts a global feature vector for each point cloud and regresses the transformation with the global feature vectors. Although direct registration methods have achieved promising results on single synthetic shapes, they could fail in large-scale scenes as stated in 


Deep Robust Estimators.   As traiditional robust estimators such as RANSAC suffer from slow convergence and instability in case of high outlier ratio, deep robust estimators [pais20203dregnet, choy2020deep, bai2021pointdsc] have been proposed as the alternatives for them. They usually contain a classification network to reject outliers and an estimation network to compute the transformation. Compared with traditional robust estimators, they achieve improvements in both accuracy and speed. However, they require training a specific network. In comparison, our method achieves fast and accurate registration with a parameter-free local-to-global registration scheme.

3 Method


1. Feature Extraction

    2. Superpoint Matching     3. Point Matching     4. Local-to-Global Registration   

Figure 2: The backbone downsamples the input point clouds and learns features in multiple resolution levels. The Superpoint Matching Module extracts high-quality superpoint correspondences between and using the Geometric Transformer which iteratively encodes intra-point-cloud geometric structures and inter-point-cloud geometric consistency. The superpoint correspondences are then propagated to dense points and by the Point Matching Module. Finally, the transformation is computed with a local-to-global registration method.

Given two point clouds and , our goal is to estimate a rigid transformation which aligns the two point clouds, with a 3D rotation and a 3D translation . The transformation can be solved by:


Here is the set of ground-truth correspondences between and . Since is unknown in reality, we need to first establish point correspondences between two point clouds and then estimate the alignment transformation.

Our method adopts the hierarchical correspondence paradigm which finds correspondences in a coarse-to-fine manner. We adopt KPConv-FPN to simultaneously downsample the input point clouds and extract point-wise features (Sec. 3.1). The first and the last (coarsest) level downsampled points correspond to the dense points and the superpoints to be matched. A Superpoint Matching Module is used to extract superpoint correspondences whose neighboring local patches overlap with each other (Sec. 3.2). Based on that, a Point Matching Module then refines the superpoint correspondences to dense points (Sec. 3.3). At last, the alignment transformation is recovered from the dense correspondences without relying on RANSAC (Sec. 3.4). The pipeline is illustrated in Fig. 2.

3.1 Superpoint Sampling and Feature Extraction

We utilize the KPConv-FPN backbone [thomas2019kpconv, lin2017feature] to extract multi-level features for the point clouds. A byproduct of the point feature learning is point downsampling. We work on downsampled points since point cloud registration can actually be pinned down by the correspondences of a much coarser subset of points. The original point clouds are usually too dense so that point-wise correspondences are redundant and sometimes too clustered to be useful.

The points correspond to the coarsest resolution, denoted by and , are treated as superpoints

to be matched. The associated learned features are denoted as

and . The dense point correspondences are computed at of the original resolution, i.e., the first level downsampled points denoted by and . Their learned features are represented by and .

For each superpoint, we construct a local patch of points around it using the point-to-node grouping strategy [li2018so, yu2021cofinet]. In particular, each point in and its features from are assigned to its nearest superpoint in the geometric space:


This essentially leads to a Voronoi decomposition of the input point cloud seeded by superpoints. The feature matrix associated with the points in is denoted as . The superpoints with an empty patch are removed. The patches and the feature matrices for are computed and denoted in a similar way. In what follows, the terms “superpoint” and “patch” will be used interchangeably unless otherwise noted.

3.2 Superpoint Matching Module


Figure 3: Left: The structure of geometric self-attention module. Right: The computation graph of geometric self-attention.

Geometric Transformer.

Global context has proven critical in many computer vision tasks 

[dosovitskiy2020image, sun2021loftr, yu2021cofinet]. For this reason, transformer has been adopted to leverage global contextual information for point cloud registration. However, existing methods [wang2019deep, huang2021predator, yu2021cofinet] usually feed transformer with only high-level point cloud features and does not explicitly encode the geometric structure. This makes the learned features geometrically less discriminative, which causes severe matching ambiguity and numerous outlier matches, especially in low-overlap cases. A straightforward recipe is to explicitly inject positional embeddings [yang2019modeling, zhao2021point] of 3D point coordinates. However, the resultant coordinate-based transformers are naturally transformation-variant, while registration requires transformation invariance since the input point clouds can be in arbitrary poses.

To this end, we propose Geometric Transformer which not only encodes high-level point features but also explicitly captures intra-point-cloud geometric structures and inter-point-cloud geometric consistency. GeoTransformer is composed of a geometric self-attention module for learning intra-point-cloud features and a feature-based cross-attention module for modeling inter-point-cloud consistency. The two modules are interleaved for times to extract hybrid features and for reliable superpoint matching (see Fig. 2 (bottom left)).

Geometric self-attention.

We design a geometric self-attention to learn the global correlations in both feature and geometric spaces among the superpoints for each point cloud. In the following, we describe the computation for and the same goes for . Given the input feature matrix , the output feature matrix is the weighted sum of all projected input features:


where the weight coefficient is computed by a row-wise softmax on the attention score , and is computed as:


Here, is a geometric structure embedding to be described in the next. are the respective projection matrices for queries, keys, values and geometric structure embeddings. Fig. 3 shows the structure and the computation of geometric self-attention.

We design a novel geometric structure embedding to encode the transformation-invariant geometric structure of the superpoints. The core idea is to leverage the distances and angles computed with the superpoints which are consistent across different point clouds of the same scene. Given two superpoints , their geometric structure embedding consists of a pair-wise distance embedding and a triplet-wise angular embedding, which will be described below.

(1) Pair-wise Distance Embedding. Given the distance between and , the distance embedding between them is computed by applying a sinusoidal function [vaswani2017attention] on . Here, is a hyper-parameter used to tune the sensitivity on distance variations. Please refer to the Appx. A.1 for detailed computation.

(2) Triplet-wise Angular Embedding. We compute angular embedding with triplets of superpoints. We first select the nearest neighbors of . For each , we compute the angle , where . The triplet-wise angular embedding is then computed with a sinusoidal function on , with controlling the sensitivity on angular variations.

Finally, the geometric structure embedding is computed by aggregating the pair-wise distance embedding and the triplet-wise angular embedding:



are the respective projection matrices for the two types of embeddings. We use max pooling here to improve the robustness to the varying nearest neighbors of a superpoint due to self-occlusion.

Fig. 4 illustrates the computation of geometric structure embedding.


Figure 4: An illustration of the distance-and-angle-based geometric structure encoding and its computation.

Feature-based cross-attention.

Cross-attention is a typical module for point cloud registration task [huang2021predator, wang2019deep, yu2021cofinet], used to perform feature exchange between two input point clouds. Given the self-attention feature matrices , for , respectively, the cross-attention feature matrix of is computed with the features of :


Similarly, is computed by a row-wise softmax on the cross-attention score , and is computed as the feature correlation between the and :


The cross-attention features for are computed in the same way. While the geometric self-attention module encodes the transformation-invariant geometric structure for each individual point cloud, the feature-based cross-attention module can model the geometric consistency across the two point clouds. The resultant hybrid features are both invariant to transformation and robust for reasoning correspondence.

Superpoint matching.

To find the superpoint correspondences, we propose a matching scheme based on global feature correlation. We first normalize and onto a unit hypersphere and compute a Gaussian correlation matrix with . In practice, some patches of a point cloud are less geometrically discriminative and have numerous similar patches in the other point cloud. Besides our powerful hybrid features, we also perform a dual-normalization operation [rocco2018neighbourhood, sun2021loftr] on S to further suppress ambiguous matches, leading to with


We found that this suppression can effectively eliminate wrong matches. Finally, we select the largest entries in as the superpoint correspondences:


Due to the powerful geometric structure encoding of GeoTransformer, our method is able to achieve accurate registration in low-overlap cases and with less point correspondences, and most notably, in a RANSAC-free manner.

3.3 Point Matching Module

Having obtained the superpoint correspondences, we extract point correspondences using a simple yet effective Point Matching Module. At point level, we use only local point features learned by the backbone. The rationale is that point level matching is mainly determined by the vicinities of the two points being matched, once the global ambiguity has been resolved by superpoint matching. This design choice improves the robustness.

For each superpoint correspondence , an optimal transport layer [sarlin2020superglue] is used to extract the local dense point correspondences between and . Specifically, we first compute a cost matrix :


where , . The cost matrix is then augmented into by appending a new row and a new column as in [sarlin2020superglue], filled with a learnable dustbin parameter . We then utilize the Sinkhorn algorithm [sinkhorn1967concerning] on to compute a soft assignment matrix which is then recovered to by dropping the last row and the last column. We use as the confidence matrix of the candidate matches and extract point correspondences via mutual top- selection, where a point match is selected if it is among the largest entries of both the row and the column that it resides in:


The point correspondences computed from each superpoint match are then collected together to form the final global dense point correspondences: .

3.4 RANSAC-free Local-to-Global Registration

Previous methods generally rely on robust pose estimators to estimate the transformation since the putative correspondences are often predominated by outliers. Most robust estimators such as RANSAC suffer from slow convergence. Given the high inlier ratio of GeoTransformer, we are able to achieve robust registration without relying on robust estimators, which also greatly reduces computation cost.

We design a local-to-global registration (LGR) scheme. As a hypothesize-and-verify approach, LGR is comprised of a local phase of transformation candidates generation and a global phase for transformation selection. In the local phase, we solve for a transformation for each superpoint match using its local point correspondences:


This can be solved in closed form using weighted SVD [besl1992method]. The corresponding confidence score for each correspondence in is used as the weight . Benefitting from the high-quality correspondences, the transformations obtained in this phase are already very accurate. In the global phase, we select the transformation which admits the most inlier matches over the entire global point correspondences:


where is the Iverson bracket. is the acceptance radius. We then iteratively re-estimate the transformation with the surviving inlier matches for times by solving Eq. 12. As shown in Sec. 4.1, our approach achieves comparable registration accuracy with RANSAC but reduces the computation time by more than times. Moreover, unlike deep robust estimators [choy2020deep, pais20203dregnet, bai2021pointdsc], our method is parameter-free and no network training is needed.

3.5 Loss Functions

The loss function

is composed of an overlap-aware circle loss for superpoint matching and a point matching loss for point matching.

Overlap-aware circle loss.

Existing methods [sun2021loftr, yu2021cofinet] usually formulate superpoint matching as a multi-label classification problem and adopt a cross-entropy loss with dual-softmax [sun2021loftr] or optimal transport [sarlin2020superglue, yu2021cofinet]. Each superpoint is assigned (classified) to one or many of the other superpoints, where the ground truth is computed based on patch overlap and it is very likely that one patch could overlap with multiple patches. By analyzing the gradients from the cross-entropy loss, we find that the positive classes with high confidence scores are suppressed by positive gradients in the multi-label classification111The detailed analysis is presented in Appx. C.. This hinders the model from extracting reliable superpoint correspondences.

To address this issue, we opt to extract superpoint descriptors in a metric learning fashion. A straightforward solution is to adopt a circle loss [sun2020circle] similar to [bai2020d3feat, huang2021predator]. However, the circle loss overlooks the differences between the positive samples and weights them equally. As a result, it struggles in matching patches with relatively low overlap. For this reason, we design an overlap-aware circle loss to focus the model on those matches with high overlap. We select the patches in which have at least one positive patch in to form a set of anchor patches, . A pair of patches are positive if they share at least overlap, and negative if they do not overlap. All other pairs are omitted. For each anchor patch , we denote the set of its positive patches in as , and the set of its negative patches as . The overlap-aware circle loss on is then defined as:


where is the distance in the feature space, and represents the overlap ratio between and . The positive and negative weights are computed for each sample individually with and . The margin hyper-parameters are set to and . The overlap-aware circle loss reweights the loss values on based on the overlap ratio so that the patch pairs with higher overlap are given more importance. The same goes for the loss on . And the overall loss is .

Point matching loss.

The ground-truth point correspondences are relatively sparse because they are available only for downsampled point clouds. We simply use a negative log-likelihood loss [sarlin2020superglue] on the assignment matrix of each superpoint correspondence. During training, we randomly sample ground-truth superpoint correspondences instead of using the predicted ones. For each , a set of ground-truth point correspondences is extracted with a matching radius . The sets of unmatched points in the two patches are denoted as and . The individual point matching loss for is computed as:


The final loss is computed by averaging the individual loss over all sampled superpoint matches: .

4 Experiments

We evaluate GeoTransformer on indoor 3DMatch [zeng20173dmatch] and 3DLoMatch [huang2021predator] benchmarks (Sec. 4.1) and outdoor KITTI odometry [geiger2012we] benchmark (Sec. 4.2). Specifically, we interleave the geometric self-attention module and the feature-based cross-attention module for times to learn hybrid features, with in the triplet-wise angular embedding. We use superpoint matches to extract dense point correspondences. The alignment transformation is iteratively recomputed for times in LGR. See more implementation details in Appx. A.3.

4.1 Indoor Benchmarks: 3DMatch & 3DLoMatch


3DMatch [zeng20173dmatch] contains scenes among which are used for training, for validation and for testing. We use the training data preprocessed by [huang2021predator] and evaluate on both 3DMatch and 3DLoMatch [huang2021predator] protocols. The point cloud pairs in 3DMatch have overlap, while those in 3DLoMatch have low overlap of .


Following [bai2020d3feat, huang2021predator], we evaluate the performance with three metrics: (1) Inlier Ratio (IR), the fraction of putative correspondences whose residuals are below a certain threshold (i.e., ) under the ground-truth transformation, (2) Feature Matching Recall (FMR), the fraction of point cloud pairs whose inlier ratio is above a certain threshold (i.e., ), and (3) Registration Recall (RR), the fraction of point cloud pairs whose transformation error is smaller than a certain threshold (i.e., ).

3DMatch 3DLoMatch
# Samples 5000 2500 1000 500 250 5000 2500 1000 500 250
Feature Matching Recall (%)
PerfectMatch [gojcic2019perfect] 95.0 94.3 92.9 90.1 82.9 63.6 61.7 53.6 45.2 34.2
FCGF [choy2019fully] 97.4 97.3 97.0 96.7 96.6 76.6 75.4 74.2 71.7 67.3
D3Feat [bai2020d3feat] 95.6 95.4 94.5 94.1 93.1 67.3 66.7 67.0 66.7 66.5
SpinNet [ao2021spinnet] 97.6 97.2 96.8 95.5 94.3 75.3 74.9 72.5 70.0 63.6
Predator [huang2021predator] 96.6 96.6 96.5 96.3 96.5 78.6 77.4 76.3 75.7 75.3
YOHO [wang2021you] 98.2 97.6 97.5 97.7 96.0 79.4 78.1 76.3 73.8 69.1
CoFiNet [yu2021cofinet] 98.1 98.3 98.1 98.2 98.3 83.1 83.5 83.3 83.1 82.6
GeoTransformer (ours) 97.9 97.9 97.9 97.9 97.6 88.3 88.6 88.8 88.6 88.3
Inlier Ratio (%)
PerfectMatch [gojcic2019perfect] 36.0 32.5 26.4 21.5 16.4 11.4 10.1 8.0 6.4 4.8
FCGF [choy2019fully] 56.8 54.1 48.7 42.5 34.1 21.4 20.0 17.2 14.8 11.6
D3Feat [bai2020d3feat] 39.0 38.8 40.4 41.5 41.8 13.2 13.1 14.0 14.6 15.0
SpinNet [ao2021spinnet] 47.5 44.7 39.4 33.9 27.6 20.5 19.0 16.3 13.8 11.1
Predator [huang2021predator] 58.0 58.4 57.1 54.1 49.3 26.7 28.1 28.3 27.5 25.8
YOHO [wang2021you] 64.4 60.7 55.7 46.4 41.2 25.9 23.3 22.6 18.2 15.0
CoFiNet [yu2021cofinet] 49.8 51.2 51.9 52.2 52.2 24.4 25.9 26.7 26.8 26.9
GeoTransformer (ours) 71.9 75.2 76.0 82.2 85.1 43.5 45.3 46.2 52.9 57.7
Registration Recall (%)
PerfectMatch [gojcic2019perfect] 78.4 76.2 71.4 67.6 50.8 33.0 29.0 23.3 17.0 11.0
FCGF [choy2019fully] 85.1 84.7 83.3 81.6 71.4 40.1 41.7 38.2 35.4 26.8
D3Feat [bai2020d3feat] 81.6 84.5 83.4 82.4 77.9 37.2 42.7 46.9 43.8 39.1
SpinNet [ao2021spinnet] 88.6 86.6 85.5 83.5 70.2 59.8 54.9 48.3 39.8 26.8
Predator [huang2021predator] 89.0 89.9 90.6 88.5 86.6 59.8 61.2 62.4 60.8 58.1
YOHO [wang2021you] 90.8 90.3 89.1 88.6 84.5 65.2 65.5 63.2 56.5 48.0
CoFiNet [yu2021cofinet] 89.3 88.9 88.4 87.4 87.0 67.5 66.2 64.2 63.1 61.0
GeoTransformer (ours) 92.0 91.8 91.8 91.4 91.2 75.0 74.8 74.2 74.1 73.5
Table 1: Evaluation results on 3DMatch and 3DLoMatch.

Correspondence results.

We first compare the correspondence results of our method with the recent state of the arts: PerfectMatch [gojcic2019perfect], FCGF [choy2019fully], D3Feat [bai2020d3feat], SpinNet [ao2021spinnet], Predator [huang2021predator], YOHO [wang2021you] and CoFiNet [yu2021cofinet] in Tab. 1(top and middle). Following [bai2020d3feat, huang2021predator], we report the results with different numbers of correspondences. The details of the correspondence sampling schemes are given in Appx. A.3. For Feature Matchig Recall, our method achieves improvements of at least on 3DLoMatch, demonstrating its effectiveness in low-overlap cases. For Inlier Ratio, the improvements are even more prominent. It surpasses the baselines consistently by on 3DMatch and on 3DLoMatch. The gain is larger with less correspondences. It implies that our method extracts more reliable correspondences.

Registration results.

To evaluate the registration performance, we first compare the Registration Recall obtained by RANSAC in Tab. 1(bottom). Following [bai2020d3feat, huang2021predator], we run K RANSAC iterations to estimate the transformation. GeoTransformer attains new state-of-the-art results on both 3DMatch and 3DLoMatch. It outperforms the previous best by on 3DMatch and on 3DLoMatch, showing its efficacy in both high- and low-overlap scenarios. More importantly, our method is quite stable under different numbers of samples, so it does not require sampling a large number of correspondences to boost the performance as previous methods [choy2019fully, ao2021spinnet, wang2021you, yu2021cofinet].

Model Estimator #Samples RR(%) Time(s)
3DM 3DLM Model Pose Total
FCGF [choy2019fully] RANSAC-50k 5000 85.1 40.1 0.052 3.326 3.378
D3Feat [bai2020d3feat] RANSAC-50k 5000 81.6 37.2 0.024 3.088 3.112
SpinNet [ao2021spinnet] RANSAC-50k 5000 88.6 59.8 60.248 0.388 60.636
Predator [huang2021predator] RANSAC-50k 5000 89.0 59.8 0.032 5.120 5.152
CoFiNet [yu2021cofinet] RANSAC-50k 5000 89.3 67.5 0.115 1.807 1.922
GeoTransformer (ours) RANSAC-50k 5000 92.0 75.0 0.075 1.558 1.633
FCGF [choy2019fully] weighted SVD 250 42.1 3.9 0.052 0.008 0.056
D3Feat [bai2020d3feat] weighted SVD 250 37.4 2.8 0.024 0.008 0.032
SpinNet [ao2021spinnet] weighted SVD 250 34.0 2.5 60.248 0.006 60.254
Predator [huang2021predator] weighted SVD 250 50.0 6.4 0.032 0.009 0.041
CoFiNet [yu2021cofinet] weighted SVD 250 64.6 21.6 0.115 0.003 0.118
GeoTransformer (ours) weighted SVD 250 86.5 59.9 0.075 0.003 0.078
CoFiNet [yu2021cofinet] LGR all 87.6 64.8 0.115 0.028 0.143
GeoTransformer (ours) LGR all 91.5 74.0 0.075 0.013 0.088
Table 2: Registration results w/o RANSAC on 3DMatch (3DM) and 3DLoMatch (3DLM). The model time is the time for network inference, while the pose time is for transformation estimation.

We then compare the registration results without using RANSAC in Tab. 2. We start with weighted SVD over correspondences in solving for alignment transformation. The baselines either fail to achieve reasonable results or suffer from severe performance degradation. In contrast, GeoTransformer (with weighted SVD) achieves the registration recall of on 3DMatch and on 3DLoMatch, close to Predator with RANSAC. Without outlier filtering by RANSAC, high inlier ratio is necessary for successful registration. However, high inlier ratio does not necessarily lead to high registration recall since the correspondences could cluster together as noted in [huang2021predator]. Nevertheless, our method without RANSAC performs well by extracting reliable and well-distributed superpoint correspondences.

When using our local-to-global registration (LGR) for computing transformation, our method brings the registration recall to on 3DMatch and on 3DLoMatch, surpassing all RANSAC-based baselines by a large margin. The results are also very close to those of ours with RANSAC, but LGR gains over times acceleration over RANSAC in the pose time. These results demonstrate the superiority of our method in both accuracy and speed.


Figure 5: Registration results of the models with vanilla self-attention and geometric self-attention. In the columns (e) and (f), we visualize the features of the patches with t-SNE. In the first row, the geometric self-attention helps find the inlier matches on the structure-less wall based on their geometric relationships to the more salient regions (e.g., the chairs). In the following rows, the geometric self-attention helps reject the outlier matches between the similar flat or corner patches based on their geometric relationships to the bed or the sofa.

Ablation studies.

Model 3DMatch 3DLoMatch
(a) graph neural network 73.3 97.9 56.5 89.5 39.4 84.9 29.2 69.8
(b) vanilla self-attention 79.6 97.9 60.1 89.0 45.2 85.6 32.6 68.4
(c) self-attention w/ ACE 83.2 98.1 68.5 89.3 48.2 84.3 38.9 69.3
(d) self-attention w/ RCE 80.0 97.9 66.1 88.5 46.1 84.6 37.9 68.7
(e) self-attention w/ RDE 84.9 98.0 69.1 90.7 50.6 85.8 40.3 72.1
(f) geometric self-attention 86.1 97.7 70.3 91.5 54.9 88.1 43.3 74.0
Table 3: Ablation experiments of the geometric self-attention.
Model 3DMatch 3DLoMatch
(a) cross-entropy loss 80.0 97.7 65.7 90.0 45.9 85.1 37.4 68.4
(b) weighted cross-entropy loss 83.2 98.0 67.4 90.0 49.0 86.2 38.6 70.7
(c) circle loss 85.1 97.8 69.5 90.4 51.5 86.1 41.3 71.5
(d) overlap-aware circle loss 86.1 97.7 70.3 91.5 54.9 88.1 43.3 74.0
Table 4: Ablation experiments of the overlap-aware circle loss.

We conduct extensive ablation studies for a better understanding of the various modules in our method. To evaluate superpoint (patch) matching, we introduce another metric Patch Inlier Ratio (PIR) which is the fraction of patch matches with actual overlap. The FMR and IR are reported with all dense point correspondences, with LGR being used for registration.

To study the effectiveness of the geometric self-attention, we compare six methods for intra-point-cloud feature learning in Tab. 3: (a) graph neural network [huang2021predator], (b) self-attention with no positional embedding [yu2021cofinet], (c) absolute coordinate embedding [sarlin2020superglue], (d) relative coordinate embedding [zhao2021point]

, (e) pair-wise distance embedding, and (f) geometric structure embedding. Generally, injecting geometric information boosts the performance. But the gains of coordinate-based embeddings are limited due to their transformation variance. Surprisingly, GNN performs well on RR. This is because

NN graphs are transformation-invariant. However, it suffers from limited receptive fields which harms the IR performance. Our method outperforms the alternatives by a large margin on all the metrics, especially in the low-overlap scenarios, even with only the pair-wise distance embedding. Note that all methods use the same point matching module, so the improvements come from the more accurate superpoint correspondences.


Figure 6: Visualizing geometric self-attention scores on four pairs of point clouds. The overlap areas are delineated with purple lines. The anchor patches (in correspondence) are highlighted in red and the attention scores to other patches are color-coded (deeper is larger). Note how the attention patterns of the two matching anchors are consistent even across disjoint overlap areas.

Next, we ablate the overlap-aware circle loss in Tab. 4. We compare four loss functions for supervising the superpoint matching: (a) cross-entropy loss [sarlin2020superglue], (b) weighted cross-entropy loss [yu2021cofinet], (c) circle loss [sun2020circle], and (d) overlap-aware circle loss. For the first two models, an optimal transport layer is used to compute the matching matrix as in [yu2021cofinet]. Circle loss works much better than the two variants of cross-entropy loss, verifying the effectiveness of supervising superpoint matching in a metric learning fashion. Our overlap-aware circle loss beats the vanilla circle loss by a large margin on all the metrics.

Qualitative results.

Fig. 5 provides a gallery of the registration results of the models with vanilla self-attention and our geometric self-attention. Geometric self-attention helps infer patch matches in structure-less regions from their geometric relationships to more salient regions ( row) and reject outlier matches which are similar in the feature space but different in positions ( and rows).

Fig. 6 visualizes the attention scores learned by our geometric self-attention, which exhibits significant consistency between the anchor patch matches. It shows that our method is able to learn inter-point-cloud geometric consistency which is important to accurate correspondences.

4.2 Outdoor Benchmark: KITTI odometry


KITTI odometry [geiger2012we] consists of 11 sequences of outdoor driving scenarios scanned by LiDAR. We follow [choy2019fully, bai2020d3feat, huang2021predator] and use sequences 0-5 for training, 6-7 for validation and 8-10 for testing. As in [choy2019fully, bai2020d3feat, huang2021predator], the ground-truth poses are refined with ICP and we only use point cloud pairs that are at most away for evaluation.


We follow [huang2021predator] to evaluate our GeoTransformer with three metrics: (1) Relative Rotation Error (RRE), the geodesic distance between estimated and ground-truth rotation matrices, (2) Relative Translation Error

(RTE), the Euclidean distance between estimated and ground-truth translation vectors, and (3)

Registration Recall (RR), the fraction of point cloud pairs whose RRE and RTE are both below certain thresholds (i.e., RRE5 and RTE2m).b

Model RTE(cm) RRE() RR(%)
3DFeat-Net [yew20183dfeat] 25.9 0.25 96.0
FCGF [choy2019fully] 9.5 0.30 96.6
D3Feat [bai2020d3feat] 7.2 0.30 99.8
SpinNet [ao2021spinnet] 9.9 0.47 99.1
Predator [huang2021predator] 6.8 0.27 99.8
CoFiNet [yu2021cofinet] 8.2 0.41 99.8
GeoTransformer (ours, RANSAC-50k) 7.4 0.27 99.8
FMR [huang2020feature] 66 1.49 90.6
DGR [choy2020deep] 32 0.37 98.7
HRegNet [lu2021hregnet] 12 0.29 99.7
GeoTransformer (ours, LGR) 6.8 0.24 99.8
Table 5: Registration results on KITTI odometry.

Registration results.

In Tab. 5(top), we compare to the state-of-the-art RANSAC-based methods: 3DFeat-Net [yew20183dfeat], FCGF [choy2019fully], D3Feat [bai2020d3feat], SpinNet [ao2021spinnet], Predator [huang2021predator] and CoFiNet [yu2021cofinet]. Our method performs on par with these methods, showing good generality on outdoor scenes. We further compare to three RANSAC-free methods in Tab. 5(bottom): FMR [huang2020feature], DGR [choy2020deep] and HRegNet [lu2021hregnet]. Our method outperforms all the baselines by large margin. In addition, our method with LGR beats all the RANSAC-based methods.

5 Conclusion

We have presented Geometric Transformer to learn accurate superpoint matching allowing for robust coarse-to-fine correspondence and point cloud registration. Through encoding pair-wise distances and triplet-wise angles among superpoints, our method captures the geometric consistency across point clouds with transformation invariance. Thanks to the reliable correspondences, it attains fast and accurate registration in a RANSAC-free manner. In the future, we would like to extend our method to handle cross-modality (e.g., 2D-3D) registration with richer applications.


Appendix A Network Architecture Details

a.1 Geometric Structure Embedding

First, we provide the detailed computation for our geometric structure embedding. The geometric structure embedding encodes distances in superpoint pairs and angles in superpoint triplets. Due to the continuity of the sinusoidal embedding function [vaswani2017attention], we use it instead of learned embedding vectors to compute the pair-wise distance embedding and the triplet-wise angular embedding.

Given the distance between and , the pair-wise distance embedding is computed as:


where is the feature dimension, and is a temperature which controls the sensitivity to distance variations.

The triplet-wise angular embedding can be computed in the same way. Given the angle , the triplet-wise angular embedding is computed as:


where is another temperature to control the sensitivity to angular variantions.

a.2 Point Matching Module

For completeness, we then provide the details of the optimal transport layer [sarlin2020superglue] in the point matching module. For each superpoint correspondence , its local point correspondences are extracted from their local patches and . We first compute a cost matrix using the feature matrices of the two patches:


where , . The cost matrix is then augmented to by appending a new row and a new column filled with a learnable dustbin parameter as in [sarlin2020superglue]. The point matching problem can then be formulated as an optimal transport problem which maximizes , where is the soft assignment matrix satisfying:


Here can be solved by the differentiable Sinkhorn algorithm [sinkhorn1967concerning] with doubly-normalization iterations:




The algorithm starts with and . The assignment matrix is then computed as:


We run Sinkhorn iterations following [sarlin2020superglue]. is then recovered to by dropping the last row and the last column, which is used as the confidence matrix of the candidate matches. The local point correspondences are extracted by mutual top- selection on . We ignore the matches whose confidence scores are too small (i.e., ). The hyper-parameter controls the number of point correspondences as described in Sec. A.3. At last, the final global dense point correspondences are generated by combining the local point correspondences from all superpoint correspondences together.

a.3 Network Configurations


We use a KPConv-FPN backbone for feature extraction. The grid subsampling scheme [thomas2019kpconv] is used to downsample the point clouds. Before being fed into the backbone, the input point clouds are first downsampled with a voxel size of on 3DMatch and on KITTI. The voxel size is then doubled in each downsampling operation. We use a 4-stage backbone for 3DMatch and a 5-stage backbone for KITTI because the point clouds in KITTI are much larger than those in 3DMatch. The configurations of KPConv are the same as in [huang2021predator]. And we use group normalization [wu2018group] with groups after the KPConv layers. The detailed network configurations are shown in Tab. 6.

Stage 3DMatch KITTI
1 KPConv() KPConv()
ResBlock() ResBlock()
2 ResBlock(

, strided)

ResBlock(, strided)
ResBlock() ResBlock()
ResBlock() ResBlock()
3 ResBlock(, strided) ResBlock(, strided)
ResBlock() ResBlock()
ResBlock() ResBlock()
4 ResBlock(, strided) ResBlock(, strided)
ResBlock() ResBlock()
ResBlock() ResBlock()
5 - ResBlock(, strided)
6 - NearestUpsampling
7 NearestUpsampling NearestUpsampling
UnaryConv() UnaryConv()
8 NearestUpsampling NearestUpsampling
UnaryConv() UnaryConv()
Superpoint Matching Module
1 Linear() Linear()
2 GeometricSelfAttention(256, 4) GeometricSelfAttention(128, 4)
FeatureCrossAttention(256, 4) FeatureCrossAttention(128, 4)
3 GeometricSelfAttention(256, 4) GeometricSelfAttention(128, 4)
FeatureCrossAttention(256, 4) FeatureCrossAttention(128, 4)
4 GeometricSelfAttention(256, 4) GeometricSelfAttention(128, 4)
FeatureCrossAttention(256, 4) FeatureCrossAttention(128, 4)
5 Linear() Linear()
Table 6: Network architecture for 3DMatch and KITTI.

Superpoint Matching Module.

At the beginning of the superpoint matching module, a linear projection is used to compress the feature dimension. For 3DMatch, the feature dimension is . For KITTI, we halve the feature dimension to to reduce memory footprint. We then interleave the geometric self-attention module and the feature-based cross-attention module for times:


All attention modules have attention heads. In the geometric structure embedding, we use on 3DMatch and on KITTI (i.e., the voxel size in the coarsest resolution level), while on both datasets. The computation of the feature-based cross-attention for is shown in Fig. 7. Afterwards, we use another linear projection to project the features to -d, i.e., the final and :



Figure 7: Left: The structure of feature-based cross-attention module. Right: The computation graph of cross-attention.

Local-to-Global Registration.

In the local-to-global registration, we only use the superpoint correspondences with at least local point correspondences to compute the transformation candidates. To select the best transformation, the acceptance radius is on 3DMatch and on KITTI. At last, we iteratively recompute the transformation with the surviving inlier matches for times, which is similar with the post-refinement process in [bai2021pointdsc]. However, we do not change the weights of the correspondences during the refinement. The impact of the number of iterations in the refinement is studied in Sec. D.3.

Implementation details.

We implement and evaluate our GeoTransformer with PyTorch 

[paszke2019pytorch] on a Xeon Glod 5218 CPU and an NVIDIA RTX 3090 GPU. The network is trained with Adam optimizer [kingma2014adam] for epochs on 3DMatch and epochs on KITTI. The batch size is and the weight decay is . The learning rate starts from and decays exponentially by every epoch on 3DMatch and every epochs on KITTI. We use the matching radius of for 3DMatch and for KITTI (i.e., the voxel size in the resolution level of and ) to determine overlapping during the generation of both superpoint-level and point-level ground-truth matches. The same data augmentation as in [huang2021predator] is adopted. We randomly sample ground-truth superpoint matches during training, and use putative ones during testing.

Correspondences sampling strategy.

For 3DMatch, we vary the hyper-parameter in the mutual top- selection of the point matching module to control the number of the point correspondences for GeoTransformer, i.e., for // matches, for matches, and for matches. And we use top- selection to sample a certain number of the correspondences instead of random sampling as in [choy2019fully, huang2021predator, yu2021cofinet], which makes our correspondences deterministic. For the registration with LGR (Tab. 2(bottom) of our main paper), we use to generate around correspondences for each point cloud pair. For the baselines, we report the results from their original papers or official models in Tab. 1 of our main paper.

For the registration with weighted SVD (Tab. 2(middle) of our main paper), the correspondences of the baselines are extracted in the following manner: we first sample keypoints and generate the correspondences with mutual nearest neighbor selection in the feature space, and then the top correspondences with the smallest feature distances are used to compute the transformation. The weights of the correspondences are computed as , where and are the respective descriptors of the correspondences. In the sampling strategies that we have tried, this scheme achieves the best registration results.

For KITTI, we use and select the top point correspondences following [bai2020d3feat, huang2021predator]

. All other hyperparameters are the same as those in 3DMatch.

Appendix B Metrics

Following common practice [bai2020d3feat, huang2021predator, choy2019fully], we use different metrics for 3DMatch and KITTI. On 3DMatch, we report Inlier Ratio, Feature Matching Recall and Registration Recall. We also report Patch Inlier Ratio to evaluate the superpoint (patch) correspondences. On KITTI, we report Relative Rotation Error, Relative Translation Error and Registration Recall.

b.1 3DMatch/3DLoMatch

Inlier Ratio (IR) is the fraction of inlier matches among all putative point matches. A match is considered as an inlier if the distance between the two points is smaller than under the ground-truth transformation :


where is the Iversion bracket.

Feature Matching Recall (FMR) is the fraction of point cloud pairs whose IR is above . FMR measures the potential success during the registration:


where is the number of all point cloud pairs.

Registration Recall (RR) is the fraction of correctly registered point cloud pairs. Two point clouds are correctly registered if their transformation error is smaller than . The transformation error is computed as the root mean square error of the ground-truth correspondences after applying the estimated transformation :


Patch Inlier Ratio (PIR) is the fraction of superpoint (patch) matches with actual overlap under the ground-truth transformation. It reflects the quality of the putative superpoint (patch) correspondences:


where the matching radius is as stated in A.3.

b.2 Kitti

Relative Rotation Error (RRE) is the geodesic distance in degrees between estimated and ground-truth rotation matrices. It measures the differences between the predicted and the ground-truth rotation matrices.


Relative Translation Error (RTE) is the Euclidean distance between estimated and ground-truth translation vectors. It measures the differences between the predicted and the ground-truth translation vectors.


Registration Recall (RR) on KITTI is defined as the fraction of the point cloud pairs whose RRE and RTE are both below certain thresholds (i.e., and ).


Following [choy2019fully, bai2020d3feat, huang2021predator, lu2021hregnet, yu2021cofinet], we compute the mean RRE and the mean RTE only for the correctly registered point cloud pairs in KITTI.

Appendix C Analysis of Cross-Entropy Loss

In this section, we first give an analysis that adopting the cross-entropy loss in multi-label classification problem could suppress the classes with high confidence scores. Given the input vector and the label vector , the confidence vector z is computed by adopting a softmax on y: