1 Introduction††First two authors contributed equally to this work.
Recent work has made substantial progress in fully automatic, 3D feature-based point cloud registration. At first glance, benchmarks like 3DMatch  appear to be saturated, with multiple state-of-the-art (SOTA) methods [15, 7, 3] reaching nearly 95% feature matching recall and successfully registering 80% of all scan pairs. One may get the impression that the registration problem is solved—but this is actually not the case. We argue that the high success rates are a consequence of lenient evaluation protocols. We have been making our task too easy: existing literature and benchmarks [4, 48, 20] consider only pairs of point clouds with 30% overlap to measure performance. Yet, the low-overlap regime is very relevant for practical applications. On the one hand, it may be difficult to ensure high overlap, for instance when moving along narrow corridors, or when closing loops in the presence of occlusions (densely built-up areas, forest, etc.). On the other hand, data acquisition is often costly, so practitioners aim for a low number of scans with only the necessary overlap.
Driven by the evaluation protocol, the high-overlap scenario became the focus of research, whereas the more challenging low-overlap examples were largely neglected (cf. Fig. 1). As a consequence, the registration performance of even the best known methods deteriorates rapidly when the overlap between the two point clouds falls below 30%, see Fig. 2. Human operators, in contrast, can still register such low overlap point clouds without much effort.
This discrepancy is the starting point of the present work. To study its reasons, we have constructed a low-overlap dataset 3DLoMatch (Sec. 4.1) from scans of the popular 3DMatch benchmark, and have analysed the individual modules/steps of the registration pipeline (Fig. 2). It turns out that the effective receptive field of modern (fully convolutional) feature point descriptors [7, 3] is local enough and the descriptors are hardly corrupted by non-overlapping parts of the scans. Rather than coming up with yet another way to learn better descriptors, the key to registering low overlap point clouds is learning where to look for feature points (Fig. 2, right). A large performance boost can be achieved if the feature points are predominantly sampled from the overlapping portions of the scans.
We follow this path and introduce Predator, a neural architecture for pairwise 3D point cloud registration that learns to (implicitly) detect the overlap region between two unregistered scans, and to focus on that region when extracting salient feature points. The main contributions of our work are:
an analysis why existing registration architectures break down in the low-overlap regime
a novel overlap attention block that allows for early information exchange between the two point clouds and focuses the subsequent steps on the overlap region
a scheme to refine the feature point descriptors, by conditioning them also on the respective other point cloud
a novel loss function to trainmatchability scores, which help to sample better and more repeatable interest points
Moreover, we make available the 3DLoMatch dataset, containing the previously ignored scan pairs of 3DMatch that have low (10-30%) overlap. In our experiments, Predator greatly outperforms existing methods in the low-overlap regime, increasing registration recall by >10 percent points. It also sets a new state of the art for the conventional 3DMatch benchmark, reaching a registration recall of 89%.
2 Related work
Local 3D feature descriptors: Early local descriptors for point clouds [19, 29, 28, 36, 35] aimed to characterise the local geometry by using hand-crafted features. While often lacking robustness against clutter and occlusions, they have long been a default choice for downstream tasks because they naturally generalise across datasets . In the last years, learned 3D feature descriptors have taken over and now routinely outperform their hand-crafted counterparts.
first extract hand-crafted features, then map them to a compact representation using multi-layer perceptrons. PPFNet, and its self-supervised version PPF-FoldNet , combine point pair features with a PointNet  architecture to extract descriptors that are aware of the global context. To alleviate artefacts caused by noise and voxelisation,  proposed to use a smoothed density voxel grid as input to a 3D CNN. These early works achieved strong performance, but still operate on individual local patches, which greatly increases the computational cost and limits the receptive field to a predefined size.
Fully convolutional architectures  that enable dense feature computation over the whole input in a single forward pass [11, 12, 27] have been adopted to design faster 3D feature descriptors. Building on sparse convolutions , FCGF  achieves a performance similar to the best patch-based descriptors , while being orders of magnitude more efficient. D3Feat  complements a fully convolutional feature descriptor with an interest point detector trained to detect salient points.
: In the traditional pipeline, feature extraction is done independently per point cloud. Information is only mixed when computing pairwise similarities, although aggregating contextual information at an earlier stage could provide additional cues to robustify the descriptors and guide the matching step.
In 2D feature learning,  use an attention mechanism in ththe bottleneck of an encoder-decoder scheme to aggregate the contextual information, which is later used to condition the output of the decoder on the second image. SuperGlue  infuses the contextual information into the learned descriptors with a whole series of self- and cross-attention layers, built upon the message-passing GNN. Early information mixing was previously also explored in the field of deep point cloud registration, where [40, 41] use a transformer module to extract task-specific 3D features that are reinforced with contextual information.
Interest point sampling: The classic principle to sample salient rather than random points has also found its way into learned 2D [11, 12, 27, 43] and 3D [46, 3] local feature extraction. All these methods implicitly assume that the saliency of a point fully determines its utility for downstream tasks. Here, we take a step back and argue that, while saliency is desirable for an interest point, it is not sufficient on its own. Indeed, in order to contribute to registration a point should not only be salient, but must also lie in the region where the two point clouds overlap—an essential property that, surprisingly, has largely been neglected thus far.
Deep point-cloud registration
: Instead of combining learned feature descriptors with some off-the-shelf robust optimization at inference time, a parallel stream of work aims to embed the (differentiable) estimation of the transformation parameters into the learning pipeline. PointNetLK combines a PointNet-based global feature descriptor  with a Lucas/Kanade-like optimization algorithm  and estimates the relative transformation in an iterative fashion. DCP  uses a DGCNN network  to extract local features and computes soft correspondences before using the Kabsch algorithm to estimate the transformation parameters. To relax the need for strict one-to-one correspondence, DCP was later extended to PRNet , which includes a keypoint detection step and allows for partial correspondence. Instead of simply using soft correspondences,  estimate the similarity matrix with a differentiable Sinkhorn layer . Similar to other methods, the weighted Kabsch algorithm is used in  to estimate the transformation parameters. Finally, [14, 5]
complement a learned feature descriptor with an outlier filtering network, which infers the points’ influence weights for later use in the weighted Kabsch algorithm.
Predator is a two-stream encoder-decoder network. Our implementation uses residual blocks with KPConv-style point convolutions , but the architecture is agnostic w.r.t. the backbone and could also be implemented with other formulations of 3D convolutions, such as for instance sparse voxel convolutions . The architecture of Predator can be decomposed into three main modules:
encoding of the two point clouds into smaller sets of superpoints and associated latent feature encodings, with shared weights (Sec. 3.2);
the overlap attention module (in the bottleneck) that extracts co-contextual information between the feature encodings of the two point clouds, and assigns each superpoint two overlap scores that quantify how likely the superpoint itself and its soft-correspondence are located in the overlap between the two inputs (Sec. 3.3);
decoding of the mutually conditioned bottleneck representations to point-wise descriptors as well as refined per-point overlap and matchability scores (Sec. 3.4).
Before diving into each component we lay out the basic problem setting and notation in Sec. 3.1.
3.1 Problem setting
Consider two point clouds , and . Our goal is to recover a rigid transformation with parameters and that aligns to . By a slight abuse of notation we use the same symbols for sets of points and for their corresponding matrices and .
Obviously can only ever be determined from the data if and have sufficient overlap, meaning that after applying the ground truth transformation the overlap ratio
where denotes the nearest-neighbour operator w.r.t. its second argument, is the Euclidean norm, is the set cardinality, and is a tolerance that depends on the point density.222For efficiency, is in practice determined after voxel-grid downsampling of the two point clouds. Contrary to previous work [48, 20], where the threshold to even attempt the alignment is typically , we are interested in low-overlap point clouds with .
We follow  and first preprocess raw point clouds with grid-based subsampling, such that and
have reasonably uniform point density. In the shared encoder, a series of ResNet-like blocks and strided convolutions aggregate the raw points intosuperpoints and with associated features and . Note that superpoints correspond to a fixed receptive field, so their number depends on the spatial extent of the input point cloud and may be different for the two inputs.
3.3 Overlap attention module
So far, the features , in the bottleneck encode the geometry and context of the two point clouds. But has no knowledge of point cloud and vice versa. In order to reason about their respective overlap regions, some cross-talk is necessary. We argue that it makes sense to add that cross-talk at the level of superpoints in the bottleneck, just like a human operator will first get a rough overview of the overall shape to determine likely overlap regions, and only after that identifies precise feature points in those regions.
: Before connecting the two feature encodings, we first further aggregate and strengthen their contextual relations individually with a graph neural network (GNN). In the following, we describe the GNN for point cloud . The GNN for is the same. First, the superpoints in are linked into a graph in Euclidean space with the -NN method. Let denote the feature encoding of superpoint , and the graph edge between superpoints and . The encoder features are then iteratively updated as
denotes element-/channel-wise max-pooling, andmeans concatenation. This update is performed twice with separate (not shared) parameters , and the final GNN features are obtained as
Cross-attention block: Knowledge about potential overlap regions can only be gained by mixing information about both point clouds. To this end we adopt a cross-attention block  based on the message passing formulation . First, each superpoint in is connected to all superpoints in to form a bipartite graph. Inspired by the Transformer architecture 
, vector-valued keysand queries are learned for each superpoint and used to retrieve (also learned) values . The messages are then computed as weighted averages of the values,
with attention weights . I.e., to update a superpoint one combines that point’s query with the keys and values of all superpoints . The queries, keys, and values are linear projections of the corresponding features . In line with the literature, in practice we use a multi-attention layer with four parallel attention heads . The co-contextual features are computed as
with denoting a three-layer fully connected network with instance normalization 
and ReLU activations after the first two layers. The same cross-attention block is also applied in reverse direction, so that information flows in both directions, and .
Overlap scores of the bottleneck points: The above update with co-contextual information is done for each superpoint in isolation, without considering the local context within each point cloud. We therefore, explicitly update the local context after the cross-attention block using another GNN that has the same architecture and underlying graph (within-point cloud links) as above, but separate parameters . This yields the final latent feature encodings and , which are now conditioned on the features of the respective other point cloud. Those features are linearly projected to overlap scores and
, which can be interpreted as probabilities that a certain superpoint lies in the overlap region. Additionally, one can computesoft correspondences between superpoints and from the correspondence weights predict the cross-overlap score of a superpoint , i.e., the probability that its correspondence in lies in the overlap region:
where is the inner product, and is the temperature parameter that controls the soft assignment. In the limit , Eq. (6) converges to hard nearest-neighbour assignment.
Our decoder starts from conditioned features , concatenates them with the overlap scores , , and outputs per-point feature descriptors and refined per-point overlap and matchability scores . The matchability can be seen as a ”conditional saliency” that quantifies how likely a point is to be matched correctly, given the points (resp. features) in the other point cloud .
The decoder architecture combines NN-upsampling with 4 PointNet-style MLP layers , and includes skip connections from the corresponding encoder layers. We deliberately keep the overlap score and the matchability separate to disentangle the reasons why a point is a good/bad candidate for matching: in principle a point can be unambiguously matchable but lie outside the overlap region, or it can lie in the overlap but have an ambiguous descriptor. Empirically, we find that the network learns to predict high matchability mostly for points in the overlap; probably reflecting the fact that the ground truth correspondences used for training, naturally, always lie in the overlap. For further details about the architecture, please refer to Sec. A.3 and the source code.
3.5 Loss function and training
Predator is trained end-to-end, using three losses w.r.t. ground truth correspondences as supervision.
Circle loss: To supervise the point-wise feature descriptors we follow333Added to the repository after publication, not mentioned in the paper.  and use the circle loss , a variant of the more common triplet loss. Consider again a pair of overlapping point clouds and , this time aligned with the ground truth transformation. We start by extracting the points that have at least one (possibly multiple) correspondence in , where the set of correspondences is defined as points in that lie within a radius around . Similarly, all points of outside a (larger) radius form the set of negatives . The circle loss is then computed from points sampled randomly from :
where denotes distance in feature space, and are negative and positive margins, respectively. The weights and are determined individually for each positive and negative example, using the empirical margins and with hyper-parameter . The reverse loss is computed in the same way, for a total circle loss .
Overlap loss: The estimation of the overlap probability is cast as binary classification and supervised using the overlap loss , where
The ground truth label of point is defined as
with overlap threshold . The reverse loss is computed in the same way. The contributions from positive and negative examples are balanced with weights inversely proportional to their relative frequencies.
Matchability loss: Supervising the matchability scores is a bit more difficult, as it is not clear in advance which are the right points to take into account during correspondence search. We follow a simple intuition: good keypoints are those that can be matched successfully at a given point during training, with the current feature descriptors. Hence, we cast the prediction as binary classification and generate the ground truth labels on the fly. Again, we sum the two symmetric losses, , with
where ground truth labels are computed on the fly via nearest neighbour search in feature space:
Implementation and training: Predator
is implemented in pytorch. For the3DMatch
dataset, we train for 30 epochs, using SGD with initial learning rate, momentum , and weight decay . The learning rate is exponentially decayed by 0.05 after each epoch. Due to memory constraints we use batch size in all experiments and sample at most positive pairs for the circle loss. Data augmentation includes random rotations around all three axes and Gaussian noise with cm, added independently to each coordinate. At the start of the training we supervise Predator only with the circle and overlap losses, the matchability loss is added only after few epochs, when the point-wise features are already meaningful (i.e., 30% of all points in can be matched correctly). The three loss terms are weighted equally. The hyper-parameters are set relative to the voxel size of the initial grid subsampling (respectively, the average point-to-point distance). For the circle loss, the positive radius is set to , and the safe radius is set to . For the overlap loss, is also set to and is set to be
, in accordance with the distance threshold for a valid correspondence in the subsequent RANSAC pose estimation. The training settings for theModelNet dataset are given in Sec. A.4.
We evaluate Predator and justify our design choices on real scan data, using 3DMatch and 3DLoMatch. Additionally, we compare Predator to direct registration methods on the synthetic, object-centric ModelNet40.
4.1 Datasets and preprocessing
3DMatch/3DLoMatch: The official 3DMatch dataset  considers only scan pairs with >30% overlap. Here, we add its counterpart that considers only scan pairs with overlaps between 10 and 30% and call this collection 3DLoMatch444Due to a bug in the official implementation of the overlap computation for 3DMatch, a few (<7%) scan pairs are included in both datasets.. For both datasets we stick to the accepted split into 54 training and 8 test scenes.
ModelNet40:  contains 12,311 CAD models of man-made objects from 40 different categories. We follow  and use 5,112 samples for training and 1,202 samples for validation, from the first 20 categories. We then test on 1,266 samples from the other 20 categories. Like , we randomly sample planes that cut away 30% of the points, to obtain point clouds with 70% completeness and, on average, 73.5% pairwise overlap. For our purposes we additionally run a version where we cut away 50% of the points, to obtain a second test set with 53.6% average overlap, which we call ModelLoNet (lower overlap is not meaningful, due to the low number of points per model).
4.2 Evaluation metrics
We use the standard metrics of 3DMatch to assess the performance of Predator and to compare it to three state-of-the-art methods: 3DSN , FCGF  and 3DFeat . Our main metric, corresponding to the actual aim of point cloud registration, is Registration Recall (RR), i.e., the fraction of scan pairs for which the correct transformation parameters are found with RANSAC. Following the literature, we also report Feature Match Recall (FMR), defined as the fraction of pairs that have >5% ”inlier” matches with <10 cm residual under the ground truth transformation (without checking if the transformation can be recovered from those matches), and Inlier Ratio (IR), the fraction of correct correspondences among the putative matches.
For ModelNet40 we follow  and measure the performance using the Relative Rotation Error (RRE) (geodesic distance between estimated and GT rotation matrices), the Relative Translation Error (RTE)
Relative Translation Error (RTE)(Euclidean distance between the estimated and GT translations), and the Chamfer distance (CD) between the two scans after applying the estimated transformation. For more details please see Sec. A.1.
Relative overlap ratio: We first evaluate if Predator achieves its goal to focus on the overlap. We discard points with a predicted overlap score , compute the overlap ratio, and compare it to the one of the original scans. Fig. 4 shows that more than half of the low-overlap pairs are pushed over the 30% threshold that prior works considered the lower limit for registration. On average, discarding points with low overlap scores almost doubles the overlap in 3DLoMatch ( increase). Notably, it also increases the overlap in standard 3DMatch by, on average, >35%.
Interest point sampling:
|# Samples (k)||5000||2500||1000||500||250||5000||2500||1000||500||250|
|Inlier ratio (%)|
|filt. () + prob. ()||55.2||55.2||53.7||51.3||46.4||24.7||25.0||24.8||24.0||22.7|
|Registration Recall (%)|
|filt. () + prob. ()||85.7||84.4||86.6||86.3||83.3||51.3||53.3||54.6||54.4||52.0|
Predator significantly increases the effective overlap, but does that improve downstream registration performance? To test this we use the overlap scores and matchability scores to bias interest point sampling. We compare three variants: top-k (om), where we multiply and pick the top- points according to the combined score; prob. (om), where we instead sample points with probability proportional to the combined score; and filt. (o)prob. (om), where we discard points with , then sample from the remaining ones proportional to .
Tab. 1 shows that any of the informed sampling strategies greatly increases the inlier ratio, and as a consequence also the registration recall. The gains are larger when fewer points are sampled. In the low-overlap regime the inlier ratios more than triple for up to 1000 points. We observe that, as expected, high inlier ratio does not necessarily imply high registration recall: our scores are apparently well calibrated, so that top-k (om) indeed finds most inliers, but these are often clustered and too close to each other to reliably estimate the transformation parameters. We thus use the more robust prob. (om) sampling, which yields the best registration recall. It may be possible to achieve even higher registration recall by combining top-k (om) sampling with non-maxima suppression. We leave this for future work.
|Feature Match Recall (%)|
|Inlier ratio (%)|
|Registration Recall (%)|
Comparison to feature-based methods: We compare Predator to recent feature-based registration methods (Tab. 2). For a more comprehensive assessment we follow  and report performance with different numbers of sampled interest points. Qualitative results are shown in Fig. 6. Predator greatly outperforms existing methods on the low-overlap 3DLoMatch dataset, improving registration recall by 10-20 percent points (pp) over the closest competitor—variously FCGF or 3DFeat. Moreover, it also consistently reaches the highest registration recall on standard 3DMatch, showing that its attention to the overlap pays off even for scans with moderately large overlap. In line with our motivation, what matters is not so much the choice of descriptors, but finding interest points that lie in the overlap region – especially if that region is small.
The results also support our claim that one should evaluate the complete registration pipeline: FCGF slightly beats Predator in terms of FMR, except in the low-overlap, small sample regime. But Predator mostly compensates that deficit when looking at the inlier ratio, i.e., a higher number of potentially matchable point pairs does not always translate to more usable matches555See Tab. 1, where top-k (om) sampling has even higher inlier ratio than FCGF, yet lower registration performance.. Even in cases where the inlier ratio remains a bit below that of FCGF, our method achieves higher registration recall.
Comparison to direct registration methods: We tried to compare Predator also to recent methods for direct registration of partial point clouds. Unfortunately, for both PRNet  and RPM-Net , training on 3DMatch failed to converge to reasonable results, as already observed in . It appears that their feature extraction is specifically tuned to synthetic, object-centric point clouds. Thus, in a further attempt we replaced the feature extractor of RPM-Net with FCGF. This brought the registration recall on 3DMatch to 54.9%, still far from the 85.1% that FCGF features achieve with RANSAC. We conclude that direct pairwise registration is at this point only suitable for geometrically simple objects in controlled settings like ModelNet40.
We ablate our point scoring functions in Tab. 3. By conditioning the decoder input on the respective other point cloud, registration recall increases only by 0.2 pp for 3DMatch, but by 3.3 pp for 3DLoMatch. Adding also the cross-overlap score brings a bigger gain of 1.2 pp for 3DMatch, but no further gain for 3DLoMatch. For more ablation studies, please see Sec. A.6.
Relative overlap ratio:
We check if Predator focuses on the overlap region. We extract 8,862 test pairs by varying the completeness of the input point clouds from 70 to 40%. As above, we then discard points with a predicted overlap score , compute the overlap ratio, and compare it to the one of the original scans. Fig. 7 shows that Predator greatly increases the relative overlap and reduces the number of pairs with overlap <70% by more than 40 pp.
|Ours (prob. ())||1.856||0.019||0.00088||5.462||0.133||0.0079|
Comparison to direct registration methods: To be able to compare Predator to RPM-Net  and DCP , we resort to the synthetic, object-centric dataset they were designed for. We failed to train PRNet  due to random crashes of the original code (also observed in ).
Remarkably, Predator can compete with methods specifically tuned for ModelNet, and in the low-overlap regime outperforms them in terms of RRE, see Tab. 4. Moreover, we observe a large boost by sampling points with overlap attention (prob. (om)) rather than randomly (rand). Fig. 7 (right) further underlines the importance of sampling in the overlap: Predator is a lot more robust in the low overlap regime (8 lower RRE at completeness 0.4).
We have introduced Predator, a deep model designed for pairwise registration of low-overlap point clouds. The core of the model is an overlap attention module that enables early information exchange between the point clouds’ latent encodings, in order to infer which of their points are likely to lie in their overlap region.
There are a number of directions in which Predator could be extended. At present it is tightly coupled to fully convolutional point cloud encoders, and relies on having a reasonable number of superpoints in the bottleneck. Moreover, it builds on the prevalent definition of the overlap region, which counts the fraction of points with a feasible correspondence. This might be a limitation in scenarios where the point density is very uneven. Finally, in future work it would be interesting to explore how our overlap-attention module can be integrated into direct point cloud registration methods, and into other neural architectures that have to handle two or more datasets with low overlap.
-  (2019) PointnetLK: robust & efficient point cloud registration using Pointnet. In CVPR, Cited by: §2.
-  (1987) Least-squares fitting of two 3-d point sets. IEEE TPAMI 9 (5), pp. 698–700. External Links: Cited by: §2.
-  (2020) D3Feat: joint learning of dense detection and description of 3d local features. In CVPR, Cited by: §A.7, Table 6, Table 9, §1, §1, §2, §2, §3.5, Figure 6, §4.2, §4.3, Table 2.
-  (2015) Robust reconstruction of indoor scenes. In CVPR, Cited by: §A.1, §1.
-  (2020) Deep global registration. In CVPR, Cited by: §2, §4.3, §4.4.
-  (2019) 4D spatio-temporal convnets: minkowski convolutional neural networks. In CVPR, Cited by: §2, §3.
-  (2019) Fully convolutional geometric features. In ICCV, Cited by: §A.7, Table 6, Table 9, §1, §1, Figure 2, §2, Figure 6, §4.2, Table 2.
-  (1996) A volumetric method for building complex models from range images. In ACM SIGGRAPH, Cited by: §A.2.
PPF-FoldNet: unsupervised learning of rotation invariant 3d local descriptors. In ECCV, Cited by: §2.
-  (2018) PPFNnet: global context aware local features for robust 3d point matching. In CVPR, Cited by: §A.1, §2.
-  (2018) Superpoint: self-supervised interest point detection and description. In CVPR Workshops, Cited by: §2, §2.
-  (2019) D2-Net: a trainable CNN for joint detection and description of local features. In CVPR, Cited by: §2, §2.
-  (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §3.3.
-  (2020) Learning multiview 3d point cloud registration. In CVPR, Cited by: §2.
-  (2019) The perfect match: 3d point cloud matching with smoothed densities. In CVPR, Cited by: Table 6, §1, §2, §2, §4.2, Table 2.
-  (2018) Learned compact local feature descriptor for TLS-based geodetic monitoring of natural outdoor scenes.. In ISPRS Annals, Cited by: §2.
-  (2014) Performance evaluation of 3D local feature descriptors. In ACCV, Cited by: §2.
-  (2016) Structured global registration of RGB-D scans in indoor environments. arXiv preprint arXiv:1607.08539. Cited by: §A.2.
-  (1999) Using spin images for efficient object recognition in cluttered 3d scenes. IEEE TPAMI 21, pp. 433–449. Cited by: §2.
-  (2017) Learning compact geometric features. In ICCV, Cited by: §1, §2, §3.1.
-  (2014) Unsupervised feature learning for 3d scene labeling. In ICRA, Cited by: §A.2.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
-  (1981) An iterative image registration technique with an application to stereo vision. In IJCAI, Cited by: §2.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML, Cited by: §3.3.
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.3.
Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §2, §2, §3.4.
-  (2019) R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §2, §2.
-  (2009) Fast point feature histograms (FPFH) for 3D registration. In ICRA, Cited by: §2.
-  (2008) Aligning point cloud views using persistent feature histograms. In IROS, Cited by: §2.
-  (2020) Superglue: learning feature matching with graph neural networks. In CVPR, Cited by: §2, §3.3.
-  (2013) Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, Cited by: §A.2.
-  (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35 (2), pp. 876–879. Cited by: §2.
-  (2020) Circle loss: a unified perspective of pair similarity optimization. In CVPR, Cited by: §3.5.
-  (2019) KPconv: flexible and deformable convolution for point clouds. In CVPR, Cited by: §3.2, §3.
-  (2010) Unique shape context for 3D data description. In ACM Workshop on 3D Object Retrieval, Cited by: §2.
-  (2010) Unique signatures of histograms for local surface description. In ECCV, Cited by: §2.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §3.3, §3.3.
-  (2016) Learning to navigate the energy landscape. In 3DV, Cited by: §A.2.
-  (2017) Attention is all you need. In NeurIPS, Cited by: §3.3.
-  (2019) Deep closest point: learning representations for point cloud registration. In ICCV, Cited by: §2, §2, §4.4, Table 4.
PRNet: self-supervised learning for partial-to-partial registration. In NeurIPS, Cited by: §2, §2, §4.3, §4.4.
-  (2019) Dynamic graph CNN for learning on point clouds. ACM TOG 38 (5). Cited by: §2, §3.3.
-  (2020) D2D: learning to find good correspondences for image matching and manipulation. arXiv preprint arXiv:2007.08480. Cited by: §2, §2.
-  (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §4.1.
-  (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In ICCV, Cited by: §A.2.
-  (2018) 3dfeat-net: weakly supervised local 3d features for point cloud registration. In ECCV, pp. 630–646. Cited by: §2.
-  (2020) RPM-Net: robust point matching using learned features. In CVPR, Cited by: §A.1, §A.2, §2, §4.1, §4.2, §4.3, §4.4, Table 4.
-  (2017) 3DMatch: learning local geometric descriptors from RGB-D reconstructions. In CVPR, Cited by: §1, §2, §3.1, §4.1, footnote 6.
A Supplementary material
In this supplementary material, we first provide rigorous definitions of evaluation metrics (Sec.A.1), then describe the data pre-processing step (Sec. A.2), network architectures (Sec. A.3) and training on ModelNet40 (Sec. A.4) in more detail. We further provide additional results (Sec. A.5), ablation studies (Sec. A.6) as well as a runtime analysis (Sec. A.7). Finally, we show more visualisations on 3DLoMatch and ModelLoNet benchmarks (Sec. A.8).
a.1 Evaluation metrics
The evaluation metrics, which we use to assess model performance in Sec. 4 of the main paper and Sec. A.5 of this supplementary material, are formally defined as follows:
Inlier ratio looks at the set of putative correspondences found by reciprocal matching666We follow 3DMatch  and apply reciprocal matching as a pre-filtering step. in feature space, and measures what fraction of them is ”correct”, in the sense that they lie within a threshold cm after registering the two scans with the ground truth transformation :
with the Iverson bracket.
Feature Match recall (FMR)  measures the fraction of point cloud pairs for which, based on the number of inlier correspondences, it is likely that accurate transformation parameters can be recovered with a robust estimator such as RANSAC. Note that FMR only checks whether the inlier ratio is above a threshold . It does not test if the transformation can actually be determined from those correspondences, which in practice is not always the case, since their geometric configuration may be (nearly) degenerate, e.g., they might lie very close together or along a straight edge. A single pair of point clouds counts as suitable for registration if
Registration recall  is the most reliable metric, as it measures end-to-end performance on the actual task of point cloud registration. Specifically, it looks at the set of ground truth correspondences after applying the estimated transformation , computes their root mean square error,
and checks for what fraction of all point pairs . In keeping with the original evaluation script of 3DMatch, immediately adjacent point clouds are excluded, since they have very high overlap by construction.
Chamfer distance measures the quality of registration on synthetic data. We follow  and use the modified Chamfer distance metric:
where and are raw source and target point clouds, and are input source and target point clouds.
Relative translation and rotation errors (RTE/RRE) measure the deviations from the ground truth pose as:
where and denote the estimated rotation matrix and translation vector, respectively.
a.2 Dataset preprocessing
3DMatch: This is a collection of 62 scenes, combining earlier data from Analysis-by-Synthesis , 7Scenes , SUN3D , RGB-D Scenes v.2 , and Halber et al. . The official specifications split the data into 54 scenes for training and 8 for testing. Individual scenes are not only captured in different indoor spaces (e.g., bedrooms, offices, living rooms, restrooms) but also with different depth sensors (e.g., Microsoft Kinect, Structure Sensor, Asus Xtion Pro Live, and Intel RealSense). 3DMatch provides great diversity and allows our model to generalize across different indoor spaces. Individual scenes of 3DMatch are split into point cloud fragments, which are generated by fusing 50 consecutive depth frames using TSDF volumetric fusion . As a preprocessing step, we apply voxel-grid downsampling to all point clouds, and if multiple points fall into the same voxel, we randomly pick one.
ModelNet40: For each CAD model, 2048 points are first generated by uniform sampling and scaled to fit into a unit sphere. Then we follow  to produce partial scans: for source partial point cloud, we uniformly sample a plane through the origin that splits the unit sphere into two half-spaces, shift that plane along its normal until points are on one side, and discard the points on the other side; the target point cloud is generated in the same manner; then the two resulting, partial point clouds are randomly rotated, translated and jittered with Gaussian noise. For the rotation, we sample a random axis and a random angle <45. The translation is sampled in the range . Gaussian noise is applied per coordinate with . Finally, 717 points are randomly sampled from the points.
a.3 Network architecture
The detailed network architecture of Predator is depicted in Fig. 9. Our model is built on the KPConv implementation from the D3Feat repository.777https://github.com/XuyangBai/D3Feat.pytorch We complement each KPConv layer with instance normalisation Leaky ReLU activations. The -th strided convolution is applied to a point cloud dowsampled with voxel size . Upsampling in the decoder is performed by querying the associated feature of the closest point from the previous layer.
With 20k points after voxel-grid downsampling, the point clouds in 3DMatch are much denser than those of ModelNet40 with only 717 points. Moreover, they also have larger spatial extent with bounding boxes up to , while ModelNet40
point clouds are normalised to fit into a unit sphere. To account for these large differences, we slightly adapt the encoder and decoder per dataset, but keep the same overlap attention model. Differences in network hyper-parameters are shown in Tab.5.
|# strided||convolution||first conv.||final|
|convolutions||radius||feature dim.||feature dim.|
a.4 Implementation and training
This section complements Sec. 3.5 of the main paper, where implementation and training details are described only for 3DMatch. Here, we provides those details for ModelNet40.
We train Predator on ModelNet40 for 200 epochs, using SGD with initial learning rate 0.01, momentum 0.98, and weight decay . The learning rate is exponentially decayed by a factor of 0.95 after each epoch. We use batch size 1, but accumulate gradients over 4 steps. Similar to 3DMatch, the matchability loss is added when >30% of points in the overlap region can be matched correctly. Due to the sparsity of ModelNet, the input point clouds are not voxel-grid downsampled before the first convolution layer. In the strided convolutions, the voxel size is set to 0.06. For the circle loss, the positive radius is set to 0.018, the safe radius is 0.06. For overlap loss and matchability loss, and are both set to 0.04. RANSAC is run for 50,000 iterations, with distance threshold .
a.5 Additional results
Detailed registration results: We report detailed per-scene Registration Recall (RR), Relative Rotation Error (RRE) and Relative Translation Error (RTE) in Tab. 6. RRE and RTE are only averaged over successfully registered pairs for each scene, such that the numbers are mot dominated by gross errors from complete registration failures. We get the highest RR and lowest or second lowest RTE and RRE for almost all scenes, this further shows that our overlap attention module together with probabilistic sampling supports not only robust, but also accurate registration.
|Kitchen||Home 1||Home 2||Hotel 1||Hotel 2||Hotel 3||Study||MIT Lab||Avg.||STD||Kitchen||Home 1||Home 2||Hotel 1||Hotel 2||Hotel 3||Study||MIT Lab||Avg.||STD|
|Registration Recall (%)|
|Relative Rotation Error (°)|
|Relative Translation Error (m)|
Feature match recall: Finally, Fig. 8 shows that our descriptors are robust and perform well over a wide range of thresholds for the allowable inlier distance and the minimum inlier ratio. Notably, Predator consistently outperforms D3Feat that uses a similar KPConv backbone.
a.6 Additional ablation studies
Ablations of overlap attention module: We compare Predator with a baseline model, which is a plain encoder-decoder architecture based on KPConv, without the proposed overlap attention module. It outputs 32-dimensional features without overlap and matchability scores. In the absence of those scores, we randomly sample 5,000 points and pass them to RANSAC for registration. As shown in Tab. 7, this baseline model achieves the second highest FMR on two benchmarks, but only reaches 82.5% and 38.9% RR on 3DMatch and 3DLoMatch respectively; much worse than the four other variants that include (at least) the overlap score. The experiment again confirms that high FMR or IR does not imply high RR, and thus good registration performance.
Ablations of matchability score: We find that probabilistic sampling guided by the product of the overlap and matchability scores attains the highest RR. Here we further analyse the impact of each individual component. We first construct a baseline which applies random sampling (rand) over conditioned features, then we sample points with probability proportional to overlap scores (prob. (o)), to matchability scores (prob. (m)), and to the combination of the two scores (prob. (om)). As shown in Tab. 8, rand fares clearly worse, in all metrics. Compared to prob. (om), either prob. (o) or prob. (m) can achieve comparable results on 3DMatch; the performance gap becomes big on the more challenging 3DLoMatch dataset, where our prob. (om) is around 4 pp better in terms of RR.
We compare the runtime of Predator with FCGF888All experiments were done with MinkowskiEngine v0.4.2.  and D3Feat999We use its PyTorch implementation.  on 3DMatch. For all three methods we set voxel size cm and batch size 1. The test is run on a single GeForce GTX 1080 Ti with Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz, 32GB RAM. The most time-consuming step of our model, and also of D3Feat, is the data loader, as we have to pre-compute the neighborhood indices before the forward pass. With its smaller encoder and decoder, but the additional overlap attention module, Predator is still marginally faster than D3Feat. FCGF has a more efficient data loader that relies on sparse convolution and queries neighbors during the forward pass. See Tab. 9.