Encoding 3D local geometry into descriptors has been an essential ingredient in many 3D vision problems, such as recognition [24, 59], retrieval , segmentation [32, 28], registration and reconstruction [43, 25], etc. In this paper, we are interested in developing point descriptors for robust registration of 3D point clouds. Matching 3D geometry of scans of real-world scenes is a challenging task, since the input data is often noisy and partial (Fig. 1). To accommodate such issues, learning-based descriptors have received significant attention in recent years, demonstrating superior performance over hand-crafted ones .
Existing literature has investigated both supervised and unsupervised approaches for descriptor learning. The supervised approaches [33, 15, 75, 21, 38] have demonstrated remarkable performance on existing point cloud registration benchmarks, such as the 3DMatch benchmark . A contributing factor to the success of these approaches is the network training that gets exposed to challenging local geometry matching between partially overlapped point clouds. The training, however, requires abundant supervision, namely labels of point correspondences, to construct training pairs, triplets, or N-tuples [75, 21, 15] for metric learning. Obtaining the annotations can be a challenging task, which demands accurate scene reconstructions [75, 13]. In contrast, the unsupervised approaches [14, 76, 56] seek to remove such requirements on training data by auto-encoder based architectures , whose reconstruction loss, however, falls short of the similarity optimization between descriptors, leaving room for improvement to close the gap with the supervised approaches.
In this work, we propose a novel unsupervised framework for descriptor learning. We diverge from the above auto-encoder paradigm. Instead, we extract descriptors with only an encoder network, but eschew training supervision by performing registration in the network. This allows us to formulate learning objectives with descriptor similarity both across and within point clouds. We name our learned descriptors UPDesc. Overall, there are two main parts in our method: descriptor extraction and unsupervised training.
To extract descriptors for points of interest, we build upon 3DSmoothNet , a supervised 3D CNN-based architecture that uses a voxel-based representation for the surrounding geometry of the interest points. Unlike other representations of 3D geometry (, point pair features [15, 14] or multi-view images [28, 38]), which require additional information such as normals, the voxel-based representation only needs point positions. 3DSmoothNet, as well as many supervised and unsupervised methods [33, 75, 14, 76, 56], uses a predefined fixed-size local support for input parameterization. This design choice, however, may not be optimal for capturing sufficient local geometry in the descriptors. On the other hand, works like [38, 28] using a multi-view representation have demonstrated that incorporating broader context is beneficial to learning discriminative descriptors. Thus we are motivated to enable the network to learn the local support size in a data-driven manner. However, the conventional voxelization [75, 21] is a discretization operation without gradient definitions, preventing gradient back-propagation to voxel grids for their size optimization. Inspired by [40, 38], we develop a differentiable voxelization module to bridge this gap.
For unsupervised training, we leverage a registration loss and a contrastive loss to guide the descriptor similarity optimization. The former is formulated on the 3D transformations estimated by in-network geometric registration, which builds upon descriptor matching between two point clouds. The latter complements the registration loss by introducing metric learning directly on the descriptor space, similar to the above supervised methods. For matching between point clouds, recent works[63, 64, 20]
have employed learned descriptors to solve pairwise registration with differentiable singular value decomposition (SVD). However, they require ground-truth transformations to supervise the estimation results so that the descriptors can be optimized indirectly. Inspired by this learning framework, we observe that the transformations can be computed by weighted least-squares fitting instead of SVD, and that training signals can be obtained by measuring the deviation of the transformations from SE(3) (, rigid transformations), which gives rise to our registration loss. The descriptor matching between point clouds is prone to error if the descriptors are not well distributed in the descriptor space, thus leading to invalid registration and training signals. To alleviate this issue, the contrastive loss is used to directly optimize the descriptor similarity for points in each individual point cloud.
Our training losses require neither correspondence labels nor pose annotations of point clouds for supervision. Through extensive experiments on the point cloud registration benchmarks [75, 65], we validate the superior performance of our method over existing unsupervised ones, further narrowing the gap with the supervised techniques.
Our main contributions are as follows: (1) A novel differentiable voxelization layer, enabling in-network conversion from point clouds to a voxel-based representation and allowing data-driven local support size optimization; (2) An unsupervised approach for learning local descriptors, with the novel registration loss based on deviation from rigidity, outperforming other unsupervised methods. Our code will be made publicly available upon publication.
2 Related Work
Hand-crafted 3D Local Descriptors. A considerable amount of literature has investigated hand-crafted descriptors for encoding 3D local geometry. A widely adopted paradigm is to use a histogram-based representation to parameterize local geometry. Generally, the statistical information of points collected by the histograms can be categorized into spatial distributions and geometric attributes . The former serves as the basis for descriptors like Spin Image , 3D Shape Context , and Unique Shape Context ; while the latter is adopted by descriptors like PFH , FPFH , and SHOT [60, 54]. We refer the reader to a comprehensive survey by Guo  on the hand-crafted descriptors.
Learned 3D Local Descriptors.
In recent years, with the development of deep neural networks, significant research attention has focused on encoding 3D local geometry into descriptors in a data-driven manner. There are supervised and unsupervised approaches currently under active investigation, as discussed below.
The supervised approaches to descriptor learning [75, 33, 15, 70, 21, 11, 17, 1, 38, 46] require correspondence labels to be available in point cloud pairs for training. Their training objective mainly aims at maximizing the descriptor similarity of corresponding points while minimizing the descriptor similarity of non-corresponding points. Such an objective has been translated to contrastive , triplet , and N-tuple  losses. The supervised approaches generally differ in the choices of input parameterizations and network backbones. More specifically, works like 3DMatch  and 3DSmoothNet  adopt voxel grids, a structured representation for local geometry input, and 3D CNNs for descriptor extraction. Deng  examined PointNet 
for directly encoding point pair features extracted from input patches. Li proposed to project local geometry as multi-view images and leverage 2D CNNs  as the descriptor encoder. To densely extract per-point descriptors, networks of [11, 4, 29] are built upon sparse convolutions  or kernel point convolutions . The above descriptors like [33, 21, 1, 46] are rotationally invariant by design while [75, 15, 70, 11, 38, 4] are not. In this work, we base the step of descriptor extraction of our unsupervised framework on 3DSmoothNet, and address its issue of using predefined fixed-size local support with differentiable voxelization.
The unsupervised approaches to descriptor learning [14, 76, 56] seek to remove the labeling requirement on training data. To this end, the existing solutions base the learning framework on auto-encoders , whose features at the bottleneck are accepted as descriptors. Their training loss is to minimize the discrepancy between input local geometry and reconstruction output, and can be expressed by the Chamfer distance. The unsupervised approaches vary from input parameterizations and encoder architectures. Concretely, to obtain a descriptor from input local geometry, PPF-FoldNet  transforms the input to a point pair feature representation and employs PointNet  as the encoder. With an identical input representation as , Zhao  proposed a 3D point-capsule network for feature encoding. By converting input patches into a spherical signal representation, Spezialetti  used Spherical CNNs  as the encoder. The descriptors learned by these approaches are rotationally invariant. To reconstruct the input from a descriptor, the unsupervised methods [14, 76, 56] opt for FoldingNet  as their decoder network, which warps a low-dimensional grid with the descriptor towards the input.
Instead of training with a reconstruction loss, we perform geometric registration with the descriptors in the network in a differentiable manner and formulate the deviation from rigidity of the resulting transformations in our registration loss. A contrastive loss is also taken into account to directly optimize the descriptor similarity within individual point clouds.
Learning-based Registration. A number of studies have incorporated the point cloud registration pipeline into deep neural networks for end-to-end learning [2, 63, 64, 41, 16, 36, 71, 20, 30, 9]. For example, PointNetLK  employs PointNet  as a global feature extractor and iteratively optimizes the transformation parameters of a point cloud pair in a similar manner to the Lucas & Kanade algorithm . Deep Closest Point (DCP)  builds soft correspondences between point clouds with learned point descriptors and uses a differentiable SVD layer for rigid transformation estimation. Follow-up works [64, 71]
improve DCP by handling partial-to-partial registration. To accommodate correspondence outliers, works such as[9, 44, 72] introduce correspondence classification blocks in the networks prior to registration. Supervision like pose annotations is still essential in these end-to-end learning-based methods. Differently, our work zooms in descriptor extraction with the unsupervised registration loss, which penalizes the deviation from rigidity for the estimated transformations.
UPDesc takes a 3D point cloud and a point of interest as input for descriptor extraction. We aim to obtain a descriptor that effectively encodes the surrounding geometry of in . We seek to learn the descriptor in an unsupervised manner. In general, our method has two stages: descriptor extraction (Sec. 3.2) and unsupervised training (Sec. 3.3), as illustrated in Fig. 2 and briefly described below.
Our descriptor extraction network is built upon 3DSmoothNet , which uses a voxel-based representation for input local geometry and handles rotation invariance explicitly with local reference frames (LRFs). 3DSmoothNet considers a fixed-size local neighborhood of as input, which, however, potentially limits the learning of an informative descriptor. We instead enable the network to learn the support size for , which is used in voxelization. To this end, we integrate the input parameterization into the network through a differentiable voxelization module (Sec. 3.2), which performs probabilistic aggregation to convert the point-based representation to a voxel grid. The resulting voxel-based representation can be seamlessly fed to the subsequent 3D CNN for descriptor extraction.
To train the descriptor extraction network, we use a registration loss based on descriptor matching between two point clouds and a contrastive loss for guiding the descriptor similarity within each individual point cloud. To formulate the registration loss, we follow existing end-to-end registration works [63, 64, 71, 20] and use the learned descriptors to build putative correspondences between point clouds with differentiable nearest neighbor query . We solve for the 3D transformation with weighted least-squares fitting instead of SVD. Our registration loss ( in Fig. 2) then penalizes the deviation of the resulting transformation from SE(3) , including orthogonality and cycle consistency. We will elaborate this loss in Sec. 3.3 (Registration Loss). The registration could be unstable if the descriptors are not well distributed in the descriptor space, leading to erroneous correspondences. Similar to the supervised methods [75, 21, 15], we use the contrastive loss ( in Fig. 2) to directly promote descriptor similarity for spatially close points and dissimilarity for spatially distant points in the same point cloud. Details of will be described in Sec. 3.3 (Contrastive Loss).
3.2 Descriptor Extraction
To extract a descriptor for point , we convert its surrounding geometry in to a voxel grid (Fig. 3), by building on the input parameterization in 3DSmoothNet . Suppose that has a resolution of . Prior to voxelization, we need to determine the orientation and the size of . The orientation is for addressing rotation invariance , and the size determines the amount of local geometry information in the final descriptor. For the orientation, we follow  to explicitly compute a local reference frame (LRF) based on a local patch located at (Fig. 3). The local patch is defined as , where is a predefined radius. The LRF can be estimated from by the approach of [21, 68] based on eigendecomposition.
For the size of , 3DSmoothNet  simply fits to enclose all the points in . Differently, we seek to optimize the size of in a data-driven manner to capture local geometry not circumscribed by the predefined local support . Specifically, we feed the patch to a mini-PointNet  to regress the size of (Fig. 3), that is, where denotes the learnable parameters of the sub-network .
Differentiable Voxelization. The conventional voxelization is non-differentiable and cannot back-propagate gradients training losses to the support size estimation sub-network . Inspired by [40, 38], we design a differentiable voxelization module, enabling point-to-voxel conversion within the network. We anchor the center of the voxel grid at and align it to the estimated LRF.
For the -th voxel in the transformed , we perform probabilistic aggregation to compute the voxel value. To simplify computation, we consider each voxel as a sphere  with a radius of . We use to denote the probability of some specific point contained in voxel :
where denotes the center of voxel (Fig. 4-a), is a sign indicator , and
controls the sharpness of the probability distribution and is set to. The voxel value of is aggregated by
3D CNN. We use the 3D CNN-based architecture from  to encode voxel grid as the final descriptor . Let , where denotes the learnable parameters of the 3D CNN . More specifically, the network is comprised of six convolutional layers, which progressively down-sample the input. Each convolutional layer is followed by a normalization layer 
and a ReLU activation layer. The output of the last convolutional layer is flattened and fed to a fully connected layer followed bynormalization, resulting the unit-length -dimensional descriptor .
3.3 Unsupervised Training
In this section, we describe our loss formulations for unsupervised descriptor learning. Our descriptor losses consist of a registration loss and a contrastive loss. In the following, we consider as input a pair of partially overlapped point clouds and for training.
Registration Loss. The registration loss seeks to optimize the similarity of the learned descriptors across point clouds. In a similar spirit to other learning-based registration works [63, 64], we build a set of putative correspondences between and in the descriptor space and compute a 3D transformation for registration within the network. By imposing losses on the resulting transformation matrix, the descriptors can be optimized to improve the registration. The learning-based registration works enforce the transformation to be in SE(3) via differentiable SVD [63, 41], and compare the transformation with the ground truth for supervision. Differently, we relax the computation with weighted least-squares fitting. Although the transformation may not be rigid, we see this deviation as an opportunity to obtain unsupervised training signals for the descriptor network optimization, as done in the context of non-rigid shape matching .
To map from to , we define a transformation for , where is a matrix and
is a 3D vector. The inverse transformationis defined similarly. We describe the estimation of and later in this section (, Eq. (7)). In our registration loss, we encourage the estimated matrices and to be in as follows:
is an identity matrix. Eq. (4) measures the orthogonality of and , and Eq. (5) measures the determinant. We enforce a cycle consistency between and as follows:
Our registration loss is composed of the above losses as with the weights , , and (see the weight settings in Sec. 3.4).
Next, we describe the estimation of based on the descriptor similarity between and , and is computed similarly. We use the learned descriptors (Sec. 3.2) to construct a putative correspondence set with differentiable nearest neighbor query , where the closest neighbor of in is retrieved as . To compute , we minimize a weighted quadratic error as follows:
which is a least-squares problem and can be solved in closed form. In Eq. (7), denotes the confidence of correspondence being an inlier. We compute based on the spectral matching technique , which estimates the correspondence reliability by the isometry compatibility of pairwise correspondences. More details of spectral matching can be found in [37, 3, 74, 22].
Contrastive Loss. We incorporate a contrastive loss to evaluate the descriptor similarity in each individual point cloud. The loss optimizes the descriptor similarity across point clouds indirectly via registration, and the descriptor distinctiveness in turn affects the correspondences and the resulting transformation. If the descriptors are not well distributed in the descriptor space, the registration may be unstable with erroneous correspondences, leading to ineffective training signals and slow convergence. Thus to facilitate , we employ a direct metric learning loss in the descriptor space, similar to the supervised methods [75, 21, 15]. The intuition is that for spatially nearby points, their descriptors tend to be similar; while for spatially distant points, their descriptors are more likely to be dissimilar. We use the double-margin contrastive loss [39, 77] in , where
The symbol denotes . Descriptors and correspond to points and , respectively. Point is in the nearest neighbor set of point in the descriptor space. We generate the label on the fly according to the spatial distance between and . The label if , otherwise ; and is a spatial distance threshold. and are two margins for positive and negative pairs, respectively. is computed similarly.
We also add a regularization loss for the support size estimations to regularize their magnitude. Our final training loss is a combination of the aforementioned loss terms:
We implemented our method with PyTorch. Following 3DSmoothNet, we set the descriptor dimension to 32. For the 3DMatch dataset  (Sec. 4.1), we use m and set the voxel grid resolution . In the contrastive loss Eq. (8), we set and m, and the margins are and according to . For the loss term weights, we use , , and in , and in . In each training step, our network takes as input a pair of point clouds. We sample 300 keypoints in each point cloud with farthest point sampling  for descriptor extraction and matching. We use Adam  as the network optimizer with an initial learning rate of 0.001. The network is trained for 16K steps, and the learning rate is decayed by 0.1 at the half of training. Training details for the ModelNet40 dataset  (Sec. 4.2) are given in the supplementary material.
We validate the performance of our proposed UPDesc on existing point cloud registration benchmarks including 3DMatch (Sec. 4.1) and ModelNet40 (Sec. 4.2). The 3DMatch dataset  consists of point clouds of indoor scene scans, while the ModelNet40 dataset  consists of object-centric point clouds generated from CAD models.
4.1 3DMatch Dataset
The 3DMatch dataset is widely adopted for evaluating the descriptor performance on geometric registration . It is constructed from several existing RGB-D datasets [62, 55, 66, 35, 26]. In total, there are 62 indoor scenes: 54 of them for training and validation, and 8 of them for testing. Each testing point cloud has 5,000 randomly sampled keypoints for descriptor extraction and matching. We perform voxel-downsampling with a voxel size of 3 cm  to the point clouds.
Evaluation Metrics. Following [75, 15, 7, 11], we compute inlier ratio (IR), feature-match recall (FMR), and registration recall (RR) on 3DMatch. Inlier ratio measures the fraction of inlier correspondences, given as input a set of putative correspondences built in the descriptor space for a point cloud pair. For an inlier correspondence, the distance between the two matching points should be less than under the ground-truth transformation of the point cloud pair. Feature-match recall is computed as the fraction of point cloud pairs whose IR is above . The inlier ratio threshold is in the range of . Registration recall is computed as the fraction of point cloud pairs with correct transformations estimated by a RANSAC-based registration pipeline . An estimated transformation is considered to be correct if it brings the root mean square error (RMSE) of ground-truth correspondences below 0.2 m.
Comparisons. We compare with hand-crafted descriptors and existing unsupervised descriptor learning methods. The former includes FPFH  and SHOT , which have already been implemented in the PCL library  and have descriptor dimensions of 33 and 352, respectively. The latter includes PPF-FoldNet , CapsuleNet , and S2CNN . Their implementations are based on publicly available codebases111https://github.com/XuyangBai/PPF-FoldNet222https://github.com/yongheng1991/3D-point-capsule-networks333https://github.com/jonas-koehler/s2cnn, and they all have a descriptor dimension of 512. The high dimensionality can make nearest neighbor search computationally inefficient [33, 21]. For a more direct comparison with the state-of-the-art S2CNN, we also implemented an S2CNN variant that produces 32-dimensional descriptors.
Table 1 (bottom) shows the inlier ratio (IR) comparison for the methods without supervision. It can be observed that our descriptor obtains the highest IR (45.1%) among all the unsupervised methods. UPDesc outperforms S2CNN (32) by 16 percent points and S2CNN (512) by 10.6 percent points, indicating the better quality of correspondences built by our method.
For the feature-match recall (FMR) comparison in Table 1, in the case of , our descriptor achieves an FMR of 94.1%, better than PPF-FoldNet, CapsuleNet and S2CNN (32). Yet is a relatively easy threshold as discussed in [38, 21, 14], and the descriptor performance tends to be saturated. In the harder case of , our descriptor obtains 79.5%, significantly outperforming S2CNN (32) by 19.6 percent points and S2CNN (512) by 9.2 percent points. Furthermore, Fig. 5 plots the FMR performance with respect to different values. Our descriptor shows better insensitivity against the inlier ratio , which can be ascribed to the better IR performance.
Table 1 also includes the registration recall (RR) performance. Our descriptor achieves the best RR performance (79.7%) among all the unsupervised methods and outperforms S2CNN (32) by 5.8 percent points. Fig. 6 visualizes some challenging point cloud registration examples with a large portion of flat surfaces, and our descriptor demonstrates better robustness.
For completeness, in Table 1 (top) we also include the performance of supervised descriptor learning methods. It is observed that our method narrows the gap with the state-of-the-art supervised methods [4, 38]. Interestingly, our descriptor achieves better IR and RR performance than 3DSmoothNet, which uses a fixed-size local support and a 3D CNN backbone but with supervision.
To test the rotation invariance, following , the rotated 3DMatch dataset is used, where each point cloud is rotated with randomly sampled axes and angles in . Table 2 reports the performance of the compared methods. A similar performance is maintained by our descriptor on this dataset, with the best IR (43.8%), FMR (78.8% at ), and RR (79.1%) scores.
Table 3 reports the running time comparison of the unsupervised descriptor learning methods. The results were collected on a desktop computer with an Intel Core i7 @ 3.6GHz and an NVIDIA GTX 1080Ti GPU. Note that for the input preparation, PPF-FoldNet and CapsuleNet compute point pair features, S2CNN performs spherical representation conversion, and UPDesc estimates LRFs. Overall, our method and PPF-FoldNet have comparable speed, while S2CNN is computationally much slower.
4.2 ModelNet40 Dataset
We perform comparisons with existing learning-based registration methods [2, 63, 64] on the ModelNet40 dataset, following . The point clouds fall into 40 man-made object categories. There are 9,843 point clouds for training and 2,468 point clouds for testing. The number of points in each point cloud is 1,024. To generate point cloud pairs, a new point cloud is obtained by transforming each testing point cloud with a random rigid transformation. The rotation angle along each axis is sampled in the range of , and the 3D translation offset is sampled in . To synthesize partial overlapping for a pair of point clouds, 768 nearest neighbors of a randomly placed point in 3D space are collected in each point cloud.
Metrics. Given the estimated rotations and translations by a specific registration method, we follow  to compare them with the ground truths by measuring root mean squared error (RMSE) and coefficient of determination (). The rotation errors are computed with the Euler angle representation in degrees.
Comparisons. As shown in Table 4, we compare with three categories of point cloud registration methods on ModelNet40. The first category is non-learning based registration methods, including ICP , Go-ICP , and FGR . The second category is learning-based registration methods, including PointNetLK , DCP , PRNet , DeepGMR, and RPM-Net . These methods require supervision in training. The last category is RANSAC-based registration with learned descriptors, including PPF-FoldNet, CapsuleNet, S2CNN and our UPDesc, which are trained without supervision. It is observed that the methods based on RANSAC and learned descriptors generally have better performance than the learning-based registration methods, demonstrating their wide applicability to different data modalities (, object-centric point clouds and indoor scene scans). Our UPDesc achieves the best performance in all the computed metrics among the unsupervised methods and is even comparable to the supervised RPM-Net, which is highly specialized for object-centric datasets.
We further follow  to test the robustness of the different methods to noise. Each point in the testing point clouds is augmented with Gaussian noise independently sampled from and clipped to . Table 5 reports the registration results, and our UPDesc shows better robustness to noise than all the unsupervised methods, especially for the rotation estimation.
|S2CNN  (512)||w/o||3.069||0.944||0.017||0.997|
|S2CNN  (32)||w/o||3.234||0.938||0.014||0.998|
|S2CNN  (512)||w/o||5.221||0.840||0.007||0.999|
|S2CNN  (32)||w/o||5.040||0.850||0.009||0.999|
4.3 Ablation Study
We perform ablation studies on 3DMatch and report the results in Table 6. We first remove the support size estimation in the descriptor extraction stage (Sec. 3.2) to study its contribution. That is, we fit to the patch , thus reducing to the input parameterization scheme used by 3DSmoothNet. It is observed that the performance of this variant (– CS in Table 6) drops significantly, compared to our full model. This is because without richer local geometry information in the learned descriptors, the putative correspondences across point clouds may not be reliable for registration, making the unsupervised registration loss ineffective to provide training signals.
We also study the contribution of the loss terms used in Eq. (9), including the registration loss , the contrastive loss , and the regularization loss . The corresponding results are shown in Table 6 (middle). Note that combining all the three loss terms produces the best result. For , we further examine the contribution of its three constituent terms including the orthogonality loss , the determinant loss , and the cycle consistency loss . The results are reported in Table 6 (right).
We have presented UPDesc, a new framework to learn point descriptors for robust point cloud registration in an unsupervised manner. Our framework is built upon a voxel-based representation and 3D CNNs for descriptor extraction. To enrich geometric information in the learned descriptors, we propose to learn the local support size in the online point-to-voxel conversion with differentiable voxelization. We introduce the registration loss and the contrastive loss to guide the learning of descriptors. Extensive experiments show that our descriptors achieve better performance than the state-of-the-art unsupervised methods on existing point cloud registration benchmarks. For future work, it would be interesting to combine our descriptors with other stages of the point cloud registration pipeline, such as keypoint detection  or outlier filtering of correspondences . It is also worth investigating the extension of our loss formulations, in particular, the registration loss, to these tasks for unsupervised learning.
-  (2020) SpinNet: learning a general surface descriptor for 3d point cloud registration. arXiv. Cited by: §2, Table 1.
-  (2019) PointNetLK: robust & efficient point cloud registration using pointnet. In Proc. IEEE CVPR, Cited by: §2, §4.2, §4.2, Table 4, Table 5.
-  (2021) PointDSC: robust point cloud registration using deep spatial consistency. In Proc. IEEE CVPR, Cited by: §3.3.
-  (2020) D3Feat: joint learning of dense detection and description of 3d local features. In Proc. IEEE CVPR, Cited by: §2, §4.1, Table 1, §5.
-  (1992) A method for registration of 3-d shapes. IEEE TPAMI 14 (2). Cited by: §4.2, Table 4, Table 5.
-  (2011-02) Shape google: geometric words and expressions for invariant shape retrieval. ACM TOG 30 (1). External Links: Cited by: §1.
-  (2015) Robust reconstruction of indoor scenes. In Proc. IEEE CVPR, Cited by: §4.1.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. In Proc. IEEE CVPR, External Links: Cited by: §2.
-  (2020) Deep global registration. In Proc. IEEE CVPR, Cited by: §2, §5.
4D spatio-temporal convnets: minkowski convolutional neural networks. In Proc. IEEE CVPR, Cited by: §2.
-  (2019) Fully convolutional geometric features. In Proc. IEEE ICCV, Cited by: §2, §4.1, Table 1.
-  (2018) Spherical CNNs. arXiv. Cited by: §2.
-  (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE CVPR, Cited by: §1.
-  (2018) PPF-FoldNet: unsupervised learning of rotation invariant 3d local descriptors. In Proc. ECCV, Cited by: §1, §1, §2, §4.1, §4.1, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2018) PPFNet: global context aware local features for robust 3d point matching. In Proc. IEEE CVPR, Cited by: §1, §1, §2, §3.1, §3.3, §4.1, Table 1.
-  (2019) 3D local features for direct pairwise registration. In Proc. IEEE CVPR, Cited by: §2.
-  (2020) DH3D: deep hierarchical 3d descriptors for robust large-scale 6dof relocalization. In Proc. ECCV, Cited by: §2.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24 (6). External Links: Cited by: §4.1.
-  (2004) Recognizing objects in range data using regional point descriptors. In Proc. ECCV, External Links: Cited by: §2.
-  (2020) Learning multiview 3d point cloud registration. In Proc. IEEE CVPR, Cited by: §1, §2, §3.1, §3.3.
-  (2019) The perfect match: 3D point cloud matching with smoothed densities. In Proc. IEEE CVPR, Cited by: Figure 1, §1, §1, §2, §3.1, §3.1, §3.2, §3.2, §3.2, §3.3, §4.1, §4.1, Table 1.
-  (1996) Matrix computations. Johns Hopkins University Press. Cited by: §3.3.
-  (2016) A comprehensive performance evaluation of 3d local feature descriptors. IJCV 116 (1). External Links: Cited by: §1, §2.
-  (2013) Rotational projection statistics for 3d local surface description and object recognition. IJCV 105 (1). External Links: Cited by: §1.
-  (2014) An accurate and robust range image registration algorithm for 3d object modeling. IEEE Transactions on Multimedia 16 (5), pp. 1377–1390. Cited by: §1.
-  (2016) Structured global registration of rgb-d scans in indoor environments. arXiv. Cited by: §4.1.
-  (2017) In defense of the triplet loss for person re-identification. arXiv. Cited by: §2.
-  (2017) Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM TOG 37 (1). External Links: Cited by: §1, §1.
-  (2020) PREDATOR: registration of 3d point clouds with low overlap. arXiv. Cited by: §2.
-  (2020) Feature-metric registration: a fast semi-supervised approach for robust point cloud registration without correspondences. In Proc. IEEE CVPR, Cited by: §2.
-  (1999) Using spin images for efficient object recognition in cluttered 3d scenes. IEEE TPAMI 21 (5). External Links: Cited by: §2.
-  (2010-07) Learning 3d mesh segmentation and labeling. ACM TOG 29 (4). External Links: Cited by: §1.
-  (2017) Learning compact geometric features. In Proc. IEEE ICCV, Cited by: §1, §1, §2, §4.1, Table 1.
-  (2015) Adam: A method for stochastic optimization. In Proc. ICLR, Cited by: §3.4.
-  (2014) Unsupervised feature learning for 3d scene labeling. In Proc. ICRA, External Links: Cited by: §4.1.
-  (2020) Registration loss learning for deep probabilistic point set registration. arXiv. Cited by: §2.
-  (2005) A spectral technique for correspondence problems using pairwise constraints. In Proc. IEEE ICCV, Cited by: §3.3.
-  (2020) End-to-end learning local multi-view descriptors for 3D point clouds. In Proc. IEEE CVPR, Cited by: §1, §1, §2, §3.2, §4.1, §4.1, Table 1.
-  (2017) DeepHash for image instance retrieval: getting regularization, depth and fine-tuning right. In Proc. ICMR, External Links: Cited by: §3.3.
-  (2019) Soft Rasterizer: a differentiable renderer for image-based 3d reasoning. In Proc. IEEE ICCV, Cited by: §1, §3.2.
-  (2019) DeepVCP: an end-to-end deep neural network for point cloud registration. In Proc. IEEE ICCV, Cited by: §2, §3.3.
-  (1981) An iterative image registration technique with an application to stereo vision. Cited by: §2.
-  (2006) A novel representation and feature matching algorithm for automatic pairwise registration of range images. IJCV 66 (1). Cited by: §1.
-  (2020) 3DRegNet: a deep neural network for 3d point registration. In Proc. IEEE CVPR, Cited by: §2.
PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, External Links: Cited by: §3.4.
-  (2021) Distinctive 3d local deep descriptors. In Proc. ICPR, Cited by: §2.
-  (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In Proc. IEEE CVPR, Cited by: §2, §2, §2.
-  (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proc. IEEE CVPR, Cited by: §3.2.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, External Links: Cited by: §3.2, §3.4.
-  (2019) Unsupervised deep learning for structured shape matching. In Proc. IEEE ICCV, Cited by: §1, §3.1, §3.3.
-  (2009) Fast point feature histograms (fpfh) for 3d registration. In Proc. ICRA, External Links: Cited by: §2, §4.1, Table 1, Table 2.
-  (2008) Aligning point cloud views using persistent feature histograms. In Proc. IROS, External Links: Cited by: §2.
-  (2011) 3D is here: Point Cloud Library (PCL). In Proc. ICRA, External Links: Cited by: §4.1.
-  (2014) SHOT: unique signatures of histograms for surface and texture description. CVIU 125. Cited by: §2, §4.1, Table 1, Table 2.
-  (2013) Scene coordinate regression forests for camera relocalization in rgb-d images. In Proc. IEEE CVPR, External Links: Cited by: §4.1.
-  (2019) Learning an effective equivariant 3d descriptor without supervision. In Proc. IEEE ICCV, Cited by: Figure 1, §1, §1, §2, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2019) KPConv: flexible and deformable convolution for point clouds. In Proc. IEEE ICCV, Cited by: §2.
-  (2017) L2-Net: deep learning of discriminative patch descriptor in euclidean space. In Proc. IEEE CVPR, External Links: Cited by: §2.
-  (2010) Unique shape context for 3d data description. In Proc. 3DOR, External Links: Cited by: §1, §2.
-  (2010) Unique signatures of histograms for local surface description. In Proc. ECCV, External Links: Cited by: §2.
-  (2016) Instance Normalization: the missing ingredient for fast stylization. arXiv. Cited by: §3.2.
-  (2016) Learning to navigate the energy landscape. arXiv. Cited by: §4.1.
-  (2019) Deep closest point: learning representations for point cloud registration. In Proc. IEEE ICCV, Cited by: §1, §2, §3.1, §3.3, §4.2, §4.2, Table 4, Table 5.
PRNet: self-supervised learning for partial-to-partial registration. In NeurIPS, Cited by: §1, §2, §3.1, §3.3, §4.2, §4.2, §4.2, §4.2, Table 4, Table 5.
-  (2015) 3D ShapeNets: a deep representation for volumetric shapes. In Proc. IEEE CVPR, Cited by: §1, §3.4, §4.
-  (2013) SUN3D: a database of big spaces reconstructed using sfm and object labels. In Proc. IEEE ICCV, Cited by: §4.1.
-  (2015) Go-ICP: a globally optimal solution to 3d icp point-set registration. IEEE TPAMI 38 (11). Cited by: §4.2, Table 4, Table 5.
-  (2017) TOLDI: an effective and robust approach for 3d local shape description. Pattern Recogn. 65. External Links: Cited by: §3.2.
-  (2018) FoldingNet: point cloud auto-encoder via deep grid deformation. In Proc. IEEE CVPR, Cited by: §1, §2.
-  (2018) 3DFeat-Net: weakly supervised local 3d features for point cloud registration. In Proc. ECCV, Cited by: §2.
-  (2020) RPM-Net: robust point matching using learned features. In Proc. IEEE CVPR, Cited by: §2, §3.1, §4.2, Table 4, Table 5.
-  (2018-06) Learning to find good correspondences. In Proc. IEEE CVPR, Cited by: §2.
DeepGMR: learning latent gaussian mixture models for registration. In Proc. ECCV, pp. 733–750. Cited by: §4.2, Table 4, Table 5.
-  (2018) Deep learning of graph matching. In Proc. IEEE CVPR, Cited by: §3.3.
-  (2017) 3DMatch: learning local geometric descriptors from rgb-d reconstructions. In Proc. IEEE CVPR, Cited by: §1, §1, §1, §2, §3.1, §3.3, §3.4, §4.1, §4.1, Table 1, §4.
-  (2019) 3D point capsule networks. In Proc. IEEE CVPR, Cited by: §1, §1, §2, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2018) Learning and matching multi-view descriptors for registration of point clouds. In Proc. ECCV, Cited by: §3.3, §3.4.
-  (2016) Fast global registration. In Proc. ECCV, Cited by: §4.2, Table 4, Table 5.
-  (2018) Open3D: A modern library for 3D data processing. arXiv. Cited by: §4.1.