1 Introduction
Encoding 3D local geometry into descriptors has been an essential ingredient in many 3D vision problems, such as recognition [24, 59], retrieval [6], segmentation [32, 28], registration and reconstruction [43, 25], etc. In this paper, we are interested in developing point descriptors for robust registration of 3D point clouds. Matching 3D geometry of scans of realworld scenes is a challenging task, since the input data is often noisy and partial (Fig. 1). To accommodate such issues, learningbased descriptors have received significant attention in recent years, demonstrating superior performance over handcrafted ones [23].
Existing literature has investigated both supervised and unsupervised approaches for descriptor learning. The supervised approaches [33, 15, 75, 21, 38] have demonstrated remarkable performance on existing point cloud registration benchmarks, such as the 3DMatch benchmark [75]. A contributing factor to the success of these approaches is the network training that gets exposed to challenging local geometry matching between partially overlapped point clouds. The training, however, requires abundant supervision, namely labels of point correspondences, to construct training pairs, triplets, or Ntuples [75, 21, 15] for metric learning. Obtaining the annotations can be a challenging task, which demands accurate scene reconstructions [75, 13]. In contrast, the unsupervised approaches [14, 76, 56] seek to remove such requirements on training data by autoencoder based architectures [69], whose reconstruction loss, however, falls short of the similarity optimization between descriptors, leaving room for improvement to close the gap with the supervised approaches.
In this work, we propose a novel unsupervised framework for descriptor learning. We diverge from the above autoencoder paradigm. Instead, we extract descriptors with only an encoder network, but eschew training supervision by performing registration in the network. This allows us to formulate learning objectives with descriptor similarity both across and within point clouds. We name our learned descriptors UPDesc. Overall, there are two main parts in our method: descriptor extraction and unsupervised training.
To extract descriptors for points of interest, we build upon 3DSmoothNet [21], a supervised 3D CNNbased architecture that uses a voxelbased representation for the surrounding geometry of the interest points. Unlike other representations of 3D geometry (, point pair features [15, 14] or multiview images [28, 38]), which require additional information such as normals, the voxelbased representation only needs point positions. 3DSmoothNet, as well as many supervised and unsupervised methods [33, 75, 14, 76, 56], uses a predefined fixedsize local support for input parameterization. This design choice, however, may not be optimal for capturing sufficient local geometry in the descriptors. On the other hand, works like [38, 28] using a multiview representation have demonstrated that incorporating broader context is beneficial to learning discriminative descriptors. Thus we are motivated to enable the network to learn the local support size in a datadriven manner. However, the conventional voxelization [75, 21] is a discretization operation without gradient definitions, preventing gradient backpropagation to voxel grids for their size optimization. Inspired by [40, 38], we develop a differentiable voxelization module to bridge this gap.
For unsupervised training, we leverage a registration loss and a contrastive loss to guide the descriptor similarity optimization. The former is formulated on the 3D transformations estimated by innetwork geometric registration, which builds upon descriptor matching between two point clouds. The latter complements the registration loss by introducing metric learning directly on the descriptor space, similar to the above supervised methods. For matching between point clouds, recent works
[63, 64, 20]have employed learned descriptors to solve pairwise registration with differentiable singular value decomposition (SVD). However, they require groundtruth transformations to supervise the estimation results so that the descriptors can be optimized indirectly. Inspired by this learning framework, we observe that the transformations can be computed by weighted leastsquares fitting instead of SVD, and that training signals can be obtained by measuring the deviation of the transformations from SE(3) (, rigid transformations)
[50], which gives rise to our registration loss. The descriptor matching between point clouds is prone to error if the descriptors are not well distributed in the descriptor space, thus leading to invalid registration and training signals. To alleviate this issue, the contrastive loss is used to directly optimize the descriptor similarity for points in each individual point cloud.Our training losses require neither correspondence labels nor pose annotations of point clouds for supervision. Through extensive experiments on the point cloud registration benchmarks [75, 65], we validate the superior performance of our method over existing unsupervised ones, further narrowing the gap with the supervised techniques.
Our main contributions are as follows: (1) A novel differentiable voxelization layer, enabling innetwork conversion from point clouds to a voxelbased representation and allowing datadriven local support size optimization; (2) An unsupervised approach for learning local descriptors, with the novel registration loss based on deviation from rigidity, outperforming other unsupervised methods. Our code will be made publicly available upon publication.
2 Related Work
Handcrafted 3D Local Descriptors. A considerable amount of literature has investigated handcrafted descriptors for encoding 3D local geometry. A widely adopted paradigm is to use a histogrambased representation to parameterize local geometry. Generally, the statistical information of points collected by the histograms can be categorized into spatial distributions and geometric attributes [23]. The former serves as the basis for descriptors like Spin Image [31], 3D Shape Context [19], and Unique Shape Context [59]; while the latter is adopted by descriptors like PFH [52], FPFH [51], and SHOT [60, 54]. We refer the reader to a comprehensive survey by Guo [23] on the handcrafted descriptors.
Learned 3D Local Descriptors.
In recent years, with the development of deep neural networks, significant research attention has focused on encoding 3D local geometry into descriptors in a datadriven manner. There are supervised and unsupervised approaches currently under active investigation, as discussed below.
The supervised approaches to descriptor learning [75, 33, 15, 70, 21, 11, 17, 1, 38, 46] require correspondence labels to be available in point cloud pairs for training. Their training objective mainly aims at maximizing the descriptor similarity of corresponding points while minimizing the descriptor similarity of noncorresponding points. Such an objective has been translated to contrastive [8], triplet [27], and Ntuple [15] losses. The supervised approaches generally differ in the choices of input parameterizations and network backbones. More specifically, works like 3DMatch [75] and 3DSmoothNet [21] adopt voxel grids, a structured representation for local geometry input, and 3D CNNs for descriptor extraction. Deng [15] examined PointNet [47]
for directly encoding point pair features extracted from input patches. Li
[38] proposed to project local geometry as multiview images and leverage 2D CNNs [58] as the descriptor encoder. To densely extract perpoint descriptors, networks of [11, 4, 29] are built upon sparse convolutions [10] or kernel point convolutions [57]. The above descriptors like [33, 21, 1, 46] are rotationally invariant by design while [75, 15, 70, 11, 38, 4] are not. In this work, we base the step of descriptor extraction of our unsupervised framework on 3DSmoothNet, and address its issue of using predefined fixedsize local support with differentiable voxelization.The unsupervised approaches to descriptor learning [14, 76, 56] seek to remove the labeling requirement on training data. To this end, the existing solutions base the learning framework on autoencoders [69], whose features at the bottleneck are accepted as descriptors. Their training loss is to minimize the discrepancy between input local geometry and reconstruction output, and can be expressed by the Chamfer distance. The unsupervised approaches vary from input parameterizations and encoder architectures. Concretely, to obtain a descriptor from input local geometry, PPFFoldNet [14] transforms the input to a point pair feature representation and employs PointNet [47] as the encoder. With an identical input representation as [14], Zhao [76] proposed a 3D pointcapsule network for feature encoding. By converting input patches into a spherical signal representation, Spezialetti [56] used Spherical CNNs [12] as the encoder. The descriptors learned by these approaches are rotationally invariant. To reconstruct the input from a descriptor, the unsupervised methods [14, 76, 56] opt for FoldingNet [69] as their decoder network, which warps a lowdimensional grid with the descriptor towards the input.
Instead of training with a reconstruction loss, we perform geometric registration with the descriptors in the network in a differentiable manner and formulate the deviation from rigidity of the resulting transformations in our registration loss. A contrastive loss is also taken into account to directly optimize the descriptor similarity within individual point clouds.
Learningbased Registration. A number of studies have incorporated the point cloud registration pipeline into deep neural networks for endtoend learning [2, 63, 64, 41, 16, 36, 71, 20, 30, 9]. For example, PointNetLK [2] employs PointNet [47] as a global feature extractor and iteratively optimizes the transformation parameters of a point cloud pair in a similar manner to the Lucas & Kanade algorithm [42]. Deep Closest Point (DCP) [63] builds soft correspondences between point clouds with learned point descriptors and uses a differentiable SVD layer for rigid transformation estimation. Followup works [64, 71]
improve DCP by handling partialtopartial registration. To accommodate correspondence outliers, works such as
[9, 44, 72] introduce correspondence classification blocks in the networks prior to registration. Supervision like pose annotations is still essential in these endtoend learningbased methods. Differently, our work zooms in descriptor extraction with the unsupervised registration loss, which penalizes the deviation from rigidity for the estimated transformations.3 Method
3.1 Overview
UPDesc takes a 3D point cloud and a point of interest as input for descriptor extraction. We aim to obtain a descriptor that effectively encodes the surrounding geometry of in . We seek to learn the descriptor in an unsupervised manner. In general, our method has two stages: descriptor extraction (Sec. 3.2) and unsupervised training (Sec. 3.3), as illustrated in Fig. 2 and briefly described below.
Our descriptor extraction network is built upon 3DSmoothNet [21], which uses a voxelbased representation for input local geometry and handles rotation invariance explicitly with local reference frames (LRFs). 3DSmoothNet considers a fixedsize local neighborhood of as input, which, however, potentially limits the learning of an informative descriptor. We instead enable the network to learn the support size for , which is used in voxelization. To this end, we integrate the input parameterization into the network through a differentiable voxelization module (Sec. 3.2), which performs probabilistic aggregation to convert the pointbased representation to a voxel grid. The resulting voxelbased representation can be seamlessly fed to the subsequent 3D CNN for descriptor extraction.
To train the descriptor extraction network, we use a registration loss based on descriptor matching between two point clouds and a contrastive loss for guiding the descriptor similarity within each individual point cloud. To formulate the registration loss, we follow existing endtoend registration works [63, 64, 71, 20] and use the learned descriptors to build putative correspondences between point clouds with differentiable nearest neighbor query [20]. We solve for the 3D transformation with weighted leastsquares fitting instead of SVD. Our registration loss ( in Fig. 2) then penalizes the deviation of the resulting transformation from SE(3) [50], including orthogonality and cycle consistency. We will elaborate this loss in Sec. 3.3 (Registration Loss). The registration could be unstable if the descriptors are not well distributed in the descriptor space, leading to erroneous correspondences. Similar to the supervised methods [75, 21, 15], we use the contrastive loss ( in Fig. 2) to directly promote descriptor similarity for spatially close points and dissimilarity for spatially distant points in the same point cloud. Details of will be described in Sec. 3.3 (Contrastive Loss).
3.2 Descriptor Extraction
To extract a descriptor for point , we convert its surrounding geometry in to a voxel grid (Fig. 3), by building on the input parameterization in 3DSmoothNet [21]. Suppose that has a resolution of . Prior to voxelization, we need to determine the orientation and the size of . The orientation is for addressing rotation invariance [21], and the size determines the amount of local geometry information in the final descriptor. For the orientation, we follow [21] to explicitly compute a local reference frame (LRF) based on a local patch located at (Fig. 3). The local patch is defined as , where is a predefined radius. The LRF can be estimated from by the approach of [21, 68] based on eigendecomposition.
For the size of , 3DSmoothNet [21] simply fits to enclose all the points in . Differently, we seek to optimize the size of in a datadriven manner to capture local geometry not circumscribed by the predefined local support . Specifically, we feed the patch to a miniPointNet [49] to regress the size of (Fig. 3), that is, where denotes the learnable parameters of the subnetwork .
Differentiable Voxelization. The conventional voxelization is nondifferentiable and cannot backpropagate gradients training losses to the support size estimation subnetwork . Inspired by [40, 38], we design a differentiable voxelization module, enabling pointtovoxel conversion within the network. We anchor the center of the voxel grid at and align it to the estimated LRF.
For the th voxel in the transformed , we perform probabilistic aggregation to compute the voxel value. To simplify computation, we consider each voxel as a sphere [48] with a radius of . We use to denote the probability of some specific point contained in voxel :
(1)  
(2) 
where denotes the center of voxel (Fig. 4a), is a sign indicator , and
controls the sharpness of the probability distribution and is set to
. The voxel value of is aggregated by(3) 
In Eq. (3), the voxel value is viewed as the probability of having at least one point contained in voxel (Fig. 4b).
3D CNN. We use the 3D CNNbased architecture from [21] to encode voxel grid as the final descriptor . Let , where denotes the learnable parameters of the 3D CNN . More specifically, the network is comprised of six convolutional layers, which progressively downsample the input. Each convolutional layer is followed by a normalization layer [61]
and a ReLU activation layer. The output of the last convolutional layer is flattened and fed to a fully connected layer followed by
normalization, resulting the unitlength dimensional descriptor .3.3 Unsupervised Training
In this section, we describe our loss formulations for unsupervised descriptor learning. Our descriptor losses consist of a registration loss and a contrastive loss. In the following, we consider as input a pair of partially overlapped point clouds and for training.
Registration Loss. The registration loss seeks to optimize the similarity of the learned descriptors across point clouds. In a similar spirit to other learningbased registration works [63, 64], we build a set of putative correspondences between and in the descriptor space and compute a 3D transformation for registration within the network. By imposing losses on the resulting transformation matrix, the descriptors can be optimized to improve the registration. The learningbased registration works enforce the transformation to be in SE(3) via differentiable SVD [63, 41], and compare the transformation with the ground truth for supervision. Differently, we relax the computation with weighted leastsquares fitting. Although the transformation may not be rigid, we see this deviation as an opportunity to obtain unsupervised training signals for the descriptor network optimization, as done in the context of nonrigid shape matching [50].
To map from to , we define a transformation for , where is a matrix and
is a 3D vector. The inverse transformation
is defined similarly. We describe the estimation of and later in this section (, Eq. (7)). In our registration loss, we encourage the estimated matrices and to be in as follows:(4)  
(5) 
where
is an identity matrix. Eq. (
4) measures the orthogonality of and , and Eq. (5) measures the determinant. We enforce a cycle consistency between and as follows:(6) 
Our registration loss is composed of the above losses as with the weights , , and (see the weight settings in Sec. 3.4).
Next, we describe the estimation of based on the descriptor similarity between and , and is computed similarly. We use the learned descriptors (Sec. 3.2) to construct a putative correspondence set with differentiable nearest neighbor query [20], where the closest neighbor of in is retrieved as . To compute , we minimize a weighted quadratic error as follows:
(7) 
which is a leastsquares problem and can be solved in closed form. In Eq. (7), denotes the confidence of correspondence being an inlier. We compute based on the spectral matching technique [37], which estimates the correspondence reliability by the isometry compatibility of pairwise correspondences. More details of spectral matching can be found in [37, 3, 74, 22].
Contrastive Loss. We incorporate a contrastive loss to evaluate the descriptor similarity in each individual point cloud. The loss optimizes the descriptor similarity across point clouds indirectly via registration, and the descriptor distinctiveness in turn affects the correspondences and the resulting transformation. If the descriptors are not well distributed in the descriptor space, the registration may be unstable with erroneous correspondences, leading to ineffective training signals and slow convergence. Thus to facilitate , we employ a direct metric learning loss in the descriptor space, similar to the supervised methods [75, 21, 15]. The intuition is that for spatially nearby points, their descriptors tend to be similar; while for spatially distant points, their descriptors are more likely to be dissimilar. We use the doublemargin contrastive loss [39, 77] in , where
(8) 
The symbol denotes . Descriptors and correspond to points and , respectively. Point is in the nearest neighbor set of point in the descriptor space. We generate the label on the fly according to the spatial distance between and . The label if , otherwise ; and is a spatial distance threshold. and are two margins for positive and negative pairs, respectively. is computed similarly.
We also add a regularization loss for the support size estimations to regularize their magnitude. Our final training loss is a combination of the aforementioned loss terms:
(9) 
3.4 Implementation
We implemented our method with PyTorch
[45]. Following 3DSmoothNet, we set the descriptor dimension to 32. For the 3DMatch dataset [75] (Sec. 4.1), we use m and set the voxel grid resolution . In the contrastive loss Eq. (8), we set and m, and the margins are and according to [77]. For the loss term weights, we use , , and in , and in . In each training step, our network takes as input a pair of point clouds. We sample 300 keypoints in each point cloud with farthest point sampling [49] for descriptor extraction and matching. We use Adam [34] as the network optimizer with an initial learning rate of 0.001. The network is trained for 16K steps, and the learning rate is decayed by 0.1 at the half of training. Training details for the ModelNet40 dataset [65] (Sec. 4.2) are given in the supplementary material.4 Experiments
We validate the performance of our proposed UPDesc on existing point cloud registration benchmarks including 3DMatch (Sec. 4.1) and ModelNet40 (Sec. 4.2). The 3DMatch dataset [75] consists of point clouds of indoor scene scans, while the ModelNet40 dataset [65] consists of objectcentric point clouds generated from CAD models.
4.1 3DMatch Dataset
The 3DMatch dataset is widely adopted for evaluating the descriptor performance on geometric registration [75]. It is constructed from several existing RGBD datasets [62, 55, 66, 35, 26]. In total, there are 62 indoor scenes: 54 of them for training and validation, and 8 of them for testing. Each testing point cloud has 5,000 randomly sampled keypoints for descriptor extraction and matching. We perform voxeldownsampling with a voxel size of 3 cm [79] to the point clouds.
Evaluation Metrics. Following [75, 15, 7, 11], we compute inlier ratio (IR), featurematch recall (FMR), and registration recall (RR) on 3DMatch. Inlier ratio measures the fraction of inlier correspondences, given as input a set of putative correspondences built in the descriptor space for a point cloud pair. For an inlier correspondence, the distance between the two matching points should be less than under the groundtruth transformation of the point cloud pair. Featurematch recall is computed as the fraction of point cloud pairs whose IR is above . The inlier ratio threshold is in the range of . Registration recall is computed as the fraction of point cloud pairs with correct transformations estimated by a RANSACbased registration pipeline [18]. An estimated transformation is considered to be correct if it brings the root mean square error (RMSE) of groundtruth correspondences below 0.2 m.
Comparisons. We compare with handcrafted descriptors and existing unsupervised descriptor learning methods. The former includes FPFH [51] and SHOT [54], which have already been implemented in the PCL library [53] and have descriptor dimensions of 33 and 352, respectively. The latter includes PPFFoldNet [14], CapsuleNet [76], and S2CNN [56]. Their implementations are based on publicly available codebases^{1}^{1}1https://github.com/XuyangBai/PPFFoldNet^{2}^{2}2https://github.com/yongheng1991/3Dpointcapsulenetworks^{3}^{3}3https://github.com/jonaskoehler/s2cnn, and they all have a descriptor dimension of 512. The high dimensionality can make nearest neighbor search computationally inefficient [33, 21]. For a more direct comparison with the stateoftheart S2CNN, we also implemented an S2CNN variant that produces 32dimensional descriptors.
Sup.  Dim.  IR  FMR  RR  
0.05  0.2  
3DMatch [75]  w/  512  8.3  57.3  7.7  51.9 
CGF [33]  w/  32  10.1  60.6  12.3  51.3 
PPFNet [15]  w/  64    62.3    71.0 
3DSmoothNet [21]  w/  32  36.0  95.0  72.9  78.8 
FCGF [11]  w/  32    95.2  67.4  82.0 
D3Feat [4]  w/  32  40.7  95.8  75.8  82.2 
LMVD [38]  w/  32  46.1  97.5  86.9  81.3 
SpinNet [1]  w/  32    97.6  85.7   
FPFH [51]  w/o  33  9.3  59.6  10.1  54.8 
SHOT [54]  w/o  352  14.9  73.3  26.9  59.4 
PPFFoldNet [14]  w/o  512  20.9  83.8  41.0  69.0 
CapsuleNet [76]  w/o  512  16.9  82.5  31.3  67.6 
S2CNN [56]  w/o  512  34.5  94.6  70.3  78.4 
S2CNN [56]  w/o  32  29.1  92.4  59.9  73.9 
UPDesc  w/o  32  45.1  94.1  79.5  79.7 
Table 1 (bottom) shows the inlier ratio (IR) comparison for the methods without supervision. It can be observed that our descriptor obtains the highest IR (45.1%) among all the unsupervised methods. UPDesc outperforms S2CNN (32) by 16 percent points and S2CNN (512) by 10.6 percent points, indicating the better quality of correspondences built by our method.
For the featurematch recall (FMR) comparison in Table 1, in the case of , our descriptor achieves an FMR of 94.1%, better than PPFFoldNet, CapsuleNet and S2CNN (32). Yet is a relatively easy threshold as discussed in [38, 21, 14], and the descriptor performance tends to be saturated. In the harder case of , our descriptor obtains 79.5%, significantly outperforming S2CNN (32) by 19.6 percent points and S2CNN (512) by 9.2 percent points. Furthermore, Fig. 5 plots the FMR performance with respect to different values. Our descriptor shows better insensitivity against the inlier ratio , which can be ascribed to the better IR performance.
Table 1 also includes the registration recall (RR) performance. Our descriptor achieves the best RR performance (79.7%) among all the unsupervised methods and outperforms S2CNN (32) by 5.8 percent points. Fig. 6 visualizes some challenging point cloud registration examples with a large portion of flat surfaces, and our descriptor demonstrates better robustness.
For completeness, in Table 1 (top) we also include the performance of supervised descriptor learning methods. It is observed that our method narrows the gap with the stateoftheart supervised methods [4, 38]. Interestingly, our descriptor achieves better IR and RR performance than 3DSmoothNet, which uses a fixedsize local support and a 3D CNN backbone but with supervision.
To test the rotation invariance, following [14], the rotated 3DMatch dataset is used, where each point cloud is rotated with randomly sampled axes and angles in . Table 2 reports the performance of the compared methods. A similar performance is maintained by our descriptor on this dataset, with the best IR (43.8%), FMR (78.8% at ), and RR (79.1%) scores.






UPDesc  
Dim.  33  352  512  512  512  32  32  
IR  9.3  14.9  21.0  16.8  34.6  29.2  43.8  
FMR  10.0  26.9  41.6  31.9  70.5  59.5  78.8  
RR  55.3  61.6  68.7  68.0  78.3  75.2  79.1 
Table 3 reports the running time comparison of the unsupervised descriptor learning methods. The results were collected on a desktop computer with an Intel Core i7 @ 3.6GHz and an NVIDIA GTX 1080Ti GPU. Note that for the input preparation, PPFFoldNet and CapsuleNet compute point pair features, S2CNN performs spherical representation conversion, and UPDesc estimates LRFs. Overall, our method and PPFFoldNet have comparable speed, while S2CNN is computationally much slower.
4.2 ModelNet40 Dataset
We perform comparisons with existing learningbased registration methods [2, 63, 64] on the ModelNet40 dataset, following [64]. The point clouds fall into 40 manmade object categories. There are 9,843 point clouds for training and 2,468 point clouds for testing. The number of points in each point cloud is 1,024. To generate point cloud pairs, a new point cloud is obtained by transforming each testing point cloud with a random rigid transformation. The rotation angle along each axis is sampled in the range of , and the 3D translation offset is sampled in . To synthesize partial overlapping for a pair of point clouds, 768 nearest neighbors of a randomly placed point in 3D space are collected in each point cloud.
Metrics. Given the estimated rotations and translations by a specific registration method, we follow [64] to compare them with the ground truths by measuring root mean squared error (RMSE) and coefficient of determination (). The rotation errors are computed with the Euler angle representation in degrees.
Comparisons. As shown in Table 4, we compare with three categories of point cloud registration methods on ModelNet40. The first category is nonlearning based registration methods, including ICP [5], GoICP [67], and FGR [78]. The second category is learningbased registration methods, including PointNetLK [2], DCP [63], PRNet [64], DeepGMR[73], and RPMNet [71]. These methods require supervision in training. The last category is RANSACbased registration with learned descriptors, including PPFFoldNet, CapsuleNet, S2CNN and our UPDesc, which are trained without supervision. It is observed that the methods based on RANSAC and learned descriptors generally have better performance than the learningbased registration methods, demonstrating their wide applicability to different data modalities (, objectcentric point clouds and indoor scene scans). Our UPDesc achieves the best performance in all the computed metrics among the unsupervised methods and is even comparable to the supervised RPMNet, which is highly specialized for objectcentric datasets.
We further follow [64] to test the robustness of the different methods to noise. Each point in the testing point clouds is augmented with Gaussian noise independently sampled from and clipped to . Table 5 reports the registration results, and our UPDesc shows better robustness to noise than all the unsupervised methods, especially for the rotation estimation.
Method  Sup.  RMSE()  R()  RMSE()  R() 
ICP [5]  w/o  33.683  5.696  0.293  0.037 
GoICP [67]  w/o  13.999  0.157  0.033  0.987 
FGR [78]  w/o  11.238  0.256  0.030  0.989 
PointNetLK [2]  w/  16.735  0.654  0.045  0.975 
DCPv2 [63]  w/  6.709  0.732  0.027  0.991 
PRNet [64]  w/  3.199  0.939  0.016  0.997 
DeepGMR[73]  w/  19.156  1.164  0.037  0.983 
RPMNet [71]  w/  1.290  0.990  0.005  1.000 
PPFFoldNet [14]  w/o  2.285  0.969  0.013  0.998 
CapsuleNet [76]  w/o  2.180  0.972  0.013  0.998 
S2CNN [56] (512)  w/o  3.069  0.944  0.017  0.997 
S2CNN [56] (32)  w/o  3.234  0.938  0.014  0.998 
UPDesc  w/o  1.912  0.978  0.011  0.998 
Method  Sup.  RMSE()  R()  RMSE()  R() 
ICP [5]  w/o  35.067  6.252  0.294  0.045 
GoICP [67]  w/o  12.261  0.112  0.028  0.991 
FGR [78]  w/o  27.653  3.491  0.070  0.941 
PointNetLK [2]  w/  19.939  1.343  0.057  0.960 
DCPv2 [63]  w/  6.883  0.718  0.028  0.991 
PRNet [64]  w/  4.323  0.889  0.017  0.995 
DeepGMR[73]  w/  19.758  1.299  0.030  0.989 
RPMNet [71]  w/  1.870  0.979  0.011  0.998 
PPFFoldNet [14]  w/o  4.151  0.899  0.009  0.999 
CapsuleNet [76]  w/o  4.274  0.893  0.009  0.999 
S2CNN [56] (512)  w/o  5.221  0.840  0.007  0.999 
S2CNN [56] (32)  w/o  5.040  0.850  0.009  0.999 
UPDesc  w/o  2.197  0.971  0.007  0.999 
4.3 Ablation Study
We perform ablation studies on 3DMatch and report the results in Table 6. We first remove the support size estimation in the descriptor extraction stage (Sec. 3.2) to study its contribution. That is, we fit to the patch , thus reducing to the input parameterization scheme used by 3DSmoothNet. It is observed that the performance of this variant (– CS in Table 6) drops significantly, compared to our full model. This is because without richer local geometry information in the learned descriptors, the putative correspondences across point clouds may not be reliable for registration, making the unsupervised registration loss ineffective to provide training signals.
We also study the contribution of the loss terms used in Eq. (9), including the registration loss , the contrastive loss , and the regularization loss . The corresponding results are shown in Table 6 (middle). Note that combining all the three loss terms produces the best result. For , we further examine the contribution of its three constituent terms including the orthogonality loss , the determinant loss , and the cycle consistency loss . The results are reported in Table 6 (right).
UPDesc  – CS  –  –  –  –  –  –  
IR  45.1  22.4  40.2  23.9  37.1  44.2  40.5  44.8 
FMR  79.5  45.4  70.1  49.8  67.8  77.6  72.1  78.2 
RR  79.7  69.9  70.6  70.2  73.2  77.3  77.2  76.4 
5 Conclusion
We have presented UPDesc, a new framework to learn point descriptors for robust point cloud registration in an unsupervised manner. Our framework is built upon a voxelbased representation and 3D CNNs for descriptor extraction. To enrich geometric information in the learned descriptors, we propose to learn the local support size in the online pointtovoxel conversion with differentiable voxelization. We introduce the registration loss and the contrastive loss to guide the learning of descriptors. Extensive experiments show that our descriptors achieve better performance than the stateoftheart unsupervised methods on existing point cloud registration benchmarks. For future work, it would be interesting to combine our descriptors with other stages of the point cloud registration pipeline, such as keypoint detection [4] or outlier filtering of correspondences [9]. It is also worth investigating the extension of our loss formulations, in particular, the registration loss, to these tasks for unsupervised learning.
References
 [1] (2020) SpinNet: learning a general surface descriptor for 3d point cloud registration. arXiv. Cited by: §2, Table 1.
 [2] (2019) PointNetLK: robust & efficient point cloud registration using pointnet. In Proc. IEEE CVPR, Cited by: §2, §4.2, §4.2, Table 4, Table 5.
 [3] (2021) PointDSC: robust point cloud registration using deep spatial consistency. In Proc. IEEE CVPR, Cited by: §3.3.
 [4] (2020) D3Feat: joint learning of dense detection and description of 3d local features. In Proc. IEEE CVPR, Cited by: §2, §4.1, Table 1, §5.
 [5] (1992) A method for registration of 3d shapes. IEEE TPAMI 14 (2). Cited by: §4.2, Table 4, Table 5.
 [6] (201102) Shape google: geometric words and expressions for invariant shape retrieval. ACM TOG 30 (1). External Links: Document Cited by: §1.
 [7] (2015) Robust reconstruction of indoor scenes. In Proc. IEEE CVPR, Cited by: §4.1.
 [8] (2005) Learning a similarity metric discriminatively, with application to face verification. In Proc. IEEE CVPR, External Links: Document, ISSN 10636919 Cited by: §2.
 [9] (2020) Deep global registration. In Proc. IEEE CVPR, Cited by: §2, §5.

[10]
(2019)
4D spatiotemporal convnets: minkowski convolutional neural networks
. In Proc. IEEE CVPR, Cited by: §2.  [11] (2019) Fully convolutional geometric features. In Proc. IEEE ICCV, Cited by: §2, §4.1, Table 1.
 [12] (2018) Spherical CNNs. arXiv. Cited by: §2.
 [13] (2017) ScanNet: richlyannotated 3d reconstructions of indoor scenes. In Proc. IEEE CVPR, Cited by: §1.
 [14] (2018) PPFFoldNet: unsupervised learning of rotation invariant 3d local descriptors. In Proc. ECCV, Cited by: §1, §1, §2, §4.1, §4.1, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
 [15] (2018) PPFNet: global context aware local features for robust 3d point matching. In Proc. IEEE CVPR, Cited by: §1, §1, §2, §3.1, §3.3, §4.1, Table 1.
 [16] (2019) 3D local features for direct pairwise registration. In Proc. IEEE CVPR, Cited by: §2.
 [17] (2020) DH3D: deep hierarchical 3d descriptors for robust largescale 6dof relocalization. In Proc. ECCV, Cited by: §2.
 [18] (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24 (6). External Links: ISSN 00010782, Document, Link Cited by: §4.1.
 [19] (2004) Recognizing objects in range data using regional point descriptors. In Proc. ECCV, External Links: ISBN 9783540246725 Cited by: §2.
 [20] (2020) Learning multiview 3d point cloud registration. In Proc. IEEE CVPR, Cited by: §1, §2, §3.1, §3.3.
 [21] (2019) The perfect match: 3D point cloud matching with smoothed densities. In Proc. IEEE CVPR, Cited by: Figure 1, §1, §1, §2, §3.1, §3.1, §3.2, §3.2, §3.2, §3.3, §4.1, §4.1, Table 1.
 [22] (1996) Matrix computations. Johns Hopkins University Press. Cited by: §3.3.
 [23] (2016) A comprehensive performance evaluation of 3d local feature descriptors. IJCV 116 (1). External Links: ISSN 15731405, Document, Link Cited by: §1, §2.
 [24] (2013) Rotational projection statistics for 3d local surface description and object recognition. IJCV 105 (1). External Links: Document Cited by: §1.
 [25] (2014) An accurate and robust range image registration algorithm for 3d object modeling. IEEE Transactions on Multimedia 16 (5), pp. 1377–1390. Cited by: §1.
 [26] (2016) Structured global registration of rgbd scans in indoor environments. arXiv. Cited by: §4.1.
 [27] (2017) In defense of the triplet loss for person reidentification. arXiv. Cited by: §2.
 [28] (2017) Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM TOG 37 (1). External Links: ISSN 07300301, Document, Link Cited by: §1, §1.
 [29] (2020) PREDATOR: registration of 3d point clouds with low overlap. arXiv. Cited by: §2.
 [30] (2020) Featuremetric registration: a fast semisupervised approach for robust point cloud registration without correspondences. In Proc. IEEE CVPR, Cited by: §2.
 [31] (1999) Using spin images for efficient object recognition in cluttered 3d scenes. IEEE TPAMI 21 (5). External Links: ISSN 01628828, Document Cited by: §2.
 [32] (201007) Learning 3d mesh segmentation and labeling. ACM TOG 29 (4). External Links: ISSN 07300301, Document Cited by: §1.
 [33] (2017) Learning compact geometric features. In Proc. IEEE ICCV, Cited by: §1, §1, §2, §4.1, Table 1.
 [34] (2015) Adam: A method for stochastic optimization. In Proc. ICLR, Cited by: §3.4.
 [35] (2014) Unsupervised feature learning for 3d scene labeling. In Proc. ICRA, External Links: Document Cited by: §4.1.
 [36] (2020) Registration loss learning for deep probabilistic point set registration. arXiv. Cited by: §2.
 [37] (2005) A spectral technique for correspondence problems using pairwise constraints. In Proc. IEEE ICCV, Cited by: §3.3.
 [38] (2020) Endtoend learning local multiview descriptors for 3D point clouds. In Proc. IEEE CVPR, Cited by: §1, §1, §2, §3.2, §4.1, §4.1, Table 1.
 [39] (2017) DeepHash for image instance retrieval: getting regularization, depth and finetuning right. In Proc. ICMR, External Links: Document, Link Cited by: §3.3.
 [40] (2019) Soft Rasterizer: a differentiable renderer for imagebased 3d reasoning. In Proc. IEEE ICCV, Cited by: §1, §3.2.
 [41] (2019) DeepVCP: an endtoend deep neural network for point cloud registration. In Proc. IEEE ICCV, Cited by: §2, §3.3.
 [42] (1981) An iterative image registration technique with an application to stereo vision. Cited by: §2.
 [43] (2006) A novel representation and feature matching algorithm for automatic pairwise registration of range images. IJCV 66 (1). Cited by: §1.
 [44] (2020) 3DRegNet: a deep neural network for 3d point registration. In Proc. IEEE CVPR, Cited by: §2.

[45]
(2019)
PyTorch: an imperative style, highperformance deep learning library
. In NeurIPS, External Links: Link Cited by: §3.4.  [46] (2021) Distinctive 3d local deep descriptors. In Proc. ICPR, Cited by: §2.
 [47] (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In Proc. IEEE CVPR, Cited by: §2, §2, §2.
 [48] (2016) Volumetric and multiview cnns for object classification on 3d data. In Proc. IEEE CVPR, Cited by: §3.2.
 [49] (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, External Links: Link Cited by: §3.2, §3.4.
 [50] (2019) Unsupervised deep learning for structured shape matching. In Proc. IEEE ICCV, Cited by: §1, §3.1, §3.3.
 [51] (2009) Fast point feature histograms (fpfh) for 3d registration. In Proc. ICRA, External Links: Document, ISSN 10504729 Cited by: §2, §4.1, Table 1, Table 2.
 [52] (2008) Aligning point cloud views using persistent feature histograms. In Proc. IROS, External Links: Document, ISSN 21530858 Cited by: §2.
 [53] (2011) 3D is here: Point Cloud Library (PCL). In Proc. ICRA, External Links: Document Cited by: §4.1.
 [54] (2014) SHOT: unique signatures of histograms for surface and texture description. CVIU 125. Cited by: §2, §4.1, Table 1, Table 2.
 [55] (2013) Scene coordinate regression forests for camera relocalization in rgbd images. In Proc. IEEE CVPR, External Links: Document Cited by: §4.1.
 [56] (2019) Learning an effective equivariant 3d descriptor without supervision. In Proc. IEEE ICCV, Cited by: Figure 1, §1, §1, §2, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
 [57] (2019) KPConv: flexible and deformable convolution for point clouds. In Proc. IEEE ICCV, Cited by: §2.
 [58] (2017) L2Net: deep learning of discriminative patch descriptor in euclidean space. In Proc. IEEE CVPR, External Links: Document, ISSN 10636919 Cited by: §2.
 [59] (2010) Unique shape context for 3d data description. In Proc. 3DOR, External Links: Document, ISBN 9781450301602, Link Cited by: §1, §2.
 [60] (2010) Unique signatures of histograms for local surface description. In Proc. ECCV, External Links: ISBN 9783642155581 Cited by: §2.
 [61] (2016) Instance Normalization: the missing ingredient for fast stylization. arXiv. Cited by: §3.2.
 [62] (2016) Learning to navigate the energy landscape. arXiv. Cited by: §4.1.
 [63] (2019) Deep closest point: learning representations for point cloud registration. In Proc. IEEE ICCV, Cited by: §1, §2, §3.1, §3.3, §4.2, §4.2, Table 4, Table 5.

[64]
(2019)
PRNet: selfsupervised learning for partialtopartial registration
. In NeurIPS, Cited by: §1, §2, §3.1, §3.3, §4.2, §4.2, §4.2, §4.2, Table 4, Table 5.  [65] (2015) 3D ShapeNets: a deep representation for volumetric shapes. In Proc. IEEE CVPR, Cited by: §1, §3.4, §4.
 [66] (2013) SUN3D: a database of big spaces reconstructed using sfm and object labels. In Proc. IEEE ICCV, Cited by: §4.1.
 [67] (2015) GoICP: a globally optimal solution to 3d icp pointset registration. IEEE TPAMI 38 (11). Cited by: §4.2, Table 4, Table 5.
 [68] (2017) TOLDI: an effective and robust approach for 3d local shape description. Pattern Recogn. 65. External Links: ISSN 00313203, Document, Link Cited by: §3.2.
 [69] (2018) FoldingNet: point cloud autoencoder via deep grid deformation. In Proc. IEEE CVPR, Cited by: §1, §2.
 [70] (2018) 3DFeatNet: weakly supervised local 3d features for point cloud registration. In Proc. ECCV, Cited by: §2.
 [71] (2020) RPMNet: robust point matching using learned features. In Proc. IEEE CVPR, Cited by: §2, §3.1, §4.2, Table 4, Table 5.
 [72] (201806) Learning to find good correspondences. In Proc. IEEE CVPR, Cited by: §2.

[73]
(2020)
DeepGMR: learning latent gaussian mixture models for registration
. In Proc. ECCV, pp. 733–750. Cited by: §4.2, Table 4, Table 5.  [74] (2018) Deep learning of graph matching. In Proc. IEEE CVPR, Cited by: §3.3.
 [75] (2017) 3DMatch: learning local geometric descriptors from rgbd reconstructions. In Proc. IEEE CVPR, Cited by: §1, §1, §1, §2, §3.1, §3.3, §3.4, §4.1, §4.1, Table 1, §4.
 [76] (2019) 3D point capsule networks. In Proc. IEEE CVPR, Cited by: §1, §1, §2, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
 [77] (2018) Learning and matching multiview descriptors for registration of point clouds. In Proc. ECCV, Cited by: §3.3, §3.4.
 [78] (2016) Fast global registration. In Proc. ECCV, Cited by: §4.2, Table 4, Table 5.
 [79] (2018) Open3D: A modern library for 3D data processing. arXiv. Cited by: §4.1.
Comments
There are no comments yet.