I Introduction
Point clouds are one of the key industrial measurements and often aligned with 3D models. Such a process is known as point cloud registration, which is a critical task with various applications in many fields, including large scale reconstruction [1], motion estimation [2, 3], object recognition and localization [4, 5, 6], and medical imaging [7]. It aims to align one point cloud (source) to another (target) via finding the optimal rigid transformation. Unlike regular images, point cloud has a complex geometry and inevitably contains noises due to the inherent imperfections of LiDAR. As a result, the registration should be conducted robustly when facing unfavorable conditions.
Point cloud registration has been studied for decades. ICP [8] is a widely used method with two iterative optimization processes: correspondences searching and rigid transformation estimation. Nevertheless, ICP is sensitive to initialization and easily deceived by local optima. Many methods [9, 10, 11, 12, 13] have been proposed to tackle these issues with increased computational complexity. On the other hand, with the rapid development of deep learning techniques, people revisit the rigid point cloud registration with many novel designs. With groundtruth correspondences or category labels, recent studies [14, 15, 16, 17, 18]
tried to solve the problem in a supervised manner. These methods effectively obtain the informative feature of point cloud and achieve superior performances, requiring high quality and quantity groundtruth correspondences. However, there may not exist pointpoint correspondences in realworld scenarios in the presence of noises and outliers. Realizing registration from unlabeled point cloud data without groundtruth correspondences is a significant yet rarely explored challenge in the literature. To our best knowledge, there exist only two works
[19, 20] so far. PCRNet [19] alleviates pose misalignment shown in the PointNetLK [14]by replacing the LucasKanade module with multilayered perceptrons. PCRNet directly recovers transformation parameters from concatenated global descriptors of source and target point cloud. The other significant work, FMRNet
[20], employs an autoencoder framework with the PointNetLK being its backbone and then achieves the registration by minimizing the featuremetric projection error.
Although these approaches showcase a sharper edge of unsupervised learning, they mainly depend on the global representations and neglect local representations. As a branch of unsupervised learning, selfsupervised learning has achieved remarkable performance on visual tasks
[21, 22, 23]in recent years. Selfsupervised learning frameworks usually design some pretext tasks and train convolutional neural networks (CNNs) for solving these pretext tasks,
e.g., predicting image rotation [21]or classifying positive and negative pairs
[22]. These pretext tasks enable CNNs to extract highlevel semantic representations. Inspired by this, we aim to encode highlevel local representations that are beneficial for solving registration through a pretext task. Since the rigid transformation consistently transforms different local regions and preserves the structural information, we design a pretext task to enable the feature extraction network to learn informative features from local geometries. To this end, we propose a local consistency loss to consider the intrinsic structure consistency among local geometries under rigid transformations. Accordingly, a novel selfsupervised scheme for rigid point cloud registration with joint global and local representations (as shown in Fig. 1) has been proposed, namely Deep Versatile Descriptors (DVDs).Furthermore, unsupervised point cloud registration techniques align two point clouds relying on differences between two global representations. However, they ignore a vital problem of obtaining global representations with high transformation awareness, which indicates differences between global representations of twopoint clouds are large when the twopoint clouds are not well aligned. To tackle this problem, we propose two additional tasks, including selfreconstruction and normal estimation tasks. In summary, our main contributions can be summarized as follows:

We first revisit the registration problem with the novel DVDs, supporting joint extracting highlevel global and local representations in a selfsupervised manner, requiring no labeled data or arduously searching correspondences.

We use two additional tasks, namely selfreconstruction and normal estimation, to enhance the transformation awareness of the DVDs.

The proposed local consistency loss could improve performances of both correspondencebased and correspondencefree registration methods.

Extensive experimental results demonstrate that DVDs achieve startoftheart performance on both synthetic and realworld datasets.
This article is organized as follows. Section II introduces related work. Problem statement with DVDs is demonstrated in Section III. Section IV provides detailed training procedures for proposed DVDs. The experimental results demonstrate the superiority of DVDs in Section V. Section VI reports the conclusion.
Ii Related Work
Point cloud registration is a vital research field due to various 3D perception applications. Here we broadly categorize the related work into traditional, supervised, and unsupervised registration methods.
Iia Traditional Registration
Iterative Closest Point (ICP) [8] is the wellknown method for the rigid point cloud registration, which iteratively computes the closet point as correspondences and estimates the optimal transformation by the correspondences. Subsequent works have been proposed to improve the robustness of ICP [12, 10, 11]
. On the other branch, Gaussian Mixture Model (GMM)
[24] and Hierarchical Gaussian Mixture Registration (HGMR)[25]reformulate the registration problem as probability distribution matching. These probabilistic methods still require good initializations to avoid being trapped into the local optimum, because their objective functions are nonconvex. Unlike local registration methods, global registration methods do not require good initializations. In this way, GoICP
[13], GOGMA [26], GOSMA [27], and GoTS [28] have been proposed to find globally optimal registration by branchandbound (BnB) method at the expense of increased computational complexity. In another line, PFH [29] and FPFH [30] constructed handcrafted feature descriptors and estimated potential correspondences. Then robust optimization approaches such as semidefinite programming [31] and RANSAC [32] can be utilized to estimate exact correspondences. Fast Global Registration (FGR) [33] accelerated the optimization process by a graduated nonconvex strategy. A fast rigid point cloud registration method was proposed without SVD or eigendecomposition [34]. Recently, a comprehensive evaluation of point cloud registration methods has been conducted [35]. However, requiring good initializations, noisy correspondences, and time constraints are challenging for traditional registration techniques.IiB Supervised Registration
The rapid development of deep learning methods paves the way for developing a new perspective for point cloud registration [36]. Recently, PointNetLK [14] utilizes the PointNet module to obtain the global descriptors and aligns two point clouds iteratively by minimizing the distances between two learned descriptors with a modified LucasKanade (LK) algorithm [37]. Deep Closest Point [15] introduces a solution in another perspective that extracts perpoint features and uses a transformer followed by a differentiable SVD module to recover the transformation. PRNet [38] uses a keypoint detector to establish keypoint correspondences to solve the partial to partial point cloud registration in a selfsupervised way. RPMNet [16] proposes Sinkhorn normalization [39] to estimate soft correspondences. IDAM [17] computes pairwise correspondences by an iterative distanceaware similarity convolution module. DeepGMR [18] learns pointtoGMM correspondences by integrating GMM. In addition, many representative techniques [40, 41, 42] have been presented to deal with the largescale registration. While recent works achieve stateoftheart performances, this work offers a new way of achieving unsupervised registration without human annotations.
IiC Unsupervised Registration
In contrast to the supervised and correspondencesbased point cloud registration methods, there are two prior attempts on unsupervised point cloud registration. PCRNet [19] improves PointNetLK by replacing the LK module with multilayered perceptrons. FMRNet [20] adapts PointNetLK to a semisupervised manner by jointly optimizing the processes of feature extraction and transformation estimation. However, they achieve registration heavily depending on global representations, paying no attention to the useful representations obtained by local geometries. Meanwhile, as a subclass of unsupervised learning, selfsupervised learning has achieved remarkable performance on visual tasks without expensive labels [23, 43]. Selfsupervised models can be categorized into adversarial, generative, and contrastive [44]. Inspired by this, we propose local consistency loss to force the feature extraction network to learn useful representations from local geometries in a selfsupervised manner. In this way, we resolve the issue by proposing novel DVDs.
Iii Problem Statement with DVDs
In this section, we present the formulation of rigid registration problem with the proposed DVDs. We denote and as the source and the target point cloud, respectively. The objective of registration is to estimate the rigid transformation as [13]:
(1) 
where represents a rotation matrix and
denotes a translation vector. Solving the problem in Eq. (
1) based on raw 3D coordinates of point clouds is not robust because raw point clouds usually contain noises and outliers. To address the issue, PointNetLK [14] utilizes PointNet [36] to extract dimensional vectors as global representations of point clouds. In this way, the registration problem can be described as follows:(2) 
where denotes the feature extraction function. Previous studies [20, 14] primarily solved the Eq. (2) by adapting PointNet as feature extraction function resulting in a global descriptor. However, they fail to utilize the informative features from local geometries. Inspired by the success of selfsupervised learning [45, 46], we model the registration problem jointly based on global and local representations in a selfsupervised manner.
We revisit point cloud registration motivated by the observation that source point cloud is transformed into target point cloud by a rigid transformation operator . In other words, all local geometries of point clouds are transformed consistently with the mapping function . This property inspires us to exploit the highlevel relation among different local geometries. Given two subsets and the corresponding transformed version , the related highlevel representations can be denoted by , and , respectively, where represents a feature extraction function. In this way, the feature change of original and transformed is described as
(3) 
where is a metric function for modeling differences of features. We employ a Fully Connected (FC) layer as the metric function , such that
(4) 
where is the concatenation operation and . In Eq. (4), the highdimensional features and are combined together with , whose goal is to perform feature extraction while maintaining the change of each feature.^{1}^{1}1The employment of the FC layer is motivated by the network design in PointNet [36], and FC networkbased intra prediction for image coding [47]. The goal of is to abstract highlevel relation between the local region and its transformed version.
Since all local geometries of point clouds are transformed consistently with the same mapping function , the feature change between different regions should be consistent, i.e., and shall be consistent, as shown in Fig. 2. In this way, we have and , where denotes a distribution. Based on the above analysis, the registration problem in Eq. (2) can be reformulated in a selfsupervised manner as
(5) 
Here, we are trying to deal with the point cloud registration problem with a joint design on global and local descriptors, which are respectively shown in the objective and the constraint of Eq. (5).
However, the constraint in Eq. 5 is nonconvex. This constraint means and should be sampled from the same distribution, which is extremely hard to achieve for a network functional. Alternatively, we try to tackle the problem by minimizing the distance metric of their distributions as
(6) 
In this way, we need to train a feature extraction function and estimate a transformation operator . Given that the optimization of the function and the transformation operator in Eq. (6) is difficult and timeconsuming, we employ a simplified training strategy to optimize and in an alternative manner. I.e., when the transformation operator is optimized and the function is frozen in each iteration and vice versa. Specifically, we introduce primary and auxiliary tasks to optimize the function and the transformation operator better, respectively.
Iv Robust Point Registration with DVDs
The core of rigidbody registration without correspondences is to extract discriminative and transformation awareness features from twopoint clouds and recover the related transformation parameters. We solve the problem by leveraging the proposed versatile descriptors with two auxiliary tasks. The overall framework is illustrated in Fig. 3. In short, we first embed the point clouds in highdimensional space, performing optimization with one primary task and two auxiliary tasks ( and ). The Primary task estimates the optimal transformation parameters while considering both global and local descriptors in a selfsupervised way. Auxiliary tasks aim to improve the rotation awareness of the feature extraction function.
The overall objective to be minimized in the proposed method can be formalized as
(7) 
where tradeoff parameters and are positive constants to balance the effect of different tasks. The optimization of Eq. (7) takes the source and the target point cloud as inputs without requiring ground truth transformation, allowing us to learn unsupervised representations for robust point cloud registration.
Iva Primary Task: Learning with Local Consistency
This subsection elaborates on the proposed scheme. Based on the problem formulation in Section III and Eq. (6), we try to tackle the unsupervised registration problem as
(8)  
In Eq. (8), the distribution alignment problem could also be measured by other distance metrics, such as Maximum Mean Discrepancy [48]
and Euclidean distance. We empirically employ the symmetric KullbackLeibler divergence, which has stable and superior performance against other distance metrics.
The first term in Eq. (8) is referred to as the global descriptor that learns global representations from both and . The estimation of transformation parameters is given in the following. Given a frozen function , estimating transformation is still timeconsuming. Therefore, we choose to use the computationally efficient Inverse Compositional (IC) method [37] to iteratively calculate transformation increment as
(9) 
where is the Jacobian matrix of global representation differences between source and target point clouds. Unlike regular grid images, computing the Jacobian matrix of irregular point clouds needs to take gradients in , , and directions. Refer to [14], the warp Jacobian in the ICLK algorithm can be approximated as
(10) 
where twist parameters denotes three Euler angles and three translation parameters, and is the inverted warp. In this way, transformation parameters are updated as after the th iteration.
To enhance the representation ability of the global descriptor, we adopt the idea of selfsupervised learning, which always encodes highlevel semantic representations through a pretext learning task [49]. Especially, we aim to extract useful local representations by exploiting the highlevel relation among different local geometries based on the second term in Eq. (8), which can be considered as a pretext learning task. In this way, we can solve registration by using joint global and local representations. The design of the selfsupervised learning is based on the key observation that the local distinctive geometric structures of the point cloud by two subset points can be employed to enhance the representation ability of the feature extraction module.
In the following, we will elaborate on the selection and intuition of local distinctive subset points.
The selection of subset points. We employ two representative points related to the barycenter of the source point cloud [6], denoted by the farthest point from barycenter,
(11) 
and the closest point from barycenter,
(12) 
There may be isolated points far from the barycenter in the real scene, so we consider an outliers rejection following [31]. For example, one should reject outliers if they are larger than the threshold . Then we choose point and search its nearest neighbors based on Euclidean distance, forming a local set . Similarly, subset is selected among the point . The effect of local size on the performance of registration will be discussed in the ablation studies.
The intuition of the selection strategy is that (1) We want to capture the distinctive geometric structures of the point cloud by two subsets points to enhance the representation ability of the feature extraction module. From the perspective of geometry, the farthest and the closest points are located in distinctive regions [50, 51]. (2) We need the two subsets to be independent without overlap. Therefore, if they are far away from each other, they are more likely to be independent.
IvB Auxiliary Task I: Selfreconstruction
Since discovering features sensitive to rotations and translation from unlabeled source and target point clouds is quite challenging, a single feature extraction module may not lead to sufficient representations. If DVDs own rich transformation awareness, proper supervision to local consistency will be offered, thus creating a virtuous circle for the optimization. On the contrary, the learning process may be trapped into local minima for the lack of transformation awareness of extracted representations. To avoid this issue, we introduce two auxiliary tasks to achieve enhanced rotationawareness of DVDs.
The feature extraction module aims to learn a transformation awareness descriptor, which needs to reflect the effects of rigidbody transformation, enabling the modified inverse compositional algorithm to recover the transformation parameters . We choose the vanilla PointNet [36] as the feature extraction module
, which is composed of MultiLayer Perceptron (MLP) and a maxpooling operation.
We employ a foldingbased [52] decoder as the reconstruction module in the selfreconstruction task. Besides, Chamfer Distance [53] is adopted to define the reconstruction error, measuring the differences between original point clouds and the reconstructed version :
(13) 
This work will take both the source and the target points for reconstruction, i.e., . As a result, we have
(14) 
Chamfer Distance can be computed in a computationefficient manner. However, it is blind to certain visual inferiority [54]. Fig. 4 demonstrates such an example. Twopoint clouds with large rotation angles may lead to a small Chamfer Distance loss, resulting in weak rotation awareness. Therefore, we consider another auxiliary task for further performance improvement.
IvC Auxiliary Task II: Normal Estimation
Normal estimation is an essential step in many research areas of point cloud [55], e.g., rendering and surface reconstruction. Unlike pursuing the estimation precision, we aim to improve the rotation awareness of DVDs by the normal estimation task. Especially, we use a lightweight MLP to generate the estimated normals with the concatenated coordinates and global features
. The estimation error is measured by the cosine loss function as
(15) 
IvD Intuitive explanation
Here, we provide an intuitive explanation of the reason that the rotationawareness of the feature extraction function is important for registering twopoint clouds. Fig. 5 visualizes feature differences of the transformed source point cloud and the target point cloud during the iterative optimization process. Specifically, we compare the registration process with a high rotationawareness feature extraction function and a low rotationawareness feature extraction function. We reshape feature differences of the transformed source point cloud and the target point cloud into a square matrix for better visualization. Fig. 5 shows that the feature difference decreases, and the registration becomes more accurate during the optimization process with the high rotationawareness feature extraction function. However, the feature difference decreases for the low rotationawareness feature extraction, but the registration is trapped in a local minimum. Specifically, the rightmost column of Fig. 5 demonstrates that the extracted global features of twopoint clouds are almost the same when the twppoint cloud are not aligned. Therefore, the rotation awareness of the feature extraction function is important for identifying whether two point clouds are aligned. In the following section, we will demonstrate the effectiveness of the proposed method with various experiments.
V Experiments
In this section, we design several experimental settings to demonstrate the generalization and superiority of DVDs. Firstly, we evaluate DVDs on synthetic objectcentric point cloud datasets. Then, we conduct cross dataset evaluation, i.e., DVDs are trained on the synthetic objectcentric dataset and are tested on another realworld dataset to test the generalization ability of DVDs. Furthermore, we evaluate DVDs on largescale realworld 3D scenes. Since realworld industrial scenarios usually lead to noisy and partial visibility point clouds, robustness evaluations have been conducted to demonstrate the robustness of DVDs under these scenarios. Meanwhile, the efficiency of DVDs shows its superiority in industrial applications. Finally, we demonstrate that the proposed local consistency loss can improve several point cloud registration techniques, indicating its generality.
Experimental Details. Specifically, we conduct experiments on the ModelNet40 [56], ScanObjectNN [57] and 3DMatch [58] benchmarks. ModelNet40 is a synthetic dataset, consisting of 12,311 CAD models. We randomly sample 1024 points from each CAD model 2 times with different random seeds and name sampled point clouds as the source point cloud and the target point cloud , respectively [59]. We rescale point clouds into a unit sphere. Following the experimental settings in [14], all the models are trained on the train set of the first 20 categories in ModelNet40. Euler angles are uniformly sampled in the range [0, ] and translations in the range for each axis during the training process [15]. We transform the source point cloud through the sampled rigid transformation. The source point cloud and the target point cloud are fed into the network, which aims to register the two point clouds. The maximum number of iterations is set as 10. We train the model for a total of 200 epochs. All the experiments are implemented on a single NVIDIA Titan Xp.
Va Evaluation on Unseen objects
In the first experiment, we compare our method with the following algorithms: ICP [8], PointNetLK [14], PCRNet [19], FMRNet [20], DCP [15], FGR [33], and RANSAC [32]. All methods are tested on the test set of the rest 20 categories. Moreover, qualitative results of these algorithms are shown in Fig. 6. The differences between predicted values and ground truth are measured by the root mean squared error (RMSE). A successful registration is defined if rotation and translation errors are smaller than predefined thresholds [42]. Performances are evaluated by the recall, i.e., the percentage of successful registration point cloud pairs. Fig. 7 shows that the ICP has inferior performance against other competing algorithms since it is a local registration method and is vulnerable to the initial position. PointNetLK uses a pretrained classification network as the feature extraction module and obtains better performance than ICP. PCRNet and FMRNet, jointly optimizing the feature extraction and transformation estimation processes, reach improved performance over ICP. The classical methods FGR and RANSAC are competitive with our method. Overall, the proposed method attains more accurate registration over other compared ones, demonstrating the effectiveness of the proposed DVDs.
VB Evaluation on Unseen Categories
In this subsection, we evaluate the generalization of the proposed model on unseen categories. To this end, all methods are trained on the train set of the first 20 categories and are tested on the test set of the rest 20 categories in ModelNet40. The differences between predicted values and ground truth are measured by the root mean squared error (RMSE). Performances are evaluated by the recall. Fig. 8 shows that there is a performance drop for DCP and PCRNet compared to unseen objects. However, our method can generalize to unseen categories and are more robust than other deep learningbased approaches. On the other hand, unsupervised FMRNet has better generalization than supervised PointNetLK instead of depending on category labels. Nevertheless, traditional registration methods, including ICP, FGR, and RANSAC are not affected so much.
VC Cross Dataset Evaluation
It is also crucial to test the influences of the generalization ability of models under cross dataset evaluation. We train models on the train set of the first 20 categories in ModelNet40 and test on the test set of ScanObjectNN [57], which is a newly published realworld dataset containing 2902 objects from 15 categories. Specifically, we set the success registration criteria that a rotation error is less than and translation error is less than 0.01. The results are reported in Fig. 9. Compared to results on the synthetic dataset, the performances of DCP and PCRNet degrade in the crossdataset setting. Classical methods FGR and RANSAC keep stable performances. However, the proposed method still outperforms these compared deep learningbased methods, demonstrating the effectiveness of customized designs in the proposed method and applicability in cross synthetic and realworld datasets.
VD Evaluation on RealWorld 3D Scenes
7Scene [60] is a realworld indoor 3D Scenes registration benchmark. Following DGR [42], we generate 3D scan pairs with more than 70% overlap for training. During the training process, Euler angles are uniformly sampled in the range [0, ] and translations in the range [0.5, 0.5]. In addition to the PointNetLK [14], PCRNet [19], FMRNet [20], DCP [15], FGR [33], and RANSAC [32], we also compare with DGR [42] and FCGF [41], which are the stateoftheart methods on largescale point cloud scenes. Two versions of RANSAC are compared with 5000 and 10000 iterations, respectively. We downsample 10k points for training the DCP, FMRNet, PCRNet, PointNetLk, and the proposed methods. As DGR requires a pretrained FCGF model, we use all the points for training DGR. Qualitative results of the proposed method are shown in Fig. 10. The proposed method is competitive with DGR, as shown in Fig. 11. For rotation errors larger than , our method slightly outperforms DGR. We tried to train PCRNet and PointNetLK, but they did not converge. Compared with DGR, the proposed method and FMRNet use the globally pooled descriptor to represent the entire point cloud scene. However, it is challenging to represent complex scenes by globally pooled descriptors. Therefore, our method does not get highly accurate results in largescale scenes.
Points  ICP  PointNetLK  DCP  FGR  RANSAC  FMRNet  PCRNet  Ours 

500  7  70  6  143  18  68  14  66 
1000  16  73  9  172  48  70  16  68 
2000  35  75  20  232  110  72  18  69 
VE Robustness Evaluation
In this section, we evaluate models in the presence of outliers and partial visibility. The success registration criteria is defined as a rotation error under
and translation error under 0.01. For outliers scenario, noises are sampled from Gaussian distribution and are added to each point of source and target point cloud. Specifically, all the models are trained on clean data without adding noise augmentation. Though FGR and RANSAC methods are comparatively with our method on clean data, they are sensitive to noisy data, as shown in Fig.
12 (a). Adding noise to the source and target point cloud breaks the exact pointpoint correspondences. Therefore, the correspondencebased DCP shows limited performance. In contrast, the learningbased registration approaches without correspondences, such as FMRNet, PointNetLK, and the proposed method, are more robust to noises.We also conduct experiments on partially visible data, as the acquired point clouds from the real world are often partially due to occlusions. To simulate this condition, we generate partial source and target point clouds independently from random camera poses following the settings in [14]. Note that all methods are trained on noisefree and fully visible data from the first 20 categories in ModelNet40. Fig. 12 (b) demonstrates that the proposed unsupervised model outperforms all the others against partial visibility because of utilizing global and local descriptors. Therefore, it is promising to extend our model to realworld scenarios.
VF Computational Efficiency
In this subsection, we evaluate the efficiency of the proposed method. The experiment has been conducted on the testing dataset of the last 20 categories from ModelNet40. Rigid transformations are randomly sampled with Euler angles in the range [0, ] and translations in the range . For realworld scenarios, noises are sampled from Gaussian distribution and are added to each point of source and target point cloud. We perform this experiment on a 2.10GHz Intel E52620 and an NVIDIA Titan Xp. Specifically, the ICP, FGR, and RANSAC are executed on the CPU. The average amount of time of all the compared methods are shown in Table I. When processing largescale point cloud (e.g., 2000), the proposed method is faster than RANSAC, FGR, FMRNet, and PointNetLK. Whilst it is slower than the ICP, DCP, and PCRNet.
VG Generality of Local Consistency Loss
We use the proposed local consistency loss to improve the performance of several classical point cloud registration models, including DCP [15], PRNet [38], IDAM [17], and DeepGMR [18]. All models are trained on the first 20 categories and tested on the rest 20 categories in ModelNet40. Table II indicates that being equipped with local consistency loss enables both the correspondencebased and correspondencefree registration methods to estimate more accurate transformation parameters. Therefore, the proposed local consistency loss is a generic method complementary to the existing point cloud registration techniques.
Method  RMSE()  MAE()  RMSE()  MAE() 

DCP [15]  4.834  2.795  0.020  0.010 
PRNet [38]  1.506  0.874  0.015  0.010 
IDAM [17]  2.084  1.122  0.022  0.014 
DeepGMR [18]  3.472  1.526  0.004  0.001 
LC + DCP  3.950  2.298  0.020  0.010 
LC + PRNet  1.293  0.783  0.014  0.010 
LC + IDAM  1.527  0.933  0.016  0.010 
LC + DeepGMR  3.192  1.247  0.004  0.001 
Method  Unsupervised  Transformation awareness  Local descriptors  Computational complexity  Performance 

DCP  low  moderate  
PointNetLK  moderate  moderate  
PCRNet  ✓  low  low  
FRMNet  ✓  moderate  moderate  
DVDs*  ✓  ✓  ✓  moderate  high 
VH Ablation Studies
To examine the effectiveness of each component in Eq. (7), we conduct detailed ablation studies on ModelNet40. The first results are elaborated in Fig. 13 (a). Model A denotes the baseline method, which is trained by selfreconstruction loss only. We see the baseline equipped with an auxiliary normal estimation task (model B) advances model A, which convincingly verifies the effectiveness of the normal estimation task. Our full model C is then obtained by incorporating all the tasks, which outperforms the others, demonstrating that all individual tasks contribute to superior performance. What’s more, we study the effect of the size of local geometries, i.e., the cardinalities of and . As shown in Fig. 13 (b), the model achieves the best performance with 64 points over that with large (96 points) or small (32 points) local size.
VI Discussion
The advance compared to the previous methods is summarized in Table III. We obtain Table III based on the context and experimental results. From Table III, we can find that PCRNet, FRMNet, and the proposed method are unsupervised. However, DCP and PointNetLK are supervised methods, requiring human annotations or labels. Besides, all the compared methods do not consider local descriptors or the transformation awareness of the feature extraction network. Furthermore, as demonstrated by the experimental results on synthetic and realworld datasets, DVDs obtain the best registration performance against all compared methods with moderate computational complexity.
Vi Conclusion
This paper proposed a novel unsupervised representation for robust point cloud registration based on the DVDs, which take advantage of high dimensional features jointly obtained by global and local geometrics. Besides, the transformation awareness of the DVDs has been further enhanced by two additional tasks (reconstruction and normal estimation). Numerical experiments based on both the synthetic and real datasets revealed several critical features of the proposed registration algorithm: 1) Regardless of the datasets, it always results in the best performance against several competing ones; 2) It has robust performance in various scenes, such as unseen categories, serve noises, partial visibility, and even cross dataset evaluation; 3) It includes one primary task and two additional tasks. All of them contribute to the accurate registration, indicating the superiority of DVDs which encourages joint representations of the discriminative features; 4) The proposed local consistency loss could improve the performances of both correspondencebased and correspondencefree registration methods.
References
 [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski, “Building rome in a day,” Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011.

[2]
Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan, “An integrated framework for 3d modeling, object detection, and pose estimation from pointclouds,”
IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 3, pp. 683–693, 2014.  [3] J. Peng, W. Xu, B. Liang, and A.G. Wu, “Virtual stereovision pose measurement of noncooperative space targets for a dualarm space robot,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 1, pp. 76–88, 2019.
 [4] J. M. Wong, V. Kee, T. Le, S. Wagner, G.L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. M. Johnson et al., “Segicp: Integrated deep semantic segmentation and pose estimation,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5784–5789.
 [5] Y. Cui, Y. An, W. Sun, H. Hu, and X. Song, “Lightweight attention module for deep learning on classification and segmentation of 3d point clouds,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–12, 2020.
 [6] D. Liu, C. Chen, C. Xu, Q. Cai, L. Chu, F. Wen, and R. Qiu, “A robust and reliable point cloud recognition network under rigid transformation,” IEEE Transactions on Instrumentation and Measurement, 2022.
 [7] M. A. Audette, F. P. Ferrie, and T. M. Peters, “An algorithmic overview of surface registration techniques for medical imaging,” Medical Image Analysis, vol. 4, no. 3, pp. 201–217, 2000.
 [8] P. J. Besl and N. D. McKay, “Method for registration of 3d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611. International Society for Optics and Photonics, 1992, pp. 586–606.
 [9] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and vision computing, vol. 21, no. 1314, pp. 1145–1153, 2003.
 [10] S. Bouaziz, A. Tagliasacchi, and M. Pauly, “Sparse iterative closest point,” in Computer graphics forum, vol. 32, no. 5. Wiley Online Library, 2013, pp. 113–123.
 [11] D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek, “The trimmed iterative closest point algorithm,” in Object recognition supported by user interaction for service robots, vol. 3. IEEE, 2002, pp. 545–548.
 [12] A. Segal, D. Haehnel, and S. Thrun, “Generalizedicp.” in Robotics: science and systems, vol. 2, no. 4. Seattle, WA, 2009, p. 435.
 [13] J. Yang, H. Li, D. Campbell, and Y. Jia, “Goicp: A globally optimal solution to 3d icp pointset registration,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 38, no. 11, pp. 2241–2254, 2015.

[14]
Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust &
efficient point cloud registration using pointnet,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2019, pp. 7163–7172.  [15] Y. Wang and J. M. Solomon, “Deep closest point: Learning representations for point cloud registration,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3523–3532.
 [16] Z. J. Yew and G. H. Lee, “Rpmnet: Robust point matching using learned features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 824–11 833.
 [17] J. Li, C. Zhang, Z. Xu, H. Zhou, and C. Zhang, “Iterative distanceaware similarity matrix convolution with mutualsupervised point elimination for efficient point cloud registration,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 2020, pp. 378–394.
 [18] W. Yuan, B. Eckart, K. Kim, V. Jampani, D. Fox, and J. Kautz, “Deepgmr: Learning latent gaussian mixture models for registration,” in European Conference on Computer Vision. Springer, 2020, pp. 733–750.
 [19] V. Sarode, X. Li, H. Goforth, Y. Aoki, R. A. Srivatsan, S. Lucey, and H. Choset, “Pcrnet: point cloud registration network using pointnet encoding,” arXiv preprint arXiv:1908.07906, 2019.
 [20] X. Huang, G. Mei, and J. Zhang, “Featuremetric registration: A fast semisupervised approach for robust point cloud registration without correspondences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 366–11 374.
 [21] N. Komodakis and S. Gidaris, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representations (ICLR), 2018.
 [22] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
 [23] J. S. L. Senanayaka, H. Van Khang, and K. G. Robbersmyr, “Toward selfsupervised feature learning for online diagnosis of multiple faults in electric powertrains,” IEEE Transactions on Industrial Informatics, vol. 17, no. 6, pp. 3772–3781, 2020.
 [24] B. Jian and B. C. Vemuri, “Robust point set registration using gaussian mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1633–1645, 2010.
 [25] B. Eckart, K. Kim, and J. Kautz, “Hgmr: Hierarchical gaussian mixtures for adaptive 3d registration,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 705–721.
 [26] D. Campbell and L. Petersson, “Gogma: Globallyoptimal gaussian mixture alignment,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5685–5694.
 [27] D. Campbell, L. Petersson, L. Kneip, H. Li, and S. Gould, “The alignment of the spheres: Globallyoptimal spherical mixture alignment for camera pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 796–11 806.
 [28] Y. Liu, C. Wang, Z. Song, and M. Wang, “Efficient global point cloud registration by matching rotation invariant features through translation search,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 448–463.
 [29] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Aligning point cloud views using persistent feature histograms,” in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008, pp. 3384–3391.
 [30] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in 2009 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2009, pp. 3212–3217.
 [31] H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,” IEEE Transactions on Robotics, vol. 37, no. 2, pp. 314–333, 2020.
 [32] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
 [33] Q.Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in European Conference on Computer Vision. Springer, 2016, pp. 766–782.
 [34] J. Wu, “Rigid 3d registration: A simple method free of svd and eigendecomposition,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 10, pp. 8288–8303, 2020.
 [35] B. Zhao, X. Chen, X. Le, J. Xi, and Z. Jia, “A comprehensive performance evaluation of 3d transformation estimation techniques in point cloud registration,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–14, 2021.
 [36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 652–660.
 [37] S. Baker and I. Matthews, “Lucaskanade 20 years on: A unifying framework,” International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004.
 [38] Y. Wang and J. M. Solomon, “Prnet: Selfsupervised learning for partialtopartial registration,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8814–8826.
 [39] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The annals of mathematical statistics, vol. 35, no. 2, pp. 876–879, 1964.
 [40] H. Deng, T. Birdal, and S. Ilic, “3d local features for direct pairwise registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3244–3253.
 [41] C. Choy, J. Park, and V. Koltun, “Fully convolutional geometric features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8958–8966.
 [42] C. Choy, W. Dong, and V. Koltun, “Deep global registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2514–2523.
 [43] I. Achituve, H. Maron, and G. Chechik, “Selfsupervised learning for domain adaptation on point clouds,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 123–133.
 [44] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Selfsupervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, 2021.

[45]
L. Jing and Y. Tian, “Selfsupervised visual feature learning with deep neural networks: A survey,”
IEEE transactions on pattern analysis and machine intelligence, 2020.  [46] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for selfsupervised learning of language representations,” in International Conference on Learning Representations, 2019.
 [47] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected networkbased intra prediction for image coding,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3236–3247, 2018.
 [48] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur, “Optimal kernel choice for largescale twosample tests,” in Advances in Neural Information Processing systems (NeurIPS), 2012, pp. 1205–1213.
 [49] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting selfsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1920–1929.
 [50] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi, “The farthest point strategy for progressive image sampling,” IEEE Transactions on Image Processing, vol. 6, no. 9, pp. 1305–1315, 1997.
 [51] M. I. Shamos and D. Hoey, “Closestpoint problems,” in 16th Annual Symposium on Foundations of Computer Science (sfcs 1975). IEEE, 1975, pp. 151–162.
 [52] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud autoencoder via deep grid deformation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 206–215.
 [53] H. Fan, H. Su, and L. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2463–2471.

[54]
P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning
representations and generative models for 3d point clouds,” in
International Conference on Machine Learning (ICML)
. PMLR, 2018, pp. 40–49.  [55] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relationshape convolutional neural network for point cloud analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8895–8904.
 [56] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1912–1920.
 [57] M. A. Uy, Q.H. Pham, B.S. Hua, T. Nguyen, and S.K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on realworld data,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1588–1597.
 [58] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgbd reconstructions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1802–1811.
 [59] H. Xu, S. Liu, G. Wang, G. Liu, and B. Zeng, “Omnet: Learning overlapping mask for partialtopartial point cloud registration,” arXiv preprint arXiv:2103.00937, 2021.
 [60] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgbd images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
Comments
There are no comments yet.