Point clouds are one of the key industrial measurements and often aligned with 3D models. Such a process is known as point cloud registration, which is a critical task with various applications in many fields, including large scale reconstruction , motion estimation [2, 3], object recognition and localization [4, 5, 6], and medical imaging . It aims to align one point cloud (source) to another (target) via finding the optimal rigid transformation. Unlike regular images, point cloud has a complex geometry and inevitably contains noises due to the inherent imperfections of LiDAR. As a result, the registration should be conducted robustly when facing unfavorable conditions.
Point cloud registration has been studied for decades. ICP  is a widely used method with two iterative optimization processes: correspondences searching and rigid transformation estimation. Nevertheless, ICP is sensitive to initialization and easily deceived by local optima. Many methods [9, 10, 11, 12, 13] have been proposed to tackle these issues with increased computational complexity. On the other hand, with the rapid development of deep learning techniques, people revisit the rigid point cloud registration with many novel designs. With ground-truth correspondences or category labels, recent studies [14, 15, 16, 17, 18]
tried to solve the problem in a supervised manner. These methods effectively obtain the informative feature of point cloud and achieve superior performances, requiring high quality and quantity ground-truth correspondences. However, there may not exist point-point correspondences in real-world scenarios in the presence of noises and outliers. Realizing registration from unlabeled point cloud data without ground-truth correspondences is a significant yet rarely explored challenge in the literature. To our best knowledge, there exist only two works[19, 20] so far. PCRNet  alleviates pose misalignment shown in the PointNetLK 
by replacing the Lucas-Kanade module with multi-layered perceptrons. PCRNet directly recovers transformation parameters from concatenated global descriptors of source and target point cloud. The other significant work, FMR-Net
, employs an autoencoder framework with the PointNetLK being its backbone and then achieves the registration by minimizing the feature-metric projection error.
Although these approaches showcase a sharper edge of unsupervised learning, they mainly depend on the global representations and neglect local representations. As a branch of unsupervised learning, self-supervised learning has achieved remarkable performance on visual tasks[21, 22, 23]
in recent years. Self-supervised learning frameworks usually design some pretext tasks and train convolutional neural networks (CNNs) for solving these pretext tasks,e.g., predicting image rotation 
or classifying positive and negative pairs. These pretext tasks enable CNNs to extract high-level semantic representations. Inspired by this, we aim to encode high-level local representations that are beneficial for solving registration through a pretext task. Since the rigid transformation consistently transforms different local regions and preserves the structural information, we design a pretext task to enable the feature extraction network to learn informative features from local geometries. To this end, we propose a local consistency loss to consider the intrinsic structure consistency among local geometries under rigid transformations. Accordingly, a novel self-supervised scheme for rigid point cloud registration with joint global and local representations (as shown in Fig. 1) has been proposed, namely Deep Versatile Descriptors (DVDs).
Furthermore, unsupervised point cloud registration techniques align two point clouds relying on differences between two global representations. However, they ignore a vital problem of obtaining global representations with high transformation awareness, which indicates differences between global representations of two-point clouds are large when the two-point clouds are not well aligned. To tackle this problem, we propose two additional tasks, including self-reconstruction and normal estimation tasks. In summary, our main contributions can be summarized as follows:
We first revisit the registration problem with the novel DVDs, supporting joint extracting high-level global and local representations in a self-supervised manner, requiring no labeled data or arduously searching correspondences.
We use two additional tasks, namely self-reconstruction and normal estimation, to enhance the transformation awareness of the DVDs.
The proposed local consistency loss could improve performances of both correspondence-based and correspondence-free registration methods.
Extensive experimental results demonstrate that DVDs achieve start-of-the-art performance on both synthetic and real-world datasets.
This article is organized as follows. Section II introduces related work. Problem statement with DVDs is demonstrated in Section III. Section IV provides detailed training procedures for proposed DVDs. The experimental results demonstrate the superiority of DVDs in Section V. Section VI reports the conclusion.
Ii Related Work
Point cloud registration is a vital research field due to various 3D perception applications. Here we broadly categorize the related work into traditional, supervised, and unsupervised registration methods.
Ii-a Traditional Registration
Iterative Closest Point (ICP)  is the well-known method for the rigid point cloud registration, which iteratively computes the closet point as correspondences and estimates the optimal transformation by the correspondences. Subsequent works have been proposed to improve the robustness of ICP [12, 10, 11]
. On the other branch, Gaussian Mixture Model (GMM) and Hierarchical Gaussian Mixture Registration (HGMR)
reformulate the registration problem as probability distribution matching. These probabilistic methods still require good initializations to avoid being trapped into the local optimum, because their objective functions are non-convex. Unlike local registration methods, global registration methods do not require good initializations. In this way, Go-ICP, GOGMA , GOSMA , and GoTS  have been proposed to find globally optimal registration by branch-and-bound (BnB) method at the expense of increased computational complexity. In another line, PFH  and FPFH  constructed handcrafted feature descriptors and estimated potential correspondences. Then robust optimization approaches such as semidefinite programming  and RANSAC  can be utilized to estimate exact correspondences. Fast Global Registration (FGR)  accelerated the optimization process by a graduated non-convex strategy. A fast rigid point cloud registration method was proposed without SVD or eigendecomposition . Recently, a comprehensive evaluation of point cloud registration methods has been conducted . However, requiring good initializations, noisy correspondences, and time constraints are challenging for traditional registration techniques.
Ii-B Supervised Registration
The rapid development of deep learning methods paves the way for developing a new perspective for point cloud registration . Recently, PointNetLK  utilizes the PointNet module to obtain the global descriptors and aligns two point clouds iteratively by minimizing the distances between two learned descriptors with a modified Lucas-Kanade (LK) algorithm . Deep Closest Point  introduces a solution in another perspective that extracts per-point features and uses a transformer followed by a differentiable SVD module to recover the transformation. PRNet  uses a keypoint detector to establish keypoint correspondences to solve the partial to partial point cloud registration in a self-supervised way. RPMNet  proposes Sinkhorn normalization  to estimate soft correspondences. IDAM  computes pairwise correspondences by an iterative distance-aware similarity convolution module. DeepGMR  learns point-to-GMM correspondences by integrating GMM. In addition, many representative techniques [40, 41, 42] have been presented to deal with the large-scale registration. While recent works achieve state-of-the-art performances, this work offers a new way of achieving unsupervised registration without human annotations.
Ii-C Unsupervised Registration
In contrast to the supervised and correspondences-based point cloud registration methods, there are two prior attempts on unsupervised point cloud registration. PCRNet  improves PointNetLK by replacing the LK module with multi-layered perceptrons. FMR-Net  adapts PointNetLK to a semi-supervised manner by jointly optimizing the processes of feature extraction and transformation estimation. However, they achieve registration heavily depending on global representations, paying no attention to the useful representations obtained by local geometries. Meanwhile, as a subclass of unsupervised learning, self-supervised learning has achieved remarkable performance on visual tasks without expensive labels [23, 43]. Self-supervised models can be categorized into adversarial, generative, and contrastive . Inspired by this, we propose local consistency loss to force the feature extraction network to learn useful representations from local geometries in a self-supervised manner. In this way, we resolve the issue by proposing novel DVDs.
Iii Problem Statement with DVDs
In this section, we present the formulation of rigid registration problem with the proposed DVDs. We denote and as the source and the target point cloud, respectively. The objective of registration is to estimate the rigid transformation as :
where represents a rotation matrix and
denotes a translation vector. Solving the problem in Eq. (1) based on raw 3D coordinates of point clouds is not robust because raw point clouds usually contain noises and outliers. To address the issue, PointNetLK  utilizes PointNet  to extract -dimensional vectors as global representations of point clouds. In this way, the registration problem can be described as follows:
where denotes the feature extraction function. Previous studies [20, 14] primarily solved the Eq. (2) by adapting PointNet as feature extraction function resulting in a global descriptor. However, they fail to utilize the informative features from local geometries. Inspired by the success of self-supervised learning [45, 46], we model the registration problem jointly based on global and local representations in a self-supervised manner.
We revisit point cloud registration motivated by the observation that source point cloud is transformed into target point cloud by a rigid transformation operator . In other words, all local geometries of point clouds are transformed consistently with the mapping function . This property inspires us to exploit the high-level relation among different local geometries. Given two subsets and the corresponding transformed version , the related high-level representations can be denoted by , and , respectively, where represents a feature extraction function. In this way, the feature change of original and transformed is described as
where is a metric function for modeling differences of features. We employ a Fully Connected (FC) layer as the metric function , such that
where is the concatenation operation and . In Eq. (4), the high-dimensional features and are combined together with , whose goal is to perform feature extraction while maintaining the change of each feature.111The employment of the FC layer is motivated by the network design in PointNet , and FC network-based intra prediction for image coding . The goal of is to abstract high-level relation between the local region and its transformed version.
Since all local geometries of point clouds are transformed consistently with the same mapping function , the feature change between different regions should be consistent, i.e., and shall be consistent, as shown in Fig. 2. In this way, we have and , where denotes a distribution. Based on the above analysis, the registration problem in Eq. (2) can be reformulated in a self-supervised manner as
Here, we are trying to deal with the point cloud registration problem with a joint design on global and local descriptors, which are respectively shown in the objective and the constraint of Eq. (5).
However, the constraint in Eq. 5 is non-convex. This constraint means and should be sampled from the same distribution, which is extremely hard to achieve for a network functional. Alternatively, we try to tackle the problem by minimizing the distance metric of their distributions as
In this way, we need to train a feature extraction function and estimate a transformation operator . Given that the optimization of the function and the transformation operator in Eq. (6) is difficult and time-consuming, we employ a simplified training strategy to optimize and in an alternative manner. I.e., when the transformation operator is optimized and the function is frozen in each iteration and vice versa. Specifically, we introduce primary and auxiliary tasks to optimize the function and the transformation operator better, respectively.
Iv Robust Point Registration with DVDs
The core of rigid-body registration without correspondences is to extract discriminative and transformation awareness features from two-point clouds and recover the related transformation parameters. We solve the problem by leveraging the proposed versatile descriptors with two auxiliary tasks. The overall framework is illustrated in Fig. 3. In short, we first embed the point clouds in high-dimensional space, performing optimization with one primary task and two auxiliary tasks ( and ). The Primary task estimates the optimal transformation parameters while considering both global and local descriptors in a self-supervised way. Auxiliary tasks aim to improve the rotation awareness of the feature extraction function.
The overall objective to be minimized in the proposed method can be formalized as
where trade-off parameters and are positive constants to balance the effect of different tasks. The optimization of Eq. (7) takes the source and the target point cloud as inputs without requiring ground truth transformation, allowing us to learn unsupervised representations for robust point cloud registration.
Iv-a Primary Task: Learning with Local Consistency
and Euclidean distance. We empirically employ the symmetric Kullback-Leibler divergence, which has stable and superior performance against other distance metrics.
The first term in Eq. (8) is referred to as the global descriptor that learns global representations from both and . The estimation of transformation parameters is given in the following. Given a frozen function , estimating transformation is still time-consuming. Therefore, we choose to use the computationally efficient Inverse Compositional (IC) method  to iteratively calculate transformation increment as
where is the Jacobian matrix of global representation differences between source and target point clouds. Unlike regular grid images, computing the Jacobian matrix of irregular point clouds needs to take gradients in , , and directions. Refer to , the warp Jacobian in the IC-LK algorithm can be approximated as
where twist parameters denotes three Euler angles and three translation parameters, and is the inverted warp. In this way, transformation parameters are updated as after the -th iteration.
To enhance the representation ability of the global descriptor, we adopt the idea of self-supervised learning, which always encodes high-level semantic representations through a pretext learning task . Especially, we aim to extract useful local representations by exploiting the high-level relation among different local geometries based on the second term in Eq. (8), which can be considered as a pretext learning task. In this way, we can solve registration by using joint global and local representations. The design of the self-supervised learning is based on the key observation that the local distinctive geometric structures of the point cloud by two subset points can be employed to enhance the representation ability of the feature extraction module.
In the following, we will elaborate on the selection and intuition of local distinctive subset points.
The selection of subset points. We employ two representative points related to the barycenter of the source point cloud , denoted by the farthest point from barycenter,
and the closest point from barycenter,
There may be isolated points far from the barycenter in the real scene, so we consider an outliers rejection following . For example, one should reject outliers if they are larger than the threshold . Then we choose point and search its nearest neighbors based on Euclidean distance, forming a local set . Similarly, subset is selected among the point . The effect of local size on the performance of registration will be discussed in the ablation studies.
The intuition of the selection strategy is that (1) We want to capture the distinctive geometric structures of the point cloud by two subsets points to enhance the representation ability of the feature extraction module. From the perspective of geometry, the farthest and the closest points are located in distinctive regions [50, 51]. (2) We need the two subsets to be independent without overlap. Therefore, if they are far away from each other, they are more likely to be independent.
Iv-B Auxiliary Task I: Self-reconstruction
Since discovering features sensitive to rotations and translation from unlabeled source and target point clouds is quite challenging, a single feature extraction module may not lead to sufficient representations. If DVDs own rich transformation awareness, proper supervision to local consistency will be offered, thus creating a virtuous circle for the optimization. On the contrary, the learning process may be trapped into local minima for the lack of transformation awareness of extracted representations. To avoid this issue, we introduce two auxiliary tasks to achieve enhanced rotation-awareness of DVDs.
The feature extraction module aims to learn a transformation awareness descriptor, which needs to reflect the effects of rigid-body transformation, enabling the modified inverse compositional algorithm to recover the transformation parameters . We choose the vanilla PointNet  as the feature extraction module
, which is composed of Multi-Layer Perceptron (MLP) and a max-pooling operation.
We employ a folding-based  decoder as the reconstruction module in the self-reconstruction task. Besides, Chamfer Distance  is adopted to define the reconstruction error, measuring the differences between original point clouds and the reconstructed version :
This work will take both the source and the target points for reconstruction, i.e., . As a result, we have
Chamfer Distance can be computed in a computation-efficient manner. However, it is blind to certain visual inferiority . Fig. 4 demonstrates such an example. Two-point clouds with large rotation angles may lead to a small Chamfer Distance loss, resulting in weak rotation awareness. Therefore, we consider another auxiliary task for further performance improvement.
Iv-C Auxiliary Task II: Normal Estimation
Normal estimation is an essential step in many research areas of point cloud , e.g., rendering and surface reconstruction. Unlike pursuing the estimation precision, we aim to improve the rotation awareness of DVDs by the normal estimation task. Especially, we use a light-weight MLP to generate the estimated normals with the concatenated coordinates and global features
. The estimation error is measured by the cosine loss function as
Iv-D Intuitive explanation
Here, we provide an intuitive explanation of the reason that the rotation-awareness of the feature extraction function is important for registering two-point clouds. Fig. 5 visualizes feature differences of the transformed source point cloud and the target point cloud during the iterative optimization process. Specifically, we compare the registration process with a high rotation-awareness feature extraction function and a low rotation-awareness feature extraction function. We reshape feature differences of the transformed source point cloud and the target point cloud into a square matrix for better visualization. Fig. 5 shows that the feature difference decreases, and the registration becomes more accurate during the optimization process with the high rotation-awareness feature extraction function. However, the feature difference decreases for the low rotation-awareness feature extraction, but the registration is trapped in a local minimum. Specifically, the rightmost column of Fig. 5 demonstrates that the extracted global features of two-point clouds are almost the same when the twp-point cloud are not aligned. Therefore, the rotation awareness of the feature extraction function is important for identifying whether two point clouds are aligned. In the following section, we will demonstrate the effectiveness of the proposed method with various experiments.
In this section, we design several experimental settings to demonstrate the generalization and superiority of DVDs. Firstly, we evaluate DVDs on synthetic object-centric point cloud datasets. Then, we conduct cross dataset evaluation, i.e., DVDs are trained on the synthetic object-centric dataset and are tested on another real-world dataset to test the generalization ability of DVDs. Furthermore, we evaluate DVDs on large-scale real-world 3D scenes. Since real-world industrial scenarios usually lead to noisy and partial visibility point clouds, robustness evaluations have been conducted to demonstrate the robustness of DVDs under these scenarios. Meanwhile, the efficiency of DVDs shows its superiority in industrial applications. Finally, we demonstrate that the proposed local consistency loss can improve several point cloud registration techniques, indicating its generality.
Experimental Details. Specifically, we conduct experiments on the ModelNet40 , ScanObjectNN  and 3DMatch  benchmarks. ModelNet40 is a synthetic dataset, consisting of 12,311 CAD models. We randomly sample 1024 points from each CAD model 2 times with different random seeds and name sampled point clouds as the source point cloud and the target point cloud , respectively . We rescale point clouds into a unit sphere. Following the experimental settings in , all the models are trained on the train set of the first 20 categories in ModelNet40. Euler angles are uniformly sampled in the range [0, ] and translations in the range for each axis during the training process . We transform the source point cloud through the sampled rigid transformation. The source point cloud and the target point cloud are fed into the network, which aims to register the two point clouds. The maximum number of iterations is set as 10. We train the model for a total of 200 epochs. All the experiments are implemented on a single NVIDIA Titan Xp.
V-a Evaluation on Unseen objects
In the first experiment, we compare our method with the following algorithms: ICP , PointNetLK , PCRNet , FMR-Net , DCP , FGR , and RANSAC . All methods are tested on the test set of the rest 20 categories. Moreover, qualitative results of these algorithms are shown in Fig. 6. The differences between predicted values and ground truth are measured by the root mean squared error (RMSE). A successful registration is defined if rotation and translation errors are smaller than predefined thresholds . Performances are evaluated by the recall, i.e., the percentage of successful registration point cloud pairs. Fig. 7 shows that the ICP has inferior performance against other competing algorithms since it is a local registration method and is vulnerable to the initial position. PointNetLK uses a pre-trained classification network as the feature extraction module and obtains better performance than ICP. PCRNet and FMR-Net, jointly optimizing the feature extraction and transformation estimation processes, reach improved performance over ICP. The classical methods FGR and RANSAC are competitive with our method. Overall, the proposed method attains more accurate registration over other compared ones, demonstrating the effectiveness of the proposed DVDs.
V-B Evaluation on Unseen Categories
In this subsection, we evaluate the generalization of the proposed model on unseen categories. To this end, all methods are trained on the train set of the first 20 categories and are tested on the test set of the rest 20 categories in ModelNet40. The differences between predicted values and ground truth are measured by the root mean squared error (RMSE). Performances are evaluated by the recall. Fig. 8 shows that there is a performance drop for DCP and PCRNet compared to unseen objects. However, our method can generalize to unseen categories and are more robust than other deep learning-based approaches. On the other hand, unsupervised FMR-Net has better generalization than supervised PointNetLK instead of depending on category labels. Nevertheless, traditional registration methods, including ICP, FGR, and RANSAC are not affected so much.
V-C Cross Dataset Evaluation
It is also crucial to test the influences of the generalization ability of models under cross dataset evaluation. We train models on the train set of the first 20 categories in ModelNet40 and test on the test set of ScanObjectNN , which is a newly published real-world dataset containing 2902 objects from 15 categories. Specifically, we set the success registration criteria that a rotation error is less than and translation error is less than 0.01. The results are reported in Fig. 9. Compared to results on the synthetic dataset, the performances of DCP and PCRNet degrade in the cross-dataset setting. Classical methods FGR and RANSAC keep stable performances. However, the proposed method still outperforms these compared deep learning-based methods, demonstrating the effectiveness of customized designs in the proposed method and applicability in cross synthetic and real-world datasets.
V-D Evaluation on Real-World 3D Scenes
7Scene  is a real-world indoor 3D Scenes registration benchmark. Following DGR , we generate 3D scan pairs with more than 70% overlap for training. During the training process, Euler angles are uniformly sampled in the range [0, ] and translations in the range [-0.5, 0.5]. In addition to the PointNetLK , PCRNet , FMR-Net , DCP , FGR , and RANSAC , we also compare with DGR  and FCGF , which are the state-of-the-art methods on large-scale point cloud scenes. Two versions of RANSAC are compared with 5000 and 10000 iterations, respectively. We down-sample 10k points for training the DCP, FMR-Net, PCRNet, PointNetLk, and the proposed methods. As DGR requires a pre-trained FCGF model, we use all the points for training DGR. Qualitative results of the proposed method are shown in Fig. 10. The proposed method is competitive with DGR, as shown in Fig. 11. For rotation errors larger than , our method slightly outperforms DGR. We tried to train PCRNet and PointNetLK, but they did not converge. Compared with DGR, the proposed method and FMR-Net use the globally pooled descriptor to represent the entire point cloud scene. However, it is challenging to represent complex scenes by globally pooled descriptors. Therefore, our method does not get highly accurate results in large-scale scenes.
V-E Robustness Evaluation
In this section, we evaluate models in the presence of outliers and partial visibility. The success registration criteria is defined as a rotation error under
and translation error under 0.01. For outliers scenario, noises are sampled from Gaussian distribution and are added to each point of source and target point cloud. Specifically, all the models are trained on clean data without adding noise augmentation. Though FGR and RANSAC methods are comparatively with our method on clean data, they are sensitive to noisy data, as shown in Fig.12 (a). Adding noise to the source and target point cloud breaks the exact point-point correspondences. Therefore, the correspondence-based DCP shows limited performance. In contrast, the learning-based registration approaches without correspondences, such as FMR-Net, PointNetLK, and the proposed method, are more robust to noises.
We also conduct experiments on partially visible data, as the acquired point clouds from the real world are often partially due to occlusions. To simulate this condition, we generate partial source and target point clouds independently from random camera poses following the settings in . Note that all methods are trained on noise-free and fully visible data from the first 20 categories in ModelNet40. Fig. 12 (b) demonstrates that the proposed unsupervised model outperforms all the others against partial visibility because of utilizing global and local descriptors. Therefore, it is promising to extend our model to real-world scenarios.
V-F Computational Efficiency
In this subsection, we evaluate the efficiency of the proposed method. The experiment has been conducted on the testing dataset of the last 20 categories from ModelNet40. Rigid transformations are randomly sampled with Euler angles in the range [0, ] and translations in the range . For real-world scenarios, noises are sampled from Gaussian distribution and are added to each point of source and target point cloud. We perform this experiment on a 2.10GHz Intel E5-2620 and an NVIDIA Titan Xp. Specifically, the ICP, FGR, and RANSAC are executed on the CPU. The average amount of time of all the compared methods are shown in Table I. When processing large-scale point cloud (e.g., 2000), the proposed method is faster than RANSAC, FGR, FMR-Net, and PointNetLK. Whilst it is slower than the ICP, DCP, and PCRNet.
V-G Generality of Local Consistency Loss
We use the proposed local consistency loss to improve the performance of several classical point cloud registration models, including DCP , PRNet , IDAM , and DeepGMR . All models are trained on the first 20 categories and tested on the rest 20 categories in ModelNet40. Table II indicates that being equipped with local consistency loss enables both the correspondence-based and correspondence-free registration methods to estimate more accurate transformation parameters. Therefore, the proposed local consistency loss is a generic method complementary to the existing point cloud registration techniques.
|LC + DCP||3.950||2.298||0.020||0.010|
|LC + PRNet||1.293||0.783||0.014||0.010|
|LC + IDAM||1.527||0.933||0.016||0.010|
|LC + DeepGMR||3.192||1.247||0.004||0.001|
|Method||Unsupervised||Transformation awareness||Local descriptors||Computational complexity||Performance|
V-H Ablation Studies
To examine the effectiveness of each component in Eq. (7), we conduct detailed ablation studies on ModelNet40. The first results are elaborated in Fig. 13 (a). Model A denotes the baseline method, which is trained by self-reconstruction loss only. We see the baseline equipped with an auxiliary normal estimation task (model B) advances model A, which convincingly verifies the effectiveness of the normal estimation task. Our full model C is then obtained by incorporating all the tasks, which outperforms the others, demonstrating that all individual tasks contribute to superior performance. What’s more, we study the effect of the size of local geometries, i.e., the cardinalities of and . As shown in Fig. 13 (b), the model achieves the best performance with 64 points over that with large (96 points) or small (32 points) local size.
The advance compared to the previous methods is summarized in Table III. We obtain Table III based on the context and experimental results. From Table III, we can find that PCRNet, FRM-Net, and the proposed method are unsupervised. However, DCP and PointNetLK are supervised methods, requiring human annotations or labels. Besides, all the compared methods do not consider local descriptors or the transformation awareness of the feature extraction network. Furthermore, as demonstrated by the experimental results on synthetic and real-world datasets, DVDs obtain the best registration performance against all compared methods with moderate computational complexity.
This paper proposed a novel unsupervised representation for robust point cloud registration based on the DVDs, which take advantage of high dimensional features jointly obtained by global and local geometrics. Besides, the transformation awareness of the DVDs has been further enhanced by two additional tasks (reconstruction and normal estimation). Numerical experiments based on both the synthetic and real datasets revealed several critical features of the proposed registration algorithm: 1) Regardless of the datasets, it always results in the best performance against several competing ones; 2) It has robust performance in various scenes, such as unseen categories, serve noises, partial visibility, and even cross dataset evaluation; 3) It includes one primary task and two additional tasks. All of them contribute to the accurate registration, indicating the superiority of DVDs which encourages joint representations of the discriminative features; 4) The proposed local consistency loss could improve the performances of both correspondence-based and correspondence-free registration methods.
-  S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski, “Building rome in a day,” Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011.
Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan, “An integrated framework for 3-d modeling, object detection, and pose estimation from point-clouds,”IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 3, pp. 683–693, 2014.
-  J. Peng, W. Xu, B. Liang, and A.-G. Wu, “Virtual stereovision pose measurement of noncooperative space targets for a dual-arm space robot,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 1, pp. 76–88, 2019.
-  J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. M. Johnson et al., “Segicp: Integrated deep semantic segmentation and pose estimation,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5784–5789.
-  Y. Cui, Y. An, W. Sun, H. Hu, and X. Song, “Lightweight attention module for deep learning on classification and segmentation of 3-d point clouds,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–12, 2020.
-  D. Liu, C. Chen, C. Xu, Q. Cai, L. Chu, F. Wen, and R. Qiu, “A robust and reliable point cloud recognition network under rigid transformation,” IEEE Transactions on Instrumentation and Measurement, 2022.
-  M. A. Audette, F. P. Ferrie, and T. M. Peters, “An algorithmic overview of surface registration techniques for medical imaging,” Medical Image Analysis, vol. 4, no. 3, pp. 201–217, 2000.
-  P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611. International Society for Optics and Photonics, 1992, pp. 586–606.
-  A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and vision computing, vol. 21, no. 13-14, pp. 1145–1153, 2003.
-  S. Bouaziz, A. Tagliasacchi, and M. Pauly, “Sparse iterative closest point,” in Computer graphics forum, vol. 32, no. 5. Wiley Online Library, 2013, pp. 113–123.
-  D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek, “The trimmed iterative closest point algorithm,” in Object recognition supported by user interaction for service robots, vol. 3. IEEE, 2002, pp. 545–548.
-  A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: science and systems, vol. 2, no. 4. Seattle, WA, 2009, p. 435.
-  J. Yang, H. Li, D. Campbell, and Y. Jia, “Go-icp: A globally optimal solution to 3d icp point-set registration,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 38, no. 11, pp. 2241–2254, 2015.
-  Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust & efficient point cloud registration using pointnet,” in
-  Y. Wang and J. M. Solomon, “Deep closest point: Learning representations for point cloud registration,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3523–3532.
-  Z. J. Yew and G. H. Lee, “Rpm-net: Robust point matching using learned features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 824–11 833.
-  J. Li, C. Zhang, Z. Xu, H. Zhou, and C. Zhang, “Iterative distance-aware similarity matrix convolution with mutual-supervised point elimination for efficient point cloud registration,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 2020, pp. 378–394.
-  W. Yuan, B. Eckart, K. Kim, V. Jampani, D. Fox, and J. Kautz, “Deepgmr: Learning latent gaussian mixture models for registration,” in European Conference on Computer Vision. Springer, 2020, pp. 733–750.
-  V. Sarode, X. Li, H. Goforth, Y. Aoki, R. A. Srivatsan, S. Lucey, and H. Choset, “Pcrnet: point cloud registration network using pointnet encoding,” arXiv preprint arXiv:1908.07906, 2019.
-  X. Huang, G. Mei, and J. Zhang, “Feature-metric registration: A fast semi-supervised approach for robust point cloud registration without correspondences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 366–11 374.
-  N. Komodakis and S. Gidaris, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representations (ICLR), 2018.
-  X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
-  J. S. L. Senanayaka, H. Van Khang, and K. G. Robbersmyr, “Toward self-supervised feature learning for online diagnosis of multiple faults in electric powertrains,” IEEE Transactions on Industrial Informatics, vol. 17, no. 6, pp. 3772–3781, 2020.
-  B. Jian and B. C. Vemuri, “Robust point set registration using gaussian mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1633–1645, 2010.
-  B. Eckart, K. Kim, and J. Kautz, “Hgmr: Hierarchical gaussian mixtures for adaptive 3d registration,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 705–721.
-  D. Campbell and L. Petersson, “Gogma: Globally-optimal gaussian mixture alignment,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5685–5694.
-  D. Campbell, L. Petersson, L. Kneip, H. Li, and S. Gould, “The alignment of the spheres: Globally-optimal spherical mixture alignment for camera pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 796–11 806.
-  Y. Liu, C. Wang, Z. Song, and M. Wang, “Efficient global point cloud registration by matching rotation invariant features through translation search,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 448–463.
-  R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Aligning point cloud views using persistent feature histograms,” in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008, pp. 3384–3391.
-  R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in 2009 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2009, pp. 3212–3217.
-  H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,” IEEE Transactions on Robotics, vol. 37, no. 2, pp. 314–333, 2020.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
-  Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in European Conference on Computer Vision. Springer, 2016, pp. 766–782.
-  J. Wu, “Rigid 3-d registration: A simple method free of svd and eigendecomposition,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 10, pp. 8288–8303, 2020.
-  B. Zhao, X. Chen, X. Le, J. Xi, and Z. Jia, “A comprehensive performance evaluation of 3-d transformation estimation techniques in point cloud registration,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–14, 2021.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 652–660.
-  S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004.
-  Y. Wang and J. M. Solomon, “Prnet: Self-supervised learning for partial-to-partial registration,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8814–8826.
-  R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The annals of mathematical statistics, vol. 35, no. 2, pp. 876–879, 1964.
-  H. Deng, T. Birdal, and S. Ilic, “3d local features for direct pairwise registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3244–3253.
-  C. Choy, J. Park, and V. Koltun, “Fully convolutional geometric features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8958–8966.
-  C. Choy, W. Dong, and V. Koltun, “Deep global registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2514–2523.
-  I. Achituve, H. Maron, and G. Chechik, “Self-supervised learning for domain adaptation on point clouds,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 123–133.
-  X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, 2021.
L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,”IEEE transactions on pattern analysis and machine intelligence, 2020.
-  Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2019.
-  J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-based intra prediction for image coding,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3236–3247, 2018.
-  A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur, “Optimal kernel choice for large-scale two-sample tests,” in Advances in Neural Information Processing systems (NeurIPS), 2012, pp. 1205–1213.
-  A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1920–1929.
-  Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi, “The farthest point strategy for progressive image sampling,” IEEE Transactions on Image Processing, vol. 6, no. 9, pp. 1305–1315, 1997.
-  M. I. Shamos and D. Hoey, “Closest-point problems,” in 16th Annual Symposium on Foundations of Computer Science (sfcs 1975). IEEE, 1975, pp. 151–162.
-  Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 206–215.
-  H. Fan, H. Su, and L. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2463–2471.
P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning
representations and generative models for 3d point clouds,” in
International Conference on Machine Learning (ICML). PMLR, 2018, pp. 40–49.
-  Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8895–8904.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1912–1920.
-  M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1588–1597.
-  A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgb-d reconstructions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1802–1811.
-  H. Xu, S. Liu, G. Wang, G. Liu, and B. Zeng, “Omnet: Learning overlapping mask for partial-to-partial point cloud registration,” arXiv preprint arXiv:2103.00937, 2021.
-  J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.