Self-supervised Point Cloud Registration with Deep Versatile Descriptors

01/25/2022
by   Dongrui Liu, et al.
0

Recent years have witnessed an increasing trend toward solving point cloud registration problems with various deep learning-based algorithms. Compared to supervised/semi-supervised registration methods, unsupervised methods require no human annotations. However, unsupervised methods mainly depend on the global descriptors, which ignore the high-level representations of local geometries. In this paper, we propose a self-supervised registration scheme with a novel Deep Versatile Descriptors (DVD), jointly considering global representations and local representations. The DVD is motivated by a key observation that the local distinctive geometric structures of the point cloud by two subset points can be employed to enhance the representation ability of the feature extraction module. Furthermore, we utilize two additional tasks (reconstruction and normal estimation) to enhance the transformation awareness of the proposed DVDs. Lastly, we conduct extensive experiments on synthetic and real-world datasets, demonstrating that our method achieves state-of-the-art performance against competing methods over a wide range of experimental settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

03/11/2020

Self-supervised Point Set Local Descriptors for Point Cloud Registration

In this work, we propose to learn local descriptors for point clouds in ...
02/09/2022

Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning

In this paper, we propose a simple and general framework for self-superv...
08/05/2021

UPDesc: Unsupervised Point Descriptor Learning for Robust Registration

In this work, we propose UPDesc, an unsupervised method to learn point d...
06/11/2020

Minimum Potential Energy of Point Cloud for Robust Global Registration

In this paper, we propose a novel minimum gravitational potential energy...
10/07/2021

RAR: Region-Aware Point Cloud Registration

This paper concerns the research problem of point cloud registration to ...
10/10/2021

Digging Into Self-Supervised Learning of Feature Descriptors

Fully-supervised CNN-based approaches for learning local image descripto...
06/01/2021

Bootstrap Your Own Correspondences

Geometric feature extraction is a crucial component of point cloud regis...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Point clouds are one of the key industrial measurements and often aligned with 3D models. Such a process is known as point cloud registration, which is a critical task with various applications in many fields, including large scale reconstruction [1], motion estimation [2, 3], object recognition and localization [4, 5, 6], and medical imaging [7]. It aims to align one point cloud (source) to another (target) via finding the optimal rigid transformation. Unlike regular images, point cloud has a complex geometry and inevitably contains noises due to the inherent imperfections of LiDAR. As a result, the registration should be conducted robustly when facing unfavorable conditions.

Point cloud registration has been studied for decades. ICP [8] is a widely used method with two iterative optimization processes: correspondences searching and rigid transformation estimation. Nevertheless, ICP is sensitive to initialization and easily deceived by local optima. Many methods [9, 10, 11, 12, 13] have been proposed to tackle these issues with increased computational complexity. On the other hand, with the rapid development of deep learning techniques, people revisit the rigid point cloud registration with many novel designs. With ground-truth correspondences or category labels, recent studies [14, 15, 16, 17, 18]

tried to solve the problem in a supervised manner. These methods effectively obtain the informative feature of point cloud and achieve superior performances, requiring high quality and quantity ground-truth correspondences. However, there may not exist point-point correspondences in real-world scenarios in the presence of noises and outliers. Realizing registration from unlabeled point cloud data without ground-truth correspondences is a significant yet rarely explored challenge in the literature. To our best knowledge, there exist only two works

[19, 20] so far. PCRNet [19] alleviates pose misalignment shown in the PointNetLK [14]

by replacing the Lucas-Kanade module with multi-layered perceptrons. PCRNet directly recovers transformation parameters from concatenated global descriptors of source and target point cloud. The other significant work, FMR-Net

[20]

, employs an autoencoder framework with the PointNetLK being its backbone and then achieves the registration by minimizing the feature-metric projection error.

Fig. 1: The illustration of our self-supervised manner. Given a source point cloud and a target point cloud, which is rigidly transformed from the source point cloud, we use these two co-occurring inputs to train a feature extraction network that learns global and local descriptors in a self-supervised manner. The registration problem can be solved accurately without human annotations or labels based on these descriptors.

Although these approaches showcase a sharper edge of unsupervised learning, they mainly depend on the global representations and neglect local representations. As a branch of unsupervised learning, self-supervised learning has achieved remarkable performance on visual tasks

[21, 22, 23]

in recent years. Self-supervised learning frameworks usually design some pretext tasks and train convolutional neural networks (CNNs) for solving these pretext tasks,

e.g., predicting image rotation [21]

or classifying positive and negative pairs

[22]. These pretext tasks enable CNNs to extract high-level semantic representations. Inspired by this, we aim to encode high-level local representations that are beneficial for solving registration through a pretext task. Since the rigid transformation consistently transforms different local regions and preserves the structural information, we design a pretext task to enable the feature extraction network to learn informative features from local geometries. To this end, we propose a local consistency loss to consider the intrinsic structure consistency among local geometries under rigid transformations. Accordingly, a novel self-supervised scheme for rigid point cloud registration with joint global and local representations (as shown in Fig. 1) has been proposed, namely Deep Versatile Descriptors (DVDs).

Furthermore, unsupervised point cloud registration techniques align two point clouds relying on differences between two global representations. However, they ignore a vital problem of obtaining global representations with high transformation awareness, which indicates differences between global representations of two-point clouds are large when the two-point clouds are not well aligned. To tackle this problem, we propose two additional tasks, including self-reconstruction and normal estimation tasks. In summary, our main contributions can be summarized as follows:

  • We first revisit the registration problem with the novel DVDs, supporting joint extracting high-level global and local representations in a self-supervised manner, requiring no labeled data or arduously searching correspondences.

  • We use two additional tasks, namely self-reconstruction and normal estimation, to enhance the transformation awareness of the DVDs.

  • The proposed local consistency loss could improve performances of both correspondence-based and correspondence-free registration methods.

  • Extensive experimental results demonstrate that DVDs achieve start-of-the-art performance on both synthetic and real-world datasets.

This article is organized as follows. Section II introduces related work. Problem statement with DVDs is demonstrated in Section III. Section IV provides detailed training procedures for proposed DVDs. The experimental results demonstrate the superiority of DVDs in Section V. Section VI reports the conclusion.

Ii Related Work

Point cloud registration is a vital research field due to various 3D perception applications. Here we broadly categorize the related work into traditional, supervised, and unsupervised registration methods.

Ii-a Traditional Registration

Iterative Closest Point (ICP) [8] is the well-known method for the rigid point cloud registration, which iteratively computes the closet point as correspondences and estimates the optimal transformation by the correspondences. Subsequent works have been proposed to improve the robustness of ICP [12, 10, 11]

. On the other branch, Gaussian Mixture Model (GMM)

[24] and Hierarchical Gaussian Mixture Registration (HGMR)[25]

reformulate the registration problem as probability distribution matching. These probabilistic methods still require good initializations to avoid being trapped into the local optimum, because their objective functions are non-convex. Unlike local registration methods, global registration methods do not require good initializations. In this way, Go-ICP

[13], GOGMA [26], GOSMA [27], and GoTS [28] have been proposed to find globally optimal registration by branch-and-bound (BnB) method at the expense of increased computational complexity. In another line, PFH [29] and FPFH [30] constructed handcrafted feature descriptors and estimated potential correspondences. Then robust optimization approaches such as semidefinite programming [31] and RANSAC [32] can be utilized to estimate exact correspondences. Fast Global Registration (FGR) [33] accelerated the optimization process by a graduated non-convex strategy. A fast rigid point cloud registration method was proposed without SVD or eigendecomposition [34]. Recently, a comprehensive evaluation of point cloud registration methods has been conducted [35]. However, requiring good initializations, noisy correspondences, and time constraints are challenging for traditional registration techniques.

Ii-B Supervised Registration

The rapid development of deep learning methods paves the way for developing a new perspective for point cloud registration [36]. Recently, PointNetLK [14] utilizes the PointNet module to obtain the global descriptors and aligns two point clouds iteratively by minimizing the distances between two learned descriptors with a modified Lucas-Kanade (LK) algorithm [37]. Deep Closest Point [15] introduces a solution in another perspective that extracts per-point features and uses a transformer followed by a differentiable SVD module to recover the transformation. PRNet [38] uses a keypoint detector to establish keypoint correspondences to solve the partial to partial point cloud registration in a self-supervised way. RPMNet [16] proposes Sinkhorn normalization [39] to estimate soft correspondences. IDAM [17] computes pairwise correspondences by an iterative distance-aware similarity convolution module. DeepGMR [18] learns point-to-GMM correspondences by integrating GMM. In addition, many representative techniques [40, 41, 42] have been presented to deal with the large-scale registration. While recent works achieve state-of-the-art performances, this work offers a new way of achieving unsupervised registration without human annotations.

Ii-C Unsupervised Registration

In contrast to the supervised and correspondences-based point cloud registration methods, there are two prior attempts on unsupervised point cloud registration. PCRNet [19] improves PointNetLK by replacing the LK module with multi-layered perceptrons. FMR-Net [20] adapts PointNetLK to a semi-supervised manner by jointly optimizing the processes of feature extraction and transformation estimation. However, they achieve registration heavily depending on global representations, paying no attention to the useful representations obtained by local geometries. Meanwhile, as a subclass of unsupervised learning, self-supervised learning has achieved remarkable performance on visual tasks without expensive labels [23, 43]. Self-supervised models can be categorized into adversarial, generative, and contrastive [44]. Inspired by this, we propose local consistency loss to force the feature extraction network to learn useful representations from local geometries in a self-supervised manner. In this way, we resolve the issue by proposing novel DVDs.

Iii Problem Statement with DVDs

In this section, we present the formulation of rigid registration problem with the proposed DVDs. We denote and as the source and the target point cloud, respectively. The objective of registration is to estimate the rigid transformation as [13]:

(1)

where represents a rotation matrix and

denotes a translation vector. Solving the problem in Eq. (

1) based on raw 3D coordinates of point clouds is not robust because raw point clouds usually contain noises and outliers. To address the issue, PointNetLK [14] utilizes PointNet [36] to extract -dimensional vectors as global representations of point clouds. In this way, the registration problem can be described as follows:

(2)

where denotes the feature extraction function. Previous studies [20, 14] primarily solved the Eq. (2) by adapting PointNet as feature extraction function resulting in a global descriptor. However, they fail to utilize the informative features from local geometries. Inspired by the success of self-supervised learning [45, 46], we model the registration problem jointly based on global and local representations in a self-supervised manner.

We revisit point cloud registration motivated by the observation that source point cloud is transformed into target point cloud by a rigid transformation operator . In other words, all local geometries of point clouds are transformed consistently with the mapping function . This property inspires us to exploit the high-level relation among different local geometries. Given two subsets and the corresponding transformed version , the related high-level representations can be denoted by , and , respectively, where represents a feature extraction function. In this way, the feature change of original and transformed is described as

(3)

where is a metric function for modeling differences of features. We employ a Fully Connected (FC) layer as the metric function , such that

(4)

where is the concatenation operation and . In Eq. (4), the high-dimensional features and are combined together with , whose goal is to perform feature extraction while maintaining the change of each feature.111The employment of the FC layer is motivated by the network design in PointNet [36], and FC network-based intra prediction for image coding [47]. The goal of is to abstract high-level relation between the local region and its transformed version.

Since all local geometries of point clouds are transformed consistently with the same mapping function , the feature change between different regions should be consistent, i.e., and shall be consistent, as shown in Fig. 2. In this way, we have and , where denotes a distribution. Based on the above analysis, the registration problem in Eq. (2) can be reformulated in a self-supervised manner as

(5)

Here, we are trying to deal with the point cloud registration problem with a joint design on global and local descriptors, which are respectively shown in the objective and the constraint of Eq. (5).

However, the constraint in Eq. 5 is non-convex. This constraint means and should be sampled from the same distribution, which is extremely hard to achieve for a network functional. Alternatively, we try to tackle the problem by minimizing the distance metric of their distributions as

(6)

In this way, we need to train a feature extraction function and estimate a transformation operator . Given that the optimization of the function and the transformation operator in Eq. (6) is difficult and time-consuming, we employ a simplified training strategy to optimize and in an alternative manner. I.e., when the transformation operator is optimized and the function is frozen in each iteration and vice versa. Specifically, we introduce primary and auxiliary tasks to optimize the function and the transformation operator better, respectively.

Fig. 2: The illustration of our main idea. We propose to model the relation of different local geometries under a rigid transformation as various local parts of point clouds are transformed consistently. Given two local regions, as shown in the red and purple circles, changes of transformed and original descriptors should be consistent.
Fig. 3: The overall architecture of our proposed unsupervised point cloud registration method. The feature extraction module learns the global descriptors of source and target point cloud ( and ). A novel local consistency loss is proposed to model the relation of local geometries under transformation by local descriptors. We further introduce two auxiliary tasks to enhance the transformation awareness of global and local descriptors. The differences between two extracted features and is used to recover transformation increment .

Iv Robust Point Registration with DVDs

The core of rigid-body registration without correspondences is to extract discriminative and transformation awareness features from two-point clouds and recover the related transformation parameters. We solve the problem by leveraging the proposed versatile descriptors with two auxiliary tasks. The overall framework is illustrated in Fig. 3. In short, we first embed the point clouds in high-dimensional space, performing optimization with one primary task and two auxiliary tasks ( and ). The Primary task estimates the optimal transformation parameters while considering both global and local descriptors in a self-supervised way. Auxiliary tasks aim to improve the rotation awareness of the feature extraction function.

The overall objective to be minimized in the proposed method can be formalized as

(7)

where trade-off parameters and are positive constants to balance the effect of different tasks. The optimization of Eq. (7) takes the source and the target point cloud as inputs without requiring ground truth transformation, allowing us to learn unsupervised representations for robust point cloud registration.

Iv-a Primary Task: Learning with Local Consistency

This subsection elaborates on the proposed scheme. Based on the problem formulation in Section III and Eq. (6), we try to tackle the unsupervised registration problem as

(8)

In Eq. (8), the distribution alignment problem could also be measured by other distance metrics, such as Maximum Mean Discrepancy [48]

and Euclidean distance. We empirically employ the symmetric Kullback-Leibler divergence, which has stable and superior performance against other distance metrics.

The first term in Eq. (8) is referred to as the global descriptor that learns global representations from both and . The estimation of transformation parameters is given in the following. Given a frozen function , estimating transformation is still time-consuming. Therefore, we choose to use the computationally efficient Inverse Compositional (IC) method [37] to iteratively calculate transformation increment as

(9)

where is the Jacobian matrix of global representation differences between source and target point clouds. Unlike regular grid images, computing the Jacobian matrix of irregular point clouds needs to take gradients in , , and directions. Refer to [14], the warp Jacobian in the IC-LK algorithm can be approximated as

(10)

where twist parameters denotes three Euler angles and three translation parameters, and is the inverted warp. In this way, transformation parameters are updated as after the -th iteration.

To enhance the representation ability of the global descriptor, we adopt the idea of self-supervised learning, which always encodes high-level semantic representations through a pretext learning task [49]. Especially, we aim to extract useful local representations by exploiting the high-level relation among different local geometries based on the second term in Eq. (8), which can be considered as a pretext learning task. In this way, we can solve registration by using joint global and local representations. The design of the self-supervised learning is based on the key observation that the local distinctive geometric structures of the point cloud by two subset points can be employed to enhance the representation ability of the feature extraction module.

In the following, we will elaborate on the selection and intuition of local distinctive subset points.

The selection of subset points. We employ two representative points related to the barycenter of the source point cloud [6], denoted by the farthest point from barycenter,

(11)

and the closest point from barycenter,

(12)

There may be isolated points far from the barycenter in the real scene, so we consider an outliers rejection following [31]. For example, one should reject outliers if they are larger than the threshold . Then we choose point and search its nearest neighbors based on Euclidean distance, forming a local set . Similarly, subset is selected among the point . The effect of local size on the performance of registration will be discussed in the ablation studies.

The intuition of the selection strategy is that (1) We want to capture the distinctive geometric structures of the point cloud by two subsets points to enhance the representation ability of the feature extraction module. From the perspective of geometry, the farthest and the closest points are located in distinctive regions [50, 51]. (2) We need the two subsets to be independent without overlap. Therefore, if they are far away from each other, they are more likely to be independent.

Iv-B Auxiliary Task I: Self-reconstruction

Since discovering features sensitive to rotations and translation from unlabeled source and target point clouds is quite challenging, a single feature extraction module may not lead to sufficient representations. If DVDs own rich transformation awareness, proper supervision to local consistency will be offered, thus creating a virtuous circle for the optimization. On the contrary, the learning process may be trapped into local minima for the lack of transformation awareness of extracted representations. To avoid this issue, we introduce two auxiliary tasks to achieve enhanced rotation-awareness of DVDs.

The feature extraction module aims to learn a transformation awareness descriptor, which needs to reflect the effects of rigid-body transformation, enabling the modified inverse compositional algorithm to recover the transformation parameters . We choose the vanilla PointNet [36] as the feature extraction module

, which is composed of Multi-Layer Perceptron (MLP) and a max-pooling operation.

Fig. 4: The weak rotation awareness of Chamfer Distance. The figure demonstrates Chamfer Distances between two point sets under rotation transformations. The angular measurement is in units of radian.
Input: Source Point Cloud and Target Point Cloud

; The number of epochs

and the number of iterations ;
Output: Rigid transformation
1 Initialize the weights and bias,
2 while   do
3       Feature Extraction and
4       Update network parameters using primary loss , auxiliary self-reconstruction and auxiliary normal estimation based on Eq. 8, Eq. 14 and Eq. 15, respectively
5       Freeze the parameters of
6       Calculate
7       , while  and  do
8             Compute
9             Update based on
10             Transform
11            
12      
Algorithm 1 Proposed Registration Algorithm
Fig. 5: We demonstrate the optimization process of registration with a high rotation-awareness feature extraction function (top row) and a low rotation-awareness feature extraction function (bottom row). Red and blue represent the transformed source point cloud and the target point cloud , respectively.

We employ a folding-based [52] decoder as the reconstruction module in the self-reconstruction task. Besides, Chamfer Distance [53] is adopted to define the reconstruction error, measuring the differences between original point clouds and the reconstructed version :

(13)

This work will take both the source and the target points for reconstruction, i.e., . As a result, we have

(14)

Chamfer Distance can be computed in a computation-efficient manner. However, it is blind to certain visual inferiority [54]. Fig. 4 demonstrates such an example. Two-point clouds with large rotation angles may lead to a small Chamfer Distance loss, resulting in weak rotation awareness. Therefore, we consider another auxiliary task for further performance improvement.

Iv-C Auxiliary Task II: Normal Estimation

Normal estimation is an essential step in many research areas of point cloud [55], e.g., rendering and surface reconstruction. Unlike pursuing the estimation precision, we aim to improve the rotation awareness of DVDs by the normal estimation task. Especially, we use a light-weight MLP to generate the estimated normals with the concatenated coordinates and global features

. The estimation error is measured by the cosine loss function as

(15)

Combing the results in Eq. (8), Eq. (14), and Eq. (15), we finally obtain the overall objective as shown in Eq. (7). Besides, we have and . We have illustrated the detailed implementation of DVDs in Algorithm 1.

Iv-D Intuitive explanation

Here, we provide an intuitive explanation of the reason that the rotation-awareness of the feature extraction function is important for registering two-point clouds. Fig. 5 visualizes feature differences of the transformed source point cloud and the target point cloud during the iterative optimization process. Specifically, we compare the registration process with a high rotation-awareness feature extraction function and a low rotation-awareness feature extraction function. We reshape feature differences of the transformed source point cloud and the target point cloud into a square matrix for better visualization. Fig. 5 shows that the feature difference decreases, and the registration becomes more accurate during the optimization process with the high rotation-awareness feature extraction function. However, the feature difference decreases for the low rotation-awareness feature extraction, but the registration is trapped in a local minimum. Specifically, the rightmost column of Fig. 5 demonstrates that the extracted global features of two-point clouds are almost the same when the twp-point cloud are not aligned. Therefore, the rotation awareness of the feature extraction function is important for identifying whether two point clouds are aligned. In the following section, we will demonstrate the effectiveness of the proposed method with various experiments.

V Experiments

In this section, we design several experimental settings to demonstrate the generalization and superiority of DVDs. Firstly, we evaluate DVDs on synthetic object-centric point cloud datasets. Then, we conduct cross dataset evaluation, i.e., DVDs are trained on the synthetic object-centric dataset and are tested on another real-world dataset to test the generalization ability of DVDs. Furthermore, we evaluate DVDs on large-scale real-world 3D scenes. Since real-world industrial scenarios usually lead to noisy and partial visibility point clouds, robustness evaluations have been conducted to demonstrate the robustness of DVDs under these scenarios. Meanwhile, the efficiency of DVDs shows its superiority in industrial applications. Finally, we demonstrate that the proposed local consistency loss can improve several point cloud registration techniques, indicating its generality.

Experimental Details. Specifically, we conduct experiments on the ModelNet40 [56], ScanObjectNN [57] and 3DMatch [58] benchmarks. ModelNet40 is a synthetic dataset, consisting of 12,311 CAD models. We randomly sample 1024 points from each CAD model 2 times with different random seeds and name sampled point clouds as the source point cloud and the target point cloud , respectively [59]. We rescale point clouds into a unit sphere. Following the experimental settings in [14], all the models are trained on the train set of the first 20 categories in ModelNet40. Euler angles are uniformly sampled in the range [0, ] and translations in the range for each axis during the training process [15]. We transform the source point cloud through the sampled rigid transformation. The source point cloud and the target point cloud are fed into the network, which aims to register the two point clouds. The maximum number of iterations is set as 10. We train the model for a total of 200 epochs. All the experiments are implemented on a single NVIDIA Titan Xp.

Fig. 6: Qualitative results on ModelNet and ScanObjectNN of the proposed method. The source, target and transformed point cloud are green, blue, and red, respectively. The top row demonstrates the results from ModelNet, and the last row shows the results from ScanObjectNN.
Fig. 7: Evaluation on unseen objects.

V-a Evaluation on Unseen objects

In the first experiment, we compare our method with the following algorithms: ICP [8], PointNetLK [14], PCRNet [19], FMR-Net [20], DCP [15], FGR [33], and RANSAC [32]. All methods are tested on the test set of the rest 20 categories. Moreover, qualitative results of these algorithms are shown in Fig. 6. The differences between predicted values and ground truth are measured by the root mean squared error (RMSE). A successful registration is defined if rotation and translation errors are smaller than predefined thresholds [42]. Performances are evaluated by the recall, i.e., the percentage of successful registration point cloud pairs. Fig. 7 shows that the ICP has inferior performance against other competing algorithms since it is a local registration method and is vulnerable to the initial position. PointNetLK uses a pre-trained classification network as the feature extraction module and obtains better performance than ICP. PCRNet and FMR-Net, jointly optimizing the feature extraction and transformation estimation processes, reach improved performance over ICP. The classical methods FGR and RANSAC are competitive with our method. Overall, the proposed method attains more accurate registration over other compared ones, demonstrating the effectiveness of the proposed DVDs.

Fig. 8: Evaluation on unseen categories.
Fig. 9: Cross dataset evaluation on real-world ScanobjectNN.

V-B Evaluation on Unseen Categories

In this subsection, we evaluate the generalization of the proposed model on unseen categories. To this end, all methods are trained on the train set of the first 20 categories and are tested on the test set of the rest 20 categories in ModelNet40. The differences between predicted values and ground truth are measured by the root mean squared error (RMSE). Performances are evaluated by the recall. Fig. 8 shows that there is a performance drop for DCP and PCRNet compared to unseen objects. However, our method can generalize to unseen categories and are more robust than other deep learning-based approaches. On the other hand, unsupervised FMR-Net has better generalization than supervised PointNetLK instead of depending on category labels. Nevertheless, traditional registration methods, including ICP, FGR, and RANSAC are not affected so much.

Fig. 10: Qualitative results on 7Scene of the proposed method. Purple represents the transformed point cloud and green is the target point cloud.

V-C Cross Dataset Evaluation

It is also crucial to test the influences of the generalization ability of models under cross dataset evaluation. We train models on the train set of the first 20 categories in ModelNet40 and test on the test set of ScanObjectNN [57], which is a newly published real-world dataset containing 2902 objects from 15 categories. Specifically, we set the success registration criteria that a rotation error is less than and translation error is less than 0.01. The results are reported in Fig. 9. Compared to results on the synthetic dataset, the performances of DCP and PCRNet degrade in the cross-dataset setting. Classical methods FGR and RANSAC keep stable performances. However, the proposed method still outperforms these compared deep learning-based methods, demonstrating the effectiveness of customized designs in the proposed method and applicability in cross synthetic and real-world datasets.

V-D Evaluation on Real-World 3D Scenes

7Scene [60] is a real-world indoor 3D Scenes registration benchmark. Following DGR [42], we generate 3D scan pairs with more than 70% overlap for training. During the training process, Euler angles are uniformly sampled in the range [0, ] and translations in the range [-0.5, 0.5]. In addition to the PointNetLK [14], PCRNet [19], FMR-Net [20], DCP [15], FGR [33], and RANSAC [32], we also compare with DGR [42] and FCGF [41], which are the state-of-the-art methods on large-scale point cloud scenes. Two versions of RANSAC are compared with 5000 and 10000 iterations, respectively. We down-sample 10k points for training the DCP, FMR-Net, PCRNet, PointNetLk, and the proposed methods. As DGR requires a pre-trained FCGF model, we use all the points for training DGR. Qualitative results of the proposed method are shown in Fig. 10. The proposed method is competitive with DGR, as shown in Fig. 11. For rotation errors larger than , our method slightly outperforms DGR. We tried to train PCRNet and PointNetLK, but they did not converge. Compared with DGR, the proposed method and FMR-Net use the globally pooled descriptor to represent the entire point cloud scene. However, it is challenging to represent complex scenes by globally pooled descriptors. Therefore, our method does not get highly accurate results in large-scale scenes.

Fig. 11: Evaluation on 7-Scenes.
Fig. 12: Robust Evaluation. (a) Robustness to noise. (b) Robustness to partial overlap.
Points ICP PointNetLK DCP FGR RANSAC FMR-Net PCRNet Ours
500 7 70 6 143 18 68 14 66
1000 16 73 9 172 48 70 16 68
2000 35 75 20 232 110 72 18 69
TABLE I: Running time comparisons for registering a point cloud pair (in milliseconds).

V-E Robustness Evaluation

In this section, we evaluate models in the presence of outliers and partial visibility. The success registration criteria is defined as a rotation error under

and translation error under 0.01. For outliers scenario, noises are sampled from Gaussian distribution and are added to each point of source and target point cloud. Specifically, all the models are trained on clean data without adding noise augmentation. Though FGR and RANSAC methods are comparatively with our method on clean data, they are sensitive to noisy data, as shown in Fig.

12 (a). Adding noise to the source and target point cloud breaks the exact point-point correspondences. Therefore, the correspondence-based DCP shows limited performance. In contrast, the learning-based registration approaches without correspondences, such as FMR-Net, PointNetLK, and the proposed method, are more robust to noises.

We also conduct experiments on partially visible data, as the acquired point clouds from the real world are often partially due to occlusions. To simulate this condition, we generate partial source and target point clouds independently from random camera poses following the settings in [14]. Note that all methods are trained on noise-free and fully visible data from the first 20 categories in ModelNet40. Fig. 12 (b) demonstrates that the proposed unsupervised model outperforms all the others against partial visibility because of utilizing global and local descriptors. Therefore, it is promising to extend our model to real-world scenarios.

V-F Computational Efficiency

In this subsection, we evaluate the efficiency of the proposed method. The experiment has been conducted on the testing dataset of the last 20 categories from ModelNet40. Rigid transformations are randomly sampled with Euler angles in the range [0, ] and translations in the range . For real-world scenarios, noises are sampled from Gaussian distribution and are added to each point of source and target point cloud. We perform this experiment on a 2.10GHz Intel E5-2620 and an NVIDIA Titan Xp. Specifically, the ICP, FGR, and RANSAC are executed on the CPU. The average amount of time of all the compared methods are shown in Table I. When processing large-scale point cloud (e.g., 2000), the proposed method is faster than RANSAC, FGR, FMR-Net, and PointNetLK. Whilst it is slower than the ICP, DCP, and PCRNet.

V-G Generality of Local Consistency Loss

We use the proposed local consistency loss to improve the performance of several classical point cloud registration models, including DCP [15], PRNet [38], IDAM [17], and DeepGMR [18]. All models are trained on the first 20 categories and tested on the rest 20 categories in ModelNet40. Table II indicates that being equipped with local consistency loss enables both the correspondence-based and correspondence-free registration methods to estimate more accurate transformation parameters. Therefore, the proposed local consistency loss is a generic method complementary to the existing point cloud registration techniques.

Method RMSE() MAE() RMSE() MAE()
DCP [15] 4.834 2.795 0.020 0.010
PRNet [38] 1.506 0.874 0.015 0.010
IDAM [17] 2.084 1.122 0.022 0.014
DeepGMR [18] 3.472 1.526 0.004 0.001
LC + DCP 3.950 2.298 0.020 0.010
LC + PRNet 1.293 0.783 0.014 0.010
LC + IDAM 1.527 0.933 0.016 0.010
LC + DeepGMR 3.192 1.247 0.004 0.001
TABLE II: Local consistency loss can be integrated into various point cloud registration models. LC denotes the proposed local consistency loss.
Method Unsupervised Transformation awareness Local descriptors Computational complexity Performance
DCP low moderate
PointNetLK moderate moderate
PCRNet low low
FRM-Net moderate moderate
DVDs* moderate high
TABLE III: The comparison between DVDs and previous unsupervised registration methods. * denotes the proposed method.

V-H Ablation Studies

To examine the effectiveness of each component in Eq. (7), we conduct detailed ablation studies on ModelNet40. The first results are elaborated in Fig. 13 (a). Model A denotes the baseline method, which is trained by self-reconstruction loss only. We see the baseline equipped with an auxiliary normal estimation task (model B) advances model A, which convincingly verifies the effectiveness of the normal estimation task. Our full model C is then obtained by incorporating all the tasks, which outperforms the others, demonstrating that all individual tasks contribute to superior performance. What’s more, we study the effect of the size of local geometries, i.e., the cardinalities of and . As shown in Fig. 13 (b), the model achieves the best performance with 64 points over that with large (96 points) or small (32 points) local size.

Fig. 13: Ablation studies of our model. (a) Effects of each component. (b) Effects of local size.

V-I Discussion

The advance compared to the previous methods is summarized in Table III. We obtain Table III based on the context and experimental results. From Table III, we can find that PCRNet, FRM-Net, and the proposed method are unsupervised. However, DCP and PointNetLK are supervised methods, requiring human annotations or labels. Besides, all the compared methods do not consider local descriptors or the transformation awareness of the feature extraction network. Furthermore, as demonstrated by the experimental results on synthetic and real-world datasets, DVDs obtain the best registration performance against all compared methods with moderate computational complexity.

Vi Conclusion

This paper proposed a novel unsupervised representation for robust point cloud registration based on the DVDs, which take advantage of high dimensional features jointly obtained by global and local geometrics. Besides, the transformation awareness of the DVDs has been further enhanced by two additional tasks (reconstruction and normal estimation). Numerical experiments based on both the synthetic and real datasets revealed several critical features of the proposed registration algorithm: 1) Regardless of the datasets, it always results in the best performance against several competing ones; 2) It has robust performance in various scenes, such as unseen categories, serve noises, partial visibility, and even cross dataset evaluation; 3) It includes one primary task and two additional tasks. All of them contribute to the accurate registration, indicating the superiority of DVDs which encourages joint representations of the discriminative features; 4) The proposed local consistency loss could improve the performances of both correspondence-based and correspondence-free registration methods.

References

  • [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski, “Building rome in a day,” Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011.
  • [2]

    Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan, “An integrated framework for 3-d modeling, object detection, and pose estimation from point-clouds,”

    IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 3, pp. 683–693, 2014.
  • [3] J. Peng, W. Xu, B. Liang, and A.-G. Wu, “Virtual stereovision pose measurement of noncooperative space targets for a dual-arm space robot,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 1, pp. 76–88, 2019.
  • [4] J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. M. Johnson et al., “Segicp: Integrated deep semantic segmentation and pose estimation,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 5784–5789.
  • [5] Y. Cui, Y. An, W. Sun, H. Hu, and X. Song, “Lightweight attention module for deep learning on classification and segmentation of 3-d point clouds,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–12, 2020.
  • [6] D. Liu, C. Chen, C. Xu, Q. Cai, L. Chu, F. Wen, and R. Qiu, “A robust and reliable point cloud recognition network under rigid transformation,” IEEE Transactions on Instrumentation and Measurement, 2022.
  • [7] M. A. Audette, F. P. Ferrie, and T. M. Peters, “An algorithmic overview of surface registration techniques for medical imaging,” Medical Image Analysis, vol. 4, no. 3, pp. 201–217, 2000.
  • [8] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611.   International Society for Optics and Photonics, 1992, pp. 586–606.
  • [9] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and vision computing, vol. 21, no. 13-14, pp. 1145–1153, 2003.
  • [10] S. Bouaziz, A. Tagliasacchi, and M. Pauly, “Sparse iterative closest point,” in Computer graphics forum, vol. 32, no. 5.   Wiley Online Library, 2013, pp. 113–123.
  • [11] D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek, “The trimmed iterative closest point algorithm,” in Object recognition supported by user interaction for service robots, vol. 3.   IEEE, 2002, pp. 545–548.
  • [12] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: science and systems, vol. 2, no. 4.   Seattle, WA, 2009, p. 435.
  • [13] J. Yang, H. Li, D. Campbell, and Y. Jia, “Go-icp: A globally optimal solution to 3d icp point-set registration,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 38, no. 11, pp. 2241–2254, 2015.
  • [14] Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust & efficient point cloud registration using pointnet,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2019, pp. 7163–7172.
  • [15] Y. Wang and J. M. Solomon, “Deep closest point: Learning representations for point cloud registration,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3523–3532.
  • [16] Z. J. Yew and G. H. Lee, “Rpm-net: Robust point matching using learned features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 824–11 833.
  • [17] J. Li, C. Zhang, Z. Xu, H. Zhou, and C. Zhang, “Iterative distance-aware similarity matrix convolution with mutual-supervised point elimination for efficient point cloud registration,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16.   Springer, 2020, pp. 378–394.
  • [18] W. Yuan, B. Eckart, K. Kim, V. Jampani, D. Fox, and J. Kautz, “Deepgmr: Learning latent gaussian mixture models for registration,” in European Conference on Computer Vision.   Springer, 2020, pp. 733–750.
  • [19] V. Sarode, X. Li, H. Goforth, Y. Aoki, R. A. Srivatsan, S. Lucey, and H. Choset, “Pcrnet: point cloud registration network using pointnet encoding,” arXiv preprint arXiv:1908.07906, 2019.
  • [20] X. Huang, G. Mei, and J. Zhang, “Feature-metric registration: A fast semi-supervised approach for robust point cloud registration without correspondences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 366–11 374.
  • [21] N. Komodakis and S. Gidaris, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representations (ICLR), 2018.
  • [22] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
  • [23] J. S. L. Senanayaka, H. Van Khang, and K. G. Robbersmyr, “Toward self-supervised feature learning for online diagnosis of multiple faults in electric powertrains,” IEEE Transactions on Industrial Informatics, vol. 17, no. 6, pp. 3772–3781, 2020.
  • [24] B. Jian and B. C. Vemuri, “Robust point set registration using gaussian mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1633–1645, 2010.
  • [25] B. Eckart, K. Kim, and J. Kautz, “Hgmr: Hierarchical gaussian mixtures for adaptive 3d registration,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 705–721.
  • [26] D. Campbell and L. Petersson, “Gogma: Globally-optimal gaussian mixture alignment,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5685–5694.
  • [27] D. Campbell, L. Petersson, L. Kneip, H. Li, and S. Gould, “The alignment of the spheres: Globally-optimal spherical mixture alignment for camera pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 796–11 806.
  • [28] Y. Liu, C. Wang, Z. Song, and M. Wang, “Efficient global point cloud registration by matching rotation invariant features through translation search,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 448–463.
  • [29] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Aligning point cloud views using persistent feature histograms,” in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2008, pp. 3384–3391.
  • [30] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in 2009 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2009, pp. 3212–3217.
  • [31] H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,” IEEE Transactions on Robotics, vol. 37, no. 2, pp. 314–333, 2020.
  • [32] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [33] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in European Conference on Computer Vision.   Springer, 2016, pp. 766–782.
  • [34] J. Wu, “Rigid 3-d registration: A simple method free of svd and eigendecomposition,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 10, pp. 8288–8303, 2020.
  • [35] B. Zhao, X. Chen, X. Le, J. Xi, and Z. Jia, “A comprehensive performance evaluation of 3-d transformation estimation techniques in point cloud registration,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–14, 2021.
  • [36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 652–660.
  • [37] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004.
  • [38] Y. Wang and J. M. Solomon, “Prnet: Self-supervised learning for partial-to-partial registration,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8814–8826.
  • [39] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The annals of mathematical statistics, vol. 35, no. 2, pp. 876–879, 1964.
  • [40] H. Deng, T. Birdal, and S. Ilic, “3d local features for direct pairwise registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3244–3253.
  • [41] C. Choy, J. Park, and V. Koltun, “Fully convolutional geometric features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8958–8966.
  • [42] C. Choy, W. Dong, and V. Koltun, “Deep global registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2514–2523.
  • [43] I. Achituve, H. Maron, and G. Chechik, “Self-supervised learning for domain adaptation on point clouds,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 123–133.
  • [44] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, 2021.
  • [45]

    L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,”

    IEEE transactions on pattern analysis and machine intelligence, 2020.
  • [46] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2019.
  • [47] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-based intra prediction for image coding,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3236–3247, 2018.
  • [48] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur, “Optimal kernel choice for large-scale two-sample tests,” in Advances in Neural Information Processing systems (NeurIPS), 2012, pp. 1205–1213.
  • [49] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1920–1929.
  • [50] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi, “The farthest point strategy for progressive image sampling,” IEEE Transactions on Image Processing, vol. 6, no. 9, pp. 1305–1315, 1997.
  • [51] M. I. Shamos and D. Hoey, “Closest-point problems,” in 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).   IEEE, 1975, pp. 151–162.
  • [52] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 206–215.
  • [53] H. Fan, H. Su, and L. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2463–2471.
  • [54] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning representations and generative models for 3d point clouds,” in

    International Conference on Machine Learning (ICML)

    .   PMLR, 2018, pp. 40–49.
  • [55] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8895–8904.
  • [56] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1912–1920.
  • [57] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1588–1597.
  • [58] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgb-d reconstructions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1802–1811.
  • [59] H. Xu, S. Liu, G. Wang, G. Liu, and B. Zeng, “Omnet: Learning overlapping mask for partial-to-partial point cloud registration,” arXiv preprint arXiv:2103.00937, 2021.
  • [60] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.