Rigid point cloud registration is a task that finds a rigid transformation to align two point clouds, and it has long been a fundamental task in computer vision and robotics, with many important applications, such as autopilot [ref1, ref2], surgical navigation [ref3] and SLAM [ref4, ref5]. There are two interlocked subproblems in point cloud registration: finding the transformation to align the two point clouds and finding the correspondences between the points [ref6]. Although when the solution to one subproblem is known, the other subproblem can be easily solved, it is difficult to solve both subproblems together. Point cloud registration becomes even harder when there are outliers, which are the points with no correspondences in the other point cloud. Outliers may come from the imperfectness of the sensors used to collect the point clouds or situations in which the two point clouds to be registered are not fully overlapped.
Iterative closest point (ICP) [ref7, ref8] is arguably the most widely used method for rigid point cloud registration, which starts from an initial transformation and alternately updates the correspondences and transformation. One major limitation of ICP is that it can only converge to a local optimum near the initialization, and its convergence basin is fairly small, especially when there are noise and outliers. A series of global registration methods based on branch-and-bound (BnB) [ref9, ref10, ref11] have been proposed to alleviate the need for initialization by obtaining the global optimal solution, but the time-consuming BnB limits their practical applications. Another method for mitigating the need for initialization is by keypoint extraction and matching [ref12, ref13]. Based on the correspondences established by matching key points, RANSAC-like methods can be explored for registration [ref12, ref13]. However, the speed and accuracy of this type of method are sensitive to outliers and repetitive geometry [ref14]
. Several recent methods integrate deep neural networks for establishing correspondences and a differentiable singular value decomposition (SVD) algorithm for calculating the transformation to build an end-to-end trainable network for point cloud registration, such as DCP[ref15], RPM-Net [ref16] and IDAM [ref17], and they do not need transformation initialization. These methods explore deep features to establish correspondences but the discrimination ability of the features extracted from point clouds is poor, as shown in Figure 1, which leads to a large proportion of incorrect correspondences and consequently devastates the registration accuracy.
In this paper, we propose a robust point cloud registration framework that utilizes deep graph matching to better handle outliers, and we denote it as RGM. By constructing graphs from point clouds to be registered and capturing the high-order structure of the graphs, RGM can find robust and accurate point-to-point correspondences to better solve the point cloud registration problem. To the best of our knowledge, this is the first time that deep graph matching has been applied to point cloud registration. RGM contains an end-to-end deep neural network, the first part of which is a feature extractor that extracts deep local features for each point by using its neighboring points. Instead of matching these local point features directly, we construct a graph for each of the two point clouds and embed [ref18]
both the graph nodes (local features for each point) and graph structure (second-order or high-order structure) into node feature space. Then, we introduce an module consisting of an affinity layer, instance normalization and Sinkhorn to predict soft correspondences from the node features of the two graphs, and we denote it as AIS module. By using graph matching in the AIS module, not only the local geometry of each node but also its structure and topology in a larger range are considered in establishing correspondences so that more correct correspondences are found. In training, the binary cross-entropy loss between the predicted soft correspondences and the ground-truth correspondences are adopted, which directly promotes the network to learn better point-to-point correspondences. In testing, we use the linear assignment problem (LAP) solver[ref52] based on the Hungarian algorithm [ref40] to transform soft correspondences into one-to-one hard correspondences, and then SVD is employed to calculate the transformation from the hard correspondences. Similar to existing methods such as RPM-Net and ICP, we iteratively optimize the registration results.
Our main contributions are as follows:
We propose using deep graph matching to solve the point cloud registration problem for the first time. Instead of only using the features of each point, graph matching can leverage the features of other nodes and the structural information of graphs when establishing correspondences so that it can better address the problem of outliers.
We introduce the AIS module to establish reliable correspondences between nodes of two given graphs. The AIS module calculates an affinity matrix between any two nodes based on the embedded features, and by analyzing the affinity matrix globally and utilizing the Sinkhorn algorithm, it can effectively reduce the proportion of incorrect correspondences.
We propose using a transformer to generate soft graph edges. In registering partial-to-partial point clouds, better correspondences can be established for the overlapping parts by utilizing the attention and co-attention mechanism in the transformer.
Our method achieves state-of-the-art performance on clean, noisy, partial-to-partial datasets and unseen categories datasets.
2 Related Work
2.1 Traditional Registration Method
A large proportion of traditional methods need an initial transformation and find a locally optimal solution near the initialization, in which ICP [ref7, ref8] is an early and representative method. ICP starts with an initial transformation and iteratively alternates between solving two trivial subproblems: finding the closest points as correspondences under current transformation and computing optimal transformation by SVD from found correspondences. Many variants have been proposed to improve ICP [ref19, ref20, ref21]
. Nevertheless, ICP and its variants can only converge to a local optimum, and their success heavily relies on a good initialization. To improve the robustness to noise and outliers and enlarge the convergence basin, some methods transform point clouds into probability distributions and reformulate point cloud registration as matching two probability distributions, such as GMM[ref22] and HGMR [ref23]. These methods do not need to alternately solve correspondences and transformation, but their objective functions are nonconvex, so they still need a good initialization to avoid converging to a bad local optimum. Recently, a series of globally optimal methods based on BnB have been proposed, such as Go-ICP [ref9], GOGMA [ref10], GOSMA [ref11], and GoTS [ref24],but they are very slow and only practical in some limited scenarios. Another line of work avoids transformation initialization by establishing correspondences. They usually first extract keypoints from the original point clouds and construct feature descriptors for them and then establish potential correspondences through feature matching [ref12, ref13]. After that, RANSAC-like algorithms can be used to find the correct correspondences for registration. Different from RANSAC-like methods, FGR [ref25] optimizes a correspondence-based objective function by a graduated nonconvex strategy and achieves state-of-the-art performance in correspondence-based point cloud registration. However, correspondence-based methods are sensitive to duplicate structures and partial-to-partial point clouds because a large proportion of the potential correspondences will be incorrect in these scenarios. Specifically, the lack of good initialization, a large proportion of outliers and time constraints are still big challenges for traditional point cloud registration methods.
2.2 Learning-based Registration Method
The developments of deep learning on point clouds allow researchers to make good use of existing research, such as PointNet[ref26], and DGCNN [ref27], to extract point cloud features for downstream tasks. These studies have stimulated the interest of using deep learning in point cloud registration. One of the earliest works is PointNetLK [ref29], which calculates global feature descriptors of the two point clouds through PointNet and iteratively uses the IC-LK algorithm [ref30, ref31] to minimize the distance between the two global feature descriptors to achieve registration. PCRNet [ref14] replaces the IC-LK algorithm in PointNetLK with a deep neural network. DCP [ref15] utilizes transformer [ref42, ref43] to compute soft correspondences between two point clouds and utilizes a differentiable SVD algorithm to calculate the transformation. Although these methods have the advantages of being fast and some of them do not need transformation initialization, they cannot effectively handle partial-to-partial point cloud registration. PRNet [ref34] proposes a keypoint detector and uses the keypoint-to-keypoint correspondences in a self-supervised way to solve the partial-to-partial point cloud registration. DeepGMR [ref35]
extracts pose-invariant correspondences between raw point clouds and Gaussian mixture model (GMM) parameters, and then recovers the transformation from the matched Gaussian mixture models. IDAM[ref17] integrates the iterative distance-aware similarity convolution module into the matching process, which can overcome the shortcomings of using inner products to obtain pointwise similarity. RPM-Net [ref16] proposes a network to predict optimal annealing parameters and uses annealing and Sinkhorn [ref36] to obtain soft correspondences from local features. Soft correspondences can increase robustness, but they lead to the decrease of registration accuracy, which is shown in our clean experiment. Although these methods can handle partial-to-partial point cloud registration to some extent, there is still room for improvement in their accuracy and robustness. The difference between our method and the existing learning-based methods is that we construct graphs from the original point clouds and merge structural information of the graphs into node features so that the nodes can be better matched.
Graph matching has been widely studied in computer vision and pattern recognition[ref47, ref50, ref51]. Recently, learning-based graph matching has attracted considerable research interest [ref37, ref38, ref18], but, to the best of our knowledge, there is no research on using learning-based graph matching to solve the point cloud registration problem.
3 Problem Formulation
3D rigid point cloud registration refers to estimating a rigid transformationto align a source point cloud and a target point cloud where , . and represent the number of points in and , respectively. The correspondences between points in and are represented by matrix . If and are a pair of corresponding points, is 1; otherwise, it is 0. We first consider the simple case where there are strict one-to-one correspondences between points in and , in which, . The rigid point cloud registration problem can be formulated as minimizing the following objective function:
. In the more difficult case where there are no one-to-one correspondences, the equality constraints no longer hold, and they become inequality constraints. We can introduce slack variables in as in [ref16] to convert inequality constraints back into equality constraints. The row constraints are converted as follows, and the column constraints are similarly converted:
Please note that becomes a matrix after introducing one row and one column slack variables, and the sums of the added row and column are not restricted to be one.
In this paper, we use an end-to-end neural network to predict . Once we know the correspondences, the rigid transformation can be obtained by SVD.
Figure 2 (a) shows the overall pipeline of RGM. RGM consists of four components: local feature extractor, edge generator, graph feature extractor AIS module and LAP-SVD. During training, we use the shared local feature extractor to extract local features for each point in and , and take these local features as the node features of the initial graph. Next, the edge generator generates edges and builds the source graph and target graph, and the graphs are inputted into the graph feature extractor, which processes the two graphs and outputs new node features and uses them to update . The AIS module predicts the soft correspondence matrix between nodes of the two graphs. By using blocks composed of three modules, the edge generator, graph feature extractor and AIS module, with the same structure but different weights times, we can obtain node features with better discrimination capability and a more accurate soft correspondence matrix . Finally, the training loss is the cross-entropy between and the ground truth correspondences. During test, two point clouds are first inputted into the network to obtain the soft correspondence matrix . Then, the soft correspondences are converted to hard correspondences using the LAP solver, and the transformation is solved by SVD. We also update the transformation iteratively, similar to ICP. The details of each component are explained in the following subsections.
4.1 Local Feature Extractor
To establish the correspondence matrix between two point clouds, it is necessary to embed the source point cloud and the target point cloud into a common feature space. We only use the coordinates of the points to build a low-dimensional local feature descriptor for each point. The local feature descriptor of is:
where, represents the -nearest neighboring points of .
Low-dimensional local feature descriptors are mapped to high-dimensional local feature spaces through nonlinear functions :, where is the dimensionality of the final high-dimensional local feature. The implementation of is shown in Figure 2 (b), where
represents the parameter of the nonlinear function, which consists of shared multilayer perceptron (MLP), maxpooling and concatenation. We use the high-dimensional local features as the node featuresof the initial graph. The node feature of can be expressed as follows:
Inspired by the idea of the Siamese network [ref41], the two point clouds share the same local feature exactor. When the two point clouds become closer, the local features also become similar, so this structure is suitable for iterative registration.
If only the local features are used to predict the correspondences between point clouds, it is easy to obtain incorrect correspondences, especially when there are outliers. The reason is that the local features do not contain the structural information of the point cloud on a larger scale (self-correlation) and the association between the two point clouds (cross-correlation). Inspired by Wang’s research on deep graph matching [ref18], we construct graphs from point clouds and use deep graph matching to establish better correspondences. Section 4.2 describes how to build graphs from point clouds, and section 4.3 introduces how to predict the correspondences by using deep graph matching and the AIS module.
4.2 Edge Generator Based on Transformer
The graphs built from and are denoted as source graph and target graph , respectively. The graph nodes are the original points, and the graph edges are represented by the adjacency matrix . The node features of and are denoted by and , respectively. There are trivial methods to generate the edges, such as full connection, nearest neighbor connection and Delaunay triangulation but the features of graphs cannot be effectively aggregated, as shown in Figure 5 (d). Inspired by the success of BERT [ref42] in NLP, we introduce a transformer [ref43] module to dynamically learn the soft edges of any two nodes within a point cloud. The transformer-based edge generator is illustrated in Figure 2 (c). The transformer consists of several stacked encoder-decoder layers. The encoder uses a self-attention layer and shared MLP to encode node features, and the decoder associates and encodes features based on the co-attention mechanism. The transformer takes node features as input and encodes them into embedding features . Soft edge adjacency matrices are obtained by applying a softmax function on the inner product of the embedding features as follows:
4.3 Graph Feature Extractor and AIS Module
This part is shown in Figure 2 (d), which consists of three consecutive steps as follows: First, we use intra-graph conv to explore the self-correlation of node features, where features are aggregated from nodes along edges within each graph. The message passing scheme between nodes is the same as PCA-GM [ref18]. A node self-correlation feature of is computed by intra-graph convolution as follows:
and likewise for . Here, is the row normalized adjacency matrix calculated from , and and
are message passing functions, which are implemented by fully connected layers and ReLU.
Second, the AIS module is used to calculate a soft correspondence matrix. The AIS module consists of an affinity layer, instance normalization and Sinkhorn. An affinity matrix between the two graphs is computed as follows:
where is the learnable parameter in the affinity layer. If ,, then .
Before using Sinkhorn to compute the soft correspondence matrix , we need to transform into a matrix with positive elements within the finite values. There are two approaches to do so, and the naïve approach is to use softmax for rows or columns. The problem with this approach is that it processes each row or column and does not consider the matrix as a whole, which may result in the problem that a smaller value in is transformed into a larger value in the transformed matrix. This phenomenon is illustrated in Figure 3, and we can see that the second element of the first column has a smaller value than the last element of the second column, but it becomes larger after using softmax on the two columns separately. To avoid this situation, we do not use softmax but use instance normalization [ref44] to transform . Instance normalization considers all the elements globally and uses an exponential function to ensure that all elements are positive. For handling outliers, we add an additional row and an additional column of ones to the transformed matrix and then input it into Sinkhorn [ref36] to calculate the soft correspondence matrix by the iterative process of alternating row and column normalizations.
Finally, we enhance the node features by exploring cross-correlation through cross-graph conv. Cross-graph conv is similar to intra-graph conv, except that features are aggregated from the node features of the other graph with edges replaced by . The more similar the node pairs between the two graphs are, the higher the corresponding weight of will be. We obtain a new node feature of node with a self-correlation feature and cross-correlation feature as follows:
and likewise for . Here, consists of a feature concatenate and a fully connected layer, and it is shared for and .
4.4 LAP Slover and SVD
To compute the hard correspondence matrix , which is binary, we sum the elements of each row and each column of and take out the rows and columns with a sum greater than 0.5, and apply a LAP solver based on Hungarian algorithm[ref40]
on the resulting matrix to obtain a binary matrix. Then, the elements of the binary matrix are assigned to a zero matrix with the shape ofaccording to their position in , and the result is the we need hard correspondence matrix . Finally, we take as input to predict the transformation by SVD.
Our loss function takes the ground truth correspondences directly as supervision, which is different from previous studies[ref15, ref16, ref35] that define loss on transformation parameters. Cross-entropy loss between soft correspondence matrix and ground-truth correspondence matrix is adopted to train our model. The formula is as follows:
Since our loss function is only related to the soft correspondence matrix , the calculations in section 4.4 do not need to be differentiable.
4.6 Implementation Details
Our local feature extractor considers a neighborhood of = 20, and outputs final high-dimensional local features with the dimension =1024. We set
in this study. We train the network using the SGD optimizer with an initial learning rate of 1e-3. This network is implemented using PyTorch. For more details of implementation please see the supplementary material.
5.1 Datasets and Evaluation Metrics
All experiments are conducted on the ModelNet40 [ref45] dataset, which includes 12,311 meshed CAD models from 40 categories. We randomly sample 2,048 points from the mesh faces and rescale points into a unit sphere. Each category consists of official train/test splits. To select models for evaluation, we take 80 and 20 of the official train split as the training set and validation set, respectively, and the official test split for testing. For each object in the dataset, we randomly sample 1,024 points as the source point cloud , and then apply a random transformation on to obtain the target point cloud and shuffle the point order. For the transformation applied, we randomly sample three Euler angles in the range of for rotation and three displacements in the range of along each axis for translation. Unless otherwise noted, these settings are used by default in all experiments.
We use six evaluation metrics, and the first four are calculated from the estimated transformation parameters. They are the mean isotropic errors (MIE) ofand proposed in RPM-Net [ref16], and the mean absolute errors (MAE) of and used in DCP [ref15], which are anisotropic. All rotation-related metrics are in units of degrees.
In addition, we propose a new metric, clip chamfer distance (CCD), which measures how close the two point clouds are brought to each other, and it is calculated as follows:
where is the transformed source point cloud after registration and is the th point. To avoid the influence of outliers in partial-to-partial registration, the point pair whose distance is larger than 0.1 is not included in the calculation. This is implemented by seting the threshold .
Finally, we also reported the recall with MAE() and MAE(). The best results are marked in bold font in tables.
5.2 Comparing Methods
We compare our method to ICP [ref7], fast global registration (FGR) [ref25], as well as three latest learning-based methods, RPM-Net [ref16], IDAM [ref17] and DeepGMR [ref35]. Other early learning-based methods, such as DCP and PointNetLK, are not directly compared, because experiments in [ref16, ref17, ref35] have already shown that these new methods have better performance. Our method performs two iterations during the test. We adopt the ICP and FGR implemented by Intel Open3D [ref46]
. For IDAM and DeepGMR, we use the code provided by the authors and train the models according to the author’s settings. For RPM-Net, we need to estimate the normal except in the clean experiment and use the code provided by the author. The number of iterations of RPM-Net was set to 5 according to the author’s article. ICP uses the identity matrix as initialization, and none of the other methods need transformation initialization. All networks are retrained because no trained model is available.
5.3 Clean Point Cloud
We first evaluate the registration performance on clean point clouds and follow the sampling and transformation settings in section 5.1. The ground-truth correspondences are obtained by the strict correspondences between and . All models are trained and evaluated on clean data, and Table 1 shows the performance of our method and its peers. Our method achieves the best performance and greatly outperforms the strongest learning-based method. In addition, the success rate of RGM reaches 100, and most of its error metrics are close to 0, which cannot be achieved by other existing methods. Although DeepGMR also achieves a 100 success rate, its errors are larger than RGM. Some qualitative comparisons are shown in Figure 4 (a).
5.4 Gaussian Noise
To evaluate the robustness to noise, Gaussian noise sampled from and clipped to is independently added to each coordinate of the points in clean point clouds. These noises might destroy the original correspondences, so we need to rebuild them for training models that need ground truth correspondences. First, we compute the point pair distance between and , which is obtained by applying the ground truth transformation to . Then, if and satisfy Eq. 13, they are regarded as a corresponding point pair and no longer appear in the next round calculation. Finally, we find corresponding point pairs again according to Eq. 13 from the remaining points. To avoid long-distance point pairs being selected as a correspondence, we only consider the point pairs whose distance is less than 0.1. The reason why we find the corresponding point pair again from the remaining points is that the distance between the two points may not be the smallest but the second smallest, so they are not found in the first round.
All models are trained and evaluated on the noise data. The results are shown in Table 2. It is obvious that our method is much more accurate than the latest learning-based methods and the traditional methods, and the recall of our method is close to 100. Some qualitative comparisons are shown in Figure 4 (b).
Partial-to-partial is the most challenging case for point cloud registration, and it is important because it occurs frequently in real-world applications. To generate partial-to-partial point cloud pairs, we follow the protocol in RPM-Net [ref16], which is closer to real-world applications. For each point cloud, we create a random plane passing through the origin independently, translate it along its normal, and retain 70 of the points. All models are trained and evaluated on partial-to-partial data and the results are illustrated in Table 3. Our method is obviously more accurate than the other methods, and its success rate is higher than 90. RPM-Net is the second best method, but its error is still twice as large as ours. Some qualitative comparisons are shown in Figure 4 (c). For the inference time of our method and the comparison methods, please refer to the supplementary material.
5.6 Unseen Categories
To test each method’s generalization capability on unseen shape categories, we take the official train and test splits for the first 20 categories as the training and validation sets, respectively, and test on the official test splits of the last 20 categories. Other experimental settings are the same as those in the partial-to-partial experiment. The experimental results are summarized in Table 4. We find that the performance of traditional methods does not change significantly. The generalization capability of RPM-Net is also good, but it is obvious that our method works better. The other learning-based methods do not generalize well to unseen categories. Some qualitative comparisons are shown in Figure 4 (d).
5.7 Ablation Studies
In this section, we present the results of the ablation study to analyze the effectiveness of two key components. All ablation studies are performed on the partial-to-partial dataset. We analyze the two key components as follows:
To demonstrate the effectiveness of the AIS module, we design a variant to replace the AIS module, and the resulting method is denoted as RGMVar1. The variant computes the distance matrix between the nodes of the two graphs by computing the L2 norm of node features, transforms into a positive matrix within the finite values by the formula , and uses Sinkhorn to calculate the soft correspondences. The results are listed in the first row of Table 5. We find that the registration accuracy becomes very poor by using the AIS variant, and this result shows that the proposed AIS module can effectively improve the registration performance. This is because the AIS module generates more correct matching than its variant, and an illustrative example of the hard correspondences generated by AIS and its variant is shown in Figure 5 (b) and (c).
To understand the importance of our edge generator, we design a variant that uses full connection edges instead of building edges by a transformer, and the resulting method is denoted as RGMVar2. The results are shown in the second row of Table 5, and they are also inferior to the performance by using a transformer to generate edges. An example of the hard correspondences generated by this method is shown in Figure 5 (d).
We introduce deep graph matching to solve the point cloud registration problem for the first time and propose a novel deep learning framework RGM that achieves state-of-the-art performance. We propose the AIS module to establish accurate correspondences between the graph nodes to greatly improve registration performance. In addition, the transformer-based edge generator provides a new idea for building graph edges in addition to full connection, nearest neighbor connection and Delaunay triangulation. We think that the deep graph matching approach has the potential to be used in other registration problems, including 2D-3D registration and deformable registration.