Alignment between 3D objects is a fundamental problem in computer vision, playing an important role in medical imaging[medical_registration], autonomous driving [autonomous_registration], robotics [robotics_registration]
, etc. At the root of the rigid alignment problem, is the need to find the linear transformation inducing the difference between the objects, where noise, sparsity, partiality, and extreme deformations can distort the input shapes.
The task is even harder in the 3D domain, where data is usually represented by point clouds, thus having irregular topology, varying number of neighbors per point, and unordered data description. A canonical work in the field is the Iterative Closest Point (ICP) algorithm [icp], which aligns point clouds in the Euclidean space in iterations until convergence, by matching points according to spatial proximity and then solving a least-squares problem for the global transformation [rtsolve]. However, ICP is extremely sensitive to noise and initial conditions, and the fact that the problem is non-convex causes ICP to get stuck in sub-optimal local solutions in most cases. In recent years, many methods [dcp, pointnetlk, prnet, itnet]
, presented deep-learning-based improvements to the original idea of ICP. Among the renowned methods is Deep Closest Point (DCP)[dcp], which replaced the Euclidean nearest point step of ICP by a learnable per-point embedding network, followed by a high-dimensional feature-matching stage. While improving ICP substantially, all methods mentioned above still use the globally optimal least-squares solution to the transformation problem, thus sensitive to noise, outliers, and errors in the feature-alignment step.
We herein provide a new line of thought for solving the rigid alignment challenge. We introduce a single feed-forward network both at training and inference by having a voting mechanism on top of a learnable confidence guided sampling step of source points. Our framework shows superior results in all common test configurations, and even larger improvements in performance when examining rotations on the full spectrum of SO(3)222rotations larger than ..
In this work, we claim that dense mapping between shapes is not only a proxy to the rigid alignment task but provides useful information about the credibility of each point that participate in the process. Here, we optimize for a dense correspondence mapping directly using a new unsupervised contrastive loss in search of alignment and confidence in a fully derivable model.
In our experimentation (Section 4.2) we exemplify the state-of-the-art results our network provides, in the ablation study (Section 4.4) and supplementary we demonstrate the significance of each of the three network components, as well as present run-time performance, convergence time, and visual comparisons.
Our contributions include the following:
Present a paradigm shift for learning rigid alignment, where the dense correspondence is being optimized directly based on the confidence each point has in its mapping.
Report SOTA results on multiple datasets in a substantial number of tests, with comparison to classic and recently published methods.
DWC is a single feed-forward model during training and testing, converging up to 10 times faster than current state-of-the-art architectures, with a real-time performance at inference.
Support the full spectrum of SO(3) without any decrease in performance (RMSE error), unlike previously published methods.
2 Related Work
Axiomatic point cloud registration
Axiomatic methods for 3D rigid registration are present for more than 40 years [icp, chen1992object, zhang1994iterative, debevec1996modeling], with the most renowned of all - the Iterative Closest Point algorithm [icp]. ICP presents an iterative solution to the problem which alternates between two steps, finding correspondences between the two point clouds, and calculating the transformation that aligns the corresponding points followed by applying the transformation. These steps are repeated until convergence but are prone to reach a local-minimum.
ICP variants had tried to solve several problems of the original method such as sensitivity to noise or input sparsity [icpv1, icpv2], with the drawback of being slower and demanding more computing resources. The most prominent problem in these solutions is the convergence to local minima and the severe sensitivity to the initial conditions. While many methods have tried to address these issues [icpl1, icpl2], the instability given the input transformation was still present, with running times slower than the original method by up to two orders of magnitude.
Graph neural networks
The term Graph neural networks (GNNs), coined in[scarselli2008graph], refers to the family of neural architectures designed for graph-structured data. [duvenaud2015convolutional] defines the convolution operation on graphs in an analogous way to the definition of CNNs for images, whereas here, unlike grid-structured data, new obstacles arise as varying numbers of neighbors, and irregularity in data order. Among the class of deep graph neural networks are architectures that work on geometric data, as point clouds and meshes.
was the first to show how a simple Multi-Layer Perceptron (MLP) per point followed by global pooling can extract meaningful high dimensional features for both classification and segmentation tasks. PointNet++[pointnetpp] and DGCNN [dgcnn] who came after, exemplified the strength of the concept of message-passing, by aggregating neighbor points features using another learnable function. While PointNet++ uses the Euclidian -nearest neighbors as the graph topology throughout training, DGCNN defines the graph structure dynamically, by the -nearest neighbors in the high-dimensional space per layer. PointCNN [pointcnn] applies a learnable transformation on the point cloud, followed by Euclidean convolution, while [sonet]
presents a hierarchical feature extraction for point-clouds, and[kpconv] offers adaptive convolution kernels based on the point neighborhood density and topology.
Learning point cloud registration
Harvesting the power of graph neural networks, many new methods trying to solve the rigid registration problem were proposed, providing significant improvements over axiomatic techniques. One such milestone is Deep Closest Point (DCP) [dcp], which proposed a feature learning scheme followed by a soft alignment of the two point clouds. While DCP suggested many new ideas in the domain, robustness to outliers was still an issue as it took all points into account when evaluating the transformation. Following DCP, many iterative methods [prnet, pointnetlk, itnet] extended the feature matching idea, where the general scheme was to learn the mapping, apply the inferred transformation on the source point cloud and learn a new alignment map until convergence. Among these methods is PRNet [prnet], which tries to solve the outlier sensitivity by choosing a subset of points in the source and target shapes and evaluates the correspondence between the subsets. While PRNet reduces the chance for outliers, it suffers from accuracy degradation due to the limitation of their sampling. Indeed, PRNet is not on-par with state-of-the-art results in the field.
A significant drawback in all iterative methods is the run-time performance, which linearly depends on the number of refinement steps, making some not suitable for real-time applications.
Lately, a new set of methods [deepglobal, robustpointcloud] offered using a fusion of learned features with spatial coordinates throughout training, as initially introduced in PointNet++ [pointnetpp], to boost the results. While these methods indeed show relative improvements compared to the previously presented works, using spatial coordinates in deeper stages of the network harms the ability to work under considerable transformations, such as when rotations grow higher than 90 degrees, which are common in real-world applications.
3 Deep Weighted Consensus
Deep Weighted Consensus is a self-supervised deep network designed to solve the rigid alignment problem of 3D shapes.
In short, DWC extracts rotation-invariant features of the input point clouds, and builds local and global features for the two shapes. The fusion of these high-dimensional features define the soft correspondence mapping
between the shapes by computing the cosine-similarity of the features. Next, the network defines a probability distribution based on the certainty of each source point in its map, and sample confident points that partake in the consensus problem. At inference, DWC uses the sampled points to find the optimal(see figure 2 for illustration).
Rotation invariant descriptors
The first step of the DWC pipeline is to extract rotation-invariant () descriptors out of the Euclidean coordinates of the input point clouds. These features are the input descriptors for the Feature Extractor Network (). The registration problem in the full range of SO(3) is hard and state-of-the-art methods such as [dgcnn, pointcnn, sonet] fail to generate meaningful representations under such deformations, as proven in [rot1, rot2] and demonstrated in the experimentation shown in section 4.
One reason for the struggle of networks using the Euclidean coordinates as the inputs for the is the predicted transformation output for symmetric shapes. On symmetric shapes transformed by a rotation, such networks will suggest the exact opposite transformation to the correct rotation, as locally, symmetric parts have identical embeddings given the same input features. Illustration to the phenomena can be seen in the plane example in figure 1.
Inspired by the local rotation invariant features presented in [rot1], we present global and local features for every point conditioned by the neighborhood of a source point , which for point clouds is usually the -nearest neighbors.
Our features do not need normal approximations as [rot3], or geometric median computation as [rot2], hence are suitable for real-time applications as demonstrated in the timing evaluation provided in the supplementary.
Given the Euclidean center of the point cloud , a source point of neighborhood , and , the Euclidean center of , the rotation invariant descriptors of are:
is the vector fromto in the 3D space and is the angle between and .
The aggregated feature is:
Feature Extraction Network
The objective of the feature extraction network is to embed the unaligned input point clouds and to a common high dimensional feature space. does this by combining two variants of DGCNN [dgcnn], namely the classification and segmentation networks, henceforth noted by respectively. For both networks, we use the representations generated before the last aggregation function, and define them such that for point cloud with points, the output features dimensions are and where are network hyper-parameters.
As opposed to the original paper [dgcnn] or subsequent works [dcp, dvir3], does not use the dynamically changing neighborhood of the graph when applying convolutions in deeper layers of the network, as it causes message passing between distinct parts of the graph, that may have similar characteristics. While this is a good quality for segmentation networks, identifying graph-symmetries is crucial for solving the alignment problem. Numerically, we have found that this change improves results substantially (Section 4.4).
The final step of is to fuse the global and local features per point. Given and , we concatenate the global feature with each local feature resulting in , and propagate through a series of dense non-linear layers, resulting in . We mark the Global-Local-Fusion process by GLF. A detailed layout of is provided in the supplementary.
Soft correspondence mapping
DWC has the ability to match key points that are similar between and
with high probability and extract the transformation using these points. Therefore, DWC avoids using outliers in the transformation computation with high probability, in contrast to previous works[dcp, pointnetlk]. To find these keypoints we define the soft correspondence mapping , which is given by the cosine similarity of the latent representations of :
is a pseudo333 contains negative values and the rows do not sum to 1. probability matrix, where can be interpreted as the probability that corresponds to .
Using we define the point-wise mapping between the source and target shapes to be:
In the theoretical case of perfect samples, given 3 non-collinear points that are correctly matched between the graphs by , a closed-form solution to the problem is feasible. In the case of noise and outliers in the mapping, an optimal solution in the least-squares sense was suggested in the Kabsch algorithm [rtsolve], which solves for
given the singular value decomposition of.
In the paragraphs that follow, we explain how a sampling procedure based on defines the subset of points that we use as input to the Kabsch algorithm.
Confidence based sampling
The soft correspondence mapping defines a dense alignment between the shapes. DWC presents a differentiable reduction procedure from to the rigid alignment solution. Here, unlike previous works that used all points in [dcp, pointnetlk], or chose a fixed subset of points using a heuristic [prnet], we construct a categorical distribution over the source points, by which we sample points that participate in the solution to the alignment problem.
We define the confidence metric to be :
is a proxy to the confidence level of in , that is, for point where DWC has high certainty in the mapping, will be high, while for points where the network is not certain, will be low444While not common, for we set ..
We normalize :
and define the confidence sampling distribution over the points in :
In what follows we explain how defines both the registration solution and the optimization objective.
Confidence based consensus
Based on the confidence distribution over the source points , DWC samples 555 is a hyper-parameter and was set to in the experiments. source points. These points are used both for the optimization of the dense alignment, and for the solution of the optimal parameters using the SVD decomposition. Using the described formulation we allow the network to learn which are the most significant points for the matching and factor them accordingly.
At inference, we sample ”experiment” groups from uniformly, each with points, where is the number of points fed to the alignment solver. By sampling uniformly from , each ”experiment” is distributed according to , the confidence based distribution. We set the chamfer distance [chamfer]
between the target shape and the source shape after applying the inferred transformation as the evaluation metric. The extractedparameters for the experiment with the lowest chamfer distance are chosen as the transformation parameters. Although our inference scheme can be seen as a weighted variant of RANSAC [ransac], a major difference is the fact that unlike previous methods that used RANSAC, our consensus voting is confident based, and takes part of the learning pipeline itself, and not only during test-time. The incorporation of our weighted consensus problem into the network leads to faster convergence and improved results, as demonstrated in the experiments (Section 4).
The fact that DWC performs separate experiments for is not only beneficial to achieving the highest score among all state-of-the-art methods but also provides a leap in performance due parallel computing utilization; solving problems with equations is up to two orders of magnitude faster than solving one problem with equations, as done by [prnet, dcp, pointnetlk].
DWC optimizes the feature correlation directly by presenting three constraints on the soft correspondence matrix. During our self-supervised training, each source shape is being augmented by rotation, translation, and noise. As the same points are ”chosen” for both source and target , the dense mapping is the identity map . We use this observation to define , the hard mapping loss:
which strives to have the cosine similarity between , equal 1, de facto teaching the network to be invariant to the augmentation presented.
Furthermore, DWC presents a novel contrastive loss that motivates the embedding of a source point to be close to the neighbors of the corresponding target point , while far from the points not in ’s proximity. The neighborhood of is defined to be its Euclidian nearest neighbors. Defined by is the adjacency matrix of the target shape induced by such that:
We define to be the negation of , where for non-neighbors and otherwise. Using the above notation, the positive and negative contrastive losses per-point are:
Where and are the positive and negative margins. The contrastive loss incorporates two significant advantages over previous methods:
By applying the contrastive loss we fuse the topology of the graphs not only to the learning process but to the objective function itself.
As all target points takes part in , the network is exposed to all points, and not only to a chosen subset.
We further define as the number of times was chosen by
and factor the contrastive loss of accordingly
By integrating into the contrastive loss is weighted based on the confidence of each source point. compels DWC to define points as confident only if they are similar to the neighborhood of their matched target point, increasing the chances they are good ”anchor” points for the alignment solution.
The final optimization function of DWC is:
|Unseen point clouds||Unseen categories||Gaussian noise|
4.1 Datasets and Metrics
We compare DWC against two axiomatic methods, ICP [icp] and Go-ICP [icpl1], and five different learnable models, DCP [dcp], PR-net [prnet], PointNetLK [pointnetlk], IT-Net [itnet] and RPM-net [robustpointcloud]. We used the publicly available code released by the authors for all of the above methods.
The two datasets used for the evaluation are:
a point cloud dataset contains 12,311 synthetically generated shapes from various categories.
FAUST scans [faust]
Real human scans of 10 different subjects in 30 different poses each with about 80,000 points per shape. FAUST scans suffers from partiality, noise, and uneven sampling.
While ModelNet40 is extremely popular in this domain, we believe evaluating on non-synthetic datasets as FAUST is extremely important for the robustness assessment of the methods and the methods’ ability to perform well in real-world scenarios.
Furthermore, unlike ModelNet40, which also has ground-truth information of the per-point normal to the surface, non-synthetic point-clouds can only approximate such information which is noisy and partial [normal1, normal2, normal3]. In order to suite real-world datasets, our method uses only the Euclidean coordinates as inputs, unlike RPM-Net [robustpointcloud], which also demands the existence of the point normal to the surface.
To assess the performance of DWC on the rigid alignment problem, we compare the predicted with the ground truth transformations using the root mean squared error (RMSE), and mean absolute error (MAE) metrics. This results in four metrics (two for each), where lower is better and zero is the optimal score. We measure the rotation difference using Euler angels and report the score in units of degrees.
ModelNet40 [modelnet] consists of 12,311 CAD models from various categories, such as planes, furniture, cars, etc. We do not use any input features other than the euclidian coordinates and unlike previous works [pointnet, dcp, prnet], we do not rescale the shapes to fit the unit sphere, as we found our network is resilient to the small scale-deformations this dataset holds. To further emphasize the rotation invariance of DWC we provide figure 4, where we evaluate the methods when applying rotations on the full spectrum of SO(3), that is, the random rotations are in in contrast to [pointnet, dcp, prnet, robustpointcloud], where the rotations are limited to . While testing with such a partial spectrum of SO(3) is an interesting proof of concept, most real-world applications encounter more extreme rotations, as in the case of multi-view registration for example. In order to fairly compare the performance of DWC and the other methods, we compare the registration performance under rotations in for table 1, while showing the incompetence of all other architectures under large rotations in figure 4. During training we follow others [pointnet, dcp, prnet] and subsample 1024 points randomly for each source shape . We then draw SO(3), (the deformation parameters), and set to be , the result of transforming by . For fair and full comparison, we re-train the baselines for each experiment until convergence with rotations in the relevant spectrum.
ModelNet40 experimentation is divided into 4 sub-experiments as follows:
Random train/test split
In this experiment the dataset is split randomly into 9843 train samples (80% of the dataset), and taking the rest as test samples, samples that the network does not see in any deformation during training. The left part of 1 presents the results on the described setting, where we present large margins in comparison to all other methods. Specifically, we see an relative improvement of in RMSE() compared to the second-best model (RPM-net).
To evaluate the generalization ability of DWC to scenarios where the test data statistics are not present during training, we split ModelNet40 such that some of the categories are not present during training. Specifically, we train on the first 20 categories and test on the rest. In the resulting split we train on categories such as planes, cars, but evaluate on chairs and plants, for example. In the middle of table 1 are the results for this experiment, where again, we present superior results compared to all others. Besides DWC and RPM-net, all methods show great drops in performance compared to the previous experiment, varying from 20% to 30% relative degradation. We attribute our stability to the consensus voting, where we suggest-and-vote for multiple transformations, instead of solving one globally optimal alignment problem.
One of the biggest caveats of ICP is its lack of noise resilience, which may harm dramatically its results. It is thus important to test noise durability. We evaluate this by adding random Gaussian noise to the target points. In our setup, the added noise is 5 times higher than in previous papers to emphasize our noise resistance capabilities. In detail, we add to each point in the cloud a Gaussian noise sampled from , and do not perform any noise clipping, unlike others. Without clipping, there may be significant outliers, as usually happens with real-world sensors. The right part of table 1 shows the results of this test. Like the two previous experiments, we present results that are up to 50% better than previous state-of-the-art methods, with a substantial margin of 11% to the second-best model. This advantage is due to the confidence sampling (Section 3.1), which assigns low confidence to outliers, thus these points are not taken into account at the transformation computation.
Noise and Rotation resilience
To stress the most prominent traits of DWC, which are the rotation invariance and noise resilience, we provide figures 4,4, comparing the effect of increasing the noise and rotation on the different methods.
From the noise resilience test, we see an almost linear dependence between the noise and RMSE for all methods, where RPM-Net shows better results for lower noise (up to ) and DWC perform best from that point onward.
For the rotation experiment, each method was trained for each rotation factor until convergence, resulting in 100 experiments per method in the range of . DWC shows 0.3% degradation in the results, from 1.51 RMSE to 1.55 RMSE, while all others become irrelevant, with RMSE ranging from 80 to 120 when exposed to the full spectrum of SO(3). Such high RMSE reveals the impracticality of these methods under big deformations, as seen in figure 1.
4.3 FUAST scans
The FAUST dataset [faust] contains 300 real human scans, without any post-acquisition filters, as denoising filters or shape completion. The resulting dataset contains noisy point-clouds with partial information in places that were occluded to the acquiring sensor. Evaluating rigid-alignment in real-world applications is crucial, thus, proving the ability to generate reliable results for a datasets such as FAUST is important.
In order to further examine the generalization ability of the evaluated models, we do not train on FAUST scans, and test the converged models trained on ModelNet40 (Figure 5).
In table 2 we present two evaluations, the first, where we test the models trained on rotations bounded to , and the second, where we evaluate the architectures after training on ModelNet40 on the full spectrum of SO(3).
RPM-net requires point-normals as input features. As these are not provided by FAUST, we use RNe [normal1]
, an axiomatic robust normal estimation for point clouds to make the inference on FAUST relevant for RPM-net also. DWC presents superior results to all methods, with trends that correlate with ModelNet40 evaluations. As previously shown in figure4, all other methods fail when exposed to the full spectrum of SO(3). As to RPM-net, which was the second-best model on ModelNet40, we see a dramatic shift, emphasizing the disadvantage of normal-based methods in the point-cloud domain.
|ModelNet40 [modelnet]||FAUST scans [faust]|
Table 3 presents an ablation study for DWC, evaluated on the ModelNet40 train-test split, and FAUST scans datasets with rotations from the full spectrum of SO(3).
The following variants are considered:
(i) features - We replace the features666 features are not extra parameters but rather derived directly from the euclidean coordinates. with the Cartesian coordinates of the shapes as the input descriptors to the module.
(ii) - Discard the global feature vector generated by the module. As a consequence, is discarded as well.
(iii) Static-Graph - We use the suggested Dynamic graph topology from DGCNN [dgcnn] in the FEN instead the topology defined by the local Euclidian neighborhood.
(iv) sampling - In order to evaluate the significance of the confidence sampling defined by , we replace with a uniform sampling over all source points.
(v) Consensus voting (full) - We replicate the inference procedure presented in DCP [dcp], and replace our consensus voting for with solving the on the entire points in the cloud.
(vi) Consensus voting () - We replicate the inference procedure presented in PRNet [prnet], and replace our consensus voting for with solving the on the source points with respect to the confidence metric
(vii) Contrastive loss - Our novel contrastive loss changes the training paradigm in the rigid correspondence domain to training directly the dense mapping. We test the effectiveness of this concept by replacing the contrasitve loss with loss on , as previous works have done:
The ablation study affirms the significance of all mentioned parts of the network. The use of RI features present dramatic performance enhancements, improving RMSE by 8 points of average. Nevertheless, we see that even without them, our results on the full spectrum are dramatically better than other methods (Figure 4). After the RI features, the biggest contributions come from the contrastive loss, consensus voting, and our confidence metric, all offer a paradigm shift to the mainstream approach of solving the rigid alignment problem. In the supplementary we provide a performance analysis of the baseline models when the input features are replaced with our features.
DWC offers a paradigm shift for solving the rigid alignment problem by explicitly optimizing the dense map. In addition, unlike previous methods that solved for the globally optimal parameters which tend to be extremely sensitive to outliers, DWC presents a differentiable reduction procedure based on the confidence of each source point in its correspondence map, and weight the alignment solution accordingly. We exemplify under multiple experiments how these novelties present a significant boost in performance in all popular benchmarks, and, for the first time, solve the alignment problem for extreme rigid transformations in the full spectrum of SO(3).