PyTorch implementation of DeepUME: Learning the Universal Manifold Embedding for Robust Point Cloud Registration (BMVC 2021)
Registration of point clouds related by rigid transformations is one of the fundamental problems in computer vision. However, a solution to the practical scenario of aligning sparsely and differently sampled observations in the presence of noise is still lacking. We approach registration in this scenario with a fusion of the closed-form Universal Mani-fold Embedding (UME) method and a deep neural network. The two are combined into a single unified framework, named DeepUME, trained end-to-end and in an unsupervised manner. To successfully provide a global solution in the presence of large transformations, we employ an SO(3)-invariant coordinate system to learn both a joint-resampling strategy of the point clouds and SO(3)-invariant features. These features are then utilized by the geometric UME method for transformation estimation. The parameters of DeepUME are optimized using a metric designed to overcome an ambiguity problem emerging in the registration of symmetric shapes, when noisy scenarios are considered. We show that our hybrid method outperforms state-of-the-art registration methods in various scenarios, and generalizes well to unseen data sets. Our code is publicly available.READ FULL TEXT VIEW PDF
PyTorch implementation of DeepUME: Learning the Universal Manifold Embedding for Robust Point Cloud Registration (BMVC 2021)
The massive development of 3D range sensors [collis1970lidar, smisek20133d] led to an intense interest in 3D data analysis. As 3D data is commonly acquired in the form of a point cloud, many related applications have been studied in recent years for that data form. In wide range of applications, specifically in medical imaging [hill2001medical], autonomous driving [bresson2017simultaneous] and robotics [durrant2006simultaneous], the alignment of 3D objects into a coherent world model is a crucial problem. Point cloud rigid alignment is a deep-rooted problem in computer vision and graphics, and various methods for point cloud registration have been suggested [pomerleau2015review]. In general, the point clouds to be registered are sampled from a physical object. When two point clouds are sampled at two different poses of an object, and especially when sampling is sparse, it is unlikely that the same set of object points is sampled in both. The difference between the sampling patterns of the object may result in model mismatch when performing registration, and we therefore refer to it as sampling noise. Registration of point clouds in the presence of noise has been extensively studied by both closed-form [besl1992method, yang2015go, zhou2016fast, rusinkiewicz2001efficient] and learning-based [aoki2019pointnetlk, wang2019deep, yuan2020deepgmr, Fu2021RGM] methods. In most of these works, the noise is modeled as an Additive White Gaussian Noise (AWGN) on the coordinates. However, in many registration applications the point clouds are sampled differently and sparsely. In such applications, these sampling effects are dominant and adversely affect registration performance, yet they cannot be modeled by an AWGN. In this work we address the global registration of 3D under-sampled point clouds, where the point clouds are differently sampled, and the samples are subject to the presence of an additive coordinate noise. Our strategy is to combine the closed-form Universal Manifold Embedding (UME) registration method [efraim2019universal]
, and a learning-based framework. The UME non-linearly maps functions related by geometric transformations of coordinates (rigid, in our case) to matrices that are linearly related by the transformation parameters. In the UME framework, the embedding of the orbit of possible observations on the object to the space of matrices is based on constructing an operator that evaluates a sequence of low-order geometric moments of some function defined on the point clouds to be registered. This representation is therefore more resilient to noise than local operators, as under reasonable noise, the geometric structure of the point cloud is preserved. Since the UME is an operator defined on functions of the coordinates, in order to enable registration, these functions (features) need to be invariant to the transformation. While in the original UME framework, the invariant features arehand-crafted functions, in this work we learn those from data using an unsupervised deep neural network architecture. The proposed framework is a cascade of three blocks: The first is a pre-processing step that employs an SO(3)-invariant coordinate system, constructed using PCA, to enable estimation of large transformations. The second block is the neural network, designed to implement joint-resampling and embedding of the raw point clouds. The final block implements the UME. Our trained model is tested on both seen and unseen data sets, to demonstrate generalization capabilities. Inference performance is evaluated for different noise scenarios using metrics that are invariant to the ambiguity arising in symmetric shapes registration when noisy scenarios are considered. Our main contributions are as follows:
We integrate, for the first time, the closed-form UME registration methodology and a data-driven approach by both adapting the UME method to the DNN framework and designing the DNN architecture to optimize the UME performance. We address the highly practical yet less studied case of registering point clouds that are sparsely and randomly sampled in the presence of large transformations (full range of rotations). Our hybrid model is trained end-to-end, labels free and results with substantial performance gains compared to competing state-of-the-art methods.
To enhance the DeepUME performance in the presence of large deformations and since the point clouds are sparsely and differently sampled, we present a learned joint resampling of both point clouds to be registered, using a novel approach for integrating a PCA module, a Transformer module and a DGCNN module. By mapping the input data to an -invariant coordinates system, we overcome DGCNN inability to learn invariant features for registration in the case of large rotations. The Transformer is used for implementing a joint-resampling strategy of both point clouds to be registered. However, while learned joint-resampling procedures are usually applied in a high-dimensional feature space, in our framework, the joint-resampling is aimed at "equalizing" the differences in the sampling patterns of the observations, and is therefore implemented in the coordinate low-dimensional space.
There are many approaches to 3D point cloud registration. One of the commonly practiced approaches is to extract and match spatially local features e.g, [Guo2016, SPINIM, RUSU2008, Yang2016, Yang2017, LRF]. Many of the existing methods are 3D adaptations of 2D image processing solutions, such as variants of 3D-SIFT [3DSIFT] and the 3D Harris key-point detector [Harris3D]
. In 3D, with the absence of a regular sampling grid, artifacts, and sampling noise, key-point matching is prone to high outlier rates and localization errors. Hence, global alignment estimated by key-point matching usually employs outlier rejection methods such as RANSAC[fischler1981random] followed by a refinement stage using local optimization algorithms [ICP1, Zhang1994, POTTMANN2004, 3DNDT]. DGR [Choy_2020_CVPR] follows a similar paradigm, but inlier detection is learnable. Numerous works have been proposed for handling outliers and noise [chetverikov2005robust], formulating robust minimizers [fitzgibbon2003robust], or proposing more suitable distance metrics. The standard algorithm in the category of refinement algorithms, also known as local registration, is the Iterative Closest Point algorithm (ICP) [ICP1, Zhang1994]. It constructs point correspondences based on spatial proximity followed by a transformation estimation step. Over the years, many variants of the ICP algorithm have been proposed in attempt to improve the convergence rate, robustness, and accuracy of the algorithm. Registration methods are not restricted only to methods based on the extraction and matching of key-points. In [isola2011makes], for example, an initial alignment is found by employing a matched filter in the frequency space of local orientation histograms. In [Shimshoni] an initial alignment is found by clustering the orientations of local point cloud descriptors followed by estimating the relative rotation between clusters. In this work, a global closed-form solution that employs the UME [efraim2019universal] representation of the shapes to be registered, is being integrated. As a result, an efficient and accurate registration scheme is achieved where no initial alignment is required. Other types of registration methods adopt learning-based techniques. Pioneered by PointNet [qi2017pointnet] and subsequently by DGCNN [wang2019dynamic], point cloud representations are learned from data in a task-specific manner. These can, in turn, be leveraged for robust point cloud registration e.g, [qi2017pointnet++, su2018splatnet, wang2019deep, wang2019prnet]. PointNetLK [aoki2019pointnetlk] minimizes learned feature distance by a differentiable Lucas-Kanade algorithm [lucas1981iterative]. DCP [wang2019deep] addressed feature matching via attention-based module and differentiable SVD modules for point-to-point registration. The recently proposed RGM [Fu2021RGM]
transforms point clouds into graphs and perform deep graph matching for extracting deep features soft correspondence matrix. In section5 we show that registration performance using the aforementioned architectures deteriorates when observations are related by large rotations. We relax that limitation by employing an SO-(3)invariant coordinate system, and therefore provide a global registration solution. DeepGMR [yuan2020deepgmr]
also addresses this limitation by extracting pose-invariant correspondences between raw point clouds and Gaussian mixture model (GMM) parameters, and recovers the transformation from the matched mixtures. However, its performance deteriorates in the presence of sampling noise.
A point cloud is a finite set of points in . Usually, these points are samples from a physical object, (we may think of it as a surface or a manifold). Viewing point clouds as sets of samples, the registration problem may be formulated as follows: Let be a physical object and a rigid map ( is a rotation matrix and
is translation vector). We consider the transformed objectLet and be two point clouds sampled from the object and the transformed object , respectively. In the registration problem, the objective is to estimate the transformation parameters and given only and . Since point clouds are generated by a sampling procedure, the effects of sampling must be addressed, when solving the registration problem. Ideally, the relation between the two sampled point clouds and (sampled from and , respectively) satisfies the relation
. Unfortunately, when point clouds are sparsely sampled at two different poses of some object, it is unlikely that the same set of object points is sampled, in both: If we assume a uniformly distributed sampling pattern on a continuous surface, it may be easily proved that the probability of having such a relation is null. As we show in our experiments, once sparse and differently-sampled point clouds are considered, the sampling differences result in substantial registration errors, having different characteristics from those of AWGN on the coordinates. Under-sampled point cloud registration is considered in the following scenarios:
Full intersection (Vanilla model) - where .
Sampling noise - Two cases are considered: Partial intersection, where and may intersect, but are not identical; Zero intersection, where and have no samples in common.
Gaussian noise - , where is a result of a rigid transformation of with its coordinates perturbed by AWGN.
Following the strategy of integrating the UME registration methodology into a deep neural network, we both adjust the UME method [efraim2019universal] adapting it to a DNN framework and design our architecture to optimize the UME performance. The proposed framework is a cascade of three blocks: The first is a pre-processing step that employs an SO(3)-invariant coordinate system, constructed using PCA, that enables estimation of large transformations. The second block is the neural network, designed to implement joint-resampling and embedding of the raw point clouds aimed at "equalizing" the differences in the sampling patterns of the observations, and consequently boost the performance of the third block, implementing the UME. The framework is illustrated in Figure 2 (see supplementary for detailed illustrations). The details about its building blocks, and in particular the Transformer and DGCNN architectures adaptations into the UME framework are provided in the following. In the UME registration framework the input is composed of two point clouds satisfying the relation , and an invariant feature (that is, a function defined on a point cloud such that for any ). Applying the UME operator on each of the point clouds, two matrices and are obtained, such that . The geometric nature of the UME motivates us to use it as a basis for our framework. While many registration methods find corresponding points in the reference and the transformed point clouds in order to solve the registration problem, in the UME methodology a new set of "corresponding points" is constructed by evaluating low order geometric moments of the invariant feature. This is a significant advantage in noisy scenarios, where point correspondences between the reference and transformed point clouds, may not exist at all. The use of moments allows us to exploit the geometric structure of the objects to be registered, which is invariant under sampling (as long as the sampling is reliable) resulting in an improved immunity to sampling noise. The goal of the deep neural network in our framework is to construct multiple high-quality invariant features, in order to maximize the performance of the UME in various noisy scenarios. We adopt the joint usage of DGCNN and Transformer blocks [wang2019deep], which has been proven to be very efficient in creating high-dimensional embedding for point clouds, and adapt it to the UME framework in order to learn multiple high-quality invariant features. UME registration The UME framework [itmanifolds, efraim2019universal] is designed for registering two functions , with compact supports related by a geometric transformation (rigid, affine) parameterized by . Zero and first order moments (integrals) are evaluated in constructing the UME matrix of dimension (where ). The UME matrices of and satisfy the relation: . Following the principles of the UME, we derive a new discrete closed-form implementation of the UME for the registration of point clouds undergoing rigid transformations. More specifically, let and be two point clouds related by a rigid transformation . Then, an invariant feature (function) on and is a function that assigns any point and the transformed point with the same value. A simple example for such an invariant feature is the one that assigns each point with its distance from the point cloud center of mass. Since for finite support objects, it is straightforward to reduce the problem of computing the rigid transformation to a rotation-only problem, i.e, , we next show that the moment integral calculations involved in evaluating the UME operator, may be replaced by computing moments of the invariant functions using summations:
Let be a rotation matrix and and be two point clouds satisfying the relation . Let be an invariant function on and . Then,
(See the supplementary material for the proof). We call the moment vector of the transformation invariant function defined on . Given two sets of points in , and , satisfying the relation for all , we may find by a standard procedure proposed by Horn et al[horn1988closed]. Hence, we conclude from (1) that in the absence of noise, finding a set of invariant functions , such that , yields a closed-form solution to the registration problem. However, in the presence of noise (sampling or additive) (1) no longer holds as and are not identical anymore and in fact, with a high probability they do not share any point in common. Therefore, the estimated rotation matrix is noisy. This is the point where a deep neural network comes into play. The registration error obviously depends on the difference between and , and on the function being rigid transformation invariant despite the noise. To that extent, we employ a deep neural network in order to learn how to construct good transformation-invariant functions. These invariant functions are designed to exploit the geometry of the point cloud so that their invariance to the transformation is minimally affected by the noise, resulting in a smaller registration error. Feature Extraction Towards obtaining noise resilient -invariant functions for effectively evaluating the UME moments, we aim at learning features capturing the geometric structure of the point cloud. This structure is determined by the point cloud coordinates (global information) and the neighborhood of each point (local information). We adopt the joint usage of DGCNN and Transformer blocks [wang2019deep], which has been proven to be very efficient in creating high-dimensional embedding for point clouds, and adapt it to the UME framework in order to learn multiple high-quality invariant features. The DGCNN block is designed to preform a per-point embedding, such that information on neighboring points is well incorporated. Each input point in the cloud is characterized by the coordinates of points in its neighborhood in addition to its own coordinates. This approach results in a new representation for each point using a matrix. Each column of that matrix represents one of the nearest neighbors and consists of the coordinates of the observed point stacked on top of the coordinated of the neighbor point. DGCNN processes one point cloud at a time, sharing its weights between the two clouds. Ideally, our embedding network should result in identical feature (function value) for two corresponding points in the reference point cloud and in its transformed version. SO(3)-invariant coordinate system followed by joint-resampling Since DGCNN weights are shared, when the coordinates of corresponding points between two point clouds are significantly different, the learning process fails. Clearly, this situation occurs when transformations of large magnitude, e.glarge rotations, are considered (demonstrated in Section 5). Therefore DGCNN architecture in its original design, cannot be employed towards constructing invariant features when rotations by large angles are considered. We overcome this inherent difficulty by mapping the input point clouds to an alternative coordinates system, which is invariant: Given a point cloud , the cloud center of mass (denoted by ) is subtracted from each point coordinates, to obtain a centered representation . We then construct a new coordinate system for the point cloud using PCA. That is, the axes of the new coordinate system are the principle vectors of the point cloud covariance matrix given by
The principle vectors form the axes of the new coordinate system, and the new coordinates of each point are the projection coefficients on these axes. Formally, for a point , the new coordinates of are defined to be where is the matrix whose columns are the principle vectors. The resulting point cloud new coordinates are denoted by . It is easy to verify (see supplementary) that the new axes (columns of the PCA matrix) are co-variant under a rigid transformation. Therefore if is obtained by a rigid transformation of , it holds that . Furthermore, since the change of coordinate system is invertible, the original point cloud can be reconstructed. However, in our setting, where the point clouds to be registered are sparse, differently sampled and noisy, the relation does not hold anymore. The loss of information caused by low sampling rate makes the resulting representations of the clouds significantly different yielding axes that are no longer co-variant and thus projections that are no longer invariant. Nonetheless, employing the invariant representation of the point clouds, the difference between and is sufficiently small to enable a learning process of multiple invariant features using a DGCNN block. While DCP strategy,[wang2019deep], is to jointly-resample the embedded point clouds in a high-dimensional feature space via a Transformer [vaswani2017attention]. In DeepUME we adopt the joint-resampling strategy, but DeepUME resamples the low-dimensional coordinate space, rather than the high-dimensional feature space. We note that resampling point clouds in the coordinate space has a complexity advantage since the dimension of the coordinate space is 3, while the dimension of the embedding space is much higher ( in [wang2019deep] and in our case). In terms of architecture blocks, with this sampling approach the Transformer is leading the embedding network (and not the other way around, as in [wang2019deep]). We therefore employ the Transformer for learning a resampling strategy of the projected samples in such that the resampling depends on the sampling of , and vice versa. Denoting the asymmetric function of the Transformer by , we have
The additive terms, and are optimized to improve registration, by jointly "equalizing" the sampling patterns of both point clouds. The resampled point clouds produced using the Transformer are employed for evaluating the UME moments, by re-projecting them on the corresponding principle axes to obtain
The point clouds, and , are re-sampled versions of the original points clouds, related by the same rigid transformation that relates and . We therefore apply the UME registration to the resampled point clouds and , using the DGCNN generated functions designed to be -invariant for differently and sparsely sampled point clouds corrupted by noise. As demonstrated by the experimental results, applying UME registration on the re-projected re-sampled point clouds, provides with an improved performance. Loss In order to overcome the ambiguity problem in registering symmetric objects, discussed in Section 5, we adopt the Chamfer distance [barrow1977parametric]
as our loss function. That is, ifis the estimated transformation,
Using this loss, ambiguous symmetric objects do not damage the learning process, as even if an ambiguity exists and the registration is successful, the Chamfer distance will be small. The Chamfer loss function has another advantage as no labels are required for the learning process, which makes it unsupervised.
We conduct experiments on three data sets: ModelNet40 [wu20153d], FAUST [bogo2014faust] and Stanford 3D Scanning Repository [StanfordScanRep] where the latter two are used only for testing. We train our network using ModelNet40, which consists of 12,311 CAD models in categories where samples are used for training and the rest for testing. Following previous works experimental settings, we uniformly sample 1,024 points from each model’s outer surface and further center and rescale the model into the unit sphere. In our training and testing procedure, for each point cloud, we choose a rotation drawn uniformly from the full range of Euler angles and a translation vector in in each axis. We apply the rigid transformation obtained from the resulting parameters on , followed by a random shuffling of the points order, and get . In noisy scenarios, a suitable noise is applied to . We train our framework in the scenario of Bernoulli noise (see bellow), in the specific case where , and test all scenarios using the trained configuration. Each experiment is evaluated using four metrics: The RMSE metric, both for the rotation and the translation as well as the Chamfer distance and the Hausdorff distance [liu2018point, DBLP:conf/eccv/UrbachBL20], proposed next as alternative metrics to resolve ambiguity issues. We compare our performances with the basic implementation of the UME method (coded by ourselves), ICP (implemented in Intel Open3D [zhou2018open3d]), and four learned methods; PointNetLK and DCP (benchmarks point cloud registration networks) as well as the recently proposed DeepGMR, [yuan2020deepgmr] and RGM, [Fu2021RGM]. We retrain the baselines, adapting the code released by the authors. The experimental results are detailed below and summarized in Tables 1, 2; further experimental results are provided in the supplementary. The ambiguity problem in the registration of symmetric objects An ambiguity problem arises in point cloud registration whenever symmetric objects are considered, as more than a single rigid transformation (depending on the degree of symmetry of the object) can correctly align the two point clouds. In the noise free scenario such an ambiguity is trivially handled since a fixed point constellation undergoes a rigid transformation which in the case of nonuniform sampling guarantees the uniqueness of the solution. However, when observations are noisy, the ambiguity is harder to resolve as the constellation structure breaks. In that scenario, is a noisy version of , and possibly no rigid transformation can perfectly align the point clouds. In that case, if the point clouds to be aligned represent a symmetric shape, there are multiple transformations that approximately align the two clouds together. For symmetric shapes, which are very common in man-made objects, the rotation angles RMSE metric may assign large errors to successful registrations. To resolve this ambiguity we replace this metric by the Chamfer and Hausdorff distances defined by
Using ModelNet40, we test performance in four scenarios: noise free model, sampling noise (Bernoulli noise and zero-intersection noise) and AWGN noise. We note that in the presence of sampling noise we indeed observe large errors in RMSE() due to the symmetry of objects, although the symmetric shapes are well aligned. Moreover, in the unseen data sets, where real world data is used, which is naturally asymmetric, the rotation RMSE is indeed small and indicates successful registration. Therefore we consider the Chamfer and Hausdorff distances to be more reliable metrics for registration in the presence of symmetries. ModelNet40: Noise free model We examine the case where no noise is applied to the measurements. In that case, we see that the estimation error is practically null. DeepGMR is shown to be the second best learned method, while all other tested methods yield large errors, meaning that the registration fails when large range of rotation angles is considered. ModelNet40: Bernoulli noise The Bernoulli noise case is related to the scenario of sampling noise, in which the point clouds contain different number of points. In that case, we choose randomly (uniformly) two numbers and in . Then, each point in is removed with probability , independently of the rest of the points. We perform registration on the resulting point clouds and . We note that the number of points in averages to , and is likely to be different in the two clouds. The number of corresponding points between the resulting clouds averages to (as the probability for the two point clouds to share a specific point is ). We note that we were not able to evaluate RGM performance in that scenario. ModelNet40: Zero-intersection noise Zero intersection noise is the extreme (and most realistic) case of sampling noise. In that case, we randomly choose points from to be removed. Next, we remove the points from that correspond to the points in . Thus, by construction, no point in one cloud is the result of applying a rigid transformation to a point in the other cloud. This scenario is visualized in Figure 1 that shows registration results of several baseline methods and our proposed method on a representative example from the unseen data set FAUST. ModelNet40: AWGN
In these set of experiments each coordinate of every point in
is perturbed by an additive white Gaussian random variable drawn fromwhere is chosen randomly in . All noise components are independent and no value clipping is performed. The results in Table 1 indicate that since the UME is an integral (summation) operator the registration error of both the UME and the DeepUME is lower than the error of the other tested methods. We find that for sufficiently large additive noise variance, the rotation RMSE in both AWGN and sampling-noise models are comparable, yet these different types of noise have very different impact on registration, as illustrated in Figure 3. In order to have the same registration error as in the zero-intersection model on the Stanford data set, an AWGN with variance of approximately 0.09 is required. As shown in Figure 3, such a noise causes a sever distortion, which practically makes the original shape unrecognizable. On the other hand, the original shape of the bunny in the sampling noise scenario, is well preserved.
|Noise free||Bernoulli noise||Gaussian noise|
Unseen data sets The generalization ability of a model to unseen data is an important aspect for any learning based framework. In order to demonstrate that our framework indeed generalizes well, we test it on two unseen datasets. The first is the FAUST data set that contains human scans of 10 different subjects in 30 different poses each with about 80,000 points per shape, and the other is the Stanford 3D Scanning Repository. We generate the objects to be registered using a similar methodology to that employed for the ModelNet40 data set. Our framework achieves accurate registration results in all scenarios checked, and shows superior performance over the compared methods.
|ModelNet40 [wu20153d]||FAUST [bogo2014faust]||Stanford 3D Scanning Repository [StanfordScanRep]|
We derived a novel solution to the highly practical problem of aligning sparsely and differently sampled point clouds in the presence of noise and large transformations. Our model, named DeepUME, integrates the closed-form UME registration method into a DNN framework. The two are combined into a single unified framework, trained end-to-end and in an unsupervised manner. DeepUME employs an -invariant coordinate system to learn both a joint-resampling strategy of the point clouds and -invariant features. The resampling is performed in the low-dimensional coordinate space (rather than in the high-dimensional feature space), to minimize the effect of the sampling noise on the registration performance. The constructed features are utilized by the geometric UME method for transformation estimation. The parameters of DeepUME are optimized using a metric designed to overcome the ambiguity emerging in the registration of symmetric shapes, when noisy scenarios are considered. We show that our hybrid method outperforms state-of-the-art registration methods in various scenarios, and generalizes well to unseen datasets. Future work will extend the proposed method for registration of sub-parts and key-points, incorporating both local and global information to allow the registration of partially overlapping scenes.