Point clouds are a major form of 3D information representation in perception with numerous applications today, including intelligent robots and self-driving cars. The registration of two different point clouds capturing the same underlying object or scene refers to estimating the relative transformation that aligns them. Point cloud registration enables information aggregation across several observations, which is essential for many tasks such as mapping, localization, shape reconstruction, and object tracking. Given perfect point-to-point pairings between the point-cloud-pair, the relative transformation can be solved in closed-form using Horn’s method[Horn:88].
However, such a pairing is not available or corrupted in real-world problems, posing a significant challenge for the point cloud registration task, namely data association. In the deep learning era, the strong feature learning capacity of neural networks enables better description and discrimination of points, improving their matching. Overall, these methods can be categorized as correspondence-based registration.
An alternative approach, called correspondence-free registration, is to completely circumvent the matching of points by learning a global representation for the entire point cloud and estimating the transformation by aligning their representations instead of the original points. Similar to correspondence-based registration, correspondence-free registration can also be with or without the involvement of deep learning. The main obstacle in this type of approach is that the mapping between the input Euclidean space and the output feature space realized by neural networks is nonlinear and obscure. Thus it is common to rely on iterative optimization with local linearization like Gauss-Newton [wedderburn1974quasi]. The performance of these approaches deteriorates when the initial pose difference gets larger.
In this paper, we innovatively employ equivariant neural networks and implicit shape learning into correspondence-free registration by constructing a feature space with two major advantages. First, thanks to the equivariance property, the feature space preserves the same rotation operation as the Euclidean input space, allowing us to solve the feature-space registration in closed form using Horn’s method. As a result, the accuracy is independent of the initial pose difference. Second, the implicit shape learning encourages the network to learn features describing the underlying geometry rather than specific points sampled on the geometry, encouraging features to be robust to noise. In this work, we focus on the rotational registration task.
In particular, this work has the following contributions.
We propose a correspondence-free point cloud registration method via
-equivariance representation learning and closed-form pose estimation.
The proposed method achieves accurate registration results independent of initial pose differences with robustness against noisy points.
Open source software will be available (after the final decision).
Ii Preliminaries on Equivariance and Vector Neurons
Symmetry (or equivariance) in a neural network is an important ingredient for efficient learning and generalization. It also allows us to build a clear connection between the changes in the input space and the output space. For example, convolutional neural networks are translation-equivariant. It makes a network generalize to image shifting and enables huge parameter saving. However, other symmetries are less explored. For example, most point cloud processing networks (e.g., PointNet[qi2016pointnet], DGCNN [wang18dgcnn]) are not equivariant to rotations.
A function is equivariant to a set of transformations , if for any , and commutes, i.e., . For example, applying a translation on a 2D image and then going through a convolution layer is identical to processing the original image with a convolution layer and then shifting the output feature map. Therefore convolution layers are translation-equivariant.
Vector Neurons extend the equvariance property to rotations for point cloud neural networks. The key idea is to augment the scalar feature in each feature dimension to a vector in and redesign the linear, nonlinear, pooling, and normalization layers accordingly. In a regular network, the feature matrix with feature dimension corresponding to a set of points is . In Vector Neuron networks, it is . The mapping between layers can be written as , where is the layer index. Following this design, the representation of rotations in feature space is straight-forward: , where denotes the rotation operation in the feature space, parameterized by the 3-by-3 rotation matrix . In the following, we ignore the first dimension of for simplicity.
The linear layer for Vector Neurons is similar to a traditional MLP: , where . It is easy to see that such a mapping is -equivariant:
Designing nonlinearities for Vector Neuron networks is less trivial. A naive ReLu on all elements in the feature matrix will break the equivariance. Instead, avectorized
version of ReLu is designed as the truncation of vectors along a hyperplane. Specifically, given an input feature matrix, predict a canonical direction where . Then the ReLU is defined as:
for each row of . This design preserves equivariance because and will rotate together, and , meaning that rotations do not change the activation.
For the design of pooling layers and normalization layers of Vector Neurons, please refer to the original paper for details [deng2021vector].
The overall idea of correspondence-free registration is to transform the alignment of points in input space to the alignment of global features where point-wise matching is not needed. In this work, we employ a -equivariant neural network to construct a feature space preserving the same rotation representations as to the input space, thus simplifying solving the rotation for feature alignment. Furthermore, deep implicit representation learning improves the robustness of the features against noise and allows registration between two different scans of the same geometry.
Iii-a -equivariant point cloud global feature extractor
We choose PointNet as the feature extractor backbone. The permutation invariance property implies that two point clouds correspond to the same feature embeddings if they only differ in the permutation of points. We replace the layers in PointNet with the corresponding Vector Neuron version, similar to the work of deng2021vector
. Denote the feature extraction network as, a point cloud , and its rotated and permuted copy , where is a permutation matrix and is a 3-by-3 rotation matrix. Then we have
where , guaranteed by the -equivariance and permutation-invariance properties, which is essential for feature-space registration.
Iii-B Deep implicit representation learning
Being able to register two rotated copies of the same point cloud is not enough for practical use. In real application, the measurements have noise, and the individual points captured by different scans generally do not correspond to the same physical points. To our knowledge, we are among the first to address the need for registration across noisy and different scans of point clouds in deep correspondence-free registration methods.
Our solution is to repurpose the global feature (for registration ultimately) for implicit shape representation. Following Occupancy Network, we build an encoder-decoder network. The encoder is the aforementioned -equivariant feature extractor, and the decoder takes the feature and a queried position as input, predicting the occupancy value of that position. In this way, the encoder is encouraged to predict identical features if the input point clouds correspond to the same geometry even if they scan different points. Gaussian noise is also added to the input during training to improve the robustness. In order to output an identical implicit function field, we adopted the same invariant layer designed by deng2021vector to convert the -equivariant features to be -invariant before feeding them into the decoder.
Iii-C Point cloud registration through feature alignment
With the above network design, we can further relax the connection between the two point clouds and in (3). They do not have to be permutations of each other, as long as they are sampled on the same geometry. However, the equation only holds approximately in this case since the implicit shape model is not trained perfectly.
One can see that (3) is exactly in the form of orthogonal Procrustes problem. The learned features and can be regarded as two pseudo point clouds and each row of them are automatically matched. This problem has a standard closed-form solution. First calculate the cross covariance matrix
. Then conduct Singular Value Decomposition (SVD) on: . The optimal rotation matrix is given as:
Not only can each row of the feature matrix be regarded as a point in Euclidean space, can also be interpreted as an element in , and visualized as a vector field applied on the Euclidean space where the input point clouds live. Each feature generates such a vector field. Because of the rotation-equivariance property, the vector field generated from the feature of the rotated point cloud is identical to the rotated vector field generated from the feature of the original point cloud. See Fig. 2 for an example.
Consistent with existing literature, we conduct experiments on the ModelNet40 dataset [wu20153d]. It is composed of 12311 CAD models from 40 categories of objects. Following the official split, 9843 models are used for training, and 2468 models are used for testing. The models are preprocessed using the tool provided by Stutz2018ARXIV to obtain watertight and simplified models that are centered and scaled to a unit cube.
During training, 1000 points are randomly sampled on each mesh model as the input point cloud. The occupancy value is queried at 2048 points in the unit cube to compose the loss function. The sampling strategy of the querying points is consistent with the Occupancy Network[mescheder2019occupancy]. The dimension of the learned global feature is set as , corresponding to dimensions for a traditional feature vector, close to the dimension chosen by previous literature including PointNetLK [aoki19pointnetlk], PCRNet [Sarode2019PCRNetPC], and Feature-Metric Registration (FMR) [huang2020feature]. The network is trained on a single Tesla V100 GPU with batch size 24.
Iv-a Registration with rotated copies of point clouds
|Max initial rotation angle||0||30||60||90||120||150||180|
|Methods||Rotation error after registration|
Tested using ModelNet40 official test set. The max initial rotation angle refers to the maximum rotation angle allowed in the generation of initial random rotations during testing. The rotation axis is fully randomized. The last column (180 degrees) corresponds to unconstrained initial rotations. All values are in degrees.
We first tested the registration of two points clouds that are only different by rotation and permutation. Both point clouds have 1000 points sampled from the mesh model. The initial rotations are generated randomly using axis-angle representation to specify the maximum magnitude of the initial rotations. We compare with two representative correspondence-free deep registration methods as baselines: PCR-Net [Sarode2019PCRNetPC] and FMR [huang2020feature]. The experiments on the baseline methods are based on their official open-source implementation and pretrained weights. The results are shown in Table I. We can see that both PCR-Net and FMR are sensitive to the initial rotation angle and only work properly when the initial rotation is small because they rely on iterative refinements. In comparison, the registration accuracy of our method is independent of the initial rotation angle, and it provides almost perfect rotational registration.
Iv-B Registration with rotated copies of point clouds corrupted with Gaussian noise
|Max initial rotation angle||0||30||60||90||120||150||180|
|Methods||Rotation error after registration|
In practice, the pair of point clouds for registration will not be identical to each other. Therefore we test the case where the source and target point clouds are both corrupted by noise. We add a Gaussian noise with to all points before putting them through the networks. The rest of the setup is the same as in Sec. IV-A. As shown in Table II, PCR-Net and FMR show a similar trend of increasing error when the initial angle gets larger. Our method, though showing a slightly larger registration error, still outperforms the baseline in most columns. Most importantly, the error remains consistent across different initialization. The result shows that our method not only works in an ideal situation where the point clouds are noise-free but also delivers accurate rotation estimation under noise corruption.
Iv-C Registration with point clouds with different densities
|Max initial rotation angle||0||30||60||90||120||150||180|
|Methods||Rotation error after registration|
To further test the robustness of the proposed method against variations in the point clouds, we tested the registration performance when given two point clouds of different densities. Here we sample 1000 and 500 points for the pair of point clouds, respectively. The registration accuracy is shown in Table III. The trend of the three methods is similar to what is shown in Sec. IV-A and Sec. IV-B. We do observe an increase of error in our method in this experiment. A possible reason is that at the beginning of our PointNet encoder is an edge-convolution layer to initialize the feature dimensions of each point, which is dependent on the neighborhood. Change of density may change the feature at the first layer, affecting the feature space registration. This issue is to be analyzed in more detail in future work. However, our result still outperforms the baselines when the initial rotation angle is larger than 90 degrees.
Overall, our proposed orientation registration method can estimate consistent results independent of initial rotations. It outperforms the baselines even under the presence of noise, showing the practicality of our proposed method. Under density variance, the accuracy of our method deteriorates, showing an improvement direction of our method. Nevertheless, it still outperforms the baseline methods under large initial rotations.
V Related Work and Discussion
In this work, we connect ideas in several different fields, including sensor registration, implicit shape learning, and equivariant neural networks. It may benefit the understanding of our method provided a broader context of each of the fields. Here we first review point cloud registration methods with and without data association with a focus on deep learning-based approaches. Then we review literature about equivariant neural networks and deep implicit models briefly.
V-a Correspondence-based point cloud registration
A major challenge here is to recover the corresponding pairs of points from a pair of point clouds. ICP simply matches the closest points together, solves the transformation, and iteratively rematches the closest points after aligning the two point clouds using the estimated transformation [besl1992icp]. Point-to-line [censi2008icl], point-to-plane [chen1992surfacenormalicp], plane-to-plane [mitra04_surface_icp], and Generalized-ICP [segal2009gicp] build local geometric structures to the loss formulation.
A challenge in the matching is to distinguish points corresponding to different underlying locations and to recognize points corresponding to the same location, requiring strong feature descriptor for points, on which deep learning approaches show expertise. Through metric learning, a feature descriptor can be learned such that matching points are close to each other in the feature space, while non-matching points are far away. Approaches following this idea include PPFNet [deng2018ppfnet], 3DSmoothNet [gojcic2019perfect], SpinNet [ao2020spinnet], and FCGF [FCGF2019]. Good point matching leads to accurate pose estimation when solving the orthogonal Procrustes problem. Therefore, the error in pose estimation can be used to supervise point matching and feature learning. DCP [wang2019dcp], RPM-Net [yew2020rpm], DGR [choy2020deepglobal], and 3DRegNet [pais20203dregnet] are some of the methods leveraging ground truth pose to supervise point feature learning. In a point cloud, it is likely that a subset of them (e.g. corner points) have more salient features and are easier to be identified than others (e.g. points on a flat surface). They are called keypoints. Keypoints can be matched more reliably, resulting in better pose estimation. In this regard, neural networks are designed to better pick the keypoints (e.g., USIP [li2019usip] and SampleNet [lang2020samplenet]). Keypoint selection and feature learning may also be considered jointly. Representative works include 3DFeat-Net [yew2018-3dfeatnet], D3Feat [bai2020d3feat], DeepICP [lu2019deepvcp], and PRNet [wang19prnet].
The main remaining challenge is that such matching pairs may not exist in the input point clouds in the first place, because point clouds are sparse and a point may not be captured repeatedly by different scans. Soft matching (or many to many) [gold1998softassign, granger02emicp] is proposed to address this problem, but it is at best an approximation of the underlying true matching using sparse samples, which can deteriorate the performance when the sparsity increases.
V-B Correspondence-free point cloud registration
Correspondence-free registration treats a point cloud as a whole rather than a collection of points, requiring a global representation for an entire point cloud. In this way, the limitation brought by the sparsity as mentioned in Sec. V-A is circumvented. CVO [MGhaffari-RSS-19, zhang2020new] represents a point cloud as a function in a reproducing kernel Hilbert space, transforming the registration problem to maximizing the inner product of two functions. A series of deep-learning-based methods attempt to extract a global feature embedding for a point cloud and solve the registration by aligning the global features. Examples include PointNetLK [aoki19pointnetlk], PCRNet [Sarode2019PCRNetPC], and Feature-Metric Registration [huang2020feature]. A limitation of global-feature-based registration is that the nonlinearity of the feature extraction networks leaves us few clues about the structure and properties of the feature space. Therefore these methods rely on iterative local optimization such as the Gauss-Newton algorithm. Consequently, these methods require good initialization to achieve decent results. Our method differentiates from previous work in that the feature extraction network is equivariant, enabling well-behaved optimization in the feature space (detailed in Sec. III).
V-C Group equivariant neural networks
Neural networks today mainly preserve symmetry against translation, limiting the capacity of networks to deal with broader transformations presented in data. Researchers have been working on group-equivariant neural networks to address this issue. For example, the group is a classical Lie group and frequently appears in practice [chirikjian2001engineering, chirikjian2009stochastic]. One line of work forms kernels as some steerable functions so that the rotations in the input space can be transformed into rotations in the output space [cohen2016steerable, esteves2018learning, fuchs2020se]. However, this strategy can limit the expressiveness of the feature extractor since the form of the kernels is constrained.
Another strategy is to lift the input space to a higher-dimensional space where the group is contained so that equivariance is obtained naturally [kondor2018generalization, finzi2020generalizing]. The simplest example is that the group of 2D translations is contained in ; therefore, 2D convolution embeds equivariance to 2D translations. However, for each lifted input to group elements, we need to integrate over the entire group for computing its convolution. While being mathematically sound, this approach involves the use of group and maps (for mapping from and to the Lie algebra) and a Monte Carlo approximation to compute the convolution integral over a fixed set of group elements [finzi2020generalizing, Sections 4.2-4.4].
cohen2014learning studied the problem of learning the irreducible representations of commutative Lie groups. kondor2018generalization studied the generalization of equivariance and convolution in neural networks to the action of compact groups. The latter also discussed group convolutions and their connection with Fourier analysis [chirikjian2001engineering, chirikjian2009stochastic]. In this context, the convolution theorem naturally extends to the group-valued functions and the representation theory [hall2015lie].
A new design of equivariant neural networks is proposed by deng2021vector. This approach allows incorporating the equivariance property into existing network architectures such as PointNet and DGCNN by replacing the linear, nonlinear, pooling, and normalization layers with their vector neuron version. Our equivariant feature learner builds on this simple and useful idea.
V-D Deep implicit shape modeling
Implicit shape modeling represents a shape as a field on which a function such as an occupancy or signed distance function is defined. The surface is typically defined by the equation for some constant . In recent years, deep models are developed to model such implicit function fields, where a network is trained to predict the function value given a query input point coordinate. Some representative works include DeepSDF [park2019deepsdf], OccNet [mescheder2019occupancy], NeRF [mildenhall2020nerf].
Implicit functions model a shape continuously, up to the infinite resolution, which offers opportunities in registration to circumvent the challenge in data association due to sparsity in point clouds. DI-Fusion [huang2020di] leverages deep implicit models to do surface registration and mapping by optimizing the position of the query points to minimize the absolute SDF value. iNeRF [yen2020inerf] performs RGB-only registration through deep implicit models by minimizing the pixel residual between observed images and neural-rendered images. In this work, we also exploit the continuous nature of the deep implicit model but by using the learned latent feature instead of the decoded function field.
In this paper, we presented a correspondence-free rotational registration method for point clouds. The method is built upon the developments in equivariant neural networks and implicit shape representation learning. We construct a feature space where the rotational relations in Euclidean space are preserved, in which registration can be done efficiently. We circumvent the need for data association and solve the rotation in closed form, achieving low errors independent of initial rotation angles. Furthermore, we leverage implicit representation learning to enhance the network robustness against non-ideal cases where the points are subject to noise.
Nevertheless, the robustness of our method against density variations still has room for improvement. Furthermore, there are two open issues unsolved. First, generalizing the -equivariant neural network to deal with translation so that the registration of rotation and translation can be solved together. Secondly, generalizing the registration to larger-scale problems to handle outdoor LiDAR or stereo camera scans. These problems are left for future studies.
Toyota Research Institute provided funds to support this work.