1 Introduction
Point cloud registration is the task of aligning two point clouds that differ from each other by a transformation, possibly made more challenging through the presence of noise, outliers and occlusions. In this work, we consider rigid transformations through rotations and translations. An example registration problem can be found in Fig
1. The rigid registration problem has widespread applications in robotics, medical imaging and computer vision. The task becomes trivial once the correspondences between the points in the two point clouds are known, and this is a popular approach to solving the registration problem. However, finding correspondences is in itself a challenging problem, particularly in the presence of outliers and sampling differences meaning that some or all points do not have exact correspondences.Recently, works such as Aoki et al. (2019); Wang and Solomon (2019b, a); Yew and Lee (2020); Li et al. (2020)
proposed learningbased methods that provided promising results. The vast majority of these approaches attempt to find point or feature correspondences and then apply a singular value decomposition (SVD) to obtain the transformation. We, instead, frame the point cloud registration problem as a Markov Decision Process (MDP) and solve it by learning reward functions using deep supervised learning, which does not require the prediction of correspondences. Unlike conventional reinforcement learning methods, the training samples are taken i.i.d. and thus no experience replay buffer is needed. The idea of using deep supervised learning was also used by
Liao et al. (2017), which applied it to the problem of image registration. To the best of our knowledge, our work is the first to pose the point cloud registration problem as an iterative agent solving an MDP and trained using deep supervised learning. Our method does not rely on estimating correspondences and therefore encourages the use of a new paradigm to approach this problem. The work in Bauer et al. (2021)also avoids point correspondences by using deep reinforcement learning. However, it requires an experience replay buffer and a full sampling along a registration trajectory, as well as the use of imitation learning. In contrast, our proposed method does not require the creation of replay buffers nor the use of imitation learning, hence resulting in lower memory requirements and an easier training process.
Contributions.
Our main contributions are as follows:

We pose the point set registration problem as an MDP coupled with a reward function that measures the reduction of distance in the transformation space.

We devise and train an artificial agent using deep supervised learning based on the MDP formulation, providing a novel approach without resorting to point correspondences.

We perform an unbiased sampling of the manifold and conduct indepth experiments to show the benefits of our approach in clean, noisy and partially visible datasets (trained on unseen categories).
2 Related Work
Traditional registration methods.
Iterative Closest Point (ICP) is possibly the most wellknown approach to the point cloud registration problem. ICP iterates between estimating the rigid transformation and point correspondences. Given correspondences, finding the transformation reduces to an SVDbased operation. The performance of ICP largely hinges on the initialization and it can easily get trapped in suboptimal local minima depending on initial correspondences or transformation assumed. There are many variants such as Segal et al. (2009); Yang et al. (2016); Rusinkiewicz and Levoy (2001); Fitzgibbon (2003) that aim to overcome the shortcomings of classical ICP. In particular, Yang et al. (2016) offers a global approach through a branchandbound scheme, although it is inefficient for practical purposes. Another global method optimizes a global objective on the correspondences between source and target point clouds Zhou et al. (2016). Yang et al. (2021) proposed a certifiable algorithm that provides high robustness to outliers using a truncated leastsquares loss.
Learning based registration methods.
The possibility of encoding point clouds directly into deep neural networks was made possible by the pioneering work of PointNet
Qi et al. (2017), which takes into account the permutation invariance of the points to produce a global embedding for a given point cloud. Building on PointNet, Aoki et al. (2019) combines this global embedding with the classical, iterative LucasKanade method Lucas and Kanade (1981) to perform registration by aligning the global features of the transformed source and target point clouds. Wang and Solomon (2019a) trains an endtoend network by using the DGCNN embedding, proposed in Wang et al. (2019), with an attention mechanism, matching the point embeddings and then extracting the transformation from the matching matrix through a differentiable SVD layer. Wang and Solomon (2019b) proposes a more robust algorithm to this and matches only select keypoints, showing improved experimental results for partialtopartial registration. Yew and Lee (2020) takes advantage of surface normals given in the data to perform registration based on the classical RPM method. The work of Bauer et al. (2021) is similar to ours in that it is based on an artificial agent, but it requires imitation learning as part of its initialisation as well as an experience replay buffer. Fu et al. (2021) embeds each point cloud as a graph and performs embedding matching to obtain the transformation. Li et al. (2020) has a matching process that takes into account both the embedding space and the Euclidean point space. In contrast to our work, the vast majority of the these methods perform some variant of point matching to extract the final transformation. An artificial agent similar to our approach is applied on image registration in Liao et al. (2017). Although they also learn a reward vector over a set of incremental actions to perform the registration, their embedding uses convolutional neural networks along with a different reward scheme that couples rotations and translations.
3 Problem Statement
We are concerned with the problem of rigid point cloud registration. Given a source point cloud and target point cloud , the aim is to find a rigid transformation that minimizes the following cost function:
(1)  
where an norm distance of the registered and permuted point sets is minimized with respect the transformation parameters and the permutation matrix . The constraints on the permutation matrix allow for the existence of points in one set that do not correspond to a point in the other, such as in the presence of sampling differences and occlusions. The manifold constraint due to for and the binary value constraint for result in a nonconvex problem that is challenging to solve. Most of the aforementioned learning based methods Wang and Solomon (2019a, b); Yew and Lee (2020) obtain an estimate of and then solve the resulting easier problem via the use of polar decomposition.
In this work we take a different approach by using an MDP based formulation equipped with a reward function that measures the reduction of distance in the transformation space. We define the MDP by the tuple , where denotes the set of states, the set of actions,
gives the transition probability of arriving at state
at time after taking action in state at time , and denotes the reward associated with taking the action in state . For our problem, we take to be the set of all possible rigid transformations (), and is a finite set of rotation or translation actions. Concretely, the actions consist of positive and negative translations of fixed magnitudes in each of the , and directions, as well as rotations of fixed angles around each of the three axes in the positive and negative directions. When an action is applied, the new state is given by the composition of the transformations represented by the previous state and the action. This composition is deterministic, giving deterministic transitions so that for any states . The result of the composition is only dependent on the previous state and independent of other past states, fulfilling the Markovian property of for all time steps .The registration problem can then be solved by performing an optimal set of actions that maximize the reward. This is formalized by the policy
, which returns a probability distribution of actions given a state. We consider a deterministic and greedy policy, given by
.4 Artificial Agent
We solve the MDP based point cloud registration problem through a greedy approach inspired by Learning. In this setup, we aim to learn the reward function by modelling it as a deep neural network. The policy we undertake is the action that maximizes the reward function for a given state at time , : . To learn this, we sample source point clouds from the dataset and transformations from the manifold uniformly at random. From these, we create the corresponding target point clouds, and feed the two point clouds into the neural network. The network then outputs a reward vector, where each element represents the reward for each action that is then regressed according to the ground truth rewards.
The samples are taken i.i.d. and as such, we do not continue the registration trajectory with the policy during training time. This eliminates the need for a memory replay buffer, thereby reducing the memory requirements during training. Similarly, for the validation phase, we compute the loss of the reward vector instead of the registration error. At test time, we apply the action with the highest reward iteratively to the source point cloud. The reward is designed so that the source point cloud converges towards the target point cloud as the registration progresses. Thus, instead of implementing a stopping action, the registration process is allowed to run for a fixed number of steps.
4.1 Reward function
We chose a reward function that gives the reduction of distance to the identity transformation:
(2) 
where denotes the composition. The distance function is of the form
where
and
It can be shown that is equivalent to half the rotation angle of the combined rotation . It is a metric on and it has a linear relationship with the geodesic on the sphere (see Huynh (2009) for a more indepth discussion). For actions that are purely rotational, the translational contributions to the two terms in (2) are identical and therefore cancel out. The same holds for purely translational actions and rotational contributions.
4.2 Application of actions
Since the contributions of purely rotational and translational actions are decoupled in our reward function, we apply these in a similarly decoupled fashion during inference time. This is particularly important for rotations, where the naive method would be to apply it directly to the point cloud at each iteration. However, this would induce an additional translation since the source point cloud is not centered at the origin in general. To see this, we start with a point cloud which is centered around the origin and translate it by the vector . Applying an additional rotational action to the resulting point cloud would give
Instead, we would like to obtain where the translation vector remains unmodified. For iterative rotations and iterative translations we thus combine these transformations as follows:
4.3 Architecture
Fig. 2
shows the architecture that we use. We use a standard embedding for the point clouds and then extract the rewards from this using a multilayer perceptron (MLP). The action step sizes are learned implicitly and we do not provide these to the network. Our embedding is inspired by that of DCP
Wang and Solomon (2019a), using DGCNN Wang et al. (2019) followed by a Transformer. DGCNN creates an embedding that aims to exploit local geometric features. It does so by constructing a dynamic NN graph, and applying convolutionlike operations on the edges iteratively for multiple layers. At a layer , the following update is performed for the index of a point embedding :where is the set of edges in the graph at layer , encodes the weights associated with layer and where . Given a point cloud with points, DGCNN outputs an embedding in , where is the embedding dimension.
The Transformer module allows for an attention mechanism between the point clouds. Taking and to be the source and target embeddings generated by DGCNN respectively, the output source and target embeddings and after the Transformer are given by:
where denotes the function learned by the Transformer.
As both embeddings and are in , we derive a global embedding through the maxpool operator. We then use an MLP to extract the reward vector from this embedding. For this, we consider that the rotational and translational actions are completely different in nature, although the form of the reward shares some similarities. We thus make use of a single MLP to learn any shared information followed by separate MLPs for the two types of actions.
4.4 Sampling Rotations from the Manifold
When generating the transformations used for the training samples, many learning based methods in the literature obtain the rotations by sampling the Euler angles uniformly at random Wang and Solomon (2019a); Yew and Lee (2020). This, however, causes a bias in sampling the SO(3) space as shown in Fig. 3 (b) as the rotations do not adequately cover the manifold. We instead sample our transformations uniformly in by sampling the axis of rotation uniformly at random and sampling the angle according to the Haar measure of Diaconis and Shahshahani (1987); Kuffner (2004). As shown in Fig. 3 (a), this approach removes any bias from learned transformations and results in a better coverage of the rotation manifold. A bias in sampling can result in subpar performance for iterative learningbased methods as successive transformations can end up in parts of the manifold that have not been sampled well. This phenomenon is illustrated with the highlighted paths in Fig. 3 for a sample registration problem. When the registration path wanders into the undersampled parts of the manifold, misregistration is likely as the learning based methods might not have seen enough samples from these undersampled parts during training. We show the registration results for such an example in Fig. 4 that resulted in misregistration when the network is trained with the naive sampling method. In Section 5.5, we compare the results of the full test set between the two sampling methods.
4.5 Loss function
The network outputs a reward vector , where each element of the vector estimates the reward for one of the actions in
. We use a straightforward loss function to measure the agreement of this reward vector with the ground truth
and include a weight decay with constant and network weights :





5 Experiments
The experiments are conducted using the ModelNet40 dataset Wu et al. (2014). We split the official training set into distinct training and validation sets, and retain the official test set. We train and validate on the first 20 categories, then test on the latter 20. This gives us 4,603 training samples, 509 validation samples and 1,266 testing samples.
Since our method uses discrete steps which places a limit on the performance, we apply ICP on top of the result. We call this method PRANetv2. The method without ICP is denoted PRANetv1.
Benchmarks and metrics.
We benchmark our approach against both learningbased and classical methods. For the former category, we chose DCPv2 Wang and Solomon (2019a) and RPMNet Yew and Lee (2020), whereas for the latter we chose ICP and FGR. We use the Open3D 0.9.0 implementations of ICP and FGR, and adapt the released code for DCPv2 and RPMNet for our experiments. For both RPMNet and FGR, the point cloud features were enhanced with surface normals, whereas the other methods, including ours, only had positional data as input.
In terms of the evaluation metrics, we use the isotropic rotation error and translation error, the
norm of the complete and clean point clouds and the modified Chamfer distance as proposed by Yew and Lee (2020). For all these metrics, a lower value indicates a better registration result.Parameters.
The architecture details of our network is as follows. We have 5 EdgeConv layers in the DGCNN component, where the number of filters in each layer are [64, 64, 128, 256, 1024]. There are 4 heads in the multihead attention in the Transformer component, and an embedding dimension of 1024. We use LayerNorm without Dropout. We use the SGD optimizer with an initial learning rate of
, which is then decayed at epochs 300, 700 and 1000 by a factor of 10. We use a weight decay constant of
. The network is implemented using PyTorch 1.1.0. We set the random seed as 1234.
5.1 Clean Dataset
The first experiment consists of the ModelNet40 dataset with no noise or occlusions added. Following the methodology of Qi et al. (2017) and Wang and Solomon (2019a), we only use the positional features of the point cloud and randomly sample 1024 points from the surface of each object. The point cloud is then centered at the origin and normalized to fit in the unit sphere. As for the transformation, we sample the rotation uniformly at random according to the geometry up until along any axis, and the translation is uniformly sampled between . To enhance training, we make use of curriculum learning, where the network starts with rotation angles in and translation magnitudes up to for each axis at epochs 069, and continues with the full range of transformations for the remaining epochs, up until a total of 1300 epochs. For the actions, we use fixed angles of and around each of the three axes and translations of and in each of the three axis directions. This gives a total of 24 actions. The corresponding rewards for each action step size are normalized separately to have unit norm. At test time, the registration is allowed to run for 60 iterations. For the first 20 iterations, the rewards for only the larger actions are considered and only these actions are applied. For the remaining 40 iterations only the smaller actions and their corresponding rewards are used. Unless otherwise stated, this will be our default configuration in all subsequent experiments. Table 1 shows stateofthe art performance from PRANetv2 compared to classical and learningbased methods.
Model  Rot.  Trans.  L2  MCD 
ICP  5.872  0.0463  0.0507  0.00215 
FGR*  0.052  0.000407  0.000342  0.000107 
RPMNet*  0.061  0.000334  0.000184  0.0000053 
DCPv2  3.441  0.0250  0.0283  0.00157 
PRANetv1  1.554  0.0246  0.0249  0.00131 
PRANetv2  0.045  0.000471  0.000254  0.0000043 
Model  Rot  Trans  L2  MCD 
ICP  6.393  0.0474  0.0517  0.00221 
FGR*  3.879  0.0307  0.0322  0.00270 
RPMNet*  0.637  0.00769  0.00786  0.000672 
DCPv2  11.723  0.0869  0.0922  0.00482 
PRANetv1  5.083  0.0448  0.0457  0.00183 
PRANetv2  2.900  0.0207  0.0213  0.000768 
Model  Rot  Trans  L2  MCD 
ICP  24.692  0.251  0.270  0.0134 
FGR*  73.734  0.477  0.540  0.0447 
RPMNet*  3.514  0.0388  0.0400  0.00161 
DCPv2  21.318  0.204  0.228  0.0155 
PRANetv1  8.902  0.119  0.124  0.00625 
PRANetv2  7.237  0.0759  0.0785  0.00239 
Model  Time (ms) 

ICP  131 
FGR  157 
RPMNet  173 
DCPv2  24 
PRANetv1  742 
Distance in Reward  Rot  Trans  L2  MCD 

SE(3)  1.554  0.0246  0.0249  0.00131 
Distance  6.090  0.0480  0.0456  0.00189 
MCD  6.034  0.0577  0.0588  0.00219 
Method  Rot  Trans  L2  MCD 

Isotropic  1.555  0.0240  0.0247  0.00129 
Naive  2.771  0.0302  0.0307  0.00128 
Method  Rot  Trans  L2  MCD 

Curriculum  1.554  0.0246  0.0249  0.00131 
Uniform  4.611  0.0399  0.0400  0.00179 
Adhoc  3.374  0.0289  0.0284  0.00074 
Policy  Rot  Trans  L2  MCD 

deterministic  1.554  0.0246  0.0249  0.00131 
stoch1  1.550  0.0248  0.0250  0.00132 
stoch2  3.669  0.0544  0.0556  0.00484 
5.2 Gaussian Noise
In this experiment, we evaluate our method in the presence of noisy data with sampling differences bringing us closer to real world scenarios. Using a similar methodology to that in Yew and Lee (2020), we sample 1024 points from the source and target point cloud independently, and then add noise sampled from , which is clipped to for each axis. Table 2 shows a summary of the results. Note that the normal data for both RPMNet and FGR are clean and had no noise added, allowing these methods to perform particularly well. PRANetv2 is the second best performing method, while PRANetv1 performs comparably with FGR, showing a superior performance in terms of the modified Chamfer distance.
5.3 Partial Visibility
We further increase the difficulty by considering partially visible point clouds, while retaining the Gaussian noise and the sampling differences described in the previous section. Such deviations occur frequently in real world point cloud data. As is done in Yew and Lee (2020), we sample a random plane that passes through the origin, translate it along its normal vector, and then retain 70% of the points. The point cloud is also downsampled to 717 points to maintain a similar point density to that in previous experiments. Similar to the previous section, no noise was added to the normals, which are additional input features for RPMNet and FGR. Table 3 shows a summary of the results. Both versions of PRANet outperform all the other approaches apart from RPMNet.
5.4 Computational Efficiency
Table 4 shows the inference time of our method as compared with the other approaches. These experiments were done on an Intel Xeon(R) CPU E52650 v3 @ 2.30GHz machine with 1 NVIDIA GeForce GTX Titan X GPU and 25GB memory. We estimate the average time by computing the time taken on all 1,266 samples of the unseen categories and averaging. Note that the ICP and FGR implementations run entirely on the CPU. As PRANet is iterative, it requires multiple queries to the artificial agent and hence is more computationally costly than the other methods.
5.5 Agent Design
We conduct various experiments to justify our setup. All experiments are done without the addition of ICP.
Number of iterations.
Fig. 6 shows the Chamfer distance and the isotropic rotation error (the angle) over the number of iterations, averaged over the entire test set. Fig. 5 gives some qualitative results. From both figures, it can be seen that the performance gain is maximal towards the start of the registration process, and then gradually decreases as the registration progresses.
Reward function.
Our reward function is based on the distances of the transformations. Instead of using the distance of the transformation, an alternative would be to use pointbased distances. We formulate distances in terms of the distance of the clean and complete point clouds as well as the modified Chamfer distance. In particular, for each action ,
where is the transformed source point cloud using action and is the distance in question. Similar to the transformation distance case, the rotation actions are disentangled from the translations by rotating around the centroid of the observed source point cloud. Table 5 shows the performance of these alternate reward functions. The reward based on the transformation distance outperforms both pointbased reward functions.
Sampling rotations.
To compare the results for the two sampling methods as introduced in Section 4.4, we create another test set that samples the transformations according to the naive method, up to a maximum rotation of around each axis. This corresponds to a maximum angle of approximately in the axisangle representation, although the larger angles are only achieved for a small set of rotation axes. For the naive method, the transformations are sampled exactly in this range at training time. As for the isotropic sampling method, the transformations are sampled up to a maximum angle of , where the larger angles can be sampled for any rotation axis. Table 6 gives the results, showing that the isotropic method achieves an improved performance over the naive method.
Curriculum learning.
We applied curriculum learning to sample the transformations of the point clouds during training time. This ensures that rewards are learned explicitly when transformations are small at the initial phase of the training. To test whether the use of curriculum learning is critical to the success of our method, we evaluate this against a uniform sampling schedule and an adhoc sampling method, where small and large transformations are each sampled with a probability of 50%. This comparison is shown in Table 7. The sampling method with curriculum learning outperforms the other two methods in all metrics apart from the MCD, where the adhoc method does better.
Stochasticity of the policy.
Our current policy is entirely deterministic. Given the vector of rewards, we always pick the action with the largest reward. We evaluate whether a stochastic policy, which could potentially help with local minima, would give better results. The first modification that we make is to have the top three actions picked with probability 0.85, 0.15, and 0.05 respectively. We denote this modification “stoch1”. Another variant that we consider is where all actions with positive rewards are considered equally, and are attributed with equal probabilities. This is motivated by the fact that each such action would bring a reduction in the distance. In the event that none of the actions have a positive reward, the action with the top reward is picked. We denote this modification “stoch2”. Table 8 gives a comparison of these policy choices. The “stoch1” policy gives similar results to the deterministic policy, while the other stochastic policy leads to a worse performance.
6 Conclusion
We propose a greedy approach to train an artificial agent by learning rewards using deep supervised learning, and show promising results in experiments with the ModelNet40 dataset with clean, noisy and partially visible data. Unlike the prevailing learningbased methods, our approach does not rely on point matching, and frames the point cloud registration problem as a Markov Decision Process.
Open problems include possible additions and modifications that can enhance the performance. For instance, we used a generic point cloud embedding that was designed for point cloud matching. It could be possible to design a more specific embedding for learning rewards. We also use the maxpool operator to obtain a global embedding, from which we derive the reward vector. An improved architecture would take local information into account as well to obtain the final reward vector. In addition, for simplicity, we used a greedy approach, where we only take the action that maximizes the reward of the next action. An extension could include taking the action that maximizes the cumulative reward of the next steps, with a possible discount factor.
References

PointNetLK: Robust & Efficient Point Cloud Registration using PointNet.
Conference on Computer Vision and Pattern Recognition
. External Links: 1903.05711, Link Cited by: §1, §2.  ReAgent: Point Cloud Registration using Imitation and Reinforcement Learning. Conference on Computer Vision and Pattern Recognition. External Links: 2103.15231, Link Cited by: §1, §2.

The Subgroup Algorithm for Generating Uniform Random Variables
. Probability in the Engineering and Informational Sciences 1 (1), pp. 15–32. External Links: Document, ISSN 14698951, Link Cited by: §4.4.  Robust registration of 2D and 3D point sets. In Image and Vision Computing, Vol. 21, pp. 1145–1153. External Links: Document, ISSN 02628856 Cited by: §2.
 Robust Point Cloud Registration Framework Based on Deep Graph Matching. Conference on Computer Vision and Pattern Recognition. External Links: 2103.04256, Link Cited by: §2.
 Metrics for 3D rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision 35 (2), pp. 155–164. External Links: Document, ISSN 09249907 Cited by: §4.1.
 Effective sampling and distance metrics for 3D rigid body path planning. In Proceedings  IEEE International Conference on Robotics and Automation, Vol. 2004, pp. 3993–3998. External Links: Document, ISSN 10504729 Cited by: §4.4.
 Iterative DistanceAware Similarity Matrix Convolution with MutualSupervised Point Elimination for Efficient Point Cloud Registration. European Conference on Computer Vision. Cited by: §1, §2.

An artificial agent for robust image registration.
In
AAAI Conference on Artificial Intelligence
, Vol. 31, pp. 4168–4175. External Links: 1611.10336, Link Cited by: §1, §2.  An iterative image registration technique with an application to stereo vision.. In International Joint Conference on Artificial Intelligence, Cited by: §2.
 PointNet: Deep learning on point sets for 3D classification and segmentation. Conference on Computer Vision and Pattern Recognition, pp. 77–85. External Links: Document, 1612.00593 Cited by: §2, §5.1.
 Efficient variants of the ICP algorithm. Proceedings of International Conference on 3D Digital Imaging and Modeling, 3DIM, pp. 145–152. External Links: Document Cited by: §2.
 GeneralizedICP. Robotics: Science and Systems 2, pp. 435. Cited by: §2.
 Deep Closest Point: Learning Representations for Point Cloud Registration. In International Conference on Computer Vision, Cited by: §1, §2, §3, §4.3, §4.4, §5, §5.1.
 PRNet: Selfsupervised learning for partialtopartial registration. In Advances in Neural Information Processing Systems, Vol. 32. External Links: 1910.12240, Link Cited by: §1, §2, §3.
 Dynamic Graph CNN for Learning on Point Clouds. ACM Transactions on Graphics 38 (5), pp. 13. External Links: Document, 1801.07829, ISSN 15577368, Link Cited by: §2, §4.3.
 3D ShapeNets: A Deep Representation for Volumetric Shapes. In Conference on Computer Vision and Pattern Recognition, External Links: 1406.5670, Link Cited by: §5.
 Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics 37 (2), pp. 314–333. Cited by: §2.
 GoICP: A Globally Optimal Solution to 3D ICP PointSet Registration. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (11), pp. 2241–2254. External Links: 1605.03344 Cited by: §2.
 RPMNet: Robust point matching using learned features. In Conference on Computer Vision and Pattern Recognition, External Links: 2003.13479, ISSN 10636919 Cited by: §1, §2, §3, §4.4, §5, §5, §5.2, §5.3.
 Fast Global Registration. In European Conference on Computer Vision, External Links: 1612.02155 Cited by: §2.
Comments
There are no comments yet.