The local reference frame (LRF) is a canonical coordinate system established in the 3D local surface, which is a useful geometric cue for 3D point clouds. LRF possesses two intriguing traits. One is that rotation invariance can be achieved via LRF if the local surface is transformed with respect to the LRF . The other is that useful geometric information can be mined with LRF . These make LRF popular in many geometric relevant tasks, especially for local shape description and six-degree-of-free (6-DoF) pose estimation.
For local shape description, two corresponding local surfaces can be converted into the same pose and full 3D geometric information can be employed, which is beneficial to improving the performance of local descriptors. Some hand-crafted local shape descriptors, e.g., signature of histograms of orientations (SHOT)  and signature of rotational projection statistics (RoPS) , estimate an LRF from the local surface and then translate local geometric information with respect to the estimated LRF into distinctive and rotation-invariant feature representations. Some learned local descriptors, e.g.,  and 
, leverage LRFs to overcome the limitation of geometric deep learning networks of being sensitive to rotations. Therefore, LRF is critical for both traditional and learned local shape descriptors. For 6-DoF pose estimation, an LRF can significantly improves its efficiency. Traditional 6-DoF pose estimation is usually performed via RANSAC, which randomly selects inlier correspondences from an initial correspondence pool to for pose prediction. Such random sampling method is neither reliable nor computational efficient . By contrast, we can directly predict an initial pose via two corresponding LRFs, reducing the computational complexity from to .
The desirable properties for LRF are twofold . The first one is the invariance to rigid transformation (e.g., translations and rotations). The second one is the robustness to common disturbances (e.g., noise, clutter, occlusion and varying mesh resolutions). To achieve these goals, many LRF methods have been proposed in the past decade and they can be categorized into two classes : covariance analysis (CA) [9, 15] or point spatial distributions (PSD)-based [12, 13, 18]
. CA-based LRFs are based on the computation of eigenvectors of a covariance matrix calculated either for the points or triangles in the local surface. PSD-based LRFs usually calculate estimate axes successively, where the main efforts are put on the determination of the-axis . However, most CA-based LRFs still suffer from sign ambiguity, and PSD-based LRFs show limited robustness to high levels of noise and variations of mesh resolution 
. Methods in both classes usually apply a weighted strategy to improve their repeatability performance. However, their weights are determined heuristically, and the repeatability performance in challenging 3D matching cases cannot be guaranteed.
Motivated by existing considerations, we propose a learned approach toward LRF estimation (named LRF-Net), which considers the contribution of all neighboring points (Fig. 1). Our key insight is that each neighboring point in the local surface gives a unique contribution to LRF construction, which can be quantitatively represented by assigning weights to these points. Given a local surface centered at a keypoint, we first resort to the normal of the keypoint computed within a subset of the radius neighbors for the calculation of its -axis. Its repeatability has been confirmed in . Compared with -axis, estimating the -axis is more challenging, due to noise, clutter, and occlusion. By collecting angle and distance attributes within a local neighborhood, we can formulate the estimation of -axis as a weighted prediction problem with respect to these geometric attributes. Unlike previous CA-based and PSD-based approaches, such learned strategy of determining weights is shown to be invariant to rigid transformation and robust to noise, clutter, occlusion and varying mesh resolutions. Our network can be trained in a weakly supervised manner. Specifically, it needs the corresponding relationships between local patches only, instead of ground-truth LRFs and/or exact pose variation information between patches. We have conducted a set of experiments on three public datasets to comprehensively evaluate the proposed LRF-Net. Extensive analysis and comparative experiments on three public datasets addressing different application scenarios have demonstrated that LRF-Net is more repeatable and robust than several state-of-the-art LRF methods (LRF-Net is only trained on one dataset). In addition, LRF-Net can significantly boost the local shape description and 6-DoF pose estimation performance when matching 3D point clouds. The major contributions of this paper are summarized as follows:
LRF-Net, based on a Siamese network that needs weak supervision only, is proposed that achieves the state-of-the-art repeatability performance under the impacts of noise, varying mesh resolutions, clutter and occlusion. To the best of our knowledge, we are the first to concentrate on designing LRF for local surfaces with deep learning.
LRF-Net can significantly boost the performance of local shape description and 6-DoF pose estimation.
The rest of this paper is organized as follows. Section 2 presents a detailed description of our proposed LRF-Net. Section 3 presents the experimental evaluation of LRF-Net on three public datasets with comparisons with several state-of-the-art methods. Several concluding remarks are drawn in Section 4.
This section represents the details of our proposed LRF-Net for 3D local surface. We first introduce the technique approach for calculating the three axes for an LRF and then describes a weakly supervised approach for training LRF-Net.
2.1 A Learned LRF Proposal
The whole architecture of LRF-Net in shown in Fig. 2(a). LRF-Net predicts the direction of three axes successively. For a local surface, we first estimate its -axis via its normal vector computed over a small subset of the local point set. Then, unique weights are learned for each point in the local surface. The -axis is calculated by integrating projection vectors with learned weights using a vector-sum operation. At last, the -axis is calculated by the cross-product operation between -axis and -axis.
LRF definition: Given a local surface centered at keypoint , the LRF at (denoted by ) can be represented as :
where , , and denote the -axis, -axis, and -axis of , respectively. As three axes are orthogonal, the estimation of LRF therefore contains two parts: estimation of the -axis and the -axis.
A naive way to learn an LRF for the local surface is to train a network that directly regresses the axes. The premise is that ground-truth LRFs are labeled for local surfaces. Unfortunately, the network trained in this manner meets two difficulties. The first one is that the definition of ground-truth LRFs for local surfaces remain an open issue in the community . The second one, which is more important, is that the orthogonality of three axes cannot be guaranteed. We suggest estimating -axis and -axis independently.
z-axis: As for -axis, we take the normal of the keypoint as the -axis., which has been confirmed  to be quite repeatable. To resist the impact fo clutter and occlusion, we collect a small subset of the local surface to calculate the normal. For more details, readers are referred to .
x-axis: Once the -axis is determined, the remaining task is to compute the -axis. Compared with -axis, -axis is more challenging due to noise, clutter, and occlusion . We argue that each neighboring point in the local surface gives a unique contribution to LRF construction. Hence, we predict a weight for each neighboring point and leverage all neighboring points with learned weights for -axis prediction. The main steps are as follows.
First, to make the estimate LRF invariant to rigid transformation, our network consumes with invariant geometric attributes, rather than point coordinates. In particular, two attributes, i.e., relative distance and surface variation angle are used in LRF-Net as illustrated in Fig. 2(b). For a neighbor of , the two attributes of are computed as:
where is the norm and represents the support radius of the local surface. The range of and are and , respectively. Thus, every radius neighboring point represented by two attributes that will be encoded to a weight value via LRF-Net later. The employed two attributes in LRF-Net have two merits at least. First, the unique spatial information of a radius neighboring point in the local surface can be well represented, as shown in Fig. 3. Both attributes are complementary to each other. Second, the two attributes are calculated with respect to the keypoint, which are rotation invariant. It makes the learned weights rotation invariant as well.
Second, with geometric attributes being the input, we use a U-Net with multilayer perceptions (MLP) layers only to predict weights for neighboring points. The details of the network are illustrated in Fig. 4. The network is very simple, however, is sufficient to predict stable and informative weights for neighboring points (as will be verified in the experiments).
Third, because -axis is orthogonal to -axis, we project each neighbor on the tangent plane of the -axis and compute a projection vector for as:
We integrate all weighted projection vectors in a weighted vector-sum manner:
where denotes the total number of radius neighbors of keypoint and is a learned weight by LRF-Net. Another way for determining the -axis, based on these weights, is choosing the vector with the maximum weight, as in many PSD-based LRFs [12, 13]. However, it fails to leverage all neighboring information and we will shown that it is inferior to the vector-sum operation in the experiments.
y-axis: Based on the calculated -axis and -axis, the -axis can be computed by the cross-product between them.
2.2 Weakly Supervised Training Scheme
Our training data are constituted by a series of corresponding local surface patches. The corresponding relationship is obtained based on the ground-truth rigid transformation of two whole point clouds. In particular, LRF-Net needs the corresponding relationships between local surface patches only, rather than ground-truth LRFs and/or exact pose variation information between patches. Therefore, our network can be trained in a weakly supervised manner.
We train our LRF-Net with two streams in a Siamese fashion where each stream independently predicts an LRF for a local surface. Specifically, two streams take the local surfaces of keypoints and as inputs, respectively. Here, and are two corresponding keypoints sampled from the model and scene point cloud. Both streams share the same architecture and underlying weights. We use the predicted LRFs and by two stream to transform the local surfaces and to the coordinate system of the two LRFs. Then, we calculate the Chamfer Distance 
between two transformed local surfaces as the loss function to train LRF-Net:
Remarkably, our opinion is that it is difficult to define a “good” LRF for a single local surface. For 3D shape matching, LRFs that can align the poses of two local surface patches are judged as repeatable. This motivates us to consider two local patches simultaneously and employ the Chamfer Distance to train the network.
In this section, we first evaluate the repeatability performance of our LRF-Net on three standard datasets, including the Bologna retrieval (BR) dataset , the UWA 3D modeling (UWA3M) dataset , and the UWA object recognition (UWAOR) dataset , together with a comparison with other state-of-the-art LRFs. Second, we apply our LRF-Net perform local shape description and 6-DoF pose estimation to verify the practicability of our method. Third, analysis experiments are conducted to improve the explainability of the proposed LRF-Net.
3.1 Experimental Setup
The details of our experiments including the description of datasets and the illustration for all compared methods are introduced before evaluation. The experiments were conducted on a Windows Server with an Intel Xeon E5-2640 2.39 GHz CPU and 96 GB of RAM. We train our LRF-Net using a batch size of 512 local surface pairs and leverage the ADAM optimizer with an initial learning rate of 1e-4, which decays
every epoch. Each sampled local surface contains 256 points. The max epoch count is set to 20.
Our experimental datasets includes three standard datasets with different application scenarios. The variety among these public 3D datasets definitely helps us to evaluate the performance of our method in a comprehensive manner. The main properties of these datasets are summarized in Table 1.
These dataset are also injected with five levels of Gaussian noise (i.e., from 0.1 mr to 0.5 mr Gaussian noise) and four levels of mesh decimation (i.e., , , and of original mesh resolution). Here, the unit mr denotes mesh resolution. Remarkably, the noise-free BR dataset is used to train our LRF-Net, the rest noisy data in the BR dataset and data in the UWA3M dataset and the UWAOR dataset are used for testing.
3.1.2 Compared Methods
We compare our LRF-Net with several existing LRF methods for a through evaluation. Specifically, the compared methods are proposed by Mian et al. , Tombari et al. , Petrelli et al. , Guo et al.  and Yang et al. , respectively. We dub them as Mian, Tombari, Petrelli, Guo, and Yang, respectively. To compare fairly, we keep the support radius of all the LRFs as 15 mr. The properties of these LRFs are shown in Table 2.
To evaluate the local shape description performance of our method, we replace the LRF in four LRF-based descriptors (i.e., snapshots , SHOT , RoPS  and TOLDI ) and assess the performance variations. To measure the 6-DoF pose estimation performance of our method, we adapt LRF-Net to the RANSAC pipeline and compare with the original RANSAC .
3.2 Performance Evaluation of LRF-Net
3.2.1 Repeatability Performance
We evaluate the repeatability of all LRFs via the popular  metric, which measures overall angular error between two LRFs. The repeatability results of evaluated LRFs are shown in Fig. 5 and Fig. 6. Several observations can be made from these figures.
First, as witnessed by Fig. 5, our LRF together with Tombari, Petrelli, and Yang achieve decent performance on the BR dataset. On the UWA3M and UWAOR datasets, our LRF-Net achieves the best performance. Second, as shown in Fig. 6(a), LRF-Net and Tombari achieve a comparably stable performance on the BR dataset with respect to different levels of Gaussian noise. Fig. 6(b) and Fig. 6(c) indicate that LRF-Net achieves the best performance under all levels of Gaussian noise on the UWA3M and UWAOR datasets, surpassing the others by a very significant gap. Note that UWA3M and UWAOR datasets also include nuisances such as clutter, self-occlusion, and occlusion. Third, results in Fig. 6(d)-(f) suggest that LRF-Net is the best competitor with , , and mesh decimation on all datasets.
These results clearly demonstrate the strong robustness of our LRF-Net with respect to Gaussian noise, mesh decimation, clutter, and occlusion. The reasons are at least twofold. One is that all points are leveraged to generate the critical -axis, which guarantees the robustness to Gaussian noise and low level mesh decimation. The other is that a LRF-Net can learn stable and informative weights for neighboring points. It can improve the robustness of LRF-Net to common nuisances.
3.2.2 Local Shape Description Performance
We further evaluate our LRF-Net by replacing the LRFs in four LRF-based descriptors (i.e., snapshots, SHOT, RoPS, and TOLDI) with our LRF-Net. Then we compare their descriptor matching performance measured via recall vs. 1-precision curve (RPC) [6, 15]. Notably, the original LRF methods employed by snapshots, SHOT, RoPS, and TOLDI are Mian, Tombari, Guo Yang, respectively. We conduct this experiment on the original BR, UWA3M, and UWAOR datasets. Fig. 7 reports the RPC results of the all tested descriptors.
As witnessed by the figure, all LRF-based descriptors equipped with our LRF-Net outperform their original versions. Specifically, snapshots achieves a dramatic performance improvement with our LRF-Net on the BR dataset; the performance of SHOT also climbs significantly on the UWA3M and UWAOR datasets with the help of the proposed LRF-Net. Therefore, we can draw a conclusion that LRF plays an important role in local shape description, where a repeatable LRF can effectively improve the description performance of an LRF-based descriptor without changing its feature representation. It also indicates that the proposed LRF-Net can bring positive impacts on a number of existing local shape descriptors.
3.2.3 6-DoF Pose Estimation Performance
A general 6-DoF pose estimation process with local descriptors is achieved by correspondence generation and pose estimation from correspondences with potential outliers. RANSAC is arguablly the de facto 6-DoF pose estimator in many applications. However, a key limitation of RANSAC is that the computational complexity of RANSAC is and estimating a reasonable pose requires a huge number of iterations. With LRFs, a single correspondence is able to generate a 6-DoF pose, decreasing the computational complexity from to . Therefore, we apply LRF-Net to 6-DoF pose estimation, following a RANSAC-fashion pipeline. The difference is that we sample one correspondence per iteration. Two criteria, i.e., the rotation error between our predicted rotation and the ground-truth one , and the translation error between the predicted translation vector and the ground truth one , are employed for evaluating the performance of 6-DoF pose estimation.
The initial feature correspondence set is generated by first matching TOLDI (equipped with our LRF-Net) descriptors and keeping 100 correspondences with the highest similarity scores. 100 and 1000 iterations are assigned to our method and RANSAC. The average rotation errors and translation errors of the two estimators on three experimental datasets are shown in Table 3.
Two salient observations can be made from the table. First, both RANSAC and our method manage to achieve accurate pose estimation results on the BR dataset that contains point cloud pairs with large overlapping ratios. However, our method only needs of the iterations required for RANSAC. Second, on more challenging datasets, i.e., UWA3M and UWAOR, our method significantly outperforms RANSAC. This demonstrates that LRF-Net can improve the accuracy and efficiency of RANSAC for 6-DoF pose estimation simultaneously.
3.3 Analysis Experiments
3.3.1 Verifying the Rationality of LRF-Net
To verify the rationality of the main technique components of our LRF-Net, we conduct the following experiments. First, in order to verify the choice of weighted vector-sum operation for -axis calculation, we test the approach using the vector with the maximum weight as the -axis (dubbed “Max”). Second, to demonstrate that the axes of LRF is not suitable to be directly regressed, we compare our method with the one regressing -axis via a network shown in the left of Fig. 8 (dubbed “DR”). The results are shown in the right of Fig. 8.
Clearly, LRF-Net achieves the best performance among tested methods. It verifies that learning weights rather than directly learning axes is more reasonable. In addition, vector-sum is more appropriate for integrating projection vectors with learned weights for LRF-Net.
Fig. 9 visualizes the learned weights by our LRF-Net for several sample local surfaces, which presents two interesting findings. First, closer points do not seem to have greater contributions. It is a common assumption for many existing CA- and PSD-based LRF methods, including Tombari, Guo, and Yang, that closer points should have greater weights. However, they are inferior to our LRF-Net in terms of repeatability performance. Second, -axis estimation is generally determined by a particular area, rather than a single salient point as employed by many PSD-based methods, e.g., Petrelli. These visualization results also demonstrate our opinion that each neighboring point in the local surface gives a unique contribution to LRF construction.
In this paper, we have proposed LRF-Net, a learned LRF for 3D local surface that is repeatable and robust to a number of nuisances. LRF-Net assumes that each neighboring point in the local surface gives a unique contribution to LRF construction and measure such contributions via learned weights. Experiments showed that our LRF-Net outperforms many state-of-the-art LRF methods on datasets addressing different application scenarios. In addition, LRF-Net can significantly boost the local shape description and 6-DoF pose estimation performance. In the future, we expect further improving the LRF-Net by considering RGB cues and multi-scale geometric information.
Ppf-foldnet: unsupervised learning of rotation invariant 3d local descriptors. In
Proc. European Conference on Computer Vision, pp. 602–618. Cited by: §2.2.
-  (2019) 3D local features for direct pairwise registration. arXiv preprint arXiv:1904.04281. Cited by: §1.
-  (2010) Overview of the ransac algorithm. Image Rochester NY 4 (1), pp. 2–3. Cited by: §1, §3.2.3.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §3.1.2.
The perfect match: 3d point cloud matching with smoothed densities.
Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5545–5554. Cited by: §1.
-  (2016) A comprehensive performance evaluation of 3d local feature descriptors. International Journal of Computer Vision 116 (1), pp. 66–89. Cited by: §3.2.2.
-  (2013) Rotational projection statistics for 3d local surface description and object recognition. International Journal of Computer Vision 105 (1), pp. 63–86. Cited by: §1, §1, §3.1.2, §3.1.2.
-  (2007) Snapshots: a novel local surface descriptor and matching algorithm for robust 3d surface alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (7), pp. 1285–1290. Cited by: §3.1.2.
-  (2010) On the repeatability and quality of keypoints for local feature-based 3d object retrieval from cluttered scenes. International Journal of Computer Vision 89 (2-3), pp. 348–361. Cited by: §1, §3.1.2.
-  (2006) A novel representation and feature matching algorithm for automatic pairwise registration of range images. International Journal of Computer Vision 66 (1), pp. 19–40. Cited by: §3.2.3, §3.
-  (2006) Three-dimensional model-based object recognition and segmentation in cluttered scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (10), pp. 1584–1601. Cited by: §3.
-  (2011) On the repeatability of the local reference frame for partial shape matching. In Proc. IEEE International Conference on Computer Vision, pp. 2244–2251. Cited by: §1, §1, §1, §2.1, §2.1.
-  (2012) A repeatable and efficient canonical reference for surface matching. In Proc. Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pp. 403–410. Cited by: §1, §2.1, §3.1.2.
-  (2019) Learning an effective equivariant 3d descriptor without supervision. In Proc. IEEE International Conference on Computer Vision, pp. 6401–6410. Cited by: §1.
-  (2010) Unique signatures of histograms for local surface description. In Proc. European Conference on Computer Vision, pp. 356–369. Cited by: §1, §1, §3.1.2, §3.1.2, §3.2.1, §3.2.2.
-  (2013) Performance evaluation of 3d keypoint detectors. International Journal of Computer Vision 102 (1-3), pp. 198–220. Cited by: §3.
-  (2018) Toward the repeatability and robustness of the local reference frame for 3d shape matching: an evaluation. IEEE Transactions on Image Processing 27 (8), pp. 3766–3781. Cited by: §1, §2.1.
-  (2017) TOLDI: an effective and robust approach for 3d local shape description. Pattern Recognition 65, pp. 175–187. Cited by: §1, §2.1, §3.1.2, §3.1.2.