1 Introduction
The local reference frame (LRF) is a canonical coordinate system established in the 3D local surface, which is a useful geometric cue for 3D point clouds. LRF possesses two intriguing traits. One is that rotation invariance can be achieved via LRF if the local surface is transformed with respect to the LRF [7]. The other is that useful geometric information can be mined with LRF [12]. These make LRF popular in many geometric relevant tasks, especially for local shape description and sixdegreeoffree (6DoF) pose estimation.
For local shape description, two corresponding local surfaces can be converted into the same pose and full 3D geometric information can be employed, which is beneficial to improving the performance of local descriptors. Some handcrafted local shape descriptors, e.g., signature of histograms of orientations (SHOT) [15] and signature of rotational projection statistics (RoPS) [7], estimate an LRF from the local surface and then translate local geometric information with respect to the estimated LRF into distinctive and rotationinvariant feature representations. Some learned local descriptors, e.g., [5] and [14]
, leverage LRFs to overcome the limitation of geometric deep learning networks of being sensitive to rotations. Therefore, LRF is critical for both traditional and learned local shape descriptors. For 6DoF pose estimation, an LRF can significantly improves its efficiency. Traditional 6DoF pose estimation is usually performed via RANSAC
[3], which randomly selects inlier correspondences from an initial correspondence pool to for pose prediction. Such random sampling method is neither reliable nor computational efficient [2]. By contrast, we can directly predict an initial pose via two corresponding LRFs, reducing the computational complexity from to .The desirable properties for LRF are twofold [15]. The first one is the invariance to rigid transformation (e.g., translations and rotations). The second one is the robustness to common disturbances (e.g., noise, clutter, occlusion and varying mesh resolutions). To achieve these goals, many LRF methods have been proposed in the past decade and they can be categorized into two classes [17]: covariance analysis (CA) [9, 15] or point spatial distributions (PSD)based [12, 13, 18]
. CAbased LRFs are based on the computation of eigenvectors of a covariance matrix calculated either for the points or triangles in the local surface. PSDbased LRFs usually calculate estimate axes successively, where the main efforts are put on the determination of the
axis [17]. However, most CAbased LRFs still suffer from sign ambiguity, and PSDbased LRFs show limited robustness to high levels of noise and variations of mesh resolution [13]. Methods in both classes usually apply a weighted strategy to improve their repeatability performance. However, their weights are determined heuristically, and the repeatability performance in challenging 3D matching cases cannot be guaranteed.
Motivated by existing considerations, we propose a learned approach toward LRF estimation (named LRFNet), which considers the contribution of all neighboring points (Fig. 1). Our key insight is that each neighboring point in the local surface gives a unique contribution to LRF construction, which can be quantitatively represented by assigning weights to these points. Given a local surface centered at a keypoint, we first resort to the normal of the keypoint computed within a subset of the radius neighbors for the calculation of its axis. Its repeatability has been confirmed in [12]. Compared with axis, estimating the axis is more challenging, due to noise, clutter, and occlusion. By collecting angle and distance attributes within a local neighborhood, we can formulate the estimation of axis as a weighted prediction problem with respect to these geometric attributes. Unlike previous CAbased and PSDbased approaches, such learned strategy of determining weights is shown to be invariant to rigid transformation and robust to noise, clutter, occlusion and varying mesh resolutions. Our network can be trained in a weakly supervised manner. Specifically, it needs the corresponding relationships between local patches only, instead of groundtruth LRFs and/or exact pose variation information between patches. We have conducted a set of experiments on three public datasets to comprehensively evaluate the proposed LRFNet. Extensive analysis and comparative experiments on three public datasets addressing different application scenarios have demonstrated that LRFNet is more repeatable and robust than several stateoftheart LRF methods (LRFNet is only trained on one dataset). In addition, LRFNet can significantly boost the local shape description and 6DoF pose estimation performance when matching 3D point clouds. The major contributions of this paper are summarized as follows:

LRFNet, based on a Siamese network that needs weak supervision only, is proposed that achieves the stateoftheart repeatability performance under the impacts of noise, varying mesh resolutions, clutter and occlusion. To the best of our knowledge, we are the first to concentrate on designing LRF for local surfaces with deep learning.

LRFNet can significantly boost the performance of local shape description and 6DoF pose estimation.
The rest of this paper is organized as follows. Section 2 presents a detailed description of our proposed LRFNet. Section 3 presents the experimental evaluation of LRFNet on three public datasets with comparisons with several stateoftheart methods. Several concluding remarks are drawn in Section 4.
2 Method
This section represents the details of our proposed LRFNet for 3D local surface. We first introduce the technique approach for calculating the three axes for an LRF and then describes a weakly supervised approach for training LRFNet.
2.1 A Learned LRF Proposal
The whole architecture of LRFNet in shown in Fig. 2(a). LRFNet predicts the direction of three axes successively. For a local surface, we first estimate its axis via its normal vector computed over a small subset of the local point set. Then, unique weights are learned for each point in the local surface. The axis is calculated by integrating projection vectors with learned weights using a vectorsum operation. At last, the axis is calculated by the crossproduct operation between axis and axis.
LRF definition: Given a local surface centered at keypoint , the LRF at (denoted by ) can be represented as :
(1) 
where , , and denote the axis, axis, and axis of , respectively. As three axes are orthogonal, the estimation of LRF therefore contains two parts: estimation of the axis and the axis.
A naive way to learn an LRF for the local surface is to train a network that directly regresses the axes. The premise is that groundtruth LRFs are labeled for local surfaces. Unfortunately, the network trained in this manner meets two difficulties. The first one is that the definition of groundtruth LRFs for local surfaces remain an open issue in the community [17]. The second one, which is more important, is that the orthogonality of three axes cannot be guaranteed. We suggest estimating axis and axis independently.
zaxis: As for axis, we take the normal of the keypoint as the axis., which has been confirmed [12] to be quite repeatable. To resist the impact fo clutter and occlusion, we collect a small subset of the local surface to calculate the normal. For more details, readers are referred to [18].
xaxis: Once the axis is determined, the remaining task is to compute the axis. Compared with axis, axis is more challenging due to noise, clutter, and occlusion [17].
We argue that each neighboring point in the local surface gives a unique contribution to LRF construction. Hence, we predict a weight for each neighboring point and leverage all neighboring points with learned weights for axis prediction. The main steps are as follows.
First, to make the estimate LRF invariant to rigid transformation, our network consumes with invariant geometric attributes, rather than point coordinates. In particular, two attributes, i.e., relative distance and surface variation angle are used in LRFNet as illustrated in Fig. 2(b). For a neighbor of , the two attributes of are computed as:
(2) 
where is the norm and represents the support radius of the local surface. The range of and are and , respectively. Thus, every radius neighboring point represented by two attributes that will be encoded to a weight value via LRFNet later. The employed two attributes in LRFNet have two merits at least. First, the unique spatial information of a radius neighboring point in the local surface can be well represented, as shown in Fig. 3. Both attributes are complementary to each other. Second, the two attributes are calculated with respect to the keypoint, which are rotation invariant. It makes the learned weights rotation invariant as well.
Second, with geometric attributes being the input, we use a UNet with multilayer perceptions (MLP) layers only to predict weights for neighboring points. The details of the network are illustrated in Fig. 4. The network is very simple, however, is sufficient to predict stable and informative weights for neighboring points (as will be verified in the experiments).
Third, because axis is orthogonal to axis, we project each neighbor on the tangent plane of the axis and compute a projection vector for as:
(3) 
We integrate all weighted projection vectors in a weighted vectorsum manner:
(4) 
where denotes the total number of radius neighbors of keypoint and is a learned weight by LRFNet. Another way for determining the axis, based on these weights, is choosing the vector with the maximum weight, as in many PSDbased LRFs [12, 13]. However, it fails to leverage all neighboring information and we will shown that it is inferior to the vectorsum operation in the experiments.
yaxis: Based on the calculated axis and axis, the axis can be computed by the crossproduct between them.
2.2 Weakly Supervised Training Scheme
Our training data are constituted by a series of corresponding local surface patches. The corresponding relationship is obtained based on the groundtruth rigid transformation of two whole point clouds. In particular, LRFNet needs the corresponding relationships between local surface patches only, rather than groundtruth LRFs and/or exact pose variation information between patches. Therefore, our network can be trained in a weakly supervised manner.
We train our LRFNet with two streams in a Siamese fashion where each stream independently predicts an LRF for a local surface. Specifically, two streams take the local surfaces of keypoints and as inputs, respectively. Here, and are two corresponding keypoints sampled from the model and scene point cloud. Both streams share the same architecture and underlying weights. We use the predicted LRFs and by two stream to transform the local surfaces and to the coordinate system of the two LRFs. Then, we calculate the Chamfer Distance [1]
between two transformed local surfaces as the loss function to train LRFNet:
(5) 
where
(6) 
Remarkably, our opinion is that it is difficult to define a “good” LRF for a single local surface. For 3D shape matching, LRFs that can align the poses of two local surface patches are judged as repeatable. This motivates us to consider two local patches simultaneously and employ the Chamfer Distance to train the network.
3 Experiments
In this section, we first evaluate the repeatability performance of our LRFNet on three standard datasets, including the Bologna retrieval (BR) dataset [16], the UWA 3D modeling (UWA3M) dataset [10], and the UWA object recognition (UWAOR) dataset [11], together with a comparison with other stateoftheart LRFs. Second, we apply our LRFNet perform local shape description and 6DoF pose estimation to verify the practicability of our method. Third, analysis experiments are conducted to improve the explainability of the proposed LRFNet.
3.1 Experimental Setup
The details of our experiments including the description of datasets and the illustration for all compared methods are introduced before evaluation. The experiments were conducted on a Windows Server with an Intel Xeon E52640 2.39 GHz CPU and 96 GB of RAM. We train our LRFNet using a batch size of 512 local surface pairs and leverage the ADAM optimizer with an initial learning rate of 1e4, which decays
every epoch. Each sampled local surface contains 256 points. The max epoch count is set to 20.
3.1.1 Datasets
Our experimental datasets includes three standard datasets with different application scenarios. The variety among these public 3D datasets definitely helps us to evaluate the performance of our method in a comprehensive manner. The main properties of these datasets are summarized in Table 1.
These dataset are also injected with five levels of Gaussian noise (i.e., from 0.1 mr to 0.5 mr Gaussian noise) and four levels of mesh decimation (i.e., , , and of original mesh resolution). Here, the unit mr denotes mesh resolution. Remarkably, the noisefree BR dataset is used to train our LRFNet, the rest noisy data in the BR dataset and data in the UWA3M dataset and the UWAOR dataset are used for testing.
Dateset  BR  UWA3M  UWAOR  

Scenario  Retrieval  Registration  Recogntion  
Challenge 




# Models  6  4  5  
# Scenes  18  75  50  

18  75  188 
3.1.2 Compared Methods
We compare our LRFNet with several existing LRF methods for a through evaluation. Specifically, the compared methods are proposed by Mian et al. [9], Tombari et al. [15], Petrelli et al. [13], Guo et al. [7] and Yang et al. [18], respectively. We dub them as Mian, Tombari, Petrelli, Guo, and Yang, respectively. To compare fairly, we keep the support radius of all the LRFs as 15 mr. The properties of these LRFs are shown in Table 2.
To evaluate the local shape description performance of our method, we replace the LRF in four LRFbased descriptors (i.e., snapshots [8], SHOT [15], RoPS [7] and TOLDI [18]) and assess the performance variations. To measure the 6DoF pose estimation performance of our method, we adapt LRFNet to the RANSAC pipeline and compare with the original RANSAC [4].
Method  Mian  Tombari  Guo  Petrelli  Yang  Ours 

Category  CA  CA  CA  PSD  PSD  PSD 
Date type  P  P  M  P  P  P 
Weight  H  H  H  H  L 
3.2 Performance Evaluation of LRFNet
3.2.1 Repeatability Performance
We evaluate the repeatability of all LRFs via the popular [15] metric, which measures overall angular error between two LRFs. The repeatability results of evaluated LRFs are shown in Fig. 5 and Fig. 6. Several observations can be made from these figures.
First, as witnessed by Fig. 5, our LRF together with Tombari, Petrelli, and Yang achieve decent performance on the BR dataset. On the UWA3M and UWAOR datasets, our LRFNet achieves the best performance. Second, as shown in Fig. 6(a), LRFNet and Tombari achieve a comparably stable performance on the BR dataset with respect to different levels of Gaussian noise. Fig. 6(b) and Fig. 6(c) indicate that LRFNet achieves the best performance under all levels of Gaussian noise on the UWA3M and UWAOR datasets, surpassing the others by a very significant gap. Note that UWA3M and UWAOR datasets also include nuisances such as clutter, selfocclusion, and occlusion. Third, results in Fig. 6(d)(f) suggest that LRFNet is the best competitor with , , and mesh decimation on all datasets.
These results clearly demonstrate the strong robustness of our LRFNet with respect to Gaussian noise, mesh decimation, clutter, and occlusion. The reasons are at least twofold. One is that all points are leveraged to generate the critical axis, which guarantees the robustness to Gaussian noise and low level mesh decimation. The other is that a LRFNet can learn stable and informative weights for neighboring points. It can improve the robustness of LRFNet to common nuisances.
3.2.2 Local Shape Description Performance
We further evaluate our LRFNet by replacing the LRFs in four LRFbased descriptors (i.e., snapshots, SHOT, RoPS, and TOLDI) with our LRFNet. Then we compare their descriptor matching performance measured via recall vs. 1precision curve (RPC) [6, 15]. Notably, the original LRF methods employed by snapshots, SHOT, RoPS, and TOLDI are Mian, Tombari, Guo Yang, respectively. We conduct this experiment on the original BR, UWA3M, and UWAOR datasets. Fig. 7 reports the RPC results of the all tested descriptors.
As witnessed by the figure, all LRFbased descriptors equipped with our LRFNet outperform their original versions. Specifically, snapshots achieves a dramatic performance improvement with our LRFNet on the BR dataset; the performance of SHOT also climbs significantly on the UWA3M and UWAOR datasets with the help of the proposed LRFNet. Therefore, we can draw a conclusion that LRF plays an important role in local shape description, where a repeatable LRF can effectively improve the description performance of an LRFbased descriptor without changing its feature representation. It also indicates that the proposed LRFNet can bring positive impacts on a number of existing local shape descriptors.
3.2.3 6DoF Pose Estimation Performance
A general 6DoF pose estimation process with local descriptors is achieved by correspondence generation and pose estimation from correspondences with potential outliers
[3]. RANSAC is arguablly the de facto 6DoF pose estimator in many applications. However, a key limitation of RANSAC is that the computational complexity of RANSAC is and estimating a reasonable pose requires a huge number of iterations. With LRFs, a single correspondence is able to generate a 6DoF pose, decreasing the computational complexity from to . Therefore, we apply LRFNet to 6DoF pose estimation, following a RANSACfashion pipeline. The difference is that we sample one correspondence per iteration. Two criteria, i.e., the rotation error between our predicted rotation and the groundtruth one , and the translation error between the predicted translation vector and the ground truth one [10], are employed for evaluating the performance of 6DoF pose estimation.The initial feature correspondence set is generated by first matching TOLDI (equipped with our LRFNet) descriptors and keeping 100 correspondences with the highest similarity scores. 100 and 1000 iterations are assigned to our method and RANSAC. The average rotation errors and translation errors of the two estimators on three experimental datasets are shown in Table 3.
BR  UWA3M  UWAOR  

RANSAC  0.000  7.929  9.513  
0.0298  0.696  0.769  
LRFNet  0.000  6.088  4.392  
0.0239  0.608  0.405 
Two salient observations can be made from the table. First, both RANSAC and our method manage to achieve accurate pose estimation results on the BR dataset that contains point cloud pairs with large overlapping ratios. However, our method only needs of the iterations required for RANSAC. Second, on more challenging datasets, i.e., UWA3M and UWAOR, our method significantly outperforms RANSAC. This demonstrates that LRFNet can improve the accuracy and efficiency of RANSAC for 6DoF pose estimation simultaneously.
3.3 Analysis Experiments
3.3.1 Verifying the Rationality of LRFNet
To verify the rationality of the main technique components of our LRFNet, we conduct the following experiments. First, in order to verify the choice of weighted vectorsum operation for axis calculation, we test the approach using the vector with the maximum weight as the axis (dubbed “Max”). Second, to demonstrate that the axes of LRF is not suitable to be directly regressed, we compare our method with the one regressing axis via a network shown in the left of Fig. 8 (dubbed “DR”). The results are shown in the right of Fig. 8.
Clearly, LRFNet achieves the best performance among tested methods. It verifies that learning weights rather than directly learning axes is more reasonable. In addition, vectorsum is more appropriate for integrating projection vectors with learned weights for LRFNet.
3.3.2 Visualization
Fig. 9 visualizes the learned weights by our LRFNet for several sample local surfaces, which presents two interesting findings. First, closer points do not seem to have greater contributions. It is a common assumption for many existing CA and PSDbased LRF methods, including Tombari, Guo, and Yang, that closer points should have greater weights. However, they are inferior to our LRFNet in terms of repeatability performance. Second, axis estimation is generally determined by a particular area, rather than a single salient point as employed by many PSDbased methods, e.g., Petrelli. These visualization results also demonstrate our opinion that each neighboring point in the local surface gives a unique contribution to LRF construction.
4 Conclusion
In this paper, we have proposed LRFNet, a learned LRF for 3D local surface that is repeatable and robust to a number of nuisances. LRFNet assumes that each neighboring point in the local surface gives a unique contribution to LRF construction and measure such contributions via learned weights. Experiments showed that our LRFNet outperforms many stateoftheart LRF methods on datasets addressing different application scenarios. In addition, LRFNet can significantly boost the local shape description and 6DoF pose estimation performance. In the future, we expect further improving the LRFNet by considering RGB cues and multiscale geometric information.
References

[1]
(2018)
Ppffoldnet: unsupervised learning of rotation invariant 3d local descriptors
. InProc. European Conference on Computer Vision
, pp. 602–618. Cited by: §2.2.  [2] (2019) 3D local features for direct pairwise registration. arXiv preprint arXiv:1904.04281. Cited by: §1.
 [3] (2010) Overview of the ransac algorithm. Image Rochester NY 4 (1), pp. 2–3. Cited by: §1, §3.2.3.
 [4] (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §3.1.2.

[5]
(2019)
The perfect match: 3d point cloud matching with smoothed densities.
In
Proc. IEEE International Conference on Computer Vision and Pattern Recognition
, pp. 5545–5554. Cited by: §1.  [6] (2016) A comprehensive performance evaluation of 3d local feature descriptors. International Journal of Computer Vision 116 (1), pp. 66–89. Cited by: §3.2.2.
 [7] (2013) Rotational projection statistics for 3d local surface description and object recognition. International Journal of Computer Vision 105 (1), pp. 63–86. Cited by: §1, §1, §3.1.2, §3.1.2.
 [8] (2007) Snapshots: a novel local surface descriptor and matching algorithm for robust 3d surface alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (7), pp. 1285–1290. Cited by: §3.1.2.
 [9] (2010) On the repeatability and quality of keypoints for local featurebased 3d object retrieval from cluttered scenes. International Journal of Computer Vision 89 (23), pp. 348–361. Cited by: §1, §3.1.2.
 [10] (2006) A novel representation and feature matching algorithm for automatic pairwise registration of range images. International Journal of Computer Vision 66 (1), pp. 19–40. Cited by: §3.2.3, §3.
 [11] (2006) Threedimensional modelbased object recognition and segmentation in cluttered scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (10), pp. 1584–1601. Cited by: §3.
 [12] (2011) On the repeatability of the local reference frame for partial shape matching. In Proc. IEEE International Conference on Computer Vision, pp. 2244–2251. Cited by: §1, §1, §1, §2.1, §2.1.
 [13] (2012) A repeatable and efficient canonical reference for surface matching. In Proc. Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pp. 403–410. Cited by: §1, §2.1, §3.1.2.
 [14] (2019) Learning an effective equivariant 3d descriptor without supervision. In Proc. IEEE International Conference on Computer Vision, pp. 6401–6410. Cited by: §1.
 [15] (2010) Unique signatures of histograms for local surface description. In Proc. European Conference on Computer Vision, pp. 356–369. Cited by: §1, §1, §3.1.2, §3.1.2, §3.2.1, §3.2.2.
 [16] (2013) Performance evaluation of 3d keypoint detectors. International Journal of Computer Vision 102 (13), pp. 198–220. Cited by: §3.
 [17] (2018) Toward the repeatability and robustness of the local reference frame for 3d shape matching: an evaluation. IEEE Transactions on Image Processing 27 (8), pp. 3766–3781. Cited by: §1, §2.1.
 [18] (2017) TOLDI: an effective and robust approach for 3d local shape description. Pattern Recognition 65, pp. 175–187. Cited by: §1, §2.1, §3.1.2, §3.1.2.
Comments
There are no comments yet.