1 Introduction
3D interest point or keypoint detection refers to the problem of finding stable points with welldefined positions that are highly repeatable on 3D point clouds under arbitrary SE(3) transformations. These detected keypoints play important roles in many computer vision and robotics tasks, where 3D point clouds are widely adopted as the data structure to represent objects and scenes in the 3D space. Examples include geometric registration for 3D object modeling
[1] or point cloudbased Simultaneous Localization and Mapping (SLAM) [23], and 3D object [15, 19] or place recognition [34]. In these tasks, the detected keypoints are respectively used as correspondences to compute rigid transformations, and locations to extract representative signatures for efficient retrievals. Hence, a keypoint detector that cannot produce highly repeatable and welllocalized keypoints from 3D point clouds under arbitrary transformations would render these tasks to fail catastrophically.Despite the high number of successful handcrafted detectors proposed for 2D images [25, 20, 12], significantly lesser handcrafted detectors [32] with limited success are proposed for handcrafted detectors on 3D point clouds. This difference can be largely attributed to the difficulty in handcrafting powerful algorithms to extract meaningful information solely from the Euclidean coordinates of the point cloud in comparison to images that contain richer information from the additional RGB channels. The problem is further aggravated by the fact that it is difficult to handcraft 3D detectors to handle 3D point clouds in arbitrary transformations, , different reference coordinate frames. In particular, different transformations applied to the same 3D point cloud cause the Euclidean coordinates of each point to change significantly, thus severely affecting the repeatability of the keypoints from the 3D detectors.
It seems evidential that all the above mentioned problems with handcrafted detectors for 3D point clouds can be resolved by the highly successful datadriven deep networks. However, very few deep learningbased 3D keypoint detectors exist (only one deep learningbased approach [36] exists to date) in contrast to its increasing success on learning 3D keypoint descriptors [7, 6, 38, 16]. This is due to the lack of ground truth training datasets to supervise deep learningbased detectors on 3D point clouds. Unlike 3D descriptors that are supervised by easily available ground truth registered overlapping 3D point clouds [7, 6, 16, 38, 36, 11], it is impossible for anyone to identify and label the “ground truth” keypoints on 3D point clouds. Consequently, most of the works on 3D descriptors [7, 6, 38, 16] ignored the detector problem and are built on top of existing handcrafted 3D detectors or uniform sampling.
In view of the challenges on both handcrafted and deep learningbased 3D detectors, we propose the USIP detector: an Unsupervised Stable Interest Point deep learningbased detector that can detect highly repeatable, and accurately localized keypoints from 3D point clouds under arbitrary transformations without
the need for any ground truth training data. To this end, we design a Feature Proposal Network (FPN) that outputs a set of keypoints and their respective saliency uncertainties from an input 3D point cloud. Our FPN improves keypoint localization by estimating their positions on contrary to existing 3D detectors
[29, 36, 39] that select existing points in the point cloud as keypoints, which causes quantization errors. During training, we apply randomly generated SE(3) transformations on each point cloud to get a set of corresponding pairs of transformed point clouds as inputs to the FPN. Furthermore, we identify and prevent the degeneracy of our USIP detector. We encourage high repeatability and accurate localization of the keypoints with a probabilistic chamfer loss that minimizes the distances between the detected keypoints from the training point cloud pairs. Additionally, we introduce a pointtopoint loss to enforce the constraint of getting keypoints that lie close to the point cloud. We verify our USIP detector by performing extensive repeatability tests on several simulated and realworld benchmark 3D point cloud datasets from Lidar, RGBD and CAD models. Some qualitative results are shown in Fig 1.Our key contributions are summarized as follows:

Our USIP detector is fully unsupervised, thus avoids the need for ground truth that are impossible to obtain.

We provide degeneracy analysis of our USIP detector and suggest solutions to prevent it.

Our FPN improves keypoint localization by estimating the keypoint position instead of choosing it from an existing point in the point cloud.

We introduce the probabilistic chamfer loss and pointtopoint loss to encourage high repeatability and accurate keypoint localization.

The use of randomly generated transformations on point clouds during training inherently allows our network to achieve good performance under rotations.
2 Related Work
Unlike the recent success of deep learningbased 3D keypoint descriptors [7, 6, 16, 38, 36, 11], most existing 3D keypoint detectors remain handcrafted. A comprehensive review and evaluation of existing handcrafted 3D keypoint detectors can be found in [32]. Local Surface Patches (LSP) [3] and Shape Index (SI) [8] are based on the maximum and minimum principal curvatures of a point, and consider the point as a keypoint if it is a global extremum in a predefined neighborhood. Intrinsic Shape Signatures (ISS) [39] and KeyPoint Quality (KPG) [22] select salient points that has a local neighborhood with large variations along each principal axis. MeshDoG [37] and Salient Points (SP) [2] construct a scalespace of the curvature with the DifferenceofGaussian (DoG) operator similar to SIFT [20]. Points with local extrema values over an onering neighborhood are selected as keypoints. These methods can be regarded as the 3D extension of SIFT. LaplaceBeltrami Scalesapce (LBSS) [33] computes the saliency by applying a LaplaceBeltrami operator on increasing supports for each point.
More recently, LORAX [9]
proposes the method of projecting the point set into a depth map and use Principal Component Analysis (PCA) to select keypoints with commonly found geometric characteristics. All handcrafted approaches share the common trait of relying on the local geometric properties of the points to select keypoints. Hence, the performances of these detectors deteriorate under disturbances such as noise, density variations and/or arbitrary transformations. In contrast, our deep learningbased USIP detector is more resilient to these disturbances by learning from data. To the best of our knowledge, the only existing deep learningbased 3D keypoint detector is the weakly supervised 3DFeatNet
[36], which is trained with GPS/INS tagged point clouds. However, the training of 3DfeatNet is largely focused on learning discriminative descriptors using the Siamese architecture with an attention score map that estimates the saliency of each point as its byproduct. It does not ensure good performance of the keypoint detection. In comparison, our USIP is designed to encourage high repeatability and accurate localization of the keypoints. Furthermore, our method is fully unsupervised and does not rely on any form of ground truth datasets.3 Our USIP Detector
Fig. 2 shows the illustration of the pipeline to train our USIP detector. We denote a point cloud from the training dataset as . A set of transformation matrices , where is randomly generated and applied to the point cloud to form pairs of training inputs denoted as , where . Here, we use the operator to denote matrix multiplication under homogeneous coordinate with a slight abuse of notation. We drop the indices for brevity and refer to a triplet of training pair of point clouds and their corresponding transformation matrix as . During training, and are respectively fed into the FPN, which outputs proposal keypoints and its saliency uncertainties denoted as and for the respective point cloud. , , and . We enforce and so that it is a valid rate parameter in our probabilistic chamfer loss (see later paragraph). To improve keypoint localization, it is not necessary for all to be any of the points in . Similar condition applies to all .
We undo the transformation on with a slight abuse of notation to get , so that can be compared directly to . Here, we made an assumption that the saliency uncertainties remain unaffected after the transformation, ,
. The objectives of detecting keypoints that are highly repeatable and accurately localized from 3D point clouds under arbitrary transformations can now be achieved by formulating a loss function that minimizes the difference between
and . To this end, we propose the loss function: , where is the probabilistic chamfer loss that minimizes the probabilistic distances between all correspondence pairs of keypoints in and . is the pointtopoint loss that minimizes the distance of the estimated keypoints to their respective nearest neighbor in the point cloud. This constraints the estimated keypoints to be close to the point cloud.is a hyperparameter that adjust the relative contribution of
and to the total loss. More specifically:Probabilistic Chamfer Loss
A straightforward way to minimize the difference between and is to use the chamfer loss:
(1) 
that minimizes the distance of each point in one point cloud with its nearest neighbor in the other point cloud. However, the proposals are not equally salient. The receptive field of a point can be a featureless surface since the receptive field is limited to a small volume. In this case, it is detrimental to force the FPN to minimize the distance between and , where is the nearest neighbor of in .
To mitigate the above problem, we design our FPN to learn the saliency uncertainties and of the proposal keypoints and with a probabilistic chamfer loss . In particular, we propose to formulate
with an exponential distribution that measures the probabilistic distances between
and with the saliency uncertainties and. More formally, the probability distribution between
and for is given by:(2) 
is a valid probability distribution since it integrates to 1. A shorter distance between the proposal keypoints and gives a higher probability that and are highly repeatable and accurately localized keypoints in the point clouds and . Assuming i.i.d for all
, the joint distribution between
and is given by:(3) 
It is important to note that the probability distribution is not symmetrical when the order of the point cloud is swapped, , and , due to a different set of nearest neighbors, , and . Hence, the joint distribution between and is given by:
(4) 
Finally, the probabilistic chamfer loss between and is given by the negative loglikelihood of the joint distributions defined in Eq. 3 and 4:
(5) 
We further analyze the physical meaning of or by computing the extrema of Eq. 2 from its first derivative over :
(6) 
and solve for the stationary points:
(7) 
Furthermore, the second derivative means that given a fixed , the highest probability is achieved at . Consider any triplet of proposal keypoints , where and are the distances between the nearest neighbors and ( can be the nearest neighbor in both orders of and since chamfer distance is not bijective). has to take a large value when and is large because we have shown that and at optimum. Furthermore, and is large implies that are repeatable and accurately localized keypoints while is not. Hence, a large saliency uncertainty for a bad proposal keypoint at optimum shows that our probabilistic chamfer loss is guiding the FPN to learn correctly.
PointtoPoint Loss
To avoid quantization error in the positions of the keypoints, we design the FPN such that it is not necessary that the proposal keypoints to be any of the points in . However, this can cause the FPN to give erroneous proposal keypoints that are far away from the point cloud . We circumvent this problem by adding a loss function that penalizes for being too far from . We also apply similar penalty on and . This loss can be formulated as either the pointtopoint loss [1]:
(8) 
where is the nearest neighbor of or the pointtoplane loss [26, 4]:
(9) 
where and are the nearest surface normal in to and to , respectively. We set by default since we found experimentally that both loss functions give similar performances.
4 Feature Proposal Network
The network architecture of our FPN is shown in Fig. 2. We first sample nodes denoted as with Farthest Point Sampling (FPS) from a given input point cloud . A neighborhood of points is built for each node using pointtonode grouping [18, 17], which is denoted as . represents the number of points associated with the each of the nodes in . The advantage of pointtonode association over nodetopoint NN search or radiusbased ballsearch is twofold: (1) Every point in is associated with one node, while some points may be left out in nodetopoint NN search and ballsearch. (2) Pointtonode grouping automatically adapts to various scale and point density, while NN search and ballsearch are vulnerable to density variation and varying scales, respectively. To make FPN translation equivariant, we normalize each neighborhood point into by subtracting from its respective node , , . Each cluster of normalized local neighborhood points is then fed into a PointNetlike network [24] shown in Fig. 2
to get a local feature vector
associated with . A NN grouping layer is applied on the set of local feature vectors to achieve hierarchical information aggregation. Specifically, the nearest neighbors of each pair of are retrieved as . These NN local feature vectors are then normalized by subtracting with its respective to get a positionindependent neighborhood denoted as , where , before feeding into another network to get a set of feature vectors. A simple MultiLayer Perceptron (MLP) is then used to estimate
proposal keypoints , where , and saliency uncertainties , where from . Finally, we unnormalize each with , , to get the final proposal keypoints . It is important to note that the size of the receptive field is controlled by the number of proposals and in NN layers and it determines the levelofdetail for each feature. Large receptive field leads to features that are salient on a largescale and vice versa.5 Degeneracy Analysis
Let us denote the FPN as , where is the input of the network. We further denote a transformation matrix , where and are the rotation matrix and translation vector in . We get , where is the operator to denote the addition of to every entries of the other term. We say that the network is degenerate when it outputs trivial solutions where is satisfied for all and .
Lemma 1.
when outputs the centroid of the input point cloud, , and .
Proof.
Putting into , we get . Hence, which completes our proof that the network degenerates when it outputs the centroid of the input point cloud. ∎
Lemma 2.
when is translational equivariant, , , and outputs points that are in the linear subspace of any principal axis from the input point cloud denoted as , , and
(10) 
where can be any principal axis in and are scalar coefficients in .
Proof.
Let and denote the covariance matrices of and , respectively. and are the centroids of and , respectively. Putting into and , we get:
(11) 
Taking the Singular Value Decomposition (SVD) of
and , we get and , where and are the diagonal matrices of singular values, and and are the Eigenvectors that are also the principal axes of and , respectively. Putting the SVD of and into Eq. 11, we get:(12) 
Putting the relationship from Eq. 12 into , we get:
(13) 
which completes our proof that the network degenerates when it outputs a set of points on any principal axis. ∎
Discussions
We note that the network requires sufficient global semantic information of the input point cloud, , the input is the whole point cloud or clusters of local neighbor points that contain large receptive fields, to learn the trivial solutions of centroid or set of points on the principal axes. Hence, the degeneracies can be easily prevented by limiting the receptive fields of the FPN. We achieve this by setting the number of clusters and nearest neighbors of the clusters in the FPN (refer to Sec. 4 for the definitions of and ) to reasonable values. Small values for or high values for increases the receptive field and causes the FPN to degenerate. Fig. 3 show some examples of the degeneracies with different values at . It is interesting to note that the principal axis degeneracy occurs when is set to a midrange value, and centroid degeneracy occurs when is set to a high value. This implies that larger receptive fields, , a higher global semantic information is needed for the network to learn the centroid. We also notice experimentally that the degeneracies (both centroid and principal axis) occur in point clouds with more regular shapes, objects from ModelNet40 where the centroid and principal axes are more welldefined.
6 Experiments
Following [32], we evaluate the repeatability (Sec. 6.1), distinctiveness (Sec. 6.2) and computational efficiency (Sec. 6.3) of our USIP detector on 4 datasets from object models, outdoor Lidar and indoor RGBD scans. Additionally, we compare our evaluations to existing detectors  ISS [39], Harris3D [12], SIFT3D [20] and 3DFeatNet [36] .
Implementation Details
Three USIP detectors are respectively trained for outdoor Lidars, RGBD scans and object models. Specifically, we use the Oxford [21] for outdoor Lidar, “RGBD reconstruction dataset” [38] for RGBD, and ModelNet40 [35] for object models. The PCL [29] implementations of the classical detectors, , ISS, Harris3D and SIFT3D are used for the comparisons. We take the pretrained models of 3DFeatNet [36] for KITTI [10] and Oxford, and train separate models for Redwood and ModelNet40 using the codes provided by 3DFeatNet.
Qualitative Visualization
Fig. 7 shows some results from our USIP detector on ModelNet40. We can clearly see that our USIP learns keypoints on corners, edges, center of small surfaces, etc. Keypoints in the first row of Fig. 7 are selected with NonMaximum Suppression (NMS) and thresholding on the saliency uncertainty . In the second row, keypoints are selected with only NMS. Keypoints with small are shown in bright red and get darker with larger .
KITTI  Oxford  Redwood  ModelNet40  

Type  Velodyne lidar  SICK lidar  RGBD  CAD Model 
Scale  200m  60m  10m  2 
# point  16,384  16,384  10,240  5,000 
in Eq. 14  0.5m  0.5m  0.1m  0.03 
Rotation  2D  2D  3D  3D 
Noise  Sensor  Sensor  Gaussian  Gaussian 
Occlusion  Yes  Yes  Yes  No 
Density Variation  Yes  No  No  No 
Missing Parts  Yes  Yes  Yes  No 
Registration Failure Rate (%)  Inlier Ratio (%)  

Our Desc.  3DFeatNet[36]  FPFH[27]  SHOT[31]  Our Desc.  3DFeatNet  FPFH  SHOT  
Random  18.83  42.14  49.95  68.39  7.47  4.48  5.45  4.46 
SIFT3D[29, 20]  15.44  42.63  79.72  84.49  7.36  5.47  4.24  4.11 
ISS[29, 39]  5.97  25.96  37.09  69.83  8.52  4.71  4.44  3.45 
Harris3D[29, 12]  3.81  13.56  49.49  51.29  10.57  6.58  4.78  5.00 
3DFeatNet[36]  2.61  2.26  12.15  11.76  15.66  10.76  9.55  8.46 
USIP  1.41  1.55  8.37  5.40  32.20  22.48  18.77  18.21 
6.1 Repeatability
Repeatibility refers to the ability of a detector to detect keypoints in the same locations under various disturbances such as viewpoint variations, noise, missing parts, etc. It is often taken as the most important measure of keypoint detectors because it is a standalone measure that depends only on the detector (without a descriptor). Given two point clouds of a scene captured from different viewpoints such that are related by a rotation matrix and a translational vector . A keypoint detector detects a set of keypoints and from , respectively. A keypoint is repeatable if the distance between and its nearest neighbor is less than a threshold , ,
(14) 
Test Datasets
We evaluate repeatability on four test datasets  KITTI, Oxford, Redwood and ModelNet40. Note that our USIP is not trained on KITTI nor Redwood. We use the KITTI and Oxford test datasets prepared by 3DFeatNet [36]. Each pair of point clouds are captured from nearby locations of within 10m and manually augmented with random 2D rotations. in Redwood are from simulated RGBD cameras with 3D rotations / translations and Gaussian noise. The overlapped areas between are as low as 30%. In ModelNet40, is obtained by augmenting with random 3D rotations. Points in KITTI, Oxford and Redwood are in its original scale while points in ModelNet40 are normalized to . Details of the datasets are shown in Tab. 1. The scale refers to the diameter of the point clouds.
Relative Repeatability
We use relative repeatability that normalizes over the total number of detected keypoints for fair comparisons, , , where is the number of keypoints that passed the repeatability test in Eq. 14. We set the parameters of each keypoint detector in each dataset to generate 4, 8, 16, 32, 64, 128, 256 and 512 keypoints or close to these numbers when it is not possible to set the detectors (SIFT3D, Harris3D and ISS) to generate exact number of keypoints. Note that in general the repeatability should be proportional to the number of keypoints. In the extreme case that , , each point is regarded as a keypoint, the repeatability is the same as the percentage of overlap between . As shown in Fig. 4, our USIP generally outperforms other detectors by a significant margin on the 4 datasets over 8 different number of keypoints. In the extremely hard case that only 4 keypoints are detected, our method achieved relative repeatability of 34%, 23%, 10% and 60% for KITTI, Oxford, Redwood and ModelNet40, respectively. In the case of 64 keypoints, our performance is roughly 4.2x, 2.8x, 1.3x and 2.6x higher than the second best detector.
Robustness to Noise
The original points in KITTI and Oxford are already corrupted with sensor noise. We further augment the point clouds in the 4 datasets with Gaussian noise , where is up to m for KITTI and Oxford, m for Redwood and (no unit) for ModelNet40. The number of keypoints is fixed to 128. Our USIP is a lot more robust than other detectors as shown in Fig. 5. In KITTI and Oxford, the performances of other detectors fall to the level of random sampling when m, while our USIP does not show significant drop in performance even with . In Redwood, other methods except USIP and ISS deteriorate to random sampling with m. In ModelNet40, our method maintain high repeatability of 91% with , while all other methods drop below 8%.
Robustness to Downsampling
We evaluate the repeatability of the detectors on input point clouds downsampled by some factors using random selection. The results are shown in Fig. 6, where the downsample factor denoted as means the number of points is reduced to of the original number shown in Tab. 1. We can see that the repeatability of our USIP remains satisfactory even with a downsampling on KITTI, Oxford and ModelNet40. The only exception is the Redwood dataset, where almost all detectors perform poorly on high downsample factors. Indoor RGBD scans in Redwood consist of many large and flat surface like wall, ceiling, etc. Furthermore, there are very few distinguishable and nonoccluded structures, which are further aggrevated by severe downsampling. Hence, it is difficult to detect repeatable keypoints with these RGBD scans.
6.2 Distinctiveness: Point Cloud Registration
Distinctiveness is a measure of the performance of keypoint detectors and descriptors for finding correspondences in point cloud registration. Hence, distinctiveness is not as good as repeatability as an evaluation criterion on keypoint detectors because it is confounded with the performance of the descriptor. We mitigate this limitation by evaluating point cloud registration over several existing keypoint descriptors. We also use the results to show that our USIP detector works with different existing keypoint descriptors.
Experiment Setup
We follow the point cloud registration pipeline from 3DFeatNet [36] on their KITTI test dataset. Four descriptors are used to perform keypoint description, , three offtheshelf descriptors: 3DFeatNet, FPFH [28], SHOT[31], and our own descriptor inspired by 3DFeatNet with minor modifications, which is denoted as “Our Desc.” (details are in our supplementary material). Registration of a pair of point clouds involves 4 steps: (a) Extract keypoints and their corresponding descriptor vectors from each point cloud. (b) Establish keypointtokeypoint correspondences by nearest neighbor search of the descriptor vectors. (c) Perform RANSAC on the two matched keypoint sets to find the rotation and translation that have the most inliers. (d) Compare the resulted rotation and translation with the ground truth. A pair of point cloud is regarded as successfully registered if m, and .
Registration Results
We perform registration evaluations over the combination of 6 keypoint detectors and 4 descriptors. The registration failure rate and keypoint inlier ratio are shown in Tab. 2. Compared to other detectors, our USIP achieves the lowest registration failure rate and the highest inlier ratio with a considerable margin on all the 4 descriptors. The significance of the results in Tab. 2 is two fold. First, our USIP works well with various handcrafted (FPFH and SHOT) and deep learningbased (our desc. and 3DFeatNet) descriptors. Second, our USIP produces more distinctive keypoints since it consistently outperforms other keypoint detectors over the different descriptors on registration failure rate and keypoint inlier ratio as shown in Tab. 2. The experimental configurations in Tab. 2 is not the optimal setting for our USIP detector and descriptor nor the 3DFeatNet because we have to fix the number of keypoints for fair comparison. In Tab. 3, we illustrate the best registration results for our USIP and 3DFeatNet on KITTI without limitation on the number of keypoints. We again achieve lower failure rate and higher inlier ratio. In addition, we show the visualization of keypoint matching results of two examples from KITTI and Oxford in Fig. 8.
Detector  Descriptor  Fail(%)  Inlier(%)  RTE(m)  RRE (°) 

3DFeatNet  3DFeatNet  0.57  12.9  
USIP  Our Desc.  0.24  28.0 
6.3 Computational Efficiency
Handcrafted detectors are deployed with single thread C++ codes on an Intel i7 6950X CPU. Our USIP and 3DFeatNet are deployed on a Nvidia 1080Ti, with PyTorch and TensorFlow, respectively. Computational efficiency is evaluated on 2,391 KITTI point clouds, where each point cloud is downsampled to 16,384 points. We record the average time taken to extract 128 keypoints from each point cloud. As shown in Tab.
4, our USIP is an order of magnitude faster than other detectors except random sampling.Random  SIFT3D  ISS  Harris3D  3DFeatNet  USIP 
0.0005  0.163  0.388  0.150  0.438  0.011 
7 Conclusion
In this paper, we present the USIP detector, an unsupervised deep learningbased keypoint detector for 3D point clouds. A probabilistic chamfer loss is proposed to guide the network to learn highly repeatable keypoints. We provide mathematical analysis and solutions for network degeneracy, which are supported by experimental results. Extensive evaluations are performed with Lidar scans, RGBD images and CAD models. Our USIP detector outperforms existing detectors by a significant margin in terms of repeatability, distinctiveness and computational efficiency.
References
 [1] P. J. Besl and N. D. McKay. Method for registration of 3d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
 [2] U. Castellani, M. Cristani, S. Fantoni, and V. Murino. Sparse points matching by combining 3d mesh saliency with statistical descriptors. In Computer Graphics Forum, volume 27, pages 643–652. Wiley Online Library, 2008.
 [3] H. Chen and B. Bhanu. 3d freeform object recognition in range images using local surface patches. Pattern Recognition Letters, 28(10):1252–1262, 2007.
 [4] Y. Chen and G. Medioni. Object modelling by registration of multiple range images. Image and vision computing, 10(3):145–155, 1992.
 [5] S. Choi, Q.Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5556–5565, 2015.
 [6] H. Deng, T. Birdal, and S. Ilic. Ppffoldnet: Unsupervised learning of rotation invariant 3d local descriptors. arXiv preprint arXiv:1808.10322, 2018.
 [7] H. Deng, T. Birdal, and S. Ilic. Ppfnet: Global context aware local features for robust 3d point matching. Computer Vision and Pattern Recognition (CVPR). IEEE, 1, 2018.
 [8] C. Dorai and A. K. Jain. Cosmosa representation scheme for 3d freeform objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(10):1115–1130, 1997.

[9]
G. Elbaz, T. Avraham, and A. Fischer.
3d point cloud registration for localization using a deep neural network autoencoder.
In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 2472–2481. IEEE, 2017.  [10] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 [11] Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser. The perfect match: 3d point cloud matching with smoothed densities. arXiv preprint arXiv:1811.06879, 2018.
 [12] C. G. Harris, M. Stephens, et al. A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.
 [13] B.S. Hua, Q.H. Pham, D. T. Nguyen, M.K. Tran, L.F. Yu, and S.K. Yeung. Scenenn: A scene meshes dataset with annotations. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 92–101. IEEE, 2016.
 [14] A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis & Machine Intelligence, (5):433–449, 1999.
 [15] R. V. J.W. Tangelder. A survey of content based 3d shape retrieval methods. 2008.
 [16] M. Khoury, Q.Y. Zhou, and V. Koltun. Learning compact geometric features. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pages 153–61, 2017.

[17]
T. Kohonen.
The selforganizing map.
Neurocomputing, 21(1):1–6, 1998.  [18] J. Li, B. M. Chen, and G. H. Lee. Sonet: Selforganizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9397–9406, 2018.
 [19] Z. Lian and A. A. Godil. A comparison of methods for nonrigid 3d shape retrieval. 2012.
 [20] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
 [21] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research (IJRR), 36(1):3–15, 2017.
 [22] A. Mian, M. Bennamoun, and R. Owens. On the repeatability and quality of keypoints for local featurebased 3d object retrieval from cluttered scenes. International Journal of Computer Vision, 89(23):348–361, 2010.
 [23] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al. Fastslam: A factored solution to the simultaneous localization and mapping problem. Aaai/iaai, 593598, 2002.
 [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016.
 [25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE international conference on, pages 2564–2571. IEEE, 2011.
 [26] S. Rusinkiewicz and M. Levoy. Efficient variants of the icp algorithm. In 3D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, pages 145–152. IEEE, 2001.
 [27] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. Citeseer, 2009.
 [28] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2155–2162. IEEE, 2010.
 [29] R. B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In 2011 IEEE International Conference on Robotics and Automation, pages 1–4, May 2011.
 [30] F. Tombari, S. Salti, and L. Di Stefano. Unique shape context for 3d data description. In Proceedings of the ACM workshop on 3D object retrieval, pages 57–62. ACM, 2010.
 [31] F. Tombari, S. Salti, and L. Di Stefano. Unique signatures of histograms for local surface description. In European conference on computer vision, pages 356–369. Springer, 2010.
 [32] F. Tombari, S. Salti, and L. Di Stefano. Performance evaluation of 3d keypoint detectors. International Journal of Computer Vision, 102(13):198–220, 2013.
 [33] R. Unnikrishnan and M. Hebert. Multiscale interest regions from unorganized point clouds. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. IEEE, 2008.
 [34] M. A. Uy and G. H. Lee. Pointnetvlad: Deep point cloud based retrieval for largescale place recognition. 2018.
 [35] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
 [36] Z. J. Yew and G. H. Lee. 3dfeatnet: Weakly supervised local 3d features for point cloud registration. In Proceedings of the European Conference on Computer Vision (ECCV), pages 607–623, 2018.
 [37] A. Zaharescu, E. Boyer, K. Varanasi, and R. Horaud. Surface feature detection and description with applications to mesh matching. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 373–380. IEEE, 2009.
 [38] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgbd reconstructions. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 199–208. IEEE, 2017.
 [39] Y. Zhong. Intrinsic shape signatures: A shape descriptor for 3d object recognition. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 689–696. IEEE, 2009.
Appendix A Overview
We provide more details on the algorithms and experiments described in the main paper. Sec. B presents more examples of the network degeneracy. Sec. C evaluates the effect of pointtopoint loss on the keypoint repeatability. Sec. D illustrates the details of our feature descriptor design. Sec. E gives more experiments on point cloud registration tasks. Sec. F presents visualizations of our USIP keypoints in various datasets.
Appendix B More Examples on Degeneracy
As analyzed in Sec. 5, our FPN degenerates when the receptive field becomes sufficiently large, , it has gained sufficient global semantic information. The receptive field of the FPN is controlled by two parameters: number of keypoint proposals and number of neighbors in the NN feature aggregation. More specifically, the receptive field size is proportional to and inversely proportional to . In this section, we visualize the network degeneracy by gradually enlarging the receptive field. Fig. 14 shows the degeneracies when and . Fig. 15 shows the degeneracies when and .
Appendix C Effect of in PointtoPoint Loss
Sec. 3 of the main paper describes the pointtopoint loss to penalize for being too far from . The pointtopoint loss is added to the loss function with the weight . Here, we show that our USIP is very robust to the value of . Specifically, the repeatability of our USIP keypoints remains almost the same over a wide range of values for . Keypoint repeatability is illustrated in Fig. 10 with various . Fig. 10 shows that the USIP keypoints are highly repeatable even when is small. This is probably because our design to limit the receptive field already guides the network to learn repeatable keypoints even without the pointtopoint loss. On the other hand, the network fails to converge when is too large because the pointtopoint loss dominates the training process. Nonetheless, training the network without the pointtopoint loss does not ensure the keypoints to be close to the input point cloud. The top row of Fig. 9 shows keypoints from our USIP detector trained with , , with pointtopoint loss. They are close to the input point cloud. In comparison, the bottom row of Fig. 9 shows from our USIP detector trained without pointtopoint loss, , . These are less desirable keypoints that are farther from the input point cloud.
Appendix D Our Descriptor a.k.a “Our Desc.”
Fig 11 shows the network design of “Our Desc.” inspired by 3DFeatNet [36] as mentioned in Sec. 6.2 of the main paper. Given the output from FPN, a ball of points from the point cloud within a radius is built around each . A keypoint descriptor is extracted for each . The descriptor can be trained with either weak [36] or strong supervision [38, 16]. We improve the keypoint descriptor training by utilizing the keypoint saliency uncertainty in Sec. D.1, D.2, and E.
d.1 Weak Supervision
Weak supervision of the descriptor is based on a triplet loss and the ground truth coarse registrations of the point clouds in the training dataset. Similar to [36], point clouds from the dataset are selected as the anchor samples during training. All overlapping pairs of point clouds to the anchor are defined as positive samples, while nonoverlapping pairs of point clouds are defined as the negative samples. We denote the sets of keypoint descriptors extracted from the anchor, positive and negative samples as , and , respectively. We generate these training samples from the Oxford RobotCar and KITTI datasets. More formally, the triplet loss is given by:
(15) 
where is a descriptor from the anchor sample. For each descriptor , we minimize the Euclidean distance to its nearest neighbor and maximize the Euclidean distance to its nearest neighbor . In addition, a normalized weight is added to our triplet loss. is derived from our USIP keypoint saliency uncertainty that indicates the reliability of and . More specifically:
(16) 
where is a threshold serves as the upper bound of .
d.2 Strong Supervision
We do strong supervision of the descriptor network on datasets with ground truth poses, , SceneNN [13] and “3D reconstruction dataset” [38]. The loss function for strong supervision defined on a pair of overlapping point clouds and with ground truth poses and is given by:
(17) 
and are keypoint descriptors from and , respectively. Additionally, is a descriptor with keypoint location that is within a distance from the keypoint location of the descriptor , , . To achieve hard negative mining, we randomly select 50% of from with the distance between the keypoint locations and larger than . The other 50% are chosen from keypoints with shortest but larger than keypoint distances to .
Appendix E More Point Cloud Registration Results
Method  Oxford  KITTI  

RTE (m)  RRE (°)  Fail %  Inlier %  # Iter  RTE (m)  RRE (°)  Fail %  Inlier %  # Iter  
ISS[39] + FPFH[27]  0.400.29  1.601.02  7.68  8.6  7171  0.330.27  1.040.77  39.00  8.8  8000 
ISS[39] + SI[14]  0.420.31  1.611.12  12.55  4.7  9888  0.350.31  1.110.93  41.86  4.6  9401 
ISS[39] + USC[30]  0.320.27  1.220.95  5.98  8.6  7084  0.270.28  0.830.76  18.62  7.7  8149 
ISS[39] + CGF[16]  0.430.32  1.621.10  12.64  4.9  9628  0.230.25  0.690.60  8.90  8.4  7670 
ISS[39] + 3DMatch[38]  0.490.37  1.781.21  30.94  5.4  9131  0.300.28  0.800.67  7.14  8.4  7165 
3DFeatNet[36]  0.300.26  1.070.85  1.90  13.7  2940  0.260.26  0.560.46  0.57  12.9  3768 
USIP + Our Desc.  0.280.26  0.810.74  0.93  28.1  523  0.210.24  0.420.32  0.24  28.0  600 
We follow the experimental setup and pipeline of 3DFeatNet [36] to provide more evaluation results on point cloud registration. More specifically, we compare the performance of our USIP detector and “Our Desc.” with other existing keypoint detector and descriptors. The evaluations are done on the Oxford RobotCar and KITTI datasets prepared by [36]. Refer to Sec. 6.2 of the main paper for the details of the registration steps. A fixed number of 256 keypoints is extracted from each point cloud. We extract the keypoints without NonMaximumSupression (NMS). Furthermore, keypoints with high saliency uncertainty, , large , are filtered out.
Datasets
The Oxford RobotCar consists of 40 traversals on the same route over a year. 3D point clouds are built by accumulating the 2D scans from SICK LMS151 LiDAR with the GPS/INS readings. We use 35 traversals, 21,875 point clouds for training. The remaining 5 traversals, , 828 point clouds and 3,426 overlapping pairs are used for evaluation. Random rotations around the upaxis are applied to each evaluation point cloud. In KITTI, 3D point clouds are directly provided by a Velodyne HDL64E. We use the 2,831 overlapping pairs of point clouds prepared by [36] for registration evaluation.
Performance
Tab. 5 shows the point cloud registration performances. Our USIP detector + “Our Desc.” outperforms previous methods with the lowest registration failure rate (Fail %), Relative Translational Error (RTE), Relative Rotation Error (RRE), and highest inlier ratio (Inlier %). In particular, our registration failure rate and inlier ratio are respectively 50% and 2x of the second best keypoint detector + descriptor. We further analyze the performance over different number of RANSAC iterations. The registration failure rate versus the maximum number of RANSAC iterations is shown in Fig. 12. Due to high repeatability, our USIP detector (red line) shows very little drop in performance with decreasing number of RANSAC iterations, while all other algorithms show rapid drops in performances. Additionally, we replace our USIP detector + “Our Desc.” with Random Sampling + “Our Desc.” to demonstrate the effectiveness of our USIP detector. It can be seen from Fig. 12 that the performance of Random Sampling + “Our Desc.” (black line) drops as quickly as other methods with decreasing number of RANSAC iterations.
Effect of USIP Keypoint Saliency Uncertainty on Descriptor Training
We show that the keypoint salicency uncertainty from our USIP detector improves the performance of “Our Desc.”. To this end, we compare the performances of “Our Desc.” trained with USIP and randomly sampled keypoints, respectively. In particular, the weight from Eq. 15 or Eq. 17 is set to 1 for the randomly sampled keypoints. We denote the descriptor trained with randomly sampled keypoints as “Desc. w. RS”. Tab. 6 shows the registration failure rates of “Desc. w. USIP” and “Desc. w. RS”. The results show that “Desc. w. USIP” performs better than “Desc. w. RS”, which means that keypoints and saliency uncertainty from our USIP detector improve descriptor training.
Failure %  Oxford  KITTI  

Desc w. USIP  Desc w. RS  Desc w. USIP  Desc w. RS  
USIP  0.93  1.20  0.24  1.02 
Effect of Parameters
We demonstrate the point cloud registration failure rate (%) in Fig. 13, when various USIP detector parameters, , are selected. In Fig. 13 we use the same descriptor mentioned in Sec. D. As shown in Fig. 13, our method outperforms existing methods over a wide range of . We notice our network performance decreases significantly when is too small or is too large, , the receptive is too large. This further verifies our design of limiting the receptive field. In addition, Fig. 13 shows that the registration failure rate remains satisfying when is small. This is consistent with Fig. 10 that our USIP is able to detect repeatable keypoints even without the pointtopoint loss. Nonetheless, it is still important to include the pointtopoint loss to ensure that the keypoints are close to the input point cloud.
Appendix F Qualitative Visualization of USIP Keypoints
We show more visualizations of the keypoints detected from our USIP detector on ModelNet40, KITTI, Oxford RobotCar, Redwood in Fig. 16, 17, 18, 19, respectively. NMS and thresholding are applied here. A limitation of our USIP detector is shown in Fig. 16, where there are no or very few keypoints on objects that are highly symmetrical or with smooth surfaces. The saliency uncertainties of the keypoints detected on these objects are large, thus discarded by the thresholding.
Comments
There are no comments yet.