USIP: Unsupervised Stable Interest Point Detection from 3D Point Clouds

03/30/2019 ∙ by Jiaxin Li, et al. ∙ 0

In this paper, we propose the USIP detector: an Unsupervised Stable Interest Point detector that can detect highly repeatable and accurately localized keypoints from 3D point clouds under arbitrary transformations without the need for any ground truth training data. Our USIP detector consists of a feature proposal network that learns stable keypoints from input 3D point clouds and their respective transformed pairs from randomly generated transformations. We provide degeneracy analysis of our USIP detector and suggest solutions to prevent it. We encourage high repeatability and accurate localization of the keypoints with a probabilistic chamfer loss that minimizes the distances between the detected keypoints from the training point cloud pairs. Extensive experimental results of repeatability tests on several simulated and real-world 3D point cloud datasets from Lidar, RGB-D and CAD models show that our USIP detector significantly outperforms existing hand-crafted and deep learning-based 3D keypoint detectors. Our code is available at the project website.



There are no comments yet.


page 15

page 16

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Examples of keypoints detected by our USIP detector on four datasets: (a) ModelNet40 [35], object model. (b) Redwood [5] (Trained on “RGB-D reconstruction dataset” [38]), indoor RGB-D. (c) Oxford RobotCar [21], outdoor SICK LiDAR. (d) KITTI [10] (Trained on Oxford), outdoor Velodyne LiDAR.

3D interest point or keypoint detection refers to the problem of finding stable points with well-defined positions that are highly repeatable on 3D point clouds under arbitrary SE(3) transformations. These detected keypoints play important roles in many computer vision and robotics tasks, where 3D point clouds are widely adopted as the data structure to represent objects and scenes in the 3D space. Examples include geometric registration for 3D object modeling

[1] or point cloud-based Simultaneous Localization and Mapping (SLAM) [23], and 3D object [15, 19] or place recognition [34]. In these tasks, the detected keypoints are respectively used as correspondences to compute rigid transformations, and locations to extract representative signatures for efficient retrievals. Hence, a keypoint detector that cannot produce highly repeatable and well-localized keypoints from 3D point clouds under arbitrary transformations would render these tasks to fail catastrophically.

Despite the high number of successful hand-crafted detectors proposed for 2D images [25, 20, 12], significantly lesser hand-crafted detectors [32] with limited success are proposed for hand-crafted detectors on 3D point clouds. This difference can be largely attributed to the difficulty in hand-crafting powerful algorithms to extract meaningful information solely from the Euclidean coordinates of the point cloud in comparison to images that contain richer information from the additional RGB channels. The problem is further aggravated by the fact that it is difficult to hand-craft 3D detectors to handle 3D point clouds in arbitrary transformations, , different reference coordinate frames. In particular, different transformations applied to the same 3D point cloud cause the Euclidean coordinates of each point to change significantly, thus severely affecting the repeatability of the keypoints from the 3D detectors.

It seems evidential that all the above mentioned problems with hand-crafted detectors for 3D point clouds can be resolved by the highly successful data-driven deep networks. However, very few deep learning-based 3D keypoint detectors exist (only one deep learning-based approach [36] exists to date) in contrast to its increasing success on learning 3D keypoint descriptors [7, 6, 38, 16]. This is due to the lack of ground truth training datasets to supervise deep learning-based detectors on 3D point clouds. Unlike 3D descriptors that are supervised by easily available ground truth registered overlapping 3D point clouds [7, 6, 16, 38, 36, 11], it is impossible for anyone to identify and label the “ground truth” keypoints on 3D point clouds. Consequently, most of the works on 3D descriptors [7, 6, 38, 16] ignored the detector problem and are built on top of existing hand-crafted 3D detectors or uniform sampling.

In view of the challenges on both hand-crafted and deep learning-based 3D detectors, we propose the USIP detector: an Unsupervised Stable Interest Point deep learning-based detector that can detect highly repeatable, and accurately localized keypoints from 3D point clouds under arbitrary transformations without

the need for any ground truth training data. To this end, we design a Feature Proposal Network (FPN) that outputs a set of keypoints and their respective saliency uncertainties from an input 3D point cloud. Our FPN improves keypoint localization by estimating their positions on contrary to existing 3D detectors

[29, 36, 39] that select existing points in the point cloud as keypoints, which causes quantization errors. During training, we apply randomly generated SE(3) transformations on each point cloud to get a set of corresponding pairs of transformed point clouds as inputs to the FPN. Furthermore, we identify and prevent the degeneracy of our USIP detector. We encourage high repeatability and accurate localization of the keypoints with a probabilistic chamfer loss that minimizes the distances between the detected keypoints from the training point cloud pairs. Additionally, we introduce a point-to-point loss to enforce the constraint of getting keypoints that lie close to the point cloud. We verify our USIP detector by performing extensive repeatability tests on several simulated and real-world benchmark 3D point cloud datasets from Lidar, RGB-D and CAD models. Some qualitative results are shown in Fig 1.

Our key contributions are summarized as follows:

  • Our USIP detector is fully unsupervised, thus avoids the need for ground truth that are impossible to obtain.

  • We provide degeneracy analysis of our USIP detector and suggest solutions to prevent it.

  • Our FPN improves keypoint localization by estimating the keypoint position instead of choosing it from an existing point in the point cloud.

  • We introduce the probabilistic chamfer loss and point-to-point loss to encourage high repeatability and accurate keypoint localization.

  • The use of randomly generated transformations on point clouds during training inherently allows our network to achieve good performance under rotations.

2 Related Work

Unlike the recent success of deep learning-based 3D keypoint descriptors [7, 6, 16, 38, 36, 11], most existing 3D keypoint detectors remain hand-crafted. A comprehensive review and evaluation of existing hand-crafted 3D keypoint detectors can be found in [32]. Local Surface Patches (LSP) [3] and Shape Index (SI) [8] are based on the maximum and minimum principal curvatures of a point, and consider the point as a keypoint if it is a global extremum in a predefined neighborhood. Intrinsic Shape Signatures (ISS) [39] and KeyPoint Quality (KPG) [22] select salient points that has a local neighborhood with large variations along each principal axis. MeshDoG [37] and Salient Points (SP) [2] construct a scale-space of the curvature with the Difference-of-Gaussian (DoG) operator similar to SIFT [20]. Points with local extrema values over an one-ring neighborhood are selected as keypoints. These methods can be regarded as the 3D extension of SIFT. Laplace-Beltrami Scale-sapce (LBSS) [33] computes the saliency by applying a Laplace-Beltrami operator on increasing supports for each point.

More recently, LORAX [9]

proposes the method of projecting the point set into a depth map and use Principal Component Analysis (PCA) to select keypoints with commonly found geometric characteristics. All hand-crafted approaches share the common trait of relying on the local geometric properties of the points to select keypoints. Hence, the performances of these detectors deteriorate under disturbances such as noise, density variations and/or arbitrary transformations. In contrast, our deep learning-based USIP detector is more resilient to these disturbances by learning from data. To the best of our knowledge, the only existing deep learning-based 3D keypoint detector is the weakly supervised 3DFeatNet

[36], which is trained with GPS/INS tagged point clouds. However, the training of 3Dfeat-Net is largely focused on learning discriminative descriptors using the Siamese architecture with an attention score map that estimates the saliency of each point as its by-product. It does not ensure good performance of the keypoint detection. In comparison, our USIP is designed to encourage high repeatability and accurate localization of the keypoints. Furthermore, our method is fully unsupervised and does not rely on any form of ground truth datasets.

Figure 2: (a) The training pipeline of USIP detector. (b) The architecture of our Feature Proposal Network (FPN). See text for more detail.

3 Our USIP Detector

Fig. 2 shows the illustration of the pipeline to train our USIP detector. We denote a point cloud from the training dataset as . A set of transformation matrices , where is randomly generated and applied to the point cloud to form pairs of training inputs denoted as , where . Here, we use the operator to denote matrix multiplication under homogeneous coordinate with a slight abuse of notation. We drop the indices for brevity and refer to a triplet of training pair of point clouds and their corresponding transformation matrix as . During training, and are respectively fed into the FPN, which outputs proposal keypoints and its saliency uncertainties denoted as and for the respective point cloud. , , and . We enforce and so that it is a valid rate parameter in our probabilistic chamfer loss (see later paragraph). To improve keypoint localization, it is not necessary for all to be any of the points in . Similar condition applies to all .

We undo the transformation on with a slight abuse of notation to get , so that can be compared directly to . Here, we made an assumption that the saliency uncertainties remain unaffected after the transformation, ,

. The objectives of detecting keypoints that are highly repeatable and accurately localized from 3D point clouds under arbitrary transformations can now be achieved by formulating a loss function that minimizes the difference between

and . To this end, we propose the loss function: , where is the probabilistic chamfer loss that minimizes the probabilistic distances between all correspondence pairs of keypoints in and . is the point-to-point loss that minimizes the distance of the estimated keypoints to their respective nearest neighbor in the point cloud. This constraints the estimated keypoints to be close to the point cloud.

is a hyperparameter that adjust the relative contribution of

and to the total loss. More specifically:

Probabilistic Chamfer Loss

A straightforward way to minimize the difference between and is to use the chamfer loss:


that minimizes the distance of each point in one point cloud with its nearest neighbor in the other point cloud. However, the proposals are not equally salient. The receptive field of a point can be a featureless surface since the receptive field is limited to a small volume. In this case, it is detrimental to force the FPN to minimize the distance between and , where is the nearest neighbor of in .

To mitigate the above problem, we design our FPN to learn the saliency uncertainties and of the proposal keypoints and with a probabilistic chamfer loss . In particular, we propose to formulate

with an exponential distribution that measures the probabilistic distances between

and with the saliency uncertainties and

. More formally, the probability distribution between

and for is given by:


is a valid probability distribution since it integrates to 1. A shorter distance between the proposal keypoints and gives a higher probability that and are highly repeatable and accurately localized keypoints in the point clouds and . Assuming i.i.d for all

, the joint distribution between

and is given by:


It is important to note that the probability distribution is not symmetrical when the order of the point cloud is swapped, , and , due to a different set of nearest neighbors, , and . Hence, the joint distribution between and is given by:


Finally, the probabilistic chamfer loss between and is given by the negative log-likelihood of the joint distributions defined in Eq. 3 and 4:


We further analyze the physical meaning of or by computing the extrema of Eq. 2 from its first derivative over :


and solve for the stationary points:


Furthermore, the second derivative means that given a fixed , the highest probability is achieved at . Consider any triplet of proposal keypoints , where and are the distances between the nearest neighbors and ( can be the nearest neighbor in both orders of and since chamfer distance is not bijective). has to take a large value when and is large because we have shown that and at optimum. Furthermore, and is large implies that are repeatable and accurately localized keypoints while is not. Hence, a large saliency uncertainty for a bad proposal keypoint at optimum shows that our probabilistic chamfer loss is guiding the FPN to learn correctly.

Point-to-Point Loss

To avoid quantization error in the positions of the keypoints, we design the FPN such that it is not necessary that the proposal keypoints to be any of the points in . However, this can cause the FPN to give erroneous proposal keypoints that are far away from the point cloud . We circumvent this problem by adding a loss function that penalizes for being too far from . We also apply similar penalty on and . This loss can be formulated as either the point-to-point loss [1]:


where is the nearest neighbor of or the point-to-plane loss [26, 4]:


where and are the nearest surface normal in to and to , respectively. We set by default since we found experimentally that both loss functions give similar performances.

4 Feature Proposal Network

The network architecture of our FPN is shown in Fig. 2. We first sample nodes denoted as with Farthest Point Sampling (FPS) from a given input point cloud . A neighborhood of points is built for each node using point-to-node grouping [18, 17], which is denoted as . represents the number of points associated with the each of the nodes in . The advantage of point-to-node association over node-to-point NN search or radius-based ball-search is two-fold: (1) Every point in is associated with one node, while some points may be left out in node-to-point NN search and ball-search. (2) Point-to-node grouping automatically adapts to various scale and point density, while NN search and ball-search are vulnerable to density variation and varying scales, respectively. To make FPN translation equivariant, we normalize each neighborhood point into by subtracting from its respective node , , . Each cluster of normalized local neighborhood points is then fed into a PointNet-like network [24] shown in Fig. 2

to get a local feature vector

associated with . A NN grouping layer is applied on the set of local feature vectors to achieve hierarchical information aggregation. Specifically, the nearest neighbors of each pair of are retrieved as . These NN local feature vectors are then normalized by subtracting with its respective to get a position-independent neighborhood denoted as , where , before feeding into another network to get a set of feature vectors

. A simple Multi-Layer Perceptron (MLP) is then used to estimate

proposal keypoints , where , and saliency uncertainties , where from . Finally, we un-normalize each with , , to get the final proposal keypoints . It is important to note that the size of the receptive field is controlled by the number of proposals and in NN layers and it determines the level-of-detail for each feature. Large receptive field leads to features that are salient on a large-scale and vice versa.

5 Degeneracy Analysis

Let us denote the FPN as , where is the input of the network. We further denote a transformation matrix , where and are the rotation matrix and translation vector in . We get , where is the operator to denote the addition of to every entries of the other term. We say that the network is degenerate when it outputs trivial solutions where is satisfied for all and .

Lemma 1.

when outputs the centroid of the input point cloud, , and .


Putting into , we get . Hence, which completes our proof that the network degenerates when it outputs the centroid of the input point cloud. ∎

Lemma 2.

when is translational equivariant, , , and outputs points that are in the linear subspace of any principal axis from the input point cloud denoted as , , and


where can be any principal axis in and are scalar coefficients in .


Let and denote the covariance matrices of and , respectively. and are the centroids of and , respectively. Putting into and , we get:


Taking the Singular Value Decomposition (SVD) of

and , we get and , where and are the diagonal matrices of singular values, and and are the Eigenvectors that are also the principal axes of and , respectively. Putting the SVD of and into Eq. 11, we get:


Putting the relationship from Eq. 12 into , we get:


which completes our proof that the network degenerates when it outputs a set of points on any principal axis. ∎


We note that the network requires sufficient global semantic information of the input point cloud, , the input is the whole point cloud or clusters of local neighbor points that contain large receptive fields, to learn the trivial solutions of centroid or set of points on the principal axes. Hence, the degeneracies can be easily prevented by limiting the receptive fields of the FPN. We achieve this by setting the number of clusters and nearest neighbors of the clusters in the FPN (refer to Sec. 4 for the definitions of and ) to reasonable values. Small values for or high values for increases the receptive field and causes the FPN to degenerate. Fig. 3 show some examples of the degeneracies with different values at . It is interesting to note that the principal axis degeneracy occurs when is set to a mid-range value, and centroid degeneracy occurs when is set to a high value. This implies that larger receptive fields, , a higher global semantic information is needed for the network to learn the centroid. We also notice experimentally that the degeneracies (both centroid and principal axis) occur in point clouds with more regular shapes, objects from ModelNet40 where the centroid and principal axes are more well-defined.

Figure 3: Increasing values in FPN causes degeneracies (). (a) No degeneracy with (low value). (b) Principal axis degeneracy with (mid-range value). (c) Centroid degeneracy with (high value).

6 Experiments

Figure 4: Relative repeatability when different number of keypoints are detected. Left to right: KITTI, Oxford, Redwood, ModelNet40.
Figure 5: Relative repeatability when Gaussian noise is added to the input point clouds. Keypoint number is fixed to 128.
Figure 6: Relative repeatability when the input point cloud is randomly downsampled by some factors. Keypoint number is fixed to 128.

Following [32], we evaluate the repeatability (Sec. 6.1), distinctiveness (Sec. 6.2) and computational efficiency (Sec. 6.3) of our USIP detector on 4 datasets from object models, outdoor Lidar and indoor RGB-D scans. Additionally, we compare our evaluations to existing detectors - ISS [39], Harris-3D [12], SIFT-3D [20] and 3DFeat-Net [36] .

Implementation Details

Three USIP detectors are respectively trained for outdoor Lidars, RGB-D scans and object models. Specifically, we use the Oxford [21] for outdoor Lidar, “RGB-D reconstruction dataset” [38] for RGB-D, and ModelNet40 [35] for object models. The PCL [29] implementations of the classical detectors, , ISS, Harris-3D and SIFT-3D are used for the comparisons. We take the pretrained models of 3DFeat-Net [36] for KITTI [10] and Oxford, and train separate models for Redwood and ModelNet40 using the codes provided by 3DFeat-Net.

Qualitative Visualization

Fig. 7 shows some results from our USIP detector on ModelNet40. We can clearly see that our USIP learns keypoints on corners, edges, center of small surfaces, etc. Keypoints in the first row of Fig. 7 are selected with Non-Maximum Suppression (NMS) and thresholding on the saliency uncertainty . In the second row, keypoints are selected with only NMS. Keypoints with small are shown in bright red and get darker with larger .

Figure 7: Examples of keypoints from our USIP on ModelNet40.
KITTI Oxford Redwood ModelNet40
Type Velodyne lidar SICK lidar RGB-D CAD Model
Scale 200m 60m 10m 2
# point 16,384 16,384 10,240 5,000
in Eq. 14 0.5m 0.5m 0.1m 0.03
Rotation 2D 2D 3D 3D
Noise Sensor Sensor Gaussian Gaussian
Occlusion Yes Yes Yes No
Density Variation Yes No No No
Missing Parts Yes Yes Yes No
Table 1: Datasets used in evaluating keypoint repeatability.
Registration Failure Rate (%) Inlier Ratio (%)
Our Desc. 3DFeatNet[36] FPFH[27] SHOT[31] Our Desc. 3DFeatNet FPFH SHOT
Random 18.83 42.14 49.95 68.39 7.47 4.48 5.45 4.46
SIFT-3D[29, 20] 15.44 42.63 79.72 84.49 7.36 5.47 4.24 4.11
ISS[29, 39] 5.97 25.96 37.09 69.83 8.52 4.71 4.44 3.45
Harris-3D[29, 12] 3.81 13.56 49.49 51.29 10.57 6.58 4.78 5.00
3DFeatNet[36] 2.61 2.26 12.15 11.76 15.66 10.76 9.55 8.46
USIP 1.41 1.55 8.37 5.40 32.20 22.48 18.77 18.21
Table 2: Point cloud registration results on KITTI. The number of keypoints is fixed to 256.

6.1 Repeatability

Repeatibility refers to the ability of a detector to detect keypoints in the same locations under various disturbances such as view-point variations, noise, missing parts, etc. It is often taken as the most important measure of keypoint detectors because it is a standalone measure that depends only on the detector (without a descriptor). Given two point clouds of a scene captured from different view-points such that are related by a rotation matrix and a translational vector . A keypoint detector detects a set of keypoints and from , respectively. A keypoint is repeatable if the distance between and its nearest neighbor is less than a threshold , ,


Test Datasets

We evaluate repeatability on four test datasets - KITTI, Oxford, Redwood and ModelNet40. Note that our USIP is not trained on KITTI nor Redwood. We use the KITTI and Oxford test datasets prepared by 3DFeat-Net [36]. Each pair of point clouds are captured from nearby locations of within 10m and manually augmented with random 2D rotations. in Redwood are from simulated RGB-D cameras with 3D rotations / translations and Gaussian noise. The overlapped areas between are as low as 30%. In ModelNet40, is obtained by augmenting with random 3D rotations. Points in KITTI, Oxford and Redwood are in its original scale while points in ModelNet40 are normalized to . Details of the datasets are shown in Tab. 1. The scale refers to the diameter of the point clouds.

Relative Repeatability

We use relative repeatability that normalizes over the total number of detected keypoints for fair comparisons, , , where is the number of keypoints that passed the repeatability test in Eq. 14. We set the parameters of each keypoint detector in each dataset to generate 4, 8, 16, 32, 64, 128, 256 and 512 keypoints or close to these numbers when it is not possible to set the detectors (SIFT-3D, Harris-3D and ISS) to generate exact number of keypoints. Note that in general the repeatability should be proportional to the number of keypoints. In the extreme case that , , each point is regarded as a keypoint, the repeatability is the same as the percentage of overlap between . As shown in Fig. 4, our USIP generally outperforms other detectors by a significant margin on the 4 datasets over 8 different number of keypoints. In the extremely hard case that only 4 keypoints are detected, our method achieved relative repeatability of 34%, 23%, 10% and 60% for KITTI, Oxford, Redwood and ModelNet40, respectively. In the case of 64 keypoints, our performance is roughly 4.2x, 2.8x, 1.3x and 2.6x higher than the second best detector.

Robustness to Noise

The original points in KITTI and Oxford are already corrupted with sensor noise. We further augment the point clouds in the 4 datasets with Gaussian noise , where is up to m for KITTI and Oxford, m for Redwood and (no unit) for ModelNet40. The number of keypoints is fixed to 128. Our USIP is a lot more robust than other detectors as shown in Fig. 5. In KITTI and Oxford, the performances of other detectors fall to the level of random sampling when m, while our USIP does not show significant drop in performance even with . In Redwood, other methods except USIP and ISS deteriorate to random sampling with m. In ModelNet40, our method maintain high repeatability of 91% with , while all other methods drop below 8%.

Robustness to Downsampling

We evaluate the repeatability of the detectors on input point clouds downsampled by some factors using random selection. The results are shown in Fig. 6, where the down-sample factor denoted as means the number of points is reduced to of the original number shown in Tab. 1. We can see that the repeatability of our USIP remains satisfactory even with a downsampling on KITTI, Oxford and ModelNet40. The only exception is the Redwood dataset, where almost all detectors perform poorly on high downsample factors. Indoor RGB-D scans in Redwood consist of many large and flat surface like wall, ceiling, etc. Furthermore, there are very few distinguishable and non-occluded structures, which are further aggrevated by severe downsampling. Hence, it is difficult to detect repeatable keypoints with these RGB-D scans.

6.2 Distinctiveness: Point Cloud Registration

Distinctiveness is a measure of the performance of keypoint detectors and descriptors for finding correspondences in point cloud registration. Hence, distinctiveness is not as good as repeatability as an evaluation criterion on keypoint detectors because it is confounded with the performance of the descriptor. We mitigate this limitation by evaluating point cloud registration over several existing keypoint descriptors. We also use the results to show that our USIP detector works with different existing keypoint descriptors.

Experiment Setup

We follow the point cloud registration pipeline from 3DFeat-Net [36] on their KITTI test dataset. Four descriptors are used to perform keypoint description, , three off-the-shelf descriptors: 3DFeatNet, FPFH [28], SHOT[31], and our own descriptor inspired by 3DFeat-Net with minor modifications, which is denoted as “Our Desc.” (details are in our supplementary material). Registration of a pair of point clouds involves 4 steps: (a) Extract keypoints and their corresponding descriptor vectors from each point cloud. (b) Establish keypoint-to-keypoint correspondences by nearest neighbor search of the descriptor vectors. (c) Perform RANSAC on the two matched keypoint sets to find the rotation and translation that have the most inliers. (d) Compare the resulted rotation and translation with the ground truth. A pair of point cloud is regarded as successfully registered if m, and .

Registration Results

We perform registration evaluations over the combination of 6 keypoint detectors and 4 descriptors. The registration failure rate and keypoint inlier ratio are shown in Tab. 2. Compared to other detectors, our USIP achieves the lowest registration failure rate and the highest inlier ratio with a considerable margin on all the 4 descriptors. The significance of the results in Tab. 2 is two fold. First, our USIP works well with various hand-crafted (FPFH and SHOT) and deep learning-based (our desc. and 3DFeat-Net) descriptors. Second, our USIP produces more distinctive keypoints since it consistently outperforms other keypoint detectors over the different descriptors on registration failure rate and keypoint inlier ratio as shown in Tab. 2. The experimental configurations in Tab. 2 is not the optimal setting for our USIP detector and descriptor nor the 3DFeatNet because we have to fix the number of keypoints for fair comparison. In Tab. 3, we illustrate the best registration results for our USIP and 3DFeatNet on KITTI without limitation on the number of keypoints. We again achieve lower failure rate and higher inlier ratio. In addition, we show the visualization of keypoint matching results of two examples from KITTI and Oxford in Fig. 8.

Detector Descriptor Fail(%) Inlier(%) RTE(m) RRE (°)
3DFeat-Net 3DFeat-Net 0.57 12.9
USIP Our Desc. 0.24 28.0
Table 3: Point cloud registration on KITTI from the optimal configurations of 3DFeat-Net and our USIP.
Figure 8: Keypoints and matches from our USIP detector and “Our Desc.”. Best view with color and zoom-in.

6.3 Computational Efficiency

Hand-crafted detectors are deployed with single thread C++ codes on an Intel i7 6950X CPU. Our USIP and 3DFeatNet are deployed on a Nvidia 1080Ti, with PyTorch and TensorFlow, respectively. Computational efficiency is evaluated on 2,391 KITTI point clouds, where each point cloud is downsampled to 16,384 points. We record the average time taken to extract 128 keypoints from each point cloud. As shown in Tab. 

4, our USIP is an order of magnitude faster than other detectors except random sampling.

Random SIFT-3D ISS Harris-3D 3DFeatNet USIP
0.0005 0.163 0.388 0.150 0.438 0.011
Table 4: Average time (in seconds) to extract 128 keypoints from KITTI point clouds respectively downsampled to 16,384 points.

7 Conclusion

In this paper, we present the USIP detector, an unsupervised deep learning-based keypoint detector for 3D point clouds. A probabilistic chamfer loss is proposed to guide the network to learn highly repeatable keypoints. We provide mathematical analysis and solutions for network degeneracy, which are supported by experimental results. Extensive evaluations are performed with Lidar scans, RGB-D images and CAD models. Our USIP detector out-performs existing detectors by a significant margin in terms of repeatability, distinctiveness and computational efficiency.


  • [1] P. J. Besl and N. D. McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
  • [2] U. Castellani, M. Cristani, S. Fantoni, and V. Murino. Sparse points matching by combining 3d mesh saliency with statistical descriptors. In Computer Graphics Forum, volume 27, pages 643–652. Wiley Online Library, 2008.
  • [3] H. Chen and B. Bhanu. 3d free-form object recognition in range images using local surface patches. Pattern Recognition Letters, 28(10):1252–1262, 2007.
  • [4] Y. Chen and G. Medioni. Object modelling by registration of multiple range images. Image and vision computing, 10(3):145–155, 1992.
  • [5] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5556–5565, 2015.
  • [6] H. Deng, T. Birdal, and S. Ilic. Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. arXiv preprint arXiv:1808.10322, 2018.
  • [7] H. Deng, T. Birdal, and S. Ilic. Ppfnet: Global context aware local features for robust 3d point matching. Computer Vision and Pattern Recognition (CVPR). IEEE, 1, 2018.
  • [8] C. Dorai and A. K. Jain. Cosmos-a representation scheme for 3d free-form objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(10):1115–1130, 1997.
  • [9] G. Elbaz, T. Avraham, and A. Fischer.

    3d point cloud registration for localization using a deep neural network auto-encoder.

    In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 2472–2481. IEEE, 2017.
  • [10] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [11] Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser. The perfect match: 3d point cloud matching with smoothed densities. arXiv preprint arXiv:1811.06879, 2018.
  • [12] C. G. Harris, M. Stephens, et al. A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.
  • [13] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung. Scenenn: A scene meshes dataset with annotations. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 92–101. IEEE, 2016.
  • [14] A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis & Machine Intelligence, (5):433–449, 1999.
  • [15] R. V. J.W. Tangelder. A survey of content based 3d shape retrieval methods. 2008.
  • [16] M. Khoury, Q.-Y. Zhou, and V. Koltun. Learning compact geometric features. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pages 153–61, 2017.
  • [17] T. Kohonen.

    The self-organizing map.

    Neurocomputing, 21(1):1–6, 1998.
  • [18] J. Li, B. M. Chen, and G. H. Lee. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9397–9406, 2018.
  • [19] Z. Lian and A. A. Godil. A comparison of methods for non-rigid 3d shape retrieval. 2012.
  • [20] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [21] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research (IJRR), 36(1):3–15, 2017.
  • [22] A. Mian, M. Bennamoun, and R. Owens. On the repeatability and quality of keypoints for local feature-based 3d object retrieval from cluttered scenes. International Journal of Computer Vision, 89(2-3):348–361, 2010.
  • [23] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al. Fastslam: A factored solution to the simultaneous localization and mapping problem. Aaai/iaai, 593598, 2002.
  • [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016.
  • [25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE international conference on, pages 2564–2571. IEEE, 2011.
  • [26] S. Rusinkiewicz and M. Levoy. Efficient variants of the icp algorithm. In 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, pages 145–152. IEEE, 2001.
  • [27] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. Citeseer, 2009.
  • [28] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2155–2162. IEEE, 2010.
  • [29] R. B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In 2011 IEEE International Conference on Robotics and Automation, pages 1–4, May 2011.
  • [30] F. Tombari, S. Salti, and L. Di Stefano. Unique shape context for 3d data description. In Proceedings of the ACM workshop on 3D object retrieval, pages 57–62. ACM, 2010.
  • [31] F. Tombari, S. Salti, and L. Di Stefano. Unique signatures of histograms for local surface description. In European conference on computer vision, pages 356–369. Springer, 2010.
  • [32] F. Tombari, S. Salti, and L. Di Stefano. Performance evaluation of 3d keypoint detectors. International Journal of Computer Vision, 102(1-3):198–220, 2013.
  • [33] R. Unnikrishnan and M. Hebert. Multi-scale interest regions from unorganized point clouds. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. IEEE, 2008.
  • [34] M. A. Uy and G. H. Lee. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. 2018.
  • [35] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [36] Z. J. Yew and G. H. Lee. 3dfeat-net: Weakly supervised local 3d features for point cloud registration. In Proceedings of the European Conference on Computer Vision (ECCV), pages 607–623, 2018.
  • [37] A. Zaharescu, E. Boyer, K. Varanasi, and R. Horaud. Surface feature detection and description with applications to mesh matching. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 373–380. IEEE, 2009.
  • [38] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 199–208. IEEE, 2017.
  • [39] Y. Zhong. Intrinsic shape signatures: A shape descriptor for 3d object recognition. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 689–696. IEEE, 2009.

Appendix A Overview

We provide more details on the algorithms and experiments described in the main paper. Sec. B presents more examples of the network degeneracy. Sec. C evaluates the effect of point-to-point loss on the keypoint repeatability. Sec. D illustrates the details of our feature descriptor design. Sec. E gives more experiments on point cloud registration tasks. Sec. F presents visualizations of our USIP keypoints in various datasets.

Appendix B More Examples on Degeneracy

As analyzed in Sec. 5, our FPN degenerates when the receptive field becomes sufficiently large, , it has gained sufficient global semantic information. The receptive field of the FPN is controlled by two parameters: number of keypoint proposals and number of neighbors in the NN feature aggregation. More specifically, the receptive field size is proportional to and inversely proportional to . In this section, we visualize the network degeneracy by gradually enlarging the receptive field. Fig. 14 shows the degeneracies when and . Fig. 15 shows the degeneracies when and .

Figure 9: Visualization of USIP keypoints with different in Point-to-Point loss. First row , second row .

Appendix C Effect of in Point-to-Point Loss

Sec. 3 of the main paper describes the point-to-point loss to penalize for being too far from . The point-to-point loss is added to the loss function with the weight . Here, we show that our USIP is very robust to the value of . Specifically, the repeatability of our USIP keypoints remains almost the same over a wide range of values for . Keypoint repeatability is illustrated in Fig. 10 with various . Fig. 10 shows that the USIP keypoints are highly repeatable even when is small. This is probably because our design to limit the receptive field already guides the network to learn repeatable keypoints even without the point-to-point loss. On the other hand, the network fails to converge when is too large because the point-to-point loss dominates the training process. Nonetheless, training the network without the point-to-point loss does not ensure the keypoints to be close to the input point cloud. The top row of Fig. 9 shows keypoints from our USIP detector trained with , , with point-to-point loss. They are close to the input point cloud. In comparison, the bottom row of Fig. 9 shows from our USIP detector trained without point-to-point loss, , . These are less desirable keypoints that are farther from the input point cloud.

Figure 10: Relative repeatability with different weight for the Point-to-Point Loss . Number of keypoints is fixed to 128. Left to right: KITTI, Oxford, Redwood, ModelNet40.

Appendix D Our Descriptor a.k.a “Our Desc.”

Figure 11: Network architecture of “Our Desc.”.

Fig 11 shows the network design of “Our Desc.” inspired by 3DFeat-Net [36] as mentioned in Sec. 6.2 of the main paper. Given the output from FPN, a ball of points from the point cloud within a radius is built around each . A keypoint descriptor is extracted for each . The descriptor can be trained with either weak [36] or strong supervision [38, 16]. We improve the keypoint descriptor training by utilizing the keypoint saliency uncertainty in Sec. D.1, D.2, and E.

d.1 Weak Supervision

Weak supervision of the descriptor is based on a triplet loss and the ground truth coarse registrations of the point clouds in the training dataset. Similar to [36], point clouds from the dataset are selected as the anchor samples during training. All overlapping pairs of point clouds to the anchor are defined as positive samples, while non-overlapping pairs of point clouds are defined as the negative samples. We denote the sets of keypoint descriptors extracted from the anchor, positive and negative samples as , and , respectively. We generate these training samples from the Oxford RobotCar and KITTI datasets. More formally, the triplet loss is given by:


where is a descriptor from the anchor sample. For each descriptor , we minimize the Euclidean distance to its nearest neighbor and maximize the Euclidean distance to its nearest neighbor . In addition, a normalized weight is added to our triplet loss. is derived from our USIP keypoint saliency uncertainty that indicates the reliability of and . More specifically:


where is a threshold serves as the upper bound of .

d.2 Strong Supervision

We do strong supervision of the descriptor network on datasets with ground truth poses, , SceneNN [13] and “3D reconstruction dataset” [38]. The loss function for strong supervision defined on a pair of overlapping point clouds and with ground truth poses and is given by:


and are keypoint descriptors from and , respectively. Additionally, is a descriptor with keypoint location that is within a distance from the keypoint location of the descriptor , , . To achieve hard negative mining, we randomly select 50% of from with the distance between the keypoint locations and larger than . The other 50% are chosen from keypoints with shortest but larger than keypoint distances to .

Appendix E More Point Cloud Registration Results

Method Oxford KITTI
RTE (m) RRE (°) Fail % Inlier % # Iter RTE (m) RRE (°) Fail % Inlier % # Iter
ISS[39] + FPFH[27] 0.400.29 1.601.02 7.68 8.6 7171 0.330.27 1.040.77 39.00 8.8 8000
ISS[39] + SI[14] 0.420.31 1.611.12 12.55 4.7 9888 0.350.31 1.110.93 41.86 4.6 9401
ISS[39] + USC[30] 0.320.27 1.220.95 5.98 8.6 7084 0.270.28 0.830.76 18.62 7.7 8149
ISS[39] + CGF[16] 0.430.32 1.621.10 12.64 4.9 9628 0.230.25 0.690.60 8.90 8.4 7670
ISS[39] + 3DMatch[38] 0.490.37 1.781.21 30.94 5.4 9131 0.300.28 0.800.67 7.14 8.4 7165
3DFeat-Net[36] 0.300.26 1.070.85 1.90 13.7 2940 0.260.26 0.560.46 0.57 12.9 3768
USIP + Our Desc. 0.280.26 0.810.74 0.93 28.1 523 0.210.24 0.420.32 0.24 28.0 600
Table 5: Geometric registration performance on Oxford RobotCar and KITTI. The combination of our USIP keypoint detector and “Our Desc.” outperforms existing methods in all criteria with around inlier ratio.
Figure 12: Registration failure rate versus maximum RANSAC iterations in Oxford RobotCar (left) and KITTI (right). Note that the x axis is in logarithmic scale. Our USIP detector + “Our Desc.” (red line) shows very little drop in performance with decreasing number of RANSAC iterations.
Figure 13: Point cloud registration error rate (%) on KITTI (trained on Oxford). Dash line is the best performance of existing methods. in (a) (b).

We follow the experimental setup and pipeline of 3DFeat-Net [36] to provide more evaluation results on point cloud registration. More specifically, we compare the performance of our USIP detector and “Our Desc.” with other existing keypoint detector and descriptors. The evaluations are done on the Oxford RobotCar and KITTI datasets prepared by [36]. Refer to Sec. 6.2 of the main paper for the details of the registration steps. A fixed number of 256 keypoints is extracted from each point cloud. We extract the keypoints without Non-Maximum-Supression (NMS). Furthermore, keypoints with high saliency uncertainty, , large , are filtered out.


The Oxford RobotCar consists of 40 traversals on the same route over a year. 3D point clouds are built by accumulating the 2D scans from SICK LMS-151 LiDAR with the GPS/INS readings. We use 35 traversals, 21,875 point clouds for training. The remaining 5 traversals, , 828 point clouds and 3,426 overlapping pairs are used for evaluation. Random rotations around the up-axis are applied to each evaluation point cloud. In KITTI, 3D point clouds are directly provided by a Velodyne HDL-64E. We use the 2,831 overlapping pairs of point clouds prepared by [36] for registration evaluation.


Tab. 5 shows the point cloud registration performances. Our USIP detector + “Our Desc.” outperforms previous methods with the lowest registration failure rate (Fail %), Relative Translational Error (RTE), Relative Rotation Error (RRE), and highest inlier ratio (Inlier %). In particular, our registration failure rate and inlier ratio are respectively 50% and 2x of the second best keypoint detector + descriptor. We further analyze the performance over different number of RANSAC iterations. The registration failure rate versus the maximum number of RANSAC iterations is shown in Fig. 12. Due to high repeatability, our USIP detector (red line) shows very little drop in performance with decreasing number of RANSAC iterations, while all other algorithms show rapid drops in performances. Additionally, we replace our USIP detector + “Our Desc.” with Random Sampling + “Our Desc.” to demonstrate the effectiveness of our USIP detector. It can be seen from Fig. 12 that the performance of Random Sampling + “Our Desc.” (black line) drops as quickly as other methods with decreasing number of RANSAC iterations.

Effect of USIP Keypoint Saliency Uncertainty on Descriptor Training

We show that the keypoint salicency uncertainty from our USIP detector improves the performance of “Our Desc.”. To this end, we compare the performances of “Our Desc.” trained with USIP and randomly sampled keypoints, respectively. In particular, the weight from Eq. 15 or Eq. 17 is set to 1 for the randomly sampled keypoints. We denote the descriptor trained with randomly sampled keypoints as “Desc. w. RS”. Tab. 6 shows the registration failure rates of “Desc. w. USIP” and “Desc. w. RS”. The results show that “Desc. w. USIP” performs better than “Desc. w. RS”, which means that keypoints and saliency uncertainty from our USIP detector improve descriptor training.

Failure % Oxford KITTI
Desc w. USIP Desc w. RS Desc w. USIP Desc w. RS
USIP 0.93 1.20 0.24 1.02
Table 6: Registration failure rate for “Our Desc.” trained keypoints from our USIP detector and randomly sampled keypoints.

Effect of Parameters

We demonstrate the point cloud registration failure rate (%) in Fig. 13, when various USIP detector parameters, , are selected. In Fig. 13 we use the same descriptor mentioned in Sec. D. As shown in Fig. 13, our method outperforms existing methods over a wide range of . We notice our network performance decreases significantly when is too small or is too large, , the receptive is too large. This further verifies our design of limiting the receptive field. In addition, Fig. 13 shows that the registration failure rate remains satisfying when is small. This is consistent with Fig. 10 that our USIP is able to detect repeatable keypoints even without the point-to-point loss. Nonetheless, it is still important to include the point-to-point loss to ensure that the keypoints are close to the input point cloud.

Figure 14: Visualization of FPN degeneracy. and from left to right: , , receptive field of FPN increases from left to right.
Figure 15: Visualization of FPN degeneracy. and from left to right: , , receptive field of FPN increases from left to right.

Appendix F Qualitative Visualization of USIP Keypoints

We show more visualizations of the keypoints detected from our USIP detector on ModelNet40, KITTI, Oxford RobotCar, Redwood in Fig. 16, 17, 18, 19, respectively. NMS and thresholding are applied here. A limitation of our USIP detector is shown in Fig. 16, where there are no or very few keypoints on objects that are highly symmetrical or with smooth surfaces. The saliency uncertainties of the keypoints detected on these objects are large, thus discarded by the thresholding.

Figure 16: Visualization of USIP keypoints on ModelNet40. Best view with color and zoom-in.
Figure 17: Visualization of USIP keypoints on KITTI with our USIP detector trained on Oxford RobotCar dataset. Best view with color and zoom-in.
Figure 18: Visualization of USIP keypoints on Oxford RobotCar. Best view with color and zoom-in.
Figure 19: Visualization of USIP keypoints on Redwood with our USIP detector trained on “3D Reconstruction Dataset” [38]. Best view with color and zoom-in.