3D point cloud registration plays an important role in many real-world applications such as 3D Lidar-based mapping and localization for autonomous robots, and 3D model acquisition for archaeological studies, geo-surveying and architectural inspections etc. Compared to images, point clouds exhibit less variation and can be matched under strong lighting changes, i.e. day and night, or summer and winter (Fig. 1). A two-step process is commonly used to solve the point cloud registration problem - (1) establishing 3D-3D point correspondences between the source and target point clouds, and (2) finding the optimal rigid transformation between the two point clouds that minimizes the total Euclidean distance between all point correspondences. Unfortunately, the critical step of establishing 3D-3D point correspondences is non-trivial. Even though many handcrafted 3D feature detectors [35, 5] and descriptors [26, 25, 13, 30, 28] have been proposed over the years, the performance of establishing 3D-3D point correspondences remains unsatisfactory. As a result, iterative algorithms, e.g. Iterative Closest Point (ICP) 
, that circumvent the need for wide-baseline 3D-3D point correspondences with good initialization and nearest neighbors, are often used. This severely limits usage in applications such as global localization / pose estimation and loop-closures  that require wide-baseline correspondences.
Inspired by the success of deep learning for computer vision tasks such as image-based object recognition
, several deep learning based works that learn 3D feature descriptors for finding wide-baseline 3D-3D point matches have been proposed in the recent years. Regardless of the improvements of these deep learning based 3D descriptors over the traditional handcrafted 3D descriptors, none of them proposed a full pipeline that uses deep learning to concurrently learn both the 3D feature detector and descriptor. This is because the existing deep learning approaches are mostly based on supervised learning that requires huge amounts of hand-labeled data for training. It is impossible for anyone to manually identify and label salient 3D features from a point cloud. Hence, most existing approaches focused only on learning the 3D descriptors, while the detection of the 3D features are done with random selection[34, 8]. On the other hand, it is interesting to note the availability of an abundance of GPS/INS tagged 3D point cloud based datasets collected over large environments, e.g. the Oxford RobotCar  and KITTI  datasets etc. This naturally leads us into the question: “Can we design a deep learning framework that concurrently learns the 3D feature detector and descriptor from the GPS/INS tagged 3D point clouds?”
In view of the difficulty to get datasets of accurately labeled salient 3D features for training the deep networks, we propose a weakly supervised deep learning framework - the 3DFeat-Net to holistically learn a 3D feature detector and descriptor from GPS/INS tagged 3D point clouds. Specifically, our 3DFeat-Net is a Siamese architecture  that learns to recognize whether two given 3D point clouds are taken from the same location. We leverage on the recently proposed PointNet [23, 24]
to enable us to directly use the 3D point cloud as input to our network. The output of our 3DFeat-Net is a set of local descriptor vectors. The network is trained by minimizing a Triplet loss where the positive and “hardest” negative samples are chosen from the similarity measures between all pairs of descriptors  from two input point clouds. Furthermore, we add an attention layer  that learns importance weights that weigh the contribution of each input descriptor towards the Triplet loss. During inference, we use the output from the attention layer to determine the saliency likelihood of an input 3D point. Additionally, we take the output descriptor vector from our network as the descriptor for finding good 3D-3D correspondences. Experimental results from real-world datasets [19, 9, 22] validates the feasibility of our 3DFeat-Net.
Our contributions in this paper can be summarized as follows:
Propose a weakly supervised network that holistically learns a 3D feature detector and descriptor using only GPS/INS tagged 3D point clouds.
Use an attention layer  that allows our network to learn the saliency likelihood of an input 3D point.
Create training and benchmark datasets from Oxford RobotCar dataset .
We have made our source code and dataset available online.111https://github.com/yewzijian/3DFeatNet
2 Related Work
Existing approaches of the local 3D feature detectors and descriptors are heavily influenced by the widely studied 2D local features methods [17, 2], and can be broadly categorized into handcrafted [35, 5, 26, 25, 13, 30, 28] and learning approaches [34, 8, 15, 27, 6] - i.e. pre- and post- deep learning approaches.
Handcrafted 3D Features Several handcrafted 3D features are proposed before the popularity of deep learning. The design of these features are largely based on the domain knowledge of 3D point clouds by the researchers. The authors of  and  detects salient keypoints which have large variations in the principal direction , or unique curvatures . The similarity between keypoints can then be estimated using descriptors. PFH  and FPFH  consider pairwise combinations of surface normals to describe the curvature around a keypoint. Other 3D descriptors [13, 30] build histograms based on the number of points falling into each spatial bin around the keypoint. A comprehensive evaluation of the common handcrafted 3D detectors and descriptors can be found in . As we show in our evaluation, many of these handcrafted detectors and descriptors do not work well on real world point clouds, which can be noisy and low density.
Learned 2D Features Some recent works have applied deep learning to learn detectors and descriptors on images for 2D-2D correspondences. LIFT  learns to distinguish between matching and non-matching pairs with a Siamese CNN, where the matching pairs are obtained from feature matches that survive the Structure from Motion (SfM) pipeline. TILDE  learns to detect keypoints that can be reliably matched over different lighting conditions. These works rely on the matches provided by handcrafted 2D features, e.g. SIFT , but unfortunately handcrafted 3D features are less robust and do not provide as good a starting point to learn better features. On the other hand, a recently proposed work - DELF  uses a weakly supervised framework to learn salient local 2D image features through an attention mechanism in a landmark classification task. This motivates our work in using an attention mechanism to identify good 3D local keypoints and descriptors for matching.
Learned 3D Features The increasing success and popularity of deep learning has recently inspired many learned 3D features. 3DMatch  uses a 3D convolutional network to learn local descriptors for indoor RGB-D matching by training on matching and non-matching pairs. PPFNet  operates on raw points, incorporating point pair features and global context to improve the descriptor representation. Other works such as CGF  and LORAX  utilize deep learning to reduce the dimension of their handcrafted descriptors. Despite the good performance of these works, none of them learns to detect keypoints. The descriptors are either computed on all or random sampled points. On the other hand,  learns to detect keypoints that give good matching performance with handcrafted SHOT  descriptors. This provides the intuition for our work, i.e. a good keypoint is one that gives a good descriptor performance.
3 Problem Formulation
A point cloud is represented as a set of 3D points . Each point cloud is cropped to a ball with fixed radius around its centroid . We assume the absolute pose of the point cloud is available during training, e.g. from GPS/INS, but is not sufficiently accurate to infer point-to-point correspondences. We define the distance between two point clouds as the Euclidean distance between their centroids, i.e. .
We train our network using a set of triplets containing an anchor, positive and negative point cloud , similar to typical instance retrieval approaches [1, 10]. We define positive instances as point clouds with distance to the anchor below a threshold, . Similarly, negative instances are point clouds far away from the anchor, i.e. . The thresholds and are chosen such that positive and negative point clouds have large and small overlaps with the anchor point cloud respectively.
The objective of our network is to learn to find a set of correspondences between a subset of points in two point clouds and . Our network learning is weakly supervised in two ways. Firstly, only model level annotations in the form of relative poses of the point clouds are provided, and we do not explicitly specify which subset of points to choose for the 3D features. Secondly, the ground truth poses are not accurate enough to infer point-to-point correspondences.
4 Our 3DFeat-Net
4.1 Network Architecture
Fig. 2 shows the three-branch Siamese architecture of our 3DFeat-Net. Each branch takes an entire point cloud as input. Point clusters are sampled from the point cloud in a clustering layer. For each cluster , an orientation and attention  are predicted by a detector network. A descriptor network then rotates the cluster to a canonical configuration using the predicted orientation and computes a descriptor .
We train our network with the triplet loss to minimize the difference between the anchor and positive point clouds, while maximizing the difference between anchor and negative point clouds. To allow the loss to take individual cluster descriptors into account, we use an alignment model  to align each descriptor to its best match before aggregating the loss. Since not all sampled clusters have the same distinctiveness, the predicted attention from the detector is used to weigh the contribution of each cluster descriptor in the training loss. These attention weights are learned on arbitrarily sampled clusters during training, and later used to detect distinctive keypoints in the point cloud during inference.
The first stage of the network is to sample clusters from the point cloud. To this end, we use the sample and grouping layers in PointNet++ . The sampling layer samples a set of points from an input point cloud . The coordinates of these sampled points and the point cloud are then passed into the grouping layer that outputs clusters of points. Each cluster is a collection of points in a local region of a predefined radius around the sampled point . These clusters are used as support regions to compute local descriptors, analogous to 2D image patches around detected keypoints in 2D feature matching frameworks. We use the iterative farthest point sampling scheme as in PointNet++ for sampling, but any form of sampling that can sufficiently cover the point cloud (e.g. Random Sphere Cover Set ) is also suitable. Such sampling schemes increases the likelihood that each sampled cluster in an anchor point cloud has a nearby cluster in the positive point cloud.
Each cluster sampled by the clustering step is passed to the detector network that predicts an orientation and an attention score . The attention score is a positive number that indicates the saliency of the input cluster . This design is inspired by typical 2D feature detectors, e.g. SIFT . The predicted orientation is used to rotate the cluster to a canonical orientation, so that the final descriptor is invariant to the cluster’s original orientation.
We construct our detector network using point symmetric functions defined in PointNet , which is defined as , where is a shared function that transforms each individual point , and is a symmetric function on all transformed elements. These functions are invariant to point ordering and generates fixed length features given arbitrary sized point clouds. We use a three fully connected layers (64-128-256 nodes) for the implementation of . The symmetric function
is implemented as a max-pooling followed by two fully connected layers (128-64 nodes), before branching into two 1-layer paths for orientation and attention predictions.
We only predict a single 1D rotation angle , avoiding unnecessary equivariances to retain higher discriminating power. This is reasonable since point clouds are usually aligned to the gravity direction due to the sensor setup (e.g. a Velodyne Lidar mounted upright on a car); for other cases, the gravity vector obtained from an IMU can be used to rotate the point clouds into the upright orientation. Similar to , we do not regress the angle directly. Instead, we regress two separate values and that denote the sine and cosine of the angle. We use a normalization layer to add a constraint of to ensure valid sine and cosine values. The final angle can be computed as . For the attention weights ’s, we use the softplus activation as suggested by  to prevent the network from learning negative attention weights.
Our descriptor network takes each cluster from the clustering layer and orientation from the detector network as inputs, and generates a descriptor for each cluster. More specifically, is first used to rotate cluster into a canonical configuration, before they are passed into another point symmetric function to generate the descriptor. In practice, we find it helpful to aggregate contextual information in the computation of the descriptor. Hence, after applying max-pooling to obtain a cluster feature vector, we concatenate this cluster feature vector with the individual point features to incorporate context. We then apply a single fully connected layer with nodes before another max-pooling. Finally, we apply another fully connected layer and normalization to produce a final cluster descriptor for cluster . The addition of the contextual information improves the discriminating power of the descriptor.
4.1.4 Feature Alignment Triplet Loss
The output from the descriptor network in the previous stage is a set of features for each cluster. We use the alignment objective introduced in  to compare individual descriptors since the supervision is given as model-level annotations. Instead of the dot product similarity measure used in , we adopt the Euclidean distance measure which is more commonly used for comparing feature descriptors. Specifically, the distance between all pairs of descriptors between the two point clouds and with clusters and is given by:
where is the normalized attention weight. Under this formulation, every descriptor from the first point cloud aligns to its closest descriptor in the second one. Intuitively, in a matching point cloud pair, clusters in the first point cloud should have a similar cluster in the second point cloud. For non-matching pairs, the above distance simply aligns a descriptor to one which is most similar to itself, i.e. its hardest negative. This consideration of the hardest negative descriptor in the non-matching image provides the advantage that no explicit mining for hard negatives is required. Our model trains well with randomly sampled negative point clouds. We formulate the triplet loss for each training tuple as:
where denotes the hinge loss, and denotes the margin which is enforced between the positive and negative pairs.
4.2 Inference Pipeline
The keypoints and descriptors are computed in two separate stages during inference. In the first stage, the attention scores of all points in a given point cloud are computed. We apply non-maximal suppression over a fixed radius around each point, and keep the remaining points with the highest attention weights. Furthermore, points with low attention are filtered out by rejecting those with attention . The remaining points are our detected keypoints. In the next stage, the descriptor network computes the descriptors only for these keypoints. The separation of the inference into two detector-descriptor stages is computationally and memory efficient since only the detector network needs to process all the points while the descriptor network processes only the clusters that correspond to the selected keypoints. As a result, our network can scale up to larger point clouds.
5 Evaluations and Results
5.1 Benchmark Datasets
5.1.1 Oxford RobotCar
We use the open-source Oxford RobotCar dataset  for training and evaluation. This dataset consists of a large number of traversals over the same route in central Oxford, UK, at different times over a year. The push-broom 2D scans are accumulated into 3D point clouds using the associated GPS/INS poses. For each traversal, we create a 3D point cloud with 30m radius for every 10m interval whenever good GPS/INS poses are available. Each point cloud is then downsampled using a VoxelGrid filter with a grid size of 0.2m. We split the data into two disjoint sets for training and testing. The first 35 traversals are used for training and the last 5 traversals are used for testing. We obtain a total of 21,875 training and initially 828 testing point cloud sets.
We make use of the pairwise relative poses computed from the GPS/INS poses as ground truth to evaluate the performance on the test set. However, the GPS/INS poses may contain errors in the order of several meters. To improve the fidelity of our test set, we refine these poses using ICP . We set one of the test traversals as reference, and register all test point clouds within 10m to their respective point clouds in the reference. We retain matches with an estimated Root Mean Square Error (RMSE) of m, and perform manual visual inspection to filter out bad matches. We get 794/828 good point clouds that give us 3426 pairwise relative poses for testing. Lastly, we randomly rotate each point cloud around the vertical axis to evaluate rotational equivariances and randomly downsample each test point cloud to 16,384 points.
|# Traversals||# Point clouds||# Matched pairs|
|Test (after registration)||5||794||3426|
5.1.2 KITTI Dataset
We evaluate the performance of our trained network on the 11 training sequences of the KITTI Odometry dataset  to understand how well our trained detector and descriptor generalizes to point clouds captured in a different city using a different sensor. The KITTI dataset contains point clouds captured on a Velodyne-64 3D Lidar scaner in Karlsruhe, Germany. We sample the Lidar scans at 10m intervals to obtain 2369 point clouds, and downsample them using a VoxelGrid filter with a grid size of 0.2m. We consider the 2831 point cloud pairs that are captured within 10m range of each other. We use the provided GPS/INS pose as ground truth.
5.1.3 ETH Dataset
Our network is also evaluated on the “challenging dataset for point cloud registration algorithms” . This dataset is captured on a ground Lidar scanner, and contains largely unstructured vegetation unlike the previous two datasets. Following , we accumulate point clouds captured in one season of the Gazebo and Wood scenes to build a global point cloud, and register local point clouds captured in the other season to it. We take the liberty to build the global point cloud for both scenes using the summer data since  did not state the season that was used. During pre-procesing, we downsample the point clouds using a VoxelGrid filter of 0.1m grid size. We choose a finer resolution because of the finer features in the vegetation in this dataset.
5.2 Experimental Setup
We train using a batch size of 6 triplets, and use the ADAM optimizer with a constant learning rate of 1e-5. We use points within a radius of m from the centroid of the point cloud for training, and sample clusters with a radius of m. The thresholds for defining positive and negative point clouds are set to m and m. We randomly downsample each point cloud to 4096 points on the fly during training. We found that training with this random input dropout leads to better generalization behavior as also observed by . We apply the following data augmentations during training time: random jitter to the individual points, random shifts and small rotations. Random rotations are also applied around the z-axis, i.e. upright axis in order to learn rotation equivariance. Note that our training data is already oriented in the upright direction using its GPS/INS pose.
Our network is end-to-end differentiable, but in practice, we find it sometimes hard to train. Hence, we train in two phases to improve stability. We first pretrain the network without the detector for 2 epochs, i.e. the clusters are fed directly into the descriptor network without rotation, and all clusters have equal attention weights. During this phase, we apply all data augmentations except for large 1D rotations. We use these learned weights to initialize the descriptor in the second phase, where we train the entire network and apply all the above data augmentation. Training took 34 hours on a Nvidia Geforce GTX1080Ti.
For inference, we use the following parameters for both the Oxford and KITTI odometry datasets: m, . For the ETH dataset, we use m, to boost the number of detected keypoints in the semi-structured environments. We limit the number of keypoints to 1024 for all cases, except for the global model in the ETH dataset which we use 2048 due to its larger size. Inference took around 0.8s for a point cloud with 16,384 points.
5.3 Baseline Algorithms
We compare our algorithm with three commonly used handcrafted 3D feature descriptors: Fast Point Feature Histograms (FPFH)  (33 dimensions), SpinImage (SI)  (153 dimensions), and Unique Shape Context (USC)  (1980 dimensions). We use the implementation provided in the Point Cloud Library (PCL)  for all the handcrafted descriptors. In addition, we include two recent learned 3D descriptors: Compact Geometric Features (CGF)  (32 dimensions) and 3DMatch  (512 dimensions) in our comparisons. Note that our comparisons are done using their provided weights that were pretrained on indoor datasets. This is because we are unable to train CGF and 3DMatch on the weakly supervised Oxford dataset as these networks require strong supervision. We also train a modified PointNet++ (PN++)  in a weakly supervised manner on a retrieval task, which we use as a baseline to show the importance of descriptor alignment in learning local descriptors. We modify PointNet++ as follows: the first set abstraction layer is replaced by our detection and description network. The second and third set abstraction layers remain unchanged. We replace the subsequent fully connected layers with fc1024-dropout-fc512 layers. During inference, we extract descriptors as the output from the first set abstraction layer. We tuned the parameters for all descriptors to the best of our ability, except in Section 5.4, where we used the same cluster radii for all descriptors to ensure all descriptors “see” the same information.
5.4 Descriptor Matching
We first evaluate the ability of the descriptors to distinguish between different clusters using the procedure in . We extract each matching cluster at randomly selected locations from two matching model pairs. Non-matching clusters are extracted from two random point clouds at locations that are at least 20m apart. We extract 30,000 pairs of 3D clusters, equally split between matches and non-matches. As in 
, our evaluation metric is the false-positive rate at 95% recall. To ensure all descriptors have access to similar amounts of information, we fixed the cluster radiusto 2.0m for all descriptors in this experiment.
Fig. 3 shows the performance of our descriptor at different dimensionality. We observe that there is diminishing returns above 32 dimensions, and will use a feature dimension of for the rest of the experiments.
Table 2 compares our descriptor with the baseline algorithms. Our learned descriptor yields a lower error rate than all other descriptors despite having a similar or smaller dimension. It performs significantly better than the best handcrafted descriptor (FPFH) which uses explicitly computed normals. The other two handcrafted descriptors (USC and SI), as well as the learned CGF consider the number of points falling in each subregion around the keypoint and could not differentiate the sparse point cloud clusters. 3DMatch performed well in differentiating randomly sampled clusters, but requiring a larger feature dimension. Lastly, the modified PointNet++ network did not learn a good descriptor in the weakly supervised setting and gives significantly higher error than our descriptor despite having the same descriptor network structure.
5.5 Keypoint Detection and Feature Description
We follow the procedure in  to evaluate the joint performance of keypoints and descriptors. For each keypoint descriptor in the first model of the pair, we find its nearest neighbor in the second model via exhaustive search. We then compute the distance between this nearest neighbor and its ground truth location. We obtain a precision plot, as shown in Fig. 4, by varying the distance threshold for considering a match as correct and plotting the proportion of correct matches. We evaluate on all the baseline descriptors, and also show the performance of our descriptor with the ISS detector, random sampling of points (RS), and points obtained using Random Sphere Cover Set (RSCS) . We tuned the cluster sizes for all baseline descriptors, as many of them require larger cluster sizes to work well. Nevertheless, our keypoint detector and descriptor combination still yields the best precision for all distance thresholds, and obtains a precision of 15.6% at 1m. We also note that our descriptor underperforms when used with the two random sampling methods or the generic ISS detector, indicating the importance of a dedicated feature detector.
Analysis of Keypoint Detection Fig. 5 shows the attention weights computed by our network, as well as the retrieved keypoints. We also show the keypoints obtained using ISS for comparison. Our network learns to give higher attentions to lower regions of the walls (near the ground), and mostly ignores the ground and the cars (which are transient and not useful for matching). In comparison, ISS detects many non-distinctive keypoints on the ground and cars.
5.6 Geometric Registration
We test our keypoint detection and description algorithm on the geometric registration problem. We perform nearest neighbor matching on the computed keypoints and descriptors, and use RANSAC on these nearest neighbor matches to estimate a rigid transformation between the two point clouds. The number of RANSAC iterations is automatically adjusted based on 99% confidence but is limited to 10,000. No subsequent refinement, e.g. using ICP is performed. We evaluate the estimated pose against its ground truth using the Relative Translational Error (RTE) and Relative Rotation Error (RRE) as in [8, 18]. We consider registration successful when the RTE and RRE are both below a predefined threshold of 2m and 5, and report the average RTE and RRE values for successful cases. We also report the average number of RANSAC iterations.
|Method||RTE (m)||RRE ()||Success Rate||Avg # iter|
|ISS  + FPFH ||92.32%||7171|
|ISS  + SI ||87.45 %||9888|
|ISS  + USC ||94.02%||7084|
|ISS  + CGF ||87.36%||9628|
|RS + 3DMatch ||54.64%||9848|
|ISS  + 3DMatch ||69.06%||9131|
|ISS  + PN++ ||48.86%||9904|
|RS + Our Desc||90.28%||9941|
|RSCS  + Our Desc||92.64%||9913|
|ISS  + Our Desc||97.66%||7127|
|Our Kpt + Desc||98.10%||2940|
5.6.1 Performance on Oxford RobotCar
Table 3 shows the performance on the Oxford dataset. We observe the following: (1) using a keypoint detector instead of random sampling improves geometric registration performance, even for 3DMatch which is designed for random keypoints, (2) our learned descriptor gives good accuracy even when used with handcrafted descriptors or random sampling, suggesting that it generalizes well to generic point clusters, (3) our detector and descriptor combination gives the highest success rates and lowest errors. This highlights the importance of designing a keypoint detector and descriptor simultaneously, and the applicability of our approach to geometric registration. Some qualitative registration results can be found in Fig. 6(a).
5.6.2 Performance on KITTI Dataset
We evaluate the generalization performance of our network on the KITTI odometry dataset by comparing the geometric registration performance against ISS + FPFH in Table 4. We use the same parameters as the Oxford dataset for all algorithms, and did not fine-tune our network in any way. Nevertheless, our 3DFeat-Net outperforms all other algorithms in most measures. It underperforms CGF slightly in terms of RTE, but has a significantly higher success rate and requires far fewer RANSAC iterations. We show some matching results on the KITTI dataset in Fig. 6(b).
|Method||RTE (m)||RRE ()||Success Rate||Avg # iter|
|ISS  + FPFH ||58.95%||7462|
|ISS  + SI ||55.92%||9219|
|ISS  + USC ||78.24%||7873|
|ISS  + CGF ||87.81%||7442|
|RS + 3DMatch ||83.96%||8674|
|ISS  + 3DMatch ||89.12%||7292|
|Our Kpt + Desc||95.97%||3798|
5.6.3 Performance on ETH Dataset
We compare our performance against LORAX , which evaluates on 9 models from the Gazebo dataset and 3 models from the Wood dataset. We show our performance on the best performing point clouds for each algorithm in Table 5 since it is not explicitly stated in  which datasets were used. Note that success rate in this experiment refers to a RTE of below 1m to be consistent to . We also report the success rate over the entire dataset. Detailed results on the entire dataset can be found in the supplementary material. LORAX considers 3 best descriptor matches for each keypoint. These matches are used to compute multiple pose hypotheses which are then refined using ICP for robustness. For our algorithm and baselines, we only consider the best match, compute a single pose hypothesis and do not perform any refinement of the pose. Despite this, our approach outperforms  and most baseline algorithms. It only underperforms USC which uses a much larger dimension (1980D). Fig. 6(c) shows an example of a successful matching by our approach.
|Method||RTE (m)||RRE ()||Success Rate||(All)|
|ISS  + SI ||100%||93.7%|
|ISS  + USC ||100%||100%|
|ISS  + CGF ||100%||92.1%|
|RS + 3DMatch ||91.7%||33.3%|
|ISS  + 3DMatch ||100%||33.3%|
|Our Kpt + Desc||100%||95.2%|
We proposed the 3DFeat-Net model that learns the detection and description of keypoints in Lidar point clouds in a weakly supervised fashion by making use of a triplet loss that takes into account individual descriptor similarities and the saliency of input 3D points. Experimental results showed our learned detector and descriptor compares favorably against previous handcrafted and learned ones on several outdoor gravity aligned datasets. However, we note that our network is unable to train well on overly noisy point clouds, and the use of PointNet limits the max size of the input point cloud. We also do not extract a fully rotational invariant descriptor. We leave these as future work.
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5297–5307 (2016).https://doi.org/10.1109/CVPR.2016.572
-  Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) European Conference on Computer Vision (ECCV). pp. 404–417. Springer Berlin Heidelberg, Berlin, Heidelberg (2006). https://doi.org/10.1007/1174402332
-  Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 14(2), 239–256 (1992). https://doi.org/10.1109/34.121791
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Advances in Neural Information Processing Systems. pp. 737–744 (1994)
-  Chen, H., Bhanu, B.: 3d free-form object recognition in range images using local surface patches. In: International Conference on Pattern Recognition (ICPR). vol. 3, pp. 136–139 (2004). https://doi.org/10.1109/ICPR.2004.1334487
-  Deng, H., Birdal, T., Ilic, S.: Ppfnet: Global context aware local features for robust 3d point matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Dubé, R., Dugas, D., Stumm, E., Nieto, J., Siegwart, R., Cadena, C.: Segmatch: Segment based loop-closure for 3d point clouds. arXiv preprint arXiv:1609.07720 (2016)
-  Elbaz, G., Avraham, T., Fischer, A.: 3d point cloud registration for localization using a deep neural network auto-encoder. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2472–2481 (2017). https://doi.org/10.1109/CVPR.2017.265
-  Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3354–3361 (2012). https://doi.org/10.1109/CVPR.2012.6248074
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) European Conference on Computer Vision (ECCV). pp. 241–257. Springer International Publishing (2016).https://doi.org/10.1007/978-3-319-46466-415
-  Hänsch, R., Weber, T., Hellwich, O.: Comparison of 3d interest point detectors and descriptors for point cloud fusion. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2(3), 57 (2014)
-  Holz, D., Ichim, A.E., Tombari, F., Rusu, R.B., Behnke, S.: Registration with the point cloud library: A modular framework for aligning in 3-d. IEEE Robotics Automation Magazine 22(4), 110–124 (Dec 2015). https://doi.org/10.1109/MRA.2015.2432331
-  Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 21(5), 433–449 (1999). https://doi.org/10.1109/34.765655
-  Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3128–3137 (2015). https://doi.org/10.1109/CVPR.2015.7298932
-  Khoury, M., Zhou, Q.Y., Koltun, V.: Learning compact geometric features. In: International Conference on Computer Vision (ICCV). pp. 153–161 (2017). https://doi.org/10.1109/ICCV.2017.26
-  Lee, G.H., Li, B., Pollefeys, M., Fraundorfer, F.: Minimal solutions for the multi-camera pose estimation problem. In: International Journal of Robotics Research (IJRR). pp. 837–848 (2015). https://doi.org/10.1177/0278364914557969
-  Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
-  Ma, Y., Guo, Y., Zhao, J., Lu, M., Zhang, J., Wan, J.: Fast and accurate registration of structured point clouds with small overlaps. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 643–651 (2016). https://doi.org/10.1109/CVPRW.2016.86
-  Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 year, 1000km: The oxford robotcar dataset. The International Journal of Robotics Research (IJRR) 36(1), 3–15 (2017). https://doi.org/10.1177/0278364916679498
-  Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: IEEE International Conference on Computer Vision (ICCV). pp. 3476–3485 (2017). https://doi.org/10.1109/ICCV.2017.374
-  Pavel, F.A., Wang, Z., Feng, D.D.: Reliable object recognition using sift features. In: IEEE International Workshop on Multimedia Signal Processing (MMSP). pp. 1–6 (2009). https://doi.org/10.1109/MMSP.2009.5293282
-  Pomerleau, F., Liu, M., Colas, F., Siegwart, R.: Challenging data sets for point cloud registration algorithms. The International Journal of Robotics Research (IJRR) 31(14), 1705–1711 (Dec 2012)
-  Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 77–85 (2017). https://doi.org/10.1109/CVPR.2017.16
-  Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems. pp. 5105–5114 (2017)
-  Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 3212–3217 (2009). https://doi.org/10.1109/ROBOT.2009.5152473
-  Rusu, R.B., Blodow, N., Marton, Z.C., Beetz, M.: Aligning point cloud views using persistent feature histograms. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3384–3391 (2008). https://doi.org/10.1109/IROS.2008.4650967
-  Salti, S., Tombari, F., Spezialetti, R., Stefano, L.D.: Learning a descriptor-specific 3d keypoint detector. In: IEEE International Conference on Computer Vision (ICCV). pp. 2318–2326 (2015). https://doi.org/10.1109/ICCV.2015.267
-  Salti, S., Tombari, F., Di Stefano, L.: Shot: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding 125, 251–264 (2014)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern recognition (CVPR). pp. 815–823 (2015).https://doi.org/10.1109/CVPR.2015.7298682
-  Tombari, F., Salti, S., Di Stefano, L.: Unique shape context for 3d data description. In: ACM Workshop on 3D Object Retrieval. pp. 57–62. 3DOR ’10, ACM (2010). https://doi.org/10.1145/1877808.1877821
-  Verdie, Y., Yi, K.M., Fua, P., Lepetit, V.: TILDE: A Temporally Invariant Learned DEtector. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5279–5288 (2015). https://doi.org/10.1109/CVPR.2015.7299165
-  Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: Learned invariant feature transform. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) European Conference on Computer Vision (ECCV). pp. 467–483. Springer International Publishing (2016). https://doi.org/10.1007/978-3-319-46466-428
-  Yi, K.M., Verdie, Y., Fua, P., Lepetit, V.: Learning to assign orientations to feature points. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 107–116 (2016). https://doi.org/10.1109/CVPR.2016.19
-  Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., Funkhouser, T.: 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 199–208 (2017). https://doi.org/10.1109/CVPR.2017.29
-  Zhong, Y.: Intrinsic shape signatures: A shape descriptor for 3d object recognition. In: IEEE International Conference on Computer Vision Workshops, (ICCVW). pp. 689–696 (2009). https://doi.org/10.1109/ICCVW.2009.5457637