Establishing reliable point-to-point correspondences between 3D rigid shapes represented by point clouds, meshes, or depth maps is a fundamental issue in 3D computer vision, computer graphics, photogrammetry, and robotics. To judge whether two points are corresponding, a typical solution is first representing the geometric information of the proximity of a point by a feature vector, known as the local geometric descriptor, and then measuring the similarity between their feature descriptors. As such, distinctive and robust feature representations are desired to achieve reliable shape correspondences.
Following the trend in 2D vision area lowe2004distinctive ; Yu2015Multi , a number of 3D local geometric descriptors, either hand-crafted johnson1999using ; rusu2009fast ; yang2017RCS_jrnl ; tombari2010unique ; guo2013rotational ; yang2016fast ; yang2017toldi or learned zeng20173dmatch ; khoury2017learning ; elbaz20173d ; deng2018ppfnet ; yew20183dfeat ; Deng2018PPFold , have been proposed for the purpose of fully encoding the geometric information of a rigid local shape (this paper concerns about rigid data only, some literature about non-rigid data feature representations will be introduced in Sect. 2). We can find a variety of low-level geometric features employed to represent the local 3D structure, e.g., normal deviation rusu2009fast , signed distance malassiotis2007snapshots , contour yang2017RCS_jrnl , and density guo2013rotational . To achieve information complementary, many descriptors suggest aggregating multiple low-level features from different subspaces tombari2010unique ; yang2016fast or viewpoints guo2013rotational ; yang2017RCS_jrnl , which has been demonstrated to be a simple yet effective way guo2013rotational ; yang2017RCS_jrnl . These descriptors mainly resort to concatenation for feature fusion. However, it has two key drawbacks. One is the information redundancy and resultant high-dimensional feature vectors that are inefficient to store and match. The other is that all bins in the considered low-level features are enforced with identical weights, which appears to be improper because each bin may contribute differently (even resulting in negative impacts) in a specific scenario yang2016accumulated . There are also some works that perform Principle Component Analysis (PCA) on the concatenated feature, e.g., johnson1999using ; guo2015novel . Although more compact feature variants can be achieved with PCA, PCA still fails to discover a reasonable feature embedding to integrate the beneficial information provided by each feature bin while suppressing the impact of noisy bins simultaneously khoury2017learning , possibly owing to the lack of matching supervision.
In addition to low-level feature fusion, some studies in the image matching realm suggest using multiple feature descriptors to perform feature matching hsu2015robust ; hu2016progressive because the optimal descriptor for feature matching may vary from image to image, or even pixel to pixel. This idea has later been applied to 3D feature matching in buch2016local where a min pooling method is proposed to fuse high-level geometric features. Above fusion methods for high-level features tend to select the optimal feature for a particular pair of image/surface patches. Accordingly, those sub-optimal features are discarded to measure the similarity of two patches. However, we will show that there still exists valuable information in other features that contributes to more accurate patch similarity calculation.
Although the learning for local geometric feature fusion remains unexplored in the field of 3D rigid data matching, there are already a number of deep learning-based local geometric descriptors zeng20173dmatch ; khoury2017learning ; deng2018ppfnet and the majority of them directly learn features from raw data. For the task of scene registration, these learned descriptors have already outperformed traditional descriptors by a clear margin. Unfortunately, most of existing learned descriptors are intrinsically sensitive to rotation Charles2017PointNet , which greatly limits their deployment in real-world applications. By contrast, the fusion result by learning from a set of rotational invariant traditional feature representations, as will be done in this work, inherits the merit of being invariant to rotation khoury2017learning . Also, we will show that an ultra lightweight network is sufficient to learn a reasonable feature fusion and the resultant descriptor is competitive to existing deep learned descriptors.
Motivated by these considerations, we present a deep learning-based approach to fuse local geometric features for 3D rigid data matching. More specifically, a compact and distinctive new feature is generated by feeding multiple low/high-level features to a Neural Network (NN) under the triplet framework. (i) Compared with previous linear fusion methods, we show that NN owes the ability to discover and fully leverage the advantageous information provided by each feature with a non-linear feature embedding. A comparison of our NN-fused feature and PCA-fused feature can be found in Fig. 1. (ii) Compared with existing deep learned descriptors, our method is fully rotational invariant while requiring a dramatically more lightweight network for training. We show that our method promotes the distinctiveness of traditional local geometric feature descriptors greatly via learning for fusion, achieving competitive performance with several deep learned descriptors on a scene registration benchmark. To train the network, we propose a new loss that improves the original triplet loss from two aspects. First, we consider all negative pairs in the triplet to compute the negative distance to more reliably separate matches and non-matches. Second, a pairwise term for pulling the positive samples together is attached to achieve a tighter cluster formed by matches. Experiments have been carried out on four standard datasets with different application contexts and data modalities. Comparison with prior geometric fusion methods and several deep learned descriptors confirms the overall superiority of the presented approach. Our main contributions are summarized as follows:
We present a deep learning-based approach to non-linearly fusing multiple local geometric features and reveal that neural networks owe the ability to effectively discover and integrate the complementary information within a geometric feature set.
An improved triplet loss that explores all pairwise relationships within the triplet is proposed. It compares favorably to the classical triplet loss and contrastive loss in the local geometric feature fusion context.
Compared with deep learned local geometric descriptors, the descriptor produced by the proposed fusing approach achieves comparable performance, albeit being more lightweight to train and truly rotational invariant. Fusing traditional local geometric features with deep learning methods therefore paves a new path for the research of local geometric feature description.
The remainder of this paper is organized as follows. Sect. 2 gives a review of relevant literature for local geometric feature fusion. Sect. 3 details the proposed fusion method using a triplet network architecture for both low-level and high-level local geometric features. Sect. 4 presents the experiments deployed to validate the effectiveness of our method together with necessary explanations and discussions. Sect. 5 draws the conclusions and discusses about potential future research directions.
2 Related work
Low-level geometric feature fusion. Fusion of several low-level geometric features from multiple viewpoints or subspaces is prevalent in modern local geometric descriptors. 3D-to-2D projection provides a straightforward way to describe the local geometry by virtue of 2D cues. Representative examples include Point Fingerprint sun2003point and Snapshots malassiotis2007snapshots . To gain more discriminative information, the rotation and projection mechanism is proposed in Rotational Projection Statistics (RoPS) guo2013rotational and Rotational Contour Signatures (RCS) yang2017RCS_jrnl
that concatenate features extracted from multiple viewpoints. There also exist methods that perform feature description in the 3D space such as 3D Shape Contextfrome2004recognizing and Signatures of Histograms of OrienTations (SHOT) tombari2010unique , usually with subspace partition to achieve spatial information characterization. In these descriptors, features from different spatial subspaces are concatenated as the final descriptor. With the observation that different geometric attributes could bring complementary information to each other, Local Feature Statistics Histograms yang2016fast concatenates histograms of normal deviation, signed distance, and point density. Besides concatenation, Triple Spin Image (TriSI) guo2015novel applies PCA to the concatenated feature vector consists of three spin image signatures to achieve a more compact variant. Either concatenation or PCA is employed for descriptors based on low-level geometric feature fusion in existing literature.
High-level geometric feature fusion. (i) Non-rigid shape retrieval. A typical application of the fusion of multiple high-level geometric features can be found in 3D non-rigid object retrieval studies. For non-rigid data, one of the most challenging factors is deformation Tam2013Registration . The fusion of multiple or multi-scale high-level features such as Heat Kernel Signatures (HKS) sun2009concise and Wave Kernel Signatures (WKS) aubry2011wave is a popular way for improving the performance of shape retrieval. Yang and Chen yang2007ofs
presented an optimized feature selection strategy for combining a set of high-level features for the task of object retrieval. Papadakis et al.papadakis20083d concatenated multiple 3D features based on spherical harmonics as well as 2D features based on depth buffers as the feature vector, which is further compressed via scalar quantization. Tabia et al. tabia2014covariance proposed using the covariance matrices of the descriptors to efficient fuse shape descriptors for 3D face matching and retrieval. Xie et al. xie2015deepshape proposed a deep shape descriptor for 3D object retrieval that concatenates the features from the hidden layers of several auto-encoder networks. Fang et al. Yi20153D
proposed combining HKS and heat shape descriptor to parametrize raw 3D shapes for deep feature learning. Bu et al.Bu2014Multimodal proposed fusing the different modalities of 3D shapes via deep learning to promote the discriminability of unimodal feature. Furuya et al. Furuya2016bmvc proposed a deep local feature aggregation network that integrates the extraction of rotation-invariant 3D local features and their fusion in a single deep network, they finally aggregated these local features to a global feature for shape retrieval. (ii) Rigid 3D matching. In the field of 3D rigid data matching, min pooling proposed in buch2016local , to the best of our knowledge, is the only approach that leverages multiple 3D feature descriptors to enhance the feature matching performance in 3D object recognition context. Specifically, min pooling suggests selecting the feature resulting in the highest feature similarity to judge two patches as a match or non-match. However, potential complementary information of other descriptors are therewith discarded.
Learning for feature fusion. Since the deep learning architecture provides flexible interfaces to tasks with various inputs and allows outputting aggregated results such as classification vectors and regression values from multiple input channels, it is straightforward to fuse features within a deep learning framework. For instance, deep fusion of RGB and depth features gupta2014learning , view features from different viewpoints su2015multi , and different geometric features fang20153d . Besides deep learning, there is also a field called multi-view learning that introduces one function to model a particular feature and jointly optimizes all the functions to find the redundant features of the same input data as well as improve the learning performance xu2013survey ; Jing2017Multi . It includes three categories of learning approaches, i.e., co-training, multiple kernel learning, and subspace learning. They have been successfully applied to a number of computer vision tasks such as object classification kumar2007support , object recognition kembhavi2009incremental , facial expression recognition zheng2006facial , image classification zhang2012combining
, and image retrievalli2011difficulty . Nonetheless, above learning-based feature fusion techniques have not found applications in the realm of 3D rigid shape matching yet. Inspired by the great success of deep learning for feature fusion in previous mentioned applications, we present the first attempt to fuse local geometric features to improve feature matching and rigid geometric registration with a deep neural network.
In this section, we detail the proposed fusion method for local geometric features. More specifically, we train a deep neural network to combine a number of local geometric features to a new feature that achieves information complementary while being compact as well.
3.1 Network architecture
Learned descriptors in 2D and 3D domains have been typically trained with 2-branch simo2015discriminative ; zeng20173dmatch , 3-branch kumar2016learning ; khoury2017learning , and N-branch deng2018ppfnet Siamese networks. The 3-branch network is shown to be more suitable than the 2-branch network with feature histograms being the input khoury2017learning , and requires far less hardware resource than the N-branch network. Therefore, we adopt the Siamese network under the triplet frame as our basic architecture with the purpose of pulling matches together while pushing non-matches apart.
Fig. 2 presents the proposed network architecture. Let , , and respectively be an anchor sample, a positive sample, and a negative sample that will be fed to the network together. In our context, the training samples refer to local shape patches. We first extract the geometric features to be fused for each sample. Then, the features of each sample are embedded by an NN that non-linearly merges all input features as a new feature in Euclidean spaces. We keep the weights of the three NN module identical during parameter updating. For the NN module, two blocks are included, i.e., an intra-feature fusion block and an inter-feature fusion block. The intra-feature fusion block allows bin-level fusion for each independent feature. Since the bins of the feature after intra-feature fusion are combinations of all initial feature bins with different weights, this can be seen as information exchange within a feature. By contrast, the most popular concatenation operation for local geometric fusion sets identical weights to all bins within a feature and ignores the combinational information of bins. The features after intra-feature fusion are then concatenated as a single vector, and finally passed to the inter-feature fusion block that combines feature-level information.
We use 3 fully connected (FC) layers in the intra-feature fusion block for each feature respectively with , , and nodes, where each FC layer is followed by an ReLU layer. The inter-feature fusion block is composed of 5 FC + ReLU layers where each of the former 4 FC layers has nodes and the last FC layer (i.e., the final fused feature) has nodes. The values of and will be tuned for a specific input feature combination. We will also consider different values of to achieve a balance between compactness and distinctiveness.
3.2 Loss function
The network is trained to minimize an improved triplet loss proposed in this paper. We first revisit the original triplet loss in the following.
Let be the extracted feature sets for samples where denotes the local geometric features to be fused. Consider a set of triplets of , i.e., . The triplet loss is defined as:
where denotes the distance between two vectors, are the parameters of mapping , denotes , and is the margin set to separate positive and negative samples.
However, the original triplet loss fails to fully exploit the relationships within the triplet from two aspects. First, there are two negative distances, i.e., and , while it only considers the former one. Since the objective of the triplet loss is separating positive and negative samples by a predefined margin, it is more reasonable to consider all negative distances. Second, the triplet loss ignores the compactness of the cluster formed by positive samples. Intuitively, negative samples may distribute scatteredly in the feature space because they are supposed to be distant from the anchor sample and no pairwise affinity exists between two negative samples. By contrast, positive samples show clear consistency as the distance between any two positive samples should close to 0 in the ideal case. In the context of geometric feature matching where perturbations such as noise, partial overlap, clutter, and occlusion commonly exist in real-world applications, even two corresponding keypoints may exhibit dissimilar geometric features. Therefore, it is desired to further pull the anchor and positive samples such that challenging matches typically located in boarder or incomplete regions can be recognized by the network during testing.
Under these concerns, we propose the following improved triplet loss:
where is the minimum of and , and is a parameter to control the weight of the last pairwise term. Compared with the triplet loss (Eq. 1), our improved loss considers both negative distances within the triplet to ensure a better separation for positive and negative samples, and minimizes the positive distance to achieve a tight positive sample cluster.
3.3 Training data preparation
We sample triplets of points for feature extraction and training from a number of shape pairs with overlaps. Given a shape pair composed of a source shape and a target shape together with the ground truth rotation and translation , any point in , e.g., , that satisfies the following condition will be considered as an anchor sample:
where , is the nearest neighbor of in the target shape , and is a distance threshold. To select the positive samples for , we find points in whose distances to are smaller than and serve them as the positive samples to . As for negative samples, we follow khoury2017learning that divides the negative examples to be sampled into two portions. The first portion includes samples randomly selected from the set of points in that are at distance at least and at most from . Because these points locate near to the ground truth corresponding point to , they are served as the hard negative examples. The second portion is composed of samples that are randomly select from the points in that are at least from . A visual illustration of the sampling process is shown in Fig. 4. In this manner, triplets are sampled for each point in the overlapping region of a shape pair.
3.4 Implementation details
The proposed method was implemented in TensorFlow v1.8abadi1603tensorflow . The training samples are permuted and partitioned in minibatches with a size of 512. We use the Adam kinga2015method optimizer with and
In this section, we present the deployed experiments to verify the effectiveness of the proposed method from two perspectives, i.e., feature matching and geometric registration. The following parameter settings are used throughout all the experiments: and (Eq. 2); pr, pr and pr (Sect. 3.3); and (Sect. 3.3). Here, pr denotes point cloud resolution, i.e., the average shortest distance among neighboring points computed for a whole dataset.
All the experiments are done on a laptop with a 4-core Intel i7-6700HQ CPU, a 24GB memory, and an NVIDIA GTX1060 graphics card enabled by CUDA v9.0 and cuDNN v7.0.4. The compared methods, feature matching, and geometric registration are implemented based on Point Cloud Library (PCL) rusu20113d .
Four well-known 3D rigid matching datasets with different application scenarios and data modalities are considered for experiments. Some samples from these datasets are shown in Fig. 5.
U3M mian2006novel . It is a LiDAR registration dataset that incorporates 22, 16, 16, and 21 partial views of four 3D objects, namely the Chef, Chicken, Parasaurolophus, and T-Rex. This dataset is acquired using a Minolta Vivid 910 Scanner. Particularly, we consider any two partial views of the same object with a minimum of 30% overlaps, resulting in a total of 313 valid matching pairs in this dataset. The ground truth motion between two views is obtained by first manual alignment and then ICP refinement besl1992method .
U3OR mian2006three ; mian2010repeatability . This dataset addresses the 3D object recognition scenario and is very popular for local shape descriptor assessment guo2016comprehensive . It consists of 5 3D models, i.e., Chef, Chicken, Parasaurolophus, T-Rex, and Rhino, and 50 scenes. The scenes are obtained by randomly placing four or five objects together and scan them from a particular view using a Minolta Vivid 910 Scanner. The challenges provided by U3OR are mainly various degrees of clutter and occlusion. There are 188 valid matching pairs in this dataset. Remarkably, the Rhino model is used to create clutter and matching pairs are only found for the other four objects.
BMR tombari2010unique . BMR is a low-resolution dataset collected using a Microsoft Kinect sensor. It is composed of 15, 16, 20, 13, 16, and 15 partial views of 6 objects, i.e., Doll, Duck, Frog, Mario, PeterRabbit, and Squirrel. Analogously to the U3M dataset, we consider shape pairs with at least 30% overlaps and eventually get 321 valid matching pairs.
Aug_ICL-NUIM choi2015robust . The Augmented ICL-NUIM (Aug_ICL-NUIM) dataset was collected via a Microsoft Kinect sensor with four indoor scene sequences including Living room 1, Living room 2, Office 1 and Office 2. These 2.5D sequences include 57, 47, 53, and 50 point clouds, respectively. Different from widely-used object datasets, Aug_ICL-NUIM contains far less geometric details. Fragment pairs in the four sequences that are not temporally adjacent to each other are considered for experiments choi2015robust ; zhou2016fast .
In our experiments, the U3M and U3OR datasets are widely employed for the evaluation feature matching performance guo2013rotational ; yang2017RCS_jrnl so we use the two datasets for feature matching evaluation; the lager-scale BMR and Aug_ICL-NUIM datasets are considered for geometric registration evaluation. For U3M and U3OR, matching pairs of Chef and Chicken, Parasaurolophus, and T-Rex are employed for training, validation, and testing, respectively. For the BMR dataset, we use the matching pairs of Doll, Duck, and Frog for training and the remaining for testing. For the Aug_ICL-NUIM dataset, we follow khoury2017learning and train our network on the SceneNN 111https://marckhoury.github.io/CGF/ dataset. Remarkably, there are benchmark results of several deep learned descriptors for the Aug_ICL-NUIM dataset, making it possible to fairly compare our fused descriptors with deep learned descriptors .
|SI johnson1999using +SHOT tombari2010unique||512||512||5|
|SI johnson1999using +SHOT tombari2010unique +RCS yang2017RCS_jrnl||512||512||5|
We use the Recall versus Precision curve (RPC) as suggested by many previous works tombari2010unique ; guo2013rotational to examine the feature matching performance of a local shape descriptor. It is calculated as follows: given a source shape, a target shape, and the ground truth transformation, we randomly sample 1000 keypoints for the source shape and locate their corresponding points in the target shape using the ground truth transformation. Local shape features are extracted for the keypoints and a target feature is matched against all source features to find the closest feature. If the ratio of the smallest distance to the second feature distance is smaller than a threshold, the target feature and the closest source feature is considered as a match. A match is further considered as correct if it conforms to the ground truth transformation. Precision refers to the ratio of the number of feature-identified correct matches to the number of matches. Recall refers to the ratio of the number of feature-identified correct matches to the number of total correct matches. We will also present the area under curve (AUC) values for each RPC to measure the overall feature matching performance.
Regarding geometric registration performance assessment, we employ the -recall suggested in zhou2016fast for the BMR dataset. It is defined as the fraction of success registrations whose RMSE values are smaller than . RMSE is computed on the distances between ground truth correspondences after registration yang2016fast ; choi2015robust , which is defined as:
where is the ground-truth correspondence set between and ; and are ground-truth rotation and translation, respectively. Regarding the Aug_ICL-NUIM dataset, we employ the benchmarking metrics, i.e., Precision and Recall of successfully aligned data pairs (we additionally consider the F-score, i.e., , as an aggregated metric). Specifically, a registration is considered as correct if its RMSE (Eq. 4) is smaller than 0.2 choi2015robust .
4.1.3 Features for fusion
For low-level local geometric feature fusion, we consider the LFSH yang2016fast (30 dim.) and RCS yang2017RCS_jrnl (72 dim.). LFSH is the concatenation of three sub-histograms that encode signed distance, normal deviation, and density attributes, respectively. RCS is the concatenation of six signatures that represents the contour information from different views.
Regarding high-level feature fusion, two combinations are considered: spin image (SI) johnson1999using (153 dim.)+SHOT tombari2010unique (352 dim.), and SHOT tombari2010unique (352 dim.)+RoPS guo2013rotational (135 dim.)+RCS yang2017RCS_jrnl (72 dim.). A common trait of our considered combination is that features within the combination encode local geometry from different perspectives, which are supposed to be complementary to each other.
We set the support radius of all descriptors to 15pr on the U3M, U3OR, and BMR datasets. For the Aug_ICL-NUIM dataset with large-scale indoor data, 60pr is used to include more structure details. The network parameters used to fuse different features are reported in Table 1.
4.1.4 Compared methods
Existing methods for low-level geometric feature fusion, i.e., concatenation tombari2010unique ; yang2016fast and PCA johnson1999using ; guo2015novel , are taken into comparison. Specifically, PCA is trained over a set of descriptors computed for each point of the Chef and Chicken models. For high-level geometric feature fusion, we compare our method with the min pooling approach buch2016local . We also investigate the effectiveness of applying concatenation and PCA for descriptor-level fusion.
Because the proposed method is a fusion approach, thus we mainly focus on the comparison with existing fusion methods for local geometric features, but we will still include comparisons with some traditional and deep learned descriptors.
4.2 Low-level feature fusion performance
We present the feature matching performance of local geometric descriptors created by our fusion method (dubbed as NN) as well as the compared approaches on the test sets of the U3M and U3OR datasets in Fig. 6. Different feature lengths are specifically examined for PCA and NN to see whether a good balance could be achieved between distinctiveness and compactness.
For fusion results on LFSH in Fig. 6(a)-(b), one can see that the proposed method manages to achieve better performance than the original LFSH feature (i.e., generated via histogram concatenation) with 16 and 30 dimensions. The gap is especially significant on the U3OR dataset. Notably, our NN-fused LFSH with mere 8 dimensions outperforms the original version on the U3OR dataset. Although LFSH is known to be fast and compact yang2016fast , NN can further improve its compactness while making it more discriminative (the NN fusion process was performed in real time). PCA, though making the LFSH feature more compact, results in a slight performance deterioration. For the RCS feature, consistent results can be witnessed in Fig. 6(a)-(b). Specifically, our NN method surpasses all methods in terms of both compactness and distinctiveness by a clear margin on the U3M and U3OR datasets. These results demonstrate that integrating features non-linearly by matching-guided learning is able to fully exploit the advantageous components to feature matching existed in the feature set and aggregate them in a compact manner.
4.3 High-level feature fusion performance
, and our loss), network architectures, and inputs on the validation sets of the U3M and U3OR datasets. “w/o intra.” indicates removing the intra-feature fusion block and using the concatenated feature vector as the input to fully connected layers, as done inkhoury2017learning . represents the best performance when using any features for fusion from the prepared feature set. The learned features from (a) to (b) have 16, 48, 256, and 256 dimensions, respectively. The best results are in bold fonts.
Fig. 7 shows the feature matching performance of local geometric descriptors and their fused results by several high-level feature fusion methods. For the fusion of SI and SHOT (Fig. 7(a)-(b)), NN achieves the best performance on both datasets under all tested dimensions, outperforming the second best, i.e., min pooling, by a significant margin. Concatenation and PCA behave poorly. The reason can be reflected by the phenomenon that SI exhibits comparative performance with min pooling, indicating that in most patch matching cases SI produces higher similarity score than SHOT. In this sense, attaching SHOT to SI will cause many noisy bins in the concatenated feature (we will show that SHOT actually has a postive impact on SI when fused via NN in Sect. 4.4).
Similar conclusion can be drawn from the fusion results of SHOT, RoPS, and RCS as presented in Fig. 7(c)-(d), i.e., NN outperforms all others. Since min pooling returns better performance than using any single feature, we can infer that these descriptors provide complementary information to each other. Compared with min pooling, NN mines such information more effectively. In addition, more time is dedicated to feature matching by min pooling because the feature similarity score needs to be calculated times ( being the number of descriptors).
4.4 Method analysis
For the purpose of verifying the effectiveness of our improved triplet loss and the rationality of our network design, we conduct a set of experiments on the validation sets of U3M and U3OR datasets with different loss functions, network architectures, and inputs. The results are reported in Table 2.
Three main observations can be made from the results. First, the inter-feature fusion block that performs information exchange within a feature is demonstrated to be beneficial for multi-feature fusion as it brings performance gain in all tested cases. Second, recall that directly concatenating all features often lead to inferior performance than using a particular feature as shown in Fig. 7. Yet, under the propoposed NN fusion framework, the best performance is achieved when taking all features as the input. For instance, SI is more distinctive than the concatenated SI+SHOT feature, while NN reveals that SHOT does provide complementary information to SI since a clear performance gain is witnessed when using SI+SHOT for learning. This is somewhat not surprising because SI and SHOT describe the local surface from different perspectives, i.e. density and normal deviations, and intuitively we should expect implicit complementary information within SI+SHOT. Based on this observation, we can conclude that NN is capable of fully mining the complementary information within multiple features. Third, the improved triplet loss, when keeping other settings identical, generally yields better results than the original triplet loss khoury2017learning , showing the effectiveness of our improvement. We also try the contrastive loss zeng20173dmatch but it turns out to be far inferior to other losses. Similar results have also been reported in khoury2017learning when learning representations from local geometric features.
4.5 3D rigid geometric registration
For task-level evaluation, i.e., 3D rigid registration, we consider RCS and SHOT+RoPS+RCS that respectively belong to low-level and high-level feature fusion scenarios to generate local geometric features. To perform registration, we follow the standard pipeline in PCL holz2015registration
, including keypoint detection, feature description, feature matching, transformation estimation, and refinement. We use uniform sampling provided by PCL to detect keypoints, the fused features by tested methods for feature description, KNNmuja2014scalable for feature matching, and RANSAC fischler1981random for transformation estimation. No refinement such as iterative closest points (ICP) besl1992method is performed in this experiment.
1) Results on the BMR dataset
We consider different RANSAC iterations since this parameter often varies with the quality of feature correspondences and will result in various registration speeds. The -recall results are presented in Fig. 8.
For RCS fusion, NN surpasses other methods with all tested RANSAC iterations. The margin is more obvious with 200 iterations, showing that a higher ratio of inliers is achieved in the initial established feature correspondences between two point clouds by matching NN-fused RCS features and RANSAC can quickly find the main cluster formed by inliers. For SHOT+RoPS+RCS fusion, we can see that NN outperforms all others with a significant margin. These results demonstrate that NN is able to generate distinctive representations during feature fusion on noisy and low-resolution Kinect data. Finally, we present a qualitative result of two sample registration outcomes by tested methods in Fig. 9. The figure suggests that the point clouds in the BMR datasets are very noisy and with limited geometric information. However, NN-based registration still accurately aligns the point cloud pairs while PCA and min pooling produce larger errors.
2) Results on the Aug_ICL-NUIM dataset
|Precision (%)||Recall (%)||F-score (%)|
|Super 4PCS mellado2014super||10.4||17.8||13.1|
|CGF (FGR) khoury2017learning||9.4||60.7||16.3|
|CGF (CZK) khoury2017learning||14.6||72.0||24.3|
|Ours (Comb. 1)||15.8||58.6||24.9|
|Ours (Comb. 2)||26.0||71.2||38.1|
The results of our approach with 1000 RANSAC iterations as well as benchmarking results of many state-of-the-art methods are reported in Table 3.
One can see that the fusion of RCS achieves comparable performance with CGF (CZK), while the fusion of SHOT+RoPS+RCS exceeds all compared methods. This result has demonstrated the effectiveness of our fusion approach from two aspects. First, the fusion of traditional local geometric descriptors can enhance their discriminative power greatly. Second, our method is not only more effective than existing fusion approaches, but also shows competitive performance to deep learned descriptors (e.g., 3DMatch zeng20173dmatch and CGF khoury2017learning ). In addition, an appealing trait of our fused descriptor is intrinsically to rotation, as opposed to most of existing learned ones. We believe even better results could be achieved by trying more distinctive feature sets. Although the selection of features for fusion is somewhat tricky, simply fusing RCS achieves better F-score performance than CGF (CZK). It is therefore not a cumbersome work for the selection of input features.
Finally, we present some registration results on this dataset by the fusion descriptor of SHOT+RoPS+RCS in Fig. 10. Even for rigid data with limited geometric information, our method still achieves accurate registrations.
5 Conclusions and future work
We have presented a simple yet effective deep learning-based method for local geometric feature fusion that employs a neural network to mine complementary information within a set of geometric features. Feature matching and geometric registration experiments on several public available datasets with different modalities and application scenarios confirm that our method is able to achieve more distinctive features than existing fusion approaches while occupying less dimensions. The fused features are also competitive to several deep learned geometric descriptors yet being more lightweight to train and rotational invariant. In light of the experimental results, the following conclusions can be drawn.
In terms of feature fusion, the proposed fusion approach attains a comprehensive exploitation of the complementary information within a feature set using very compact representations.
Besides superior fusion performance, the resultant descriptor after fusion achieves competitive performance to learned descriptors. Since one of the most challenging problems for learned point cloud features is how to achieve rotation invariance yew20183dfeat , our work directs a new way by fusing traditional features using a lightweight NN that improves their distinctiveness greatly while being rotational invariant.
Potential future research directions include (i) investigating more advanced training strategies (e.g., hard negative mining simo2015discriminative ; Lin2017Focal ) to further improve the fusion performance; (ii) the application our method to other domains such as 2D image feature and non-rigid feature description as the inference of our method can be seamlessly applied to these tasks.
The authors would like to thank the publishers of experimental datasets used in our work. Financial supports from the National Natural Science Foundation of China under Grant 61876211 are gratefully acknowledged.
- (1) D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.
- (2) L. Yu, S. Liu, Z. Wang, Multi-focus image fusion with dense sift, Information Fusion 23 (C) (2015) 139–155.
- (3) A. E. Johnson, M. Hebert, Using spin images for efficient object recognition in cluttered 3d scenes, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (5) (1999) 433–449.
- (4) R. B. Rusu, N. Blodow, M. Beetz, Fast point feature histograms (fpfh) for 3d registration, in: Proc. IEEE International Conference on Robotics and Automation, 2009, pp. 3212–3217.
- (5) J. Yang, Q. Zhang, K. Xian, Y. Xiao, Z. Cao, Rotational contour signatures for both real-valued and binary feature representations of 3d local shape, Computer Vision and Image Understanding 160 (2017) 133–147.
- (6) F. Tombari, S. Salti, L. Di Stefano, Unique signatures of histograms for local surface description, in: Proc. European Conference on Computer Vision, 2010, pp. 356–369.
- (7) Y. Guo, F. Sohel, M. Bennamoun, M. Lu, J. Wan, Rotational projection statistics for 3d local surface description and object recognition, International Journal of Computer Vision 105 (1) (2013) 63–86.
- (8) J. Yang, Z. Cao, Q. Zhang, A fast and robust local descriptor for 3d point cloud registration, Information Sciences 346 (2016) 163–179.
J. Yang, Q. Zhang, Y. Xiao, Z. Cao, Toldi: An effective and robust approach for 3d local shape description, Pattern Recognition 65 (2017) 175–187.
- (10) A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, T. Funkhouser, 3dmatch: Learning local geometric descriptors from rgb-d reconstructions, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2017, pp. 199–208.
- (11) M. Khoury, Q.-Y. Zhou, V. Koltun, Learning compact geometric features, in: Proc. IEEE International Conference on Computer Vision, 2017, pp. 153–61.
- (12) G. Elbaz, T. Avraham, A. Fischer, 3d point cloud registration for localization using a deep neural network auto-encoder, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2017, pp. 2472–2481.
- (13) H. Deng, T. Birdal, S. Ilic, Ppfnet: Global context aware local features for robust 3d point matching, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 199–208.
- (14) Z. J. Yew, G. H. Lee, 3dfeat-net: Weakly supervised local 3d features for point cloud registration, in: Proc. European Conference on Computer Vision, 2018, pp. 630–646.
H. Deng, T. Birdal, S. Ilic, Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors, in: Proc. European Conference on Computer Vision, 2018, pp. 602–618.
- (16) S. Malassiotis, M. G. Strintzis, Snapshots: A novel local surface descriptor and matching algorithm for robust 3d surface alignment, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (7) (2007) 1285–1290.
- (17) T.-Y. Yang, Y.-Y. Lin, Y.-Y. Chuang, Accumulated stability voting: A robust descriptor from descriptors of multiple scales, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 327–335.
- (18) Y. Guo, F. Sohel, M. Bennamoun, et al., A novel local surface feature for 3d object recognition under clutter and occlusion, Information Sciences 293 (2015) 196–213.
- (19) K.-J. Hsu, Y.-Y. Lin, Y.-Y. Chuang, et al., Robust image alignment with multiple feature descriptors and matching-guided neighborhoods., in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1921–1930.
- (20) Y.-T. Hu, Y.-Y. Lin, Progressive feature matching with alternate descriptor selection and correspondence enrichment, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 346–354.
- (21) A. G. Buch, H. G. Petersen, N. Krüger, Local shape feature fusion for improved matching, pose estimation and 3d object recognition, SpringerPlus 5 (1) (2016) 297.
- (22) R. Q. Charles, S. Hao, K. Mo, L. J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 195–205.
- (23) A. S. Mian, M. Bennamoun, R. A. Owens, A novel representation and feature matching algorithm for automatic pairwise registration of range images, International Journal of Computer Vision 66 (1) (2006) 19–40.
- (24) Y. Sun, J. Paik, A. Koschan, D. L. Page, M. A. Abidi, Point fingerprint: a new 3-d object representation scheme, IEEE Transactions on Systems, Man, and Cybernetics, Part B 33 (4) (2003) 712–717.
- (25) A. Frome, D. Huber, R. Kolluri, T. Bülow, J. Malik, Recognizing objects in range data using regional point descriptors, in: Proc. European Conference on Computer Vision, 2004, pp. 224–237.
- (26) G. K. L. Tam, C. Zhi-Quan, L. Yu-Kun, F. C. Langbein, L. Yonghuai, M. David, R. R. Martin, S. Xian-Fang, P. L. Rosin, Registration of 3d point clouds and meshes: a survey from rigid to nonrigid, IEEE Transactions on Visualization and Computer Graphics 19 (7) (2013) 1199–1217.
- (27) J. Sun, M. Ovsjanikov, L. Guibas, A concise and provably informative multi-scale signature based on heat diffusion, in: Computer graphics forum, Vol. 28, Wiley Online Library, 2009, pp. 1383–1392.
- (28) M. Aubry, U. Schlickewei, D. Cremers, The wave kernel signature: A quantum mechanical approach to shape analysis, in: Proc. IEEE International Conference on Computer Vision Workshops, IEEE, 2011, pp. 1626–1633.
- (29) F. Yang, B. Leng, Ofs: A feature selection method for shape-based 3d model retrieval, in: Proc. IEEE International Conference on Computer-Aided Design and Computer Graphics, IEEE, 2007, pp. 114–119.
- (30) P. Papadakis, I. Pratikakis, T. Theoharis, G. Passalis, S. Perantonis, 3d object retrieval using an efficient and compact hybrid shape descriptor, in: Eurographics Workshop on 3D Object Retrieval, 2008.
- (31) H. Tabia, H. Laga, D. Picard, P.-H. Gosselin, Covariance descriptors for 3d shape matching and retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4185–4192.
- (32) J. Xie, Y. Fang, F. Zhu, E. Wong, Deepshape: Deep learned shape descriptor for 3d shape matching and retrieval, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1275–1283.
- (33) F. Yi, X. Jin, G. Dai, W. Meng, Z. Fan, T. Xu, E. Wong, 3d deep shape descriptor, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2319–2328.
- (34) S. Bu, S. Cheng, Z. Liu, J. Han, Multimodal feature fusion for 3d shape recognition and retrieval, IEEE Multimedia 21 (4) (2014) 38–46.
- (35) T. Furuya, O. Ryutarou, Deep aggregation of local 3d geometric features for 3d model retrieval, in: Proc. British Machine Vision Conference, 2018, pp. 121–1.
- (36) S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from rgb-d images for object detection and segmentation, in: Proc. European Conference on Computer Vision, Springer, 2014, pp. 345–360.
H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller, Multi-view convolutional neural networks for 3d shape recognition, in: Proc. IEEE International Conference on Computer Vision, 2015, pp. 945–953.
- (38) Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, E. Wong, 3d deep shape descriptor, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2319–2328.
- (39) C. Xu, D. Tao, C. Xu, A survey on multi-view learning, arXiv preprint arXiv:1304.5634.
- (40) Z. Jing, X. Xie, X. Xin, S. Sun, Multi-view learning overview: Recent progress and new challenges, Information Fusion 38 (2017) 43–54.
- (41) A. Kumar, C. Sminchisescu, Support kernel machines for object recognition, in: Proc. IEEE International Conference on Computer Vision, IEEE, 2007, pp. 1–8.
- (42) A. Kembhavi, B. Siddiquie, R. Miezianko, S. McCloskey, L. S. Davis, Incremental multiple kernel learning for object recognition, in: Proc. IEEE International Conference on Computer Vision, IEEE, 2009, pp. 638–645.
- (43) W. Zheng, X. Zhou, C. Zou, L. Zhao, Facial expression recognition using kernel canonical correlation analysis (kcca), IEEE Transactions on Neural Networks 17 (1) (2006) 233–238.
- (44) L. Zhang, L. Zhang, D. Tao, X. Huang, On combining multiple features for hyperspectral remote sensing image classification, IEEE Transactions on Geoscience and Remote Sensing 50 (3) (2012) 879–893.
- (45) Y. Li, B. Geng, Z.-J. Zha, D. Tao, L. Yang, C. Xu, Difficulty guided image retrieval using linear multiview embedding, in: Proc. ACM International Conference on Multimedia, ACM, 2011, pp. 1169–1172.
- (46) E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, F. Moreno-Noguer, Discriminative learning of deep convolutional feature point descriptors, in: Proc. IEEE International Conference on Computer Vision, 2015, pp. 118–126.
- (47) B. Kumar, G. Carneiro, I. Reid, et al., Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5385–5394.
- (48) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: large-scale machine learning on heterogeneous distributed systems. arxiv preprint (2016), arXiv preprint arXiv:1603.04467.
- (49) D. Kinga, J. B. Adam, A method for stochastic optimization, in: Proc. International Conference on Learning Representations, Vol. 5, 2015.
- (50) R. B. Rusu, S. Cousins, 3d is here: Point cloud library (pcl), in: Proc. IEEE International Conference on Robotics and Automation, 2011, pp. 1–4.
- (51) P. J. Besl, N. D. McKay, Method for registration of 3-d shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (2) (1992) 239–256.
- (52) A. S. Mian, M. Bennamoun, R. Owens, Three-dimensional model-based object recognition and segmentation in cluttered scenes, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (10) (2006) 1584–1601.
- (53) A. Mian, M. Bennamoun, R. Owens, On the repeatability and quality of keypoints for local feature-based 3d object retrieval from cluttered scenes, International Journal of Computer Vision 89 (2-3) (2010) 348–361.
- (54) Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, N. M. Kwok, A comprehensive performance evaluation of 3d local feature descriptors, International Journal of Computer Vision 116 (1) (2016) 66–89.
- (55) S. Choi, Q.-Y. Zhou, V. Koltun, Robust reconstruction of indoor scenes, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5556–5565.
- (56) Q.-Y. Zhou, J. Park, V. Koltun, Fast global registration, in: Proc. European Conference on Computer Vision, Springer, 2016, pp. 766–782.
- (57) D. Holz, A. E. Ichim, F. Tombari, R. B. Rusu, S. Behnke, Registration with the point cloud library: A modular framework for aligning in 3-d, IEEE Robotics & Automation Magazine 22 (4) (2015) 110–124.
M. Muja, D. G. Lowe, Scalable nearest neighbor algorithms for high dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence (11) (2014) 2227–2240.
- (59) M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Communications of the ACM 24 (6) (1981) 381–395.
- (60) B. Drost, M. Ulrich, N. Navab, S. Ilic, Model globally, match locally: Efficient and robust 3d object recognition, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 998–1005.
- (61) N. Mellado, D. Aiger, N. J. Mitra, Super 4pcs fast global pointcloud registration via smart indexing, in: Computer Graphics Forum, Vol. 33, Wiley Online Library, 2014, pp. 205–215.
- (62) T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, IEEE Transactions on Pattern Analysis and Machine Intelligence (99) (2017) 2999–3007.