Implementation of ECCV'18 paper - GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints
Learned local descriptors based on Convolutional Neural Networks (CNNs) have achieved significant improvements on patch-based benchmarks, whereas not having demonstrated strong generalization ability on recent benchmarks of image-based 3D reconstruction. In this paper, we mitigate this limitation by proposing a novel local descriptor learning approach that integrates geometry constraints from multi-view reconstructions, which benefit the learning process in data generation, data sampling and loss computation. We refer to the proposed descriptor as GeoDesc, and demonstrate its superior performance on various large-scale benchmarks, and in particular show its great success on challenging reconstruction cases. Moreover, we provide guidelines towards practical integration of learned descriptors in Structure-from-Motion (SfM) pipelines, showing the good trade-off that GeoDesc delivers to 3D reconstruction tasks between accuracy and efficiency.READ FULL TEXT VIEW PDF
We introduce a simple modification of local image descriptors, such as S...
We propose a convolutional neural network (ConvNet) based approach for
Extraction of local feature descriptors is a vital stage in the solution...
Local feature extraction remains an active research area due to the adva...
Accuracy, descriptor size, and the time required for extraction and matc...
High-quality 3D reconstructions from endoscopy video play an important r...
View-graph is an essential input to large-scale structure from motion (S...
Implementation of ECCV'18 paper - GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints
TensorFlow implementation of GeoDesc (ECCV'18), ContextDesc (CVPR'19) and ASLFeat (CVPR'20)
Computing local descriptors on interest regions serves as the subroutine of various computer vision applications such as panorama stitching, wide baseline matching 3], and Structure-from-Motion (SfM) [4, 5, 6, 7]. A powerful descriptor is expected to be invariant to both photometric and geometric changes, such as illumination, blur, rotation, scale and perspective changes. Due to the reliability, efficiency and portability, hand-crafted descriptors such as SIFT  have been influentially dominating this field for more than a decade. Until recently, great efforts have been made on developing learned descriptors based on Convolutional Neural Networks (CNNs), which have achieved surprising results on patch-based benchmarks such as HPatches dataset . However, on image-based datasets such as ETH local features benchmark , learned descriptors are found to underperform advanced variants of hand-crafted descriptors. The contradictory findings raise the concern of integrating those purportedly better descriptors in real applications, and show significant room of improvement for developing more powerful descriptors that generalize to a wider range of scenarios.
One possible cause of above contradictions, as demonstrated in , is the lack of generalization ability as a consequence of data insufficiency. Although previous research [11, 12, 13] discusses several effective sampling methods that produce seemingly large amount of training data, the generalization ability is still bounded to limited data sources, e.g., the widely-used Brown dataset  with only 3 image sets. Hence, it is not surprising that resulting descriptors tend to overfit to particular scenarios. To overcome it, research such as [15, 16] applies extra regularization for compact feature learning. Meanwhile, LIFT  and  seek to enhance data diversity and generate training data from reconstructions of Internet tourism data. However, the existing limitation has not yet been fully mitigated, while intermediate geometric information is overlooked in the learning process despite the robust geometric property that local patch preserves, e.g., the well approximation of local deformations .
Besides, we lack guidelines for integrating learned descriptors in practical pipelines such as SfM. In particular, the ratio criterion, as suggested in  and justified in , has received almost no individual attention or was considered inapplicable for learned descriptors , whereas it delivers excellent matching efficiency and accuracy improvements, and serves as the necessity for pipelines such as SfM to reject false matches and seed feasible initialization. A general method to apply ratio criterion for learned descriptors is in need in practice.
In this paper, we tackle above issues by presenting a novel learning framework that takes advantage of geometry constraints from multi-view reconstructed data. In particular, we address the importance of data sampling for descriptor learning and summarize our contributions threefold. i) We propose a novel batch construction method that simulates the pair-wise matching and effectively samples useful data for learning process. ii) Collaboratively, we propose a loss formulation to reduce overfitting and improve the performance with geometry constraints. iii) We provide guidelines about ratio criterion, compactness and scalability towards practical portability of learned descriptors.
We evaluate the proposed descriptor, referred to as GeoDesc, on traditional  and recent two large-scale datasets [9, 10]. Superior performance is shown over the state-of-the-art hand-crafted and learned descriptors. We mitigate previous limitations by showing consistent improvements on both patch-based and image-based datasets, and further demonstrate its success on challenging 3D reconstructions.
Networks design. Due to weak semantics and efficiency requirements, existing descriptor learning often relies on shallow and thin networks, e.g., three-layer networks in DDesc 
with 128-dimensional output features. Moreover, although widely-used in high-level computer vision tasks, max pooling is found to be unsuitable for descriptor learning, which is then replaced by L2 pooling in DDesc or even removed in L2-Net . To further incorporate scale information, DeepCompare  and L2-Net 
use a two-stream central-surround structure which delivers consistent improvements at extra computational cost. To improve the rotational invariance, an orientation estimator is proposed in. Besides of feature learning, previous efforts are also made on joint metric learning as in [22, 24, 25], whereas comparison in Euclidean space is more preferable by recent works [11, 12, 15, 17, 26] in order to guarantee its efficiency.
Loss formulation Various of loss formulations have been explored for effective descriptor learning. Initially, networks with a learned metric use softmax loss [22, 24] and cast the descriptor learning to a binary classification problem (similar/dissimilar). With weakly annotated data,  formulates the loss on keypoint bags. More generally, pair-wise loss [12, 17] and triplet loss [11, 25, 26] are used by networks without a learned metric. Both loss formulations encourage matching patches to be close whereas non-matching patches to be far-away in some measure space. In particular, triplet loss delivers better results [11, 25] as it suffers less overfitting . For effective training, recent L2-Net  and HardNet  use the structured loss for data sampling which drastically improves the performance. To further boost the performance, extra regularizations are introduced for learning compact representation in [15, 16].
Evaluation protocol Previous works often evaluate on datasets such as [30, 31, 21]. However, those datasets either are small, or lack diversity to generalize well to various applications of descriptors. As a result, the evaluation results are commonly inconsistent or even contradictory to each other as pointed out in , which limits the application of learned descriptors. Two novel benchmarks, HPatches  and ETH local descriptor benchmark  have been recently introduced with clearly defined protocols and better generalization properties. However, inconsistency still exists in the two benchmarks, where HPatches  benchmark demonstrates the significant outperformance from learned descriptors over the handcrafted, whereas the ETH local descriptor benchmark  finds that the advanced variants of the traditional descriptor are at least on par with the learning-based. The inconclusive results indicate that there is still significant room for improvement to learn more powerful feature descriptors.
We borrow the network in L2-Net 
, where the feature tower is constructed by eschewing pooling layers and using strided convolutional layers for in-network downsampling. Each convolutional layer except the last one is followed by a batch normalization (BN) layer whose weighting and bias parameters are fixed toand . The L2-normalization layer after the last convolution produces the final
-dimensional feature vector.
Acquiring high quality training data is important in learning tasks. In this section, we discuss a practical pipeline that automatically produces well-annotated data suitable for descriptor learning.
2D correspondence generation. Similar to LIFT , we rely on successful 3D reconstructions to generate ground truth 2D correspondences in an automatic manner. First, sparse reconstructions are obtained from standard SfM pipeline . Then, 2D correspondences are generated by projecting 3D point clouds. In general, SfM is used to filter out most mismatches among images.
Although verified by SfM, the generated correspondences are still outlier-contaminated from image noise and wrongly registered cameras. It happens particularly often on Internet tourism datasets such as[33, 34] (illustrated in Fig. 1(a)), and usually not likely to be filtered by simply limiting reprojection error. To improve data quality, we take one step further than LIFT by computing the visibility check based on 3D Delaunay triangulation  which is widely-used for outlier filtering in dense stereo. Empirically, of 3D points will be discarded after the filtering while only points with high precision are kept for ground truth generation. Fig. 1(b) gives an example to illustrate its effect.
Matching patch generation. Next, the interest region of a 2D projection is cropped similar to LIFT, which is formulated by an similarity transformation
where are input and output regular sampling grids, and are keypoint parameters ( coordinates, scale and orientation) from SIFT detector. The constant is set to as in LIFT, resulting in patches.
Due to the robust estimation of scale () and orientation () parameters of SIFT even in extreme cases , the resulting patches are mostly free of scale and rotation differences, thus suitable for training. In later experiments of image matching or SfM, we rely on the same cropping method to achieve scale and rotation invariance for learned descriptors.
Geometries at a 3D point are robust and provide rich information. Inspired by the MVS (Multi-View Stereo) accuracy measurement in , we define two types of geometric similarity: patch similarity and image similarity, which will facilitate later data sampling and loss formulation.
Patch similarity. We define patch similarity to measure the difficulty to have a patch pair matched with respect to perspective changes. Formally, given a patch pair, we relate it to its corresponding 3D track which is seen by cameras centering at and . Next, we compute the vertex normal at from the surface model. The geometric relationship is illustrated in Fig. 2(a). Finally, we formulate as
where measures the intersection angle between two viewing rays from the 3D track (), while measures the difference of incidence angles between a viewing ray and the vertex normal from the 3D track (). The angle metric is defined as . As an interpretation, and measure the perspective change regarding a 3D point and local 3D surface, respectively. The effect of is illustrated in Fig. 2(b).
The accuracy of and depends on sparse and mesh reconstructions, respectively, and is generally sufficient for its use as shown in . The similarity does not consider scale and rotation changes as already resolved from Equation 1. Empirically, we choose and (in degree).
Image similarity. Based on the patch similarity, we define the image similarity as the average patch similarity of the correspondences between an image pair. The image similarity measures the difficulty to match an image pair and can be interpreted as a measurement of perspective change. Examples are given in Fig. 2(c). The image similarity will be beneficial for data sampling in Sec. 3.4.
For descriptor learning, most existing frameworks take patch pairs (matching/non-matching) or patch triplets (reference, matching and non-matching) as input. As in previous studies, the convergence rate is highly dependent on being able to see useful data . Here, “useful” data often refers to patch pairs/triplets that produce meaningful loss for learning. However, the effective sampling of such data is generally challenging due to the intractably large number of patch pair/triplet combination in the database. Hence, on one hand, sampling strategies such as hard negative mining  and anchor swap  are proposed, while on the other hand, effective batch construction is used in [15, 25, 29] to compare the reference patch against all the in-batch samples in the loss computation.
Inspired by previous works, we propose a novel batch construction method that effectively samples “useful” data by relying on geometry constraints from SfM, including the image matching results and image similarity , to simulate the pair-wise image matching and sample data. Formally, given one image pair, we extract a match set , where is the set size and is a matching patch pair surviving the SfM verification. A training batch is then constructed on match sets. Hence, the learning objective becomes to improve the matching quality for each match set. In Sec. 3.5, we will discuss the loss computation on each match set and batch data.
Compared with L2-Net  and HardNet  whose training batches are random sampled from the whole database, the proposed method produces harder samples and thus raises greater challenges for learning. As an example shown in Fig. 2(d), the training batch constructed by the proposed method consists of many similar patterns, due to the spatially close keypoints or repetitive textures. In general, such training batch has two major advantages for descriptor learning:
[leftmargin=*, noitemsep, topsep=0pt]
It reflects the in-practice complexity. In real applications, image patches are often extracted between image pairs for matching. The proposed method simulates this scenario so that training and testing become more consistent.
It generates hard samples for training. As observed in [11, 12, 29, 38], hard samples are critical to fast convergence and performance improvement for descriptor learning. The proposed method effectively generates batch data that is sufficiently hard, while not being overfitting as constructed on real matching results instead of model inference results .
To further boost the training efficiency, we exclude image pairs that are too similar to contribute to the learning. Those pairs are effectively identified by the image similarity defined in Sec. 3.3. In practice, we discard image pairs whose are larger than (e.g., the toppest pair in Fig. 2(c)), which results in a shrink of training samples.
We formulate the loss with two terms: structured loss and geometric loss.
Structured loss. The structured loss used in L2-Net  and HardNet  is essentially suitable to consume the batch samples constructed in Sec. 3.4. In particular, the formulation in HardNet based on the “hardest-in-batch” strategy and a distance margin shows to be more effective than the log-likelihood formulation in L2-Net. However, we observe successive overfitting when applying the HardNet loss to our batch data, which we ascribe to the too strong constraint of “hardest-in-batch” strategy. In this strategy, the loss is computed on the data sample that produces the largest loss, and a margin with a large value ( in HardNet) is set to push the non-matching pairs away from matching pairs. In our batch data, we already effectively sample the “hard” data which is often visually similar, thus forcing a large margin is inapplicable and stalls the learning. One simple solution is to decrease the margin value, whereas the performance drops significantly in our experiments.
To avoid above limitation and better take advantage of our batch data, we propose the loss formulation as follows. First, we compute the structured loss for one match set. Given normalized features computed on match set for all
, the cosine similarity matrix is derived by. Next, we compute and formulate the loss as
where is the element in , and is the distance ratio mimicking the behavior of ratio test  and pushing away non-matching pairs from matching pairs. Finally, we take the average of the loss on each match set to derive the final loss for one training batch.
The proposed formulation is distinctive from HardNet in three aspects. First, we compute the cosine similarity instead of Euclidean distance for computational efficiency. Second, we apply a distance ratio margin instead of a fixed distance margin as an adaptive margin to reduce overfitting. Finally, we compute the mean value of each loss element instead of the maximum (“hardest-in-batch”) in order to cooperate the proposed batch construction.
Geometric loss. Although ensures matching patch pairs to be distant from the non-matching, it does not explicitly encourage matching pairs to be close in its measure space. One simple solution is to apply a typical pair-wise loss in , whereas taking a risk of positive collapse and overfitting as observed in . To overcome it, we adaptively set up the margin regarding the patch similarity defined in Sec. 3.3, serving as a soft constraint for maximizing the positive similarity. We refer to this term as geometric loss and formulate it as
where is the adaptive margin, is the element in , namely, the cosine similarity of patch pair , while is the patch similarity for . We use as the final loss, and empirically set and to and .
We use image sets  as in LIFT , the SfM data in , and further collect several image sets to form the training database. Based on COLMAP , we run 3D reconstructions to establish necessary geometry constraints. Image sets that are overlapping with the benchmark data are manually excluded. We train the networks from scratch using Adam with a base learning rate of and weight decay of . The learning rate decays by every steps. Data augmentation includes randomly flipping, degrees rotation and brightness and contrast adjustment. The match set size and batch size are 64 and 12, respectively. Input patches are standardized to have zero mean and unit norm.
We evaluate the proposed descriptor on three datasets: the patch-based HPatches  benchmark, the image-based Heinly benchmark  and ETH local features benchmark . We further demonstrate on challenging SfM examples.
HPatches benchmark 
defines three complementary tasks: patch verification, patch matching, and patch retrieval. Different levels of geometrical perturbations are imposed to form EASY, HARD and TOUGH patch groups. In the task of verification, two subtasks are defined based on whether negative pairs are sampled from images within the same (SAMESEQ) or different sequences (DIFFSEQ). In the task of matching, two subtasks are defined based on whether the principle variance is viewpoint (VIEW) or illumination (ILLUM). Following, we use mean average precision (mAP) to measure the performance for all three tasks on HPatches split ‘full’.
We select five descriptors to compare: SIFT as the baseline, RSIFT  and DDesc  as the best-performing hand-crafted and learned descriptors concluded in . Moreoever, we experiment with recent learned descriptors L2-Net  and HardNet . The proposed descriptor is referred to as GeoDesc.
As shown in Fig. 3, GeoDesc surpasses all the other descriptors on all three tasks by a large margin. In particular, the performance on TOUGH patch group is significantly improved, which indicates the superior invariance to large image changes of GeoDesc. Interestingly, comparing GeoDesc with HardNet, we observe some performance drop on EASY groups especially for illumination changes, which can be ascribed to the data bias for SfM data. Though applying randomness such as illumination during the training, we cannot fully mitigate this limitation which asks more diverse real data in descriptor learning.
In addition, we evaluate different configurations of GeoDesc on HPatches as shown in Tab. 1 to demonstrate the effect of each part of our method.
[leftmargin=*, noitemsep, topsep=0pt]
Config. 1: the HardNet framework as the baseline.
Config. 3: equipped with the proposed batch construction in Sec. 3.4. As discussed in Sec. 3.5, the “hardest-in-batch” strategy in HardNet is inapplicable to hard batch data and thus leads to performance drop compared with Config. 2. In practice, we need to adjust the margin value from in HardNet to , otherwise the training will not even converge. Though trainable, the smaller margin value harms the final performance.
Config. 4: equipped with the modified structured loss in Sec. 3.5. Notable performance improvements are achieved over Config. 2 due to the collaborative use of proposed methods, showing the effectiveness of simulating pair-wise matching and sampling hard data. Besides, replacing the distance margin with distance ratio can improve the training efficiency, as shown in Fig. 4.
Config. 5: equipped with the geometric loss in Sec. 3.5. Further improvements are obtained over Config. 4 as constrains the solution space and enhances the training efficiency.
To sum up, the “hardest-in-batch” strategy is beneficial when no other sampling is applied and most in-batch samples do not contribute to the loss. However, with harder batch data effectively constructed, it is advantageous to replace the “hardest-in-batch” and further boost the descriptor performance.
|GeoDesc Configuration||HPathces Benchmark Tasks|
|No.||SfM Data||Batch Construct.||Verification||Matching||Retrieval|
Different from HPatches which experiments on image patches, the benchmark by Heinly et al.  evaluates pair-wise image matching regarding different types of photometric or geometric changes, targeting to provide practical insights for strengths and weaknesses of descriptors. We use two standard metrics as in  to quantify the matching quality. First, the Matching Score = #Inlier Matches / #Features. Second, the Recall = #Inlier Matches / #True Matches. Four descriptor are selected to compare: SIFT, the baseline hand-crafted descriptor; DSP-SIFT, the best hand-crafted descriptor even superior to the previous learning-based as evaluated in ; L2-Net and HardNet, the recent advanced learned descriptors. For fairness comparison, no ratio test and only cross check (mutual test) is applied for all descriptors.
|Matching Score in %||Recall in %|
Evaluation results are shown in Tab. 2. Compared with DSP-SIFT, GeoDesc performs comparably regarding image quality changes (compression, blur), while notably better for illumination and geometric changes (rotation, scale, viewpoint). On the other hand, GeoDesc delivers significant improvements on L2-Net and HardNet and particularly narrows the gap in terms of photometric changes, which makes GeoDesc applicable to different scenarios in real applications.
The ETH local features benchmark  focuses on image-based 3D reconstruction tasks. We compare GeoDesc with SIFT, DSP-SIFT and L2-Net, and follow the same protocols in  to quantify the SfM quality, including the number of registered images (# Registered), reconstructed sparse points (# Sparse Points), image observations (# Observations), mean track length (Track Length) and mean reprojection error (Reproj. Error). For fairness comparison, we apply no distance ratio test for descriptor matching and extract features at the same keypoints as in .
As observed in Tab. 3, first, GeoDesc performs best on # Registered, which is generally considered as the most important SfM metric that directly quantifies the reconstruction completeness. Second, GeoDesc achieves best results on # Sparse Points and # Observations, which indicates the superior matching quality in the early step of SfM. However, GeoDesc fails to get best statistics about Track Length and Reproj. Error as GeoDesc computes the two metrics on significantly larger # Sparse Points. In terms of datasets whose scale is small and have similar track number (Fountain, Herzjesu), GeoDesc gives the longest Track Length.
To sum up, GeoDesc surpasses both the previous best-performing DSP-SIFT and recent advanced L2-Net by a notable margin. In addition, it is noted that L2-Net also shows consistent improvements over DSP-SIFT, which demonstrates the power of taking structured loss for learned descriptors.
|# Images||# Registered||# Sparse Points||# Observations||Track Length||Reproj. Error|
|Tower of London||SIFT||1,576||702||142,746||963K||6.75||0.53px|
To further demonstrate the effect of the proposed descriptor in a context of 3D reconstruction, we showcase selective image sets whose reconstructions fail or are in low quality with a typical SIFT-based 3D reconstruction pipeline but get significantly improved by integrating GeoDesc.
From examples shown in Fig. 5, it is clear to see the benefit of deploying GeoDesc in a reconstruction pipeline. First, by robust matching resistant to photometric and geometric changes, a complete sparse reconstruction registered with more cameras can be obtained. Second, due to more accurate camera pose estimation, the final fined mesh reconstruction is then derived.
In this section, we discuss several practical guidelines to complement the performance evaluation and provide insights towards real applications. Following experiments are conducted with extra high-resolution image pairs, whose keypoints are downsampled to
10k per image. We use a single NVIDIA GTX 1080 GPU with TensorFlow, and forward each batch with 256 patches.
The ratio criterion  compares the distance between the first and the second nearest neighbor, and establishes a match if the former is smaller than the latter to some ratio. For SfM tasks, the ratio criterion improves overall matching quality, RANSAC efficiency, and seeds robust initialization. Despite those benefits, the ratio criterion has received little attention, or even been considered inapplicable to learned descriptors in previous studies . Here, we propose a general method to determine the ratio that well cooperates with existing SfM pipelines.
The general idea is simple: the new ratio should function similarly as SIFT’s, as most SfM pipelines are parameterized for SIFT. To quantify the effect of the ratio criterion, we use the metric Precision = #Inlier Matches / #Putative matches, and determine the ratio that achieves similar Precision as SIFT’s. As an example in Fig. 6, we compute the Precision of SIFT and GeoDesc on our experimental dataset, and find the best ratio for GeoDesc is at which it gives similar Precision () as SIFT (). This ratio is applied to experiments in Sec. 4.4 and shows robust results and compatibility in the practical SfM pipeline.
A compact feature representation generally indicates better performance with respective to discriminativeness and scalability. To quantify the compactness, we reply on the intermediate result in Principal Component Analysis (PCA). First, we compute the explained variancewhich is stored in increasing order for each feature dimension indexed by . Then we estimate the compact dimensionality (denoted as Compact-Dim) by finding the minimal that satisfies , where is a given threshold and is the original feature dimensionality. In this experiment, we set , so that the Compact-Dim can be interpreted as the minimal dimensionality required to convey more than information of the original feature. Obviously, larger Compact-Dim indicates less redundancy, namely greater compactness.
As a result, the Compact-Dim estimated on 4 millions feature vectors for SIFT, DSP-SIFT, L2-Net and GeoDesc is 56, 63, 75 and 100, respectively. The ranking of Compact-Dim effectively responds to previous performance evaluations, where descriptors with larger Compact-Dim yield better results.
Computational cost. As evaluated in [9, 10], the efficiency of learned descriptors is on par with traditional descriptors such as CPU-based SIFT. Here, we further compare with GPU-based SIFT  to provide insights about practicability. We evaluate the running time in three steps. First, keypoint detection and canonical orientation estimation by SIFT-GPU. Next, patches cropping by Equ. 1. Finally, feature inference of image patches. The computational cost and memory demand are shown in Tab. 4, indicating that with GPU support, not surprisingly, SIFT () is still faster than the learned descriptor (), with a narrow gap due to the parallel implementation. For applications heavily relying on matching quality (e.g., 3D reconstruction), the proposed descriptor achieves a good trade-off to replace SIFT.
Quantization. To conserve disk space, I/O and memory, we linearly map feature vectors of GeoDesc from to and round each element to unsigned-char value. The quantization does not affect the performance as evaluated on HPatches benchmark.
Computational cost and memory demand of feature extraction of GeoDesc in three steps: SIFT-GPU extraction, patch cropping and feature inference. The total time cost is evaluated with three steps implemented in a parallel fashion
In contrast to prior work, we have addressed the advantages of integrating geometry constraints for descriptor learning, which benefits the learning process in terms of ground truth data generation, data sampling and loss computation. Also, we have discussed several guidelines, in particular, the ratio criterion, towards practical portability. Finally, we have demonstrated the superior performance and generalization ability of the proposed descriptor, GeoDesc, on three benchmark datasets in different scenarios, We have further shown the significant improvement of GeoDesc on challenging reconstructions, and the good trade-off between efficiency and accuracy to deploy GeoDesc in real applications.
Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions.CVPR (2016)
Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv (2016)