GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints

07/17/2018 ∙ by Zixin Luo, et al. ∙ The Hong Kong University of Science and Technology 6

Learned local descriptors based on Convolutional Neural Networks (CNNs) have achieved significant improvements on patch-based benchmarks, whereas not having demonstrated strong generalization ability on recent benchmarks of image-based 3D reconstruction. In this paper, we mitigate this limitation by proposing a novel local descriptor learning approach that integrates geometry constraints from multi-view reconstructions, which benefit the learning process in data generation, data sampling and loss computation. We refer to the proposed descriptor as GeoDesc, and demonstrate its superior performance on various large-scale benchmarks, and in particular show its great success on challenging reconstruction cases. Moreover, we provide guidelines towards practical integration of learned descriptors in Structure-from-Motion (SfM) pipelines, showing the good trade-off that GeoDesc delivers to 3D reconstruction tasks between accuracy and efficiency.



There are no comments yet.


page 4

page 6

page 12

Code Repositories


Implementation of ECCV'18 paper - GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints

view repo


TensorFlow implementation of GeoDesc (ECCV'18), ContextDesc (CVPR'19) and ASLFeat (CVPR'20)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computing local descriptors on interest regions serves as the subroutine of various computer vision applications such as panorama stitching 

[1], wide baseline matching [2]

, image retrieval 

[3], and Structure-from-Motion (SfM) [4, 5, 6, 7]. A powerful descriptor is expected to be invariant to both photometric and geometric changes, such as illumination, blur, rotation, scale and perspective changes. Due to the reliability, efficiency and portability, hand-crafted descriptors such as SIFT [8] have been influentially dominating this field for more than a decade. Until recently, great efforts have been made on developing learned descriptors based on Convolutional Neural Networks (CNNs), which have achieved surprising results on patch-based benchmarks such as HPatches dataset [9]. However, on image-based datasets such as ETH local features benchmark [10], learned descriptors are found to underperform advanced variants of hand-crafted descriptors. The contradictory findings raise the concern of integrating those purportedly better descriptors in real applications, and show significant room of improvement for developing more powerful descriptors that generalize to a wider range of scenarios.

One possible cause of above contradictions, as demonstrated in [10], is the lack of generalization ability as a consequence of data insufficiency. Although previous research [11, 12, 13] discusses several effective sampling methods that produce seemingly large amount of training data, the generalization ability is still bounded to limited data sources, e.g., the widely-used Brown dataset [14] with only 3 image sets. Hence, it is not surprising that resulting descriptors tend to overfit to particular scenarios. To overcome it, research such as [15, 16] applies extra regularization for compact feature learning. Meanwhile, LIFT [17] and [18] seek to enhance data diversity and generate training data from reconstructions of Internet tourism data. However, the existing limitation has not yet been fully mitigated, while intermediate geometric information is overlooked in the learning process despite the robust geometric property that local patch preserves, e.g., the well approximation of local deformations [19].

Besides, we lack guidelines for integrating learned descriptors in practical pipelines such as SfM. In particular, the ratio criterion, as suggested in [8] and justified in [20], has received almost no individual attention or was considered inapplicable for learned descriptors [10], whereas it delivers excellent matching efficiency and accuracy improvements, and serves as the necessity for pipelines such as SfM to reject false matches and seed feasible initialization. A general method to apply ratio criterion for learned descriptors is in need in practice.

In this paper, we tackle above issues by presenting a novel learning framework that takes advantage of geometry constraints from multi-view reconstructed data. In particular, we address the importance of data sampling for descriptor learning and summarize our contributions threefold. i) We propose a novel batch construction method that simulates the pair-wise matching and effectively samples useful data for learning process. ii) Collaboratively, we propose a loss formulation to reduce overfitting and improve the performance with geometry constraints. iii) We provide guidelines about ratio criterion, compactness and scalability towards practical portability of learned descriptors.

We evaluate the proposed descriptor, referred to as GeoDesc, on traditional [21] and recent two large-scale datasets [9, 10]. Superior performance is shown over the state-of-the-art hand-crafted and learned descriptors. We mitigate previous limitations by showing consistent improvements on both patch-based and image-based datasets, and further demonstrate its success on challenging 3D reconstructions.

2 Related Works

Networks design. Due to weak semantics and efficiency requirements, existing descriptor learning often relies on shallow and thin networks, e.g., three-layer networks in DDesc [12]

with 128-dimensional output features. Moreover, although widely-used in high-level computer vision tasks, max pooling is found to be unsuitable for descriptor learning, which is then replaced by L2 pooling in DDesc 

[12] or even removed in L2-Net [15]. To further incorporate scale information, DeepCompare [22] and L2-Net [15]

use a two-stream central-surround structure which delivers consistent improvements at extra computational cost. To improve the rotational invariance, an orientation estimator is proposed in 

[23]. Besides of feature learning, previous efforts are also made on joint metric learning as in [22, 24, 25], whereas comparison in Euclidean space is more preferable by recent works [11, 12, 15, 17, 26] in order to guarantee its efficiency.

Loss formulation Various of loss formulations have been explored for effective descriptor learning. Initially, networks with a learned metric use softmax loss [22, 24] and cast the descriptor learning to a binary classification problem (similar/dissimilar). With weakly annotated data, [27] formulates the loss on keypoint bags. More generally, pair-wise loss [12, 17] and triplet loss [11, 25, 26] are used by networks without a learned metric. Both loss formulations encourage matching patches to be close whereas non-matching patches to be far-away in some measure space. In particular, triplet loss delivers better results [11, 25] as it suffers less overfitting [28]. For effective training, recent L2-Net [15] and HardNet [29] use the structured loss for data sampling which drastically improves the performance. To further boost the performance, extra regularizations are introduced for learning compact representation in [15, 16].

Evaluation protocol Previous works often evaluate on datasets such as [30, 31, 21]. However, those datasets either are small, or lack diversity to generalize well to various applications of descriptors. As a result, the evaluation results are commonly inconsistent or even contradictory to each other as pointed out in [9], which limits the application of learned descriptors. Two novel benchmarks, HPatches [9] and ETH local descriptor benchmark [10] have been recently introduced with clearly defined protocols and better generalization properties. However, inconsistency still exists in the two benchmarks, where HPatches [9] benchmark demonstrates the significant outperformance from learned descriptors over the handcrafted, whereas the ETH local descriptor benchmark [10] finds that the advanced variants of the traditional descriptor are at least on par with the learning-based. The inconclusive results indicate that there is still significant room for improvement to learn more powerful feature descriptors.

3 Method

3.1 Network architecture

We borrow the network in L2-Net [15]

, where the feature tower is constructed by eschewing pooling layers and using strided convolutional layers for in-network downsampling. Each convolutional layer except the last one is followed by a batch normalization (BN) layer whose weighting and bias parameters are fixed to

and . The L2-normalization layer after the last convolution produces the final

-dimensional feature vector.

3.2 Training data generation

Acquiring high quality training data is important in learning tasks. In this section, we discuss a practical pipeline that automatically produces well-annotated data suitable for descriptor learning.

2D correspondence generation. Similar to LIFT [17], we rely on successful 3D reconstructions to generate ground truth 2D correspondences in an automatic manner. First, sparse reconstructions are obtained from standard SfM pipeline [32]. Then, 2D correspondences are generated by projecting 3D point clouds. In general, SfM is used to filter out most mismatches among images.

Although verified by SfM, the generated correspondences are still outlier-contaminated from image noise and wrongly registered cameras. It happens particularly often on Internet tourism datasets such as 

[33, 34] (illustrated in Fig. 1(a)), and usually not likely to be filtered by simply limiting reprojection error. To improve data quality, we take one step further than LIFT by computing the visibility check based on 3D Delaunay triangulation [35] which is widely-used for outlier filtering in dense stereo. Empirically, of 3D points will be discarded after the filtering while only points with high precision are kept for ground truth generation. Fig. 1(b) gives an example to illustrate its effect.

Figure 1: (a) Outlier matches after SfM verification (by COLMAP [32]) on Gendarmenmarkt dataset [33]. The reprojection error (next to the image) cannot be used to identify false matches. (b) Reconstructed sparse point cloud (top), where points in red (bottom) indicate being filtered by Delaunay triangulation and only reliable points in green are kept. The number of points decreases from 75k to 53k after the filtering.

Matching patch generation. Next, the interest region of a 2D projection is cropped similar to LIFT, which is formulated by an similarity transformation


where are input and output regular sampling grids, and are keypoint parameters ( coordinates, scale and orientation) from SIFT detector. The constant is set to as in LIFT, resulting in patches.

Due to the robust estimation of scale () and orientation () parameters of SIFT even in extreme cases [36], the resulting patches are mostly free of scale and rotation differences, thus suitable for training. In later experiments of image matching or SfM, we rely on the same cropping method to achieve scale and rotation invariance for learned descriptors.

3.3 Geometric similarity estimation

Geometries at a 3D point are robust and provide rich information. Inspired by the MVS (Multi-View Stereo) accuracy measurement in [37], we define two types of geometric similarity: patch similarity and image similarity, which will facilitate later data sampling and loss formulation.

Patch similarity. We define patch similarity to measure the difficulty to have a patch pair matched with respect to perspective changes. Formally, given a patch pair, we relate it to its corresponding 3D track which is seen by cameras centering at and . Next, we compute the vertex normal at from the surface model. The geometric relationship is illustrated in Fig. 2(a). Finally, we formulate as


where measures the intersection angle between two viewing rays from the 3D track (), while measures the difference of incidence angles between a viewing ray and the vertex normal from the 3D track (). The angle metric is defined as . As an interpretation, and measure the perspective change regarding a 3D point and local 3D surface, respectively. The effect of is illustrated in Fig. 2(b).

The accuracy of and depends on sparse and mesh reconstructions, respectively, and is generally sufficient for its use as shown in [37]. The similarity does not consider scale and rotation changes as already resolved from Equation 1. Empirically, we choose and (in degree).

Image similarity. Based on the patch similarity, we define the image similarity as the average patch similarity of the correspondences between an image pair. The image similarity measures the difficulty to match an image pair and can be interpreted as a measurement of perspective change. Examples are given in Fig. 2(c). The image similarity will be beneficial for data sampling in Sec. 3.4.

Figure 2: (a) The patch similarity relies on the geometric relationship between cameras, tracks and surface normal. (b) The effect of patch similarity, which measures the difficulty to have a patch pair matched with respect to the perspective change. (c) The effect of image similarity, which measures the perspective change between an image pairs. (d) Batch data constructed by L2-Net [15] and HardNet [29] (top), whose in-batch patch pairs are often distinctive to each other and thus contribute nothing to the loss in the late learning (e.g., the margin-based loss). However, the batch data from the proposed batch construction method (bottom) consists of similar patch pairs due to the spatially close keypoints or repetitive patterns, which are considered harder to distinguish and thus raise greater challenges for learning

3.4 Batch construction

For descriptor learning, most existing frameworks take patch pairs (matching/non-matching) or patch triplets (reference, matching and non-matching) as input. As in previous studies, the convergence rate is highly dependent on being able to see useful data [38]. Here, “useful” data often refers to patch pairs/triplets that produce meaningful loss for learning. However, the effective sampling of such data is generally challenging due to the intractably large number of patch pair/triplet combination in the database. Hence, on one hand, sampling strategies such as hard negative mining [12] and anchor swap [11] are proposed, while on the other hand, effective batch construction is used in [15, 25, 29] to compare the reference patch against all the in-batch samples in the loss computation.

Inspired by previous works, we propose a novel batch construction method that effectively samples “useful” data by relying on geometry constraints from SfM, including the image matching results and image similarity , to simulate the pair-wise image matching and sample data. Formally, given one image pair, we extract a match set , where is the set size and is a matching patch pair surviving the SfM verification. A training batch is then constructed on match sets. Hence, the learning objective becomes to improve the matching quality for each match set. In Sec. 3.5, we will discuss the loss computation on each match set and batch data.

Compared with L2-Net [15] and HardNet [29] whose training batches are random sampled from the whole database, the proposed method produces harder samples and thus raises greater challenges for learning. As an example shown in Fig. 2(d), the training batch constructed by the proposed method consists of many similar patterns, due to the spatially close keypoints or repetitive textures. In general, such training batch has two major advantages for descriptor learning:

  • [leftmargin=*, noitemsep, topsep=0pt]

  • It reflects the in-practice complexity. In real applications, image patches are often extracted between image pairs for matching. The proposed method simulates this scenario so that training and testing become more consistent.

  • It generates hard samples for training. As observed in [11, 12, 29, 38], hard samples are critical to fast convergence and performance improvement for descriptor learning. The proposed method effectively generates batch data that is sufficiently hard, while not being overfitting as constructed on real matching results instead of model inference results [12].

To further boost the training efficiency, we exclude image pairs that are too similar to contribute to the learning. Those pairs are effectively identified by the image similarity defined in Sec. 3.3. In practice, we discard image pairs whose are larger than (e.g., the toppest pair in Fig. 2(c)), which results in a shrink of training samples.

3.5 Loss formulation

We formulate the loss with two terms: structured loss and geometric loss.

Structured loss. The structured loss used in L2-Net [15] and HardNet [29] is essentially suitable to consume the batch samples constructed in Sec. 3.4. In particular, the formulation in HardNet based on the “hardest-in-batch” strategy and a distance margin shows to be more effective than the log-likelihood formulation in L2-Net. However, we observe successive overfitting when applying the HardNet loss to our batch data, which we ascribe to the too strong constraint of “hardest-in-batch” strategy. In this strategy, the loss is computed on the data sample that produces the largest loss, and a margin with a large value ( in HardNet) is set to push the non-matching pairs away from matching pairs. In our batch data, we already effectively sample the “hard” data which is often visually similar, thus forcing a large margin is inapplicable and stalls the learning. One simple solution is to decrease the margin value, whereas the performance drops significantly in our experiments.

To avoid above limitation and better take advantage of our batch data, we propose the loss formulation as follows. First, we compute the structured loss for one match set. Given normalized features computed on match set for all

, the cosine similarity matrix is derived by

. Next, we compute and formulate the loss as


where is the element in , and is the distance ratio mimicking the behavior of ratio test [8] and pushing away non-matching pairs from matching pairs. Finally, we take the average of the loss on each match set to derive the final loss for one training batch.

The proposed formulation is distinctive from HardNet in three aspects. First, we compute the cosine similarity instead of Euclidean distance for computational efficiency. Second, we apply a distance ratio margin instead of a fixed distance margin as an adaptive margin to reduce overfitting. Finally, we compute the mean value of each loss element instead of the maximum (“hardest-in-batch”) in order to cooperate the proposed batch construction.

Geometric loss. Although ensures matching patch pairs to be distant from the non-matching, it does not explicitly encourage matching pairs to be close in its measure space. One simple solution is to apply a typical pair-wise loss in [12], whereas taking a risk of positive collapse and overfitting as observed in [28]. To overcome it, we adaptively set up the margin regarding the patch similarity defined in Sec. 3.3, serving as a soft constraint for maximizing the positive similarity. We refer to this term as geometric loss and formulate it as


where is the adaptive margin, is the element in , namely, the cosine similarity of patch pair , while is the patch similarity for . We use as the final loss, and empirically set and to and .

3.6 Training

We use image sets [33] as in LIFT [17], the SfM data in [34], and further collect several image sets to form the training database. Based on COLMAP [32], we run 3D reconstructions to establish necessary geometry constraints. Image sets that are overlapping with the benchmark data are manually excluded. We train the networks from scratch using Adam with a base learning rate of and weight decay of . The learning rate decays by every steps. Data augmentation includes randomly flipping, degrees rotation and brightness and contrast adjustment. The match set size and batch size are 64 and 12, respectively. Input patches are standardized to have zero mean and unit norm.

4 Experiments

We evaluate the proposed descriptor on three datasets: the patch-based HPatches [9] benchmark, the image-based Heinly benchmark [21] and ETH local features benchmark [10]. We further demonstrate on challenging SfM examples.

4.1 HPatches benchmark

HPatches benchmark [9]

defines three complementary tasks: patch verification, patch matching, and patch retrieval. Different levels of geometrical perturbations are imposed to form EASY, HARD and TOUGH patch groups. In the task of verification, two subtasks are defined based on whether negative pairs are sampled from images within the same (SAMESEQ) or different sequences (DIFFSEQ). In the task of matching, two subtasks are defined based on whether the principle variance is viewpoint (VIEW) or illumination (ILLUM). Following 

[9], we use mean average precision (mAP) to measure the performance for all three tasks on HPatches split ‘full’.

We select five descriptors to compare: SIFT as the baseline, RSIFT [39] and DDesc [12] as the best-performing hand-crafted and learned descriptors concluded in [9]. Moreoever, we experiment with recent learned descriptors L2-Net [15] and HardNet [29]. The proposed descriptor is referred to as GeoDesc.

Figure 3: Left to right: Verification, matching and retrieval results on HPatches dataset, split ‘full’. Results on different patch groups are colorized, while DIFFSEQ/SAMESEQ and ILLUM/VIEW denote the subtasks of verification and matching, respectively

As shown in Fig. 3, GeoDesc surpasses all the other descriptors on all three tasks by a large margin. In particular, the performance on TOUGH patch group is significantly improved, which indicates the superior invariance to large image changes of GeoDesc. Interestingly, comparing GeoDesc with HardNet, we observe some performance drop on EASY groups especially for illumination changes, which can be ascribed to the data bias for SfM data. Though applying randomness such as illumination during the training, we cannot fully mitigate this limitation which asks more diverse real data in descriptor learning.

In addition, we evaluate different configurations of GeoDesc on HPatches as shown in Tab. 1 to demonstrate the effect of each part of our method.

  • [leftmargin=*, noitemsep, topsep=0pt]

  • Config. 1: the HardNet framework as the baseline.

  • Config. 2: trained with the SfM data in Sec. 3.2. Compared with Config. 1, it is shown that crowd-sourced training data essentially improves the generalization ability. Meanwhile, on the other hand, Config. 2 can be regarded as an extension of LIFT [17] with more advanced loss formulation.

  • Config. 3: equipped with the proposed batch construction in Sec. 3.4. As discussed in Sec. 3.5, the “hardest-in-batch” strategy in HardNet is inapplicable to hard batch data and thus leads to performance drop compared with Config. 2. In practice, we need to adjust the margin value from in HardNet to , otherwise the training will not even converge. Though trainable, the smaller margin value harms the final performance.

  • Config. 4: equipped with the modified structured loss in Sec. 3.5. Notable performance improvements are achieved over Config. 2 due to the collaborative use of proposed methods, showing the effectiveness of simulating pair-wise matching and sampling hard data. Besides, replacing the distance margin with distance ratio can improve the training efficiency, as shown in Fig. 4.

  • Config. 5: equipped with the geometric loss in Sec. 3.5. Further improvements are obtained over Config. 4 as constrains the solution space and enhances the training efficiency.

To sum up, the “hardest-in-batch” strategy is beneficial when no other sampling is applied and most in-batch samples do not contribute to the loss. However, with harder batch data effectively constructed, it is advantageous to replace the “hardest-in-batch” and further boost the descriptor performance.


GeoDesc Configuration HPathces Benchmark Tasks


No. SfM Data Batch Construct. Verification Matching Retrieval


1 - - - - 88.4 52.8 69.8
2 Y - - - 90.1 57.0 73.2
3 Y Y - - 89.9 50.2 70.4
4 Y Y Y - 90.9 58.5 74.5
5 Y Y Y Y 91.1 59.1 74.9


Figure 4: Effect of taking distance ratio in loss computation. The metric is the validation accuracy of patch triplets with a margin of by cosine similarity.
Table 1: Evaluation of different configurations of GeoDesc on HPatches. Modules are enabled if marked with “Y” otherwise with “-”. SfM Data denotes the training with our SfM data, Batch Construct. denotes the equipment of proposed batch construction, while and denote the use of proposed structured loss and geometric loss, respectively. The last configuration (Config. 5) is our best model with GeoDesc

4.2 Heinly benchmark

Different from HPatches which experiments on image patches, the benchmark by Heinly et al. [21] evaluates pair-wise image matching regarding different types of photometric or geometric changes, targeting to provide practical insights for strengths and weaknesses of descriptors. We use two standard metrics as in [21] to quantify the matching quality. First, the Matching Score = #Inlier Matches / #Features. Second, the Recall = #Inlier Matches / #True Matches. Four descriptor are selected to compare: SIFT, the baseline hand-crafted descriptor; DSP-SIFT, the best hand-crafted descriptor even superior to the previous learning-based as evaluated in [10]; L2-Net and HardNet, the recent advanced learned descriptors. For fairness comparison, no ratio test and only cross check (mutual test) is applied for all descriptors.


Matching Score in % Recall in %


SIFT DSP-SIFT L2-Net HardNet GeoDesc SIFT DSP-SIFT L2-Net HardNet GeoDesc


JPEG 31.9 35.1 25.7 27.0 34.7 60.7 66.9 49.0 51.5 66.1
Blur 12.4 14.3 9.1 11.3 14.4 41.0 47.3 30.1 37.4 47.7
Exposure 32.9 34.8 33.9 34.9 36.3 78.2 82.6 80.4 82.8 86.4
Day-Night 5.6 5.7 6.8 7.4 7.5 29.2 29.7 35.6 38.9 39.6
Scale 35.8 34.7 32.6 34.8 37.8 81.2 78.8 73.6 79.0 85.8
Rotation 56.3 49.1 55.9 57.4 59.8 82.4 71.8 81.9 84.0 87.6
Scale-rotation 12.6 12.0 10.7 12.1 14.3 29.6 28.1 25.0 28.5 33.7
Planar 23.8 24.8 25.6 27.4 29.1 48.2 50.4 51.9 55.6 59.1


Table 2: Evaluation results on pair-wise image matching on benchmark by Heinly et al. [21] with respect to different types of image changes

Evaluation results are shown in Tab. 2. Compared with DSP-SIFT, GeoDesc performs comparably regarding image quality changes (compression, blur), while notably better for illumination and geometric changes (rotation, scale, viewpoint). On the other hand, GeoDesc delivers significant improvements on L2-Net and HardNet and particularly narrows the gap in terms of photometric changes, which makes GeoDesc applicable to different scenarios in real applications.

4.3 ETH local features benchmark

The ETH local features benchmark [10] focuses on image-based 3D reconstruction tasks. We compare GeoDesc with SIFT, DSP-SIFT and L2-Net, and follow the same protocols in [10] to quantify the SfM quality, including the number of registered images (# Registered), reconstructed sparse points (# Sparse Points), image observations (# Observations), mean track length (Track Length) and mean reprojection error (Reproj. Error). For fairness comparison, we apply no distance ratio test for descriptor matching and extract features at the same keypoints as in [10].

As observed in Tab. 3, first, GeoDesc performs best on # Registered, which is generally considered as the most important SfM metric that directly quantifies the reconstruction completeness. Second, GeoDesc achieves best results on # Sparse Points and # Observations, which indicates the superior matching quality in the early step of SfM. However, GeoDesc fails to get best statistics about Track Length and Reproj. Error as GeoDesc computes the two metrics on significantly larger # Sparse Points. In terms of datasets whose scale is small and have similar track number (Fountain, Herzjesu), GeoDesc gives the longest Track Length.

To sum up, GeoDesc surpasses both the previous best-performing DSP-SIFT and recent advanced L2-Net by a notable margin. In addition, it is noted that L2-Net also shows consistent improvements over DSP-SIFT, which demonstrates the power of taking structured loss for learned descriptors.


# Images # Registered # Sparse Points # Observations Track Length Reproj. Error


Fountain SIFT 11 11 10,004 44K 4.49 0.30px
DSP-SIFT 11 14,785 71K 4.80 0.41px
L2-Net 11 16,119 78K 4.86 0.43px
GeoDesc 11 16,687 83K 4.99 0.46px


Herzjesu SIFT 8 8 4,916 19K 4.00 0.32px
DSP-SIFT 8 7,760 32K 4.19 0.45px
L2-Net 8 8,473 36K 4.27 0.47px
GeoDesc 8 8,720 38K 4.34 0.55px


South Building SIFT 128 128 62,780 353K 5.64 0.42px
DSP-SIFT 128 110,394 664K 6.02 0.57px
L2-Net 128 155,780 798K 5.13 0.58px
GeoDesc 128 170,306 887K 5.21 0.64px


Madrid Metropolis SIFT 1,344 440 62,729 416K 6.64 0.53px
DSP-SIFT 476 107,028 681K 6.36 0.64px
L2-Net 692 254,142 1,067K 4.20 0.69px
GeoDesc 809 306,976 1,200K 3.91 0.66px


Gendarmenmarkt SIFT 1,463 950 169,900 1,010K 5.95 0.64px
DSP-SIFT 975 321,846 1,732K 5.38 0.74px
L2-Net 1,168 667,392 2,611K 3.91 0.73px
GeoDesc 1,208 779,814 2,903K 3.72 0.74px


Tower of London SIFT 1,576 702 142,746 963K 6.75 0.53px
DSP-SIFT 755 236,598 1,761K 7.44 0.64px
L2-Net 1,049 558,673 2,617K 4.68 0.67px
GeoDesc 1,081 622,076 2,852K 4.58 0.69px


Alamo SIFT 2,915 743 120,713 1,384K 11.47 0.54px
DSP-SIFT 754 144,341 1,815K 12.58 0.66px
L2-Net 882 318,787 2,932K 9.17 0.76px
GeoDesc 893 353,329 3,159K 8.94 0.84px


Roman Forum SIFT 2,364 1,407 242,192 1,805K 7.45 0.61px
DSP-SIFT 1,583 372,573 2,879K 7.73 0.71px
L2-Net 1,537 708,794 4,530K 6.39 0.69px
GeoDesc 1,566 770,363 5,051K 6.56 0.73px


Cornell SIFT 6,514 4,999 1,010,544 6,317K 6.25 0.53px
DSP-SIFT 4,946 1,177,916 7,233K 6.14 0.67px
L2-Net 5,557 2,706,215 15,710K 5.81 0.72px
GeoDesc 5,823 3,076,476 17,550K 5.70 0.96px


Table 3: Evaluation results on ETH local features benchmark [10] for SfM tasks

4.4 Challenging 3D reconstructions

To further demonstrate the effect of the proposed descriptor in a context of 3D reconstruction, we showcase selective image sets whose reconstructions fail or are in low quality with a typical SIFT-based 3D reconstruction pipeline but get significantly improved by integrating GeoDesc.

Figure 5: Testing cases of challenging image sets, where a traditional SIFT-based reconstruction pipeline fails to apply but GeoDesc delivers significant improvement.

From examples shown in Fig. 5, it is clear to see the benefit of deploying GeoDesc in a reconstruction pipeline. First, by robust matching resistant to photometric and geometric changes, a complete sparse reconstruction registered with more cameras can be obtained. Second, due to more accurate camera pose estimation, the final fined mesh reconstruction is then derived.

5 Practical Guidelines

In this section, we discuss several practical guidelines to complement the performance evaluation and provide insights towards real applications. Following experiments are conducted with extra high-resolution image pairs, whose keypoints are downsampled to

10k per image. We use a single NVIDIA GTX 1080 GPU with TensorFlow 

[40], and forward each batch with 256 patches.

5.1 Ratio criterion

The ratio criterion [8] compares the distance between the first and the second nearest neighbor, and establishes a match if the former is smaller than the latter to some ratio. For SfM tasks, the ratio criterion improves overall matching quality, RANSAC efficiency, and seeds robust initialization. Despite those benefits, the ratio criterion has received little attention, or even been considered inapplicable to learned descriptors in previous studies [10]. Here, we propose a general method to determine the ratio that well cooperates with existing SfM pipelines.

The general idea is simple: the new ratio should function similarly as SIFT’s, as most SfM pipelines are parameterized for SIFT. To quantify the effect of the ratio criterion, we use the metric Precision = #Inlier Matches / #Putative matches, and determine the ratio that achieves similar Precision as SIFT’s. As an example in Fig. 6, we compute the Precision of SIFT and GeoDesc on our experimental dataset, and find the best ratio for GeoDesc is at which it gives similar Precision () as SIFT (). This ratio is applied to experiments in Sec. 4.4 and shows robust results and compatibility in the practical SfM pipeline.

5.2 Compactness study

A compact feature representation generally indicates better performance with respective to discriminativeness and scalability. To quantify the compactness, we reply on the intermediate result in Principal Component Analysis (PCA). First, we compute the explained variance

which is stored in increasing order for each feature dimension indexed by . Then we estimate the compact dimensionality (denoted as Compact-Dim) by finding the minimal that satisfies , where is a given threshold and is the original feature dimensionality. In this experiment, we set , so that the Compact-Dim can be interpreted as the minimal dimensionality required to convey more than information of the original feature. Obviously, larger Compact-Dim indicates less redundancy, namely greater compactness.

As a result, the Compact-Dim estimated on 4 millions feature vectors for SIFT, DSP-SIFT, L2-Net and GeoDesc is 56, 63, 75 and 100, respectively. The ranking of Compact-Dim effectively responds to previous performance evaluations, where descriptors with larger Compact-Dim yield better results.

5.3 Scalability study

Computational cost. As evaluated in [9, 10], the efficiency of learned descriptors is on par with traditional descriptors such as CPU-based SIFT. Here, we further compare with GPU-based SIFT [41] to provide insights about practicability. We evaluate the running time in three steps. First, keypoint detection and canonical orientation estimation by SIFT-GPU. Next, patches cropping by Equ. 1. Finally, feature inference of image patches. The computational cost and memory demand are shown in Tab. 4, indicating that with GPU support, not surprisingly, SIFT () is still faster than the learned descriptor (), with a narrow gap due to the parallel implementation. For applications heavily relying on matching quality (e.g., 3D reconstruction), the proposed descriptor achieves a good trade-off to replace SIFT.

Quantization. To conserve disk space, I/O and memory, we linearly map feature vectors of GeoDesc from to and round each element to unsigned-char value. The quantization does not affect the performance as evaluated on HPatches benchmark.

Figure 6: Determine the ratio criterion of GeoDesc so that it has the same Precision as SIFT (at )
Table 4:

Computational cost and memory demand of feature extraction of GeoDesc in three steps: SIFT-GPU extraction, patch cropping and feature inference. The total time cost is evaluated with three steps implemented in a parallel fashion


SIFT Crop. Infer. Total


Device GPU CPU GPU -
Memory (GB) 3.3 2.7 0.3 -
Time (s) 0.20 0.28 0.31 0.31


6 Conclusions

In contrast to prior work, we have addressed the advantages of integrating geometry constraints for descriptor learning, which benefits the learning process in terms of ground truth data generation, data sampling and loss computation. Also, we have discussed several guidelines, in particular, the ratio criterion, towards practical portability. Finally, we have demonstrated the superior performance and generalization ability of the proposed descriptor, GeoDesc, on three benchmark datasets in different scenarios, We have further shown the significant improvement of GeoDesc on challenging reconstructions, and the good trade-off between efficiency and accuracy to deploy GeoDesc in real applications.


  • [1] Li, S., Yuan, L., Sun, J., Quan, L.: Dual-feature warping-based motion model estimation. ICCV (2015)
  • [2] Mishkin, D., Matas, J., Perdoch, M., Lenc, K.: Wxbs: Wide baseline stereo generalizations. BMVC (2015)
  • [3] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. CVPR (2007)
  • [4] Zhu, S., Zhang, R., Zhou, L., Shen, T., Fang, T., Tan, P., Quan, L.: Very large-scale global sfm by distributed motion averaging. CVPR (2018)
  • [5] Shen, T., Zhu, S., Fang, T., Zhang, R., Quan, L.: Graph-based consistent matching for structure-from-motion. European Conference on Computer Vision (2016)
  • [6] Zhang, R., Zhu, S., Fang, T., Quan, L.: Distributed very large scale bundle adjustment by global camera consensus. ICCV (2017)
  • [7] Zhu, S., Fang, T., Xiao, J., Quan, L.: Local readjustment for high-resolution 3d reconstruction. CVPR (2014)
  • [8] Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV (2004)
  • [9] Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Lescriptors. CVPR (2017)
  • [10] Schönberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative evaluation of hand-crafted and learned local features. CVPR (2017)
  • [11] Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning Local Feature Descriptors with Triplets and Shallow Convolutional Neural Networks. BMVC (2016)
  • [12] Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative Learning of Deep Convolutional Feature Point Descriptors. CVPR (2015)
  • [13] Sohn, K.: Improved Deep Metric Learning with Multi-class N-pair Loss Objective. NIPS (2016)
  • [14] Brown, M.A., Hua, G., Winder, S.A.J.: Discriminative Learning of Local Image Descriptors. PAMI (2011)
  • [15] Tian, BFY, Wu, F: L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. CVPR (2017)
  • [16] Zhang, X., Yu, F.X., Kumar, S., Chang, S.F.: Learning spread-out local feature descriptors. CVPR (2017)
  • [17] Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT - Learned Invariant Feature Transform. ECCV (2016)
  • [18] Mitra, R., Doiphode, N., Gautam, U., Narayan, S., Ahmed, S., Chandran, S., Jain, A.: A large dataset for improving patch matching. arXiv, year=2018
  • [19] Morel, J.M., Yu, G.: Asift: A new framework for fully affine invariant image comparison. SIAM journal on imaging sciences
  • [20] Kaplan, A., Avraham, T., Lindenbaum, M.: Interpreting the ratio criterion for matching sift descriptors. ECCV (2016)
  • [21] Heinly, Jared, Dunn, Enrique, Frahm, Jan-Michael: Comparative Evaluation of Binary Features. ECCV (2012)
  • [22] Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. CVPR (2015)
  • [23] Yi, K.M., Verdie, Y., Fua, P., Lepetit, V.: Learning to Assign Orientations to Feature Points. CVPR (2015)
  • [24] Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet - Unifying Feature and Metric Learning for Patch-based Matching. CVPR (2015)
  • [25] G, V.K.B., Carneiro, G., Reid, I.:

    Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions.

    CVPR (2016)
  • [26] Balntas, V., Johns, E., Tang, L., Mikolajczyk, K.: PN-Net: Conjoined Triple Deep Network for Learning Local Image Descriptors. arXiv (2016)
  • [27] Markuš, N., Pandžić, I.S., Ahlberg, J.: Learning local descriptors by optimizing the keypoint-correspondence criterion. ICPR (2016)
  • [28] Lin, J., Morere, O., Chandrasekhar, V., Veillard, A., Goh, H.: Deephash: Getting regularization, depth and fine-tuning right. arXiv (2015)
  • [29] Mishchuk, A., Mishkin, D., Radenovic, F.: Working Hard to Know Your Neighbor’s Margins: Local Descriptor Learning Loss. NIPS (2017)
  • [30] Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. PAMI (2005)
  • [31] Winder, S., Hua, G., Brown, M.: Picking the best daisy. CVPR (2009)
  • [32] Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. CVPR (2016)
  • [33] Wilson, K., Snavely, N.: Robust Global Translations with 1DSfM. ECCV (2014)
  • [34] Radenovic, F., Tolias, G., Chum, O.: CNN Image Retrieval Learns from BoW - Unsupervised Fine-Tuning with Hard Examples. ECCV (2016)
  • [35] Labatut, P., Pons, J.P., Keriven, R.: Efficient multi-view reconstruction of large-scale scenes using interest points, delaunay triangulation and graph cuts. ICCV (2007)
  • [36] Zhou, L., Zhu, S., Shen, T., Wang, J., Fang, T., Quan, L.: Progressive large scale-invariant image matching in scale space. ICCV (2017)
  • [37] Zhang, R., Li, S., Fang, T., Zhu, S., Quan, L.: Joint Camera Clustering and Surface Segmentation for Large-Scale Multi-view Stereo. ICCV (2015)
  • [38] Movshovitz-Attias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S.: No Fuss Distance Metric Learning using Proxies. ICCV (2017)
  • [39] Arandjelovic, R., Zisserman, A.: Three Things Everyone Should Know to Improve Object Retrieval. CVPR (2012)
  • [40] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.:

    Tensorflow: Large-scale machine learning on heterogeneous distributed systems.

    arXiv (2016)
  • [41] Wu, C.: Siftgpu: A gpu implementation of sift. http://cs. unc. edu/~ ccwu/siftgpu (2007)