1 Introduction
Computing local descriptors on interest regions serves as the subroutine of various computer vision applications such as panorama stitching
[1], wide baseline matching [2][3], and StructurefromMotion (SfM) [4, 5, 6, 7]. A powerful descriptor is expected to be invariant to both photometric and geometric changes, such as illumination, blur, rotation, scale and perspective changes. Due to the reliability, efficiency and portability, handcrafted descriptors such as SIFT [8] have been influentially dominating this field for more than a decade. Until recently, great efforts have been made on developing learned descriptors based on Convolutional Neural Networks (CNNs), which have achieved surprising results on patchbased benchmarks such as HPatches dataset [9]. However, on imagebased datasets such as ETH local features benchmark [10], learned descriptors are found to underperform advanced variants of handcrafted descriptors. The contradictory findings raise the concern of integrating those purportedly better descriptors in real applications, and show significant room of improvement for developing more powerful descriptors that generalize to a wider range of scenarios.One possible cause of above contradictions, as demonstrated in [10], is the lack of generalization ability as a consequence of data insufficiency. Although previous research [11, 12, 13] discusses several effective sampling methods that produce seemingly large amount of training data, the generalization ability is still bounded to limited data sources, e.g., the widelyused Brown dataset [14] with only 3 image sets. Hence, it is not surprising that resulting descriptors tend to overfit to particular scenarios. To overcome it, research such as [15, 16] applies extra regularization for compact feature learning. Meanwhile, LIFT [17] and [18] seek to enhance data diversity and generate training data from reconstructions of Internet tourism data. However, the existing limitation has not yet been fully mitigated, while intermediate geometric information is overlooked in the learning process despite the robust geometric property that local patch preserves, e.g., the well approximation of local deformations [19].
Besides, we lack guidelines for integrating learned descriptors in practical pipelines such as SfM. In particular, the ratio criterion, as suggested in [8] and justified in [20], has received almost no individual attention or was considered inapplicable for learned descriptors [10], whereas it delivers excellent matching efficiency and accuracy improvements, and serves as the necessity for pipelines such as SfM to reject false matches and seed feasible initialization. A general method to apply ratio criterion for learned descriptors is in need in practice.
In this paper, we tackle above issues by presenting a novel learning framework that takes advantage of geometry constraints from multiview reconstructed data. In particular, we address the importance of data sampling for descriptor learning and summarize our contributions threefold. i) We propose a novel batch construction method that simulates the pairwise matching and effectively samples useful data for learning process. ii) Collaboratively, we propose a loss formulation to reduce overfitting and improve the performance with geometry constraints. iii) We provide guidelines about ratio criterion, compactness and scalability towards practical portability of learned descriptors.
We evaluate the proposed descriptor, referred to as GeoDesc, on traditional [21] and recent two largescale datasets [9, 10]. Superior performance is shown over the stateoftheart handcrafted and learned descriptors. We mitigate previous limitations by showing consistent improvements on both patchbased and imagebased datasets, and further demonstrate its success on challenging 3D reconstructions.
2 Related Works
Networks design. Due to weak semantics and efficiency requirements, existing descriptor learning often relies on shallow and thin networks, e.g., threelayer networks in DDesc [12]
with 128dimensional output features. Moreover, although widelyused in highlevel computer vision tasks, max pooling is found to be unsuitable for descriptor learning, which is then replaced by L2 pooling in DDesc
[12] or even removed in L2Net [15]. To further incorporate scale information, DeepCompare [22] and L2Net [15]use a twostream centralsurround structure which delivers consistent improvements at extra computational cost. To improve the rotational invariance, an orientation estimator is proposed in
[23]. Besides of feature learning, previous efforts are also made on joint metric learning as in [22, 24, 25], whereas comparison in Euclidean space is more preferable by recent works [11, 12, 15, 17, 26] in order to guarantee its efficiency.Loss formulation Various of loss formulations have been explored for effective descriptor learning. Initially, networks with a learned metric use softmax loss [22, 24] and cast the descriptor learning to a binary classification problem (similar/dissimilar). With weakly annotated data, [27] formulates the loss on keypoint bags. More generally, pairwise loss [12, 17] and triplet loss [11, 25, 26] are used by networks without a learned metric. Both loss formulations encourage matching patches to be close whereas nonmatching patches to be faraway in some measure space. In particular, triplet loss delivers better results [11, 25] as it suffers less overfitting [28]. For effective training, recent L2Net [15] and HardNet [29] use the structured loss for data sampling which drastically improves the performance. To further boost the performance, extra regularizations are introduced for learning compact representation in [15, 16].
Evaluation protocol Previous works often evaluate on datasets such as [30, 31, 21]. However, those datasets either are small, or lack diversity to generalize well to various applications of descriptors. As a result, the evaluation results are commonly inconsistent or even contradictory to each other as pointed out in [9], which limits the application of learned descriptors. Two novel benchmarks, HPatches [9] and ETH local descriptor benchmark [10] have been recently introduced with clearly defined protocols and better generalization properties. However, inconsistency still exists in the two benchmarks, where HPatches [9] benchmark demonstrates the significant outperformance from learned descriptors over the handcrafted, whereas the ETH local descriptor benchmark [10] finds that the advanced variants of the traditional descriptor are at least on par with the learningbased. The inconclusive results indicate that there is still significant room for improvement to learn more powerful feature descriptors.
3 Method
3.1 Network architecture
We borrow the network in L2Net [15]
, where the feature tower is constructed by eschewing pooling layers and using strided convolutional layers for innetwork downsampling. Each convolutional layer except the last one is followed by a batch normalization (BN) layer whose weighting and bias parameters are fixed to
and . The L2normalization layer after the last convolution produces the finaldimensional feature vector.
3.2 Training data generation
Acquiring high quality training data is important in learning tasks. In this section, we discuss a practical pipeline that automatically produces wellannotated data suitable for descriptor learning.
2D correspondence generation. Similar to LIFT [17], we rely on successful 3D reconstructions to generate ground truth 2D correspondences in an automatic manner. First, sparse reconstructions are obtained from standard SfM pipeline [32]. Then, 2D correspondences are generated by projecting 3D point clouds. In general, SfM is used to filter out most mismatches among images.
Although verified by SfM, the generated correspondences are still outliercontaminated from image noise and wrongly registered cameras. It happens particularly often on Internet tourism datasets such as
[33, 34] (illustrated in Fig. 1(a)), and usually not likely to be filtered by simply limiting reprojection error. To improve data quality, we take one step further than LIFT by computing the visibility check based on 3D Delaunay triangulation [35] which is widelyused for outlier filtering in dense stereo. Empirically, of 3D points will be discarded after the filtering while only points with high precision are kept for ground truth generation. Fig. 1(b) gives an example to illustrate its effect.Matching patch generation. Next, the interest region of a 2D projection is cropped similar to LIFT, which is formulated by an similarity transformation
(1) 
where are input and output regular sampling grids, and are keypoint parameters ( coordinates, scale and orientation) from SIFT detector. The constant is set to as in LIFT, resulting in patches.
Due to the robust estimation of scale () and orientation () parameters of SIFT even in extreme cases [36], the resulting patches are mostly free of scale and rotation differences, thus suitable for training. In later experiments of image matching or SfM, we rely on the same cropping method to achieve scale and rotation invariance for learned descriptors.
3.3 Geometric similarity estimation
Geometries at a 3D point are robust and provide rich information. Inspired by the MVS (MultiView Stereo) accuracy measurement in [37], we define two types of geometric similarity: patch similarity and image similarity, which will facilitate later data sampling and loss formulation.
Patch similarity. We define patch similarity to measure the difficulty to have a patch pair matched with respect to perspective changes. Formally, given a patch pair, we relate it to its corresponding 3D track which is seen by cameras centering at and . Next, we compute the vertex normal at from the surface model. The geometric relationship is illustrated in Fig. 2(a). Finally, we formulate as
(2) 
where measures the intersection angle between two viewing rays from the 3D track (), while measures the difference of incidence angles between a viewing ray and the vertex normal from the 3D track (). The angle metric is defined as . As an interpretation, and measure the perspective change regarding a 3D point and local 3D surface, respectively. The effect of is illustrated in Fig. 2(b).
The accuracy of and depends on sparse and mesh reconstructions, respectively, and is generally sufficient for its use as shown in [37]. The similarity does not consider scale and rotation changes as already resolved from Equation 1. Empirically, we choose and (in degree).
Image similarity. Based on the patch similarity, we define the image similarity as the average patch similarity of the correspondences between an image pair. The image similarity measures the difficulty to match an image pair and can be interpreted as a measurement of perspective change. Examples are given in Fig. 2(c). The image similarity will be beneficial for data sampling in Sec. 3.4.
3.4 Batch construction
For descriptor learning, most existing frameworks take patch pairs (matching/nonmatching) or patch triplets (reference, matching and nonmatching) as input. As in previous studies, the convergence rate is highly dependent on being able to see useful data [38]. Here, “useful” data often refers to patch pairs/triplets that produce meaningful loss for learning. However, the effective sampling of such data is generally challenging due to the intractably large number of patch pair/triplet combination in the database. Hence, on one hand, sampling strategies such as hard negative mining [12] and anchor swap [11] are proposed, while on the other hand, effective batch construction is used in [15, 25, 29] to compare the reference patch against all the inbatch samples in the loss computation.
Inspired by previous works, we propose a novel batch construction method that effectively samples “useful” data by relying on geometry constraints from SfM, including the image matching results and image similarity , to simulate the pairwise image matching and sample data. Formally, given one image pair, we extract a match set , where is the set size and is a matching patch pair surviving the SfM verification. A training batch is then constructed on match sets. Hence, the learning objective becomes to improve the matching quality for each match set. In Sec. 3.5, we will discuss the loss computation on each match set and batch data.
Compared with L2Net [15] and HardNet [29] whose training batches are random sampled from the whole database, the proposed method produces harder samples and thus raises greater challenges for learning. As an example shown in Fig. 2(d), the training batch constructed by the proposed method consists of many similar patterns, due to the spatially close keypoints or repetitive textures. In general, such training batch has two major advantages for descriptor learning:

[leftmargin=*, noitemsep, topsep=0pt]

It reflects the inpractice complexity. In real applications, image patches are often extracted between image pairs for matching. The proposed method simulates this scenario so that training and testing become more consistent.

It generates hard samples for training. As observed in [11, 12, 29, 38], hard samples are critical to fast convergence and performance improvement for descriptor learning. The proposed method effectively generates batch data that is sufficiently hard, while not being overfitting as constructed on real matching results instead of model inference results [12].
To further boost the training efficiency, we exclude image pairs that are too similar to contribute to the learning. Those pairs are effectively identified by the image similarity defined in Sec. 3.3. In practice, we discard image pairs whose are larger than (e.g., the toppest pair in Fig. 2(c)), which results in a shrink of training samples.
3.5 Loss formulation
We formulate the loss with two terms: structured loss and geometric loss.
Structured loss. The structured loss used in L2Net [15] and HardNet [29] is essentially suitable to consume the batch samples constructed in Sec. 3.4. In particular, the formulation in HardNet based on the “hardestinbatch” strategy and a distance margin shows to be more effective than the loglikelihood formulation in L2Net. However, we observe successive overfitting when applying the HardNet loss to our batch data, which we ascribe to the too strong constraint of “hardestinbatch” strategy. In this strategy, the loss is computed on the data sample that produces the largest loss, and a margin with a large value ( in HardNet) is set to push the nonmatching pairs away from matching pairs. In our batch data, we already effectively sample the “hard” data which is often visually similar, thus forcing a large margin is inapplicable and stalls the learning. One simple solution is to decrease the margin value, whereas the performance drops significantly in our experiments.
To avoid above limitation and better take advantage of our batch data, we propose the loss formulation as follows. First, we compute the structured loss for one match set. Given normalized features computed on match set for all
, the cosine similarity matrix is derived by
. Next, we compute and formulate the loss as(3) 
where is the element in , and is the distance ratio mimicking the behavior of ratio test [8] and pushing away nonmatching pairs from matching pairs. Finally, we take the average of the loss on each match set to derive the final loss for one training batch.
The proposed formulation is distinctive from HardNet in three aspects. First, we compute the cosine similarity instead of Euclidean distance for computational efficiency. Second, we apply a distance ratio margin instead of a fixed distance margin as an adaptive margin to reduce overfitting. Finally, we compute the mean value of each loss element instead of the maximum (“hardestinbatch”) in order to cooperate the proposed batch construction.
Geometric loss. Although ensures matching patch pairs to be distant from the nonmatching, it does not explicitly encourage matching pairs to be close in its measure space. One simple solution is to apply a typical pairwise loss in [12], whereas taking a risk of positive collapse and overfitting as observed in [28]. To overcome it, we adaptively set up the margin regarding the patch similarity defined in Sec. 3.3, serving as a soft constraint for maximizing the positive similarity. We refer to this term as geometric loss and formulate it as
(4) 
where is the adaptive margin, is the element in , namely, the cosine similarity of patch pair , while is the patch similarity for . We use as the final loss, and empirically set and to and .
3.6 Training
We use image sets [33] as in LIFT [17], the SfM data in [34], and further collect several image sets to form the training database. Based on COLMAP [32], we run 3D reconstructions to establish necessary geometry constraints. Image sets that are overlapping with the benchmark data are manually excluded. We train the networks from scratch using Adam with a base learning rate of and weight decay of . The learning rate decays by every steps. Data augmentation includes randomly flipping, degrees rotation and brightness and contrast adjustment. The match set size and batch size are 64 and 12, respectively. Input patches are standardized to have zero mean and unit norm.
4 Experiments
We evaluate the proposed descriptor on three datasets: the patchbased HPatches [9] benchmark, the imagebased Heinly benchmark [21] and ETH local features benchmark [10]. We further demonstrate on challenging SfM examples.
4.1 HPatches benchmark
HPatches benchmark [9]
defines three complementary tasks: patch verification, patch matching, and patch retrieval. Different levels of geometrical perturbations are imposed to form EASY, HARD and TOUGH patch groups. In the task of verification, two subtasks are defined based on whether negative pairs are sampled from images within the same (SAMESEQ) or different sequences (DIFFSEQ). In the task of matching, two subtasks are defined based on whether the principle variance is viewpoint (VIEW) or illumination (ILLUM). Following
[9], we use mean average precision (mAP) to measure the performance for all three tasks on HPatches split ‘full’.We select five descriptors to compare: SIFT as the baseline, RSIFT [39] and DDesc [12] as the bestperforming handcrafted and learned descriptors concluded in [9]. Moreoever, we experiment with recent learned descriptors L2Net [15] and HardNet [29]. The proposed descriptor is referred to as GeoDesc.
As shown in Fig. 3, GeoDesc surpasses all the other descriptors on all three tasks by a large margin. In particular, the performance on TOUGH patch group is significantly improved, which indicates the superior invariance to large image changes of GeoDesc. Interestingly, comparing GeoDesc with HardNet, we observe some performance drop on EASY groups especially for illumination changes, which can be ascribed to the data bias for SfM data. Though applying randomness such as illumination during the training, we cannot fully mitigate this limitation which asks more diverse real data in descriptor learning.
In addition, we evaluate different configurations of GeoDesc on HPatches as shown in Tab. 1 to demonstrate the effect of each part of our method.

[leftmargin=*, noitemsep, topsep=0pt]

Config. 1: the HardNet framework as the baseline.

Config. 3: equipped with the proposed batch construction in Sec. 3.4. As discussed in Sec. 3.5, the “hardestinbatch” strategy in HardNet is inapplicable to hard batch data and thus leads to performance drop compared with Config. 2. In practice, we need to adjust the margin value from in HardNet to , otherwise the training will not even converge. Though trainable, the smaller margin value harms the final performance.

Config. 4: equipped with the modified structured loss in Sec. 3.5. Notable performance improvements are achieved over Config. 2 due to the collaborative use of proposed methods, showing the effectiveness of simulating pairwise matching and sampling hard data. Besides, replacing the distance margin with distance ratio can improve the training efficiency, as shown in Fig. 4.

Config. 5: equipped with the geometric loss in Sec. 3.5. Further improvements are obtained over Config. 4 as constrains the solution space and enhances the training efficiency.
To sum up, the “hardestinbatch” strategy is beneficial when no other sampling is applied and most inbatch samples do not contribute to the loss. However, with harder batch data effectively constructed, it is advantageous to replace the “hardestinbatch” and further boost the descriptor performance.


GeoDesc Configuration  HPathces Benchmark Tasks  


No.  SfM Data  Batch Construct.  Verification  Matching  Retrieval  


1          88.4  52.8  69.8 
2  Y        90.1  57.0  73.2 
3  Y  Y      89.9  50.2  70.4 
4  Y  Y  Y    90.9  58.5  74.5 
5  Y  Y  Y  Y  91.1  59.1  74.9 

4.2 Heinly benchmark
Different from HPatches which experiments on image patches, the benchmark by Heinly et al. [21] evaluates pairwise image matching regarding different types of photometric or geometric changes, targeting to provide practical insights for strengths and weaknesses of descriptors. We use two standard metrics as in [21] to quantify the matching quality. First, the Matching Score = #Inlier Matches / #Features. Second, the Recall = #Inlier Matches / #True Matches. Four descriptor are selected to compare: SIFT, the baseline handcrafted descriptor; DSPSIFT, the best handcrafted descriptor even superior to the previous learningbased as evaluated in [10]; L2Net and HardNet, the recent advanced learned descriptors. For fairness comparison, no ratio test and only cross check (mutual test) is applied for all descriptors.


Matching Score in %  Recall in %  


SIFT  DSPSIFT  L2Net  HardNet  GeoDesc  SIFT  DSPSIFT  L2Net  HardNet  GeoDesc  


JPEG  31.9  35.1  25.7  27.0  34.7  60.7  66.9  49.0  51.5  66.1 
Blur  12.4  14.3  9.1  11.3  14.4  41.0  47.3  30.1  37.4  47.7 
Exposure  32.9  34.8  33.9  34.9  36.3  78.2  82.6  80.4  82.8  86.4 
DayNight  5.6  5.7  6.8  7.4  7.5  29.2  29.7  35.6  38.9  39.6 
Scale  35.8  34.7  32.6  34.8  37.8  81.2  78.8  73.6  79.0  85.8 
Rotation  56.3  49.1  55.9  57.4  59.8  82.4  71.8  81.9  84.0  87.6 
Scalerotation  12.6  12.0  10.7  12.1  14.3  29.6  28.1  25.0  28.5  33.7 
Planar  23.8  24.8  25.6  27.4  29.1  48.2  50.4  51.9  55.6  59.1 

Evaluation results are shown in Tab. 2. Compared with DSPSIFT, GeoDesc performs comparably regarding image quality changes (compression, blur), while notably better for illumination and geometric changes (rotation, scale, viewpoint). On the other hand, GeoDesc delivers significant improvements on L2Net and HardNet and particularly narrows the gap in terms of photometric changes, which makes GeoDesc applicable to different scenarios in real applications.
4.3 ETH local features benchmark
The ETH local features benchmark [10] focuses on imagebased 3D reconstruction tasks. We compare GeoDesc with SIFT, DSPSIFT and L2Net, and follow the same protocols in [10] to quantify the SfM quality, including the number of registered images (# Registered), reconstructed sparse points (# Sparse Points), image observations (# Observations), mean track length (Track Length) and mean reprojection error (Reproj. Error). For fairness comparison, we apply no distance ratio test for descriptor matching and extract features at the same keypoints as in [10].
As observed in Tab. 3, first, GeoDesc performs best on # Registered, which is generally considered as the most important SfM metric that directly quantifies the reconstruction completeness. Second, GeoDesc achieves best results on # Sparse Points and # Observations, which indicates the superior matching quality in the early step of SfM. However, GeoDesc fails to get best statistics about Track Length and Reproj. Error as GeoDesc computes the two metrics on significantly larger # Sparse Points. In terms of datasets whose scale is small and have similar track number (Fountain, Herzjesu), GeoDesc gives the longest Track Length.
To sum up, GeoDesc surpasses both the previous bestperforming DSPSIFT and recent advanced L2Net by a notable margin. In addition, it is noted that L2Net also shows consistent improvements over DSPSIFT, which demonstrates the power of taking structured loss for learned descriptors.


# Images  # Registered  # Sparse Points  # Observations  Track Length  Reproj. Error  


Fountain  SIFT  11  11  10,004  44K  4.49  0.30px 
DSPSIFT  11  14,785  71K  4.80  0.41px  
L2Net  11  16,119  78K  4.86  0.43px  
GeoDesc  11  16,687  83K  4.99  0.46px  


Herzjesu  SIFT  8  8  4,916  19K  4.00  0.32px 
DSPSIFT  8  7,760  32K  4.19  0.45px  
L2Net  8  8,473  36K  4.27  0.47px  
GeoDesc  8  8,720  38K  4.34  0.55px  


South Building  SIFT  128  128  62,780  353K  5.64  0.42px 
DSPSIFT  128  110,394  664K  6.02  0.57px  
L2Net  128  155,780  798K  5.13  0.58px  
GeoDesc  128  170,306  887K  5.21  0.64px  


Madrid Metropolis  SIFT  1,344  440  62,729  416K  6.64  0.53px 
DSPSIFT  476  107,028  681K  6.36  0.64px  
L2Net  692  254,142  1,067K  4.20  0.69px  
GeoDesc  809  306,976  1,200K  3.91  0.66px  


Gendarmenmarkt  SIFT  1,463  950  169,900  1,010K  5.95  0.64px 
DSPSIFT  975  321,846  1,732K  5.38  0.74px  
L2Net  1,168  667,392  2,611K  3.91  0.73px  
GeoDesc  1,208  779,814  2,903K  3.72  0.74px  


Tower of London  SIFT  1,576  702  142,746  963K  6.75  0.53px 
DSPSIFT  755  236,598  1,761K  7.44  0.64px  
L2Net  1,049  558,673  2,617K  4.68  0.67px  
GeoDesc  1,081  622,076  2,852K  4.58  0.69px  


Alamo  SIFT  2,915  743  120,713  1,384K  11.47  0.54px 
DSPSIFT  754  144,341  1,815K  12.58  0.66px  
L2Net  882  318,787  2,932K  9.17  0.76px  
GeoDesc  893  353,329  3,159K  8.94  0.84px  


Roman Forum  SIFT  2,364  1,407  242,192  1,805K  7.45  0.61px 
DSPSIFT  1,583  372,573  2,879K  7.73  0.71px  
L2Net  1,537  708,794  4,530K  6.39  0.69px  
GeoDesc  1,566  770,363  5,051K  6.56  0.73px  


Cornell  SIFT  6,514  4,999  1,010,544  6,317K  6.25  0.53px 
DSPSIFT  4,946  1,177,916  7,233K  6.14  0.67px  
L2Net  5,557  2,706,215  15,710K  5.81  0.72px  
GeoDesc  5,823  3,076,476  17,550K  5.70  0.96px  

4.4 Challenging 3D reconstructions
To further demonstrate the effect of the proposed descriptor in a context of 3D reconstruction, we showcase selective image sets whose reconstructions fail or are in low quality with a typical SIFTbased 3D reconstruction pipeline but get significantly improved by integrating GeoDesc.
From examples shown in Fig. 5, it is clear to see the benefit of deploying GeoDesc in a reconstruction pipeline. First, by robust matching resistant to photometric and geometric changes, a complete sparse reconstruction registered with more cameras can be obtained. Second, due to more accurate camera pose estimation, the final fined mesh reconstruction is then derived.
5 Practical Guidelines
In this section, we discuss several practical guidelines to complement the performance evaluation and provide insights towards real applications. Following experiments are conducted with extra highresolution image pairs, whose keypoints are downsampled to
10k per image. We use a single NVIDIA GTX 1080 GPU with TensorFlow
[40], and forward each batch with 256 patches.5.1 Ratio criterion
The ratio criterion [8] compares the distance between the first and the second nearest neighbor, and establishes a match if the former is smaller than the latter to some ratio. For SfM tasks, the ratio criterion improves overall matching quality, RANSAC efficiency, and seeds robust initialization. Despite those benefits, the ratio criterion has received little attention, or even been considered inapplicable to learned descriptors in previous studies [10]. Here, we propose a general method to determine the ratio that well cooperates with existing SfM pipelines.
The general idea is simple: the new ratio should function similarly as SIFT’s, as most SfM pipelines are parameterized for SIFT. To quantify the effect of the ratio criterion, we use the metric Precision = #Inlier Matches / #Putative matches, and determine the ratio that achieves similar Precision as SIFT’s. As an example in Fig. 6, we compute the Precision of SIFT and GeoDesc on our experimental dataset, and find the best ratio for GeoDesc is at which it gives similar Precision () as SIFT (). This ratio is applied to experiments in Sec. 4.4 and shows robust results and compatibility in the practical SfM pipeline.
5.2 Compactness study
A compact feature representation generally indicates better performance with respective to discriminativeness and scalability. To quantify the compactness, we reply on the intermediate result in Principal Component Analysis (PCA). First, we compute the explained variance
which is stored in increasing order for each feature dimension indexed by . Then we estimate the compact dimensionality (denoted as CompactDim) by finding the minimal that satisfies , where is a given threshold and is the original feature dimensionality. In this experiment, we set , so that the CompactDim can be interpreted as the minimal dimensionality required to convey more than information of the original feature. Obviously, larger CompactDim indicates less redundancy, namely greater compactness.As a result, the CompactDim estimated on 4 millions feature vectors for SIFT, DSPSIFT, L2Net and GeoDesc is 56, 63, 75 and 100, respectively. The ranking of CompactDim effectively responds to previous performance evaluations, where descriptors with larger CompactDim yield better results.
5.3 Scalability study
Computational cost. As evaluated in [9, 10], the efficiency of learned descriptors is on par with traditional descriptors such as CPUbased SIFT. Here, we further compare with GPUbased SIFT [41] to provide insights about practicability. We evaluate the running time in three steps. First, keypoint detection and canonical orientation estimation by SIFTGPU. Next, patches cropping by Equ. 1. Finally, feature inference of image patches. The computational cost and memory demand are shown in Tab. 4, indicating that with GPU support, not surprisingly, SIFT () is still faster than the learned descriptor (), with a narrow gap due to the parallel implementation. For applications heavily relying on matching quality (e.g., 3D reconstruction), the proposed descriptor achieves a good tradeoff to replace SIFT.
Quantization. To conserve disk space, I/O and memory, we linearly map feature vectors of GeoDesc from to and round each element to unsignedchar value. The quantization does not affect the performance as evaluated on HPatches benchmark.
Computational cost and memory demand of feature extraction of GeoDesc in three steps: SIFTGPU extraction, patch cropping and feature inference. The total time cost is evaluated with three steps implemented in a parallel fashion


SIFT  Crop.  Infer.  Total  


Device  GPU  CPU  GPU   
Memory (GB)  3.3  2.7  0.3   
Time (s)  0.20  0.28  0.31  0.31 

6 Conclusions
In contrast to prior work, we have addressed the advantages of integrating geometry constraints for descriptor learning, which benefits the learning process in terms of ground truth data generation, data sampling and loss computation. Also, we have discussed several guidelines, in particular, the ratio criterion, towards practical portability. Finally, we have demonstrated the superior performance and generalization ability of the proposed descriptor, GeoDesc, on three benchmark datasets in different scenarios, We have further shown the significant improvement of GeoDesc on challenging reconstructions, and the good tradeoff between efficiency and accuracy to deploy GeoDesc in real applications.
References
 [1] Li, S., Yuan, L., Sun, J., Quan, L.: Dualfeature warpingbased motion model estimation. ICCV (2015)
 [2] Mishkin, D., Matas, J., Perdoch, M., Lenc, K.: Wxbs: Wide baseline stereo generalizations. BMVC (2015)
 [3] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. CVPR (2007)
 [4] Zhu, S., Zhang, R., Zhou, L., Shen, T., Fang, T., Tan, P., Quan, L.: Very largescale global sfm by distributed motion averaging. CVPR (2018)
 [5] Shen, T., Zhu, S., Fang, T., Zhang, R., Quan, L.: Graphbased consistent matching for structurefrommotion. European Conference on Computer Vision (2016)
 [6] Zhang, R., Zhu, S., Fang, T., Quan, L.: Distributed very large scale bundle adjustment by global camera consensus. ICCV (2017)
 [7] Zhu, S., Fang, T., Xiao, J., Quan, L.: Local readjustment for highresolution 3d reconstruction. CVPR (2014)
 [8] Lowe, D.G.: Distinctive Image Features from ScaleInvariant Keypoints. IJCV (2004)
 [9] Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Lescriptors. CVPR (2017)
 [10] Schönberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative evaluation of handcrafted and learned local features. CVPR (2017)
 [11] Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning Local Feature Descriptors with Triplets and Shallow Convolutional Neural Networks. BMVC (2016)
 [12] SimoSerra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., MorenoNoguer, F.: Discriminative Learning of Deep Convolutional Feature Point Descriptors. CVPR (2015)
 [13] Sohn, K.: Improved Deep Metric Learning with Multiclass Npair Loss Objective. NIPS (2016)
 [14] Brown, M.A., Hua, G., Winder, S.A.J.: Discriminative Learning of Local Image Descriptors. PAMI (2011)
 [15] Tian, BFY, Wu, F: L2Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. CVPR (2017)
 [16] Zhang, X., Yu, F.X., Kumar, S., Chang, S.F.: Learning spreadout local feature descriptors. CVPR (2017)
 [17] Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT  Learned Invariant Feature Transform. ECCV (2016)
 [18] Mitra, R., Doiphode, N., Gautam, U., Narayan, S., Ahmed, S., Chandran, S., Jain, A.: A large dataset for improving patch matching. arXiv, year=2018
 [19] Morel, J.M., Yu, G.: Asift: A new framework for fully affine invariant image comparison. SIAM journal on imaging sciences
 [20] Kaplan, A., Avraham, T., Lindenbaum, M.: Interpreting the ratio criterion for matching sift descriptors. ECCV (2016)
 [21] Heinly, Jared, Dunn, Enrique, Frahm, JanMichael: Comparative Evaluation of Binary Features. ECCV (2012)
 [22] Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. CVPR (2015)
 [23] Yi, K.M., Verdie, Y., Fua, P., Lepetit, V.: Learning to Assign Orientations to Feature Points. CVPR (2015)
 [24] Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet  Unifying Feature and Metric Learning for Patchbased Matching. CVPR (2015)

[25]
G, V.K.B., Carneiro, G., Reid, I.:
Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions.
CVPR (2016)  [26] Balntas, V., Johns, E., Tang, L., Mikolajczyk, K.: PNNet: Conjoined Triple Deep Network for Learning Local Image Descriptors. arXiv (2016)
 [27] Markuš, N., Pandžić, I.S., Ahlberg, J.: Learning local descriptors by optimizing the keypointcorrespondence criterion. ICPR (2016)
 [28] Lin, J., Morere, O., Chandrasekhar, V., Veillard, A., Goh, H.: Deephash: Getting regularization, depth and finetuning right. arXiv (2015)
 [29] Mishchuk, A., Mishkin, D., Radenovic, F.: Working Hard to Know Your Neighbor’s Margins: Local Descriptor Learning Loss. NIPS (2017)
 [30] Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. PAMI (2005)
 [31] Winder, S., Hua, G., Brown, M.: Picking the best daisy. CVPR (2009)
 [32] Schönberger, J.L., Frahm, J.M.: Structurefrommotion revisited. CVPR (2016)
 [33] Wilson, K., Snavely, N.: Robust Global Translations with 1DSfM. ECCV (2014)
 [34] Radenovic, F., Tolias, G., Chum, O.: CNN Image Retrieval Learns from BoW  Unsupervised FineTuning with Hard Examples. ECCV (2016)
 [35] Labatut, P., Pons, J.P., Keriven, R.: Efficient multiview reconstruction of largescale scenes using interest points, delaunay triangulation and graph cuts. ICCV (2007)
 [36] Zhou, L., Zhu, S., Shen, T., Wang, J., Fang, T., Quan, L.: Progressive large scaleinvariant image matching in scale space. ICCV (2017)
 [37] Zhang, R., Li, S., Fang, T., Zhu, S., Quan, L.: Joint Camera Clustering and Surface Segmentation for LargeScale Multiview Stereo. ICCV (2015)
 [38] MovshovitzAttias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S.: No Fuss Distance Metric Learning using Proxies. ICCV (2017)
 [39] Arandjelovic, R., Zisserman, A.: Three Things Everyone Should Know to Improve Object Retrieval. CVPR (2012)

[40]
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
G.S., Davis, A., Dean, J., Devin, M., et al.:
Tensorflow: Largescale machine learning on heterogeneous distributed systems.
arXiv (2016)  [41] Wu, C.: Siftgpu: A gpu implementation of sift. http://cs. unc. edu/~ ccwu/siftgpu (2007)
Comments
There are no comments yet.