1 Introduction
Finding local feature correspondence is a fundamental component of many computer vision tasks, such as structure from motion (SfM)
[56] and visual localization [53]. Recently, learned feature descriptors [41, 58, 64] have shown significant improvements over handcrafted ones [4, 25, 36] on standard benchmarks. However, other recent work has observed that, when applied to realworld unseen scenarios, learned descriptors do not always generalize well [38, 57].One potential cause of such limited generalization is the insufficiency of highquality training data in both quantity and diversity [57]. Ideally, one would train descriptors on fully accurate, dense groundtruth correspondence between image pairs. However, it is hard to collect such data for real imagery, and only a few datasets of this form exist [7, 11]. As an alternative, many previous methods resort to SfM datasets that provide pseudo groundtruth correspondences given by matched and reconstructed feature points [38, 41, 47, 64], but these correspondences are sparse and potentially biased by the keypoints used in the SfM pipeline. Another option for obtaining correspondence annotations is synthetic image pairs warped by homographies [14, 39]. However, homographies do not capture the full range of geometric and photometric variations observed in real images.
In this paper, we address the challenge of limited training data in descriptor learning by relaxing this requirement of groundtruth pixellevel correspondences. We propose to learn descriptors solely from relative camera poses between pairs of images. Camera poses can be obtained via a variety of nonvisionbased sensors, such as IMUs and GPS, and can also be estimated reliably using SfM pipelines [56]. By reducing the supervision requirement to camera poses, it becomes possible to learn better descriptors on much larger and more diverse datasets.
However, existing metric learning based methods for learning descriptors cannot utilize camera poses as supervision, as the triplet or contrastive losses used in such methods cannot be defined with respect to camera poses. Hence, we propose a novel framework to leverage camera pose supervision. Specifically, we translate the relative camera pose between an image pair into an epipolar constraint on pixel coordinates of matched points as our supervision signal (Fig. 2). The remaining challenge is to make the coordinates of matched points differentiable with respect to descriptors for training, for which we introduce a new differentiable matching layer (Fig. 3(a)). To further reduce the computation cost and accelerate training, we use a coarsetofine matching scheme (Fig. 3(b)) that computes the correspondence at a lower resolution, then locally refines at a finer scale.
Once trained, our system can generate dense feature descriptors for an arbitrary input image, which can then be combined with existing keypoint detectors for downstream tasks. To evaluate the performance of our learned descriptors, we conduct extensive experiments on sparse and dense feature matching, homography estimation, relative pose estimation, and 3D reconstruction. Despite the fact that we only train with
weak camera pose supervision, our learned descriptors are on par with or even outperform prior fullysupervised stateoftheart methods that train with groundtruth correspondence annotations.Fig. 1 summarizes our approach. To conclude, our main contributions are:

We show that camera poses alone suffice to learn good descriptors, which has not been explored in the literature to our knowledge.

To enable learning from camera poses, we depart from existing metric learningbased approaches and design a novel loss function as well as a new, efficient network architecture.

We achieve stateoftheart performance across a range of geometric tasks.
2 Related Work
Descriptor Learning. The dominant paradigm for learning feature descriptors is essentially deep metric learning [9], which encourages matching points to be close whereas nonmatching points to be far away in the feature space. Various loss functions (e.g., pairwise and triplet loss [3, 9, 14, 30, 65], structured loss [41, 46, 59, 61]) have been developed. Based on the input type, current descriptor learning approaches roughly fall into two categories, patchbased and dense descriptor methods. Patchbased methods [3, 16, 22, 26, 38, 41, 43, 44, 47, 58, 61, 64] produce a feature descriptor for each patch defined by a keypoint detector, which can be viewed as direct counterparts for handcrafted feature descriptors [4, 6, 36, 52]. Dense descriptor methods [10, 14, 15, 17, 33, 49, 55]
instead use fullyconvolutional neural networks
[34] to extract dense feature descriptors for the whole image in one forward pass. Our method gives dense descriptors, and unlike the prior work that requires groundtruth correspondence annotations to train, we are able to learn descriptors from the weak supervision of camera pose.Correspondence Learning. Our differentiable matching layer is related to the correlation layer and cost volume that are widely used to compute stereo correspondences [8, 27] or optical flow [18, 23, 60] in a differentiable manner. However, the search space in these problems is limited to either a single scanline or a local patch, while in widebaseline matching we must search for matches over the whole image. This necessitates the efficient coarsetofine architecture we use. Our method is also related to semantic correspondence approaches [24, 28, 50, 67] which are also often weaklysupervised. However, they usually assume a simpler parametric transformation between images and tolerate much coarser correspondences than what is required for geometric tasks. Recent work [39, 51] explores dense geometric correspondence, but focuses on global optimization of the estimated correspondences rather than the descriptors themselves; these are complementary to our contributions. In contrast to this prior work, we propose a new architecture that is more suitable for descriptor learning.
3 Method
Given only image pairs with camera pose, standard deep metric learning methods do not apply as positive and negative training matches are unavailable. Therefore, we devise a new method to exploit the geometric information of camera pose for descriptor learning. Specifically, we translate relative camera pose into an epipolar constraint between image pairs, and enforce the predicted matches to obey this constraint (Sec. 3.1). Since this constraint is imposed on pixel coordinates, we must make the coordinates of correspondences differentiable with respect to the feature descriptors. For this we devise a differentiable matching layer (Sec. 3.2). To further improve efficiency, we introduce a coarsetofine architecture (Sec. 3.3) to accelerate training, which also boosts the descriptor performance. We elaborate on our method below.
3.1 Loss Formulation
Our training data consists of image pairs with relative camera poses. To train our correspondence system with such data, we propose to use two complimentary loss terms: a novel epipolar loss, and a cycle consistency loss (Fig. 2).
Given the relative pose and camera intrinsics for a pair of images and , one can compute the fundamental matrix . The epipolar constraint states that holds if and represent a true match, where can be interpreted as the epipolar line corresponding to in .^{2}^{2}2For simplicity, we use the same symbols for homogeneous and Cartesian coordinates. We treat as the query point and refashion this constraint into an epipolar loss based on the distance between the predicted correspondence location and the groundtruth epipolar line:
(1) 
where is the predicted correspondence in for the point in , and is the distance between a point and a line.
The epipolar loss alone only encourages a predicted match to lie on the epipolar line, rather than near the groundtruth correspondence location (which is at an unknown position along the line). To provide additional supervision, we additionally introduce a cycle consistency loss. This loss encourages the forwardbackward mapping of a point to be spatially close to itself [63]:
(2) 
This term encourages the network to find true correspondences and suppress other outputs, especially those that satisfy the epipolar constraint alone.
Full Training Objective. For each image pair, our total objective is a weighted sum of epipolar and cycle consistency losses, totaled over sampled query points:
(3) 
where is the th training point in , and is a weight for the cycle consistency loss term. At the end of Sec. 3.2, we further show how we can reweight individual training instances in Eq. (3) to improve training.
3.2 Differentiable Matching Layer
The objective defined above is a simple function of the pixel locations of the predicted correspondences. Minimizing this objective through gradient descent therefore requires these locations to be differentiable with respect to the network parameters. Many prior methods establish correspondence by identifying nearest neighbor matches, which unfortunately is a nondifferentiable operation.
. We use the location of highest probability at coarse level (red circle) to determine the location of a local window
at the fine level. During training, we compute the correspondence locations at both coarse and fine level from distribution and , respectively, and impose our loss functions on both. This allows us to train both coarse and finelevel features simultaneously. Please refer to Sec. 3.5 for implementation details.To address this challenge, we propose a differentiable matching layer, illustrated in Fig. 3(a). Given a pair of images, we first use convolutional networks with shared weights to extract dense feature descriptors and . To compute correspondence for a query point in , we correlate the feature descriptor at , denoted , with all of . Following a 2D softmax operation [20], we obtain a distribution over 2D pixel locations of , indicating the probability of each location being the correspondence of
. We denote this probability distribution as
:(4) 
where varies over the pixel grid of . A single 2D match can then be computed as the expectation of this distribution:
(5) 
This makes the entire system endtoend trainable. Since the correspondence location is computed from the correlation between feature descriptors, enforcing it to be correct would facilitate descriptor learning. Below we talk about additional advantages of our differentiable matching layer.
Leveraging Uncertainty during Training. This differentiable matching also provides an interpretable measure of uncertainty. For each query point
, we can calculate the total variance
as an uncertainty measure, which is defined as the trace of the covariance matrix of the 2D distribution . High variance indicates multiple or diffuse modes, signifying an unreliable prediction.This uncertainty can help identify unreliable correspondences and improve training. Essentially, due to the lack of groundtruth correspondence annotations, it is unknown if a query point has a true correspondence in the other image during training (which could be missing due to occlusion and truncation). Minimizing the loss for such points can lead to incorrect training signals. To alleviate this issue, we reweight the losses for each individual point using the total variance defined above, resulting in the final weighted loss function:
(6) 
where the weight are normalized so that they sum up to one. This weighting strategy weakens the effect of infeasible and nondiscriminative training points, which we find to be critical for rapid convergence. To prevent the network from minimizing the loss by increasing the variance, the gradient is not backpropagated through during training.
3.3 CoarsetoFine Architecture
During training, we impose supervision only on sparsely sampled query points for each pair of images. While the computational cost is made manageable in this way, having to search correspondence over the entire image space is still costly. To overcome this issue, we propose a coarsetofine architecture that significantly improves computational efficiency, while preserving the resolution of learned descriptors. Fig. 3(b) illustrates the coarsetofine module. Instead of generating a flat feature descriptor map, we produce both coarselevel feature descriptors and finelevel feature descriptors .
Coarsetofine matching works as follows. Given a query point , we first compute the distribution over all locations of the coarse feature map. At the fine level, on the contrary, we compute the finelevel distribution only in a local window centered at the highest probability location in the coarselevel distribution (with coordinates rescaled appropriately). Given coarse and finelevel distributions, correspondences at both levels can be computed. We then impose our loss function (Eq. (6)) on correspondences at both levels, which allows us to train both coarse and fine features descriptors simultaneously.
The coarsetofine architecture allows us to learn fine descriptors without evaluating full correlation between large feature maps, significantly reducing computational cost. In addition, as observed by Liu et al. [32], we find that coarsetofine not only improves efficiency but also boosts matching accuracy (Sec. 4.3). By concatenating both coarse and finelevel descriptors, we obtain the final hierarchical descriptors [17] that capture both abstract and detailed information.
3.4 Discussion
Effectiveness of Epipolar Constraint. While the epipolar constraint may appear to provide very weak supervision at first glance, the experimental results in Sec. 4 suggest that it provides empirically sufficient supervision for descriptor learning. One key reason is that the epipolar constraint suppresses a large number of incorrect correspondence—i.e., every point not on the epipolar line. Moreover, among all valid predictions that satisfy the epipolar constraint, true correspondences are most likely to have similar feature encodings given their local appearance similarity. Therefore, by aggregating such a geometric constraint over all training data, the network learns to encode the similarity between true correspondences, leading to effective learned feature descriptors.
Training with Groundtruth Correspondence Annotations. Although our focus is on learning from camera poses alone, our system can also be trained with groundtruth correspondence annotations when such data is available. In this case, we can replace our loss functions with an distance between the predicted and groundtruth correspondences.
Matching at Test Time
. The descriptors learned by our system can be integrated in standard feature matching pipelines. Given a detected keypoint, feature vectors in the coarse and fine feature maps are extracted by interpolation and concatenated to form the final descriptor. We then match features using the standard Euclidean distance between them.
3.5 Implementation Details
Architecture
. We use a ImageNetpretrained ResNet50
[12, 21, 48] architecture, truncated after layer3, as our backbone. With an additional convolutional layer we obtain the coarselevel feature map. The finelevel feature map is obtained by further convolutional layers along with upsampling and skipconnections. The sizes of the coarse and finelevel feature map are and of the original image size, respectively. They both have a feature dimensionality of . The size of the local window at fine level is the size of the finelevel feature map.Training Data. We train using the MegaDepth dataset [31], which consists of 196 different scenes reconstructed from over 1M internet photos using COLMAP [56]. 130 out of 196 scenes are used for training and the rest are for validation and testing. This gives us millions of training pairs with known camera poses. We train our system on these pairs using only the provided camera poses and intrinsics.
Training Details. We train the network using Adam [29] with a base learning rate of . The weight for the cycle consistency term is set to . query points are used in each training image pair due to memory constraints. These query points consist of 90% SIFT [36] keypoints and 10% random points.
For more implementation details, please refer to the supplementary material.
4 Experimental Results
To evaluate our descriptors as well as the impact of the various design choices, we conduct three sets of experiments:

Feature matching experiments: The most direct evaluation of our descriptors is in terms of how accurately they can be matched between images. We evaluate both sparse and dense feature matching on the HPatches dataset [2].

Downstream task experiments: Feature matches are rarely the endgoal. Instead, they form a core part of many 3D reconstruction tasks. We evaluate the impact of our descriptors on twoview geometry estimation (homography estimation on HPatches as well as relative pose estimation on MegaDepth [31] and ScanNet [11]) and 3D reconstruction (as part of an SfM pipeline in the ETH local features benchmark [57]).

Ablation study: We evaluate the impact of each proposed contribution using the HPatches dataset.
4.1 Feature Matching Results
We evaluate our learned descriptors on both sparse and dense feature matching on the HPatches dataset [2]. HPatches is a homography dataset containing 116 sequences, where 57 sequences have illumination changes and 59 have viewpoint changes. In each sequence, the first image is taken as a reference and it forms pairs with subsequent images with increasing variations.
Sparse Feature Matching. Given a pair of images, we extract keypoints in both images and match them using feature descriptors. We follow the same evaluation protocol as in D2Net [15]
and use the mean matching accuracy (MMA) as the evaluation metric. The MMA score is defined as the average percentage of correct matches per image pair under a certain pixel error threshold. Only mutual nearest neighbors matches are considered.
We combine our descriptor with SIFT [36] and SuperPoint [14] keypoints which are representative of handcrafted and learned keypoints, respectively. We compare to several baselines: Hessian affine detector [40] with RootSIFT descriptor [36, 1] (HesAff + RootSIFT), HesAffNet [43] regions with HardNet++ descriptors [42] (HAN + HN++), DELF [45], SuperPoint [14], LFNet [47], multiscale D2Net [15] (D2Net MS), SIFT detector with ContextDesc descriptors [37] (SIFT + ContextDesc), as well as R2D2 [49].
Fig. 4 shows MMA results on the HPatches dataset. We report results for the whole dataset, as well as for subsets corresponding to illumination and viewpoint changes. Following D2Net [15], we additionally present the mean number of detected features per image and mutual nearest neighbor matches per pair. Our descriptor combined with SuperPoint keypoints achieves the best overall performance, and our descriptor combined with SIFT keypoints also achieves competitive performance. In addition, with the same detectors, our descriptor shows clear improvements over our counterparts (“SIFT + Ours” vs. “SIFT + ContextDesc”, “SuperPoint + Ours” vs. “SuperPoint”).
Dense Feature Matching. To evaluate our dense matching capability, we extract keypoints on image grids in the first image and find their nearest neighbor match in the full second image. The percentage of correct keypoints (PCK) metric [35, 66, 9] is used to measure performance: the predicted match for a query point is deemed correct if it is within certain pixel threshold of the groundtruth match.
We compare to three baseline methods that produce dense descriptors: Dense SIFT [36], SuperPoint [14]^{3}^{3}3While SuperPoint produces dense descriptors, it is trained on sparse interest points. and D2Net [15]. Fig. 5(a) shows the mean PCK (Percentage of Correct Keypoints) over all image pairs in the HPatches dataset. Our method outperforms other methods by a large margin. Fig. 5(b) shows the qualitative performance of our dense correspondence.
4.2 Results on Downstream Tasks
Next, we evaluate how well our learned descriptors facilitate downstream tasks. We focus on two tasks related to twoview geometry estimation: homography estimation and relative camera pose estimation, and a third task related to 3D reconstruction. For the twoview geometry tasks, we compare our method to five existing descriptor methods: SIFT [36], LFNet [47], SuperPoint [14], D2Net [15] and ContextDesc [37]. As in Sec. 4.1, we evaluate our descriptor in combination with SIFT keypoints and SuperPoint keypoints.
Homography Estimation. We use the same HPatches dataset as in Sec. 4.1 to our feature descriptor on the homography estimation task. We follow the corner correctness metric used in SuperPoint [13, 14]. The four corners of one image are transformed to the other image using the estimated homography and compared with the corners computed using the groundtruth homography. The estimated homography is deemed correct if the average error of the four corners is less than pixels. The accuracy is averaged over all image pairs.
Following SuperPoint [14], for all methods we extract a maximum of 1,000 keypoints from each image, and robustly estimate the homography from mutual nearest neighbor matches. Homography accuracy at thresholds of pixels is shown in Tab. 1. As can be seen, our descriptor improves over both SIFT and SuperPoint descriptors. With SuperPoint keypoints, our method outperforms all other methods even without training on annotated correspondences.
Methods  

SIFT [36]  40.5  68.1  77.6 
LFNet [47]  34.8  62.9  73.8 
SuperPoint [14]  37.4  73.1  82.8 
D2Net [15]  16.7  61.0  75.9 
ContextDesc [37]  41.0  73.1  82.2 
Ours w/ SIFT kp.  34.6  72.2  81.7 
Ours w/ SuperPoint kp.  44.8  74.5  85.7 
Methods  Accuracy on ScanNet [%]  Accuracy on MegaDepth [%]  

= 10  = 30  = 60  easy  moderate  hard  
SIFT [36]  91.0 / 14.1  65.1 / 15.6  41.4 / 11.9  58.9 / 20.2  26.9 / 11.8  13.6 / 9.6 
SIFT w/ ratio test [36]  91.2 / 15.9  67.1 / 19.8  44.3 / 15.9  63.9 / 25.6  36.5 / 17.0  20.8 / 13.2 
SuperPoint [14]  94.4 / 17.5  75.9 / 26.3  53.4 / 22.1  67.2 / 27.1  38.7 / 18.8  24.5 / 14.1 
LFNet [47]  93.6 / 17.4  76.0 / 22.4  49.9 / 18.0  52.3 / 18.6  25.5 / 13.2  15.4 / 11.1 
D2Net [15]  91.6 / 13.3  68.4 / 19.5  42.0 / 14.6  61.8 / 23.6  35.2 / 19.2  19.1 / 12.2 
ContextDesc [37]  91.5 / 16.3  73.8 / 21.8  51.4 / 18.5  68.9 / 27.1  43.1 / 21.5  27.5 / 14.1 
Ours w/ SIFT kp.  92.3 / 16.3  74.8 / 22.5  50.8 / 20.9  70.0 / 30.5  50.2 / 24.8  36.8 / 16.1 
Ours w/ SuperPoint kp.  96.1 / 17.1  79.5 / 27.2  59.3 / 26.1  72.9 / 30.5  53.5 / 27.9  38.1 / 19.2 
Relative Pose Estimation. We also evaluate the performance of our learned descriptors on the task of relative camera pose estimation. Note that we train only on MegaDepth [31] but test on both MegaDepth and ScanNet [11]
, an indoor dataset that we use to test the generalization of our descriptors. For MegaDepth, we generate overlapping image pairs from test scenes, and classify them into three subsets according to relative rotation angle:
easy (), moderate () and hard (). For ScanNet, we follow LFNet [47] and randomly sample image pairs at three different frame intervals, 10, 30, and 60. Larger frame intervals imply harder pairs for matching. Each subset in MegaDepth and ScanNet consists of 1,000 image pairs.To estimate relative pose, we first compute mutual nearest neighbor matches between detected keypoints. We then use RANSAC [19] to estimate the essential matrix and decompose it to obtain the relative camera pose. For SIFT [36] we additionally prune matches using the ratio test [36], since that is the gold standard for camera pose estimation (i.e, we report the performance of both plain SIFT and SIFT with a carefullytuned ratio test).
Following UCN [9], we evaluate the estimated camera pose using angular deviation for both rotation and translation. We consider a rotation or translation to be correct if the angular deviation is less than a threshold, and report the average accuracy for that threshold. We set a threshold of for ScanNet and for MegaDepth, as MegaDepth is harder due to larger illumination changes. Results for all methods are reported in Tab. 2. Our descriptor improves performance over SIFT and SuperPoint descriptors, and “Ours w/ SuperPoint keypoints” outperforms all other methods. Qualitative results on MegaDepth test images are shown in Fig. 6.
#Registered  #Sparse points  #Obs.  Track Len.  Reproj. Err.  

Madrid  SIFT [36]  500  116K  734K  6.32  0.61px 
Metropolis  GeoDesc [38]  809  307K  1,200K  3.91  0.66px 
1,344 images  D2Net [15]  501  84K    6.33  1.28px 
SOSNet [62]  844  335K  1,411K  4.21  0.70px  
Ours  851  242K  1,489K  6.16  1.03px  
Gendarmen  SIFT  1,035  339K  1,872K  5.52  0.70px 
markt  GeoDesc  1,208  780K  2,903K  3.72  0.74px 
1,463 images  D2Net  1,053  250K    5.08  1.19px 
SOSNet  1,201  816K  3,255K  3.98  0.77px  
Ours  1,179  627K  3,330K  5.31  1.00px  
Tower of  SIFT  804  240K  1,863K  7.77  0.62px 
London  GeoDesc  1,081  622K  2,852K  4.58  0.69px 
1,576 images  D2Net  785  180K    5.32  1.24px 
Ours  1,104  452K  2,627K  5.81  0.98px 
3D Reconstruction. Finally, we evaluate the effectiveness of our learned feature descriptor in the context of 3D reconstruction. We use the ETH local features benchmark [57] which evaluates descriptors for the SfM task. We extract our feature descriptors at keypoint locations provided by [57] and feed them into the protocol. Following [38], we do not conduct the ratio test, in order to investigate the direct matching performance of the descriptors. To quantify the quality of SfM, we report the number of registered images (# Registered), sparse 3D points (#Sparse Points) and image observations (# Obs), the mean track lengths of the 3D points (Track Len.), and the mean reprojection error (Reproj. Err.).
We use SIFT [36], GeoDesc [38], D2Net [15] and SOSNet [62] as our baselines and show the results in Tab. 3. Note that SIFT and D2Net [15] apply ratio test but other methods do not. Our method is comparable to or even outperforms our baselines in terms of the completeness of the sparse reconstruction (i.e., the number of registered images, sparse points and observations). However, we do not achieve the lowest reprojection error. A similar situation is observed in [38, 62], which can be explained by the tradeoff between completeness of reconstruction and low reprojection error: fewer matches tend to lead to lower reprojection error. Taking all metrics into consideration, the performance of our descriptor for SfM is competitive, indicating the advantages of our descriptors even trained with only weak pose supervision.
4.3 Ablation Analysis
In this section, we conduct ablation analysis to demonstrate the effectiveness of our proposed camera pose supervision and architectural designs. We follow the evaluation protocol in Sec. 4.1 and report MMA and PCK score over all image pairs in the HPatches dataset [2]. For sparse feature matching, we combine our descriptors with SIFT [36] keypoints. The variants of our default method (Ours) are introduced below. For fair comparison, we train each variant on the same training data (20K image pairs) from MegaDepth [31]
for 10 epochs.
Variants. Ours from scratch is trained from scratch instead of using ImageNet [12] pretrained weights. Ours supervised is trained on sparse ground truth correspondences provided by the SfM models of MegaDepth [31]. We simply change the epipolar loss to a loss between predicted and groundtruth correspondence locations. Triplet Loss is also trained on sparse groundtruth correspondences, but using a standard triplet loss and a hard negative mining strategy [9]. Ours w/o c2f is a singlescale version of our method, where the coarselevel feature maps are removed and only the finelevel feature maps are trained and used as descriptors. Ours w/o cycle does not use the cycle consistency loss term (), and Ours w/o reweighting does not use the uncertainty reweighting strategy, where the reweighting in our final loss function is removed and uniform weights are applied during training. Below we provide a detailed analysis based on these variants. The results are shown in Fig. 7.
Analysis of Supervision Signal. The relatively small margin between Ours and Ours supervised validates the effectiveness of camera pose supervision. In addition, perhaps due to the fact that Ourssupervised is only trained on sparsely reconstructed SIFT matches, it achieves a negligible improvement on PCK for dense feature matching. It is also worth noting that both Ours supervised and Ours outperform the plain version of Triplet Loss, where Ours supervised and Triplet Loss share the same correspondence annotations but Ours uses only camera pose. In terms of loss functions, cycle consistency only provides marginal improvement. Moreover, if we enable only the cycle consistency loss without the epipolar loss, training fails. This validates the importance of epipolar constraint. Ours from scratch shows that even with randomly initialized weights, our network still succeeds to converge and learn descriptors, further validating the effectiveness of our loss functions.
Analysis of Architecture Design. As shown in Fig. 7, the coarsetofine module significantly improves performance. Two explanations for this improvement include: 1) At the fine level, correspondence is computed within a local window, which may reduce issues arising from multimodal distributions compared to a flat model that computes expectations over the whole image; and 2) The coarsetofine module produces hierarchical feature descriptors that capture both global and local information, which may be beneficial for feature matching.
5 Conclusion
In this paper, we propose a novel descriptor learning framework that can be trained using only camera pose supervision. We present both new loss functions that exploit the epipolar constraints, and a new efficient architectural design that enables learning by making the correspondence differentiable. Experiments showed that our method achieves stateoftheart performance across a range of geometric tasks, outperforming fully supervised counterparts without using any correspondence annotations for training. In future work, we will study how to further improve invariance of the learned descriptors to large geometric transformations. It is also worth investigating if the pose supervision and traditional metric learning losses are complementary to each other, and if their combination can lead to even better performance.
Acknowledgements. We thank Kai Zhang, Zixin Luo, Zhengqi Li for helpful discussion and comments. This work was partly supported by a DARPA LwLL grant, and in part by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.
References
 [1] Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)
 [2] Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017)
 [3] Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: Proc. British Machine Vision Conf. (BMVC). p. 3 (2016)
 [4] Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Proc. European Conf. on Computer Vision (ECCV) (2006)
 [5] Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)
 [6] Calonder, M., Lepetit, V., Strecha, C., Fua, P.: Brief: Binary robust independent elementary features. In: ECCV (2010)
 [7] Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgbd data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
 [8] Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: CVPR (2018)
 [9] Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: NeurIPS (2016)
 [10] Christiansen, P.H., Kragh, M.F., Brodskiy, Y., Karstoft, H.: Unsuperpoint: Endtoend unsupervised interest point detector and descriptor. arXiv preprint arXiv:1907.04011 (2019)
 [11] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richlyannotated 3d reconstructions of indoor scenes. In: CVPR (2017)
 [12] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: Imagenet: A largescale hierarchical image database. In: CVPR (2009)
 [13] DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016)
 [14] DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Selfsupervised interest point detection and description. In: CVPR Workshops (2018)
 [15] Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2net: A trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561 (2019)
 [16] Ebel, P., Mishchuk, A., Yi, K.M., Fua, P., Trulls, E.: Beyond cartesian representations for local descriptors. In: ICCV (2019)
 [17] Fathy, M.E., Tran, Q.H., Zeeshan Zia, M., Vernaza, P., Chandraker, M.: Hierarchical metric learning and matching for 2d and 3d geometric correspondences. In: ECCV (2018)
 [18] Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., Van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)
 [19] Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)

[20]
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016),
http://www.deeplearningbook.org  [21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
 [22] He, K., Lu, Y., Sclaroff, S.: Local descriptors optimized for average precision. In: CVPR (2018)
 [23] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR (2017)
 [24] Jeon, S., Kim, S., Min, D., Sohn, K.: Parn: Pyramidal affine regression networks for dense semantic correspondence. In: ECCV (2018)
 [25] Ke, Y., Sukthankar, R.: Pcasift: A more distinctive representation for local image descriptors. In: CVPR (2004)
 [26] Keller, M., Chen, Z., Maffra, F., Schmuck, P., Chli, M.: Learning deep descriptors with scaleaware triplet networks. In: CVPR (2018)
 [27] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: Endtoend learning of geometry and context for deep stereo regression. In: ICCV (2017)

[28]
Kim, S., Lin, S., JEON, S.R., Min, D., Sohn, K.: Recurrent transformer networks for semantic correspondence. In: NeurIPS (2018)
 [29] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [30] Kumar, B., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: CVPR (2016)
 [31] Li, Z., Snavely, N.: Megadepth: Learning singleview depth prediction from internet photos. In: CVPR (2018)
 [32] Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: Sift flow: Dense correspondence across different scenes. In: ECCV (2008)
 [33] Liu, Y., Shen, Z., Lin, Z., Peng, S., Bao, H., Zhou, X.: Gift: Learning transformationinvariant dense visual descriptors via group cnns. In: NeurIPS (2019)
 [34] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
 [35] Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: NeurIPS (2014)
 [36] Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. IJCV 60(2), 91–110 (2004)
 [37] Luo, Z., Shen, T., Zhou, L., Zhang, J., Yao, Y., Li, S., Fang, T., Quan, L.: Contextdesc: Local descriptor augmentation with crossmodality context. In: CVPR (2019)
 [38] Luo, Z., Shen, T., Zhou, L., Zhu, S., Zhang, R., Yao, Y., Fang, T., Quan, L.: Geodesc: Learning local descriptors by integrating geometry constraints. In: ECCV (2018)
 [39] Melekhov, I., Tiulpin, A., Sattler, T., Pollefeys, M., Rahtu, E., Kannala, J.: Dgcnet: Dense geometric correspondence network. In: WACV (2019)
 [40] Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. IJCV 60(1), 63–86 (2004)
 [41] Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Working hard to know your neighbor’s margins: Local descriptor learning loss. In: NeurIPS (2017)
 [42] Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Working hard to know your neighbor’s margins: Local descriptor learning loss. In: NeurIPS (2017)
 [43] Mishkin, D., Radenovic, F., Matas, J.: Repeatability is not enough: Learning affine regions via discriminability. In: ECCV (2018)
 [44] Mukundan, A., Tolias, G., Chum, O.: Explicit spatial encoding for deep local descriptors. In: CVPR (2019)

[45]
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Largescale image retrieval with attentive deep local features. In: ICCV (2017)
 [46] Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: CVPR (2016)
 [47] Ono, Y., Trulls, E., Fua, P., Yi, K.M.: Lfnet: learning local features from images. In: NeurIPS (2018)

[48]
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017)
 [49] Revaud, J., Weinzaepfel, P., de Souza, C.R., Humenberger, M.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)
 [50] Rocco, I., Arandjelovic, R., Sivic, J.: Convolutional neural network architecture for geometric matching. In: CVPR (2017)
 [51] Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS (2018)
 [52] Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: Orb: An efficient alternative to sift or surf. In: Proc. Int. Conf. on Computer Vision (ICCV). Citeseer (2011)
 [53] Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J., et al.: Benchmarking 6dof outdoor visual localization in changing conditions. In: CVPR (2018)
 [54] Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for imagebased localization revisited. In: BMVC. p. 4 (2012)
 [55] Schmidt, T., Newcombe, R., Fox, D.: Selfsupervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters 2(2), 420–427 (2016)
 [56] Schonberger, J.L., Frahm, J.M.: Structurefrommotion revisited. In: CVPR (2016)
 [57] Schönberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative Evaluation of HandCrafted and Learned Local Features. In: CVPR (2017)
 [58] SimoSerra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., MorenoNoguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: ICCV (2015)
 [59] Sohn, K.: Improved deep metric learning with multiclass npair loss objective. In: NeurIPS (2016)
 [60] Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwcnet: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)
 [61] Tian, Y., Fan, B., Wu, F.: L2net: Deep learning of discriminative patch descriptor in euclidean space. In: CVPR (2017)
 [62] Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., Balntas, V.: Sosnet: Second order similarity regularization for local descriptor learning. In: CVPR (2019)
 [63] Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycleconsistency of time. In: CVPR (2019)
 [64] Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: Learned invariant feature transform. In: ECCV (2016)
 [65] Zhang, L., Rusinkiewicz, S.: Learning local descriptors with a cdfbased dynamic soft margin. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
 [66] Zhou, T., Jae Lee, Y., Yu, S.X., Efros, A.A.: Flowweb: Joint image set alignment by weaving consistent, pixelwise correspondences. In: CVPR (2015)
 [67] Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3dguided cycle consistency. In: CVPR (2016)
6 Supplementary Material
In this supplemental material, we provide additional experimental results, visualizations and implementation details. In Sec. 6.1, we demonstrate the performance of our learned descriptors on a standard visual localization benchmark. In Sec. 6.2, we visualize the probabilistic distribution of correspondences for given query points in example image pairs. In Sec. 6.3, we show more qualitative results for dense feature matching. Finally, in Sec. 6.4 we provide additional implementation details, including network architecture and training details.
6.1 Visual Localization Results
As an additional experiment, we evaluate how our descriptors facilitate the task of visual localization. Following D2Net [15], we use the Aachen DayNight dataset [54] and the standard protocol proposed in [53] to evaluate the performance of our descriptors in the context of longterm localization^{4}^{4}4https://www.visuallocalization.net/. Specifically, this protocol first does exhaustive feature matching between the daytime images with known poses, and then uses triangulation to obtain 3D scene structure. The resulting 3D models are then used to localize 98 nighttime query images in the dataset. For each nighttime image, up to 20 relevant daytime images with known camera poses are given as reference images.
Following D2Net [15], we report the percentage of correctly localized nighttime queries under given error thresholds on camera position and orientation. As in the paper, we report the performance of our descriptors combined with both SIFT [36] and SuperPoint [14] keypoints. We compare our method with other stateoftheart methods and report the results in Tab. 4. It can be seen that combined with SuperPoint [14] keypoints, our descriptors achieve the best overall performance. When combined with SIFT [36] keypoints, our descriptors are also comparable with the baseline methods. Note that our descriptors are learned using only the weak supervision of camera poses.
Correctly localized queries (%)  
Methods  (0.5m, )  (1m, )  (5m, ) 
SIFT [36]  36.7  54.1  72.5 
HardNet [41]  41.8  61.2  84.7 
SuperPoint [14]  42.9  57.1  77.6 
DELF [45]  38.8  62.2  85.7 
D2Net [15]  45.9  68.4  88.8 
R2D2 [49]  46.9  66.3  88.8 
SOSNet [62]  42.9  64.3  85.7 
ContextDesc [37]  48.0  63.3  88.8 
Our w/ SIFT kp.  44.9  68.4  87.8 
Our w/ SuperPoint kp.  45.9  69.4  88.8 
6.2 Visualizing Predicted Distributions
Our method obtains the correspondence of a given query point in one image from distributions over pixel locations in the other image at training time. The quality of the distributions reflects the quality of learned descriptors. Therefore, we visualize the distributions at both the coarse and fine levels for test image pairs drawn from the MegaDepth dataset [31] and illustrate the results in Fig. 8. This figure shows that, even under challenging illumination and perspective changes, the peaks of the distributions can still indicate correct correspondences. Moreover, compared with the coarselevel distributions, the finelevel distributions tend to be peakier, demonstrating the discriminability of the finelevel feature descriptors. However, our method can fail when repeating structures are present in the image, as shown in the last row of Fig. 8.
6.3 More Qualitative Results on Dense Feature Matching
In this section, we present more qualitative results on dense feature matching. To perform dense feature matching, we resize the coarselevel feature descriptors to be the size of finelevel feature descriptors, and concatenate both to form our final dense feature descriptors, whose spatial resolution is of the input image. For any given point in the first image, we find its nearest neighbor match in the second image as its correspondence. No postprocessing techniques are used. The dense correspondences are visualized in Fig. 9. It is shown that the correspondences are reasonable under various illumination and viewpoint changes.
6.4 Implementation Details
In this section, we provide our network architecture and additional techniques that we find useful for improving the robustness and speed of training.
Network Architecture. During training, our system takes a pair of images as input and extracts their features using a network with shared weights. We use ResNet50 [21], as implemented in PyTorch [48], as the backbone of our network. Our network is fully convolutional and accepts input images of any size. We take a single image of size as an example input and present a detailed network architecture in Tab. 5. Our code and model will be made available soon.
Input (id: dimension)  Layer  Output (id: dimension) 

0: 
Conv, , stride 
1: 
1:  MaxPool, stride  2: 
2:  Residual Block 1  3: 
3:  Residual Block 2  4: 
4:  Residual Block 3  5: 
5:  Conv,  Coarse: 
5:  Upconv, , factor  6: 
[4, 6]:  Conv,  7: 
7:  Upconv, , factor  8: 
[3, 8]:  Conv,  9: 
9:  Conv,  Fine: 
” indicate the coarse and finelevel output feature descriptors, respectively. Note that “Conv” stands for a sequence of operations: convolution, rectified linear units (ReLU) and batch normalization. The default stride is 1. “Upconv” stands for a bilinear upsampling with certain factor, followed by a “Conv” operation with stride 1. “
” is the channelwise concatenation of two feature maps.Curriculum Learning. In general, we find that simply using ImageNet [12] pretrained weights gives the network a good initialization. To help the network converge faster, we can also optionally leverage curriculum learning [5], i.e., presenting easier training examples and then harder ones. Specifically, we can sort image pairs by their relative rotation angles, and feed easy pairs into the network at the beginning of training for better initialization.
Training Points Pruning. As mentioned in Sec. 3.2 of the paper, the sampled query points in the first image may not have true correspondences in the second image, and enforcing losses on these points could prohibit training. To alleviate this issue, we introduce the reweighting strategy in the paper. Besides this “soft” filtering, we can also do “hard” filtering to get rid of points that are not likely to have true correspondences. Specifically, we could assume the depth range of the scene to be . For each query point in the first image, this range will yield a segment along its epipolar line in the second image. If this line segment does not intersect the image plane of the second image, then this point is excluded from training. Since we do not assume known depths, we can only roughly determine the depth range for each image. Even so, this strategy is still found effective in removing a large fraction of unwanted training points. For MegaDepth [31], we approximate , as and times the maximum distance between two cameras, respectively, in each reconstructed model.
Training Curves. To give a most straightforward illustration on the effectiveness of our loss function, we show the curves of our training loss as well as the training PCK [9] in Fig. 10. For each training pair, the PCK is computed on sparsely reconstructed keypoints from MegaDepth [31] with a threshold of 5 pixels. Higher PCK means more correct correspondences found on the training data, indicating better descriptors. It is shown in Fig. 10 that as our training loss goes down, the PCK score increases, which verifies that minimizing our loss terms does lead to the improvement of the feature descriptors.
Comments
There are no comments yet.