Designing powerful local features is an essential basis for a broad range of computer vision tasks[31, 43, 44, 30, 41, 15, 41]. During the past few years, the joint learning of local feature detectors and descriptors has gained increasing popularity, with promising results achieved in real applications. However, there are two limitations we consider that may have hinged further boost in performance: 1) the lack of shape-awareness of feature points for acquiring stronger geometric invariance, and 2) the lack of keypoint localization accuracy for solving camera geometry robustly.
Traditionally, the local shape is parameterized by hand-crafted scale/rotation estimation [17, 29] or affine shape adaptation , while more recently, data-driven approaches [23, 22, 39] have emerged that build a separate network to regress the shape parameters, then transform the patch inputs before feature descriptions. Due to the increasing prevalence of the joint learning with keypoint detectors [6, 25, 27, 7, 4], recent research focus has shifted to frameworks that densely extract features from image inputs, whereas no pre-defined keypoint is given and thus previous patch-wise shape estimation becomes inapplicable. As an alternative, LF-Net 
extracts dense features and transforms intermediate feature maps via Spatial Transformer Networks (STN), whereas multiple forward passes are needed and only sparse predictions of shape parameters are practically feasible. In this view, there still lacks a solution that enables efficient local shape estimation in a dense prediction framework.
Besides, the localization accuracy of learned keypoints is still concerned in solving geometry-sensitive problems. For instance, LF-Net  and D2-Net  empirically yield low precision in two-view matching or introduce large reprojection error in Structure-from-Motion (SfM) tasks, which in essence can be ascribed to the lack of spatial accuracy as the detections are derived from low-resolution feature maps (e.g., times the original size). To restore the spatial resolution, SuperPoint  learns to upsample the feature maps with pixel-wise supervision from artificial points, while R2D2  employs dilated convolutions to maintain the spatial resolution but trades off excessive GPU computation and memory usage. Moreover, it is questionable that if the detections from the deepest layer are capable of identifying low-level structures (corners, edges, etc.) where keypoints are often located. Although widely discussed in dense prediction tasks [28, 10, 16], in our context, neither the keypoint localization accuracy, nor the low-level nature of keypoint detection has received adequate attention.
To mitigate above limitations, we present ASLFeat, with three light-weight yet effective modifications. First, we employ deformable convolutional networks (DCN) [5, 45] in the dense prediction framework, which allows for not only pixel-wise estimation of local transformation, but also progressive shape modelling by stacking multiple DCNs. Second, we leverage the inherent feature hierarchy, and propose a multi-level detection mechanism that restores not only the spatial resolution without extra learning weights, but also low-level details for accurate keypoint localization. Finally, we base our methods on an improved D2-Net  that is trained from scratch, and further propose a peakiness measurement for more selective keypoint detection.
Despite the key insights of above modifications being familiar, we address their importance in our specific context, fully optimize the implementation in a non-trivial way, and thoroughly study the effect by comparing with different design choices. To summarize, we aim to provide answers to two critical questions: 1) what deformation parameterization is needed for local descriptors (geometrically constrained [23, 22, 39] or free-form modelling [5, 45]), 2) what feature fusion is effective for keypoint detectors (multi-scale input [27, 7], in-network multi-scale inference , or multi-level fusion ). Finally, we extensively evaluate our methods across various practical scenarios, including image matching [1, 2], 3D reconstruction  and visual localization . We demonstrate drastic improvements upon the backbone architecture, D2-Net, and report state-of-the-art results on popular benchmarks.
2 Related works
Local shape estimation. Most existing descriptor learning methods [19, 18, 21, 37, 36, 41] do not explicitly model the local shape, but rely on geometric data augmentation (scaling/rotational perturbation) or hand-crafted shape estimation (scale/rotation estimation [17, 29]) to acquire geometric invariance. Instead, OriNet  and LIFT  propose to learn a canonical orientation of feature points, AffNet  predicts more affine parameters to improve the modelling power, and the log-polar representation  is used to handle in particular scale changes. Despite the promising results, those methods are limited to take image patches as input, and introduce a considerable amount of computation since two independent networks are constructed for predicting patch shape and patch description separately. As an alternative, LF-Net  takes images as input and performs STN  on intermediate features, while multiple forward passes are needed to transform individual “feature patch”, and thus only prediction on sparse locations is practically applicable.
Meanwhile, the modelling of local shape has been shown crucial in image recognition tasks, which inspires works such as scale-adaptive convolution (SAC) for flexible-size dilations  and deformable convolution networks (DCN) for tunable grid sampling locations [5, 45]. In this paper, we adopt the similar idea in our context, and propose to equip DCN for dense local transformation prediction, of which the inference requires only a single forward pass and is thus of high efficiency.
Joint local feature learning. The joint learning of feature detectors and descriptors has received increasing attention, where a unified network is constructed to share most computations of the two tasks for fast inference. In terms of descriptor learning, the ranking loss [25, 7, 6, 4, 27] has been primarily used as a de-facto standard. However, due to the difficulty of acquiring unbiased ground-truth data, no general consensus has yet been reached regarding an effective loss design for keypoint detector learning. For instance, LF-Net  warps the detection map and minimizes the difference at selected pixels in two views, while SuperPoint  operates a self-supervised paradigm with a bootstrap training on synthetic data and multi-round adaptations on real data. More recent R2D2  enforces grid-wise peakiness in conjunction with reliability prediction for descriptor, while UnsuperPoint  and Key.Net  learn grid-wise offsets to localize keypoints.
By contrast, D2-Net 
eschews learning extra weights for a keypoint detector, but hand-crafts a selection rule to derive keypoints from the same feature maps that are used for extracting feature descriptors. This design essentially couples the capability of the feature detector and descriptor, and results in a clean framework without complex heuristics in loss formulation. However, it is a known issue that D2-Net lacks of accuracy of keypoint localization, as keypoints are derived from low-resolution feature maps. In this paper, we base our methods on D2-Net, and mitigate above limitation by a light-weight modification that cheaply restores both the spatial resolution and low-level details.
The backbone architecture in this work is built upon 1) deformable convolutional networks (DCN) [5, 45] that predict and apply dense spatial transformation, and 2) D2-Net  that jointly learns keypoint detector and descriptor.
Deformable convolutional networks (DCN) [5, 45] target to learn dynamic receptive filed to accommodate the ability of modelling geometric variations. Formally, given a regular grid that samples values over the input feature maps , the output features of a standard convolution for each spatial position can be written as:
As the offset is typically fractional, Eq. 2
is implemented via bilinear interpolation, while the feature amplitudeis limited to . During training, the initial values of and are respectively set to and , following the settings in .
D2-Net  proposes a describe-and-detect strategy to jointly extract feature descriptions and detections. Over the last feature maps , D2-Net applies channel-wise L2-normalization to obtain dense feature descriptors, while the feature detections are derived from 1) the local score and 2) the channel-wise score. Specifically, for each location in (), the local score is obtained by:
where is neighboring pixels around , e.g., 9 neighbours defined by a kernel. Next, the channel-wise score is obtained by:
The final detection score is combined as:
The detection score will be later used as a weighting term in loss formulation (Sec. 3.4), and will allow for top-K selection of keypoints during testing.
3.2 DCN with Geometric Constraints
The original free-form DCN predicts local transformation of high degrees of freedom (DOF), e.g.,offsets for a kernel. On the one hand, it enables the potential to model complex deformation such as non-planarity, while on the other hand, it takes a risk of over-paramertizing the local shape, where simpler affine or perspective transformation are often considered to serve as a good approximation [20, 23, 22]. To find out what deformation is needed in our context, we compare three shape modellings via enforcing different geometric constraints in DCN, including 1) similarity, 2) affine and 3) homography. The shape properties of the investigated variants are summarized in Tab. 1.
|s.t. similarity||scale, rotation||2|
|s.t. affine||scale, rotation, shear||4|
Affine-constrained DCN. Traditionally, the local shape is often modelled by similarity transformation with estimates of rotation and scale [17, 29]. In a learning framework such as [23, 25], this transformation is decomposed as:
Moreover, a few works such as HesAff  further includes an estimate of shearing, which is cast as a learnable problem by AffNet . Here, we follow AffNet and decompose the affine transformation as:
where . The network is implemented to predict one scalar for scaling (), another two for rotation (, , while the other three for shearing ().
Homography-constrained DCN. Virtually, the local deformation can be better approximated by a homography (perspective) transformation24] to solve the 4-point parameterization of in a differentiable manner.
Formally, a linear system can be created that solves , where and
is a vector with 9 elements consisting of the entries of, and each correspondence provides two equations in . By enforcing the last element of equals to 1  and omitting the translation, we set and , then rewrite the above system of equations as , where and for each correspondence,
and consists of 6 elements from the first two columns of . By stacking the equations of 4 correspondences, we derive the final linear system:
Suppose that correspondence points are not collinear, can be then efficiently and uniquely solved by using the differentiable pseudo-inverse of 111Implemented via function tf.matrix_solve in TensorFlow.
in TensorFlow.. In practice, we initialize 4 corner points at , and implement the network to predict 8 corresponding offsets lying in so as to avoid collinearity.
After forming the above transformation , the offset values in Eq. 2 are now obtained by:
so that geometry constraints are enforced in DCN. More implementation details can be found in the Appendix.
3.3 Selective and Accurate Keypoint Detection
Keypoint peakiness measurement. As introduced in Sec. 3.1, D2-Net scores a keypoint regarding both spatial and channel-wise responses. Among many possible metrics, D2-Net implements a ratio-to-max (Eq. 4) to evaluate channel-wise extremeness, whereas one possible limitation lies on that it only weakly relates to the actual distribution of all responses along the channel.
To study this effect, we first trivially modify Eq. 4 with a channel-wise softmax, whereas this modification deteriorates the performance in our experiments. Instead, inspired by [27, 40], we propose to use peakiness as a keypoint measurement in D2-Net, which rewrites Eq. 4 as:
where softplus activates the peakiness to a positive value. To balance the scales of both scores, we also rewrites Eq. 3 in the similar form:
and the two scores are again combined as in Eq. 5.
Multi-level keypoint detection (MulDet). As aforementioned, one known limitation of D2-Net  is the lack of keypoint localization accuracy, since detections are obtained from low-resolution feature maps. There are multiple options to restore the spatial resolution, for instance, by learning an additional feature decoder (SuperPoint ) or employing dilated convolutions (R2D2 ). However, those methods either increase the number of learning parameters, or consume huge GPU memory or computation. Instead, we propose a simple yet effective solution without introducing extra learning weights, by leveraging the inherent pyramidal feature hierarchy of ConvNets and combining detections from multiple feature levels.
Specifically, given a feature hierarchy consisting of feature maps at different levels strided by , respectively, we apply the aforementioned detection at each level to get a set of score maps . Next, each score map is upsampled to have the same spatial resolution as input image, and finally combined by taking the weighted sum:
To better address the superiority of the proposed method, we implement 1) the multi-scale detection used in D2-Net  and R2D2  (Fig. 4a) by constructing an image pyramid with multiple forward passes, 2) the in-network multi-scale prediction used in LF-Net  (Fig. 4b) by resizing the intermediate feature maps, and 3) the standard U-Net architecture  (Fig. 4c) that builds a feature decoder and skip connections from low-level feature maps.
The proposed multi-level detection (Fig. 4d) is advantageous in three aspects. Firstly, it adopts implicit multi-scale detection that conforms to classical scale-space theory  by having different sizes of receptive field for localizing keypoints. Secondly, compared with U-Net architecture, it cheaply restores the spatial resolution without introducing extra learning weights to achieve pixel-wise accuracy. Thirdly, different from U-Net that directly fuses low-level and high-level features, it keeps the low-level features untouched, but fuses the detections of multi-level semantics, which helps to better preserve the low-level structures such as corners or edges. The implementation details of above variants can be found in the Appendix.
3.4 Learning Framework
Network architecture. The network architecture is illustrated in Fig. 1. To reduce computations, we replace the VGG backbone  used in D2-Net with more light-weight L2-Net . Similar to R2D2 , we further replace the last convolution of L2-Net by three convolutions, resulting in feature maps of dimension and times resolution of the input. Finally, the last three convolutions, conv6, conv7 and conv8, are substituted with DCN (Sec. 3.1). Three levels, conv1, conv3 and conv8, are selected to perform the proposed MulDet (Sec. 3.3). The combination weights in Eq. 13 are empirically set to , and the dilation rate to find neighboring pixels in Eq. 3 is set to , respectively, which we find to deliver best trade-offs to balance the attention on low-level and abstracted features.
Loss design. We identify a set of correspondences for an image pair via densely warping to regarding ground-truth depths and camera parameters. To derive the training loss for both detector and descriptor, we adopt the formulation in D2-Net , written as:
where and are combined detection scores in Eq. 13 for image and , and are their corresponding descriptors, and is the ranking loss for representation learning. Instead of using the hardest-triplet loss in D2-Net , we adopt the hardest-contrastive form in FCGF , which we find guarantee better convergence when training from scratch and equipping DCN, written as:
where denotes the Euclidean distance measured between two descriptors, and are respectively set to for positives and negatives. Similar to D2-Net , a safe radius sized 3 is set to avoid taking spatially too close feature points as negatives.
Training. In contrast to D2-Net 
which starts from an ImageNet pretrained model with only the last convolution fine-tuned, we train our modelfrom scratch with ground-truth cameras and depths obtained from [33, 26] (the same data used in [19, 18]). The training consumes image pairs sized and batched . Learning gradients are computed for image pairs that have at least matches, while the maximum match number is limited to . Each input image is standardized to have zero mean and unit norm, and independently applied with random photometric augmentation including brightness, contrast and blurriness. The SGD optimizer is used with momentum of 0.9, and the base learning rate is set to .
Although an end-to-end learning with DCN is feasible, we find that a two-stage training yields better results in practice. Specifically, in the first round we train the model with all regular convolutions for iterations. In the second round, we tune only the DCNs with the base learning rate divided by for another iterations. Our implementation is made in TensorFlow with single NVIDIA RTX 2080Ti card, and the training finishes within hours.
Testing. A non-maximum suppression (NMS) sized is applied to remove detections that are spatially too close. Similar to D2-Net, we postprocess the keypoints with the SIFT-like edge elimination (with threshold set to ) and sub-pixel refinement, the descriptors are then bilinearly interpolated at the refined locations. We select top-K keypoints regarding detection scores obtained in Eq. 13, and empirically discard those whose scores are lower than .
In the following sections, we evaluate our methods across several practical scenarios, including image matching, 3D reconstruction and visual localization. Further experiments on dense reconstruction and image retrieval can be found in the Appendix.
4.1 Image Matching
Datasets. First, we use the popular HPatches dataset , which includes image sequences with ground-truth homography. Following D2-Net , we exclude high-resolution sequences, leaving and sequences with illumination or viewpoint variations, respectively.
Though widely used, HPatches dataset exhibits only homography transformation, which may not comprehensively reflect the performance in real applications. Thus, we resort to the newly proposed FM-Bench , which comprises four datasets captured in practical scenarios: the TUM dataset  in indoor SLAM settings, the KITTI dataset  in driving scenes, the Tanks and Temples dataset (T&T)  for wide-baseline reconstruction, and the Community Photo Collection (CPC)  for wild reconstruction from web images. For each datasets, overlapping image pairs are randomly chosen for evaluation, with ground-truth fundamental matrix pre-computed.
Evaluation protocols. On HPatches dataset , three standard metrics are used: 1) Keypoint repeatability (%Rep.), a.k.a. the ratio of possible matches and the minimum number of keypoints in the shared view. 2) Matching score (%M.S.), a.k.a. the ratio of correct matches and the minimum number of keypoints in the shared view. 3) Mean matching accuracy (%MMA), a.k.a. the ratio of correct matches and possible matches. Here, a match is defined to correspond if the point distance is below some error threshold after homography wrapping, and a correct match is further required to be a mutual nearest neighbor during brute-force searching. For above metrics, we report their average scores over all image pairs in the dataset.
In terms of FM-Bench 
, a full matching pipeline including outlier rejection (e.g., ratio test) and geometric verification (e.g., RANSAC) is performed, and the final pose recovery accuracy is evaluated. To determine the correctness of a pose estimate, FM-Bench uses ground-truth pose to generate a set of virtual correspondences, then measures the average of normalized symmetric epipolar distance regarding a pose estimate, and finally computes %Recall as the ratio of estimates where the distance error is below a certain threshold (0.05 by default). At correspondence level, FM-Bench also reports intermediate results such as the inlier ratio (%Inlier/%Inlier-m) and correspondence number (%Corr/%Corr-m) after/before RANSAC.
|HPatches dataset (error threshold @ 3px)|
|+ MulDet||multi-scale (pyramid)||46.12||32.55||48.72|
|+ MulDet||s.t. affine||78.49||45.35||71.80|
|2-5||free-form, 1 layer||78.27||45.12||71.08|
Comparative methods. We compare our methods with 1) patch descriptors, including HardNet++  with SIFT  detector (SIFT + HN++), or plus a shape estimator HesAffNet  (HAN + HN++). Also, ContextDesc  with SIFT detector (SIFT + ContextDesc). 2) Joint local feature learning approaches including SuperPoint , LF-Net , D2-Net (fine-tuned)  and more recent R2D2 . Unless otherwise specified, we report either results reported in original papers, or derived from authors’ public implementations with default parameters. We limit the maximum numbers of features of our methods to 5K and 20K on HPatches dataset and FM-Bench, respectively.
On FM-Bench, both the mutual check and ratio test  are applied to reject outliers before RANSAC. A ratio at is used for all methods except for D2-Net and R2D2222We use for D2-Net as suggested in its original paper, and conduct a parameter searching for LF-Net, SuperPoint and R2D2, obtaining the ratio at , , and , respectively, to achieve overall best performance..
Baseline. To avoid overstatement, we first present our re-implementation of D2-Net (our impl.) as the baseline. As mentioned in Sec. 3.4 and Sec. 3.5, the new baseline differs from the original D2-Net (orig.) in three aspects: 1) Different backbone architecture (L2-Net  with -d output vs. VGG  with -d output). 2) Different loss formulation (hardest-contrastive  vs. hardest-triplet ). 3) Different training settings (trained from scratch vs. fine-tuned only the last convolution from a pre-trained model). As shown in Tab. 2 and Tab. 3, the new baseline outperforms original D2-Net in general, while being more parameter- and computation-efficient regarding model size.
|TUM  (indoor SLAM settings)||KITTI  (driving settings)|
|Methods||%Recall||%Inlier||#Inlier-m||#Corrs (-m)||%Recall||%Inlier||#Inlier-m||#Corrs (-m)|
|SIFT ||57.40||75.33||59.21||65 (316)||91.70||98.20||87.40||154 (525)|
|SIFT + HN++ ||58.90||75.74||62.07||67 (315)||92.00||98.21||91.25||159 (535)|
|HAN + HN++ ||51.70||75.70||62.06||101 (657)||90.40||98.09||90.64||233 (1182)|
|SIFT + ContextDesc ||59.70||75.53||62.61||69 (325)||92.20||98.23||91.92||160 (541)|
|LF-Net (MS) ||53.00||70.97||56.25||143 (851)||80.40||95.38||84.66||202 (1045)|
|D2-Net (MS) ||34.50||67.61||49.01||74 (1279)||71.40||94.26||73.25||103 (1832)|
|SuperPoint ||45.80||72.79||64.06||39 (200)||86.10||98.11||91.52||73 (392)|
|R2D2 (MS) ||57.70||73.70||61.53||260 (1912)||78.80||97.53||86.49||278 (1804)|
|D2-Net (our impl.)||39.10||70.09||61.58||64 (337)||70.80||97.04||91.97||81 (683)|
|ASLFeat (w/o peakiness meas.)||53.30||74.96||68.29||116 (703)||89.60||98.47||95.36||223 (1376)|
|ASLFeat||60.20||76.34||69.09||148 (739)||92.20||98.69||96.25||444 (1457)|
|ASLFeat (MS)||59.90||76.72||69.50||258 (1332)||92.20||98.76||96.16||630 (2222)|
|T&T  (wide-baseline reconstruction)||CPC  (wild reconstruction from web images)|
|SIFT||70.00||75.20||53.25||85 (795)||29.20||67.14||48.07||60 (415)|
|SIFT + HN++||79.90||81.05||63.61||96 (814)||40.30||76.73||62.30||69 (400)|
|HAN + HN++||82.50||84.71||70.29||97 (920)||47.40||82.58||72.22||65 (405)|
|SIFT + ContextDesc||81.60||83.32||69.92||94 (728)||41.80||84.01||72.21||61 (306)|
|LF-Net (MS)||57.40||66.62||60.57||54 (362)||19.40||44.27||44.35||50 (114)|
|D2-Net (MS)||68.40||71.79||55.51||78 (2603)||31.30||56.57||49.85||84 (1435)|
|SuperPoint||81.80||83.87||70.89||52 (535)||40.50||75.28||64.68||31 (225)|
|R2D2 (MS)||73.00||80.81||65.31||84 (1462)||43.00||82.40||67.28||91 (954)|
|D2-Net (our impl.)||83.20||84.19||75.32||74 (1009)||46.60||83.72||77.31||51 (464)|
|ASLFeat (w/o peakiness meas.)||86.30||84.71||77.84||171 (1775)||49.50||85.80||80.39||97 (780)|
|ASLFeat||89.90||85.33||79.08||295 (2066)||51.50||87.98||82.24||165 (989)|
|ASLFeat (MS)||88.70||85.68||79.74||327 (2465)||54.40||89.33||82.76||185 (1159)|
) notably improves the results regarding all evaluation metrics on HPatches dataset. This effect is validated on FM-Bench, which is shown to apply for all different scenarios as shown in Tab.3 (ASLFeat w/o peakiness meas.). Our later modifications will be thus based on this model.
Ablations on MulDet. As shown in Tab. 2, applying multi-scale detection solely does not take obvious effect, as spatial accuracy is still lacking. Instead, adopting multi-level detection, with spatial resolution restored, remarkably boosts the performance, which conforms the necessity of pixel-level accuracy especially when small pixel error is tolerated. It is also note-worthy that, despite less learning weights and computation, the proposed multi-level detection outperforms the U-Net variant, addressing the low-level nature of this task where a better preservation of low-level features is beneficial. Although the proposed multi-level detection also includes feature fusion of difference scales, we find that combining a more explicit multi-scale (pyramid) detection (free-form, multi-scale) is in particular advantagous in order to handle the scale changes. This combination will be denoted as ASLFeat (MS) in the following context.
Ablations on DCN. As shown in Tab. 2, all investigated variants of DCN are valid and notably boost the performance. Among those designs, the free-form variant slightly outperforms the constrained version, despite the fact that HPatches datasets exhibit only homography transformation. This confirms that modelling non-planarity is feasible and useful for local features, and we thus opt for the free-form DCN to better handle geometric variations. Besides, we also implement a single-layer DCN (free-form, 1 layer) that replaces only the last regular convolution (i.e., conv8 in Fig. 1), showing that stacking more DCNs is beneficial and the shape estimation can be learned progressively.
Comparisons with other methods. As illustrated in Fig. 3, both ASLFeat and its multi-scale (MS) variant achieve overall best results on HPatches dataset regarding both illumination and viewpoint variations at different error thresholds. Specifically, ASLFeat delivers remarkable improvements upon its backbone architecture, D2-Net, especially at low error thresholds, which in particular demonstrates that the keypoint localization error has been largely reduced. Besides, ASLFeat notably outperforms the more recent R2D2 ( vs. for MMA@3 overall), while being more computationally efficient by eschewing the use of dilated convolutions for restoring spatial resolution.
In addition, as shown in Tab. 3 on FM-Bench, the ASLFeat remarkably outperforms other joint learning approaches. In particular, ASLFeat largely improves the state-of-the-art results on two MVS datasets: T&T and CPC, of which the scenarios are consistent with the training data. It is also noteworthy that our methods generalize well to unseen scenarios: TUM (indoor scenes) and KITTI (driving scenes). As a common practice, adding more task-specific training data is expected to further boost the performance.
Visualizations. We here present some sample detection results on FM-Bench in Fig.4, and more visualizations are provided in the Appendix.
4.2 3D Reconstruction
Evaluation protocols. We exhaustively match all image pairs for each dataset with both ratio test at and mutual check for outlier rejection, then run SfM and MVS algorithms by COLMAP . For sparse reconstruction, we report the number of registered images (#Reg. Images), the number of sparse points (#Sparse Points), average track length (Track Length) and mean reprojection error (Reproj. Error). For dense reconstruction, we report the number of dense points (#Dense Points). We limit the maximum number of features of ASLFeat to 20K.
Results. As shown in Tab. 4, ASLFeat produces the most complete reconstructions regarding #Reg. Images and #Dense Points. Besides, ASLFeat results in Reproj. Error that is on par with SuperPoint and smaller than D2-Net, which again validates the effect of the proposed MulDet for restoring spatial information. However, the reprojection error produced by hand-crafted keypoints (e.g., RootSIFT) is still notably smaller than all learning methods, which implies that future effort can be spent to further improve the keypoint localization in a learning framework.
|2-7 1344 images||SuperPoint||438||29K||9.03||1.02px||1.55M|
|2-7 1463 images||SuperPoint||967||93K||7.22||1.03px||3.81M|
|2-7 1576 images||SuperPoint||681||52K||8.67||0.96px||2.77M|
4.3 Visual Localization
Datasets. We resort to Aachen Day-Night dataset  to demonstrate the effect on visual localization tasks, where the key challenge lies on matching images with extreme day-night changes for queries.
Evaluation protocols. We use the evaluation pipeline provided in The Visual Localization Benchmark333https://www.visuallocalization.net/, which takes custom features as input, then relies on COLMAP  for image registration, and finally generates the percentages of successfully localized images within three error tolerances (0.5m, 2) / (1m, 5) / (5m, 10). The maximum feature number of our methods are limited to 20K.
Results. As shown in Tab. 5
, although only mediocre results are obtained in previous evaluations, D2-Net performs surprisingly well under challenging illumination variations. This can be probably ascribed to the superior robustness of low-level features pre-trained on ImageNet. On the other hand, our method outperforms the plain implementation of R2D2, while a specialized R2D2 model (R2D2 (fine-tuned)) achieves the state-of-the-art results with doubled model size, training on day image from Aachen dataset and using photo-realistic style transfer to generate night images.
|Methods||#Features||Dim||0.5m, 2||1m, 5||5m, 10|
|HAN + HN++||11K||128||37.8||54.1||75.5|
|SIFT + ContextDesc||11K||128||40.8||55.1||80.6|
|R2D2 (MS, fine-tuned)||10K||128||45.9||66.3||88.8|
|D2-Net (our impl.)||10K||128||40.8||59.2||77.6|
In this paper, we have used D2-Net as the backbone architecture to jointly learn the local feature detector and descriptor. Three light-weight yet effective modifications have been proposed that drastically boost the performance in two aspects: the ability to model the local shape for stronger geometric invariance, and the ability to localize keypoints accurately for solving robust camera geometry. We have conducted extensive experiments to study the effect of each modification, and demonstrated the superiority and practicability of our methods across various applications.
Acknowledgments. This work is supported by Hong Kong RGC GRF 16206819, 16203518 and T22-603/15N.
-  (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, Cited by: §1, §2, Figure 3, §4.1, §4.1, A.4 Discussions.
-  (2019) An evaluation of feature matchers forfundamental matrix estimation. BMVC. Cited by: §1, Figure 4, §4.1, §4.1, Table 3.
-  (2019) Fully convolutional geometric features. In ICCV, Cited by: §3.4, §4.1.
-  (2019) UnsuperPoint: end-to-end unsupervised interest point detector and descriptor. arXiv. Cited by: §1, §2.
-  (2017) Deformable convolutional networks. In ICCV, Cited by: §1, §1, §2, §3.1, §3.1.
-  (2018) Superpoint: self-supervised interest point detection and description. In CVPRW, Cited by: §1, §1, §2, §3.3, §4.1, Table 3.
-  (2019) D2-net: a trainable cnn for joint detection and description of local features. In CVPR, Cited by: §1, §1, §1, §1, §2, §2, §3.1, §3.1, §3.3, §3.3, §3.4, §3.5, §4.1, §4.1, §4.1, §4.2, Table 3, A.2 Implementation Details of MulDet.
-  (2019) Beyond cartesian representations for local descriptors. ICCV. Cited by: §2.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §4.1, Table 3.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Cited by: §1.
-  (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §3.2.
-  (2015) Spatial transformer networks. In NeurIPs, Cited by: §1, §2.
-  (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ToG. Cited by: §4.1, Table 3, A.3 Additional Experiments, A.3 Additional Experiments, Table 6.
-  (2019) Key. net: keypoint detection by handcrafted and learned cnn filters. ICCV. Cited by: §2.
-  (2015) Dual-feature warping-based motion model estimation. In ICCV, Cited by: §1.
-  (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In CVPR, Cited by: §1.
-  (2004) Distinctive image features from scale-invariant keypoints. IJCV. Cited by: §1, §2, §3.2, §3.3, §4.1, §4.1, §4.1, Table 3, A.3 Additional Experiments.
-  (2019) ContextDesc: local descriptor augmentation with cross-modality context. In CVPR, Cited by: §2, §3.5, §4.1, Table 3, Table 7.
-  (2018) Geodesc: learning local descriptors by integrating geometry constraints. In ECCV, Cited by: §2, §3.5, A.3 Additional Experiments, Table 6, Table 7.
-  (2002) An affine invariant interest point detector. In ECCV, Cited by: §1, §3.2, §3.2.
-  (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In NeurIPs, Cited by: §2, §4.1, Table 3.
-  (2018) Repeatability is not enough: learning affine regions via discriminability. In ECCV, Cited by: §1, §1, §2, §3.2, §3.2, §4.1, Table 3, A.1 Implementation Details of DCN, A.1 Implementation Details of DCN, A.1 Implementation Details of DCN.
-  (2016) Learning to assign orientations to feature points. In CVPR, Cited by: §1, §1, §2, §3.2, §3.2.
-  (2018) Unsupervised deep homography: a fast and robust homography estimation model. In IEEE Robotics and Automation Letters, Cited by: §3.2.
-  (2018) LF-net: learning local features from images. In NeurIPs, Cited by: §1, §1, §1, §2, §2, §3.2, §3.3, §4.1, Table 3, A.2 Implementation Details of MulDet.
-  (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In ECCV, Cited by: §3.5.
-  (2019) R2D2: repeatable and reliable detector and descriptor. In NeurIPs, Cited by: §1, §1, §1, §2, §3.3, §3.3, §3.3, §3.4, §4.1, Table 3, A.2 Implementation Details of MulDet, A.3 Additional Experiments.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §1, §1, §3.3.
-  (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Cited by: §1, §2, §3.2.
-  (2012) Image retrieval for image-based localization revisited.. In BMVC, Cited by: §1, §1, §4.3, Table 5, A.3 Additional Experiments, Table 8.
-  (2016) Structure-from-motion revisited. In CVPR, Cited by: §1, §4.2, §4.3.
-  (2017) Comparative evaluation of hand-crafted and learned local features. In CVPR, Cited by: §1, §2, §4.2, Table 4.
-  (2018) Matchable image retrieval by learning from surface reconstruction. In ACCV, Cited by: §3.5.
-  (2014) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §3.4, §4.1.
-  (2012) A benchmark for the evaluation of rgb-d slam systems. In IROS, Cited by: §4.1, Table 3.
L2-net: deep learning of discriminative patch descriptor in euclidean space. In CVPR, Cited by: §2, §3.4, §4.1.
-  (2019) SOSNet: second order similarity regularization for local descriptor learning. In CVPR, Cited by: §2.
-  (2014) Robust global translations with 1dsfm. In ECCV, Cited by: §4.1, §4.2, Table 3.
-  (2016) Lift: learned invariant feature transform. In ECCV, Cited by: §1, §1, §2.
-  (2018) Learning to detect features in texture images. In CVPR, Cited by: §3.3.
-  (2019) Learning local descriptors with a cdf-based dynamic soft margin. In ICCV, Cited by: §1, §2, A.3 Additional Experiments, Table 8.
-  (2017) Scale-adaptive convolutions for scene parsing. In ICCV, Cited by: §2.
-  (2017) Distributed very large scale bundle adjustment by global camera consensus. In ICCV, Cited by: §1.
-  (2018) Learning and matching multi-view descriptors for registration of point clouds. In ECCV, Cited by: §1.
-  (2019) Deformable convnets v2: more deformable, better results. In CVPR, Cited by: §1, §1, §2, §3.1, §3.1, A.1 Implementation Details of DCN.
A. Supplementary Appendix
A.1 Implementation Details of DCN
Since no native implementation of modulated DCN  is currently available in TensorFlow, we implement DCN of our own. To reduce the number of learning weights, only one set of deformation parameters are predicted and then applied along all channels, similar to the setting when num_deformable_groups=1
in PyTorch implementation444https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/blob/master/modules/deform_conv.py.
To model the rotation, two scalars are predicted as scaled cosine and sine, which are then used to compute an angle by taking:
To compose the affine shape matrix , we implement the network to predict the residual shape, and enforce by:
where and lie in range through an activation. In contrast to the observation in AffNet , we do not suffer degeneration when joint learning all affine parameters in DCN.
In this paper, we have concluded that the free-form DCN is a preferable choice than other deformation parameterziation subject to geometric constraints, in the context of local feature learning. However, as shown in Tab. 2, this difference is not that obvious. We ascribe this phenomenon to the lack of meaningful supervision for learning complex deformation. As also discussed in AffNet , a specialized loss may be needed to guide the local shape estimation, whereas in our current implementation, the same loss is used in both local feature learning and deformation learning (we have tried the loss in AffNet , whereas no consistent improvement has been observed). In the future, we will further explore this direction in order to better release the potential of DCN.
A.2 Implementation Details of MulDet
To implement the multi-scale (pyramid) variant, we follow D2-Net  and R2D2 , and feed an image pyramid to the network, which is constructed from the input image sized up to , and downsampled by and blurred by a Gaussian kernel factored for each scale, until the longest side is smaller than 128 pixels. In each scale, a set of keypoints are identified whose scores are above some threshold, e.g., , and the final top-K keypoints are selected from the keypoints combined from all scales. The inference time will be doubled when enabling this configuration.
To implement the multi-scale (in-network) variant, we follow LF-Net , and resize the feature maps from the last convolution, i.e., conv8, into multiple scales. Specifically, the resizing is repeated for times, resulting scales from to , where and . Each corresponding score map is generated as Eq. 5, then the final scale-space score map is obtained by merging all the score maps via weighted-summation, where the weight is computed from a softmax function. Since DCN has already handled the in-network scale invariance, we did not find this variant useful when combining with our methods.
To implement the multi-level (U-Net) variant, we build skip connections from two levels, i.e., conv1 and conv3, and fuse different levels via feature concatenation. The same training scheme is applied as in the main paper, except that the keypoints are now derived from high-resolution feature maps.
A.3 Additional Experiments
Evaluation on dense reconstruction. In Sec. 4.1, we have used T&T dataset  to evaluate the performance in two-view image matching. Here, we resort to its original evaluation protocols defined for evaluation dense reconstruction, and integrate ASLFeat into a dense reconstruction pipeline of our own to further demonstrate its superiority.
Specifically, we use the training set of T&T, including scans with ground-truth scanned models, and use F-score defined in  to jointly measure the reconstruction accuracy (precision) and reconstruction completeness (recall). For comparison, we choose RootSIFT , GeoDesc  with SIFT detector , and sample the features to 10K for each method. Next, we apply the same matching strategy (mutual check plus a ratio test at 0.8), SfM and dense algorithms to obtain the final dense point clouds. As shown in Tab. 6, ASLFeat delivers consistent improvements on dense reconstruction. Since T&T exhibits less scale difference, ASLFeat without the multi-scale detection yields overall best results.
|Methods||RootSIFT ||GeoDesc ||ASLFeat||ASLFeat (MS)|
Application on image retrieval.
We use an open-source implementation of VocabTree555https://github.com/hlzz/libvot  for evaluating image retrieval performance on the popular Oxford buildings  and Paris dataset . For clarity, we do not apply advanced post-processing (e.g., query expansion) or re-ranking methods (e.g., spatial verification), and report the mean average precision (mAP) for all comparative methods. For fair comparison, we sample the top-10K keypoints for each method to build the vocabulary tree. As shown in Tab. 7, the proposed feature also performs well in this task, which further extends its usability in real applications.
|RootSIFT ||GeoDesc ||ContextDesc ||ASLFeat||ASLFeat (MS)|
Integration with a learned matcher. In contrast to R2D2  which strengthens the model with additional task-specific training data and data augmentation by style transfer, we explore the usability of equipping a learnable matcher to reject outlier matches for improving the recovery of camera poses. Specifically, we resort to the recent OANet , and train a matcher using the authors’ public implementation666https://github.com/zjhthu/OANet.git with ASLFeat. Finally, we integrate the resulting matcher into the evaluation pipeline of Aachen Day-Night dataset . As shown in Tab. 8, this integration (ASLFeat + OANet) further boosts the localization results.
End-to-end learning with DCN. As mentioned in Sec. 3.5, we find that a two-stage training for deformation parameters yields better results, i.e., 72.64 for MMA@3 on HPatches dataset (Tab. 2), while an end-to-end training results in 70.45. Although we have tried different training strategies, e.g., dividing the learning rate of deformation parameters by 10 during end-to-end training, none of them have shown better results than the simple separate training. It is still under exploration whether an end-to-end learning will benefit more to this learning process.
A.5 More Visualizations
We provide visualizations in Fig. 6 for comparing the keypoints from different local features, including SIFT, D2-Net and the proposed method.
-  R. Arandjelović, and A. Zisserman. Three things everyone should know to improve object retrieval. In CVPR, 2012.
-  T. Shen, S. Zhu, T. Fang, R. Zhang, and L. Quan. Graph-based consistent matching for structure-from-motion. In ECCV, 2016.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.