Log In Sign Up

HDD-Net: Hybrid Detector Descriptor with Mutual Interactive Learning

Local feature extraction remains an active research area due to the advances in fields such as SLAM, 3D reconstructions, or AR applications. The success in these applications relies on the performance of the feature detector and descriptor. While the detector-descriptor interaction of most methods is based on unifying in single network detections and descriptors, we propose a method that treats both extractions independently and focuses on their interaction in the learning process rather than by parameter sharing. We formulate the classical hard-mining triplet loss as a new detector optimisation term to refine candidate positions based on the descriptor map. We propose a dense descriptor that uses a multi-scale approach and a hybrid combination of hand-crafted and learned features to obtain rotation and scale robustness by design. We evaluate our method extensively on different benchmarks and show improvements over the state of the art in terms of image matching on HPatches and 3D reconstruction quality while keeping on par on camera localisation tasks.


page 1

page 3

page 4

page 5

page 7


HyNet: Local Descriptor with Hybrid Similarity Measure and Triplet Loss

Recent works show that local descriptor learning benefits from the use o...

On the Comparison of Classic and Deep Keypoint Detector and Descriptor Methods

The purpose of this study is to give a performance comparison between se...

Performance Evaluation of Learned 3D Features

Matching surfaces is a challenging 3D Computer Vision problem typically ...

Realtime CNN-based Keypoint Detector with Sobel Filter and CNN-based Descriptor Trained with Keypoint Candidates

The local feature detector and descriptor are essential in many computer...

The CUDA LATCH Binary Descriptor: Because Sometimes Faster Means Better

Accuracy, descriptor size, and the time required for extraction and matc...

GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints

Learned local descriptors based on Convolutional Neural Networks (CNNs) ...

A proposition of a robust system for historical document images indexation

Characterizing noisy or ancient documents is a challenging problem up to...

1 Introduction

Figure 1: Effect of Training Strategies on Result. Correct matches and target detection response maps on London Bridge sequence (HPatches) for different training strategies.

At its core, a feature extraction method aims at identifying locations within a scene that are repeatable and distinctive, so that they can be detected with high accuracy under different camera conditions and be matched between different views. The results in vision applications such as image retrieval 

[41], 3D reconstruction [38], or medical applications [6], among others, have shown the performance advantages of using sparse features over direct methods.

Classical methods [25, 1, 23] independently compute keypoints and descriptors. For instance, SIFT [25] focused on finding blobs on images and extracting gradient histograms as descriptors. Recently proposed descriptors, especially patch-based ones [42, 30, 43, 26], are computed for DoG keypoints, and although they may perform well with other detectors, their test performance is better if the models are trained with patches extracted with the same detector. Most detectors are trained independently of the descriptors and optimise local repeatability of keypoints [11, 33, 40]. The methods that attempt to use the descriptor information to train the detector [14, 35, 47, 27], predicted score maps that either focus on the repeatability or the reliability of a local feature. In our approach, motivated by the limited descriptor influence on the detector, we adapt the descriptor based hard-mining triplet cost function [30] to train the detector model. Thus, keypoint locations are optimised based on the descriptor performance jointly with the detector repeatability. This approach leads to finding both, repeatable and discriminative features, as shown in figure 1. We extend the models to a multi-scale framework, such that the detector/descriptor networks use different levels of details when making predictions.

Our approach is motivated by the observations that jointly learnt detector-descriptor models [14, 35]

lack keypoint localization accuracy, which is critical for SLAM, SfM, or pose estimations. Furthermore, keypoints are typically well localised on simple structures such as edges or corners, while descriptors require more context to be discriminative. We argue that despite the recent trend for end-to-end and joint detector-descriptor methods, separate extractors allow for shallow models that can perform well in terms of accuracy and efficiency.

Besides that, in contrast to patch-based descriptors, dense image descriptors make it more difficult to locally rectify the image regions for invariance. To address this issue, we introduce an approach based on a block of hand-crafted features, and a multi-scale representation within the descriptor architecture, making our network robust to small rotations and scale changes. We term our approach as HDD-Net: Hybrid Detector and Descriptor Network.

In summary, the contributions are: 1) A new detector loss based on the hard-mining triplet cost function. Although the hard-mining triplet is widely used for descriptors it has not been adapted to the training of keypoint detectors. 2) A novel multi-scale sampling scheme to simultaneously train our detector and descriptor. 3) The first dense descriptor architecture that uses a block of hand-crafted features and multi-scale representation to improve the robustness to rotation and scale changes.

2 Related Work

Classical hand-crafted methods have been extensively studied in [44, 9]. We focus the review of related work on learned methods. For further details we refer to [2, 22, 5, 48].

Detectors.Machine learning detectors were introduced with FAST [36], a learned algorithm to speed up the detection of corners in images. Later, TILDE [45] proposed to train multiple piecewise regressors that were robust under photometric changes in images. DNET [21] and TCDET [49] based its learning on a formulation of the covariant constraint, enforcing the architecture to propose the same feature location in corresponding patches. Key.Net [4] expanded the covariant constraint to a multi-scale formulation, and used a hybrid architecture composed of hand-crafted and learned feature blocks.

Descriptors. Descriptors have attracted much attention, particularly patch-based methods [3, 42, 30] due to the simplicity of the task and available benchmarks. Recently, SOSNet [43] improved on the state-of-the-art by adding a regularisation term to the triplet loss to include the second-order similarity relationships among descriptors. DOAP [17] reformulated the training of descriptors as a ranking problem, by optimising the mean average precision instead of the distance between patches. GeoDesc [26] integrated geometry constraints to obtain better training data. Following the idea of improving the data, [34] presented a new patch-based dataset containing scenes under different weather and seasonal conditions.

Joint Detectors and Descriptors. LIFT [47] was the first CNN based method to integrate detection, orientation estimation, and description. SuperPoint [11] used a single encoder and two decoders to perform dense feature detection and description. It was first pretrained to detect corners on a synthetic dataset, and then improved by applying random homographies to the training images, improving the stability of the ground truth positions under different viewpoints. Similar to LIFT, LF-Net [33] computed position, scale, orientation, and description. LF-Net trained its detector score and scale estimator in full images without external keypoint supervision. RF-Net [40] extended LF-Net by exploiting the information provided by the receptive fields. D2-Net [14] proposed to perform feature detection in the descriptor space, showing that an already pre-trained network could be used for feature extraction even though it was optimized for a different task. R2D2 [35] introduced a dense version of the L2Net [42] to predict descriptors and two keypoint score maps based on their repeatability and reliability. Recently, ASLFeat [27] proposed an accurate detector and invariant descriptor with multi-level connections and deformable convolutional networks [10, 50].

3 Method

This section presents the architecture and training of our Hybrid Detector and Descriptor Network (HDD-Net).

3.1 HDD-Net Architecture

HDD-Net consists of two independent architectures for inferring the keypoint and descriptor maps, allowing to use different hand-crafted blocks that are designed specifically for each of these two tasks.

Descriptor. As our method estimates dense descriptors in the entire image, an affine rectification of independent patches or rotation invariance by construction [15] is not possible. To circumvent this, we design a hand-crafted block which explicitly addresses the robustness to rotation. We incorporate this block into an architecture based on L2-Net [42]. We replace the last convolutional layer by a bilinear upsampling operator to upscale the map to its original image resolution. Moreover, we use a multi-scale image representation to extract features from different scale levels. Multi-scale L2-Net features are fused into a final descriptor map by a last convolutional layer.

Rotation Robustness. Transformation equivariance in CNNs has been extensively discussed in [7, 16, 46, 13]. The two main approaches differ whether the transformations are applied to the input image [19] or to the filters [12]. Rotating the filters is more efficient since they are smaller than the input images, and therefore, have fewer memory requirements. Unlike [12], which applies the rotation to all the layers in their convolutional model, we focus on the input filters only, which further reduces the computational complexity. In contrast, we apply more rotations than [12] to the input filters to provide sufficient robustness. The feature extraction is illustrated in figure 2. At first, we rotate the input filter

times and apply a circular mask to avoid artifacts at the filter corners. Consecutively, we extract the feature maps and apply a cyclic max-pooling operator. Max-pooling is applied on the rotation in all three neighbouring feature maps with a channel-wise stride of two. Then, instead of providing a single maximum over the entire rotation space, cyclic pooling returns the maxima in different quadrants. We experimentally found that returning its local maxima provides better results than using only the global one. As the max-pooling operator is driven to positive values, we split the feature maps in three parts 

[8]: where the and operators respectively keep the positive and negative parts of the feature map .

Scale Robustness. Gaussian scale-space has been extensively exploited for local feature extraction [1, 28, 47]. In [33, 40, 4], the scale-space representation was used not only to extract multi-scale features but also to learn to combine their information. However, the fusion of multi-scale features is only used during the detection, while, in deep descriptors, it is either implemented via consecutive convolutional layers [11], or as independent multi-scale extraction [35, 14, 27]. In contrast, we extend the Gaussian pyramid to the descriptor extraction and design a network that is able to compute and combine multi-scale information in a single forward pass. The descriptor encoder shares the weights of each multi-scale stream, thus, boosting its ability to extract features robust to scale changes. Figure 3 depicts the multi-scale descriptor.

Figure 2: Hand-crafted Block. Rotation robustness is given by rotating an input filter and sampling from its rotation space. and operators split positive and negative parts before the cyclic max-pooling is applied to all features.

Detector. We adopt the architecture of Key.Net [4], which combines specific hand-crafted filters for feature detection and a multi-scale shallow network. It has recently been shown to achieve the state of the art results in repeatability.

Figure 3: Multi-Scale Hybrid Descriptor. A Gaussian pyramid is fed into the block of hand-crafted features that serve as the input to L2-Net. Multi-scale L2-Net features are upsampled and combined through a final convolution.

3.2 Descriptor-Detector Training

Detector learning has focused on localising features that are repeatable in a sequence of images [11, 33, 40, 45, 22, 4], with a few works that determine whether these features are adequate for the matching stage [32, 35, 47]. Since a good feature should be repeatable as well as discriminative [44]

, we formulate the descriptor triplet loss function as a new detector learning term to refine the feature candidates towards more discriminative positions. Unlike AffNet 

[32], which estimates the affine shape of the features, we refine only their locations, as these are the main parameters that are often used for the end tasks such as SfM, SLAM or AR. R2D2 [35] inferred two independent response maps, seeking for discriminativeness of the features and their repeatability. Our approach combines both objectives into a single detection map. LIFT [47] training was based on finding the locations with closest descriptors, in contrast, we propose a function based on a triplet loss with a hard-negative mining strategy.

Figure 4: Triplet loss function optimises detections based on their descriptor map, refining the feature candidates towards more discriminative positions.

Detector Learning with Triplet Loss. Hard-negative triplet learning maximises the Euclidean distance between a positive pair and their closest negative sample. In the original work [30], the optimisation happens in the descriptor part, however, we propose to freeze the descriptor such that the sampling locations proposed by the detector are updated to minimise the loss term as shown in figure 4.

Given a pair of corresponding images, we create a grid on each image with a fixed window size of . From each window, we extract a soft-descriptor and its positive and negative samples as illustrated in figure 5. To compute the soft-descriptor, we aggregate all the descriptors within the window based on the detection score map, so that the final soft-descriptor and the scores within a window are entangled. Note that if Non-Maximum Suppression (NMS) was used to select the maximum coordinates and its descriptor, we would only be able to back-propagate through the selected pixels and not the entire map. Consider a window of size with the score value at each coordinate within the window. A softmax provides:


Window has the associated score map

, and descriptor vector

, at each coordinate within the window. We compute the soft-score, , and soft-descriptor, , as:


We use L2 normalisation for the soft-descriptor by projecting it onto the unit hypersphere. Similar to [31, 40], we sample the hardest negative candidate from a non-neighbouring area. This geometric constraint is illustrated in figure 5. We can define our detector triplet loss with soft-descriptors in window as:


where is a margin parameter, and and are the Euclidean distances between positive and negative soft-descriptors pairs. We weight the contribution of each window by its soft-score to control the participation of meaningless windows e.g., flat areas. The final loss is defined as the aggregation of losses on all windows of size :

Figure 5: Triplet Formation Pipeline. Soft-descriptors are extracted from each window together with their respective positives (green lines) and the hardest negatives (red lines). The negatives are extracted only from non-neighbouring areas (denoted as non-red areas).

Multi-Scale Context Aggregation. We extend equation 4 to a multi-scale approach to learn features that are discriminative across a range of scales. Multi-scale learning was used in keypoint detection [4, 33, 40], however, we extend these works by using the multi-scale sampling strategy on the descriptor part. Thus, we sample local soft-descriptors with varying window sizes, , as shown in figure 5, and combine their losses with control parameters in a final term:


Repeatable & Discriminative. The detector triplet loss optimises the model to find locations that can potentially be matched. As stated in [44], discriminativeness is not sufficient to train a suitable detector. Therefore, we combine our discriminative loss and the repeatability term M-SIP proposed in [4] with control parameter to balance their contributions:


Entangled Detector-Descriptor Learning. We frame our joint optimisation strategy as follows. The detector is optimised by equation 6, meanwhile, the descriptor learning is based on the hard-mining triplet loss [30]. For the descriptor learning, we use the same sampling approach as in figure 5, however, instead of sampling soft-descriptors, we sample a point-wise descriptor per window. The location to sample the descriptor is provided by an NMS on the detector score map. Hence, our detector refines its candidate positions using the descriptor space, while, the descriptor learning is conditioned by the detector score map sampling. The interaction between parts tightly couples the two tasks and allows for mutual refinement. We alternate the detector and descriptor optimisation steps during training until a mutual convergence is reached. Although it is possible to formulate our optimisation as a single objective minimisation problem, in practice the alternation helped the optimiser converge to a satisfactory minimum.

4 Implementation Details

This section introduces relevant implementation details, such as dataset generation and HDD-Net training methodology.

Training Dataset.

We synthetically create pairs of images by applying random homography transformations to ImageNet images

[20]. The random homography parameters are: rotation , scale

and skew

. For tackling illumination changes, we use the AMOS dataset [34], which contains sequences of images taken from the same position at different times of the year. We further filter the AMOS dataset and keep only images that are taken during summer between sunrise and midnight time. We generate a total of and images for training and validation, respectively.

HDD-Net Training and Testing. Although the detector triplet loss function is applied to the full image, we only use the top detections for training the descriptor. We select with a batch size of . Thus, in every training batch, there is a total of triplets for training the descriptor. On the detector site, we use , , and set . The hyper-parameter search was done on the validation set. We fix HDD-Net descriptor size to 256 dimensions. During test time, we apply a

NMS to select candidate locations on the detector score map. Networks and dataset generation were implemented in TensorFlow 1.15 and will be released. Training concludes within 48 hours on a single GTX 1080Ti.

 L2Net-Backbone Order Order Gabor Filter Fully Learnt & Multi-Scale Heinly MMA (%)
 ✓ - - - - - 41.8
 ✓ - - - - - - 42.0
 ✓ - - - - - 42.5
 ✓ - - - - - 43.1
 ✓ - - - - - 43.3
 ✓ - - - - - 43.4
 ✓ - - - - 43.6
 ✓ - - - - 44.1
 ✓ - - - 44.5
Table 1: Ablation Study. MMA (%) on Heinly dataset [18] for different descriptor designs. Best results are obtained with Gabor filters in the hand-crafted block, and operators, and multi-scale feature fusion.

5 Experimental Evaluation

This section presents the evaluation results of our method in several application scenarios. The comparison focuses against joint detector and descriptor state of the art approaches.

5.1 Architecture Design

Dataset. We use the Heinly dataset [18] to validate our architecture design choices. It is a small SfM and homography dataset, we focus on its homography set and use only the sequences that are not part of HPatches [2]. We compute the Mean Matching Accuracy (MMA) [29] as the ratio of correctly matched features within a threshold of 5 pixels and the total number of detected features.

Ablation Study. We evaluate a set of hand-crafted filters for extracting features that are robust to rotation. Specifically, and order derivatives as well as Gabor filters. In addition, we further test a fully learnt approach without the hand-crafted filters. We also report results showing the impact of splitting the hand-crafted positive and negative features. Finally, our multi-scale approach is also tested against a single-pass architecture without multi-scale feature fusion.

Results in table 1 show that Gabor filters obtain better results than or order derivatives. They are especially effective for rotation since they are designed to detect patterns under specific orientations. Besides, results without constraining the rotational block to any specific filter are slightly lower than the baseline. The fully learnt model could be improved by adding more filters, but if we restrict the design to a single filter, hand-crafted filter with and operators give the best performance. Lastly, a notable boost over the baseline comes from our proposed multi-scale pyramid and feature fusion within the architecture.

Figure 6: Mean Matching Accuracy (MMA) on HPatches dataset [2] for top 100, 500 and 1,000 extracted points. Methods are sorted on descending order by their score on each of the splits.

5.2 Image Matching

Dataset. We use the HPatches dataset [2] with 116 sequences, including viewpoint and illumination changes. We compute results for sequences with image resolution smaller than 1200 1600 following the approach in [14]. To demonstrate the impact of the detector and to make a fair comparison between different methods, we extend the detector evaluation protocol proposed in [22] to the matching metrics by computing the MMA score for the top 100, 500, and 1,000 keypoints.

Effect of Triplet Learning on Detector. Table 2 shows HDD-Net results when training its detections to be repeatable or/and discriminative. The performance of only is lower than , which is in line with [35]. Repeatable features are crucial for matching images, however, best results are obtained when combining repeatable and discriminative loss terms for the detector learning. The results show that the combination of both principles into a single score map detection is effective.

Comparison to SOTA. Figure 6 compares our HDD-Net to different algorithms. HDD-Net outperforms all the other methods for viewpoint and illumination sequences on every threshold, excelling especially in the viewpoint change, that includes the scale and rotation transformations for which HDD-Net was designed. SuperPoint [11] performance is lower when using only top 100 keypoints, and even though no method was trained with such constraint, the other models keep their performance very close to their 500 or 1,000 results. When constraining the number of keypoints, D2Net-SS [14] results are higher than for its multi-scale version D2Net-MS, D2Net-MS was reported in [14] to achieve higher performance when using an unlimited number of features.

HPatches (MMA)
View Illum
26.4 34.9
38.3 35.5
& 38.9 41.5
Table 2: MMA (%) results for different detector optimisation objectives on HPatches [2] dataset.

5.3 3D Reconstruction

Dataset. We use the ETH SfM benchmark [39] for the 3D reconstruction task. We select three sequences; Madrid Metropolis, Gendarmenmarkt, and Tower of London. We report results in terms of registered images, sparse points, track length, and reprojection error. The top 2,048 points are used as in [48], which still provides a fair comparison between methods at a much lower cost. The sparse and dense reconstructions are performed using COLMAP [38] software. In addition, we used one-third of the images in each dataset to reduce the computational time.

Results. Table 3 presents the results for the 3D reconstructions experiment. HDD-Net and SuperPoint obtain the best results overall. While HDD-Net recovers more sparse points and registers more images in Madrid Metropolis and Tower of London, SuperPoint does it for Geendarmenmarkt. Their accuracy leads to more dense reconstructions than D2-Net or R2D2 networks. D2-Net features did not allow to reconstruct any model on Madrid Metropolis within the evaluation protocol i.e., small regime on the number of extracted keypoints. Due to challenging examples with moving objects within the images and sometimes the object of interest being in distant views, recovering a 3D model from a subset of keypoints makes the reconstruction task even harder. Even though, limiting the total number of extracted points for each method also gives an indicator of the precision and relevance of such keypoints. In terms of a track length, that is the number of images in which at least one feature was successfully tracked, R2D2 and HDD-Net outperform all the other methods. LF-Net reports a smaller reprojection error followed by SIFT and HDD-Net. Although the reprojection error is small in LF-Net, their number of sparse points and registered images are below other competitors.

Madrid Metropolis
(448 Images)
Reg. Images Sparse Points Track Length Reproj. Err.
 SIFT [25] 27 1140 4.34 0.69
 LF-Net [33] 19 467 4.22 0.62
 SuperPoint [11] 39 1258 5.08 0.96
 D2Net-SS [14]
 D2Net-MS [14]
 R2D2 [35] 22 984 4.85 0.88
 HDD-Net 43 1374 5.25 0.80

(488 Images)
 SIFT [25] 132 5332 3.68 0.86
 LF-Net [33] 99 3460 4.65 0.90
 SuperPoint [11] 156 6470 5.93 1.21
 D2-Net SS [14] 17 610 3.31 1.04
 D2-Net MS [14] 14 460 3.02 0.99
 R2D2 [35] 115 3834 7.12 1.05
 HDD-Net 154 6174 6.30 0.98

Tower of London
(526 Images)
 SIFT [25] 75 4621 3.21 0.71
 LF-Net [33] 76 3847 4.63 0.56
 SuperPoint [11] 111 5760 5.41 0.75
 D2-Net SS [14] 10 360 2.93 0.94
 D2-Net MS [14] 10 64 5.95 0.93
 R2D2 [35] 81 3756 6.02 1.03
 HDD-Net 116 6039 5.45 0.80
Table 3: 3D Reconstruction results on ETH 3D benchmark [39]. Best results are in bold. Dash symbol (–) means that COLMAP could not reconstruct any model.

5.4 Camera Localisation

Aachen Day-Night
Correct Localised Queries (%)
Localisation Thres. 0.5m, 1m, 5 5m, 10
SIFT [25] 33.7 52.0 65.3
SuperPoint [11] 42.9 61.2 85.7
D2-Net SS [14] 44.9 65.3 88.8
D2-Net MS [14] 41.8 68.4 88.8
R2D2 [35] 45.9 66.3 88.8
HDD-Net 43.9 62.2 82.7
Table 4: Aachen Day-Night [37] results on camera localisation.

Dataset. The Aachen Day-Night [37] contains more than 5,000 images, with separate queries for day and night111We use the benchmark from the CVPR 2019 workshop on Long-term Visual Localization.. Due to the challenging data, and to avoid convergence issues, we increase the number of keypoints to 8,000. Despite that, LF-Net features did not allow to converge and are not included in table 4.

Results. The best results for the most permissive error threshold are reported by D2-Net networks and R2D2. Note that D2-Net and R2D2 are trained on MegaDepth [24], and Aachen datasets, respectively, which contains real 3D scenes under similar geometric conditions. In contrast, SuperPoint and HDD-Net use synthetic training data, and while they perform better on image matching or 3D reconstruction, their performance is lower on localisation. As a remark, results are much closer in the most restrictive error, showing that HDD-Net and SuperPoint are on par with their competitors for more accurate camera localisation.

6 Conclusion

In this paper, we have introduced a new detector-descriptor method based on a hand-crafted block and multi-scale image representation within the descriptor. Moreover, we have reformulated the triplet loss function to not only learn the descriptor part but also to refine the proposed keypoint locations from the detector. We validate our contributions in the image matching task, where HDD-Net outperforms the baseline with a wide margin. Furthermore, we show through extensive experiments across different tasks that our approach outperforms or performs as well as the top joint detector-descriptor algorithms in terms of matching accuracy, number of registered images and reconstructed 3D points, despite using only synthetic and much fewer data samples for training.


  • [1] P. F. Alcantarilla, J. Nuevo, and A. Bartoli (2013) Fast explicit diffusion for accelerated features in nonlinear scale spaces. BMVC. Cited by: §1, §3.1.
  • [2] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5173–5182. Cited by: §2, Figure 6, §5.1, §5.2, Table 2.
  • [3] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016)

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In BMVC, Vol. 1, pp. 3. Cited by: §2.
  • [4] A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk (2019) keypoint detection by handcrafted and learned cnn filters. International Conference on Computer Vision. Cited by: §2, §3.1, §3.1, §3.2, §3.2, §3.2.
  • [5] D. Bojanić, K. Bartol, T. Pribanić, T. Petković, Y. D. Donoso, and J. S. Mas (2019) On the comparison of classic and deep keypoint detector and descriptor methods. In 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), pp. 64–69. Cited by: §2.
  • [6] B. Busam, P. Ruhkamp, S. Virga, B. Lentes, J. Rackerseder, N. Navab, and C. Hennersperger (2018) Markerless inside-out tracking for 3d ultrasound compounding. In Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and Navigation, pp. 56–64. Cited by: §1.
  • [7] T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §3.1.
  • [8] A. Crivellaro and V. Lepetit (2014) Robust 3d tracking with descriptor fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3414–3421. Cited by: §3.1.
  • [9] G. Csurka, C. R. Dance, and M. Humenberger (2018) From handcrafted to deep local features. arXiv preprint arXiv:1807.10254. Cited by: §2.
  • [10] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §2.
  • [11] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236. Cited by: §1, §2, §3.1, §3.2, §5.2, Table 3, Table 4.
  • [12] S. Dieleman, J. De Fauw, and K. Kavukcuoglu (2016) Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660. Cited by: §3.1.
  • [13] S. Dieleman, K. W. Willett, and J. Dambre (2015) Rotation-invariant convolutional neural networks for galaxy morphology prediction. Monthly notices of the royal astronomical society 450 (2), pp. 1441–1459. Cited by: §3.1.
  • [14] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-net: a trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561. Cited by: §1, §1, §2, §3.1, §5.2, §5.2, Table 3, Table 4.
  • [15] P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, and E. Trulls (2019) Beyond cartesian representations for local descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 253–262. Cited by: §3.1.
  • [16] P. Follmann and T. Bottger (2018) A rotationally-invariant convolution module by feature map back-rotation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 784–792. Cited by: §3.1.
  • [17] K. He, Y. Lu, and S. Sclaroff (2018) Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–605. Cited by: §2.
  • [18] J. Heinly, E. Dunn, and J. Frahm (2012) Comparative evaluation of binary features. In European Conference on Computer Vision, pp. 759–773. Cited by: Table 1, §5.1.
  • [19] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §3.1.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
  • [21] K. Lenc and A. Vedaldi (2016) Learning covariant feature detectors. In European Conference on Computer Vision, pp. 100–117. Cited by: §2.
  • [22] K. Lenc and A. Vedaldi (2018) Large scale evaluation of local image feature detectors on homography datasets. BMVC. Cited by: §2, §3.2, §5.2.
  • [23] S. Leutenegger, M. Chli, and R. Siegwart (2011) BRISK: binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pp. 2548–2555. Cited by: §1.
  • [24] Z. Li and N. Snavely (2018) Megadepth: learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2041–2050. Cited by: §5.4.
  • [25] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1, Table 3, Table 4.
  • [26] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan (2018) Geodesc: learning local descriptors by integrating geometry constraints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 168–183. Cited by: §1, §2.
  • [27] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan (2020) ASLFeat: learning local features of accurate shape and localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §3.1.
  • [28] K. Mikolajczyk and C. Schmid (2001) Indexing based on scale invariant interest points. ICCV. Cited by: §3.1.
  • [29] K. Mikolajczyk and C. Schmid (2005) A performance evaluation of local descriptors. Cited by: §5.1.
  • [30] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Advances in Neural Information Processing Systems, pp. 4826–4837. Cited by: §1, §2, §3.2, §3.2.
  • [31] D. Mishkin, J. Matas, and M. Perdoch (2015) MODS: fast and robust method for two-view matching. Computer Vision and Image Understanding 141, pp. 81–93. Cited by: §3.2.
  • [32] D. Mishkin, F. Radenovic, and J. Matas (2018) Repeatability is not enough: learning affine regions via discriminability. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 284–300. Cited by: §3.2.
  • [33] Y. Ono, E. Trulls, P. Fua, and K. M. Yi (2018) LF-net: learning local features from images. In Advances in Neural Information Processing Systems, pp. 6234–6244. Cited by: §1, §2, §3.1, §3.2, §3.2, Table 3.
  • [34] M. Pultar, D. Mishkin, and J. Matas (2019) Leveraging outdoor webcams for local descriptor learning. arXiv preprint arXiv:1901.09780. Cited by: §2, §4.
  • [35] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger (2019) R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §1, §1, §2, §3.1, §3.2, §5.2, Table 3, Table 4.
  • [36] E. Rosten and T. Drummond (2006) Machine learning for high-speed corner detection. In European conference on computer vision, pp. 430–443. Cited by: §2.
  • [37] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al. (2018) Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8601–8610. Cited by: §5.4, Table 4.
  • [38] J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113. Cited by: §1, §5.3.
  • [39] J. L. Schonberger, H. Hardmeier, T. Sattler, and M. Pollefeys (2017) Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1482–1491. Cited by: §5.3, Table 3.
  • [40] X. Shen, C. Wang, X. Li, Z. Yu, J. Li, C. Wen, M. Cheng, and Z. He (2019) RF-net: an end-to-end image matching network based on receptive field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8132–8140. Cited by: §1, §2, §3.1, §3.2, §3.2, §3.2.
  • [41] M. Teichmann, A. Araujo, M. Zhu, and J. Sim (2019) Detect-to-retrieve: efficient regional aggregation for image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5109–5118. Cited by: §1.
  • [42] Y. Tian, B. Fan, and F. Wu (2017)

    L2-net: deep learning of discriminative patch descriptor in euclidean space

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669. Cited by: §1, §2, §2, §3.1.
  • [43] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) SOSNet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11016–11025. Cited by: §1, §2.
  • [44] T. Tuytelaars and K. Mikolajczyk (2008) Local invariant feature detectors: a survey. Foundations and Trends in Computer Graphics and Vision. Cited by: §2, §3.2, §3.2.
  • [45] Y. Verdie, K. Yi, P. Fua, and V. Lepetit (2015) TILDE: a temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5279–5288. Cited by: §2, §3.2.
  • [46] D. E. Worrall and M. Welling (2019) Deep scale-spaces: equivariance over scale. arXiv preprint arXiv:1905.11697. Cited by: §3.1.
  • [47] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) Lift: learned invariant feature transform. In European Conference on Computer Vision, pp. 467–483. Cited by: §1, §2, §3.1, §3.2.
  • [48] J. Yuhe, D. Mishkin, A. Mishchuk, J. Matas, P. Fua, K. Moo Yi, and E. Trulls (2020) Image matching across wide baselines: from paper to practice. In arXiv preprint arXiv:2003.01587, Cited by: §2, §5.3.
  • [49] X. Zhang, F. X. Yu, S. Karaman, and S. Chang (2017) Learning discriminative and transformation covariant local feature detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6818–6826. Cited by: §2.
  • [50] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §2.