HyNet: Local Descriptor with Hybrid Similarity Measure and Triplet Loss

06/17/2020 ∙ by Yurun Tian, et al. ∙ Facebook Imperial College London 0

Recent works show that local descriptor learning benefits from the use of L2 normalisation, however, an in-depth analysis of this effect lacks in the literature. In this paper, we investigate how L2 normalisation affects the back-propagated descriptor gradients during training. Based on our observations, we propose HyNet, a new local descriptor that leads to state-of-the-art results in matching. HyNet introduces a hybrid similarity measure for triplet margin loss, a regularisation term constraining the descriptor norm, and a new network architecture that performs L2 normalisation of all intermediate feature maps and the output descriptors. HyNet surpasses previous methods by a significant margin on standard benchmarks that include patch matching, verification, and retrieval, as well as outperforming full end-to-end methods on 3D reconstruction tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Local feature detectors and descriptors play a key role in many computer vision tasks such as 3D reconstruction or visual localisation. Recently, joint detection and description 

Yi et al. (2016); Ono et al. (2018); DeTone et al. (2018); Dusmanu et al. (2019); Revaud et al. (2019); Liu et al. (2019b); Luo et al. (2020); Germain et al. (2020) has drawn significant attention. Despite the alluring idea of the end-to-end detection and description, the classic two-stage strategy withstood years of tests in many computer vision tasks and still gives a competitive performance in standard benchmarks Brown et al. (2011); Balntas et al. (2017); Schönberger et al. (2017); Jin et al. (2020). Moreover, customised matchers Moo Yi et al. (2018); Rocco et al. (2018, 2020); Brachmann and Rother (2019); Sarlin et al. (2019) have also contributed to boosting the matching performance, where the time complexity is critical. Despite the progress in end-to-end methods, the two-stage process still deserves attention since it often leads to competitive results in the overall matching system Tian et al. (2019).

Deep descriptors Simo-Serra et al. (2015); Balntas et al. (2016); Tian et al. (2017); Mishchuk et al. (2017); Keller et al. (2018); He et al. (2018); Tian et al. (2019); Zhang and Rusinkiewicz (2019b) have shown superiority over hand-crafted ones Lowe (2004); Wang et al. (2011) in different tasks Balntas et al. (2017); Jin et al. (2020); Brown et al. (2011); Schönberger et al. (2017)

. Current works mainly focus on improving the loss function or the sampling strategy. L2-Net 

Tian et al. (2017) introduces a progressive batch sampling with an N-Pair loss. HardNet Mishchuk et al. (2017) uses a simple yet effective hard negative mining strategy, justifying the importance of the sampling. Other than contrastive or triplet loss, DOAP He et al. (2018) employs a retrieval based ranking loss. GeoDesc Luo et al. (2018) integrates geometry constraints from multi-view reconstructions to benefit the training. Besides the first-order optimisation, SOSNet Tian et al. (2019) shows that second-order constraints further improve the descriptors.

It has been widely observed that L2 normalisation of the descriptors leads to consistent improvements. Methods such as Tian et al. (2017); Mishchuk et al. (2017); He et al. (2018); Ebel et al. (2019); Tian et al. (2019); Zhang et al. (2019); Zhang and Rusinkiewicz (2019b) which L2 normalised descriptors, significantly outperform early unnormalised descriptors Simo-Serra et al. (2015); Balntas et al. (2016). Moreover, even hand-crafted descriptors can be improved with L2 normalisation Balntas et al. (2017)

. All such observations indicate that descriptors are better distinguished by their vector directions rather than the magnitudes (L

2 norms), where similar conclusions can also be found in other feature embedding tasks Wang et al. (2018); Deng et al. (2019); Liu et al. (2019a).

We therefore analyse the impact of L2

normalisation on learning from the gradients perspective. Since the gradients for each layer are generated via the chain rule 

Goodfellow et al. (2016), we analyse them at the beginning of the chain, where they are generated by the given similarity measure. Our intuition is that the gradient direction should benefit the optimisation of descriptor directions, while the gradient magnitude should be adaptive to the level of hardness of the training samples. Consequently, HyNet is introduced to make better use of the gradient signals in terms of direction and magnitude.

Despite the evolving design of loss function, triplet loss is still employed in state-of-the-art local descriptors Mishchuk et al. (2017); Tian et al. (2019). Furthermore, triplet loss has also earned noticeable popularity in various embedding tasks, e.g

, face recognition 

Schroff et al. (2015); Parkhi et al. (2015) and person re-identification Cheng et al. (2016); Hermans et al. (2017). An interesting observation in Musgrave et al. (2020) indicates that the improvements from the classic contrastive and triplet loss are marginal. In this work, we further show that state-of-the-art local descriptor can be learned by triplet loss with a better designed similarity measure.

Specifically, we propose: 1) a hybrid similarity measure that can balance the gradient contributions from positive and negative samples, 2) a regularisation term which provides suitable constraints on descriptor norms, and 3) a new network architecture that is able to L2 normalise the intermediate feature maps.

2 Gradient Analysis

In this section, we explore how the widely used inner product and L2 distance provide gradients for training normalised and unnormalised descriptors.

2.1 Preliminaries

We denote as the loss for a descriptor pair , where is a similarity measure. Whether are positive (matching) or negative (non-matching), the gradients with respect to the descriptors are calculated as:


where are omitted for clarity. Importantly, the gradients for learnable weights of a network are derived in Eqn.(1) at the beginning of the chain, and play a key role during training. Note that is a scalar, while the direction of the gradient is determined by the partial derivatives of . We consider the two most commonly used similarity measures, namely inner product and L2 distance, for descriptors with and without L2 normalisation:


where denotes the L2 norm (). and are similarity measures111To ensure consistency, we refer to the L2 distance also as a similarity measure even though it measures the inverse similarity. for normalised descriptors while and are for the unnormalised ones. Note that we consider L2 normalisation as a part of the similarity measure. We then obtain the partial derivatives:


In the following sections we analyse the above gradients in terms of directions and magnitudes.

2.2 Gradient Direction

Figure 1: Gradient descent directions derived in Eqn. 3, with and for positive and negative pairs. is the angle between the descriptors. Black arrows: descriptors before L2 normalisation. Red arrow: gradient descent direction from . Green arrow: parallel component from . Blue arrows: orthogonal component from . Note that for (c) and (d) . The vector lengths are irrelevant in this figure. Better viewed in colour.

Optimal gradient direction is the key for convergence, i.e., a learning process will not converge given incorrectly directed gradients, regardless of the learning rate. We denote , where is the gradient direction, and are the parallel and orthogonal components, respectively.

(a) Distribution of .
(b) and
Figure 2: Gradient magnitude and distribution of . Note that from Eqn. 4 the gradients are periodic functions with a period of .

According to Eqn. (3), we obtain , and similarly for , , and , i.e., gradients are always orthogonal to the descriptors, indicating that L2 normalised descriptors only have . Meanwhile, unnormalised descriptors both components non-zero. For better understanding, we illustrate 2D descriptors and the corresponding gradient descent directions (negative gradient direction) in Fig. 1, where is the angle between descriptors. Specifically, modifies the descriptor magnitude (L2 norms), while updates the descriptor direction. However, since descriptor magnitudes can be harmful for matching (see Sec. 1), the training should focus on the optimisation of the descriptor directions, which can be achieved with L2 normalised descriptors. An interesting question is whether it is possible to make a better use of . We address this problem in Sec. 3.1 and show that detailed analysis leads to training constraints that improve the performance.

2.3 Gradient Magnitude

The training gradients should have not only the optimal directions but also the properly scaled magnitudes. The magnitude should be adapted to the level of ’hardness’ of the training samples, i.e., hard samples should receive a stronger update over easy ones.

We focus on L2 normalised descriptors whose gradients have optimal directions. We denote and as two descriptors normalised with L2. With a slight abuse of notation, we use and to represent the similarity measure and gradient magnitude, respectively, with angle between and :


where , and denotes the absolute value.

We analyse the similarities and gradient magnitudes from Eqn. (4) in the real descriptor space during training. Fig. 2(a) shows the distribution of from 512K descriptor pairs, where the number of positive and negative pairs is 50% each. Specifically, following the hard negative mining strategy of Mishchuk et al. (2017), we sample 512 triplets (one positive pair and one negative) from each of the 1K randomly constructed batches of size 1024. Fig. 2(a) shows the distribution of HardNet and SOSNet in training, i.e., both models are trained and tested on Liberty. As shown, almost all hard negatives and positives have in the range . Worth noting that easy negatives may have , however, sampling hard negatives only, has been proven to be effective Mishchuk et al. (2017). Similarly, we observe how and behave in range , which is highlighted in Fig. 2(b). The gradients differ, i.e., is monotonically increasing while is decreasing. It indicates that is more beneficial for the optimisation of positives, since hard positives (large ), generate large gradients compared to easy positives (small ). In contrast, favours negatives, as hard negatives (small ) generate large updates compared to the easy negatives (large ). These observations lead to the conclusion that neither the inner product nor the L2 on its own can balance the optimisation with positives and negatives.

It is also worth noting that according to Eqn. (1), the overall gradient magnitude is further weighted by , which means a better form of may alleviate the inherent flaws of and . Consequently, in Sec. 3.2 we show that a carefully designed similarity measure leads to the state-of-the-art performance with the standard triplet loss.

3 Method

Building upon the analysis from the previous section, we propose to improve the descriptor learning by 1) introducing a regularisation term that provides a beneficial , 2) a hybrid similarity measure that can strike a balance between the contribution of positives and negatives to the gradient update, 3) a new network architecture that normalises the intermediate feature maps with affine L2 such that they are optimised in their directions rather than the magnitudes.

3.1 L2 Norm Regularisation

Section 2.2 shows that L2 normalisation excludes parallel gradients , i.e., there are no constraints on the descriptor norms which can vary with scaling of image intensities. Intuitively, a possible way of making positive contributions from to the optimisation is to introduce the following constraint before the L2 normalisation:


where and are a positive pair of descriptors before L2 normalisation. As a regularisation term, drives the network to be robust to image intensity changes, e.g, caused by different illuminations.

3.2 Hybrid Similarity Measure and Triplet Loss

The standard triplet loss is defined as:


where is the margin. and are the angles for the positive and negative pairs of the -th descriptor triplet, i.e., the angles between the anchor descriptor and its positive and negative samples.

Remarkable improvements have been made by modifying the standard triplet loss Mishchuk et al. (2017); Yu et al. (2018); Zhang and Rusinkiewicz (2019a); Ge (2018); Zhang and Rusinkiewicz (2019a). From the gradient perspective, when the margin constraint in Eqn. (6) is not satisfied, we obtain , otherwise . Hence, according to Eqn. (1), and is directly related to the gradient magnitude. As discussed in Sec. 2.3, and lead to significantly different updates from the positive and negative examples. Intuitively, a direct solution would be to use for positives while for negatives, however, as we show in Sec. 5 this strategy is not optimal. Instead, we propose a hybrid similarity measure that combines the inner product and the L2 :


where is a scalar ranging from to , and is the normalising factor ensuring the gradient has the maximum magnitude of .

Finally, our overall loss function is defined as:


with as a regularisation parameter and balancing the contributions from and . Optimal can be found by a grid search which is discussed in Sec. 5.

Figure 3: HyNet architecture. It consists of 7 convolutional layers which all but the last are followed by a FRN Singh and Krishnan (2019) normalisation and a TLU non-linearity Singh and Krishnan (2019).

3.3 Network Architecture

Our intuition for designing the network architecture is based on the analysis in Sec. 2.2 that, similarly to the output descriptors, L2 normalisation needs to be applied to the intermediate feature maps. However, we found that additional affine scaling of normalised maps has a positive effect on the output descriptors. To this end, we apply the Filter Response Normalisation(FRN) Singh and Krishnan (2019), which has recently been proposed and shown promising results in the classification task. Specifically, FRN normalises each layer of feature maps by:


where and are learned parameters, is the flattened feature map of the -th channel and is the number of pixels. As argued in Singh and Krishnan (2019) the gradients w.r.t are always orthogonal, hence as discussed in Sec. 2.2, the training can focus on optimising the directions of feature vectors.

Our HyNet architecture is based on L2-Net Tian et al. (2017), which consists of seven convolutional layers and outputs 128-dimensional descriptors. As shown in Fig 3, all Batch Normalisation (BN) Ioffe and Szegedy (2015) layers, except the last one before the final L2 normalisation in the original L2-Net, are replaced with FRN layers. Moreover, as recommended in Singh and Krishnan (2019)

, each FRN is followed by the Thresholded Linear Unit (TLU) instead of the conventional ReLU. Thus, HyNet has the same number of convolutional weights as HardNet

Mishchuk et al. (2017) and SOSNet Tian et al. (2019).

4 Experiment

Our novel architecture and training is implemented in PyTorch 

Paszke et al. (2017)

. The network is trained for 200 epochs with a batch size of 1024 and Adam optimizer 

Kingma and Ba (2014). Training starts from scratch, and the threshold in TLU for each layer is initialised with . We set and . In the following experiments, HyNet is compared with recent deep local descriptors Balntas et al. (2016); Tian et al. (2017); Mishchuk et al. (2017); Tian et al. (2019) as well as end-to-end methods DeTone et al. (2018); Dusmanu et al. (2019); Revaud et al. (2019) on three standard benchmarks Brown et al. (2011); Balntas et al. (2017); Schönberger et al. (2017).

4.1 Ubc

UBC dataset Brown et al. (2011) consists of three subsets-scenes, namely Liberty, Notredame and Yosemite. The benchmark is focused on the patch pair verification task, i.e., whether the match is positive or negative. Following the evaluation protocol Brown et al. (2011), models are trained on one subset and tested on the other two. In Table 1, we report the standard measure of false positive rate at 95% recall (FPR@95) Brown et al. (2011) on six train and test splits. We can observe that, while the performance is nearly saturated, HyNet still shows remarkable improvements over previous methods.



SIFT Lowe (2004)
29.84 22.53 27.29 26.55

TFeat Balntas et al. (2016)
7.39 10.13 3.06 3.80 8.06 7.24 6.64
L2-Net Tian et al. (2017) 2.36 4.70 0.72 1.29 2.57 1.71 2.23

HardNet Mishchuk et al. (2017)
1.49 2.51 0.53 0.78 1.96 1.84 1.51

DOAP He et al. (2018)
1.54 2.62 0.43 0.87 2.00 1.21 1.45

SOSNet Tian et al. (2019)
1.08 2.12 0.35 0.67 1.03 0.95 1.03

0.89 1.37 0.34 0.61 0.88 0.96 0.84
Table 1: Patch verification performance on the UBC phototour dataset. Numbers denote false positive rates at 95% recall(FPR@95). ND: Notredame, LIB: Liberty, YOS: Yosemite.

4.2 HPatches

HPatches dataset Balntas et al. (2017) evaluates three tasks, patch verification, patch retrieval, and image matching for viewpoint and illumination changes between local patches. Based on different levels of geometric noise, the results are divided into 3 groups: easy, hard, and tough. We show the results in Fig. 4, where all models are trained on Liberty, which is the protocol proposed in Balntas et al. (2017). HyNet improves the MAP from the previous state-of-the-art SOSNet Tian et al. (2019) by a large margin, i.e., 0.89%, 2.35%, and 1.75% for the three tasks. Note that the improvement of SOSNet over its predecessor HardNet Mishchuk et al. (2017) was 0.03%, 0.96%, and 1.14% at the time of its publication.

Figure 4: Results on test set ‘a’ of HPatches Balntas et al. (2017). HyNet outperforms the state-of-the-art SOSNet Tian et al. (2019) and other local image descriptors in all metrics on this benchmark.

4.3 Eth

ETH SfM benchmark Schönberger et al. (2017) evaluates local descriptors in the task of Structure from Motion (SfM) for outdoor scenes. To quantify the SfM quality, in Table 2, we follow the protocol from Schönberger et al. (2017) and report the number of registered images, reconstructed sparse and dense points, mean track length, and mean reprojection error. First, we compare HyNet with HardNet Mishchuk et al. (2017) and SOSNet Tian et al. (2019) by using the same local patches extracted from DoG detector, which is presented above the dashed lines. Since the detector is fixed, the results reflect the performance of the descriptors. To ensure a fair comparison, HardNet, SOSNet, and HyNet are all trained on Liberty from UBC dataset Brown et al. (2011). In this benchmark, HyNet exhibits significant superiority by registering more images for large scenes and reconstructing more spare points, while the results for the other metrics are on par with top performing descriptors. Next, we compare HyNet to the recent end-to-end methods, namely SuperPoint DeTone et al. (2018), D2-Net Dusmanu et al. (2019) and R2D2 Revaud et al. (2019). DoG+HyNet shows significantly better performance on larger scenes, for example, Madrid Metropolis and Gendarmenmarkt, where it gives over 50% more of reconstructed sparse points in 3D. Note that in the SfM task, the number of registered images and reconstructed points is crucial for the quality of 3D models. Moreover, results also show that HyNet generalises well to different patches provided by the state-of-the-art detector Key.Net Barroso-Laguna et al. (2019), where the average track length is increased for a number of scenes.

#Reg. #Sparse #Dense Track Reproj.
Images Points Points Length Error

SIFT (11.3K) 8 7.5K 241K 4.22 0.43px
8 images DoG+HardNet 8 8.7K 239K 4.30 0.50px
DoG+SOSNet 8 8.7K 239K 4.31 0.50px
DoG+HyNet 8 8.9K 246K 4.32 0.52px


plus1fil minus1fil

SuperPoint (6.1K) 8 5K 244K 4.47 0.79px
D2-Net (13.1K) 8 13K 221K 2.87 1.37px
R2D2 (12.1K) 8 10K 244K 4.48 1.04px
Key.Net+HyNet (11.9K) 8 9.4K 246K 5.24 0.69px

SIFT (11.8K) 11 14.7K 292K 4.79 0.39px
11 images DoG+HardNet 11 16.3K 303K 4.91 0.47px
DoG+SOSNet 11 16.3K 306K 4.92 0.46px
DoG+HyNet 11 16.5K 303K 4.93 0.48px


plus1fil minus1fil

SuperPoint (5.5K) 11 7K 304K 4.93 0.81px
D2-Net (12.5K) 11 19K 301K 3.03 1.40px
R2D2 (12.6K) 11 13.4K 308K 5.02 1.47px
Key.Net+HyNet (11.9K) 11 12.0K 307K 7.81 0.69px

SIFT (13.3K) 128 108K 2.14M 6.04 0.54px
Building DoG+HardNet 128 159K 2.12M 5.18 0.62px
128 images DoG+SOSNet 128 160K 2.12M 5.17 0.63px
DoG+HyNet 128 166K 2.12M 5.14 0.64px


plus1fil minus1fil

SuperPoint (10.6K) 128 125k 2.13M 7.10 0.83px
D2-Net (12.4K) 128 178K 2.06M 3.11 1.36px
R2D2 (13.2K) 128 136K 3.31M 5.60 1.43px
Key.Net+HyNet (12.9K) 128 100K 2.11M 12.03 0.74px

SIFT (7.4K) 500 116K 1.82M 6.32 0.60px
Metropolis DoG+HardNet 697 261K 1.27M 4.16 0.98px
1344 images DoG+SOSNet 675 240K 1.27M 4.40 0.94px
DoG+HyNet 697 337K 1.25M 3.93 0.98px


plus1fil minus1fil

SuperPoint (2.1K) 702 125K 1.14M 4.43 1.05px
D2-Net (7.74K) 787 229K 0.96M 5.50 1.27px
R2D2 (12.9K) 790 158K 1.15M 7.26 1.20px
Key.Net+HyNet (9.3K) 897 386K 1.62M 5.87 1.05px

SIFT (8.5K) 1035 338K 4.22M 5.52 0.69px
menmarkt DoG+HardNet 1018 827K 2.06M 2.56 1.09px
1463 images DoG+SOSNet 1129 729K 3.05M 3.85 0.95px
DoG+HyNet 1181 927K 2.93M 3.49 1.05px


plus1fil minus1fil

SuperPoint (2.3K) 1112 236K 2.49M 4.74 1.10px
D2-Net (8.0K) 1225 541K 2.60M 5.21 1.30px
R2D2 (13.3K) 1226 529K 3.80M 6.38 1.21px
Key.Net+HyNet (10.6K) 1259 897K 3.58M 5.79 1.13px

Table 2: Evaluation results on ETH dataset Schönberger et al. (2017) for SfM. The improvement is in the number of registered images and sparse points, for large scenes in particular.

5 Discussion

Choice Other components MAP
FRN, 53.58
FRN, 53.97
Similarity measure FRN, ✓ 52.10
FRN, ✓ 53.19
FRN, ✓ 53.97
Norm type BN , ✓ 52.04
IN , ✓ 52.47
FRN , ✓ 53.97

Table 3: Ablation of HyNet’s components.

In this section, we first investigate how each building block of HyNet contributes to the overall performance, then observe the impact of hyper-parameters, and finally, we show the advantage of the proposed hybrid similarity measure over other possible solutions.

Ablation Study is presented in Table. 3, which shows how the L2 norm regularisation term , similarity measure and feature map normalisation affect the performance. Specifically, we train different models on Liberty Brown et al. (2011) and report average MAP on Hpatches Balntas et al. (2017) matching task. First, we can see that helps to boost the performance, justifying our intuition that it optimises the network to be robust to illumination changes. Next, we experiment with different similarities for Eqn. (8), where the best results (through grid search for optimal margin) for each similarity are reported.

As shown, improves from and by 1.87% and 0.78% respectively, indicating its effectiveness in balancing the gradient magnitude obtained from the positive and negative samples. Finally, Filter Response Normalisation (FRN) Singh and Krishnan (2019) is compared to Batch Normalisation (BN) Ioffe and Szegedy (2015) and Instance Normalisation(IN) Ulyanov et al. (2016), where the network with BN is commonly used by previous methods Tian et al. (2017); Mishchuk et al. (2017); Tian et al. (2019); He et al. (2018). FRN surpass BN and IN by at least 1.5%, which demonstrates the effectiveness of the affine L2 normalisation of the intermediate feature maps. Above all, by integrating , and FRN together, we achieve the best result.

Effect of and is investigated with grid search and reported in Fig. 5(a), where HyNet reaches top performance with and . Furthermore, we denote as the gradient magnitude for , and show its curve in Fig. 5(b). As seen, the curve of is in between for and for , balancing the contributions from positives and negatives.

Other possible solutions for using different metrics for the positives and negatives include:


Specifically, uses for positives while for negatives, which is intuitively the most direct approach for adaptive gradient magnitude. Meanwhile, stacks two triplet losses, where and are the two margins. We conduct grid search for and , and set , , and . Following Tian et al. (2019), we compare their training curves with our HyNet loss in Fig. 5(c), where networks are trained on Liberty and FPR@95 are average on Notredame and Yosemite. As shown, our HyNet loss using surpasses the other two solutions. Worth noting, that direct combination in does not show an advantage. We believe that the triplet loss with a linear margin does not fit well the nonlinear transformation between and , i.e., , but we leave it for future investigation. Meanwhile, stacking triplet losses with different similarity measures is also sub-optimal, which further justifies the effectiveness of the proposed hybrid similarity.

Figure 5: (a) Effect of parameter in the proposed hybrid loss. We give the matching MAP on HPatches Balntas et al. (2017) for different and margin from Eqn. (7) and (8). (b) Gradient magnitude of the proposed HyNet loss for different . (c) Comparison of the proposed loss to other variants that combine the inner product and L2 loss.

6 Conclusion

We have introduced a new deep local descriptor named HyNet, which is inspired by the analysis and optimisation of the descriptor gradients. HyNet further benefits from a regularisation term that constrains the descriptor magnitude before L2 normalisation, a hybrid similarity measure that makes different contributions from positive and negative pairs, and a new network architecture which L2 normalises the intermediate feature maps. Empirically, HyNet outperforms previous methods by a significant margin on various tasks. Moreover, a comprehensive ablation study is conducted revealing the contribution of each proposed component on its final performance.


  • [1] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. 4, pp. 6. Cited by: Figure 6, §1, §1, §1, Figure 4, §4.2, §4, Figure 5, §5.
  • [2] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016)

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In British Machine Vision Conference (BMVC), Vol. 1, pp. 3. Cited by: §1, §1, Table 1, §4.
  • [3] A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk (2019) Key.net: keypoint detection by handcrafted and learned cnn filters. Proceedings of the IEEE International Conference on Computer Vision. Cited by: §A.1, §4.3.
  • [4] E. Brachmann and C. Rother (2019) Neural-guided ransac: learning where to sample model hypotheses. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4322–4331. Cited by: §1.
  • [5] M. Brown, G. Hua, and S. Winder (2011) Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 43–57. Cited by: §1, §1, §4.1, §4.3, §4, §5.
  • [6] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1335–1344. Cited by: §1.
  • [7] O. Chum, T. Werner, and J. Matas (2005)

    Two-view geometry estimation unaffected by a dominant plane

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.1.
  • [8] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §1.
  • [9] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236. Cited by: §1, §4.3, §4.
  • [10] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.3, §4.
  • [11] P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, and E. Trulls (2019) Beyond Cartesian Representations for Local Descriptors. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [12] W. Ge (2018) Deep metric learning with hierarchical triplet loss. In European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: §3.2.
  • [13] H. Germain, G. Bourmaud, and V. Lepetit (2020) S2DNet: learning accurate correspondences for sparse-to-dense feature matching. arXiv preprint arXiv:2004.01673. Cited by: §1.
  • [14] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §1.
  • [15] K. He, Y. Lu, and S. Sclaroff (2018) Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 596–605. Cited by: §1, §1, Table 1, §5.
  • [16] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.3, §5.
  • [18] Y. Jin, D. Mishkin, A. Mishchuk, J. Matas, P. Fua, K. M. Yi, and E. Trulls (2020) Image matching across wide baselines: from paper to practice. arXiv preprint arXiv:2003.01587. Cited by: §A.1, Table 4, §1, §1.
  • [19] M. Keller, Z. Chen, F. Maffra, P. Schmuck, and M. Chli (2018) Learning deep descriptors with scale-aware triplet networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [21] H. Liu, X. Zhu, Z. Lei, and S. Z. Li (2019) Adaptiveface: adaptive margin and sampling for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11947–11956. Cited by: §1.
  • [22] Y. Liu, Z. Shen, Z. Lin, S. Peng, H. Bao, and X. Zhou (2019) GIFT: learning transformation-invariant dense visual descriptors via group cnns. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6990–7001. Cited by: §1.
  • [23] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. Proceedings of the IEEE International Conference on Computer Vision (ICCV) 60 (2), pp. 91–110. Cited by: Table 5, §1, Table 1.
  • [24] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan (2018) GeoDesc: learning local descriptors by integrating geometry constraints. In European Conference on Computer Vision (ECCV), pp. 170–185. Cited by: §1.
  • [25] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan (2020) ASLFeat: learning local features of accurate shape and localization. arXiv preprint arXiv:2003.10071. Cited by: §1.
  • [26] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4826–4837. Cited by: §A.1, Table 4, Table 5, §1, §1, §1, §2.3, §3.2, §3.3, §4.2, §4.3, Table 1, §4, §5.
  • [27] K. Moo Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua (2018) Learning to find good correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2666–2674. Cited by: §1.
  • [28] K. Musgrave, S. Belongie, and S. Lim (2020) A metric learning reality check. arXiv preprint arXiv:2003.08505. Cited by: §1.
  • [29] Y. Ono, E. Trulls, P. Fua, and K. M. Yi (2018) LF-net: learning local features from images. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6234–6244. Cited by: §1.
  • [30] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In BMVC, Cited by: §1.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.
  • [32] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger (2019) R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §1, §4.3, §4.
  • [33] I. Rocco, R. Arandjelović, and J. Sivic (2020) Efficient neighbourhood consensus networks via submanifold sparse convolutions. arXiv preprint arXiv:2004.10566. Cited by: §1.
  • [34] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic (2018) Neighbourhood consensus networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1651–1662. Cited by: §1.
  • [35] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2019) SuperGlue: learning feature matching with graph neural networks. arXiv preprint arXiv:1911.11763. Cited by: §1.
  • [36] J. L. Schönberger, H. Hardmeier, T. Sattler, and M. Pollefeys (2017) Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.3, Table 2, §4.
  • [37] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. Cited by: §1.
  • [38] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 118–126. Cited by: §1, §1.
  • [39] S. Singh and S. Krishnan (2019) Filter response normalization layer: eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737. Cited by: Figure 3, §3.3, §3.3, §5.
  • [40] Y. Tian, B. Fan, and F. Wu (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 6. Cited by: §1, §1, §3.3, Table 1, §4, §5.
  • [41] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) SOSNet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11016–11025. Cited by: §A.1, §A.2, Table 4, Table 5, §1, §1, §1, §1, §3.3, Figure 4, §4.2, §4.3, Table 1, §4, §5, §5.
  • [42] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §5.
  • [43] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §1.
  • [44] Z. Wang, B. Fan, and F. Wu (2011) Local intensity order pattern for feature description. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 603–610. Cited by: §1.
  • [45] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) Lift: learned invariant feature transform. In European Conference on Computer Vision (ECCV), pp. 467–483. Cited by: §1.
  • [46] B. Yu, T. Liu, M. Gong, C. Ding, and D. Tao (2018) Correcting the triplet selection bias for triplet loss. In European Conference on Computer Vision (ECCV), pp. 71–87. Cited by: §3.2.
  • [47] L. Zhang and S. Rusinkiewicz (2019) Learning local descriptors with a cdf-based dynamic soft margin. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2969–2978. Cited by: §3.2.
  • [48] L. Zhang and S. Rusinkiewicz (2019) Learning local descriptors with a cdf-based dynamic soft margin. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969–2978. Cited by: §1, §1.
  • [49] X. Zhang, L. Zhang, Z. Zheng, Y. Liu, J. Bian, and M. Cheng (2019) AdaSample: adaptive sampling of hard positives for descriptor learning. arXiv preprint arXiv:1911.12110. Cited by: §1.

Appendix A Appendix

a.1 Image Matching Challenge 2020

We further evaluate HyNet on the newly proposed Image Matching Challenge222https://vision.uvic.ca/image-matching-challenge/benchmark/ (IMC) dataset [18]. It consists of two tasks, namely wide-baseline stereo and multi-view reconstruction. Since the ground truth for the test set is not released, we report the performance on the validation set. For fair comparison, we use Key.Net [3] as the detector and compare HyNet with two other state-of-the-art descriptors, HardNet [26] and SOSNet [41]. The evaluation protocol is with a maximum of 2048 keypoints per image and standard descriptor size (512 bytes). We use DEGENSAC [7] for geometric verification, and nearest-neighbour matcher with first-to-second nearest-neighbour ratio test for filtering false-positive matches. Please refer to [18] for exact details of the challenge’s settings.

mAA (%)
Stereo Multi-View Average

HardNet [26]
63.40 74.41 68.91
SOSNet [41] 63.41 74.51 68.96
HyNet 64.07 74.84 69.46
Table 4: Mean Average Accuracy (mAA) at on IMC dataset [18].

As can be seen from Table 4, HyNet surpasses the previous state-of-the-art methods HardNet and SOSNet on both tasks, which further validates its effectiveness.

a.2 Integrating HyNet with SOSR

In this section, we test HyNet by combining it with the Second Order Similarity Regularisation (SOSR) proposed in [41], results are shown in Table 5 and Fig. 6. As shown, HyNet generalises well with the extra supervision signal from SOSR, indicating its potential of being further boosted by other third-party loss terms.



SIFT [23]
29.84 22.53 27.29 26.55

HardNet [26]
1.49 2.51 0.53 0.78 1.96 1.84 1.51

SOSNet [41]
1.08 2.12 0.35 0.67 1.03 0.95 1.03

0.89 1.37 0.34 0.61 0.88 0.96 0.84

HyNet+SOSR [41]
0.91 1.62 0.31 0.54 0.78 0.73 0.82
Table 5: Patch verification performance on the UBC phototour dataset. Numbers denote false positive rates at 95% recall(FPR@95). ND: Notredame, LIB: Liberty, YOS: Yosemite.
Figure 6: Results on test set ‘a’ of HPatches [1]. Colour of the marker indicates EASY, HARD, and TOUGH noise. The type of marker corresponds to the variants of the experimental settings.