1 Introduction
Learning discriminative local descriptors from image patches is a fundamental ingredient of various computer vision tasks, including structurefrommotion [1]
[24], and panorama stitching [6]. Conventional approaches mostly utilize handcrafted descriptors, such as SIFT [17], which have been successfully employed in a variety of applications. Recently, with the emergence of largescale annotated datasets [5, 3], datadriven methods have started to demonstrate their effectiveness, and learningbased descriptors have gradually dominated this field. Specifically, convolutional neural network (CNN) based descriptors
[10, 35, 34, 30, 21, 31] can achieve stateoftheart performance on various tasks, including patch retrieval and 3D reconstruction.Notably, triplet loss is adopted in many wellperforming descriptor learning frameworks. Nevertheless, the quality of the learned descriptors heavily relies on the triplet selection, and mining suitable triplets from a large database is a challenging task. Towards this challenge, Balntas et al. [4] propose an intriplet hard negative mining strategy called anchor swapping. Tian et al. [30] progressively sample unbalanced training pairs in favor of negatives, and Mishchuk et al. [21] further simplify this idea to mine the hardest negatives within the minibatch. Despite the significant progress on performance and generalization ability, however, two potential problems still exist in the current hardestinbatch sampling solution: i) hard negatives are mined in the batch level, while randomly selected matching pairs can still be easily discriminated by the descriptor network; ii) it does not take the interaction between the training progress and the hardness of the training samples into consideration. To this end, we propose a novel triplet mining pipeline to adaptively construct highinformativeness batches in a principled manner.
Our proposed method is nominated as AdaSample, in which matching pairs are sampled from the dataset based on their informativeness to construct minibatches. The methodology is developed on informativeness analysis, where informativeness
is defined via the contributing gradients of the potential samples and can assist estimate their optimal sampling probabilities. Moreover, we propose a novel training protocol inspired by
maximum loss minimization [26] to boost the generalization ability of the descriptor network. Under this training framework, we can adaptively adjust the overall hardness of the training examples fed to the network, based on the training progress. Comprehensive evaluation results and ablation studies on several standard benchmarks [5, 3] demonstrate the effectiveness of our proposed method.In summary, our contributions are threefold:

We theoretically analyze the informativeness of potential training examples and formulate a principled sampling approach for descriptor learning.

We propose a hardnessaware training protocol inspired by maximum loss minimization, in which the overall hardness of the generated triplets are adaptively adjusted to match the training progress.

Comprehensive evaluation results on popular benchmarks demonstrate the efficacy of our proposed AdaSample framework.
2 Related work
Local Descriptor Learning.
Traditional descriptors [17, 15] mostly utilize handcrafted features to extract lowlevel textures from image patches. The seminal work, i.e., SIFT [17], computes the smoothed weighted histograms using the gradient field of the image patch. PCASIFT [15] further improves the descriptors by applying Principle Component Analysis (PCA) to the normalized image gradient. A comprehensive overview of the handcrafted descriptors can be found in [20].
Recently, due to the rapid development of deep learning, CNNbased methods enable us to learn feature descriptors directly from the raw image patches. MatchNet
[10] propose a twostage Siamese architecture to extract feature embeddings and measure patch similarity, which significantly improves the performance and demonstrates the great potential of CNNs in descriptor learning. DeepDesc [28] trains the network with Euclidean distance and adopts a mining strategy to sample hard examples. DeepCompare [35] explores various architectures of the Siamese network and develops a twostream network focusing on image centers.With the advances of metric learning, tripletbased architectures have gradually replaced the pairbased ones. TFeat [4] adopts the triplet loss and mines intriplet hard negatives with a strategy named anchor swapping. Net [30] employs progressive sampling and requires that matching patches have minimal distances within the minibatch. HardNet [21] further develops the idea to mine the hardestinbatch negatives with a simple triplet margin loss. DOAP [12] imposes a rankingbased loss directly optimized for the average precision. GeoDesc [18] further incorporates the geometric constraints from multiview reconstructions and achieves significant improvement on 3D reconstruction task. SOSNet [31] proposes a secondorder similarity regularization term and achieves more compact patch clusters in the feature space. A very recent work [36]
relaxes the hard margin in the triplet margin loss with a dynamic soft margin to avoid manually tuning the margin by human heuristics.
From previous arts, we find that the triplet mining framework can generally be decoupled into two stages, i.e., batch construction from the dataset and triplet generation within the minibatch. Previous works [4, 30, 21] mostly focus on mining hard negatives in the second stage, while neglecting batch construction in the first place. Besides, their sampling approaches do not take the training progress into consideration when generating triplets. Therefore, we argue that their triplet mining solutions still cannot exploit the full potential of the entire dataset to produce triplets with suitable hardness. To alleviate this issue, we analyze the contributing gradients of the potential training examples and sample informative matching pairs for batch construction. Then, we propose a hardnessaware training protocol inspired by maximum loss minimization, in which the overall hardness of the selected triplets is correlated with the training progress. Incorporating the hardestinbatch negative mining solution, we formulate a powerful triplet mining framework, AdaSample, for descriptor learning, in which the quality of the learned descriptors can be significantly improved by a simple sampling strategy.
Hard Negative Mining.
Hard negative mining has been widely used in deep metric learning, such as face verification [25], as it can progressively select hard negatives for triplet loss and Siamese networks to boost the performance and speed up the convergence. FaceNet [25]
samples semihard triplets within the minibatch to avoid overfitting the outliers. Wu
et al. [33] select training examples based on their relative distances. Zheng et al. [38] augment the training data by adaptively synthesizing hardnessaware and labelpreserving examples. However, our sampling solution differs from them in that we analyze the informativenessof the training data and ensure that the sampled data can provide gradients contributing most to the parameter update. Besides, our method can adaptively adjust the hardness of the selected training data as training progresses. In this way, wellclassified samples are filtered out, and the network is always fed with informative triplets with suitable hardness. Comprehensive evaluation results demonstrate consistent performance improvement contributed by our proposed approach.
3 Methodology
3.1 Problem Overview
Given a dataset that consists of classes^{1}^{1}1 The term ”class” stands for the image patches that come from the same 3D location. For our sampling purpose, patches from a single class are matching, while nonmatching pairs come from different classes. with each containing matching patches, we decompose the triplet generation into two stages. Firstly, we select matching pairs (positives) to form a minibatch, where is the batch size. This is done by our proposed AdaSample, as introduced in Sec. 3.2. Secondly, we mine the hardestinbatch negatives for each matching pair and use the triplet loss to supervise the network training, as in Sec. 3.3. See Fig. 1 for an illustration of the twostage sampling pipeline. Finally, the overall solution is summarized in Sec. 3.4.
3.2 AdaSample
Previous works [30, 21] sample positives randomly to construct minibatches, yielding a majority of similar matching pairs which can be easily discriminated by the network. This practice may reduce the overall hardness of the triplets. Motivated by the hardestinbatch mining strategy in [21], a straightforward solution is to select the most dissimilar matching pairs. However, potential issues arise, i.e., the network may be trained with bias in favor of the most dissimilar matching pairs, while other cases are lessconsidered. We validate this solution, nominated as Hardpos, in experiments (Sec. 5.4).
A more principled solution is to sample positives based on their informativeness. Here, we assume that informative pairs are those contributing most to the optimization, namely, providing effective gradients for parameter updates. Therefore, we quantify the informativeness of the matching pairs by measuring their contributing gradients during training. Moreover, we employ maximum loss minimization [26] to improve the generalization ability of the learned model and show that the resulting gradient estimator is an unbiased estimator of the actual gradient. In the following, we introduce our derivation and elaborate on the theoretical justification in Sec. 4.
Informativeness Based Sampling.
In the endtoend deep learning literature, the training data contribute to optimization via gradients, so we measure the informativeness of training examples by analyzing their resulting gradients. Generally, we consider the generic deep learning framework. Let be the datalabel pair of the training set, be the model parameterized by , and
be a differentiable loss function. The goal is to find the optimal model parameter
that minimizes the average loss, i.e.,(1) 
where denotes the number of training examples. Then, we proceed with the following definition of informativeness.
Definition 1.
The informativeness of a training example is quantified by its resulting gradient norm at iteration , namely,
(2) 
At iteration , let be the sampling probabilities of each datum in the training set. More generally, we also reweight each sample by
. Let random variable
denote the sampled index at iteration , then , namely, . We record the reweighted gradient induced by the training sample as(3) 
For simplicity, we omit the superscript when no ambiguity is made. By setting , we can make the gradient estimator
an unbiased estimator of the actual gradient,
i.e.,(4) 
Without loss of generality, we use stochastic gradient descent (SGD) to update model parameters:
(5) 
where is the learning rate at iteration . As the goal is to find the optimal , we define the expected progress towards the optimum at each iteration as follows.
Definition 2.
At iteration , the expected parameter rectification is defined as the expected reduction of the squared distance between the parameter and the optimum after iteration ,
(6) 
Generally, tens of thousands of iterations are included in the training so that the empirical average parameter rectification will converge to the average of asymptotically. Therefore, by maximizing , we guarantee the most progressive step towards parameters optimum at each iteration in the expectation sense. Inspired by the greedy algorithm [8], we aim to maximize at each iteration.
It can be shown that maximizing is equivalent to minimizing (Thm. 1). Under this umbrella, we show that the optimal sampling probability is proportional to the persample gradient norm (a special case of Thm. 2). Therefore, the optimal sampling probability of each datum happens to be proportional to its informativeness. This property justifies our definition of informativeness as the resulting gradient norm of each training example.
However, as the neural network has multiple layers with a large number of parameters, it is computationally prohibitive to calculate the full gradient norm. Instead, we prove that the matching distance in the feature space is a good approximation to the informativeness^{2}^{2}2The approximation is up to a constant factor, which is insignificant as it will be offset by the learning rate. The same reasoning applies to the approximation of gradients in Maximum Loss Minimization paragraph. in Sec. 4.2. Concretely, for each class consisting of patches , we first select a patch randomly, which serves as the anchor patch, and then sample a matching patch with probability
(7) 
where is the extracted descriptor of , and measures the discrepancy of the extracted descriptors. See specific choices of in Sec. 3.4.
Maximum Loss Minimization.
Minimizing the average loss may be suboptimal because the training tends to be overwhelmed by wellclassified examples that provide noisy gradients [16]. On the contrary, wellclassified examples can be adaptively filtered out by minimizing the maximum loss [26], which can further improve the generalization ability. However, directly minimizing the maximum loss may lead to insufficient usage of training data and sensitivity to outliers, so we approximate the gradient of maximum loss by , in which is sufficiently large. As is used to update parameters, consider its expectation
(8) 
To guarantee is an unbiased estimator^{3}^{3}3 We impose the unbiasedness constraints due to its theoretical convergence guarantees. For example, the nonasymptotic error bound induced by unbiased gradient estimates is referred to [22]. For reweighted SGD, as in our case, improved convergence rate can be found in [23]. of , it suffices to set
(9) 
as in this case,
(10) 
Following the previous reasoning, we need to minimize under the constraints specified by Eqn. 9 in order to step most progressively at each iteration. In Thm. 2, we show that the optimal sampling probability and reweighting scalar should be given by
(11) 
As previously claimed, we approximate the gradient norm via the matching distance in the feature space. Besides, in our case, the hinge triplet loss (Eqn. 14) is positively (or even linearly) correlated with the matching distance squared. Therefore, we use the matching distance squared as an approximation of the hinge triplet loss. Thus, the sampling probability and reweighting scalar are given by
(12) 
Moreover, for better approximation, it is preferable to adjust adaptively, namely, to increase with training. Intuitively, when easy matching pairs have been correctly classified, we focus more on hard ones. A good indicator of the training progress is the average loss. As a result, instead of predefining a sufficiently large , we set , where is a tunable hyperparameter, and is the moving average of history loss. Formally, we formulate our sampling probability and reweighting scalar as
(13) 
The exponent increases adaptively as training progresses so that hardnessaware training examples can be generated and fed to the network. Our sampling approach is thus named as AdaSample.
3.3 Triplet generation by hardestinbatch
AdaSample focuses on the batch construction stage, and for a complete triplet mining framework, we need to mine negatives from the minibatch as well. Here, we adopt the hardestinbatch strategy in [21]. Formally, given a minibatch of matching pairs , let be the descriptors extracted from ^{4}^{4}4For clarity, denotes the selected matching pairs, with different pairs belonging to different classes. denotes a generic patch in a specific class, where denotes the placeholder for the index.. For each matching pair , we select the nonmatching patch which lies closest to one of the matching patches in the feature space. Then, the Hinge Triplet (HT) loss is defined as follows:
(14) 
where denotes the margin. Incorporating the reweighting scalar, we update the model parameter via the gradient estimator .
3.4 Distance Metric
Euclidean distance is widely used in previous works [28, 30, 21, 31]. However, as the descriptors lie on the unit hypersphere in dimensional space (Sec. 5.1), it is more natural to adopt the geodesic distance of the embedded manifold. Therefore, we adopt the angular distance [7] as follows:
(15) 
where denotes the inner product operator. We nominate our loss function as Angular Hinge Triplet (AHT) loss, which is demonstrated to result in consistent performance improvement (Sec. 5.4).
Alg. 1 summarizes the overall triplet generation framework. For each training iteration, we first randomly pick distinct classes from the dataset and extract descriptors for patches belonging to these classes (Step 1, 2). Then, we randomly choose a patch as the anchor from each of the selected classes (Step 4) and adopt our proposed AdaSample to select an informative matching patch (Step 5). With the generated minibatch, we mine hard negatives following [21] and compute Angular Hinge Triplet (AHT) loss (Step 7).
4 Theoretical Analysis
In this section, we complete the theoretical analysis of informativeness in Sec. 4.1, and prove that the matching distance can serve as a good approximation of informativeness in Sec. 4.2.
4.1 Informativeness Formulation
Following notations in Sec. 3.2, we reformulate (Eqn. 6), and give an equivalent condition for maximizing . The same conclusion can be found in [14].
Due to unbiasedness (Eqn. 4), the first two terms in Eqn. 16 is fixed, so maximizing is equivalent to minimizing . Thm. 2 specifies the optimal probabilities to minimize the aforementioned trace under a more general assumption.
Theorem 2.
Let be defined in Eqn. 3 and suppose the sampled index obeys distribution . Then, given the constraints , is minimized by the following optimal sampling probabilities:
(17) 
Proof.
As is an unbiased estimator of the actual gradient (Eqn. 4), is fixed in our case, denoted by for short. By the linearity of trace and , we have
(18) 
Mathematically, given the constraints
, the aforementioned harmonic mean of
reaches its minimum when the probabilities satisfy(19) 
Dividing by a normalization factor, we get the expression in Eqn. 17. ∎
Note that in the special case of , the constraints degrade into , and the optimal sampling probabilities become .
4.2 Approximation of Informativeness
As mentioned in Sec. 3.2, the matching distance can serve as a good approximation of informativeness. We justify this here. For simplicity, we introduce some notations for a
layer multilayer perceptron (MLP). Let
be the weight matrix for layer andbe a Lipschitz continuous activation function. Then the multilayer perceptron can be formulated as follows:
(20) 
Note that although our notations describe only MLPs without bias, our analysis holds for any affine transformation followed by a Lipschitz continuous nonlinearity. Therefore, our reasoning can naturally extend to CNNs. With
(21) 
we have
(22) 
Various data preprocessing, weight initialization [9, 11], and activation normalization [13, 2, 32] techniques uniformize the activations of each layer across samples. Therefore, the variation of gradient norms is mostly captured by the gradient of the loss function w.r.t. the output of neural networks,
(23) 
where is a constant, and serves as a precise approximation of the full gradient norm. For simplicity, we consider hinge triplet loss (Eqn. 14) here. Then, the gradient norm w.r.t. the descriptor of the matching patch is just twice the matching distance^{5}^{5}5This relation holds only when the hinge triplet loss is positive. Empirically, due to the relatively large margin, the hinge loss never becomes zero.,
(24) 
As a result, we reach the conclusion that the matching distance is a good approximation to the informativeness. Also, we empirically verify this in Sec. 5.4.
5 Experiments
Descriptor  Length  Train  Notredame  Yosemite  Liberty  Yosemite  Liberty  Notredame  Mean 
Test  Liberty  Notredame  Yosemite  
SIFT [17]  128  29.84  22.53  27.29  26.55  
DeepDesc [27]  128  10.9  4.40  5.69  6.99  
GeoDesc [18]  128  5.47  1.94  4.72  4.05  
MatchNet [10]  4096  7.04  11.47  3.82  5.65  11.60  8.70  8.05  
Net [30]  128  3.64  5.29  1.15  1.62  4.43  3.30  3.24  
CSNet [30]  256  2.55  4.24  0.87  1.39  3.81  2.84  2.61  
HardNet [21]  128  1.47  2.67  0.62  0.88  2.14  1.65  1.57  
HardNetGOR [37]  128  1.72  2.89  0.63  0.91  2.10  1.59  1.64  
110[4pt/6pt] HardNet*  128  1.80  2.89  0.68  0.90  1.93  1.71  1.65  
AdaSample* (Ours)  128  1.64  2.62  0.61  0.88  1.92  1.46  1.52  
TFeatM+ [4]  128  7.39  10.31  3.06  3.80  8.06  7.24  6.64  
Net+ [30]  128  2.36  4.70  0.72  1.29  2.57  1.71  2.23  
CSNet+ [30]  256  1.71  3.87  0.56  1.09  2.07  1.30  1.76  
HardNet+ [21]  128  1.49  2.51  0.53  0.78  1.96  1.84  1.51  
HardNetGOR+ [37]  128  1.48  2.43  0.51  0.78  1.76  1.53  1.41  
DOAP+ [12]  128  1.54  2.62  0.43  0.87  2.00  1.21  1.45  
110[4pt/6pt] HardNet+*  128  1.32  2.31  0.41  0.67  1.51  1.24  1.24  
AdaSample+* (Ours)  128  1.25  2.21  0.40  0.63  1.40  1.14  1.17 
5.1 Implementation Details
We adopt the architecture of Net [30] to embed local descriptors into the unit hypersphere in dimensional space. Following prior works [30, 21], all patches are resized to
and normalized to zero perpatch mean and unit perpatch variance. We train our model from scratch in PyTorch library
[29] using SGD optimizer with initial learning rate , momentum , and weight decay . Batch size is , margin , and unless otherwise specified. We generatematching pairs for each epoch, and the total number of epochs is
. The learning rate is divided by at the end of , , epochs.We compare our method with both handcrafted and deep methods^{6}^{6}6 Note that the training dataset of GeoDesc [18] is not released, so the comparison may be unfair. Besides, some recent works [31, 36] explore in different directions, and their training codes are not publicly available. So we leave the efficacy comparison and system combination in future work., including SIFT [17], DeepDesc [27], TFeat [4], Net [30], HardNet [21], HardNet with global orthogonal regularization (GOR) [37], DOAP [12], and GeoDesc [18]. Comprehensive evaluation results and ablation studies on two standard descriptor datasets: UBC Phototour [5] (Sec. 5.2), and HPatches [3] (Sec. 5.3) demonstrate the efficacy of our proposed sampling framework.
5.2 UBC Phototour
UBC Phototour [5], also known as Brown dataset, consists of three subsets: Liberty, Notre Dame, and Yosemite, with about normalized patches in each subset. Keypoints are detected by DoG detector [17] and verified by model. The testing set consists of matching and nonmatching pairs for each sequence. For evaluation, models are trained on one subset and tested on the other two. The metric is the false positive rate (FPR) at true positive recall. The evaluation results are reported in Tab. 1.
Our method outperforms other approaches by a significant margin. We randomly flip and rotate by degrees for data augmentation, noted by . Besides, for our method, we also generate positive patches by random rotation such that each class has patches, noted by *. We augment matching pairs as there are too few patches (two or three) corresponding to one class in UBC Phototour dataset [5], which limits the capacity of our method. To analyze its effect, we also conduct it for HardNet [21] baseline. It can be seen that our method consistently outperforms the baseline, indicating the effectiveness of our adaptive sampling solution.
5.3 HPatches
HPatches [3] consists of sequences of images. The dataset is split into two parts: viewpoint  sequences with significant viewpoint change and illumination  sequences with significant illumination change. According to the level of geometric noises, the patches can be further divided into three groups: easy, hard, and tough. There are three evaluation tasks: patch verification, image matching, and patch retrieval. Following standard evaluation protocols of the dataset, we show results in Fig. 2. It demonstrates that our method performs in favor of other methods on patch verification task, which is consistent with the patch classification results in Tab. 1. Furthermore, our descriptors achieve the best results on the more challenging image matching and patch retrieval tasks, indicating the improved generalization ability contributed by our approach.
5.4 Ablation Study
Informativeness Approximation.
We empirically verify the conclusion in Sec. 4.2 that the probability induced by matching distance approximate well to the one induced by informativeness (Fig. 3, Left). Besides, the results show that the Pearson correlation is consistently greater than during training (Fig. 3, Right), which indicates these probabilities have strong correlation with each other statistically.
Impact of and Distance Metric.
We experiment with varying in AdaSample to control the overall hardness of the selected matching pairs. A large indicates that hard matching pairs are more likely to be selected. When , our method degrades into random sampling and the overall framework becomes HardNet [21], and as , the framework becomes Hardpos. Therefore, both HardNet and Hardpos are special cases of our proposed AdaSample. Tab. 2 shows the results on HPatches [3] dataset, where leads to the best results in most cases. It demonstrates the advantages of our balanced sampling strategy against the hardest solution. Also, Tab. 2 demonstrates that the angular hinge triplet (AHT) loss outperforms the commonlyused hinge triplet (HT) loss in most cases.
Task  Verification  Matching  Retrieval  
Loss  AHT  HT  AHT  HT  AHT  HT 
93.84  93.17  64.09  62.64  81.26  79.97  
94.72  94.56  66.04  65.92  83.58  83.34  
94.78  94.76  65.89  65.68  83.80  83.54  
94.78  94.60  65.46  65.37  83.98  83.62  
94.60  94.69  64.56  64.84  83.56  83.69  
94.42  94.51  63.81  64.02  83.41  83.29 
Stability and Reproducibility.
The sampling naturally comes from stochasticity. To ensure reproducibility, we conduct experiments on five runs with different random seeds and show the means and standard deviations of the patch classification results in Tab.
3. It demonstrates the stability of our sampling solution. We argue that a possible explanation of the stability is the unbiasedness of the gradient estimator (Eqn. 10). As the number of training triplets is huge, the estimated gradients converge to the actual gradient asymptotically. Therefore, the gradients can guide the network towards the parameter optimum as training progresses, regardless of the specific random condition.Train  Test  HardNet+*  AdaSample+*  Rel  p value 
Notr  Lib  1.3160.044  1.2540.026  4.71%  0.031 
Yos  2.3100.063  2.2120.049  4.28%  0.018  
Lib  Notr  0.4060.011  0.4000.016  1.58%  0.337 
Yos  0.6710.010  0.6270.012  6.62%  0.006  
Lib  Yos  1.5130.084  1.3950.050  7.80%  0.030 
Notr  1.2410.044  1.1370.036  8.38%  0.011 
Statistical Significance.
Since previous methods have been approaching the saturating point in terms of the performance on UBC Phototour [5] dataset, it is challenging to make progress on top of the HardNet [21] baseline. However, with the proposed method, we still observe a consistent improvement, as demonstrated in Tab. 3. It can be seen that our method can give a relative improvement of up to 8.38% in terms of patch classification accuracy, indicating our superiority. To be more principled, we also demonstrate the statistical significance of our improvement upon the baseline. Specifically, we adopt the nonparametric hypothesis testing, i.e., the classic MannWhitney testing [19], to test whether a random variable is stochastically larger than the other one. In our setting, the two random variables are the performance of AdaSample
and HardNet baseline, respectively, and the null hypothesis is that our method
cannot significantly improve the performance. The p values under different experimental settings are summarized in Tab. 3. With a significance level of , we can reject the null hypothesis in 5 of the 6 experiments in total. For the only anomaly, i.e., training on Notredame and testing on Liberty, we conjecture that the reason lies in the extremely high performance of the HardNet baseline (about 0.4% in terms of FPR). Therefore, we argue that the statistical significance under the other 5 experimental settings is sufficient to verify the effectiveness of our approach.6 Conclusion
This paper proposes AdaSample for descriptor learning, which adaptively samples hard positives to construct informative minibatches during training. We demonstrate the efficacy of our method from both theoretical and empirical perspectives. Theoretically, we give a rigorous definition of informativeness of potential training examples. Then, we reformulate the problem and derive a tractable sampling probability expression (Eqn. 13) to generate hardnessaware training triplets. Empirically, we enjoy a consistent and statistically significant performance gain on top of the HardNet [21] baseline when evaluated on various tasks, including patch classification, patch verification, image matching, and patch retrieval.
References
 [1] (2009) Building rome in a day. In IEEE International Conference on Computer Vision (ICCV), pp. 72–79. Cited by: §1.
 [2] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.

[3]
(2017)
HPatches: a benchmark and evaluation of handcrafted and learned local descriptors.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 5173–5182. Cited by: §1, §1, Figure 2, §5.1, §5.3, §5.4, Table 2.  [4] (2016) Learning local feature descriptors with triplets and shallow convolutional neural networks.. In British Machine Vision Conference (BMVC), Cited by: §1, §2, §2, §5.1, Table 1.
 [5] (2011) Discriminative learning of local image descriptors. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI) 33 (1), pp. 43–57. Cited by: §1, §1, Figure 2, §5.1, §5.2, §5.2, §5.4, Table 1, Table 3.
 [6] (2007) Automatic panoramic image stitching using invariant features. International Journal on Computer Vision (IJCV) 74 (1), pp. 59–73. Cited by: §1.

[7]
(2019)
ArcFace: additive angular margin loss for deep face recognition
. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699. Cited by: §3.4.  [8] (1971) Matroids and the greedy algorithm. Mathematical programming 1 (1), pp. 127–136. Cited by: §3.2.

[9]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §4.2.  [10] (2015) Matchnet: unifying feature and metric learning for patchbased matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3279–3286. Cited by: §1, §2, Table 1.

[11]
(2015)
Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. In IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: §4.2.  [12] (2018) Local descriptors optimized for average precision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 596–605. Cited by: §2, §5.1, Table 1.

[13]
(2015)
Batch Normalization: accelerating deep network training by reducing internal covariate shift.
In
International Conference on Machine Learning (ICML)
, pp. 448–456. Cited by: §4.2.  [14] (2018) Not all samples are created equal: deep learning with importance sampling. PMLR, pp. 2525–2534. Cited by: §4.1.
 [15] (2004) PCASIFT: a more distinctive representation for local image descriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 506–513. Cited by: §2.
 [16] (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §3.2.
 [17] (2004) Distinctive image features from scaleinvariant keypoints. International Journal on Computer Vision (IJCV) 60 (2), pp. 91–110. Cited by: §1, §2, §5.1, §5.2, Table 1.
 [18] (2018) GeoDesc: learning local descriptors by integrating geometry constraints. In European Conference on Computer Vision (ECCV), pp. 168–183. Cited by: §2, §5.1, Table 1, footnote 6.
 [19] (1947) On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pp. 50–60. Cited by: §5.4.
 [20] (2005) A performance evaluation of local descriptors. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), pp. 1615–1630. Cited by: §2.
 [21] (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Neural Information Processing Systems (NIPS), pp. 4826–4837. Cited by: §1, §1, §2, §2, §3.2, §3.3, §3.4, §3.4, §5.1, §5.1, §5.2, §5.4, §5.4, Table 1, Table 3, §6.

[22]
(2011)
Nonasymptotic analysis of stochastic approximation algorithms for machine learning
. In Neural Information Processing Systems (NIPS), pp. 451–459. Cited by: footnote 3.  [23] (2014) Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Neural Information Processing Systems (NIPS), pp. 1017–1025. Cited by: footnote 3.
 [24] (2010) Descriptor learning for efficient retrieval. In European Conference on Computer Vision (ECCV), pp. 677–691. Cited by: §1.
 [25] (2015) FaceNet: a unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. Cited by: §2.
 [26] (2016) Minimizing the maximal loss: how and why.. In International Conference on Machine Learning (ICML), pp. 793–801. Cited by: §1, §3.2, §3.2.
 [27] (201512) Discriminative learning of deep convolutional feature point descriptors. In IEEE International Conference on Computer Vision (ICCV), pp. 118–126. Cited by: §5.1, Table 1.
 [28] (2015) Discriminative learning of deep convolutional feature point descriptors. In IEEE International Conference on Computer Vision (ICCV), pp. 118–126. Cited by: §2, §3.4.
 [29] (2019) PyTorch: an imperative style, highperformance deep learning library. In Neural Information Processing Systems (NIPS), Cited by: §5.1.
 [30] (2017) L2Net: deep learning of discriminative patch descriptor in euclidean space. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 661–669. Cited by: §1, §1, §2, §2, §3.2, §3.4, §5.1, §5.1, Table 1.
 [31] (2019) SOSNet: second order similarity regularization for local descriptor learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.4, footnote 6.
 [32] (2016) Instance Normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.2.
 [33] (2017) Sampling matters in deep embedding learning. In IEEE International Conference on Computer Vision (ICCV), pp. 2840–2848. Cited by: §2.
 [34] (2016) LIFT: learned invariant feature transform. In European Conference on Computer Vision (ECCV), pp. 467–483. Cited by: §1.
 [35] (2015) Learning to compare image patches via convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4353–4361. Cited by: §1, §2.
 [36] (2019) Learning local descriptors with a cdfbased dynamic soft margin. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969–2978. Cited by: §2, footnote 6.
 [37] (2017) Learning spreadout local feature descriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4595–4603. Cited by: §5.1, Table 1.
 [38] (2019) HardnessAware deep metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 72–81. Cited by: §2.
Comments
There are no comments yet.