AdaSample: Adaptive Sampling of Hard Positives for Descriptor Learning

11/27/2019 ∙ by Xin-Yu Zhang, et al. ∙ 0

Triplet loss has been widely employed in a wide range of computer vision tasks, including local descriptor learning. The effectiveness of the triplet loss heavily relies on the triplet selection, in which a common practice is to first sample intra-class patches (positives) from the dataset for batch construction and then mine in-batch negatives to form triplets. For high-informativeness triplet collection, researchers mostly focus on mining hard negatives in the second stage, while paying relatively less attention to constructing informative batches. To alleviate this issue, we propose AdaSample, an adaptive online batch sampler, in this paper. Specifically, hard positives are sampled based on their informativeness. In this way, we formulate a hardness-aware positive mining pipeline within a novel maximum loss minimization training protocol. The efficacy of the proposed method is evaluated on several standard benchmarks, where it demonstrates a significant and consistent performance gain on top of the existing strong baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning discriminative local descriptors from image patches is a fundamental ingredient of various computer vision tasks, including structure-from-motion [1]

, image retrieval 

[24], and panorama stitching [6]. Conventional approaches mostly utilize hand-crafted descriptors, such as SIFT [17], which have been successfully employed in a variety of applications. Recently, with the emergence of large-scale annotated datasets [5, 3]

, data-driven methods have started to demonstrate their effectiveness, and learning-based descriptors have gradually dominated this field. Specifically, convolutional neural network (CNN) based descriptors

[10, 35, 34, 30, 21, 31] can achieve state-of-the-art performance on various tasks, including patch retrieval and 3D reconstruction.

Notably, triplet loss is adopted in many well-performing descriptor learning frameworks. Nevertheless, the quality of the learned descriptors heavily relies on the triplet selection, and mining suitable triplets from a large database is a challenging task. Towards this challenge, Balntas et al. [4] propose an in-triplet hard negative mining strategy called anchor swapping. Tian et al. [30] progressively sample unbalanced training pairs in favor of negatives, and Mishchuk et al. [21] further simplify this idea to mine the hardest negatives within the mini-batch. Despite the significant progress on performance and generalization ability, however, two potential problems still exist in the current hardest-in-batch sampling solution: i) hard negatives are mined in the batch level, while randomly selected matching pairs can still be easily discriminated by the descriptor network; ii) it does not take the interaction between the training progress and the hardness of the training samples into consideration. To this end, we propose a novel triplet mining pipeline to adaptively construct high-informativeness batches in a principled manner.

Our proposed method is nominated as AdaSample, in which matching pairs are sampled from the dataset based on their informativeness to construct mini-batches. The methodology is developed on informativeness analysis, where informativeness

is defined via the contributing gradients of the potential samples and can assist estimate their optimal sampling probabilities. Moreover, we propose a novel training protocol inspired by

maximum loss minimization [26] to boost the generalization ability of the descriptor network. Under this training framework, we can adaptively adjust the overall hardness of the training examples fed to the network, based on the training progress. Comprehensive evaluation results and ablation studies on several standard benchmarks [5, 3] demonstrate the effectiveness of our proposed method.

In summary, our contributions are three-fold:

  • We theoretically analyze the informativeness of potential training examples and formulate a principled sampling approach for descriptor learning.

  • We propose a hardness-aware training protocol inspired by maximum loss minimization, in which the overall hardness of the generated triplets are adaptively adjusted to match the training progress.

  • Comprehensive evaluation results on popular benchmarks demonstrate the efficacy of our proposed AdaSample framework.

2 Related work

Local Descriptor Learning.

Traditional descriptors [17, 15] mostly utilize hand-crafted features to extract low-level textures from image patches. The seminal work, i.e., SIFT [17], computes the smoothed weighted histograms using the gradient field of the image patch. PCA-SIFT [15] further improves the descriptors by applying Principle Component Analysis (PCA) to the normalized image gradient. A comprehensive overview of the hand-crafted descriptors can be found in [20].

Recently, due to the rapid development of deep learning, CNN-based methods enable us to learn feature descriptors directly from the raw image patches. MatchNet 

[10] propose a two-stage Siamese architecture to extract feature embeddings and measure patch similarity, which significantly improves the performance and demonstrates the great potential of CNNs in descriptor learning. DeepDesc [28] trains the network with Euclidean distance and adopts a mining strategy to sample hard examples. DeepCompare [35] explores various architectures of the Siamese network and develops a two-stream network focusing on image centers.

With the advances of metric learning, triplet-based architectures have gradually replaced the pair-based ones. TFeat [4] adopts the triplet loss and mines in-triplet hard negatives with a strategy named anchor swapping. -Net [30] employs progressive sampling and requires that matching patches have minimal distances within the mini-batch. HardNet [21] further develops the idea to mine the hardest-in-batch negatives with a simple triplet margin loss. DOAP [12] imposes a ranking-based loss directly optimized for the average precision. GeoDesc [18] further incorporates the geometric constraints from multi-view reconstructions and achieves significant improvement on 3D reconstruction task. SOSNet [31] proposes a second-order similarity regularization term and achieves more compact patch clusters in the feature space. A very recent work [36]

relaxes the hard margin in the triplet margin loss with a dynamic soft margin to avoid manually tuning the margin by human heuristics.

From previous arts, we find that the triplet mining framework can generally be decoupled into two stages, i.e., batch construction from the dataset and triplet generation within the mini-batch. Previous works [4, 30, 21] mostly focus on mining hard negatives in the second stage, while neglecting batch construction in the first place. Besides, their sampling approaches do not take the training progress into consideration when generating triplets. Therefore, we argue that their triplet mining solutions still cannot exploit the full potential of the entire dataset to produce triplets with suitable hardness. To alleviate this issue, we analyze the contributing gradients of the potential training examples and sample informative matching pairs for batch construction. Then, we propose a hardness-aware training protocol inspired by maximum loss minimization, in which the overall hardness of the selected triplets is correlated with the training progress. Incorporating the hardest-in-batch negative mining solution, we formulate a powerful triplet mining framework, AdaSample, for descriptor learning, in which the quality of the learned descriptors can be significantly improved by a simple sampling strategy.

Hard Negative Mining.

Hard negative mining has been widely used in deep metric learning, such as face verification [25], as it can progressively select hard negatives for triplet loss and Siamese networks to boost the performance and speed up the convergence. FaceNet [25]

samples semi-hard triplets within the mini-batch to avoid overfitting the outliers. Wu

et al. [33] select training examples based on their relative distances. Zheng et al. [38] augment the training data by adaptively synthesizing hardness-aware and label-preserving examples. However, our sampling solution differs from them in that we analyze the informativeness

of the training data and ensure that the sampled data can provide gradients contributing most to the parameter update. Besides, our method can adaptively adjust the hardness of the selected training data as training progresses. In this way, well-classified samples are filtered out, and the network is always fed with informative triplets with suitable hardness. Comprehensive evaluation results demonstrate consistent performance improvement contributed by our proposed approach.

3 Methodology

3.1 Problem Overview

Given a dataset that consists of classes111 The term ”class” stands for the image patches that come from the same 3D location. For our sampling purpose, patches from a single class are matching, while non-matching pairs come from different classes. with each containing matching patches, we decompose the triplet generation into two stages. Firstly, we select matching pairs (positives) to form a mini-batch, where is the batch size. This is done by our proposed AdaSample, as introduced in Sec. 3.2. Secondly, we mine the hardest-in-batch negatives for each matching pair and use the triplet loss to supervise the network training, as in Sec. 3.3. See Fig. 1 for an illustration of the two-stage sampling pipeline. Finally, the overall solution is summarized in Sec. 3.4.

[width=1]imgs/netplot-visiovsdx.pdf

Figure 1: Illustration of the two-stage descriptor learning pipeline.

3.2 AdaSample

Previous works [30, 21] sample positives randomly to construct mini-batches, yielding a majority of similar matching pairs which can be easily discriminated by the network. This practice may reduce the overall hardness of the triplets. Motivated by the hardest-in-batch mining strategy in [21], a straightforward solution is to select the most dissimilar matching pairs. However, potential issues arise, i.e., the network may be trained with bias in favor of the most dissimilar matching pairs, while other cases are less-considered. We validate this solution, nominated as Hardpos, in experiments (Sec. 5.4).

A more principled solution is to sample positives based on their informativeness. Here, we assume that informative pairs are those contributing most to the optimization, namely, providing effective gradients for parameter updates. Therefore, we quantify the informativeness of the matching pairs by measuring their contributing gradients during training. Moreover, we employ maximum loss minimization [26] to improve the generalization ability of the learned model and show that the resulting gradient estimator is an unbiased estimator of the actual gradient. In the following, we introduce our derivation and elaborate on the theoretical justification in Sec. 4.

Informativeness Based Sampling.

In the end-to-end deep learning literature, the training data contribute to optimization via gradients, so we measure the informativeness of training examples by analyzing their resulting gradients. Generally, we consider the generic deep learning framework. Let be the data-label pair of the training set, be the model parameterized by , and

be a differentiable loss function. The goal is to find the optimal model parameter

that minimizes the average loss, i.e.,

(1)

where denotes the number of training examples. Then, we proceed with the following definition of informativeness.

Definition 1.

The informativeness of a training example is quantified by its resulting gradient norm at iteration , namely,

(2)

At iteration , let be the sampling probabilities of each datum in the training set. More generally, we also re-weight each sample by

. Let random variable

denote the sampled index at iteration , then , namely, . We record the re-weighted gradient induced by the training sample as

(3)

For simplicity, we omit the superscript when no ambiguity is made. By setting , we can make the gradient estimator

an unbiased estimator of the actual gradient,

i.e.,

(4)

Without loss of generality, we use stochastic gradient descent (SGD) to update model parameters:

(5)

where is the learning rate at iteration . As the goal is to find the optimal , we define the expected progress towards the optimum at each iteration as follows.

Definition 2.

At iteration , the expected parameter rectification is defined as the expected reduction of the squared distance between the parameter and the optimum after iteration ,

(6)

Generally, tens of thousands of iterations are included in the training so that the empirical average parameter rectification will converge to the average of asymptotically. Therefore, by maximizing , we guarantee the most progressive step towards parameters optimum at each iteration in the expectation sense. Inspired by the greedy algorithm [8], we aim to maximize at each iteration.

It can be shown that maximizing is equivalent to minimizing (Thm. 1). Under this umbrella, we show that the optimal sampling probability is proportional to the per-sample gradient norm (a special case of Thm. 2). Therefore, the optimal sampling probability of each datum happens to be proportional to its informativeness. This property justifies our definition of informativeness as the resulting gradient norm of each training example.

However, as the neural network has multiple layers with a large number of parameters, it is computationally prohibitive to calculate the full gradient norm. Instead, we prove that the matching distance in the feature space is a good approximation to the informativeness222The approximation is up to a constant factor, which is insignificant as it will be offset by the learning rate. The same reasoning applies to the approximation of gradients in Maximum Loss Minimization paragraph. in Sec. 4.2. Concretely, for each class consisting of patches , we first select a patch randomly, which serves as the anchor patch, and then sample a matching patch with probability

(7)

where is the extracted descriptor of , and measures the discrepancy of the extracted descriptors. See specific choices of in Sec. 3.4.

0:    Dataset of classes with each containing matching patches;Moving average of history loss

;Hyperparameter

;
1:  Randomly select distinct classes from the dataset without replacement;
2:  Extract descriptors of the patches belonging to the selected classes;
3:  for each selected class with patches  do
4:     Sample an anchor patch randomly;
5:     Sample a matching patch from the remaining patches with probabilities specified by Eqn. 13;
6:  end for
7:  With sampled positive pairs and their descriptors , compute Angular Triplet Hinge loss by Eqn. 15;
8:

  Backpropagate and update model parameters via

;
Algorithm 1 Pipeline of AdaSample framework.

Maximum Loss Minimization.

Minimizing the average loss may be sub-optimal because the training tends to be overwhelmed by well-classified examples that provide noisy gradients [16]. On the contrary, well-classified examples can be adaptively filtered out by minimizing the maximum loss [26], which can further improve the generalization ability. However, directly minimizing the maximum loss may lead to insufficient usage of training data and sensitivity to outliers, so we approximate the gradient of maximum loss by , in which is sufficiently large. As is used to update parameters, consider its expectation

(8)

To guarantee is an unbiased estimator333 We impose the unbiasedness constraints due to its theoretical convergence guarantees. For example, the non-asymptotic error bound induced by unbiased gradient estimates is referred to [22]. For re-weighted SGD, as in our case, improved convergence rate can be found in [23]. of , it suffices to set

(9)

as in this case,

(10)

Following the previous reasoning, we need to minimize under the constraints specified by Eqn. 9 in order to step most progressively at each iteration. In Thm. 2, we show that the optimal sampling probability and re-weighting scalar should be given by

(11)

As previously claimed, we approximate the gradient norm via the matching distance in the feature space. Besides, in our case, the hinge triplet loss (Eqn. 14) is positively (or even linearly) correlated with the matching distance squared. Therefore, we use the matching distance squared as an approximation of the hinge triplet loss. Thus, the sampling probability and re-weighting scalar are given by

(12)

Moreover, for better approximation, it is preferable to adjust adaptively, namely, to increase with training. Intuitively, when easy matching pairs have been correctly classified, we focus more on hard ones. A good indicator of the training progress is the average loss. As a result, instead of pre-defining a sufficiently large , we set , where is a tunable hyperparameter, and is the moving average of history loss. Formally, we formulate our sampling probability and re-weighting scalar as

(13)

The exponent increases adaptively as training progresses so that hardness-aware training examples can be generated and fed to the network. Our sampling approach is thus named as AdaSample.

3.3 Triplet generation by hardest-in-batch

AdaSample focuses on the batch construction stage, and for a complete triplet mining framework, we need to mine negatives from the mini-batch as well. Here, we adopt the hardest-in-batch strategy in [21]. Formally, given a mini-batch of matching pairs , let be the descriptors extracted from 444For clarity, denotes the selected matching pairs, with different pairs belonging to different classes. denotes a generic patch in a specific class, where denotes the placeholder for the index.. For each matching pair , we select the non-matching patch which lies closest to one of the matching patches in the feature space. Then, the Hinge Triplet (HT) loss is defined as follows:

(14)

where denotes the margin. Incorporating the re-weighting scalar, we update the model parameter via the gradient estimator .

3.4 Distance Metric

Euclidean distance is widely used in previous works [28, 30, 21, 31]. However, as the descriptors lie on the unit hypersphere in -dimensional space (Sec. 5.1), it is more natural to adopt the geodesic distance of the embedded manifold. Therefore, we adopt the angular distance [7] as follows:

(15)

where denotes the inner product operator. We nominate our loss function as Angular Hinge Triplet (AHT) loss, which is demonstrated to result in consistent performance improvement (Sec. 5.4).

Alg. 1 summarizes the overall triplet generation framework. For each training iteration, we first randomly pick distinct classes from the dataset and extract descriptors for patches belonging to these classes (Step 1, 2). Then, we randomly choose a patch as the anchor from each of the selected classes (Step 4) and adopt our proposed AdaSample to select an informative matching patch (Step 5). With the generated mini-batch, we mine hard negatives following [21] and compute Angular Hinge Triplet (AHT) loss (Step 7).

4 Theoretical Analysis

In this section, we complete the theoretical analysis of informativeness in Sec. 4.1, and prove that the matching distance can serve as a good approximation of informativeness in Sec. 4.2.

4.1 Informativeness Formulation

Following notations in Sec. 3.2, we reformulate (Eqn. 6), and give an equivalent condition for maximizing . The same conclusion can be found in [14].

Theorem 1.

Let , , and be defined as in Eqn. 6, 1 and 3, respectively. Then, we have

(16)

Due to unbiasedness (Eqn. 4), the first two terms in Eqn. 16 is fixed, so maximizing is equivalent to minimizing . Thm. 2 specifies the optimal probabilities to minimize the aforementioned trace under a more general assumption.

Theorem 2.

Let be defined in Eqn. 3 and suppose the sampled index obeys distribution . Then, given the constraints , is minimized by the following optimal sampling probabilities:

(17)
Proof.

As is an unbiased estimator of the actual gradient (Eqn. 4), is fixed in our case, denoted by for short. By the linearity of trace and , we have

(18)

Mathematically, given the constraints

, the aforementioned harmonic mean of

reaches its minimum when the probabilities satisfy

(19)

Dividing by a normalization factor, we get the expression in Eqn. 17. ∎

Note that in the special case of , the constraints degrade into , and the optimal sampling probabilities become .

4.2 Approximation of Informativeness

As mentioned in Sec. 3.2, the matching distance can serve as a good approximation of informativeness. We justify this here. For simplicity, we introduce some notations for a

-layer multi-layer perceptron (MLP). Let

be the weight matrix for layer and

be a Lipschitz continuous activation function. Then the multi-layer perceptron can be formulated as follows:

(20)

Note that although our notations describe only MLPs without bias, our analysis holds for any affine transformation followed by a Lipschitz continuous non-linearity. Therefore, our reasoning can naturally extend to CNNs. With

(21)

we have

(22)

Various data preprocessing, weight initialization [9, 11], and activation normalization [13, 2, 32] techniques uniformize the activations of each layer across samples. Therefore, the variation of gradient norms is mostly captured by the gradient of the loss function w.r.t. the output of neural networks,

(23)

where is a constant, and serves as a precise approximation of the full gradient norm. For simplicity, we consider hinge triplet loss (Eqn. 14) here. Then, the gradient norm w.r.t. the descriptor of the matching patch is just twice the matching distance555This relation holds only when the hinge triplet loss is positive. Empirically, due to the relatively large margin, the hinge loss never becomes zero.,

(24)

As a result, we reach the conclusion that the matching distance is a good approximation to the informativeness. Also, we empirically verify this in Sec. 5.4.

5 Experiments

Descriptor Length Train Notredame Yosemite Liberty Yosemite Liberty Notredame Mean
Test Liberty Notredame Yosemite
SIFT [17] 128 29.84 22.53 27.29 26.55
DeepDesc [27] 128 10.9 4.40 5.69 6.99
GeoDesc [18] 128 5.47 1.94 4.72 4.05
MatchNet [10] 4096 7.04 11.47 3.82 5.65 11.60 8.70 8.05
-Net [30] 128 3.64 5.29 1.15 1.62 4.43 3.30 3.24
CS--Net [30] 256 2.55 4.24 0.87 1.39 3.81 2.84 2.61
HardNet [21] 128 1.47 2.67 0.62 0.88 2.14 1.65 1.57
HardNet-GOR [37] 128 1.72 2.89 0.63 0.91 2.10 1.59 1.64
1-10[4pt/6pt] HardNet* 128 1.80 2.89 0.68 0.90 1.93 1.71 1.65
AdaSample* (Ours) 128 1.64 2.62 0.61 0.88 1.92 1.46 1.52
TFeat-M+ [4] 128 7.39 10.31 3.06 3.80 8.06 7.24 6.64
-Net+ [30] 128 2.36 4.70 0.72 1.29 2.57 1.71 2.23
CS--Net+ [30] 256 1.71 3.87 0.56 1.09 2.07 1.30 1.76
HardNet+ [21] 128 1.49 2.51 0.53 0.78 1.96 1.84 1.51
HardNet-GOR+ [37] 128 1.48 2.43 0.51 0.78 1.76 1.53 1.41
DOAP+ [12] 128 1.54 2.62 0.43 0.87 2.00 1.21 1.45
1-10[4pt/6pt] HardNet+* 128 1.32 2.31 0.41 0.67 1.51 1.24 1.24
AdaSample+* (Ours) 128 1.25 2.21 0.40 0.63 1.40 1.14 1.17
Table 1: Patch classfication results on UBC Phototour dataset [5]. The false positive rate at recall is reported. indicates data augmentation and indicates positive generation.

[width=1]imgs/hpatches.pdf InterIntraViewpIllumEasyHardToughHPHPHP

Figure 2: Evaluation results on HPatches dataset [3]. By default, descriptors are trained on Liberty subset of UBC Phototour [5] dataset, and “-HP” indicates descriptors trained on HPatches training set of split a. Marker color indicates the level of geometrical noises and marker type indicates the experimental setup. Inter and Intra indicate the source of negative examples for the verification task. Viewp and Illum indicate the sequence type for the matching task.

5.1 Implementation Details

We adopt the architecture of -Net [30] to embed local descriptors into the unit hypersphere in -dimensional space. Following prior works [30, 21], all patches are resized to

and normalized to zero per-patch mean and unit per-patch variance. We train our model from scratch in PyTorch library

[29] using SGD optimizer with initial learning rate , momentum , and weight decay . Batch size is , margin , and unless otherwise specified. We generate

matching pairs for each epoch, and the total number of epochs is

. The learning rate is divided by at the end of , , epochs.

We compare our method with both handcrafted and deep methods666 Note that the training dataset of GeoDesc [18] is not released, so the comparison may be unfair. Besides, some recent works [31, 36] explore in different directions, and their training codes are not publicly available. So we leave the efficacy comparison and system combination in future work., including SIFT [17], DeepDesc [27], TFeat [4], -Net [30], HardNet [21], HardNet with global orthogonal regularization (GOR) [37], DOAP [12], and GeoDesc [18]. Comprehensive evaluation results and ablation studies on two standard descriptor datasets: UBC Phototour [5] (Sec. 5.2), and HPatches [3] (Sec. 5.3) demonstrate the efficacy of our proposed sampling framework.

5.2 UBC Phototour

UBC Phototour [5], also known as Brown dataset, consists of three subsets: Liberty, Notre Dame, and Yosemite, with about normalized patches in each subset. Keypoints are detected by DoG detector [17] and verified by model. The testing set consists of matching and non-matching pairs for each sequence. For evaluation, models are trained on one subset and tested on the other two. The metric is the false positive rate (FPR) at true positive recall. The evaluation results are reported in Tab. 1.

Our method outperforms other approaches by a significant margin. We randomly flip and rotate by degrees for data augmentation, noted by . Besides, for our method, we also generate positive patches by random rotation such that each class has patches, noted by *. We augment matching pairs as there are too few patches (two or three) corresponding to one class in UBC Phototour dataset [5], which limits the capacity of our method. To analyze its effect, we also conduct it for HardNet [21] baseline. It can be seen that our method consistently outperforms the baseline, indicating the effectiveness of our adaptive sampling solution.

5.3 HPatches

HPatches [3] consists of sequences of images. The dataset is split into two parts: viewpoint - sequences with significant viewpoint change and illumination - sequences with significant illumination change. According to the level of geometric noises, the patches can be further divided into three groups: easy, hard, and tough. There are three evaluation tasks: patch verification, image matching, and patch retrieval. Following standard evaluation protocols of the dataset, we show results in Fig. 2. It demonstrates that our method performs in favor of other methods on patch verification task, which is consistent with the patch classification results in Tab. 1. Furthermore, our descriptors achieve the best results on the more challenging image matching and patch retrieval tasks, indicating the improved generalization ability contributed by our approach.

5.4 Ablation Study

Informativeness Approximation.

We empirically verify the conclusion in Sec. 4.2 that the probability induced by matching distance approximate well to the one induced by informativeness (Fig. 3, Left). Besides, the results show that the Pearson correlation is consistently greater than during training (Fig. 3, Right), which indicates these probabilities have strong correlation with each other statistically.

Figure 3: (Left) Probabilities induced by informativeness and matching distance. (Right) Pearson correlation between probabilities and training epochs.

Impact of and Distance Metric.

We experiment with varying in AdaSample to control the overall hardness of the selected matching pairs. A large indicates that hard matching pairs are more likely to be selected. When , our method degrades into random sampling and the overall framework becomes HardNet [21], and as , the framework becomes Hardpos. Therefore, both HardNet and Hardpos are special cases of our proposed AdaSample. Tab. 2 shows the results on HPatches [3] dataset, where leads to the best results in most cases. It demonstrates the advantages of our balanced sampling strategy against the hardest solution. Also, Tab. 2 demonstrates that the angular hinge triplet (AHT) loss outperforms the commonly-used hinge triplet (HT) loss in most cases.

Task Verification Matching Retrieval
Loss AHT HT AHT HT AHT HT
93.84 93.17 64.09 62.64 81.26 79.97
94.72 94.56 66.04 65.92 83.58 83.34
94.78 94.76 65.89 65.68 83.80 83.54
94.78 94.60 65.46 65.37 83.98 83.62
94.60 94.69 64.56 64.84 83.56 83.69
94.42 94.51 63.81 64.02 83.41 83.29
Table 2: Ablation studies on the impact of . All experiments are conducted on HPatches [3] benchmark.

Stability and Reproducibility.

The sampling naturally comes from stochasticity. To ensure reproducibility, we conduct experiments on five runs with different random seeds and show the means and standard deviations of the patch classification results in Tab. 

3. It demonstrates the stability of our sampling solution. We argue that a possible explanation of the stability is the unbiasedness of the gradient estimator (Eqn. 10). As the number of training triplets is huge, the estimated gradients converge to the actual gradient asymptotically. Therefore, the gradients can guide the network towards the parameter optimum as training progresses, regardless of the specific random condition.

Train Test HardNet+* AdaSample+* Rel p value
Notr Lib 1.3160.044 1.2540.026 4.71% 0.031
Yos 2.3100.063 2.2120.049 4.28% 0.018
Lib Notr 0.4060.011 0.4000.016 1.58% 0.337
Yos 0.6710.010 0.6270.012 6.62% 0.006
Lib Yos 1.5130.084 1.3950.050 7.80% 0.030
Notr 1.2410.044 1.1370.036 8.38% 0.011
Table 3: Reproducibility and statistical significance of our proposed AdaSample. The repeated experiments are conducted on UBC Phototour [5] dataset. Here, “Rel ” indicates the relative improvement upon the HardNet [21] baseline.

Statistical Significance.

Since previous methods have been approaching the saturating point in terms of the performance on UBC Phototour [5] dataset, it is challenging to make progress on top of the HardNet [21] baseline. However, with the proposed method, we still observe a consistent improvement, as demonstrated in Tab. 3. It can be seen that our method can give a relative improvement of up to 8.38% in terms of patch classification accuracy, indicating our superiority. To be more principled, we also demonstrate the statistical significance of our improvement upon the baseline. Specifically, we adopt the non-parametric hypothesis testing, i.e., the classic Mann-Whitney testing [19], to test whether a random variable is stochastically larger than the other one. In our setting, the two random variables are the performance of AdaSample

and HardNet baseline, respectively, and the null hypothesis is that our method

cannot significantly improve the performance. The p values under different experimental settings are summarized in Tab. 3. With a significance level of , we can reject the null hypothesis in 5 of the 6 experiments in total. For the only anomaly, i.e., training on Notredame and testing on Liberty, we conjecture that the reason lies in the extremely high performance of the HardNet baseline (about 0.4% in terms of FPR). Therefore, we argue that the statistical significance under the other 5 experimental settings is sufficient to verify the effectiveness of our approach.

6 Conclusion

This paper proposes AdaSample for descriptor learning, which adaptively samples hard positives to construct informative mini-batches during training. We demonstrate the efficacy of our method from both theoretical and empirical perspectives. Theoretically, we give a rigorous definition of informativeness of potential training examples. Then, we reformulate the problem and derive a tractable sampling probability expression (Eqn. 13) to generate hardness-aware training triplets. Empirically, we enjoy a consistent and statistically significant performance gain on top of the HardNet [21] baseline when evaluated on various tasks, including patch classification, patch verification, image matching, and patch retrieval.

References

  • [1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski (2009) Building rome in a day. In IEEE International Conference on Computer Vision (ICCV), pp. 72–79. Cited by: §1.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.
  • [3] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 5173–5182. Cited by: §1, §1, Figure 2, §5.1, §5.3, §5.4, Table 2.
  • [4] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016) Learning local feature descriptors with triplets and shallow convolutional neural networks.. In British Machine Vision Conference (BMVC), Cited by: §1, §2, §2, §5.1, Table 1.
  • [5] M. Brown, G. Hua, and S. Winder (2011) Discriminative learning of local image descriptors. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI) 33 (1), pp. 43–57. Cited by: §1, §1, Figure 2, §5.1, §5.2, §5.2, §5.4, Table 1, Table 3.
  • [6] M. Brown and D. G. Lowe (2007) Automatic panoramic image stitching using invariant features. International Journal on Computer Vision (IJCV) 74 (1), pp. 59–73. Cited by: §1.
  • [7] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    ArcFace: additive angular margin loss for deep face recognition

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699. Cited by: §3.4.
  • [8] J. Edmonds (1971) Matroids and the greedy algorithm. Mathematical programming 1 (1), pp. 127–136. Cited by: §3.2.
  • [9] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §4.2.
  • [10] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg (2015) Matchnet: unifying feature and metric learning for patch-based matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3279–3286. Cited by: §1, §2, Table 1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: §4.2.
  • [12] K. He, Y. Lu, and S. Sclaroff (2018) Local descriptors optimized for average precision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 596–605. Cited by: §2, §5.1, Table 1.
  • [13] S. Ioffe and C. Szegedy (2015) Batch Normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning (ICML)

    ,
    pp. 448–456. Cited by: §4.2.
  • [14] A. Katharopoulos and F. Fleuret (2018) Not all samples are created equal: deep learning with importance sampling. PMLR, pp. 2525–2534. Cited by: §4.1.
  • [15] Y. Ke, R. Sukthankar, et al. (2004) PCA-SIFT: a more distinctive representation for local image descriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 506–513. Cited by: §2.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §3.2.
  • [17] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV) 60 (2), pp. 91–110. Cited by: §1, §2, §5.1, §5.2, Table 1.
  • [18] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan (2018) GeoDesc: learning local descriptors by integrating geometry constraints. In European Conference on Computer Vision (ECCV), pp. 168–183. Cited by: §2, §5.1, Table 1, footnote 6.
  • [19] H. B. Mann and D. R. Whitney (1947) On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pp. 50–60. Cited by: §5.4.
  • [20] K. Mikolajczyk and C. Schmid (2005) A performance evaluation of local descriptors. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), pp. 1615–1630. Cited by: §2.
  • [21] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Neural Information Processing Systems (NIPS), pp. 4826–4837. Cited by: §1, §1, §2, §2, §3.2, §3.3, §3.4, §3.4, §5.1, §5.1, §5.2, §5.4, §5.4, Table 1, Table 3, §6.
  • [22] E. Moulines and F. R. Bach (2011)

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    .
    In Neural Information Processing Systems (NIPS), pp. 451–459. Cited by: footnote 3.
  • [23] D. Needell, R. Ward, and N. Srebro (2014) Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Neural Information Processing Systems (NIPS), pp. 1017–1025. Cited by: footnote 3.
  • [24] J. Philbin, M. Isard, J. Sivic, and A. Zisserman (2010) Descriptor learning for efficient retrieval. In European Conference on Computer Vision (ECCV), pp. 677–691. Cited by: §1.
  • [25] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. Cited by: §2.
  • [26] S. Shalev-Shwartz and Y. Wexler (2016) Minimizing the maximal loss: how and why.. In International Conference on Machine Learning (ICML), pp. 793–801. Cited by: §1, §3.2, §3.2.
  • [27] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015-12) Discriminative learning of deep convolutional feature point descriptors. In IEEE International Conference on Computer Vision (ICCV), pp. 118–126. Cited by: §5.1, Table 1.
  • [28] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In IEEE International Conference on Computer Vision (ICCV), pp. 118–126. Cited by: §2, §3.4.
  • [29] B. Steiner, Z. DeVito, S. Chintala, S. Gross, A. Paszke, F. Massa, A. Lerer, G. Chanan, Z. Lin, E. Yang, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems (NIPS), Cited by: §5.1.
  • [30] Y. Tian, B. Fan, and F. Wu (2017) L2-Net: deep learning of discriminative patch descriptor in euclidean space. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 661–669. Cited by: §1, §1, §2, §2, §3.2, §3.4, §5.1, §5.1, Table 1.
  • [31] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) SOSNet: second order similarity regularization for local descriptor learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.4, footnote 6.
  • [32] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance Normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.2.
  • [33] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In IEEE International Conference on Computer Vision (ICCV), pp. 2840–2848. Cited by: §2.
  • [34] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) LIFT: learned invariant feature transform. In European Conference on Computer Vision (ECCV), pp. 467–483. Cited by: §1.
  • [35] S. Zagoruyko and N. Komodakis (2015) Learning to compare image patches via convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4353–4361. Cited by: §1, §2.
  • [36] L. Zhang and S. Rusinkiewicz (2019) Learning local descriptors with a cdf-based dynamic soft margin. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969–2978. Cited by: §2, footnote 6.
  • [37] X. Zhang, F. X. Yu, S. Kumar, and S. Chang (2017) Learning spread-out local feature descriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4595–4603. Cited by: §5.1, Table 1.
  • [38] W. Zheng, Z. Chen, J. Lu, and J. Zhou (2019) Hardness-Aware deep metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 72–81. Cited by: §2.