1 Introduction
Capturing visual similarity between images is the core of virtually every computer vision task, such as image retrieval
[60, 52, 38, 35], pose understanding [34, 7, 2, 53][48] and style transfer [27]. Measuring similarity requires to find a representation which maps similar images close together and dissimilar images far apart. This task is naturally formulated as Deep Metric Learning (DML) in which individual pairs of images are compared[16, 52, 37] or contrasted against a third image[48, 60, 57] to learn a distance metric that reflects image similarity. Such triplet learning constitutes the basis of powerful learning algorithms[44, 38, 46, 62]. However, with growing training set size, leveraging every single triplet for learning becomes computationally infeasible, limiting training to only a subset of all possible triplets. Thus, a careful selection of those triplets which drive learning best, is crucial. This raises the question: How to determine which triplets to present when to our model during training?As training progresses, more and more triplet relations will be correctly represented by the model. Thus, ever fewer triplets will still provide novel, valuable information. Conversely, leveraging only triplets which are hard to learn[48, 8, 63]
but therefore informative, impairs optimization due to high gradient variance
[60]. Consequently, a reasonable mixture of triplets with varying difficulty would provide an informative and stable training signal. Now, the question remains, when to present which triplet? Sampling from a fixed distribution over difficulties may serve as a simple proxy[60] and is a typical remedy in representation learning in general[26, 4]. However, (i) choosing a proper distribution is difficult; (ii) the abilities and state of our model evolves as training progresses and, thus, a fixed distribution cannot optimally support every stage of training; and (iii) triplet sampling should actively contribute to the learning objective rather than being chosen independently. Since a manually predefined sampling distribution does not fulfill these requirements, we need to learn and adapt it while training a representation.Such online adaptation of the learning algorithm and parameters that control it during training is typically framed as a teacherstudent setup and optimized using Reinforcement Learning (RL). When modelling a flexible sampling process (the student), a controller network (the teacher) learns to adjusts the sampling such that the DML model is steadily provided with an optimal training signal. Fig. 1 compares progressions of learned sampling distributions adapted to the DML model with a typical fixed sampling distribution[60].
This paper presents how to learn a novel triplet sampling strategy which is able to effectively support the learning process of a DML model at every stage of training. To this end, we model a sampling distribution so it is easily adjustable to yield triplets of arbitrary mixtures of difficulty. To adapt to the training state of the DML model we employ Reinforcement Learning to update the adjustment policy. Directly optimizing the policy so it improves performance on a heldback validation set, adjusts the sampling process to optimally support DML training. Experiments show that our adaptive sampling strategy significantly improves over fixed, manually designed triplet sampling strategies on multiple datasets. Moreover, we perform diverse analyses and ablations to provide additional insights into our method.
2 Related Work
Metric learning has become the leading paradigm for learning distances between images with a broad range of applications, including image retrieval[36, 30, 60], image classification [10, 64], face verification[48, 19, 31] or human pose analysis[34, 7]. Ranking losses formulated on pairs[52, 16], triplets[48, 60, 57, 11] or even higher order tuples of images[6, 37, 58] emerged as the most widely used basis for DML [45]. As with the advent of CNNs datasets are growing larger, different strategies are developed to cope with the increasing complexity of the learning problem.
Complexity management in DML: The main line of research are negative sampling strategies[48, 60, 17] based on distances between an anchor and a negative image. FaceNet[48] leverages only the hard negatives in a minibatch. Wu et al. [60] sample negatives uniformly over the whole range of distances to avoid large variances in the gradients while optimization. Harwood et al. [17]
restrict and control the search space for triplets using precomputed sets of nearest neighbors by linearly regressing the training loss. Each of them successfully enable effective DML training. However, these works are based on fixed and manually predefined sampling strategies. In contrast, we learn an adaptive sampling strategy to provide an optimal input stream of triplets conditioned on the training state of our model.
Orthogonal to sampling negatives from the training set is the generation of hard negatives in form of images[8]
or feature vectors
[65, 63]. Thus, these approaches also resort to hard negatives, while our sampling process yields negatives of any mixture of difficulty depending on the model state.Finally, proxy based techniques reduce the complexity of the learning problem by learning one[36] or more [42] virtual representatives for each class, which are used as negatives. Thus, these approaches approximate the negative distributions, while our sampling adaptively yields individual negative samples.
Advanced DML: Based on the standard DML losses many works improve model performance using more advanced techniques. Ensemble methods [38, 62, 46] learn and combine multiple embedding spaces to capture more information. HORDE[22]
additionally forces feature representations of related images to have matching higher moments. Roth
et al. [44] combines classdiscriminative features with features learned from characteristics shared across classes. Similarly, Lin et al. [30] proposes to learn the intraclass distributions, next to the interclass distribution. All these approaches are applied in addition to the standard ranking losses discussed above. In contrast, our work presents a novel triplet sampling strategy and, thus, is complementary to these advanced DML methods.Adaptive Learning: Curriculum Learning[3] gradually increases the difficulty of the the samples presented to the model. Hacohen et al. [15] employ a batchbased learnable scoring function to provide a batchcurriculum for training, while we learn how to adapt a sampling process to the training state. Graves et al. [14] divide the training data into fixed subsets before learning in which order to use them from training. Further, Gopal et al. [13] employs an empirical online importance sampling distribution over inputs based on their gradient magnitudes during training. Similarly, Shreyas et al. [47]
learn an importance sampling over instances. In contrast, we learn an online policy for selecting triplet negatives, thus instance relations. Meta Learning aims at learning how to learn. It has been successfully applied for various components of a learning process, such as activation functions
[43], input masking[9], selfsupervision [5], finetuning [51][20], optimizer parameters[1] and model architectures[41, 61]. In this work, we learn a sampling distribution to improve tripletbased learning.3 Distancebased Sampling for DML
Let be a dimensional embedding of an image with
being represented by a deep neural network parametrized by
. Further, is normalized to a unit hypersphere for regularization purposes [48]. Thus, the objective of DML is to learn such that images are mapped close to another if they are similar and far otherwise, under a standard distance function . Commonly, is the euclidean distance, i.e. .A popular family of training objectives for learning are ranking losses[48, 60, 52, 37, 37, 16] operating on tuples of images. Their most widely used representative is arguably the triplet loss[48] which is defined as an ordering task between images , formulated as
(1) 
Here, and are the anchor and positive with the same class label. acts as the negative from a different class. Optimizing pushes closer to and further away from as long as a constant distance margin is violated.
3.1 Static Triplet sampling strategies
While ranking losses have proven to be powerful, the number of possible tuples grows dramatically with the size of the training set. Thus, training quickly becomes infeasible, turning efficient tuple sampling strategies into a key component for successful learning as discussed here.
When performing DML using ranking losses like Eq.1, triplets decreasingly violate the triplet margin as training progresses. Naively employing random triplet sampling entails many of the selected triplets being uninformative, as distances on are strongly biased towards larger distances due to its regularization to . Consequently, recent sampling strategies explicitly leverage triplets which violate the triplet margin and, thus, are difficult and informative.
(Semi)Hard negative sampling: Hard negative sampling methods focus on triplets violating the margin the most, i.e. by sampling negatives . While it speeds up convergence, it may result in collapsed models[48]
due to a strong focus on few data outliers and very hard negatives. Facenet
[48] proposes a relaxed, semihard negative sampling strategy restricting the sampling set to a single minibatch by employing negatives . Based on this idea, different online[39, 52] and offline[17] strategies emerged.(Static) Distancebased sampling: By considering the hardness of a negative, one can successfully discard easy and uninformative triplets. However, triplets that are too hard lead to noisy learning signals due to overall high gradient variance[60]. As a remedy, to control the variance while maintaining sufficient triplet utility, sampling can be extended to also consider easier negatives, i.e. introducing a sampling distribution over the range of distances between anchor and negatives. Wu et al. [60] propose to sample from a static uniform prior on the range of , thus equally considering negatives from the whole spectrum of difficulties. As pairwise distances on are strongly biased towards larger , their sampling distribution requires to weigh inversely to the analytical distance distribution on : for large [55]. Distancebased sampling from the static, uniform prior is then performed by
(2) 
with
being a clipping hyperparameter for regularization.
4 Learning an Adaptive Negative Sampling
Distancebased sampling of negatives has proven to offer a good tradeoff between fast convergence and a stable, informative training signal. However, a static sampling distribution provides a stream of training data independent of the the changing needs of a DML model during learning. While samples of mixed difficulty may be useful at the beginning, later training stages are calling for samples of increased difficulty, as e.g. analyzed by curriculum learning[3]. Unfortunately, as different models and even different model intializations[12]
exhibit distinct learning dynamics, finding a generally applicable learning schedule is challenging. Thus, again, heuristics
[15]are typically employed, inferring changes after a fixed number of training epochs or iterations. To provide an optimal training signal, however, we rather want
to adapt to the training state of the DML model than merely the training iteration. Such an adaptive negative sampling allows for adjustments which directly facilitate maximal DML performance. Since manually designing such a strategy is difficult, learning it is the most viable option.Subsequently, we first present how to find a parametrization of that is able to represent arbitrary, potentially multimodal distributions, thus being able to sample negatives of any mixture of difficulty needed. Using this, we can learn a policy which effectively alters to optimally support learning of the DML model.
4.1 Modelling a flexible sampling distribution
Since learning benefits from a diverse distribution of negatives, unimodal distributions
(e.g. Gaussians, Binomials, ) are insufficient. Thus, we utilize a discrete probability mass function , where the bounded intervall of possible distances is discretized into disjoint equidistant bins . The probability of drawing from bin is with and . Fig. 2 illustrates this discretized sampling distribution.
This representation of the negative sampling distribution effectively controls which samples are used to learn . As changes during learning, should also adapt to always provide the most useful training samples, i.e. to control when to use which sample. Hence the probabilities need to be updated while learning . We subsequently solve this task by learning a stochastic adjustment policy for the , implemented as a neural network parametrized by .
4.2 Learning an adjustment policy for
Our sampling process based on should provide optimal training signals for learning at every stage of training. Thus, we adjust the by a multiplicative update conditioned on the current representation (or state) of during learning. We introduce a conditional distribution to control which adjustment to apply at which state of training . To learn , we measure the utility of these adjustments for learning using a reward signal . We now first describe how to model each of these components, before presenting how to efficiently optimize the adjustment policy alongside .
Adjustments : To adjust , proposes adjustments to the . To lower the complexity of the action space, we use a limited set of actions to individually decrease, maintain, or increase the probabilities for each bin , i.e. . Further, are fixed constants and . Updating is then simply performed by binwise updates
followed by renormalization. Using a multiplicative adjustment accounts for the exponential distribution of distances on
(cf. Sec. 3.1).Training states : Adjustments depend on the present state of the representation . Unfortunately, we cannot use the current model weights of the embedding network, as the dimensionality of would be to high, thus making optimization of infeasible. Instead, we represent the current training state using representative statistics describing the learning progress: running averages over Recall@1[23], NMI[32] and average distances between and within classes on a fixed heldback validation set . Additionally we use past parametrizations of and the relative training iteration (cf. Implementation details, Sec. 5).
Rewards : An optimal sampling distribution yields triplets whose training signal consistently improves the evaluation performance of while learning. Thus, we compute the reward for for adjustments by directly measuring the relative improvement of over
from the previous training state. This improvement is quantified through DML evaluation metrics
on the validation set . More precisely, we define as(3) 
where was reached from after DML training iterations using . We choose to be the sum of Recall@1[23] and NMI[32]. Both metrics are in the range and target slightly different performance aspects. Further, similar to [20], we utilize the sign function for consistent learning signals even during saturated training stages.
Learning of :
Adjusting is a stochastic process controlled by actions sampled from based on a current state
. This defines a Markov Decision Process (MDP) naturally optimized by Reinforcement Learning. The policy objective
is formulated to maximize the total expected reward over training episodes of tuples collected from sequences of timesteps, i.e.(4) 
Hence, is optimized to predict adjustments for which yield high rewards and thereby improving the performance of . Common approaches use episodes comprising long state trajectories which potentially cover multiple training epochs[9]. As a result, there is a large temporal discrepancy between model and policy updates. However, in order to closely adapt to the learning of , this discrepancy needs to be minimized. In fact, our experiments show that singlestep episodes, i.e. , are sufficient for optimizing to infer meaningful adjustments for . Such a setup is also successfully adopted by contextual bandits [29] ^{1}^{1}1Opposed to bandits, in our RL setup, actions which are sampled from influence future training states of the learner. Thus, the policy implicitly learns statetransition dynamics.. In summary, our training episodes consists of updating using a sampled adjustment , performing DML training iterations based on the adjusted and updating using the resulting reward . Optimizing Eq. 4 is then performed by standard RL algorithms which approximate different variations of the policy gradient based on the gain ,
(5) 
The choice of the exact form of gives rise to different optimization methods, e.g REINFORCE[59] (), Advantage Actor Critic (A2C)[54] (), etc. Other RL algorithms, such as TRPO[49] or PPO[50] replace Eq. 4 by surrogate objective functions. Fig. 3 provides an overview over the learning procedure. Moreover, in the supplementary material we compare different RL algorithms and summarizes the learning procedure in Alg. 1 using PPO[50] for policy optimization.
Initialization of : We find that an initialization with a slight emphasis towards smaller distances works best. However, as shown in Tab. 5, also other initializations work well. In addition, the limits of the distance interval can be controlled for additional regularization as done in [60], which is analysed in Tab. 5.
SelfRegularisation: As noted in [44], the utilisation of intraclass features can be beneficial to generalization. Our approach easily allows for a learnable inclusion of such features. As positive samples are generally closest to anchors, we can merge positive samples into the set of negative samples and have the policy learn to place higher sampling probability on such lowdistance cases. We find that this additionally improves generalization performance.
Computational costs:
Computational overhead over fixed sampling strategies[48, 60]
comes from the estimation of
requiring a forward pass over and the computation of the evaluation metrics. For example, setting increases the computation time per epoch by less than .Dataset  CUB2002011[56]  CARS196[28]  SOP[37]  

Approach  Dim  R@1  R@2  R@4  NMI  R@1  R@2  R@4  NMI  R@1  R@10  R@100  NMI 
Margin[60] + dist (orig)  128  63.6  74.4  83.1  69.0  79.6  86.5  90.1  69.1  72.7  86.2  93.8  90.7 
Margin[60] + dist (ReImp, )  128  63.5  74.9  84.4  68.1  80.1  87.4  91.9  67.6  74.6  87.5  94.2  90.7 
Margin[60] + dist (ReImp, )  128  63.0  74.3  83.0  66.9  79.7  87.0  91.8  67.1  73.5  87.2  93.9  89.3 
Margin[60] + PADS (Ours)  128  67.3  78.0  85.9  69.9  83.5  89.7  93.8  68.8  76.5  89.0  95.4  89.9 
Triplet[48] + semihard (orig)  64  42.6  55.0  66.4  55.4  51.5  63.8  73.5  53.4  66.7  82.4  91.9  89.5 
Triplet[48] + semihard (ReImp)  128  60.6  72.3  82.1  65.5  71.9  81.5  88.5  64.1  73.5  87.5  94.9  89.2 
Triplet[48] + dist (ReImp)  128  62.2  73.2  82.8  66.3  78.0  85.6  91.4  65.7  73.9  87.7  94.5  89.3 
Triplet[48] + PADS (Ours)  128  64.0  75.5  84.3  67.8  79.9  87.5  92.3  67.1  74.8  88.2  95.0  89.5 
5 Experiments
In this section we provide implementation details, evaluations on standard metric learning datasets, ablations studies and analysis experiments.
Implementation details. We follow the training protocol of [60] with ResNet50. During training, images are resized to with random crop to and random horizontal flipping. For completeness, we also evaluate on InceptionBN [21] following standard practice in the supplementary. The initial learning rates are set to . We choose triplet parameters according to [60], with . For margin loss, we evaluate margins and . Our policy
is implemented as a twolayer fullyconnected network with ReLUnonlinearity inbetween and 128 neurons per layer. Action values are set to
. Episode iterations are determined via crossvalidation within [30,150]. The sampling range of is set to [0.1, 1.4], with . The sampling probability of negatives corresponding to distances outside this interval is set to . For the input state we use running averages of validation recall, NMI and average intra and interclass distance based on running average lengths of 2, 8, 16 and 32 to account for short and longterm changes. We also incorporate the metrics of the previous 20 iterations. Finally, we include the sampling distributions of the previous iteration and the training progress normalized over the total training length. For optimization, we utilize an A2C + PPO setup with ratio limit. The history policy is updated every 5 policy iterations. For implementation we use the PyTorch framework
[40] on a single NVIDIA Titan X.Benchmark datasets. We evaluate the performance on three common benchmark datasets. For each dataset the first half of classes is used for training and the other half is used for testing. Further, we use a random subset of of the training images for our validation set . We use:
CARS196[28], with 16,185 images from 196 car classes.
CUB2002011[56], 11,788 bird images from 200 classes.
Stanford Online Products (SOP)[37], containing 120,053 images divided in 22,634 classes.
5.1 Results
In Tab. 1 we apply our adaptive sampling strategy on two widely adopted basic ranking losses: triplet[48] and margin loss[60]. For each loss, we compare against the most commonly used static sampling strategies, semihard[48] (semihard) and distancebased sampling[60] (dist) on the CUB2002011, CARS196 and SOP dataset. We measure image retrieval performance using recall accuracy R@k[23] following [38]. For completeness we additonally show the normalized mutual information score (NMI)[32], despite not fully correlating with retrieval performance. For both losses and each dataset, our learned negative sampling significantly improves the performance over the nonadaptive sampling strategies. Especially the strong margin loss greatly benefits from the adaptive sampling, resulting in boosts up to on CUB2002011, on CARS196 and on SOP. This clearly demonstrates the importance of adjusting triplet sampling to the learning process a DML model, especially for smaller datasets.
Next, we compare these results with the current stateoftheart in DML which extend these basic losses using diverse additional training signals (MIC[44], DVML[30], HORDE[22], ABIER[38]), ensembles of embedding spaces (DREML[62], D&C[46], Rank[58]) and/or significantly more network parameters (HORDE[22], SOFTTRIPLE[42]). Tab. 2 shows that our results, despite not using such additional extensions, compete and partly even surpass these strong methods. On CUB2002011 we outperform all methods, including the powerful ensembles, by at least in Recall accuracy. On CARS196[28] we rank second behind the top performing nonensemble method D&C[46]. On SOP[37] we lose to MIC[44] which, in turn, we surpass on both CUB2002011 and CARS196. This highlights the strong benefit of our adaptive sampling.
Dataset  CUB2002011[56]  CARS196[28]  SOP[37]  

Approach  Dim  R@1  R@2  R@4  NMI  R@1  R@2  R@4  NMI  R@1  R@2  R@4  NMI 
HTG[63]  512  59.5  71.8  81.3    76.5  84.7  90.4           
HDML[65]  512  53.7  65.7  76.7  62.6  79.1  87.1  92.1  69.7  68.7  83.2  92.4  89.3 
HTL[11]  512  57.1  68.8  78.7    81.4  88.0  92.7    74.8  88.3  94.8   
DVML[30]  512  52.7  65.1  75.5  61.4  82.0  88.4  93.3  67.6  70.2  85.2  93.8  90.8 
ABIER[38]  512  57.5  68.7  78.3    82.0  89.0  93.2    74.2  86.9  94.0   
MIC[44]  128  66.1  76.8  85.6  69.7  82.6  89.1  93.2  68.4  77.2  89.4  95.6  90.0 
D&C[46]  128  65.9  76.6  84.4  69.6  84.6  90.7  94.1  70.3  75.9  88.4  94.9  90.2 
Margin[60]  128  63.6  74.4  83.1  69.0  79.6  86.5  90.1  69.1  72.7  86.2  93.8  90.8 
Ours (Margin[60] + PADS, R50)  128  67.3  78.0  85.9  69.9  83.5  89.7  93.8  68.8  76.5  89.0  95.4  89.9 
Significant increase in network parameter:  
HORDE[22]+contrastive loss[16]  512  66.3  76.7  84.7    83.9  90.3  94.1           
SOFTTRIPLE[42]  512  65.4  76.4  84.5    84.5  90.7  94.5  70.1  78.3  90.3  95.9  92.0 
Ensemble Methods:  
Rank[58]  1536  61.3  72.7  82.7  66.1  82.1  89.3  93.7  71.8  79.8  91.3  96.3  90.4 
DREML[62]  9216  63.9  75.0  83.1  67.8  86.0  91.7  95.0  76.4         
ABE[24]  512  60.6  71.5  79.8    85.2  90.5  94.0    76.3  88.4  94.8   
5.2 Analysis
We now present various analysis experiments providing detailed insights into our learned adaptive sampling strategy.
Training progression of :
We now analyze in Fig. 4 how our adaptive sampling distribution progresses during training by averaging the results of multiple training runs with different network initializations. While on CARS196 the distribution strongly emphasizes smaller distances , we observe on CUB2002011 and SOP generally a larger variance of . Further, on each dataset, during the first half of training quickly peaks on a sparse set of bins , as intuitively expected, since most triplets are still informative. As training continues, begins to yield both harder and easier negatives, thus effectively sampling from a wider distribution. This observation confirms the result of Wu et al. [60] which proposes to ease the large gradient variance introduced by hard negatives with also adding easier negatives. Moreover, for each dataset we observe a different progression of which indicates that manually designing similar sampling strategies is difficult, as also confirmed by our results in Tab. 1 and 4.
Init.  Reference  fix  fix last  

R@1  65.4  64.3  59.0  
R@1  =  65.4  65.8  57.6 
Dataset  CUB2002011[56]  CARS196[28]  

Metrics  R@1  NMI  R@1  NMI 
Ours  67.3  69.9  83.5  68.8 
linear CL  59.1  63.1  72.2  64.0 
nonlinear CL  63.6  68.4  78.1  66.8 
Transfer of and :
Tab. 3 investigates how well a trained policy and the final sampling distribution from a reference run transfers to a differently () or equally () initialized training run.
We observe that fixing and applying a trained policy (fix ) to a new training run with the same network initialization () improves performance by . This is explained by the immediate utility of for learning since is already fully adapted to the reference learning process. In contrast, applying the trained policy to a differently initialized training run () leads to a drop in performance by . Since the fixed cannot adapt to the learning states of the new model, its support for optimizing is diminished. Note that the policy has only been trained on a single training run, thus it cannot fully generalize to training runs with different learning dynamics. This shows the importance of an adaptive sampling.
Next, we investigate if the distribution obtained at the end of training can be regarded as an optimal sampling distribution over , as is fully trained. To this end we fix and apply the distribution after its last adjustment by (fix last ) in training the reference run. As intuitively expected, in both cases performance drops strongly as (i) we now have a static sampling process and (ii) the sampling distribution is optimized to a specific training state. Given our strong results, this proves that our sampling process indeed adapts to the learning of .
Curriculum Learning: To compare our adaptive sampling with basic curriculum learning strategies, we predefine two sampling schedules: (1) A linear increase of negative hardness, starting from a semihard distance intervall[48] and (2) a nonlinear schedule using distancebased sampling[60], where the distribution is gradually shifted towards harder negatives. We visualize the corresponding progression of the sampling distribution in the supplementary material. Tab. 4 illustrates that both fixed, predefined curriculum schedules perform worse than our learned, adaptive sampling distribution by at least on CUB2002011. On CARS196 the performance gap is even larger. The strong difference in datasets further demonstrates the difficulty of finding broadly applicable, effective fixed sampling strategies.


denotes a normal distribution. 
5.3 Ablation studies
Subsequently we ablate different parameters for learning our sampling distribution on the CUB2002011 dataset. More ablations are shown in the appendix. To make the following experiments comparable, no learning rate scheduling was applied, as convergence may significantly change with different parameter settings. In contrast, the results in Tab 12 are obtained with our best parameter settings and a fixed learning rate scheduling. Without scheduling, our best parameter setting achieves a recall value of and NMI of on CUB2002011.
Distance interval : As presented in Sec. 4.1, is defined on a fixed interval of distances. Similar to other works[60, 17], this allows us to additionally regularize the sampling process by clipping the tails of the true range of distances on . Tab. 5 (a) compares different combinations of . We observe that, while each option leads to significant performance boost compared to the static sampling strategies, an interval results in the most effective sampling process.
Number of bins : Next, we analyze the impact of the resolution in Tab. 5 (b), i.e. the number of bins . This affects the flexibility of , but also the complexity of the actions to be predicted. As intuitively expected, increasing allows for better adaption and performance until the complexity grows too large.
Initialization of : Finally, we analyze how the initialization of impacts learning. Tab. 5 (c) compares the performance using different initial distributions, such as a neutral uniform initialization (i.e. random sampling) (), emphasizing harder negatives early on () or a proxy to [60] (). We observe that our learned sampling process is robust against the initial configuration of and in each case effectively adapts to the learning process of .
6 Conclusion
This paper presents a learned adaptive triplet sampling strategy using Reinforcement Learning. We optimize a teacher network to adjust the negative sampling distribution to the ongoing training state of a DML model. By training the teacher to directly improve the evaluation metric on a heldback validation set, the resulting training signal optimally facilitates DML learning. Our experiments show that our adaptive sampling strategy improves significantly over static sampling distributions. Thus, even though only built on top of basic triplet losses, we achieve competitive or even superior performance compared to the stateoftheart of DML on multiple standard benchmarks sets.
Acknowledgements
We thank David YuTung Hui (MILA) for valuable insights regarding the choice of RL Methods. This work has been supported in part by Bayer AG, the German federal ministry BMWi within the project “KI Absicherung”, and a hardware donation from NVIDIA corporation.
References
 [1] (2016) Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, Cited by: §2.
 [2] (2016) Cliquecnn: deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems, pp. 3846–3854. Cited by: §1.

[3]
(2009)
Curriculum learning.
In
International Conference on Machine Learning
, Cited by: §2, §4.  [4] (2017) Unsupervised learning by predicting noise. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1.
 [5] (2018) Improving spatiotemporal selfsupervision by deep reinforcement learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.

[6]
(2017)
Beyond triplet loss: a deep quadruplet network for person reidentification.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §2.  [7] (2018) Human motion analysis with deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.
 [8] (201806) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
 [9] (2017) Learning what data to learn. External Links: 1702.08635 Cited by: §2, §4.2.
 [10] (2019) Selfsupervised representation learning by rotation feature decoupling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [11] (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: Table 6, §2, Table 2.
 [12] (2010) Understanding the difficulty of training deep feedforward neural networks.. JMLR Proceedings. Cited by: §4.
 [13] (2016) Adaptive sampling for sgd by exploiting side information. In International Conference on Machine Learning, Cited by: §2.
 [14] (2017) Automated curriculum learning for neural networks. In International Conference on Machine Learning, Cited by: §2.
 [15] (2019) On the power of curriculum learning in training deep networks. Cited by: §2, §4.
 [16] (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 6, §1, §2, §3, Table 2.
 [17] (2017) Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829. Cited by: §2, §3.1, §5.3.
 [18] (2017) Rainbow: combining improvements in deep reinforcement learning. External Links: 1710.02298 Cited by: 3rd item.
 [19] (2014) Discriminative deep metric learning for face verification in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [20] (2019) Addressing the lossmetric mismatch with adaptive loss alignment. In ICML, Cited by: §2, §4.2.
 [21] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning. Cited by: Appendix A, §5.
 [22] (2019) Metric learning with horde: highorder regularizer for deep embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, §2, §5.1, Table 2.
 [23] (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.2, §4.2, §5.1.
 [24] (2018) Attentionbased ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 6, Table 2.
 [25] (2015) Adam: a method for stochastic optimization. Cited by: Appendix A.
 [26] (2013) Autoencoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
 [27] (2019) Content and style disentanglement for artistic style transfer. In Proceedings of the Intl. Conf. on Computer Vision (ICCV), Cited by: §1.
 [28] (2013) 3d object representations for finegrained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: Table 6, Appendix A, Table 1, §5.1, Table 2, Table 4, §5.
 [29] (2008) The epochgreedy algorithm for multiarmed bandits with side information. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.), pp. 817–824. External Links: Link Cited by: §4.2.
 [30] (201809) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: Table 6, §2, §5.1, Table 2.
 [31] (2017) SphereFace: deep hypersphere embedding for face recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
 [32] (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §4.2, §4.2, §5.1.

[33]
(2018)
UMAP: uniform manifold approximation and projection.
The Journal of Open Source Software
3 (29), pp. 861. Cited by: Appendix D.  [34] (2017) Unsupervised video understanding by reconciliation of posture similarities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
 [35] (202006) Unsupervised representation learning by discovering reliable image relations. Pattern Recognition (PR) 102. Cited by: §1.
 [36] (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: Table 6, Appendix A, §2.
 [37] (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §1, §2, §3, Table 1, §5.1, Table 2, §5.
 [38] (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 6, §1, §2, §5.1, Table 2.
 [39] (2015) Deep face recognition. In British Machine Vision Conference, Cited by: §3.1.
 [40] (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §5.
 [41] (2018) Efficient neural architecture search via parameter sharing. International Conference on Machine Learning. Cited by: §2.
 [42] (2019) SoftTriple loss: deep metric learning without triplet sampling. Cited by: Table 6, Appendix A, §2, §5.1, Table 2.
 [43] (2017) Searching for activation functions. CoRR abs/1710.05941. Cited by: §2.
 [44] (201910) MIC: mining interclass characteristics for improved metric learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 6, §1, §2, §4.2, §5.1, Table 2.
 [45] (20202020) Revisiting training strategies and generalization performance in deep metric learning. CoRR abs/2002.08473. Cited by: §2.
 [46] (2019) Divide and conquer the embedding space for metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, §1, §2, §5.1, Table 2.
 [47] (2019) Data parameters: a new family of parameters for learning a differentiable curriculum. In Advances in Neural Information Processing Systems, Cited by: §2.
 [48] (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: Figure 5, §1, §2, §3.1, §3, §4.2, Table 1, §5.1, §5.2.
 [49] (2015) Trust region policy optimization. In International Conference on Machine Learning, Cited by: §4.2.
 [50] (2017) Proximal policy optimization algorithms. CoRR. Cited by: 4th item, §4.2.
 [51] (2015) Learning where to sample in structured prediction. In Artificial Intelligence and Statistics, Cited by: §2.
 [52] (2016) Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §1, §2, §3.1, §3.

[53]
(2017)
Selfsupervised learning of pose embeddings from spatiotemporal relations in videos
. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.  [54] (1998) Reinforcement learning: an introduction. The MIT Press. Cited by: 2nd item, §4.2.
 [55] (2017) The sphere game in n dimensions. http://faculty. madisoncollege.edu/alehnen/sphere/hypers.htm.. Cited by: §3.1.
 [56] (2011) The caltechucsd birds2002011 dataset. Cited by: Table 6, Appendix A, Appendix C, Appendix D, Figure 6, Table 1, Table 2, Table 4, §5.
 [57] (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §1, §2.
 [58] (2019) Ranked list loss for deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Table 6, §2, §5.1, Table 2.
 [59] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning. Cited by: 1st item, §4.2.
 [60] (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: Figure 5, Table 6, 8(a), Appendix A, Appendix B, Table 10, Figure 1, §1, §2, §3.1, §3, §4.2, Table 1, §5.1, §5.2, §5.2, §5.3, §5.3, Table 2, §5.
 [61] (2019) SNAS: stochastic neural architecture search. Cited by: §2.
 [62] (2018) Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: Table 6, §1, §2, §5.1, Table 2.
 [63] (2018) An adversarial approach to hard triplet generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–517. Cited by: Table 6, §1, §2, Table 2.
 [64] (2018) Directional statisticsbased deep metric learning for image classification and retrieval. Pattern Recognition 93. Cited by: §2.
 [65] (2019) Hardnessaware deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Table 6, §2, Table 2.
Supplementary Material
This part contains supporting or additional experiments to the main paper, such as additional ablations and qualitative evaluations.
Appendix A Additional Ablation Experiments
We now conduct further ablation experiments for different aspects of our proposed approach based on the CUB2002011[56] dataset. Note, that like in our main paper we did not apply any learning rate scheduling for the results of our approach to establish comparable training settings.
Performance with InceptionBN:
For fair comparison, we also evaluate using InceptionV1 with BatchNormalization [21]. We follow the standard pipeline (see e.g. [36, 42]), utilizing Adam [25] with images resized and random cropped to 224x224. The learning rate is set to . We retain the size of the policy network and other hyperparameters. The results on CUB2002011[56] and CARS196[28] are listed in Table 6. On CUB200, we achieve results competitive to previous stateoftheart methods. On CARS196, we achieve a significant boost over baseline values and competitive performance to the stateoftheart.
Validation set :
The validation set is sampled from the training set , composed as either a fixed disjoint, heldback subset or repetitively resampled from during training. Further, we can sample across all classes or include entire classes. We found (Tab. 7 (d)) that sampling from each class works much better than doing it per class. Further, resampling provides no significant benefit at the cost of an additional hyperparameter to tune.
Composition of states and target metric : Choosing meaningful target metrics for computing rewards and a representative composition of the training state increases the utility of our learned policy . To this end, Tab. 8 compares different combinations of state compositions and employed target metrics . We observe that incorporating information about the current structure of the embedding space into , such as intra and interclass distances, is most crucial for effective learning and adaptation. Moreover, also incorporating performance metrics into which directly represent the current performance of the model , e.g. Recall@1 or NMI, additional adds some useful information.
Frequency of updating : We compute the reward for an adjustment to every DML training iterations. High values of reduce the variance of the rewards , however, at the cost of slow policy updates which result in potentially large discrepancies to updating . Tab. 9 (a) shows that choosing from the range results in a good tradeoff between the stability of and the adaptation of to . Moreover, we also show the result for setting , i.e. using the initial distribution throughout training without adaptation. Fixing this distribution performs worse than the reference method Margin loss with static distancebased sampling[60]. Nevertheless, frequently adjusting leads to significant superior performance, which indicates that our policy effectively adapts to the training state of .
Importance of longterm information for states : For optimal learning, should not only contain information about the current training state of , but also about some history of the learning process. Therefore, we compose of a set of running averages over different lengths for various training state components, as discussed in the implementation details of the main paper. Tab. 9 (b) confirms the importance of longterm information for stable adaptation and learning. Moreover, we see that the set of moving averages works best.
Dataset  CUB2002011[56]  CARS196[28]  
Approach  Dim  R@1  R@2  R@4  NMI  R@1  R@2  R@4  NMI 
HTG[63]  512  59.5  71.8  81.3    76.5  84.7  90.4   
HDML[65]  512  53.7  65.7  76.7  62.6  79.1  87.1  92.1  69.7 
HTL[11]  512  57.1  68.8  78.7    81.4  88.0  92.7   
DVML[30]  512  52.7  65.1  75.5  61.4  82.0  88.4  93.3  67.6 
ABIER[38]  512  57.5  68.7  78.3    82.0  89.0  93.2   
MIC[44]  128  66.1  76.8  85.6  69.7  82.6  89.1  93.2  68.4 
D&C[46]  128  65.9  76.6  84.4  69.6  84.6  90.7  94.1  70.3 
Margin[60]  128  63.6  74.4  83.1  69.0  79.6  86.5  90.1  69.1 
Reimpl. Margin[60], IBN  512  63.8  75.3  84.7  67.9  79.7  86.9  91.4  67.2 
Ours(Margin[60] + PADS, IBN)  512  66.6  77.2  85.6  68.5  81.7  88.3  93.0  68.2 
Significant increase in network parameter:  
HORDE[22]+Contr.[16]  512  66.3  76.7  84.7    83.9  90.3  94.1   
SOFTTRIPLE[42]  512  65.4  76.4  84.5    84.5  90.7  94.5  70.1 
Ensemble Methods:  
Rank[58]  1536  61.3  72.7  82.7  66.1  82.1  89.3  93.7  71.8 
DREML[62]  9216  63.9  75.0  83.1  67.8  86.0  91.7  95.0  76.4 
ABE[24]  512  60.6  71.5  79.8    85.2  90.5  94.0   
Validation Set:  

Recall@1  
NMI 
NMI  R@1  R@1 + NMI  

Recall, Dist., NMI  63.9  65.5  65.6 
68.5  68.9  69.2  
Recall, Dist.  65.0  65.7  64.4 
68.5  69.2  69.4  
Recall, NMI  63.7  63.9  64.2 
68.4  68.2  68.5  
Dist., NMI  65.3  65.3  65.1 
68.8  68.7  68.5  
Dist.  65.3  65.5  64.3 
68.8  69.1  68.6  
Recall  64.2  65.1  64.9 
67.8  69.0  68.4  
NMI  64.3  64.8  63.9 
68.7  69.2  68.4 


Appendix B Curriculum Evaluations
In Fig. 5 we visually illustrate the fixed curriculum schedules which we applied for the comparison experiment in Sec. 5.3 of our main paper. We evaluated various schedules  Linear progression of sampling intervals starting at semihard negatives going to hard negatives, and progressively moving dist[60] towards harder negatives. The schedules visualized were among the best performing ones to work for both CUB200 and CARS196 dataset.
Appendix C Comparison of RL Algorithms
We evaluate the applicability of the following RL algorithms for optimizing our policy (Eq. 4 in the main paper):
Approach  R@1  NMI 

Margin[60]  
REINFORCE  
REINFORCE, EMA  
REINFORCE, A2C  
PPO, EMA  
PPO, A2C  
QLearn  
QLearn, PR/2Step 
For a comparable evaluation setting we use the CUB2002011[56] dataset without learning rate scheduling and fixed 150 epochs of training. Within this setup, the hyperparameters related to each method are optimized via crossvalidation. Tab. 10 shows that all methods, except for vanilla QLearning, result in an adjustment policy for which outperforms static sampling strategies. Moreover, policybased methods in general perform better than QLearning based methods with PPO being the best performing algorithm. We attribute this to the reduced search space (QLearning methods need to evaluate in stateactions space, unlike policymethods, which work directly over the action space), as well as not employing replay buffers, i.e. not acting offpolicy, since stateaction pairs of previous training iterations may no longer be representative for current training stages.
Appendix D Qualitative UMAP Visualization
Figure 6 shows a UMAP[33] embedding of test image features for CUB2002011[56] learned by our model using PADS. We can see clear groupings for birds of the same and similar classes. Clusterings based on similar background is primarily due to dataset bias, e.g. certain types of birds occur only in conjunction with specific backgrounds.
Appendix E PseudoCode
Algorithm 1 gives an overview of our proposed PADS approach using PPO with A2C as underlying RL method.
Before training, our sampling distributions is initialized with an initial distribution. Further, we initialize both the adjustment policy and the preupdate auxiliary policy for estimating the PPO probability ratio. Then, DML training is performed using triplets with random anchorpositive pairs and sampled negatives from the current sampling distribution . After iterations, all reward and state metrics are computed on the embeddings of . These values are aggregated in a training reward and input state . While is used to update the current policy , is fed into the updated policy to estimate adjustments to the sampling distribution . Finally, after iterations (e.g. we set to ) is updated with the current policy weights .
Appendix F Typical image retrieval failure cases
Fig. 7 shows nearest neighbours for good/bad test set retrievals. Even though the nearest neighbors do not always share the same class label as the anchor, all neighbors are very similar to the bird species depicted in the anchor images. Failures are due to very subtle differences.