Log In Sign Up

PADS: Policy-Adapted Sampling for Visual Similarity Learning

by   Karsten Roth, et al.

Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training triplets. Thus, sampling strategies that decide when to use which training sample during learning are crucial. Currently, the prominent paradigm are fixed or curriculum sampling strategies that are predefined before training starts. However, the problem truly calls for a sampling process that adjusts based on the actual state of the similarity representation during training. We, therefore, employ reinforcement learning and have a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity. Experiments on benchmark datasets using standard triplet-based losses show that our adaptive sampling strategy significantly outperforms fixed sampling strategies. Moreover, although our adaptive sampling is only applied on top of basic triplet-learning frameworks, we reach competitive results to state-of-the-art approaches that employ diverse additional learning signals or strong ensemble architectures. Code can be found under


page 4

page 12

page 14

page 15


Learning Embeddings for Product Visual Search with Triplet Loss and Online Sampling

In this paper, we propose learning an embedding function for content-bas...

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Visual-Semantic Embedding (VSE) aims to learn an embedding space where r...

Deep Ranking with Adaptive Margin Triplet Loss

We propose a simple modification from a fixed margin triplet loss to an ...

Sampling Through the Lens of Sequential Decision Making

Sampling is ubiquitous in machine learning methodologies. Due to the gro...

Do Data-based Curricula Work?

Current state-of-the-art NLP systems use large neural networks that requ...

Quantifying Similarity between Relations with Fact Distribution

We introduce a conceptually simple and effective method to quantify the ...

Dynamic Computational Time for Visual Attention

We propose a dynamic computational time model to accelerate the average ...

1 Introduction

Capturing visual similarity between images is the core of virtually every computer vision task, such as image retrieval

[60, 52, 38, 35], pose understanding [34, 7, 2, 53]

, face detection

[48] and style transfer [27]. Measuring similarity requires to find a representation which maps similar images close together and dissimilar images far apart. This task is naturally formulated as Deep Metric Learning (DML) in which individual pairs of images are compared[16, 52, 37] or contrasted against a third image[48, 60, 57] to learn a distance metric that reflects image similarity. Such triplet learning constitutes the basis of powerful learning algorithms[44, 38, 46, 62]. However, with growing training set size, leveraging every single triplet for learning becomes computationally infeasible, limiting training to only a subset of all possible triplets. Thus, a careful selection of those triplets which drive learning best, is crucial. This raises the question: How to determine which triplets to present when to our model during training?
As training progresses, more and more triplet relations will be correctly represented by the model. Thus, ever fewer triplets will still provide novel, valuable information. Conversely, leveraging only triplets which are hard to learn[48, 8, 63]

but therefore informative, impairs optimization due to high gradient variance

[60]. Consequently, a reasonable mixture of triplets with varying difficulty would provide an informative and stable training signal. Now, the question remains, when to present which triplet? Sampling from a fixed distribution over difficulties may serve as a simple proxy[60] and is a typical remedy in representation learning in general[26, 4]. However, (i) choosing a proper distribution is difficult; (ii) the abilities and state of our model evolves as training progresses and, thus, a fixed distribution cannot optimally support every stage of training; and (iii) triplet sampling should actively contribute to the learning objective rather than being chosen independently. Since a manually predefined sampling distribution does not fulfill these requirements, we need to learn and adapt it while training a representation.
Such online adaptation of the learning algorithm and parameters that control it during training is typically framed as a teacher-student setup and optimized using Reinforcement Learning (RL). When modelling a flexible sampling process (the student), a controller network (the teacher) learns to adjusts the sampling such that the DML model is steadily provided with an optimal training signal. Fig. 1 compares progressions of learned sampling distributions adapted to the DML model with a typical fixed sampling distribution[60].
This paper presents how to learn a novel triplet sampling strategy which is able to effectively support the learning process of a DML model at every stage of training. To this end, we model a sampling distribution so it is easily adjustable to yield triplets of arbitrary mixtures of difficulty. To adapt to the training state of the DML model we employ Reinforcement Learning to update the adjustment policy. Directly optimizing the policy so it improves performance on a held-back validation set, adjusts the sampling process to optimally support DML training. Experiments show that our adaptive sampling strategy significantly improves over fixed, manually designed triplet sampling strategies on multiple datasets. Moreover, we perform diverse analyses and ablations to provide additional insights into our method.

2 Related Work

Metric learning has become the leading paradigm for learning distances between images with a broad range of applications, including image retrieval[36, 30, 60], image classification [10, 64], face verification[48, 19, 31] or human pose analysis[34, 7]. Ranking losses formulated on pairs[52, 16], triplets[48, 60, 57, 11] or even higher order tuples of images[6, 37, 58] emerged as the most widely used basis for DML [45]. As with the advent of CNNs datasets are growing larger, different strategies are developed to cope with the increasing complexity of the learning problem.
Complexity management in DML: The main line of research are negative sampling strategies[48, 60, 17] based on distances between an anchor and a negative image. FaceNet[48] leverages only the hard negatives in a mini-batch. Wu et al. [60] sample negatives uniformly over the whole range of distances to avoid large variances in the gradients while optimization. Harwood et al. [17]

restrict and control the search space for triplets using pre-computed sets of nearest neighbors by linearly regressing the training loss. Each of them successfully enable effective DML training. However, these works are based on fixed and manually predefined sampling strategies. In contrast, we learn an adaptive sampling strategy to provide an optimal input stream of triplets conditioned on the training state of our model.

Orthogonal to sampling negatives from the training set is the generation of hard negatives in form of images[8]

or feature vectors

[65, 63]. Thus, these approaches also resort to hard negatives, while our sampling process yields negatives of any mixture of difficulty depending on the model state.
Finally, proxy based techniques reduce the complexity of the learning problem by learning one[36] or more [42] virtual representatives for each class, which are used as negatives. Thus, these approaches approximate the negative distributions, while our sampling adaptively yields individual negative samples.
Advanced DML: Based on the standard DML losses many works improve model performance using more advanced techniques. Ensemble methods [38, 62, 46] learn and combine multiple embedding spaces to capture more information. HORDE[22]

additionally forces feature representations of related images to have matching higher moments. Roth

et al. [44] combines class-discriminative features with features learned from characteristics shared across classes. Similarly, Lin et al. [30] proposes to learn the intra-class distributions, next to the inter-class distribution. All these approaches are applied in addition to the standard ranking losses discussed above. In contrast, our work presents a novel triplet sampling strategy and, thus, is complementary to these advanced DML methods.
Adaptive Learning: Curriculum Learning[3] gradually increases the difficulty of the the samples presented to the model. Hacohen et al. [15] employ a batch-based learnable scoring function to provide a batch-curriculum for training, while we learn how to adapt a sampling process to the training state. Graves et al. [14] divide the training data into fixed subsets before learning in which order to use them from training. Further, Gopal et al[13] employs an empirical online importance sampling distribution over inputs based on their gradient magnitudes during training. Similarly, Shreyas et al[47]

learn an importance sampling over instances. In contrast, we learn an online policy for selecting triplet negatives, thus instance relations. Meta Learning aims at learning how to learn. It has been successfully applied for various components of a learning process, such as activation functions

[43], input masking[9], self-supervision [5], finetuning [51]

, loss functions

[20], optimizer parameters[1] and model architectures[41, 61]. In this work, we learn a sampling distribution to improve triplet-based learning.

Figure 2: Sampling distribution . We discretize the distance interval into equisized bins with individual sampling probabilities .

3 Distance-based Sampling for DML

Let be a -dimensional embedding of an image with

being represented by a deep neural network parametrized by

. Further, is normalized to a unit hypersphere for regularization purposes [48]. Thus, the objective of DML is to learn such that images are mapped close to another if they are similar and far otherwise, under a standard distance function . Commonly, is the euclidean distance, i.e. .
A popular family of training objectives for learning are ranking losses[48, 60, 52, 37, 37, 16] operating on tuples of images. Their most widely used representative is arguably the triplet loss[48] which is defined as an ordering task between images , formulated as


Here, and are the anchor and positive with the same class label. acts as the negative from a different class. Optimizing pushes closer to and further away from as long as a constant distance margin is violated.

3.1 Static Triplet sampling strategies

While ranking losses have proven to be powerful, the number of possible tuples grows dramatically with the size of the training set. Thus, training quickly becomes infeasible, turning efficient tuple sampling strategies into a key component for successful learning as discussed here.
When performing DML using ranking losses like Eq.1, triplets decreasingly violate the triplet margin as training progresses. Naively employing random triplet sampling entails many of the selected triplets being uninformative, as distances on are strongly biased towards larger distances due to its regularization to . Consequently, recent sampling strategies explicitly leverage triplets which violate the triplet margin and, thus, are difficult and informative.
(Semi-)Hard negative sampling: Hard negative sampling methods focus on triplets violating the margin the most, i.e. by sampling negatives . While it speeds up convergence, it may result in collapsed models[48]

due to a strong focus on few data outliers and very hard negatives. Facenet

[48] proposes a relaxed, semi-hard negative sampling strategy restricting the sampling set to a single mini-batch by employing negatives . Based on this idea, different online[39, 52] and offline[17] strategies emerged.
(Static) Distance-based sampling: By considering the hardness of a negative, one can successfully discard easy and uninformative triplets. However, triplets that are too hard lead to noisy learning signals due to overall high gradient variance[60]. As a remedy, to control the variance while maintaining sufficient triplet utility, sampling can be extended to also consider easier negatives, i.e. introducing a sampling distribution over the range of distances between anchor and negatives. Wu et al. [60] propose to sample from a static uniform prior on the range of , thus equally considering negatives from the whole spectrum of difficulties. As pairwise distances on are strongly biased towards larger , their sampling distribution requires to weigh inversely to the analytical distance distribution on : for large [55]. Distance-based sampling from the static, uniform prior is then performed by



being a clipping hyperparameter for regularization.

4 Learning an Adaptive Negative Sampling

Figure 3: Overview of approach. Blue denotes the standard Deep Metric Learning (DML) setup using triplets . Our proposed adaptive negative sampling is shown in green: (1) We compute the current training state using . (2) Conditioned on , our policy predicts adjustments to . (3) We perform bin-wise adjustments of . (4) Using the adjusted we train the DML model. (5) Finally, is updated based on the reward .

Distance-based sampling of negatives has proven to offer a good trade-off between fast convergence and a stable, informative training signal. However, a static sampling distribution provides a stream of training data independent of the the changing needs of a DML model during learning. While samples of mixed difficulty may be useful at the beginning, later training stages are calling for samples of increased difficulty, as e.g. analyzed by curriculum learning[3]. Unfortunately, as different models and even different model intializations[12]

exhibit distinct learning dynamics, finding a generally applicable learning schedule is challenging. Thus, again, heuristics


are typically employed, inferring changes after a fixed number of training epochs or iterations. To provide an optimal training signal, however, we rather want

to adapt to the training state of the DML model than merely the training iteration. Such an adaptive negative sampling allows for adjustments which directly facilitate maximal DML performance. Since manually designing such a strategy is difficult, learning it is the most viable option.
Subsequently, we first present how to find a parametrization of that is able to represent arbitrary, potentially multi-modal distributions, thus being able to sample negatives of any mixture of difficulty needed. Using this, we can learn a policy which effectively alters to optimally support learning of the DML model.

4.1 Modelling a flexible sampling distribution

Since learning benefits from a diverse distribution of negatives, uni-modal distributions (e.g. Gaussians, Binomials, ) are insufficient. Thus, we utilize a discrete probability mass function , where the bounded intervall of possible distances is discretized into disjoint equidistant bins . The probability of drawing from bin is with and . Fig. 2 illustrates this discretized sampling distribution.
This representation of the negative sampling distribution effectively controls which samples are used to learn . As changes during learning, should also adapt to always provide the most useful training samples, i.e. to control when to use which sample. Hence the probabilities need to be updated while learning . We subsequently solve this task by learning a stochastic adjustment policy for the , implemented as a neural network parametrized by .

4.2 Learning an adjustment policy for

Our sampling process based on should provide optimal training signals for learning at every stage of training. Thus, we adjust the by a multiplicative update conditioned on the current representation (or state) of during learning. We introduce a conditional distribution to control which adjustment to apply at which state of training . To learn , we measure the utility of these adjustments for learning using a reward signal . We now first describe how to model each of these components, before presenting how to efficiently optimize the adjustment policy alongside .
Adjustments : To adjust , proposes adjustments to the . To lower the complexity of the action space, we use a limited set of actions to individually decrease, maintain, or increase the probabilities for each bin , i.e. . Further, are fixed constants and . Updating is then simply performed by bin-wise updates

followed by re-normalization. Using a multiplicative adjustment accounts for the exponential distribution of distances on

(cf. Sec. 3.1).
Training states : Adjustments depend on the present state of the representation . Unfortunately, we cannot use the current model weights of the embedding network, as the dimensionality of would be to high, thus making optimization of infeasible. Instead, we represent the current training state using representative statistics describing the learning progress: running averages over Recall@1[23], NMI[32] and average distances between and within classes on a fixed held-back validation set . Additionally we use past parametrizations of and the relative training iteration (cf. Implementation details, Sec. 5).
Rewards : An optimal sampling distribution yields triplets whose training signal consistently improves the evaluation performance of while learning. Thus, we compute the reward for for adjustments by directly measuring the relative improvement of over

from the previous training state. This improvement is quantified through DML evaluation metrics

on the validation set . More precisely, we define as


where was reached from after DML training iterations using . We choose to be the sum of Recall@1[23] and NMI[32]. Both metrics are in the range and target slightly different performance aspects. Further, similar to [20], we utilize the sign function for consistent learning signals even during saturated training stages.
Learning of : Adjusting is a stochastic process controlled by actions sampled from based on a current state

. This defines a Markov Decision Process (MDP) naturally optimized by Reinforcement Learning. The policy objective

is formulated to maximize the total expected reward over training episodes of tuples collected from sequences of time-steps, i.e.


Hence, is optimized to predict adjustments for which yield high rewards and thereby improving the performance of . Common approaches use episodes comprising long state trajectories which potentially cover multiple training epochs[9]. As a result, there is a large temporal discrepancy between model and policy updates. However, in order to closely adapt to the learning of , this discrepancy needs to be minimized. In fact, our experiments show that single-step episodes, i.e. , are sufficient for optimizing to infer meaningful adjustments for . Such a setup is also successfully adopted by contextual bandits [29] 111Opposed to bandits, in our RL setup, actions which are sampled from influence future training states of the learner. Thus, the policy implicitly learns state-transition dynamics.. In summary, our training episodes consists of updating using a sampled adjustment , performing DML training iterations based on the adjusted and updating using the resulting reward . Optimizing Eq. 4 is then performed by standard RL algorithms which approximate different variations of the policy gradient based on the gain ,


The choice of the exact form of gives rise to different optimization methods, e.g REINFORCE[59] (), Advantage Actor Critic (A2C)[54] (), etc. Other RL algorithms, such as TRPO[49] or PPO[50] replace Eq. 4 by surrogate objective functions. Fig. 3 provides an overview over the learning procedure. Moreover, in the supplementary material we compare different RL algorithms and summarizes the learning procedure in Alg. 1 using PPO[50] for policy optimization.
Initialization of : We find that an initialization with a slight emphasis towards smaller distances works best. However, as shown in Tab. 5, also other initializations work well. In addition, the limits of the distance interval can be controlled for additional regularization as done in [60], which is analysed in Tab. 5.
Self-Regularisation: As noted in [44], the utilisation of intra-class features can be beneficial to generalization. Our approach easily allows for a learnable inclusion of such features. As positive samples are generally closest to anchors, we can merge positive samples into the set of negative samples and have the policy learn to place higher sampling probability on such low-distance cases. We find that this additionally improves generalization performance.
Computational costs: Computational overhead over fixed sampling strategies[48, 60]

comes from the estimation of

requiring a forward pass over and the computation of the evaluation metrics. For example, setting increases the computation time per epoch by less than .

Dataset CUB200-2011[56] CARS196[28] SOP[37]
Approach Dim R@1 R@2 R@4 NMI R@1 R@2 R@4 NMI R@1 R@10 R@100 NMI
Margin[60] + -dist (orig) 128 63.6 74.4 83.1 69.0 79.6 86.5 90.1 69.1 72.7 86.2 93.8 90.7
Margin[60] + -dist (ReImp, ) 128 63.5 74.9 84.4 68.1 80.1 87.4 91.9 67.6 74.6 87.5 94.2 90.7
Margin[60] + -dist (ReImp, ) 128 63.0 74.3 83.0 66.9 79.7 87.0 91.8 67.1 73.5 87.2 93.9 89.3
Margin[60] + PADS (Ours) 128 67.3 78.0 85.9 69.9 83.5 89.7 93.8 68.8 76.5 89.0 95.4 89.9
Triplet[48] + semihard (orig) 64 42.6 55.0 66.4 55.4 51.5 63.8 73.5 53.4 66.7 82.4 91.9 89.5
Triplet[48] + semihard (ReImp) 128 60.6 72.3 82.1 65.5 71.9 81.5 88.5 64.1 73.5 87.5 94.9 89.2
Triplet[48] + -dist (ReImp) 128 62.2 73.2 82.8 66.3 78.0 85.6 91.4 65.7 73.9 87.7 94.5 89.3
Triplet[48] + PADS (Ours) 128 64.0 75.5 84.3 67.8 79.9 87.5 92.3 67.1 74.8 88.2 95.0 89.5
Table 1: Comparison of our proposed adaptive negative sampling (PADS) against common static negative sampling strategies: semihard negative mining[37] (semihard) and static distance-based sampling (-dist)[60] using triplet[48] and margin loss[60]. ReImp. denotes our re-implementations and Dim the dimensionality of .

5 Experiments

In this section we provide implementation details, evaluations on standard metric learning datasets, ablations studies and analysis experiments.
Implementation details. We follow the training protocol of [60] with ResNet50. During training, images are resized to with random crop to and random horizontal flipping. For completeness, we also evaluate on Inception-BN [21] following standard practice in the supplementary. The initial learning rates are set to . We choose triplet parameters according to [60], with . For margin loss, we evaluate margins and . Our policy

is implemented as a two-layer fully-connected network with ReLU-nonlinearity inbetween and 128 neurons per layer. Action values are set to

. Episode iterations are determined via cross-validation within [30,150]. The sampling range of is set to [0.1, 1.4], with . The sampling probability of negatives corresponding to distances outside this interval is set to . For the input state we use running averages of validation recall, NMI and average intra- and interclass distance based on running average lengths of 2, 8, 16 and 32 to account for short- and longterm changes. We also incorporate the metrics of the previous 20 iterations. Finally, we include the sampling distributions of the previous iteration and the training progress normalized over the total training length. For optimization, we utilize an A2C + PPO setup with ratio limit

. The history policy is updated every 5 policy iterations. For implementation we use the PyTorch framework

[40] on a single NVIDIA Titan X.
Benchmark datasets. We evaluate the performance on three common benchmark datasets. For each dataset the first half of classes is used for training and the other half is used for testing. Further, we use a random subset of of the training images for our validation set . We use:
CARS196[28], with 16,185 images from 196 car classes.
CUB200-2011[56], 11,788 bird images from 200 classes.
Stanford Online Products (SOP)[37], containing 120,053 images divided in 22,634 classes.

Figure 4: Averaged progression of over multiple training runs on CUB200-2011, CARS196 and SOP.

5.1 Results

In Tab. 1 we apply our adaptive sampling strategy on two widely adopted basic ranking losses: triplet[48] and margin loss[60]. For each loss, we compare against the most commonly used static sampling strategies, semi-hard[48] (semihard) and distance-based sampling[60] (-dist) on the CUB200-2011, CARS196 and SOP dataset. We measure image retrieval performance using recall accuracy R@k[23] following [38]. For completeness we additonally show the normalized mutual information score (NMI)[32], despite not fully correlating with retrieval performance. For both losses and each dataset, our learned negative sampling significantly improves the performance over the non-adaptive sampling strategies. Especially the strong margin loss greatly benefits from the adaptive sampling, resulting in boosts up to on CUB200-2011, on CARS196 and on SOP. This clearly demonstrates the importance of adjusting triplet sampling to the learning process a DML model, especially for smaller datasets.
Next, we compare these results with the current state-of-the-art in DML which extend these basic losses using diverse additional training signals (MIC[44], DVML[30], HORDE[22], A-BIER[38]), ensembles of embedding spaces (DREML[62], D&C[46], Rank[58]) and/or significantly more network parameters (HORDE[22], SOFT-TRIPLE[42]). Tab. 2 shows that our results, despite not using such additional extensions, compete and partly even surpass these strong methods. On CUB200-2011 we outperform all methods, including the powerful ensembles, by at least in Recall accuracy. On CARS196[28] we rank second behind the top performing non-ensemble method D&C[46]. On SOP[37] we lose to MIC[44] which, in turn, we surpass on both CUB200-2011 and CARS196. This highlights the strong benefit of our adaptive sampling.

Dataset CUB200-2011[56] CARS196[28] SOP[37]
Approach Dim R@1 R@2 R@4 NMI R@1 R@2 R@4 NMI R@1 R@2 R@4 NMI
HTG[63] 512 59.5 71.8 81.3 - 76.5 84.7 90.4 - - - - -
HDML[65] 512 53.7 65.7 76.7 62.6 79.1 87.1 92.1 69.7 68.7 83.2 92.4 89.3
HTL[11] 512 57.1 68.8 78.7 - 81.4 88.0 92.7 - 74.8 88.3 94.8 -
DVML[30] 512 52.7 65.1 75.5 61.4 82.0 88.4 93.3 67.6 70.2 85.2 93.8 90.8
A-BIER[38] 512 57.5 68.7 78.3 - 82.0 89.0 93.2 - 74.2 86.9 94.0 -
MIC[44] 128 66.1 76.8 85.6 69.7 82.6 89.1 93.2 68.4 77.2 89.4 95.6 90.0
D&C[46] 128 65.9 76.6 84.4 69.6 84.6 90.7 94.1 70.3 75.9 88.4 94.9 90.2
Margin[60] 128 63.6 74.4 83.1 69.0 79.6 86.5 90.1 69.1 72.7 86.2 93.8 90.8
Ours (Margin[60] + PADS, R50) 128 67.3 78.0 85.9 69.9 83.5 89.7 93.8 68.8 76.5 89.0 95.4 89.9
Significant increase in network parameter:
HORDE[22]+contrastive loss[16] 512 66.3 76.7 84.7 - 83.9 90.3 94.1 - - - - -
SOFT-TRIPLE[42] 512 65.4 76.4 84.5 - 84.5 90.7 94.5 70.1 78.3 90.3 95.9 92.0
Ensemble Methods:
Rank[58] 1536 61.3 72.7 82.7 66.1 82.1 89.3 93.7 71.8 79.8 91.3 96.3 90.4
DREML[62] 9216 63.9 75.0 83.1 67.8 86.0 91.7 95.0 76.4 - - - -
ABE[24] 512 60.6 71.5 79.8 - 85.2 90.5 94.0 - 76.3 88.4 94.8 -
Table 2: Comparison to the state-of-the-art DML methods on CUB200-2011[56], CARS196[28] and SOP[37]. Dim denotes the dimensionality of . R50 and IBN denote implementations using ResNet50 and Inception-BN, respectively.

5.2 Analysis

We now present various analysis experiments providing detailed insights into our learned adaptive sampling strategy.
Training progression of : We now analyze in Fig. 4 how our adaptive sampling distribution progresses during training by averaging the results of multiple training runs with different network initializations. While on CARS196 the distribution strongly emphasizes smaller distances , we observe on CUB200-2011 and SOP generally a larger variance of . Further, on each dataset, during the first half of training quickly peaks on a sparse set of bins , as intuitively expected, since most triplets are still informative. As training continues, begins to yield both harder and easier negatives, thus effectively sampling from a wider distribution. This observation confirms the result of Wu et al. [60] which proposes to ease the large gradient variance introduced by hard negatives with also adding easier negatives. Moreover, for each dataset we observe a different progression of which indicates that manually designing similar sampling strategies is difficult, as also confirmed by our results in Tab. 1 and 4.

Init. Reference fix fix last
R@1 65.4 64.3 59.0
R@1 = 65.4 65.8 57.6
Table 3: Transferring a fixed trained policy and fixed final distribution to training runs with different () and the same network initialization (=). Reference denotes the training run from which and is obtained.
Dataset CUB200-2011[56] CARS196[28]
Metrics R@1 NMI R@1 NMI
Ours 67.3 69.9 83.5 68.8
linear CL 59.1 63.1 72.2 64.0
non-linear CL 63.6 68.4 78.1 66.8
Table 4: Comparison to curriculum learning strategies with predefined linear and non-linear progression of .

Transfer of and : Tab. 3 investigates how well a trained policy and the final sampling distribution from a reference run transfers to a differently () or equally () initialized training run. We observe that fixing and applying a trained policy (fix ) to a new training run with the same network initialization () improves performance by . This is explained by the immediate utility of for learning since is already fully adapted to the reference learning process. In contrast, applying the trained policy to a differently initialized training run () leads to a drop in performance by . Since the fixed cannot adapt to the learning states of the new model, its support for optimizing is diminished. Note that the policy has only been trained on a single training run, thus it cannot fully generalize to training runs with different learning dynamics. This shows the importance of an adaptive sampling.
Next, we investigate if the distribution obtained at the end of training can be regarded as an optimal sampling distribution over , as is fully trained. To this end we fix and apply the distribution after its last adjustment by (fix last ) in training the reference run. As intuitively expected, in both cases performance drops strongly as (i) we now have a static sampling process and (ii) the sampling distribution is optimized to a specific training state. Given our strong results, this proves that our sampling process indeed adapts to the learning of .

Curriculum Learning: To compare our adaptive sampling with basic curriculum learning strategies, we pre-define two sampling schedules: (1) A linear increase of negative hardness, starting from a semi-hard distance intervall[48] and (2) a non-linear schedule using distance-based sampling[60], where the distribution is gradually shifted towards harder negatives. We visualize the corresponding progression of the sampling distribution in the supplementary material. Tab. 4 illustrates that both fixed, pre-defined curriculum schedules perform worse than our learned, adaptive sampling distribution by at least on CUB200-2011. On CARS196 the performance gap is even larger. The strong difference in datasets further demonstrates the difficulty of finding broadly applicable, effective fixed sampling strategies.

(a) Varying the interval of distances used for learning . The number of bins is kept fixed to .
Num. bins
(b) Varying the number of bins used to discretize the range of distances used for learning .
Init. Distr.
(c) Comparison of -initializations on distance interval . denotes uniform emphasis in with low probabilities outside the interval.

denotes a normal distribution.

Table 5: Ablation experiments analyzing various parameters for learning .

5.3 Ablation studies

Subsequently we ablate different parameters for learning our sampling distribution on the CUB200-2011 dataset. More ablations are shown in the appendix. To make the following experiments comparable, no learning rate scheduling was applied, as convergence may significantly change with different parameter settings. In contrast, the results in Tab 1-2 are obtained with our best parameter settings and a fixed learning rate scheduling. Without scheduling, our best parameter setting achieves a recall value of and NMI of on CUB200-2011.

Distance interval : As presented in Sec. 4.1, is defined on a fixed interval of distances. Similar to other works[60, 17], this allows us to additionally regularize the sampling process by clipping the tails of the true range of distances on . Tab. 5 (a) compares different combinations of . We observe that, while each option leads to significant performance boost compared to the static sampling strategies, an interval results in the most effective sampling process.

Number of bins : Next, we analyze the impact of the resolution in Tab. 5 (b), i.e. the number of bins . This affects the flexibility of , but also the complexity of the actions to be predicted. As intuitively expected, increasing allows for better adaption and performance until the complexity grows too large.

Initialization of : Finally, we analyze how the initialization of impacts learning. Tab. 5 (c) compares the performance using different initial distributions, such as a neutral uniform initialization (i.e. random sampling) (), emphasizing harder negatives early on () or a proxy to [60] (). We observe that our learned sampling process is robust against the initial configuration of and in each case effectively adapts to the learning process of .

6 Conclusion

This paper presents a learned adaptive triplet sampling strategy using Reinforcement Learning. We optimize a teacher network to adjust the negative sampling distribution to the ongoing training state of a DML model. By training the teacher to directly improve the evaluation metric on a held-back validation set, the resulting training signal optimally facilitates DML learning. Our experiments show that our adaptive sampling strategy improves significantly over static sampling distributions. Thus, even though only built on top of basic triplet losses, we achieve competitive or even superior performance compared to the state-of-the-art of DML on multiple standard benchmarks sets.


We thank David Yu-Tung Hui (MILA) for valuable insights regarding the choice of RL Methods. This work has been supported in part by Bayer AG, the German federal ministry BMWi within the project “KI Absicherung”, and a hardware donation from NVIDIA corporation.


  • [1] M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [2] M. A. Bautista, A. Sanakoyeu, E. Tikhoncheva, and B. Ommer (2016) Cliquecnn: deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems, pp. 3846–3854. Cited by: §1.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    International Conference on Machine Learning

    Cited by: §2, §4.
  • [4] P. Bojanowski and A. Joulin (2017) Unsupervised learning by predicting noise. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1.
  • [5] U. Büchler, B. Brattoli, and B. Ommer (2018) Improving spatiotemporal self-supervision by deep reinforcement learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
  • [6] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §2.
  • [7] H. Coskun, D. J. Tan, S. Conjeti, N. Navab, and F. Tombari (2018) Human motion analysis with deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.
  • [8] Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou (2018-06) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [9] Y. Fan, F. Tian, T. Qin, J. Bian, and T. Liu (2017) Learning what data to learn. External Links: 1702.08635 Cited by: §2, §4.2.
  • [10] Z. Feng, C. Xu, and D. Tao (2019) Self-supervised representation learning by rotation feature decoupling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [11] W. Ge (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: Table 6, §2, Table 2.
  • [12] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks.. JMLR Proceedings. Cited by: §4.
  • [13] S. Gopal (2016) Adaptive sampling for sgd by exploiting side information. In International Conference on Machine Learning, Cited by: §2.
  • [14] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017) Automated curriculum learning for neural networks. In International Conference on Machine Learning, Cited by: §2.
  • [15] G. Hacohen and D. Weinshall (2019) On the power of curriculum learning in training deep networks. Cited by: §2, §4.
  • [16] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 6, §1, §2, §3, Table 2.
  • [17] B. Harwood, B. Kumar, G. Carneiro, I. Reid, T. Drummond, et al. (2017) Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829. Cited by: §2, §3.1, §5.3.
  • [18] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2017) Rainbow: combining improvements in deep reinforcement learning. External Links: 1710.02298 Cited by: 3rd item.
  • [19] J. Hu, J. Lu, and Y. Tan (2014) Discriminative deep metric learning for face verification in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [20] C. Huang, S. Zhai, W. Talbott, M. Á. Bautista, S. Sun, C. Guestrin, and J. Susskind (2019) Addressing the loss-metric mismatch with adaptive loss alignment. In ICML, Cited by: §2, §4.2.
  • [21] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning. Cited by: Appendix A, §5.
  • [22] P. Jacob, D. Picard, A. Histace, and E. Klein (2019) Metric learning with horde: high-order regularizer for deep embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, §2, §5.1, Table 2.
  • [23] H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.2, §4.2, §5.1.
  • [24] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 6, Table 2.
  • [25] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Cited by: Appendix A.
  • [26] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • [27] D. Kotovenko, A. Sanakoyeu, S. Lang, and B. Ommer (2019) Content and style disentanglement for artistic style transfer. In Proceedings of the Intl. Conf. on Computer Vision (ICCV), Cited by: §1.
  • [28] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: Table 6, Appendix A, Table 1, §5.1, Table 2, Table 4, §5.
  • [29] J. Langford and T. Zhang (2008) The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.), pp. 817–824. External Links: Link Cited by: §4.2.
  • [30] X. Lin, Y. Duan, Q. Dong, J. Lu, and J. Zhou (2018-09) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: Table 6, §2, §5.1, Table 2.
  • [31] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
  • [32] C. Manning, P. Raghavan, and H. Schütze (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §4.2, §4.2, §5.1.
  • [33] L. McInnes, J. Healy, N. Saul, and L. Grossberger (2018) UMAP: uniform manifold approximation and projection.

    The Journal of Open Source Software

    3 (29), pp. 861.
    Cited by: Appendix D.
  • [34] T. Milbich, M. Bautista, E. Sutter, and B. Ommer (2017) Unsupervised video understanding by reconciliation of posture similarities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [35] T. Milbich, O. Ghori, F. Diego, and B. Ommer (2020-06) Unsupervised representation learning by discovering reliable image relations. Pattern Recognition (PR) 102. Cited by: §1.
  • [36] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: Table 6, Appendix A, §2.
  • [37] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §1, §2, §3, Table 1, §5.1, Table 2, §5.
  • [38] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 6, §1, §2, §5.1, Table 2.
  • [39] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In British Machine Vision Conference, Cited by: §3.1.
  • [40] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §5.
  • [41] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. International Conference on Machine Learning. Cited by: §2.
  • [42] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin (2019) SoftTriple loss: deep metric learning without triplet sampling. Cited by: Table 6, Appendix A, §2, §5.1, Table 2.
  • [43] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. CoRR abs/1710.05941. Cited by: §2.
  • [44] K. Roth, B. Brattoli, and B. Ommer (2019-10) MIC: mining interclass characteristics for improved metric learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 6, §1, §2, §4.2, §5.1, Table 2.
  • [45] K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. P. Cohen (20202020) Revisiting training strategies and generalization performance in deep metric learning. CoRR abs/2002.08473. Cited by: §2.
  • [46] A. Sanakoyeu, V. Tschernezki, U. Buchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, §1, §2, §5.1, Table 2.
  • [47] S. Saxena, O. Tuzel, and D. DeCoste (2019) Data parameters: a new family of parameters for learning a differentiable curriculum. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [48] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: Figure 5, §1, §2, §3.1, §3, §4.2, Table 1, §5.1, §5.2.
  • [49] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, Cited by: §4.2.
  • [50] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR. Cited by: 4th item, §4.2.
  • [51] T. Shi, J. Steinhardt, and P. Liang (2015) Learning where to sample in structured prediction. In Artificial Intelligence and Statistics, Cited by: §2.
  • [52] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §1, §2, §3.1, §3.
  • [53] Ö. Sümer, T. Dencker, and B. Ommer (2017)

    Self-supervised learning of pose embeddings from spatiotemporal relations in videos

    In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [54] R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. The MIT Press. Cited by: 2nd item, §4.2.
  • [55] (2017) The sphere game in n dimensions. http://faculty. Cited by: §3.1.
  • [56] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: Table 6, Appendix A, Appendix C, Appendix D, Figure 6, Table 1, Table 2, Table 4, §5.
  • [57] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §1, §2.
  • [58] X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson (2019) Ranked list loss for deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Table 6, §2, §5.1, Table 2.
  • [59] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. Cited by: 1st item, §4.2.
  • [60] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: Figure 5, Table 6, 8(a), Appendix A, Appendix B, Table 10, Figure 1, §1, §2, §3.1, §3, §4.2, Table 1, §5.1, §5.2, §5.2, §5.3, §5.3, Table 2, §5.
  • [61] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. Cited by: §2.
  • [62] H. Xuan, R. Souvenir, and R. Pless (2018) Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: Table 6, §1, §2, §5.1, Table 2.
  • [63] Y. Zhao, Z. Jin, G. Qi, H. Lu, and X. Hua (2018) An adversarial approach to hard triplet generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–517. Cited by: Table 6, §1, §2, Table 2.
  • [64] X. Zhe, S. Chen, and H. Yan (2018) Directional statistics-based deep metric learning for image classification and retrieval. Pattern Recognition 93. Cited by: §2.
  • [65] W. Zheng, Z. Chen, J. Lu, and J. Zhou (2019) Hardness-aware deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Table 6, §2, Table 2.

Supplementary Material

This part contains supporting or additional experiments to the main paper, such as additional ablations and qualitative evaluations.

Appendix A Additional Ablation Experiments

We now conduct further ablation experiments for different aspects of our proposed approach based on the CUB200-2011[56] dataset. Note, that like in our main paper we did not apply any learning rate scheduling for the results of our approach to establish comparable training settings.
Performance with Inception-BN: For fair comparison, we also evaluate using Inception-V1 with Batch-Normalization [21]. We follow the standard pipeline (see e.g. [36, 42]), utilizing Adam [25] with images resized and random cropped to 224x224. The learning rate is set to . We retain the size of the policy network and other hyperparameters. The results on CUB200-2011[56] and CARS196[28] are listed in Table 6. On CUB200, we achieve results competitive to previous state-of-the-art methods. On CARS196, we achieve a significant boost over baseline values and competitive performance to the state-of-the-art.
Validation set : The validation set is sampled from the training set , composed as either a fixed disjoint, held-back subset or repetitively re-sampled from during training. Further, we can sample across all classes or include entire classes. We found (Tab. 7 (d)) that sampling from each class works much better than doing it per class. Further, resampling provides no significant benefit at the cost of an additional hyperparameter to tune.
Composition of states and target metric : Choosing meaningful target metrics for computing rewards and a representative composition of the training state increases the utility of our learned policy . To this end, Tab. 8 compares different combinations of state compositions and employed target metrics . We observe that incorporating information about the current structure of the embedding space into , such as intra- and inter-class distances, is most crucial for effective learning and adaptation. Moreover, also incorporating performance metrics into which directly represent the current performance of the model , e.g. Recall@1 or NMI, additional adds some useful information.
Frequency of updating : We compute the reward for an adjustment to every DML training iterations. High values of reduce the variance of the rewards , however, at the cost of slow policy updates which result in potentially large discrepancies to updating . Tab. 9 (a) shows that choosing from the range results in a good trade-off between the stability of and the adaptation of to . Moreover, we also show the result for setting , i.e. using the initial distribution throughout training without adaptation. Fixing this distribution performs worse than the reference method Margin loss with static distance-based sampling[60]. Nevertheless, frequently adjusting leads to significant superior performance, which indicates that our policy effectively adapts to the training state of .
Importance of long-term information for states : For optimal learning, should not only contain information about the current training state of , but also about some history of the learning process. Therefore, we compose of a set of running averages over different lengths for various training state components, as discussed in the implementation details of the main paper. Tab. 9 (b) confirms the importance of long-term information for stable adaptation and learning. Moreover, we see that the set of moving averages works best.

Dataset CUB200-2011[56] CARS196[28]
Approach Dim R@1 R@2 R@4 NMI R@1 R@2 R@4 NMI
HTG[63] 512 59.5 71.8 81.3 - 76.5 84.7 90.4 -
HDML[65] 512 53.7 65.7 76.7 62.6 79.1 87.1 92.1 69.7
HTL[11] 512 57.1 68.8 78.7 - 81.4 88.0 92.7 -
DVML[30] 512 52.7 65.1 75.5 61.4 82.0 88.4 93.3 67.6
A-BIER[38] 512 57.5 68.7 78.3 - 82.0 89.0 93.2 -
MIC[44] 128 66.1 76.8 85.6 69.7 82.6 89.1 93.2 68.4
D&C[46] 128 65.9 76.6 84.4 69.6 84.6 90.7 94.1 70.3
Margin[60] 128 63.6 74.4 83.1 69.0 79.6 86.5 90.1 69.1
Reimpl. Margin[60], IBN 512 63.8 75.3 84.7 67.9 79.7 86.9 91.4 67.2
Ours(Margin[60] + PADS, IBN) 512 66.6 77.2 85.6 68.5 81.7 88.3 93.0 68.2
Significant increase in network parameter:
HORDE[22]+Contr.[16] 512 66.3 76.7 84.7 - 83.9 90.3 94.1 -
SOFT-TRIPLE[42] 512 65.4 76.4 84.5 - 84.5 90.7 94.5 70.1
Ensemble Methods:
Rank[58] 1536 61.3 72.7 82.7 66.1 82.1 89.3 93.7 71.8
DREML[62] 9216 63.9 75.0 83.1 67.8 86.0 91.7 95.0 76.4
ABE[24] 512 60.6 71.5 79.8 - 85.2 90.5 94.0 -
Table 6: Comparison to the state-of-the-art DML methods on CUB200-2011[56] and CARS196[28] using the Inception-BN Backbone (see e.g. [36, 42]) and embedding dimension of 512.
Validation Set:
Table 7: Composition of . Superscript / denotes usage of entire classes/sampling across classes. denotes re-sampling during training with best found frequency of .
NMI R@1 R@1 + NMI
Recall, Dist., NMI 63.9 65.5 65.6
68.5 68.9 69.2
Recall, Dist. 65.0 65.7 64.4
68.5 69.2 69.4
Recall, NMI 63.7 63.9 64.2
68.4 68.2 68.5
Dist., NMI 65.3 65.3 65.1
68.8 68.7 68.5
Dist. 65.3 65.5 64.3
68.8 69.1 68.6
Recall 64.2 65.1 64.9
67.8 69.0 68.4
NMI 64.3 64.8 63.9
68.7 69.2 68.4
Table 8: Comparison of different compositions of the training state and reward metric . Dist. denotes average intra- and inter-class distances. Recall in state composition denotes all Recall@k-values, whereas for the target metric only Recall@1 was utilized.
10 30 50 70 100 [60]
R@1 64.4 65.7 65.4 65.2 65.1 61.9 63.5
NMI 68.3 69.2 69.2 68.9 69.0 67.0 68.1
(a) Evaluation of the policy update frequency .
2 2, 32 2, 8, 16, 32 2, 8, 16, 32, 64
R@1 64.5 65.4 65.7 65.6
NMI 68.6 69.1 69.2 69.3
(b) Evaluation of various sets of moving average lengths.
Table 9: Ablation experiments: (a) evaluates the influence of the number of DML iterations performed before updating the policy using a reward and, thus, the update frequency of . (b) analyzes the benefit of long-term learning progress information added to training states by means of using various moving average lengths .
Figure 5: Visual comparison between fixed sampling curriculums and a learned progression of by PADS. Left: log-scale over , right: original scale. Top row: learned sampling schedule (PADS); middle row: linear shift of a sampling interval from semihard[48] negatives to hard negatives; bottom row: shifting a static distance-based sampling[60] to gradually sample harder negatives.

Appendix B Curriculum Evaluations

In Fig. 5 we visually illustrate the fixed curriculum schedules which we applied for the comparison experiment in Sec. 5.3 of our main paper. We evaluated various schedules - Linear progression of sampling intervals starting at semi-hard negatives going to hard negatives, and progressively moving -dist[60] towards harder negatives. The schedules visualized were among the best performing ones to work for both CUB200 and CARS196 dataset.

Appendix C Comparison of RL Algorithms

We evaluate the applicability of the following RL algorithms for optimizing our policy (Eq. 4 in the main paper):

Approach R@1 NMI
Q-Learn, PR/2-Step
Table 10: Comparison of different RL algorithms. For policy-based algorithms (REINFORCE, PPO) we either use Exponential Moving Average (EMA) as a variance-reducing baseline or employ Advantage Actor Critic (A2C). In addition, we also evaluate Q-Learning methods (vanilla and Rainbow Q-Learning). For the Rainbow setup we use Priority Replay and 2-Step value approximation. Margin loss[60] is used as a representative reference for static sampling strategies.
  • REINFORCE algorithm[59] with and without Exponential Moving Average (EMA)

  • Advantage Actor Critic (A2C)[54]

  • Rainbow Q-Learning[18] without extensions (vanilla) and using Priority Replay and 2-Step updates

  • Proximal Policy Optimization (PPO)[50] applied to REINFORCE with EMA and to A2C.

For a comparable evaluation setting we use the CUB200-2011[56] dataset without learning rate scheduling and fixed 150 epochs of training. Within this setup, the hyperparameters related to each method are optimized via cross-validation. Tab. 10 shows that all methods, except for vanilla Q-Learning, result in an adjustment policy for which outperforms static sampling strategies. Moreover, policy-based methods in general perform better than Q-Learning based methods with PPO being the best performing algorithm. We attribute this to the reduced search space (Q-Learning methods need to evaluate in state-actions space, unlike policy-methods, which work directly over the action space), as well as not employing replay buffers, i.e. not acting off-policy, since state-action pairs of previous training iterations may no longer be representative for current training stages.

Appendix D Qualitative UMAP Visualization

Figure 6 shows a UMAP[33] embedding of test image features for CUB200-2011[56] learned by our model using PADS. We can see clear groupings for birds of the same and similar classes. Clusterings based on similar background is primarily due to dataset bias, e.g. certain types of birds occur only in conjunction with specific backgrounds.

Appendix E Pseudo-Code

Algorithm 1 gives an overview of our proposed PADS approach using PPO with A2C as underlying RL method.
Before training, our sampling distributions is initialized with an initial distribution. Further, we initialize both the adjustment policy and the pre-update auxiliary policy for estimating the PPO probability ratio. Then, DML training is performed using triplets with random anchor-positive pairs and sampled negatives from the current sampling distribution . After iterations, all reward and state metrics are computed on the embeddings of . These values are aggregated in a training reward and input state . While is used to update the current policy , is fed into the updated policy to estimate adjustments to the sampling distribution . Finally, after iterations (e.g. we set to ) is updated with the current policy weights .

Input : , , Train labels , Val. labels , total iterations
Parameter : Reward metrics , State metrics + running average lengths , Num. of bins , multiplier , , num. of iterations before updates ,

// Initialization

for i in  do
// Update DML Model
       for j in  do
             // within batch
       end for
// Update policy
if i mod == 0 then
       end if
end for
Algorithm 1 Training one epoch via PADS by PPO
Figure 6: UMAP embedding based on the image embeddings obtained from our proposed approach on CUB200-2011[56] (Test Set).

Appendix F Typical image retrieval failure cases

Fig. 7 shows nearest neighbours for good/bad test set retrievals. Even though the nearest neighbors do not always share the same class label as the anchor, all neighbors are very similar to the bird species depicted in the anchor images. Failures are due to very subtle differences.

Figure 7: Selection of good and bad nearest neighbour retrieval cases on CUB200-2011 (Test). Orange bounding box marks query images, green/red boxes denote correct/incorrect retrievals.