1 Introduction
Metric Learning aims to learn a metric to measure the distance between examples that captures certain notion of humandefined similarity between examples. Deep metric learning (DML) has emerged as an effective approach for learning a metric by training a deep neural network. Simply speaking, a deep neural network can induce new feature embedding of examples and it is trained in such a way that the Euclidean distance between the induced feature embeddings of two similar examples shall be small and that between the induced feature embeddings of two dissimilar pairs shall be large. DML has been widely used in many tasks such as face recognition (
Fan et al. (2017)), image retrieval (
Chen & Deng (2019)), and classification (Qian et al. (2015); Li et al. (2019)).However, unlike training a deep neural network by minimizing the classification error, training a deep neural network for metric learning is notoriously more difficult (Qian et al. (2018); Wang et al. (2017)
). Many studies have attempted to address this challenge by focusing on several issues. The first issue is how to define a loss function over pairs of examples. A variety of loss functions have been proposed such as contrastive loss (
Hadsell et al. (2006)), binomial deviance loss (Yi et al. ), margin loss (Wu et al. (2017)), liftedstructure (LS) loss (Oh Song et al. (2016)), Npair loss (Sohn (2016)), triplet loss (Schroff et al. (2015)), multisimilarity (MS) loss (Wang et al. (2019). The major difference between these pairbased losses lies at how the pairs interact with each other in a minibatch. In simple pairwise loss such as binomial deviance loss, contrastive loss, and margin loss, pairs are regarded as independent of each other. In triplet loss, a positive pair only interacts with one negative pair. In Npair loss, a positive pair interacts with all negative pairs. In LS loss and MS loss, a positive pair interacts with all positive pairs and all negative pairs. The trend is that the loss functions become increasingly complicated but are difficult to understand.In parallel with the loss function, how to select informative pairs to construct the loss function has also received great attention. Traditional approaches that construct pairs or triplets over all examples before training suffer from prohibitive or sample complexity, where is the total number of examples. To tackle this issue, constructing pairs within a minibatch is widely used in practice. Although it helps mitigate the computational and storage burden, slow convergence and model degeneration with inferior performance still commonly exist when using all pairs in a minibatch to update the model. To combat this issue, various pair mining methods have been proposed to complement the design of loss function, such as hard (semihard) mining for triplet loss (Schroff et al. (2015)), distance weighted sampling (DWS) for margin loss (Wu et al. (2017)), MS sampling for MS loss (Wang et al. (2019)). These sampling methods usually keep all positive (similar) pairs and select roughly the same order of negative (dissimilar) pairs according to some criterion.
Regardless of these great efforts, existing studies either fail to explain the most fundamental problem in DML or fail to propose most effective approach towards addressing the fundamental challenge. It is evident that the loss functions become more and more complicated. But it is still unclear why these complicated losses are effective and how does the pair mining methods affect the overall loss within a minibatch. In this paper, we propose a novel effective solution to DML and bring new insights from the perspective of learning theory that can guide the discovery of new methods. Our philosophy is simple: casting the problem of DML into a simple pairwise classification problem and focusing on addressing the most critical issue, i.e., the sheer imbalance between positive pairs and negative pairs. To this end, we employ simple pairwise loss functions (e.g., margin loss, binomial deviance loss) and propose a flexible distributionally robust optimization (DRO) framework for defining a robust loss over pairs within a minibatch. The idea of DRO is to assign different weights to different pairs that are optimized by maximizing the weighted loss over an uncertainty set for the distributional variable. The model is updated by stochastic gradient descent with stochastic gradients computed based on the sampled pairs according to the found optimal distributional variable.
The DRO framework allows us to (i) connect to advanced learning theories that already exhibit their power for imbalanced data, hence providing theoretical explanation for the proposed framework; (ii) to unify pair sampling and lossbased methods to provide a unified perspective for existing solutions; (iii) to induce simple and effective methods for DML, leading to stateoftheart performance on several benchmark datasets. The contributions of our work are summarized as follows:

[leftmargin=*]

We propose a general solution framework for DML, i.e., by defining a robust overall loss based on the DRO formulation and updating the model based on pairs sampled according to the optimized sampling probabilities. We provide theoretical justification of the proposed framework from the perspective of advanced learning theories.

We show that the general DRO framework can recover existing methods based on complicated pairbased losses: LS loss and MS loss by specifying different uncertainty sets for the distributional variable in DRO. It verifies that our method is general and brings a unified perspective regarding pair sampling and complicated loss over all pairs within a batch.

We also propose simple solutions under the general DRO framework for tackling DML. Experimental results show that our proposed variants of DRO framework outperform stateoftheart methods on several benchmark datasets.
2 Related Work
Loss Design. The loss function is usually defined over the similarities or distances between the induced feature embeddings of pairs. There are simple pairwise losses that simply regard DML as binary classification problem using averaged loss over pairs, e.g., contrastive loss, binomial loss, margin loss. It is notable that the binomial loss proposed in (Yi et al. ) assigns asymmetric weights for positive and negative pairs, which can mitigate the issue of imbalance to certain degree. The principal of the newly designed complicated pairbased losses can be summarized as heuristically discovering specific kinds of relevant information between groups of pairs to boost the training. The key difference between these complicated losses lies at how to group the pairs. Npair loss put one positive pair and all negative pairs together, Liftedstructure loss and MSloss group all positive pairs together and all negative pairs together for each example. In contrast, our DRO framework employs simple pairwise loss but induce complicated overall loss in a systematic and interpretable way.
Pair Mining/Pair Weighting. Wu et al. (2017) points out that pair mining plays an important role in distance metric learning. Different pair mining methods have been proposed, including semihard sampling for triplet loss, distance weighted sampling (DWS) for margin loss, MS mining for MS losses. These pair mining methods aim to select the hard positive and negative pairs for each anchor. For instance, Schroff et al. (2015) selects the hard negative pairs whose distance is smaller than that between the positive pairs in triplets, Shi et al. (2016) selects the hardest positive pair whose distance is smaller than that of the nearest negative pair in a batch, and MS mining (Wang et al. (2019)) selects hard negative pairs whose distance is smaller than the largest distance between positive pairs and hard positive pairs whose distance is larger than the smallest distance between negative pairs at the same time. DWS method keeps all positive pairs but samples negative pairs according to their distance distribution within a batch. The proposed DRO framework induce a pair sampling method by using the optimal distributional variables that defines the robust loss over pairs within a minibatch. As a result, the sampling probabilities induced by our DRO framework is automatically adaptive to the pairbased losses. There are other works that study the problem from the perspective of pair weighting instead of pair sampling. For example, Yu et al. (2018) heuristically design exponential weights for the different pairs in a triplet, which is a special case of our DRO framework. Details are provided in the supplementary. However, since the quality of anchors varies very much, it may not be reasonable to sample the same number of pairs from all anchors.
Imbalance Data Classification.
There are many studies in machine learning which have tackled the imbalanced issue. Commonly used tricks include oversampling, undersampling and costsensitive learning. However, these approaches do not take the differences between examples into account. Other effective approaches grounded on advanced learning theories include minimizing maximal losses
(ShalevShwartz & Wexler, 2016), minimizing topk losses (Fan et al., 2017)and minimizing varianceregularized losses
(Namkoong & Duchi, 2017). However, these approaches are not efficient for deep learning with big data, which is a severe issue in DML. In contrast, the proposed DRO formulation is defined over a minibatch of examples, which inherits the theoretical explanation from the literature and is much more efficient for DML. In addition, the induced loss by our DRO formulation include maximal loss, topk loss and varianceregularized loss as special cases by specifying different uncertainty sets of the distributional variable.3 DML As A DROBased Binary Classification Problem
In this section, we will first present a general framework for DML based on DRO with theoretical justification. We will then discuss three simple variants of the proposed framework and also show how the proposed framework recover existing complicated losses for DML.
Preliminaries. Let denote an input data (e.g., image) and denote the feature embedding function defined by a deep neural network parameterized by . The central task in DML is to update the model parameter by leveraging pairs of similar and dissimilar examples. Following most existing works, at each iteration we will sample a minibatch of examples denoted by . We can construct pairs between these examples ^{1}^{1}1For simplicity, we consider all pairs including selfpair., and let denote the label of pairs, i.e., if the pair is similar (positive), and if the pair is dissimilar (negative). The label of pairs can be either defined by users or derived from the class label of individual examples. Existing works of DML follow the same paradigm for learning the deep neural network i.e., a loss function is first defined over the pairs within a minibatch and the model parameter is updated by gradientbased methods. Various gradientbased methods can be used, including stochastic gradient descent (SGD), stochastic momentum methods and adaptive gradient methods (e.g. Adam). Taking SGD as an example, the model parameter can be updated by , where denotes the learning rate. The focus here is to how to define the loss function over all pairs within a minibatch. As mentioned earlier, we will cast the problem as simple binary classification problem, i.e., classifying a pair into positive or negative. To this end, we use denote the pairwise classification loss between and in the embedding space (e.g., margin loss Wu et al. (2017), binomial loss Yi et al. ). A naive approach for DML is to use the averaged loss over all pairs, i.e., . However, this approach will suffer from the severe imbalanced issue, i.e., most pairs are negative pairs. The gradient of will be dominated by that of negative pairs.
3.1 General DROBased Framework
To address the imbalanced pair issue, we propose a general DRO formulation to compute a robust loss. The formulation of our DRObased loss over all pairs within a minibatch is given below:
(1) 
where
is a nonnegative vector with each element
representing a weight for an individual pair. denotes the decision set of , which encodes some prior knowledge about . In the literature of DRO Namkoong & Duchi (2017), is interpreted as a probability vector such that called the distributional variable and denotes the uncertainty set that specifies how deviates from the uniform probabilities . In next subsection, we will propose simple variants of the above general framework by specifying different constraints or regularizations for . Below, we will provide some theoretical evidences to justify the above framework.To theoretically justify the above loss, we connect (1) to exiting works in machine learning by considering three different uncertainty sets for . First, we can consider a simple constraint . As a result, becomes the maximal loss over all pairs. ShalevShwartz & Wexler (2016)
shows that minimizing maximum loss is robust to imbalanced data distributions and also derives better generalization error for imbalanced data with a rare class. However, the maximal loss is more sensitive to outliers
(Zhu et al., 2019). To address this issue, top loss (Fan et al., 2017) and varianceregularized loss (Namkoong & Duchi, 2017) are proposed, which can be induced by the above DRO framework. If we set , will become the top loss , where denotes the th largest loss over all pairs. If we set , where is the divergence between two distributions and with , then the DRObased loss becomes the varianceregularized loss under certain condition about the variance of the random loss, i.e., for a set of i.i.d random losses we could havewhere denotes the empirical variance of . We can see that the second term in R.H.S of the above equation involves the variance, which can play a role of regularization. The varianceregularized loss has been justified from advanced learning theory by Namkoong & Duchi (2017), and its promising performance for imbalanced data has been observed as well.
Before ending this subsection, we will discuss how to update the model parameter based on the robust loss defined by (1). A simple approach is to find an optimal distributional variable to (1) and then update according to the subgradient of weighted loss by , which is justified by the following lemma.
Lemma 1
Assume that is proper, lowersemicontinuous in and levelbounded in locally uniformly in . Then the subgradient , where denotes the optimal solution set of the maximization problem in (1). Furthermore, when is smooth in and is a singleton, i.e., is unique, we have .
Remark 1
The above lemma can be proved by Theorem 10.13 in Rockafellar & Wets (2009). It shows that even if we may not directly compute , our framework can at least obtain its superset . Particularly, if we have additional conditions, i.e., is smooth in and the optimal solution is unique (considering our regularized formulation below), it theoretically guarantees that our framework exactly computes .
3.2 Proposed Three Variants of Our Framework
In this subsection, we present three variants of our general framework. In order to contrast to other variants recovering existing complicated losses presented in next subsection, we introduce some notations and make some simplifications. For each example that serves as an anchor data, let and denote the index sets of positive and negative pairs, respectively. Let and . We denote the cardinality of a set by . For simplicity, we let , , and . Since zero losses usually do not contribute to the computation of the subgradient for updating the model, we can simply eliminate those examples for consideration.
The first variant is to simply select the top pairs with largest losses, which is equivalent to the following DRO formulation:
where is a hyperparameter. The gradient of the robust loss can be simply computed by sorting the pairwise losses and computing the average of top losses.
The second variant is a variant of the varianceregularized loss. Instead of specifying the uncertainty set , we use a regularization term for the ease of computation, which is defined by
where is a hyperparameter and denotes the KL divergence between two probabilities. The optimal solution to can be easily computed following Namkoong & Duchi (2016). It is notable that the optimal solution is not necessarily sparse. Hence, computing needs to compute the gradient of pairwise loss for all pairs, which could be prohibitive in practice when the minibatch size is large. To alleviate this issue, we can simply sample a subset of pairs according to probabilities in and the compute the averaged gradient of these sampled pairs.
The third variant of our DRO framework is explicitly balancing the number of positive pairs and negative pairs by choosing top pairs for each class, which is denoted by DROTopKPN:
For implementation, we can simply select positive pairs with largest losses and negative pairs with largest loss respectively, and compute averaged gradient of the pairwise losses of the selected pairs for updating the model parameter.
3.3 Recovering the Method based on SOTA PairBased Loss
Next we show that proposed framework can recover the method based on SOTA complicated losses. With the induced feature vector normalized to have unit norm, we define the similarity of two samples as , where denotes dot product. Specifically, we consider two SOTA loss functions, LS and MS loss, which are defined below:
(2)  
(3) 
where are hyperparameters of these losses.
The key to our argument is that the gradient computed based on these losses can be exactly computed according to our DRO framework by choosing appropriate constrained set and setting the pairwise loss as the margin loss. To this end, we first show the gradient based on the LS loss, which can be computed by (Wang et al., 2019):
(4) 
which can be alternatively written as
(5) 
It can be shown that for LS loss, derivations are provided in the supplementary, we have
(6) 
To recover the gradient of the LS loss under our DRO framework, we employ the pairwise margin loss for , i.e., , where and are two hyperparameters and . Assume that the margin parameter is sufficiently large such that for all pairs. The key to deriving the same gradient of the LS loss under our framework is to group distributional variables in for the positive and negative pairs according to the anchor data. Let and denote the corresponding vectors of positive and negative pairs for the anchor , respectively. Let us consider the following DRO formulation:
(7)  
s.t. 
where and for are hyperparameters. we can easily derive the closedform solution for , i.e., , and . Then computing the gradient of the robust loss with respect to by using the above optimal , we have:
which exactly recover the gradient in (6) by setting .
Finally, we can recover the gradient based on the MS loss in a very similar way. The difference is to add a pseudo positive pair and pseudo negative pair with 0 loss for each anchor , and augment each and by one additional dimension. The details are provided in the supplementary.
4 Experiments
Our methods was implemented by Pytorch and using BNInception network (
Ioffe & Szegedy (2015)) pretrained on ImageNet ILSVRC (
Russakovsky et al. (2015)) to fairly compare with other works. The same as (Wang et al. (2019)), a FC layer on the top of the model structure following the global pooling layer was added with randomly initialization for our task. Adam Optimizer with learning rate was used for all our experiments.We verify our DRO framework on image retrieval task with three standard datasets, Cub2002011, Cars196 and InShop. These three datasets are split according to the standard protocol. For Cub 2002011, the first 100 classes with 5864 images are used for training, and the the other 100 classes with 5924 images are saved for testing. Cars196 consists of 196 car models with 16,185 images. We use the first 98 classes with 8054 images for training and the remaining 98 classes with 8,131 images for testing. For InShop, 997 classes with 25882 images are used for training. The test set is further partitioned to a query set with 14218 images of 3985 classes, and a gallery set containing 3985 classes with 12612 images. Batches are constructed with the following rule: we first sample a certain number of classes and then randomly sample instances for each class. The standard recallevaluation metric is used in all our experiments, where on Cub2002011 and Car196, and on InShop. We apply margin loss () and binomial loss (, Yi et al. ) as base losses for our DRO framework. is the margin in . is the threshold for both and . and
are hyperparameters in
.4.1 Quantitative Results
In this experiment, we compare our DRO framework with other SOTA baselines on Cub2002011, Cars196 and InShop, which includes Wang et al. (2019); Yu et al. (2018); Kim et al. (2018); Opitz et al. (2018); Ge (2018); Harwood et al. (2017); Wu et al. (2017); Oh Song et al. (2017). Among them, miningbased methods are Clusetring, HDC, Margin, Smart Mining and HDL. ABIER and ABE are ensemble methods. HAP2S_E and MS are samplingbased methods, which are highly related to our methods. For our DRO framework, we test all three variants which are proposed in section 3. We apply two loss functions, margin loss and binomial loss, respectively. Since DRO sampling works on all pairs in a batch, the binomial variant may not directly apply to psampling. Thus, it makes totally five variants of our DRO framework, denoted by DROTopK, DROTopK, DROTopKPN, DROTopKPN and DROKL, where the subscript and represent the variants of our framework using margin loss and binomial loss, respectively. We set embedding space dimension . The batchsize is set on Cub2002011 and Cars196, on InShop. is tuned from the range on all three datasets and is tuned from on Cub2002011 and Cars196, and selected from on InShop.
Cub2002011  Cars196  

Recall  1  2  4  8  16  32  1  2  4  8  16  32 
Clusetring(Oh Song et al. (2017))  48.2  61.4  71.8  81.9      58.1  70.6  80.3  87.8     
HDC(Oh Song et al. (2017))  53.6  65.7  77.0  85.6  91.5  95.5  73.7  83.2  89.5  93.8  96.7  98.4 
Margin(Wu et al. (2017))  63.6  74.4  83.1  90.0  94.2    79.6  86.5  91.9  95.1  97.3   
Smart Mining(Harwood et al. (2017))  49.8  62.3  74.1  83.3      64.7  76.2  84.2  90.2     
HDL(Ge (2018))  57.1  68.8  78.7  86.5  92.5  95.5  81.4  88.0  92.7  95.7  97.4  99.0 
ABIER(Opitz et al. (2018))  57.5  68.7  78.3  86.2  91.9  95.5  82.0  89.0  93.2  96.1  97.8  98.7 
ABE(Kim et al. (2018))  60.6  71.5  79.8  87.4      85.2  90.5  94.0  96.1     
HAP2SE(Yu et al. (2018))  56.1  68.3  79.2  86.9      74.1  83.5  89.9  94.1     
MS(Wang et al. (2019))  65.7  77.0  86.3  91.3  94.8  97.0  84.1  90.4  94.0  96.5  98.0  98.9 
DROTopK(Ours)  67.4  77.7  85.9  91.6  95.0  97.3  86.0  91.7  95.0  97.3  98.5  99.2 
DROTopK(Ours)  68.1  78.4  86.0  91.4  95.1  97.6  85.4  91.0  94.2  96.5  98.0  99.0 
DROTopKPN(Ours)  67.3  77.6  85.7  91.2  95.0  97.7  86.1  91.7  95.1  97.1  98.4  99.1 
DROTopKPN(Ours)  67.6  77.9  86.0  91.8  95.2  97.7  86.2  91.7  95.8  97.4  98.6  99.3 
DROKL(Ours)  67.7  78.0  86.1  91.8  95.6  97.8  86.4  91.9  95.4  97.5  98.7  99.3 
Table 1 and 4.2.1 report the experiment results. We mark the best performer in bold in the corresponding evaluation measure on each column. For our framework, particularly, we mark those who outperform all other SOTA methods in bold. We can see that our five variants outperform other SOTA methods on recall on all three datasets. Particularly on Cars196, our five variants outperforms other SOTA methods on all recall measures. On Cub2002011, DROTopK achieves a higher recall (improving in recall) than the best SOTA, MS. On Cars196, DROKL has the best performance, which improves and in recall compared to the best nonensemble SOTA, MS, and the best ensemble SOTA, ABE. On InShop, DROTopKPN improves in recall compared to the best results among SOTA, MS. The above results verify 1) the effectiveness of our DRO sampling methods and 2) the flexibility of our DRO framework to adopt different losses.
4.2 Ablation Study
4.2.1 Comparison with LS loss and MS loss
Cub2002011  Cars196  

Recall  1  2  4  8  16  32  1  2  4  8  16  32 
MS  55.6  67.7  77.4  86.3  92.1  95.8  73.2  81.5  87.6  92.6     
LS  56.8  67.9  77.5  85.6  91.2  95.2  69.7  79.3  86.2  91.1     
DROKLG  56.4  68.3  78.9  86.3  91.7  95.8  70.5  79.8  86.6  91.6  94.9  97.1 
DROKLG  56.8  68.7  79.0  86.6  92.1  95.9  72.5  81.9  88.1  92.3  95.4  97.3 
DROKLG  57.0  69.4  79.9  87.0  92.3  95.9  73.1  82.2  88.8  93.4  96.2  98.0 
DROKLG  56.7  68.5  79.0  87.3  92.6  96.0  75.0  83.4  89.5  93.7  96.6  98.3 
In Section 3.3, we theoretically show that LS loss and MS loss can be viewed as special cases of our DRO framework. In this experiment, we aim to empirically demonstrate that our framework is general enough and recovers LS loss. Specifically, we would show 1) when , our framework performs similarly to LS loss, as stated in Section 3.3, 2) our framework can be seen as a generalized LS loss by treating as a hyperparameter, and 3) our generalized LS loss outperforms MS loss, even though the performance of the ordinary LS loss is inferior to that of MS loss.
We adopt the set up of embedding dimension and batchsize in the ablation study of Wang et al. (2019), i.e., and . Therefore, we report the existing results of MS and LS loss presented in Wang et al. (2019) on Cars196. For Cub2002011 and InShop, we implement MS and LS loss according to Wang et al. (2019). Following Wang et al. (2019), we set for MS loss. For our DRO framework, we apply grouping to by equation (7), and denote this variant of DRO framework as DROKLG. We set for DROKLG , for the margin loss, and for all three losses (MS, LS and margin loss). As the pairs with zero loss will not contribute to the updates of model but affect the calculation of in DRO framework, we remove the pairs with zero loss to further promotes training.
Table 2 and 4.2.1 show experiment results on Cub2002011, Cars196 and InShop, respectively. As can be seen, the performance of MS loss is better than LS loss on three datasets, particularly on Cars196, which also verifies the results of ablation study in Wang et al. (2019). When , our method performs similarly to LS loss, which verifies that our method recovers LS loss. Furthermore, when we treat as a hyperparameter (especially ) and regard our framework as generalized LS loss, our method obtain improved performance compared to the ordinary LS loss. Lastly, even if MS loss exploits pseudo positive and negative pairs, our generalized LS loss outperforms MS loss.
4.2.2 Capacity to Handle Pair Imbalance.
In this experiment, we compare our DRO framework with different sampling methods, i.e., semihard (SH) and DWS, in terms of sensitivity to the imbalance ratio. By setting different batchsizes , we have different positivenegative pair ratios . For all methods, we apply margin loss and set for each class and embedding space dimension . SH mining is originally designed for triplet loss. Since there is no straightforward choice for the positive pair, we use as the upper bound to simulate the similarity of the positive pair in triplet loss. For DWS, we follow the parameter setting in the original paper (Wu et al. (2017)). We apply margin loss in the proposed three variants our DRO framework, which are denoted by DROTopK, DROTopKPN and DROKL, respectively. We set both for DROTopK, DROTopKPN. We evaluate recall of all methods and report results in Figure 1.
Figure 1 shows that the DWS has better performance when the positivenegative pair ratio is relatively large, and encounters a sharp decrease in recall when the ratio decreases. Other four methods obtain better performance when the positivenegative ratio increases. Among them, DROTopK and DROKL have similar performance, with SH on all positivenegative pair ratios, while they perform slightly better than SH when the positivenegative ratios are small. DROTopKPN constantly outperforms all other methods. The reason why DWS performs poorly when the positivenegative pair ratio is small may be that DWS aims to sample pairs uniformly in terms of distance (Wu et al. (2017)), while our DRO framework and SH focus more on hard pairs. To sum up, our framework achieves very competitive performance against SOTA methods, and maintains increasing recall as the positivenegative ratio increases. These two observations together demonstrate the effectiveness of our DRO framework to handle pair imbalance.
4.2.3 Sensitivity of in TopK
As we mentioned in section 1, selecting too many pairs within a batch will leads to poor performance of the model. On the other hand, when the number of selected pairs is too small, the model would be sensitive to outliers. In this experiment, we study the sensitivity of in our DRO framework–how the performance of our DRO framework is affected by the value of . We set the batchsize and , which makes the number of positive pairs and the number of negative pairs . We set from the range and evaluate recall for models trained by different . We choose the above range of according to the number of pairs selected by DWS and SH in Section 4.2.2 (both selects pairs roughly).
Figure 3 illustrates how different values of affect recall on InShop. We can see that, DROTopK performs best when and recall is stable on the entire range of . Our DRO framework is not sensitive to when is in a reasonably large range.
4.2.4 Runtime Comparison
Next, we compare the running time of our proposed three variants of our DRO framework with different pair mining methods, MS and LS losses on Inshop. Our experiments conducted on eight GTX1080Ti GPU. The embedding dimension , and results are compared under different batchsize . The same as previous experiments, we set both for DROTopK and DROTopKPN. for DROKL. SH is implemented according to the paper Schroff et al. (2015), Wu et al. (2017). DWS and MS are implemented based on the code provided by the author. LS loss is implemented following the code provided by Wang et al. (2019).
Figure 3
reports the average running time of each iteration on 200 epochs. We can see that all of three proposed variants of DRO framework run faster than other
anchorbased mining methods and losses. For all of our three variants, pairs are selected directly from all the pairs, while additional cost is required to select pairs anchor by anchor in other methods. LS loss is slower than MS loss, because MS mining is applied to MS loss, which would reduce the number of pairs for computing subgradients when updating the model. For DWS, the distance distribution of negative pairs is only calculated once for each iteration. It thus only needs to select pairs according to the precomputed distribution for each anchor. In contrast, SH requires to compare negative pairs with the lower and upper bound of an interval at each iteration for each anchor, which increases the computational burden. It can be the reason why SH is slower than DWS.5 Conclusion
In this paper, we cast DML as a simple pairwise binary classification problem and formulate it as a DRO framework. Compared to existing DML methods that leverage all pairs in a batch or employ heuristic approaches to sample pairs, our DRO framework constructs a robust loss to sample informative pairs, which also comes with theoretical justification from the perspective of learning theory. Our framework is general since it can include many novel designs in its uncertainty decision set. Its flexibility allows us to recover the stateoftheart loss functions. Experiments show that our framework outperforms the stateoftheart DML methods on benchmark datasets. We also empirically demonstrate that our framework is efficient, general and flexible.
References

Chen & Deng (2019)
Binghui Chen and Weihong Deng.
Hybridattention based decoupled metric learning for zeroshot image
retrieval.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2750–2759, 2019.  Fan et al. (2017) Yanbo Fan, Siwei Lyu, Yiming Ying, and Baogang Hu. Learning with average topk loss. In Advances in Neural Information Processing Systems, pp. 497–505, 2017.
 Ge (2018) Weifeng Ge. Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285, 2018.
 Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735–1742. IEEE, 2006.
 Harwood et al. (2017) Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom Drummond, et al. Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829, 2017.
 Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737, 2017.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Kim et al. (2018) Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon. Attentionbased ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 736–751, 2018.
 Li et al. (2019) Xiaomeng Li, Lequan Yu, ChiWing Fu, Meng Fang, and PhengAnn Heng. Revisiting metric learning for fewshot image classification. arXiv preprint arXiv:1907.03123, 2019.
 Liu et al. (2016) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104, 2016.
 Namkoong & Duchi (2016) Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Advances in Neural Information Processing Systems, pp. 2208–2216, 2016.
 Namkoong & Duchi (2017) Hongseok Namkoong and John C. Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems (NIPS), pp. 2975–2984, 2017.
 Oh Song et al. (2016) Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012, 2016.
 Oh Song et al. (2017) Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390, 2017.
 Opitz et al. (2018) Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. Deep metric learning with bier: Boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence, 2018.
 Qian et al. (2015) Qi Qian, Rong Jin, Shenghuo Zhu, and Yuanqing Lin. Finegrained visual categorization via multistage metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 Qian et al. (2018) Qi Qian, Jiasheng Tang, Hao Li, Shenghuo Zhu, and Rong Jin. Largescale distance metric learning with uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8542–8550, 2018.
 Rockafellar & Wets (2009) R Tyrrell Rockafellar and Roger JB Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
 ShalevShwartz & Wexler (2016) Shai ShalevShwartz and Yonatan Wexler. Minimizing the maximal loss: How and why. In ICML, pp. 793–801, 2016.
 Shi et al. (2016) Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Weishi Zheng, and Stan Z Li. Embedding deep metric for person reidentification: A study against large variations. In European conference on computer vision, pp. 732–748. Springer, 2016.
 Sohn (2016) Kihyuk Sohn. Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865, 2016.
 Wang et al. (2017) Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601, 2017.
 Wang et al. (2019) Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multisimilarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030, 2019.
 Wu et al. (2017) ChaoYuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848, 2017.
 (27) D Yi, Z Lei, and SZ Li. Deep metric learning for practical person reidentification (2014). ArXiv eprints.
 Yu et al. (2018) Rui Yu, Zhiyong Dou, Song Bai, Zhaoxiang Zhang, Yongchao Xu, and Xiang Bai. Hardaware pointtoset deep metric for person reidentification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 188–204, 2018.

Zhu et al. (2019)
Dixian Zhu, Zhe Li, Xiaoyu Wang, Boqing Gong, and Tianbao Yang.
A robust zerosum game framework for poolbased active learning.
InThe 22nd International Conference on Artificial Intelligence and Statistics
, pp. 517–526, 2019.
6 Supplementary
6.1 Derivation of Recover SOTA LOSSES
In this section, we show how our DRO framework recovers SOTA loss functions, LS loss and MS loss. Their definitions are as follows, respectively.
(8) 
where is the margin hyperparameter.
(9) 
where is the margin hyperparameter and and are coefficient hyperparameters.
6.1.1 LS Loss under Our Framework
Recall that the objective function is decomposable in terms of . We denote the when , and when for simplicity. The Lagrangian function of (7) can be represented as:
(10) 
where
(11)  
According to KKT conditions, and are the optimal solutions of the dual function, and is the optimal solution of the primal problem (7), if and only if
(12)  
(13) 
We first derive in terms of using equation (12), i.e.:
(14) 
where . Then the closed form of for positive pairs and negative pairs can be written as follows
(15) 
Substitute into equation (13), which means and need to satisfy:
(16) 
Even though equal (16) also equals to 0 when or , or , but the corresponding optimal solution will not meet the equality constraints, i.e., and , in the original formulation (7). Therefore, we only have
(17) 
Then from equation (17), we can get
(18) 
Plugging them into (15) and apply margin loss as the base loss function for each pair, , we successfully derive the weighting representation of LS loss:
(19)  
Thus, when updating the model parameter , we are going to minimize the following objective function:
(20)  
Taking the gradients to equation 20 in terms of , we can get:
(21) 
Similarly, we take gradients to the loss function (8):
(22) 
6.1.2 MS Loss under Our Framework
MS loss, a combination of binomial loss and LS loss, can also be formulated into our DRO framework. LS loss only considers the lifted structure between pairs, while binomial loss focusing on the intrinsic property of an independent pair while encoding the pair class information. To recover MS loss under our framework, we redefine by adding one more element to and . Therefore, now we have and , where the newly added element corresponds to a zero loss, and thus does not contribute to the computation of overall loss. Then based on the formulation of LS loss, the formulation of MS loss can be written as:
(23)  
s.t. 
Note that
(24) 
As the analysis of LS loss, we can also obtain the representation of MS loss under our DRO framework from formulation (23), i.e.,
Comments
There are no comments yet.