Log In Sign Up

Multi-level Distance Regularization for Deep Metric Learning

by   Yonghyun Kim, et al.
Kakao Corp.

We propose a novel distance-based regularization method for deep metric learning called Multi-level Distance Regularization (MDR). MDR explicitly disturbs a learning procedure by regularizing pairwise distances between embedding vectors into multiple levels that represents a degree of similarity between a pair. In the training stage, the model is trained with both MDR and an existing loss function of deep metric learning, simultaneously; the two losses interfere with the objective of each other, and it makes the learning process difficult. Moreover, MDR prevents some examples from being ignored or overly influenced in the learning process. These allow the parameters of the embedding network to be settle on a local optima with better generalization. Without bells and whistles, MDR with simple Triplet loss achieves the-state-of-the-art performance in various benchmark datasets: CUB-200-2011, Cars-196, Stanford Online Products, and In-Shop Clothes Retrieval. We extensively perform ablation studies on its behaviors to show the effectiveness of MDR. By easily adopting our MDR, the previous approaches can be improved in performance and generalization ability.


page 3

page 7


Deep Metric Learning via Facility Location

Learning the representation and the similarity metric in an end-to-end f...

Generalization in Metric Learning: Should the Embedding Layer be the Embedding Layer?

Many recent works advancing deep learning tend to focus on large scale s...

Energy Confused Adversarial Metric Learning for Zero-Shot Image Retrieval and Clustering

Deep metric learning has been widely applied in many computer vision tas...

Diversified Mutual Learning for Deep Metric Learning

Mutual learning is an ensemble training strategy to improve generalizati...

Revisiting Training Strategies and Generalization Performance in Deep Metric Learning

Deep Metric Learning (DML) is arguably one of the most influential lines...

Unbiased Evaluation of Deep Metric Learning Algorithms

Deep metric learning (DML) is a popular approach for images retrieval, s...

Jitter: Random Jittering Loss Function

Regularization plays a vital role in machine learning optimization. One ...

Code Repositories


Official Code of AAAI 2021 Paper "Multi-level Distance Regularization for Deep Metric Learning"

view repo

1 Introduction

Deep Metric Learning (DML) aims to learn an appropriate metric that measures the semantic difference between a pair of images as a distance between embedding vectors. Many research areas such as image retrieval

sohn2016improved; yuan2017hard; oh2017deep; duan2018deep; ge2018deep

and face recognition

NormFace; SphereFace; CosFace; ArcFace are based on DML to seek appropriate metrics among instances. Those studies focus on devising a better loss function for DML.

Most of previous loss functions sohn2016improved; bromley1994signature; hadsell2006dimensionality; yideep2014; hoffer2015deep; schroff2015facenet use binary supervision that indicates whether a given pair is positive or negative. Their common objective is to minimize the distance between a positive pair and maximize the distance between a negative pair (Figure (a)a). However, without any constraints, a model trained with such objective is prone to overfitting on a training set because positive pairs can be aligned too closely while the negative pairs can be aligned too far in the embedding space. Therefore, several loss functions employ additional terms to avoid positive pairs to be too close and negative pairs to be too far, e.g., margin in Triplet loss schroff2015facenet and Constrastive loss hadsell2006dimensionality. Despite these attempts, they still can suffer from overfitting due to the lack of explicit regularization for the distances.

Our insight is that a learning procedure of DML can be enhanced by explicitly regularizing the distance between pairs to disturb a loss function of DML from optimizing an embedding network; one easy way to constrain a distance is to pull the value of the distance to a predefined level. Conventional loss functions of DML adjust the distance according to its label, on the other hand, explicit distance-based regularization prevents the distance from deviating from the predefined level. Those two interfere with the objective of each other, thus it makes the learning process difficult and allows the embedding network to be more robust for generalization. Additionally, we consider multiple levels with disjoint intervals to regularize distances, not a single level, because a degree of inter-class similarity or intra-class variation can be different depending on classes or instances.

Conventional Learning
(a) Conventional Learning
(b) Our Learning
Figure 1: Conceptual comparison between the conventional learning scheme and our learning scheme. (a) illustrates the triplet learning schroff2015facenet, which is one of the representative conventional learning. It increases the relative difference between distances of a positive pair and that of a negative pair more than margin . (b) illustrates our learning combined with the triplet learning. It has multiple levels with disjoint intervals to reflect various degrees of similarity between pairs. It disturbs the learning procedure to construct an efficient embedding space by preventing the pairwise distances from deviating from its belonging level.

We propose a novel method called Multi-level Distance Regularization (MDR) that makes the conventional loss functions of DML have difficulty in converging by holding each distance so that it does not deviate from the belonging level. At first, MDR normalizes pairwise distances among the embedding vectors of a mini-batch, with their mean and standard deviation to obtain the objective degree of similarity between a pair by considering overall distribution. MDR defines the multiple levels that represent various degrees of similarity for pairwise distances, and the levels and the belonging distances are trained to approach each other (Figure

(b)b). A conventional loss function of DML struggles to optimize a model by overcoming the disturbance from the proposed regularization. Therefore, the learning process succeeds in learning a model with a better generalization ability. We summarize our contributions:

  • [leftmargin=*]

  • We introduce MDR, a novel regularization method for DML. The method disturbs optimizing pairwise distances by preventing them from deviating from its belonging level for better generalization.

  • MDR achieves the-state-of-the-art performance on various benchmark datasets CUB-200; Cars-196; Song2016DeepML; InShop of DML. Moreover, our extensive ablation studies show that MDR can be adopted to any backbone networks and any distance-based loss functions to improve the performance of a model.

2 Related Work

Loss Function. Improving the loss function is one of the key objectives in recent DML studies. One family of loss functions sohn2016improved; bromley1994signature; schroff2015facenet; Song2016DeepML; wang2019multi; wu2017sampling focuses on optimizing pairwise distance between instances. The common objective of these functions is to minimize the distance between positive pairs and to maximize the distance between negative pairs in an embedding space. Contrastive loss bromley1994signature samples pairs of two instances, whereas Triplet loss schroff2015facenet samples triplets of anchor, positive and negative instances; then both losses optimize the distance between the sampled instances. Also, Global Loss kumar2016learning

minimizes the mean and variance of all pairwise distances between positive examples and maximizes the mean of pairwise distances between all negative examples; Global Loss helps to optimize examples that are not selected by the example mining of DML. Histogram Loss


minimizes the probability that a randomly sampled positive pair has a smaller similarity than randomly sampled negative pairs. To extend the number of relations explored at once, NPair

sohn2016improved samples a positive and all negative instances for each example in a given mini-batch; similar loss functions Song2016DeepML; wang2019multi also sample a large number of instances to fully explore the pairwise relations in the mini-batch. On the other, some loss functions cakir2019deep; revaud2019learning focus on learning to rank according to the similarity between pairs. The performance of loss functions optimizing pairwise distance can be changed by a sampling method, thus, several studies focused on the pair sampling suh2019stochastic; schroff2015facenet; wu2017sampling for stable learning and better accuracy. A recent work wang2020cross even samples pairs across mini-batches to collect a sufficient number of negative examples. Instead of designing a sampling method manually, a work roth2020pads

employs reinforcement learning to learn the policy for sampling. As a regularizer, MDR can be combined with those loss functions to improve the generalization ability of a model.

Figure 2: Learning procedure of the proposed MDR. The embedding network generates embedding vectors from given images. Our MDR computes a matrix of pairwise distances for the embedding vectors, and then, the distances are normalized after vectorization. In our learning scheme, a model is trained by simultaneously optimizing the conventional metric learning loss such as Triplet loss schroff2015facenet and the proposed loss, which regularizes the normalized pairwise distances with multiple levels.

Generalization Ability. Another goal of DML is to improve the generalization ability of a given model. An ensemble of multiple heads that share the backbone network opitz_2018_pami; Kim_2018_ECCV; JACOB_2019_ICCV; sanakoyeu2019divide has the key objective of diversifying each head to achieve reliable embedding. Boosting can be used to re-weight the importance of instances differently on each head opitz_2018_pami; sanakoyeu2019divide, or a spatial attention module can be used to differentiate a spatial region on which each head focuses Kim_2018_ECCV. HORDE JACOB_2019_ICCV

makes each head approximate a different higher-order moment. Those methods focus on changing the architecture of a model, but our MDR, as a regularizer, focuses on making a learning procedure harder to improve generalization ability. Without adding any extra computational costs or changing the architecture of the model, MDR can be easily integrated with those DML methods by simply adding our loss function.

3 Proposed Method

In this section, we introduce a new regularization method called Multi-level Distance Regularization (MDR), which makes the learning procedure difficult by preventing each pairwise distance from deviating from a corresponding level, to learn a robust feature representation.

3.1 Multi-level Distance Regularization

We describe the detailed procedure of MDR to regulate pairwise distances in three steps (Figure 2).

(1) Distance Normalization. This step is performed to obtain an objective degree of distance by considering overall distribution for stable regularization. Here, an embedding network maps an image into an embedding vector with a certain dimensionality: . A distance is defined as Euclidean distance between two given embedding vectors, . We normalize the distance as:


where is mean of distances and is standard deviation of distances a set of pairs, which is for all instances of a mini-batch. To more widely consider the overall dataset, we employ the momentum updates:


where and are respectively the momented mean and momented standard deviation at iteration , and is the momentum. With the momented statistics, the normalized distance is re-written:


(2) Level Assignment. MDR designates a level that acts as a regularization goal for each normalized distance. We define a set of levels , and the levels are initialized with predefined values; each level is interpreted as a multiplier of the standard deviation of the normalized distance. is an assignment function that outputs whether the given distance and the given level are the closest or not, and is defined as:


By adopting the assignment function, MDR selects valid regularization levels for each distance with the consideration of various degrees of similarities.

(3) Regularization. Finally, this step is performed to prevent pairwise distances from deviating from its belonging level. MDR minimizes the difference between a given normalized pairwise distance and the assigned level:


The levels are learnable parameters and are updated to optimally regularize the pairwise distances. Each normalized distance is trained to become closer to the assigned level; the assigned level is also trained to become closer to the corresponding distances. As iterations pass, the levels are trained to properly divide the normalized distances into multiple intervals. Each level is a representative value of a certain interval in the normalized distance. We describe the initial configuration of the levels in Section 4.3.

In conclusion, MDR has two functional effects of regularization: (1) the multiple levels of MDR disturbs optimizing the pairwise distances among examples, (2) the outermost levels of MDR prevents the positive pairs from getting too close and the negative pairs from getting too far. By the formal effect, the learning process does not easily suffer from overfitting. By the latter effect, the learning process does not suffer from diminishing of the loss from easy examples, and also, does not suffer from being too biased to certain examples such as hard examples. Therefore, MDR stabilizes the learning procedure to achieve a better generalization ability on a test dataset.

3.2 Learning

CUB-200 Cars-196
Recall@K 1 2 4 8 1 2 4 8
HTL ge2018deep 57.1 68.8 78.7 86.5 81.4 88.0 92.7 95.7
RLL-H wang2019ranked 57.4 69.7 79.2 86.9 74.0 83.6 90.1 94.1
NSM zhaiclassification2019 59.6 72.0 81.2 88.4 81.7 88.9 93.4 96.0
MS wang2019multi 65.7 77.0 86.3 91.2 84.1 90.4 94.0 96.5
SoftTriple qian2019softtriple 65.4 76.4 84.5 90.4 84.5 90.7 94.5 96.9
HORDE JACOB_2019_ICCV 66.3 76.7 84.7 90.6 83.9 90.3 94.1 96.3
DiVA milbich2020diva 66.8 77.7 - - 84.1 90.7 - -
(a) CUB-200 CUB-200 and Cars-196 Cars-196
SOP In-Shop
Recall@K 1 10 100 1000 1 10 20 40
NSM zhaiclassification2019 73.8 88.1 95.0 - - - - -
MS wang2019multi 78.2 90.5 96.0 98.7 89.7 97.9 98.5 99.1
SoftTriple qian2019softtriple 78.3 90.3 95.9 - - - - -
HORDE JACOB_2019_ICCV 80.1 91.3 96.2 98.7 90.4 97.8 98.4 98.9
DiVA milbich2020diva 78.1 90.6 - - - - - -
(b) SOP Song2016DeepML and In-Shop InShop
Table 1: Recall@K comparison with state-of-the-art methods. The baseline methods and MDR are grouped in the gray-colored rows. indicates that the model is trained and tested with large images of following the setting of JACOB_2019_ICCV. We round reported values to the first decimal place.

Loss Function. The proposed MDR can be applied to any loss functions such as Contrastive loss bromley1994signature, Triplet loss schroff2015facenet and Margin loss wu2017sampling. We mostly adopted Triplet loss as baseline for our experiments:


where is a set of triplets of an anchor , a positive , and a negative sampled from a mini-batch. is a margin. The final loss function is defined as the sum of and with a multiplier that balances the losses:


optimizes the model by minimizing the distance of positive pairs and maximizing the distance of negative pairs. regularize the pairwise distances by constraining the distances with multiple levels. The embedding network is trained simultaneously with different objectives.

Embedding Normalization Trick for MDR. In our learning procedure, Normalization ( Norm) is not adopted because it can disturb the proper regularization effect of MDR. However, the lack of Norm can cause difficulty in finding appropriate hyper-parameters of such as margin in Triplet loss, because any prior knowledge of the scale of embedding vectors is not given. To overcome the difficulty, we normalize by dividing the embedding vectors by during the training stage, such that the expected pairwise distance is one: . We adopt this trick on several loss functions such as Constrastive loss hadsell2006dimensionality, Margin loss wu2017sampling, and Triplet loss in our experiments.

4 Experiments

To show the effectiveness of MDR and its behaviors, we extensively perform ablation studies and experiments. We follow the standard evaluation protocol and data splits proposed in Song2016DeepML. For an unbiased evaluation, we conduct 5 independent runs for each experiment and report the mean and the standard deviation of them.

Datasets. We employ the four standard datasets of deep metric learning for evaluations: CUB-200-2011 CUB-200 (CUB-200), Cars-196 Cars-196, Stanford Online Product Song2016DeepML (SOP) and In-Shop Clothes Retrieval InShop (In-Shop). CUB-200 has 5,864 images of first 100 classes for training and 5,924 images of the rest classes for evaluation. Cars-196 has 8,054 images of first 98 classes for training and 8,131 images of the rest classes for evaluation. SOP has 59,551 images of 11,318 classes for training and 60,502 images of the rest classes for evaluation. In-Shop has 25,882 images of 3,997 classes for training, and the remaining 7,970 classes with 26,830 images are partitioned into two subsets (query set and gallery set) for evaluation.

4.1 Implementation Details

Embedding Network.

All the compared methods and our method use the Inception architecture with Batch Normalization (IBN)


as a backbone network. IBN is pre-trained for ImageNet ILSVRC 2012 dataset

deng2009imagenet and then fine-tuned on the target dataset. We attach a fully-connected layer, where its output activation is used as an embedding vector, after the last pooling layer of IBN. For models trained with MDR, Norm is not applied to the embedding vectors because it disturbs the effect of the regularization. For a fair comparison with the conventional implementation of Triplet loss schroff2015facenet that is used as a baseline, we apply Norm to those models.

Learning. We employ Adam kingma2014adam optimizer with a weight decay of . For CUB-200 and Cars-196, a learning rate and the size of mini-batch are set to and 128. For SOP and In-Shop, a learning rate and the size of mini-batch are set to and 256. We mainly apply our method to Triplet loss schroff2015facenet. As a triplet sampling method, we employ the distance weighted sampling wu2017sampling. The margin of Triplet loss is set to 0.2. We summarized the hyper-parameters of MDR: the configuration of the levels is initialized to three levels of , and the momentum is set to . is set differently for each dataset: for CUB-200, for Cars-196 and for SOP and In-Shop. For most of the datasets, of is enough to improve a given model; on CUB-200, a strong regularization is more effective because it is a small dataset with only 5,864 training images where a model may easily suffer from overfitting. Those hyper-parameters are not very sensitive to tune, and we explain the effects of each hyper-parameter in the ablation studies at Section 4.3.

Image Setting. During training, we follow the standard image augmentation process Song2016DeepML; wang2019multi with the following order: resizing to , random cropping, random horizontal flipping, and resizing to . For evaluation, images are center-cropped.

4.2 Comparison with State-of-the-art Methods

We show the comparison of MDR and the recent state-of-the-art methods (Table 1). All compared methods use embedding vectors of 512 dimensionality. Our baseline model is trained by Triplet loss without Norm (Triplet) and we also report the conventional Triplet with Norm (Triplet+ Norm). The lack of constraints of Norm on the embedding space results in poor generalization performance, and it is known that Triplet loss is effective when Norm is applied schroff2015facenet. However, the models with MDR outperform the Triplet+ Norm models on all the datasets. Those results prove the effectiveness of the proposed distance-based regularization.

Experimental Results. MDR improves performance on all the datasets, and, in particular, the improvements are significantly high on the small-sized datasets. For CUB-200, MDR improves 3.7 percentage points on Recall@1 compared to the conventional Triplet+ Norm; the result is 11.5 percentage points higher than Recall@1 of the Triplet. For Cars-196, MDR improves 8.7 percentage points on Recall@1 compared to the conventional Triplet+ Norm; the result is 12.3 percentage points higher than Recall@1 of the Triplet. MDR also improves the recall performance compared to the baselines on SOP and In-Shop. Moreover, our method significantly outperforms the other state-of-the-art methods in all recall criteria for all datasets.

(a) Dimensionality
(b) Learning Curves
Figure 3: (a) compares the three methods on various dimensionalities of the embedding vector on CUB-200 and Cars-196. (b) shows the learning curves of the three methods for the training and test set on CUB-200.
CUB-200 Cars-196
(a) Backbone Network
CUB-200 Cars-196
(b) Loss Function
Fixed Learnable
(c) Level Configuration
Table 2: Recall@1 comparison with various backbone networks, loss functions, and level configurations. The models of (a) are trained with Triplet loss. The models of (b) use IBN as the backbone network. In (a) and (b), a column with indicates that the models are trained with MDR.
Norm at Inference
Table 3: Recall@1 comparison with the effect of Norm at inference time for the models trained without Norm. A column with indicates that the trained models are evaluated with Norm.
(a) Expectation of Two-Norm:
(b) Coefficient of Variation of Two-Norm:
Figure 4: (a) compares the expectation of the two-norm of the embedding vectors for the test set on CUB-200. (b) compares the coefficient of variation of the two-norm of the embedding vectors for the test set on CUB-200.

4.3 Ablation Studies

We extensively perform ablation studies on the behaviors of the proposed MDR.

Backbone Network. MDR can be widely applicable to any backbone networks (Table (a)a). We apply MDR on IBN ioffe2015batch, ResNet18 (R18) and ResNet50 he2016deep (R50), and achieve significant improvements for all backbone networks. Especially, a light-weight backbone, R18, with MDR even outperforms the baseline models with a heavy-weight backbone such as R50 and IBN on both datasets.

Loss Function. Our MDR also can be widely applicable to any distance-based loss function (Table (b)b). We apply MDR on Constrastive loss hadsell2006dimensionality, Margin loss wu2017sampling and Triplet loss. MDR achieves significant improvements for all loss functions.

(a) Triplet
(b) Triplet + Norm
(c) Triplet + MDR
Figure 5: Class centers in the embedding space of two models trained without MDR (Triplet & Triplet+ Norm) and one model trained with MDR (Triplet+MDR). We visualize using t-SNE t_SNE on CUB-200.
Figure 6: Visualization of assigned positive and negative pairs at each level on CUB-200. Regardless of positive or negative pair, a visually close pair is assigned to level 1, and a visually distant pair is assigned to level 3; even the same birds of the same species can be varying in appearance by the difference in perspectives, poses, and environments.

Level Configuration . Even though the levels are learnable, we should properly set the number of levels and the initial values of levels. We perform experiments on various initial configurations of levels and validate the importance of the learnability of levels (Table (c)c). From the experiments, we find that a sufficiently spaced configuration is better than a tightly spaced configuration; is better than , and a configuration of three levels is sufficient.

4.4 Discussion

Effectiveness in Small Dimensionality. We perform an experiment on various dimensionalities of embedding vector such as , , , and . MDR significantly improves the Recall@1 of the models, especially in small dimensionality. In the experiment, our MDR only with 64 dimensionality is similar to or surpasses the performance of other methods with 512 dimensionality (Figure (a)a). The result indicates that our MDR constructs a highly efficient embedding space in compact dimensionality. Moreover, the improvements are larger compared to Triplet+ Norm for all dimensionality.

Prevention of Overfitting as Regularizer. We investigate the learning curves of three models: Triplet, Triplet+ Norm and Triplet+MDR (Figure (b)b). There are two crucial observations: (1) on the training set, Triplet+MDR is less overfitted than the other two methods, but it shows the most high performance on the test set., (2) the recall of Triplet+MDR does not drop until the end of learning, unlike the other methods, which suffer from severe overfitting. These observations indicate that our MDR is an effective regularizer for DML.

Equalizing the Two-Norm of Embedding Vectors. We find that the embedding vectors of a model trained with MDR have almost the same two-norm (Figure (a)a and (b)b). This shows that the embedding vectors are almost located on a hypersphere, even though the model is trained without Norm. Therefore, the model trained with MDR achieves similar performance even if Norm is applied at inference time (Table 3). This observation implies that MDR has similar effects of Norm at the end of the training, even though MDR is a distance-based regularization and Norm is norm-based regularization.

Discriminative Representation. To show the effectiveness of our method, we visualize how MDR constructs an embedding space. In the embedding space of Triplet and Triplet+ Norm, the class centers are often aligned closely to each other (Figure (a)a and (b)b). However, in an embedding space of Triplet+MDR, the class centers are evenly spaced with a large margin (Figure (c)c). This result indicates that MDR constructs a more discriminative representation than the conventional methods.

Qualitative Analysis on Level Assignment. In the step of the level assignment, a lower level indicates that the pairs are closely aligned in the embedding space and vice versa. Most of the positive pairs are belonging to between level 1 and 2, and most of the negative pairs are belonging to between level 2 and 3. However, hard-positive pairs may belong to level 3 while hard-negative also may belong to level 1 (Figure 6). Therefore, levels are assigned to each pair regardless of given binary supervision. The learning procedure tried to overcome the disturbance that pulls the distances to belonging levels by considering the various degrees of distances; this multi-level disturbance leads to the improvement of the generalization ability.

5 Conclusion

We introduce a new distance-based regularization method that elaborately adjusts the pairwise distance into multiple levels for better generalization. We prove the effectiveness of MDR by showing the improvements that greatly exceed the existing methods, and by extensively performing the ablation studies of its behaviors. By applying our MDR, many methods can be significantly improved without any extra burdens at inference time.


We would like to thank AI R&D team of Kakao Enterprise for the helpful discussion. In particular, we would like to thank Yunmo Park who designed the visual materials.

Potential Ethical Impact
Due to the gap between a training dataset and real-world data, it is important to build a reliable model with better generalization ability across the unseen dataset, e.g.

test set, for its practicality. Our MDR is a regularization method to improve the generalization ability of a deep neural network on the task of deep metric learning. As positive aspects, our method can be applied to many practical applications such as image retrieval and item recommendation. These applications are utilized for our conveniences and the proposed MDR can improve their performance more reliably. We believe that our method does not have particular negative aspects because it is a fundamental method that assists conventional approaches to improve reliability on unseen datasets.