# Energy Confused Adversarial Metric Learning for Zero-Shot Image Retrieval and Clustering

Deep metric learning has been widely applied in many computer vision tasks, and recently, it is more attractive in zero-shot image retrieval and clustering(ZSRC) where a good embedding is requested such that the unseen classes can be distinguished well. Most existing works deem this 'good' embedding just to be the discriminative one and thus race to devise powerful metric objectives or hard-sample mining strategies for leaning discriminative embedding. However, in this paper, we first emphasize that the generalization ability is a core ingredient of this 'good' embedding as well and largely affects the metric performance in zero-shot settings as a matter of fact. Then, we propose the Energy Confused Adversarial Metric Learning(ECAML) framework to explicitly optimize a robust metric. It is mainly achieved by introducing an interesting Energy Confusion regularization term, which daringly breaks away from the traditional metric learning idea of discriminative objective devising, and seeks to 'confuse' the learned model so as to encourage its generalization ability by reducing overfitting on the seen classes. We train this confusion term together with the conventional metric objective in an adversarial manner. Although it seems weird to 'confuse' the network, we show that our ECAML indeed serves as an efficient regularization technique for metric learning and is applicable to various conventional metric methods. This paper empirically and experimentally demonstrates the importance of learning embedding with good generalization, achieving state-of-the-art performances on the popular CUB, CARS, Stanford Online Products and In-Shop datasets for ZSRC tasks. [rgb]1, 0, 0Code available at http://www.bhchen.cn/.

## Authors

• 9 publications
• 28 publications
• ### Hybrid-Attention based Decoupled Metric Learning for Zero-Shot Image Retrieval

In zero-shot image retrieval (ZSIR) task, embedding learning becomes mor...
07/27/2019 ∙ by Binghui Chen, et al. ∙ 7

• ### A Framework to Enhance Generalization of Deep Metric Learning methods using General Discriminative Feature Learning and Class Adversarial Neural Networks

Metric learning algorithms aim to learn a distance function that brings ...
06/11/2021 ∙ by Karrar Al-Kaabi, et al. ∙ 0

• ### Generalization in Metric Learning: Should the Embedding Layer be the Embedding Layer?

Many recent works advancing deep learning tend to focus on large scale s...
03/08/2018 ∙ by Nam Vo, et al. ∙ 0

• ### Metric Learning With HORDE: High-Order Regularizer for Deep Embeddings

Learning an effective similarity measure between image representations i...
08/07/2019 ∙ by Pierre Jacob, et al. ∙ 1

• ### Hard Negative Mining for Metric Learning Based Zero-Shot Classification

Zero-Shot learning has been shown to be an efficient strategy for domain...
08/26/2016 ∙ by Maxime Bucher, et al. ∙ 0

• ### MIC: Mining Interclass Characteristics for Improved Metric Learning

Metric learning seeks to embed images of objects suchthat class-defined ...
09/25/2019 ∙ by Karsten Roth, et al. ∙ 6

• ### Orthogonality-Promoting Distance Metric Learning: Convex Relaxation and Theoretical Analysis

Distance metric learning (DML), which learns a distance metric from labe...
02/16/2018 ∙ by Pengtao Xie, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Since zero-shot learning (ZSL) removes the limitation of category-consistency between training and testing sets, it turns to be more attractive where the model is required to learn concepts from seen classes and then enables to distinguish the unseen classes. ZSL has been widely explored in image classification [Changpinyo et al.2016, Fu et al.2015, Zhang and Saligrama2015, Zhang, Xiang, and Gong2017] and retrieval tasks [Dalton, Allan, and Mirajkar2013, Shen et al.2018, Oh Song et al.2016], etc. In this paper, we focus on zero-shot image retrieval and clustering tasks(ZSRC).

In order to accurately retrieve and cluster the unseen classes, most existing works employ Deep Metric Learning

to optimize a good embedding, such as exploring tuple-based loss functions

[Sun et al.2014, Yuan, Yang, and Zhang2017, Wu et al.2017, Schroff2015, Oh Song et al.2016, Wang et al.2017, Huang, Loy, and Tang2016, Sohn2016] and proposing efficient hard-sample mining strategies [Kumar et al.2017, Wu et al.2017, Schroff2015], etc. However, the above methods deem this ’good’ embedding just to be the discriminative one and then concentrate on the discriminative learning over the seen classes, but neglect the importance of the generalization ability of the learned metric which is significant in ZSRC as well, as a result, without robustness constraining they are easily subject to concepts overfitting problem on the seen classes and some helpful or general knowledge for unseen

classes may have been left out with a high probability.

To be specific, in ZSRC, we emphasize that the generalization ability of the learned embedding is seriously affected by the following problem: “the biased learning behavior of deep models”, concretely, as illustrated in Fig.1.(a)111In fact, the learned partial biased knowledge is more complicated and cannot be easily illustrated in figure, here for intuitive understanding, we translate it into some single body-part knowledge., for a functional learner parameterized by CNN, to correctly distinguish classes A and B, it will selectively learn the partial biased attributes concepts that are the easiest ones to reduce the current training loss over the seen classes (here head knowledge is enough to sperate class A from B and thus is learned), instead of learning all-sided details and concepts, thus yielding over-fitting on seen classes and generalizing worse to unseen ones (classes C and D). In another word, in order to correctly recognize classes, deep networks easily learn to focus on surface statistical regularities rather than more general abstract concepts.

Therefore, when learning the embedding as in the aforementioned conventional metric learning methods, this issue objectively exists and impedes the learning of the desired good embedding. And without explicit and benign robustness constraint, the learned embedding is unable to generalize well to the unseen classes. Most ZSRC works ignore the importance to learn robust descriptors. To this end, proposing efficient regularization method for conventional metric learning to learn metrics with good generalization is important, especially in ZSRC tasks.

In this paper, we propose the Energy Confused Adversarial Metric Learning (ECAML) framework, an elegant regularization strategy, to alleviate the problem of generalization in ZSRC tasks by randomly confusing the learned metric during each iteration. It is mainly achieved by a novel and simple Energy Confusion (EC) term which is ’plug and play’ and can be generally applied to many existing deep metric learning approaches. Concretely, this confusion term plays an adversary role against the conventional metric learning objective, which intends to minimize the expected value of the Euclidean distances between the paired images from two different categories. As illustrated in Fig.1.(b), confusing the biased head-based metric will make the model less discriminative on the seen classes by reducing its dependence on head learning and thus give it chances of exploring other complementary and general knowledge, preventing overfitting on the seen classes and improving the generalization ability of the embedding in an adversarial manner. In another word, the EC term allows the SGD solver to escape from the ’bad’ local-minima region induced by the seen classes and to explore more for the robust one. The main contributions of this work can be summarized as follows:

• We emphasize that the crucial issue to ZSRC, i.e. the biased learning behavior of deep model, is the key stumbling block of improving the generalization ability of the learned embedding.

• We propose Energy Confused Adversarial Metric Learning(ECAML) framework to reinforce the robustness of embedding in an adversarial manner. The Energy Confusion(EC) term is ’plug and play’ and can work in conjunction with many existing metric methods. To our knowledge, it is the first work to introduce confusion for deep metric learning.

• Extensive experiments have been performed on several popular datasets for ZSRC, including CARS[Krause et al.2013], CUB[Wah et al.2011], Stanford Online Products[Oh Song et al.2016] and In-shop[Liu et al.2016], achieving state-of-the-art performances.

## 2. Related Work

Zero-shot setting: ZSL has been widely explored in many computer vision tasks, such as image classification[Changpinyo et al.2016, Fu et al.2015, Zhang and Saligrama2015] and image retrieval[Dalton, Allan, and Mirajkar2013, Shen et al.2018]. Most of these ZSL methods are capable of exploiting the extra auxiliary supervision information of the unseen classes (e.g. word representations of semantic name), thus aligning the learned features in an explicit manner. However in real applications, collecting and labelling these auxiliary information is time-consuming and impractical. Our ECAML concentrates on a more actual scene where there are only seen class labels available.

Deep metric learning for ZSRC: The commonly used contrastive[Sun et al.2014] and triplet loss[Schroff2015] have been broadly studied. Additionally, there are some other deep metric learning works: Smart-mining[Kumar et al.2017] combines local triplet loss and global loss to optimize the deep metric with hard-samples mining. Sampling-Matters[Wu et al.2017] proposes distance weighted sampling strategy. Angular loss[Wang et al.2017] optimizes a triangle-based angular function. Proxy-NCA[Movshovitz-Attias et al.2017] explains why popular classification loss works from a proxy-agent view, and its implementation is very similar to Softmax. ALMN[Chen and Deng2018] proposes to generate geometrical virtual negative point instead of employing hard-sample mining for learning discriminative embedding. However, all the above methods are to cope with the metric by designing discriminative losses or exploring sample-mining strategies, thus suffer from the aforementioned issue easily. Additionally, HDC[Yuan, Yang, and Zhang2017] employs the cascaded models and selects hard-samples from different levels and models. BIER loss[Opitz et al.2017, Opitz et al.2018]

adopts the online gradients boosting methods. These methods try to improve the performances by resorting to the ensemble idea. Different from all these methods, our ECAML has a clear object of improving the generalization ability of the learned metric by introducing the Energy Confusion regularization term.

Regularization technique: Regularization methods sometimes are important for deep models as the deep models are more likely to be data-driven. There are some works injecting random noise into deep nets so as to ensure the robust training, such as Bengio et al.[Bengio, Léonard, and Courville2013] and Gulcehre et al.[Gulcehre et al.2016]

add noise in the ReLU and Sigmoid activation functions respectively, Blundell

et al.[Blundell et al.2015], Graves[Graves2011] and Neelakantan et al.[Neelakantan et al.2015]

add noise in weights and gradients respectively. Moreover, some research works intend to regularize the deep models at the top layer, i.e. Softmax classifier layer, for example, Szegedy

et al.[Szegedy et al.2016] propose label-smoothing regularization technique for training deep models, Xie et al.[Xie et al.2016] propose label-disturbing technique for improving the generalization ability of the deep models and, Chen et al.[Chen, Deng, and Du2017] inject annealed noise into the softmax activations so as to boost the generalization ability by postponing the early Softmax saturation behavior. However, different from these above methods which are mainly devised for classification tasks and applicable to the Softmax classifier layer, our ECAML aims to promote the generalization ability of the metric learning in ZSRC tasks, and it is achieved by training the EC term in an adversarial manner.

## 3 Notations and Preliminaries

In this section, we review some notations and the necessary preliminaries on the relation between semimetric and RKHS kernels for later convenience, which will be used to interpret the differences between our EC (Sec.4.14.1 Energy Confusion) and some other existing methods, i.e. general energy distance and maximum mean discrepancy.

If not specified, we will assume that is any topological space where the Borel measures can be defined. Denote by the set of all finite signed Borel measures on , and by the set of all Borel probability measures on .

###### Definition 1.

(RKHS) Let be a Hilbert space of real-valued functions defined on . A function : is called a reproducing kernel of , if (i) , , and (ii) , , . If has a reproducing kernel, it is called a reproducing kernel Hilbert space (RKHS).

###### Definition 2.

(Semimetric) Let be a nonempty set and let be a function such that , (i) iff and (ii) . is called a semimetric space and is a semimetric.

###### Definition 3.

(Negative type) Semimetric space is said to have negative type if , and , with , .

Then we have the following propositions, which are derived from [Van Den Berg, Christensen, and Ressel2012].

###### Proposition 1.

If satisfies Def.3, then so does , where .

###### Proposition 2.

is a semimetric of negative type iff there exists a and an injective map , such that

 ρ(z,z′)=∥φ(z)−φ(z′)∥2H (1)

This shows that is of negative type, and by taking , we conclude that all Euclidean spaces are of negative type[Sejdinovic et al.2012, Sejdinovic et al.2013] , which will be used to reason our Energy Confusion term. Then we also show that the semimetrics of negative type and symmetric positive definite kernels are in fact closely related by the following Lemma(for more details please refer to [Van Den Berg, Christensen, and Ressel2012]).

###### Lemma 1.

For a nonempty , let be a semimetric on . Let , and denote . Then k is positive definite iff is of negative type.

We call the kernel defined above distance-induced kernel and, it is induced by the semimetric and centered at . By varying the point at the center , we obtain a kernel family , induced by . Then we can always express Eq.1 in terms of the canonical feature map for RKHS as the following proposition.

###### Proposition 3.

Let be a semimetric space of negative type, and . Then:

1. is nondegenerate, i.e. the Aronszajn map is injective.

For the above valid , we say that generates . And the above proposition implies that the Aronszajn map is an isometric embedding of a metric space into , for each . Lem.1 and Prop.3 reveal the general link between semimetrics of negative type and RKHS kernels in different views. By taking some special cases of and , we are able to elucidate our EC in the following sections.

## 4. Proposed Approach

### 4.1 Energy Confusion

As discussed in (Sec.11. Introduction), without taking the generalization ability into consideration explicitly, simply optimizing a discriminative objective metric functions or applying hard-sample mining strategies like in most existing metric learning works wouldn’t lead a robust metric for ZSRC tasks, since the ’biased learning behavior of deep models’ will mostly force the network to fit the surface statistical regularities rather than the more general abstract concepts, i.e. it will only highlight the concepts that are discriminative for the seen classes instead of keeping all-sided information, resulting in overfitting on the seen categories and limiting the generalization ability of the learned embedding.

Consider that the biased learning behavior is actually induced by the nature of model training since in order to correctly distinguish different seen classes, the deep metric has to be confident about the feature distribution prediction over the current seen classes as far as possible(e.g. features of different classes should be far away from each other) and as a result, only the partial biased knowledge that are discriminative to separate seen categories as shown in Fig.1 are captured while other potentially helpful knowledge are omitted. To this end, a natural solution is to introduce an opposite optimizing objective, i.e. a feature distribution confusion term, into the conventional metric learning phase so as to ’confuse’ the network and reduce the over-confident predictions of distances between feature distributions on the seen classes. Specifically, denote the input features by , the corresponding label inputs by , where is the number of seen classes. The conventional metric optimizing goal is to make the distance measurement as large as possible if , otherwise as small as possible, and it can be formulated as:

 θf=argminθfLm(θf;T,D) (2)

where is some specific metric loss function, indicates some instance-tuple, e.g. contrastive tuple [Sun et al.2014], triplet tuple [Schroff2015] or N-Pair tuple [Sohn2016], is the distance distribution measurement, e.g. Euclidean measurement[Oh Song et al.2016, Yuan, Yang, and Zhang2017, Huang, Loy, and Tang2016, Schroff2015, Wu et al.2017] or inner-product measurement[Opitz et al.2017, Sohn2016], and is the metric parameters to be learned.

Therefore, in order to prevent the biased learning behavior by confusing the feature distribution learning, we would like to learn that make the feature distributions from different classes closer when under some specific . It seems that the commonly adopted family of -

for measuring the difference between two probability distributions might be a suitable choice, such as KL-divergence

[Kullback and Leibler1951], Hellinger-distance[Hellinger1909] and Total-variation-distance, however, we emphasize that they cannot be directly applied here since they mostly work with the probability measure (where

) but our confusion goal is based on the statistical distance between two random vectors following some probability distributions. To this end, we propose the

Energy Confusion term as follows:

 Lec(θf;XI,XJ)= E˜XI˜XJ(∥˜XI−˜XJ∥22) = ∑i,jpi,j∥xi−xj∥22 (3)

where indicates the expected value, are two different class sets, are random feature vectors which obey some certain distribution, are the corresponding feature observations and is the joint probability. Since during training the samples are uniformly sampled and the classes are independent, we have and . In this case, are expected value function, contrastive tuple and Euclidean measurement respectively.

From Eq.4.1 Energy Confusion, one can observe that the EC term intends to minimize the distance expected value between different classes so as to confuse the metric. As discussed above, the learned embedding represents the learned concepts to some extend, and the more accurate the prediction of distance on the seen classes, the greater the risk of concepts overfitting. EC serves as a regularization term that would like to prevent the model being over-confident about the seen classes and mitigate the biased learning issue by avoiding the learner being stuck in the training-data-specific concepts. In another word, the metric learning is regularized by explicitly reducing model’s dependence on the partial biased knowledge, and this is mainly achieved by the idea of feature distribution confusion. Moreover, ’confusing’ also gives SGD solver chances of escaping from the ’partial’ and ’bad’ local-minima induced by the seen instances, and then exploring other solution regions for the more ’general’ ones.

Discussion: Inferred from the above analysis, it seems that the commonly used general energy distance(GED) and maximum mean discrepancy(MMD) might be also useful here for confusing the network by pushing different feature distributions closer. However, we will bridge our EC with these two methods, and illuminate the significance of our EC by theoretically accounting for why these two methods cannot be directly applied here.

Relation to GED: Let be a semimetric space of negative type, and let , then the general energy distance(GED) between and , w.r.t is:

 DE,ρ(P,Q)=2E˜P˜Qρ(˜P,˜Q)−E˜P˜P′ρ(˜P,˜P′)−E˜Q˜Q′ρ(˜Q,˜Q′) (4)

where and . is a general extension of energy distance[Székely and Rizzo2004, Székely and Rizzo2005] on metric space. Then we have:

###### Lemma 2.

For two different class sets , let be squared Euclidean metric, i.e. , then:

 Lec(θf;XI,XJ)≥12DE,ρ(XI,XJ)
###### Proof.

from Prop.2, if is the squared Euclidean metric, we have is of negative type, thus from Eq.4

 12DE,ρ(XI,XJ)=E(∥˜XI−˜XJ∥22)−12{E(∥˜XI−˜X′I∥22) +E(∥˜XJ−˜X′J∥22)}

since always holds, we have

 12DE,ρ(XI,XJ)≤E(∥˜XI−˜XJ∥22)

by substituting Eq.4.1 Energy Confusion here, the proof is completed. ∎

Remark: From Lem.2, one can observe that our EC can be viewed as an upper bound of GED, minimizing this upper bound function is equivalent to optimizing GED to some extend. Moreover, it seems that directly optimizing GED with is reasonable as well, since GED itself is a statistical distance between two probability distributions. However, by comparing EC with GED, we emphasize that directly minimizing GED will additionally make large, i.e. making points in the same class be far away from each other which violates the basic discrimination criterion of metric learning and will degrade the model into a noisy counterpart, it isn’t what we desire. Therefore, GED cannot be directly applied here.

Relation to MMD: Let be a kernel on , and let . The maximum mean discrepancy(MMD) between and is:

 γ2k(P,Q)= ∥μk(P)−μk(Q)∥2Hk=∥E˜Pk(⋅,˜P)−E˜Qk(⋅,˜Q)∥2Hk = E˜P˜P′k(˜P,˜P′)+E˜Q˜Q′k(˜Q,˜Q′)−2E˜P˜Qk(˜P,˜Q) (5)

where is the kernel embedding, and . Then we have:

###### Lemma 3.

For two different class sets , let be - homogeneous polynomial kernel, then:

 Lec(θf;XI,XJ)≥γ2k(XI,XJ)
###### Proof.

Insert the distance-induced kernel by corresponding from Lem.1 into Eq.4.1 Energy Confusion

, and cancel out the terms dependant on a single random variable, we have:

 γ2k(XI,XJ) =12E˜XI˜X′I[ρ(˜XI,z0)+ρ(˜X′I,z0)−ρ(˜XI,˜X′I)] +12E˜XJ˜X′J[ρ(˜XJ,z0)+ρ(˜X′J,z0)−ρ(˜XJ,˜X′J)] −E˜XI˜XJ[ρ(˜XI,z0)+ρ(˜XJ,z0)−ρ(˜XI,˜XJ)] =EXIXJ ρ(XI,XJ)−12EXIX′Iρ(XI,X′I)−12EXJX′Jρ(XJ,X′J) (6)

i.e. , since is - homogeneous polynomial kernel, from Prop.3 we have the corresponding generated , then by using Lem.2, we have . ∎

Remark: From Lem.3, one can observe that our EC can also be viewed as an upper bound of MMD. Moreover, it seems that directly optimizing MMD with - homogeneous polynomial kernel, i.e.

, is reasonable as well, since many existing works employ this to pull two probability distributions closer, such as in transfer learning

[Long et al.2015, Long et al.2016, Tzeng et al.2014]. However, by expanding this , we have , and in this case, if we minimize so as to pull different classes distributions closer and thus confuse the metric learning, we will additionally force + to be small, which implicitly pushes the points within the same class further apart as their inner-products are getting small. This results also aren’t what we desire and will degrade the model into a noisy counterpart. Therefore, MMD cannot be directly applied here as well.

Remark Summary

: We theoretically derive the relations between our EC and both GED and MMD, and also reason about why they cannot be directly applied here even if they have been widely adopted in many machine learning tasks for measuring probability distributions. Thus, we will focus on ’confusing’ the metric learning via our EC term.

### 4.2 Energy Confused Adversarial Metric Learning

The framework of ECAML can be generally applied to various metric learning objective functions, where we simultaneously train our Energy Confusion term and the distance metric term as follows:

 minθfL=Lm(θf;T,D)+λ∑I,J,I≠JLec(θf;XI,XJ) (7)

where is the trade-off hyper-parameter and class sets are randomly chosen in the current minibatch. In order to demonstrate the effectiveness of the proposed ECAML framework, we develop various SOTA metric learning objective functions here, i.e. :

ECAML(Tri): For triplet-tuple and Euclidean measurement , we employ[Schroff2015, Wang and Gupta2015]:

 Lm(θf;T,D)=N∑i[∥xi−xi+∥22−∥xi−xi−∥22+m]+ (8)

where the objective limits the distances of negative pairs larger than that of the positive pairs by margin and features is assumed to be on unit sphere, we experimentally find performs best.

ECAML(N-Pair): For N-tuple and inner-product measurement , we employ [Sohn2016]:

 Lm(θf;T,D)=N∑i=1log(1+N∑j=1,yj≠yiexp(xTixj−xTixi+)) (9)

where the objective limits the inner-product of each negative pair smaller than that of the positive pair .

ECAML(Binomial): For contrastive-tuple and cosine measurement , we employ[Yi et al.2014, Opitz et al.2017]:

 Lm(θf;T,D)=∑i,jlog(1+e−(2sij−1)α(Dij−β)ηij) (10)

where if are from the same class, otherwise , are the scaling and translation parameters resp, is the penalty coefficient and is set to if , otherwise , .

Moreover, for numerical stability, we extend our EC to a logarithmic counterpart and thus Eq.7 becomes:

 minθfL=Lm(θf;T,D)+λ∑I,J,I≠Jlog(1+Lec(θf;XI,XJ)) (11)

Discussion: From Eq.11, our ECAML is achieved by jointly training the conventional metric objective and the proposed Energy Confusion goal. These two terms form an adversarial learning scheme by optimizing the opposite objective functions. Specifically, acts as a ’defender’ and acts as an ’attacker’, the attacker intends to confuse the metric so as to make it confound with the training data, while in order to correctly distinguish the training data, the defender has to learn more ’general’ and complementary concepts. As the defending-attacking going, the learned embedding will be less likely to the prejudiced concepts and, thus successfully prevent the biased learning behavior and improve the generalization ability. Moreover, we experimentally find that the overfitting mainly appears at the fc layer, thus our EC term is only used to constrain the learning of fc layer.

## 5. Experiments and Results

Implementation details: Following many other works, e.g. [Oh Song et al.2016, Sohn2016], we choose the pretrained GooglenetV1[Szegedy et al.2014] as our bedrock CNN and randomly initialized an added fully connected layer. If not specified, we set the embedding size as 512 throughout our experiments. We also adopt exactly the same data preprocessing method[Oh Song et al.2016] so as to make fair comparisons with other works222Only the images in CARS dataset are preprocessed differently, see the detail underneath Tab.4. For training, the optimizer is Adam[Kingma and Ba2014] with learning rate and weight decay . The training iterations are (CUB), (CARS), (Stanford Online Products and In-Shop), resp. The new fc-layer is optimized with 10 times learning rate for fast convergence. Moreover, for fair comparison, we use minibatch of size 128 throughout our experiments, which is composed of

random selected classes with two instances each class. Our work is implemented by caffe

[Jia et al.2014].

Evaluation and datasets: The same as many other works, the retrieval performance is evaluated by Recall@K metric. And following [Oh Song et al.2016], we evaluate the clustering performances via normalized mutual information(NMI) and F metrics. The input of NMI is a set of clusters and the ground truth classes , where represents the samples that belong to the th cluster, and is the set of samples with label . NMI is defined as the ratio of mutual information and the mean entropy of clusters and the ground truth, NMI(), and F

metric is the harmonic mean of precision and recall as follows F

. Then our ECAML is evaluated over the widely used benchmarks with the standard zero-shot evaluation protocol[Oh Song et al.2016]:

1. CARS[Krause et al.2013] contains 16,185 car images from 196 classes. We split the first 98 classes for training (8,054 images) and the rest 98 classes for testing (8,131 images).

2. CUB[Wah et al.2011] includes 11,788 bird images from 200 classes.We use the first 100 classes for training (5,864 images) and the rest 100 classes for testing (5,924 images).

3. Stanford Online Products[Oh Song et al.2016] has 11,318 classes for training (59,551 images) and the other 11,316 classes for testing (60,502 images).

4. In-Shop[Liu et al.2016] contains 3,997 classes for training(25,882 images) and the resting 3,985 classes for testing(28,760 images). The test set is partitioned into the query set of 3,985 classes(14,218 images) and the retrieval database set of 3,985 classes(12,612 images).

### 5.1 Ablation Experiments

We show the primary results below and the qualitative analysis(embedding visualization) is placed in Supplementary.

Regularization ability:

To demonstrate the regularization ability of our ECAML, we plot the R@1 retrieval result curves on training(seen) and testing(unseen) sets resp, as in Fig.2. Specifically, for example, from the figures in left column, one can observe that the training curve of the conventional Triplet method rises quickly to a relatively high level but its testing curve only rises a little at first and then starts dropping to quite a low level, showing that the metric learned by conventional Triplet are more likely to over-fit the seen classes and generalize worse to the unseen classes in zero-shot settings. Conversely, by employing our ECAML(Tri), the training result curve rises much slower than the original Triplet and stops rising at a relatively lower level ( vs. ), however, the testing cure of our ECAML(Tri) rises fast to quite a high level, more than , implying that our ECAML(Tri) indeed serves as a regularization method and improves the generalization ability of the learned metric by suppressing the learning of biased metric over seen classes caused by ’biased learning behavior’. Moreover, the similar phenomenon can be observed by ECAML(N-Pair,Binomial).

Ablation experiments on : To show the effectiveness of the parameter , here for simplicity, we just show the results of ECAML(tri,N-Pair,Binomial) with different on CARS benchmark as in Tab.1. It can be observed that when our ECAML degenerates into the corresponding conventional metric learning method and the performance is unsatisfactory, and as increasing, the performances of ECAML(tri,N-Pair,Binomial) peak around {} resp and outperform the baselines (Triplet, N-Pair, Binomial) by a large margin, validating the effectiveness and importance of our ECAML.

Ablation experiments on embedding size: We also conduct quantitative experiments on embedding size with ECAML(Binomial). From Tab.2, it can be observed that for the conventional Binomial metric learning method, most of the evaluation indexes’ results (e.g. R@4, R@8, NMI and F) don’t increase with the embedding size (from 128-dim to 512-dim) and even have a decrease trend, showing that the risk of overfitting increases with feature size and without robustness learning the performances of the learned embedding cannot be guaranteed even if its theoretical representation ability will increase with the feature size. However, by employing our ECAML, the performances can be consistently improved and indeed increase with embedding size, demonstrating the importance and superiority of robust metric learning in ZSRC tasks.

Ablation Study on Regularization Method There are some other research works aiming at imposing regularization in the top layer of the whole network, such as label-smothing[Szegedy et al.2016], label-disturbing[Xie et al.2016] and Noisy-Softmax[Chen, Deng, and Du2017]. However these methods are all designed for Softmax classifier layer and cannot be applied in the metric learning methods. Then, in order to show the effectiveness of our ECAML in the metric learning framework, we compare it with the commonly used ’Dropout’ method. The dropout layer is placed after the CNN model. From Tab.3, one can observe that although the dropout with ratio

improves most of the performances over the baseline, the improvements are limited and not worthy of attention. However, in contrast to Dropout, our ECAML significantly surpasses the baseline model by a large margin. We conjecture that is because Dropout is not specially designed for the metric learning and the tested datasets are all fine-grained datasets in which simply depressing the neurons to be zero will largely affects the estimated distributions of these fine-grained classes regardless of the ratio value due to the small inter-class variations (for example, by using a smaller ratio (e.g.

) the performance will still be reduced). In summary, our ECAML regularization method is specially designed for the deep metric learning and indeed performs well.

### 5.2 Comparison with State-of-the-art

To highlight the significance of our ECAML framework, we compare with the aforementioned corresponding baseline methods, i.e. the wildly used Triplet[Schroff2015], N-Pair[Sohn2016] and Binomial[Yi et al.2014], moreover, we also compare ECAML with other SOTA methods. The experimental results over CUB, CARS, Stanford Online Products and In-shop are in Tab.4-Tab.7 resp, bold number indicates improvement over baseline method, red and blue number show the best and second best results resp. From these tables, one can observe that our ECAML consistently improves the performances of original metric learning methods (i.e. Triplet, N-Pair and Binomial) on all the benchmark datasets by a large margin, demonstrating the necessity of explicitly enhancing the generalization ability of the learned metric and validating the universality and effectiveness of our ECAML. Furthermore, our ECAML(Binomial) also surpasses all the listed state-of-the-art approches. In summary, learning ’general’ concepts by avoiding the biased learning behavior is more important in ZSRC tasks and the generalization ability of the optimized metric heavily affects the performance of conventional metric learning methods.

## 6. Conclusion

In this paper, we propose the Energy Confused Adversarial Metric Learning (ECAML) framework, a generally applicable methods to various conventional metric learning approaches, for ZSRC tasks by explicitly intensifying the generalization ability within the learned embedding with the help of our Energy Confusion term. Extensive experiments on the popular ZSRC benchmarks(CUB, CARS, Stanford Online Products and In-Shop) demonstrate the significance and necessity of our idea of learning metric with good generalization by energy confusion.

Acknowledgments: This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 61573068 and 61871052, Beijing Nova Program under Grant No. Z161100004916088, and supported by BUPT Excellent Ph.D. Students Foundation CX2019307.