Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

03/01/2019 ∙ by Yizhe Zhu, et al. ∙ Binghamton University Rutgers University 6

Zero-shot learning extends the conventional object classification to the unseen class recognition by introducing semantic representations of classes. Existing approaches predominantly focus on learning the proper mapping function for visual-semantic embedding, while neglecting the effect of learning discriminative visual features. In this paper, we study the significance of the discriminative region localization. We propose a semantic-guided multi-attention localization model, which automatically discovers the most discriminative parts of objects for zero-shot learning without any human annotations. Our model jointly learns cooperative global and local features from the whole object as well as the detected parts to categorize objects based on semantic descriptions. Moreover, with the joint supervision of embedding softmax loss and class-center triplet loss, the model is encouraged to learn features with high inter-class dispersion and intra-class compactness. Through comprehensive experiments on three widely used zero-shot learning benchmarks, we show the efficacy of the multi-attention localization and our proposed approach improves the state-of-the-art results by a considerable margin.



There are no comments yet.


page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks have achieved significant advances in object recognition. The main shortcoming of deep learning methods is the inevitable requirement of large-scale labeled training data that need to be collected and annotated by costly human labor. In spite that images of ordinary objects can be readily found, there remains a tremendous number of objects with insufficient and sparse visual data 

[Zhu et al.2018]. This attracts many researchers’ interest in how to recognize objects with few or even no training samples, which are known as few-shot learning and zero-shot learning respectively.

Zero-shot learning mimics the human ability to recognize objects only from a description in terms of concepts in some semantic vocabulary [Morgado and Vasconcelos2017]

. The underlying key is to learn the association between visual representations and semantic concepts and to use this learned association to extend the possibility to unseen object recognition. In a general sense, the common scheme of the state-of-the-art approaches of zero-shot learning is (1) to extract the feature representation of visual data from CNN models pretrained on the large-scale dataset(e.g., ImageNet), (2) to learn mapping functions to project the visual features and semantic representations to a shared space. The mapping functions are optimized either by ridge regression loss 

[Zhang et al.2017, Kodirov et al.2017] or by ranking loss on compatibility scores of two mapped features [Akata et al.2016b, Xian et al.2016]. Taking advantage of the success of generative models (e.g., GAN [Goodfellow et al.2014], VAE [Kingma and Welling2013]) in data generation, several recent methods [Zhu et al.2018, Xian et al.2018b] resort to hallucinating visual features of unseen classes, converting zero-shot learning to conventional object recognition problems.

Figure 1: The demonstration of the localized discriminative regions for “blue jay” in CUB and “antelope” in AwA. Two attention maps are produced by our SGMA model. The part patches are located and marked with red and blue bounding boxes.

All aforementioned methods neglect the significance of discriminative visual feature learning. Since the CNN models are pretrained on a traditional object recognition task, the extracted feature may not be representative enough for zero-shot learning task. Especially in the fine-grained scenarios, the feature learned from the coarse object recognition can hardly capture the subtle difference between classes. Although several recent works [Morgado and Vasconcelos2017, Li et al.2018] solve the problem in an end-to-end manner that is capable of discovering more distinctive visual information suitable for zero-shot recognition, they still simply extract the global visual feature of the whole image, without considering the effect of discriminative part regions in the images. We argue that there are multiple discriminative part areas that are key points to recognize objects, especially fine-grained objects. For instance, the head and the tail are crucial to distinguish bird species. To capture such discriminating regions, we propose a semantic-guided attention localization model to pinpoint where the most significant parts are. The compactness and diversity loss on multi-attention maps are proposed to encourage attention maps to be compact in the most crucial region in each map while divergent across different attention maps.

We cooperate the whole image and multiple discovered regions to provide a richer visual expression, and learn global and local visual features (i.e., image features and region features) for visual-semantic embedding model which is trained in an end-to-end fashion. In the zero-shot learning scenario, embedding softmax loss [Morgado and Vasconcelos2017, Li et al.2018] is used by embedding the class semantic representations into a multi-class classification framework. However, softmax loss only encourages the inter-class separability of features. The resulting features are not sufficient for recognition tasks [Wen et al.2016].

To encourage high intra-class compactness, class-center triplet loss [Wang et al.2017] assigns an adaptive “center” for each class and forces learned features to be closer to the “center” of corresponding class than those of other classes. In this paper, we involve both embedding softmax loss and class-center triplet loss as the supervision of feature learning. We argue that these cooperative losses can efficiently enhance the discriminative power of learned features.

To the best of our knowledge, this is the first work to jointly optimize multi-attention localization with global and local feature learning for zero-shot learning tasks in an end-to-end fashion. Our main contributions are summarized as follows:

  • We present a weakly-supervised multi-attention localization model optimized for zero-shot recognition, which jointly discovers the crucial regions and learns feature representation under the guides of semantic descriptions.

  • We propose a multi-attention loss to encourage compact and diverse attention distribution by applying geometric constraints over attention maps.

  • We jointly learn global and local features under the supervision of embedding softmax loss and class-center triplet loss to provide an enhanced visual representation for zero-shot recognition.

  • We conduct extensive experiments and analysis on three ZSL datasets and demonstrate the excellent performance of our proposed method on both part detection and zero-shot learning.

2 Related Work

Zero-Shot Learning Methods While several early works of zero-shot learning [Lampert et al.2014] make use of attribute as the intermediate information to infer the label of an image, the current majority of zero-shot learning approaches treat the problem as a visual-semantic embedding one. A bilinear compatibility function between the images and the attribute space is learned using the ranking loss in ALE [Akata et al.2016b] or the ridge regression loss in ESZSL [Romera-Paredes and Torr2015]. Some other zero-shot learning approaches learn non-linear multi-model embeddings. LatEm [Xian et al.2016] learns a piecewise linear model by selection of learned multiple linear mappings. DEM [Zhang et al.2017]

presents a deep zero-shot learning model that raises the expression ability by adding non-linear activation ReLU.

More related to our work, several end-to-end learning methods are proposed to address the pitfall that discriminative feature learning is neglected. SCoRe  [Morgado and Vasconcelos2017] combines two semantic constraints to supervise attribute prediction and visual-semantic embedding respectively. LDF [Li et al.2018] takes one step further and integrates a zoom network in the model to discover significant regions automatically, and learn discriminative visual feature representation. But the zoom mechanism can only discover the whole object by cropping out the background with a square shape, still being restricted to the global feature. In contrast, our multi-attention localization network can help find multiple finer part regions (e.g., head, tail) that are discriminative for zero-shot learning.

Figure 2: The Framework of the proposed Semantic-Guided Multi-Attention localization model (SGMA). The model takes as input the original image and produces n part attention maps (here n=2). Multi-Attention loss keeps the attention areas compact in each map and divergent across the maps. The part images from the cropping subnet and the original images are fed into the joint feature learning subnet for object recognizaiton guided by semantic disciptions .

Multi-Attention Localization Several previous methods proposed to leverage the extra annotations of part bounding boxes to localize significant regions for fine-grained zero-shot recognition [Akata et al.2016a, Elhoseiny et al.2017, Zhu et al.2018].  [Akata et al.2016a] straightforwardly extracts the part features by feeding annotated part regions into CNN pretrained on ImageNet.  [Elhoseiny et al.2017, Zhu et al.2018] train a multiple-part detector with groundtruth annotations to produce the boundingbox of parts and learn the part features with conventional recognition tasks. However, the heavy involvement of human labor for part annotations makes tasks costly in real large-scale problems. Therefore, learning part attentions in a weakly supervised way is desirable in the zero-shot recognition scenario. Recently, several attention localization models are presented in the fine-grained classification scenario.  [Wang et al.2015, Zhang et al.2016b] learn a set of part detectors by analyzing filter response that consistently respond to specific patterns. Spatial transformer [Jaderberg et al.2015]

proposes a dynamic mechanism that actively spatial transforms an image. Our work is different in three aspects: (1) we learn part attention models from convolutional channel responses. (2) instead of being supervised by the classification loss, our model discovers the parts with semantic guides, making the located part more discriminative for zero-shot learning; (3) zero-shot recognition model and attention localization model are trained jointly to ensure the parts localization are optimized for the zero-shot object recognition.

3 Method

We start by introducing some notations and the problem definition.

Assume there are labeled instances from seen classes as training data, where denotes the image, is the corresponding class label , represents the semantic representation of the corresponding class. Given the image from unseen classes and a set of semantic representations of unseen class where denotes the number of unseen classes, the task of zero-shot learning is to predict the class label of the image, where and are disjoint.

The framework of our approach is demonstrated in Figure 2. It consists of three modules: the multi-attention subnet, the region cropping subnet subnet, the joint feature embedding subnet. The multi-attention subnet generates multiple attention maps corresponding to distinct parts of the object. The region cropping subset crops the discriminitive parts with differentiable operations. The joint feature learning subnet takes as input the cropped parts and the original image, and learn the global and local visual feature for the final zero-shot recognition. The whole model is trained in an end-to-end fashion.

3.1 Multi-Attention Subnet

LDF [Li et al.2018] presents a cascaded zooming mechanism to gradually localize the object-centric region while cropping out background noise. Different from LDF, our method considers multiple finer discriminative areas, which can provide various richer cues for object recognition. Our method starts from multi-attention subnet to produce attention maps.

As shown in Figure 2, the input images first pass through the convolutional network backbone to extract feature representations of size

. The attention weight vectors

over channels are obtained for each attended part based on the extracted feature maps. The attention maps are finally obtained by the weighted sum of feature maps over channels with the previously obtained attention weight vectors . To encourage different attention maps to discover different discriminating regions, we design compactness and diversity loss. Details will be shown later.

To be specific, the channel descriptor , encoding the global spatial information, is first obtained by using global average pooling on the extracted feature maps. Formally, the features are shrunk through its spatial dimension . The element of is calculated by:


where is the feature in channel. To make use of the information of the channel descriptor , we follow it by the stacked fully connected layers to fully capture channel-wise dependencies of each part. A sigmoid activation is then employed to play the role of a gating mechanism. Formally, the channel-wise attention weight is obtained by



refers to the ReLU activation function.

can be considered as the soft-attention weight of channels associated with the part. As being discovered in many works [Zhou et al.2016, Selvaraju et al.2017], each channel of features focuses on a certain pattern or a certain part of the object. Ideally, our model aims to assign high weights (i.e., ) to channels that are associated with a certain part

, while giving low weights to channels irrelevant to that part. We apply unsupervised k-means clustering to group channels based on the peak activation positions and initialize

with pseudo labels generated by clustering. Details are shown in Appendix A.

The attention map for part is then generated by the weighted sum of all channels followed by the sigmoid activation:


where the superscript means channel and is the number of channels. For brevity, we omit in the rest of the paper. With the sigmoid activation, the attention map serves the gating mechanism as in soft-attention scheme. The gating mechanism will force the network to focus on the discriminative parts.

3.1.1 Multi-Attention Objective

To enable our model to discover diverse regions over attention maps, we designed the multi-attention loss by applying the geometric constraints. The proposed loss consists of two components:


where and are a compactness and diversity loss with a balance factor , is the of attention maps. Ideally, we want the attention map to concentrate around the peak position rather than being dispersed. The ideal concentrated attention map for the part is created as a Gaussian blob with the Gaussian peak at the peak activation of the attention map. Let be the position of the attention map, and be the set of all positions. The compactness loss is defined using the following loss:


where and denote the generated attention map and the ideal concentrated attention map at location for part respectively, and denotes the size of attention maps. The

heatmap regression loss has been widely used in human pose estimation scenarios to localize the keypoints 

[Bulat and Tzimiropoulos2016, Wei et al.2016].

Intuitively, we also want the attention maps to attend different discriminative parts. For example, one map attends the head while another map attends the tail. To fulfill this goal, we design a diversity loss to encourage the divergent attention distribution across different attention maps. Formally, it is formulated by:


where represents the maximum of other attention maps at location and denotes a margin. The maximum-margin design here is to make the loss less sensitive to noises and improve the robustness. The motivation of the diversity loss is that when the activation of a particular position in one attention map is high, the loss prefers lower activations of other attention maps at the same position. From another perspective, can be roughly considered as the inner product of two flattened matrices that can measure the similarity of two attention maps.

3.2 Region Cropping Subnet

With the attention maps in hand, the region can be directly cropped with a square centered at the peak value of each attention maps. However, it’s hard to optimize such a non-continuous cropping operation with backward propagation. To make our model end-to-end trainable, we design a cropping network to approximate region cropping. Specifically, with an assumption of a square shape of the part region for computational efficiency, our cropping network takes as input the attention maps from the multi-attention subnet, and outputs three parameters:


where denotes the cropping network and consists of two FC layers, , represent the x-axis and y-axis coordinates of the square center respectively, is the side length of the square. Inspired by [Fu et al.2017], we produce a two-dim continuous boxcar mask :


where . We obtain the cropped region by the element-wise multiplication between the original image and the continuous mask :


where is the index of parts.

3.3 Joint Feature Learning Subnet

To provide enhanced visual representations of images for zero-shot learning, we jointly learn the global and local visual features given the original image and part images produced by the region cropping subnet.

As shown in Figure 2, the original image and part patches are resized to and fed into separate CNN backbone networks (with the identical VGG19 architecture). The convolution layers are followed by the global average pooling to get the visual feature vector .

To learn the discriminative feature for the zero-shot learning task, we employ two cooperative losses: the embedding softmax loss and the class center triplet loss . The former encourages a higher inter-class distinction, while the latter forces the learned feature of each class to be concentrated with a lower intra-class divergence.

Embedding Softmax Loss

Let denotes the semantic feature. The compatibility score of multi-model features is defined as , where

is a trainable transform matrix. If the compatibility scores are considered as logits in softmax, the embedding softmax loss can be given by:


where , , and is the number of training samples.

In order to combine the global and local features without increasing the complexity of the model, we adopt the late fusion strategy. The overall compatibility scores are obtained by summing up the compatibility scores from each CNNs and used to perform the softmax loss. Note that the strategy can significantly reduce the number of parameters of the network by discarding the additional dimension reduction layer (i.e., FC layer) after the feature concatenation used in [Li et al.2018]. Formally, we substitute in Eq. 10 with , where and is the index of part images and the original image.

Class-Center Triplet Loss

The class-center triplet loss is originally designed to minimize the intra-class distances of deep visual features in face recognition tasks. In our case, we jointly train the network with the class-center triplet loss to encourage the intra-class compactness of features. Let

be the class indices, the loss is formulated as:


where is the margin, is the mapped visual feature in semantic feature space (i.e., ), denotes the “center” of each class that are trainable parameters, means normalization operation. The normalization operation is involved to make feature points located on the surface of a unit sphere, leading to the ease of setting the proper margin. Moreover, class-center triplet loss exempts the necessity of triple sampling in the naive triplet loss.

Overall, the proposed SGMA model is trained in an end-to-end manner with the objective:


where the balance factor and are consistently set to 1 in all experiments.

4 Inference from SGMA model

We provide two ways to infer the labels of unseen class images from the SGMA model. The first one is straightforwardly to choose the class label with the maximal overall compatibility score, as the green path in Figure 2. An alternative way is utilizing the features learned in the class-center branch, as the purple path in Figure 2. We employ the inner product to measure the similarities between the feature of the test image and the prototypes of unseen classes , which can be obtained by the following steps. The prototypes of seen classes is obtained by averaging the features of all images of each class. We assume the semantic descriptions of unseen classes can be represented by a linear combination of those of seen classes. Let be the weight matrix of such a combination, and can be obtained by solving the ridge regression:


where and are the semantic matrices of unseen and seen classes with each row being the semantic vector of each class. Equipped with the learned describing the relationship of the semantic vectors of seen and unseen classes, we can obtain the prototypes for unseen classes by applying the same , .

To combine the global and local descriptions of images, we concatenate the visual features generated by different CNNs. Moreover, to combine the inference of two ways, the compatibility scores of the embedding softmax branch and the inner product of the class-agent triplet branch are added as the final prediction scores of the test image w.r.t. unseen classes:


where , and denotes the row of the matrix corresponding to the class .

5 Experiment

To evaluate the empirical performance of our proposed approach, we conduct experiments on three standard ZSL datasets and compare our method with the state-of-the-art ones. We then show the performance of multi-attention localization.

5.1 Initialization of Attention Weights

In this section, we introduce how to initialize the attention weight in Eq. 2. Each channel of feature maps focuses on a certain pattern or a certain part of the object. We assume that if in several channels of features the positions of peak values are near to each other if not the same, these channels correspond to the same part of the object. Inspired by [Zheng et al.2017], we leverage the peak locations as the representations of feature channels and apply k-means clustering to cluster feature channels. Specifically, we feed all training samples into the CNN trained for the conventional classification task and extract the coordinates of the peak for each channel. Formally, the representation for channel is given by:


where are the coordinates of the peak for sample, is the number of samples. Applying k-means clustering, we obtain the n=2 groups of channels corresponding to two attention maps. We initialize the channel-wise attention weights for part by an indicator function over all feature channels:


where equals one if the channel belongs to the cluster and zero otherwise, and is the number of channels. With these pseudo values of , we can initialize the stacked FC layers in Eq. 2 by the regression with loss:


The stacked FC layers, as a part of our model, are optimized later with the SGMA objective in an end-to-end fashion.

5.2 Initialization of Cropping Subnet

We leverage the attention maps from the initialized multi-attention subnet to pretrain the region cropping subnet. Specifically, we obtain the attended region in attention maps by a discriminative square centered at the peak response of the attention map. The side length of the squares is assumed to be the quarter of the image size. The coordinates of the attended region () are utilized to pretrain the cropping subnet with loss.

5.3 Implementation details

We implement our approach on the Pytorch Framework. For the multi-attention subnet, we take the images of size 448 by 448 as input to achieve high-resolution attention maps. For joint feature embedding subnet, we resize all the input images to the size 224 by 224. We consistently adopt VGG19 as the backbone and train the model with the batch size of 32 on two GPUs(TitanX). We use SGD optimizer with the learning rate of

, the moment of

and weight decay of to optimize the objectives. The learning rate is decay by 0.1 on the plateau, and the minimum one is set to . Hyper-parameters in our models are obtained by grid search on the validation set. s in Eq. 6 and Eq. 11 are set to 0.2 and 0.8 respectively. in Eq. 8 is set to 10. We only generate two attention maps as we find more maps lead to severe overlap among attended regions and don’t improve the ZSL performance.

5.4 Datasets and Experiment settings

We use three widely used zero-shot learning datasets: Caltech-UCSD-Birds 200-2011(CUB) [Wah et al.2011], Oxford Flowers(FLO) [Nilsback and Zisserman2008], Animals with Attributes(AwA) [Lampert et al.2014]. CUB is a fine-grained dataset of bird species, containing 11,788 images from 200 different species and 312 attributes. FLO is a fine-grained dataset, containing 8,189 images from 102 different types of flowers without attribute annotations. However, the visual descriptions are available and collected by [Reed et al.2016]. Finally, AwA is a coarse-grained dataset with 30,475 images, 50 classes of animals and 85 attributes.

To fairly compare with baselines, we use the attribute or sentence features provided by [Xian et al.2018a, Xian et al.2018b]

as semantic features for all methods. For non-end-to-end methods, we consistently use 2048D features extracted from pretrained 101-layer ResNet provided by 

[Xian et al.2018a], and for end-to-end methods, we adopt VGG19 as the backbone network. Besides,  [Xian et al.2018a] points out that several test classes in the standard splitting (marked as SS) of zero-shot learning setting are utilized to train the feature extraction network, which violates the spirit of zero-shot that test classes should never be seen before. Therefore, we also evaluate methods on the splitting proposed by [Xian et al.2018b] (marked as PS). We measure the quantitative performance of methods on Mean Class Accuracy (MCA).

Method Head Tail Average
Ours w/o MA
Table 1: Parts detection results measured by average precision(%).

5.5 Part Detection Results

To evaluate the efficacy of weakly supervised part detection, we compare our detection results on CUB with SPDA-CNN [Zhang et al.2016a], a state-of-the-art work on part detectors trained with groundtruth part annotation. We observe our model consistently attend the head or tail on two attention map respectively. Therefore, we compare the detected parts with head or tail groundtruth annotations. Part detection is considered correct if it has at least 0.5 overlap with ground truth (i.e., ).

As shown in Table 1, the SPDA-CNN can be considered as the upper bound since it leverages part annotation to train detectors. We also provide the result of random crops that serves as a lower bound. Compared with the random crops, our method has achieved absolute improvement on average. Although there is still a small gap between the performances of ours and SPDA-CNN () due to the lack of precise part annotations, the results are promising since our model is more practical in large-scale real-world tasks where costly annotations are not available. Besides, if we remove the proposed multi-attention loss (marked as “ours w/o MA”), the performance suffers a significant drop ( v.s. ), confirming the effect of the multi-attention loss.

We also show the qualitative results of part localization in Figure 3 4 5. The results are triply grouped. The detected parts are marked with blue and red bounding boxes in the first image, and the rest two images are generated attention maps. The detected parts are well aligned with semantic parts of objects. In CUB, two parts are associated with the head and the legs of birds, while the parts are the head and rear body of the animals in AwA. In FLO benchmarks, the stamen and pistil are roughly detected in the red box while the petal is localized as another crucial part.

5.6 Zero-Shot Classification Results

We compare our method with the two groups of state-of-the-art methods: non end-to-end methods that use visual features extracted from pretrained CNN, and end-to-end methods that jointly train CNN and visual-sementic embedding network. The former group includes DAP [Lampert et al.2014], CONSE [Norouzi et al.2014], CMT [Socher et al.2013], SSE [Zhang and Saligrama2015], LATEM [Xian et al.2016], ALE [Akata et al.2016b], DEVISE [Frome et al.2013], SJE [Akata et al.2015], ESZSL [Romera-Paredes and Torr2015], SYNC [Changpinyo et al.2016], SAE [Kodirov et al.2017], DEM [Zhang et al.2017], GAZSL [Zhu et al.2018], and the latter one includes SCoRe [Morgado and Vasconcelos2017], LDF [Li et al.2018]. The evaluation results are shown in Table 2. Different groups of approaches are separated by horizontal lines. The scores of baselines (DAP-SAE) are obtained from [Xian et al.2018a, Xian et al.2018b]. As the codes of DEM, GAZSL, SCoRe are available online, we obtain the results by running the codes on different settings if they are not reported in the published papers. We get all the results of LDF from the authors.

In general, we observe that the end-to-end methods outperform the non-end-to-end methods. That confirms that jointly training the CNN model and the embedding model eliminates the discrepancy between features for conventional object recognition and those for zero-shot one that exists in non-end-to-end methods. It’s worth noting that LDF learns object localization by integrating an additional zoom network to the whole model, while our approach further involves part-level patches to provide local features of objects. It is clear that our proposed model consistently outperforms previous approaches, achieving impressive gains over the state-of-the-arts on fine-grained datasets: , on CUB SS/PS settings, and on FLO.

Method SS PS SS PS
DAP (2013) -
CONSE (2014) -
CMT (2013) -
SSE (2015) -
LATEM (2016)
ALE (2015)
DEVISE (2013)
SJE (2015)
ESZSL (2015)
SYNC (2016) -
SAE (2017)
DEM (2017)
GAZSL (2018)
SCoRe (2017)
LDF (2018) -
Table 2: Zero-shot learning results on CUB, AWA, FLO benchmarks. The best scores and second best ones are marked bold and underline respectively.

5.7 Ablation study

In this section, we study the effectiveness of the detected object regions and finer part patches, as well as the joint supervision of embedding softmax and class-center triplet loss. We set our baseline to the model without localizing parts and with only embedding softmax loss as the objective.

Effect of discriminative regions. The upper part of Table 3 shows the performance of our method with different image inputs.

Our model with only part regions performs the worst because part regions only provide local features of an object, such as the features of head or leg. Although these local features are discriminative in the part level, it misses lots of information contained in other regions, and thus cannot recognize the whole object well alone. Combining the original image and the localized parts, the performance has a giant improvement from the baseline by ( v.s. ).

To further demonstrate the effectiveness of the localized parts and objects, we combine the object with randomly cropped parts of the same part size. From the results, we observe in most cases adding random parts will hurt the performance. We believe it’s due to the lack of alignment of random cropped parts. For instance, one random part in one image is roughly the head of the object while in another image it may focus on the leg. In contrast, our localized parts have better semantic alignment, as shown in Figure 3 4 5.

Method CUB AWA FLO Avg
Baseline+Parts 67.4 64.3 63.9 65.2
Baseline+Random Parts
Embedding Softmax
Class-Center Triplet
Combined 63.5 65.7 61.8 63.7
Table 3: The performance of variants on zero-shot learning with PS setting. The best scores are marked bold.

Effect of joint loss. The bottom part of Table 3 shows the results from on different ways of inferences when our model is trained with the joint loss as the objective and with only the original image as input. Compared with the baseline, the results inferred from embedding softmax branch improve a little as class-center triplet loss can be considered as a regularizer to enhance the discriminative features. The results inferred from class-center triplet branch are better, and we get the best results when combining the inferences of two branches, which improves the baseline results by .

6 Conclusion

In the paper, we show the significance of discriminative parts for zero-shot object recognition. It motivates us to design a semantic-guided attention localization model to detect such discriminative parts of objects guided by semantic representations. The multi-attention loss is proposed to favor compact and diverse attentions. Our model jointly learns global and local features from the original image and discovered parts with embedding softmax loss and class-center triplet loss in an end-to-end fashion. Extensive experiments show the proposed method outperforms state-of-the-art methods.

Figure 3: Part detection results on CUB
Figure 4: Part detection results on AWA1
Figure 5: Part detection results on FLO


  • [Akata et al.2015] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, pages 2927–2936, 2015.
  • [Akata et al.2016a] Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-cue zero-shot learning with strong supervision. In CVPR, pages 59–68, 2016.
  • [Akata et al.2016b] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE transactions on PAMI, 38(7):1425–1438, 2016.
  • [Bulat and Tzimiropoulos2016] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, pages 717–732. Springer, 2016.
  • [Changpinyo et al.2016] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha.

    Synthesized classifiers for zero-shot learning.

    In CVPR, pages 5327–5336, 2016.
  • [Elhoseiny et al.2017] Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. Link the head to the ”beak”: Zero shot learning from noisy text description at part precision. In CVPR, 2017.
  • [Frome et al.2013] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
  • [Fu et al.2017] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, pages 4438–4446, 2017.
  • [Goodfellow et al.2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [Jaderberg et al.2015] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
  • [Kingma and Welling2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [Kodirov et al.2017] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345, 2017.
  • [Lampert et al.2014] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on PAMI, 36(3):453–465, 2014.
  • [Li et al.2018] Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. Discriminative learning of latent features for zero-shot recognition. In CVPR, 2018.
  • [Morgado and Vasconcelos2017] Pedro Morgado and Nuno Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR, 2017.
  • [Nilsback and Zisserman2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP’08, pages 722–729. IEEE, 2008.
  • [Norouzi et al.2014] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.
  • [Reed et al.2016] Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. Learning deep representations of fine-grained visual descriptions. In CVPR, 2016.
  • [Romera-Paredes and Torr2015] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In ICML, pages 2152–2161, 2015.
  • [Selvaraju et al.2017] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [Socher et al.2013] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In NIPS, pages 935–943, 2013.
  • [Wah et al.2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [Wang et al.2015] Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang. Multiple granularity descriptors for fine-grained categorization. In ICCV, 2015.
  • [Wang et al.2017] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l 2 hypersphere embedding for face verification. In ACMMM. ACM, 2017.
  • [Wei et al.2016] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, pages 4724–4732, 2016.
  • [Wen et al.2016] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515. Springer, 2016.
  • [Xian et al.2016] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. In CVPR, pages 69–77, 2016.
  • [Xian et al.2018a] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on PAMI, 2018.
  • [Xian et al.2018b] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In CVPR, 2018.
  • [Zhang and Saligrama2015] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, pages 4166–4174, 2015.
  • [Zhang et al.2016a] Han Zhang, Tao Xu, Mohamed Elhoseiny, Xiaolei Huang, Shaoting Zhang, Ahmed Elgammal, and Dimitris Metaxas. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In CVPR, 2016.
  • [Zhang et al.2016b] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, and Qi Tian. Picking deep filter responses for fine-grained image recognition. In CVPR, 2016.
  • [Zhang et al.2017] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In CVPR, 2017.
  • [Zheng et al.2017] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV, 2017.
  • [Zhou et al.2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.

    Learning deep features for discriminative localization.

    In CVPR, 2016.
  • [Zhu et al.2018] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018.