Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning

by   Faisal Alamri, et al.

Zero-Shot Learning (ZSL) aims to recognise unseen object classes, which are not observed during the training phase. The existing body of works on ZSL mostly relies on pretrained visual features and lacks the explicit attribute localisation mechanism on images. In this work, we propose an attention-based model in the problem settings of ZSL to learn attributes useful for unseen class recognition. Our method uses an attention mechanism adapted from Vision Transformer to capture and learn discriminative attributes by splitting images into small patches. We conduct experiments on three popular ZSL benchmarks (i.e., AWA2, CUB and SUN) and set new state-of-the-art harmonic mean results on all the three datasets, which illustrate the effectiveness of our proposed method.


page 3

page 6


Implicit and Explicit Attention for Zero-Shot Learning

Most of the existing Zero-Shot Learning (ZSL) methods focus on learning ...

Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

Zero-shot learning extends the conventional object classification to the...

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

Zero-shot learning (ZSL) tackles the novel class recognition problem by ...

eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation

Recently vision transformer models have become prominent models for a ra...

Hybrid Routing Transformer for Zero-Shot Learning

Zero-shot learning (ZSL) aims to learn models that can recognize unseen ...

On Implicit Attribute Localization for Generalized Zero-Shot Learning

Zero-shot learning (ZSL) aims to discriminate images from unseen classes...

ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos

Sign Language Recognition (SLR) is a challenging research area in comput...

Code Repositories


Multi-Headed Self-Attention via Vision Transformer for Zero-Shot Learning (ViT-ZSL)

view repo

1 Introduction

Relying on massive annotated datasets, significant progress has been made on many visual recognition tasks, which is mainly due to the widespread use of different deep learning architectures

[20, 7, 12]. Despite these advancements, recognising any arbitrary real-world object still remains a daunting challenge as it is unrealistic to label all the existing object classes on the earth. Zero-Shot Learning (ZSL) addresses this problem, requiring images from the seen classes during the training, but has the capability of recognising unseen classes during the inference [29, 32, 33, 8]. Here the central insight is that all the existing categories share a common semantic space and the task of ZSL is to learn a mapping from the imagery space to the semantic space with the help of side information (attributes, word embeddings) [30, 16, 19] available with the seen classes during the training phase so that it can be used to predict the class information for the unseen classes during the inference time.

Most of the existing ZSL methods [27, 22]

depends on pretrained visual features and necessarily focus on learning a compatibility function between the visual features and semantic attributes. Although modern neural network models encode local visual information and object parts

[32], they are not sufficient to solve the localisation issue in ZSL models. Some attempts have also been made by learning visual attention that focuses on some object parts [37]. However, designing a model that can exploit a stronger attention mechanism is relatively unexplored.

Therefore, to alleviate the above shortcomings of visual representations in ZSL models, in this paper, we propose a Vision Transformer (ViT) [7]

based multi-head self-attention model for solving the ZSL task. Our main contribution is to introduce ViT for enhancing the visual feature localisation to solve the zero-shot learning task. Without any object part-level annotation or detection, this is the first attempt to introduce ViT into ZSL. As illustrated in Figure

1, our method maps the visual features of images to the semantic space with the help of scaled dot-product of multi-head attention employed in ViT. We have also performed detailed experimentation on three public datasets (i.e., AWA2, CUB and SUN) following Generalised Zero-Shot Learning (GZSL) setting and achieved very encouraging results on all of them, including the new state-of-the-art harmonic mean on all the datasets.

2 Related Work

Zero-Shot Learning: ZSL is employed to bridge the gap between seen and unseen classes using semantic information, which is done by computing similarity function between visual features and previously learned knowledge [21]

. Various approaches address the ZSL problem by learning probabilistic attribute classifiers to predict class labels

[14, 17] and by learning linear [9, 2, 1], and non-linear [28] compatibility function associating image features and semantic information. Recently proposed generative models synthesise visual features for the unseen classes [27, 22]. Although those models achieve better performances compared to classical models, they rely on features of trained CNNs. Recently, attention mechanism is adapted in ZSL to integrate discriminative local and global visual features. Among them, SGA [34] and AREN [32] use an attention-based network with two branches to guide the visual features to generate discriminative regions of objects. SGMA [37] also applies attention to jointly learn global and local features from the whole image and multiple discovered object parts. Very recently, APN [33] proposes to divide an object into eight groups and learns a set of attribute prototypes, which further help the model to decorrelate the visual features. Partly inspired by the success of attention-based models, in this paper, we propose to learn local and global features using multi-scaled-dot-product self-attention via the Vision Transformer model, which to the best of our knowledge, is the first work on ZSL involving Vision Transformer. In this model, we employ multi-head attention after splitting the image into fixed-size patches so that it can attend to each patch to capture discriminative features among them and generate a compact representation of the entire image.

Vision Transformer: Self-attention-based architectures, especially Transformer [24]

has shown major success for various Natural Language Processing (NLP)


as well as for Computer Vision tasks

[3, 7]; the reader is referred to [12] for further reading on Vision Transformer based literature. Specifically, CaiT [23]

introduces deeper transformer networks, and Swin Transformer

[15] proposes a hierarchical Transformer, where the representation is computed using self-attention via shifted windows. In addition, TNT [11] proposes transformer-backbone method modelling not only the patch-level features but also the pixel-level representations. CrossViT [6] shows how dual-branch Transformer combining different sized image patches produce stronger image features. Since the applicability of transformer-based models is growing, we aim to expand and judge its capability for GZSL tasks; to the best of our knowledge, this is still unexplored. Therefore, different from the existing works, we employ ViT to map the visual information to the semantic space, benefiting from the great performance of multi-head self-attention to learn class-level attributes.

3 Vision Transformer for Zero-shot Learning (ViT-ZSL)

We follow the inductive approach for training our model, i.e. during training, the model only has access to the images and corresponding image/object attributes from the seen classes , where is an RGB image and

is the class-level attribute vector annotated with

different attributes, as provided with the dataset. As depicted in Figure 2, a image with resolution and channels is fed into the model. The model follows ViT [7] as closely as possible; hence the image is divided into a sequence of patches denoted as , where . Each patch with a resolution of

is encoded into a patch embedding by a trainable 2D convolution layer (i.e., Conv2d with kernel size=(16, 16) and stride=(16, 16)). Position embeddings are then attached to the patch embeddings to preserve the relative positional information of the order of the sequence due to the lack of recurrence in the Transformer. An extra learnable classification token (

) is appended at the beginning of the sequence to encode the global image representation. Patch embeddings () are then projected thought a linear projection to dimension (i.e., ) as in Eq. 1. Embeddings are then passed to the Transformer encoder, which consists of Multi-Head Attention (MHA) (Eq. 2) and MLP blocks (Eq. 3

). Before every block, a layer normalisation (Norm) is employed, and residual connections are also applied after every block. Image representation (

) is produced as in Eq. 4.

Figure 2: ViT-ZSL Architecture. An image is split into small patches fed into the Transformer encoder after attaching positional embeddings. During the training the output of the encoder is compared with the semantic information of the corresponding image via MSE loss. At inference the encoder output is used to search for the nearest class label.

In terms of MHA, self-attention is performed for every patch in the sequence of the patch embeddings independently; thus, attention works simultaneously for all the patches, leading to multi-head self-attention. Three vectors, namely Query (), Key () and Value (), are created by multiplying the encoder’s input (i.e., patch embeddings) by three weight matrices (i.e., , and ) trained during the training process to compute the self-attention. The and

vectors undergo a dot-product to output a scoring matrix representing how much a patch embedding has to attend to every other embedding; the higher the score is, the more attention is considered. The score matrix is then scaled down and passed into a softmax to convert the scores into probabilities, which are then multiplied by the

vectors, as in Eq. 5, where is the dimension of the vectors. Since the multi-attention mechanism is employed, self-attention matrices are then concatenated and fed into a linear layer and passed to the regression head.


We argue that self-attention allows our model to attend to image regions that can be semantically relevant for classification and learns the visual features across the entire image. Since the standard ViT has one classification head implemented by an MLP, it has been edited to meet our model objective: to predict number of attributes (i.e., depending on the datasets used). The motivation behind this is that the network is assumed to learn the notion of classes to predict attributes. For the objective function, we employed the Mean Squared Error (MSE) loss, as the continuous attributes are used as in Eq. 6, where is the observed attributes, and is the predicted ones.


During testing, instead of applying the extensively used dot product as in [33]

, we consider the cosine similarity as in

[10] to predict class labels. The cosine similarity between the predicted attributes and every class embedding is measured. The output of the similarity measure is then used to determine the class label of the test images.

4 Experiments

Implementation Details: All images used in training and testing are adapted from the ZSL datasets mentioned below and sized without any data augmentation. We employ the Large variant of ViT (ViT-L) [7], with input patch size , hidden dimension, layers, heads on each layer, and series encoder. There are 307M parameters in total in this architecture. ViT-L is then fine-tuned using Adam optimiser with a fixed learning rate of and a batch size of

. All methods are implemented in PyTorch

111Our code is available at: on an NVIDIA RTX GPU, Xeon processor, and a memory sized GB.

Datasets: We have conducted our experiments on three popular ZSL datasets: AWA2, CUB, and SUN, whose details are presented in Table 1. The main aim of this experimentation is to validate our proposed method, ViT-ZSL, demonstrate its effectiveness and compare it with the existing state-of-the-arts. Among these datasets, AWA2 [30] consists of images of categories ( seen + unseen). Each category contains binary as well as continuous class attributes. CUB [25] contains images forming different types of birds, among them classes are considered as seen, and the other as unseen, which is split by [1]. Together with images CUB dataset also contains attributes describing birds. Finally, SUN [18] has the largest number of classes among others. It consists of types of scene, divided into seen and unseen classes. The SUN dataset contains images with annotated attributes.

Datasets Granularity # Classes (S + U) # Attributes # Images
AWA2 [30] coarse 50 (40 + 10) 85 37,322
CUB [25] fine 200 (150 + 50) 102 11,788
SUN [18] fine 717 (645 + 72) 312 14,340
Table 1: Dataset statistics in terms of granularity, number of classes (seen + unseen classes) as shown within parenthesis, number of attributes and number of images.

Evaluation: In this work, we train our ViT-ZSL model following the inductive approach [26]. Following [29], we measure the top-1 accuracy for both seen as well as unseen classes. To capture the trade-off between both sets of classes performance, we use the harmonic mean, which is the primary evaluation criterion for our model. Following the recent papers (e.g., [33], [5]), we apply Calibrated Stacking [5] to evaluate the considered methods under GZSL setting, where the calibration factor is dataset dependant and decided based on a validation set.

Quantitative Results: We consider the AWA2, CUB and SUN datasets to show the performance of our proposed model and compare the performance with related arts. Table 2 shows the quantitative comparison between the proposed model and various other GZSL models. The performance of each model is shown in terms of Seen (S) and Unseen (U) classes and their harmonic mean (H).

DAP [14] 84.7 0.0 0.0 67.9 1.7 3.3 25.1 4.2 7.2
IAP [14] 87.6 0.9 1.8 72.8 0.2 0.4 37.8 1.0 1.8
DeViSE [9] 74.7 17.1 27.8 53.0 23.8 32.8 30.5 14.7 19.8
ConSE [17] 90.6 0.5 1.0 72.2 1.6 3.1 39.9 6.8 11.6
SSE [35] 82.5 8.1 14.8 46.9 8.5 14.4 36.4 2.1 4.0
SJE [2] 73.9 8.0 14.4 59.2 23.5 33.6 30.5 14.7 19.8
ESZSL [21] 77.8 5.9 11.0 63.8 12.6 21.0 27.9 11.0 15.8
LATEM [28] 77.3 11.5 20.0 57.3 15.2 24.0 28.8 14.7 19.5
ALE [1] 81.8 14.0 23.9 62.8 23.7 34.4 33.1 21.8 26.3
SAE [13] 82.2 1.1 2.2 54.0 7.8 13.6 18.0 8.8 11.8
AREN [32] 92.9 15.6 26.7 78.7 38.9 52.1 38.8 19.0 25.5
SGMA [37] 87.1 37.6 52.5 71.3 36.7 48.5 - - -
APN [33] 78.0 56.5 65.5 69.3 65.3 67.2 34.0 41.1 37.6
*GAZSL [36] 86.5 19.2 31.4 60.6 23.9 34.3 34.5 21.7 26.7
*f-CLSWGAN [27] 64.4 57.9 59.6 57.7 43.7 49.7 36.6 42.6 39.4
Our model (ViT-ZSL) 90.0 51.9 65.8 75.2 67.3 71.0 55.3 44.5 49.3
  • S, U, H denote Seen classes (), Unseen classes (), and the Harmonic mean, respectively. For each scenario, the best is in red and the second-best is in blue. * indicates generative representation learning methods.

Table 2: Generalised zero-shot classification performance on AWA2, CUB and SUN

DAP and IAP [14] are some of the earliest works in ZSL, which perform poorly compared to other models. This is due to the assumptions claimed in these approaches regarding attributes dependency. In real-world animals with attributes ‘terrestrial’ and ‘farm’ are dependent but are assumed independent by such models, which are noted as incorrect by [1]. Our model ViT-ZSL does not assume this, but rather it considers the correlation between attributes, which self-attention helps to achieve by considering both positional and contextual information of the entire sequence of patches. DeViSE [9] and ConSE [17]

learn a linear mapping between images and their semantic embedding space. They both make use of the same text model trained on 5.4B words from Wikipedia to construct 500-dimensional word embedding vectors. Both use the same baseline model, but DeViSE replaces the last layer (i.e., softmax layer) with a linear transformation layer. In contrast, ConSE keeps it and computes the predictions via a convex combination of the class label embedding vectors. ConSE, as presented in Table

2 outperforms DeViSE, but DeViSE is generally performing better on the unseen classes. Similarly, SJE [2] learns a bilinear compatibility function using the structural SVM objective function to maximise the compatibility between image and class embeddings. ESZSL [21] uses the square loss to learn bilinear compatibility. Although ESZSL is claimed to be easy to implement, its performance, in particular for GZSL, is poor. ALE [1], which belongs to the bilinear compatibility approach group, performs better than most of its group member. LATEM [28], instead of learning a single bilinear map, extends the bilinear compatibility of SJE [2] as to be an image-class pairwise linear by learning multiple linear mappings. It performs better than SJE on unseen classes but with a lower harmonic mean due to its poor performance on seen classes. Generative ZSL models such as GAZSL [36], and f-CLSWGAN [27]

are seen to reduce the effect of the bias problem due to the inclusion of synthesised features for the unseen classes. However, this does not apply to our method, as no synthesised features are used in our case; instead, solely the features extracted from seen classes are used during training. AREN

[32], SGMA [37] and APN [33] are non-generative ZSL models focusing on object region localisation using image attention. They are the most relevant works to ours as attention mechanism is included in these models architecture. However, they consist of two branches in their models, where the first learns local discriminative visual features and the second captures the image’s global context. In contrast, our model uses only one compact network, where the input is the image patches so that the global and local discriminative features can be learned using the multi-head self-attention mechanism.

Figure 3: Representative examples of attention. First row: Original images, Middle: Attention maps, and last: Attention fusions. From left to right, ViT-ZSL is able to focus on object-level attributes and learn objects discriminative features when objects are partly captured (first three columns images), occluded (fourth column images) or fully presented (last two columns images).

Our model ViT-ZSL, as shown in Table 2

, achieves the best harmonic mean on AWA2. It also performs as the third best on both seen and unseen classes. Compared with the other models, it scores 90.02%, where the highest is the highest is AREN with 92.9% accuracy. As the comparison illustrated follows the GZSL setting using the harmonic mean as the primary evaluation metric for GZSL models, ViT-ZSL outperforms all state-of-the-art models. In terms of the CUB dataset, our method achieves the second-highest accuracy for seen classes, but the highest for unseen. In addition, our ViT-ZSL obtains the best harmonic mean score among all the reported approaches. On SUN datasets, which has the most significant number of object classes among other datasets, our model performs as the best for both seen and unseen classes, achieving a harmonic mean of 47.9%, the highest compared to all other models.

Attention Maps: In Figure 3, we show how our model attends to image regions semantically relevant to the object class. For example, in the images of the first three columns, the entire objects’ shapes are absent (i.e., only the top part is captured), and in the image in the fourth column, the groove-billed ani bird is impeded by a human hand. Although these images suffer from occlusion, our model accurately attends to the objects in the image. Thus, we believe that ViT-ZSL definitely benefits from the attention mechanism, which is also involved in the human recognition system. Clearly, we can say that our method has learned to map the relevance of local regions to representations in the semantic space, where it makes predictions on the visible attribute-based regions. Similarly, in the last two columns images of Figure 3, it can be noticed how the model pays more attention to some object-level attributes (i.e., Deer: forest, agility, furry etc., and Vermilion Flycatcher: solid and red breast, perching-like shape, notched tail). It can also be noticed that the model focuses on the context of the object, as in the second column images. This can be due to the guidance of some attributes (i.e., forest, jungle, ground and tree) which are associated with leopard class. However, as shown in the first column, the model did not pay much attention to the bird’s beak compared to the head and the rest of the body, which needs to be investigated further and building an explainable model as in [31] could be a way to accomplish this.

5 Conclusion

In this paper, we proposed a Vision Transformer-based Zero-Shot Learning (ViT-ZSL) model that specifically exploits the multi-head self-attention mechanism for relating visual and semantic attributes. Our qualitative results showed that the attention mechanism involved in our model focuses on the most relevant image regions related to the object class to predict the semantic information, which is used to find out the class label during inference. Our results on the GZSL task, including the highest harmonic mean scores on the AWA2, CUB and SUN datasets, illustrate the effectiveness of our proposed method.

Although our method achieves very encouraging results for the GZSL task on three publicly available benchmarks, the bias problem towards seen classes remains a challenge, which will be addressed in future work. Training the model in a transductive setting, where visual information for unseen classes could be included, is a direction to be examined.


This work was supported by the Defence Science and Technology Laboratory and the Alan Turing Institute. The TITAN Xp and TITAN V used for this research were donated by the NVIDIA Corporation.


  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2016) Label-embedding for image classification. IEEE TPAMI. Cited by: §2, Table 2, §4, §4.
  • [2] Z. Akata, S. E. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In CVPR, Cited by: §2, Table 2, §4.
  • [3] F. Alamri, S. Kalkan, and N. Pugeault (2021) Transformer-encoder detector module: using context to improve robustness to adversarial attacks on object detection. In ICPR, Cited by: §2.
  • [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. In NeurIPS, Cited by: §2.
  • [5] W. Chao, S. Changpinyo, B. Gong, and F. Sha (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, Cited by: §4.
  • [6] C. Chen, Q. Fan, and R. Panda (2021) CrossViT: cross-attention multi-scale vision transformer for image classification. arXiv. Cited by: §2.
  • [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §1, §1, §2, §3, §4.
  • [8] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata (2020) Learning Robust Representations via Multi-View Information Bottleneck. In ICLR, Cited by: §1.
  • [9] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In NIPS, Cited by: §2, Table 2, §4.
  • [10] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §3.
  • [11] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang (2021) Transformer in transformer. arXiv. Cited by: §2.
  • [12] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. Khan, and M. Shah (2021) Transformers in vision: a survey. arXiv. Cited by: §1, §2.
  • [13] E. Kodirov, T. Xiang, and S. Gong (2017)

    Semantic autoencoder for zero-shot learning

    In CVPR, Cited by: Table 2.
  • [14] C. H. Lampert, H. Nickisch, and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In CVPR, Cited by: §2, Table 2, §4.
  • [15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv. Cited by: §2.
  • [16] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1.
  • [17] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean (2014) Zero-shot learning by convex combination of semantic embeddings. In ICLR, Cited by: §2, Table 2, §4.
  • [18] G. Patterson and J. Hays (2012) SUN attribute database: discovering, annotating, and recognizing scene attributes. In CVPR, Cited by: Table 1, §4.
  • [19] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §1.
  • [20] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE TPAMI. Cited by: §1.
  • [21] B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In ICML, Cited by: §2, Table 2, §4.
  • [22] E. Schönfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero- and few-shot learning via aligned variational autoencoders. CVPR. Cited by: §1, §2.
  • [23] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou (2021) Going deeper with image transformers. arXiv. Cited by: §2.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2.
  • [25] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report California Institute of Technology. Cited by: Table 1, §4.
  • [26] W. Wang, V. Zheng, H. Yu, and C. Miao (2019) A survey of zero-shot learning. ACM-TIST. Cited by: §4.
  • [27] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018) Feature generating networks for zero-shot learning. In CVPR, Cited by: §1, §2, Table 2, §4.
  • [28] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele (2016) Latent embeddings for zero-shot classification. In CVPR, Cited by: §2, Table 2, §4.
  • [29] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2019) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI. Cited by: §1, §4.
  • [30] Y. Xian, B. Schiele, and Z. Akata (2017) Zero-shot learning - the good, the bad and the ugly. In CVPR, Cited by: §1, Table 1, §4.
  • [31] Y. Xian, S. Sharma, B. Schiele, and Z. Akata (2019) F-vaegan-d2: a feature generating framework for any-shot learning. In CVPR, Cited by: §4.
  • [32] G. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao (2019) Attentive region embedding network for zero-shot learning. In CVPR, Cited by: §1, §1, §2, Table 2, §4.
  • [33] W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata (2020) Attribute prototype network for zero-shot learning. In NIPS, Cited by: §1, §2, §3, Table 2, §4, §4.
  • [34] y. Yu, Z. Ji, Y. Fu, J. Guo, Y. Pang, and Z. (. Zhang (2018) Stacked semantics-guided attention model for fine-grained zero-shot learning. In NeurIPS, Cited by: §2.
  • [35] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In ICCV, Cited by: Table 2.
  • [36] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal (2018) Imagine it for me: generative adversarial approach for zero-shot learning from noisy texts. In CVPR, Cited by: Table 2, §4.
  • [37] Y. Zhu, J. Xie, Z. Tang, X. Peng, and A. Elgammal (2019) Semantic-guided multi-attention localization for zero-shot learning. In NIPS, Cited by: §1, §2, Table 2, §4.