Multi-Headed Self-Attention via Vision Transformer for Zero-Shot Learning (ViT-ZSL)
Zero-Shot Learning (ZSL) aims to recognise unseen object classes, which are not observed during the training phase. The existing body of works on ZSL mostly relies on pretrained visual features and lacks the explicit attribute localisation mechanism on images. In this work, we propose an attention-based model in the problem settings of ZSL to learn attributes useful for unseen class recognition. Our method uses an attention mechanism adapted from Vision Transformer to capture and learn discriminative attributes by splitting images into small patches. We conduct experiments on three popular ZSL benchmarks (i.e., AWA2, CUB and SUN) and set new state-of-the-art harmonic mean results on all the three datasets, which illustrate the effectiveness of our proposed method.READ FULL TEXT VIEW PDF
Multi-Headed Self-Attention via Vision Transformer for Zero-Shot Learning (ViT-ZSL)
Relying on massive annotated datasets, significant progress has been made on many visual recognition tasks, which is mainly due to the widespread use of different deep learning architectures[20, 7, 12]. Despite these advancements, recognising any arbitrary real-world object still remains a daunting challenge as it is unrealistic to label all the existing object classes on the earth. Zero-Shot Learning (ZSL) addresses this problem, requiring images from the seen classes during the training, but has the capability of recognising unseen classes during the inference [29, 32, 33, 8]. Here the central insight is that all the existing categories share a common semantic space and the task of ZSL is to learn a mapping from the imagery space to the semantic space with the help of side information (attributes, word embeddings) [30, 16, 19] available with the seen classes during the training phase so that it can be used to predict the class information for the unseen classes during the inference time.
depends on pretrained visual features and necessarily focus on learning a compatibility function between the visual features and semantic attributes. Although modern neural network models encode local visual information and object parts, they are not sufficient to solve the localisation issue in ZSL models. Some attempts have also been made by learning visual attention that focuses on some object parts . However, designing a model that can exploit a stronger attention mechanism is relatively unexplored.
Therefore, to alleviate the above shortcomings of visual representations in ZSL models, in this paper, we propose a Vision Transformer (ViT) 
based multi-head self-attention model for solving the ZSL task. Our main contribution is to introduce ViT for enhancing the visual feature localisation to solve the zero-shot learning task. Without any object part-level annotation or detection, this is the first attempt to introduce ViT into ZSL. As illustrated in Figure1, our method maps the visual features of images to the semantic space with the help of scaled dot-product of multi-head attention employed in ViT. We have also performed detailed experimentation on three public datasets (i.e., AWA2, CUB and SUN) following Generalised Zero-Shot Learning (GZSL) setting and achieved very encouraging results on all of them, including the new state-of-the-art harmonic mean on all the datasets.
Zero-Shot Learning: ZSL is employed to bridge the gap between seen and unseen classes using semantic information, which is done by computing similarity function between visual features and previously learned knowledge 
. Various approaches address the ZSL problem by learning probabilistic attribute classifiers to predict class labels[14, 17] and by learning linear [9, 2, 1], and non-linear  compatibility function associating image features and semantic information. Recently proposed generative models synthesise visual features for the unseen classes [27, 22]. Although those models achieve better performances compared to classical models, they rely on features of trained CNNs. Recently, attention mechanism is adapted in ZSL to integrate discriminative local and global visual features. Among them, SGA  and AREN  use an attention-based network with two branches to guide the visual features to generate discriminative regions of objects. SGMA  also applies attention to jointly learn global and local features from the whole image and multiple discovered object parts. Very recently, APN  proposes to divide an object into eight groups and learns a set of attribute prototypes, which further help the model to decorrelate the visual features. Partly inspired by the success of attention-based models, in this paper, we propose to learn local and global features using multi-scaled-dot-product self-attention via the Vision Transformer model, which to the best of our knowledge, is the first work on ZSL involving Vision Transformer. In this model, we employ multi-head attention after splitting the image into fixed-size patches so that it can attend to each patch to capture discriminative features among them and generate a compact representation of the entire image.
Vision Transformer: Self-attention-based architectures, especially Transformer 
has shown major success for various Natural Language Processing (NLP)
as well as for Computer Vision tasks[3, 7]; the reader is referred to  for further reading on Vision Transformer based literature. Specifically, CaiT 
introduces deeper transformer networks, and Swin Transformer proposes a hierarchical Transformer, where the representation is computed using self-attention via shifted windows. In addition, TNT  proposes transformer-backbone method modelling not only the patch-level features but also the pixel-level representations. CrossViT  shows how dual-branch Transformer combining different sized image patches produce stronger image features. Since the applicability of transformer-based models is growing, we aim to expand and judge its capability for GZSL tasks; to the best of our knowledge, this is still unexplored. Therefore, different from the existing works, we employ ViT to map the visual information to the semantic space, benefiting from the great performance of multi-head self-attention to learn class-level attributes.
We follow the inductive approach for training our model, i.e. during training, the model only has access to the images and corresponding image/object attributes from the seen classes , where is an RGB image and
is the class-level attribute vector annotated withdifferent attributes, as provided with the dataset. As depicted in Figure 2, a image with resolution and channels is fed into the model. The model follows ViT  as closely as possible; hence the image is divided into a sequence of patches denoted as , where . Each patch with a resolution of
is encoded into a patch embedding by a trainable 2D convolution layer (i.e., Conv2d with kernel size=(16, 16) and stride=(16, 16)). Position embeddings are then attached to the patch embeddings to preserve the relative positional information of the order of the sequence due to the lack of recurrence in the Transformer. An extra learnable classification token () is appended at the beginning of the sequence to encode the global image representation. Patch embeddings () are then projected thought a linear projection to dimension (i.e., ) as in Eq. 1. Embeddings are then passed to the Transformer encoder, which consists of Multi-Head Attention (MHA) (Eq. 2) and MLP blocks (Eq. 3
). Before every block, a layer normalisation (Norm) is employed, and residual connections are also applied after every block. Image representation () is produced as in Eq. 4.
In terms of MHA, self-attention is performed for every patch in the sequence of the patch embeddings independently; thus, attention works simultaneously for all the patches, leading to multi-head self-attention. Three vectors, namely Query (), Key () and Value (), are created by multiplying the encoder’s input (i.e., patch embeddings) by three weight matrices (i.e., , and ) trained during the training process to compute the self-attention. The and
vectors undergo a dot-product to output a scoring matrix representing how much a patch embedding has to attend to every other embedding; the higher the score is, the more attention is considered. The score matrix is then scaled down and passed into a softmax to convert the scores into probabilities, which are then multiplied by thevectors, as in Eq. 5, where is the dimension of the vectors. Since the multi-attention mechanism is employed, self-attention matrices are then concatenated and fed into a linear layer and passed to the regression head.
We argue that self-attention allows our model to attend to image regions that can be semantically relevant for classification and learns the visual features across the entire image. Since the standard ViT has one classification head implemented by an MLP, it has been edited to meet our model objective: to predict number of attributes (i.e., depending on the datasets used). The motivation behind this is that the network is assumed to learn the notion of classes to predict attributes. For the objective function, we employed the Mean Squared Error (MSE) loss, as the continuous attributes are used as in Eq. 6, where is the observed attributes, and is the predicted ones.
During testing, instead of applying the extensively used dot product as in 
, we consider the cosine similarity as in to predict class labels. The cosine similarity between the predicted attributes and every class embedding is measured. The output of the similarity measure is then used to determine the class label of the test images.
Implementation Details: All images used in training and testing are adapted from the ZSL datasets mentioned below and sized without any data augmentation. We employ the Large variant of ViT (ViT-L) , with input patch size , hidden dimension, layers, heads on each layer, and series encoder. There are 307M parameters in total in this architecture. ViT-L is then fine-tuned using Adam optimiser with a fixed learning rate of and a batch size of
. All methods are implemented in PyTorch111Our code is available at: https://github.com/FaisalAlamri0/ViT-ZSL on an NVIDIA RTX GPU, Xeon processor, and a memory sized GB.
Datasets: We have conducted our experiments on three popular ZSL datasets: AWA2, CUB, and SUN, whose details are presented in Table 1. The main aim of this experimentation is to validate our proposed method, ViT-ZSL, demonstrate its effectiveness and compare it with the existing state-of-the-arts. Among these datasets, AWA2  consists of images of categories ( seen + unseen). Each category contains binary as well as continuous class attributes. CUB  contains images forming different types of birds, among them classes are considered as seen, and the other as unseen, which is split by . Together with images CUB dataset also contains attributes describing birds. Finally, SUN  has the largest number of classes among others. It consists of types of scene, divided into seen and unseen classes. The SUN dataset contains images with annotated attributes.
|Datasets||Granularity||# Classes (S + U)||# Attributes||# Images|
|AWA2 ||coarse||50 (40 + 10)||85||37,322|
|CUB ||fine||200 (150 + 50)||102||11,788|
|SUN ||fine||717 (645 + 72)||312||14,340|
Evaluation: In this work, we train our ViT-ZSL model following the inductive approach . Following , we measure the top-1 accuracy for both seen as well as unseen classes. To capture the trade-off between both sets of classes performance, we use the harmonic mean, which is the primary evaluation criterion for our model. Following the recent papers (e.g., , ), we apply Calibrated Stacking  to evaluate the considered methods under GZSL setting, where the calibration factor is dataset dependant and decided based on a validation set.
Quantitative Results: We consider the AWA2, CUB and SUN datasets to show the performance of our proposed model and compare the performance with related arts. Table 2 shows the quantitative comparison between the proposed model and various other GZSL models. The performance of each model is shown in terms of Seen (S) and Unseen (U) classes and their harmonic mean (H).
|Our model (ViT-ZSL)||90.0||51.9||65.8||75.2||67.3||71.0||55.3||44.5||49.3|
S, U, H denote Seen classes (), Unseen classes (), and the Harmonic mean, respectively. For each scenario, the best is in red and the second-best is in blue. * indicates generative representation learning methods.
DAP and IAP  are some of the earliest works in ZSL, which perform poorly compared to other models. This is due to the assumptions claimed in these approaches regarding attributes dependency. In real-world animals with attributes ‘terrestrial’ and ‘farm’ are dependent but are assumed independent by such models, which are noted as incorrect by . Our model ViT-ZSL does not assume this, but rather it considers the correlation between attributes, which self-attention helps to achieve by considering both positional and contextual information of the entire sequence of patches. DeViSE  and ConSE 
learn a linear mapping between images and their semantic embedding space. They both make use of the same text model trained on 5.4B words from Wikipedia to construct 500-dimensional word embedding vectors. Both use the same baseline model, but DeViSE replaces the last layer (i.e., softmax layer) with a linear transformation layer. In contrast, ConSE keeps it and computes the predictions via a convex combination of the class label embedding vectors. ConSE, as presented in Table2 outperforms DeViSE, but DeViSE is generally performing better on the unseen classes. Similarly, SJE  learns a bilinear compatibility function using the structural SVM objective function to maximise the compatibility between image and class embeddings. ESZSL  uses the square loss to learn bilinear compatibility. Although ESZSL is claimed to be easy to implement, its performance, in particular for GZSL, is poor. ALE , which belongs to the bilinear compatibility approach group, performs better than most of its group member. LATEM , instead of learning a single bilinear map, extends the bilinear compatibility of SJE  as to be an image-class pairwise linear by learning multiple linear mappings. It performs better than SJE on unseen classes but with a lower harmonic mean due to its poor performance on seen classes. Generative ZSL models such as GAZSL , and f-CLSWGAN 
are seen to reduce the effect of the bias problem due to the inclusion of synthesised features for the unseen classes. However, this does not apply to our method, as no synthesised features are used in our case; instead, solely the features extracted from seen classes are used during training. AREN, SGMA  and APN  are non-generative ZSL models focusing on object region localisation using image attention. They are the most relevant works to ours as attention mechanism is included in these models architecture. However, they consist of two branches in their models, where the first learns local discriminative visual features and the second captures the image’s global context. In contrast, our model uses only one compact network, where the input is the image patches so that the global and local discriminative features can be learned using the multi-head self-attention mechanism.
Our model ViT-ZSL, as shown in Table 2
, achieves the best harmonic mean on AWA2. It also performs as the third best on both seen and unseen classes. Compared with the other models, it scores 90.02%, where the highest is the highest is AREN with 92.9% accuracy. As the comparison illustrated follows the GZSL setting using the harmonic mean as the primary evaluation metric for GZSL models, ViT-ZSL outperforms all state-of-the-art models. In terms of the CUB dataset, our method achieves the second-highest accuracy for seen classes, but the highest for unseen. In addition, our ViT-ZSL obtains the best harmonic mean score among all the reported approaches. On SUN datasets, which has the most significant number of object classes among other datasets, our model performs as the best for both seen and unseen classes, achieving a harmonic mean of 47.9%, the highest compared to all other models.
Attention Maps: In Figure 3, we show how our model attends to image regions semantically relevant to the object class. For example, in the images of the first three columns, the entire objects’ shapes are absent (i.e., only the top part is captured), and in the image in the fourth column, the groove-billed ani bird is impeded by a human hand. Although these images suffer from occlusion, our model accurately attends to the objects in the image. Thus, we believe that ViT-ZSL definitely benefits from the attention mechanism, which is also involved in the human recognition system. Clearly, we can say that our method has learned to map the relevance of local regions to representations in the semantic space, where it makes predictions on the visible attribute-based regions. Similarly, in the last two columns images of Figure 3, it can be noticed how the model pays more attention to some object-level attributes (i.e., Deer: forest, agility, furry etc., and Vermilion Flycatcher: solid and red breast, perching-like shape, notched tail). It can also be noticed that the model focuses on the context of the object, as in the second column images. This can be due to the guidance of some attributes (i.e., forest, jungle, ground and tree) which are associated with leopard class. However, as shown in the first column, the model did not pay much attention to the bird’s beak compared to the head and the rest of the body, which needs to be investigated further and building an explainable model as in  could be a way to accomplish this.
In this paper, we proposed a Vision Transformer-based Zero-Shot Learning (ViT-ZSL) model that specifically exploits the multi-head self-attention mechanism for relating visual and semantic attributes. Our qualitative results showed that the attention mechanism involved in our model focuses on the most relevant image regions related to the object class to predict the semantic information, which is used to find out the class label during inference. Our results on the GZSL task, including the highest harmonic mean scores on the AWA2, CUB and SUN datasets, illustrate the effectiveness of our proposed method.
Although our method achieves very encouraging results for the GZSL task on three publicly available benchmarks, the bias problem towards seen classes remains a challenge, which will be addressed in future work. Training the model in a transductive setting, where visual information for unseen classes could be included, is a direction to be examined.
This work was supported by the Defence Science and Technology Laboratory and the Alan Turing Institute. The TITAN Xp and TITAN V used for this research were donated by the NVIDIA Corporation.
Semantic autoencoder for zero-shot learning. In CVPR, Cited by: Table 2.