[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages
Multi-label zero-shot learning (ZSL) is a more realistic counter-part of standard single-label ZSL since several objects can co-exist in a natural image. However, the occurrence of multiple objects complicates the reasoning and requires region-specific processing of visual features to preserve their contextual cues. We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes. Such shared maps lead to diffused attention, which does not discriminatively focus on relevant locations when the number of classes are large. Moreover, mapping spatially-pooled visual features to the class semantics leads to inter-class feature entanglement, thus hampering the classification. Here, we propose an alternate approach towards region-based discriminability-preserving multi-label zero-shot classification. Our approach maintains the spatial resolution to preserve region-level characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information. The enriched region-level features are then mapped to the class semantics and only their class predictions are spatially pooled to obtain image-level predictions, thereby keeping the multi-class features disentangled. Our approach sets a new state of the art on two large-scale multi-label zero-shot benchmarks: NUS-WIDE and Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9 mAP for ZSL, compared to the best published results.READ FULL TEXT VIEW PDF
[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages
Multi-label classification strives to recognize all the categories (labels) present in an image. In the standard multi-label classification [33, 40, 16, 5, 22, 41, 42] setting, the category labels in both the train and test sets are identical. In contrast, the task of multi-label zero-shot learning (ZSL) is to recognize multiple new unseen categories in images at test time, without having seen the corresponding visual examples during training. In the generalized ZSL (GZSL) setting, test images can simultaneously contain multiple seen and unseen classes. GZSL is particularly challenging in the large-scale multi-label setting, where several diverse categories occur in an image (, maximum of labels per image in NUS-WIDE ) along with a large number of unseen categories at test time (, unseen classes in Open Images ). Here, we investigate this challenging problem of multi-label (generalized) zero-shot classification.
Existing multi-label (G)ZSL methods tackle the problem by using global image features [21, 47]
, structured knowledge graph and attention schemes . Among these, the recently introduced LESA  proposes a shared attention scheme based on region-based feature representations and achieves state-of-the-art results. LESA learns multiple attention maps that are shared across all categories. The region-based image features are weighted by these shared attentions and then spatially aggregated. Subsequently, the aggregated features are projected to the label space via a joint visual-semantic embedding space.
with significant inter and intra-class variations. Secondly, attended features are spatially pooled before projection to the label space, thus entangling the multi-label information in the collapsed image-level feature vectors. Since multiple diverse labels can appear in an image, the class-specific discriminability within such a collapsed representation is severely hampered.
To address the aforementioned problems, we pose large-scale multi-label ZSL as a region-level classification problem. We introduce a simple yet effective region-level classification framework that maintains the spatial resolution of features to keep the multi-class information disentangled for dealing with large number of co-existing classes in an image. Our framework comprises a bi-level attention module (BiAM) to contextualize and obtain highly discriminative region-level feature representations. Our BiAM contains region and global (scene) contextualized blocks and enables reasoning about all the regions together using pair-wise relations between them, in addition to utilizing the holistic scene context. The region contextualized block enriches each region feature by attending to all regions within the image whereas the scene contextualized block enhances the region features based on their congruence to the scene feature representation. The resulting discriminative features, obtained through our BiAM, are then utilized to perform region-based classification through a compatibility function. Afterwards, a spatial top- pooling is performed over each class to obtain the final predictions. Experiments are performed on two challenging large-scale multi-label zero-shot benchmarks: NUS-WIDE  and Open Images . Our approach performs favorably against existing methods, setting a new state of the art on both benchmarks. Particularly, on NUS-WIDE, our approach achieves an absolute gain of in terms of mAP for the ZSL task, over the best published results .
Here, we introduce a region-based discriminability-preserving multi-label zero-shot classification framework aided by learning rich features that explicitly encodes both region as well as global scene contexts in an image.
Problem Formulation: Let denote the feature instances of a multi-label image and the corresponding multi-hot labels from the set of seen class labels . Further, let denote the -dimensional attribute embeddings, which encode the semantic relationships between seen classes. With as the number of positive labels in an image, we denote the set of attribute embeddings for the image as , where . The goal in (generalized) zero-shot learning is to learn a mapping aided by the attribute embeddings , such that the mapping can be adapted to include the unseen classes (with embeddings ) at test time, , for ZSL and for the GZSL setting. Here, represents the total number of seen and unseen classes.
As discussed earlier, recognizing diverse and wide range of category labels in images under the (generalized) zero-shot setting is challenging. The problem arises, primarily, due to the entanglement of features of the various different classes present in an image. Fig. 2(a) illustrates this feature entanglement in the shared attention-based classification pipeline  that integrates multi-label features by performing a weighted spatial averaging of the region-based features based on the shared-attention maps. In this work, we argue that entangled feature representations are sub-optimal for multi-label classification and instead propose to alleviate this issue by posing large-scale multi-label ZSL as a region-level classification problem. To this end, we introduce a simple but effective region-level classification framework that first enriches the region-based features by the proposed feature enrichment mechanism. It then classifies the enriched region-based features followed by spatially pooling the per-class region-based scores to obtain the final image-level class predictions (see Fig. 2(b)). Consequently, our framework minimizes inter-class feature entanglement and enhances the classification performance. Fig. 3 shows our overall proposed framework. Let be the output region-based features, which are to be classified, from our proposed enrichment mechanism (, BiAM). Here, denote the spatial extent of the region-based features with regions. These features are first aligned with the class-specific attribute embeddings of the seen classes. This alignment is performed, , a joint visual-semantic space is learned, so that the classifier can be adapted to the unseen classes at test time. The aligned region-based features are classified to obtain class-specific response maps given by,
where is a learnable weight matrix that is used to reshape the visual features to attribute embeddings of seen classes (). The response maps are then top- pooled along the spatial dimensions to obtain image-level per-class scores , which are then utilized for training the network (in Sec. 2.3). Such a region-level classification, followed by a score-level pooling, helps to preserve the discriminability of the features in each of the regions by minimizing the feature entanglement of different positive classes occurring in the image.
The aforementioned region-level multi-label ZSL framework relies on discriminative region-based features. Standard region-based features only encode local region-specific information and do not explicitly reason about all the regions together. Moreover, region-based features do not possess image-level holistic scene information. Next, we introduce a bi-level attention module (BiAM) to enhance feature discriminability and generate enriched features .
Here, we present a bi-level attention module (BiAM) that enhances region-based features by incorporating both region and scene context information, without sacrificing the spatial resolution. Our BiAM comprises region and scene contextualized blocks, which are described next.
The region-contextualized block (RCB) enriches the region-based latent features by capturing the contexts from different regions in the image. We observe encoding the individual contexts of different regions in an image to improve the discriminability of standard region-based features, , the context of a region with window can aid in identifying other possibly texture-less regions in the image as house or building. Thus, inspired by the multi-headed self-attention , our RCB allows the features in different regions to interact with each other and identify the regions to be paid more attention to for enriching themselves (see Fig. 3(b)). To this end, the input features are first processed by a convolution layer to obtain latent features . These latent features are then projected to a low-dimensional space () to create query-key-value triplets using a total of projection heads,
where and are learnable weights of convolution layers with input and output channels as and , respectively. The query vector (of length ) derived from each region feature111Query can be considered as queries represented by features each. Similar observation holds for keys, values, . is used to find its correlation with the keys obtained from all the region features, while the value embedding holds the status of the current form of each region feature.
Given these triplets for each head, first, an intra-head processing is performed by relating each query vector with ‘keys’ derived from the region features. The resulting normalized relation scores () from the softmax function () are used to reweight the corresponding ‘value’ vectors. Without loss of generality1, the attended features are given by,
Next, these low-dimensional self-attended features from each head are channel-wise concatenated and processed by a convolution layer to generate output ,
To encourage the network to selectively focus on adding complimentary information to the ‘source’ latent feature , a residual branch is added to the attended features and further processed with a small residual sub-network , comprising two convolution layers, to help the network first focus on the local neighbourhood and then progressively pay attention to the other-level features. The enriched region-based features from the RCB are given by,
Consequently, the discriminability of the latent features is enhanced by self-attending to the context of different regions in the image, resulting in enriched features .
As discussed earlier, the RCB captures the regional context in the image, enabling reasoning about all regions together using pair-wise relations between them. In this way, RCB enriches the latent feature inputs . However, such a region-based contextual attention does not effectively encode the global scene-level context of the image, which is necessary for understanding abstract scene concepts like night-time, protest, clouds, . Understanding such labels from local regional contexts is challenging due to their abstract nature. Thus, in order to better capture the holistic scene-level context, we introduce a scene contextualized block (SCB) within our BiAM. Our SCB attends to the region-based latent features , based on their congruence with the global image feature (see Fig. 3(c)). To this end, the learnable weights project the features to a -dimensional space to obtain the global ‘key’ vectors , while the latent features are spatially average pooled to create the ‘query’ vectors ,
The region-based latent features are retained as ‘value’ features . Given these query-key-value triplets, first, the query is used to find its correlation with the key . The resulting relation score vectors are then used to reweight the corresponding channels in value features to obtain the attended features , given by,
where and denote channel-wise and element-wise multiplications. The channel-wise operation is chosen here since we want to use the global contextualized features to dictate kernel-wise importance of the feature channels for aggregating relevant contextual cues without disrupting the local filter signature. Similar to RCB, to encourage the network to selectively focus on adding complimentary information to the ‘source’ , a residual branch is added after processing the attended features through a convolution layer . The scene-context enriched features from the SCB are given by,
In order to ensure the enrichment due to both region and global contexts are well captured, the enriched features ( and ) from both region and scene contextualized blocks are channel-wise concatenated and processed through a channel-reducing convolution layer to obtain the final enriched features , given by,
Fig. 4 shows that encoding scene context into the region-based features improves the attention maps of scene level labels (, protest), which were hard to attend to using only the region context. Consequently, our bi-level attention module effectively reasons about all the image regions together using pair-wise relations between them, while being able to utilize the whole image (holistic) scene as context.
As discussed earlier, discriminative region-based features are learned and region-wise classified to obtain class-specific response maps (using Eq. 1). The response maps are further top- pooled spatially to compute the image-level per-class scores . The network is trained using a simple, yet effective ranking loss on the predicted scores , given by,
Here, denotes the positive labels in image . The ranking loss ensures that the predicted scores of the positive labels present in the image rank ahead, by a margin of at least , of the negative label scores.
At test time, for the multi-label ZSL task, the unseen class attribute embeddings of the respective unseen classes are used (in place of ) for computing the class-specific response maps in Eq. 1. As in training, these response maps are then top- pooled spatially to compute the image-level per-class scores . Similarly, for the multi-label GZSL task, the concatenated embeddings () of all the classes are used to classify the multi-label images.
Datasets: We evaluate our approach on two benchmarks: NUS-WIDE  and Open Images .
The NUS-WIDE dataset comprises nearly K images with human-annotated categories, in addition to the labels obtained from Flickr user tags. As in [14, 47], the and labels are used as seen and unseen classes, respectively.
The Open Images (v4) is a large-scale dataset comprising nearly million training images along with and images in validation and test sets.
It has annotations with human and machine-generated labels. Here, labels, with at least training images, are selected as seen classes. The most frequent test labels that are absent in the training data are selected as unseen classes, as in .
Evaluation Metrics: We use F1 score at top-
predictions and mean Average Precision (mAP) as evaluation metrics, as in[32, 14]. The model’s ability to correctly rank labels in each image is measured by the F1, while the its image ranking accuracy for each label is captured by the mAP.
NUS-WIDE: The state-of-the-art comparison for zero-shot (ZSL) and generalized zero-shot (GZSL) classification is presented in Tab. 1. The results are reported in terms of mAP and F1 score at top- predictions (). The approach of Fast0Tag , which finds principal directions in the attribute embedding space for ranking the positive tags ahead of negative tags, achieves mAP on the ZSL task. The recently introduced LESA , which employs a shared multi-attention mechanism to recognize labels in an image, improves the performance over Fast0Tag, achieving mAP. Our approach outperforms LESA with an absolute gain of mAP. Furthermore, our approach achieves consistent improvement over the state-of-the-art in terms of F1 (), achieving gains as high as at .
Similarly, on the GZSL task, our approach achieves an mAP score of , outperforming LESA with an absolute gain of . Moreover, consistent performance improvement in terms of F1 is achieved over LESA by our approach, with absolute gains of and at and .
Open Images: Tab. 2 shows the state-of-the-art comparison for multi-label ZSL and GZSL tasks. The results are reported in terms of mAP and F1 score at top- predictions (). We follow the same evaluation protocol as in the concurrent work of SDL . Since Open Images has significantly larger number of labels, in comparison to NUS-WIDE, ranking them within an image is more challenging. This is reflected by the lower F1 scores in the table. Among existing methods, LESA obtains an mAP of for the ZSL task. In comparison, our approach outperforms LESA by achieving mAP with an absolute gain of . Furthermore, our approach performs favorably against the best existing approach with F1 scores of and at and . It is worth noting that the ZSL task is challenging due to the high number of unseen labels (). As in ZSL, our approach obtains a significant gain of mAP over the best published results for GZSL and also achieves favorable performance in F1. Additional details and results are presented in Appendix A.
Impact of region-based classification: To analyse this impact, we train our proposed framework without region-based classification, where the enriched features are spatially average-pooled to a single feature representation (of size ) per image and then classified. Fig. 5 shows the performance comparison between our frameworks trained with and without region-based classification in terms of mAP and F1. Since images have large and diverse set of positive labels, spatially aggregating features without the region-based classification (blue bars), leads to inter-class feature entanglement, as discussed in Sec. 2.1. Instead, preserving the spatial dimension by classifying the region-based features, as in the proposed framework (red bars), mitigates the inter-class feature entanglement to a large extent. This leads to a superior performance for the region-based classification on both multi-label ZSL and GZSL tasks. These results suggest the importance of region-based classification for learning discriminative features in large-scale multi-label (G)ZSL tasks. Furthermore, Fig. 6 presents a t-SNE visualization showing the impact of our region-level classification framework on unseen classes from NUS-WIDE.
Impact of the proposed BiAM: Here, we analyse the impact of our feature enrichment mechanism (BiAM) to obtain discriminative feature representations. Tab. 3 presents the comparison between region-based classification pipelines based on standard features and discriminative features obtained from our BiAM on NUS-WIDE. We also present results of our RCB and SCB blocks alone. Both RCB alone and SCB alone consistently improve the (G)ZSL performance over the standard region-based features. This shows that our region-based classification pipeline benefits from the discriminative features obtained through the two complementary attention blocks. Furthermore, best results are obtained with our BiAM that comprises both RCB and SCB blocks, demonstrating the importance of encoding both region and scene context information. Fig. 8 shows a comparison between the standard features-based classification and the proposed classification framework utilizing BiAM on example unseen class images.
Varying the attention modules:
Tab. 4 (left) shows the comparison on NUS-WIDE when ablating RCB and SCB modules in our BiAM. Including LayerNorm in RCB or replacing its softmax with sigmoid or replacing sigmoid with softmax in SCB result in sub-optimal performance compared to our final BiAM. Similarly, replacing our BiAM with existing Non-Local  and Criss-cross  attention blocks also results in reduced performance (see Tab. 4 (right)). This shows the efficacy of BiAM, which integrates both region and holistic scene context.
Varying the hyperparameters:
Compute and run-time complexity: Tab. 5 shows that our approach achieves significant performance gains of % and % over LESA with comparable FLOPs, memory cost, training and inference run-times, on NUS-WIDE and Open Images, respectively. For a fair comparison, both methods are run on the same Tesla V100.
Additional examples failure cases of our model such as confusing abstract classes (, sunset sunrise) and fine-grained classes are provided in Appendix B.
Varying the hyperparameters:Fig. 7 shows the ZSL performance of our framework when varying heads , in top- and number of regions (). Performance improves as is increased till and drops beyond , likely due to overfitting to seen classes. Similarly, as top- increases beyond , features of spatially-small classes entangle and reduce the discriminability. Furthermore, decreasing the regions leads to multiple classes overlapping in the same regions causing feature entanglement and performance drop.
In addition to multi-label (generalized) zero-shot classification, we evaluate our proposed region-based classification framework on the standard multi-label classification task. Here, image instances for all the labels are present in training. The state-of-the-art comparison for the standard multi-label classification on NUS-WIDE with human annotated labels is shown in Tab. 6. Among existing methods, the work of  and LESA  achieve mAP scores of and , respectively. Our approach outperforms all published methods and achieves a significant gain of mAP over the state of the art. Furthermore, our approach performs favorably against existing methods in terms of F1.
Several works [39, 28, 19, 20, 44, 23, 45] have researched the conventional single-label ZSL problem. In contrast, a few works [21, 47, 18, 14, 12] have investigated the more challenging problem of multi-label ZSL. Mensink  propose an approach based on using co-occurrence statistics for multi-label ZSL. Zhang  introduce a method that utilizes linear mappings and non-linear deep networks to approximate principal direction from an input image. The work of  investigates incorporating knowledge graphs to reason about relationships between multiple labels. Recently, Huynh and Elhamifar  introduce a shared attention-based multi-label ZSL approach, where the shared attentions are label-agnostic and are trained to focus on relevant foreground regions by utilizing a formulation based on multiple loss terms.
have shown that deep convolutional networks-based visual recognition models implicitly rely on contextual information. Recently, self-attention models have achieved promising performance for machine translation and natural language processing[31, 38, 7, 8]. This has inspired studies to investigate self-attention and related ideas for vision tasks, such as object recognition , image synthesis  and video prediction 
. Self-attention strives to learn the relationships between elements of a sequence by estimating the relevance of one item to other items. Motivated by its success in several vision tasks, we introduce a multi-label zero-shot region-based classification approach that utilizes self-attention in the proposed bi-level attention module to reason about all regions together using pair-wise relations between these regions. To complement the self-attentive region features with the holistic scene context information, we integrate a global scene prior which enables us to enrich the region-level features with both region and scene context information.
We proposed a region-based classification framework comprising a bi-level attention module for large-scale multi-label zero-shot learning. The proposed classification framework design preserves the spatial resolution of features to retain the multi-class information disentangled. This enables to effectively deal with large number of co-existing categories in an image. To contextualize and enrich the region features in our classification framework, we introduced a bi-level attention module that incorporates both region and scene context information, generating discriminative feature representations. Our simple but effective approach sets a new state of the art on two large-scale benchmarks and obtains absolute gains as high as ZSL mAP, compared to the best published results.
Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. In ICLR, Cited by: §4.
Transformer-xl: attentive language models beyond a fixed-length context. In ACL, Cited by: §4.
Maximizing subset accuracy with recurrent neural networks in multi-label classification. NeurIPS. Cited by: §1.
Non-local neural networks. In CVPR, Cited by: §3.2, Table 4, Table 4.
Self-attention generative adversarial networks. In ICML, Cited by: §4.
Similar to Sec. 3.3, where we evaluate our approach for the standard multi-label classification on the NUS-WIDE dataset , here, we also evaluate on the large-scale Open Images dataset . Tab. 7 shows the state-of-the-art comparison for the standard multi-label classification on Open Images. Here, classes are used for both training and evaluation. Test samples with missing labels for these classes are removed during evaluation, as in . Due to significantly larger number of labels in Open Images, ranking the labels within an image is more challenging. This is reflected by the lower F1 scores in the table. Among existing methods, Fast0Tag  and LESA  achieve an F1 score of and at . Our approach achieves favorable performance against the existing approaches, achieving an F1 score of at . The proposed approach also achieves superior performance in terms of mAP score, compared to existing methods and obtains an absolute gain of mAP over the best existing method.
In Sec. 3, for a fair comparison with existing works such as Fast0Tag  and LESA , we employed a pretrained VGG-19  as the backbone for extracting region-level and global-level features of images. However, such supervisedly pretrained backbone will not strictly conform with the zero-shot paradigm if there is any overlap between the unseen classes and the classes used for pretraining. To avoid using a supervisedly pre-trained network, we conduct an experiment by using the recent self-supervised DINO  ResNet-50 backbone trained on ImageNet without any labels. Tab. 8 shows that our approach (BiAM) significantly outperforms LESA  even with a self-supervised pretrained backbone on both benchmarks: NUS-WIDE  and Open Images . Absolute gains as high as mAP are obtained for NUS-WIDE on the ZSL task. Similar favorable gains are also obtained for the GZSL task on both datasets. These results show that irrespective of the backbone used for extracting the image features, our BiAM approach performs favorably against existing methods, achieving significant gains across different datasets on both ZSL and GZSL tasks.
Fig. 9 shows the qualitative results for multi-label (generalized) zero-shot learning. Nine example images from the test set of the NUS-WIDE dataset  are presented in each figure. The comparison is shown between the standard region-based features and our discriminative region-based features. Alongside each image, top- predictions for both approaches are shown with true positives and false positives. In general, our approach learns discriminative region-based features and achieves increased true positive predictions along with reduced false positives, compared to the standard region-based features. , categories such as reflection and water in Fig. 9(b), ocean and sky in Fig. 9(g), boat and sky in Fig. 9(j) along with graveyard and england in Fig. 9(k) are correctly predicted. Both approaches predict a few confusing classes such as beach and surf in Fig. 9(d) in addition to sunrise and sunset that are hard to differentiate using visual cues alone in Fig. 9(l). Moreover, false positives that are predicted by the standard region-based features, are reduced by our discriminative region-based features, , vehicle in Fig. 9(g), soccer in Fig. 9(h), balloons in Fig. 9(j), and ocean in Fig. 9(k). These results suggest that our approach based on discriminative region features achieves promising performance against the standard features, for multi-label (generalized) zero-shot classification.
Fig. 10 and 11 show the visualizations of attention maps for the ground truth classes in example test images from NUS-WIDE and Open Images, respectively. Alongside each example, class-specific maps for the unseen classes are shown with the corresponding labels on top. In general, we observe that these maps focus reasonably well on the desired classes. , promising class-specific attention is captured for zebra in Fig. 10(a), vehicle in Fig. 10(b), buildings in Fig. 10(d), Keelboat in Fig. 11(c), Boeing 717 in Fig. 11(e) and Exercise in Fig. 11(i). Although we observe that the attention maps of visually similar classes overlap for sky and clouds in Fig. 10(d), these abstract categories, including reflection in Fig. 10(a) and nighttime in Fig. 10(c) are well captured. These qualitative results show that our proposed approach (BiAM) generates promising class-specific attention maps, leading to improved multi-label (generalized) zero-shot classification.