Discriminative Region-based Multi-Label Zero-Shot Learning

08/20/2021 ∙ by Sanath Narayan, et al. ∙ 7

Multi-label zero-shot learning (ZSL) is a more realistic counter-part of standard single-label ZSL since several objects can co-exist in a natural image. However, the occurrence of multiple objects complicates the reasoning and requires region-specific processing of visual features to preserve their contextual cues. We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes. Such shared maps lead to diffused attention, which does not discriminatively focus on relevant locations when the number of classes are large. Moreover, mapping spatially-pooled visual features to the class semantics leads to inter-class feature entanglement, thus hampering the classification. Here, we propose an alternate approach towards region-based discriminability-preserving multi-label zero-shot classification. Our approach maintains the spatial resolution to preserve region-level characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information. The enriched region-level features are then mapped to the class semantics and only their class predictions are spatially pooled to obtain image-level predictions, thereby keeping the multi-class features disentangled. Our approach sets a new state of the art on two large-scale multi-label zero-shot benchmarks: NUS-WIDE and Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9 mAP for ZSL, compared to the best published results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 8

page 12

page 13

page 14

Code Repositories

BiAM

[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-label classification strives to recognize all the categories (labels) present in an image. In the standard multi-label classification [33, 40, 16, 5, 22, 41, 42] setting, the category labels in both the train and test sets are identical. In contrast, the task of multi-label zero-shot learning (ZSL) is to recognize multiple new unseen categories in images at test time, without having seen the corresponding visual examples during training. In the generalized ZSL (GZSL) setting, test images can simultaneously contain multiple seen and unseen classes. GZSL is particularly challenging in the large-scale multi-label setting, where several diverse categories occur in an image (, maximum of labels per image in NUS-WIDE [6]) along with a large number of unseen categories at test time (, unseen classes in Open Images [17]). Here, we investigate this challenging problem of multi-label (generalized) zero-shot classification.
Existing multi-label (G)ZSL methods tackle the problem by using global image features [21, 47]

, structured knowledge graph 

[18] and attention schemes [14]. Among these, the recently introduced LESA [14] proposes a shared attention scheme based on region-based feature representations and achieves state-of-the-art results. LESA learns multiple attention maps that are shared across all categories. The region-based image features are weighted by these shared attentions and then spatially aggregated. Subsequently, the aggregated features are projected to the label space via a joint visual-semantic embedding space.
While achieving promising results, LESA suffers from two key limitations. Firstly, classification is performed on features obtained using a set of attention maps that are shared across all the classes. In such a shared attention framework, many categories are observed to be inferred from only a few dominant attention maps, which tend to be diffused across an image rather than discriminatively focusing on regions likely belonging to a specific class (see Fig. 1). This is problematic for large-scale benchmarks comprising several hundred categories, , more than 7 seen classes in Open Images [17]

with significant inter and intra-class variations. Secondly, attended features are spatially pooled before projection to the label space, thus entangling the multi-label information in the collapsed image-level feature vectors. Since multiple diverse labels can appear in an image, the class-specific discriminability within such a collapsed representation is severely hampered.

Figure 1: Comparison, in terms of attention visualization, between shared attention-based LESA [14] and our approach on example NUS-WIDE test images. For each image, the visualization of attentions of positive labels within that image are shown for LESA (top row) and our approach (bottom row). In the case of LESA, all classes in these examples are inferred from the eighth shared attention module except for dog class in (b), which is inferred from the ninth module. As seen in these examples, these dominant attention maps struggle to discriminatively focus on relevant (class-specific) regions. In contrast, our proposed approach based on a bi-level attention module (BiAM) produces attention maps by preserving class-specific discriminability, leading to an enriched feature representation. Our BiAM effectively captures region-level semantics as well as global scene-level context, thereby enabling it to accurately attend to object class (, window class in (a)) and abstract concepts (, reflection class in (a)). Best viewed zoomed in.

1.1 Contributions

To address the aforementioned problems, we pose large-scale multi-label ZSL as a region-level classification problem. We introduce a simple yet effective region-level classification framework that maintains the spatial resolution of features to keep the multi-class information disentangled for dealing with large number of co-existing classes in an image. Our framework comprises a bi-level attention module (BiAM) to contextualize and obtain highly discriminative region-level feature representations. Our BiAM contains region and global (scene) contextualized blocks and enables reasoning about all the regions together using pair-wise relations between them, in addition to utilizing the holistic scene context. The region contextualized block enriches each region feature by attending to all regions within the image whereas the scene contextualized block enhances the region features based on their congruence to the scene feature representation. The resulting discriminative features, obtained through our BiAM, are then utilized to perform region-based classification through a compatibility function. Afterwards, a spatial top- pooling is performed over each class to obtain the final predictions. Experiments are performed on two challenging large-scale multi-label zero-shot benchmarks: NUS-WIDE [6] and Open Images [17]. Our approach performs favorably against existing methods, setting a new state of the art on both benchmarks. Particularly, on NUS-WIDE, our approach achieves an absolute gain of in terms of mAP for the ZSL task, over the best published results [14].

2 Proposed Method

Here, we introduce a region-based discriminability-preserving multi-label zero-shot classification framework aided by learning rich features that explicitly encodes both region as well as global scene contexts in an image.

Problem Formulation: Let denote the feature instances of a multi-label image and the corresponding multi-hot labels from the set of seen class labels . Further, let denote the -dimensional attribute embeddings, which encode the semantic relationships between seen classes. With as the number of positive labels in an image, we denote the set of attribute embeddings for the image as , where . The goal in (generalized) zero-shot learning is to learn a mapping aided by the attribute embeddings , such that the mapping can be adapted to include the unseen classes (with embeddings ) at test time, , for ZSL and for the GZSL setting. Here, represents the total number of seen and unseen classes.

Figure 2: Comparison of our region-level classification framework (b) with the shared attention-based classification pipeline (a) in [14]

. The shared attention-based pipeline performs an attention-weighted spatial averaging of the region-based features to generate a feature vector per shared attention. These (spatially pooled) features are then classified to obtain

class scores per shared attention, which are max-pooled to obtain image-level class predictions. In contrast, our framework minimizes inter-class feature entanglement by enhancing the region-based features through a feature enrichment mechanism, which preserves the spatial resolution of the features. Each region-based enriched feature representation is then classified to

seen classes. Afterwards, per class top- activations are aggregated to obtain image-level predictions.

2.1 Region-level Multi-label ZSL

As discussed earlier, recognizing diverse and wide range of category labels in images under the (generalized) zero-shot setting is challenging. The problem arises, primarily, due to the entanglement of features of the various different classes present in an image. Fig. 2(a) illustrates this feature entanglement in the shared attention-based classification pipeline [14] that integrates multi-label features by performing a weighted spatial averaging of the region-based features based on the shared-attention maps. In this work, we argue that entangled feature representations are sub-optimal for multi-label classification and instead propose to alleviate this issue by posing large-scale multi-label ZSL as a region-level classification problem. To this end, we introduce a simple but effective region-level classification framework that first enriches the region-based features by the proposed feature enrichment mechanism. It then classifies the enriched region-based features followed by spatially pooling the per-class region-based scores to obtain the final image-level class predictions (see Fig. 2(b)). Consequently, our framework minimizes inter-class feature entanglement and enhances the classification performance. Fig. 3 shows our overall proposed framework. Let be the output region-based features, which are to be classified, from our proposed enrichment mechanism (, BiAM). Here, denote the spatial extent of the region-based features with regions. These features are first aligned with the class-specific attribute embeddings of the seen classes. This alignment is performed, , a joint visual-semantic space is learned, so that the classifier can be adapted to the unseen classes at test time. The aligned region-based features are classified to obtain class-specific response maps given by,

(1)

where is a learnable weight matrix that is used to reshape the visual features to attribute embeddings of seen classes (). The response maps are then top- pooled along the spatial dimensions to obtain image-level per-class scores , which are then utilized for training the network (in Sec. 2.3). Such a region-level classification, followed by a score-level pooling, helps to preserve the discriminability of the features in each of the regions by minimizing the feature entanglement of different positive classes occurring in the image.

The aforementioned region-level multi-label ZSL framework relies on discriminative region-based features. Standard region-based features only encode local region-specific information and do not explicitly reason about all the regions together. Moreover, region-based features do not possess image-level holistic scene information. Next, we introduce a bi-level attention module (BiAM) to enhance feature discriminability and generate enriched features .

2.2 Bi-level Attention Module

Here, we present a bi-level attention module (BiAM) that enhances region-based features by incorporating both region and scene context information, without sacrificing the spatial resolution. Our BiAM comprises region and scene contextualized blocks, which are described next.

Figure 3: Our region-level multi-label (G)ZSL framework: The top row shows an overview of our network architecture. Given an image, the region-level features are first obtained using a backbone. The region features are enriched using a Bi-level Attention Module (BiAM). This module incorporates region (b) and scene (c) contextualized blocks which learn to aggregate region-level and scene-specific context, respectively, which is in turn used to enhance the region features. The enriched features are mapped to the joint visual-semantic space to relate them with class semantics, obtaining . Per-class region-based prediction scores are then spatially pooled to generate final image-level predictions. Notably, our design ensures region-level feature enrichment while preserving the spatial resolution uptil class predictions are made, which minimizes inter-class feature entanglement, a key requisite for large-scale multi-label (G)ZSL.

2.2.1 Region Contextualized Block

The region-contextualized block (RCB) enriches the region-based latent features by capturing the contexts from different regions in the image. We observe encoding the individual contexts of different regions in an image to improve the discriminability of standard region-based features, , the context of a region with window can aid in identifying other possibly texture-less regions in the image as house or building. Thus, inspired by the multi-headed self-attention [31], our RCB allows the features in different regions to interact with each other and identify the regions to be paid more attention to for enriching themselves (see Fig. 3(b)). To this end, the input features are first processed by a convolution layer to obtain latent features . These latent features are then projected to a low-dimensional space () to create query-key-value triplets using a total of projection heads,

(2)

where and are learnable weights of convolution layers with input and output channels as and , respectively. The query vector (of length ) derived from each region feature111Query can be considered as queries represented by features each. Similar observation holds for keys, values, . is used to find its correlation with the keys obtained from all the region features, while the value embedding holds the status of the current form of each region feature.

Given these triplets for each head, first, an intra-head processing is performed by relating each query vector with ‘keys’ derived from the region features. The resulting normalized relation scores () from the softmax function () are used to reweight the corresponding ‘value’ vectors. Without loss of generality1, the attended features are given by,

(3)

Next, these low-dimensional self-attended features from each head are channel-wise concatenated and processed by a convolution layer to generate output ,

(4)

To encourage the network to selectively focus on adding complimentary information to the ‘source’ latent feature , a residual branch is added to the attended features and further processed with a small residual sub-network , comprising two convolution layers, to help the network first focus on the local neighbourhood and then progressively pay attention to the other-level features. The enriched region-based features from the RCB are given by,

(5)

Consequently, the discriminability of the latent features is enhanced by self-attending to the context of different regions in the image, resulting in enriched features .

2.2.2 Scene Contextualized Block

As discussed earlier, the RCB captures the regional context in the image, enabling reasoning about all regions together using pair-wise relations between them. In this way, RCB enriches the latent feature inputs . However, such a region-based contextual attention does not effectively encode the global scene-level context of the image, which is necessary for understanding abstract scene concepts like night-time, protest, clouds, . Understanding such labels from local regional contexts is challenging due to their abstract nature. Thus, in order to better capture the holistic scene-level context, we introduce a scene contextualized block (SCB) within our BiAM. Our SCB attends to the region-based latent features , based on their congruence with the global image feature (see Fig. 3(c)). To this end, the learnable weights project the features to a -dimensional space to obtain the global ‘key’ vectors , while the latent features are spatially average pooled to create the ‘query’ vectors ,

(6)

The region-based latent features are retained as ‘value’ features . Given these query-key-value triplets, first, the query is used to find its correlation with the key . The resulting relation score vectors are then used to reweight the corresponding channels in value features to obtain the attended features , given by,

(7)

where and denote channel-wise and element-wise multiplications. The channel-wise operation is chosen here since we want to use the global contextualized features to dictate kernel-wise importance of the feature channels for aggregating relevant contextual cues without disrupting the local filter signature. Similar to RCB, to encourage the network to selectively focus on adding complimentary information to the ‘source’ , a residual branch is added after processing the attended features through a convolution layer . The scene-context enriched features from the SCB are given by,

(8)
Figure 4: Effect of enhancing the region-based features through our feature enrichment mechanism: BiAM. The two complementary RCB and SCB blocks in BiAM integrate region-level semantics and global scene-level context, leading to a more discriminative feature representation. While RCB alone (on the left) is able to capture the region-level semantics of person class, it confuses those related to protest label. However, encoding the global scene-level context from the SCB in BiAM (on the right) improves the semantic recognition of scene-level concepts like protest.

In order to ensure the enrichment due to both region and global contexts are well captured, the enriched features ( and ) from both region and scene contextualized blocks are channel-wise concatenated and processed through a channel-reducing convolution layer to obtain the final enriched features , given by,

(9)

Fig. 4 shows that encoding scene context into the region-based features improves the attention maps of scene level labels (, protest), which were hard to attend to using only the region context. Consequently, our bi-level attention module effectively reasons about all the image regions together using pair-wise relations between them, while being able to utilize the whole image (holistic) scene as context.

2.3 Training and Inference

As discussed earlier, discriminative region-based features are learned and region-wise classified to obtain class-specific response maps (using Eq. 1). The response maps are further top- pooled spatially to compute the image-level per-class scores . The network is trained using a simple, yet effective ranking loss on the predicted scores , given by,

(10)

Here, denotes the positive labels in image . The ranking loss ensures that the predicted scores of the positive labels present in the image rank ahead, by a margin of at least , of the negative label scores.

At test time, for the multi-label ZSL task, the unseen class attribute embeddings of the respective unseen classes are used (in place of ) for computing the class-specific response maps in Eq. 1. As in training, these response maps are then top- pooled spatially to compute the image-level per-class scores . Similarly, for the multi-label GZSL task, the concatenated embeddings () of all the classes are used to classify the multi-label images.

3 Experiments

Datasets: We evaluate our approach on two benchmarks: NUS-WIDE [6] and Open Images [17]. The NUS-WIDE dataset comprises nearly K images with human-annotated categories, in addition to the labels obtained from Flickr user tags. As in [14, 47], the and labels are used as seen and unseen classes, respectively. The Open Images (v4) is a large-scale dataset comprising nearly million training images along with and images in validation and test sets. It has annotations with human and machine-generated labels. Here, labels, with at least training images, are selected as seen classes. The most frequent test labels that are absent in the training data are selected as unseen classes, as in [14].
Evaluation Metrics: We use F1 score at top-

predictions and mean Average Precision (mAP) as evaluation metrics, as in 

[32, 14]. The model’s ability to correctly rank labels in each image is measured by the F1, while the its image ranking accuracy for each label is captured by the mAP.
Implementation Details: Pretrained VGG-19 [29] is used to extract features from multi-label images, as in [47, 14]. The region-based features (of size and ) from Conv are extracted along with the global features of size from FC. As in [14], -normalized -dimensional GloVe [26] vectors of the class names are used as the attribute embeddings . The two convolutions (input and output channels are set to

) are followed by ReLU and batch normalization layers. The

for top- pooling is set to , while the heads . For training, we use the ADAM optimizer with () as (, ) and a gradual warm-up learning rate scheduler with an initial of . Our model is trained with a mini-batch size of for epochs on NUS-WIDE and epochs on Open Images.

3.1 State-of-the-art Comparison

NUS-WIDE: The state-of-the-art comparison for zero-shot (ZSL) and generalized zero-shot (GZSL) classification is presented in Tab. 1. The results are reported in terms of mAP and F1 score at top- predictions (). The approach of Fast0Tag [47], which finds principal directions in the attribute embedding space for ranking the positive tags ahead of negative tags, achieves mAP on the ZSL task. The recently introduced LESA [14], which employs a shared multi-attention mechanism to recognize labels in an image, improves the performance over Fast0Tag, achieving mAP. Our approach outperforms LESA with an absolute gain of mAP. Furthermore, our approach achieves consistent improvement over the state-of-the-art in terms of F1 (), achieving gains as high as at .

Similarly, on the GZSL task, our approach achieves an mAP score of , outperforming LESA with an absolute gain of . Moreover, consistent performance improvement in terms of F1 is achieved over LESA by our approach, with absolute gains of and at and .

width= Method Task mAP [HTML]EEEEEEF1 (K = 3) [HTML]DAE8FCF1 (K = 5) CONSE [24] ZSL 9.4 21.6 20.2 GZSL 2.1 7.0 8.1 LabelEM [1] ZSL 7.1 19.2 19.5 GZSL 2.2 9.5 11.3 Fast0Tag [47] ZSL 15.1 27.8 26.4 GZSL 3.7 11.5 13.5 Attention per Label [15] ZSL 10.4 25.8 23.6 GZSL 3.7 10.9 13.2 Attention per Cluster [14] ZSL 12.9 24.6 22.9 GZSL 2.6 6.4 7.7 LESA [14] ZSL 19.4 31.6 28.7 GZSL 5.6 14.4 16.8 Our Approach ZSL 26.3 33.1 30.7 GZSL 9.3 16.1 19.0

Table 1: State-of-the-art comparison for multi-label ZSL and GZSL tasks on NUS-WIDE. We report the results in terms of mAP and F1 score at . Our approach outperforms the state-of-the-art for both ZSL and GZSL tasks, in terms of mAP and F1 score. Best results are in bold.

Open Images: Tab. 2 shows the state-of-the-art comparison for multi-label ZSL and GZSL tasks. The results are reported in terms of mAP and F1 score at top- predictions (). We follow the same evaluation protocol as in the concurrent work of SDL [2]. Since Open Images has significantly larger number of labels, in comparison to NUS-WIDE, ranking them within an image is more challenging. This is reflected by the lower F1 scores in the table. Among existing methods, LESA obtains an mAP of for the ZSL task. In comparison, our approach outperforms LESA by achieving mAP with an absolute gain of . Furthermore, our approach performs favorably against the best existing approach with F1 scores of and at and . It is worth noting that the ZSL task is challenging due to the high number of unseen labels (). As in ZSL, our approach obtains a significant gain of mAP over the best published results for GZSL and also achieves favorable performance in F1. Additional details and results are presented in Appendix A.

width= Method Task mAP [HTML]EEEEEEF1 (K = 10) [HTML]DAE8FCF1 (K = 20) CONSE [24] ZSL 40.4 0.4 0.3 GZSL 43.5 2.6 2.4 LabelEM [1] ZSL 40.5 0.5 0.4 GZSL 45.2 5.2 5.1 Fast0Tag [47] ZSL 41.2 0.7 0.6 GZSL 45.2 16.0 12.9 Attention per Cluster [14] ZSL 40.7 1.2 0.9 GZSL 44.9 16.9 13.5 LESA [14] ZSL 41.7 1.4 1.0 GZSL 45.4 17.4 14.3 Our Approach ZSL 73.6 8.3 5.5 GZSL 84.5 19.1 15.9

Table 2: State-of-the-art comparison for multi-label ZSL and GZSL tasks on Open Images. Results are reported in terms of mAP and F1 score at . Our approach sets a new state of the art for both tasks, in terms of mAP and F1 score. Best results are in bold.

3.2 Ablation Study

Figure 5: Impact of region-based classification for the ZSL task on NUS-WIDE, in terms of mAP and F1 at . Classifying spatially pooled features (blue bars) entangles the features of the different classes resulting in sub-optimal performance. In contrast, our proposed approach, which classifies each region individually and then spatially pools the per region class scores (red bars), minimizes the inter-class feature entanglement and achieves superior classification performance.

Impact of region-based classification: To analyse this impact, we train our proposed framework without region-based classification, where the enriched features are spatially average-pooled to a single feature representation (of size ) per image and then classified. Fig. 5 shows the performance comparison between our frameworks trained with and without region-based classification in terms of mAP and F1. Since images have large and diverse set of positive labels, spatially aggregating features without the region-based classification (blue bars), leads to inter-class feature entanglement, as discussed in Sec. 2.1. Instead, preserving the spatial dimension by classifying the region-based features, as in the proposed framework (red bars), mitigates the inter-class feature entanglement to a large extent. This leads to a superior performance for the region-based classification on both multi-label ZSL and GZSL tasks. These results suggest the importance of region-based classification for learning discriminative features in large-scale multi-label (G)ZSL tasks. Furthermore, Fig. 6 presents a t-SNE visualization showing the impact of our region-level classification framework on unseen classes from NUS-WIDE.

width= Method Task mAP [HTML]EEEEEEF1 (K = 3) [HTML]DAE8FCF1 (K = 5) Standard ZSL 21.1 28.0 26.9 region features GZSL 6.8 12.0 14.5 RCB alone ZSL 23.7 31.9 29.0 GZSL 7.6 14.7 17.6 SCB alone ZSL 23.2 29.4 27.8 GZSL 8.6 14.0 16.7 BiAM (RCB + SCB) ZSL 26.3 33.1 30.7 GZSL 9.3 16.1 19.0

Table 3: Impact of the proposed BiAM comprising RCB and SCB blocks. Note that all results here are reported with the same region-level classification framework and only the features utilized within the classification framework differs. Both RCB alone and SCB alone achieve consistently improved performance over standard region features. For both ZSL and GZSL tasks, the best performance is obtained when utilizing the discriminative features obtained from the proposed BiAM. Best results are in bold.
width= Method mAP BiAM: RCB w/ LayerNorm 25.0 BiAM: RCB w/ sigmoid 24.6 BiAM: SCB w/ softmax 24.3 BiAM: Final 26.3 width=0.95 Method mAP Non-Local [34] 23.1 Criss-Cross Atn [13] 23.9 BiAM (Ours) 26.3
Table 4: ZSL comparison on NUS-WIDE with attention variants: our attention (left) and other attentions [34, 13] (right).

Impact of the proposed BiAM: Here, we analyse the impact of our feature enrichment mechanism (BiAM) to obtain discriminative feature representations. Tab. 3 presents the comparison between region-based classification pipelines based on standard features and discriminative features obtained from our BiAM on NUS-WIDE. We also present results of our RCB and SCB blocks alone. Both RCB alone and SCB alone consistently improve the (G)ZSL performance over the standard region-based features. This shows that our region-based classification pipeline benefits from the discriminative features obtained through the two complementary attention blocks. Furthermore, best results are obtained with our BiAM that comprises both RCB and SCB blocks, demonstrating the importance of encoding both region and scene context information. Fig. 8 shows a comparison between the standard features-based classification and the proposed classification framework utilizing BiAM on example unseen class images.

Figure 6: t-SNE visualization showing the impact of the proposed region-level classification framework on the inter-class feature entanglement. We present the comparison on unseen classes of NUS-WIDE. On left: the single feature representation-based classification pipeline, where the enriched features are spatially aggregated to obtain a feature vector (of length ) and then classified. On right: the proposed region-level classification framework, which classifies the region-level features first and then spatially pools the class scores to obtain image-level predictions. Our classification framework maintains the spatial resolution to preserve the region-level characteristics, thereby effectively minimizing the inter-class feature entanglement.
Figure 7: ZSL comparison on NUS-WIDE when varying , top- and regions. Results improve slightly as heads increases till and drops beyond , likely due to overfitting to seen classes. A similar trend is observed when top- increases. Decreasing regions from 14x14 to 9x9 does not affect much.

Varying the attention modules: Tab. 4 (left) shows the comparison on NUS-WIDE when ablating RCB and SCB modules in our BiAM. Including LayerNorm in RCB or replacing its softmax with sigmoid or replacing sigmoid with softmax in SCB result in sub-optimal performance compared to our final BiAM. Similarly, replacing our BiAM with existing Non-Local [34] and Criss-cross [13] attention blocks also results in reduced performance (see Tab. 4 (right)). This shows the efficacy of BiAM, which integrates both region and holistic scene context.

Varying the hyperparameters:

Fig. 7 shows the ZSL performance of our framework when varying heads , in top- and number of regions (). Performance improves as is increased till and drops beyond , likely due to overfitting to seen classes. Similarly, as top- increases beyond , features of spatially-small classes entangle and reduce the discriminability. Furthermore, decreasing the regions leads to multiple classes overlapping in the same regions causing feature entanglement and performance drop.
Compute and run-time complexity: Tab. 5 shows that our approach achieves significant performance gains of % and % over LESA with comparable FLOPs, memory cost, training and inference run-times, on NUS-WIDE and Open Images, respectively. For a fair comparison, both methods are run on the same Tesla V100.
Additional examples failure cases of our model such as confusing abstract classes (, sunset sunrise) and fine-grained classes are provided in Appendix B.

width= Method mAP (NUS / OI) Train (NUS / OI) Inference FLOPs Memory LESA [10] 19.4 / 41.7 9.1 hrs / 35 hrs 1.4 ms 0.46 G 2.6 GB BiAM (Ours) 26.1 / 73.0 7.5 hrs / 26 hrs 2.3 ms 0.59 G 2.8 GB

Table 5: Comparison of our BiAM with LESA in terms of ZSL performance (mAP), train and inference time, FLOPs and memory cost on NUS-WIDE (NUS) and Open Images (OI). Our BiAM achieves significant gain in performance with comparable compute and run-time complexity, over LESA.
Figure 8: Qualitative comparison on four test examples from NUS-WIDE, between the standard region features and our discriminative features. Top- predictions per image for both approaches are shown with true positives and false positives. Compared to the standard region-based features, our approach learns discriminative region-based features and performs favorably.

3.3 Standard Multi-label Classification

In addition to multi-label (generalized) zero-shot classification, we evaluate our proposed region-based classification framework on the standard multi-label classification task. Here, image instances for all the labels are present in training. The state-of-the-art comparison for the standard multi-label classification on NUS-WIDE with human annotated labels is shown in Tab. 6. Among existing methods, the work of [15] and LESA [14] achieve mAP scores of and , respectively. Our approach outperforms all published methods and achieves a significant gain of mAP over the state of the art. Furthermore, our approach performs favorably against existing methods in terms of F1.

4 Related Work

Several works [39, 28, 19, 20, 44, 23, 45] have researched the conventional single-label ZSL problem. In contrast, a few works [21, 47, 18, 14, 12] have investigated the more challenging problem of multi-label ZSL. Mensink [21] propose an approach based on using co-occurrence statistics for multi-label ZSL. Zhang [47] introduce a method that utilizes linear mappings and non-linear deep networks to approximate principal direction from an input image. The work of [18] investigates incorporating knowledge graphs to reason about relationships between multiple labels. Recently, Huynh and Elhamifar [14] introduce a shared attention-based multi-label ZSL approach, where the shared attentions are label-agnostic and are trained to focus on relevant foreground regions by utilizing a formulation based on multiple loss terms.

Context is known to play a crucial role in several vision problems, such as object recognition [25, 9, 37, 46]. Studies [10, 3]

have shown that deep convolutional networks-based visual recognition models implicitly rely on contextual information. Recently, self-attention models have achieved promising performance for machine translation and natural language processing  

[31, 38, 7, 8]. This has inspired studies to investigate self-attention and related ideas for vision tasks, such as object recognition [27], image synthesis [43] and video prediction [35]

. Self-attention strives to learn the relationships between elements of a sequence by estimating the relevance of one item to other items. Motivated by its success in several vision tasks, we introduce a multi-label zero-shot region-based classification approach that utilizes self-attention in the proposed bi-level attention module to reason about all regions together using pair-wise relations between these regions. To complement the self-attentive region features with the holistic scene context information, we integrate a global scene prior which enables us to enrich the region-level features with both region and scene context information.

width= Method mAP [HTML]EEEEEEF1 (K = 3) [HTML]DAE8FCF1 (K = 5) WARP [11] 3.1 54.4 49.4 WSABIE [36] 3.1 53.8 49.2 Logistic [30] 21.6 51.1 46.1 Fast0Tag [47] 22.4 53.8 48.6 CNN-RNN [33] 28.3 55.2 50.8 LESA [14] 31.5 58.0 52.0 Attention per Cluster [14] 31.7 56.6 50.7 Attention per Label [15] 32.6 56.8 51.3 Our Approach 47.8 59.6 53.4

Table 6: State-of-the-art performance comparison for the standard multi-label classification on NUS-WIDE. The results are reported in terms of mAP and F1 score at . Our proposed approach achieves superior performance compared to existing methods, with gains as high as in terms of mAP. Best results are in bold.

5 Conclusion

We proposed a region-based classification framework comprising a bi-level attention module for large-scale multi-label zero-shot learning. The proposed classification framework design preserves the spatial resolution of features to retain the multi-class information disentangled. This enables to effectively deal with large number of co-existing categories in an image. To contextualize and enrich the region features in our classification framework, we introduced a bi-level attention module that incorporates both region and scene context information, generating discriminative feature representations. Our simple but effective approach sets a new state of the art on two large-scale benchmarks and obtains absolute gains as high as ZSL mAP, compared to the best published results.

References

  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2015) Label-embedding for image classification. TPAMI. Cited by: Table 1, Table 2.
  • [2] A. Ben-Cohen, N. Zamir, E. B. Baruch, I. Friedman, and L. Zelnik-Manor (2021) Semantic diversity learning for zero-shot multi-label classification. arXiv preprint arXiv:2105.05926. Cited by: §3.1.
  • [3] W. Brendel and M. Bethge (2019)

    Approximating cnns with bag-of-local-features models works surprisingly well on imagenet

    .
    In ICLR, Cited by: §4.
  • [4] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294. Cited by: §A.2, Table 8.
  • [5] Z. Chen, X. Wei, P. Wang, and Y. Guo (2019) Multi-label image recognition with graph convolutional networks. In CVPR, Cited by: §1.
  • [6] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In CIVR, Cited by: §A.1, §A.2, Figure 10, Appendix B, §1.1, §1, §3.
  • [7] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019)

    Transformer-xl: attentive language models beyond a fixed-length context

    .
    In ACL, Cited by: §4.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §4.
  • [9] C. Galleguillos and S. Belongie (2010) Context based object categorization: a critical survey. CVIU. Cited by: §4.
  • [10] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, Cited by: §4.
  • [11] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe (2013) Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894. Cited by: Table 7, Table 6.
  • [12] A. Gupta, S. Narayan, S. Khan, F. S. Khan, L. Shao, and J. van de Weijer (2021) Generative multi-label zero-shot learning. arXiv preprint arXiv:2101.11606. Cited by: §4.
  • [13] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In ICCV, Cited by: §3.2, Table 4, Table 4.
  • [14] D. Huynh and E. Elhamifar (2020) A shared multi-attention framework for multi-label zero-shot learning. In CVPR, Cited by: §A.1, §A.2, Table 7, Table 8, Figure 1, §1.1, §1, Figure 2, §2.1, §3.1, §3.3, Table 1, Table 2, §3, Table 6, §4.
  • [15] J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In NeurIPS, Cited by: §3.3, Table 1, Table 6.
  • [16] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1.
  • [17] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §A.1, §A.2, Figure 11, §1.1, §1, §3.
  • [18] C. Lee, W. Fang, C. Yeh, and Y. Frank Wang (2018) Multi-label zero-shot learning with structured knowledge graphs. In CVPR, Cited by: §1, §4.
  • [19] J. Liu, H. Bai, H. Zhang, and L. Liu (2021) Near-real feature generative network for generalized zero-shot learning. In ICME), Cited by: §4.
  • [20] D. Mandal, S. Narayan, S. K. Dwivedi, V. Gupta, S. Ahmed, F. S. Khan, and L. Shao (2019) Out-of-distribution detection for generalized zero-shot action recognition. In CVPR, Cited by: §4.
  • [21] T. Mensink, E. Gavves, and C. G. Snoek (2014) Costa: co-occurrence statistics for zero-shot classification. In CVPR, Cited by: §1, §4.
  • [22] J. Nam, E. Loza Mencía, H. J. Kim, and J. Fürnkranz (2017)

    Maximizing subset accuracy with recurrent neural networks in multi-label classification

    .
    NeurIPS. Cited by: §1.
  • [23] S. Narayan, A. Gupta, F. S. Khan, C. G. Snoek, and L. Shao (2020) Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, Cited by: §4.
  • [24] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. Cited by: Table 1, Table 2.
  • [25] A. Oliva and A. Torralba (2007) The role of context in object recognition. Trends Cogn Sci 7. Cited by: §4.
  • [26] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §3.
  • [27] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. In NeurIPS, Cited by: §4.
  • [28] Y. Shen, J. Qin, L. Huang, L. Liu, F. Zhu, and L. Shao (2020) Invertible zero-shot recognition flows. In ECCV, Cited by: §4.
  • [29] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §A.2, §3.
  • [30] G. Tsoumakas and I. Katakis (2007) Multi-label classification: an overview. IJDWM. Cited by: Table 7, Table 6.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §2.2.1, §4.
  • [32] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie (2017) Learning from noisy large-scale datasets with minimal supervision. In CVPR, Cited by: §3.
  • [33] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu (2016) Cnn-rnn: a unified framework for multi-label image classification. In CVPR, Cited by: Table 7, §1, Table 6.
  • [34] X. Wang, R. Girshick, A. Gupta, and K. He (2018)

    Non-local neural networks

    .
    In CVPR, Cited by: §3.2, Table 4, Table 4.
  • [35] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §4.
  • [36] J. Weston, S. Bengio, and N. Usunier (2011) Wsabie: scaling up to large vocabulary image annotation. In IJCAI, Cited by: Table 7, Table 6.
  • [37] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) CBAM: convolutional block attention module. In ECCV, Cited by: §4.
  • [38] F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli (2019) Pay less attention with lightweight and dynamic convolutions. In ICLR, Cited by: §4.
  • [39] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018) Feature generating networks for zero-shot learning. In CVPR, Cited by: §4.
  • [40] V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. v. d. Weijer (2020) Orderless recurrent models for multi-label classification. In CVPR, Cited by: §1.
  • [41] J. Ye, J. He, X. Peng, W. Wu, and Y. Qiao (2020) Attention-driven dynamic graph convolutional network for multi-label image recognition. In ECCV, Cited by: §1.
  • [42] R. You, Z. Guo, L. Cui, X. Long, Y. Bao, and S. Wen (2020) Cross-modality attention with semantic graph embedding for multi-label classification.. In AAAI, Cited by: §1.
  • [43] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019)

    Self-attention generative adversarial networks

    .
    In ICML, Cited by: §4.
  • [44] H. Zhang, H. Bai, Y. Long, L. Liu, and L. Shao (2021) A plug-in attribute correction module for generalized zero-shot learning. PR. Cited by: §4.
  • [45] H. Zhang, L. Liu, Y. Long, Z. Zhang, and L. Shao (2020) Deep transductive network for generalized zero shot learning. PR. Cited by: §4.
  • [46] M. Zhang, C. Tseng, and G. Kreiman (2020) Putting visual object recognition in context. In CVPR, Cited by: §4.
  • [47] Y. Zhang, B. Gong, and M. Shah (2016) Fast zero-shot image tagging. In CVPR, Cited by: §A.1, §A.2, Table 7, §1, §3.1, Table 1, Table 2, §3, Table 6, §4.

Appendix A Additional Quantitative Results

a.1 Standard Multi-Label Learning

Similar to Sec. 3.3, where we evaluate our approach for the standard multi-label classification on the NUS-WIDE dataset [6], here, we also evaluate on the large-scale Open Images dataset [17]. Tab. 7 shows the state-of-the-art comparison for the standard multi-label classification on Open Images. Here, classes are used for both training and evaluation. Test samples with missing labels for these classes are removed during evaluation, as in [14]. Due to significantly larger number of labels in Open Images, ranking the labels within an image is more challenging. This is reflected by the lower F1 scores in the table. Among existing methods, Fast0Tag [47] and LESA [14] achieve an F1 score of and at . Our approach achieves favorable performance against the existing approaches, achieving an F1 score of at . The proposed approach also achieves superior performance in terms of mAP score, compared to existing methods and obtains an absolute gain of mAP over the best existing method.

a.2 Robustness to Backbone Variation

In Sec. 3, for a fair comparison with existing works such as Fast0Tag [47] and LESA [14], we employed a pretrained VGG-19 [29] as the backbone for extracting region-level and global-level features of images. However, such supervisedly pretrained backbone will not strictly conform with the zero-shot paradigm if there is any overlap between the unseen classes and the classes used for pretraining. To avoid using a supervisedly pre-trained network, we conduct an experiment by using the recent self-supervised DINO [4] ResNet-50 backbone trained on ImageNet without any labels. Tab. 8 shows that our approach (BiAM) significantly outperforms LESA [14] even with a self-supervised pretrained backbone on both benchmarks: NUS-WIDE [6] and Open Images [17]. Absolute gains as high as mAP are obtained for NUS-WIDE on the ZSL task. Similar favorable gains are also obtained for the GZSL task on both datasets. These results show that irrespective of the backbone used for extracting the image features, our BiAM approach performs favorably against existing methods, achieving significant gains across different datasets on both ZSL and GZSL tasks.

width= Method mAP [HTML]EEEEEEF1 (K = 10) [HTML]DAE8FCF1 (K = 20) WARP [11] 46.0 7.7 7.4 WSABIE [36] 47.2 2.2 2.2 CNN-RNN [33] 41.0 9.6 10.5 Logistic [30] 49.4 13.3 11.8 Fast0Tag [47] 45.4 16.2 13.1 One Attention per Cluster [14] 45.1 16.3 13.0 LESA [14] 45.6 17.8 14.5 Our Approach 85.0 20.4 17.3

Table 7: State-of-the-art performance comparison for the standard multi-label classification on Open Images. The results are reported in terms of mAP and F1 score at . In comparison to existing approaches, our approach achieves favorable performance in terms of both mAP and F1. Best results are in bold.

width= Backbone Task NUS-WIDE (mAP) Open Images (mAP) LESA BiAM (Ours) LESA BiAM (Ours) DINO ResNet-50 [4] ZSL 20.5 27.4 41.9 74.0 GZSL 6.4 10.2 45.5 84.8

Table 8: ZSL/GZSL performance comparison with LESA on NUS-WIDE and Open Images, when using the recent DINO ResNet-50 backbone pretrained on ImageNet without any labels. Our BiAM outperforms LESA [14] with a large margin on both datasets.

Appendix B Additional Qualitative Results

Multi-label zero-shot classification:

Fig. 9 shows the qualitative results for multi-label (generalized) zero-shot learning. Nine example images from the test set of the NUS-WIDE dataset [6] are presented in each figure. The comparison is shown between the standard region-based features and our discriminative region-based features. Alongside each image, top- predictions for both approaches are shown with true positives and false positives. In general, our approach learns discriminative region-based features and achieves increased true positive predictions along with reduced false positives, compared to the standard region-based features. , categories such as reflection and water in Fig. 9(b), ocean and sky in Fig. 9(g), boat and sky in Fig. 9(j) along with graveyard and england in Fig. 9(k) are correctly predicted. Both approaches predict a few confusing classes such as beach and surf in Fig. 9(d) in addition to sunrise and sunset that are hard to differentiate using visual cues alone in Fig. 9(l). Moreover, false positives that are predicted by the standard region-based features, are reduced by our discriminative region-based features, , vehicle in Fig. 9(g), soccer in Fig. 9(h), balloons in Fig. 9(j), and ocean in Fig. 9(k). These results suggest that our approach based on discriminative region features achieves promising performance against the standard features, for multi-label (generalized) zero-shot classification.

Figure 9: Qualitative comparison for multi-label zero-shot classification on nine example images from the NUS-WIDE test set, between the standard region-based features and our discriminative features. Top- predictions per image for both approaches are shown with true positives and false positives. Generally, in comparison to the standard region-based features, our approach learns discriminative region-based features and results in increased true positive predictions along with reduced false positives. , reflection and water in (b), ocean and sky in (g), boat and sky in (j) along with graveyard and england in (k) are correctly predicted. Though a few confusing classes are predicted (, beach and surf in (d)), the obvious false positives such as vehicle in (g), soccer in (h), balloons in (j) and ocean in (k) which are predicted by the standard region-based features, are reduced by our discriminative region-based features. These qualitative results suggest that our approach based on discriminative region features achieves promising performance in comparison to the standard features, for the task of multi-label (generalized) zero-shot classification.
Visualization of attention maps:

Fig. 10 and 11 show the visualizations of attention maps for the ground truth classes in example test images from NUS-WIDE and Open Images, respectively. Alongside each example, class-specific maps for the unseen classes are shown with the corresponding labels on top. In general, we observe that these maps focus reasonably well on the desired classes. , promising class-specific attention is captured for zebra in Fig. 10(a), vehicle in Fig. 10(b), buildings in Fig. 10(d), Keelboat in Fig. 11(c), Boeing 717 in Fig. 11(e) and Exercise in Fig. 11(i). Although we observe that the attention maps of visually similar classes overlap for sky and clouds in Fig. 10(d), these abstract categories, including reflection in Fig. 10(a) and nighttime in Fig. 10(c) are well captured. These qualitative results show that our proposed approach (BiAM) generates promising class-specific attention maps, leading to improved multi-label (generalized) zero-shot classification.

Figure 10: Qualitative results with attention maps generated by our proposed approach, on example test images from the NUS-WIDE [6] dataset. For each image, class-specific maps for the ground truth unseen classes are shown with the corresponding labels on top. Generally, we observe that these maps focus reasonably well on the desired classes. , promising attention/focus is observed on classes such as zebra in (a), vehicle in (b), buildings in (d) and statue in (f). Although we observe that the attention maps of visually similar classes such as sky and clouds overlap, as in (d), these abstract classes, including reflection in (a), (d) and nighttime in (c) are well captured. These qualitative results show that our proposed approach generates promising class-specific attention maps, leading to improved multi-label (generalized) zero-shot classification.
Figure 11: Qualitative results with attention maps generated by our proposed approach, on example test images from the Open Images [17] dataset. For each image, class-specific maps for the ground truth unseen classes are shown with the corresponding labels on top. Although there are overlapping attention regions for visually similar and fine-grained classes (, Caridean shrimp and Fried prawn in (f), Canaan dog and Akita inu in (j)), generally, these maps focus reasonably well on the desired classes. , promising class-specific attention is captured for Keelboat in (c), Boeing 717 in (e) and Exercise in (i). These qualitative results show that our proposed approach generates promising class-specific attention maps, resulting in improved multi-label (generalized) zero-shot classification.