Image recognition has become a very popular and successful research area in recent years, due to the development of large-scale datasets deng2009imagenet ; kuznetsova2020open and advanced model architectures dosovitskiy2020image ; he2016deep ; liu2021swin ; simonyan2014very . However, the majority of image recognition approaches have focused on single-label prediction, which ignores the intrinsic multi-label nature of images. Unlike single-label recognition dosovitskiy2020image ; he2016deep ; liu2021swin ; simonyan2014very , multi-label image recognition aims to recognize all semantic labels present in an image chen2019multi ; chua2009nus ; liu2017semantic ; liu2018multi ; sarafianos2018deep ; yazici2020orderless ; wang2020multi
, providing a more comprehensive understanding and benefiting applications like image retrieval, video analysis, and recommendation systems.
Multi-label recognition typically deals with images of complex scenes and diverse objects. Collecting multi-label annotations becomes difficult to scale up, for two reasons: (i) annotating images with the full semantic label set is laborious and (ii) samples of particular categories can be hard to find. The first challenge can be addressed by multi-label recognition with partial labels
, where merely some of the categories are annotated for each training image. Recent works proposed solutions to partial-label MLR based on semi-supervised learningjoulin2016learning ; mahajan2018exploring , normalized training objectives durand2019learning , or label correlations chen2022structured ; huynh2020interactive ; pu2022semantic . The second setting involves zero-shot MLR, where novel unseen categories are recognized by transferring knowledge from seen categories, with solutions like principal image features ben2021semantic ; zhang2016fast lee2018multi , and attention mechanisms huynh2020shared ; narayan2021discriminative . Despite significant progress on the two settings, existing approaches are not designed to handle both. We propose to unify these settings as limited-annotation MLR and design a solution that can handle practical scenarios with either partial or missing labels.
Successful solutions to the above problems transfer knowledge from fully-annotated categories to partially-labeled and novel categories by learning an alignment between images and category names chen2022structured ; pu2022semantic ; zhang2016fast . Recently, vision-language pretraining models are bridging the visual-textual gap via large-scale pretraining, e.g., CLIP radford2021learning is trained with 400 million image-text pairs. In this work, we draw inspiration from the recent success of prompt learning for such models huang2022unsupervised ; ju2021prompting ; luddecke2021mm ; radford2019language ; zhou2021denseclip ; zhou2022conditional
. Prompt learning provides a convenient way to transfer pretrained vision-language models to other tasks. It designs additional templated or learnable prompt tokens for textual input to “inform” the model about downstream tasks and avoids finetuning the entire model which can be inefficient and data-hungry. By doing so, recent works like CoOpzhou2021learning have demonstrated CLIP’s remarkable generalisation to various zero-shot image tasks huang2022unsupervised ; radford2021learning ; zhou2022conditional . However, these methods mainly focus on matching each image with a single label, hence they are not able to handle the multi-label setting.
To adapt the knowledge learned in CLIP to multi-label image recognition, we propose the DualCoOp framework. As shown in Fig. 1 (c), DualCoOp learns a pair of differentiable prompts to provide positive and negative contexts for the target class. Instead of using hand-crafted thresholding to determine positive labels ridnik2021asymmetric
, the dual prompts naturally result in a positive and a negative classifier, so the existence of the target class in the image can be easily decided by comparing their scores. Unlike prior models, shown in Fig.1 (a)(b), we avoid fine-tuning the full vision-language model and only learn the prompts, which are much smaller compared to the entire model. Therefore, our simple framework achieves much higher efficiency when adapting to different datasets. Additionally, we modify the attention mechanism of CLIP to better model spatial information in images, improving its ability to recognize multiple objects in MLR. With these design choices, we achieve a unified framework for addressing the general challenges of multi-label recognition with limited annotations.
We summarize our contributions as follows:
We propose DualCoOp to quickly adapt powerful vision-language models to solve multi-label recognition tasks using limited annotations.
We propose dual (positive and negative) prompts to drastically reduce the number of learnable parameters, and improve the spatial modeling of the visual encoder to better distinguish multiple objects.
We conduct extensive experiments on partial-label MLR (on MS-COCOlin2014microsoft and VOC2007 everingham2010pascal ) and zero-shot MLR (on MS-COCO and NUS-WIDE chua2009nus ). Notably, DualCoOp improves mAP by with of labels on VOC2007, and F1-score at Top-3 Prediction by for zero-shot MLR on NUS-WIDE.
2 Related Works
Multi-Label Recognition with Limited Annotations. Multi-label image recognition has drawn increasing attention in past years. One straightforward solution to this problem is to individually learn a binary classifier for each category liu2015optimality ; misra2016seeing ; tsoumakas2007multi
, which however does not consider correlations among labels. Hence, recent works have focused on incorporating semantic dependencies among labels via graph neural networkschen2019multi ; chua2009nus ; wang2020multi or RNN/LSTM liu2017semantic ; wang2016cnn ; wang2017multi ; yazici2020orderless . Some work also considers the spatial distribution of labels in the image, and exploits object proposals li2016human ; liu2018multi ; wang2016beyond or attention mechanism sarafianos2018deep ; wang2017multi ; zhu2017learning as a regularization to rectify the prediction. However, despite achieving significant progress, these methods require a large-scale and complete annotated dataset to train models krishna2017visual ; lin2014microsoft . This limits their application to more practical scenarios where data is partially annotated for training bucak2011multi ; chen2013fast ; mahajan2018exploring ; sun2017revisiting ; sun2010multi ; xie2018partial and unseen (zero-shot) categories may appear during testing chen2020knowledge ; gupta2021generative ; lee2018multi ; mensink2014costa ; zhang2016fast .
With partially labeled data, where merely some labels of each sample are known, Mahajan et al. mahajan2018exploring and Joulin et al. joulin2016learning attempt to use web supervision to automatically generate the pseudo labels, which unfortunately leads to poor performance as the web supervision is noisy and incomplete zhang2021understanding . To avoid external noise, Durand et al. durand2019learning exploit the proportion of annotated samples for different labels and propose a normalized BCE loss to train models based on the given partial labels. More recent works explicitly transfer information from known labels to complement unknown labels by utilizing category-specific feature blending pu2022semantic or label co-occurrences chen2022structured at both instance-level and prototype-level.
Unlike partial annotation of the same label set for training and testing, zero-shot multi-label image recognition needs to handle novel categories during testing, hence inspiring a different route based on a joint visual-label embedding space chen2020knowledge ; huynh2020shared ; mensink2014costa ; narayan2021discriminative ; zhang2016fast . Zhang et al. zhang2016fast propose to find a principal direction that ranks related labels first in the joint embedding space optimized via a tailored zero-shot ranking loss. Cohen et al. ben2021semantic
further improve the idea by learning multiple principal vectors to support the semantic diversity. Huynhet al. huynh2020shared
consider the spatial regularization and propose a shared multi-attention model and obviate the need for explicit region proposalsrahman2019deep0tag . Narayan et al. narayan2021discriminative propose to enhance the region-based features so as to minimize inter-class feature entanglement.
Though significant progress has been made in each of the directions, existing methods still require a lot of MLR data and complex architectures/losses. Our approach reduces the need for hard-to-get MLR data by pretraining on unsupervised text-image pairs. While it may seem unfair to compare existing MLR methods with ones based on such pretraining, we point out that the pretraining data is unsupervised and thus easier to obtain. We also provide experiments comparing DualCoOp to baselines using the same pretraining. Importantly, previous methods are designed for only one task, hence have limitations in practical applications. In contrast, our proposed framework can be easily adapted with small data and can address both partial and zero-shot tasks at the same time.
Prompt Learning for Vision-Language Models. Vision-Language Models jia2021scaling ; radford2021learning based on contrastive learning have demonstrated impressive ability to learn generic visual representations. As a milestone, CLIP radford2021learning is trained with 400 million curated image-text pairs, and shows remarkable transfer capability for over 30 classification datasets. With such powerful vision-language models, several follow-ups gao2021clip ; wortsman2021robust ; yao2021cpt ; zhang2021tip have been proposed to explore the training strategies for training downstream classification tasks. Instead of fine-tuning the entire model dong2019unified ; he2016deep , which may damage the learned representation space, recent approaches adopt the prompt-based paradigm that formalizes NLP tasks as masked language modeling (prompt templates) lester2021power ; li2021prefix ; shin2020autoprompt . Zhou et al. zhou2021learning propose to tune prompts for downstream classification tasks, and further introduce input-conditional prompts for better generalization ability zhou2022conditional . Lu et al. lu2022prompt learn the distribution of diverse prompts to handle the varying visual representations. Huang et al. huang2022unsupervised generate pseudo labels for images to learn prompts in an unsupervised way. Though achieving promising improvements for downstream tasks, these methods address the multi-class zero-shot image recognition, assuming each image has one label, hence lacking the ability to handle the multi-label setting. In this paper, we present a novel framework to efficiently transfer VLMs to address multi-label image recognition with limited annotations.
Problem Definition. We formally define multi-label recognition with limited annotations as follows: Consider as the set of categories which describe objects or attributes in images. Given a training image , the existence of a category can be positive, negative or unknown, corresponding to the label or respectively. During inference, we predict each label of interest for an input image.
Many existing MLR problems fit into this broad definition. In this paper, we consider the settings with partial or missing labels: (1) Partial-label MLR chen2022structured ; durand2019learning ; pu2022semantic , in which only a subset of labels are known ( or ) for one training image and we are interested in predicting all existing labels during inference. (2) Zero-shot MLR ben2021semantic ; huynh2020shared ; rahman2018deep , in which each label is either known (seen) or unknown (unseen) for all images during training and we are interested in predicting either all labels or only unknown (unseen) labels during inference. In this paper, we propose a unified framework to address the limited-annotation MLR in both scenarios.
Approach Overview. To compensate for insufficient or missing image labels, it is important to learn how the meanings of category names are related to each other, so we can transfer knowledge between related categories. This is usually done by learning an alignment between the visual and textual spaces. However, our dataset is too limited to learn a broad and generalizable mapping. We propose DualCoOp to instead leverage the strong alignment of visual and textual feature spaces learned by large-scale vision-language pretraining (CLIP radford2019language ) with a light-weight learnable overhead which quickly adapts to the MLR task with limited semantic annotations. Figure 2 provides an overview of our proposed approach. DualCoOp learns a pair of “prompt” contexts in the form of two learnable sequences of word vectors, to provide positive and negative contextual surroundings of a given category name . This generates positive and negative textual features and that are fed into the pretrained text encoder. Furthermore, to better recognize multiple objects, which can be located at different locations in the image, the spatial aggregation step is modified. We first compute the similarity score of each projected visual feature at location with / to obtain prediction logits over regions. For each class, we perform aggregation of all spatial logits, in which the weight for each logit is determined by its relative magnitude. We call this Class-Specific Region Feature Aggregation. During training, we optimize the learnable prompts via the ASL loss ridnik2021asymmetric while keeping all other network components frozen. During inference, we directly compare the final positive and negative logits to make a prediction for each label .
Dual Learnable Prompts. Instead of learning a single prompt for a class zhou2021learning , we propose Dual Context Optimization (DualCoOp) which learns two contrastive prompts’ contexts for each class. The learnable part in dual prompts carries positive and negative contextual surroundings individually and can be optimized end-to-end from data via binary classification loss. Specifically, we define the pair of prompts given to the text encoder as follows:
where each is a learnable word embedding vector (e.g. with dimension 512 in CLIP radford2019language ) and CLS is the given category name. and are the numbers of word tokens learned in the positive and negative prompts respectively. For simplicity, we set in our experiments. We learn a pair of positive and negative prompts for each class (i.e. class-specific prompt pair) when solving MLR with partial labels, and learn a pair of prompts shared for all classes in zero-shot MLR. With a pair of prompts, we compute the binary classification output with the following form:
represents cosine similarity and
is the predicted probability for a given (image, label) pair as positive example.and are the visual and textual encoders from the vision-language pretraining. is our new aggregation function to adaptively reduce the spatial dimension of visual features for each class, which will be discussed next.
Class-Specific Region Feature Aggregation. In multi-label image recognition, it is common that multiple objects appear in different regions of the image. Pooling to produce a single image-level feature vector for all classes gives sub-optimal performance since spatial information is reduced and different objects are mixed. In this work, we reformulate the last multi-headed attention pooling layer of the visual encoder in CLIP radford2019language and apply class-specific pooling to adaptively aggregate region features in the multi-label setting. The original attention pooling layer in CLIP pools the visual feature map first, and then projects the global feature vector into text space as follows:
where , and are independent linear embedding layers and is the output feature map of the visual encoder. By removing the pooling operation, we can project the visual feature of each region to the textual space zhou2021denseclip :
For each region and each class , we compute cosine similarity between and as , and compute in the same way. In order to make a single prediction for the whole image, we aggregate and into and according to the magnitude of , i.e.:
Notably, we do not introduce any new parameters in our re-formulation of the spatial aggregation function. All parameters used to project visual features to the textual space are inherited from the original multi-headed attention pooling layer in CLIP.
Optimization. We apply the Asymmetric Loss (ASL) ridnik2021asymmetric to handle the inherent positive-negative imbalance in the optimization of multi-label recognition. Specially, we compute losses for a positive (image, label) pair and a negative (image, label) pair as follows:
where is the probability for negative examples shifted by hard thresholding via the margin . We set the hyper-parameters , so that ASL down-weighs and hard-thresholds easy negative samples. The pair of learnable prompts are updated by back-propagating ASL through the frozen text encoder.
4.1 Multi-Label Recognition with Partial Labels
|Curriculum labeling durand2019learning||38M||26.7||31.8||51.5||65.4||70.0||71.9||74.0||77.4||78.0||60.7|
|Patial BCE durand2019learning||38M||61.6||70.5||74.1||76.3||77.2||77.7||78.2||78.4||78.5||74.7|
|PASCAL VOC 2007 everingham2010pascal|
|Curriculum labeling durand2019learning||38M||44.7||76.8||88.6||90.2||90.7||91.1||91.6||91.7||91.9||84.1|
|Patial BCE durand2019learning||38M||80.7||88.4||89.9||90.7||91.2||91.8||92.3||92.4||92.5||90.0|
Datasets. We conduct experiments on MS-COCO lin2014microsoft and VOC2007 everingham2010pascal to evaluate multi-label recognition with partial labels. MS-COCO lin2014microsoft contains 80 common object categories and we use the official train2014 (82K images) and val2014 (40K images) splits for training and test. VOC2007 everingham2010pascal contains 20 object categories and we use the official trainval (5K images) and test (5K images) splits for training and test. To create the training set with partial labels, we randomly mask out labels from the fully annotated training set111The difference in performance is within of independent runs. and use the remaining labels for training by following standard practice chen2022structured ; durand2019learning ; pu2022semantic . In this work, we vary the proportion of kept labels from to chen2022structured ; pu2022semantic .
Evaluation. On MS-COCO and VOC2007 datasets, we follow chen2022structured ; durand2019learning ; pu2022semantic to report the mean average precision (mAP) for each proportion of labels available for optimization (from to ) and its average value for all proportions. We count the learnable parameters (#P) of each baseline and DualCoOp to measure the complexity of optimization222For baselines without public released implementation, we only measure the major part of the learnable parameters based on description in their papers. (indicated as #P [a value] in Table 1-3). We also report the per-class and the average overall precision (CP and OP), recall (CR and OR), and F1 (CF1 and OF1) of DualCoOp under different proportions of labels for training in the supplementary material due to the page limit.
Implementation. We adopt ResNet-101 he2016deep as the visual encoder in all baselines and DualCoOp for input resolution 448448, and use the same Transformer radford2019language ; vaswani2017attention in CLIP radford2021learning as the text encoder. The visual and text encoders are initialized from the CLIP pretrained model and kept frozen during optimization. For each class/label, we learn two independent context vectors with 16 context tokens (N = 16) following zhou2021learning , which is the only learnable part in DualCoOp
. We use the SGD optimizer with an initial rate of 0.002 which is decayed by the cosine annealing rule. We train context vectors for 50 epochs with a batch-size 32/8 for MS-COCO/VOC2007, respectively. For ASL loss, we choose, and via validation. The training is done with one RTX A6000.
Baselines. To evaluate the effectiveness of DualCoOp, we compare with the following baselines: (1).SSGRL chen2019learning , GCN-ML chen2019multi and KGGR chen2020knowledge adopt graph neural networks to model label dependencies. We follow chen2022structured to report their performance in the partial-label setting. (2). Curriculum labeling durand2019learning and SST chen2022structured generate pseudo labels for unknown labels. (3). Partial BCE durand2019learning uses a normalized BCE loss to better exploit partial labels. (4).SARB pu2022semantic blends category-specific representation across different images to transfer information of known labels to complement unknown labels.
), we further substitute the ImageNet pretrained weightshe2016deep with the CLIP pretrained weights radford2021learning when initializing of their visual encoders, which results in and in Table 1. Since we learn class-specific prompts, DualCoOp on MS-COCO adopts more learnable parameters than VOC2007. Our proposed DualCoOp achieves the best performance across all proportions of labels available during the training with the smallest learnable overhead (1.3M vs. 29.6M in on MS-COCO and 0.3M vs. 29.6M in on VOC2007). Notably, DualCoOp yields a great improvement over the second-best method, on MS-COCO and on VOC2007, especially when only providing of labels during the training. This indicates that DualCoOp can quickly adapt to the multi-label recognition task with a few labels by taking advantage of the powerful vision-language pretraining.
4.2 Zero-shot Multi-Label Recognition
Datasets. Following ben2021semantic ; huynh2020shared , we conduct experiments on MS-COCO lin2014microsoft and NUS-WIDE chua2009nus to perform zero-shot multi-label recognition. On MS-COCO, we follow bansal2018zero ; ben2021semantic to split the dataset into 48 seen classes and 17 unseen classes. NUS-WIDE chua2009nus dataset includes 270K images. Following ben2021semantic ; huynh2020shared we use 81 human-annotated categories as unseen classes and an additional set of 925 labels obtained from Flickr tags as seen classes.
Evaluation. We follow ben2021semantic and report precision, recall, and F1 score at Top-3 predictions in each image on MS-COCO. We also follow ben2021semantic ; huynh2020shared to report mAP over all categories as well as precision, recall, and F1 score at Top-3 and Top-5 predictions in each image on NUS-WIDE. We evaluate all methods with both zero-shot setting (test only on unseen classes) and generalized zero-shot setting (test on both seen and unseen classes).
Implementation. We adopt ResNet-50 he2016deep similar to ben2021semantic as the visual encoder in DualCoOp for input resolution 224. Instead of learning class-specific prompts, we learn the class-agnostic context vectors with 64 context tokens (N = 64) for all classes, which is the only learnable part in DualCoOp. We optimize context vectors for 50 epochs with a batch-size 32/192 for MS-COCO/NUS-WIDE, respectively. Other implementation details are the same with Sec. 4.1
Baselines. To evaluate the effectiveness of DualCoOp in the zero-shot setting, we compare with the following baselines: (1). CONSE norouzi2013zero adopts an ensemble of classifiers for unseen classes. (2). LabelEM akata2015label learns a joint image-label embedding. (3). Fast0Tag zhang2016fast and SDL ben2021semantic estimate one or multiple diverse principal directions of the input images. (4). Deep0Tag rahman2018deep and LESA huynh2020shared estimate the relevant regions via region proposals and attention techniques respectively. (5). BiAM narayan2021discriminative enhances the region-based features to minimize inter-class feature entanglement.
|SDL (M=2) ben2021semantic||30.6M||26.3||65.3||37.5||59.0||60.8||59.9|
|Zero-Shot Learning (ZSL)|
|One Attention per Label Kim2018||12.8M||20.9||33.5||25.8||16.2||43.2||23.6||10.4|
|LESA (M=10) huynh2020shared||0.45M||25.7||41.1||31.6||19.7||52.5||28.7||19.4|
|SDL (M=7) ben2021semantic||33.6M||24.2||41.3||30.5||18.8||53.4||27.8||25.9|
|Generalized Zero-Shot Learning (GZSL)|
|One Attention per Label Kim2018||12.8M||17.9||7.9||10.9||15.6||11.5||13.2||3.7|
|LESA (M=10) huynh2020shared||0.45M||23.6||10.4||14.4||19.8||14.6||16.8||5.6|
|SDL (M=7) ben2021semantic||33.6M||27.7||13.9||18.5||23.0||19.3||21.0||12.1|
Results. Table 2-3 shows the comparison between DualCoOp and all SOTA methods of zero-shot learning and generalized zero-shot learning on MS-COCO and NUS-WIDE datasets. DualCoOp achieves the best F1 score in all cases with a very light learnable overhead (0.02M) and improves the performance of zero-shot learning (unseen labels) with a significant margin: F1 score improves by 12.5 @Top-3 on MS-COCO, and by 10.8 @Top-3 and 10.9 @Top-5 on NUS-WIDE. This shows the power of exploiting the pretrained alignment of textual and visual spaces in CLIP via DualCoOp to solve multi-label recognition.
4.3 Ablation Studies
Effectiveness of Text Supervision. To show the effectiveness of text supervision from label space, we compare the model learned with discrete label space (“Discrete Label”) with three methods (SST chen2022structured , SARB pu2022semantic ] and DualCoOp) which introduce the textual space to utilize the contextual correlation of labels in Table 4. We find that methods with text supervision usually perform better than the method only using discrete labels. However, when the semantic annotations are limited, text supervision sometimes yields worse performance (e.g. mAP of SST is lower than Discrete Labels with only of labels). By adopting the well-pretrained visual-textual alignment, DualCoOp achieves a great performance (e.g. higher than Discrete Labels with of labels) and quickly adapts to the dataset even with limited labels.
|M1||Hand-crafted Pos./Neg. Templates + Classname||0||25.6||63.6||36.5||31.0||36.2||33.4|
|M2||Pos. Learnable Prompt + Classname (=64)||0.01M||31.2||77.5||44.5||55.7||65.0||60.0|
|M3||Neg. Learnable Prompt + Classname (=64)||0.01M||9.3||23.0||13.2||2.6||3.0||2.8|
|M4||Dual Learnable Prompts + Classname (=64)||0.02M||35.3||87.6||50.3||58.4||68.1||62.9|
|M5||Dual Learnable Prompts + Classname (=32)||0.01M||35.8||88.9||51.0||57.4||67.0||61.9|
Ablation of Prompt Design. We compare our proposed dual learnable prompts with two hand-crafted prompts and one prompt learning method on the MS-COCO dataset with the zero-shot setting (see Table 5). Hand-crafted prompts can use either contextless class names li2017learning or manually designed prompt templates. In our experiments, we carefully choose the positive and negative prompt templates as “a photo of a [classname]” and “a photo without a [classname]”. In contrast with performing the binary classification for each class with dual learnable prompts as the input, we also experiment with learning a single prompt of positive or negative contexts and use a chosen threshold (0.5 in our experiment) to make the prediction for each class. As we can see, the single positive prompt learning method (M2) performs better than non-learnable methods (M0 and M1), and a single negative learnable prompt (M3) achieves much worse accuracy than its positive counterpart (M2). However, when we include both positive and negative prompts, dual prompts (M4) performs even better than a single prompt, which indicates that DualCoOp learns complementary and beneficial information in the dual prompt pair. To keep the same amount of learnable parameters as in single prompt settings, we also halve the token size (M5), and find that DualCoOp still outperforms two single prompts in M2 and M3 by large gaps, demonstrating the effectiveness of our dual-prompt design.
|Visual Aggregation||Finetune.||Train Res.||Test Res.|
Multi-Headed Attention vs. Class-Specific Region Aggregation. In Table 6, we compare the adaptive ability of these two visual aggregation methods when training/testing with a larger resolution (see Table 6), which is crucial in multi-label recognition as spatial details matter. For a fair comparison, we only replace the class-specific region aggregation in DualCoOp with the original multi-headed attention layer in CLIP radford2019language at the end of the visual encoder. We adaptively resize the input feature map to match the input dimension of the multi-headed attention layer.
As shown in Table 6, multi-headed attention is bonded to the pre-training image resolution (224 in CLIP), while our class-specific region aggregation benefits from the increased input resolution either during training or in inference. Our class-specific feature aggregation uses original weights, but actually performs better than finetuning the original multi-headed attention layer.
Ablation of Aggregation Function. We experiment with different functions to aggregate the regional logits for each class in Fig. 3. We compute final logits in three ways: (1) taking the average of logits at all spatial locations (“Ave”), (2) taking the region with the largest positive logit (“Max”), and (3) generating aggregating weights for all spatial locations via a softmax function over the positive logits (“Ours”). “Max” performs better than “Ave”, which indicates the regional feature is more informative than the global feature in multi-label recognition. Furthermore, by taking account of both the regional and the global features, “Ours” gives the best performance.
In this paper, we propose a unified framework, DualCoOp, for two types of multi-label recognition with limited annotations. It utilizes the powerful vision-language pretraining from a large-scale dataset. By introducing a lightweight learnable overhead, it can quickly adapt to solve multi-label recognition after receiving a small amount of labels. In DualCoOp, we learn a pair of positive and negative prompts followed by the target class name as the linguistic input. Furthermore, to better aggregate visual region features for each class, we reformulate the original visual attention in the pretraining model as a class-specific region feature aggregation. We conduct extensive experiments for both partial-label MLR and Zero-Shot MLR across MS-COCO, VOC2007, and NUS-WIDE datasets showing the efficacy of our proposed approach over state-of-the-art methods.
Limitations. Since the vision-language pretraining adopts a large Transformer-based language model and all labels need to be feed-forward through the text encoder, the large language model limits the size of the label set. Also, compared to training the model with both seen and unseen labels, we still get worse performance for the zero-shot unseen classes even though we have used 400M auxiliary samples in the pretraining. This highlights the difficulty of zero-shot MLR.
Negative Societal Impacts.
Negative impacts of our research are difficult to predict, however, it shares many of the pitfalls associated with deep learning models. These include susceptibility to adversarial attacks and data poisoning, dataset bias, and lack of interpretability. Other risks associated with the deployment of computer vision systems include privacy violations when images are captured without consent, or used to track individuals for profit, or increased automation resulting in job losses. While we believe that these issues should be mitigated, they are beyond the scope of this paper. Furthermore, we should be cautious of the result of failures of the system which could impact the performance/user experience of the high-level AI systems based on our research.
- (1) Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Trans. on PAMI, 38(7):1425–1438, 2015.
- (2) Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object detection. In ECCV, 2018.
- (3) Avi Ben-Cohen, Nadav Zamir, Emanuel Ben-Baruch, Itamar Friedman, and Lihi Zelnik-Manor. Semantic diversity learning for zero-shot multi-label classification. In ICCV, 2021.
- (4) Serhat Selcuk Bucak, Rong Jin, and Anil K Jain. Multi-label learning with incomplete class assignments. In CVPR, 2011.
- (5) Minmin Chen, Alice Zheng, and Kilian Weinberger. Fast image tagging. In ICML, 2013.
- (6) Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. on PAMI, 2020.
- (7) Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, and Liang Lin. Structured semantic transfer for multi-label recognition with partial labels. 2022.
- (8) Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. Learning semantic-specific graph representation for multi-label image recognition. In ICCV, 2019.
- (9) Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. Multi-label image recognition with graph convolutional networks. In CVPR, 2019.
- (10) Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nus-wide: a real-world web image database from national university of singapore. In ICIVR, 2009.
- (11) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR. Ieee, 2009.
- (12) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. NeurIPS, 2019.
- (13) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- (14) Thibaut Durand, Nazanin Mehrasa, and Greg Mori. Learning a deep convnet for multi-label classification with partial labels. In CVPR, 2019.
- (15) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
- (16) Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- (17) Akshita Gupta, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Ling Shao, and Joost van de Weijer. Generative multi-label zero-shot learning. arXiv preprint arXiv:2101.11606, 2021.
- (18) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- (19) Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
- (20) Dat Huynh and Ehsan Elhamifar. Interactive multi-label cnn learning with partial labels. In CVPR, 2020.
- (21) Dat Huynh and Ehsan Elhamifar. A shared multi-attention framework for multi-label zero-shot learning. In CVPR, 2020.
- (22) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- (23) Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
- (24) Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. arXiv preprint arXiv:2112.04478, 2021.
- (25) Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear Attention Networks. In NIPS, 2018.
- (26) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
- (27) Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4. IJCV, 128(7):1956–1981, 2020.
- (28) Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. Multi-label zero-shot learning with structured knowledge graphs. In CVPR, 2018.
- (29) Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
Ang Li, Allan Jabri, Armand Joulin, and Laurens Van Der Maaten.
Learning visual n-grams from web data.In ICCV, 2017.
- (31) Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- (32) Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. Human attribute recognition by deep hierarchical contexts. In ECCV, 2016.
- (33) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- (34) Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou Yang, and Changyin Sun. Semantic regularisation for recurrent image annotation. In CVPR, 2017.
- (35) Weiwei Liu and Ivor Tsang. On the optimality of classifier chain for multi-label classification. NIPS, 28, 2015.
- (36) Yongcheng Liu, Lu Sheng, Jing Shao, Junjie Yan, Shiming Xiang, and Chunhong Pan. Multi-label image classification via knowledge distillation from weakly-supervised detection. In ACM MM, 2018.
- (37) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- (38) Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. arXiv preprint arXiv:2205.03340, 2022.
- (39) Timo Lüddecke and Alexander S. Ecker. Prompt-based multi-modal image segmentation. arXiv preprint arXiv:2112.10003, 2021.
- (40) Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
- (41) Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrence statistics for zero-shot classification. In CVPR, 2014.
- (42) Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, 2016.
- (43) Sanath Narayan, Akshita Gupta, Salman Khan, Fahad Shahbaz Khan, Ling Shao, and Mubarak Shah. Discriminative region-based multi-label zero-shot learning. In ICCV, 2021.
- (44) Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
- (45) Tao Pu, Tianshui Chen, Hefeng Wu, and Liang Lin. Semantic-aware representation blending for multi-label image recognition with partial labels. In AAAI, 2022.
- (46) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
- (47) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019.
- (48) Shafin Rahman and Salman Khan. Deep multiple instance learning for zero-shot image tagging. In ACCV, 2018.
- (49) Shafin Rahman, Salman Khan, and Nick Barnes. Deep0tag: Deep multiple instance learning for zero-shot image tagging. IEEE Trans. on MM, 22(1):242–255, 2019.
- (50) Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. In ICCV, 2021.
- (51) Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. Deep imbalanced attribute classification using visual attention aggregation. In ECCV, 2018.
- (52) Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- (53) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (54) Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
- (55) Yu-Yin Sun, Yin Zhang, and Zhi-Hua Zhou. Multi-label learning with weak label. In AAAI, 2010.
- (56) Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
- (57) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 2017.
- (58) Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A unified framework for multi-label image classification. In CVPR, 2016.
- (59) Meng Wang, Changzhi Luo, Richang Hong, Jinhui Tang, and Jiashi Feng. Beyond object proposals: Random crop pooling for multi-label image recognition. IEEE Trans on IP, 25(12):5678–5688, 2016.
- (60) Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, and Shilei Wen. Multi-label classification with label graph superimposing. In AAAI, 2020.
- (61) Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. Multi-label image recognition by recurrently discovering attentional regions. In ICCV, 2017.
- (62) Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
- (63) Ming-Kun Xie and Sheng-Jun Huang. Partial multi-label learning. In AAAI, 2018.
- (64) Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
- (65) Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. Orderless recurrent models for multi-label classification. In CVPR, 2020.
- (66) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- (67) Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- (68) Yang Zhang, Boqing Gong, and Mubarak Shah. Fast zero-shot image tagging. In CVPR, 2016.
- (69) Chong Zhou, Chen Change Loy, and Bo Dai. Denseclip: Extract free dense labels from clip. arXiv preprint arXiv:2112.01071, 2021.
- (70) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134, 2021.
- (71) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. arXiv preprint arXiv:2203.05557, 2022.
- (72) Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. Learning spatial regularization with image-level supervisions for multi-label image classification. In CVPR, 2017.
Appendix A Different Prompt Length
We have provided the comparison of the performance of DualCoOp with different lengths of prompt context (i.e. ) in all three different experiment scenarios (see Fig. 4 and 5). In MLR with partial labels, we learn class-specific prompts and thus DualCoOp performs good when is small, such as 8, 16. For zero-shot learning in MLR, we learn uniform prompts shared by all classes and it requires larger (e.g. 32 or 64) for good performance. In the main paper, we use for all experiments of MLR with partial labels and use for experiments in zero-shot learning.
Appendix B Performance on the Full Dataset
We also finetune the visual backbone on the full multi-labeled recognition dataset MS-COCO and achieve mAP with ResNet 101 and input resolution 448, comparing to achieved by the same setting using ASL ridnik2021asymmetric .
Appendix C Full performance of MLR with Partial Labels
In this section, we provide the average per-class and average overall precisions (CP and OP), recalls (CR and oR) and F1 scores (CF1 and OF1) of DualCoOp in the experiment of MLR with Partial Labels on MS-COCO lin2014microsoft and VOC2007 everingham2010pascal (see Table 7 and 8 in supplementary material) as a supplementary for Table 1 in the main paper.
|Amount of Labels||CP||CR||CF1||OP||OR||OF1||mAP|
|Amount of Labels||CP||CR||CF1||OP||OR||OF1||mAP|
Appendix D Visualization of Class-Specific Region Feature Aggregation
We have visualized the class-specific region feature aggregation on MS-COCO dataset (in Fig. 6). We can see DualCoOp generates the high attention score at the correct objects.