Learning to Compose Diversified Prompts for Image Emotion Classification

01/26/2022
by   Sinuo Deng, et al.
0

Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models. Although CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering, it is still underexplored for Image Emotion Classification (IEC). Adapting CLIP to the IEC task has three significant challenges, tremendous training objective gap between pretraining and IEC, shared suboptimal and invariant prompts for all instances. In this paper, we propose a general framework that shows how CLIP can be effectively applied to IEC. We first introduce a prompt tuning method that mimics the pretraining objective of CLIP and thus can leverage the rich image and text semantics entailed in CLIP. Then we automatically compose instance-specific prompts by conditioning them on the categories and image contents of instances, diversifying prompts and avoiding suboptimal problems. Evaluations on six widely-used affective datasets demonstrate that our proposed method outperforms the state-of-the-art methods to a large margin (i.e., up to 9.29 on EmotionROI dataset) on IEC tasks, with only a few parameters trained. Our codes will be publicly available for research purposes.

READ FULL TEXT
research
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
research
07/05/2022

Vision-and-Language Pretraining

With the burgeoning amount of data of image-text pairs and diversity of ...
research
06/04/2023

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

Training models to apply common-sense linguistic knowledge and visual co...
research
12/10/2021

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Most existing vision-language pre-training methods focus on understandin...
research
09/14/2022

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...
research
04/23/2021

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...
research
12/21/2022

Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias

Nine language-vision AI models trained on web scrapes with the Contrasti...

Please sign up or login with your details

Forgot password? Click here to reset