Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

03/30/2023
by   Sifan Long, et al.
0

Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an appropriate prompt for each specific task. Recent CoCoOp further boosts the base-to-new generalization performance via an image-conditional prompt. However, it directly fuses identical image semantics to prompts of different labels and significantly weakens the discrimination among different classes as shown in our experiments. Motivated by this observation, we first propose a class-aware text prompt (CTP) to enrich generated prompts with label-related image information. Unlike CoCoOp, CTP can effectively involve image semantics and avoid introducing extra ambiguities into different prompts. On the other hand, instead of reserving the complete image representations, we propose text-guided feature tuning (TFT) to make the image branch attend to class-related representation. A contrastive loss is employed to align such augmented text and image representations on downstream tasks. In this way, the image-to-text CTP and text-to-image TFT can be mutually promoted to enhance the adaptation of VLMs for downstream tasks. Extensive experiments demonstrate that our method outperforms the existing methods by a significant margin. Especially, compared to CoCoOp, we achieve an average improvement of 4.03 new classes and 3.19

READ FULL TEXT

page 2

page 3

research
10/06/2022

MaPLe: Multi-modal Prompt Learning

Pre-trained vision-language (V-L) models such as CLIP have shown excelle...
research
05/06/2022

Prompt Distribution Learning

We present prompt distribution learning for effectively adapting a pre-t...
research
11/21/2022

Understanding and Improving Visual Prompting: A Label-Mapping Perspective

We revisit and advance visual prompting (VP), an input prompting techniq...
research
07/03/2023

Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-l...
research
10/05/2022

Variational prompt tuning improves generalization of vision-language models

Prompt tuning provides an efficient mechanism to adapt large vision-lang...
research
10/19/2022

CPL: Counterfactual Prompt Learning for Vision and Language Models

Prompt tuning is a new few-shot transfer learning technique that only tu...
research
05/16/2023

UOR: Universal Backdoor Attacks on Pre-trained Language Models

Backdoors implanted in pre-trained language models (PLMs) can be transfe...

Please sign up or login with your details

Forgot password? Click here to reset