Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

09/08/2023
by   Hongyu Hu, et al.
0

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visual concepts from tedious training data, showing superb generalization ability. Amount of prompt learning methods have been proposed to efficiently adapt the VLMs to downstream tasks with only a few training samples. We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs), called Dual-Aligned Prompt Tuning (DuAl-PT). Learnable prompts, like CoOp, implicitly model the context through end-to-end training, which are difficult to control and interpret. While explicit context descriptions generated by LLMs, like GPT-3, can be directly used for zero-shot classification, such prompts are overly relying on LLMs and still underexplored in few-shot domains. With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling. To achieve this, we introduce a pre-trained LLM to generate context descriptions, and we encourage the prompts to learn from the LLM's knowledge by alignment, as well as the alignment between prompts and local image features. Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization. Hopefully, DuAl-PT can serve as a strong baseline. Code will be available.

READ FULL TEXT

page 3

page 4

page 6

research
06/20/2023

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Prompt tuning, like CoOp, has recently shown promising vision recognizin...
research
08/17/2022

Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

With the emergence of large pre-trained vison-language model like CLIP, ...
research
10/09/2022

Learning to Decompose Visual Features with Latent Textual Prompts

Recent advances in pre-training vision-language models like CLIP have sh...
research
10/05/2022

Variational prompt tuning improves generalization of vision-language models

Prompt tuning provides an efficient mechanism to adapt large vision-lang...
research
07/15/2023

SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

Large Pre-trained Transformers exhibit an intriguing capacity for in-con...
research
10/13/2022

Visual Classification via Description from Large Language Models

Vision-language models (VLMs) such as CLIP have shown promising performa...
research
02/18/2023

StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization

Large-scale foundation models (e.g., CLIP) have shown promising zero-sho...

Please sign up or login with your details

Forgot password? Click here to reset