MaPLe: Multi-modal Prompt Learning

10/06/2022
by   Muhammad Uzair Khattak, et al.
3

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45 on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Code: https://tinyurl.com/2dzs8f3w.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2023

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Prompt tuning, like CoOp, has recently shown promising vision recognizin...
research
05/10/2023

Multi-Prompt with Depth Partitioned Cross-Modal Learning

In recent years, soft prompt learning methods have been proposed to fine...
research
03/30/2023

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

Prompt learning has become one of the most efficient paradigms for adapt...
research
09/02/2021

Learning to Prompt for Vision-Language Models

Vision-language pre-training has recently emerged as a promising alterna...
research
06/27/2021

Multi-Modal Chorus Recognition for Improving Song Search

We discuss a novel task, Chorus Recognition, which could potentially ben...
research
07/01/2022

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Vision-Language Pretraining (VLP) models have recently successfully faci...
research
10/12/2022

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Current multimodal models, aimed at solving Vision and Language (V+L) ta...

Please sign up or login with your details

Forgot password? Click here to reset