CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

09/24/2021
by   Yuan Yao, et al.
0

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for quantities of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, our prompt tuning approach enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3 deviation reduction on average with one shot in RefCOCO evaluation). All the data and code will be available to facilitate future research.

READ FULL TEXT
research
05/05/2022

Declaration-based Prompt Tuning for Visual Question Answering

In recent years, the pre-training-then-fine-tuning paradigm has yielded ...
research
12/19/2022

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to a...
research
09/03/2023

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Large-scale text-to-image diffusion models have shown impressive capabil...
research
04/16/2022

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

We study multimodal few-shot object detection (FSOD) in this paper, usin...
research
07/01/2022

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Vision-Language Pretraining (VLP) models have recently successfully faci...
research
07/15/2022

Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis

Representational Similarity Analysis is a method from cognitive neurosci...
research
05/23/2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a...

Please sign up or login with your details

Forgot password? Click here to reset