VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

12/04/2021
by   Renrui Zhang, et al.
11

Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning. Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios. However, there exists semantic gap between the specific application and generally pre-trained knowledge, which makes the matching sub-optimal on downstream tasks. In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts. Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism. In this way, the visual-guided text become more semantically correlated with the image, which greatly benefits the matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets and experiment extensive ablation studies to demonstrate the effectiveness of VT-CLIP. The code will be released soon.

READ FULL TEXT

page 1

page 5

page 6

research
10/17/2022

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto stan...
research
12/04/2021

PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Languag...
research
11/28/2022

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Although significant progress has been made in few-shot learning, most o...
research
02/23/2023

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it...
research
04/03/2023

Probabilistic Prompt Learning for Dense Prediction

Recent progress in deterministic prompt learning has become a promising ...
research
10/31/2022

Generative Negative Text Replay for Continual Vision-Language Pretraining

Vision-language pre-training (VLP) has attracted increasing attention re...
research
07/29/2022

Curriculum Learning for Data-Efficient Vision-Language Alignment

Aligning image and text encoders from scratch using contrastive learning...

Please sign up or login with your details

Forgot password? Click here to reset