How Much Can CLIP Benefit Vision-and-Language Tasks?

07/13/2021
by   Sheng Shen, et al.
7

Most existing Vision-and-Language (V L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

READ FULL TEXT

page 8

page 14

research
03/14/2022

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

CLIP has shown a remarkable zero-shot capability on a wide range of visi...
research
04/03/2023

Vision-Language Models for Vision Tasks: A Survey

Most visual recognition studies rely heavily on crowd-labelled data in d...
research
01/26/2022

Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest inc...
research
05/31/2023

Too Large; Data Reduction for Vision-Language Pre-Training

This paper examines the problems of severe image-text misalignment and h...
research
05/23/2023

VisorGPT: Learning Visual Prior via Generative Pre-Training

Various stuff and things in visual data possess specific traits, which c...
research
04/04/2023

Exploring Vision-Language Models for Imbalanced Learning

Vision-Language models (VLMs) that use contrastive language-image pre-tr...
research
03/20/2023

EVA-02: A Visual Representation for Neon Genesis

We launch EVA-02, a next-generation Transformer-based visual representat...

Please sign up or login with your details

Forgot password? Click here to reset