CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

03/14/2022
by   Haoyu Song, et al.
0

CLIP has shown a remarkable zero-shot capability on a wide range of vision tasks. Previously, CLIP is only regarded as a powerful visual encoder. However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task. Then we propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task. We achieve competitive zero/few-shot results on the visual question answering and visual entailment tasks without introducing any additional pre-training procedure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
research
06/01/2023

Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

The pre-training-fine-tuning paradigm based on layout-aware multimodal p...
research
12/22/2022

When are Lemons Purple? The Concept Association Bias of CLIP

Large-scale vision-language models such as CLIP have shown impressive pe...
research
10/31/2022

Towards Zero-Shot and Few-Shot Table Question Answering using GPT-3

We present very early results on using GPT-3 to perform question answeri...
research
10/16/2021

A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models

Large pretrained vision-language (VL) models can learn a new task with a...
research
04/19/2022

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Learning visual representations from natural language supervision has re...
research
01/15/2022

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Contrastive language-image pretraining (CLIP) links vision and language ...

Please sign up or login with your details

Forgot password? Click here to reset