Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

10/11/2021
by   Yangguang Li, et al.
0

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4 CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.Our code, dataset and models are released at: https://github.com/Sense-GVT/DeCLIP

READ FULL TEXT

page 2

page 9

page 16

research
08/16/2023

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Contrastive Language-Image Pre-training (CLIP) has significantly boosted...
research
09/27/2022

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

Pre-training vision-language models with contrastive objectives has show...
research
01/19/2023

Self Supervision Does Not Help Natural Language Supervision at Scale

Self supervision and natural language supervision have emerged as two ex...
research
03/11/2022

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel par...
research
06/02/2022

Prefix Conditioning Unifies Language and Label Supervision

Vision-language contrastive learning suggests a new learning paradigm by...
research
07/26/2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great ...
research
06/22/2022

Prototypical Contrastive Language Image Pretraining

Contrastive Language Image Pretraining (CLIP) received widespread attent...

Please sign up or login with your details

Forgot password? Click here to reset