Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

03/11/2022
by   Yufeng Cui, et al.
0

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP. (3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP. The CLIP-benchmark would be released at: https://github.com/Sense-GVT/DeCLIP for future CLIP research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2021

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has...
research
11/05/2021

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

Model quantization has emerged as an indispensable technique to accelera...
research
09/27/2022

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

Pre-training vision-language models with contrastive objectives has show...
research
06/02/2022

Prefix Conditioning Unifies Language and Label Supervision

Vision-language contrastive learning suggests a new learning paradigm by...
research
01/19/2023

Self Supervision Does Not Help Natural Language Supervision at Scale

Self supervision and natural language supervision have emerged as two ex...
research
11/29/2021

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

The visual world naturally exhibits a long-tailed distribution of open c...

Please sign up or login with your details

Forgot password? Click here to reset