Self Supervision Does Not Help Natural Language Supervision at Scale

01/19/2023
by   Floris Weers, et al.
0

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.

READ FULL TEXT
research
10/11/2021

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has...
research
04/04/2023

Improved Visual Fine-tuning with Natural Language Supervision

Fine-tuning a pre-trained model can leverage the semantic information fr...
research
09/27/2022

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

Pre-training vision-language models with contrastive objectives has show...
research
01/31/2021

Adversarial Contrastive Pre-training for Protein Sequences

Recent developments in Natural Language Processing (NLP) demonstrate tha...
research
05/03/2022

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Contrastively trained image-text models such as CLIP, ALIGN, and BASIC h...
research
03/11/2022

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel par...
research
09/25/2017

EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation

Many real-world applications require large-scale data annotation, such a...

Please sign up or login with your details

Forgot password? Click here to reset