Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

04/18/2021
by   Ruizhe Cheng, et al.
0

Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use a simple pretraining task of predicting the pairings between images and text captions. CLIP, however, is data hungry and requires more than 400M image text pairs for training. We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs. Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP. Our method exceeds the previous SoTA of general zero-shot learning on ImageNet 21k+1k by 73 encoder. We also beat CLIP by 10.5 Google Open Images (19,958 classes).

READ FULL TEXT
research
12/17/2021

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Traditional computer vision models are trained to predict a fixed set of...
research
12/27/2021

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

Using natural language as a supervision for training visual recognition ...
research
11/30/2021

An implementation of the "Guess who?" game using CLIP

CLIP (Contrastive Language-Image Pretraining) is an efficient method for...
research
05/10/2023

Text-To-Concept (and Back) via Cross-Model Alignment

We observe that the mapping between an image's representation in one mod...
research
03/24/2023

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Assessing the aesthetics of an image is challenging, as it is influenced...
research
11/03/2021

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Multi-modal language-vision models trained on hundreds of millions of im...
research
04/10/2022

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effe...

Please sign up or login with your details

Forgot password? Click here to reset