Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

12/17/2021
by   Bichen Wu, et al.
11

Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially attributed to the fact that the image-text pairs are noisy. To address this, we propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning. Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero shot evaluation on Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.

READ FULL TEXT

page 2

page 9

page 17

page 18

research
04/18/2021

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

Traditional computer vision models are trained to predict a fixed set of...
research
12/27/2021

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

Using natural language as a supervision for training visual recognition ...
research
06/09/2022

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradig...
research
01/28/2023

ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Recent success of large-scale Contrastive Language-Image Pre-training (C...
research
11/30/2021

An implementation of the "Guess who?" game using CLIP

CLIP (Contrastive Language-Image Pretraining) is an efficient method for...
research
04/10/2022

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effe...
research
07/23/2023

Geometry-Aware Adaptation for Pretrained Models

Machine learning models – including prominent zero-shot models – are oft...

Please sign up or login with your details

Forgot password? Click here to reset