A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

12/27/2021
by   Ajinkya Tejankar, et al.
2

Using natural language as a supervision for training visual recognition models holds great promise. Recent works have shown that if such supervision is used in the form of alignment between images and captions in large training datasets, then the resulting aligned models perform well on zero-shot classification as downstream tasks2. In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models. Through extensive and careful experiments, we show that: 1) A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset. Surprisingly, we observe that this approach improves the zero-shot classification performance when combined with word balancing. 2) Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption. Models trained on images with real and pseudo-BoW captions achieve stronger zero-shot performance. On ImageNet-1k zero-shot evaluation, our best model, that uses only 3M image-caption pairs, performs on-par with a CLIP model trained on 15M image-caption pairs (31.5

READ FULL TEXT

page 3

page 5

page 14

page 15

research
04/18/2021

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

Traditional computer vision models are trained to predict a fixed set of...
research
12/17/2021

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Traditional computer vision models are trained to predict a fixed set of...
research
09/21/2023

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

Given the recent advances in multimodal image pretraining where visual m...
research
04/20/2022

K-LITE: Learning Transferable Visual Models with External Knowledge

Recent state-of-the-art computer vision systems are trained from natural...
research
07/17/2023

Zero-Shot Image Harmonization with Generative Model Prior

Recent image harmonization methods have demonstrated promising results. ...
research
12/29/2016

Learning Visual N-Grams from Web Data

Real-world image recognition systems need to recognize tens of thousands...
research
05/17/2023

Equivariant Few-Shot Learning from Pretrained Models

Efficient transfer learning algorithms are key to the success of foundat...

Please sign up or login with your details

Forgot password? Click here to reset