CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

12/14/2021
by   Aman Shrivastava, et al.
13

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +15.4 classification, and a +22.1 comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, by performing explicit image-text alignment during representation learning, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks.

READ FULL TEXT

page 7

page 8

page 12

research
10/17/2022

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto stan...
research
11/14/2022

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

State-of-the-art empirical work has shown that visual representations le...
research
09/04/2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Pre-training visual and textual representations from large-scale image-t...
research
10/21/2021

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Contrastive learning with the InfoNCE objective is exceptionally success...
research
10/05/2020

Conditional Negative Sampling for Contrastive Learning of Visual Representations

Recent methods for learning unsupervised visual representations, dubbed ...
research
10/11/2022

Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment

Annotating large collections of textual data can be time consuming and e...
research
09/24/2021

Dense Contrastive Visual-Linguistic Pretraining

Inspired by the success of BERT, several multimodal representation learn...

Please sign up or login with your details

Forgot password? Click here to reset