Non-Contrastive Learning Meets Language-Image Pre-Training

10/17/2022
by   Jinghao Zhou, et al.
1

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP.

READ FULL TEXT
research
12/04/2021

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Vision-Language Pre-training (CLIP) has drown increasing att...
research
12/14/2021

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

We propose CLIP-Lite, an information efficient method for visual represe...
research
08/25/2022

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

This paper presents a simple yet effective framework MaskCLIP, which inc...
research
09/15/2022

Exploring Visual Interpretability for Contrastive Language-Image Pre-training

Contrastive Language-Image pre-training (CLIP) learns rich representatio...
research
01/17/2023

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Image-text contrastive learning models such as CLIP have demonstrated st...
research
06/02/2022

Prefix Conditioning Unifies Language and Label Supervision

Vision-language contrastive learning suggests a new learning paradigm by...
research
06/07/2023

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning

Single-cell RNA sequencing (scRNA-seq) data is a potent tool for compreh...

Please sign up or login with your details

Forgot password? Click here to reset