Learning Visual Representations via Language-Guided Sampling

02/23/2023
by   Mohamed El Banani, et al.
0

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. This happens because language abstracts away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach deviates from image-based contrastive learning by using language to sample pairs instead of hand-crafted augmentations or learned clusters. Our approach also deviates from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than minimize a cross-modal similarity. Through a series of experiments, we show that language-guided learning can learn better features than both image-image and image-text representation learning approaches.

READ FULL TEXT

page 1

page 3

page 6

page 17

page 18

research
05/28/2022

CyCLIP: Cyclic Contrastive Language-Image Pretraining

Recent advances in contrastive representation learning over paired image...
research
11/28/2022

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Although significant progress has been made in few-shot learning, most o...
research
12/04/2021

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Vision-Language Pre-training (CLIP) has drown increasing att...
research
09/30/2022

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Recent Vision-Language Pre-trained (VLP) models based on dual encoder ha...
research
05/08/2023

Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
06/27/2023

Dental CLAIRES: Contrastive LAnguage Image REtrieval Search for Dental Research

Learning about diagnostic features and related clinical information from...
research
10/06/2020

Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations – noise co...

Please sign up or login with your details

Forgot password? Click here to reset