Curriculum Learning for Data-Efficient Vision-Language Alignment

07/29/2022
by   Tejas Srinivasan, et al.
10

Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data. We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data, augmented with a curriculum learning algorithm to learn fine-grained vision-language alignments. TOnICS (Training with Ontology-Informed Contrastive Sampling) initially samples minibatches whose image-text pairs contain a wide variety of objects to learn object-level alignment, and progressively samples minibatches where all image-text pairs contain the same object to learn finer-grained contextual alignment. Aligning pre-trained BERT and VinVL models to each other using TOnICS outperforms CLIP on downstream zero-shot image retrieval while using less than 1

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2023

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

The field of vision and language has witnessed a proliferation of pre-tr...
research
08/22/2023

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

Vision-language models (VLMs), such as CLIP and ALIGN, are generally tra...
research
07/30/2021

CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification

Existing computer vision research in artwork struggles with artwork's fi...
research
05/09/2023

Boosting Visual-Language Models by Exploiting Hard Samples

Large vision and language models, such as Contrastive Language-Image Pre...
research
04/29/2022

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results ...
research
12/04/2021

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Vision-Language Pre-training (CLIP) has drown increasing att...
research
11/16/2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

This technical report describes the CONE approach for Ego4D Natural Lang...

Please sign up or login with your details

Forgot password? Click here to reset