ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

08/16/2023
by   Kaicheng Yang, et al.
0

Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.

READ FULL TEXT

page 1

page 4

page 8

page 14

page 15

research
01/28/2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Vision-Language Pre-training (VLP) has advanced the performance for many...
research
10/11/2021

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has...
research
07/14/2023

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

The foundation models based on pre-training technology have significantl...
research
12/01/2022

Scaling Language-Image Pre-training via Masking

We present Fast Language-Image Pre-training (FLIP), a simple and more ef...
research
07/06/2021

Improving Text-to-Image Synthesis Using Contrastive Learning

The goal of text-to-image synthesis is to generate a visually realistic ...
research
04/29/2022

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results ...
research
08/08/2022

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Most of the currently existing vision and language pre-training (VLP) me...

Please sign up or login with your details

Forgot password? Click here to reset