NLIP: Noise-robust Language-Image Pre-training

12/14/2022
by   Runhui Huang, et al.
1

Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.

READ FULL TEXT

page 3

page 13

page 15

research
11/14/2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero...
research
04/10/2022

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effe...
research
01/30/2022

VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training

Vision-and-language pre-trained models (VLMs) have achieved tremendous s...
research
08/22/2023

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Cross-modal pre-training has shown impressive performance on a wide rang...
research
09/10/2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...
research
12/19/2022

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to a...
research
11/28/2022

SLAN: Self-Locator Aided Network for Cross-Modal Understanding

Learning fine-grained interplay between vision and language allows to a ...

Please sign up or login with your details

Forgot password? Click here to reset