Dense Contrastive Visual-Linguistic Pretraining

09/24/2021
by   Lei Shi, et al.
0

Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-/Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.

READ FULL TEXT

page 1

page 6

research
07/26/2020

Contrastive Visual-Linguistic Pretraining

Several multi-modality representation learning approaches such as LXMERT...
research
10/16/2020

Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning

We propose to solve the natural language inference problem without any s...
research
05/03/2022

i-Code: An Integrative and Composable Multimodal Learning Framework

Human intelligence is multimodal; we integrate visual, linguistic, and a...
research
06/01/2023

CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception

Perception is crucial in the realm of autonomous driving systems, where ...
research
08/25/2022

Refine and Represent: Region-to-Object Representation Learning

Recent works in self-supervised learning have demonstrated strong perfor...
research
12/14/2021

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

We propose CLIP-Lite, an information efficient method for visual represe...
research
03/06/2023

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Multimodal contrastive pretraining has been used to train multimodal rep...

Please sign up or login with your details

Forgot password? Click here to reset