DeepAI AI Chat
Log In Sign Up

Contrastive Visual-Linguistic Pretraining

by   Lei Shi, et al.

Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual region regression and classification loss, they often suffer from domain gap and noisy label problems, based on the visual features having been pretrained on the Visual Genome dataset. To overcome these issues, we propose unbiased Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual self-supervised loss built upon contrastive learning. We evaluate CVLP on several down-stream tasks, including VQA, GQA and NLVR2 to validate the superiority of contrastive learning on multi-modality representation learning. Our code is available at:


page 2

page 8


Dense Contrastive Visual-Linguistic Pretraining

Inspired by the success of BERT, several multimodal representation learn...

MoPro: Webly Supervised Learning with Momentum Prototypes

We propose a webly-supervised representation learning method that does n...

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

Mainstream 3D representation learning approaches are built upon contrast...

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

The multimedia community has shown a significant interest in perceiving ...

Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework

Current contrastive learning frameworks focus on leveraging a single sup...

Prototypical Contrastive Language Image Pretraining

Contrastive Language Image Pretraining (CLIP) received widespread attent...

Commonsense Knowledge Graph Completion Via Contrastive Pretraining and Node Clustering

The nodes in the commonsense knowledge graph (CSKG) are normally represe...

Code Repositories