MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment

01/29/2022
by   Zejun Li, et al.
0

In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MVPTR) framework to learn cross-modality representation via multi-level semantic alignment. We introduce concepts in both modalities to construct two-level semantic representations for language and vision. Based on the multi-level input, we train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning. The former stage enforces within-modality interactions to learn multi-level semantics for each single modality. The latter stage enforces interactions across modalities via both coarse-grain and fine-grain semantic alignment tasks. Image-text matching and masked language modeling are then used to further optimize the pre-training model. Our model generates the-state-of-the-art results on several vision and language tasks. Our code is available at https://github.com/Junction4Nako/mvp_pytorch.

READ FULL TEXT

page 1

page 6

research
08/20/2019

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Vision-and-language reasoning requires an understanding of visual concep...
research
08/03/2022

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and ...
research
04/22/2022

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

To solve video-and-language grounding tasks, the key is for the network ...
research
08/18/2020

AssembleNet++: Assembling Modality Representations via Attention Connections

We create a family of powerful video models which are able to: (i) learn...
research
12/01/2019

Integrate Image Representation to Text Model on Sentence Level: a Semi-supervised Framework

Integrating visual features has been proved useful in language represent...
research
04/06/2023

MemeFier: Dual-stage Modality Fusion for Image Meme Classification

Hate speech is a societal problem that has significantly grown through t...
research
05/31/2023

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Two-Tower Vision-Language (VL) models have shown promising improvements ...

Please sign up or login with your details

Forgot password? Click here to reset