SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

03/14/2021
by   Chenliang Li, et al.
0

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text representation at a feature level as input to a single-stream Transformer, or use a two-stream cross-modal Transformer to align the image-text representation at a high-level semantic space. In real-world image-text data, we observe that it is easy for some of the image-text pairs to align simple semantics on both modalities, while others may be related after higher-level abstraction. Therefore, in this paper, we propose a new pre-training method SemVLP, which jointly aligns both the low-level and high-level semantics between image and text representations. The model is pre-trained iteratively with two prevalent fashions: single-stream pre-training to align at a fine-grained feature level and two-stream pre-training to align high-level semantics, by employing a shared Transformer network with a pluggable cross-modal attention module. An extensive set of experiments have been conducted on four well-established vision-language understanding tasks to demonstrate the effectiveness of the proposed SemVLP in aligning cross-modal representations towards different semantic granularities.

READ FULL TEXT

page 2

page 8

06/03/2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...
04/13/2020

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Large-scale pre-training methods of learning cross-modal representations...
07/02/2022

Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Recently, the cross-modal pre-training task has been a hotspot because o...
04/29/2022

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be...
08/11/2022

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

Cross-modal fashion image synthesis has emerged as one of the most promi...
05/24/2022

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Multimodal learning from document data has achieved great success lately...
08/08/2022

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Video-text retrieval (VTR) is an attractive yet challenging task for mul...