mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

by   Chenliang Li, et al.

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.


page 1

page 2

page 3

page 4


E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

3D vision-language grounding (3D-VL) is an emerging field that aims to c...

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundat...

CoCa: Contrastive Captioners are Image-Text Foundation Models

Exploring large-scale pretrained foundation models is of significant int...

SLAN: Self-Locator Aided Network for Cross-Modal Understanding

Learning fine-grained interplay between vision and language allows to a ...

Deeply Coupled Cross-Modal Prompt Learning

Recent advancements in multimodal foundation models (e.g., CLIP) have ex...

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

In this paper, we propose a novel cross-modal distillation method, calle...

Please sign up or login with your details

Forgot password? Click here to reset