mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

02/01/2023
by   Haiyang Xu, et al.
0

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

READ FULL TEXT

page 9

page 17

research
09/15/2022

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

This paper presents OmniVL, a new foundation model to support both image...
research
08/20/2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

In this paper, we address multi-modal pretraining of product data in the...
research
03/30/2020

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Multi-modal pretraining for learning high-level multi-modal representati...
research
03/10/2023

MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

Video-and-language understanding has a variety of applications in the in...
research
12/09/2022

VindLU: A Recipe for Effective Video-and-Language Pretraining

The last several years have witnessed remarkable progress in video-and-l...
research
11/14/2022

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

We launch EVA, a vision-centric foundation model to explore the limits o...
research
11/27/2022

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

The recently developed discrete diffusion models perform extraordinarily...

Please sign up or login with your details

Forgot password? Click here to reset