VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

11/03/2021
by   Wenhui Wang, et al.
0

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA and NLVR2. The code and pretrained models are available at https://aka.ms/vlmo.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2023

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Building scalable vision-language models to learn from diverse, multimod...
research
08/09/2019

VisualBERT: A Simple and Performant Baseline for Vision and Language

We propose VisualBERT, a simple and flexible framework for modeling a br...
research
09/24/2019

Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. ...
research
04/20/2023

CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with Mixture-of-Textual-Experts for Passage Retrieval

Passage retrieval aims to retrieve relevant passages from large collecti...
research
12/08/2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

We initiate the first empirical study on the use of MLP architectures fo...
research
10/19/2022

A Unified View of Masked Image Modeling

Masked image modeling has demonstrated great potential to eliminate the ...
research
09/06/2023

Gene-induced Multimodal Pre-training for Image-omic Classification

Histology analysis of the tumor micro-environment integrated with genomi...

Please sign up or login with your details

Forgot password? Click here to reset