MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

10/09/2022
by   Zijia Zhao, et al.
0

Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low- or mid-level prediction targets (e.g. image pixels), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.

READ FULL TEXT
research
08/04/2022

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Large-scale vision-language pre-training has shown impressive advances i...
research
10/28/2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Vision-and-language (V-L) tasks require the system to understand both vi...
research
05/23/2023

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Recent work in vision-and-language pretraining has investigated supervis...
research
02/17/2022

On Guiding Visual Attention with Language Specification

While real world challenges typically define visual categories with lang...
research
07/16/2021

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Large-scale vision and language representation learning has shown promis...
research
09/18/2022

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Recent efforts of multimodal Transformers have improved Visually Rich Do...
research
06/13/2022

Compositional Mixture Representations for Vision and Text

Learning a common representation space between vision and language allow...

Please sign up or login with your details

Forgot password? Click here to reset