UNITER: Learning UNiversal Image-TExt Representations

09/25/2019
by   Yen-Chun Chen, et al.
0

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design three pre-training tasks: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Region Modeling (MRM, with three variants). Different from concurrent work on multimodal pre-training that apply joint random masking to both modalities, we use conditioned masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). Comprehensive analysis shows that conditioned masking yields better performance than unconditioned masking. We also conduct a thorough ablation study to find an optimal setting for the combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2.

READ FULL TEXT
research
01/22/2020

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...
research
08/22/2019

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

We introduce a new pre-trainable generic representation for visual-lingu...
research
10/24/2022

Towards Unifying Reference Expression Generation and Comprehension

Reference Expression Generation (REG) and Comprehension (REC) are two hi...
research
08/30/2016

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

Image segmentation from referring expressions is a joint vision and lang...
research
04/23/2021

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...
research
06/11/2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial trai...
research
06/02/2023

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Vision-language models (VLMs) discriminatively pre-trained with contrast...

Please sign up or login with your details

Forgot password? Click here to reset