LXMERT: Learning Cross-Modality Encoder Representations from Transformers

08/20/2019
by   Hao Tan, et al.
0

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22 to prove that both our novel model components and pre-training strategies significantly contribute to our strong results. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2022

MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment

In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MV...
research
10/07/2020

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-trainin...
research
12/06/2019

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...
research
12/08/2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

We initiate the first empirical study on the use of MLP architectures fo...
research
06/25/2021

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Vision-Language Pre-training (VLP) aims to learn multi-modal representat...
research
10/28/2022

On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

This paper investigates the effectiveness and implementation of modality...
research
03/20/2022

simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers

Transfer learning is widely used in computer vision (CV), natural langua...

Please sign up or login with your details

Forgot password? Click here to reset