VisualBERT: A Simple and Performant Baseline for Vision and Language

08/09/2019
by   Liunian Harold Li, et al.
11

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

READ FULL TEXT

page 2

page 3

page 10

research
11/03/2021

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

We present a unified Vision-Language pretrained Model (VLMo) that jointl...
research
06/25/2021

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Vision-Language Pre-training (VLP) aims to learn multi-modal representat...
research
05/15/2020

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...
research
06/28/2022

ZoDIAC: Zoneout Dropout Injection Attention Calculation

Recently the use of self-attention has yielded to state-of-the-art resul...
research
11/11/2019

Attending to Entities for Better Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trai...
research
12/28/2019

All-in-One Image-Grounded Conversational Agents

As single-task accuracy on individual language and image tasks has impro...
research
05/27/2019

FAN: Focused Attention Networks

Attention networks show promise for both vision and language tasks, by e...

Please sign up or login with your details

Forgot password? Click here to reset