ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

08/06/2019
by   Jiasen Lu, et al.
9

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

READ FULL TEXT

page 5

page 8

research
12/05/2019

12-in-1: Multi-Task Vision and Language Representation Learning

Much of vision-and-language research focuses on a small but diverse set ...
research
12/13/2020

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning

Reasoning is a critical ability towards complete visual understanding. T...
research
02/10/2023

Is multi-modal vision supervision beneficial to language?

Vision (image and video) - Language (VL) pre-training is the recent popu...
research
03/25/2021

Visual Grounding Strategies for Text-Only Natural Language Processing

Visual grounding is a promising path toward more robust and accurate Nat...
research
10/27/2022

Masked Vision-Language Transformer in Fashion

We present a masked vision-language transformer (MVLT) for fashion-speci...
research
01/15/2021

Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge

The limits of applicability of vision-and-language models are defined by...
research
08/10/2021

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Language-guided robots performing home and office tasks must navigate in...

Please sign up or login with your details

Forgot password? Click here to reset