VL-BERT: Pre-training of Generic Visual-Linguistic Representations

08/22/2019
by   Weijie Su, et al.
8

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the vision-and-language downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on massive-scale Conceptual Captions dataset with three tasks: masked language modeling with visual clues, masked RoI classification with linguistic clues, and sentence-image relationship prediction. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual question answering, visual commonsense reasoning and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2019

UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...
research
01/11/2022

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Vision-language pre-training has been an emerging and fast-developing re...
research
12/13/2020

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning

Reasoning is a critical ability towards complete visual understanding. T...
research
04/02/2020

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-moda...
research
10/15/2021

Probing as Quantifying the Inductive Bias of Pre-trained Representations

Pre-trained contextual representations have led to dramatic performance ...
research
07/02/2017

Modulating early visual processing by language

It is commonly assumed that language refers to high-level visual concept...
research
12/16/2020

Focusing More on Conflicts with Mis-Predictions Helps Language Pre-Training

In this work, we propose to improve the effectiveness of language pre-tr...

Please sign up or login with your details

Forgot password? Click here to reset