VL-BERT: Pre-training of Generic Visual-Linguistic Representations

by   Weijie Su, et al.

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the vision-and-language downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on massive-scale Conceptual Captions dataset with three tasks: masked language modeling with visual clues, masked RoI classification with linguistic clues, and sentence-image relationship prediction. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual question answering, visual commonsense reasoning and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.


page 1

page 2

page 3

page 4


UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Vision-language pre-training has been an emerging and fast-developing re...

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning

Reasoning is a critical ability towards complete visual understanding. T...

Modulating early visual processing by language

It is commonly assumed that language refers to high-level visual concept...

Focusing More on Conflicts with Mis-Predictions Helps Language Pre-Training

In this work, we propose to improve the effectiveness of language pre-tr...

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

In this paper, we propose to investigate the problem of out-of-domain vi...

Probing as Quantifying the Inductive Bias of Pre-trained Representations

Pre-trained contextual representations have led to dramatic performance ...

Code Repositories


Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

view repo



view repo


Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

view repo


This folder includes some projects about ASR.

view repo

1 Introduction

Pre-training of generic feature representations applicable to a variety of tasks in a domain is a hallmark of the success of deep networks. Firstly in computer vision, backbone networks designed for & pre-trained on ImageNet 


classification are found to be effective for improving numerous image recognition tasks. Recently in natural language processing (NLP), Transformer networks 

(vaswani2017transformer) pre-trained with “masked language model” (MLM) objective (devlin2018bert) on large language corpus excel at a variety of NLP tasks.

Meanwhile, for tasks at the intersection of vision and language, such as image captioning (young2014image; chen2015microsoft; sharma2018conceptual), visual question answering (VQA) (antol2015vqa; johnson2017clevr; goyal2017making; hudson2019gqa), visual commonsense reasoning (VCR) (zellers2019vcr; gao2019two), there lacks such pre-trained generic feature representations. The previous practice is to combine base networks pre-trained for image recognition and NLP respectively in a task-specific way. The task-specific model is directly finetuned for the specific target task, without any generic visual-linguistic pre-training. The task-specific model may well suffer from overfitting when the data for the target task is scarce. Also, due to the task-specific model design, it is difficult to benefit from pre-training, where the pre-training task may well be different from the target. There lacks a common ground for studying the feature design & pre-training of visual-linguistic tasks in general.

In the various network architectures designed for different visual-linguistic tasks, a key goal is to effectively aggregate the multi-modal information in both the visual and linguistic domains for solving the problem at hand. For example, to pick the right answer in the VQA task, the network should empower integrating linguistic information from the question and the answers, and aggregating visual information from the input image, together with aligning the linguistic meanings with the visual clues. Thus, we seek to derive generic representations that can effectively aggregate and align visual and linguistic information.

In the meantime, we see the successful application of Transformer attention (vaswani2017transformer) in NLP, together with its MLM-based pre-training technique in BERT (devlin2018bert). The attention module is powerful and flexible in aggregating and aligning word embedded features in sentences, while the pre-training in BERT further enhances the capability.

Inspired by that, we developed VL-BERT, a pre-trainable generic representation for visual-linguistic tasks, as shown in Fig. 1. The backbone of VL-BERT is of (multi-modal) Transformer attention module taking both visual and linguistic embedded features as input. In it, each element is either of a word from the input sentence, or a region-of-interest (RoI) from the input image, together with certain special elements to disambiguate different input formats. Each element can adaptively aggregate information from all the other elements according to the compatibility defined on their contents, positions, categories, and etc. The content features of a word / an RoI are produced by domain specific networks (BERT for word features, Fast R-CNN (girshick2015fast) for RoI features). By stacking multiple layers of multi-modal Transformer attention modules, the derived representation is of rich capability in aggregating and aligning visual-linguistic clues. And task-specific branches can be added above for specific visual-linguistic tasks.

To better exploit the generic representation, we pre-train VL-BERT at large visual-linguistic corpus. Here we pre-train at the Conceptual Captions dataset (sharma2018conceptual) with around 3.3 million image-caption pairs. Similar to the practice in BERT, the pre-training is driven by losses incurrd via predicting randomly masked words or RoIs, together with losses from deciding whether sampled images and captions are of the same pair. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues.

Comprehensive empirical evidence demonstrates that our proposed VL-BERT with pre-training can achieve state-of-the-art performance on various downstream vision-and-language tasks, such as visual question answering, visual commonsense reasoning and referring expression comprehension. In particular, we have achieved the first place of single model on the leaderboard of visual commonsense reasoning.

2 Related Work

Pre-training for Computer Vision Prior to the era of deep networks, it is far from mature to share features among different tasks and to improve the features via pre-training. The models for various computer vision tasks are of too diverse design choices to derive a generic representation. With the success of AlexNet (krizhevsky2012alexnet) in ImageNet (deng2009imagenet)

classification, we see the renaissance of convolutional neural networks (CNNs) in the vision community. Soon after that, researchers found that ImageNet pre-trained CNNs can serve well as a generic feature representation for various downstream tasks

(donahue2014decaf), such as object detection (girshick2014rich), semantic segmentation (long2015fcn), instance segmentation (hariharan2014simultaneous). And the improvement in backbone networks designed for ImageNet classification further improves the downstream tasks. Recently there are research works on directly training CNNs from scratch on massive-scale target datasets, without ImageNet pre-training (he2018rethinking). They achieved performance on par or even slightly higher than those with ImageNet pre-training. While these works also note that pre-training on a proper massive dataset is vital for improving performance on target tasks with scarce data.

Pre-training for Natural Language Processing (NLP) It is interesting to note that the development of pre-training techniques in NLP lags quite behind computer vision. There are previous research works on improving word embedding (mikolov2013efficient; pennington2014glove; kiros2015skip), which is a low-level linguistic feature representation. On top of that, numerous diverse architectures are designed for various NLP tasks. In the milestone work of Transformers (vaswani2017transformer), the Transformer attention module is proposed as a generic building block for various NLP tasks. After that, a serious of approaches are proposed for pre-training the generic representation, mainly based on Transformers, such as GPT (radford2018GPT), BERT (devlin2018bert)

, GPT-2 

(radford2019GPT-2), XLNet (yang2019xlnet), XLM (lample2019xlm), and RoBERTa (liu2019roberta). Among them, BERT is perhaps the most popular one due to its simplicity and superior performance.


Method Architecture Visual Token Pre-train Datasets Pre-train Tasks Downstream Tasks
Published Works VideoBERT (sun2019videobert) single cross-modal Transformer video frame Cooking312K (sun2019videobert) 1) sentence-image alignment 1) zero-shot action classification
2) masked language modeling 2) video captioning
3) masked visual-words prediction
Works Under Review / Just Got Accepted CBT (sun2019contrastive) two single-modal Transformer video frame Cooking312K (sun2019videobert) 1) sentence-image alignment 1) action anticipation
(vision & language respectively) 2) masked language modeling 2) video captioning
+ one cross-modal Transformer 3) masked visual-feature regression
ViLBERT (lu2019vilbert) one single-modal Transformer (language) + one cross-modal Transformer (with restricted attention pattern) image RoI Conceptual Captions (sharma2018conceptual) 1) sentence-image alignment 1) visual question answering
2) masked language modeling 2) visual commonsense reasoning
3) masked visual-feature classification 3) grounding referring expressions

4) image retrieval

5) zero-shot image retrieval
B2T2 (alberti2019fusion) single cross-modal Transformer image RoI Conceptual Captions (sharma2018conceptual) 1) sentence-image alignment 1) visual commonsense reasoning
2) masked language modeling
LXMERT (tan2019lxmert) image RoI ‡ COCO Caption 1) sentence-image alignment 1) visual question answering
two single-modal Transformer + VG Caption 2) masked language modeling 2) natural language visual reasoning
(vision & language respectively) + VG QA 3) masked visual-feature classification
+ one cross-modal Transformer + VQA 4) masked visual-feature regression
+ GQA 5) visual question answering
Works in Progress VisualBERT (li2019visualbert) single cross-modal Transformer image RoI COCO Caption (chen2015microsoft) 1) sentence-image alignment 1) visual question answering
2) masked language modeling 2) visual commonsense reasoning
3) natural language visual reasoning
4) grounding phrases
Unicoder-VL (li2019unicodervl) single cross-modal Transformer image RoI Conceptual Captions (sharma2018conceptual) 1) sentence-image alignment 1) image-text retrieval
2) masked language modeling 2) zero-shot image-text retrieval
3) masked visual-feature classification
Our VL-BERT single cross-modal Transformer image RoI Conceptual Captions (sharma2018conceptual) 1) sentence-image alignment 1) visual question answering
2) masked language modeling 2) visual commonsense reasoning
3) masked visual-feature classification 3) grounding referring expressions


‡ LXMERT is pre-trained on COCO Caption (chen2015microsoft), VG Caption (krishna2017visual), VG QA (zhu2016visual7w), VQA (antol2015vqa) and GQA (hudson2019gqa).
Table 1: Comparison among our VL-BERT and other works seeking to derive pre-trainable generic representations for visual-linguistic tasks.

Pre-training for Vision-and-Language Tasks. The development course of models for vision-and-language tasks is also quite similar to those in the computer vision and NLP communities. Previously, task-specific models are designed, wherein the features derived from off-the-shelf computer vision and NLP models are combined in an ad-hoc way for specific tasks. Model training is performed on the dataset for the specific task only.

VideoBERT (sun2019videobert)

is the first work seeking to conduct pre-training for vision-and-language tasks. In it, video clips are processed by off-the-shelf networks for action recognition, and are assigned to different clusters based on the derived features. The cluster ids tokenize all the video clips, to form visual words. The visual words are processed together with the linguistic words by a BERT model, with almost no change in design. Specifically, when a visual word corresponding to a video clip is masked, the classifier attached is trained by predicting its corresponding cluster id. The work demonstrates improved performance on a cooking instruction dataset, with models pre-trained on massive cooking instruction videos from YouTube. However, due to the abrupt clustering of the video clips in tokenization, it may well loss considerable visual content information and makes updating network parameters for deriving visual features impossible. In the following work of CBT 


, such tokenization mechanism is removed. Instead, each video clip is represented by its raw feature extracted by action recognition network. In pre-training, the loss incurred on a masked video clip is of the error in predicting the masked raw features. Both works are applied on videos, which are of linear structure in the time dimension, same as sentences. And the study at the well-established image-based vision-and-language tasks is highly desired.

Concurrent to our work, multiple works released on Arxiv very recently also seek to derive a pre-trainable generic representation for vision-and-language tasks. Table 1 compares among them. We briefly discuss some of these works here.

In ViLBERT (lu2019vilbert) and LXMERT (tan2019lxmert), which are under review or just got accepted, the network architectures are of two single-modal networks applied on input sentences and images respectively, followed by a cross-modal Transformer combining information from the two sources. The attention pattern in the cross-modal Transformer is restricted, where the authors believe to improve the performance. The authors of ViLBERT claim that such two-stream design is superior than a single-stream unified model. Meanwhile, in the proposed VL-BERT, it is of a unified architecture based on Transformers without any restriction on the attention patterns. The visual and linguistic contents are fed as input to VL-BERT, wherein they interact early and freely. We found that our unified model of VL-BERT outperforms such two-stream designs.

VisualBert (li2019visualbert), B2T2 (alberti2019fusion), and Unicoder-VL (li2019unicodervl), which are of work in progress or under review, are also of unified single-stream architecture. The differences of these works are compared in Table 1. The concurrent emergency of these research works indicates the importance of deriving a generic pre-trainable representation for vision-and-language tasks.

3 Vl-Bert

3.1 Revisit BERT Model

Let be the input elements in BERT (devlin2018bert), which are of embedded features encoding sentence words. They are processed by a multi-layer bidirectional Transformer (vaswani2017transformer), where the embedding features of each element are transformed layer-by-layer in the fashion of aggregating features from the other elements with adaptive attention weights. Let be the features of the -th layer ( is set as the input ). The features of the -th layer, , is computed by

Multi-head Attention, (1)
Residual Connection, (2)
Feed-forward, (3)
Residual Connection, (4)

where in Eq. 1 indexes over the attention heads, and denotes the attention weights between elements and in the -th head, which is normalized by . , , and are learnable weights for attention head, and in Eq. 3 are learnable weights and biases, respectively. Note that, the operations in Eq. 1  4 is irrelevant to the order of input sequence, i.e. the final BERT representation of permuted input is same as the final BERT representation of the original input after the same permutation. The position of an element in BERT is encoded in its own embedding features by sequence positional embedding. Thanks to such decoupled representation, the BERT model is flexible enough to be pre-trained and finetuned for a variety of NLP tasks.

In BERT pre-training, the masked language modeling (MLM) task is introduced. The embedded features of a certain input word would be randomly masked out (the token embedding channels capturing the word content is replaced by a special [MASK] token). The BERT model is trained to predict the masked word from linguistic clues of all the other unmasked elements. As explained in wang2019bertmrf

, the overall MLM-based training of BERT is equivalent to optimizing the following joint probability distribution


where is the potential function for the -th input element, with parameters , and is the partition function. Each log-potential term is defined as


where denotes the final output feature of BERT corresponding to the -th element for input , where is defined as . The incurred MLM-based loss is as


where is a randomly sampled sentence from the training set , and is a randomly sampled location for masking words.

The second pre-training task, Next Sentence Prediction, focuses on modeling the relationship between two sentences. Two sentences are sampled from the input document, and the model should predict whether the second sentence is the direct successor of the first. In BERT, the sampled two sentences are concatenated into one input sequence, with special elements [CLS] and [SEP] inserted prior to the first and the second sentences, respectively. A Sigmoid classifier is appended on the final output feature corresponding to the [CLS] element to make the prediction. Let be the input sequence,

indicates the relationship between the two sentences. The loss function is defined as


where is the final output feature of the [CLS] element (at the -th layer), and is the classifier output.

3.2 Model Architecture

Figure 1: Architecture for pre-training VL-BERT. All parameters in this architecture including VL-BERT and Fast R-CNN are jointly trained in both pre-training and fine-tuning phase.

Figure 1 illustrates the architecture of VL-BERT. Basically, it modifies the original BERT (devlin2018bert) model by adding new elements to accommodate the visual contents, and a new type of visual feature embedding to the input feature embeddings. Similar to BERT, the backbone is of multi-layer bidirectional Transformer encoder (vaswani2017transformer), enabling dependency modeling among all the input elements. Different to BERT processing sentence words only, VL-BERT takes both visual and linguistic elements as input, which are of features defined on regions-of-interest (RoIs) in images and sub-words from input sentences, respectively. The RoIs can either be bounding boxes produced by object detectors, or be annotated ones in certain tasks.

It is worth noting that the input formats vary for different visual-linguistic tasks (e.g., Caption, Image for image captioning, and Question, Answer, Image for VQA (antol2015vqa; johnson2017clevr; goyal2017making; hudson2019gqa) and VCR (zellers2019vcr; gao2019two)). But thanks to the unordered representation nature of Transformer attention (e.g., the position of a word in sentence is encoded by the positional embedding only, other than the order in the input sequence), a generic representation can be derived as long as the input elements and embedding features are properly designed. Three types of input elements are involved, namely, visual, linguistic, and special elements for disambiguating different input formats. The input sequence always starts with a special classification element ([CLS]), then goes on with linguistic elements, then follows up with visual elements, and ends with a special ending element ([END]). A special separation element ([SEP]) is inserted in between different sentences in the linguistic elements, and between the linguistic and visual elements. For each input element, its embedding feature is the summation of four types of embedding, namely, token embedding, visual feature embedding, segment embedding, and sequence position embedding. Among them, the visual feature embedding is newly introduced for capturing visual clues, while the other three embeddings follow the design in the original BERT paper.

Token Embedding Following the practice in BERT, the linguistic words are embedded with WordPiece embeddings (wu2016google) with a 30,000 vocabulary. A special token is assigned to each special element. For the visual elements, a special [IMG] token is assigned for each one of them.

Visual Feature Embedding It is attached to each of the input elements, which is the concatenation of visual appearance feature and visual geometry feature. For the visual element corresponding to an RoI, the visual appearance feature is extracted by applying a Faster R-CNN (ren2015faster)

detector, where the feature vector prior to the output layer of each RoI is utilized as the visual feature embedding. For the non-visual elements, the corresponding visual appearance features are of features extracted on the whole input image. They are obtained by applying Faster R-CNN on an RoI covering the whole input image.

The visual geometry feature is designed to inform VL-BERT the geometry location of each input visual element in image. Each RoI is characterized by a 4-d vector, as , where and denote the coordinate of the top-left and bottom-right corner respectively, and are of the width and height of the input image. Following the practice in Relation Networks (hu2018relation), the 4-d vector is embedded into a high-dimensional representation (of 2048-d in paper) by computing sine and cosine functions of different wavelengths.

Segment Embedding Three types of segment, , are defined to separate input elements from different sources, namely, and for the words from the first and second input sentence respectively, and for the RoIs from the input image. For example, for input format of Question, Answer, Image, denotes Question, denotes Answer, and denotes Image. For input format of Caption, Image, denotes Caption, and denotes Image. A learned segment embedding is added to every input element for indicating which segment it belongs to.

Sequence Position Embedding  A learnable sequence position embedding is added to every input element indicating its order in the input sequence. Because there is no natural order among input visual elements, any permutation of them in the input sequence should achieve the same result. Thus the sequence position embedding for all visual elements are the same.

3.3 Pre-training VL-BERT

The generic feature representation of VL-BERT enables us to pre-train VL-BERT on massive-scale visual-linguistic datasets, with properly designed pre-training tasks. Here we pre-train VL-BERT on the recently released Conceptual Captions dataset (sharma2018conceptual), with around 3.3 million images annotated with captions. The dataset is harvested from web data, which is processed through an automatic pipeline with balanced goals of caption cleanliness, informativeness, fluency and learnability. We expect pre-training VL-BERT on such a massive-scale dataset enhance the capability in dependency modeling of visual and linguistic elements. The input format to VL-BERT is of Caption, Image, where the RoIs in the image are localized and categorized by a pre-trained Faster R-CNN (ren2015faster) object detector. Three pre-training tasks are exploited to incur loss during pre-training, which are as follows.

Task #1 Sentence-Image Relationship Prediction  Many important downstream tasks such as Visual Question Answering (VQA) (antol2015vqa; johnson2017clevr; goyal2017making; hudson2019gqa) and Caption-based Image Retrieval (young2014image; chen2015microsoft) requires understanding the relationship between the given sentence and image, i.e. whether the contents of the given sentence and image are highly related. To train a model understanding such relationship, the Sentence-Image Relationship Prediction task is exploited. Specifically, both caption and image are randomly sampled from the dataset. For 50% of the time the caption actually describes the image (labeled as [Related]), and for other 50% of the time the caption is randomly chosen from the whole dataset (labeled as [NotRelated]). As shown in Figure 1, the prediction is made by adding a randomly initialized classifier upon the final output feature corresponding to the [CLS] element, driven by a sigmoid cross-entropy loss during training.

Task #2 Masked Language Modeling with Visual Clues This task is very similar to the Masked Language Modeling (MLM) task utilized in BERT (devlin2018bert). The key difference is that visual clues are incorporated in VL-BERT for capturing the dependencies among visual and linguistic contents. During pre-training, each word in the input sentence(s) is randomly masked (with probability of 15%). For the masked word, its token is replaced with a special token of [MASK]. The model is trained to predict the masked words, based on the unmasked words and the visual features. The task drives the network to not only model the dependencies in sentence words, but also to align the visual and linguistic contents. For example, in Figure 1 “kitten drinking from [MASK]”, without the input image, the masked word could be any containers, such as “bowl”, “spoon” and “bottle”. The representation should capture the correspondence of the word “bottle” and the corresponding RoIs in the image to make the right guess. During pre-training, the final output feature corresponding to the masked word is fed into a classifier over the whole vocabulary, driven by Softmax cross-entropy loss.

Task #3 Masked RoI Classification with Linguistic Clues This is a dual task of Task #2. Each RoI in image is randomly masked out (with 15% probability), and the pre-training task is to predict the category label of the masked RoI from the other clues. For the visual element corresponding to an RoI to be masked out, its visual feature embedding is replaced by a special learnable embedding, irrelevant to the image content. To avoid any visual clue leakage from the visual feature embedding of other elements, the pixels laid in the masked RoI are set as zeros before applying Fast R-CNN (girshick2015fast). During pre-training, the final output feature corresponding to the masked RoI is fed into a classifier with Softmax cross-entropy loss for object category classification. The category label predicted by pre-trained Faster R-CNN (ren2015faster) is set as the ground-truth. An example is shown in Figure 1. The RoI corresponding to cat in image is masked out, and the corresponding category cannot be predicted from any visual clues. But with the input caption of “kitten drinking from bottle”, the model can infer the category by exploiting the linguistic clues.

In summary, Task #1 facilitates downstream tasks requiring understanding the relationship (i.e. highly related or not) between given sentence and image. While Task #2 and #3 improves the detailed alignment between visual and linguistic contents. Such detailed alignment is vital for many downstream tasks (for example, in Visual Grounding (kazemzadeh2014referitgame), the model locates the most relevant object or region in an image based on a natural language query).

3.4 Fine-tuning VL-BERT

VL-BERT is designed to be a generic feature representation for various visual-linguistic tasks. It is relatively simple to finetune VL-BERT for various downstream tasks. We simply need to feed VL-BERT with properly formatted input and output, and finetune all the network parameters end-to-end. For the input, the typical formats of Caption, Image and Question, Answer, Image cover the majority visual-linguistic tasks. VL-BERT also supports more sentences and more images as long as appropriate segment embeddings are introduced to identify different input sources. At the output, typically, the final output feature of the [CLS] element is used for sentence-image-relation level prediction. The final output features of words or RoIs are for word-level or RoI-level prediction. In addition to the input and output format, task-specific loss functions and training strategies also need to be tuned. See Section 4.2 for the detailed design choices and settings.

4 Experiment

4.1 Pre-training

As described in Section 3.3, we pre-train VL-BERT on the Conceptual Captions dataset (sharma2018conceptual). As VL-BERT is developed via adding new inputs capturing visual information to the original BERT model (), we initialize the parameters the same as the original BERT described in (devlin2018bert), which is pre-trained on text-only BooksCorpus (zhu2015aligning)

and English Wikipedia. The parameters corresponding to the newly added inputs capturing visual information are randomly initialized from a Gaussian distribution with mean of 0 and standard deviation of 0.02. Visual content embedding is produced by Faster R-CNN model with ResNet-101, initialized from parameters pre-trained on Visual Genome 

(krishna2017visual) for object detection (see BUTD (anderson2018bottom) for more details).

Prior to pre-training on the Conceptual Captions dataset, the Faster R-CNN detector pre-trained on Visual Genome is applied to extract RoIs. Specifically, at most 100 RoIs with detection scores higher than 0.5 are selected for each image. The minimum number of RoIs selected from one image is 10, regardless of the detection score threshold.

The VL-BERT model is pre-trained on 16 Tesla V100 GPUs for 10 epochs by SGD. In each mini-batch, 256 sampled

Caption, Image pairs are utilized. In SGD, Adam optimizer (kingma2014adam) is applied, with base learning rate of , , , weight decay of , learning rate warmed up over the first 8,000 steps, and linear decay of the learning rate.

4.2 Fine-tuning on Downstream Tasks

(a) Input and output format for Visual Commonsense Reasoning (VCR) dataset
(b) Input and output format for Visual Question Answering (VQA) dataset
(c) Input and output format for Referring Expression task on RefCOCO+ dataset
Figure 2: Input and output formats for fine-tuning different visual-linguistic downstream tasks.

The pre-trained VL-BERT model can be fine-tuned for various downstream visual-linguistic tasks, with simple modifications on the input format, output prediction, loss function and training strategy. Details are illustrated in Figure 2 and described as follows.

4.2.1 Visual Commonsense Reasoning (VCR)


Model Q A QA R Q AR
val test val test val test
R2C (zellers2019vcr) 63.8 65.1 67.2 67.3 43.1 44.0
ViLBERT (lu2019vilbert) 72.4 73.3 74.5 74.6 54.0 54.8
VisualBERT (li2019visualbert) 70.8 71.6 73.2 73.2 52.2 52.4
B2T2 (alberti2019fusion) 71.9 72.6 76.0 75.7 54.9 55.0
VL-BERT w/o pre-training 73.5 - 74.4 - 54.8 -
VL-BERT 73.7 74.0 74.5 74.8 55.0 55.5


Table 2: Results compared to the state-of-the-art methods with single model on VCR dataset.
† indicates concurrent works.

Visual Commonsense Reasoning (VCR) focuses on higher-order cognitive and commonsense understanding of the given image. In the dataset of zellers2019vcr, given an image and a list of categorized RoIs, a question at cognition level is raised. The model should pick the right answer to the question and provide the rationale explanation. For each question, there are 4 candidate answers and 4 candidate rationales. This holistic task (Q AR) is decomposed into two sub-tasks wherein researchers can train specific individual models: question answering (Q A) and answer justification (QA R). The released VCR dataset consists of 265k pairs of questions, answers, and rationales, over 100k unique movie scenes (100k images). They are split into training, validation, and test sets consisting of 213k questions and 80k images, 27k questions and 10k images, and 25k questions and 10k images, respectively.

Our experimental protocol for VCR follows that in R2C (zellers2019vcr). The model is trained on the train split, and is evaluated at the val and test sets. In the original work R2C, task-specific “Grounding”, “Contextualization” and “Reasoning” modules are designed. Here we simply adopt the generic representation of VL-BERT for the task. Figure 2 (a) illustrates the input format, Question, Answer, Image. For the sub-task of Q A, ‘Q’ and ‘A’ are filled to the Question section and Answer section respectively. For the sub-task of QA R , the concatenation of ‘Q’ and ‘A’ is filled to the Question section, and ‘R’ is filled to the Answer section. The input RoIs to VL-BERT are the ground-truth annotations in the dataset. The final output feature of [CLS] element is fed to a Softmax classifier for predicting whether the given Answer is the correct choice. During fine-tuning, we adopt two losses, the classification over the correctness of the Answer and the masked RoI classification with linguistic clues. We experimented the fine-tuning together with masked language modeling with visual clues, but found out that introducing this loss would degrade the accuracy.

For VCR, the fine-tuning is conducted on 16 Tesla V100 GPUs for 20 epochs. In each mini-batch, 256 triplets of Question, Answer, Image are sampled. In SGD, the basic mini-batch gradient descent is conducted, with base learning rate of , momentum of 0.9, and weight decay of . The learning rate is linearly warmed up in the first 1,000 steps from an initial learning rate of 0, and is decayed by 0.1 at the 14-th and the 18-th epochs.

Table 2 presents the experiment results. Pre-training VL-BERT only slightly improves the performance. This might be because the pre-training task of image captioning is at the perceptual level, while the VCR task is at the cognitive understanding level. Compared with R2C, we do not use ad-hoc task-specific modules. Instead, we simply adopt the generic representation of VL-BERT and jointly train the whole model end-to-end. Despite the same input, output and experimental protocol as R2C, VL-BERT outperforms R2C by large margins, indicating the power of our simple cross-modal architecture. Compared with other concurrent works, i.e. ViLBERT, VisualBERT and B2T2, our VL-BERT achieves the state-of-the-art performance.

4.2.2 Visual Question Answering (VQA)


Model test-dev test-std
BUTD (anderson2018bottom) 65.32 65.67
ViLBERT (lu2019vilbert) 70.55 70.92
VisualBERT (li2019visualbert) 70.80 71.00
LXMERT (tan2019lxmert) 72.42 72.54
VL-BERT w/o pre-training 69.58 -
VL-BERT 70.50 70.83


Table 3: Results compared to the state-of-the-art methods with single model on VQA dataset.
† indicates concurrent works.

In the VQA task, given a natural image, a question at the perceptual level is asked, and the algorithm should generate / choose the correct answer. Here we conduct experiments on the widely-used VQA v2.0 dataset (goyal2017making), which is built based on the COCO (lin2014microsoft) images. The VQA v2.0 dataset is split into train (83k images and 444k questions), validation (41k images and 214k questions), and test (81k images and 448k questions) sets. Following the experimental protocol in BUTD (anderson2018bottom), for each question, the algorithm should pick the corresponding answer from a shared set consisting of 3,129 answers.

Figure 2 (b) illustrates the input format for the VQA task, which is of Question, Answer, Image. As the possible answers are from a shared pool independent to the question, we only fill a [MASK] element to the Answer section. As in BUTD (anderson2018bottom), the input RoIs in VL-BERT are generated by a Faster R-CNN detector pre-trained on Visual Genome (krishna2017visual). The answer prediction is made from a multi-class classifier based upon the output feature of the [MASK] element. During fine-tuning, the network training is driven by the multi-class cross-entropy loss over the possible answers.

For VQA, the fine-tuning is conducted on 16 Tesla V100 GPUs for 20 epochs. In each mini-batch, 256 triplets of Question, Answer, Image are sampled. In SGD, Adam optimizer is applied, with base learning rate of , , , weight decay of , learning rate warmed up over the first 2,000 steps, and linear decay of the learning rate.

Table 3 presents our experimental results. Pre-training VL-BERT improves the performance by nearly 1%, which validates the effectiveness of pre-training. VL-BERT shares the same input (i.e. question, image, and RoIs), output and experimental protocol with BUTD, a prevalent model specifically designed for the task. Still, VL-BERT surpasses BUTD by over 5% in accuracy. Except for LXMERT, our VL-BERT achieves comparable performance with other concurrent works, i.e., ViLBERT and VisualBERT. LXMERT achieves accuracy 2% higher than VL-BERT. This is because LXMERT is pre-trained on massive visual question answering data (aggregating almost all the VQA datasets based on COCO and Visual Genome). While our model is only pre-trained on captioning dataset, where there is still gap with the VQA task.

4.2.3 Referring Expression Comprehension


Model Ground-truth Regions Detected Regions
val testA testB val testA testB
MAttNet (yu2018mattnet) 71.01 75.13 66.17 65.33 71.62 56.02
ViLBERT (lu2019vilbert) - - - 72.34 78.52 62.61
VL-BERT w/o pre-training 73.96 76.65 67.64 65.92 72.30 55.55
VL-BERT 78.44 81.30 71.18 71.84 77.59 60.57


Table 4: Results compared to the state-of-the-art methods with single model on RefCOCO+ dataset.
† indicates concurrent works.

A referring expression is a natural language phrase that refers to an object in an image. The referring expression comprehension task is to localize the object in an image with the given referring expression. We adopt the RefCOCO+ (kazemzadeh2014referitgame) dataset for evaluation, consisting of 141k expressions for 50k referred objects in 20k images in the COCO dataset (lin2014microsoft). The referring expressions in RefCOCO+ are forbidden from using absolute location words, e.g. left dog. Therefore the referring expressions focus on purely appearance-based descriptions. RefCOCO+ are split into four sets, training set (train), validation set (val), and two testing sets (testA and testB). Images containing multiple people are in testA set, while images containing multiple objects of other categories are in testB set. There is no overlap between the training, validation and testing images.

Figure 2 (c) illustrates the input format for referring expression comprehension , where the input format is of Query, Image. Model training and evaluation are conducted either on the ground-truth RoIs or on the detected boxes in MAttNet (yu2018mattnet). And the results are reported either in the track of ground-truth regions or that of detected regions, respectively. During training, we compute the classification scores for all the input RoIs. For each RoI, a binary classification loss is applied. During inference, we directly choose the RoI with the highest classification score as the referred object of the input referring expression.

For RefCOCO+, the fine-tuning is conducted on 16 Tesla V100 GPUs for 20 epochs. In each mini-batch, 256 pairs of Query, Image are sampled. In SGD, Adam optimizer is applied, with base learning rate of , , , weight decay of , learning rate warmed up over the first 500 steps, and linear decay of the learning rate.

Table 4 presents our experiment results. Pre-trained VL-BERT significantly improves the performance on validation set by 4.5% using the ground-truth regions, and 5.9% using the detected regions. Compared with MAttNet, VL-BERT is much simpler without task-specific architecture designs, yet much better with over 6% improvement in accuracy. VL-BERT achieves comparable performance with the concurrent work of ViLBERT.