UNITER: Learning UNiversal Image-TExt Representations

by   Yen-Chun Chen, et al.

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design three pre-training tasks: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Region Modeling (MRM, with three variants). Different from concurrent work on multimodal pre-training that apply joint random masking to both modalities, we use conditioned masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). Comprehensive analysis shows that conditioned masking yields better performance than unconditioned masking. We also conduct a thorough ablation study to find an optimal setting for the combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2.


ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

We introduce a new pre-trainable generic representation for visual-lingu...

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

Image segmentation from referring expressions is a joint vision and lang...

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial trai...

Revising Image-Text Retrieval via Multi-Modal Entailment

An outstanding image-text retrieval model depends on high-quality labele...

UIBert: Learning Generic Multimodal Representations for UI Understanding

To improve the accessibility of smart devices and to simplify their usag...

1 Introduction

Most Vision-and-Language tasks rely on joint multimodel embeddings to bridge the semantic gap between visual and textual clues in images and text, although such representations are usually tailored for specific tasks. For example, MCB (Fukui et al., 2017) and BAN (Kim et al., 2018) proposed advanced multimodal fusion methods for Visual Question Answering (VQA) (Antol et al., 2015). SCAN (Lee et al., 2018) and MAttNet (Yu et al., 2018) studied learning latent alignment between words and image regions for Image-Text Retrieval (Wang et al., 2016) and Referring Expression Comprehension (Kazemzadeh et al., 2014) tasks. While each of these proposed models has pushed the state of the art on respective benchmarks, their architectures are diverse and the learned representations are highly task-specific, preventing them from being generalized to other tasks. This raises a million-dollar question: can we learn a universal image-text representation for all V+L tasks?

To answer this question, we introduce UNiversal Image-TExt Representations (UNITER), a large-scale pre-trained model for multimodal embedding. We adopt Transformer (Vaswani et al., 2017) as the core of our model, to leverage its elegant self-attention mechanism designed for learning contextualized representations. Inspired by BERT (Devlin et al., 2019), which has successfully applied Transformer to NLP tasks through large-scale language modeling, we pre-train UNITER through three pre-training tasks: () Masked Language Modeling (MLM) conditioned on image; () Masked Region Modeling (MRM) conditioned on text; and () joint Image-Text Matching (ITM). To further investigate the effectiveness of MRM, we propose three MRM variants: () Masked Region Classification (MRC); () Masked Region Feature Regression (MRFR); and () Masked Region Classification with KL-divergence (MRC-kl).

As shown in Figure 1, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder, then applies a Transformer module to learn generalizable contextualized embeddings for each region and word through aforementioned pre-training tasks. Compared with LXMERT (Tan and Bansal, 2019) and ViLBERT (Lu et al., 2019) that use two streams (one Transformer for each modality), our UNITER model can learn joint contextualized representations for image regions and textual words through a single Transformer. Besides, our masked language/region modeling is conditioned on full observation of image/text, different from other concurrent pre-trained models that apply joint random masking to both modalities. We show that the conditional masking strategy can successfully ease the missing-alignment between images and text, and obtain better joint embeddings for downstream tasks. Detailed ablation study also demonstrates that the combination of MLM+ITM+MRC-kl+MRFR yields the best pre-training performance.

To demonstrate the power of UNITER, we evaluate on six V+L tasks across nine datasets, including: () VQA; () Visual Commonsense Reasoning (VCR) (Zellers et al., 2019); () NLVR (Suhr et al., 2019); () Visual Entailment (Xie et al., 2019); () Image-Text Retrieval (including zero-shot setting) (Lee et al., 2018); and () Referring Expression Comprehension. Our UNITER model is trained on a large-scale V+L dataset composed of four subsets: () COCO (Lin et al., 2014); () Visual Genome (VG)  (Krishna et al., 2017); () Conceptual Captions (CC) (Sharma et al., 2018); and () SBU Captions (Ordonez et al., 2011). Experiments show that UNITER achieves new state of the art with significant performance boost across all six downstream tasks. Moreover, training on additional CC and SBU data (containing unseen images/text in downstream tasks) further boosts model performance over training on COCO and VG only.

Our contributions can be summarized as follows: () We introduce UNITER, a powerful UNiversal Image-TExt Representations for Vision-and-Language tasks. () We achieve new state of the art (SOTA) on multiple V+L benchmarks, outperforming existing SOTA and concurrent multimodal pre-training methods by a large margin. () We present extensive experiments and analysis to provide useful insights on the effectiveness of each pre-training task/dataset for multimodal encoder training.

Figure 1: Overview of the proposed UNITER model (best viewed in color), consisting of an Image Embedder, a Text Embedder and a multi-layer self-attention Transformer, learned through three pre-training tasks.

2 Related Work

Self-supervised learning utilizes original data as its own source of supervision, which has been applied to many Computer Vision tasks, such as image colorization 

(Zhang et al., 2016), solving jigsaw puzzles (Noroozi and Favaro, 2016; Trinh et al., 2019), inpainting (Pathak et al., 2016), rotation prediction (Gidaris et al., 2018), and relative location prediction (Doersch et al., 2015). Recently, pre-trained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), GPT2 (Radford et al., 2019), and XLNet (Yang et al., 2019) have shown great advances for NLP tasks. There are two keys to their success: effective pre-training tasks over large language corpus, and the use of Transformer (Vaswani et al., 2017) for learning contextualized text representations.

More recently, there has been some concurrent work on self-supervised learning for multimodal tasks, by pre-training on large-scale image/video and text pairs, then finetuning on downstream tasks. For example, VideoBERT (Sun et al., 2019)

applied BERT to learn a bidrectional joint distribution over quantized video frame features and linguistic tokens from video-text pairs. ViLBERT 

(Lu et al., 2019) and LXMERT (Tan and Bansal, 2019) introduced the two-stream architecture, where two Transformers are applied to images and text independently, which will be fused by a third Transformer in a later stage. On the other hand, VisualBERT (Li et al., 2019b), Unicoder-VL (Li et al., 2019a), VL-BERT (Su et al., 2019) and B2T2 (Alberti et al., 2019) proposed the single-stream architecture, where a single Transformer is applied to both image and text. Specifically, LXMERT model was pre-trained with downstream tasks such as VQA (Antol et al., 2015) and GQA (Hudson and Manning, 2019), while the others were pre-trained on image-text pairs only. Our UNITER model belongs to the second family. One key difference between UNITER and the other methods is the masking approach on pre-training tasks. Instead of randomly masking both image regions and sentence words, we use conditional masking, i.e., masking only one modality while keeping the other untainted. In addition, we examine the best combination of pre-training tasks through a thorough ablation study on the effects of each pre-training task and dataset on downstream tasks.

3 UNiversal Image-TExt Representations

In this section, we first introduce the model architecture of UNITER (Section 3.1), then describe the designed pre-training tasks and V+L datasets used for pre-training (Section 3.2 and 3.3).

3.1 Model Overview

The model architecture of UNITER is illustrated in Figure 1. Given a pair of image and sentence, UNITER takes the visual regions of the image and textual tokens of the sentence as the input. We design an Image Embedder and a Text Embedder to extract their respective embeddings. These embeddings are then fed into a multi-layer self-attention Transformer to learn a cross-modality contextualized embedding between visual regions and textual tokens. Note that the self-attention mechanism in Transformer is order-less, thus it is necessary to explicitly encode positions/locations of tokens/regions as additional inputs.

Specifically, in Image Embedder, we first use Faster R-CNN111Our Faster R-CNN was pre-trained on Visual Genome object+attribute data (Anderson et al., 2018).

to extract the visual features (pooled ROI features) for each region. We also encode the location features for each region via a 7-dimensional vector

222 (normalized top/left/bottom/right coordinates, width, height, and area.). Both visual and location features are then fed through a fully-connected (FC) layer, to be projected into the same embedding space. The final visual embedding for each region is obtained by summing up the two FC outputs and then passing through a layer normalization (LN) layer. For Text Embedder, we follow BERT (Devlin et al., 2019) and tokenize the input sentence into WordPieces (Wu et al., 2016). The final representation for each sub-word token333We use word/sub-word and token interchangeably throughout the rest of the paper. is obtained via summing up its word embedding and position embedding, followed by another LN layer444We also use a special modality embedding to help the model distinguish between textual and visual input, which is similar to the ‘segment embedding’ in BERT. This embedding is also summed before the LN layer in each embedder. For simplicity, this modality embedding is omitted in Figure 1..

We introduce three main tasks to pre-train our model: Masked Language Modeling conditioned on image regions (MLM), Masked Region Modeling conditioned on input text (with three variants) (MRM), and Image-Text Matching (ITM). As shown in Figure 1, our MRM and MLM are in analogy to BERT, where we randomly mask some words or regions from the input and learn to recover the words or regions as the output of Transformer. Specifically, word masking is realized by replacing the token with a special token [MASK], and region masking is implemented by replacing the visual feature vector with all zeros. Note that each time we only mask one modality while keeping the other modality intact, instead of randomly masking both modalities like ViLBERT and LXMERT. This prevents potential miss-alignment when a masked region happens to be described by a masked word. Empirically, we show that with conditional masking, our model is able to learn better embeddings (in Section 4.2). Lastly, we also learn an instance-level alignment (rather than token/region-level) between the whole image and the sentence via ITM. During training, we sample both positive and negative image-sentence pairs and learn their matching scores.

3.2 Pre-training Tasks

Masked Language Modeling (MLM)

We denote the image regions as , the input words as , and the mask indices as

. In MLM, we randomly mask out the input words with probability of 15%, and replace the masked ones

with special token [MASK]555Following BERT, we decompose this 15% into 10% random word, 10% unchanged, and 80% [MASK].. The goal is to predict these masked words based on the observation of their surrounding words and all image regions , by minimizing the negative log-likelihood:


where is the trainable parameters. Each pair is sampled from the whole training set .

Image-Text Matching (ITM)

In ITM, an additional special token [CLS] is fed into our model, which indicates the fused representation of both modalities. We apply an FC layer on top of the Transformer output of [CLS], scoring how well the input image and the text are matched with each other. The scoring function is denoted as . During training, we sample a positive or negative pair from the dataset at each step. The negative pair is created by replacing the image or text in a paired sample with a randomly-selected one from other samples. We denote the label as , indicating if the sampled pair is a match. Then we apply a binary cross-entropy loss for optimization:


Masked Region Modeling (MRM)

Similar to MLM, we also sample image regions and mask their visual features with a probability of 15%. The model is trained to reconstruct the masked regions given the remaining regions and all the words . The visual features of the masked region are replaced by zeros. Unlike textual tokens that are represented as discrete labels, visual features are high-dimensional and continuous, thus cannot be supervised via class likelihood. Instead, we propose three variants for Masked Region Modeling, which share the same objective base:


1) Masked Region Feature Regression (MRFR) MRFR learns to regress the Transformer output of each masked region to its visual features. Specifically, we apply an FC layer to convert its Transformer output into a vector of same dimension as the input ROI pooled feature . Then we apply L2 regression between the two: .

2) Masked Region Classification (MRC) MRC learns to predict the object semantic class for each masked region. We first feed the Transformer output of the masked region into an FC layer to predict the scores of

object classes, which further goes through a softmax function to be transformed into a normalized distribution

. Note that there is no ground-truth label, as the object categories are not provided. Thus, we use the object detection output from Faster R-CNN, and take the detected object category (with the highest confidence score) as the label of the masked region, which will be converted into a one-hot vector . The final objective minimizes the cross-entropy (CE) loss: .

3) Masked Region Classification with KL-Divergence (MRC-kl) MRC takes the most likely object class from the object detection model as the hard label (w.p. 0 or 1). We can also use its soft label as supervision signals, which is the raw output from the detector, i.e., a distribution of object classes . MRC-kl aims to distill such knowledge into UNITER, by minimizing the KL divergence between two distributions: .

3.3 Pre-training Datasets

We construct our pre-training dataset based on four existing V+L datasets: COCO (Lin et al., 2014), Visual Genome (VG) (Krishna et al., 2017), Conceptual Captions (CC) (Sharma et al., 2018), and SBU Captions (Ordonez et al., 2011). Only image and sentence pairs are used for our pre-training purpose, which makes the model framework more scalable, as additional image-sentence pairs are easy to harvest for further pre-training.

To study the effects of different datasets on pre-training, we divide the four datasets into two categories. The first one consists of image captioning data from COCO and dense captioning data from VG. We call it “In-domain” data, as most V+L tasks are built on top of these two datasets. To obtain a ‘fair’ data split, we merge the raw training and validation splits from COCO, and exclude all validation and test images that appear in downstream tasks. We also exclude all co-occurring Flickr30K 

(Plummer et al., 2015) images via URL matching, as both COCO and Flickr30K images were crawled from Flickr and may have overlaps666A total of 222 images were eliminated through this process.. The same rule was applied to Visual Genome as well. In this way, we obtain 5.6M image-text pairs for training and 131K image-text pairs for our internal validation, which is half the size of the dataset used in LXMERT (Tan and Bansal, 2019), due to the filtering of overlapping images and the use of image-text pairs only. We also use additional Out-of-domain data from Conceptual Captions (Sharma et al., 2018) and SBU Captions (Ordonez et al., 2011) for model training777We apply the same URL matching method, excluding 109 images from the training set.. The statistics on the cleaned splits are provided in Table 1.

In-domain Out-of-domain
Split COCO Captions VG Dense Captions Conceptual Captions SBU Captions
train 533K (106K) 5.06M (101K) 3.0M (3.0M) 990K (990K)
val 25K (5K) 106K (2.1K) 14K (14K) 10K (10K)
Table 1: Statistics on datasets used for pre-training. Each cell shows #image-text pairs (#images).
Task Datasets Image Src. #Images #Text Metric
1 VQA VQA COCO 204K 1.1M VQA-score
2 VCR VCR Movie Clips 110K 290K Accuracy
3 NLVR NLVR Web Crawled 214K 107K Accuracy
4 Visual Entailment SNLI-VE Flickr30K 31K 507K Accuracy
5 Image-Text Retrieval COCO COCO 92K 460K Recall@1,5,10
Flickr30K Flickr30K 32K 160K
6 RE Comprehension RefCOCO COCO 20K 142K Accuracy
RefCOCO+ 20K 142K
RefCOCOg 26K 95K
Table 2: Statistics on the datasets of downstream tasks.

4 Experiments

We evaluate UNITER on six V+L tasks (listed in Table  2), by transferring the pre-trained model to each target task and finetuning through end-to-end training. We report experimental results on two model sizes: UNITER-base with 12 layers and UNITER-large with 24 layers888UNITER-base: L=12, H=768, A=12, Total Parameters=86M. UNITER-large: L=24, H=1024, A=16, Total Parameters=303M (L: number of stacked Transformer blocks; H: hidden activation dimension; A: number of attention heads). and V100 GPU hours were used for pre-training UNITER-base and UNITER-large..

Pre-training Data Pre-training Tasks Meta-Sum VQA
test-dev val val dev val
None 1 None 314.34 67.03 61.74 65.55 51.02 68.73
Wikipedia +
2 MLM (text only) 346.24 69.39 73.92 83.27 50.86 68.80
In-domain (COCO+VG) 3 MRFR 344.66 69.02 72.10 82.91 52.16 68.47
4 ITM 385.29 70.04 78.93 89.91 74.08 72.33
5 MLM 386.10 71.29 77.88 89.25 74.79 72.89
6 MLM + ITM 393.04 71.55 81.64 91.12 75.98 72.75
7 MLM + ITM + MRC 393.97 71.46 81.39 91.45 76.18 73.49
8 MLM + ITM + MRFR 396.24 71.73 81.76 92.31 76.21 74.23
9 MLM + ITM + MRC-kl 397.09 71.63 82.10 92.57 76.28 74.51
10 MLM + ITM + MRC-kl + MRFR 399.97 71.92 83.73 92.87 76.93 74.52
(w/o cond. mask)
396.51 71.68 82.31 92.08 76.15 74.29
12 MLM + ITM + MRC-kl + MRFR 395.45 71.47 83.10 92.21 75.58 73.09
In-domain +
13 MLM + ITM + MRC-kl + MRFR 402.50 72.27 84.68 93.69 77.14 74.72
Table 3: Evaluation on pre-training tasks and datasets using VQA, Image-Text Retrieval on Flickr30K, NLVR

, and RefCOCO+ as benchmarks. All results are obtained from UNITER-base. Averages of R@1, R@5 and R@10 on Flickr30K for Image Retrieval (IR) and Text Retrieval (TR) are reported. Dark and light grey colors highlight the top and second best results across all the tasks trained with In-domain data.




VQA test-dev 70.63 70.55 70.50 - 70.80 72.42 72.27 73.24
test-std 70.90 70.92 70.83 - 71.00 72.54 72.46 73.40
VCR QA 72.60 73.30 74.00 - 71.60 - 75.00 77.30
QAR 75.70 74.60 74.80 - 73.20 - 77.20 80.80
QAR 55.00 54.80 55.50 - 52.40 - 58.20 62.80
NLVR dev 54.80 - - - 67.40 74.90 77.14 78.40
test-P 53.50 - - - 67.00 74.50 77.87 79.50
SNLI- VE val 71.56 - - - - - 78.56 79.28
test 71.16 - - - - - 78.02 78.98
ZS IR (Flickr) R@1 - 31.86 - 42.40 - - 62.34 65.82
R@5 - 61.12 - 71.80 - - 85.62 88.88
R@10 - 72.80 - 81.50 - - 91.48 93.52
IR (Flickr) R@1 48.60 58.20 - 68.30 - - 71.50 73.66
R@5 77.70 84.90 - 90.30 - - 91.16 93.06
R@10 85.20 91.52 - 94.60 - - 95.20 95.98
IR (COCO) R@1 38.60 - - 44.50 - - 48.42 51.72
R@5 69.30 - - 74.40 - - 76.68 78.41
R@10 80.40 - - 84.00 - - 85.90 86.93
ZS TR (Flickr) R@1 - - - 61.60 - - 75.10 77.50
R@5 - - - 84.80 - - 93.70 96.30
R@10 - - - 90.10 - - 95.50 98.50
TR (Flickr) R@1 67.90 - - 82.30 - - 84.70 88.20
R@5 90.30 - - 95.10 - - 97.10 98.40
R@10 95.80 - - 97.80 - - 99.00 99.00
TR (COCO) R@1 50.40 - - 59.60 - - 63.28 66.60
R@5 82.20 - - 85.10 - - 87.04 89.42
R@10 90.00 - - 91.80 - - 93.08 94.26
Ref- COCO val 87.51 - - - - 91.64 91.84
testA 89.02 - - - - - 92.26 92.65
testB 87.05 - - - - - 90.46 91.19
val 77.48 - - - - - 81.24 81.41
testA 83.37 - - - - - 86.48 87.04
testB 70.32 - - - - - 73.94 74.17
Ref- COCO+ val 75.38 - 78.44 - - - 82.84 84.04
testA 80.04 - 81.30 - - - 85.70 85.87
testB 69.30 - 71.18 - - - 78.11 78.89
val 68.19 72.34 71.84 - - - 74.72 74.94
testA 75.97 78.52 77.59 - - - 80.65 81.37
testB 57.52 62.61 60.57 - - - 65.15 65.35
Ref- COCOg val 81.76 - - - - - 86.52 87.85
test 81.75 - - - - - 86.52 87.73
val 68.22 - - - - - 74.31 74.86
test 69.46 - - - - - 74.51 75.77


Table 4: Results on downstream V+L tasks from UNITER model, compared with task-specific state-of-the-art (SOTA) and concurrent pre-trained models. ZS: Zero-Shot, IR: Image Retrieval and TR: Text Retrieval.

4.1 Downstream Tasks

In VQA, VCR and NLVR tasks, given an input image (or a pair of images) and a natural language question (or description), the model predicts an answer (or judges the correctness of the description) based on the visual content in the image. For Visual Entailment, we evaluate on the SNLI-VE dataset. The goal is to predict whether a given image semantically entails an input sentence. Classification accuracy over three classes (“Entailment”, “Neutral” and “Contradiction”) is used to measure model performance. For Image-Text Retrieval, we consider two datasets (COCO and Flickr30K) and evaluate the model in two settings: Image Retrieval (IR) and Text Retrieval (TR). Referring Expression (RE) Comprehension requires the model to select the target from a set of image region proposals given the query description. Models are evaluated on both ground-truth objects and detected proposals999The evaluation splits of RE comprehension using detected proposals are denoted as val, test, etc. (MAttNet (Yu et al., 2018)).


, Visual Entailment and Image-Text Retrieval, we extract the joint embedding of the input image-text pairs via a multi-layer perceptron (MLP) from the representation of the

[CLS] token. For RE Comprehension, we use the MLP to compute the region-wise alignment scores. These MLP layers are learned during the finetuning stage. Specifically, we formulate VQA, VCR, NLVR, Visual Entailment and RE Comprehension as classification problems and minimize the cross-entropy loss over the ground-truth answers/responses. For Image-Text Retrieval, we formulate it as a ranking problem. During finetuning, we sample three pairs of image and text, one positive pair from the dataset and two negative pairs by randomly replacing its sentence/image with others. We compute the similarity scores (based on the joint embedding) for both positive and negative pairs, and maximize the margin between them through triplet loss.

4.2 Evaluation on Pre-training Tasks

We analyze the effectiveness of different pre-training settings through ablation studies over VQA, NLVR, Flickr30K and RefCOCO+ as representative V+L benchmarks. In addition to standard metrics for each benchmark (listed in Table 2) , we also use Meta-Sum (sum of all the scores across all the benchmarks) as a global metric.

Firstly, we establish two baselines: Line 1 (L1) in Table 3 indicates no pre-training is involved, and L2 shows the results from MLM initialized with pre-trained weights from Devlin et al. (2019). Although MLM trained on text only did not absorb any image information during pre-training, we see a gain of approximately +30 on Meta-Sum over L1. Hence, we use the pre-trained weights in L2 to initialize our model for the following experiments.

Secondly, we validate the effectiveness of each pre-training task through a thorough ablation study. Comparing L2 and L3, MRFR (L3) achieves better results than MLM (L2) only on NLVR. On the other hand, when pre-trained on ITM (L4) or MLM (L5) only, we observe a significant improvement across all the tasks over L1 and L2 baselines. When combining different pre-training tasks, MLM + ITM (L6) improves over single ITM (L4) or MLM (L5). When MLM, ITM and MRM are jointly trained (L7-L10), we observe consistent performance gain across all the benchmarks. Among the three variants of MRM (L7-L9), we observe that MRC-kl (L9) achieves the best performance (397.09) when combined with MLM + ITM, while MRC (L7) the worst (393.97). When combining MRC-kl and MRFR together with MLM and ITM (L10), we find that they are complimentary to each other, which leads to the highest Meta-Sum score. We use this as the optimal pre-training setting for further experiments.

Additionally, we validate the contributions of conditional masking through a comparison study. When we perform random masking on both modalities simultaneously during pre-training, i.e., w/o conditional masking (L11), we observe a decrease in Meta-Sum score (396.51) compared to that with conditional masking (399.97). This indicates that the conditional masking strategy enables the model to learn better joint image-text representations effectively.

Lastly, we study the effects of pre-training datasets. Our experiments so far have been focused on In-domain data. In this study, we pre-train our model on Out-of-domain data (Conceptual Captions + SBU Captions). A performance drop (395.45 in L12) from the model trained on In-domain data (COCO + Visual Genome) (399.97 in L10) shows that although Out-of-domain data contain more images, the model still benefits more from being exposed to similar downstream images during pre-training. We further pre-train our model on both In-domain and Out-of-domain data. With doubled data size, the model continues to improve (402.50 in L13).

4.3 Results on Downstream Tasks

Table 4 presents the results of UNITER on all downstream tasks. Both our base and large models are pre-trained on In-domain+Out-of-domain datasets, with the optimal pre-training setting: MLM+ITM+MRC-kl+MRFR. The implementation details of each task are provided in Appendix A.2. We compare with both task-specific models and concurrent pre-trained models on each downstream task. SOTA task-specific models include: MCAN (Yu et al., 2019) for VQA, MaxEnt (Suhr et al., 2019) for NLVR, B2T2 (Alberti et al., 2019) for VCR, SCAN (Lee et al., 2018) for Image-Text Retrieval, EVE-Image (Xie et al., 2019) for SNLI-VE, and MAttNet for RE Comprehension (RefCOCO, RefCOCO+ and RefCOCOg)101010MAttNet results are updated using the same features as the others. More details are provided in Appendix.. Concurrent pre-trained models include: ViLBERT, LXMERT, Unicoder-VL, VisualBERT and VLBERT.

Results show that our UNITER-large model achieves new state of the art across all the benchmarks. UNITER-base model also outperforms the others by a large margin across all tasks except VQA. Specifically, our UNITER-base model outperforms SOTA by approximately for VCR on QAR, for NLVR, for SNLI-VE, on R@1 for Image-Text Retrieval ( for zero-shot setting), and for RE Comprehension.

Note that LXMERT pre-trains with downstream VQA (+VG+GQA) data, which may help adapt the model to VQA task. However, when evaluated on unseen tasks such as NLVR, UNITER-base achieves 3% gain over LXMERT. In addition, among all the models pre-trained on image-text pairs only, our UNITER-base outperforms the others by > on VQA.

It is also worth mentioning that both VilBERT and LXMERT observed two-stream model outperforms single-stream model, while our results show empirically that with our pre-training setting, single-stream model can achieve new state-of-the-art results, with much fewer parameters (UNITER-base: 86M, LXMERT: 183M, VilBERT: 221M)111111The word embedding layer contains excessive rare words, thus excluded from the parameter counts..

For VCR, we propose a two-stage pre-training approach: () pre-train on standard pre-training datasets; and then () pre-train on downstream VCR dataset. Interestingly, while VLBERT and B2T2 observed that pre-training is not very helpful on VCR, we find that the second-stage pre-training can significantly boost model performance, while the first-stage pre-training still helps but with limited effects (results shown in Table 6). This indicates that the proposed two-stage approach is highly effective in our pre-trained model over new data that are unseen in pre-training datasets.

Different from other tasks, NLVR takes two images as input. Thus, directly finetuning UNITER pre-trained with image-sentence pairs might not lead to optimal performance, as the interactions between paired images are not learned during the pre-training stage. Thus, we experimented with three modified settings on NLVR: () Triplet: joint embedding of images pairs and query captions; () Pair: individual embedding of each image and each query caption; and () Pair-biattn: a bidirectional attention is added to the Pair model to learn the interactions between the paired images.

Comparison results are presented in Table 6. The Pair setting achieves better performance than the Triplet setting even without cross-attention between the image pairs. We hypothesize that it is due to the fact that our UNITER is pre-trained with image-text pairs. Thus, it is difficult to finetune a pair-based pre-trained model on triplet input. The bidirectional attention mechanism in the Pair-biattn setting, however, compensates the lack of cross-attention between images, hence yielding the best performance with a large margin. This show that with minimal surgery on the top layer of UNITER, our pre-trained model can adapt to new tasks that are very different from pre-training tasks.

Stage I Stage II QA QA R Q AR N N 72.44 73.71 53.52 N Y 73.52 75.34 55.6 Y N 72.83 75.25 54.94 Y Y 74.56 77.03 57.76
Table 5: Experiments on two-stage pre-training for VCR. Results are from UNITER-base on VCR val split. Stage I and Stage II denote first-stage and second-stage pre-training.
Setting dev test-P Triplet 72.76 73.55 Pair 75.37 75.97 Pair-biattn 77.14 77.87
Table 6: Experiments on three modified settings for NLVR. All models use pre-trained UNITER-base.

5 Conclusion

In this paper, we present UNITER, a large-scale pre-trained model providing UNiversal Image-TExt Representations for Vision-and-Language tasks. Three main pre-training tasks are proposed and evaluated through extensive ablation studies. Trained with both in-domain and out-of-domain datasets, UNITER outperforms state-of-the-art models over multiple V+L tasks by a significant margin. Future work includes studying early interaction between raw image pixels and sentence tokens, as well as developing more effective pre-training tasks.


  • C. Alberti, J. Ling, M. Collins, and D. Reitter (2019) Fusion of detected objects in text for visual question answering. In EMNLP, Cited by: §2, §4.3.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §A.1, §A.2, footnote 1.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In ICCV, Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §2, §3.1, §4.2.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §2.
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2017) Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, Cited by: §1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §2.
  • D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In CVPR, Cited by: §2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §A.2.
  • S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg (2014) Referitgame: referring to objects in photographs of natural scenes. In EMNLP, Cited by: §1.
  • J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In NeurIPS, Cited by: §1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §1, §3.3.
  • K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In ECCV, Cited by: §A.2, §1, §1, §4.3.
  • G. Li, N. Duan, Y. Fang, D. Jiang, and M. Zhou (2019a) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §2.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019b) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §3.3.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: §A.2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1, §2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • V. Ordonez, G. Kulkarni, and T. L. Berg (2011) Im2text: describing images using 1 million captioned photographs. In NeurIPS, Cited by: §1, §3.3, §3.3.
  • M. Ott, S. Edunov, D. Grangier, and M. Auli (2018)

    Scaling neural machine translation

    WMT. Cited by: §A.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    Cited by: §A.2.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §2.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: §3.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §2.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, Cited by: §1, §3.3, §3.3.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) VL-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
  • A. Suhr, S. Zhou, I. Zhang, H. Bai, and Y. Artzi (2019) A corpus for reasoning about natural language grounded in photographs. ACL. Cited by: §1, §4.3.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In ICCV, Cited by: §2.
  • H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §1, §2, §3.3.
  • T. H. Trinh, M. Luong, and Q. V. Le (2019) Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §A.2, §1, §2.
  • L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In CVPR, Cited by: §1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.1.
  • N. Xie, F. Lai, D. Doran, and A. Kadav (2019) Visual entailment: a novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706. Cited by: §1, §4.3.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.
  • L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg (2018) MAttNet: modular attention network for referring expression comprehension. In CVPR, Cited by: §1, §4.1.
  • Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian (2019) Deep modular co-attention networks for visual question answering. In CVPR, Cited by: §A.2, §4.3.
  • R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi (2019) From recognition to cognition: visual commonsense reasoning. In CVPR, Cited by: §1.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §2.

Appendix A Appendix

a.1 Dataset Collection

As introduced, our full dataset is composed of four existing V+L datasets: COCO, Visual Genome, Conceptual Captions, and SBU Captions. The dataset collection is not simply combining them, as we need to make sure none of the downstream evaluation images are seen during pre-training. Among them, COCO is the most tricky one to clean, as several downstream tasks are built based on it. Figure 2 lists the splits from VQA, Image-Text Retrieval, COCO Captioning, RefCOCO/RefCOCO+/RefCOCOg, and the bottom-up top-down (BUTD) detection (Anderson et al., 2018), all from COCO images.

As observed, the validation and test splits of different tasks are scattered across the raw COCO splits. Therefore, we exclude all those evaluation images that appeared in the downstream tasks. In addition, we also exclude all co-occurring Flickr30K images via URL matching, making sure the zero-shot image-text retrieval evaluation on Flickr is fair. The remaining images become the COCO subset within our full dataset, as shown in Figure 2 bottom row. We apply the same rules to Visual Genome, Conceptual Captions, and SBU Captions.

Figure 2: Different data splits from downstream tasks based on COCO images. Our UNITER pre-training avoids seeing any downstream evaluation images.

a.2 Implementation Details

Our models are implemented based on PyTorch121212https://pytorch.org/ (Paszke et al., 2017). To speed up training, we use Nvidia Apex131313https://github.com/NVIDIA/apex

for mixed precision training. All pre-training experiments are run on Nvidia V100 GPUs (16GB VRAM; PCIe connection). Finetuning experiments are implemented on the same hardware or Titan RTX GPUs (24GB VRAM). To further speed up training, we implement dynamic sequence length to reduce padding and batch examples by number of input units (text tokens + image regions). For large pre-training experiments, we use Horovod

141414https://github.com/horovod/horovod + NCCL151515https://github.com/NVIDIA/nccl for multi-node communications (on TCP connections through ethernet) with up to 4 nodes of 4x V100 server. Gradient accumulation (Ott et al., 2018) is also applied to reduce multi-GPU communication overheads.

Visual Question Answering (VQA)

We follow Yu et al. (2019) to take most frequent answers as answer candidates, and assign a soft target score to each candidate based on its relevancy to the

human responses. To finetune on VQA dataset, we use a binary cross-entropy loss to train a multi-label classifier using batch size of

input units over maximum K steps. We use AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of and weight decay of . At inference time, the max-probable answer is selected as the predicted answer. For results on test-dev and test-std splits, both training and validation sets are used for training, and additional question-answer pairs from Visual Genome are used for data augmentation as in Yu et al. (2019).

Visual Commonsense Reasoning (VCR)

VCR can be decomposed into two multiple-choice sub-tasks: question-answering task (Q A) and answer-justification task (QA R). In the holistic setting (Q AR), a model needs to first choose an answer from the answer choices, then select a supporting rationale from rationale choices if the chosen answer is correct. We train our model in two settings simultaneously. When testing in the holistic setting, we first apply the model to predict an answer, then obtain the rationale from the same model based on the given question and the predicted answer. To finetune on VCR dataset, we concatenate the question (the qeustion and the ground truth answer) and each answer (rationale) choice from the four possible answer (rationale) candidates. The ‘modality embedding’ is extended to help distinguish question, answer and rationale. Cross-entropy loss is used to train a classifier over two classes (‘‘right’’ or ‘‘wrong’’) for each question-answer pair (question-answer-rationale triplet) with a batch size of 4096 input units over maximum K steps. We use AdamW optimizer with a learning rate of and weight decay of .

Since the images and text in VCR dataset are very different from our pre-training dataset, we further pre-train our model on VCR, using MLM, MRFR and MRC-kl as the pre-training tasks. ITM is discarded because the text in VCR does not explicitly describe the image. The results of both pre-trainings on VCR are reported in Table 6 and discussed in the main text. In conclusion, for downstream tasks that contain new data which is very different from the pre-training datasets, second-stage pre-training helps further boost the performance.

In our implementation, the second-stage pre-training is implemented with a batch size of 4096 intput units, a learning rate of and a weight decay of over maximum K steps. After second-stage pre-traing, we finetune our model with a learning rate of over maximum K steps.

Natural Language for Visual Reasoning for Real (NLVR)

NLVR is a new challenging task for visual reasoning. The goal is to determine whether a natural language statement is true about the given image pair. Here we discuss the three architecture variants of NLVR finetuning in detail. Since UNITER only handles one image and one text input at pre-training, the ‘modality embedding’ is extended to help distinguish the additional image presented in the NLVR task. For the Triplet setup, we concatenate the image regions and then feed into the UNITER model. An MLP transform is applied on the [CLS] output for binary classification. For the Pair setup, we treat one input example as two text-image pairs by repeating the text. The two [CLS] outputs from UNITER are then depth concatenated as the joint embedding for the example. Another MLP further transform this embedding for the final classification. For the Pair-biattn setup, the input format is the same as the Pair setup. As for the joint representation, instead of rely on only two [CLS] outputs, we apply a multi-head attention layer (Vaswani et al., 2017) on one sequence of joint image-text embeddings to attend to the other sequence of embeddings, and vice versa. After this ‘bidirectional’ attention interactions, a simple attentional pooling is applied on each output sequences and then a final concat+MLP layer transforms the cross-attended joint representation for true/false classification.

We finetune UNITER on NLVR for 8K steps with a batch size of 10K input units. AdamW optimizer is used with learning rate of and weight decay of .

Image-Text Retrieval

Two datasets are considered for this task: COCO and Flickr30K. COCO consists of K images, each accompanied with five human-written captions. We follow Karpathy and Fei-Fei (2015) to split the data into 82K/5K/5K training/validation/test images. Additional 30K images from MSCOCO validation set are also included to improve training as in Lee et al. (2018). Flickr30K dataset contains 31K images collected from the Flickr website, with five textual descriptions per image. We follow Karpathy and Fei-Fei (2015) to split the data into 30K/1K/1K training/validation/test splits. During finetuning, we sample two negative image-text pairs per positive sample from image and text sides, respectively. For COCO, we use batch size of 60 examples, learning rate of and finetune our model for K steps. For Flickr30K, we finetune our model with a batch size of 120 examples and a learning rate of over maximum K steps.

To obtain the final results in Table 4, we further sample hard negatives to facilitate the finetuning. For every steps, we randomly sample 128 negative images per text input and obtain a sparse scoring matrix for the whole training set. For each image, we choose the top 20 ranked negative sentences as hard negative samples. Similarly, we get 20 hard negative images for each sentence according to their scores. The hard negatives are sent to the model as additional negative samples. In the end, we have two randomly sampled negatives and two hard negative samples per positive sample. is set to 4000 for COCO and 2500 for Flickr30K.

Visual Entailment (SNLI-VE)

Visual Entailment is a task derived from Flickr30K images and Stanford Natural Language Inference (SNLI) dataset, where the goal is to determine the logical relationship between a natural language statement and an image. Similar to BERT for Natural Language Inference (NLI), we treat SNLI-VE as a three-way classification problem and apply an MLP Transform on [CLS] output. The UNITER model is finetuned using cross-entropy loss. The batch size is set to 10K input units and we use AdamW with learning rate of to train for 3K steps.

Referring Expression Comprehension

We use three referring expression datasets: RefCOCO, RefCOCO+, and RefCOCOg for the evaluation, all collected on COCO images. To finetune UNITER on this task, we add a MLP layer on top of the region outputs from Transformer, to compute the alignment score between the query phrase/sentence and each region. Since only one object is paired with the query phrase/sentence, we apply cross-entropy loss on the normalized alignment scores. The finetuning is efficient - we train the model with a batch size of 64 examples and a learning rate of

for only 5 epochs, and achieve state-of-the-art performance.

Note all works including ours use off-the-shelf object detectors trained on COCO (and Visual Genome) to extract the visual features. While this does not affect other downstream tasks, it raises an issue for RE comprehension, as the val/test images of RefCOCO, RefCOCO+, and RefCOCOg are a subset of COCO’s training split. Strictly, our object detector is not allowed to train with these val/test images. However, just for a “fair” comparison with concurrent works, we ignore this issue and use the same features (Anderson et al., 2018) as the others. We also update the results of MAttNet using this ”contaminated” features, whose accuracy is 1.5% higher than the original one. As aforementioned, the interaction between sentence and image could start from tokens and pixels instead of the extracted features. We leave this study and RE comprehension with strictly correct features to future work.