Log In Sign Up

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

by   Zhicheng Huang, et al.

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0 retrieval 5k test split, 1.5 on SNLI-VE test split, respectively.


page 3

page 8

page 10

page 11


E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Large-scale pre-training methods of learning cross-modal representations...

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint rep...

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Existing approaches to vision-language pre-training (VLP) heavily rely o...

Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

Current vision language pretraining models are dominated by methods usin...

GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

A key goal for the advancement of AI is to develop technologies that ser...

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

In the past few years, the emergence of vision-language pre-training (VL...

1 Introduction

Figure 1: Comparisons of SOHO and region-based methods by top-1 image-to-text retrieval (TR) and visual question answering (VQA) results. Baselines lack global context and fail to understand the image. SOHO discovers visual clues out of region boxes and infers correct human activities. [Best viewed in color.]

With the success of Transformer and self-supervised learning, we have recently witnessed a boosting number of research works on cross-modal learning, especially on vision-language pre-training (VLPT)

[chen2020uniter, li2019unicoder, li2019visualbert, lu2019vilbert, su2019vl, tan2019lxmert, zhou2019unified]. VLPT models learn better cross-modal representation with large-scale easy-accessible image-text pairs. They have established state-of-the-art results in many vision-language tasks, such as visual question answering (VQA) [Antol_2015_ICCV], image-text retrieval [lin2014microsoft], natural language for visual reasoning (NLVR) [suhr2019corpus], etc. Visual representation plays an important role in VLPT models. The recent success of VLPT models has been accompanied by the usage of region-based image features, which are extracted by object detectors pre-trained on the Visual Genome dataset [Anderson_2018_CVPR]. However, there are three challenges to directly utilize region-based image features for vision-language understanding. Firstly, regions focus on objects inside bounding boxes while neglecting the contextual information out of the boxes, which is important for relation understanding and reasoning. For example in Figure 1, we can easily detect “man”, “woman” and “boat” in the image. However, without the contextual information out of these boxes, a model will misunderstand the relation as “people boating” and result in an incorrect answer for either text retrieval or VQA. Secondly, visual understanding of images will be limited to the pre-defined categories for regions. Thirdly, most region-based image features are extracted by a detection model, which will suffer from low quality, noise, and over-sampling [Anderson_2018_CVPR] and rely on large-scale boxes annotation data. Although some works try to train detection model[2017Multiple, zeng2019wsod2] with weakly-supervised, the performance is far below the requirements. Recently, some works challenge that grid-based convolutional features are also effective to learn visual representations [desai2021virtex, huang2020pixel, jiang2020defense, sariyildiz2020icmlm]. Among them, Jiang et al. show that grid features can be equally effective as region features for VQA [jiang2020defense]. Sariyildiz et al. and Desai et al. use image-text data to train visual backbone for recognition tasks (e.g., image classification)  [desai2021virtex, sariyildiz2020icmlm]. These models are designed either for specific vision-language task [jiang2020defense] or vision task [desai2021virtex, sariyildiz2020icmlm]. In this paper, we focus on VLPT and propose one of the first end-to-end VLPT model without relying on region features.

To overcome the limitation of region-based image features and better utilize image-text pairs for cross-modal understanding, we propose SOHO, an end-to-end vision-language pre-training framework to directly learn image embedding, language embedding, and their semantic alignment from image-text pairs. Compared with existing VLPT works, SOHO adopts a simple pipeline that does not require a complicated visual backbone for pre-training and releases the design effort for VLPT tasks. Without the requirement of laborious annotated categories or boxes, SOHO can enrich visual semantics by directly optimizing visual representations by a wider range of image-text data.

End-to-end learning for vision and language raises challenges by different representations of the two modalities. Visual representation at pixel-level is much more diverse and dense than language embedding. And the lack of explicit supervision for pixel-level language adds the difficulty to alignment learning. To tackle the above problems, we introduce a visual dictionary (VD) which represents more comprehensive and compact semantics in visual domain. To learn the visual dictionary, we design a moving-averaged encoder to group visual pixels with similar visual semantics. VD can be dynamically updated through our trainable CNN backbone directly from visual-language data during pre-training. For pre-training tasks, we propose a novel Masked Vision Modeling (MVM) based on the learned visual dictionary besides two commonly used tasks, Masked Language Modeling (MLM) and Image-Text Matching (ITM).

Our contributions are summarized as follows: (i) We propose SOHO, one of the first end-to-end VLPT models to learn cross-modal representation directly with image-text pairs. Without the need of extracting bounding boxes, our model can achieve at least 10 times speedup for inference. (ii) To better align visual features and language tokens, we propose a novel dynamic-updated visual dictionary that represents a visual abstraction of similar semantics in images. (iii) We conduct extensive experiments on four well-established downstream tasks. Experimental results show that SOHO can improve the SOTA performance with absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR test-P split, 6.7% accuracy on SNLI-VE test split, and 0.56% VQA score on VQA2.0 test-std split. We will release both model and code to facilitate the research community111

2 Related Work

2.1 Visual Representation for Vision-Language

Visual representation learning for vision-language understanding is a long-standing research topic. Early works use CNN classification models pre-trained on ImageNet to extract visual features 

[deng2009imagenet, li2018tell, liu2018beyond, NIPS2016_9dcb88e0, yang2016stacked, yu2017multi]. Later on, Anderson et al. propose a Bottom-Up and Top-Down Attention (BUTD) detection model [Anderson_2018_CVPR]

pre-trained on Visual Genome dataset to extract salient region features as visual inputs for VQA and image captioning tasks. The BUTD features are adopted by many vision-language works 

[Anderson_2018_CVPR, krishna2017visual, singh2018pythia, suhr2019corpus] and pre-training works [chen2020uniter, kim2018bilinear, tan2019lxmert]. Recently, some works propose to directly learn visual representations in the form of grid features with convolutional networks in specific vision-language tasks [jiang2020defense] or vision recognition tasks [desai2021virtex, sariyildiz2020icmlm]. Our work shares a similar format of visual representation with [jiang2020defense] while we focus on the area of vision-language pre-training and propose the first end-to-end VLPT model without relying on the box annotations.

VideoBERT [sun2019videobert] and the bag of words [fei2005bayesian]

literature also use vector quantization to represent visual information. The key difference between VD and related works is that we dynamically update the VD-based embedding with the output of a trainable visual encoder, instead of pre-computed input features. The dynamic updating mechanism for VD can capture text-guided semantics from the vision-language dataset. Thus the model can be directly optimized with high-level semantics for VL understanding and alignment.

Figure 2: The framework of the proposed end-to-end pre-training model SOHO. For an input text (a), we use the text embedding operation (b) to extract the textual embedding features. For an input image (d), we propose to use a trainable CNN-based encoder (e) to extract visual representations. To further transform image features to consistent semantics, we apply a visual dictionary-based image embedding (f) to the image encoder outputs. Finally, we apply multi-layer Transformers to the output of multi-modal concatenation (c) with three pre-training tasks. Note that the index matrix in (f) will be used as labels in the masked VM task in (g). [Best viewed in color.]

2.2 Pre-training for Vision-Language

Many vision-language pre-training (VLPT) works have been proposed to learn cross-modal representations [chen2020uniter, li2019unicoder, li2019visualbert, lu2019vilbert, su2019vl, sun2019videobert, tan2019lxmert, zhou2019unified]. They can be categorized as two-stream or single-stream models. The two-stream models process visual and language information respectively and fuse them afterward by another Transformer layer [lu2019vilbert, tan2019lxmert]. Contrarily, the single-stream models use BERT [devlin2018bert]

to learn a bi-directional joint distribution over the detection bounding box feature and text embedding feature 

[alberti2019fusion, chen2020uniter, li2019unicoder, li2019visualbert, su2019vl, zhou2019unified]. Both types use the Transformer-based model to learn vision-language joint embedding features. While they neglect that visual representation learning is also important to vision-language tasks.

The key differences between our SOHO and existing VLPT works are 1) SOHO adopts a simple VLPT pipeline. Our vision backbone only uses ImageNet pre-trained parameters, and achieves even higher performance than existing VLPT works using VG features on five downstream tasks. 2) SOHO uses the least annotations to achieve SOTA performances. 3) SOHO enriches visual semantics by directly optimizing visual inputs for target language tasks.

3 Approach

The overall architecture of our proposed vision-language pre-training framework SOHO is shown in Figure 2. SOHO is an end-to-end framework, which consists of a trainable CNN-based visual encoder, a visual dictionary (VD) embedding module, and a multi-layer Transformer. The visual encoder takes an image as input and produces the visual features. VD embedding module is designed to aggregate diverse visual semantic information into visual tokens with a proposed visual dictionary. The Transformer is adopted to fuse features from visual and language modalities, and produce task-specific output. SOHO can be end-to-end pre-trained by Masked Vision Modeling (MVM), Masked Language Modeling (MLM), and Image-Text Matching (ITM) tasks. SOHO can also be easily adapted to several downstream tasks including Image-Text Retrieval, VQA, NLVR, and Visual Entailment.

3.1 Trainable Visual Encoder

Most recent vision-language researches follow Bottom-up and Top-Down attention [Anderson_2018_CVPR] to extract region-level visual features by a Faster R-CNN [ren2015faster] detector which is pre-trained on the Visual Genome [krishna2017visual] dataset. The representation ability of such extracted region-based features will be limited by the pre-defined object and attribute categories (i.e. objects and attributes). Besides, some contextual information out of the regions is important for VL understanding, while being neglected because they are out of the pre-defined categories. Even though considering the whole image as a region and extract its feature as global representation is an improved solution, the detector can not guarantee the feature quality because such global regions are unseen in the training stage. Despite that, most recent VLPT models adopt pre-extracted region-level visual features because it is complicated to end-to-end fine-tune an object detector in VL tasks. Besides, the extracted region-level visual features have a semantic gap with the language domain, while existing works try to bridge such domain gap by only one or several fully-connected layers.

To keep all visual information, we propose to use a trainable CNN visual encoder, which takes the whole image as input and produces image-level visual features instead of region-level features. Without the limitation of bounding boxes, the visual encoder can be end-to-end learned and updated from pre-training losses or downstream task-specific losses, and further optimize the cross-modal learning in turn. Given an input image , we get its feature by:


where is the visual feature encoder with parameter . denotes the number of embedded feature vectors, and is the embedded dimension. We adopt ResNet [he2016deep] pre-trained on ImageNet [deng2009imagenet] followed by a convolutional layer and a max pooling layer as the architecture of the encoder . For simplicity, we use to denote the feature vector of for the rest of this paper.

3.2 Visual Dictionary

The visual feature extracted by visual feature encoder is more diverse and dense than language word tokens, which will bring difficulty to the learning of cross-modal understanding. To bridge its representation gap from language tokens, we propose a visual dictionary (VD) to tokenize the visual features by aggregating similar visual semantic into the same image feature.

Visual Dictionary Embedding. We define a visual dictionary (VD) as a matrix which contains embedding vectors with -dim. The embedding vector is denoted as . For visual feature , we compute it mapping index by searching nearest neighbor in , denoted as:


We define visual dictionary embedding as a mapping function , which maps to by:


which uses the nearest embedding vector to represent the visual feature. We denote as an inverse mapping function, which maps the index back to a group of visual features. We use to represent the inverse mapping group size, and use to represent the encoding features.

Momentum Learning for Visual Dictionary Update. The visual dictionary is randomly initialized, and further updated by a moving average operation in one mini-batch, which is denoted as:


where indicates the updated embedding vector of , and is a momentum coefficient whose value range is . Note that Eqn. LABEL:eq:4 will only be applied when .

Gradient Back Propagation. Since the argmin operation is not differentiable, the gradient back propagation will be stopped by the visual dictionary. To make the visual feature encoder trainable, we follow [van2017neural] to update by:


where is the stop gradient operator.

The visual dictionary performs an online clustering on visual feature maps based on feature similarity, and represents each feature vector by its cluster center. Feature vectors sharing similar semantics will be aggregated into the same cluster, and the clustered index can be considered as a virtual visual semantic label. Since the clustering can be affected by the vision-language learning tasks, the learned semantics of each embedding vector is more suitable for cross-modal understanding and alignment.

The visual dictionary faces a cold-start problem, where directly copying the gradient from randomly initialized embedding vectors to visual feature maps will lead to incorrect model optimization direction (i.e., mode collapse). Therefore, we freeze the parameters of ResNet in the visual feature encoder in the first

training epochs.

Task Dataset Train Split Test Split Metric
Pre-training VG [krishna2017visual] train - -
MSOCO [lin2014microsoft] train+restval* - -
Image-Text Retrieval MSCOCO [lin2014microsoft] train+restval* test* Recall@1,5,10
Flickr30K [plummer2015flickr30k] train test*
Visual Question Answering VQA2.0 [goyal2017making] train+val test-dev/test-std VQA-score [goyal2017making]
Visual Reasoning NLVR [suhr2019corpus] train dev/test-P Top-1 Accuracy
Visual Entailment SNLI-VE [xie2018visual] train val/test Top-1 Accuracy
Table 1: Statistics of different tasks. Notation “*” denotes Karpathy split [karpathy2015deep]. Notation “-” denotes not applicable. Detailed train/test image and text numbers can be found in the supplementary material.

3.3 Pre-training Pipeline

We apply a multi-layer Transformer to learn cross-modal representations with the fusion of visual and language features. In order to learn a universal representation for vision and language-related tasks, we apply the self-supervised method to pre-train the model on a large aggregated dataset. We follow the existing works [chen2020uniter, li2019unicoder, lu2019vilbert, su2019vl, tan2019lxmert, zhou2019unified] to adopt Masked Language Modeling (MLM) and Image-Text Matching (ITM) pre-training tasks. Besides, we propose a novel Masked Visual Modeling (MVM) pre-training task based on the virtual visual semantic labels produced by the visual dictionary.

Cross-Modal Transformer. For visual representation, we utilize 2-D position embedding computed by sine function to encode spatial information of visual tokens following other works [carion2020end, dosovitskiy2020image, parmar2018image]. For the input sentence, we follow BERT [devlin2018bert] to tokenize it, and then represent the tokens by embedding vectors . We use to denote the embedding vector in . The word embedding and the VD embeddings share the dimension on their outputs. We concatenate the VD embeddings and word embedding vectors together to form an input sequence for cross-modal learning. Similar to other VLPT models, we add two special tokens [CLS] and [SEP] into the input sequence to indicate classification position and the end of a text, respectively. A multi-layer Transformer is adopted to take the joint vision-language input, and outputs the attended features.

Masked Language Modeling. We follow [chen2020uniter] and adopt Masked Language Modeling (MLM) to encourage the model to build the mapping between language tokens and visual contents. The goal of MLM is to predict the masked word tokens based on other word tokens  and all image features  by minimizing the negative log-likelihood. The learning target can be formulated as:


where indicate hereinafter the whole training dataset. We adopt the same masking strategy used in BERT [devlin2018bert].

Masked Visual Modeling. We propose Masked Visual Modeling (MVM) by visual dictionary, which is a symmetry to the MLM. We randomly mask the image features before feeding them into the Transformer. The learning target of MVM is denoted as:


The goal of MVM is to predict the masked image features based on their surrounding image features  and all language tokens  by minimizing the negative log-likelihood. MVM can encourage the model to infer visual knowledge from the contextual visual information as well as language. When an image feature is masked, its mapping index in VD is considered as its label. In visual feature maps, neighbor features may have similar values, and thus share the same mapping index. This will cause the model to directly copy the label from surrounding features as predictions in a lazy way. To prevent this, in the masking stage, we first randomly select an existing label index , then replace all visual embedding vectors in with the special [MASK] token embedding vector.

Image-Text Matching. To enhance the cross-modal matching, we adopt Image-Text Matching (ITM) task for pre-training as in previous works [chen2020uniter]

. We apply a binary classifier

on the joint embedding feature of [CLS

] token to predict whether the input image and text are matched or not. ITM task is driven by the following loss function:


where indicates whether the image and text is matched () or not ().

The visual feature encoder, VD-based image embedding module and the cross-modal Transformer is end-to-end jointly trainable. We assign equal loss weight to the three pre-training objectives, and thus the full pre-training objective of SOHO is:


3.4 Pre-training Datasets

Several large-scale datasets have been proposed to facilitate VL pre-training. According to typical settings in UNITER [chen2020uniter], these datasets can be categorized into two classes: “in-domain” and “out-domain”. In our work, we use “in-domain” as a pre-training dataset as most VL pre-training tasks are built on them [chen2020uniter, li2019visualbert, tan2019lxmert]. We construct our pre-training datasets with MSCOCO [lin2014microsoft] and VG [krishna2017visual].

To avoid data leak, we only use the train and restval splits of MSCOCO dataset, and the train and val splits of VG dataset in the training stage. The detailed statistic of our pre-training datasets can be found in Table 1. Detailed comparisons of pre-training dataset usage of most VLPT works, including our train/test image and text numbers, are included in our supplementary material.

Model Backbone TR IR TR IR
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
1K Test set 5K Test set
VSE++[faghri2017vse++] R152 64.6 90.0 95.7 52.0 84.3 92.0 41.3 71.1 81.2 30.3 59.4 72.4
SCAN[lee2018stacked] R101 72.7 94.8 98.4 58.8 88.4 94.8 50.4 82.2 90.0 38.6 69.3 80.4
Unicoder-VL[li2019unicoder] - 84.3 97.3 99.3 69.7 93.5 97.2 62.3 87.1 92.8 46.7 76.0 85.3
UNITER[chen2020uniter] R101 - - - - - - 64.4 87.4 93.1 50.3 78.5 87.2
SOHO (ours) R101 85.1 97.4 99.4 73.5 94.5 97.5 66.4 88.2 93.8 50.6 78.0 86.7
Table 2:

Evaluation of image-to-text retrieval (TR) and text-to-image retrieval (IR) task on MSCOCO Dataset. ”-” indicates the detail is not reported.

Model Backbone TR IR
R@1 R@5 R@10 R@1 R@5 R@10
VSE++[faghri2017vse++] R152 52.9 80.5 87.2 39.6 70.1 79.5
SCAN[lee2018stacked] R101 67.4 90.3 95.8 48.6 77.7 85.2
ViLBERT[lu2019vilbert] R101 - - - 58.2 84.9 91.5
Unicoder-VL[li2019unicoder] - 86.2 96.3 99.0 71.5 90.9 94.9
UNITER[chen2020uniter] R101 85.9 97.1 98.8 72.5 92.4 96.1
SOHO (ours) R101 86.5 98.1 99.3 72.5 92.7 96.1
Table 3: Evaluation of image-to-text retrieval (TR) and text-to-image retrieval (IR) on Flickr30K dataset. ”-” indicates the detail is not reported.

4 Experiment

4.1 Implementation Details

For the language processing, we follow BERT to use the WordPiece tokenizer [wu2016google] to split each text into language tokens. For the visual processing, as most previous works adopt feature extractor which uses as input resolution, we also adopt setting to resize the shorter edge of input images to , and limit the longer edge to be lower than for a fair comparison. We use pre-trained models based on public accessible ImageNet [deng2009imagenet] and BERT [devlin2018bert] to initialize the parameters of our visual backbone and Transformer architecture, respectively. Specifically, we adopt the widely-used ResNet-101 backbone and 12-layer Transformer to fairly compare with other works, while we adopt a lightweight ResNet-18 backbone and 3-layer Transformer in our ablation studies to reduce experiment cost. We will use to denote X-layer ResNet architecture in the rest of this paper for simplicity (e.g. R101 denotes ResNet-101). Since the visual backbone and Transformer favor different kinds of optimizers [zhang2019adam], we follow the suggestion of Zhang et al. [zhang2019adam] to use SGD and AdamW optimizers for visual backbone and Transformer respectively. We use SGD with learning rate and weight decay for the visual backbone, and apply AdamW with learning rate and weight decay for Transformer. We pre-train our model with 32 NVIDIA Tesla V100 GPUs with a batch size of image-text pairs. The training process takes 40 epochs until convergence and we empirically decay the learning rate by 10 times at and epoch.

We adopt mixed-precision training to reduce memory cost and speed up training procedure. An image will be paired with four texts in each batch during pre-training, including two positive pairs and two negative pairs. We only apply MLM and MVM on the positive image-text pairs.

4.2 Downstream Tasks and Results

We test the performance of SOHO on four well-established downstream tasks, include image-text retrieval, visual question answering (VQA), natural language for visual reasoning(NLVR), and fine-grained visual reasoning (Visual Entailment, or VE).Image-text retrieval task includes two sub-tasks, i.e., image-to-text retrieval (TR) and text-to-image retrieval (IR), and are conducted on Flickr30K [xie2017aggregated] and MSCOCO [lin2014microsoft] datasets. The tasks of VQA, NLVR, and VE are conducted on datasets of VQA 2.0 [goyal2017making], NLVR [suhr2019corpus] and SNLI-VE [xie2018visual] respectively. Table 1 summarizes the statistics of all our downstream tasks.

We compare our approach with several task-specific methods and pre-training models. Most pre-training models adopt Transformer-like architectures [vaswani2017attention] with BERT-like objectives [devlin2018bert] to learn cross-modal representations [chen2020uniter, li2019unicoder, li2019visualbert, lu2019vilbert, su2019vl, tan2019lxmert, zhou2019unified]. For downstream tasks, we find that using input features of the VD module for visual representation is better than directly applying VD embedding. We adopt the former setting in our experiment. This shows that VD suits visual representation learned with a very large scale of semantics while dense features provide more details in a relatively small dataset.

4.2.1 Task I: Image-Text Retrieval

Image-text retrieval requires a model to retrieve the most relevant caption from candidate images, or vice versa. It is one of the most typical tasks in the field of vision-language learning which enables a broad range of applications (e.g., image searching). Image-text retrieval includes two sub-tasks of image-to-text retrieval (TR) and text-to-image retrieval (IR). During training, we construct aligned and unaligned pairs inside of a mini-batch like most image-text retrieval models. We randomly sample aligned image-caption pairs from ground truth annotations to form a mini-batch. All the other captions are used as the unaligned captions for each image. To encourage the model to predict the right labels for both the aligned and unaligned pairs, we consider the retrieval task as a binary classification problem.

In our implementation, we use the joint embedding representation of the [CLS] token from Transformers to predict whether an image-caption pair is aligned or not. Since the objective of image-text retrieval task is consistent with the image-text matching (ITM) task in pre-training stage, the pre-trained parameters can well be inherited for fine-tuning. We adopt AdamW optimizer with learning rate and weight decay. The mini-batch size is set to . We train 20 epochs until convergence and decay the learning rate by half at , , and epoch empirically.

We conduct experiments on MSCOCO [lin2014microsoft] and Flickr30k [plummer2015flickr30k], and the results are shown in Table 2 and Table 3 respectively. It worth noting that UNITER additionally uses out-of-domain datasets and the results are expected to be better than merely use in-domain datasets as they reported [chen2020uniter]. Unicoder-VL [li2019unicoder] adopts merely out-of-domain datasets, which is also not directly comparable to our SOHO. Nevertheless, SOHO outperforms the most recent VLPT works under most metrics on both MSCOCO and Flickr30k. The performance improvements indicate that SOHO learns better image-text embeddings by our end-to-end pre-training architecture, and exploits comprehensive yet compact visual semantic abstraction by the proposed visual dictionary.

Model Backbone test-dev test-std
MUTAN[ben2017mutan] R152 60.17 -
BUTD[Anderson_2018_CVPR] R101 65.32 65.67
Unified VLP [zhou2019unified] X101 70.50 70.70
ViLBERT[lu2019vilbert] R101 70.55 70.92
VisualBERT[li2019visualbert] R152 70.80 71.00
VLBERT[su2019vl] R101 71.79 72.22
LXMERT[tan2019lxmert] R101 72.42 72.54
UNITER[chen2020uniter] R101 72.70 72.91
SOHO (Ours) R101 73.25 73.47
Table 4: Evaluation of VQA on VQA 2.0 dataset. ”-” indicates the detail is not reported. X101 denotes ResNeXt-101 architecture [xie2017aggregated].

4.2.2 Task II: Visual Question Answering

Visual Question Answering (VQA) requires a model to take an image and a question as input and output an answer. This task requires machines to act like humans and reason across vision and language, which is approaching intelligent AI. We model VQA as a classification problem by learning multi-layer perception from the [CLS] token. We follow [kim2018bilinear] to treat is as a -way classification problem, and optimize the model via binary cross-entropy loss. We fine-tune for 18 epochs with a batch size of 256 until convergence. We set the optimizer the same as in the pre-training stage. The initial learning rates are also set the same as pre-training, and we decay the learning rate by 10 at the and epochs empirically.

Results are presented in Table 4. The most direct comparable baseline to our SOHO is LXMERT [tan2019lxmert], which adopts the same backbone and pre-training dataset as our SOHO. SOHO obtains 0.83% and 0.93% absolute improvements on test-dev and test-std split over LXMERT respectively. It is worth noting that SOHO outperforms UNITER [chen2020uniter] even under an inferior experimental setting, where UNITER additionally uses out-domain datasets in the pre-training stage. The promising results of SOHO on VQA demonstrate that our end-to-end pre-training approach enables intelligent question answering on visual contents.

Model Backbone dev test-P
Image Only[suhr2019corpus] R152 51.60 51.90
CNN+RNN[suhr2019corpus] R152 53.50 52.40
MaxEnt[suhr2019corpus] R152 54.10 54.80
VisualBERT[li2019visualbert] R152 67.40 67.00
LXMERT[tan2019lxmert] R101 74.90 74.50
UNITER[chen2020uniter] R101 75.85 75.80
SOHO (Ours) R101 76.37 77.32
Table 5: Evaluation of Visual Reasoning on NLVR dataset.

4.2.3 Task III: Visual Reasoning

Visual Reasoning with Natural Language (NLVR) requires a model to predict whether a text is related to a given pair of images. Compared with VQA, NLVR addresses the challenge of compositional visual reasoning on relations, comparisons, and quantities. We conduct this task on NLVR dataset [suhr2019corpus]. In our implementation, we follow LXMERT [tan2019lxmert] and UNITER [chen2020uniter] to input two pairs of image-text to Transformer and get two embedding vectors from [CLS] tokens. Then we learn a classifier on the concatenation of the embedding vectors over “true” or “false” by a cross-entropy loss. The settings of the optimizer, epoch number, and learning rate are the same as VQA settings. Since the number of input images for NLVR is twice as VQA, the batch size of NLVR is half of VQA.

We mainly compare with the SOTA results provided by LXMERT [tan2019lxmert] and UNITER [chen2020uniter] under the same settings for fair comparisons. From the results shown in Table 5, we observe 0.52% and 1.52% absolute gains of SOHO against UNITER on dev and test-P split respectively. This result validates that SOHO also has advantages when applying to compositional visual reasoning tasks.

Model Backbone val test
EVE-Image[xie2018visual] R101 71.56 71.16
UNITER[tan2019lxmert] R101 78.59 78.28
SOHO (Ours) R101 85.00 84.95
Table 6: Evaluation of Visual Entailment on SNLI-VE.
VD size Text Retrieval Image Retrieval VQA NLVR SNLI-VE
R@1 R@5 R@10 R@1 R@5 R@10 test-dev test-std dev test-P val test
w/o VD - 72.80 93.20 96.90 58.22 88.32 94.40 66.08 66.33 62.62 62.61 82.28 82.16
w/ VD 1024 73.40 92.10 97.00 58.55 88.84 94.70 66.75 66.95 63.32 64.60 82.47 82.55
2048 75.50 93.50 97.30 59.03 88.88 94.84 66.69 67.09 64.62 65.32 82.56 82.54
4096 71.20 93.20 97.30 58.50 88.92 94.96 66.76 66.91 63.60 64.80 82.53 82.55
8192 72.10 92.30 96.50 58.01 88.08 94.70 66.65 67.10 63.15 64.49 82.29 82.69
2048 2.70 0.30 0.40 0.81 0.56 0.44 0.61 0.76 2.0 2.71 0.28 0.38
Table 7: Ablation study on the effectiveness of Visual Dictionary (VD) and the embedding vector size of VD. Results are obtained under the settings of a ResNet-18 backbone and a 3-layer Transformer architecture. Image-text Retrieval is conducted on the MSCOCO 1k test set. The top-1 and top-2 results of each metric are highlighted in bold and underlined respectively. Notation indicates the performance gains of 2048 VD size results over baseline results without VD.

4.2.4 Task IV: Visual Entailment

Visual Entailment (VE) is a fine-grained visual reasoning task to predict whether an image semantically entails a text. In pursuit of visual intelligence, the relationship between an image and a text pair in the VE task is more fine-grained than VQA and NLVR, which can be true (entailment), false (contradiction) or neutral. SNLI-VE dataset [xie2018visual] is proposed for the VE task and is constructed based on Stanford Natural Language Inference (SNLI) [bowman2015large] and Flickr30K [plummer2015flickr30k] datasets. We follow UNITER [chen2020uniter] to treat the VE task as a three-way classification problem and predict the scores of each class by a fully-connected layer on the representation of the [CLS] token from the output of the Transformer. We fine-tune the model for 6 epochs with batch size 128 until convergence. The learning rate is initialized as 1e-4, and decay to 1e-5 after four epoch empirically.

We compare SOHO with a VLPT work UNITER [chen2020uniter] and a task-specific method EVE-Image [xie2018visual]. As reported in Table 6, SOHO achieves 85.00% and 84.95% accuracy on val and test split respectively. The results significantly outperform the SOTA results provided by UNITER [chen2020uniter], where 6.41% and 6.67% absolute accuracy improvements are obtained on the val and test split respectively. The results indicate the advantage of our end-to-end framework for refining the CNN backbone together with the cross-modal Transformer to facilitate thorough vision-language alignment.

4.3 Ablation Study

We perform ablation studies to validate the effectiveness of the visual dictionary (VD) on all downstream tasks. We first establish a baseline model without VD, then incorporate VD with the baseline and further evaluate the influence of the embedding vector size (VD size) .

Results are presented in Table 7. Generally, we observe that for most tasks, a VD size of 2048 or 4096 achieves the best results among four sizes ranging from 1024 to 8192. This is reasonable as VD is designed to aggregate similar visual semantics into the same image feature. With such design, the bigger VD could learn to group more fine-grained and complete visual semantics, which benefits the VL alignment as expected. However, too fine-grained visual semantics being grouped into different image features may deteriorate the abstraction of visual semantics, which consequently is harmful to VL alignment. We empirically find that works the best in most cases, thus we adopt as our default setting.

When compared with the baseline without VD, our proposed method with VD enjoys better performances under almost all metrics with a wide range of (i.e., 1024, 2048, and 4096). It validates the effectiveness of VD and shows that VD is generally applicable to a broad range of tasks.

4.4 Visualization of Visual Dictionary

To share insights on what the proposed Visual Dictionary (VD) learned, we visualize some representative VD indices in Figure 3. As introduce in Sec 3.2, a VD index is correlated with many visual features, where each visual feature corresponds to an image patch. We randomly sample some indices from VD and visualize their corresponding image patches. As shown in Figure 3, the VD groups meaningful and consistent image patches into different indices, which reflects an abstraction of visual semantics. The visualization shows the strong capability of the learned VD. More cases can be found in supplementary materials.

4.5 Inference Time

BUTD-based methods mainly include three inference stages: CNN forwarding, region feature generation, and Transformer forwarding [Anderson_2018_CVPR]. In contrast, SOHO only includes two inference stages of CNN and Transformer forwarding. To compare the inference efficiency of SOHO and BUTD-based methods, we set up an experiment on a V100 GPU with

input resolution, a ResNet-101 backbone, a 12-layer Transformer, 100 boxes, 16 sentence padding length. The average inference time for extracting BUTD features on ResNet-101 is

ms. The input sequence length of the Transformer for BUTD-based methods and SOHO are and , respectively. Thus the inference time of Transformer is ms and ms for BUTD-based methods and SOHO, respectively. For BUTD-based methods, in addition to a ms time cost of region feature generation , the main time cost, however, comes from the non-maximum suppression which s required to be applied to all categories. Consequently, the ms time cost of SOHO for an inference step is about times faster than the ms time cost of BUTD-based methods. Therefore, our highly-efficient SOHO could be better applied to real applications.

Figure 3: Visualization of VD. The left and right indices reflect the semantic of “head” and “building” with consistent visual patterns, respectively.

5 Conclusion

In this paper, we show a new perspective for vision-language model design. Particularly, we propose SOHO, one of the first end-to-end vision-language pre-training models that learns comprehensive yet compact visual representation for cross-modal understanding. To generate visual features that can be fused with language tokens, we propose a novel visual dictionary to transform an image to concrete semantics. Three pre-training tasks are conducted to build connections between images and languages. Performances on four downstream tasks show the superiority of SOHO over pre-training models with region-based image features. Moreover, we relieve the requirement for bounding box annotations, and reduce heavy human labeling costs. This end-to-end framework also shows the merit of accelerating the inference time in vision-language tasks about 10 times, which enables more online vision-language applications. In the future, we will further explore vision-language generation tasks, and study the utilization of large-scale unpaired multi-modal data for cognition-level visual understanding.

6 Acknowledgments

This work is supported by Ministry of Science and Technology International Exchange Project (No. 43-17).

Appendix A Appendix

a.1 Dataset Statistics

Here we first summarize the detailed train/test image and text numbers of our pre-training and downstream datasets in Table 9. Then we provide a detailed comparisons of pre-training dataset usage of recent VLPT works in Table 10.

We follow UNITER [chen2020uniter] to classify pre-training datasets into two classes of “in-domain” and “out-of-domain”. MSCOCO Captions  (MSCOCO)[lin2014microsoft] and Visual Genome Dense Captions (VG) [krishna2017visual] are typical in-domain datasets for many VL downstream tasks (e.g., image-text retrieval). In contrast, Conceptual Captions [sharma2018conceptual] and SBU Captions [ordonez2011im2text] are out-of-domain datasets which are noisier than in-domain datasets. We show the dataset usage of recent VLPT works in Table 10. For example, VisualBERT [li2019visualbert], LXMERT [tan2019lxmert] and UNITER [chen2020uniter] pre-train with in-domain datasets. Among them, UNITER [chen2020uniter] additionally use out-of-domain data for model training. The ablation study of UNITER [chen2020uniter] shows that the additional usage of out-of-domain further improves performance.

In our work, we focus on in-domain datasets as they are commonly used in many VL tasks (e.g., image-text retrieval) and adopted by many VLPT works (e.g., VisualBERT [li2019visualbert], LXMERT [tan2019lxmert] and UNITER [chen2020uniter]). When comparing with UNITER, we fairly compare with its in-domain pre-training results if they are provided. Otherwise, our “in-domain” dataset setting is inferior to the “in-domain+out-of-domain” pre-training setting of UNITER, and our results are not directly comparable.

We plan to include out-of-domain data in our pre-training data as a future work.

Dataset Split #Image (K) #Text (K)
VG train 105.9 472.7
COCO train 82.8 414.1
val restval* 30.5 152.6
val* 5.0 25.0
test* 5.0 25.0
VQA2.0 train 82.8 443.8
val 40.5 214.4
test-dev 81.4 447.8
NLVR train 103.2 86.4
dev 8.2 7.0
test-P 8.1 7.0
Flickr30K train* 29.0 145.0
val* 1.0 5.0
test* 1.0 5.0
SNLI-VE train 29.8 529.5
val 1.0 17.9
test 1.0 17.9
Table 9: Statistics of different datasets. Notation “*” denotes Karpathy split [karpathy2015deep].

a.2 Implementation Details

We adopt two strategies to speed up the training procedure. First, we adopt mixed-precision training to reduce memory cost and speed up training procedure. Second, we re-organize the input data in one mini-batch. Within a mini-batch, we only forward an image once to the visual backbone if it has multiple corresponding texts, while concatenating it with each text into cross-modal transformers. For example, an image will be paired with four texts in each batch during pre-training, including two positive pairs and two negative pairs. We only apply MLM and MVM on the positive image-text pairs.

a.3 Visualization of Visual Dictionary

To show the semantic of visual dictionary (VD) items, we visualize the image patches that are grouped in each indices. We have shown two examples in the paper, and in the supplementary material, we randomly select ten more indices from the VD. From the visualization shown in Figure 4, we can find that each item in VD has meaningful and consistent semantics. In other words, our model is able to learn unified representations to represent different semantics of the image even though we do not have object bounding box annotations for supervision.

Figure 4: Visualization of visual dictionary (VD) we have learned by SOHO. Apart from the two indices we have shown in the paper, we randomly select another ten indices in the visual dictionary to present in this supplementary material. From the above results we can find that, our visual dictionary is learned to group meaningful and consistent semantics of image patches into different indices. Thus, each index can reflect an abstraction of visual semantics. [Best viewed in color.]

a.4 Discussion

For image-text retrieval task, the traditional approaches [faghri2017vse++]

first project an image and a text to a common representation space and then correlate their representations by late fusion. For example, the widely-used late fusion method is calculating cosine similarity based on a dot-product operation, which is simple and fast. In contrast, Transformer-based approaches early fuse the image and text by a multi-layer Transformer to get an united representation. The unified representation captures the deep relation between an image and a text with self-attention mechanism, thus is able to achieve a better result than the late fusion representation. However, the early fusion Transformer-based approaches cannot produce separate representation for images and texts, thus suffers from slow speed due to exhaustive computation of each possible image-text combination. Our model as well as other vision-language pre-training models are based on Transformers, and the inference speed has become a bottleneck for applying these models to real-world search engines. For future works, we are curious about how we could speedup the Transformer-based approaches in image-text retrieval task.

In-domain Out-of-domain
Dataset Visual Genome [krishna2017visual] MSCOCO [lin2014microsoft] Conceptual Captions [sharma2018conceptual] SBU [ordonez2011im2text]
Caption/Image Num 5,060K/101K 533K/106K 3,000K/3,000K 990K/990K
Unified VLP [zhou2019unified]
ViLBERT [lu2019vilbert]
VLBERT [su2019vl]
Unicoder-VL [li2019unicoder]
VisualBERT [li2019visualbert]
LXMERT [tan2019lxmert]
UNITER [chen2020uniter]
Table 10: Statistics on the datasets used in recent vision-and-language pre-training works.