Contrastive Visual-Linguistic Pretraining

by   Lei Shi, et al.

Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual region regression and classification loss, they often suffer from domain gap and noisy label problems, based on the visual features having been pretrained on the Visual Genome dataset. To overcome these issues, we propose unbiased Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual self-supervised loss built upon contrastive learning. We evaluate CVLP on several down-stream tasks, including VQA, GQA and NLVR2 to validate the superiority of contrastive learning on multi-modality representation learning. Our code is available at:



There are no comments yet.


page 2

page 8


Dense Contrastive Visual-Linguistic Pretraining

Inspired by the success of BERT, several multimodal representation learn...

i-Code: An Integrative and Composable Multimodal Learning Framework

Human intelligence is multimodal; we integrate visual, linguistic, and a...

Fair Contrastive Learning for Facial Attribute Classification

Learning visual representation of high quality is essential for image cl...

MoPro: Webly Supervised Learning with Momentum Prototypes

We propose a webly-supervised representation learning method that does n...

Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning

We propose to solve the natural language inference problem without any s...

Using Navigational Information to Learn Visual Representations

Children learn to build a visual representation of the world from unsupe...

Effective Sample Pair Generation for Ultrasound Video Contrastive Representation Learning

Most deep neural networks (DNNs) based ultrasound (US) medical image ana...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language pretraining Devlin et al. (2018); Radford et al. (2018) has revolutionized Natural Language Understanding (NLU), and strong models such as BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019) are widely used across numerous NLP tasks. Building on this, Visual-Linguistic Pretraining (VLP) has been proposed to add an extra mask-predict self-supervised strategy for the visual branch Tan and Bansal (2019); Lu et al. (2019). Compared with VQA models that are trained from scratch such as DCN Nguyen and Okatani (2018), BAN Kim et al. (2018), DFAF Gao et al. (2019a) and MCAN Yu et al. (2019), VLP relies on a similar network structure as previous methods but can achieve superior performance and better generalization ability thanks to the semantic information acquired from large-scale pretraining.

The two prominent VLP methods LXMERT Tan and Bansal (2019) and ViLBERT Lu et al. (2019)

usually perform feature regression or classification for masked visual regions as the pretext task of self-supervised learning. However, we have identified several important problems: 1) Noisy Labels:

feature regression and classification suffer from the noisy annotations in Visual Genome Krishna et al. (2017). 2) Domain Gap: As the visual features are generated by an object detector pretrained on Visual Genome, feature regression and classification of masked regions will make the pretrained visual-linguistic model inherit the bias from Visual Genome, which results in a weak generalization ability on other downstream tasks. Taking LXMERT as an example, it can generalize better on GQA than on NLVR2 due to the overlap of pretraining and finetuning domains (the visual inputs of GQA Hudson and Manning (2019) are borrowed from Visual Genome), while the images in NLVR2 Suhr et al. (2017) are collected online and consist of entirely different image manifolds compared with the sorts of images used for pretraining.

To solve the aforementioned domain gap and noisy label problems, we propose a novel Contrastive Visual-Linguistic Pretraining (CVLP), which borrows ideas from the popular contrastive learning framework in metric learning to solve the domain bias and noisy label problems. Specifically, CVLP replaces the region regression and classification with contrastive learning, which resolves the above problems. Contrastive learning aims to discriminate between positive examples and negative ones, which does not require any annotation and can solve the noisy label and domain bias problems directly. However, due to the tremendous memory cost of Transformers Vaswani et al. (2017), scaling up the batch size for contrastive learning is difficult. A conspicuous problem of contrastive learning is that the performance is highly constrained by the size of negative examples, which are bounded by the batch size. Motivated by the idea of memory banks Wu et al. (2018); He et al. (2019), we build a dynamic memory queue that caches the contextual features of the previous region and serves as negative examples in contrastive learning. The corresponding cached features drift gradually during training, thus invalidating the previously cached negative features in the memory queue. At the same time, motivated by MoCo He et al. (2019), we extract features from the slowly moving query network and store them in the memory queue. When the queue is filled with features, the oldest visual contextual feature will be eliminated from the memory bank. A naive implementation of contrastive learning will fail because the network will learn to discriminate between positive and negative examples quite easily. To solve this problem, we increase feature diversity by adopting a randomly layer-dropping key network Fan et al. (2019).

Our contributions can be summarized as below:

  • We propose a novel contrastive learning framework for visual-linguistic pretraining that solves the domain bias and noisy label problems encountered with previous visual-linguistic pretraining approaches such as LXMERT and ViLBERT.

  • We carry out extensive ablation studies over CVLP to validate our proposed approach. Our CVLP pretraining can achieve significant improvements over a strong baseline (LXMERT), especially when the domain gap between the pretraining and finetuning stages becomes larger. CVLP can surpass the performance of LXMERT on all three datasets (VQA, NLVR2, and GQA).

Figure 1: Example question or caption for VQA, NLVR2, GQA datasets. GQA questions are usually longer and more fine-grained than VQA ones, while NLVR2 offers a caption on a pair of images. Our CVLP consistently beats LXMERT across all three vision–language datasets.

2 Related Work

2.1 Self-supervised Learning in Vision, Language and Multi-modality

Deep Neural Networks (DNN) trained on ImageNet 

Deng et al. (2009) have revolutionized automatic feature representation learning Krizhevsky et al. (2012)

. Compared to supervised training, which incurs a substantial cost for data annotation, self-supervised learning learns useful features automatically by constructing a loss from a pretext task, which does not require human annotation. In computer vision, context encoders 

Pathak et al. (2016) learn features by image in-painting. Jigsaw Noroozi and Favaro (2016) learns features by predicting the position of permuted features. Kolesnikov et al. Kolesnikov et al. (2019) carry out a large-scale study of previously proposed self-supervised learning methods and show that the performance of self-supervised tasks varies as the backbone changes. In Natural Language Understanding (NLU), large-scale pretraining with next-word prediction (GPT) Radford et al. (2018), next sentence prediction, or masked word prediction (BERT) Devlin et al. (2018), typically trained with the Transformer architecture Vaswani et al. (2017), has significantly improved the accuracy of NLU, e.g., on the GLUE benchmark Wang et al. (2018). Motivated by the success of self-supervised learning in both vision and language, LXMERT Tan and Bansal (2019) and ViLBERT Lu et al. (2019) have shown that masked words and visual regions can also yield a good visual-linguistic representation.

2.2 Contrastive Learning

Contrastive learning is a sub-branch of self-supervised learning, employing a contrastive loss to learn a representation that is useful in downstream tasks. The contrastive loss encourages the encoded instance features to be similar to positive keys while keeping away from negative ones. Different contrastive learning methods adopt different strategies to generate positive and negative keys, which is an essential factor for the quality of learned representation. Wu et al. (2018) select the keys from a large memory bank that stores the instance features for the entire training dataset. Tian et al. (2019); Chen et al. (2020a) generate keys using the current mini-batch samples. MoCo He et al. (2019); Chen et al. (2020b) proposes a momentum encoder to generate the keys on-the-fly and store them in a fixed-size queue.

2.3 Multi-modality Reasoning

The backbone of current visual-linguistic pretraining is built upon previous architectures for multi-modal reasoning. Image captioning and VQA 

Lin et al. (2014); Antol et al. (2015); Gao et al. (2018, 2019b) are two popular tasks that motivate the architecture design for multi-modality fusion. Specifically, attention-based architectures have widely been used in multimodal fusion. Xu et al. Xu et al. (2015)

proposed the first soft and hard attentions, showing that an attention model can yield good performance and interpretability. Yang et al. 

Yang et al. (2016) proposed a multi-layer attention model by stacking attention models. Besides attention, bilinear models such as MCB Fukui et al. (2016), MLB Kim et al. (2016) and MUTAN Ben-Younes et al. (2017) have explored the benefit of channel interactions for multimodal understanding. Subsequently, bottom-up and top-down features Anderson et al. (2018) illustrated the benefit of employing object-level features. Recently, modeling relationships between objects and words as representation learning has been proposed in the DCN Nguyen and Okatani (2018), BAN Kim et al. (2018), DFAF Gao et al. (2019a), MCAN Yu et al. (2019), QBN Shi et al. (2020), CA-RN Geng et al. (2020b) and STSGR Geng et al. (2020a) methods.

3 Contrastive Visual-Linguistic Pretraining (CVLP)

Figure 2: The overall architecture of the proposed CVLP approach. CVLP includes a Query Network, a Key Network and maintains a dynamic memory queue. The entire model is trained with a combination of three cross-modality losses.

As illustrated in Figure 2, the architecture of CVLP consists of a Query Network (QueryNet) and a Key Network (KeyNet). They both contain a Language Transformer Encoder, a Vision Transformer Encoder and a Multi-modality Fusion Transformer Encoder. At initialization, KeyNet is copied from QueryNet with the same layers and parameters. The QueryNet produces cross-modality embeddings with a masking strategy applied on both visual and textual inputs, while the KeyNet generates contextualized visual features with masking only applied to textual inputs. The output features of KeyNet are pushed into a dynamic memory queue, which continuously generates negative samples for calculating the Cross-Modality Contrastive Loss. The full CVLP model is trained with a combination of Cross-Modality Masked Language Modeling Loss, Matching Loss and Contrastive Loss. The following subsections are organized as follows: Section 3.1 introduces how visual and textual features are extracted and fused through self-attention and co-attention strategies, Sections 3.2 and 3.3 describe the design of the mask loss for the language branch and the contrastive loss for the visual branch, respectively. Section 3.4 provides further details about the dynamic memory queue mechanism and the droplayer strategy.

3.1 Multi-modality Fusion

Given image–sentence pairs from a vision–language dataset, we first tokenize each sentence using the WordPieces technique Wu et al. (2016) and map a token to its corresponding embedding , where . In addition, visual regions and their features are extracted by a Faster-RNN Ren et al. (2015) detector pretrained on Visual Genome Krishna et al. (2017) for each image : , where we detect regions in each image and each region is represented using a feature dimensionality of . Then we can calculate the visual inputs and textual inputs of CVLP as follows:


where and are two fully-connected layers that map and , respectively, to the feature dimensionality , while is a positional encoding function for the position of token .

Taking and as inputs, CVLP adopts masking for both QueryNet and KeyNet. For QueryNet, we uniformly choose 15% of the input textual tokens for replacement. Some of the chosen tokens are replaced by the special [MASK] token, while the other tokens are substituted by a random token. For visual regions, we use a different masking strategy: the features of the chosen regions can either be set to zero or be replaced by region features from other images. Different from QueryNet, KeyNet only employs masking on the textual inputs, while keeping all visual region features unchanged. KeyNet and QueryNet are initialized to have the same layers and parameters. They both contain 9 layers of Language Transformer Encoder, 5 layers of Vision Transformer Encoder and 5 layers of Multi-Modality Fusion Transformer Encoder. For example, all layers in a KeyNet can be represented as:


where stands for a self-attention layer in the visual branch, stands for a self-attention layer in the language branch, stands for a co-attention layer in the multimodality fusion branch.

The three encoders are implemented by 3 modules, namely, the visual self-attention, language self-attention and visual-linguistic co-attention modules. Visual self-attention performs information fusion between region features by using such features as both key, query and value in the attention model. We denote the key, query and value features for visual as , , , for language as , , , respectively. Then the intra-modality information fusion for visual and language features can be denoted as:


where the attention module of Transformer layer can be denoted as below:


After deploying intra-modality information flow for language and visual signals, we invoke an inter-modality fusion module to fuse the information from both language and visual features. The inter-modality fusion process is bi-directional, which includes information fusion from language to vision and vice versa:


After intra-inter modality feature fusion, we can acquire a multi-modality contextual feature embedding for each word and visual region. A contextual feature encodes the multi-modality interaction in a compact feature vector. The contextual features are used by CVLP for the mask loss in the language branch and the contrastive loss in the visual branch.

3.2 Mask Loss for Language Branch

In the pretraining stage, CVLP performs different pretext tasks compared with LXMERT. CVLP does not contain a supervised learning task and thus is independent of human-annotated labels. For the language branch, we keep masked language modeling and image–sentence matching prediction as two pretext tasks. Mask loss was first proposed by BERT. Subsequent visual-linguistic BERT approaches add a visual feature mask loss besides the masked language modeling loss. This loss masks out the contextual representation obtained in Section 3.1 and predicts the masked feature using its contextual information. By optimizing the mask loss, the Transformer implicitly learns to encode contextual information, which facilitates the generalization on downstream tasks. In CVLP, we only utilize mask loss for the text inputs. Additionally, we also add a matching loss, which involves a binary Yes/No classification to predict whether the sentence matches the visual feature or not. The mask loss can be formalized as follows:


where denotes the parameters of the Language Transformer Encoder, and are the masked token to be predicted and the contextual tokens that are not masked. The matching loss is defined as:


which is clearly a binary classification task. In the above equation, stands for the [CLS] token which encodes the visual-linguistic information for tackling the image–sentence matching pretext task.

3.3 Contrastive Loss for Visual Branch

Contrastive learning performs self-supervised representation learning by discriminating visually similar feature pairs from a group of negative features. Given visual region features extracted by Faster-RCNN, we can obtain a positive query-key pair by feeding such features into both QueryNet and KeyNet. All region features from other batches are utilized as negative keys. Then we conduct contrastive learning by updating network weights to minimize the following loss:


where is the temperature of Softmax, are all positive keys of , and serves as negative examples for calculating . Traditional contrastive learning is constrained by the size of negative examples. In practice, it is time-consuming to acquire a large-sized pool of negative samples. Motivated by Momentum Contrastive (MoCo) He et al. (2019), we build a dynamic visual memory queue to store the features generated by the KeyNet. The visual memory queue is empty at first, and features generated by the KeyNet are gradually placed into the queue. As training goes on, we can obtain a large visual queue to serve as negative examples. The performance of contrastive learning depends significantly on the feature diversity of the visual queue. Once the queue is full, we eliminate the oldest features. We denote the visual memory queue as:


where represents the visual feature that comes from the -th region of the -th image in the

-th iteration batch. One drawback of visual memory queue is feature drift during training. As the neural network is updated rapidly, the extracted features may become outdated fairly soon, which invalidates the negative examples stored in the visual queue. To resolve this, we define the weight of KeyNet as a moving average of QueryNet when QueryNet is trained through stochastic gradient descent. The update of the network is denoted as:


where stands for a momentum value, and are the parameters of KeyNet and QueryNet respectively. This form of contrastive learning can achieve superior performance due to the large visual memory queue and the small feature drift during the training progress.

3.4 Randomly Layer-Dropping Key Network

One important factor in training unsupervised representation learning by contrastive learning is to diversify the negative examples. Contrastive learning is highly susceptible to overfitting, thus invalidating the representation learning process. We observe that the contrastive learning loss becomes very small as the training process proceeds, suggesting that overfitting has occurred. We thus increase the diversity of features stored in the visual memory queue through the Randomly Layer-dropping Key Network. The droplayer strategy consists of a random dropout of self-attention and co-attention layers in KeyNet, which can increase the feature diversity and prevent overfitting during the training process of contrastive learning. The randomly layer-dropping Key Network can be defined as follows:



stands for random dropout of a layer or not. As the above equation shows, even layers in the KeyNet may be dropped during pretraining with a sampling probability of 0.5.

4 Experiments

In this section, we first introduce the implementation details of the proposed contrastive visual-linguistic pretraining network. Then we conduct extensive ablation studies to demonstrate the effectiveness of the proposed method. CVLP is pretrained on the same dataset as LXMERT, namely MSCOCO and Visual Genome. To assess the learned visual-linguistic features, we conduct finetuning experiments and compare CVLP with state-of-the-art methods on three downstream tasks, i.e., VQA v2 Goyal et al. (2017), GQA Hudson and Manning (2019) and NLVR2 Suhr et al. (2018).

Figure 3: We show the compositions of pretext tasks used by various visual-linguistic pre-training models. Different pretext tasks require different levels of annotations and have multiple effects on downstream tasks. The height of the bars reflects the performance of each method on VQA Test-std.

Implementation Details.

Following LXMERT, we pretrain CVLP on the same image–text pairs from MSCOCO Lin et al. (2014) and Visual Genome Krishna et al. (2017). In the pre-training stage, the batch size is set to 256 and the initial learning rate is 1e-4. During finetuning, the batch size is reduced to 32 and the initial learning rates of downstream tasks VQA, GQA and NLVR2 are 5e-5, 1e-5 and 5e-5, respectively. The temperature in the contrastive loss is set to 0.07. In both pre-training and fine-tuning stages, CVLP is optimized with Adam Kingma and Ba (2014) on four Nvidia Tesla P100 GPUs.

4.1 Comparison with State-of-The-Art VLP Methods

We compare our proposed CVLP with previous visual-linguistic pretraining models, including ViLBERT Lu et al. (2019), VisualBERT Li et al. (2019), UNITER Chen et al. (2019) and LXMERT Tan and Bansal (2019). The pretraining loss utilized in each specific method is presented in Figure  3. All previous methods adopt masked visual region classification and regression. CVLP, in contrast, only needs mask loss on the text modality and contrastive learning loss on visual modality. With the help of this contrastive learning, CVLP achieves better performance on all three downstream tasks compared to previous approaches. In Table 1, we can also see that CVLP improves by 2.36% over the runner-up model UNITER on NLVR2. This improvement is the biggest among CVLP’s improvements on all the three datasets, suggesting that CVLP possesses good generalization ability for large domain gap settings.

Test-dev / Test-std
Test-dev-2020 Test-P
Human - 89.30 96.30
Image Only - 17.80 51.90
Language Only 44.30 / - 41.10 51.10
LXMERT Tan and Bansal (2019) 72.42 / 72.54 61.39 74.45
ViLBERT Lu et al. (2019) 70.55 / 70.92 - -
VisualBERT Li et al. (2019)
(w/o Ensemble)
70.08 / 71.00 - 67.00
UNITER Chen et al. (2019) 72.27 / 72.46 - 75.58
(finetune w/o momentum)
72.77 / 72.90 61.55 76.20
(finetune with momentum)
72.87 / 73.05 61.78 76.81
Table 1: Performance comparison between CVLP and state-of-the-art visual-linguistic pretraining approaches on test splits of VQA v2, GQA 2020, and NLVR2. For all datasets, the best accuracy is in bold while the second best accuracy is underlined.
Momentum value for pre-training
= 0.999 = 0.9999 = 0.99995 = 0.99999
Acc% 70.18 70.40 70.62 70.08
Table 2: Comparison of momentum values in pre-training stage on VQA Dev-set.
Droplayer Policy
Keynet w/o Droplayers

(Epoch 1-40)

Keynet with Even Droplayers
(Epoch 1-40)
Keynet with Delayed Even Droplayers
(Epoch 21-40)
Acc% 70.27 70.06 70.62
Table 3: Comparison of different droplayer policies on VQA Dev-set.
Methods VQA–Dev-set GQA–Dev-set NLVR2–Dev-set
No Vision Task 66.30 57.10 50.90
Feature Regression 69.10 59.45 72.89
Feature Regression + Label
69.90 59.80 74.51
Contrastive Learning 70.62 59.21 76.47
Table 4: Comparison of different loss compositions on dev splits of VQA, GQA and NLVR2.
Momentum value for fine-tuning
= 0.9995 = 0.99995 = 0.99997
(1 Epoch = 2699 Iterations)
76.47 72.19 -
(1 Epoch = 19753 Iterations)
70.31 70.62 -
(1 Epoch = 33595 Iterations)
- 58.98 59.21
Table 5: Comparison of momentum values in fine-tuning stage on the dev splits of NLVR2, VQA, and GQA.

4.2 Ablation Studies and Analyses of CVLP

Effects of Momentum Value. Momentum controls the weight movement in the key network. A large momentum will result in slow drift of features. From Table 2, we can infer that a larger momentum results in a better performance on VQA because the feature drift is reduced. However, as the momentum grows to 1, the performance can drop significantly because the weight in the key network will stop to accept new information. In our experiments, we empirically determine a proper value for the momentum as , where is the iteration step in one epoch.

Effects of Droplayer Policy. Due to the powerful discrimination ability, contrastive learning easily overfits the training data. Our droplayer in the key network is an important technique to tackle the over-fitting problem of contrastive learning. As shown in Table 3, the key network with droplayer applied on the even layer decreases the performance. By applying a delayed droplayer policy, which takes effects after 20 epochs on even layers, the performance is significantly improved over the key network without droplayer. This experiment demonstrates the effectiveness of our proposed droplayer technique.

Effects of Loss Composition. In Table 4, we perform an ablation study on different loss combinations. No vision task performs visual-linguistic pretraining without adding masks on the visual features. By adding feature regression over masked visual regions, we can achieve improved performance. By adding feature regression and label classification, the results can be improved even further. After replacing feature regression and label classification loss with contrastive loss, we achieve improved performance over the three LXMERT variants on VQA and NLVR2. The performance on GQA is worse than LXMERT. This consolidates our claim that contrastive learning can perform better when the gap between pretraining and finetuning is large.

Figure 4:

Illustration of attention weights in odd layers of the Co-attention Encoder. The lighter the color, the greater the weight of attention.

Visualizing CVLP Encoder.

In Figure  4, we visualize the attention weights in odd layers (i.e., the 1st, 3rd, 5th layers) of the Co-attention Encoder in CVLP. We can see that as the layer grows, the attention weight which indicates correct word-bounding box matching also increases gradually.

5 Conclusion

In this paper, we propose a contrastive learning based pretraining approach for visual-linguistic representation. CVLP is not biased by the visual features pretrained on Visual Genome and can achieve superior performance, particularly when there is a substantial gap between pretraining and the downstream task. Extensive experiments and ablation studies over VQA, NLVR2, as well as GQA demonstrate the effectiveness of CVLP.

Broader Impact

Deep learning algorithms frequently achieve superior performance on supervised tasks. However, due to their large numbers of parameters, they often require high quality and abundant training labels. Such annotation can be time-consuming and expensive. Our proposed CVLP can perform high-quality representation learning based on self-supervised pretext tasks. We believe our research can help many deep learning applications and decrease the overall cost to train and deploy a deep learning system. Large-scale pretraining with models that can cope with a domain gap have the potential to reduce possible energy usage, as one does not need to train a model from scratch for new domains. Moreover, self-supervised learning can allow us to learn from more available unlabeled data, enabling us to mitigate the well-known problems of bias and fairness in human annotations. Still, it remains important to consider the distribution of the unlabeled data to avoid biases in the model.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §2.3.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.3.
  • H. Ben-Younes, R. Cadene, M. Cord, and N. Thome (2017) Mutan: multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2612–2620. Cited by: §2.3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.2.
  • X. Chen, H. Fan, R. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §2.2.
  • Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §4.1, Table 1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.1.
  • A. Fan, E. Grave, and A. Joulin (2019) Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556. Cited by: §1.
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: §2.3.
  • P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li (2019a) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6639–6648. Cited by: §1, §2.3.
  • P. Gao, H. Li, S. Li, P. Lu, Y. Li, S. C. Hoi, and X. Wang (2018) Question-guided hybrid convolution for visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 469–485. Cited by: §2.3.
  • P. Gao, H. You, Z. Zhang, X. Wang, and H. Li (2019b) Multi-modality latent interaction network for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5825–5835. Cited by: §2.3.
  • S. Geng, P. Gao, C. Hori, J. L. Roux, and A. Cherian (2020a) Spatio-temporal scene graphs for video dialog. arXiv preprint arXiv:2007.03848. Cited by: §2.3.
  • S. Geng, J. Zhang, Z. Fu, P. Gao, H. Zhang, and G. de Melo (2020b) Character matters: video story understanding with character-aware relations. arXiv preprint arXiv:2005.08646. Cited by: §2.3.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §2.2, §3.3.
  • D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §4.
  • J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §1, §2.3.
  • J. Kim, K. On, W. Lim, J. Kim, J. Ha, and B. Zhang (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325. Cited by: §2.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1920–1929. Cited by: §2.1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, §3.1, §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.1.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §4.1, Table 1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.3, §4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: §1, §1, §2.1, §4.1, Table 1.
  • D. Nguyen and T. Okatani (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6087–6096. Cited by: §1, §2.3.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.1.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §2.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1, §2.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.1.
  • L. Shi, S. Geng, K. Shuang, C. Hori, S. Liu, P. Gao, and S. Su (2020) Multi-layer content interaction through quaternion product for visual question answering. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4412–4416. Cited by: §2.3.
  • A. Suhr, M. Lewis, J. Yeh, and Y. Artzi (2017) A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 217–223. Cited by: §1.
  • A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi (2018) A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491. Cited by: §4.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1, §1, §2.1, §4.1, Table 1.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §2.1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §3.1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §2.2.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International conference on machine learning

    pp. 2048–2057. Cited by: §2.3.
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §2.3.
  • Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian (2019) Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6281–6290. Cited by: §1, §2.3.