mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

by   Chenliang Li, et al.

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.


page 1

page 2

page 3

page 4


E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...

CoCa: Contrastive Captioners are Image-Text Foundation Models

Exploring large-scale pretrained foundation models is of significant int...

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands com...

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

People say, "A picture is worth a thousand words". Then how can we get t...

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Pre-trained language models are still far from human performance in task...

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effe...

1 Introduction

Large-scale pre-training of vision-language models have recently received tremendous success on a wide range of cross-modal tasks 

[54, 11, 25, 36, 65, 33, 58]. Such vision-language models learn cross-modal representations from a quantity of image-text pairs by aligning the visual and linguistic modalities. A great challenge of learning vision-language models is to find a good alignment between the two modalities to close the semantic gap in-between.

Figure 1: Illustration of two conventional cross-modal fusion networks and our proposed cross-modal skip-connected network. We compare the running time and performance of different fusion networks, where the total fusion layers, image encoder and text encoder are all kept the same. The running time is the total forward time of 100 samples in different fusion networks.

To discover a cross-modal alignment, prior studies [36, 3, 69] employ a pre-trained object detector to extract salient regions from images, which are then aligned with language counterparts. Such an architecture, however, is generally limited by the power of the object detector, the pre-defined visual semantics it can represent, and the quantity of annotations available. Besides, it is also computationally expensive to extract region-based visual features from high-resolution (e.g. 6001000) images. More recent work [25, 58, 33, 29, 16], which scales and performs better on many vision-language tasks, drops the requirement of pre-trained object detection and enables a direct alignment between the image and text representations in an end-to-end manner. These models extract finer-grained visual representation with a long sequence of image patches or grids for good vision understanding [16]. However, there exist two significant problems in modeling long visual sequences: 1) efficiency: full self-attention on long visual sequences requires much more computation than that on textual sequences, and 2) information asymmetry: the caption text in widely-used image-text pre-training data is usually short and highly abstract while more detailed and diverse information can be extracted from the image. This asymmetry presents challenges for effective multi-modal fusion between the modalities.

One straightforward way of multi-modal fusion is the connected-attention network as shown in Figure 1 (a). It adopts a single Transformer [55] network for early fusion of vision and language by simply taking the concatenation of visual and linguistic features as input [34]. This paradigm allows self-attention to discover alignments between the modalities from the bottom level, and requires full self-attention on the concatenation of cross-modal sequences, which is rather time-consuming. Besides, this type of methods process information from both modalities equally, which may suffer from the information asymmetry especially when there is a big difference in information density or sequence lengths between the modalities.

Another line of work keeps separate Transformer networks for both textual and visual features, and uses techniques such as cross-attention to enable cross-modal interaction 

[16], as shown in Figure 1 (b). This architecture design conducts multi-modal fusion on both modalities independently, which can help alleviate the information asymmetry problem. However, it still suffers from computation inefficiency for full self-attention on long visual sequences, and it is not that parameter-efficient with two separate Transformer networks.

In this work, we propose mPLUG, a unified Multi-modal Pre-training framework for both vision-Language Understanding and Generation. mPLUG performs effective and efficient vision-language learning with novel cross-modal skip-connections to address the fundamental information asymmetry problem. Instead of fusing visual and linguistic representations at the same levels, the cross-modal skip-connections enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It creates inter-layer shortcuts that skip a certain number of layers for visual representations to reflect the semantic richness of language compared to vision. As shown in Figure 1 (c), in each block of our cross-modal skip-connected network, mPLUG first adopts an asymmetric co-attention architecture at the first few layers for efficiency, by removing the co-attention on vision side. It is then followed by one layer of connected-attention, by concatenating the original visual representation and the co-attention output on the language side as input. In addition to the modeling efficacy due to the asymmetry, the cross-modal skip-connections ease the model training by alleviating vanishing gradients with the inserted shortcuts. Figure 1 shows that the new cross-modal skip-connected network achieves superior performance with at least four times speeding-up than other cross-modal fusion networks.

Our key contributions can be summarized as follows:

  • We propose a unified vision-language pretrained model mPLUG of cross-modal understanding and generation for both effectiveness and efficiency in cross-modal learning.

  • We introduce a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation inefficiency in multi-modal fusion.

  • mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to a wide range of vision-language and video-language tasks.

2 Related Work

2.1 Vision-Language Pre-training

Vision-Language pre-training (VLP) has recently received tremendous success and achieved state-of-the-art results across a variety of vision-language tasks [4, 9, 66]. In terms of how information from different modalities are aggregated, typical approaches to VLP [54, 11, 25, 65, 33, 46, 26] can be roughly divided into two categories: dual encoder and fusion encoder. Dual encoder approach utilizes two single-modal encoders to encode images and text separately, and then uses simple functions such as dot product to model the instance-level cross-modal interaction between image and text. The advantage of dual encoder models like CLIP [46] and ALIGN [26] is that images and text can be pre-computed and cached, which is quite computation-efficient and more appropriate for retrieval tasks. However, they tend to fail in handling more complicated VL understanding tasks that require complex reasoning, such as visual question answering [4]. In contrast, fusion encoder approach uses deep fusion functions such as multi-layer self-attention and cross-attention networks to model the fine-grained cross-modal interaction between image and text sequences. Representative methods of this category include the single-stream architecture such as UNITER [11] and OSCAR [36], and two-stream architecture such as LXMERT [54], ALBEF [33] and ERNIE-ViL [65]. This kind of methods can better capture the underlying association between image and text for vision-language understanding tasks, while it needs to jointly encode all possible image-text pairs, which leads to a relatively slow inference speed.

To improve the inference speed, some recent work such as Pixel-BERT [25], E2E-VLP  [60] and ViLT [29]

removes the complicated object detector in feature extraction, and conducts end-to-end VL learning with CNN-based grid features and linearly projected patched embeddings, respectively. To combine the benefits of both categories of architectures, VLMo 

[57] further unifies the dual encoder and fusion encoder modules with shared mixture-of-modality-experts Transformer. In this work, mPLUG introduces a new cross-modal fusion mechanism with cross-modal skip-connections, to enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It achieves superior performances in effectiveness and efficiency across a wide range of VL tasks.

2.2 Skip-connection

Skip-connection is a popular technique to bypass the gradient exploding or vanishing problem for model optimization in deep neural networks, which is widely-used in CV and NLP architectures such as ResNet 

[22] and Transformer [55]. A variety of skip connection methods have been proposed in recent years [51, 22, 55, 24, 53, 38]. ResNet [22] introduces summed shortcut connections between different layers using simple identity mapping, while highway network [51] designs a transform gating function to control the balance of the input and the transformed input. DenseNet [24] designs new architectures with concatenated skip-connections, allowing the subsequent layers to re-use all the middle representations of previous layers. Layer Normalization and recursive skip connection are further used in combination with plain skip connection for further stablizing model optimization and better incorporating the transformed input [55, 38]. In this work, mPLUG proposes a new cross-modal skip connection method to address cross-modal fusion problem, and combines the concatenated skip-connection and summed skip-connection for choosing whether to attend to all the concatenated representations of different modalities or just focus on the cross-modal interaction part at each layer.

Figure 2: The model architecture and objectives of mPLUG, which consists of two unimodal encoders for images and text separately, a cross-modal skip-connected network and a decoder for text generation. An image-text contrastive loss is first applied to align the unimodal representations from the visual encoder and text encoder. Then, we use a novel cross-modal skip-connected network to fuse the visual and linguistic representations effectively and efficiently. We adopt connected cross-modal fusion to every asymmetric co-attention layers, where

is a fixed stride value. Based on the connected representation of the image and prefix sub-sequence, the decoder is trained with a prefix language modeling (Prefix LM) loss by generating the remaining caption.


In this section, we will first introduce our new model architecture with the key module of the cross-modal skip-connected network, and then give the details of the pre-training objectives and scalable training infrastructure.

3.1 Model Architecture

As shown in Figure 2, mPLUG consists of two unimodal encoders for image and text independently, a cross-modal skip-connected network and a decoder for text generation. To better model the inherent modality bias information, we first use two unimodal encoders to encode image and text separately. Following [16, 50], we use a visual transformer [15] directly on the image patches as the visual encoder, which is more computation-friendly than using pre-trained object detectors for visual feature extraction [3, 69]. The visual encoder divides an input image into patches and encodes them as a sequence of embeddings with an additional token. The input text is fed to the text encoder and represented as a sequence of embeddings , where is the embedding of the token and used to summarize the input text. Then, the visual and linguistic representations are fed into a cross-modal skip-connected network, which consists of multiple skip-connected fusion blocks. In each skip-connected fusion block, we adopt connected cross-modal fusion to each of asymmetric co-attention layers where is a fixed stride value. The aim of this network is to take advantage of the effectiveness of the connected cross-modal fusion and the efficiency of the asymmetric co-attention for enhanced cross-modal fusion in a recursive manner. Finally, the output cross-modal representations are fed into a transformer decoder for sequence to sequence learning, which equips mPLUG with both understanding and generation capabilities.

3.2 Cross-modal Skip-connected Network

The cross-modal skip-connected network consists of skip-connected fusion blocks. In each skip-connected fusion block, we adopt connected-attention layer to each of asymmetric co-attention layers where is a fixed stride value. We first pass the text feature and image feature from unimodal encoders through the asymmetric co-attention layers, and then connect the output text feature and image feature to one connected-attention layer. We repeat the skip-connected fusion block times for the final connected image and text representation.

Specifically, the asymmetric co-attention is composed of the self-attention (SA) layer, cross-attention (CA) layer and the feed-forward network (FFN). The input text feature is first fed to the self-attention layer, and then the visual feature is injected into the text feature by the cross-attention layer which gives . The output of self-attention and cross-attention are added up and fed to the FFN layer for the visual-aware text representation :


where LN is short for layer normalization.

The connected-attention layer is composed of the self-attention (SA) layer and the feed-forward network (FFN). We connect the image feature and input text feature , where is the output of asymmetric co-attention layers. The connected image and text feature are fed to the self-attention layer and FFN layer:


Then is fed into the next cross-modal skip-connected network repeatedly to get the final connected image and text representation. Finally, the connected output is fed into a Transformer decoder for sequence to sequence learning.

3.3 Pre-training Tasks

We perform four pre-training tasks including three understanding tasks (Image-Text Contrastive Learning, Image-Text Matching, Masked Language Modeling) and one generation task (Prefix Language Modeling). These pre-training tasks are optimized jointly.

Image-Text Contrastive (ITC): Following  [33]

, we employ the task to align the image features and the text features from the unimodal encoders. Specifically, we calculate the softmax-normalized image-to-text and text-to-image similarity, and take two dynamic memory queues (text, image) to increase the number of negative examples as MoCo  


Image-Text Matching (ITM): This task aims to predict whether an image and a sentence match with each other on the cross-modal representation. We also select hard negative image-text pairs based on the contrastive text-image similarity as  [33].

Masked Language Modeling (MLM): The task setup is basically the same as in BERT [14], where we randomly mask of tokens in text and the model is asked to predict these masked words with the cross-modal representations.

Prefix Language Modeling (PrefixLM): This task aims to generate the caption given an image and predict the text segment subsequent to the cross-modal context as  [6]. It optimizes a cross entropy loss by maximizing the likelihood of text in an autoregressive manner.

4 Distributed Learning on a Large Scale

Training a big model like mPLUG on large-scale datasets faces many efficiency challenges. We increase the throughput from the perspective of reducing memory usage and computation time, thereby accelerating the training of the model.

The memory usage during model training is mainly composed of two aspects: the static memory usage composed of parameters/optimizer states/gradients, etc., and the runtime memory usage caused by intermediate variables like activation values. For static memory overhead, we use the ZeRO [47] technique to partition parameters/optimizer states/gradients into the entire data-parallel group, so that the static memory overhead of a single GPU can be approximately reduced to , where denotes the number of GPU cards. We use gradient checkpointing [8] for the runtime memory cost, which greatly reduces the runtime memory usage at the expense of increasing forward time by recomputing part of the activation values during backward pass without keeping them in memory.

To reduce the computation time, we use BF16 precision training. BF16 is a new data type supported by NVIDIA’s new Ampere architecture GPU like A100. Compared with the previously widely used mixed-precision training of FP16 and FP32, BF16 has the same representation range as FP32, thereby reducing the risk of numerical overflow and ensuring model convergence stability, and at the same time has the same fast computing speed as FP16.

5 Experiments

5.1 Data & Setup

Models Data COCO Caption NoCaps
Cross-entropy Optimization CIDEr Optimization
B@4 M C S B@4 M C S C S
Encoder-Decoder CC12M - - 110.9 - - - - - 90.2 12.1
E2E-VLP [60] 4M 36.2 - 117.3 - - - - - - -
VinVL [69] 5.65M 38.5 30.4 130.8 23.4 41.0 31.1 140.9 25.2 97.3 13.8
OSCAR [36] 6.5M - - - - 41.7 30.6 140.0 24.5 83.4 11.4
SimVLM [58] 1.8B 40.3 33.4 142.6 24.7 - - - - - -
LEMON [23] 200M 40.6 30.4 135.7 23.5 42.3 31.2 144.3 25.3 113.4 15.0
BLIP [32] 129M 40.4 - 136.7 - - - - - 113.2 14.8
OFA [56] 18M - - - - 43.5 31.9 149.6 26.1 - -
mPLUG 14M 43.1 31.4 141.0 24.2 46.5 32.0 155.1 26.0 114.8 14.8
Table 1: Evaluation Results on COCO Caption ”Karpathy” test split and NoCaps validation set. B@4: BLEU@4, M: METEOR, C: CIDEr, S: SPICE.

Following the previous work  [33], we use the same pre-training dataset with 14M images with texts, which includes two in-domain datasets (MS COCO  [37] and Visual Genome  [30]), and three web out-domain datasets (Conceptual Captions  [49], Conceptual 12M  [7], SBU Captions  [44].

We pretrain the model for 30 epochs with the total batch size of 1024 on 16 NVIDIA A100 GPUs. We use a 6-layer Transformer for both the text encoder and the cross-modal skip-connected network, and a 12-layer Transformer for the decoder. The text encoder is initialized using the first 6 layers of the

 [14] model and the skip-connected network is initialized using the last 6 layers of the . We initialize the visual encoder by CLIP-ViT  [46] pretrained on 400M noisy image-text pairs. The visual transformer with ViT-B/16 is used as our base architecture, the one with ViT-L/14 as the large architecture. We use the AdamW  [39] optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e-5 (ViT-B/16) and 1e-4 () for mPLUGViT-B , and 5e-6 (ViT-L/14) and 5e-5 () for mPLUGViT-L in the first 1000 iterations, and decayed to 1e-6 following a cosine schedule. During pre-training, we take random image crops of resolution 256 256 (ViT-B/16)/224 224 (ViT-L/14) as input, and also apply RandAugment  [13] to improve the generalization of vision encoders. For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs. We increase the image resolution during finetuning. For image-text contrastive learning, the size of the queue is set as 65,536, and the momentum coefficient is set as 0.995.

5.2 Evaluation on Vision-Language Tasks

We compare our pre-trained model against other VLP models on the six downstream V+L tasks. We introduce each task and our fine-tuning strategy below. Details of the datasets and fine-tuning hyperparameters are in Appendix.

Models Data Test-dev Test-std
Pretrained on COCO, VG, SBU and CC datasets
VLBERT [40] 4M 71.16 -
E2E-VLP [60] 4M 73.25 73.67
VL-T5 [12] 4M - 71.30
UNITER[11] 4M 72.70 72.91
OSCAR[36] 4M 73.16 73.44
CLIP-ViL[50] 4M 76.48 76.94
METER[16] 4M 77.68 77.64
ALBEF[33] 4M 74.54 74.70
mPLUGViT-B 4M 77.94 77.96
Models Pretrained on More Data
ALBEF [33] 14M 75.84 76.04
BLIP [32] 129M 78.25 78.32
SimVLM [58] 1.8B 80.03 80.34
Florence [67] 0.9B 80.16 80.36
OFA [56] 18M 79.87 80.02
VLMo [57] - 79.94 79.98
mPLUGViT-B 14M 79.79 79.81
mPLUGViT-L 14M 81.27 81.26
Table 2: Evaluation Results on VQA test set.

5.2.1 Visual Question Answering

The VQA task  [4] requires the model to answer natural language questions given an image. Most methods [54, 57, 36, 58] deal with visual question answering tasks as multi-label classification on predefined answer sets. This strategy achieves strong performance, but it is not suitable for real-world open scenarios. We treat VQA as an answer generation task and directly use unconstrained open-vocab generation during inference, which is different from constrained close-vocab generation models [33, 56]. Following [36, 56], we concatenate the question with the object labels and OCR tokens extracted from image. As shown in Table 2, mPLUG achieves 81.27 on Test-std split and outperforms the SOTA models including SimVLM and Florence, which use 100 and 60 more pre-training image-text pairs, respectively. Based on the same 4M pre-training data, mPLUG outperforms CLIP-ViL and METER, which also use CLIP [46] as the visual encoder. Besides, under the same settings, mPLUG always significantly outperforms ALBEF and BLIP which only rely on co-attention from images to text for cross-modal fusion. The gain can derive from the network design of cross-modal skip-connections specifically for information asymmetry of the two modalities. Neither ALBEF nor BLIP addresses this problem well, with bias towards the language modality.

Models # Pretrain MSCOCO (5K test set) Flickr30K (1K test set)
data TR IR TR IR
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
E2E-VLP [60] 4M - - - - - - 86.2 97.5 98.92 73.6 92.4 96.0
UNITER [11] 4M 65.7 88.6 93.8 52.9 79.9 88.0 87.3 98.0 99.2 75.6 94.1 96.8
OSCAR [36] 4M 70.0 91.1 95.5 54.0 80.8 88.5 - - - - - -
UNIMO  [35] 4M - - - - - - 89.4 98.9 99.8 78.0 94.2 97.1
VLMo  [57] 4M 78.2 94.4 97.4 60.6 84.4 91.0 95.3 99.9 100.0 84.5 97.3 98.6
ALIGN [26] 1.8B 77.0 93.5 96.9 59.9 83.3 89.8 95.3 99.8 100.0 84.9 97.4 98.6
ALBEF  [33] 14M 77.6 94.3 97.2 60.7 84.3 90.5 95.9 99.8 100.0 85.6 97.5 98.9
Florence  [67] 0.9B 81.8 95.2 - 63.2 85.7 - 97.2 99.9 - 87.9 98.1 -
BLIP  [32] 14M 80.6 95.2 97.6 63.1 85.3 91.1 96.6 99.8 100.0 87.2 97.5 98.8
BLIP  [32] 129M 82.4 95.4 97.9 65.1 86.3 91.8 97.4 99.8 99.9 87.6 97.7 99.0
mPLUG 14M 82.8 96.1 98.3 65.8 87.3 92.6 97.6 100.0 100.0 88.4 97.9 99.1

Table 3: Image-text retrieval results on Flickr30K and COCO datasets.

5.2.2 Image Captioning

The image captioning task requires a model to generate an appropriate and fluent caption for a given image. We evaluate image captioning on two datasets COCO Caption [10] and NoCaps [1]. mPLUG finetuned with training data of COCO Caption is tested on both of the datasets. We train mPLUG on the MS COCO Caption and test on the same Karpathy split [36, 58] and NoCaps validation set. Following [36, 56], we first fine-tune mPLUG with cross-entropy loss and then with CIDEr optimization [48] for extra 5 epochs. As shown in Table 1, mPLUG with only 14M pre-training images can outperform the SOTA models including LEMON and SimVLM on both COCO Caption and Nocaps datasets, which uses more than 10 and 100 pre-training data, respectively. For the COCO Caption, mPLUG performs the best on CIDEr evaluation and surpasses the SOTA model by a large margin of 5.5 on Karpathy test set. We use the best checkpoint on COCO Caption and predict on the Nocaps validation set directly.

val testA testB val testA testB val-u test-u
VLBERT [40] - - - 72.59 78/57 62.30 - -
UNITER [11] 81.41 87.04 74.17 75.90 81.45 66.70 74.86 75.77
VILLA [18] 82.39 87.48 74.84 76.17 81.54 66.84 76.18 76.71
MDETR [27] 86.75 89.58 81.41 79.52 84.09 70.62 81.64 80.89
UNICORN [63] 88.29 90.42 83.06 80.30 85.05 71.88 83.44 83.93
OFA [56] 90.05 92.93 85.26 84.49 90.10 77.77 84.54 85.20
mPLUG 92.40 94.51 88.42 86.02 90.17 78.17 85.88 86.42
Table 4: Visual grounding results (Acc@0.5) on ReferCOCO, ReferCOCO+, and ReferCOCOg.

5.2.3 Image-Text Retrieval

We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO  [37] and Flickr30K  [45] datasets. Following  [33, 32], we jointly optimize the ITC loss and the ITM loss during fine-tuning. During inference, we first select top-k candidates by computing the dot-product similarity between the image and text encoder features, and then rerank the selected candidates based on their ITM scores. We set for COCO and for Flickr30K. As shown in Table 3, mPLUG outperforms all existing methods on both datasets. Using 14M images, mPLUG achieves better performance than BLIP with 129M and Florence with 0.9B pre-training data. Using the same 14M pre-training images, mPLUG substantially outperforms the previous best model BLIP by +2.7% in TR recall@1 on COCO and +1.0 % in TR recall@1 on Flickr30K.

5.2.4 Visual Grounding

Given a query in plain text and an image, visual grounding requires models to localize the referred object in the image. Instead of regressing the bounding boxes directly, we concatenate visual features and attended textual features and feed them into the decoder to predict the coordinates. Table 4 shows that mPLUG outperforms all the SOTA methods. We observe that in RefCOCO testB the images often contain arbitrary objects and in RecCOCOg test-u the expressions are longer than other datasets. Compared with the previous best model OFA, mPLUG achieves 3.16% absolute improvement on RefCOCO testB and 1.22% absolute improvement on RefCOCOg test-u. It demonstrates that mPLUG learns better multi-modal interaction from cross-modal skip-connections and is better at handling complex images and long queries.

5.2.5 Visual Reasoning

We consider two datasets for visual reasoning: NLVR2 [52]

and SNLI-VE 

[59]. The NLVR2 [52] task requires the model to predict whether a sentence describes a pair of images. Following [32]

, we use two cross-attention layers to process the two input images, and their outputs are merged and fed to the FFN. An MLP classifier is then applied on the output embedding of the language [CLS] token. The SNLI-VE 

[59] task requires the model to evaluate how the given image and text are semantically correlated, i.e., entailment, neutral, or contradiction. Following [56]

, the image premise, text premise and text hypothesis are fed to the encoder. While we remove the decoder, and only use the encoder modules for three-way classification, which can save nearly half of the total computation cost. We predict the class probabilities using the multimodal encoder’s output representation of the language [CLS] token. As shown in Table

5, mPLUG can obtain competitive performances to the SOTA models 111The SOTA models such as OFA and VLMo both add large-scale text-only and image-only pre-training data for improving the reasoning ability. in both visual reasoning tasks, and even outperform SimVLM [58] and BLIP [32], which use far more pre-training data.

dev test-P dev test
LXMERT[54] 74.90 74.50 - -
VL-T5[12] - 73.6 - -
UNITER[11] 79.12 79.98 79.39 79.38
CLIP-ViL[50] - - 80.61 80.20
METER[16] 82.33 83.05 80.86 81.19
UNIMO[35] - - 81.11 80.63
ALBEF[33] 82.55 83.14 80.80 80.91
BLIP[32] 82.67 82.30 - -
SimVLM[58] 84.13 84.84 85.68 85.62
VLMo[57] 85.64 86.86 - -
OFA[56] - - 90.30 90.20
mPLUG 84.58 84.95 89.45 89.29
Table 5: Evaluation Results on NLVR2 and SNLI-VE.

5.3 Effectiveness and Efficiency

To validate the effectiveness and efficiency of our proposed cross-modal skip-connected network, we conduct in-depth analysis on different stride values and various cross-modal fusion methods.

5.3.1 Analysis of Stride for Skip

Figure 3: Results w.r.t different stride values in cross-modal skip-connected network on running time and performance of VQA test-dev and NLVR2 test-P, where the running time is the total forward time of 100 samples.

The stride S is the key factor to control the effectiveness and efficiency tradeoff. Therefore, we further compare the running time and performance of different stride value S in cross-modal skip-connected network on VQA and NLVR2 tasks. Specifically, we test four different stride values, which can be divisible by the total number of cross-modal fusion layers. The model is chosen as mPLUGViT-B and all the other experiment settings are kept the same. As shown in Figure 3, we can see that the larger S is, the more efficient cross-modal fusion is, where the running time can be largely reduced from skipping the vision co-attention layers by 5 times from to . The performances of mPLUG on both datasets gradually increases when , and slightly decreases later on. Compared with , mPLUG can achieve comparable performance at , while speeding up by nearly 30%. Therefore, we set on mPLUGViT-L for faster pre-training.

5.3.2 Analysis of Cross-modal Fusion

Figure 4: Results w.r.t different cross-modal fusions on running time and performance on VQA test-dev and NLVR2 test-P, where the running time is the total forward time of 100 samples.

We compare the effectiveness and efficiency of different cross-modal fusion variants in terms of running time and performance on VQA and NLVR2 tasks. Specifically, we pre-train mPLUG with different cross-modal fusion network based on the same image encoder and text encoder. All the pre-training settings and the number of fusion layers are kept the same as in the original mPLUG pre-training. As shown in Figure 4, the fusion methods of co-attention and connected-attention both requires much more running time due to long visual sequence. Compared with the two fusion methods, our proposed skip-connected network is 4 faster and obtain better performance on both datasets. We also compare it with the asymmetric co-attention used in BLIP  [33, 32] which only relies on the co-attention layers from images to text. Despite running slightly faster than the skip-connected network does, the asymmetric co-attention performs worse in accuracy on both datasets. The performance degradation is attributed to the information asymmetry and bias towards language, as shown in Section 5.2.1.

Model Throughput (Samples/S)
baseline 124.0
+ BFloat16 182.7
+ Gradient Checkpoint 238.2
+ ZeRO 422.5
Table 6: Training Throughput

5.3.3 Large-scale Training

Combining the techniques introduced in Section 4 has dramatically increased the training throughput. With the utilization of memory saving and accelerated training techniques, the throughput of mPLUG improves 3 more from 124 samples per second to 422 samples per second, as shown in Table 6.

Model In Near Out Overall
SimVLM[58] 83.2 84.1 82.5 83.5
SimVLM[58] 101.2 100.4 102.3 101.4
Oscar[36] 85.4 84.0 80.3 83.4
VinVL[69] 103.7 95.6 83.8 94.3
SimVLM[58] 113.7 110.9 115.2 112.2
mPLUG 86.34 81.5 90.49 84.02
mPLUG 116.7 113.75 117.0 114.8
Table 7: Image captioning results on NoCaps validation split (zero-shot and finetuned), and {In, Near, Out} refer to in-domain, near-domain and out-of-domain respectively. denotes the models finetuned on COCO Caption dataset.
Model TR IR
R@1 R@5 R@1 R@5
CLIP [46] 88.0 98.7 68.7 90.6
ALIGN  [26] 88.6 98.7 75.7 93.8
FLIP  [64] 89.8 99.2 75.0 93.4
Florence  [67] 90.9 99.1 76.7 93.6
ALBEF  [33] 94.1 99.5 82.8 96.3
BLIP  [32] 94.8 99.7 84.9 96.7
mPLUG 93.0 99.5 82.2 95.8
mPLUG 95.8 99.8 86.4 97.6
Table 8: Zero-shot image-text retrieval results on Flickr30K. denotes the models finetuned on COCO.
Model # Pretrain MSRVTT-Retrieval
data R@1 R@5 R@10
MIL-NCE [42] How100M 9.9 24.0 32.4
VideoCLIP [61] How100M 10.4 22.2 30.0
VATT  [2] How100M, AudSet - - 29.7
ALPRO  [31] W2M, C3M 24.1 44.7 55.4
VIOLET  [17] Y180M, W2M, C3M 25.9 49.5 59.7
CLIP [46] WIT400M 26.0 49.4 60.7
Florence  [67] FLD900M 37.6 63.8 72.6
BLIP  [32] 129M 43.3 65.6 74.7
mPLUG 14M 38.1 59.2 68.2
mPLUG 14M 44.3 66.4 75.4
VideoCLIP [61] How100M 30.9 55.4 66.8
ALPRO  [31] C3M, W2M 33.9 60.7 73.2
VIOLET  [17] Y180M, C3M, W2M 34.5 63.0 73.4
Table 9: Zero-shot video-language results on text-to-video retrieval on the 1k test split of the MSRVTT dataset. denotes the models finetuned on COCO. Video datasets include HowTo100M [43], WebVid-2M(W2M) [5], YT-Temporal-180M( Y180M) [68]. Image datasets include CC3M(C3M) [49], FLD900M [67], WIT400M [46]. Audio datasets include AudioSet(AudSet) [19].

5.4 Zero-shot Transferability

In this section, we examine the generalization of mPLUG and compare the zero-shot result on two Vision-Language and three Video-Language tasks.

5.4.1 Zero-shot Vision-Language Tasks

The pretraining of mPLUG adopts image-text contrastive and prefix language modeling tasks on large-scale image-text pairs. Thus, mPLUG has zero-shot generalization ability in image-text retrieval and image captioning. Image Caption: First, we take the pretrained mPLUG model and directly decode on NoCaps validation set without further finetuning. Following[58, 32], we feed a prefix prompt “A picture of” into the text encoder to improve the quality of decoded captions. As shown in Table 7, the zero-shot performance of mPLUG is competitive with fully supervised baselines such like Oscar and VinVL. With further finetuning on MSCOCO dataset, mPLUG outperforms the SimVLM, which use more pre-training image-text pairs and has larger model parameters. Image-text Retrieval: We perform zero-shot retrieval on Flickr30K. The result is shown in Table 8, where zero-shot mPLUG outperforms models (CLIP, ALIGN, Florence) pretrained with more image-text pairs. Following  [32], we also evaluate zero-shot retrieval by the model finetuned on MSCOCO dataset. Table 8 shows that mPLUG achieves better performance than the previous SOTA models.

5.4.2 Zero-shot Transfer to Video-Language Tasks

To evaluate the generalization ability of mPLUG to Video-Language Tasks, we conduct zero-shot experiments on Video-text Retrieval, Video Caption and Video Question Answering. Following  [32], we uniformly sample frames for each video ( for Retrieval, for QA, for Caption), and concatenate the frame features into a single sequence. Video-text Retrieval: We evaluate the mPLUG models pretrained and further finetuned on the COCO-retrieval image-text dataset without any video pre-training or supervision. Table 9 shows that zero-shot mPLUG can outperform the SOTA models pretrained on far more pretraining data (e.g., Florence, BLIP), and can even outperform models finetuned on the supervised video dataset without using temporal information (e.g., VideoCLIP, VIOLET); Video Question Answering: Following BLIP  [32], We treat Video QA as an answer generation task and perform evaluation based on models finetuned on VQA. As shown in Table 10, the zero-shot mPLUG outperforms BLIP pretrained with more image-text pairs; Video Caption: We use a prefix prompt “A video of” to improve the quality of decoded captions. Table 10 shows that zero-shot mPLUG also achieves better performance than BLIP.

Acc Acc CIDEr
VQA-T  [62] 2.9 7.5 -
BLIP  [32] 19.2 35.2 37.4
mPLUG 21.1 37.2 42.0
Table 10: Zero-shot video-language results on Question-Answer and Caption tasks.

6 Conclusion

This paper presents mPLUG, an effective and efficient VLP framework for both cross-modal understanding and generation. mPLUG introduces a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation efficiency in cross-modal alignment. Pretrained on large-scale image-text pairs, mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks. mPLUG also demonstrates strong zero-shot transfer ability when directly applied to multiple video-language tasks. Our work explores the cross-modal alignment with a newly-designed VLP architecture and we hope it can help promote future research on image-text foundation models.


  • [1] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson (2018) Nocaps: novel object captioning at scale. CoRR abs/1812.08658. External Links: Link, 1812.08658 Cited by: §5.2.2.
  • [2] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong (2021)

    Vatt: transformers for multimodal self-supervised learning from raw video, audio and text

    Advances in Neural Information Processing Systems 34. Cited by: Table 9.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §1, §3.1.
  • [4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.1, §5.2.1.
  • [5] M. Bain, A. Nagrani, G. Varol, and A. Zisserman (2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738. Cited by: Table 9.
  • [6] B. Bi, C. Li, C. Wu, M. Yan, W. Wang, S. Huang, F. Huang, and L. Si (2020)

    Palm: pre-training an autoencoding&autoregressive language model for context-conditioned generation

    arXiv preprint arXiv:2004.07159. Cited by: §3.3.
  • [7] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut (2021) Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568. Cited by: §5.1.
  • [8] T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §4.
  • [9] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §2.1.
  • [10] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325. External Links: Link, 1504.00325 Cited by: §5.2.2.
  • [11] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) Uniter: universal image-text representation learning. In European conference on computer vision, pp. 104–120. Cited by: §1, §2.1, Table 2, Table 3, Table 4, Table 5.
  • [12] J. Cho, J. Lei, H. Tan, and M. Bansal (2021-18–24 Jul) Unifying vision-and-language tasks via text generation. In

    Proceedings of the 38th International Conference on Machine Learning

    , M. Meila and T. Zhang (Eds.),
    Proceedings of Machine Learning Research, Vol. 139, pp. 1931–1942. External Links: Link Cited by: Table 2, Table 5.
  • [13] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703. Cited by: §5.1.
  • [14] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.3, §5.1.
  • [15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §3.1.
  • [16] Z. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, Z. Liu, M. Zeng, et al. (2021) An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387. Cited by: §1, §1, §3.1, Table 2, Table 5.
  • [17] T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu (2021) VIOLET: end-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681. Cited by: Table 9.
  • [18] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu (2020) Large-scale adversarial training for vision-and-language representation learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Table 4.
  • [19] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. Cited by: Table 9.
  • [20] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913. Cited by: §7.1.
  • [21] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738. Cited by: §3.3.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2.
  • [23] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang (2021) Scaling up vision-language pre-training for image captioning. CoRR abs/2111.12233. External Links: Link, 2111.12233 Cited by: Table 1.
  • [24] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.2.
  • [25] Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu (2020) Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Cited by: §1, §1, §2.1, §2.1.
  • [26] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918. Cited by: §2.1, Table 3, Table 8.
  • [27] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion (2021) MDETR - modulated detection for end-to-end multi-modal understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 1760–1770. External Links: Link, Document Cited by: Table 4.
  • [28] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §7.1.
  • [29] W. Kim, B. Son, and I. Kim (2021) Vilt: vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334. Cited by: §1, §2.1.
  • [30] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §5.1, §7.1.
  • [31] D. Li, J. Li, H. Li, J. C. Niebles, and S. C. Hoi (2021) Align and prompt: video-and-language pre-training with entity prompts. arXiv preprint arXiv:2112.09583. Cited by: Table 9.
  • [32] J. Li, D. Li, C. Xiong, and S. Hoi (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086. Cited by: §5.2.3, §5.2.5, §5.3.2, §5.4.1, §5.4.2, Table 1, Table 10, Table 2, Table 3, Table 5, Table 8, Table 9.
  • [33] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi (2021) Align before fuse: vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, §2.1, §3.3, §3.3, §5.1, §5.2.1, §5.2.3, §5.3.2, Table 2, Table 3, Table 5, Table 8, §7.1, §7.1.
  • [34] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1.
  • [35] W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang (2020) UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409. Cited by: Table 3, Table 5.
  • [36] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Cited by: §1, §1, §2.1, §5.2.1, §5.2.2, Table 1, Table 2, Table 3, Table 7, §7.1.
  • [37] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1, §5.2.3.
  • [38] F. Liu, X. Ren, Z. Zhang, X. Sun, and Y. Zou (2021) Rethinking skip connection with layer normalization in transformers and resnets. arXiv preprint arXiv:2105.07205. Cited by: §2.2.
  • [39] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §5.1.
  • [40] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: Table 2, Table 4.
  • [41] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20. Cited by: §7.1.
  • [42] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: Table 9.
  • [43] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640. Cited by: Table 9.
  • [44] V. Ordonez, G. Kulkarni, and T. L. Berg (2011) Im2text: describing images using 1 million captioned photographs. In Advances in neural information processing systems, pp. 1143–1151. Cited by: §5.1.
  • [45] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §5.2.3.
  • [46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §2.1, §5.1, §5.2.1, Table 8, Table 9.
  • [47] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2020) Zero: memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. Cited by: §4.
  • [48] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1179–1195. External Links: Document Cited by: §5.2.2, §7.1.
  • [49] P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565. Cited by: §5.1, Table 9.
  • [50] S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K. Chang, Z. Yao, and K. Keutzer (2021) How much can clip benefit vision-and-language tasks?. arXiv preprint arXiv:2107.06383. Cited by: §3.1, Table 2, Table 5.
  • [51] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §2.2.
  • [52] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi (2018) A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491. Cited by: §5.2.5, §7.1.
  • [53] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Thirty-first AAAI conference on artificial intelligence, Cited by: §2.2.
  • [54] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1, §2.1, §5.2.1, Table 5.
  • [55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2.
  • [56] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang (2022) Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052. Cited by: §5.2.1, §5.2.2, §5.2.5, Table 1, Table 2, Table 4, Table 5, §7.1.
  • [57] W. Wang, H. Bao, L. Dong, and F. Wei (2021) VLMo: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358. Cited by: §2.1, §5.2.1, Table 2, Table 3, Table 5.
  • [58] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao (2021) SimVLM: simple visual language model pretraining with weak supervision. CoRR abs/2108.10904. Cited by: §1, §1, §5.2.1, §5.2.2, §5.2.5, §5.4.1, Table 1, Table 2, Table 5, Table 7.
  • [59] N. Xie, F. Lai, D. Doran, and A. Kadav (2019) Visual entailment: A novel task for fine-grained image understanding. CoRR abs/1901.06706. External Links: Link, 1901.06706 Cited by: §5.2.5, §7.1.
  • [60] H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang (2021) E2E-vlp: end-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804. Cited by: §2.1, Table 1, Table 2, Table 3.
  • [61] H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer (2021) VideoCLIP: contrastive pre-training for zero-shot video-text understanding. In

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    pp. 6787–6800. Cited by: Table 9.
  • [62] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid (2021) Just ask: learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1686–1697. Cited by: Table 10.
  • [63] Z. Yang, Z. Gan, J. Wang, X. Hu, F. Ahmed, Z. Liu, Y. Lu, and L. Wang (2021) Crossing the format boundary of text and boxes: towards unified vision-language modeling. CoRR abs/2111.12085. External Links: Link, 2111.12085 Cited by: Table 4.
  • [64] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu (2021) FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783. Cited by: Table 8.
  • [65] F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang (2021) ERNIE-vil: knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 3208–3216. Cited by: §1, §2.1.
  • [66] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg (2016) Modeling context in referring expressions. In European Conference on Computer Vision, pp. 69–85. Cited by: §2.1, §7.1.
  • [67] L. Yuan, D. Chen, Y. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al. (2021) Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432. Cited by: Table 2, Table 3, Table 8, Table 9.
  • [68] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi (2021) Merlot: multimodal neural script knowledge models. Advances in Neural Information Processing Systems 34. Cited by: Table 9.
  • [69] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao (2021) Vinvl: revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588. Cited by: §1, §3.1, Table 1, Table 7.

7 More Experiments Details

7.1 Downstream Task Details

We evaluate mPLUG on the six downstream vision-language tasks. The hyperparameters that we use for finetuning on the downstream tasks are listed in Table 11. Following  [33], all tasks adopt RandAugment, AdamW optimizer with a weight decay of 0.05 and a cosine learning rate schedule. We use an image resolution of 336 336, except for VQA where we use 504 504 images. For VQA and image captioning tasks, we also do an additional continue pre-training on 4M image-text pairs, which can bring about 0.2+ accuracy improvement. Next we introduce the dataset settings in detail.


We conduct experiment on the VQA2.0 dataset  [20], which contains 83k/41k/81k images for training/validation/test. Following  [33], we use both training and validation splits for training, and incorporate additional training data from Visual Genome [30].

Task LR (ViT-L/) batch size epochs
VQA 2e-5/5e-6 1024 8
Captioning 1e-5 & 8e-7 256 5
Retrieval 1e-5/2e-6 256 5
Visual Grounding 2e-5/2e-6 512 120
NLVR2 5e-5/5e-6 256 15
SNLI-VE 2e-5 64 5
Table 11: Finetuning hyperparameters for downstream tasks. denotes two stages fine-tuning.
Image Captioning.

We finetune on COCO’s Karpathy train split, and evaluate on COCO’s Karpathy test split and No-Caps validation split. Following [36, 56], we first fine-tune mPLUG with cross-entropy loss for 5 epochs with a learning rate of 1e-5 and a batch size of 256. Based on the fine-tuned model, we the fine-tune it with CIDEr optimization [48] for extra 5 epochs with a smaller learning rate of 8e-7. During inference, we use beam search with a beam size of 10, and set the maximum generation length as 20.

Image-Text Retrieval.

We adopt the widely-used Karpathy split  [28] for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test.

Visual Grounding.

We evaluate our method on three referring expression grounding datasets: RefCOCO, RefCOCO+ [66] and RefCOCOg [41]. The RefCOCO and RefCOCO+ datasets share 19K images and contain 142/141K queries. The RefCOCOg dataset contains 25K images and 95K queries. To fully use training data, we first train the model with a mixed dataset with a learning rate of 2e-5. Then we continue fine-tuning the model on each dataset with a learning rate of 2e-6.

Nlvr2 & Snli-Ve.

We conduct experiment both on the official split [52, 59].

7.2 Pre-training Dataset Details

Table 12 shows the statistics of the 14M pre-training images with texts.

image 113K 100K 860K 3M 10M
text 567K 769K 860K 3M 10M
Table 12: Statistics of the pre-training datasets.