Large-scale pre-training of vision-language models have recently received tremendous success on a wide range of cross-modal tasks[54, 11, 25, 36, 65, 33, 58]. Such vision-language models learn cross-modal representations from a quantity of image-text pairs by aligning the visual and linguistic modalities. A great challenge of learning vision-language models is to find a good alignment between the two modalities to close the semantic gap in-between.
To discover a cross-modal alignment, prior studies [36, 3, 69] employ a pre-trained object detector to extract salient regions from images, which are then aligned with language counterparts. Such an architecture, however, is generally limited by the power of the object detector, the pre-defined visual semantics it can represent, and the quantity of annotations available. Besides, it is also computationally expensive to extract region-based visual features from high-resolution (e.g. 6001000) images. More recent work [25, 58, 33, 29, 16], which scales and performs better on many vision-language tasks, drops the requirement of pre-trained object detection and enables a direct alignment between the image and text representations in an end-to-end manner. These models extract finer-grained visual representation with a long sequence of image patches or grids for good vision understanding . However, there exist two significant problems in modeling long visual sequences: 1) efficiency: full self-attention on long visual sequences requires much more computation than that on textual sequences, and 2) information asymmetry: the caption text in widely-used image-text pre-training data is usually short and highly abstract while more detailed and diverse information can be extracted from the image. This asymmetry presents challenges for effective multi-modal fusion between the modalities.
One straightforward way of multi-modal fusion is the connected-attention network as shown in Figure 1 (a). It adopts a single Transformer  network for early fusion of vision and language by simply taking the concatenation of visual and linguistic features as input . This paradigm allows self-attention to discover alignments between the modalities from the bottom level, and requires full self-attention on the concatenation of cross-modal sequences, which is rather time-consuming. Besides, this type of methods process information from both modalities equally, which may suffer from the information asymmetry especially when there is a big difference in information density or sequence lengths between the modalities.
Another line of work keeps separate Transformer networks for both textual and visual features, and uses techniques such as cross-attention to enable cross-modal interaction, as shown in Figure 1 (b). This architecture design conducts multi-modal fusion on both modalities independently, which can help alleviate the information asymmetry problem. However, it still suffers from computation inefficiency for full self-attention on long visual sequences, and it is not that parameter-efficient with two separate Transformer networks.
In this work, we propose mPLUG, a unified Multi-modal Pre-training framework for both vision-Language Understanding and Generation. mPLUG performs effective and efficient vision-language learning with novel cross-modal skip-connections to address the fundamental information asymmetry problem. Instead of fusing visual and linguistic representations at the same levels, the cross-modal skip-connections enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It creates inter-layer shortcuts that skip a certain number of layers for visual representations to reflect the semantic richness of language compared to vision. As shown in Figure 1 (c), in each block of our cross-modal skip-connected network, mPLUG first adopts an asymmetric co-attention architecture at the first few layers for efficiency, by removing the co-attention on vision side. It is then followed by one layer of connected-attention, by concatenating the original visual representation and the co-attention output on the language side as input. In addition to the modeling efficacy due to the asymmetry, the cross-modal skip-connections ease the model training by alleviating vanishing gradients with the inserted shortcuts. Figure 1 shows that the new cross-modal skip-connected network achieves superior performance with at least four times speeding-up than other cross-modal fusion networks.
Our key contributions can be summarized as follows:
We propose a unified vision-language pretrained model mPLUG of cross-modal understanding and generation for both effectiveness and efficiency in cross-modal learning.
We introduce a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation inefficiency in multi-modal fusion.
mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to a wide range of vision-language and video-language tasks.
2 Related Work
2.1 Vision-Language Pre-training
Vision-Language pre-training (VLP) has recently received tremendous success and achieved state-of-the-art results across a variety of vision-language tasks [4, 9, 66]. In terms of how information from different modalities are aggregated, typical approaches to VLP [54, 11, 25, 65, 33, 46, 26] can be roughly divided into two categories: dual encoder and fusion encoder. Dual encoder approach utilizes two single-modal encoders to encode images and text separately, and then uses simple functions such as dot product to model the instance-level cross-modal interaction between image and text. The advantage of dual encoder models like CLIP  and ALIGN  is that images and text can be pre-computed and cached, which is quite computation-efficient and more appropriate for retrieval tasks. However, they tend to fail in handling more complicated VL understanding tasks that require complex reasoning, such as visual question answering . In contrast, fusion encoder approach uses deep fusion functions such as multi-layer self-attention and cross-attention networks to model the fine-grained cross-modal interaction between image and text sequences. Representative methods of this category include the single-stream architecture such as UNITER  and OSCAR , and two-stream architecture such as LXMERT , ALBEF  and ERNIE-ViL . This kind of methods can better capture the underlying association between image and text for vision-language understanding tasks, while it needs to jointly encode all possible image-text pairs, which leads to a relatively slow inference speed.
removes the complicated object detector in feature extraction, and conducts end-to-end VL learning with CNN-based grid features and linearly projected patched embeddings, respectively. To combine the benefits of both categories of architectures, VLMo further unifies the dual encoder and fusion encoder modules with shared mixture-of-modality-experts Transformer. In this work, mPLUG introduces a new cross-modal fusion mechanism with cross-modal skip-connections, to enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It achieves superior performances in effectiveness and efficiency across a wide range of VL tasks.
Skip-connection is a popular technique to bypass the gradient exploding or vanishing problem for model optimization in deep neural networks, which is widely-used in CV and NLP architectures such as ResNet and Transformer . A variety of skip connection methods have been proposed in recent years [51, 22, 55, 24, 53, 38]. ResNet  introduces summed shortcut connections between different layers using simple identity mapping, while highway network  designs a transform gating function to control the balance of the input and the transformed input. DenseNet  designs new architectures with concatenated skip-connections, allowing the subsequent layers to re-use all the middle representations of previous layers. Layer Normalization and recursive skip connection are further used in combination with plain skip connection for further stablizing model optimization and better incorporating the transformed input [55, 38]. In this work, mPLUG proposes a new cross-modal skip connection method to address cross-modal fusion problem, and combines the concatenated skip-connection and summed skip-connection for choosing whether to attend to all the concatenated representations of different modalities or just focus on the cross-modal interaction part at each layer.
is a fixed stride value. Based on the connected representation of the image and prefix sub-sequence, the decoder is trained with a prefix language modeling (Prefix LM) loss by generating the remaining caption.
In this section, we will first introduce our new model architecture with the key module of the cross-modal skip-connected network, and then give the details of the pre-training objectives and scalable training infrastructure.
3.1 Model Architecture
As shown in Figure 2, mPLUG consists of two unimodal encoders for image and text independently, a cross-modal skip-connected network and a decoder for text generation. To better model the inherent modality bias information, we first use two unimodal encoders to encode image and text separately. Following [16, 50], we use a visual transformer  directly on the image patches as the visual encoder, which is more computation-friendly than using pre-trained object detectors for visual feature extraction [3, 69]. The visual encoder divides an input image into patches and encodes them as a sequence of embeddings with an additional token. The input text is fed to the text encoder and represented as a sequence of embeddings , where is the embedding of the token and used to summarize the input text. Then, the visual and linguistic representations are fed into a cross-modal skip-connected network, which consists of multiple skip-connected fusion blocks. In each skip-connected fusion block, we adopt connected cross-modal fusion to each of asymmetric co-attention layers where is a fixed stride value. The aim of this network is to take advantage of the effectiveness of the connected cross-modal fusion and the efficiency of the asymmetric co-attention for enhanced cross-modal fusion in a recursive manner. Finally, the output cross-modal representations are fed into a transformer decoder for sequence to sequence learning, which equips mPLUG with both understanding and generation capabilities.
3.2 Cross-modal Skip-connected Network
The cross-modal skip-connected network consists of skip-connected fusion blocks. In each skip-connected fusion block, we adopt connected-attention layer to each of asymmetric co-attention layers where is a fixed stride value. We first pass the text feature and image feature from unimodal encoders through the asymmetric co-attention layers, and then connect the output text feature and image feature to one connected-attention layer. We repeat the skip-connected fusion block times for the final connected image and text representation.
Specifically, the asymmetric co-attention is composed of the self-attention (SA) layer, cross-attention (CA) layer and the feed-forward network (FFN). The input text feature is first fed to the self-attention layer, and then the visual feature is injected into the text feature by the cross-attention layer which gives . The output of self-attention and cross-attention are added up and fed to the FFN layer for the visual-aware text representation :
where LN is short for layer normalization.
The connected-attention layer is composed of the self-attention (SA) layer and the feed-forward network (FFN). We connect the image feature and input text feature , where is the output of asymmetric co-attention layers. The connected image and text feature are fed to the self-attention layer and FFN layer:
Then is fed into the next cross-modal skip-connected network repeatedly to get the final connected image and text representation. Finally, the connected output is fed into a Transformer decoder for sequence to sequence learning.
3.3 Pre-training Tasks
We perform four pre-training tasks including three understanding tasks (Image-Text Contrastive Learning, Image-Text Matching, Masked Language Modeling) and one generation task (Prefix Language Modeling). These pre-training tasks are optimized jointly.
Image-Text Contrastive (ITC): Following 
, we employ the task to align the image features and the text features from the unimodal encoders. Specifically, we calculate the softmax-normalized image-to-text and text-to-image similarity, and take two dynamic memory queues (text, image) to increase the number of negative examples as MoCo.
Image-Text Matching (ITM): This task aims to predict whether an image and a sentence match with each other on the cross-modal representation. We also select hard negative image-text pairs based on the contrastive text-image similarity as .
Masked Language Modeling (MLM): The task setup is basically the same as in BERT , where we randomly mask of tokens in text and the model is asked to predict these masked words with the cross-modal representations.
Prefix Language Modeling (PrefixLM): This task aims to generate the caption given an image and predict the text segment subsequent to the cross-modal context as . It optimizes a cross entropy loss by maximizing the likelihood of text in an autoregressive manner.
4 Distributed Learning on a Large Scale
Training a big model like mPLUG on large-scale datasets faces many efficiency challenges. We increase the throughput from the perspective of reducing memory usage and computation time, thereby accelerating the training of the model.
The memory usage during model training is mainly composed of two aspects: the static memory usage composed of parameters/optimizer states/gradients, etc., and the runtime memory usage caused by intermediate variables like activation values. For static memory overhead, we use the ZeRO  technique to partition parameters/optimizer states/gradients into the entire data-parallel group, so that the static memory overhead of a single GPU can be approximately reduced to , where denotes the number of GPU cards. We use gradient checkpointing  for the runtime memory cost, which greatly reduces the runtime memory usage at the expense of increasing forward time by recomputing part of the activation values during backward pass without keeping them in memory.
To reduce the computation time, we use BF16 precision training. BF16 is a new data type supported by NVIDIA’s new Ampere architecture GPU like A100. Compared with the previously widely used mixed-precision training of FP16 and FP32, BF16 has the same representation range as FP32, thereby reducing the risk of numerical overflow and ensuring model convergence stability, and at the same time has the same fast computing speed as FP16.
5.1 Data & Setup
|Cross-entropy Optimization||CIDEr Optimization|
Following the previous work , we use the same pre-training dataset with 14M images with texts, which includes two in-domain datasets (MS COCO  and Visual Genome ), and three web out-domain datasets (Conceptual Captions , Conceptual 12M , SBU Captions .
We pretrain the model for 30 epochs with the total batch size of 1024 on 16 NVIDIA A100 GPUs. We use a 6-layer Transformer for both the text encoder and the cross-modal skip-connected network, and a 12-layer Transformer for the decoder. The text encoder is initialized using the first 6 layers of the model and the skip-connected network is initialized using the last 6 layers of the . We initialize the visual encoder by CLIP-ViT  pretrained on 400M noisy image-text pairs. The visual transformer with ViT-B/16 is used as our base architecture, the one with ViT-L/14 as the large architecture. We use the AdamW  optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e-5 (ViT-B/16) and 1e-4 () for mPLUGViT-B , and 5e-6 (ViT-L/14) and 5e-5 () for mPLUGViT-L in the first 1000 iterations, and decayed to 1e-6 following a cosine schedule. During pre-training, we take random image crops of resolution 256 256 (ViT-B/16)/224 224 (ViT-L/14) as input, and also apply RandAugment  to improve the generalization of vision encoders. For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs. We increase the image resolution during finetuning. For image-text contrastive learning, the size of the queue is set as 65,536, and the momentum coefficient is set as 0.995.
5.2 Evaluation on Vision-Language Tasks
We compare our pre-trained model against other VLP models on the six downstream V+L tasks. We introduce each task and our fine-tuning strategy below. Details of the datasets and fine-tuning hyperparameters are in Appendix.
|Pretrained on COCO, VG, SBU and CC datasets|
|Models Pretrained on More Data|
5.2.1 Visual Question Answering
The VQA task  requires the model to answer natural language questions given an image. Most methods [54, 57, 36, 58] deal with visual question answering tasks as multi-label classification on predefined answer sets. This strategy achieves strong performance, but it is not suitable for real-world open scenarios. We treat VQA as an answer generation task and directly use unconstrained open-vocab generation during inference, which is different from constrained close-vocab generation models [33, 56]. Following [36, 56], we concatenate the question with the object labels and OCR tokens extracted from image. As shown in Table 2, mPLUG achieves 81.27 on Test-std split and outperforms the SOTA models including SimVLM and Florence, which use 100 and 60 more pre-training image-text pairs, respectively. Based on the same 4M pre-training data, mPLUG outperforms CLIP-ViL and METER, which also use CLIP  as the visual encoder. Besides, under the same settings, mPLUG always significantly outperforms ALBEF and BLIP which only rely on co-attention from images to text for cross-modal fusion. The gain can derive from the network design of cross-modal skip-connections specifically for information asymmetry of the two modalities. Neither ALBEF nor BLIP addresses this problem well, with bias towards the language modality.
|Models||# Pretrain||MSCOCO (5K test set)||Flickr30K (1K test set)|
5.2.2 Image Captioning
The image captioning task requires a model to generate an appropriate and fluent caption for a given image. We evaluate image captioning on two datasets COCO Caption  and NoCaps . mPLUG finetuned with training data of COCO Caption is tested on both of the datasets. We train mPLUG on the MS COCO Caption and test on the same Karpathy split [36, 58] and NoCaps validation set. Following [36, 56], we first fine-tune mPLUG with cross-entropy loss and then with CIDEr optimization  for extra 5 epochs. As shown in Table 1, mPLUG with only 14M pre-training images can outperform the SOTA models including LEMON and SimVLM on both COCO Caption and Nocaps datasets, which uses more than 10 and 100 pre-training data, respectively. For the COCO Caption, mPLUG performs the best on CIDEr evaluation and surpasses the SOTA model by a large margin of 5.5 on Karpathy test set. We use the best checkpoint on COCO Caption and predict on the Nocaps validation set directly.
5.2.3 Image-Text Retrieval
We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO  and Flickr30K  datasets. Following [33, 32], we jointly optimize the ITC loss and the ITM loss during fine-tuning. During inference, we first select top-k candidates by computing the dot-product similarity between the image and text encoder features, and then rerank the selected candidates based on their ITM scores. We set for COCO and for Flickr30K. As shown in Table 3, mPLUG outperforms all existing methods on both datasets. Using 14M images, mPLUG achieves better performance than BLIP with 129M and Florence with 0.9B pre-training data. Using the same 14M pre-training images, mPLUG substantially outperforms the previous best model BLIP by +2.7% in TR recall@1 on COCO and +1.0 % in TR recall@1 on Flickr30K.
5.2.4 Visual Grounding
Given a query in plain text and an image, visual grounding requires models to localize the referred object in the image. Instead of regressing the bounding boxes directly, we concatenate visual features and attended textual features and feed them into the decoder to predict the coordinates. Table 4 shows that mPLUG outperforms all the SOTA methods. We observe that in RefCOCO testB the images often contain arbitrary objects and in RecCOCOg test-u the expressions are longer than other datasets. Compared with the previous best model OFA, mPLUG achieves 3.16% absolute improvement on RefCOCO testB and 1.22% absolute improvement on RefCOCOg test-u. It demonstrates that mPLUG learns better multi-modal interaction from cross-modal skip-connections and is better at handling complex images and long queries.
5.2.5 Visual Reasoning
We consider two datasets for visual reasoning: NLVR2 
and SNLI-VE. The NLVR2  task requires the model to predict whether a sentence describes a pair of images. Following 
, we use two cross-attention layers to process the two input images, and their outputs are merged and fed to the FFN. An MLP classifier is then applied on the output embedding of the language [CLS] token. The SNLI-VE task requires the model to evaluate how the given image and text are semantically correlated, i.e., entailment, neutral, or contradiction. Following 
, the image premise, text premise and text hypothesis are fed to the encoder. While we remove the decoder, and only use the encoder modules for three-way classification, which can save nearly half of the total computation cost. We predict the class probabilities using the multimodal encoder’s output representation of the language [CLS] token. As shown in Table5, mPLUG can obtain competitive performances to the SOTA models 111The SOTA models such as OFA and VLMo both add large-scale text-only and image-only pre-training data for improving the reasoning ability. in both visual reasoning tasks, and even outperform SimVLM  and BLIP , which use far more pre-training data.
5.3 Effectiveness and Efficiency
To validate the effectiveness and efficiency of our proposed cross-modal skip-connected network, we conduct in-depth analysis on different stride values and various cross-modal fusion methods.
5.3.1 Analysis of Stride for Skip
The stride S is the key factor to control the effectiveness and efficiency tradeoff. Therefore, we further compare the running time and performance of different stride value S in cross-modal skip-connected network on VQA and NLVR2 tasks. Specifically, we test four different stride values, which can be divisible by the total number of cross-modal fusion layers. The model is chosen as mPLUGViT-B and all the other experiment settings are kept the same. As shown in Figure 3, we can see that the larger S is, the more efficient cross-modal fusion is, where the running time can be largely reduced from skipping the vision co-attention layers by 5 times from to . The performances of mPLUG on both datasets gradually increases when , and slightly decreases later on. Compared with , mPLUG can achieve comparable performance at , while speeding up by nearly 30%. Therefore, we set on mPLUGViT-L for faster pre-training.
5.3.2 Analysis of Cross-modal Fusion
We compare the effectiveness and efficiency of different cross-modal fusion variants in terms of running time and performance on VQA and NLVR2 tasks. Specifically, we pre-train mPLUG with different cross-modal fusion network based on the same image encoder and text encoder. All the pre-training settings and the number of fusion layers are kept the same as in the original mPLUG pre-training. As shown in Figure 4, the fusion methods of co-attention and connected-attention both requires much more running time due to long visual sequence. Compared with the two fusion methods, our proposed skip-connected network is 4 faster and obtain better performance on both datasets. We also compare it with the asymmetric co-attention used in BLIP [33, 32] which only relies on the co-attention layers from images to text. Despite running slightly faster than the skip-connected network does, the asymmetric co-attention performs worse in accuracy on both datasets. The performance degradation is attributed to the information asymmetry and bias towards language, as shown in Section 5.2.1.
|+ Gradient Checkpoint||238.2|
5.3.3 Large-scale Training
Combining the techniques introduced in Section 4 has dramatically increased the training throughput. With the utilization of memory saving and accelerated training techniques, the throughput of mPLUG improves 3 more from 124 samples per second to 422 samples per second, as shown in Table 6.
|VATT ||How100M, AudSet||-||-||29.7|
|ALPRO ||W2M, C3M||24.1||44.7||55.4|
|VIOLET ||Y180M, W2M, C3M||25.9||49.5||59.7|
|ALPRO ||C3M, W2M||33.9||60.7||73.2|
|VIOLET ||Y180M, C3M, W2M||34.5||63.0||73.4|
5.4 Zero-shot Transferability
In this section, we examine the generalization of mPLUG and compare the zero-shot result on two Vision-Language and three Video-Language tasks.
5.4.1 Zero-shot Vision-Language Tasks
The pretraining of mPLUG adopts image-text contrastive and prefix language modeling tasks on large-scale image-text pairs. Thus, mPLUG has zero-shot generalization ability in image-text retrieval and image captioning. Image Caption: First, we take the pretrained mPLUG model and directly decode on NoCaps validation set without further finetuning. Following[58, 32], we feed a prefix prompt “A picture of” into the text encoder to improve the quality of decoded captions. As shown in Table 7, the zero-shot performance of mPLUG is competitive with fully supervised baselines such like Oscar and VinVL. With further finetuning on MSCOCO dataset, mPLUG outperforms the SimVLM, which use more pre-training image-text pairs and has larger model parameters. Image-text Retrieval: We perform zero-shot retrieval on Flickr30K. The result is shown in Table 8, where zero-shot mPLUG outperforms models (CLIP, ALIGN, Florence) pretrained with more image-text pairs. Following , we also evaluate zero-shot retrieval by the model finetuned on MSCOCO dataset. Table 8 shows that mPLUG achieves better performance than the previous SOTA models.
5.4.2 Zero-shot Transfer to Video-Language Tasks
To evaluate the generalization ability of mPLUG to Video-Language Tasks, we conduct zero-shot experiments on Video-text Retrieval, Video Caption and Video Question Answering. Following , we uniformly sample frames for each video ( for Retrieval, for QA, for Caption), and concatenate the frame features into a single sequence. Video-text Retrieval: We evaluate the mPLUG models pretrained and further finetuned on the COCO-retrieval image-text dataset without any video pre-training or supervision. Table 9 shows that zero-shot mPLUG can outperform the SOTA models pretrained on far more pretraining data (e.g., Florence, BLIP), and can even outperform models finetuned on the supervised video dataset without using temporal information (e.g., VideoCLIP, VIOLET); Video Question Answering: Following BLIP , We treat Video QA as an answer generation task and perform evaluation based on models finetuned on VQA. As shown in Table 10, the zero-shot mPLUG outperforms BLIP pretrained with more image-text pairs; Video Caption: We use a prefix prompt “A video of” to improve the quality of decoded captions. Table 10 shows that zero-shot mPLUG also achieves better performance than BLIP.
This paper presents mPLUG, an effective and efficient VLP framework for both cross-modal understanding and generation. mPLUG introduces a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation efficiency in cross-modal alignment. Pretrained on large-scale image-text pairs, mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks. mPLUG also demonstrates strong zero-shot transfer ability when directly applied to multiple video-language tasks. Our work explores the cross-modal alignment with a newly-designed VLP architecture and we hope it can help promote future research on image-text foundation models.
-  (2018) Nocaps: novel object captioning at scale. CoRR abs/1812.08658. External Links: Cited by: §5.2.2.
Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34. Cited by: Table 9.
-  (2018) Bottom-up and top-down attention for image captioning and visual question answering. In , pp. 6077–6086. Cited by: §1, §3.1.
-  (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.1, §5.2.1.
-  (2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738. Cited by: Table 9.
Palm: pre-training an autoencoding&autoregressive language model for context-conditioned generation. arXiv preprint arXiv:2004.07159. Cited by: §3.3.
-  (2021) Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568. Cited by: §5.1.
-  (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §4.
-  (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §2.1.
-  (2015) Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325. External Links: Cited by: §5.2.2.
-  (2020) Uniter: universal image-text representation learning. In European conference on computer vision, pp. 104–120. Cited by: §1, §2.1, Table 2, Table 3, Table 4, Table 5.
Unifying vision-and-language tasks via text generation.
Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 1931–1942. External Links: Cited by: Table 2, Table 5.
-  (2020) Randaugment: practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703. Cited by: §5.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.3, §5.1.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §3.1.
-  (2021) An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387. Cited by: §1, §1, §3.1, Table 2, Table 5.
-  (2021) VIOLET: end-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681. Cited by: Table 9.
-  (2020) Large-scale adversarial training for vision-and-language representation learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: Table 4.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. Cited by: Table 9.
-  (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913. Cited by: §7.1.
-  (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738. Cited by: §3.3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2.
-  (2021) Scaling up vision-language pre-training for image captioning. CoRR abs/2111.12233. External Links: Cited by: Table 1.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.2.
-  (2020) Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Cited by: §1, §1, §2.1, §2.1.
-  (2021) Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918. Cited by: §2.1, Table 3, Table 8.
-  (2021) MDETR - modulated detection for end-to-end multi-modal understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 1760–1770. External Links: Cited by: Table 4.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §7.1.
-  (2021) Vilt: vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334. Cited by: §1, §2.1.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §5.1, §7.1.
-  (2021) Align and prompt: video-and-language pre-training with entity prompts. arXiv preprint arXiv:2112.09583. Cited by: Table 9.
-  (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086. Cited by: §5.2.3, §5.2.5, §5.3.2, §5.4.1, §5.4.2, Table 1, Table 10, Table 2, Table 3, Table 5, Table 8, Table 9.
-  (2021) Align before fuse: vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, §2.1, §3.3, §3.3, §5.1, §5.2.1, §5.2.3, §5.3.2, Table 2, Table 3, Table 5, Table 8, §7.1, §7.1.
-  (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1.
-  (2020) UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409. Cited by: Table 3, Table 5.
-  (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Cited by: §1, §1, §2.1, §5.2.1, §5.2.2, Table 1, Table 2, Table 3, Table 7, §7.1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1, §5.2.3.
-  (2021) Rethinking skip connection with layer normalization in transformers and resnets. arXiv preprint arXiv:2105.07205. Cited by: §2.2.
-  (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §5.1.
-  (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: Table 2, Table 4.
-  (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20. Cited by: §7.1.
-  (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: Table 9.
-  (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640. Cited by: Table 9.
-  (2011) Im2text: describing images using 1 million captioned photographs. In Advances in neural information processing systems, pp. 1143–1151. Cited by: §5.1.
-  (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §5.2.3.
-  (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §2.1, §5.1, §5.2.1, Table 8, Table 9.
-  (2020) Zero: memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. Cited by: §4.
-  (2017) Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1179–1195. External Links: Cited by: §5.2.2, §7.1.
-  (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565. Cited by: §5.1, Table 9.
-  (2021) How much can clip benefit vision-and-language tasks?. arXiv preprint arXiv:2107.06383. Cited by: §3.1, Table 2, Table 5.
-  (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §2.2.
-  (2018) A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491. Cited by: §5.2.5, §7.1.
Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, Cited by: §2.2.
-  (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1, §2.1, §5.2.1, Table 5.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2.
-  (2022) Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052. Cited by: §5.2.1, §5.2.2, §5.2.5, Table 1, Table 2, Table 4, Table 5, §7.1.
-  (2021) VLMo: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358. Cited by: §2.1, §5.2.1, Table 2, Table 3, Table 5.
-  (2021) SimVLM: simple visual language model pretraining with weak supervision. CoRR abs/2108.10904. Cited by: §1, §1, §5.2.1, §5.2.2, §5.2.5, §5.4.1, Table 1, Table 2, Table 5, Table 7.
-  (2019) Visual entailment: A novel task for fine-grained image understanding. CoRR abs/1901.06706. External Links: Cited by: §5.2.5, §7.1.
-  (2021) E2E-vlp: end-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804. Cited by: §2.1, Table 1, Table 2, Table 3.
VideoCLIP: contrastive pre-training for zero-shot video-text understanding.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6787–6800. Cited by: Table 9.
-  (2021) Just ask: learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1686–1697. Cited by: Table 10.
-  (2021) Crossing the format boundary of text and boxes: towards unified vision-language modeling. CoRR abs/2111.12085. External Links: Cited by: Table 4.
-  (2021) FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783. Cited by: Table 8.
-  (2021) ERNIE-vil: knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 3208–3216. Cited by: §1, §2.1.
-  (2016) Modeling context in referring expressions. In European Conference on Computer Vision, pp. 69–85. Cited by: §2.1, §7.1.
-  (2021) Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432. Cited by: Table 2, Table 3, Table 8, Table 9.
-  (2021) Merlot: multimodal neural script knowledge models. Advances in Neural Information Processing Systems 34. Cited by: Table 9.
-  (2021) Vinvl: revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588. Cited by: §1, §3.1, Table 1, Table 7.
7 More Experiments Details
7.1 Downstream Task Details
We evaluate mPLUG on the six downstream vision-language tasks. The hyperparameters that we use for finetuning on the downstream tasks are listed in Table 11. Following , all tasks adopt RandAugment, AdamW optimizer with a weight decay of 0.05 and a cosine learning rate schedule. We use an image resolution of 336 336, except for VQA where we use 504 504 images. For VQA and image captioning tasks, we also do an additional continue pre-training on 4M image-text pairs, which can bring about 0.2+ accuracy improvement. Next we introduce the dataset settings in detail.
We conduct experiment on the VQA2.0 dataset , which contains 83k/41k/81k images for training/validation/test. Following , we use both training and validation splits for training, and incorporate additional training data from Visual Genome .
|Task||LR (ViT-L/)||batch size||epochs|
|Captioning||1e-5 & 8e-7||256||5|
We finetune on COCO’s Karpathy train split, and evaluate on COCO’s Karpathy test split and No-Caps validation split. Following [36, 56], we first fine-tune mPLUG with cross-entropy loss for 5 epochs with a learning rate of 1e-5 and a batch size of 256. Based on the fine-tuned model, we the fine-tune it with CIDEr optimization  for extra 5 epochs with a smaller learning rate of 8e-7. During inference, we use beam search with a beam size of 10, and set the maximum generation length as 20.
We adopt the widely-used Karpathy split  for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test.
We evaluate our method on three referring expression grounding datasets: RefCOCO, RefCOCO+  and RefCOCOg . The RefCOCO and RefCOCO+ datasets share 19K images and contain 142/141K queries. The RefCOCOg dataset contains 25K images and 95K queries. To fully use training data, we first train the model with a mixed dataset with a learning rate of 2e-5. Then we continue fine-tuning the model on each dataset with a learning rate of 2e-6.
Nlvr2 & Snli-Ve.
7.2 Pre-training Dataset Details
Table 12 shows the statistics of the 14M pre-training images with texts.