Log In Sign Up

VLP: A Survey on Vision-Language Pre-training

by   Feilong Chen, et al.

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey on VLP. We hope that this survey can shed light on future research in the VLP field.


page 1

page 2

page 3

page 4


A Survey of Vision-Language Pre-Trained Models

As Transformer evolved, pre-trained models have advanced at a breakneck ...

Survey: Transformer based Video-Language Pre-training

Inspired by the success of transformer-based pre-training methods on nat...

A Survey on Spoken Language Understanding: Recent Advances and New Frontiers

Spoken Language Understanding (SLU) aims to extract the semantics frame ...

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Recent advances in vision-language pre-training (VLP) have demonstrated ...

Eliciting Knowledge from Large Pre-Trained Models for Unsupervised Knowledge-Grounded Conversation

Recent advances in large-scale pre-training provide large models with th...

Learning to Interpret Satellite Images in Global Scale Using Wikipedia

Despite recent progress in computer vision, finegrained interpretation o...

Prefix Language Models are Unified Modal Learners

With the success of vision-language pre-training, we have witnessed the ...

1 Introduction

Making machines respond in ways similar to humans has been a relentless goal of AI researchers. To enable machines to perceive and think, researchers propose a series of related tasks, such as face recognition, reading comprehension, and human-machine dialogue, to train and evaluate the intelligence of machines in a particular aspect. Specifically, domain experts manually construct standard datasets and then train and evaluate relevant models on them. However, due to the limitations of related technologies, it is often necessary to train on a large amount of labelled data to obtain a better and more capable model. The recent emergence of pre-training models based on the Transformer structure


has alleviated this problem. They are first pre-trained via self-supervised learning that typically exploits auxiliary tasks (pre-training objectives) to mine supervision signals from large-scale unlabelled data to train the model, thereby learning universal representations. Then they can achieve surprising effectiveness by fine-tuning with only a tiny amount of manually-labelled data on downstream tasks. Since the advent of BERT

[DBLP:conf/naacl/DevlinCLT19] in natural language processing (NLP), various pre-training models have sprung up in the uni-modal field, such as Vision Transformer (ViT) [dosovitskiy2020image] in computer vision (CV) and Wave2Vec [DBLP:conf/interspeech/SchneiderBCA19] in speech. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.

Similar to the uni-modal field, there is also a problem of less high-quality labelled data in the multi-modal field. The natural question is, can the above pre-training method be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. In this paper, we focus on mainstream vision-language pre-training (VLP), including image-text and video-text pre-training. VLP mainly learns the semantic correspondence between different modalities by pre-training on large-scale data. For example, in image-text pre-training, we expect the model to associate “dog” in text with what “dog” looks like in images. In video-text pre-training, we expect the model to map objects/actions in the text to objects/actions in the video. To achieve this goal, the VLP objects and model architecture need to be cleverly designed to allow the model to mine the associations between different modalities.

To give readers a better global grasp of VLP, we first comprehensively review its recent advances and focus on five significant aspects:

  • Feature extraction. This section includes the preprocessing and representation methods of image, video, and text in VLP models (see Section 2).

  • Model architecture. We introduce the architecture of the VLP models from two different perspectives: Single-stream versus Dual-stream from multi-modal fusion perspective, and Encoder-only versus Encoder-decoder from the overall architectural design perspective (see Section 3).

  • Pre-training objectives. Pre-training objectives are the core of VLP, mainly used to guide the model to learn vision-language associated information. We summarize typical and characteristic pre-training objectives divided into completion, matching, temporal, and particular types (see Section 4).

  • Pre-training datasets. Data is critical for VLP. We briefly introduce mainstream corpora for VLP and their specific sizes (see Section 5).

  • Downstream tasks. Various tasks requires a cooperative knowledge of both vision and language. We divide them into five categories: classification, regression, retrieval, generation, and other tasks. We also discuss the basic details and goals of these tasks (see Section 6).

Then we summarize the specific state-of-the-art (SOTA) VLP models in detail (see Section 7). Finally, We conclude the paper and have broad discussions on new frontiers in VLP (see Section 8).

To the best of our knowledge, this is the first survey on VLP. We hope that our survey can help researchers better understand this field and inspire them to design better models.

2 Feature Extraction

This section describes how VLP models preprocess and represent an image, video and text to obtain counterpart features.

2.1 Feature Extraction

2.1.1 Image Feature Extraction

(1) OD-based Region Features (OD-RFs).

Most previous work on VLP utilizes pre-trained object detectors to extract visual features. The most commonly used object detection model is Faster R-CNN with bottom-up attention [anderson2018bottom]. It is designed to identify objects belonging to certain classes and localize them with bounding boxes. By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding of an image with selected regions. Each region feature is a -d Region-of-Interest (RoI) feature with its bounding box. The bounding box is defined by the coordinates of the bottom-left and top-right corners of the region. VLP models use bounding boxes to construct

-d vectors, and the vector is embedded into a high-dimensional representation (2048-d) named visual geometry embedding. The OD-RFs are obtained by adding the OD-based Region feature embedding with its visual geometry embedding. Although ODFs have brought impressive performance, extracting region features can be time-consuming. To relieve this problem, the pre-trained object detectors are usually frozen during pre-training, which can limit the capacity of VLP models.

(2) CNN-based Grid Features (CNN-GFs).

VLP models extract visual features by utilizing convolutional neural networks (CNNs) to obtain the grid features. On the one hand, VLP models can train the CNNs end-to-end by using the grid features directly. On the other hand, VLP models can also first discretize grid features using a learned vision dictionary, then feed them into the cross-modal module.

(3) ViT-based Patch Features (ViT-PFs).

Inspired by ViT, VLP models reshape the image into a sequence of flattened 2D patches , where is the resolution of the original image, is the number of channels, is the resolution of each image patch, and is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. An input image is encoded into a sequence of embeddings: , where is the embedding of the [CLS] token.

2.1.2 Video Feature Extraction

A video clip is denoted as

frames (images). VLP models extract the frame features by using the method mentioned above. The two most commonly used features are CNN-GFs and ViT-PFs. For CNN-GFs, VLP models first use ResNet pre-trained on ImageNet and SlowFast pre-trained on Kinetics to extract 2D and 3D visual features for each video frame. These features are concatenated as visual features and fed through a fully-connected (FC) layer to be projected into the same lower-dimensional space as token embeddings. For ViT-PFs, a video clip

consisting of frames of resolution , where for images. Following the protocol in ViT and Timesformer, the input video clip is divided into non-overlapping spatio-temporal patches of size , where .

2.1.3 Text Feature Extraction

For the textual features, following BERT, VLP models first segment the input sentence into a sequence of subwords. And then, insert a start-of-sequence token and an end-of-sequence token at the beginning and the end of the sequence to generate the input text sequence. Text input representations are computed via summing the corresponding word embedding, text position embedding, and text type embedding.

2.2 Feature Representation

To make full use of uni-modal pre-trained models, VLP models can send the visual or text features to a transformer encoder. Specifically, VLP models utilize the standard transformer encoder with random initialization to generate the visual or textual representation. In addition, VLP models can utilize a pre-trained visual transformer to encode the ViT-PFs, such as ViT and DeiT [pmlr-v139-touvron21a]. VLP models can use a pre-trained textual transformer to encode the textual features, such as BERT. For simplicity, we name these transformer Xformer.

3 Model Architecture

In this section, we introduce the architecture of the VLP models from two different perspectives: (1) Single-stream versus Dual-stream from multi-modal fusion perspective, and (2) Encoder-only versus Encoder-decoder from the overall architectural design perspective.


Figure 1: Illustration of two types of model architectures for VLP.

3.1 Single-stream versus Dual-stream

Single-stream Architecture.

The single-stream architecture refers to that the text and visual features are concatenated together, then fed into a single transformer block as shown in Firgue 1 (a). The single-stream structure utilizes merged attention to fuse multimodal inputs. The single-stream architecture is more parameter-efficient, as the same set of parameters is used for both modalities.

Dual-stream Architecture.

The dual-stream architecture refers to that the text and visual features are not concatenated together but sent to two different transformer blocks independently, as shown in Firgue 1 (b). These two transformer blocks do not share parameters. To achieve higher performance, cross-attention (as shown by the dotted line in Firgue 1 (b)) are used to enable cross-modal interaction. To achieve higher efficiency, there can also be no cross-attention between the visual transformer and textual transformer blocks.

3.2 Encoder-only versus Encoder-decoder

Many VLP models adopt the encoder-only architecture, where the cross-modal representations are directly fed into an output layer to generate the final outputs. In contrast, other VLP models advocate using a transformer encoder-decoder architecture, where the cross-modal representations are first fed into a decoder and then to an output layer.

4 Pre-training Objectives

This section introduces how we pre-train VLP models by using different pre-training objectives, which are crucial for learning the universal representation of vision-language. We summarize the pre-training objectives into four categories: completion, matching, temporal, and particular types.

  • Completion is to reconstruct the masked element by leverage the unmasked remainders to understand the modality. (see section 4.14.2 and  4.3).

  • Matching is to unify the vision and language into a shared hidden space to generate universal vision-language representation (see Section 4.44.5 and 4.6).

  • Temporal is to learn good representation by reorder the disrupted input sequence (see Section 4.7)

  • Particular types consists of other pre-training objects, such as visual question answering and visual captioning (see Section 4.8).

Now we introduce the most used pre-training objectives.

4.1 Masked Language Modeling

Masked language modeling (MLM), which was first proposed by Talylor taylor1953cloze in the literature, is widely known because the BERT model adapted it as a novel pre-training task. MLM in VLP models is similar to MLM in pre-training language models (PLMs) but predicts the masked textual tokens not only by the rest of the textual tokens but also by the visual tokens. Empirically, VLP models following BERT randomly mask each textual input token with probability 15% and replace the masked one by using a special token

[MASK] 80% of the time, a random textual token 10% of the time and the original token 10% of the time to perform masking.

4.2 Prefix Language Modeling

Prefix Language Modeling (PrefixLM) is unified of masked language model and language modeling (LM). PrefixLM is proposed to facilitate the model with solid generation capability that enables text-induced zero-shot generalization without finetuning. PrefixLM differs from the standard LM such that it enables bi-directional attention on the prefix sequence and only conducts autoregressive factorization on the remaining tokens. PrefixLM under the sequence-to-sequence (seq2seq) framework not only enjoys the bidirectional contextualized representation as in MLM but also can perform text generation similar to LM.

4.3 Masked Vision Modeling

Like MLM, masked vision modeling (MVM) samples vision (image or video) regions or patches and usually masks their visual features with a probability of 15%. VLP models need to reconstruct the masked visual features given the remaining visual features and all the textual features. The masked visual features are set to zeros. Because visual features are high-dimensional and continuous, VLP models propose two variants for MVM.

(1) Masked Features Regression

learns to regress the model output of masked features to its original visual features. VLP models convert the model output of the masked features to a vector of the same dimension as the original visual features first and apply L2 regression between the original visual features and the vector.

(2) Masked Feature Classification

learns to predict the object semantic class for the masked features. VLP models first feed the output of the masked features into an FC layer to predict the scores of object class, which further goes through a softmax function to be transformed into a prediction normalized distribution. Note that there is no ground-truth label. There are two kinds of methods to train VLP models. One is that VLP models take the most likely object class from the object detection model as the hard label (w.p. 0 or 1), assuming the detected object class is the ground-truth label for the masked features and apply cross-entropy loss to minimize the gap between the prediction and pseudo class. The other is that VLP models utilize soft label as supervision signal, which is the raw output from the detector (i.e., a distribution of object classes) and minimize the KL divergence between two distributions.

4.4 Vision-Language Matching

Vision-Language Matching (VLM) is the most commonly used pre-training objective to align vision and language. In the single-stream VLP models, they use the representation of the special token [CLS] as the fused representation of both modalities. In the dual-stream VLP models, they concatenate the visual representation of the special visual token [CLS] and the textual representation of the special textual token [CLS]

as the fused representation of both modalities. VLP models feed the fused representation of both modalities to an FC layer and a sigmoid function to predict a score between 0 and 1, where 0 indicates the vision and language are mismatched, and 1 indicates the vision and language are matched. During training, VLP models sample positive or negative pairs from the dataset at each step. The negative pair is created by replacing the vision or text in a paired sample with randomly selected from other samples.

4.5 Vision-Language Contrastive Learning

Vision-Language Contrastive Learning (VLC) predicts the matched vision-language pairs from possible vision-language pairs given a batch of vision-language pairs. Note that there are negative vision-language pairs within a training batch. VLP models use the visual representation of the special visual token [CLS] and the textual representation of the special textual token [CLS] to denote the aggregated representation of the vision and language, respectively. VLP models compute the softmax-normalized vision (image or video)-to-text similarity and text-to-vision similarity and leverage cross-entropy losses over vision-to-text and text-to-vision similarities to update themselves. The similarity is often implemented by dot products.

4.6 Word-Region Alignment

Word-Region Alignment (WRA) is an unsupervised pre-training objective to align vision regions (vision patches) and words. VLP models utilize Optimal Transport to learn the alignment between vision and language. Empirically, VLP models use the IPOT algorithm to approximate the OT distance since the exact minimization is computationally intractable. After solving minimization, the OT distance serves as the WRA loss to train VLP models.

4.7 Frame Order Modeling

To better model the timing of the video, VLP models randomly disrupt the order of some input frames and then predict the actual position of each frame. Frame Order Modeling (FOM) is modeled as a classification task in practice.

4.8 Particular Pre-training Objects

VLP models also sometimes use the training objects of some downstream tasks, such as visual question answering (VQA) and visual captioning (VC), as pre-training objectives. As for VQA, VLP models take the fused representation mentioned above, apply an FC layer, and use the transformed representation to predict the classification over predefined answer candidates. In addition to VLP models tackling the task as classification over predefined answer candidates, VLP models also can directly generate answers in their original text format. As for VC, to reconstruct the input sentence to endow VLP models with the generation capability, VLP models employ an auto-regressive decoder to generate a corresponding textual description of the image or video.

Note that due to space limitations, we only introduce some popular pre-training objectives. We omit some specific pre-training objectives such as grounding referring expression (GRE), image-conditioned denoising autoencoding (IDA) 

[xia2021xgpt], text-conditioned image feature generation (TIFG) [xia2021xgpt], object detection (OD) [kamath2021mdetr] and aligned Kaleido patch modeling (AKPM) [zhuge2021kaleido]. Moreover, we put masked action prediction into the category of MVM.

5 Pre-training Datasets

Most datasets for VLP are constructed by combining public datasets across different multi-modal tasks. However, some previous works, such as VideoBERT [sun2019videobert], ImageBERT [qi2020imagebert], ALIGN [jia2021scaling], and CLIP [radford2021learning], process a huge amount of data collected from the internet and conduct pre-training with their self-constructed datasets. Here, some mainstream corpora and their details are shown in Table 1.

Data # Images # Image-text Pairs Duration (hrs)
SBU ordonez2011im2text 875K 875K -
FLKR young2014image 29K 145K -
COCO lin2014microsoft 113K 567K -
VG krishna2017visual 108K 5.4M -
VGQA krishna2017visual 108K 1.8M -
VQA goyal2017making 83K 444K -
Matterport3D chang2017matterport3d 104K 104K -
FashionGen rostamzadeh2018fashion 260K 260K -
CC3M sharma2018conceptual 3M 3M -
GQA hudson2019gqa 82K 1M -
LAIT qi2020imagebert 10M 10M -
CC12M changpinyo2021conceptual 12M 12M -
AltText jia2021scaling 1.8B 1.8B -
Kinetics kay2017kinetics - - 1.4K
TV lei2018tvqa - - 461
HT100M miech2019howto100m - - 134K
WebVid2M bain2021frozen - - 13K

Table 1: Details of some mainstream datasets for VLP.

6 Downstream Tasks

A diverse range of tasks requires a cooperative knowledge of vision and language. In this section, we introduce the fundamental details and goals of such tasks and divide them into five categories: classification, regression, retrieval, generation and other tasks, where classification, regression, and retrieval tasks are also known as understanding tasks.

6.1 Classification Tasks

Visual Question Answering (VQA).

Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question. It is usually regarded as a classification task where the model predicts the most suitable answer from a pool of choices.

Visual Reasoning and Compositional Question Answering (GQA).

GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes [hudson2019gqa]. The images, questions, and answers in its dataset have matching semantic representations. The advantage of this structured representation is that the distribution of answers can be more uniform, and we can analyze the model’s performance from more dimensions.

Video-Language Inference (VLI).

Given a video clip with aligned subtitles as a premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

Natural Language for Visual Reasoning (NLVR).

The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false).

Visual Entailment (VE).

In the VE task, image is the premise, and text is the hypothesis. Our goal is to predict whether the text is “Entailment Image”. There are three labels, Entailment, Neutral, and Contradiction.

Visual Commonsense Reasoning (VCR).

VCR exists in the form of multiple-choice questions. For a question, there are several alternative answers. The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons. We can follow VCR’s leaderboard111 to track VLP’s latest ideas.

Grounding Referring Expressions (GRE).

The GRE task is to localize an image region given a text reference. The model can output a score for each region, and the region with the highest score is used as the prediction region.

Category Recognition (CR).

CR refers to identifying the category of a product which is a vital attribute for describing a product.

6.2 Regression Tasks

Multi-modal Sentiment Analysis (MSA).

MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.). It is to predict the affective orientation of an utterance as a continuous intensity variable.

6.3 Retrieval Tasks

Vision-Language Retrieval (VLR).

VLR involves understanding both vision (image or video) and language domains with appropriate matching strategies. It includes two subtasks, vision-to-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa.

6.4 Generation Tasks

Visual Captioning (VC).

VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input.

Novel Object Captioning at Scale (NoCaps).

NoCaps extends the VC task to test a model’s capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus [agrawal2019nocaps].

Visual Dialogue (VD).

The task form of VD is given an image (or video), a dialogue history, and a language question, and let the model generate an answer for the question.

6.5 Other Tasks

Multi-modal Machine Translation (MMT).

MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, i.e., image.

Vision-Language Navigation (VLN).

VLN is a grounding language task of an agent’s locomotion as it sees and explores the real-world dynamics based on linguistic instructions.

Optical Character Recognition (OCR).

OCR generally refers to detecting and recognizing text information in images, which includes two parts: text detection (similar to regression) and text recognition (similar to classification).

In addition, there are some video-related downstream tasks for evaluating the video-text pre-training models, including action classification (AC), action segmentation (AS), and action step Localization (ASL).

Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VisualBERT li2019visualbert Image OD-RFs Emb Single-stream No MLM+VLM COCO GRE+NLVR+VCR+VQA
ViLBERT lu2019vilbert Image OD-RFs Emb Dual-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
LXMERT tan2019lxmert Image OD-RFs+Xformer Xformer Dual-stream No MLM+VLM+MVM+VQA COCO+VG+VQA+GQA+VGQA GQA+NLVR+VQA
B2T2 alberti2019fusion Image CNN-GFs Emb Single-stream No MLM+VLM CC3M VCR
Unicoder-VL li2020unicoder Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M+SBU VLR+VCR
VL-BERT su2019vl Image OD-RFs Emb Single-stream No MLM+MVM CC3M GRE+VCR+VQA
VLP zhou2020unified Image OD-RFs Emb Dual-stream Yes MLM+LM CC3M VC+VQA
UNITER chen2020uniter Image OD-RFs Emb Single-stream No MLM+VLM+MVM+WRA COCO+VG+SBU+CC3M GRE+VLR+NLVR+VCR+VE+VQA
12-IN-1 lu202012 Image OD-RFs Emb Single-stream No MLM+MVM MTL GQA+GRE+VC+NLVR+VE+VQA
VisDial-BERT murahari2020large Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M+VQA VD
ImageBERT qi2020imagebert Image OD-RFs Emb Single-stream No MLM+VLM+MVM LAIT+CC3M+SBU VLR
PREVALENT hao2020towards Image CNN-GFs+Xformer Xformer Single-stream No MLM+MVM Matterport3D VLN
XGPT xia2021xgpt Image OD-RFs Emb Dual-stream Yes MLM+IDA+VC+TIFG CC3M VC+VLR
InterBER lin2020interbert Image OD-RFs Emb Single-stream No MLM+VLM+MVM COCO+CC3M+SBU VLR+VCR
PixelBERT huang2020pixel Image CNN-GFs Emb Single-stream No MLM+VLM COCO+VG VLR+NLVR+VQA
VLN-BERT hong2021vln Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M VLN
FashionBERT gao2020fashionbert Image Xformer Emb Single-stream No MLM+VLM+MVM FashionGen VLR
VILLA gan2020large Image OD-RFs+Xformer Xformer Single-stream No MLM+VLM+MVM COCO+VG+CC3M+SBU GRE+VLR+NLVR+VCR+VE+VQA
ERNIE-ViL yu2020ernie Image OD-RFs Emb Single-stream No MLM+MVM CC3M+SBU GRE+VLR+VCR+VQA
RVL-BERT chiou2021visual Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M VC+VQA
VinVL zhang2021vinvl Image OD-RFs Emb Single-stream No MLM+VLM COCO+CC3M+SBU+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
ViLT kim2021vilt Image ViT-PFs Emb Single-stream No MLM+VLM COCO+VG+SBU+CC3M VLR+NLVR+VQA
ALIGN jia2021scaling Image CNN-GFs Xformer Dual-stream No VLC AltText VLR
Kaleido-BERT zhuge2021kaleido Image CNN-GFs Emb Single-stream No MLM+VLM+AKPM FashionGen CR+VC+VLR
MDETR kamath2021mdetr Image Xformer Xformer Single-stream Yes OD+MLM+VLC COCO+VG+FLKR+GQA GQA+VQA
SOHO huang2021seeing Image CNN-GFs Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
E2E-VLP xu2021e2e Image CNN-GFs Emb Single-stream Yes OD+MLM+VLM COCO+VG VC+VLR+NLVR+VQA
Visual Parsing xue2021probing Image Xformer Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+VCR+VE+VQA
CLIP-ViL shen2021much Image CNN-GFs Emb Single-stream Yes MLM+VLM+VQA COCO+VG+VQA+GQA+VGQA VE+VLN+VQA
ALBEF li2021align Image Xformer Xformer Dual-stream No MLM+VLM+VLC COCO+VG+CC3M+SBU VLR+NLVR+VQA
SimVLM wang2021simvlm Image CNN-GFs Emb Single-stream Yes PrefixLM AltText VC+NLVR+VE+VQA
MURAL jain2021mural Image CNN-GFs Xformer Dual-stream No VLC CC12M+AltText VC+VLR
VLMO vlmo Image ViT-PFs Emb Single-stream No MLM+VLC+VLM COCO+VG+CC3M+SBU VQA+NLVR+VLR
METER dou2021empirical Image Xformer Xformer Dual-stream No MLM+VLM COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
VideoBERT sun2019videobert Video CNN-GFs Emb Single-stream No MLM+VLM+MVM SC AC+VC
CBT sun2019learning Video CNN-GFs+Xformer Xformer Single-stream No VLC Kinetics AC+AS+VC
UniVL luo2020univl Video CNN-GFs Xformer Dual-stream Yes MLM+VLM+VC HT100M AS+ASL+MSA+VC+VLR
HERO li2020hero Video CNN-GFs+Xformer Xformer Single-stream No MLM+VLM+MVM+FOM HT100M+TV VC+VLI+VQA+VLR
MMFT-BERT urooj2020mmft Video OD-RFs+Xformer Xformer Single-stream No VQA TV VQA
ActBERT zhu2020actbert Video OD-RFs+CNN Emb Single-stream No MLM+VLM+MVM HT100M AS+ASL+VC+VQA+VLR
CLIP radford2021learning Image / Video CNN/Xformer Xformer Dual-stream No VLC SC OCR +AC etc.
Frozen bain2021frozen Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
Region-Learner yan2021video Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
Table 2: The summary of mainstream VLP models. The number of downstream tasks determines whether the model is generic or domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.

7 SOTA VLP models

Image-Text VLP models.

VisualBERT [li2019visualbert], known as the first image-text pre-training model, uses the visual features extracted by Faster R-CNN, concatenates the visual features and textual embeddings, and then fed the concatenated features to a single transformer initialed by BERT. Many VLP models [li2020unicoder, su2019vl, chen2020uniter, qi2020imagebert] follow the similar feature extraction and architecture as VisualBERT while adjusting the pre-training objectives and pre-training datasets. Recently, VLMO [vlmo] leverages patch embeddings for image and word embeddings for text and feeds the concatenated embeddings into a single transformer with modality experts and achieves an impressive performance. METER [dou2021empirical] explores how to use a uni-modal pre-trained model and proposes a dual-stream architecture model to handle the multimodel fusion, which achieves the SOTA performance on many downstream tasks.

Video-Text VLP models.

VideoBERT [sun2019videobert], known as the first video-text pre-training model, extends the BERT model to process videos and texts simultaneously. VideoBERT uses the pre-trained ConvNet and S3D [xie2017rethinking] to extract video features and concatenate them with textual word embeddings to feed into a transformer initialed with BERT. ConvNet and S3D are frozen when training the VideoBERT, which indicates the approach is not end-to-end. Recently, inspired by ViT, Frozen [bain2021frozen] and Region-Learner [yan2021video] first process video clips into frames and get patch embeddings according to the method of ViT processing images for each frame. Frozen and Region-Learner optimize themselves in an end-to-end manner and achieve SOTA performance.

More existing mainstream VLP models are summarized in Table 2.

8 Conlusion and New Frontiers

In this paper, we provide the first VLP survey. We review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks and summarize the specific SOTA VLP models in detail. We hope our survey can help researchers understand VLP better and inspire new works to advance this field. In the future, based on existing works, VLP can be further developed from the following aspects:

Incorporating Acoustic Information.

Most previous works on multi-modal pre-training emphasize the joint modeling of language and vision but ignore the information buried in audios. Although the semantic information in audios might intersect with language, audios could provide extra emotion information, acoustic boundary information, etc. Moreover, pre-training with audios makes the model capable of downstream tasks with acoustic inputs. Until now, joint modeling and representation across text, vision, and audio is still an open problem left for further investigation. Several cutting-edge works have shed light on the future of this research field. Unlike previous VLP models, VATT [akbari2021vatt]

takes the raw audio as input and learns the multi-modal representations with the noise contrastive estimation (NCE). Differing from VATT, OPT

[liu2021opt] learns the cross-modal representations across text, image, and audio jointly with various multi-level masking strategies, and it is also capable of generating text and images. Some other works, such as AudioCLIP [guzhov2021audioclip] and MERLOT Reserve [zellers2022merlot], also shows their unique approaches to learn the cross-modal representations over three modalities.

Knowledgeable Learning and Cognitive.

Although the existing VLP models have achieved remarkable performance, their essence is to fit large-scale multimodal datasets. Making VLP models more knowledgeable is important for future VLP. For input vision and text, there is rich related external common sense world knowledge and illustrative situational knowledge [chen2021kbvlp], which can be used to augment the input and accelerate the model training and inference. The solution to this problem requires unified cognitive model architectures, knowledge-guided pre-training objectives, and the support of interacting with new knowledge.

Prompt Tuning.

Currently, fine-tuning is the dominant method to transfer the knowledge of VLP to downstream tasks. However, as the scale of the model increases, each downstream task has its fine-tuning parameters leading to parameter inefficiency. Moreover, the diverse downstream tasks also make the design of the pre-training and fine-tuning stages cumbersome, leading to a gap between them. Recently, prompt tuning is getting more and more attention in NLP. By designing discrete or continuous prompts and using MLM for specific downstream tasks, these models could: 1) reduce the computational cost on fine-tuning the enormous amounts of parameters; 2) bridge the gap between pre-training and fine-tuning. Prompt tuning is a promising way to stimulate the linguistic and world knowledge distributed in PLMs. In the next step, it can be improved and transferred to multi-modal scenarios, breaking the traditional paradigm and solving the pain points of VLP [tsimpoukelli2021multimodal].