Making machines respond in ways similar to humans has been a relentless goal of AI researchers. To enable machines to perceive and think, researchers propose a series of related tasks, such as face recognition, reading comprehension, and human-machine dialogue, to train and evaluate the intelligence of machines in a particular aspect. Specifically, domain experts manually construct standard datasets and then train and evaluate relevant models on them. However, due to the limitations of related technologies, it is often necessary to train on a large amount of labelled data to obtain a better and more capable model. The recent emergence of pre-training models based on the Transformer structure[vaswani2017attention]
has alleviated this problem. They are first pre-trained via self-supervised learning that typically exploits auxiliary tasks (pre-training objectives) to mine supervision signals from large-scale unlabelled data to train the model, thereby learning universal representations. Then they can achieve surprising effectiveness by fine-tuning with only a tiny amount of manually-labelled data on downstream tasks. Since the advent of BERT[DBLP:conf/naacl/DevlinCLT19] in natural language processing (NLP), various pre-training models have sprung up in the uni-modal field, such as Vision Transformer (ViT) [dosovitskiy2020image] in computer vision (CV) and Wave2Vec [DBLP:conf/interspeech/SchneiderBCA19] in speech. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.
Similar to the uni-modal field, there is also a problem of less high-quality labelled data in the multi-modal field. The natural question is, can the above pre-training method be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. In this paper, we focus on mainstream vision-language pre-training (VLP), including image-text and video-text pre-training. VLP mainly learns the semantic correspondence between different modalities by pre-training on large-scale data. For example, in image-text pre-training, we expect the model to associate “dog” in text with what “dog” looks like in images. In video-text pre-training, we expect the model to map objects/actions in the text to objects/actions in the video. To achieve this goal, the VLP objects and model architecture need to be cleverly designed to allow the model to mine the associations between different modalities.
To give readers a better global grasp of VLP, we first comprehensively review its recent advances and focus on five significant aspects:
Feature extraction. This section includes the preprocessing and representation methods of image, video, and text in VLP models (see Section 2).
Model architecture. We introduce the architecture of the VLP models from two different perspectives: Single-stream versus Dual-stream from multi-modal fusion perspective, and Encoder-only versus Encoder-decoder from the overall architectural design perspective (see Section 3).
Pre-training objectives. Pre-training objectives are the core of VLP, mainly used to guide the model to learn vision-language associated information. We summarize typical and characteristic pre-training objectives divided into completion, matching, temporal, and particular types (see Section 4).
Pre-training datasets. Data is critical for VLP. We briefly introduce mainstream corpora for VLP and their specific sizes (see Section 5).
Downstream tasks. Various tasks requires a cooperative knowledge of both vision and language. We divide them into five categories: classification, regression, retrieval, generation, and other tasks. We also discuss the basic details and goals of these tasks (see Section 6).
To the best of our knowledge, this is the first survey on VLP. We hope that our survey can help researchers better understand this field and inspire them to design better models.
2 Feature Extraction
This section describes how VLP models preprocess and represent an image, video and text to obtain counterpart features.
2.1 Feature Extraction
2.1.1 Image Feature Extraction
(1) OD-based Region Features (OD-RFs).
Most previous work on VLP utilizes pre-trained object detectors to extract visual features. The most commonly used object detection model is Faster R-CNN with bottom-up attention [anderson2018bottom]. It is designed to identify objects belonging to certain classes and localize them with bounding boxes. By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding of an image with selected regions. Each region feature is a -d Region-of-Interest (RoI) feature with its bounding box. The bounding box is defined by the coordinates of the bottom-left and top-right corners of the region. VLP models use bounding boxes to construct
-d vectors, and the vector is embedded into a high-dimensional representation (2048-d) named visual geometry embedding. The OD-RFs are obtained by adding the OD-based Region feature embedding with its visual geometry embedding. Although ODFs have brought impressive performance, extracting region features can be time-consuming. To relieve this problem, the pre-trained object detectors are usually frozen during pre-training, which can limit the capacity of VLP models.
(2) CNN-based Grid Features (CNN-GFs).
VLP models extract visual features by utilizing convolutional neural networks (CNNs) to obtain the grid features. On the one hand, VLP models can train the CNNs end-to-end by using the grid features directly. On the other hand, VLP models can also first discretize grid features using a learned vision dictionary, then feed them into the cross-modal module.
(3) ViT-based Patch Features (ViT-PFs).
Inspired by ViT, VLP models reshape the image into a sequence of flattened 2D patches , where is the resolution of the original image, is the number of channels, is the resolution of each image patch, and is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. An input image is encoded into a sequence of embeddings: , where is the embedding of the [CLS] token.
2.1.2 Video Feature Extraction
A video clip is denoted as
frames (images). VLP models extract the frame features by using the method mentioned above. The two most commonly used features are CNN-GFs and ViT-PFs. For CNN-GFs, VLP models first use ResNet pre-trained on ImageNet and SlowFast pre-trained on Kinetics to extract 2D and 3D visual features for each video frame. These features are concatenated as visual features and fed through a fully-connected (FC) layer to be projected into the same lower-dimensional space as token embeddings. For ViT-PFs, a video clipconsisting of frames of resolution , where for images. Following the protocol in ViT and Timesformer, the input video clip is divided into non-overlapping spatio-temporal patches of size , where .
2.1.3 Text Feature Extraction
For the textual features, following BERT, VLP models first segment the input sentence into a sequence of subwords. And then, insert a start-of-sequence token and an end-of-sequence token at the beginning and the end of the sequence to generate the input text sequence. Text input representations are computed via summing the corresponding word embedding, text position embedding, and text type embedding.
2.2 Feature Representation
To make full use of uni-modal pre-trained models, VLP models can send the visual or text features to a transformer encoder. Specifically, VLP models utilize the standard transformer encoder with random initialization to generate the visual or textual representation. In addition, VLP models can utilize a pre-trained visual transformer to encode the ViT-PFs, such as ViT and DeiT [pmlr-v139-touvron21a]. VLP models can use a pre-trained textual transformer to encode the textual features, such as BERT. For simplicity, we name these transformer Xformer.
3 Model Architecture
In this section, we introduce the architecture of the VLP models from two different perspectives: (1) Single-stream versus Dual-stream from multi-modal fusion perspective, and (2) Encoder-only versus Encoder-decoder from the overall architectural design perspective.
3.1 Single-stream versus Dual-stream
The single-stream architecture refers to that the text and visual features are concatenated together, then fed into a single transformer block as shown in Firgue 1 (a). The single-stream structure utilizes merged attention to fuse multimodal inputs. The single-stream architecture is more parameter-efficient, as the same set of parameters is used for both modalities.
The dual-stream architecture refers to that the text and visual features are not concatenated together but sent to two different transformer blocks independently, as shown in Firgue 1 (b). These two transformer blocks do not share parameters. To achieve higher performance, cross-attention (as shown by the dotted line in Firgue 1 (b)) are used to enable cross-modal interaction. To achieve higher efficiency, there can also be no cross-attention between the visual transformer and textual transformer blocks.
3.2 Encoder-only versus Encoder-decoder
Many VLP models adopt the encoder-only architecture, where the cross-modal representations are directly fed into an output layer to generate the final outputs. In contrast, other VLP models advocate using a transformer encoder-decoder architecture, where the cross-modal representations are first fed into a decoder and then to an output layer.
4 Pre-training Objectives
This section introduces how we pre-train VLP models by using different pre-training objectives, which are crucial for learning the universal representation of vision-language. We summarize the pre-training objectives into four categories: completion, matching, temporal, and particular types.
Temporal is to learn good representation by reorder the disrupted input sequence (see Section 4.7)
Particular types consists of other pre-training objects, such as visual question answering and visual captioning (see Section 4.8).
Now we introduce the most used pre-training objectives.
4.1 Masked Language Modeling
Masked language modeling (MLM), which was first proposed by Talylor taylor1953cloze in the literature, is widely known because the BERT model adapted it as a novel pre-training task. MLM in VLP models is similar to MLM in pre-training language models (PLMs) but predicts the masked textual tokens not only by the rest of the textual tokens but also by the visual tokens. Empirically, VLP models following BERT randomly mask each textual input token with probability 15% and replace the masked one by using a special token[MASK] 80% of the time, a random textual token 10% of the time and the original token 10% of the time to perform masking.
4.2 Prefix Language Modeling
Prefix Language Modeling (PrefixLM) is unified of masked language model and language modeling (LM). PrefixLM is proposed to facilitate the model with solid generation capability that enables text-induced zero-shot generalization without finetuning. PrefixLM differs from the standard LM such that it enables bi-directional attention on the prefix sequence and only conducts autoregressive factorization on the remaining tokens. PrefixLM under the sequence-to-sequence (seq2seq) framework not only enjoys the bidirectional contextualized representation as in MLM but also can perform text generation similar to LM.
4.3 Masked Vision Modeling
Like MLM, masked vision modeling (MVM) samples vision (image or video) regions or patches and usually masks their visual features with a probability of 15%. VLP models need to reconstruct the masked visual features given the remaining visual features and all the textual features. The masked visual features are set to zeros. Because visual features are high-dimensional and continuous, VLP models propose two variants for MVM.
(1) Masked Features Regression
learns to regress the model output of masked features to its original visual features. VLP models convert the model output of the masked features to a vector of the same dimension as the original visual features first and apply L2 regression between the original visual features and the vector.
(2) Masked Feature Classification
learns to predict the object semantic class for the masked features. VLP models first feed the output of the masked features into an FC layer to predict the scores of object class, which further goes through a softmax function to be transformed into a prediction normalized distribution. Note that there is no ground-truth label. There are two kinds of methods to train VLP models. One is that VLP models take the most likely object class from the object detection model as the hard label (w.p. 0 or 1), assuming the detected object class is the ground-truth label for the masked features and apply cross-entropy loss to minimize the gap between the prediction and pseudo class. The other is that VLP models utilize soft label as supervision signal, which is the raw output from the detector (i.e., a distribution of object classes) and minimize the KL divergence between two distributions.
4.4 Vision-Language Matching
Vision-Language Matching (VLM) is the most commonly used pre-training objective to align vision and language. In the single-stream VLP models, they use the representation of the special token [CLS] as the fused representation of both modalities. In the dual-stream VLP models, they concatenate the visual representation of the special visual token [CLS] and the textual representation of the special textual token [CLS]
as the fused representation of both modalities. VLP models feed the fused representation of both modalities to an FC layer and a sigmoid function to predict a score between 0 and 1, where 0 indicates the vision and language are mismatched, and 1 indicates the vision and language are matched. During training, VLP models sample positive or negative pairs from the dataset at each step. The negative pair is created by replacing the vision or text in a paired sample with randomly selected from other samples.
4.5 Vision-Language Contrastive Learning
Vision-Language Contrastive Learning (VLC) predicts the matched vision-language pairs from possible vision-language pairs given a batch of vision-language pairs. Note that there are negative vision-language pairs within a training batch. VLP models use the visual representation of the special visual token [CLS] and the textual representation of the special textual token [CLS] to denote the aggregated representation of the vision and language, respectively. VLP models compute the softmax-normalized vision (image or video)-to-text similarity and text-to-vision similarity and leverage cross-entropy losses over vision-to-text and text-to-vision similarities to update themselves. The similarity is often implemented by dot products.
4.6 Word-Region Alignment
Word-Region Alignment (WRA) is an unsupervised pre-training objective to align vision regions (vision patches) and words. VLP models utilize Optimal Transport to learn the alignment between vision and language. Empirically, VLP models use the IPOT algorithm to approximate the OT distance since the exact minimization is computationally intractable. After solving minimization, the OT distance serves as the WRA loss to train VLP models.
4.7 Frame Order Modeling
To better model the timing of the video, VLP models randomly disrupt the order of some input frames and then predict the actual position of each frame. Frame Order Modeling (FOM) is modeled as a classification task in practice.
4.8 Particular Pre-training Objects
VLP models also sometimes use the training objects of some downstream tasks, such as visual question answering (VQA) and visual captioning (VC), as pre-training objectives. As for VQA, VLP models take the fused representation mentioned above, apply an FC layer, and use the transformed representation to predict the classification over predefined answer candidates. In addition to VLP models tackling the task as classification over predefined answer candidates, VLP models also can directly generate answers in their original text format. As for VC, to reconstruct the input sentence to endow VLP models with the generation capability, VLP models employ an auto-regressive decoder to generate a corresponding textual description of the image or video.
Note that due to space limitations, we only introduce some popular pre-training objectives. We omit some specific pre-training objectives such as grounding referring expression (GRE), image-conditioned denoising autoencoding (IDA)[xia2021xgpt], text-conditioned image feature generation (TIFG) [xia2021xgpt], object detection (OD) [kamath2021mdetr] and aligned Kaleido patch modeling (AKPM) [zhuge2021kaleido]. Moreover, we put masked action prediction into the category of MVM.
5 Pre-training Datasets
Most datasets for VLP are constructed by combining public datasets across different multi-modal tasks. However, some previous works, such as VideoBERT [sun2019videobert], ImageBERT [qi2020imagebert], ALIGN [jia2021scaling], and CLIP [radford2021learning], process a huge amount of data collected from the internet and conduct pre-training with their self-constructed datasets. Here, some mainstream corpora and their details are shown in Table 1.
6 Downstream Tasks
A diverse range of tasks requires a cooperative knowledge of vision and language. In this section, we introduce the fundamental details and goals of such tasks and divide them into five categories: classification, regression, retrieval, generation and other tasks, where classification, regression, and retrieval tasks are also known as understanding tasks.
6.1 Classification Tasks
Visual Question Answering (VQA).
Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question. It is usually regarded as a classification task where the model predicts the most suitable answer from a pool of choices.
Visual Reasoning and Compositional Question Answering (GQA).
GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes [hudson2019gqa]. The images, questions, and answers in its dataset have matching semantic representations. The advantage of this structured representation is that the distribution of answers can be more uniform, and we can analyze the model’s performance from more dimensions.
Video-Language Inference (VLI).
Given a video clip with aligned subtitles as a premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
Natural Language for Visual Reasoning (NLVR).
The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false).
Visual Entailment (VE).
In the VE task, image is the premise, and text is the hypothesis. Our goal is to predict whether the text is “Entailment Image”. There are three labels, Entailment, Neutral, and Contradiction.
Visual Commonsense Reasoning (VCR).
VCR exists in the form of multiple-choice questions. For a question, there are several alternative answers. The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons. We can follow VCR’s leaderboard111https://visualcommonsense.com/leaderboard/ to track VLP’s latest ideas.
Grounding Referring Expressions (GRE).
The GRE task is to localize an image region given a text reference. The model can output a score for each region, and the region with the highest score is used as the prediction region.
Category Recognition (CR).
CR refers to identifying the category of a product which is a vital attribute for describing a product.
6.2 Regression Tasks
Multi-modal Sentiment Analysis (MSA).
MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.). It is to predict the affective orientation of an utterance as a continuous intensity variable.
6.3 Retrieval Tasks
Vision-Language Retrieval (VLR).
VLR involves understanding both vision (image or video) and language domains with appropriate matching strategies. It includes two subtasks, vision-to-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa.
6.4 Generation Tasks
Visual Captioning (VC).
VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input.
Novel Object Captioning at Scale (NoCaps).
NoCaps extends the VC task to test a model’s capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus [agrawal2019nocaps].
Visual Dialogue (VD).
The task form of VD is given an image (or video), a dialogue history, and a language question, and let the model generate an answer for the question.
6.5 Other Tasks
Multi-modal Machine Translation (MMT).
MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, i.e., image.
Vision-Language Navigation (VLN).
VLN is a grounding language task of an agent’s locomotion as it sees and explores the real-world dynamics based on linguistic instructions.
Optical Character Recognition (OCR).
OCR generally refers to detecting and recognizing text information in images, which includes two parts: text detection (similar to regression) and text recognition (similar to classification).
In addition, there are some video-related downstream tasks for evaluating the video-text pre-training models, including action classification (AC), action segmentation (AS), and action step Localization (ASL).
|Model||Domain||Vision FE||Language FE||Multimodal Fusion||Decoder||PT Objectives||PT Datasets||Downstream Tasks|
|Visual Parsing xue2021probing||Image||Xformer||Emb||Single-stream||No||MLM+VLM+MVM||COCO+VG||VLR+VCR+VE+VQA|
|CLIP radford2021learning||Image / Video||CNN/Xformer||Xformer||Dual-stream||No||VLC||SC||OCR +AC etc.|
7 SOTA VLP models
Image-Text VLP models.
VisualBERT [li2019visualbert], known as the first image-text pre-training model, uses the visual features extracted by Faster R-CNN, concatenates the visual features and textual embeddings, and then fed the concatenated features to a single transformer initialed by BERT. Many VLP models [li2020unicoder, su2019vl, chen2020uniter, qi2020imagebert] follow the similar feature extraction and architecture as VisualBERT while adjusting the pre-training objectives and pre-training datasets. Recently, VLMO [vlmo] leverages patch embeddings for image and word embeddings for text and feeds the concatenated embeddings into a single transformer with modality experts and achieves an impressive performance. METER [dou2021empirical] explores how to use a uni-modal pre-trained model and proposes a dual-stream architecture model to handle the multimodel fusion, which achieves the SOTA performance on many downstream tasks.
Video-Text VLP models.
VideoBERT [sun2019videobert], known as the first video-text pre-training model, extends the BERT model to process videos and texts simultaneously. VideoBERT uses the pre-trained ConvNet and S3D [xie2017rethinking] to extract video features and concatenate them with textual word embeddings to feed into a transformer initialed with BERT. ConvNet and S3D are frozen when training the VideoBERT, which indicates the approach is not end-to-end. Recently, inspired by ViT, Frozen [bain2021frozen] and Region-Learner [yan2021video] first process video clips into frames and get patch embeddings according to the method of ViT processing images for each frame. Frozen and Region-Learner optimize themselves in an end-to-end manner and achieve SOTA performance.
More existing mainstream VLP models are summarized in Table 2.
8 Conlusion and New Frontiers
In this paper, we provide the first VLP survey. We review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks and summarize the specific SOTA VLP models in detail. We hope our survey can help researchers understand VLP better and inspire new works to advance this field. In the future, based on existing works, VLP can be further developed from the following aspects:
Incorporating Acoustic Information.
Most previous works on multi-modal pre-training emphasize the joint modeling of language and vision but ignore the information buried in audios. Although the semantic information in audios might intersect with language, audios could provide extra emotion information, acoustic boundary information, etc. Moreover, pre-training with audios makes the model capable of downstream tasks with acoustic inputs. Until now, joint modeling and representation across text, vision, and audio is still an open problem left for further investigation. Several cutting-edge works have shed light on the future of this research field. Unlike previous VLP models, VATT [akbari2021vatt]
takes the raw audio as input and learns the multi-modal representations with the noise contrastive estimation (NCE). Differing from VATT, OPT[liu2021opt] learns the cross-modal representations across text, image, and audio jointly with various multi-level masking strategies, and it is also capable of generating text and images. Some other works, such as AudioCLIP [guzhov2021audioclip] and MERLOT Reserve [zellers2022merlot], also shows their unique approaches to learn the cross-modal representations over three modalities.
Knowledgeable Learning and Cognitive.
Although the existing VLP models have achieved remarkable performance, their essence is to fit large-scale multimodal datasets. Making VLP models more knowledgeable is important for future VLP. For input vision and text, there is rich related external common sense world knowledge and illustrative situational knowledge [chen2021kbvlp], which can be used to augment the input and accelerate the model training and inference. The solution to this problem requires unified cognitive model architectures, knowledge-guided pre-training objectives, and the support of interacting with new knowledge.
Currently, fine-tuning is the dominant method to transfer the knowledge of VLP to downstream tasks. However, as the scale of the model increases, each downstream task has its fine-tuning parameters leading to parameter inefficiency. Moreover, the diverse downstream tasks also make the design of the pre-training and fine-tuning stages cumbersome, leading to a gap between them. Recently, prompt tuning is getting more and more attention in NLP. By designing discrete or continuous prompts and using MLM for specific downstream tasks, these models could: 1) reduce the computational cost on fine-tuning the enormous amounts of parameters; 2) bridge the gap between pre-training and fine-tuning. Prompt tuning is a promising way to stimulate the linguistic and world knowledge distributed in PLMs. In the next step, it can be improved and transferred to multi-modal scenarios, breaking the traditional paradigm and solving the pain points of VLP [tsimpoukelli2021multimodal].