A Survey of Vision-Language Pre-Trained Models

02/18/2022
by   Yifan Du, et al.
20

As Transformer evolved, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, after which we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide multimodal researchers a synthesis and pointer to related research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2022

VLP: A Survey on Vision-Language Pre-training

In the past few years, the emergence of pre-training models has brought ...
research
09/21/2021

Survey: Transformer based Video-Language Pre-training

Inspired by the success of transformer-based pre-training methods on nat...
research
06/12/2023

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

Large language models such as BERT and the GPT series started a paradigm...
research
07/05/2022

Vision-and-Language Pretraining

With the burgeoning amount of data of image-text pairs and diversity of ...
research
08/03/2022

GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training

Large pre-trained neural networks are ubiquitous and critical to the suc...
research
10/28/2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Vision-and-language (V-L) tasks require the system to understand both vi...
research
03/17/2022

POLARIS: A Geographic Pre-trained Model and its Applications in Baidu Maps

Pre-trained models (PTMs) have become a fundamental backbone for downstr...

Please sign up or login with your details

Forgot password? Click here to reset