Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

12/10/2021
by   Tianyi Liu, et al.
0

Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining. Although they perform well in many understanding downstream tasks, e.g., visual question answering, image-text retrieval and visual entailment, they do not possess the ability to generate. To tackle this problem, we propose Unified multimodal pre-training for both Vision-Language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generative tasks. We augment existing pretraining paradigms that only use random masks with causal masks, i.e., triangular masks that mask out future tokens, such that the pre-trained models can have autoregressive generation abilities by design. We formulate several previous understanding tasks as a text generation task and propose to use prompt-based method for fine-tuning on different downstream tasks. Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model, and a feasible way to improve both tasks is to use more data. Our UniVL framework attains comparable performance to recent vision-language pre-training methods on both understanding tasks and generation tasks. Moreover, we demostrate that prompt-based finetuning is more data-efficient - it outperforms discriminative methods in few-shot scenarios.

READ FULL TEXT

page 2

page 3

page 5

page 6

page 7

page 8

page 9

page 10

research
02/15/2020

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

We propose UniViLM: a Unified Video and Language pre-training Model for ...
research
12/27/2020

MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

One of the biggest challenges that prohibit the use of many current NLP ...
research
06/15/2022

Prefix Language Models are Unified Modal Learners

With the success of vision-language pre-training, we have witnessed the ...
research
04/14/2020

PALM: Pre-training an Autoencoding Autoregressive Language Model for Context-conditioned Generation

Self-supervised pre-training has emerged as a powerful technique for nat...
research
02/28/2020

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

We propose to pre-train a unified language model for both autoencoding a...
research
11/13/2019

Unsupervised Pre-training for Natural Language Generation: A Literature Review

Recently, unsupervised pre-training is gaining increasing popularity in ...
research
01/26/2022

Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest inc...

Please sign up or login with your details

Forgot password? Click here to reset