Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs

11/30/2020
by   Emanuele Bugliarello, et al.
12

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2022

Vision-and-Language Pretrained Models: A Survey

Pretrained models have produced great success in both Computer Vision (C...
research
09/28/2022

Downstream Datasets Make Surprisingly Good Pretraining Corpora

For most natural language processing tasks, the dominant practice is to ...
research
05/04/2023

AutoML-GPT: Automatic Machine Learning with GPT

AI tasks encompass a wide range of domains and fields. While numerous AI...
research
03/13/2023

Robust Contrastive Language-Image Pretraining against Adversarial Attacks

Contrastive vision-language representation learning has achieved state-o...
research
06/06/2023

On the Difference of BERT-style and CLIP-style Text Encoders

Masked language modeling (MLM) has been one of the most popular pretrain...
research
09/16/2023

RMP: A Random Mask Pretrain Framework for Motion Prediction

As the pretraining technique is growing in popularity, little work has b...
research
03/06/2023

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Recently, large-scale Vision and Language (V&L) pretraining has become t...

Please sign up or login with your details

Forgot password? Click here to reset