Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs

11/30/2020
by   Emanuele Bugliarello, et al.
12

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

READ FULL TEXT

page 1

page 2

page 3

page 4

04/15/2022

Vision-and-Language Pretrained Models: A Survey

Pretrained models have produced great success in both Computer Vision (C...
07/05/2022

Vision-and-Language Pretraining

With the burgeoning amount of data of image-text pairs and diversity of ...
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
07/01/2022

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Vision-Language Pretraining (VLP) models have recently successfully faci...
12/09/2021

MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Large-scale pretraining is fast becoming the norm in Vision-Language (VL...
08/04/2022

Prompt Tuning for Generative Multimodal Pretrained Models

Prompt tuning has become a new paradigm for model tuning and it has demo...
06/17/2022

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...

Code Repositories

volta

Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs", TACL 2021.


view repo

mpre-unmasked

Code for our paper "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs", TACL 2021.


view repo