Are we pretraining it right? Digging deeper into visio-linguistic pretraining

04/19/2020
by   Amanpreet Singh, et al.
1

Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice of pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision language pretraining does not quite work "out of the box" yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.

READ FULL TEXT

page 2

page 21

research
07/05/2022

Vision-and-Language Pretraining

With the burgeoning amount of data of image-text pairs and diversity of ...
research
08/16/2020

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

In this paper, we propose to investigate the problem of out-of-domain vi...
research
06/06/2021

Meta-learning for downstream aware and agnostic pretraining

Neural network pretraining is gaining attention due to its outstanding p...
research
09/09/2018

How clever is the FiLM model, and how clever can it be?

The FiLM model achieves close-to-perfect performance on the diagnostic C...
research
08/24/2021

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representat...
research
12/08/2020

LAMP: Label Augmented Multimodal Pretraining

Multi-modal representation learning by pretraining has become an increas...
research
06/14/2023

Deblurring Masked Autoencoder is Better Recipe for Ultrasound Image Recognition

Masked autoencoder (MAE) has attracted unprecedented attention and achie...

Please sign up or login with your details

Forgot password? Click here to reset