Pre-training image-language transformers for open-vocabulary tasks

09/09/2022
by   AJ Piergiovanni, et al.
0

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

READ FULL TEXT
research
05/02/2022

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

We present Answer-Me, a task-aware multi-task framework which unifies a ...
research
03/10/2022

MVP: Multimodality-guided Visual Pre-training

Recently, masked image modeling (MIM) has become a promising direction f...
research
10/17/2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

This paper surveys vision-language pre-training (VLP) methods for multim...
research
09/28/2020

VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training

It is highly desirable yet challenging to generate image captions that c...
research
06/04/2022

Rethinking the Openness of CLIP

Contrastive Language-Image Pre-training (CLIP) has demonstrated great po...
research
04/12/2023

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Training models to apply linguistic knowledge and visual concepts from 2...
research
12/23/2021

LaTr: Layout-Aware Transformer for Scene-Text VQA

We propose a novel multimodal architecture for Scene Text Visual Questio...

Please sign up or login with your details

Forgot password? Click here to reset