All You May Need for VQA are Image Captions

05/04/2022
by   Soravit Changpinyo, et al.
0

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

READ FULL TEXT

page 1

page 4

page 6

page 14

page 15

research
10/17/2022

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and language rea...
research
12/04/2020

Self-Supervised VQA: Answering Visual Questions using Images and Captions

Methodologies for training VQA models assume the availability of dataset...
research
10/21/2019

Good, Better, Best: Textual Distractors Generation for Multi-Choice VQA via Policy Gradient

Textual distractors in current multi-choice VQA datasets are not challen...
research
11/02/2018

Zero-Shot Transfer VQA Dataset

Acquiring a large vocabulary is an important aspect of human intelligenc...
research
10/29/2019

Learning Rich Image Region Representation for Visual Question Answering

We propose to boost VQA by leveraging more powerful feature extractors b...
research
03/15/2022

CARETS: A Consistency And Robustness Evaluative Test Suite for VQA

We introduce CARETS, a systematic test suite to measure consistency and ...
research
06/10/2021

Supervising the Transfer of Reasoning Patterns in VQA

Methods for Visual Question Anwering (VQA) are notorious for leveraging ...

Please sign up or login with your details

Forgot password? Click here to reset