A Picture May Be Worth a Hundred Words for Visual Question Answering

06/25/2021
by   Yusuke Hirota, et al.
1

How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.

READ FULL TEXT
research
03/01/2019

Answer Them All! Toward Universal Visual Question Answering Models

Visual Question Answering (VQA) research is split into two camps: the fi...
research
08/01/2023

Making the V in Text-VQA Matter

Text-based VQA aims at answering questions by reading the text present i...
research
04/04/2023

SC-ML: Self-supervised Counterfactual Metric Learning for Debiased Visual Question Answering

Visual question answering (VQA) is a critical multimodal task in which a...
research
01/10/2020

In Defense of Grid Features for Visual Question Answering

Popularized as 'bottom-up' attention, bounding box (or region) based vis...
research
04/06/2018

Question Type Guided Attention in Visual Question Answering

Visual Question Answering (VQA) requires integration of feature maps wit...
research
11/23/2016

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

While deep convolutional neural networks frequently approach or exceed h...
research
03/10/2023

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

Medical Visual Question Answering (VQA) is an important challenge, as it...

Please sign up or login with your details

Forgot password? Click here to reset