Visual Question Answering as Reading Comprehension

11/29/2018
by   Hui Li, et al.
4

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the form of text. Current methods jointly embed both the visual information and the textual feature into the same space. However, how to model the complex interactions between the two different modalities is not an easy task. In contrast to struggling on multimodal feature fusion, in this paper, we propose to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem. With this transformation, our method not only can tackle VQA datasets that focus on observation based questions, but can also be naturally extended to handle knowledge-based VQA which requires to explore large-scale external knowledge base. It is a step towards being able to exploit large volumes of text and natural language processing techniques to address VQA problem. Two types of models are proposed to deal with open-ended VQA and multiple-choice VQA respectively. We evaluate our models on three VQA benchmarks. The comparable performance with the state-of-the-art demonstrates the effectiveness of the proposed method.

READ FULL TEXT

page 5

page 8

page 9

page 10

page 12

research
05/02/2021

A survey on VQA_Datasets and Approaches

Visual question answering (VQA) is a task that combines both the techniq...
research
07/01/2020

DocVQA: A Dataset for VQA on Document Images

We present a new dataset for Visual Question Answering on document image...
research
03/09/2023

VQA-based Robotic State Recognition Optimized with Genetic Algorithm

State recognition of objects and environment in robots has been conducte...
research
06/13/2018

Learning Visual Knowledge Memory Networks for Visual Question Answering

Visual question answering (VQA) requires joint comprehension of images a...
research
01/11/2023

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

We present a new pre-training method, Multimodal Inverse Cloze Task, for...
research
05/10/2023

Combo of Thinking and Observing for Outside-Knowledge VQA

Outside-knowledge visual question answering is a challenging task that r...
research
08/11/2023

KETM:A Knowledge-Enhanced Text Matching method

Text matching is the task of matching two texts and determining the rela...

Please sign up or login with your details

Forgot password? Click here to reset