Improving Visual Question Answering by Referring to Generated Paragraph Captions

06/14/2019
by   Hyounghun Kim, et al.
0

Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2019

Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared bo...
research
09/23/2018

Textually Enriched Neural Module Networks for Visual Question Answering

Problems at the intersection of language and vision, like visual questio...
research
02/15/2019

Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention

In this paper, we present a novel approach for the task of eXplainable Q...
research
05/13/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Videos convey rich information. Dynamic spatio-temporal relationships be...
research
02/22/2017

Task-driven Visual Saliency and Attention-based Visual Question Answering

Visual question answering (VQA) has witnessed great progress since May, ...
research
12/20/2021

ScanQA: 3D Question Answering for Spatial Scene Understanding

We propose a new 3D spatial understanding task of 3D Question Answering ...
research
05/09/2021

Passage Retrieval for Outside-Knowledge Visual Question Answering

In this work, we address multi-modal information needs that contain text...

Please sign up or login with your details

Forgot password? Click here to reset