Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

01/22/2021
by   Jung-Jun Kim, et al.
0

Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.

READ FULL TEXT

page 20

page 21

page 25

research
05/11/2018

Reciprocal Attention Fusion for Visual Question Answering

Existing attention mechanisms either attend to local image grid or objec...
research
01/16/2021

Latent Variable Models for Visual Question Answering

Conventional models for Visual Question Answering (VQA) explore determin...
research
09/17/2019

Inverse Visual Question Answering with Multi-Level Attentions

In this paper, we propose a novel deep multi-level attention model to ad...
research
02/15/2019

Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention

In this paper, we present a novel approach for the task of eXplainable Q...
research
06/13/2018

Learning Visual Knowledge Memory Networks for Visual Question Answering

Visual question answering (VQA) requires joint comprehension of images a...
research
06/20/2016

DualNet: Domain-Invariant Network for Visual Question Answering

Visual question answering (VQA) task not only bridges the gap between im...
research
05/19/2023

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Despite the availability of computer-aided simulators and recorded video...

Please sign up or login with your details

Forgot password? Click here to reset