A Focused Dynamic Attention Model for Visual Question Answering

04/06/2016
by   Ilija Ilievski, et al.
0

Visual Question and Answering (VQA) problems are attracting increasing interest from multiple research disciplines. Solving VQA problems requires techniques from both computer vision for understanding the visual contents of a presented image or video, as well as the ones from natural language processing for understanding semantics of the question and generating the answers. Regarding visual content modeling, most of existing VQA methods adopt the strategy of extracting global features from the image or video, which inevitably fails in capturing fine-grained information such as spatial configuration of multiple objects. Extracting features from auto-generated regions -- as some region-based image recognition methods do -- cannot essentially address this problem and may introduce some overwhelming irrelevant features with the question. In this work, we propose a novel Focused Dynamic Attention (FDA) model to provide better aligned image content representation with proposed questions. Being aware of the key words in the question, FDA employs off-the-shelf object detector to identify important regions and fuse the information from the regions and global features via an LSTM unit. Such question-driven representations are then combined with question representation and fed into a reasoning unit for generating the answers. Extensive evaluation on a large-scale benchmark dataset, VQA, clearly demonstrate the superior performance of FDA over well-established baselines.

READ FULL TEXT

page 6

page 7

page 11

page 12

page 13

research
06/08/2023

Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering

Visual question answering (VQA) is a Multidisciplinary research problem ...
research
09/21/2016

The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)

Visual Question Answering (VQA) task has showcased a new stage of intera...
research
05/07/2023

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

In recent years, visual question answering (VQA) has attracted attention...
research
07/26/2023

LOIS: Looking Out of Instance Semantics for Visual Question Answering

Visual question answering (VQA) has been intensively studied as a multim...
research
08/15/2017

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Rich and dense human labeled datasets are among the main enabling factor...
research
03/22/2020

Visual Question Answering for Cultural Heritage

Technology and the fruition of cultural heritage are becoming increasing...
research
09/17/2023

Syntax Tree Constrained Graph Network for Visual Question Answering

Visual Question Answering (VQA) aims to automatically answer natural lan...

Please sign up or login with your details

Forgot password? Click here to reset