Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

11/17/2015
by   Huijuan Xu, et al.
0

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the attention weights. We evaluate our model on two published visual question answering datasets, DAQUAR [1] and VQA [2], and obtain improved results compared to a strong deep baseline model (iBOWIMG) which concatenates image and question features to predict the answer [3].

READ FULL TEXT

page 2

page 9

page 10

page 12

page 14

research
05/31/2016

Hierarchical Question-Image Co-Attention for Visual Question Answering

A number of recent works have proposed attention models for Visual Quest...
research
07/25/2017

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image...
research
02/13/2020

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, such as...
research
06/12/2016

Training Recurrent Answering Units with Joint Loss Minimization for VQA

We propose a novel algorithm for visual question answering based on a re...
research
08/31/2016

Towards Transparent AI Systems: Interpreting Visual Question Answering Models

Deep neural networks have shown striking progress and obtained state-of-...
research
02/22/2017

Task-driven Visual Saliency and Attention-based Visual Question Answering

Visual question answering (VQA) has witnessed great progress since May, ...
research
01/10/2020

In Defense of Grid Features for Visual Question Answering

Popularized as 'bottom-up' attention, bounding box (or region) based vis...

Please sign up or login with your details

Forgot password? Click here to reset