Question-Agnostic Attention for Visual Question Answering

08/09/2019
by   Moshiur R Farazi, et al.
0

Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g., linear sum) to more complex ones (e.g., Block). The resulting multimodal representations define an intermediate feature space for capturing the interplay between visual and semantic features, that is helpful in selectively focusing on image content. In this paper, we propose a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms. Our proposed model parses object instances to obtain an `object map' and applies this map on the visual features to generate Question-Agnostic Attention (QAA) features. In contrast to question-dependent attention approaches that are learned end-to-end, the proposed QAA does not involve question-specific training, and can be easily included in almost any existing VQA model as a generic light-weight pre-processing step, thereby adding minimal computation overhead for training. Further, when used in complement with the question-dependent attention, the QAA allows the model to focus on the regions containing objects that might have been overlooked by the learned attention representation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC datasets, we show that incorporating complementary QAA allows state-of-the-art VQA models to perform better, and provides significant boost to simplistic VQA models, enabling them to performance on par with highly sophisticated fusion strategies.

READ FULL TEXT

page 3

page 7

page 8

research
01/20/2020

Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models

Visual Question Answering (VQA) has emerged as a Visual Turing Test to v...
research
04/07/2021

Multimodal Continuous Visual Attention Mechanisms

Visual attention mechanisms are a key component of neural network models...
research
05/11/2018

Reciprocal Attention Fusion for Visual Question Answering

Existing attention mechanisms either attend to local image grid or objec...
research
08/01/2018

Learning Visual Question Answering by Bootstrapping Hard Attention

Attention mechanisms in biological perception are thought to select subs...
research
06/04/2022

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Recently, attention-based Visual Question Answering (VQA) has achieved g...
research
03/26/2018

Generalized Hadamard-Product Fusion Operators for Visual Question Answering

We propose a generalized class of multimodal fusion operators for the ta...
research
01/31/2019

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Multimodal representation learning is gaining more and more interest wit...

Please sign up or login with your details

Forgot password? Click here to reset