DeepAI AI Chat
Log In Sign Up

Reciprocal Attention Fusion for Visual Question Answering

by   Moshiur R Farazi, et al.

Existing attention mechanisms either attend to local image grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel attention mechanism that jointly considers reciprocal relationships between the two levels of visual details. The bottom-up attention thus generated is further coalesced with the top-down information to only focus on the scene elements that are most relevant to a given question. Our design hierarchically fuses multi-modal information i.e., language, object- and gird-level features, through an efficient tensor decomposition scheme. The proposed model improves the state-of-the-art single model performances from 67.9 significant boost.


page 2

page 3


Multimodal grid features and cell pointers for Scene Text Visual Question Answering

This paper presents a new model for the task of scene text visual questi...

Inverse Visual Question Answering with Multi-Level Attentions

In this paper, we propose a novel deep multi-level attention model to ad...

Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

Visual question answering requires a deep understanding of both images a...

Attention Guided Semantic Relationship Parsing for Visual Question Answering

Humans explain inter-object relationships with semantic labels that demo...

ORD: Object Relationship Discovery for Visual Dialogue Generation

With the rapid advancement of image captioning and visual question answe...

Learning Visual Question Answering by Bootstrapping Hard Attention

Attention mechanisms in biological perception are thought to select subs...

Question-Agnostic Attention for Visual Question Answering

Visual Question Answering (VQA) models employ attention mechanisms to di...