MUREL: Multimodal Relational Reasoning for Visual Question Answering

02/25/2019
by   Rémi Cadene, et al.
0

Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps. We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 and TDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context. Our code is available: https://github.com/Cadene/murel.bootstrap.pytorch

READ FULL TEXT

page 1

page 5

page 8

research
09/27/2021

Multimodal Integration of Human-Like Attention in Visual Question Answering

Human-like attention as a supervisory signal to guide neural attention h...
research
02/01/2018

Dual Recurrent Attention Units for Visual Question Answering

We propose an architecture for VQA which utilizes recurrent layers to ge...
research
01/31/2019

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Multimodal representation learning is gaining more and more interest wit...
research
08/07/2017

Structured Attentions for Visual Question Answering

Visual attention, which assigns weights to image regions according to th...
research
04/20/2022

Attention in Reasoning: Dataset, Analysis, and Modeling

While attention has been an increasingly popular component in deep neura...
research
03/17/2022

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Knowledge-based visual question answering requires the ability of associ...
research
12/06/2021

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Textbook Question Answering (TQA) is a complex multimodal task to infer ...

Please sign up or login with your details

Forgot password? Click here to reset