Multimodal Integration of Human-Like Attention in Visual Question Answering

09/27/2021
by   Ekta Sood, et al.
11

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98 same time, has approximately 80 Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA

READ FULL TEXT

page 4

page 7

page 8

research
09/27/2021

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

We present VQA-MHUG - a novel 49-participant dataset of multimodal human...
research
09/19/2017

Exploring Human-like Attention Supervision in Visual Question Answering

Attention mechanisms have been widely applied in the Visual Question Ans...
research
08/12/2019

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Learning an effective attention mechanism for multimodal data is importa...
research
02/25/2019

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Multimodal attentional networks are currently state-of-the-art models fo...
research
11/02/2016

Dual Attention Networks for Multimodal Reasoning and Matching

We propose Dual Attention Networks (DANs) which jointly leverage visual ...
research
02/28/2023

VQA with Cascade of Self- and Co-Attention Blocks

The use of complex attention modules has improved the performance of the...
research
01/11/2022

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

In recent years, multi-modal transformers have shown significant progres...

Please sign up or login with your details

Forgot password? Click here to reset