AI Chat AI Image Generator AI Video Text to Speech

Multimodal Integration of Human-Like Attention in Visual Question Answering

09/27/2021

∙

by Ekta Sood, et al.

∙

∙

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98 same time, has approximately 80 Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA

Ekta Sood
7 publications
Fabian Kögel
3 publications
Philipp Müller
18 publications
Dominike Thomas
3 publications
Mihai Bâce
9 publications
Andreas Bulling
40 publications

page 4

page 7

page 8

research

∙ 09/27/2021

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

We present VQA-MHUG - a novel 49-participant dataset of multimodal human...

0 Ekta Sood, et al. ∙

research

∙ 09/19/2017

Exploring Human-like Attention Supervision in Visual Question Answering

Attention mechanisms have been widely applied in the Visual Question Ans...

0 Tingting Qiao, et al. ∙

research

∙ 08/12/2019

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Learning an effective attention mechanism for multimodal data is importa...

9 Zhou Yu, et al. ∙

research

∙ 02/25/2019

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Multimodal attentional networks are currently state-of-the-art models fo...

0 Rémi Cadene, et al. ∙

research

∙ 11/02/2016

Dual Attention Networks for Multimodal Reasoning and Matching

We propose Dual Attention Networks (DANs) which jointly leverage visual ...

0 Hyeonseob Nam, et al. ∙

research

∙ 02/28/2023

VQA with Cascade of Self- and Co-Attention Blocks

The use of complex attention modules has improved the performance of the...

0 Aakansha Mishra, et al. ∙

research

∙ 01/11/2022

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

In recent years, multi-modal transformers have shown significant progres...

7 Ankur Sikarwar, et al. ∙