VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

09/27/2021
by   Ekta Sood, et al.
0

We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modular Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.

READ FULL TEXT

page 6

page 14

page 15

page 16

page 17

research
09/27/2021

Multimodal Integration of Human-Like Attention in Visual Question Answering

Human-like attention as a supervisory signal to guide neural attention h...
research
09/19/2017

Exploring Human-like Attention Supervision in Visual Question Answering

Attention mechanisms have been widely applied in the Visual Question Ans...
research
12/18/2020

Trying Bilinear Pooling in Video-QA

Bilinear pooling (BLP) refers to a family of operations recently develop...
research
04/07/2021

Multimodal Continuous Visual Attention Mechanisms

Visual attention mechanisms are a key component of neural network models...
research
08/09/2017

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

This paper presents a state-of-the-art model for visual question answeri...
research
10/29/2019

Learning Rich Image Region Representation for Visual Question Answering

We propose to boost VQA by leveraging more powerful feature extractors b...
research
05/21/2018

Bilinear Attention Networks

Attention networks in multimodal learning provide an efficient way to ut...

Please sign up or login with your details

Forgot password? Click here to reset