Log In Sign Up

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modular Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.


page 6

page 14

page 15

page 16

page 17


Multimodal Integration of Human-Like Attention in Visual Question Answering

Human-like attention as a supervisory signal to guide neural attention h...

Exploring Human-like Attention Supervision in Visual Question Answering

Attention mechanisms have been widely applied in the Visual Question Ans...

Trying Bilinear Pooling in Video-QA

Bilinear pooling (BLP) refers to a family of operations recently develop...

Multimodal Continuous Visual Attention Mechanisms

Visual attention mechanisms are a key component of neural network models...

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

This paper presents a state-of-the-art model for visual question answeri...

Bilinear Attention Networks

Attention networks in multimodal learning provide an efficient way to ut...

On Modality Bias in the TVQA Dataset

TVQA is a large scale video question answering (video-QA) dataset based ...

1 Introduction

Visual question answering (VQA) has gained popularity as a practically-useful and challenging task at the intersection of natural language processing (NLP) and computer vision (CV) 

Antol.2015. The key challenge in VQA is to develop computational models that are able to reason over questions and images in order to generate answers that are well-grounded in both modalities P.Zhang.2015; agrawal-etal-2016-analyzing; Goyal.2017; Kafle.2019. Attention mechanisms originally introduced in NLP for monomodal language tasks have been successfully applied to multimodal tasks (like VQA) and established a new state of the art Correia.2021_survey; kim2018bilinear; Yu.2019_mcan.

These advances have, in turn, triggered research into understanding the reasons for these improvements. A body of work has studied similarities between neural and human attention qiuxia2020understanding; yun2013studying; Das.2016. Models seem to learn very different attention strategies and similarity to human attention might only improve performance for specific model types Sood_2020_Interp. However, although VQA is an inherently multimodal task, all of these analyses have only focused on image attention. The most likely reason for this is that existing datasets only offer mono-modal attention on the image Das.2016; fosco2020much; Chen.2020. In addition, due to the challenges involved in recording human gaze data at scale, prior works have instead used mouse data as a proxy to attention Jiang.2015

. However, mouse data was shown to over-estimate some image areas 

Tavakoli.2017; Das.2016 or to miss relevant background information altogether YusukeSugano.2016; Tavakoli.2017b. As of now, there is no publicly available dataset that offers human gaze data on both the images and questions. This severely impedes further progress in this emerging area of research.

Our work fills this gap by introducing VQA-MHUG – the first dataset of multimodal human gaze on both images and questions in VQA. To collect our dataset, we conducted a 49-participant eye tracking study. We used a commercial, high-speed eye tracker to record gaze data on images and corresponding questions of the VQAv2 validation set. VQA-MHUG contains 11,970 gaze samples for 3,990 question-image pairs, tagged and balanced by reasoning type and difficulty. We ensured a large overlap in question-image pairs with nine other VQA datasets to maximize the usefulness of VQA-MHUG for future multimodal studies on human and neural attention mechanisms. Using our dataset, we conduct detailed analyses of the similarity between human and neural attentive strategies, the latter of which we obtained from five top-performing models in the VQA challenges 2017-2020: Modulated Co-Attention Network (MCAN) with grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). These analyses show, for the first time, that correlation with human attention on text is a significant predictor of accuracy for all the studied state-of-the-art VQA models. This suggests a potential for significant performance improvements in VQA by guiding models to "read the questions" more similarly to humans. In summary, our work contributes:

  1. VQA-MHUG, a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering collected using a high-speed eye tracker.

  2. Detailed analysis of the similarity between human and neural attentive strategies indicating that human-like attention to text could yield significant performance improvements.

2 Related Work

Our work is related to previous work on 1) neural machine attention, 2) attention in VQA, and 3) comparison of neural and human attention.

Neural Machine Attention.

Inspired by the human visual system, neural machine attention allows neural networks to selectively focus on particular parts of the input, resulting in significant improvements in performance and interpretability 

Correia.2021_survey. Single-modal attention bahdanau2014neural as well as approaches that build on it, such as self attention xu2015show; vaswani.2017 or stacked attention yang2016stacked; yang2016hierarchical; Zhang.2018_learn_count; anderson2018bottom, have been shown to be particularly helpful for sequence learning tasks in NLP and CV. Initially, attention mechanisms were often combined with recurrent and convolutional architectures to encode the input features bahdanau2014neural; yu2017multi; Tavakoli.2017; Kim.05.06.2016; Lu.2016; Jabri.2016; agrawal-etal-2016-analyzing. More recently, Transformer-based architectures have been introduced that solely rely on attention vaswani.2017; Yu.2019_mcan; Khan.27.10.2020

. Large-scale, pre-trained language models are a key application of Transformers that enabled their current performance lead in both NLP and multimodal vision-language tasks 

devlin2018bert; yang2019xlnet; Yu.2019_mcan; lu2019vilbert.

Attention in VQA.

Increased interest into capturing multimodal relationships with attention mechanisms have put focus on VQA as a benchmark task Malinowski_Multi_2014; malinowski2015ask; Lu.2016; yu2017multi; nguyen2018improved; yang2019co; li2019beyond. In fact, attention mechanisms have been extensively explored in VQA and have repeatedly dominated the important VQAv2 challenge (anderson2018bottom; Yu.2019_mcan; Jiang.2020_grid_feat). Although attention-based models have achieved remarkable success, it often remains unclear how and why different attention mechanisms actually work jain2019attention; serrano2019attention.

Comparing Neural and Human Attention.

Several prior works have proposed datasets of human attention on images to study the differences between neural and human attention in VQA Das.2016; fosco2020much; Chen.2020. In particular, free-viewing and task-specific mouse tracking from SALICON Jiang.2015 and VQA-HAT Das.2016, as well as free-viewing and task-specific gaze data from SBU Gaze Yun.2015 and AiR-D Chen.2020 have been compared to neural attention. All of these works were limited to images only and found mouse tracking to overestimate relevant areas and miss scene context YusukeSugano.2016; Tavakoli.2017; Tavakoli.2017b; He.2019. Furthermore, while integrating human attention over the image showed performance improvements in VQA Park_2018_CVPR; Qiao.2018; Chen.2020, the influence of integrating human text attention remains unclear.

There is currently no multimodal dataset including real human gaze on VQA questions and images. This represents a major limitation for two different aspects of research, i.e. research aiming to better understand and improve neural attention mechanisms and research focusing on integrating human attention to improve VQA performance.

3 The VQA-MHUG Dataset

We present Visual Question Answering with Multi-Modal Human Gaze (VQA-MHUG)111The dataset is publicly available at To the best of our knowledge, this is the first resource containing multimodal human gaze data over a textual question and the corresponding image. Our corpus encompasses task-specific gaze on a subset of the benchmark dataset VQAv2 val222 goyal2017making. We specifically focused on question-image pairs that machines struggle with, but humans answer easily (determined by high inter-agreement and confidence in the VQAv2 annotations). We then balanced the selection by evenly picking questions based on a machine difficulty score and from different reasoning types. Thus, VQA-MHUG covers a wide range of challenging reasoning capabilities and overlaps with many VQAv2-related datasets (see Table LABEL:tab:mhug-dataset-overlap in Appendix LABEL:sec:VQA-MHUG_Overlap).

Reasoning Types.

VQAv2 groups question-image pairs based on question words: what, who, how, when and where. Instead, we binned our pairs into the reasoning capabilities required to answer them. We incorporated the categories proposed by Kafle.2017 for their task directed image understanding challenge (TDIUC) and extended them with an additional category, reading, for questions that are answered by reading text on the images. This resulted in 12 reasoning types that align better with commonly-diagnosed error cases333See Appendix LABEL:sec:bins for details on the reasoning type tagging.

. We binned VQAv2 val pairs accordingly by training a LSTM-based classifier on 1.6 M TDIUC and 145 K VQAv2 train+val samples which we labelled using regular expressions. The classifier predicted the reasoning type for a given question-answer pair. The final model achieved 99.67% accuracy on a 20% held-out test set.

Machine Difficulty Score.

To assess the difficulty for a machine to answer a question-image pair, we ran two popular VQA models – MFB yu2017multi for multimodal fusion and MCAN Yu.2019_mcan for transformer attention – inspired by Sood_2020_Interp. A difficult question results in low answer accuracy, particularly after rephrasing or asking further control questions. To test this, we evaluated on four datasets and averaged their corresponding normalized metrics: (1) VQAv2 accuracy, (2) VQA-CP accuracy on reduced bias agrawal2018don, (3) VQA-Introspect’s consistency with respect to visual perception selvaraju2020squinting and (4) VQA-Rephrasings’ robustness against linguistic variations shah2019cycle (see Appendix LABEL:sec:diffscores).

Participants and Experimental Setup.

We recruited 49 participants at the local university (18 identified female and 31 male) with normal or corrected-to-normal vision, aged between 19 and 35 years (, ) and compensated them for their participation444The university ethics committee approved our study.. All participants had an English Level of C1 or above (8 were native speakers).555After providing their consent, we collected basic demographic information for each participant. The anonymized data is available with the dataset.

Questions and images were presented one after each other on a 24.5" monitor with resolution 1920x1080 px. They were centered on a white background, and scaled/line-wrapped to fit 26.2x11.5° of visual angle in the center. For the questions, we used a monospace font of size 0.6° and line spacing such that the bounding boxes around each word covered 1.8° vertically. Binocular gaze data was collected with an EyeLink 1000 Plus remote eye tracker at 2 kHz with an average measured tracking error of 0.62° (see Appendix LABEL:sec:experimental_setup).

Participants had unlimited viewing time but were instructed to move on as soon as they understood the question, gave an answer, or decided to skip. They completed a set of practice recordings to familiarize themselves with the study procedure. As such, the task was known to the participant, so both the question reading and the subsequent image viewing were conditioned on VQA. They then completed three blocks of 110 recordings in randomized order with 5 minute breaks in-between.

Dataset Statistics.

VQA-MHUG contains gaze on 3,990 stimuli from VQAv2 val. For each stimulus, we provide three recordings from different participants over text and image, their corresponding answer, and whether they answered the question correctly (as compared to the VQAv2 annotations). For 3,177 stimuli (79.6%), the majority of participant answers appear in the VQAv2 annotations.

Human Attention Maps.

To generate human attention maps, we used the fixation detection algorithm of the EyeLink software with default parameters. We always picked the eye with the lower validation error to prioritize accuracy Hooge.2019 and represented fixations by Gaussian kernels with . For our experiments, we assumed that the majority of gaze is valid and averaged the three recordings per stimulus, yielding a single attention map.

Dataset Validation.

To validate that the attention maps indeed contain relevant image regions, we masked 300 stimuli with our recorded VQA-MHUG maps (see Figure 0(b)). Then, we showed two additional participants these masked stimuli. Comparing their answer accuracy with the participants who saw the full images, validation participants achieved comparable accuracy (62.43% vs. 63.87% in the main study). Therefore our VQA-MHUG maps contain sufficient image areas to answer the questions and mask distracting objects as illustrated in Figure 1.

(a) collection study
(b) validation study
Figure 1: Example images for the question "How ripe are the bananas?". Validation images (b) were masked using the attention maps from our VQA-MHUG dataset.

Comparison to Related Datasets.

We further measured the center bias and compared VQA-MHUG to related human attention datasets Jiang.2015; Das.2016; Chen.2020 on their overlapping samples. All datasets use mouse tracking as a proxy to collect human attention, except for the eye-tracking dataset AiR-D Chen.2020 which is similar to our recording paradigm, yet has no overlap with VQAv2. Therefore, we showed participants 195 additional stimuli from the AiR-D dataset for comparison. Table 1 shows the mean rank correlation of VQA-MHUG with a synthetic center fixation, inter-participant, and the other datasets. The high correlation between VQA-MHUG and AiR-D indicates that our data is of comparable quality. Our center bias is smaller compared to AiR-D but, as expected from human eye behaviour Tatler2007TheCF, larger than in the mouse tracking proxies SALICON and VQA-HAT. We observe that both mouse tracking datasets have significantly lower correlation with VQA-MHUG than the eye-tracking AiR-D corpus.

VQA-MHUG Center Fixation
Dataset Method
G: Gaze, M: Mouse-Tracking
Table 1: Spearman’s rank correlation (

) of VQA-MHUG with itself (inter-participant), related datasets, and a synthetic center fixation – Mean over all samples in the intersection of the datasets and three VQA-MHUG participants. The standard deviation is the mean error over participants. Only VQA-HAT and VQA-MHUG provide multiple attention maps per sample, allowing us to calculate the standard deviation when comparing to the synthetic center fixation.

4 Comparison of Human and Machine Attention

The collected data enabled us to analyze whether models achieve a higher accuracy on VQAv2 val the more their attentive behavior over the text and image correlates with human ground-truth attention. Hence, we investigated the attention weights over text and image features of different SOTA VQA models.

4.1 VQA Models

We selected five top performing VQA models of the VQA challenges 2017 to 2020:

  • MFB yu2017multi (Runner-up 2017);

  • BAN kim2018bilinear (Runner-up 2018);

  • Pythia v0.1 jiang2018pythia (Winner 2018);

  • MCANR with region image features  Yu.2019_mcan (Winner 2019);

  • MCANG with grid image features Jiang.2020_grid_feat (Winner 2020).

Instead of using the text and image features directly for classification, these models re-weight the features using linear, bilinear and Transformer vaswani.2017 (co-)attention mechanisms, whose attention maps we extracted and compared to human ground-truths from VQA-MHUG.

Pythia and MFB use co-attention: they first use a projected attention map to re-weight text features, then fuse them with the image features using linear (Pythia) and bilinear (MFB) fusion and subsequently re-weight the image features using an attention map projected from the fused features. In this way, the text attention influences the image attention. BAN avoids separating the attention into text and image streams and reduces both input streams simultaneously with a bilinear attention map projected from the fused features. Finally, MCAN as a Transformer model stacks co-attention modules with multi-headed scaled dot-product attention for each modality. After the last Transformer layer in both the text and image stream, another attention map is used to project the feature matrix into a single feature vector.

4.2 Extracting Model Attention

We used an official implementation666 of the Pythia v0.1 architecture and the OpenVQA777 implementations yu2019openvqa for MFB, BAN and MCAN. We re-implemented the grid image feature loader for MCANG, since it is not available in OpenVQA.

Following previous work Sood_2020_Interp, we trained each network architecture twelve times with random seeds on the VQAv2 training set and then chose the top nine models based on the validation accuracy.

For models based on region image features, we used the extracted features provided by Anderson et al. anderson2018bottom, while we trained MCANG with ResNeXt xie2017aggregated grid features as provided by the authors Jiang.2020_grid_feat888

For MFB and Pythia we extracted the two projected attention maps that re-weight text and image features, while we extracted the single bilinear attention map for BAN. To obtain separate attention maps for text and image from BAN’s bilinear attention map, we marginalized over each dimension as suggested by the authors kim2018bilinear. MFB, BAN and Pythia generate multiple such attention maps called “glimpses” by using multiple projections. We averaged the glimpses after extraction, yielding a single attention map for each modality. Since it is unclear how the Transformer layer weights relate to the original input features, we instead extracted the attention weights of the final projection layer in text and image streams for MCANR and MCANG.

The extracted image attention maps contain one weight per feature. To compare them with the human spatial attention maps collected in VQA-MHUG, we mapped the features back to their source region in the image. For region-based features we assigned the attention weights to the corresponding bounding box normalized by region size. Analogously, for grid-based features, we mapped the attention weights to their corresponding grid cells. The text attention vector was directly mapped back to the question token sequence. We excluded 74 samples due to varied tokenization between models.

(a) comparison to other attention datasets
(b) model attention - text, image, and inter-modal comparison
Figure 2: Attention maps visualized across question types. Image attention seems mostly plausible throughout models. Previous datasets lack attention on the questions, but we reveal now that text attention is not always human-like, nor plausible. Mouse tracking datasets, SALICON and VQA-HAT, seem to over-estimate the relevant areas.

4.3 Performance Metrics

We compared the multimodal attention extracted from five models to our human data in VQA-MHUG using three approaches. We used Spearman’s rank correlation to compare importance ranking of image regions and words, Jensen-Shannon divergence to compare the distance between the human, and neural attention distributions and a regression model to study the suitability of text and image correlation as predictors of per document model accuracy.

Spearman’s rank correlation and Jensen-Shannon divergence.

Similar to prior work, we downsampled all attention maps to 14x14 matrices and calculated the mean Spearman’s rank correlation  Das.2016 and Jensen-Shannon divergence (JSD) Sood_2020_Interp; Sood_2020_Improve between the neural attention and the corresponding human attention. We computed both metrics for both image and text modalities. We also evaluated the corresponding accuracy scores on the VQAv2 validation set VQAEval.

Ordinal Logistic Regression.

Averaging correlation over the whole dataset is too coarse and obscures the impact that similarity to human attention has on accuracy. Additionally, rank correlation does not allow to analyze the effect of two independent variables on a dependent variable Bewick2003StatisticsR7, e.g. image and text attention correlation on accuracy. To account for this and to study on a per document basis which modality factors influence the likelihood of a model to predict the answer correctly, we performed an

Ordinal Linear Regression


The official VQAv2 evaluation VQAEval

score per document is based on agreement with ten human annotator answers, where each match increases the score by 0.3 (capped at 1.0 or 4 agreed answers). Since our response variable (accuracy score) is not necessarily ordered equidistant, we binned accuracy scores for each document into a likelihood scale (accuracy correctness).

The model predicts the likelihood of accuracy correctness for each document with three different predictors — the text correlation (), the image correlation (), and the interaction between the text and image correlation (

). The latter we deem inter-modal correlation predictor, as it allows us to test if the interaction between the correlation of text and image impacts accuracy. Given that the dependant variables are ranked we opted for using ordered logistic regression to predict for each accuracy bin.

5 Results

5.1 Human and Neural Attention Relationship – Averaged Over Documents

Table LABEL:tab:inter-model-avg shows the overall accuracy scores of the five models on the VQAv2 validation set when trained only on the training partition. The models improved over the challenge years – MCAN grid is the current SOTA Jiang.2020_grid_feat. For each model and modality, we report the Spearman’s rank correlation and JSD scores averaged over the entire VQA-MHUG corpus (cf. Section 4.3

). All figures were averaged over nine model runs and the standard deviation is given over those instances. Given that one cannot average p-values we used a paired t-test to check if the differences in correlation and JSD per document and between models were statistically significant at

(see Appendix LABEL:sec:test_signf_between_models).

Image attention.

Models using region features, i.e. excluding MCAN, are more correlated with human visual attention on images. MCAN achieves the highest correlation, MFB the lowest, and the general trend shows that models with higher correlation had higher overall validation accuracy. Although MCAN achieves the highest accuracy, it had the lowest correlation with human image attention. For all model types, the difference between image correlation scores is significant, except between Pythia and BAN (see Appendix LABEL:sec:test_signf_between_models). With respect to the JSD, we observed similar patterns except for the Pythia model, which was more dissimilar to human attention (had a higher overall JSD) compared to BAN. For all model types, the difference between image JSD scores was statistically significant (see Appendix LABEL:sec:test_signf_between_models).

Text attention.

Both the correlation and JSD scores indicate that Pythia is the most similar to human text attention, followed by MFB. Models with higher overall accuracy do not have high similarity to human visual attention over text on the JSD and correlation metrics. For both metrics, the difference in text attention between every model pairing is statistically significant, except for the JSD scores between pairings of BAN, MCAN, and MCAN (see Appendix LABEL:sec:test_signf_between_models).